Accelerated Inexact Composite Gradient Methods for Nonconvex Spectral Optimization Problems
JJournal of Machine Learning Research ? (2020) ?-? Submitted 7/22; Published ??/??
Accelerated Inexact Composite Gradient Methods forNonconvex Spectral Optimization Problems
Weiwei Kong [email protected]
Renato D.C. Monteiro [email protected]
School of Industrial and Systems Engineering,Georgia Institute of Technology,Atlanta, GA 30332-0205
Editor:
TBD.
Abstract
This paper presents two inexact composite gradient methods, one inner accelerated andanother doubly accelerated, for solving a class of nonconvex spectral composite optimizationproblems. More specifically, the objective function for these problems is of the form f + f + h where f and f are differentiable nonconvex matrix functions with Lipschitz continuousgradients, h is a proper closed convex matrix function, and both f and h can be expressedas functions that operate on the singular values of their inputs. The methods essentiallyuse an accelerated composite gradient method to solve a sequence of proximal subproblemsinvolving the linear approximation of f and the singular value functions underlying f and h . Unlike other composite gradient-based methods, the proposed methods take advantageof both the composite and spectral structure underlying the objective function in order toefficiently generate their solutions. Numerical experiments are presented to demonstratethe practicality of these methods on a set of real-world and randomly generated spectraloptimization problems. Keywords: nonconvex optimization, spectral functions, inexact proximal point methods,composite gradient methods, accelerated methods, spectral methods
1. Introduction
There are numerous applications in electrical engineering, machine learning, and medicalimaging that can be formulated as nonconvex spectral optimization problems of the formmin U ∈ R m × n n φ ( U ) := f ( U ) + ( f V ◦ σ )( U ) + ( h V ◦ σ )( U ) o , (1)where σ is the function that maps a matrix to its singular value vector (in nonincreasingorder of magnitude), f and f V are continuously differentiable functions with Lipschitzcontinuous gradients, and h V is a proper, lower semicontinuous, convex function. Moreover,such problems are typically formulated so that: (i) the resolvents of λ∂h and λ∂h V areeasy compute for any λ >
0; and (ii) both f V and h V are absolutely symmetric in theirarguments, i.e., they do not depend on the ordering or the sign of their arguments.A typical approach for solving (1) is to employ a composite gradient (CG) method(or an accelerated version of it) that solves composite optimization problems of the formmin U [ g ( U ) + h ( U )], where g is a continuously differentiable function with Lipschitz contin-uous gradient and h is a proper, lower semicontinuous, convex function. More specifically, c (cid:13) https://creativecommons.org/licenses/by/4.0/ . Attribution requirements are providedat http://jmlr.org/papers/v?/???.html . a r X i v : . [ m a t h . O C ] J u l ong and Monteiro the method is applied to (1) with g = f + f V ◦ σ and h = h V ◦ σ and typically does notuse any of the spectral structure underlying f V ◦ σ and h V ◦ σ .Our goal in this paper is to develop two efficient inexact composite gradient (ICG)methods that solve (1) by exploiting the spectral structure underlying the objective function.More specifically, one of the methods, called the inner accelerated ICG (IA-ICG) methodinexactly solves a sequence of matrix prox subproblems of the formmin U ∈ R m × n (cid:26) λ h h∇ f ( Y k − ) , U i + ( f V ◦ σ )( U ) + ( h V ◦ σ )( U ) i + 12 k U − Y k − k (cid:27) (2)where λ > Y k − is the previous iterate. It is shown (see Subsection 4.1)that the effort of finding the required inexact solution Y k of (2) consists of computing onesingular value decomposition (SVD) and applying an accelerated gradient (ACG) algorithmto the related vector prox subproblemmin u ∈ R r (cid:26) λ h f V ( u ) − h c k − , u i + h V ( u ) i + 12 k u k (cid:27) (3)where r = min { m, n } and c k − = σ ( Y k − − λ ∇ f ( Y k − )). Note that (3) is a problem over thevector space R r , and hence, has significantly fewer dimensions than (2) which is a problemover the matrix space R m × n . The other ICG method, called the doubly accelerated ICG(DA-ICG) method, solves a similar prox subproblem as in (2) but with Y k − selected inan accelerated manner (and hence its qualifier of “doubly accelerated”). Given ˆ ρ >
0, it isshown that both methods obtain a pair (ˆ Y, ˆ V ) satisfyingˆ V ∈ ∇ f (ˆ Y ) + ∇ (cid:16) f V ◦ σ (cid:17) (ˆ Y ) + ∂ (cid:16) h V ◦ σ (cid:17) (ˆ Y ) , k ˆ V k ≤ ˆ ρ by solving at most O ( ˆ ρ − ) matrix prox subproblems as in (2).It is worth mentioning that the IA-ICG method can be viewed an inexact version ofthe exact composite gradient (ECG) method applied to (1), which solves a sequence ofsubproblemsmin U ∈ R m × n (cid:26) λ hD ∇ h f + f V ◦ σ i ( Y k − ) , U E + ( h V ◦ σ )( U ) i + 12 k U − Y k − k (cid:27) , (4)where λ > Y k − is the previous iterate. Similarly, the DA-ICG methodcan be viewed as an inexact version of an exact (monotone) accelerated composite gradient(EACG) method, which also solves a sequence of subproblems (4) but with Y k − chosen inan accelerated manner.For high-dimensional instances of (1) where min { m, n } is large, and hence, SVDs areexpensive to compute, it will be shown that the larger the Lipschitz constant of ∇ f V is, the better the performance of the ICG methods is compared to that of their exactcounterparts. This is due to the following facts: (i) solving (4) or (2) involves a single SVDcomputation; (ii) even though (4) requires fewer resolvent evaluations to solve than (2), thecost of solving these subproblems is comparable due to the fact that the aforementionedSVD is the bottleneck step; and (iii) the larger the Lipschitz constant of ∇ f V , is the smallerthe stepsize λ in (4) must be, and hence, the more subproblems of form (4) need to be solvedduring the execution of the exact counterparts. ptimization of Nonconvex Spectral Functions Related works . The earliest complexity analysis of an ACG method for solving nonconvexcomposite problems like the one in (1) is given in (Ghadimi and Lan, 2016). Building onthe results in (Ghadimi and Lan, 2016), many other papers (Drusvyatskiy and Paquette,2019; Ghadimi et al., 2015; Liang et al., 2019) have proposed similar ACG-based methods.Another common approach for solving problems like (1) is to employ an inexact proximalpoint method where each prox subproblem is constructed to be convex, and hence, solvableby an ACG variant. For example, papers (Carmon et al., 2018; Paquette et al., 2017; Konget al., 2019, 2020) present inner accelerated inexact proximal point methods whereas (Liangand Monteiro, 2018) presents a doubly accelerated inexact proximal point method.To the best our knowledge, this paper is the first one to present ICG methods thatexploit both the spectral and composite structure in (1).
Organization of the paper.
Subsection 1.1 gives some notation and basic definitions.Subsection 1.2 presents several real-world problems that are of the form in (1). Section 2presents some necessary background material for describing the ICG methods. Section 3 issplit into three subsections. The first one precisely describes the problem of interest, whilethe last two present the IA-ICG and DA-ICG methods. Section 4 describes an efficientway of solving problem (2) by modifying a solution of problem (3). Section 5 presentssome numerical results. Section 6 establishes the iteration complexity of the ICG methods.Finally, some auxiliary results are presented in Appendices A to D.
This subsection provides some basic notation and definitions.The set of real numbers is denoted by R . The set of non-negative real numbers and theset of positive real numbers is denoted by R + and R ++ respectively. The set of naturalnumbers is denoted by N . The set of complex numbers is C . The set of unitary matrices ofsize n –by– n is U n . For t >
0, define log +1 ( t ) := max { , log( t ) } . Let R n denote a real–valued n –dimensional Euclidean space with norm k · k . Given a linear operator A : R n R p ,the operator norm of A is denoted by k A k := sup {k Az k / k z k : z ∈ R n , z = 0 } . Using theasymptotic notation O , we denote O ( · ) ≡ O (1 + · ).Let ( m, n ) ∈ N and let r = min { m, n } . Given matrices X ∈ R m × n and Y ∈ R n × n ,let the quantities σ ( X ) and λ ( Y ) denote the singular values and eigenvalues of X and Y ,respectively, in nonincreasing order. Let dg : R r R r × r and Dg : R m × n R r be givenpointwise by [dg z ] ij = ( z i , if i = j, , otherwise , [Dg Z ] i = Z ii , for every z ∈ R r , Z ∈ R m × n , and ( i, j ) ∈ { , ..., r } .The following notation and definitions are for a general complete inner product space Z , whose inner product and its associated induced norm are denoted by h· , ·i and k · k respectively. Let ψ : Z 7→ ( −∞ , ∞ ] be given. The effective domain of ψ is denoted bydom ψ := { x ∈ Z : ψ ( x ) < ∞} and ψ is said to be proper if dom ψ = ∅ . For ε ≥
0, the ε -subdifferential of ψ at x ∈ dom ψ is denoted by ∂ ε ψ ( z ) := (cid:8) w ∈ R n : ψ ( z ) ≥ ψ ( z ) + (cid:10) w, z − z (cid:11) − ε, ∀ z ∈ Z (cid:9) , ong and Monteiro and we denote ∂ψ ≡ ∂ ψ . The set of proper, lower semi-continuous, convex functions isdenoted by Conv Z . The convex conjugate ψ is denoted by ψ ∗ . The linear approximationof ψ at a point z ∈ dom ψ is denoted by ‘ ψ ( · ; z ) := ψ ( z ) + h∇ ψ ( z ) , · − z i . The indicatorof a closed convex set C ⊆ Z at a point z ∈ Z is denoted by δ C ( z ), which is 1 if z ∈ C and ∞ otherwise. The local Lipschitz constant of ∇ ψ at two points u, z ∈ Z is denoted by L ψ ( x, y ) = k∇ ψ ( x ) −∇ ψ ( y ) kk x − y k , x = y, , x = y, ∀ x, y ∈ dom ψ. (5) This subsection lists some motivating applications that are of the form in (1). Throughoutthis subsection, we will assume that we have two sparsity-inducing regularizers R = R s + R n and P , where R s and P are continuously differentiable functions with Lipschitz continuousgradients and R n is a proper, lower semicontinuous, and convex function. Let A ∈ R m × n be a given data matrix and let r = min { m, n } . Moreover, let Ω denote asubset of the indices of A . The goal of the general matrix completion problem is to find alow rank approximation of A that is close to A in some sense.A nonconvex formulation (see, for example, Yao and Kwok, 2017) of this problem ismin X ∈ R m × n (cid:26) k P Ω ( X − A ) k F + ( R ◦ σ )( X ) (cid:27) , where P Ω is the function that zeros out the entries of its input that are not in Ω. Given a vector x ∈ R n , let x [ ω ] denote its discrete Fourier transform for some frequency ω .Moreover, for some unknown noisy signal ˜ x ∈ R n and a frequency set Ω ⊆ R + , suppose thatwe are given measurements {| ˜ x [ ω ] |} ω ∈ Ω and vectors a ω ∈ C n such that | h a ω , ˜ x i | = | ˜ x [ ω ] | for every ω ∈ Ω. The goal of the phase retrieval problem is to recover an approximation x of ˜ x such that | h a ω , x i | ≈ | h a ω , ˜ x i | for every ω ∈ Ω.A nonconvex formulation of this problem ismin X ∈ R | Ω |×| Ω | (cid:26) kA ( X ) − b k + ( R ◦ λ )( X ) : X (cid:23) (cid:27) , where λ denotes the function that maps matrices to their eigenvalue vector and the quan-tities A : R | Ω |×| Ω | R | Ω | and b ∈ R | Ω | are given by[ A ( X )] ω = tr( a ω a ∗ ω X ) , b ω = | ˜ x [ ω ] | , ∀ ( X, ω ) ∈ R | Ω |×| Ω | × Ω . In particular, this formulation is a generalization of the one in (Candes et al., 2015) wherethe convex function tr X is replaced with the nonconvex function R . ptimization of Nonconvex Spectral Functions Let c M ∈ R m × n be a given data matrix and let r = min { m, n } . The goal of the robustprincipal component analysis problem is to find an approximation M + E of c M where M is low-rank and E is sparse.A nonconvex formulation of this problem ismin M,E ∈ R m × n (cid:26) k c M − ( M + E ) k F + ( R ◦ σ )( M ) + P ( E ) (cid:27) . In particular, this formulation is a instance of the one in (Wen et al., 2019) where morestructure is imposed on the functions R and P .
2. Background Material
Recall from Section 1 that our interest is in solving (1) by repeated solving a sequence ofprox subproblems as in (2). This section presents some background material regarding (2).This section considers the nonconvex composite optimization (NCO) problemmin u ∈Z { ψ ( u ) := ψ s ( u ) + ψ n ( u ) } , (6)where Z is a finite dimensional inner product space and the functions ψ s and ψ n are assumedto satisfy the following assumptions:(B1) ψ n ∈ Conv Z ;(B2) ψ s is continuously differentiable on Z and satisfies µ k u − y k ≤ ψ s ( u ) − [ ψ s ( y ) + h∇ ψ s ( y ) , u − y i ] ≤ M k u − y k for some ( µ, M ) ∈ R and every u, y ∈ Z .Clearly, problems (1) and (2) are special cases of (6), and hence any definition or resultthat is stated in the context of (6) applies to (1) and/or (2).An important notion of an approximate solution of (6) is as follows: given ˆ ρ >
0, a pair( y r , v r ) is said to be a ˆ ρ –approximate solution of (6) if v r ∈ ∇ ψ s ( y r ) + ∂ψ n ( y r ) , k v r k ≤ ˆ ρ. (7)In Section 3, we develop prox-type methods for finding ˆ ρ –approximate solutions of (1) thatrepeatedly solve (2) inexactly by taking advantage of its spectral decomposition.We now discuss the inexactness criterion under which the subproblems (2) are solved.Again, the criterion is described in the context of (6) as follows. Problem A : Given ( µ, σ ) ∈ R and z ∈ Z , find ( y, v, ε ) ∈ dom ψ × Z × R + suchthat v ∈ ∂ ε (cid:18) ψ − µ k · − y k (cid:19) ( y ) , k v k + 2 ε ≤ σ k y − z k . (8) ong and Monteiro We begin by making three remarks about the above problem. First, if ( y, v, ε ) solvesProblem A with σ = 0, then ( v, ε ) = (0 , z is an exact solution of (6). Hence, theoutput ( y, v, ε ) of Problem A can be viewed as an inexact solution of (6) when σ ∈ R ++ .Second, the input z is arbitrary for the purpose of this section. However, the two methodsdescribed in Section 3 for solving (1) repeatedly solve (2) according to Problem A with theinput z at the k th iteration determined by the iterates generated at the ( k − th iteration.Third, defining the function∆ µ ( u ; y, v ) := ψ ( y ) − ψ ( u ) − h v, u − y i + µ k u − y k ∀ u ∈ dom ψ, (9)another way to express the inclusion in (8) is ∆ µ ( u ; y, v ) ≤ ε for every u ∈ dom ψ . Finally,the R-ACG algorithm presented later in this subsection will be shown to solve Problem A when ψ s is convex. Moreover, it solves a weaker version of Problem A involving ∆ µ (seeProblem B later on) whenever ψ s is not convex and as long as some key inequalities aresatisfied during its execution.A technical issue in our analysis in this paper lies in the ability of refining the outputof Problem A to an approximate solution ( y r , v r ) of (6), i.e., one satisfying the inclusion in(7), in which k v r k is nicely bounded. We now present a refinement procedure that addressesthis issue. Refinement ProcedureInput : a triple (
M, ψ s , ψ n ) satisfying (B1)–(B2) and a pair ( y, v ) ∈ dom ψ n × Z ; Output : a pair ( y r , v r ) satisfying the inclusion in (7); Step set the quantities y r = argmin u ∈Z (cid:26) h∇ ψ s ( y ) − v, u i + M k u − y k + ψ n ( u ) (cid:27) , (10) v r = v + M ( y − y r ) + ∇ ψ s ( y r ) − ∇ ψ s ( y ) , (11)and output ( y r , v r ).The result below presents the key properties of the above procedure. For the sake ofbrevity, we write ( y r , v r ) = RP ( y, v ) to indicate that the pair ( y r , v r ) is the output of theabove procedure with inputs ( M, ψ s , ψ n ) and ( y, v ). Proposition 1.
Let ( M, ψ s , ψ n ) satisfying assumptions (B1)–(B2) and a triple ( y, v, ε ) ∈ dom ψ n × Z ∈ R + be given. Moreover, let ( y r , v r ) = RP ( y, v ) , denote L ψ s ( · , · ) simply by L ( · , · ) where L ψ s ( · , · ) is as in (5) , and let ∆ µ be as in (9) . Then, the following statementshold:(a) v r ∈ ∇ ψ s ( y r ) + ∂ψ n ( y r ) ;(b) for every s ∈ dom ψ n we have ∆ µ ( u ; y, v ) ≥ and, in particular, ∆ µ ( y r ; y, v ) ≥ M k y r − y k ; (12) ptimization of Nonconvex Spectral Functions (c) if ∆ µ ( y r ; y, v ) ≤ ε and ( y, v, ε ) satisfies the inequality in (8) , then k v r k ≤ σ (cid:20) M + L ( y, y r ) √ M (cid:21) k y − z k ; (13) (d) if ( y, v, ε ) solves Problem A , then ∆ µ ( u ; y, v ) ≤ ε for every u ∈ dom ψ n , and, as aconsequence, bound (13) holds.Proof. (a) Using the definition of v r and the optimality of y r , we have that v r = v + M ( y − y r ) + ∇ ψ s ( y r ) − ∇ ψ s ( y ) ∈ ∇ ψ s ( y r ) + ∂ψ n ( y r ) . (b) The fact that ∆ µ ( u ; y, v ) ≥ u ∈ dom ψ n follows from the optimality of y r and the fact that ψ s ≤ ‘ ψ s ( · ; y ) + M k · − y k /
2. The bound (12) follows from Proposition 19with (
L, g, h ) = (
M, ψ s − h v, ·i , ψ n ).(c) Using the assumption that ∆ µ ( y r ; y, v ) ≤ ε , part (b), and the inequality in (8), wehave that k y − y r k ≤ s µ ( y r ; y, v ) M ≤ r εM ≤ σ √ M k y − z k . (14)Using the triangle inequality, the definition of L ( · , · ), (14) and the inequality in (8) again,we conclude that k v r k ≤ k v k + [ M + L ( y, y r )] k y − y r k ≤ σ (cid:20) M + L ( y, y r ) √ M (cid:21) k y − z k . (d) The fact that ∆ µ ( u ; y, v ) ≤ ε for every u ∈ dom ψ n follows immediately from theinclusion in (8) and the definition of ∆ µ in (9). The fact that (13) holds now follows frompart (c).We make a few remarks about Proposition 1. First, it follows from (a) that ( y r , v r )satisfies the inclusion in (7). Second, it follows from (a) and (c) that if σ = 0, then( y r , v r ) = (0 , y r is an exact stationary point of (6). In general, (13) impliesthat the residual k v r k is directly proportional to k y − w k , and hence, becomes smaller asthis quantity approaches zero .Inequality (13) plays an important technical role in the complexity analysis of the twoprox-type methods of Section 3. Sufficient conditions for its validity are provided in (c)and (d), with (c) being the weaker one, in view of (d). When ψ s is convex, it is shownthat every iterate of the R-ACG algorithm presented below always satisfies the inclusionin (8), and hence, verifying the the validity of the sufficient condition in (c) amounts tosimply checking whether the inequality in (8) holds. When ψ s is not convex, verificationof the inclusion in (8), and hence the sufficient condition in (d), is generally not possible,while the one in (c) is. This is a major advantage of the sufficient condition in (c), whichis exploited in this paper towards the development of adaptive prox-type methods whichattempt to approximately solve (6) when ψ s is not convex.For the sake of future reference, we now state the following problem for finding a triple( y, v, ε ) satisfying the sufficient condition in Proposition 1(c). Its statement relies on therefinement procedure preceding Proposition 1. ong and Monteiro Problem B : Given the same inputs as in Problem A , find ( y, v, ε ) ∈ dom ψ × Z × R + satisfying the inequality in (8) and ∆ µ ( y r ; y, v ) ≤ ε, (15)where ∆ µ ( · ; · , · ) is as in (9) and y r is the first component of the refined pair ( y r , v r ) = RP ( y, v ).We now state the aforementioned R-ACG algorithm which solves Problem A when ψ s is convex and solves Problem B whenever ψ s is not convex and two key inequalities aresatisfied, one at every iteration (i.e., (16)) and one at the end of its execution. R-ACG AlgorithmInput : a quadruple ( µ, M, ψ s , ψ n ) satisfying (B1)–(B2) and a pair ( σ, z ); Output : a triple ( y, v, ε ) that solves Problem B or a failure status; Step define ψ := ψ s + ψ n and set z c = z , B = 0, Γ ≡
0, and j = 1; Step compute the iterates ξ j − = 1 + µB j − M − µ , b j − = ξ j − + q ξ j − + 4 ξ j − B j − ,B j = B j − + b j − , ˜ z j − = B j − B j z j − + b j − B j z cj − ,z j = argmin u ∈Z (cid:26) l ψ s ( u ; ˜ z j − ) + ψ n ( u ) + M k u − ˜ z j − k (cid:27) ,z cj = 11 + µB j (cid:20) z cj − − b j − M − µ (˜ z j − − z j ) + µ ( B j − z cj − + a j − z j ) (cid:21) ; Step compute the quantities˜ γ j = l ψ s ( · ; ˜ z j − ) + ψ n + µ k · − ˜ z j − k γ j = ˜ γ j ( z j ) + 1 M − µ h ˜ z j − − z j , · − z j i + µ k · − z j k , Γ j = B j − B j Γ j − + b j − B j γ j − , r j = z c − z cj B j + µ ( z cj − z j ) ,η j = ψ ( z j ) − Γ j ( z cj ) − D r j , z j − z cj E + µ k z j − z cj k . Step if the inequality k B j r j + z j − z k + 2 B j η j ≤ k z j − z k (16)holds, then go to step 4; otherwise, stop with a failure status; Step if the inequality k r j k + 2 η j ≤ σ k z j − z k , (17)holds, then go to step 5; otherwise, go to step 1; ptimization of Nonconvex Spectral Functions Step set ( y, v, ε ) = ( z j , r j , η j ) and compute ( y r , v r ) = RP ( z j , r j ); if the condition∆ µ ( y r ; y, v ) ≤ ε, holds then stop with a success status and output the triple ( y, v, ε ); otherwise, stop with a failure status;It is well-known that the scalar B j updated in step 1 satisfies B j ≥ M max ( j , (cid:18) r µ M (cid:19) j − ) ∀ j ≥ . (18)The next result presents the key properties about the R-ACG algorithm. Proposition 2.
The R-ACG algorithm has the following properties:(a) it stops with either failure or success in at most & s Mµ ! log +1 (cid:16) K σ √ M (cid:17)’ (19) iterations, where K σ := 1 + √ /σ ;(b) if it stops with success, then its output ( y, v, ε ) solves Problem B ;(c) if ψ s is convex then it always stops with success and its output ( y, v, ε ) solvesProblem A .Proof. (a) See Appendix B.(b) This follows from the successful checks in step 4 and 5 of the algorithm.(c) The fact that the algorithm never stops with failure follows from Proposition 20(c)–(d) in Appendix B. The fact that the the algorithm stops with success follows the previousstatement, the successful checks in step 4 and 5 of the algorithm, and the fact that thealgorithm stops in a finite number of iterations in part (a).
3. Inexact Composite Gradient Methods
This section presents the ICG methods and the general problem that they solve. It containsthree subsections. The first one presents Problem of interest and gives a general outline ofthe ICG methods, the second one presents the IA-ICG method, and the third one presentsthe DA-ICG method.For the ease of presentation, the proofs of this section are deferred to Section 6.
This subsection describes Problem that the ICG methods solve and outlines their structure.The ICG methods consider the NCO problemmin u ∈Z [ φ ( u ) := f ( u ) + f ( u ) + h ( u )] (20)where Z is an finite dimensional inner product space and the functions f , f , and h areassumed to satisfy the following assumptions: ong and Monteiro (A1) h ∈ Conv Z ;(A2) f , f are continuously differentiable functions and there exists ( m , M ) ∈ R and( m , M ) ∈ R such that, for i ∈ { , } , we have − m i k u − y k ≤ f i ( u ) − ‘ f i ( u ; y ) ≤ M i k u − y k ∀ u, y ∈ dom h ; (21)(A3) for i ∈ { , } , we have k∇ f i ( u ) − ∇ f i ( y ) k ≤ L i k u − y k ∀ u, y ∈ dom h, where L i := max {| m i | , | M i |} ;(A4) φ ∗ := inf u ∈Z φ ( u ) > −∞ .Note that assumption (A2) implies that assumption (A3) holds when the interior of dom h is nonempty. Under the above assumptions, the ICG methods find an approximate solution(ˆ y, ˆ v ) of (20) as in (7) with ψ s = f + f and ψ n = h , i.e.ˆ v ∈ ∇ f (ˆ y ) + ∇ f (ˆ y ) + ∂h (ˆ y ) , k ˆ v k ≤ ˆ ρ. (22)We now outline the ICG methods. Given a starting point y ∈ dom ψ n and a specialstepsize λ >
0, each method continually calls the R-ACG algorithm of Section 2 to find anapproximate solution of a prox-linear form of (20). More specifically, each R-ACG call isused to tentatively find an approximate solution ofmin u ∈Z (cid:20) ψ ( u ) = λ [ ‘ f ( u ; z ) + f ( u ) + h ( u )] + 12 k u − z k (cid:21) , (23)for some reference point z . For the IA-ICG method, the point z is y for the first R-ACG call and is the last obtained approximate solution for the other R-ACG calls. For theDA-ICG method, the point z is chosen in an accelerated manner.From the output of the k th R-ACG call, a refined pair (ˆ y, ˆ v ) = (ˆ y k , ˆ v k ) is generatedwhich: (i) always satisfies the inclusion of (22); and (ii) is such that min i ≤ k k ˆ v i k → k → ∞ . More specifically, this refined pair is generated by applying the refinementprocedure of Section 2 and adding some adjustments to the resulting output to conformwith our goal of finding an approximate solution as in (22). For the ease of future reference,we now state this specialized refinement procedure. Before proceeding, we introduce theshorthand notation M + i := max { M i , } , m + i := max { m i , } , L i ( x, y ) := L f i ( x, y ) , for i ∈ { , } , to keep its presentation (and future results) concise. Specialized Refinement ProcedureInput : a quadruple ( M , f , f , h ) satisfying (A1)–(A2), a scalar λ >
0, and a triple( y, v, z ) ∈ dom ψ n × Z × Z ; Output : a pair (ˆ y, ˆ v ) satisfying the inclusion of (22); Step compute (ˆ y, v r ) = RP ( y, v ) using the refinement procedure in Section 2 with M = λM +2 + 1 , ψ s = λ [ ‘ f ( · ; z ) + f ] + 12 k · − z k , ψ n = λh ; ptimization of Nonconvex Spectral Functions Step compute the residualˆ v = 1 λ ( v r + z − y ) + ∇ f (ˆ y ) − ∇ f ( z ) , and output (ˆ y, ˆ v ).The result below states some properties about the above procedure. For the sake ofbrevity, we write (ˆ y, ˆ v ) = SRP ( y, v, y ) to indicate that the pair (ˆ y, ˆ v ) is the output of theabove procedure with inputs ( M , f , f , h ), λ , and ( y, v, z ). Lemma 3.
Let ( m , M ) , ( m , M ) , and ( f , f , h ) satisfying assumptions (A1)–(A3) and aquadruple ( z , y, v, ε ) ∈ Z × dom ψ n × Z ∈ R + be given. Moreover, let (ˆ y, ˆ v ) = SRP ( y, v, y ) and define C λ ( x, y ) := 1 + λ h M +2 + L ( x, y ) + L ( x, y ) iq λM +2 , (24) for every x, y ∈ Z . Then, the following statements hold:(a) ˆ v ∈ ∇ f (ˆ y ) + ∇ f (ˆ y ) + ∂h (ˆ y ) ;(b) if ( y, v, ε ) solves Problem B with ( µ, ψ s , ψ n ) as in (26) , then k ˆ v k ≤ (cid:20) L ( y, w ) + 2 + σC λ ( y, ˆ y ) λ (cid:21) k y − z k . It is worth recalling from Section 1 that in the applications we consider, the cost of the R-ACG call is small compared to SVD computation performed that is performed before solvingeach subproblem as in (23). Hence, in the analysis that follows, we present complexityresults related to the number of subproblems solved rather than the total number of R-ACGiterations. We do note, however, that the number of R-ACG iterations per subproblem isfinite in view of Proposition 2(a).
This subsection presents the static IA-ICG method and its (titular) dynamic variant.We first state the static IA-ICG method.
Static IA-ICG MethodInput : function triple ( f , f , h ) and scalar quadruple ( m , M , m , M ) ∈ R satisfying(A1)–(A4), tolerance ˆ ρ >
0, initial point y ∈ dom h , and scalar pair ( λ, σ ) ∈ R ++ × (0 , λM + σ ≤
12 ; (25)
Output : a pair (ˆ y, ˆ v ) satisfying (22) or a failure status; Step let ∆ ( · ; · , · ) be as in (9) with µ = 1, and set k = 1; ong and Monteiro Step use the R-ACG algorithm to tentatively solve Problem B associated with (23), i.e.,with inputs ( µ, M, ψ s , ψ n ) and ( σ, z ) where the former is given by µ = 1 , M = λM +2 + 1 ,ψ s = λ [ ‘ f ( · ; z ) + f ] + 12 k · − z k , ψ n = λh, (26)and z = y k − ; if the R-ACG stops with failure , then stop with a failure status;otherwise, let ( y k , v k , ε k ) denote its output and go to step 2; Step if the inequality ∆ ( y k − ; y k , v k ) ≤ ε k holds, then go to step 3; otherwise, stop with a failure status; Step set (ˆ y k , ˆ v k ) = SRP ( y k , v k , y k − ); if k ˆ v k k ≤ ˆ ρ then stop with a success status and output (ˆ y, ˆ v ) = (ˆ y k , ˆ v k ); otherwise, update k ← k + 1 and go to step 1.Note that the static IA-ICG method may fail without obtaining a pair satisfying (22). InProposition 4(c) below, we state that a sufficient condition for the method to stop success-fully is that f be convex. This property will be important when we present the (dynamic)IA-ICG method, which: (i) repeatedly calls the static method; and (ii) incrementally trans-fers convexity from f to f between each call until a successful termination is achieved.We now make some additional remarks about the above method. First, it performs twokinds of iterations, namely, ones that are indexed by k and ones that are performed by theR-ACG algorithm. We refer to the former kind as outer iterations and the latter kind asinner iterations. Second, in view of (25), if M > < λ < (1 − σ ) / (2 M ) whereasif M ≤ < λ < ∞ .The next result summarizes some facts about the static IA-ICG method. Before pro-ceeding, we first define some useful quantities. For λ > u, w ∈ Z , define e ‘ φ ( u ; w ) := ‘ f ( u ; w ) + f ( u ) + h ( u ) , C λ := 1 + λ ( M +2 + L + L ) q λM +2 . (27) Theorem 4.
The following statements hold about the static IA-ICG method:(a) it stops in O " √ λL + 1 + σC λ √ λ (cid:20) φ ( z ) − φ ∗ ˆ ρ (cid:21) (28) outer iterations, where φ ∗ is as in (A4);(b) if it stops with success, then its output pair (ˆ y, ˆ v ) is a ˆ ρ –approximate solution of (20) ;(c) if f is convex, then it always stops with success. We now make three remarks about the above results. First, if σ = O (1 /C λ ) then (28)reduces to O (cid:20) √ λL + 1 √ λ (cid:21) (cid:20) φ ( z ) − φ ∗ ˆ ρ (cid:21)! . (29)Moreover, comparing the above complexity to the iteration complexity of the ECG methoddescribed in Section 1, which is known (see, for example, Monteiro et al., 2012) to obtain ptimization of Nonconvex Spectral Functions an approximate solution of (20) in O (cid:20) √ λ ( L + L ) + 1 √ λ (cid:21) (cid:20) φ ( z ) − φ ∗ ˆ ρ (cid:21)! (30)iterations, we see that (29) is smaller than (30) in magnitude when L is large. Second,Theorem 4(b) shows that if the method stops with success, regardless of the convexity of f , then its output pair (ˆ y, ˆ v ) is always an approximate solution of (20). Third, in view ofProposition 10, the quantities L and C λ in all of the previous complexity results can bereplaced by their averaged counterparts in (43). As these averaged quantities only dependon { ( y i , ˆ y i ) } ki =1 , we can infer that the static IA-ICG method adapts to the local geometryof its input functions.We now state the (titular) dynamic IA-ICG method that resolves the issue of failure inthe static IA-ICG method. IA-ICG MethodInput : the same as the static IA-ICG method but with an additional parameter ξ > Output : a pair (ˆ y, ˆ v ) satisfying (22); Step set ξ = ξ , ‘ = 1, and f = f − ξ k · k , f = f + ξ k · k ,m = m + ξ, M = M − ξ, m = m − ξ, M = M + ξ ; (31) Step call the static IA-ICG method with inputs ( f , f , h ), ( m , M , m , M ), ˆ ρ , y , and( λ, σ ); Step if the static IA-ICG call stops with a failure status, then set ξ = 2 ξ , update thequantities in (31) with the new value of ξ , increment ‘ = ‘ + 1, and go to step 1;otherwise, let (ˆ y, ˆ v ) be the output pair returned by the static IA-ICG call, stop , and output this pair.Some remarks about the above method are in order. First, in view of (25) and thefact that M is monotonically decreasing, the parameter λ does not need to be changedfor each IA-ICG call. Second, in view of assumption (A2) and Theorem 4(c), the IA-ICGcall in step 1 always terminates with success whenever m ≤
0. As a consequence, thetotal number of IA-ICG calls is at most l log(2 m +2 /ξ ) m . Third, in view of the secondremark and Theorem 4(b), the methods always obtains a ˆ ρ –approximate solution of (20)in a finite number of IA-ICG outer iterations. Finally, in view of second remark again, thetotal number of IA-ICG outer iterations is as in Theorem 4(a) but with: (i) an additionalmultiplicative factor of l log(2 m +2 /ξ ) m ; and (ii) the constants m and M replaced with( m + 2 m +2 ) and ( M + 2 m +2 ), respectively. It is worth mentioning that a more refinedanalysis, such as the one in (Kong et al., 2020), can be applied in order to remove the factorof l log(2 m +2 /ξ ) m from the previously mentioned complexity. ong and Monteiro This subsection presents the static DA-ICG method, but omits its (titular) dynamic variantfor the sake of brevity. We do argue, however, that the dynamic variant can be stated inthe same way as the (dynamic) IA-ICG method of Subsection 6.1 but with the call to thestatic IA-ICG method replaced with a call to the static DA-ICG method of this subsection.We start by stating some additional assumptions. It is assumed that:(i) the set dom h is closed;(ii) there exists a bounded set Ω ⊇ dom h for which a projection oracle exists.We now state the static DA-ICG method. Static DA-ICG MethodInput : function triple ( f , f , h ) and scalar quadruple ( m , M , m , M ) ∈ R satisfying(A1)–(A4), tolerance ˆ ρ >
0, initial point y ∈ dom h , and scalar pair ( λ, σ ) ∈ R ++ × (0 , λM + σ ≤
12 ; (32)
Output : a pair (ˆ y, ˆ v ) satisfying (22) or a failure status; Step let ∆ ( · ; · , · ) be as in (9) with µ = 1, and set A = 0, x = y , and k = 1; Step compute the quantities a k − = 1 + p A k − , A k = A k − + a k − , ˜ x k − = A k − y k − + a k − x k − A k ; (33) Step use the R-ACG algorithm to tentatively solve Problem B associated with (23), i.e.,with inputs ( µ, M, ψ s , ψ n ) and ( σ, z ) where the former is as in (26) and z = ˜ x k − ; ifthe R-ACG stops with success , then let ( y ak , v k , ε k ) denote its output and go to step 3;otherwise, stop with a failure status; Step if the inequality ∆ ( y k − ; y ak , v k ) ≤ ε k holds, then go to step 4; otherwise, stop with a failure status; Step set (ˆ y k , ˆ v k ) = SRP ( y ak , v k , ˜ x k − ) where SRP ( · , · , · ) is described in Subsection 3.1;if k ˆ v k k ≤ ˆ ρ then stop with a success status and output (ˆ y, ˆ v ) = (ˆ y k , ˆ v k ); otherwise,compute x k = argmin u ∈ Ω k u − [ x k − − a k − ( v k + ˜ x k − − y ak )] k ,y k = argmin u ∈ { y k − ,y ak } [ f ( u ) + f ( u ) + h ( u )] , (34)update k ← k + 1, and go to step 1.Note that, similar to the static IA-ICG method, the static DA-ICG method may failwithout obtaining a pair satisfying (22). Proposition 5(c) shows that a sufficient conditionfor the method to stop successfully is that f be convex. Using arguments similar to the ones ptimization of Nonconvex Spectral Functions employed to derive the dynamic IA-ICG method, a dynamic version of DA-ICG method canalso be developed that repeatedly invokes the static DA-ICG in place of the static IA-ICG.We now make some additional remarks about the above method. First, it performs twokinds of iterations, namely, ones that are indexed by k and ones that are performed by theR-ACG algorithm. We refer to the former kind as outer iterations and the latter kind asinner iterations. Second, in view of the update for y k in (34), the collection of function values { φ ( y i ) } ki =0 is non-increasing. Third, in view of (32), if M > < λ < (1 − σ ) / (2 M )whereas if M ≤ < λ < ∞ .It is worth mentioning that the outer iteration scheme of the DA-ICG method is amonotone and inexact generalization of the accelerated gradient (AG) method in (Ghadimiand Lan, 2016). More specifically, the AG method can be viewed as a version of the DA-ICG method where: (i) σ = 0; (ii) the R-ACG algorithm in step 2 is replaced by an exactsolver of (23); and (iii) the update of x k in (34) is replaced by an update involving proxevaluation of the function a k − ( f + h ). Hence, the DA-ICG method can be significantlymore efficient when its R-ACG call is more efficient than an exact solver of (23) and/orwhen the projection onto Ω is more efficient than evaluating the prox of a k − ( f + h ).The next result summarizes some facts about the DA-ICG method. Before proceeding,we introduce the useful constants D h := sup u,z ∈ dom h k u − z k , D Ω := sup u,z ∈ Ω k u − z k , ∆ φ := φ ( y ) − φ ∗ ,d := inf u ∗ ∈Z {k y − u ∗ k : φ ( u ∗ ) = φ ∗ } , E λ,σ := √ λL + 1 + σC λ √ λ . (35) Theorem 5.
The following statements hold about the static DA-ICG method:(a) it stops in O E λ,σ [ m +1 D h + ∆ φ ]ˆ ρ + E λ,σ [ m +1 + 1 /λ ] / D Ω ˆ ρ ! (36) outer iterations;(b) if it stops with success, then its output pair (ˆ y, ˆ v ) is a ˆ ρ –approximate solution of (20) ;(c) if f is convex, then it always stops with success in O E λ,σ m +1 D h ˆ ρ + E λ,σ [ m +1 ] / D Ω ˆ ρ + E / λ,σ d / λ − / ˆ ρ / (37) outer iterations. We now make three remarks about the above results. First, in the “best” scenario ofmax { m , m } ≤
0, we have that (37) reduces to O (cid:20) L + 1 λ (cid:21) / " d / ˆ ρ / , ong and Monteiro which has a smaller dependence on ˆ ρ when compared to (29). In the “worst” scenario ofmin { m , m } >
0, if we take σ = O (1 /C λ ), then (36) reduces to O (cid:20) √ λL + 1 √ λ (cid:21) " m +1 D h + φ ( y ) − φ ∗ ˆ ρ , which has the same dependence on ˆ ρ as in (29). Second, part (c) shows that if the methodstops with an output pair (ˆ y, ˆ v ), regardless of the convexity of f , then that pair is alwaysan approximate solution of (20). Third, in view of Proposition 18, the quantities L and C λ in all of the previous complexity results can be replaced by their averaged counterpartsin (57). As these averaged quantities only depend on { ( y ai , ˆ y i , ˜ x i − ) } ki =1 , we can infer thatthe static DA-ICG method, like the static IA-ICG method of the previous subsection, alsoadapts to the local geometry of its input functions.
4. Exploiting the Spectral Decomposition
Recall that at every outer iteration of the ICG methods in Section 3, a call to the R-ACGalgorithm is made to tentatively solve Problem B (see Subsection 3.1) associated with (23).Our goal in this section is to present a significantly more efficient algorithm (based on theidea outlined in Section 1) for solving the same problem when the underlying problem ofinterest is (1).The content of this section is divided into two subsections. The first one presents theaforementioned algorithm, whereas the second one proves its key properties. This subsection presents an efficient algorithm for solving Problem B associated with (23).Throughout our presentation, we let Z represent the starting point given to the R-ACGalgorithm by the two ICG methods.We first state the aforementioned efficient algorithm. Spectral R-ACG AlgorithmInput : a quadruple ( M , f , f V , h V ) satisfying (A1)–(A3) with ( f , h ) = ( f V , h V ) and atriple ( λ, σ, Z ); Output : a triple (
Y, V, ε ) that solves Problem B associated with (23) or a failure status; Step compute Z λ := Z − λ ∇ f ( Z ) , (38)and a pair ( P, Q ) ∈ U m × U n satisfying Z λ = P [dg σ ( Z λ )] Q ∗ ; Step use the R-ACG algorithm to tentatively solve Problem B associated with (3), i.e.,with inputs ( µ, M, ψ V s , ψ V n ) and ( σ, z ) where the former is given by µ = 1 , M := λM +2 + 1 ,ψ V s := λf V − h σ ( Z λ ) , ·i + 12 k · k , ψ V n := λh V , (39)and z = Dg( P ∗ Z Q ); if the R-ACG stops with success , then let ( y, v, ε ) denote itsoutput and go to step 3; otherwise, stop with a failure status; ptimization of Nonconvex Spectral Functions Step set Y = P (dg y ) Q ∗ and V = P (dg v ) Q ∗ , and output the triple ( Y, V, ε ).We now make three remarks about the above algorithm. First, the matrices P and Q instep 1 can be obtained by computing an SVD of Z λ . Second, in view of Proposition 20(a)and the fact that ( µ, M ) in (39) and (26) are the same, the iteration complexity is thesame as the vanilla R-ACG algorithm. Finally, because the functions ψ V s and ψ V n in (39)have vector inputs over R r , the steps in the spectral R-ACG algorithm are significantly lesscostly than the ones in the R-ACG algorithm, which use functions with matrix inputs over R m × n .The following result, whose proof is in the next subsection, presents the key propertiesof this algorithm. Proposition 6.
The spectral R-ACG algorithm has the following properties:(a) if it stops with success, then its output triple ( Y, V, ε ) solves Problem B associatedwith (23) ;(b) if f is convex, then it always stops with success and its output ( Y, V, ε ) solvesProblem A associated with (23) . For the sake of brevity, let ( ψ s , ψ n ) be as in (26) and, using P and Q from the spectralR-ACG algorithm, define for every ( u, U ) ∈ R r × R m × n , the functions M ( u ) := P (dg u ) Q ∗ , V ( U ) := Dg( P ∗ U Q ) ,ψ ( U ) := ψ s ( U ) + ψ n ( U ) , ψ V ( u ) := ψ V s ( u ) + ψ V n ( u ) . The first result relates ( ψ s , ψ n ) to ( ψ V s , ψ V n ). Lemma 7.
Let ( y, v, ε ) and ( Y, V ) be as in the spectral R-ACG algorithm. Then, thefollowing properties hold:(a) we have ψ V n ( y ) = ψ n ( Y ) , ψ V s ( y ) + B λ = ψ s ( Y ) , where B λ := λf ( Z ) − λ h∇ f ( Z ) , Z i + k Z k F / ;(b) we have V ∈ ∂ ε (cid:18) ψ − k · − Y k F (cid:19) ( Y ) ⇐⇒ v ∈ ∂ ε (cid:18) ψ V − k · − y k (cid:19) ( y ) . (40) Proof. (a) The relationship between ψ V n ,and ψ n is immediate. On the other hand, using thedefinitions of Y, f , and B λ , we have ψ V s ( y ) + B λ = λf ( Y ) − h Z λ , Y i + 12 k Y k F + B λ = λ [ f ( Y ) + f ( Z ) + h∇ f ( Z ) , Y − Z i ] + 12 k Y − Z k F = ψ s ( Y ) . (b) Let S = V + Z λ − Y and s = v + σ ( Z λ ) − y , and note that S = M ( s ). Moreover,in view of part (a) and the definition of ψ , observe that the left inclusion in (40) is equivalent ong and Monteiro to S ∈ ∂ ε ( λ [ f + h ])( Y ). Using this observation, the fact that S and Y have a simultaneousSVD, and Theorem 23 with ( S, s ) = ( S , s ), Ψ = λ [ f + h ], and Ψ V = λ [ f V + h V ], we havethat the left inclusion in (40) is also equivalent to s ∈ ∂ ε ( λ [ f V + h V ])( y ). The conclusion nowfollows from the observing that the latter inclusion is equivalent to the the right inclusionin (40).We are now ready to give the proof of Proposition 6. Proof of Proposition 6. (a) Let ( y, v ) = ( V ( Y ) , V ( V )) and remark that the successful termi-nation of the algorithm implies that the inequality in (8) and (15) hold. Using this remark,the fact that k V k F = k v k , and the bound σ k z j − z k = σ (cid:16) k z j k − h z j , V ( z ) i + k Z k F (cid:17) + σ ( kV ( z ) k − k Z k F ) ≤ σ (cid:16) k Z j k − h Z j , Z i + k Z k F (cid:17) = σ k Z j − Z k F , (41)we then have that the inequality in (8) also holds with ( y, v ) = ( Y, V ).To show the corresponding inequality for (15), let ( Y r , V r ) = RP ( Y, V ) using the refine-ment procedure in Section 2. Moreover, let ( y r , v r ) = RP ( y, v ) and ∆ V ( · ; · , · ) be as in (9),where ( ψ s , ψ n ) = ( ψ V s , ψ V n ). It now follows from (10), (11), Lemma 22 with Ψ = ψ n and S = V + M Y − ∇ ψ s ( Y ), and Lemma 21(b) that Y r , Y, V , and V r have a simultaneous SVD.As a consequence of this, the first remark, and Lemma 7(a), we have that ε ≥ ∆ V ( y r ; y, v ) = ψ V ( y ) − ψ V ( y r ) − h v, y r − y i + 12 k y r − y k = ψ ( Y ) − ψ ( Y r ) − h V, Y r − Y i + 12 k Y r − Y k = ∆ ( Y r ; Y, V ) , and hence that (15) holds with ( y, v ) = ( Y, V ).(b) This follows from part (a), Proposition 2(c), and Lemma 7(b).
5. Computational Results
This section presents computational results that highlight the performance of the IA-ICGand DA-ICG methods, and it contains three subsections. The first one describes the im-plementation details, the second presents computational results related to a set of spectralcomposite problem, while the third gives some general comments about the computationalresults.
This subsection precisely describes the implementation of the methods and experiments ofthis section.We first describe some practical modifications to the IA-ICG method. Given λ > z j , z ) ∈ Z , denote ∆ λφ = 4 λ (cid:20) φ ( z ) − e ‘ φ ( z j ; z ) − M k z j − z k (cid:21) ptimization of Nonconvex Spectral Functions where e ‘ φ is as in (27). Motivated by the first inequality in the descent condition (42), werelax (17) in the R-ACG call to the three separate conditions: k z j − z k ≤ ∆ λφ , k r j k ≤ ∆ λφ ,and 2 η j ≤ ∆ λφ .We now describe some modifications and parameter choices that are common to bothmethods. First, both ICG methods use the spectral R-ACG algorithm of Subsection 4.1 inplace of the R-ACG algorithm of Section 2. Moreover, this R-ACG variant uses a line searchsubroutine for estimating the upper curvature M that is used during its execution. Second,when each of the dynamic ICG methods invoke their static counterparts, the parameters A and y are set to be the last obtained parameters of the previous invocation or the originalparameters if it is the first invocation, i.e., we implement a warm–start strategy. Third, weadaptively update λ at each outer iteration as follows: given the old value of λ = λ old atthe k th outer iteration, the new value of λ = λ new at the ( k + 1) th iteration is given by λ new = λ old , r k ∈ [0 . , . ,λ old · √ . , r k < . ,λ old · √ , r k > . , r k = h λ ( M +2 + 2 m +2 ) + 1 i k y k − ˆ y k kk ˆ v k − h λ ( M +2 + 2 m +2 ) + 1 i ( y k − ˆ y k ) k . Fourth, we take µ = 1 / µ = 1 for each of R-ACG calls in order to reduce thepossibility of a failure from the R-ACG algorithm. Fifth, in view of (41), we relax condition(17) in the vector-based R-ACG call of Subsection 4.1 to k r j k + 2 η j ≤ σ k z j − z k + τ, where τ := σ ( k Z k F − k z k ) ≥
0. Finally, both ICG methods choose the common hyper-parameters ( ξ , λ, σ ) = ( M , /M , /
2) at initialization.We now describe the two other benchmark methods considered. The first is the ECGmethod described in Section 1 with λ = 1 /M . The second is a modification of the acceler-ated gradient (AG) method that was proposed and analyzed in (Ghadimi and Lan, 2016).More specifically, the implementation is a modification of Algorithm 2 in (Ghadimi andLan, 2016, Section 2) in which α k = 2 / ( k + 1), β k = 1 / [2( M + M )], and λ k = kβ k / k ≥ f = f V ◦ σ and h = h V ◦ σ . Second, given a tolerance ˆ ρ > Y ∈ dom h ,every method in this section seeks a pair (ˆ Y, ˆ V ) ∈ dom h × R m × n satisfyingˆ V ∈ ∇ f (ˆ Y ) + ∇ ( f V ◦ σ )(ˆ Y ) + ∂ ( h V ◦ σ )(ˆ Y ) , k ˆ V kk∇ f ( Y ) + ( f V ◦ σ )( Y ) k + 1 ≤ ˆ ρ. Finally, all described algorithms are implemented in MATLAB 2020a and are run on Linux64-bit machines that contain at least 8 GB of memory.
This subsection presents computational results of a set of spectral composite optimizationproblems and contains two sub-subsections. The first one examines a class of nonconvexmatrix completion problems, while the second one examines a class of blockwise matrixcompletion problems. ong and Monteiro Name m n % nonzero min i,j A ij max i,j A ij Jester
506 9437 10.50% 1 10MovieLens 100K
610 9724 1.70% 0.5 5FilmTrust Table 1: Description of the MC data matrices A ∈ R m × n . Given a quadruple ( α, β, µ, θ ) ∈ R , a data matrix A ∈ R m × n , and indices Ω, this subsec-tion considers the following constrained matrix completion (MC) problem:min U ∈ R m × n k P Ω ( U − A ) k F + κ µ ◦ σ ( U ) + τ α ◦ σ ( U )s.t. k U k F ≤ √ mn · max i,j | A ij | , where P Ω is the linear operator that zeros out any entry that is not in Ω and κ µ ( z ) = µβθ n X i =1 log (cid:18) | z i | θ (cid:19) , τ α ( z ) = αβ " − exp − k z k θ ! for every z ∈ R n . Here, the function κ µ + τ α is a nonconvex generalization of the convexelastic net regularizer (see, for example, Sun and Zhang, 2012), and it is well-known (see,for example, Yao and Kwok, 2017) that the function κ µ − µ k · k ∗ is concave, differentiable,and has a (2 βµ/θ )-Lipschitz continuous gradient.We now describe the different data matrices that are considered. Each matrix A ∈ R m × n is obtained from a different collaborative filtering system where each row represents a uniqueuser, each column represents a unique item, and each entry represents a particular rating.Table 1 lists the names of each data set, where the data originates from (in the footnotes),and some basic statistics about the matrices.We now describe the experiment parameters considered. First the starting point Z israndomly generated from a shifted binomial distribution that closely follows the data matrix A . More specifically, the entries of Z are distributed according to a Binomial ( n, µ/n ) − A distribution, where µ is the sample average of the nonzero entries in A , the integer n isthe ceiling of the range of ratings in A , and A is the minimum rating in A . Second, thedecomposition of the objective function is as follows f = 12 k P Ω ( · − A ) k F , f V = µ (cid:20) κ µ ( · ) − βθ k · k (cid:21) + τ α ( · ) , h V = µβθ k · k + δ F ( · ) ,
1. See the ratings in the file “jester_dataset_1_1.zip” from http://eigentaste.berkeley.edu/dataset/ .2. See a subset of the ratings from where each user has rated at least 720 items.3. See the ratings in the file “ml-latest-small.zip” from https://grouplens.org/datasets/movielens/ .4. See the ratings in the file “ratings.txt” under the FilmTrust section in .5. See the ratings in the file “ml-1m.zip” from https://grouplens.org/datasets/movielens/ . ptimization of Nonconvex Spectral Functions where F = { U ∈ R m × n : k U k F ≤ √ mn · max i,j | A ij |} is the set of feasible solutions. Third,in view of the previous decomposition, the curvature parameters are set to be m = 0 , M = 1 , m = 2 βµθ + 2 αβθ exp (cid:18) − θ (cid:19) , M = αβθ , where it can be shown that the smallest and largest eigenvalues of ∇ τ α ( z ) are boundedbelow and above by − αβ exp( − θ/ /θ and αβ/θ , respectively, for every z ∈ R n . Fourth,each problem instance uses a specific data matrix A from Table 1, the hyperparameters( α, β, µ, θ ) = (10 , , ,
1) and ˆ ρ = 10 − , and Ω to be the index set of nonzero entries in thechosen matrix A . Finally, a cutoff time of 10800 seconds is used for the MovieLens 1M dataset and a cutoff time of 7200 seconds is used for the other data sets.We now present the results. Figure 1 contains the plots of the log objective functionvalue against the runtime, listed in increasing order of the smallest dimension in the datamatrix. Figure 1: Function value vs. runtime for the MC problems. Given a quadruple ( α, β, µ, θ ) ∈ R , a block decomposable data matrix A ∈ R m × n withblocks { A i } ki =1 ⊆ R p × q , and indices Ω, this subsection considers the following constrainedblockwise matrix completion (BMC) problem:min U ∈ R m × n k P Ω ( U − A ) k F + k X i =1 [ κ µ ◦ σ ( U i ) + τ α ◦ σ ( U i )]s.t. k U k F ≤ √ mn · max i,j | A ij | , ong and Monteiro where P Ω , κ µ , and τ α are as in Subsection 5.2.1 and U i ∈ R p × q is the i th block of U withthe same indices as A i with respect to A .We now describe the two classes of data matrices that are considered. Every data matrixis a 5-by-5 block matrix consisting of 50-by-100 sized submatrices. Every submatrix containsonly 25% nonzero entries and each data matrix generates its submatrix entries from differentprobability distributions. More specifically, for a sampled probability p ∼ Uniform [0 , Binomial ( n, p ) distribution with n = 10, whilethe other uses a TruncatedNormal ( µ, σ ) distribution with µ = 10 p , σ = 10 p (1 − p ),and upper and lower bounds 0 and 10, respectively.We now describe the experiment parameters considered. First, the the decompositionof the objective function and the quantities Z , ( m , M ), ( m , M ), ˆ ρ , and Ω are the sameas in Subsection 5.2.1. Second, we fix ( β, θ ) = (20 ,
1) and vary ( α, µ, A ) across the differentproblem instances. Finally, a cutoff time of 7200 seconds is used for all of Problem instancestested.We now present the results. Figure 2 contains the plots of the log objective functionvalue against the runtime for the binomial data set, listed in increasing order of M . Thecorresponding plots for the truncated normal data set are similar to the binomial plots sowe omit them for the sake of brevity. Tables 2 and 3 respectively contain the last functionvalues of each algorithm for the binomial and truncated normal data sets, listed in increasingorder of M . Moreover, each row of these tables corresponds to a different choice of ( µ, α )and the bolded numbers highlight which algorithm performed the best in terms of the lastfunction value.Figure 2: Function value vs. runtime for the binomial BMC problems.Parameters Last Function Value ( µ, α ) M ECG AG IA-ICG DA-ICG(1 , .
2) 20 2.13E+04 1.62E+04 1.61E+04 (10 ,
2) 200 2.15E+05 1.44E+05 2.19E+04 (100 ,
20) 2000 2.17E+06 8.24E+05 9.82E+04
Table 2: Last function values for the binomial BMC problems. ptimization of Nonconvex Spectral Functions Parameters Last Function Value ( µ, α ) M ECG AG IA-ICG DA-ICG(1 , .
2) 20 2.14E+04 8.92E+03 1.26E+04 (10 ,
2) 200 2.21E+05 1.75E+05 3.29E+04 (100 ,
20) 2000 2.27E+06 1.71E+06 1.06E+05
Table 3: Last function values for the truncated normal BMC problems.
From the results of the previous subsection, we make a few comments. First, the DA-ICGand IA-ICG methods are generally more efficient than the AG and ECG methods. Second,the DA-ICG method appears to be able to escape local minima more quickly than the othermethods. Third, the larger the constant M is, the more efficient the ICG methods arecompared to the benchmark methods. Finally, the larger the smallest dimension of thematrix space is, the more efficient the inexact methods are compared to the exact ones.
6. ICG Iteration Complexities
This section establishes the iteration complexities for each of the static ICG methods inSection 3.
This subsection establishes the key properties of the static IA-ICG method.
Lemma 8.
Let { ( y i , ˆ y i , ˆ v i ) } ki =1 be the collection of iterates generated by the static IA-ICGmethod. For every i ≥ , we have λ k y i − − y i k ≤ φ ( y i − ) − e ‘ φ ( y i ; y i − ) − M k y i − y i − k ≤ φ ( y i − ) − φ ( y i ) , (42) where e ‘ φ is as in (27) .Proof. Let i ≥ y i , v i , ε i ) be the point output by the i th successful callto the R-ACG algorithm. Moreover, let ∆ ( · ; · , · ) be as in (9) with ( ψ s , ψ n ) given by (26).Using the definition of e ‘ φ , step 2 of the method, and fact that ( y a , v, ε ) = ( y i , v i , ε i ) solvesProblem B in Section 2 with ( µ, ψ s , ψ n ) as in (26), we have that ε i ≥ ∆ ( y i − ; y i , v i ) = λ e ‘ φ ( y i ; y i − ) − λφ ( y i − ) − h v i , y i − y i − i + k y i − y i − k . Rearranging the above inequality and using assumption (A2), (25), and the fact that h a, b i ≥−k a k / − k b k / a, b ∈ Z yields λφ ( y i − ) − λ e ‘ φ ( y i ; y i − ) ≥ h v i , y i − − y i i − ε i + k y i − y i − k = 12 k y i − y i − k − (cid:16) k v i k + 2 ε i (cid:17) ≥ − σ ! k y i − y i − k ong and Monteiro = λM k y i − y i − k + − λM − σ ! k y i − y i − k = λM k y i − y i − k + 14 k y i − y i − k . Rearranging terms yields the first inequality of (42). The second inequality of (42) fol-lows from the first inequality, the fact that e ‘ φ ( y i ; y i − ) + M k y i − y i − k / ≥ φ ( y i ) fromassumption (A2), and the definition of e ‘ φ .The next results establish the rate at which the residual k ˆ v i k tend to 0. Lemma 9.
Let p > be given. Then, for every a, b ∈ R k , we have min ≤ i ≤ k {| a i b i |} ≤ k − p k a k k b k / ( p − . Proof.
Let p > a, b ∈ R k be fixed and let q ≥ p − + q − = 1. Usingthe fact that h x, y i ≤ k x k p k y k q for every x, y ∈ R k , and denoting ˜ a and ˜ b to be vectors withentries | a i | /p and | b i | /p , respectively, we have that k min ≤ i ≤ k {| a i b i |} /p ≤ k X i =1 | a i b i | /p ≤ k ˜ a k p k ˜ b k q = k a k /p k X i =1 | b i | q/p ! /q = (cid:16) k a k k b k q/p (cid:17) /p . Dividing by k , taking the p th power on both sides, and using the fact that p/q = p − ≤ i ≤ k {| a i b i |} ≤ k − p k a k k b k q/p = k − p k a k k b k / ( p − . Proposition 10.
Let { ( y i , ˆ y i , ˆ v i ) } ki =1 be as in Lemma 8 and define the quantities L avg1 ,k := 1 k k X i =1 L ( y i , y i − ) , C avg λ,k := 1 k k X i =1 C λ (ˆ y i , y i ) , (43) where C λ ( · , · ) and C λ are as in (24) and (27) , respectively. Then, we have min i ≤ k k ˆ v i k = O " √ λL avg1 ,k + 1 + σC avg λ,k √ λ φ ( z ) − φ ∗ k (cid:21) / ! + ˆ ρ . Proof.
Using Lemma 3 with ( y, w ) = ( y i , y i − ) and the fact that C λ ( · , · ) ≤ C λ and L ( · , · ) ≤ L , we have k ˆ v i k ≤ E i k y i − y i − k , for every i ≤ k , where E i := 2 + λL ( y i , y i − ) + σC λ (ˆ y i , y i ) λ ∀ i ≥ . ptimization of Nonconvex Spectral Functions As a consequence, using the sum of the second bound in Lemma 8 from i = 1 to k , thedefinitions in (43), and Lemma 9 with p = 3 / a i = E i , and b i = k y i − y i − k for i = 1 to k ,yields min i ≤ k k ˆ v i k ≤ min i ≤ k E i k y i − y i − k ≤ k / k X i =1 E i ! k X i =1 k y i − y i − k ! / = O " √ λL avg1 ,k + 1 + σC avg λ,k √ λ φ ( z ) − φ ∗ k (cid:21) / ! . We are now ready to give the proof of Theorem 4.
Proof of Theorem 4. (a) This follows from Proposition 10, the fact that C λ ( · , · ) ≤ C λ and L f ( · , · ) ≤ L , and the stopping condition in step 3.(b) The fact that (ˆ y, ˆ v ) = (ˆ y k , ˆ v k ) satisfies the inclusion of (22) follows from Lemma 3with ( y, v, w ) = ( y k , v k , y k − ). The fact that k ˆ v k ≤ ˆ ρ follows from the stopping condition instep 3.(c) This follows from Proposition 2(c) and the fact that method stops in finite numberof iterations from part (a). This subsection establishes several key properties of static DA-ICG method.To avoid repetition, we assume throughout this subsection that k ≥ { ( a i , A i , y i , y ai , x i , ˜ x i − , ˆ y i , ˆ v i , v i , ε i ) } ki =1 denote the sequence of all iterates generated by it up to and including the k th iteration.Observe that this implies that the i th DA-ICG outer iteration for any 1 ≤ i ≤ k is successful,i.e., the (only) R-ACG call in step 2 of the DA-ICG method does not stop with failure and∆ ( y i − ; y ai , v i ) ≤ ε i . Moreover, throughout this subsection we let e γ i ( u ) = ‘ f ( u ; ˜ x i − ) + f ( u ) + h ( u ) , γ i ( u ) = e γ i ( y ai ) + 1 λ h v i + ˜ x i − − y ai , u − y ai i . (44)The first set of results present some basic properties about the functions e γ i and γ i aswell as the iterates generated by the method. Lemma 11.
Let ∆ ( · ; · , · ) be as in (9) with ( ψ s , ψ n ) given by (26) . Then, the followingstatements hold for any s ∈ dom h and ≤ i ≤ k :(a) γ i ( y ai ) = e γ i ( y ai ) ;(b) x i = argmin u ∈ Ω (cid:8) λa i − γ i ( u ) + k u − x i − k / (cid:9) ; (c) y ai − v i = argmin u ∈Z (cid:8) λγ i ( u ) + k u − ˜ x i − k / (cid:9) ; (d) − M k u − ˜ x i − k / ≤ e γ i ( u ) − φ ( u ) ≤ m k u − ˜ x i − k / ;(e) φ ( y i − ) ≥ φ ( y i ) and φ ( y ai ) ≥ φ ( y i ) . ong and Monteiro Proof.
To keep the notation simple, denote( y a + , y + , y, ˜ x ) = ( y ai , y i , y i − , ˜ x i − ) , ( x + , x ) = ( x i , x i − ) , ( A + , A, a ) = ( A i , A i − , a i − ) , ( v, ε ) = ( v i , ε i ) . (45)(a) This is immediate from the definitions of γ and e γ in (44).(b) Define b x i := x k − − a k − ( v k + ˜ x k − − y ak ). Using the definition of γ in (44), we havethat argmin u ∈ Ω (cid:26) λaγ ( u ) + 12 k u − x k (cid:27) = argmin u ∈ Ω (cid:26) a (cid:10) v + ˜ x − y a + , u − x (cid:11) + 12 k u − x k (cid:27) = argmin u ∈ Ω (cid:13)(cid:13) u − (cid:0) x − a (cid:2) v + ˜ x − y a + (cid:3)(cid:1)(cid:13)(cid:13) = argmin u ∈ Ω k u − b x + k = x + . (c) Using the definition of γ in (44), we have that λ ∇ γ (cid:0) y a + − v (cid:1) + ( y a + − v ) − ˜ x = ( v + ˜ x − y a + ) + ( y a + − v ) − ˜ x = 0 , and hence, the point y a + − v is the global minimum of λγ + k · − ˜ x k / i = 1 and the definition of e γ in (44).(e) This follows immediately from the update rule of y i in (34). Lemma 12.
Let w = ˜ x i − , the pair ( ψ n , ψ s ) be as in (26) , and ∆ ( · ; · , · ) be as in (9) with ( ψ s , ψ n ) given by (26) . Then, following statements hold:(a) the triple ( y ai , v i , ε i ) solves Problem B and satisfies ∆ ( y i − ; y ai , v i ) ≤ ε , and hence k v i k + 2 ε i ≤ σ k y ai − ˜ x i − k , ∆ ( u ; y ai , v i ) ≤ ε i ∀ u ∈ { ˆ y i , y i − } , (46) (b) if f is convex, then ( y ai , v i , ε i ) solves Problem A ;(c) ∆ ( s ; y ai , v i ) = λ [ γ i ( s ) − e γ i ( s )]; (d) ∆ ( y i ; y ai , v i ) ≤ ε .Proof. (a) This follows from step 2 of the DA-ICG method and Proposition 2(b).(b) This follows from steps 2 and 3 of the DA-ICG method, the fact that h is convex,and Proposition 2(c) with ψ s = e γ i + k · − ˜ x i − k / ψ s , ψ n ) and ( γ, e γ ) in (26) and (44), respectively, we havethat ∆ ( s ; y a + , v ) = ( ψ s + ψ n )( y a + ) − ( ψ s + ψ n )( s ) − (cid:10) v, y a + − s (cid:11) + 12 k s − y a + k = (cid:20) λ e γ ( y a + ) + 12 k y a + − ˜ x k (cid:21) − (cid:20) λ e γ ( s ) + 12 k s − ˜ x k (cid:21) − (cid:10) v, y a + − s (cid:11) + 12 k s − y a + k = (cid:20) λγ ( s ) + 12 k s − ˜ x k (cid:21) − (cid:20) λ e γ ( s ) + 12 k s − ˜ x k (cid:21) = λγ ( s ) − λ e γ ( s ) . (d) If y i = y i − , then this follows from step 3 of the method. On the other hand, if y i = y ai , then this follows from part (c).We now state (without proof) some well-known properties of A i and a i − . ptimization of Nonconvex Spectral Functions Lemma 13.
For every ≤ i ≤ k , we have that:(a) a i − = A i ;(b) i / ≤ A i ≤ i . The next two lemmas are technical results that are needed to establish the key inequalityin Proposition 16.
Lemma 14.
For every u ∈ dom h and ≤ i ≤ k , we have that (cid:16) A i − k y i − − ˜ x i − k + a i − k u − ˜ x i − k (cid:17) ≤ D + a i − D h . Proof.
Throughout the proof, we use the notation in (45). Using the relation ( p + q ) ≤ p + 2 q for every p, q ∈ R , Lemma 13(a), the fact that A ≤ A + , x ∈ Ω, and y ∈ dom h ,and the definitions of ˜ x in (33) and of D Ω and D h in (35), we conclude that A k y − ˜ x k + a k u − ˜ x k = A (cid:13)(cid:13)(cid:13)(cid:13) aA + ( y − x ) (cid:13)(cid:13)(cid:13)(cid:13) + a (cid:13)(cid:13)(cid:13)(cid:13) AA + ( u − y ) + aA + ( u − x ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ AA + k ( y − u ) + ( u − x ) k + 2 a " A A k u − y k + a A k u − x k ≤ AA + (cid:16) k u − y k + k u − x k (cid:17) + 2 a k u − y k + 2 aA + k u − x k ≤ h k u − x k + (1 + a ) k u − y k i ≤ D + (1 + a ) D h ] . The conclusion now follows from dividing both sides of the above inequalities by 2 and usingthe fact that D h ≤ D Ω . Lemma 15.
For every u ∈ dom h and ≤ i ≤ k , we have that A i " φ ( y i ) + (cid:18) − λM λ (cid:19) k y ai − ˜ x i − k − k v i k λ + 12 λ k u − x i k ≤ A i − γ i ( y i − ) + a i − γ i ( u ) + 12 λ k u − x i − k . (47) Proof.
Throughout the proof, we use the notation in (45). We first present two key expres-sions. First, using the definition of γ in (44) and Lemma 11(c), it follows thatmin u ∈Z (cid:26) λγ ( u ) + 12 k u − ˜ x k (cid:27) = λ e γ ( y a + ) − (cid:10) v + ˜ x − y a + , v (cid:11) + 12 (cid:13)(cid:13) v + ˜ x − y a + (cid:13)(cid:13) = λ e γ ( y a + ) − k v k − (cid:10) v, ˜ x − y a + (cid:11) + 12 (cid:13)(cid:13) v + ˜ x − y a + (cid:13)(cid:13) = λ e γ ( y a + ) − k v k + 12 k ˜ x − y a + k . (48)Second, Lemma 11(b) and the fact that the function aγ + k · − x k / (2 λ ) is (1 /λ )–stronglyconvex imply that aγ ( x + ) + 12 λ k x + − x k ≤ aγ ( u ) + 12 λ k u − x k − λ k u − x + k . (49) ong and Monteiro Using (48), Lemma 11(d)–(e), Lemma 13(a), and the fact that γ is affine, we have that A + (cid:20) φ ( y + ) + (cid:18) − λM λ (cid:19) k y a + − ˜ x k (cid:21) ≤ A + (cid:20)e γ (cid:0) y a + (cid:1) + 12 λ k y a + − ˜ x k (cid:21) = A + " min u ∈Z (cid:26) γ ( u ) + 12 λ k u − ˜ x k (cid:27) + k v k λ ≤ A + " γ (cid:18) Ay + ax + A + (cid:19) + 12 λ (cid:13)(cid:13)(cid:13)(cid:13) Ay + ax + A + − Ay + axA + (cid:13)(cid:13)(cid:13)(cid:13) + k v k λ = Aγ ( y ) + aγ ( x + ) + a λA + k x − x + k + A + λ k v k = Aγ ( y ) + aγ ( x + ) + 12 λ k x − x + k + A + λ k v k (50)The conclusion now follows from combining (49) with (50).We now present an inequality that plays an important role in the analysis of the DA-ICGmethod. Proposition 16.
Let ∆ ( · ; · , · ) be as in (9) with ( ψ s , ψ n ) as in (26) , and define θ i ( u ) := A i [ φ ( y i ) − φ ( u )] + 12 λ k u − x i k ∀ i ≥ . (51) For every u ∈ dom h satisfying ∆ ( u ; y ai , v i ) ≤ ε and ≤ i ≤ k , we have that A i λ k y ai − ˜ x i − k ≤ m +1 (cid:16) a i − D h + 2 D (cid:17) + θ i − ( u ) − θ i ( u ) . (52) Proof.
Throughout the proof, we use the notation in (45) together with the notation θ = θ i − and θ + = θ i . Let u ∈ dom h be such that ∆ ( u ; y a + , v ) ≤ ε . Subtracting Aφ ( u ) fromboth sides of the inequality in (47) and using the definition of θ + we have A + λ h (1 − λM ) k y a + − ˜ x k − k v k i + θ + ( u )= A + λ h (1 − λM ) k y a + − ˜ x k − k v k i + A + [ φ ( y + ) − φ ( u )] + 12 λ k u − y a + k ≤ Aγ ( y ) + aγ ( u ) − Aφ ( u ) + 12 λ k u − x k = a [ γ ( u ) − φ ( u )] + A [ γ ( y ) − φ ( y )] + θ ( u ) . (53)Moreover, using Lemma 12(a) and (c), and with our assumption that ∆ ( u ; y a + , v ) ≤ ε , wehave that γ ( s ) − φ ( s ) = e γ ( s ) − φ ( s ) + ∆ ( s ; y a + , v ) λ ≤ m +1 k s − ˜ x k + ελ ∀ s ∈ { u, y } . (54)Combining (53), (54), and Lemma 14 then yields A + λ h (1 − λM ) k y a + − ˜ x k − k v k i + θ + ( u ) ptimization of Nonconvex Spectral Functions ≤ m +1 h a k u − ˜ x k + A k y − ˜ x k i + εA + λ + θ ( u ) ≤ m +1 (cid:16) aD h + 2 D (cid:17) + εA + λ + θ ( u ) . Re-arranging the above terms and using (32) together with the first inequality in (46), weconclude that m +1 (cid:16) aD h + 2 D (cid:17) + θ ( u ) − θ + ( u ) ≥ A + λ h (1 − λM ) k y a + − ˜ x k − k v k − ε i ≥ A + (1 − λM − σ )2 λ k y a + − ˜ x k ≥ A + λ k y a + − ˜ x k . The following result describes some important technical bounds obtained by summing(52) for two different choices of u (possibly changing with i ) from i = 1 to k . Proposition 17.
Let ∆ φ and d be as in (35) and define S k := 14 λ k X i =1 A i k y ai − ˜ x i − k . (55) Then, the following statements hold:(a) S k = O ( k [ m +1 D h + ∆ φ ] + k [ m +1 + 1 /λ ] D ); (b) if f is convex, then S k = O ( k m +1 D h + km +1 D + d /λ ) . Proof. (a) Let ∆ ( · ; · , · ) be defined as in (9) with ( ψ s , ψ n ) given by (26). Using (51), thefact that x i , y ai ∈ Ω, the fact that A i is nonnegative and increasing, and the definitions of θ i and D Ω in (51) and (35), respectively, we have that k X i =1 [ θ i − ( y i ) − θ i ( y i )] ≤ k X i =1 A i − [ φ ( y i − ) − φ ( y i )] + 12 λ k X i =1 k y i − x i − k ≤ A k k X i =1 [ φ ( y i − ) − φ ( y i )] + k λ D ≤ A k [ φ ( y ) − φ ∗ ] + k λ D . (56)Moreover, noting Lemma 12(d) and using Proposition 16 with u = y i , we conclude that(52) holds with u = y i for every 1 ≤ i ≤ k . Summing these k inequalities and using (56),the definition of S k in (55), and Lemma 13(b) yields the desired conclusion.(b) Assume now that f is convex and let y ∗ be a point such that φ ( y ∗ ) = φ ∗ and k y − y ∗ k = d . It then follows from Lemma 12(b) and Proposition 1(d) with ( y, v ) = ( y ai , v i )that ∆ ( y ∗ ; y ai , v i ) ≤ ε for every 1 ≤ i ≤ k . The conclusion now follows by using an argumentsimilar to the one in (a) but which instead sums (52) with u = y ∗ from i = 1 to k , and usesthe fact that k X i =1 [ θ i − ( y ∗ ) − θ i ( y ∗ )] = θ ( y ∗ ) − θ k ( y ∗ ) ≤ λ k y − y ∗ k = d λ , where the inequality is due to the fact that θ k ( y ∗ ) ≥ A = 0. ong and Monteiro We now establish the rate at which the residual k ˆ v i k tends to 0. Proposition 18.
Let S k be as in (55) . Moreover, define the quantities L avg1 ,k := 1 k k X i =1 L ( y ai , ˜ x i − ) , C avg λ,k := 1 k k X i =1 C λ (ˆ y i , y ai ) , (57) where C λ ( · , · ) and C λ are as in (24) and (27) , respectively. Then, we have min i ≤ k k ˆ v i k = O " √ λL avg1 ,k + 1 + σC avg λ,k √ λ S k k (cid:21) / ! + ˆ ρ . Proof.
Let ‘ = d k/ e . Using Lemma 3 with ( z, w ) = ( y ai , ˜ x i − ) and the bounds C λ ( · , · ) ≤ C λ and L ( · , · ) ≤ L we have that k ˆ v i k ≤ E i k y ai − ˜ x i − k , for every ‘ ≤ i ≤ k , where E i = 2 + λL ( y ai , ˜ x i − ) + σC λ (ˆ y i , y ai ) λ ∀ i ≥ . As a consequence, using the definition of S k in (55), the definitions in (57), Lemma 9 with p = 3 / a i = E i / √ A i , and b i = √ A i k y ai − ˜ x i − k for i ∈ { ‘, ..., k } , Lemma 13(b), and thefact that ( k − ‘ + 1) ≥ k/
2, yieldsmin ‘ ≤ i ≤ k k ˆ v i k ≤ min ‘ ≤ i ≤ k E i k y ai − ˜ x i − k≤ k − ‘ + 1) / k X i = ‘ E i √ A i ! k X i = ‘ A i k y ai − ˜ x i − k ! / ≤ / k / k k X i =1 E i ! (4 λS k ) / = O " √ λL avg1 ,k + 1 + σC avg λ,k √ λ S k k (cid:21) / ! . We are now ready to prove Theorem 5.
Proof of Theorem 5. (a) This follows from Proposition 18, Proposition 17(a), the fact that C λ ( · , · ) ≤ C λ and L f ( · , · ) ≤ L , and the termination condition in step 4.(b) The fact that (ˆ y, ˆ v ) = (ˆ y k , ˆ v k ) satisfies the inclusion of (22) follows from Lemma 3with ( y, v, z ) = ( y ak , v k , ˜ x k − ). The fact that k ˆ v k ≤ ˆ ρ follows from the stopping conditionin step 4.(c) The fact that the method does not fail follows from Proposition 2(c). The bound in(37) follows from a similar argument as in part (a) except that Proposition 17(a) is replacedwith Proposition 17(b). Acknowledgments ptimization of Nonconvex Spectral Functions Weiwei Kong : This author acknowledges the support of the Natural Sciences and Engi-neering Research Council of Canada (NSERC), [funding reference number PGSD3-516700-2018], and the partial support of ONR Grant N00014-18-1-2077.
Renato D.C. Monteiro : The works of this author were partially supported by ONRGrant N00014-18-1-2077.
Appendix A. General Refinement Procedures
The result below, whose proof is in (Kong et al., 2019, Lemma 19), presents properties of ageneral refinement procedure.
Proposition 19.
Let h ∈ Conv ( Z ) , z ∈ dom h , and g be a differentiable function on dom h which satisfies g ( u ) − ‘ g ( u ; z ) ≤ L k u − z k / for some L ≥ and every u ∈ dom g .Moreover, define the quantities ˆ z := argmin u (cid:26) ‘ g ( u ; z ) + h ( u ) + L k u − z k (cid:27) , ˆ q := L ( z − ˆ z ) ,δ := h ( z ) − h (ˆ z ) − h r − ∇ g λ ( z ) , z − ˆ z i , ∆ := ( g + h )( z ) − ( g + h )(ˆ z ) Then, it holds that ˆ q ∈ ∇ g ( z ) + ∂h (ˆ z ) , δ ≥ , δ + 12 L k ˆ q k ≤ ∆ , (ˆ q, δ ) = argmin ( r,ε ) ∈Z× R + (cid:26) L k r k + ε : r ∈ ∇ g ( z ) + ∂ ε h ( z ) (cid:27) . Appendix B. R-ACG Algorithm
This section presents technical results related to the R-ACG algorithm.The first set of results describes some basic properties of the generated iterates.
Proposition 20. If ψ s is µ –strongly convex, then the following statements hold:(a) z cj = argmin u ∈Z (cid:8) B j Γ j ( u ) + k u − z c k / (cid:9) ;(b) Γ j ≤ ψ and B j ψ ( z j ) ≤ inf u ∈Z (cid:8) B j Γ j ( u ) + k u − z c k / (cid:9) ;(c) η j ≥ and r j ∈ ∂ η j (cid:0) ψ − µ k · − z j k / (cid:1) ( z j ) ;(d) k B j r j + z j − z k + 2 B j η j ≤ k z j − z k .Proof. (a) See (Monteiro et al., 2016, Proposition 1).(b) See (Monteiro et al., 2016, Proposition 1(b)).(c) The optimality of z cj in part (a), the µ -strong convexity of Γ j , and the definition of r j imply that r j = z c − z cj B j + µ ( z cj − z j ) ∈ ∂ (cid:18) Γ j − µ k · − z cj k + µ D · , z cj − z j E(cid:19) ( z cj )= ∂ (cid:18) Γ j − µ k · − z j k (cid:19) ( z cj ) . ong and Monteiro Using the above inclusion, the definition of η j , the fact that Γ j − µ k · k / ψ ( z ) − µ k z − z j k ≥ Γ j ( z ) − µ k z − z j k = Γ j ( z cj ) − µ k z cj − z j k + D r j , z − z cj E = ψ ( z j ) + h r j , z − z j i − η j , for every z ∈ dom ψ n , which is exactly the desired inclusion. The fact that η j ≥ z = z j .(d) It follows from parts (a)–(b) and the definition of η j that η j ≤ Γ j ( u ) + 12 B j k u − z k − ψ ( z j ) + η j = − B j D z − z cj , z j − z cj E + 12 B j k z cj − z k = 12 B j k z j − z k − B j k z j − z cj k = 12 B j k z j − z k − B j k B j r j + z j − z k . Multiplying both sides of the above inequality by 2 B j yields the desired conclusion.The next result presents the general iteration complexity of the algorithm, i.e. Propo-sition 2(a). Proof of Proposition 2(a).
Let ‘ be the quantity in (19) and suppose that the R-ACG al-gorithm has not stopped with failure before iteration ‘ . In view of step 2 of the algorithm,it follows that (17) holds for every 1 ≤ j ≤ ‘ . We now show that it must stop with successat the end of the ‘ th iteration. Indeed, note that (18), (19), the fact that K >
1, and thefact log(1 + t ) ≥ t/ (1 + t ) for t ≥ B ‘ ≥ M (cid:18) r µ M (cid:19) ‘ − ≥ K σ > . (58)Combining the triangle inequality, (17), the fact that 2 /B ‘ ≤ /K σ and (2 /B ‘ ) < a + b ) ≤ a + 2 b for all a, b ∈ R , we conclude that k r ‘ k + 2 η ‘ ≤ max n /B ‘ , / (2 B ‘ ) o (cid:16) k B ‘ r ‘ k + 4 B ‘ η ‘ (cid:17) ≤ max n /B ‘ , / (2 B ‘ ) o (cid:16) k B ‘ r ‘ + z ‘ − z k + 2 k z ‘ − z k + 4 B ‘ η ‘ (cid:17) ≤ max n (2 /B ‘ ) , /B ‘ o k z ‘ − z k ≤ K σ k z ‘ − z k ≤ σ k z ‘ − z k . Appendix C. Refined ICG Points
This appendix presents technical results related to the refined points of the ICG methods.The result below proves Lemma 3 from the main body of the paper.
Proof of Lemma 3. (a) Using Proposition 1(a), the definition of ˆ v , and the definitions of ψ s and ψ n in (26), we have thatˆ v ∈ λ [ ∇ ψ s (ˆ y ) + ∂ψ n (ˆ y ) + w − y ] + ∇ f (ˆ y ) − ∇ f ( w ) ptimization of Nonconvex Spectral Functions = 1 λ [ λ ∇ f ( w ) + λf (ˆ y ) + ( w − y ) + λ∂h ( y )] + ∇ f (ˆ y ) − ∇ f ( w )= ∇ f (ˆ y ) + ∇ f (ˆ y ) + ∂h (ˆ y ) , (b) Using assumption (A3), Proposition 1(b), the choice of M in (26), and the fact that∆ µ ( y r ; y, v ) ≤ ε , we first observe that k∇ f (ˆ y ) − ∇ f ( z ) k − L ( y, z ) k y − z k ≤ L ( y, ˆ y ) k ˆ y − y k≤ L ( y, ˆ y ) q µ ( y r ; y, v ) q λM +2 + 1 ≤ σL ( y, ˆ y ) q λM +2 + 1 k y − z k . (59)Using now (59), the choice of M in (26), Proposition 1(c) with L ( · , · ) = λL ( · , · ), the factthat σ ≤
1, and the definition of C λ ( · , · ), we conclude that k ˆ v k ≤ λ k v r k + 1 λ k y − z k + k∇ f (ˆ y ) − ∇ f ( z ) k≤ L ( y, z ) + 1 + σλ + σ h λM +2 + 1 + λL ( y, ˆ y ) + λL ( y, ˆ y ) i λ q λM +2 + 1 k y − z k≤ (cid:20) L ( y, z ) + 2 + σC λ ( y, ˆ y ) λ (cid:21) k y − z k . Appendix D. Spectral Functions
This section presents some results about spectral functions as well as the proof of Proposi-tions 6. It is assumed that the reader is familiar with the key quantities given in Subsec-tion 4.1 (e.g., see Equations 38 and 39).We first state two well-known results (see Lewis, 1995; Beck, 2017) about spectralfunctions.
Lemma 21.
Let
Ψ = Ψ V ◦ σ for some absolutely symmetric function e Ψ : R r R . Then,the following properties hold:(a) Ψ ∗ = (Ψ V ◦ σ ) ∗ = (Ψ V ) ∗ ◦ σ ;(b) ∇ Ψ = ( ∇ Ψ V ) ◦ σ ; Lemma 22.
Let (Ψ , Ψ V ) be as in Lemma 21, the pair ( S, Z ) ∈ Z × dom Ψ be fixed, andthe decomposition S = P [dg σ ( S )] Q ∗ be an SVD of S , for some ( P, Q ) ∈ U m × U n . If Ψ ∈ Conv R m × n and Ψ V ∈ Conv R r , then for every M > , we have S ∈ ∂ (cid:18) Ψ + M k · k F (cid:19) ( Z ) ⇐⇒ σ ( S ) ∈ ∂ (cid:16) Ψ V + M k · k (cid:17) ( σ ( Z )) ,Z = P [dg σ ( Z )] Q ∗ . We now present a new result about spectral functions. ong and Monteiro Theorem 23.
Let (Ψ , Ψ V ) be as in Lemma 21 and the point Z ∈ R m × n be such that σ ( Z ) ∈ dom Ψ V . Then for every ε ≥ , we have S ∈ ∂ ε Ψ( Z ) if and only if σ ( S ) ∈ ∂ ε ( S ) Ψ V ( σ ( Z )) ,where ε ( S ) := ε − [ h σ ( Z ) , σ ( S ) i − h Z, S i ] ≥ . (60) Moreover, if S and Z have a simultaneous SVD, then ε ( S ) = ε .Proof. Using Lemma 21(a), (60), and the well-known fact that S ∈ ∂ ε Ψ( Z ) if and only if ε ≥ Ψ( Z ) + Ψ ∗ ( S ) − h Z, S i , we have that S ∈ ∂ ε Ψ( Z ) if and only if ε ( S ) = ε − [ h σ ( Z ) , σ ( S ) i − h Z, S i ] ≥ Ψ( Z ) + Ψ ∗ ( S ) − h Z, S i − [ h σ ( Z ) , σ ( S ) i − h Z, S i ]= Ψ V ( σ ( Z )) + (Ψ V ) ∗ ( σ ( S )) − h σ ( Z ) , σ ( S ) i , or, equivalently, σ ( S ) ∈ ∂ ε ( S ) Ψ V ( σ ( Z )) and ε ( S ) ≥
0. To show that the existence of asimultaneous SVD of S and Z implies ε ( S ) = ε it suffices to show that h σ ( S ) , σ ( Z ) i = h S, Z i .Indeed, if S = P [dg σ ( S )] Q ∗ and Z = P [dg σ ( Z )] Q ∗ , for some ( P, Q ) ∈ U m × U n , then wehave h S, Z i = h dg σ ( S ) , P ∗ P [dg σ ( Z )] Q ∗ Q i = h dg σ ( S ) , dg σ ( Z ) i = h σ ( S ) , σ ( Z ) i . References
Amir Beck.
First-order methods in optimization , volume 25. SIAM, 2017.Emmanuel J. Candes, Yonina C. Eldar, Thomas Strohmer, and Vladislav Voroninski. Phaseretrieval via matrix completion.
SIAM review , 57(2):225–251, 2015.Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods fornonconvex optimization.
SIAM Journal on Optimization , 28(2):1751–1772, 2018.Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of minimizing compositions ofconvex functions and smooth maps.
Mathematical Programming , 178(1-2):503–558, 2019.Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinearand stochastic programming.
Math. Program. , 156:59–99, 2016. ISSN 1436-4646.Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Generalized Uniformly OptimalMethods for Nonlinear Programming. arXiv e-prints , art. arXiv:1508.07384, August 2015.Weiwei Kong, Jefferson G. Melo, and Renato D. C. Monteiro. Complexity of a quadraticpenalty accelerated inexact proximal point method for solving linearly constrained non-convex composite programs.
SIAM Journal on Optimization , 29(4):2566–2593, 2019.Weiwei Kong, Jefferson G. Melo, and Renato D. C. Monteiro. An efficient adaptive acceler-ated inexact proximal point method for solving linearly constrained nonconvex compositeproblems.
Comp. Opt. and Appl. , 76(2):305–346, 2020. ptimization of Nonconvex Spectral Functions Adrian S. Lewis. The convex analysis of unitarily invariant matrix functions.
Journal ofConvex Analysis , 2(1):173–183, 1995.Jiaming Liang and Renato D. C. Monteiro. A Doubly Accelerated Inexact ProximalPoint Method for Nonconvex Composite Optimization Problems. arXiv e-prints , art.arXiv:1811.11378, November 2018.Jiaming Liang, Renato D. C. Monteiro, and Chee-Khian Sim. A FISTA-type acceleratedgradient algorithm for solving smooth nonconvex composite optimization problems. arXive-prints , art. arXiv:1905.07010, May 2019.Renato D. C. Monteiro, Camilo Ortiz, and Benar F. Svaiter. Gradient methods for mini-mizing composite functions.
Math. Program. , pages 1–37, 2012.Renato D. C. Monteiro, Camilo Ortiz, and Benar F. Svaiter. An adaptive acceleratedfirst-order method for convex optimization.
Comput. Optim. Appl. , 64:31–73, 2016.Courtney Paquette, Hongzhou Lin, Dmitriy Drusvyatskiy, Julien Mairal, and Zaid Har-chaoui. Catalyst Acceleration for Gradient-Based Non-Convex Optimization. arXiv e-prints , art. arXiv:1703.10993, March 2017.Tingni Sun and Cun-Hui Zhang. Calibrated elastic regularization in matrix completion. In
Advances in Neural Information Processing Systems , pages 863–871, 2012.Fei Wen, Rendong Ying, Peilin Liu, and Robert C. Qiu. Robust pca using generalized non-convex regularization.
IEEE Transactions on Circuits and Systems for Video Technology ,2019.Quanming Yao and James T. Kwok. Efficient learning with a family of nonconvex regular-izers by redistributing nonconvexity.
The Journal of Machine Learning Research , 18(1):6574–6625, 2017., 18(1):6574–6625, 2017.