[PDF] An Optimal High-Order Tensor Method for Convex Optimization

Abstract

This paper is concerned with finding an optimal algorithm for minimizing a composite convex objective function. The basic setting is that the objective is the sum of two convex functions: the first function is smooth with up to the d-th order derivative information available, and the second function is possibly non-smooth, but its proximal tensor mappings can be computed approximately in an efficient manner. The problem is to find -- in that setting -- the best possible (optimal) iteration complexity for convex optimization. Along that line, for the smooth case (without the second non-smooth part in the objective), Nesterov (1983) proposed an optimal algorithm for the first-order methods (d=1) with iteration complexity O( 1 / k^2 ). A high-order tensor algorithm with iteration complexity of O( 1 / k^{d+1} ) was proposed by Baes (2009) and Nesterov (2018). In this paper, we propose a new high-order tensor algorithm for the general composite case, with the iteration complexity of O( 1 / k^{(3d+1)/2} ), which matches the lower bound for the d-th order methods as established in Nesterov (2018), and Shamir et al. (2018), and hence is optimal. Our approach is based on the Accelerated Hybrid Proximal Extragradient (A-HPE) framework proposed in Monteiro and Svaiter (2013), where a bisection procedure is installed for each A-HPE iteration. At each bisection step a proximal tensor subproblem is approximately solved, and the total number of bisection steps per A-HPE iteration is bounded by a logarithmic factor in the precision required.

Full PDF

aa r X i v : . [ m a t h . O C ] A p r An Optimal High-Order Tensor Method for Convex Optimization

Bo JIANG ∗ Haoyue WANG † Shuzhong ZHANG ‡ April 20, 2020

Abstract

This paper is concerned with ﬁnding an optimal algorithm for minimizing a composite convexobjective function. The basic setting is that the objective is the sum of two convex functions: theﬁrst function is smooth with up to the d -th order derivative information available, and the secondfunction is possibly non-smooth, but its proximal tensor mappings can be computed approximatelyin an eﬃcient manner. The problem is to ﬁnd – in that setting – the best possible (optimal) iterationcomplexity for convex optimization. Along that line, for the smooth case (without the second non-smooth part in the objective) Nesterov proposed ([25], 1983) an optimal algorithm for the ﬁrst-ordermethods ( d = 1) with iteration complexity O (cid:0) /k (cid:1) , while high-order tensor algorithms (using up togeneral d th order tensor information) with iteration complexity O (cid:0) /k d +1 (cid:1) were recently establishedin [3, 27]. In this paper, we propose a new high-order tensor algorithm for the general composite case,with the iteration complexity of O (cid:0) /k (3 d +1) / (cid:1) , which matches the lower bound for the d -th ordermethods as established in [27, 31], and hence is optimal. Our approach is based on the AcceleratedHybrid Proximal Extragradient (A-HPE) framework proposed by Monteiro and Svaiter in [24], wherea bisection procedure is installed for each A-HPE iteration. At each bisection step a proximal tensorsubproblem is approximately solved, and the total number of bisection steps per A-HPE iteration isshown to be bounded by a logarithmic factor in the precision required.

Keywords: convex optimization; tensor method; acceleration; iteration complexity.

Mathematics Subject Classiﬁcation: ∗ Research Institute for Interdisciplinary Sciences, School of Information Management and Engineering, Shanghai Uni-versity of Finance and Economics, Shanghai 200433, China. Email: [email protected]. Research of this authorwas supported in part by NSFC Grants 11771269 and 11831002, and Program for Innovative Research Team of ShanghaiUniversity of Finance and Economics. † Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139, USA, (email:[email protected]) ‡ Department of Industrial and Systems Engineering, University of Minnesota, Minneapolis, MN 55455, USA (email:[email protected]); joint appointment with Institute of Data and Decision Analytics, The Chinese University of HongKong, Shenzhen, China (email: [email protected]). Research of this author was supported in part by the National ScienceFoundation (Grant CMMI-1462408). In this paper, we consider the following composite unconstrained convex optimization:min x ∈ R n F ( x ) := f ( x ) + h ( x ) , (1.1)where f is diﬀerentiable and convex, and h is convex but possibly non-smooth. In this context, weassume that convex tensor (polynomial) proximal mappings regarding h can be approximately computedeﬃciently. Given that structure, a fundamental quest is to ﬁnd an optimal algorithm that solves theabove problem, using the available derivative information of the smooth part f .In case F ( x ) = f ( x ), and only the gradient information of f is available, Nesterov [25] proposed agradient-type algorithm, which achieves the overall iteration complexity of O (1 /k ), matching the lowerbound on the iteration complexity of this class of solution methods, hence is known to be an optimal algorithm among all the ﬁrst-order methods. Since Nesterov’s seminal work [25], especially in therecent years when the large scale machine learning applications have come under the spotlight, therehas been a surge of research eﬀort to extend Nesterov’s approach to more general settings; see e.g. [4,12, 19, 14, 30], and/or to incorporate certain adaptive strategies to enhance the practical performancesof the acceleration; see e.g. [20, 29, 13]. At the same time, there has also been a considerable researcheﬀort to fully understand the underpinning mechanism of the ﬁrst-order acceleration phenomenon; seee.g. [7, 32, 33, 34].When the Hessian information is available, Nesterov [26] proposed an acceleration scheme for cubic reg-ularized Newton’s method, and he showed that the iteration complexity bound improves from O (cid:0) /k (cid:1) to O (cid:0) /k (cid:1) . A few years later, Monteiro and Svaiter [24] proposed a totally diﬀerent accelerationscheme, which they termed as Accelerated Hybrid Proximal Extragradient Method (A-HPE) framework,and they proved that if the second-order information is incorporated into the A-HPE framework thenthe corresponding accelerated Newton proximal extragradient method has a superior iteration complex-ity bound of O (cid:0) /k / (cid:1) over O (cid:0) /k (cid:1) . In 2018, Arjevani, Shamir and Shiﬀ [31] showed that O (cid:0) /k / (cid:1) is actually a lower bound for the oracle complexity of the second-order methods for convex smoothoptimization. This shows that the accelerated Newton proximal extragradient method is an optimalsecond-order method.As evidenced by the special cases d = 1 and d = 2, there is a clear tradeoﬀ between the level of derivationinformation required and the overall iteration complexity improved. Therefore, a natural and importantquestion arises: What is the exact tradeoﬀ relationship between d and the worst-case iterationcomplexity? Such question has been in fact raised and addressed in some way in recent works [5, 10, 11, 22] inthe context of nonconvex optimization. For convex optimization, the accelerated cubic regularizedNewton method was generalized to the general high-order case [3, 27] with the iteration complexity2eing O (cid:0) /k d +1 (cid:1) , where d is the order of derivative information used in the algorithm. Jiang, Linand Zhang [18] extended Nesterov’s approach to accommodate the composite optimization (1.1) andrelaxed the requirement on the knowledge of problem parameters such as the Lipschitz constants andthe requirement on the exact solutions of the subproblems while maintaining the same iteration boundas in [3, 27]. Along the line of bounding the worst case iteration complexity using up to the d -thorder derivative information, there have also been signiﬁcant progresses as well. Arjevani, Shamir andShiﬀ [31] showed that the worst case iteration complexity of any algorithm in that setting cannot bebetter than O (cid:0) /k (3 d +1) / (cid:1) . A simpliﬁed analysis of the bound can be found in Nesterov [27]. So,there was a gap between the achieved iteration bound O (cid:0) /k d +1 (cid:1) and the best possible bound of O (cid:0) /k (3 d +1) / (cid:1) . Clearly at least one of the two bounds is improvable. In this paper, we aim to settlethe above theoretical quest by providing a new implementable algorithm whose iteration complexity isprecisely O (cid:0) /k (3 d +1) / (cid:1) . As a result, the tradeoﬀ relationship discussed above is pinned down to beexactly O (cid:0) /k (3 d +1) / (cid:1) .Our algorithm is based on the A-HPE framework of Monteiro and Svaiter [24], which is presented asAlgorithm 1 in this paper. In fact, our algorithm speciﬁes a way to generate an approximate solutionthrough the use of high order derivative information by Taylor expansion. In each iteration, suchapproximate solution is computed by means of a bisection process. At each bisection step, a regulatedconvex tensor (polynomial) optimization subproblem is approximately solved. Moreover, we show that,to implement one A-HPE iteration, the number of bisection steps – each calling to solve a convex tensorsubproblems – is upper bounded by a logarithmic factor in the inverse of the required precision. Ourbisection procedure is similar to the one proposed in [24] for the case d = 2; however, a key modiﬁcationis applied which enables the removal of the so-called “bracketing stage” used in [24]. After submittingthe ﬁrst version of the paper, we became aware of two other independent works [15, 8] establishingsimilar iteration bounds as ours, with the main diﬀerence being that the focus of [15, 8] is on thesmooth case: F ( x ) = f ( x ), while our method accommodates a composite objective function. Thecommon theoretical development by the three groups was subsequently jointly announced in the formof abstract at Conference on Learning Theory (COLT) [17]. It is also worth mentioning that otherthan the afore-mentioned three papers there are some other related works on high-order optimizationmethods [6, 1, 2] based on large-step A-HPE framework.The rest of the paper is organized as follows. In Section 2, we introduce some preliminaries includingthe assumptions and the high-order oracle model used throughout this paper. Then we present ouroptimal tensor method and its iteration complexity analysis in Section 3. The line search subroutinebeing used in the main procedure of our optimal tensor method is presented and analyzed in Section 4.Finally, some technical proofs and lemmas are provided in the appendix.3 Preliminaries

We denote ∇ d f ( x ) to be the d -th order derivative tensor at point x of function f with the ( i , ..., i d )component given as: ∇ d f ( x ) i ,...,i d = ∂ d f∂x i · · · ∂x i d ( x ) , ∀ ≤ i , ..., i d ≤ n. Given a d -th order tensor T and vectors z , . . . , z d ∈ R n , we denote T [ z , . . . , z d ] := n X i ,...,i d =1 T i ,...,i d z i . . . z di d . The operator norm associated with T is deﬁned as: kT k := max k z i k =1 , i =1 ,...,d T [ z , ..., z d ] . For given z k +1 , . . . , z d , T [ z k +1 , . . . , z d ] is a k -th order tensor with the associated ( i , · · · , i k ) componentdeﬁned as: T [ z k +1 , . . . , z d ] i , ··· ,i k := n X i k +1 ,...,i d =1 T i ,...,i k ,i k +1 ,...,i d z k +1 i k +1 . . . z di d for 1 ≤ i , ..., i k ≤ n . Denote( z ∗ , . . . , z k ∗ ) := argmax k y i k =1 , i =1 ,...,k (cid:16) T [ z k +1 , ..., z d ] (cid:17) [ y , ..., y k ] . One has kT [ z k +1 , ..., z d ] k = T [ z ∗ , . . . , z k ∗ , z k +1 , ..., z d ] ≤ kT kk z k +1 k · · · k z d k . (2.1)As a matter of convention, for quantities x and y , we use the notation y = Θ( x ) to indicate the relationthat there are positive constants a and b such that ax ≤ y ≤ bx . If a is absent, then we shall indicatethe relation as y = O ( x ). In this paper, we consider the following high-order oracle model and the algorithm we are going topropose is such oracle model. 4 -th Order Oracle Model • f is d times Lipschitz-continuous and diﬀerentiable with Lipschitz con-stant L d for d -th order derivative tensor; i.e. k∇ d f ( x ) − ∇ d f ( y ) k ≤ L d k x − y k ∀ x, y ∈ R n , (2.2)where the left side is the d -th order tensor operator norm. • Given any x , the oracle returns f ( x ) , ∇ f ( x ) , ∇ f ( x ) , ..., ∇ d f ( x ). • At iteration k , x k is generated from a deterministic function h andthe oracle’s responses at any linear combination of x , x , ..., x k − and ∇ i f ( x ) , ∇ i f ( x ) , ..., ∇ i f ( x k − ), where 1 ≤ i ≤ d .Recall that the exact proximal minimization at point x with stepsize λ > y ∈ R n f ( y ) + h ( y ) + 12 λ k y − x k . (2.3)To utilize all the derivative information, we consider the regularized tensor approximation of f ( y ) atpoint x : f x ( y ) := f ( x ) + ∇ f ( x ) ⊤ ( y − x ) + 12 ∇ f ( x )[ y − x ] + · · · + 1 d ! ∇ d f ( x )[ y − x ] d + M ( d + 1)! k y − x k d +1 , (2.4)where M > k y − x k d +1 . Then, by (2.2) andthe Taylor expansion, we can bound the gap between f x ( · ) and f ( · ) for any x (see Nesterov [27]): Lemma 2.1

For every x, y ∈ R n , k∇ f ( y ) − ∇ f x ( y ) k ≤ L d + Md ! k y − x k d . Therefore, it is natural to consider the tensor approximation of (2.3):min y ∈ R n f x ( y ) + h ( y ) + 12 λ k y − x k . (2.5)In fact, (2.5) is the subproblem to be solved in the Optimal Tensor Method that will be introducedlater. Note that similar subproblems have appeared in [24] and [27]. Speciﬁcally, the one used in [24]corresponds to d = 2 in (2.5) without the term involving k y − x k d +1 (i.e., M = 0), while [27] usesthe subproblem that only minimizes f x ( y ) (i.e., without the nonsmooth term h ( y ) and the quadraticregularization term λ k y − x k ). In contrast, our above subproblem installs both the high-order andquadratic regularization terms.Note that the unique solution y of (2.5) is characterized by the following optimality condition: u ∈ ( ∇ f x + ∂h )( y ) , λu + y − x = 0 . (2.6)5or a scalar ǫ ≥

0, the ǫ -subdiﬀerential of a proper closed convex function h is deﬁned as: ∂ ǫ h ( x ) := { u | h ( y ) ≥ h ( x ) + h y − x, u i − ǫ, ∀ y ∈ R n } . With the above notion in mind, let us consider the following approximate solution for (2.6) (hence(2.5)).

Deﬁnition 2.1

Given ( λ, x ) ∈ R ++ × R n and ˆ σ ≥ , the triplet ( y, u, ǫ ) ∈ R n × R n × R + is called a ˆ σ -approximate solution of (2.5) at ( λ, x ) if u ∈ ( ∇ f x + ∂ ǫ h )( y ) and k λu + y − x k + 2 λǫ ≤ ˆ σ k y − x k . (2.7)Obviously, if ( y, u ) is the solution pair of (2.6), then ( y, u,

0) is a ˆ σ -approximate solution of (2.5) at( λ, x ) for any ˆ σ ≥

0. In the rest of our analysis, we assume the availability of a subroutine which, forgiven ( λ, x ) and ˆ σ >

0, returns a ˆ σ -approximate solution ( y, u, ǫ ). Let us call this subroutine ATS (Approximate Tensor Subroutine). Diﬀerent from [27], where a similar subproblem as (2.5) without apossible nonsmooth function h ( · ) and regularization term λ k y − x k is exactly solved, we only assumean approximate solution in the form of (2.7) is available and no further assumption on h ( · ) is required.Note that the possibly nonsmooth function h ( · ) can be viewed as a ﬁxed parameter in ATS . Once h ( · )is given, ATS could be called in each step of the bisection search, which itself is a subroutine in themain procedure of our algorithm.

Our bid to the optimal tensor algorithm is based on the so-called

Accelerated Hybrid Proximal Extragra-dient (A-HPE) framework proposed by Monteiro and Svaiter [24] for problem (1.1), whose main stepscan be schematically sketched below: 6 lgorithm 1

A-HPE framework

STEP 1.

Let x , y ∈ R n , 0 < σ < A = 0 and k = 0. STEP 2.

If 0 ∈ ∂F ( y k ), then STOP . STEP 3.

Otherwise, ﬁnd λ k +1 > y k +1 , v k +1 , ǫ k +1 ) such that v k +1 ∈ ∂ ǫ k +1 F (˜ y k +1 ) , (3.1) k λ k +1 v k +1 + ˜ y k +1 − ˜ x k k + 2 λ k +1 ǫ k +1 ≤ σ k ˜ y k +1 − ˜ x k k (3.2)where ˜ x k = A k A k + a k +1 y k + a k +1 A k + a k +1 x k ,a k +1 = λ k +1 + q λ k +1 + 4 λ k +1 A k . STEP 4.

Choose y k +1 such that F ( y k +1 ) ≤ F (˜ y k +1 ) and let A k +1 = A k + a k +1 ,x k +1 = x k − a k +1 v k +1 . STEP 5.

Set k ← k + 1, and go to STEP 2.Note that in STEP 2, the stopping condition is 0 ∈ ∂F ( y k ). However, in practice, the condition isreplaced by an approximate version of it (see Algorithm 2 below). In the following, we quote sometechnical results derived in [24] for A-HPE. Since our proposed algorithm is within that framework, theresults in Lemma 3.1 hold true for our method as well, and they will be used in the subsequent analysis. Lemma 3.1

Suppose the sequence { x k , y k , ˜ x k , ˜ y k } is genernated from Algorithm 1. Let x ∗ be the pro-jection of x onto the set of optimal value points X ∗ , F ∗ be the optimal value, and D be the distancefrom x to X ∗ . Then for any integer k ≥ , it holds that (Theorem 3.6 in [24]), k x ∗ − x k k + A k ( F ( y k ) − F ∗ ) + 1 − σ k X j =1 A j λ j k ˜ y j − ˜ x j − k ≤ D . (3.3) Therefore, k X j =1 A j λ j k ˜ y j − ˜ x j − k ≤ D − σ . (3.4) Furthermore, A k and λ k has the following relation (Lemma 3.7 in [24]) A k ≥  k X j =1 p λ j  . (3.5)7 f y k is chosen as ˜ y k for all k, the distance between y k and x ∗ can be bounded as follows (Theorem 3.10in [24]), k y k − x ∗ k ≤ (cid:18) √ − σ + 1 (cid:19) D. (3.6)Now we are ready to propose our optimal tensor method in Algorithm 2. Algorithm 2

The optimal tensor method

STEP 1.

Let x = y ∈ R n , v ∈ ∂f ( x ), ǫ = 0, k = 0 and set 0 < ¯ ǫ, ¯ ρ < M ≥ L d . Let ˆ σ ≥ < σ l < σ u < σ := ˆ σ + σ u < σ l (1 + ˆ σ ) d − < σ u (1 − ˆ σ ) d − . STEP 2. If k v k k ≤ ¯ ρ and ǫ k ≤ ¯ ǫ , then STOP . Else, go to STEP 3.

STEP 3.

Find λ k +1 and a ˆ σ -approximate solution( y k +1 , u k +1 , ǫ k +1 ) ∈ R n × R n × R + of (2.5) at ( λ k +1 , ˜ x k ) such that either d ! σ l L d + M ≤ λ k +1 k y k +1 − ˜ x k k d − ≤ d ! σ u L d + M (3.7)or k∇ f ( y k +1 ) + u k +1 − ∇ f ˜ x k ( y k +1 ) k ≤ ¯ ρ and ǫ k +1 ≤ ¯ ǫ hold, where˜ x k = A k A k + a k +1 y k + a k +1 A k + a k +1 x k (3.8)and a k +1 = λ k +1 + q λ k +1 + 4 λ k +1 A k . (3.9)(Note that λ k +1 appears in both (3.7) and (3.9), and seeking a proper λ k +1 requires a bisectionprocedure, to be called Algorithm 3 in Section 4.) STEP 4.

Let v k +1 = ∇ f ( y k +1 ) + u k +1 − ∇ f ˜ x k ( y k +1 ) , (3.10) A k +1 = A k + a k +1 ,x k +1 = x k − a k +1 v k +1 . Set k ← k + 1 and go to STEP 2.At this point, neither Algorithm 1 nor Algorithm 2 has been shown to be implementable. In fact, STEP3 in both algorithms presented above remain unspeciﬁed. Since λ k +1 appears in both (3.7) and (3.9),it is even unclear why such solutions as required by STEP 3 exist at all. Actually, the double rolesplayed by λ k +1 in (3.7) and (3.9) are crucial for the overall O (cid:0) /k (3 d +1) / (cid:1) convergence rate. As atradeoﬀ, such λ k +1 is not easy to ﬁnd. In Section 4, we shall discuss a practical method to ﬁnd a proper λ k +1 (and thus establish a practical implementation of STEP 3 in Algorithm 2) via the ApproximateTensor Subroutine ( ATS ) in combination with a line-search subroutine.First, let us remark that Algorithm 2 is indeed a specialization of A-HPE. For simplicity, we let y k +1 =8 y k +1 in STEP 4 of Algorithm 1. Because ( y k +1 , u k +1 , ǫ k +1 ) is a ˆ σ -approximate solution at ( λ k +1 , ˜ x k ),one has that u k +1 ∈ ( ∇ f ˜ x k + ∂ ǫ k +1 h )( y k +1 ), and so we have v k +1 ∈ ∇ f ( y k +1 ) − ∇ f ˜ x k ( y k +1 ) + ( ∇ f ˜ x k + ∂ ǫ k +1 h )( y k +1 )= ∇ f ( y k +1 ) + ∂ ǫ k +1 h ( y k +1 ) ⊆ ∂ ǫ k +1 ( f + h )( y k +1 )which satisﬁes (3.1). To establish (3.2), we need the following proposition. Proposition 3.2

Let ( y, u, ǫ ) be a ˆ σ -approximate solution of (2.5) at ( λ, ˜ x ) such that (3.7) holds. Deﬁne v := ∇ f ( y ) + u − ∇ f ˜ x ( y ) . Then, k λv + y − ˜ x k + 2 λǫ ≤ (cid:18) ˆ σ + λ L d + Md ! k y − ˜ x k d − (cid:19) k y − ˜ x k . (3.11) Consequently, k λv + y − ˜ x k + 2 λǫ ≤ σ k y − ˜ x k with σ = σ u + ˆ σ, (3.12) where σ u is a input paramter in Algorithm 2 and also appears in (3.7) .Proof. First of all, according to Lemma 2.1, it follows that λ k u − v k = λ k∇ f ( y ) − ∇ f ˜ x ( y ) k ≤ λ L d + Md ! k y − ˜ x k d . Combining the above inequality with (2.7), one has that k λv + y − ˜ x k + 2 λǫ ≤ ( k λu + y − ˜ x k + λ k u − v k ) + 2 λǫ = (cid:0) k λu + y − ˜ x k + 2 λǫ (cid:1) + 2 λ k u − v kk λu + y − ˜ x k + λ k u − v k ≤ ˆ σ k y − ˜ x k + 2 (cid:18) λ L d + Md ! k y − ˜ x k d (cid:19) ˆ σ k y − ˜ x k + (cid:18) λ L d + Md ! k y − ˜ x k d (cid:19) = (cid:18) ˆ σ + λ L d + Md ! k y − ˜ x k d − (cid:19) k y − ˜ x k , proving the ﬁrst inequality. Then, by the right hand side of (3.7), λ L d + Md ! k y − ˜ x k d − ≤ σ u , and so thesecond inequality follows. (cid:3) We summarize the above discussion in the theorem below.

Theorem 3.3

Algorithm 2 is a manifestation of the A-HPE framework, and thus the results of Lemma3.1 hold for the sequence generated by Algorithm 2.

Before addressing the implementation of STEP 3 in Algorithm 2, let us ﬁrst present the overall iterationcomplexity of Algorithm 2, assuming STEP 3 could be implemented.9 heorem 3.4

Let D be the distance of x to X ∗ . Then, for any integer k ≥ , the iterate y k generatedby Algorithm 2 satisﬁes: F ( y k ) − F ∗ ≤ (cid:18) d + 12 (cid:19) d +12 d (1 − (ˆ σ + σ u ) ) d − d ! σ l D d +1 ( L d + M ) k − d +12 . The above theorem establishes the O (1 /k d +12 ) iteration complexity for Algorithm 2. Since Algorithm 2falls into the category of the High-Order Oracle Model, whose iteration complexity has a lower bound of O (1 /k d +12 ); see Arjevanim, Shamir and Shiﬀ [31] and Nesterov [27]. The worst-case iteration complexityof Algorithm 2 matches this lower bound and it is therefore an optimal method. We ﬁrst provide a recursive bound on A k as an intermediate step. Proposition 3.5

Let D be the distance of x to X ∗ . Suppose { A k } ∞ k =1 is generated from Algorithm 2,then A k ≥ C − pq  k X j =1 A q j  p (3.13) where q = d +1 d − , p = d +12 d +2 and C = D (1 − (ˆ σ + σ u ) ) (cid:16) d ! σ l L d + M (cid:17) − d − .Proof. Suppose { x k , y k , ˜ x k } is the sequence generated by Algorithm 2. Then, according to (3.4) andProposition 3.2, it holds that k X j =1 A j λ j k y j − ˜ x j − k ≤ D − (ˆ σ + σ u ) , which together with the left hand side of (3.7) implies k X j =1 A j λ d +1 d − j = k X j =1 A j k y j − ˜ x j − k λ j · λ d − j k y j − ˜ x j − k ≤ D (1 − (ˆ σ + σ u ) ) (cid:18) d ! σ l L d + M (cid:19) − d − = C. (3.14)By the deﬁnition of p and q , we have p + q = 1. Using H¨older’s inequality, together with (3.14), wehave  k X j =1 p λ j  p C q ≥  k X j =1 p λ j  p  k X j =1 A j λ d +1 d − j  q ≥ k X j =1 λ p j A q j λ d +1 q ( d − j = k X j =1 A q j . A k ≥  k X j =1 p λ j  ≥ C − pq  k X j =1 A q j  p . (cid:3) Proof of Theorem 3.4.

Let p , q and C be deﬁned as in Proposition 3.5. Construct { B k } such that B = A and B i = T − (2 p/q ) i − − p/q ( A ) (2 p/q ) i − for i ≥

2, where T := ( C ) pq ( d +1 ) p . Next, we shall applyinduction to show that for any k ≥ A k ≥ B i k r i , ∀ i ≥ , (3.15)where r i = d +12 (cid:2) − (2 p/q ) i − (cid:3) . When i = 1, this is obvious because A k ≥ A = B k r . Now supposethat for any k ≥ A k ≥ B i k r i for some i . Then, by the induction hypothesis and (3.13) it holds that A k ≥ C − pq  k X j =1 A q j  p ≥ C − pq  k X j =1 ( B i j r i ) q  p = 14 (cid:18) B i C (cid:19) pq  k X j =1 j riq  p ≥ (cid:18) B i C (cid:19) pq (cid:18)Z k x riq dx (cid:19) p = 14 (cid:18) B i C (cid:19) pq (cid:18)

11 + r i /q k riq +1 (cid:19) p = 14 (cid:18) B i C (cid:19) pq (cid:18) qq + r i (cid:19) p k p (cid:16) riq +1 (cid:17) ≥ (cid:18) B i C (cid:19) pq (cid:18) d + 1 (cid:19) p k p (cid:16) riq +1 (cid:17) , (3.16)where the last inequality follows from qq + r i = d +1 d − d +1 d − + d +12 [1 − (2 p/q ) i − ] = 11 + d − [1 − (2 p/q ) i − ] ≥

11 + d − = 2 d + 1 . Let us further simplify the expression. First of all, from the deﬁnition of T and B i , one observes that14 (cid:18) B i C (cid:19) pq (cid:18) d + 1 (cid:19) p = B pq i T = (cid:20) T − (2 p/q ) i − − p/q A (2 p/q ) i − (cid:21) pq T = T p/q − (2 p/q ) i − p/q +1 A (2 p/q ) i = T − (2 p/q ) i − p/q A (2 p/q ) i = B i +1 . (3.17)11hen, the construction of q and r i implies that2 p (cid:18) r i q + 1 (cid:19) = 3 d + 1 d + 1 d +12 (1 − (2 p/q ) i − ) d +1 d − ! = 3 d + 1 d + 1 (cid:18) d − (cid:0) − (2 p/q ) i − (cid:1)(cid:19) = 3 d + 1 d + 1 (cid:18) d + 12 − d −

12 (2 p/q ) i − (cid:19) = 3 d + 12 (cid:0) − (2 p/q ) i (cid:1) = r i +1 , (3.18)where the second last equality holds true due to the fact that 2 p/q = ( d − / ( d + 1). Now the desiredinequality (3.15) follows by combining (3.16), (3.17) and (3.18). Observe that 2 p/q = ( d − / ( d +1) < i →∞ B i = T − p/q = T d +12 and lim i →∞ r i = d +12 . Finally, by letting i → ∞ in (3.15) and using thedeﬁnition of C in (3.14), we have A k ≥ T d +12 k d +12 = " (cid:18) C (cid:19) d − d +1 (cid:18) d + 1 (cid:19) d +1 d +1 d +12 k d +12 = (cid:18) (cid:19) d +1 d ! σ l L d + M (cid:18) − σ D (cid:19) d − (cid:18) d + 1 (cid:19) d +12 k d +12 . Combining it with (3.3), we have F ( y k ) − F ∗ ≤ A k D ≤ (cid:18) d + 12 (cid:19) d +12 d (1 − (ˆ σ + σ u ) ) d − d ! σ l D d +1 ( L d + M ) k − d +12 . (cid:3) In Nesterov’s accelerated tensor method [27], an auxiliary function ψ k ( x ) = l k ( x ) + M k x − x k d +1 (3.19)with l k being some linear function, is constructed to satisfy R k : β k := min x ψ k ( x ) − A k F ( y k ) ≥ , R k : ψ k ( x ) ≤ A k F ( x ) + M k x − x k d +1 , ∀ x ∈ R n where A k = Θ( k d +1 ). In fact, the function ψ k ( x ) serves as a bridge to guarantee the following relation: A k F ( y k ) ≤ min x ψ k ( x ) ≤ ψ k ( x ∗ ) ≤ A k F ∗ + M k x ∗ − x k d +1 . (3.20)12s a result, F ( y k ) − F ∗ ≤ MA k k x ∗ − x k d +1 yielding the iteration complexity of O (1 /k d +1 ).In the implementation of high-order A-HPE framework, it is crucial to ensure that condition (3.7) issatisﬁed. In the remainder of the paper, we shall focus on how to satisfy (3.7) in STEP 3 of Algorithm2. Our bid is to use bisection on a parameter λ (to be introduced later), while calling an ApproximateTensor Subroutine ( ATS) . Observe that ˜ x , which is the point to deﬁne f ˜ x ( y ) in (2.4) to approximatethe smooth function f ( y ), is indeed heavily dependent on λ . In other words, we need to search for thepoint where the Taylor expansion (2.4) is to be computed. This is a key diﬀerence between the A-HPEframework and Nesterov’s approach [27]. Once condition (3.7) is satisﬁed, then inequality (3.3) wouldfollow, which leads to the following tighter estimation than (3.20): A k F ( y k ) + β k ≤ A k F ∗ + 12 k x ∗ − x k , as β k = − σ P kj =1 A j λ j k ˜ y j − ˜ x j − k ≥ β k . Together with the lower bound (3.6) this gives a better lower bound on A k , namely A k ≥ O ( k d +12 ), which leads to the optimal iteration complexity presented in Theorem 3.4. After establishing the overall iteration complexity for Algorithm 2, it remains to ﬁnd a way to implementSTEP 3 of the algorithm. In this section we discuss how this can be done, from a special case to thegeneral one. The idea is better illustrated by considering the special case. Finally, for the generalcomposite objective function, assuming the tensor proximal mapping regarding h ( x ) is possible, ourapproach is based on a line-search procedure for the point on which the Taylor expansion is computed. Let us ﬁrst consider a special case for Algorithm 2 where F ( x ) = f ( x ) in the objective function and y k +1 is the exact solution of the following convex tensor proximal point problem:min y f ˜ x k ( y ) + 12 λ k +1 k y − ˜ x k k . We shall discuss how to ﬁnd λ k +1 to satisfy the alternative condition in STEP 3 of Algorithm 2.Note that for ﬁxed x k and y k , ˜ x k and y k +1 are uniquely determined by λ k +1 . Therefore the functions˜ x k ( λ ) and y k +1 ( λ ) are continuous with respect to λ (where we denote λ k +1 to be λ ). Next, we showthat: (i) λ k y k +1 ( λ ) − ˜ x k ( λ ) k d − →

0, as λ → (ii) Either there exists an increasing sub-sequence λ j ↑ ∞ , such that λ j k y k +1 ( λ j ) − ˜ x k ( λ j ) k d − → ∞ as j → ∞ , or there exists ˆ λ such that k∇ f ( y k +1 ( λ )) k ≤ ¯ ρ for any λ ≥ ˆ λ .13bserve that f ˜ x k ( λ ) ( y k +1 ( λ )) + 12 λ k y k +1 ( λ ) − ˜ x k ( λ ) k = min y f ˜ x k ( λ ) ( y ) + 12 λ k y − ˜ x k ( λ ) k ≤ f ˜ x k ( λ ) (˜ x k ( λ ))= f (˜ x k ( λ )) < ∞ , ∀ λ > f (˜ x k ( λ )) is bounded, since ˜ x k ( λ ) is a convex combination of x k and y k . Letting λ → k y k +1 ( λ ) − ˜ x k ( λ ) k →

0, which implies λ k y k +1 ( λ ) − ˜ x k ( λ ) k d − → λ → (i) .To prove (ii) , it suﬃces to show that if the “either” part does not hold, then the “or” part must hold.In this case, there must exist C > λ → ∞ , λ k y k +1 ( λ ) − ˜ x k ( λ ) k d − ≤ C , and thus k y k +1 ( λ ) − ˜ x k ( λ ) k →

0. Moreover, for any λ > ∇ f ˜ x k ( λ ) ( y k +1 ( λ )) + 1 λ ( y k +1 ( λ ) − ˜ x k ( λ )) = 0 . Letting λ → ∞ in the above identity yields that ∇ f ˜ x k ( λ ) ( y k +1 ( λ )) →

0. Recall that in this case we have k y k +1 ( λ ) − ˜ x k ( λ ) k →

0, thus ∇ f ( y k +1 ( λ )) → λ k y k +1 ( λ ) − ˜ x k ( λ ) k d − → λ → λ j k y k +1 ( λ j ) − ˜ x k ( λ j ) k d − → ∞ as j → ∞ , which guarantees the existence of λ to satisfy (3.7) due to the continuity of λ k y k +1 ( λ ) − ˜ x k ( λ ) k d − on λ . Or we have a λ k +1 such that k∇ f ( y k +1 ( λ )) k ≤ ¯ ρ . In this case, since h ( x ) is not present, u k +1 = ∇ f ˜ x k ( y k +1 ) and k∇ f ( y k +1 ) + u k +1 − ∇ f ˜ x k ( y k +1 ) k = k∇ f ( y k +1 ) k ≤ ¯ ρ . Therefore, we have shownthat the alternative condition in STEP 3 is actually satisﬁed. To present the algorithm that computes λ satisﬁng the conditions in STEP 3, we ﬁrst construct β k +1 = a k +1 A k + a k +1 . From (3.9), we can see that λ k +1 = a k +1 A k + a k +1 . Therefore, we are able to represent λ k +1 and ˜ x k by means of β k +1 : ( λ k +1 = A k β k +1 − β k +1 , ˜ x k = β k +1 x k + (1 − β k +1 ) y k . In the k -th iteration, we denote λ ( β ) = A k β − β , β ∈ (0 , . (4.1)Its inverse on the domain λ > β ( λ ) = p λ + 4 λA k − λ A k , which is monotonically increasing. 14e shall perform bisection on β instead of λ in STEP 3 of Algorithm 2 to search for λ k +1 . In that way,the initial interval for the bisection is [0 , Algorithm 3

Bisection on β based on the subroutine ATSINPUT: M ≥ L d , ˆ σ ≥

0, 0 < σ l < σ u < σ := ˆ σ + σ u < σ l (1+ ˆ σ ) d − < σ u (1 − ˆ σ ) d − ,tolerance ¯ ρ > ǫ > STEP 1.

Let α + = d ! σ u L d + M and α − = d ! σ l L d + M . STEP 2. ( Bisection Setup ) Set β − = 0, β + = 1, λ + = λ ( β + ) = + ∞ , λ − = λ ( β − ). Let β = β − + β + and let λ β = λ ( β ) , x β = (1 − β ) y k + βx k , (4.2)and use ATS to compute ( y β , u β , ǫ β ) as a ˆ σ -approximate solution at ( λ β , x β ), and v β = ∇ f ( y β ) −∇ f x β ( y β )+ u β . k v β k ≤ ¯ ρ and ǫ β ≤ ¯ ǫ then output ( λ β , x β , y β , u β , ǫ β ) and STOP . else if λ β k y β − x β k d − ∈ [ α − , α + ] then set ( β k +1 , ˜ x k , y k +1 , v k +1 ) = ( β, x β , y β , v β ) and STOP . else if λ β k y β − x β k d − > α + then set β + ← β , and go to STEP 2.a. else if λ β k y β − x β k d − < α − then set β − ← β , and go to STEP 2.a. end if We remark that the conditions on ¯ ρ and ¯ ǫ are only used in the ﬁnal stage of the algorithm to decide thepoint that is close to optimum. In the implementation, it is reasonable to set a lower precision at thebeginning stage of the algorithm. Now an upper bound for the overall number of iterations requiredby Algorithm 3 is presented in the following theorem, whose proof will be postponed to the subsequentsection. Theorem 4.1

Algorithm 3 needs to perform no more than Θ (cid:0) max { log (¯ ǫ − ) , log (¯ ρ − ) } (cid:1) (4.3) bisection steps before reaching λ k +1 > and a ˆ σ -approximate solution ( y k +1 , u k +1 , ǫ k +1 ) at ( λ k +1 , ˜ x k ( λ k +1 )) satisfying α − ≤ λ k +1 k ˜ x k ( λ k +1 ) − y k +1 k d − ≤ α + , or to return v k +1 and ǫ k +1 such that k v k +1 k ≤ ¯ ρ and ǫ k +1 ≤ ¯ ǫ . In this subsection, we establish the iteration bound of Algorithm 3 and give a proof for Theorem 4.1.First, we review some facts for maximal monotone operator. For a point-to-set operator T : R n ⇒ R n ,15ts graph is deﬁned as: Gr( T ) = { ( z, v ) ∈ R n × R n | v ∈ T ( z ) } , and the operator T is called monotone if h v − ˜ v, z − ˜ z i ≥ ∀ ( z, v ) , (˜ z, ˜ v ) ∈ Gr( T ) , and T is maximal monotone if it is monotone and maximal in the family of monotone operators withrespect to the partial order of inclusion. Given a maximal monotone operator T : R n ⇒ R n and a scalar ǫ , the associated ǫ -enlargement T ǫ : R n ⇒ R n is deﬁned as: T ǫ ( z ) = { v ∈ R n | h z − ˜ z, v − ˜ v i ≥ − ǫ, ∀ ˜ z ∈ R n , ˜ v ∈ T (˜ z ) } , ∀ z ∈ R n . For a convex function f , its subdiﬀerential ∂f is monotone if f is a proper function. If f is a properlower semicontinuous convex function, then ∂f is maximal monotone [28].Recall that the optimality condition of subproblem (2.5) is characterized by (2.6), which is:0 ∈ λ ( ∇ f x + ∂h )( y ) + y − x = ( λ ( ∇ f x + ∂h ) + I ) ( y ) − x. Furthermore, x is optimal to (1.1) if and only if y = x . Therefore, it is natural to consider the residual ϕ ( λ ; x ) := λ (cid:13)(cid:13)(cid:13) ( I + λ ( ∇ f x + ∂h )) − ( x ) − x (cid:13)(cid:13)(cid:13) for any λ > , x ∈ R n . The above residual was adopted in [24] for the quadratic subproblem. In thispaper, to accommodate the high-order information, we consider the following modiﬁed residual: ψ ( λ ; x ) := λ (cid:13)(cid:13)(cid:13) ( I + λ ( ∇ f x + ∂h )) − ( x ) − x (cid:13)(cid:13)(cid:13) d − . We have an immediate property regarding ψ ( · ). Proposition 4.2

Let x ∈ R n , λ > and ˆ σ ≥ . If ( y, u, ǫ ) is a ˆ σ -approximate solution of (2.5) at ( λ, x ) , then λ (1 − ˆ σ ) d − k y − x k d − ≤ ψ ( λ ; x ) ≤ λ (1 + ˆ σ ) d − k y − x k d − . (4.4) Proof.

From proposition 7.3 in [24], it holds that( λ (1 − ˆ σ ) k y − x k ) d − ≤ ϕ d − ( λ ; x ) ≤ ( λ (1 + ˆ σ ) k y − x k ) d − . (4.5)Notice ϕ d − ( λ ; x ) = λ d − ψ ( λ ; x ), and so (4.4) readily follows by combining the above inequalities andidentity. (cid:3) Lemma 4.3

Let scalars ¯ ρ > , ¯ ǫ > , ˆ σ ≥ and α > be given and satisfy ˆ σ + L d + Md ! α := σ < .Suppose λ ≥ max  α /d (cid:20) ρ (cid:18) σ + L d + Md ! α (cid:19)(cid:21) − d , σ α d − ǫ ! d − d +1  , (4.6) and ( y, u, ǫ ) is a ˆ σ -approximate solution of (2.5) at ( λ, x ) for some vector x ∈ R n . Then, one of thefollowing holds: either (a) λ k y − x k d − > α ; or (b) the vector v := ∇ f ( y ) − ∇ f x ( y ) + u satisﬁes v ∈ ( ∇ f + ( ∂h ) ǫ )( y ) , k v k ≤ ¯ ρ, ǫ ≤ ¯ ǫ. (4.7)16 roof. Suppose that λ satisﬁes (4.6) but not (a) , namely λ k y − x k d − ≤ α. (4.8)In that case, recall that ∂h ǫ is the ǫ -subdiﬀerential of h and ( ∂h ) ǫ is the ǫ -enlargement of operator ∂h .According to Proposition 3 in [9], one has ∂h ǫ ( x ) ⊆ ( ∂h ) ǫ ( x ) for any ǫ ≥ x ∈ R n . Therefore, theinclusion in (4.7) directly follows from Proposition 3.2. Moreover, inequality (3.11) leads to λ k v k − k y − x k ≤ k λv + y − x k ≤ (cid:18) ˆ σ + λ L d + Md ! k y − x k d − (cid:19) k y − x k . Together with (4.6) and (4.8), the above inequality yields k v k ≤ λ (cid:18) σ + L d + Md ! λ k y − x k d − (cid:19) k y − x k≤ λ (cid:18) σ + L d + Md ! α (cid:19) (cid:16) αλ (cid:17) d − ≤ ¯ ρ. On the other hand, inequality (3.11) also implies that2 λǫ ≤ (cid:18) ˆ σ + L d + Md ! λ k y − x k d − (cid:19) k y − x k ≤ (cid:16) ˆ σ + L d + Md ! α (cid:17) k y − x k ≤ σ k y − x k . Combined with (4.6) and (4.8) this leads to ǫ ≤ σ k y − x k λ ≤ σ λ (cid:16) αλ (cid:17) d − ≤ ¯ ǫ. Hence, (b) must hold in this case. (cid:3)

In the rest of this section, we simply let α = α − in Lemma 4.3 and denote¯ λ = max  α /d − (cid:20) ρ (1 + ˆ σ + L d + Md ! α − ) (cid:21) − d ,  σ α d − − ǫ  d − d +1  . (4.9)Lemma 4.3 implies that if λ is suﬃciently large, then either Algorithm 3 stops because (4.7) is satisﬁed,or λ k y − x k d − ≥ α − , which achieves half of the bisection goal. Now we are ready to prove Theorem4.1. 17 roof of Theorem 4.1. Suppose that Algorithm 3 has performed j bisection steps before triggeringthe stopping criteria. We aim to show j ≤ Θ (cid:0) max { log (¯ ǫ − ) , log (¯ ρ − ) } (cid:1) . At that iteration let usdenote x + = x β + , x − = x β − , y + = y β + and y − = y β − , and we also have β + − β − = j . Denote¯ β = β (¯ λ ), where ¯ λ is as deﬁned in (4.9). If ¯ β ≤ then − ¯ β ≤

2; if ¯ β > , then (4.1) gives11 − ¯ β = ¯ λA k ¯ β < λA k ≤ max n Θ (cid:16) (¯ ρ − ) d − d (cid:17) , Θ (cid:16) (¯ ǫ − ) d − d +1 (cid:17)o . Therefore, in the rest of the proof we may assume j ≥ log (2 / (1 − ¯ β )), for otherwise j < log (2 / (1 − ¯ β )) ≤ Θ(max { log (¯ ρ − ) , log (¯ ǫ − ) } ) already holds.Note that the bisection search starts with β + = 1, corresponding to λ + = + ∞ according to (4.1) when β + is not updated during the procedure. However, the following lemma tells us that after runningAlgorithm 3 for a number of iterations, λ + will be reduced and upper bounded by some constantdepending on ¯ ǫ and ¯ ρ . Lemma 4.4

Suppose that Algorithm 3 has performed j bisection steps with j ≥ log (2 / (1 − ¯ β )) , where ¯ β = β (¯ λ ) and ¯ λ are as deﬁned in (4.9) . Then we have λ + ≤ max (cid:8) A k / , λ (cid:9) = max n Θ(¯ ǫ − ) , Θ (cid:16) (¯ ρ − ) d +1 d (cid:17)o . (4.10)We shall continue our discussion without disruption here and leave the proof of Lemma 4.4 to theappendix. Since Algorithm 3 did not stop before iteration j , the bound on β + must have been previouslyupdated, and so λ + k y β + − x β + k d − > α + , λ − k y β − − x β − k d − < α − , where λ + is upper bounded due to Lemma 4.4.By Proposition 4.2, we have that ψ + := ψ ( λ + ; x + ) ≥ λ + (1 − ˆ σ ) d − k y + − x + k d − > (1 − ˆ σ ) d − α + ,ψ − := ψ ( λ − ; x − ) ≤ λ − (1 + ˆ σ ) d − k y − − x − k d − < (1 + ˆ σ ) d − α − . Consequently, ψ + − ψ − > (1 − ˆ σ ) d − α + − (1 + ˆ σ ) d − α − . (4.11)The parameters α + and α − are pre-speciﬁed. Therefore, it suﬃces to show that ψ + − ψ − is upperbounded by β + − β − multiplied by some constant factor and hence the number of bisection search j can be bounded as well. To this end, denote¯ y + = (cid:0) I + λ + ( ∇ f x + + ∂h ) (cid:1) − ( x + ) and ¯ y − = (cid:0) I + λ − ( ∇ f x − + ∂h ) (cid:1) − ( x − ) . (4.12)Then, there exist¯ u + ∈ ( ∇ f x + + ∂h )(¯ y + ) , s.t. λ + ¯ u + = x + − ¯ y + ,ψ + = λ + k ¯ y + − x + k d − = λ d + k ¯ u + k d − (4.13)18nd ¯ u − ∈ ( ∇ f x − + ∂h )(¯ y − ) , s.t. λ − ¯ u − = x − − ¯ y − ,ψ − = λ − k ¯ y − − x − k d − = λ d − k ¯ u − k d − . (4.14)To proceed, we have the following bound on λ − k ¯ u + − ¯ u − k whose proof can be found in the appendix. Lemma 4.5

It holds that λ − k ¯ u + − ¯ u − k ≤ λ − k∇ f x + (¯ y + ) − ∇ f x − (¯ y + ) k + | λ + − λ − |k ¯ y + − x + k + λ − k x + − x − k . (4.15)Note that (cid:12)(cid:12)(cid:12) a d − − b d − (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ( a − b )( a d − + a d − b + · · · + b d − ) (cid:12)(cid:12)(cid:12) ≤ ( d − | a − b | max { a, b } d − , (4.16)for any a, b >

0. Now combining (4.13), (4.14), (4.15) and (4.16) we have | ψ + − ψ − | = (cid:12)(cid:12)(cid:12) λ d + k ¯ u + k d − − λ d − k ¯ u − k d − (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) λ d + − λ d − (cid:12)(cid:12)(cid:12) k ¯ u + k d − + (cid:12)(cid:12)(cid:12) k ¯ u + k d − − k ¯ u − k d − (cid:12)(cid:12)(cid:12) λ d − ≤ | λ + − λ − | dλ d − k ¯ u + k d − + k ¯ u + − ¯ u − k ( d −

1) max {k ¯ u + k , k ¯ u − k} d − λ d − = | λ + − λ − | dλ d − k ¯ u + k d − + ( d − k ¯ u + − ¯ u − k max {k λ − ¯ u + k , k λ − ¯ u − k} d − λ − ≤ | λ + − λ − | dλ d − k ¯ u + k d − + ( d − k ¯ u + − ¯ u − k max {k λ + ¯ u + k , k λ − ¯ u − k} d − λ − = d | λ + − λ − | k ¯ y + − x + k d − + ( d − k ¯ u + − ¯ u − k max {k x + − ¯ y + k , k x − − ¯ y − k} d − λ − ≤ d | λ + − λ − | k ¯ y + − x + k d − + ( d −

1) max {k x + − ¯ y + k , k x − − ¯ y − k} d − × (cid:0) λ − k∇ f x + (¯ y + ) − ∇ f x − (¯ y + ) k + | λ + − λ − |k ¯ y + − x + k + λ − k x + − x − k (cid:1) . (4.17)Next, by applying (1.9), (4.9), Lemma A.2, Lemma A.5 and Lemma A.6, we have λ − ≤ ¯ λ = max n Θ (cid:16) ¯ ǫ − d − d +1 (cid:17) , Θ (cid:16) ¯ ρ − d − d (cid:17) o ≤ max n Θ(¯ ǫ − ) , Θ (cid:16) ¯ ρ − d +1 d (cid:17)o ,λ + − λ − ≤ max n Θ (cid:0) ¯ ǫ − (cid:1) , Θ (cid:16) ¯ ρ − d +1) d (cid:17)o ( β + − β − ) , k x + − ¯ y + k ≤ max n Θ(¯ ǫ − ) , Θ (cid:16) ¯ ρ − d +1 d (cid:17)o , k x − − ¯ y − k ≤ max n Θ (cid:16) ¯ ǫ − d − d +1 (cid:17) , Θ (cid:16) ¯ ρ − d − d (cid:17)o ≤ max n Θ(¯ ǫ − ) , Θ (cid:16) ¯ ρ − d +1 d (cid:17)o , k∇ f x + (¯ y + ) − ∇ f x − (¯ y + ) k ≤ max n Θ (cid:16) ¯ ǫ − d +1 (cid:17) , Θ (cid:16) ¯ ρ − ( d − d +1) d (cid:17)o ( β + − β − ) . | ψ + − ψ − |≤ d max (cid:26) Θ (cid:16) ¯ ǫ − d − (cid:17) , Θ (cid:18) ¯ ρ − ( d +1)2 d (cid:19)(cid:27) ( β + − β − ) + ( d −

1) max n Θ (cid:16) ¯ ǫ − d +2 (cid:17) , Θ (cid:16) ¯ ρ − ( d +1)( d − d (cid:17)o × (cid:26) Θ (cid:16) ¯ ǫ − d − (cid:17) , Θ (cid:18) ¯ ρ − ( d +1)2 d (cid:19)(cid:27) + max n Θ (cid:0) ¯ ǫ − (cid:1) , Θ (cid:16) ¯ ρ − d +1) d (cid:17)o + max n Θ (cid:0) ¯ ǫ − (cid:1) , Θ (cid:16) ¯ ρ − ( d +1) d (cid:17)o ! ( β + − β − ) ≤ max n Θ (cid:16) ¯ ǫ − d +1 (cid:17) , Θ (cid:16) ¯ ρ − (2 d − d +1) d (cid:17)o ( β + − β − ) , where the last inequality is due to d ≥

2. Because β + − β − = j , from (4.11) we have(1 − ˆ σ ) d − α + − (1 + ˆ σ ) d − α − ≤ max n Θ (cid:16) ¯ ǫ − d +1 (cid:17) , Θ (cid:16) ¯ ρ − (2 d − d +1) d (cid:17)o j . The left hand side of the above inequality is a positive constant. Therefore, j ≤ Θ (cid:0) max { log (¯ ǫ − ) , log (¯ ρ − ) } (cid:1) as required. (cid:3) Remark 4.6

In fact, we can quantify the constants in the proof of Theorem 4.1 more explicitly, andobtain the exact form of the bound Θ (cid:0) max { log (¯ ǫ − ) , log (¯ ρ − ) } (cid:1) . Recall that ¯ λ = max  α /d − (cid:20) ρ (1 + ˆ σ + L d + Md ! α − ) (cid:21) − /d , σ α / ( d − − ǫ ! d − d +1  and D = (cid:18) √ − σ D (cid:19) . Introduce the following constants G = 4(¯ λ + 4 ¯ C ) ˆ C ,G = D + L d Dd ! max( 94 ¯ C, λ ) ,G = (1 + ˆ σ ) " D + L d D d +11 d ! ¯ λ ,G = d X l =2 h ( l − B l D ( D + G ) l − + B l +1 D ( D + G ) l − i + B D , where ˆ C = d ! σ l ( L d + M ) D d − , ¯ C = max  σ D − σ )¯ ǫ , D (3 d − / (2 d ) (1 + σ ) /d (1 − σ ) α d − d ( d − − (cid:18) ρ (cid:19) d +1 d  nd B , ..., B d is a sequence deﬁned by B d = k∇ d f ( x ∗ ) k , B l − = k∇ l − f ( x ∗ ) k + 2 D B l , l = 2 , ..., d. Then the complexity bound in Theorem 4.4 can be explicitly expressed by log (cid:18) dG ¯ λ d − + ( d −

1) max( G , G ) d − ¯ λ (2¯ λ G + G G + ¯ λD )(1 − ˆ σ ) d − α + − (1 + ˆ σ ) d − α − (cid:19) . (4.18) It is clear that the dependence of the resulting bound depends logarithmically on the parameters L d , D ,and input parameters α + , α − , σ u , and polynomially on d . The derivation of (4.18) is skipped for thesake of succinctness. Now, for a given ǫ >

0, we denote¯ D ǫ := sup {k x − x ∗ k : ∃ y ∈ ∂F ( x ) s.t. k y k < ǫ } . Combining the bounds provided in Theorem 3.4, Theorem 4.1 and (4.18), we obtain the overall iterationbound for Algorithm 2 in terms of the

ATS calls as follows:

Theorem 4.7

Given ǫ > . Assume that Algorithm 2 is implemented with M ≤ L d . Set ¯ ǫ = ǫ/ , ¯ ρ ≤ min n ǫ D ǫ , ǫ o , and deﬁne K ǫ := & d + 12 (cid:18) d (1 − (ˆ σ + σ u ) ) ( d − / d ! σ l (cid:19) d +1 (cid:18) ( L d + M ) D d +1 ǫ (cid:19) d +1 T ǫ ' where T ǫ = log (cid:18) dG ¯ λ d − + ( d −

1) max( G , G ) d − ¯ λ (2¯ λ G + G G + ¯ λD )(1 − ˆ σ ) d − α + − (1 + ˆ σ ) d − α − (cid:19) . Then, a point z ∈ R n satisfying F ( z ) − F ∗ ≤ ǫ can be found by Algorithm 2 with no more than K ǫ calls of ATS .Proof.

We consider two cases separately. In the ﬁrst case, Algorithm 2 terminates because we ﬁnd a k ≤ K ǫ such that k v k k ≤ ¯ ρ and k ǫ k k ≤ ¯ ǫ . As v k = ∇ f ( y k ) − ∇ f x k ( y k )+ u k and u k ∈ ∇ f x k ( y k )+ ∂ ǫ k h ( y k ),we have v k ∈ ∇ f ( y k ) + ∂ ǫ k h ( y k ). Let x ∗ be the projection of x onto X ∗ . By the convexity of f and h , f ( x ∗ ) ≥ f ( y k ) + h∇ f ( y k ) , x ∗ − y k i h ( x ∗ ) ≥ h ( y k ) + h v k − ∇ f ( y k ) , x ∗ − y k i − ǫ k . Summing up the two inequalities above yields F ∗ ≥ F ( y k ) + h v k , x ∗ − y k i − ǫ k ≥ F ( y k ) − ¯ ρ k y k − x ∗ k − ¯ ǫ.

21y the construction of ¯ ρ , we have that k v k k ≤ ¯ ρ ≤ ǫ . Together with the deﬁnition of ¯ D ǫ , this impliesthat k y k − x ∗ k ≤ ¯ D ǫ . Again, by evoking the construction of ¯ ρ and ¯ ǫ , it holds F ∗ ≥ F ( y k ) − ǫ − ǫ F ( y k ) − ǫ. In the other case, condition (3.7) holds for every k ≤ K ǫ . Then, according to Theorem 3.4, for all k ≤ K ǫ we have F ( y k ) − F ∗ ≤ (cid:18) d + 12 (cid:19) d +12 d (1 − (ˆ σ + σ u ) ) d − d ! σ l D d +1 ( L d + M ) k − d +12 . By the deﬁnition of K ǫ and letting k = K ǫ in the above inequality, one has F ( y K ǫ ) − F ∗ ≤ ǫ. (cid:3) Note that the implementation of our framework is based on the assumption that the ATS can beeﬃciently computed. By the construction, the objective in (2.5) has a λ -strongly convex smooth part,which can be solved by many start-of-the-art optimization algorithms. However, there is few eﬃcientalgorithms customized for problem (2.5). Without further knowledge of the problem structure, theproposed approach is not necessarily more eﬃcient as compared to, e.g., a direct application of ageneral-purpose convex optimization algorithm on (1.1). At the end of the paper, we shall brieﬂydiscuss a method for the subroutine in the special case d = 3 introduced by Nesterov [27], which opensthe door for this line of research. Of course, how to eﬃciently solve ATS in general remains a furtherresearch topic.

To conclude this paper, we shall discuss how to compute

ATS eﬃciently with d = 3. Note that in STEP2.a of Algorithm 3, an Approximate Tensor Subroutine (ATS) is required, which can be implementedin polynomial time in the case of convex optimization. In some applications, ATS may be implementedeﬃciently if some additional structures on the tensor (Taylor) expansion and/or the h function exist.In this subsection, we show how ATS (i.e., solve problem (2.5)) may be computed eﬃciently in theabsence of the non-smooth part, i.e. F ( x ) = f ( x ), when d = 3. Note that since h = 0, the ǫ β in thebisection subroutine may be simply set to 0.In this case, the objective function in (2.5) becomes: f x ( y ) + λ k y − x k = f ( x ) + Ω( y − x ) whereΩ( z ) = z ⊤ ∇ f ( x ) + 12 z ⊤ (cid:18) ∇ f ( x ) + 1 λ I (cid:19) z + 13! ∇ f ( x )[ z ] + M k z k . Therefore, the subproblem (2.5) is equivalent to min z ∈ R n Ω( z ). Let M = 3 κ L with κ >

1. Then, asimilar argument as in Lemma 4 of [27] implies that function Ω( z ) satisﬁes the strong relative smoothnesscondition ∇ ρ ( z ) (cid:22) ∇ Ω( z ) (cid:22) κ + 1 κ − ∇ ρ ( z ) (5.1)22ith respect to function ρ ( z ) = z ⊤ (cid:18) κ − κ ∇ f ( x ) + κ − λ ( κ + 1) I (cid:19) z + M − κL k z k . Such condition allows to minimize Ω( z ) eﬃciently by a gradient method described in [21, 27], wherewe need to solve the following problem in every iteration:min z ∈ R n (cid:18) a ⊤ z + 12 z ⊤ Az + γ k z k (cid:19) , A (cid:23) , γ > , which was considered at the end of Section 5 in [27]. According to a min-max argument in [27], theabove problem is shown to be equivalent tomin τ> (cid:18) γτ + 12 a ⊤ ( γτ I + A ) − a (cid:19) , which is actually a univariate optimization problem with a strongly convex and analytic objectivefunction, hence is easily solvable in practice. Note that a matrix inverse operation is required in theunivariate optimization above. Since in all iterations of the gradient method described in [21, 27],the matrix A is exactly ∇ f ( x ) throughout and only the vector a varies, the matrix inverse operationneeds to be performed only once. As the gradient method in [21, 27] is linearly convergent, the totalcomputational cost of an ATS call is in the order of O ( n + n log(¯ ρ − )). Acknowledgments.

We would like to thank Tianyi Lin at UC Berkeley for the helpful discussions at the early stages of theproject.

References [1] M.M. Alves, R.D.C. Monteiro and B.F. Svaiter (2014) Primal-dual regularized SQP and SQCQPtype methods for convex programming and their complexity analysis.[2] M.M. Alves, R.D.C. Monteiro and B.F. Svaiter (2016) Iteration-complexity of a Rockafellar’sproximal method of multipliers for convex programming based on second-order approximations. arXiv:1602.06794 [3] M. Baes (2009) Estimate sequence methods: extensions and approximations.

Institute for Opera-tions Research , ETH, Zurich, Switzerland.[4] A. Beck and M. Teboulle (2009) A fast iterative shrinkage-thresholding algorithm for linear inverseproblems.

SIAM Journal on Imaging Sciences , 2(1):183–202.[5] E.G. Birgin, J.L. Gardenghi, J.M. Martinez, S.A. Santos, and Ph.L. Toint (2017) Worst-case eval-uation complexity for unconstrained nonlinear optimization using high-order regularized models.

Mathematical Programming , 163(1-2):359–368.236] B.Bullins (2018) Fast minimization of structured convex quartics.

ArXiv Preprint: 1812.10349 .[7] S. Bubeck, Y.T. Lee, and M. Singh (2015) A geometric alternative to Nesterov’s accelerated gradientdescent.

ArXiv Preprint: 1506.08187 .[8] S. Bubeck, Q. Jiang, Y.T. Lee, Y. Li and A. Sidford (2018) Near-optimal method for highly smoothconvex optimization.

ArXiv Preprint: 1812.08026 .[9] R.S. Burachik, A.N. Iusem and B.F. Svaiter (1997) Enlargement of monotone operators withapplications to variational inequalities, Set-Valued Anal., 5, 159-180.[10] C. Cartis, N.I.M. Gould, and Ph.L. Toint (2017) Improved second-order evaluation complexity forunconstrained nonlinear optimization using high-order regularized models. arXiv:1708.04044 .[11] C. Cartis, N.I.M. Gould, and Ph.L. Toint (2018) Universal regularization methods: varying thepower, the smoothness and the accuracy. arXiv:1811.07057v1 .[12] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan (2011) Better mini-batch algorithms via accel-erated gradient methods. In

NIPS , pages 1647–1655.[13] L. Calatroni and A. Chambolle (2017) Backtracking strategies for accelerated descent methodswith smooth composite objectives. arXiv:1709.09004 .[14] Y. Drori and M. Teboulle (2014) Performance of ﬁrst-order methods for smooth convex minimiza-tion: a novel approach.

Mathematical Programming , 145(1-2):451–482.[15] A. Gasnikov, P. Dvurechensky, E. Gorbunov, E. Vorontsova, D. Selikhanovych and C. A. Uribe(2018) The global rate of convergence for optimal tensor methods in smooth convex optimization. arXiv:1809.00382 .[16] G.N. Grapiglia and Yu. Nesterov (2018) Accelerated regularized newton methods for minimizingcomposite convex functions.

Technical Report CORE Discussion paper .[17] A. Gasnikov, P. Dvurechensky, E. Gorbunov, E. Vorontsova, D. Selikhanovych, C.A. Uribe,B. Jiang, H. Wang, S. Zhang, S. Bubeck, Q. Jiang, Y.T. Lee, Y. Li and A. Sidford (2019) Nearoptimal methods for minimizing convex functions with Lipschitzp-th derivatives.

In: Conferenceon Learning Theory (COLT) , pp. 1392-1393.[18] B. Jiang, T. Lin, and S. Zhang (2018) A Uniﬁed Adaptive Tensor Approximation Scheme toAccelerate Composite Convex Optimization. arXiv:1811.02427 .[19] G. Lan (2012) An optimal method for stochastic composite optimization.

Mathematical Program-ming , 133(1):365–397.[20] Q. Lin and L. Xiao (2014) An adaptive accelerated proximal gradient method and its homotopycontinuation for sparse optimization.

Computational Optimization and Applications , 60(3):633–674.[21] H. Lu and R.M. Freund and Yu. Nesterov (2018) Relatively smooth convex optimization by ﬁrst-order methods and applications.

SIAM Journal on Optimization , 28 (1), pp. 333-354.2422] J.M. Mart´ınez (2017) On high-order model regularization for constrained optimization.

SIAMJournal on Optimization , 27(4):2447–2458.[23] R.D.C. Monteiro and B.F. Svaiter (2012) Iteration-Complexity of a Newton Proximal Extragra-dient Method for Monotone Variational Inequalities and Inclusion Problems.

SIAM Journal onOptimization , 22, 914-935.[24] R.D.C. Monteiro and B.F. Svaiter (2013) An accelerated hybrid proximal extragradient method forconvex optimization and its implications to second-order methods.

SIAM Journal on Optimization ,23 (2), pp. 1092-1125.[25] Yu. Nesterov (1983) A method for unconstrained convex minimization problem with the rate ofconvergence o (1 /k ). Doklady AN SSSR, translated as Soviet Math. Docl. , 269:543–547.[26] Yu. Nesterov (2008) Accelerating the cubic regularization of Newton’s method on convex problems.

Mathematical Programming , 112(1):159–181.[27] Yu. Nesterov (2018) Implementable tensor methods in unconstrained convex optimization,

COREDiscussion Paper 2018/05, Catholic University of Louvain, Center for Operations Research andEconometrics (CORE) .[28] R.T. Rockafellar (1970) On the maximal monotonicity of subdiﬀerential mappings.

Paciﬁc J.Math. , 33:209-216.[29] K. Scheinberg, D. Goldfarb and X. Bai (2014) Fast ﬁrst-order methods for composite convex opti-mization with backtracking.

Foundations of Computational Mathematics , 14:389–417.[30] S. Shalev-Shwartz and T. Zhang (2014) Accelerated proximal stochastic dual coordinate ascent forregularized loss minimization. In

ICML , pages 64–72.[31] Y. Arjevani, O. Shamir and R. Shiﬀ (2018) Oracle complexity of second-order methods for smoothconvex optimization.

Mathematical Programming , published online.[32] W. Su, S. Boyd and E.J. Candes (2016) A diﬀerential equation for modeling Nesterov’s acceleratedgradient method: theory and insights.

Journal of Machine Learning Research , 17(153):1–43.[33] A. Wibisono, A.C. Wilson and M.I. Jordan (2016) A variational perspective on accelerated methodsin optimization.

Proceedings of the National Academy of Sciences , pages 7351–7358.[34] A.C. Wilson, B. Recht and M.I. Jordan (2016) A Lyapunov analysis of momentum methods inoptimization.

ArXiv Preprint: 1611.02635 . A Proofs of the lemmas in Section 4

We ﬁrst establish an uniform lower bound as well as an upper bound for the sequence { A k } .25 emma A.1 Let D be the distance of x to X ∗ . Suppose { A k } ℓk =1 is generated from Algorithm 2, andthe algorithm has not stopped at iteration ℓ . Then for any integer ≤ k ≤ ℓ , it holds that A k ≥ d ! σ l ( L d + M ) (cid:16) √ − σ + 2 (cid:17) d − D d − , (1.1) and A k ≤ max  σ D ǫ (1 − σ ) , D d − d (1 + σ ) d (1 − σ )( α − ) ( d − d ( d − (cid:18) ρ (cid:19) d +1 d  . (1.2) Proof.

We ﬁrst establish the lower bound. Since { A k } is monotonically increasing, it suﬃces to lowerbound A . Recall that A = 0 and A = A + a = λ , and the choice of large-step (3.7) in Algorithm 2leads to d ! σ l L d + M ≤ λ k y − ˜ x k d − . Moreover, Lemma 3.1 implies that k x k − x ∗ k ≤ D, and k y k − x ∗ k ≤ (cid:16) √ − σ + 1 (cid:17) D, (1.3)where x ∗ is the projection of x onto the optimal solution set X ∗ . Combining the above two inequalitieswith the fact that ˜ x = x , it follows that k y − ˜ x k ≤ k y − x ∗ k + k x ∗ − ˜ x k ≤ (cid:16) √ − σ + 2 (cid:17) D. Therefore, A = λ ≥ d ! σ l ( L d + M ) (cid:16) √ − σ + 2 (cid:17) d − D d − , which is a uniform lower bound of the sequence { A k } .Next, we provide the upper bound. By invoking (3.12) to ( y k , v k , ǫ k , λ k , ˜ x k − ), it holds that λ k k v k k ≤ (1 + σ ) k y k − ˜ x k − k , (1.4)2 λ k ǫ k ≤ σ k y k − ˜ x k − k . (1.5)Then, combining (1.4) with (3.4) leads to A k λ k k v k k ≤ (1 + σ ) A k λ k k y k − ˜ x k − k ≤ (1 + σ ) D − σ . (1.6)Moreover, it follows from (3.14) that A k λ d +1 d − k ≤ k X j =1 A j λ d +1 d − j ≤ D (1 − σ )( α − ) d − α − = d ! σ l L d + M . Combining the above two inequalities yields A k (1 − σ )( α − ) d − D ! d − d +1 k v k k A k ≤ λ k k v k k A k ≤ σ − σ D, or equivalently, A k ≤ D d − d (1 + σ ) d (1 − σ )( α − ) ( d − d ( d − (cid:18) k v k k (cid:19) d +1 d . (1.7)On the other hand, (1.5) together with (3.4) implies that2 A k ǫ k ≤ A k λ k σ k y k − ˜ x k − k ≤ σ D − σ . Consequently, A k ≤ σ D − σ ) 1 ǫ k . (1.8)Now, since the algorithm has not been terminated, we have either k v k k ≥ ¯ ρ or ǫ k ≥ ¯ ǫ , which combinedwith (1.7) and (1.8) yields that A k ≤ max  σ D ǫ (1 − σ ) , D d − d (1 + σ ) d (1 − σ ) (cid:18) α (cid:19) ( d − d ( d − (cid:18) ρ (cid:19) d +1 d  , and the conclusion follows. (cid:3) Below we prove Lemma 4.4 and Lemma 4.5 respectively.

Proof of Lemma 4.4.

We ﬁrst demonstrate that β − ≤ ¯ β = β (¯ λ ) . (1.9)Otherwise, we have β − ≥ ¯ β and λ − > ¯ λ as the function λ ( β ) is strictly increasing in β . This togetherwith Lemma 4.3 and the fact that λ − k y − − x − k ≤ α − , implies that the vector v β − = ∇ f ( y − ) −∇ f x − ( y − ) + u β − satisﬁes v β − ∈ ( ∇ f + ( ∂h ) ǫ )( y − ) , k v β − k ≤ ¯ ρ, ǫ ≤ ¯ ǫ, and thus the algorithm would have been terminated, yielding a contradiction. So we have β − ≤ ¯ β .Since j ≥ log (2 / (1 − ¯ β )), we have β + = β − + 12 j ≤ ¯ β + 1 − ¯ β β < . (1.10)Together with the monotonicity of the function λ ( β ) this implies that λ + ≤ λ (cid:18) β (cid:19) = A k (cid:0) (1 + ¯ β ) / (cid:1) − (1 + ¯ β ) / A k (cid:0) β (cid:1) − ¯ β ) = ¯ λ (cid:0) β (cid:1) ¯ β . λ + ≤ A k (

1+ ¯ β ) − ¯ β ) ≤ (cid:0) (cid:1) A k when ¯ β ≤ and λ + ≤ ¯ λ (

1+ ¯ β ) ¯ β ≤ (cid:16) / (cid:17) ¯ λ when ¯ β ≥ .Combining the bounds in both cases, we have λ + ≤ max (cid:8) A k / , λ (cid:9) ≤ max n Θ(¯ ǫ − ) , Θ (cid:16) (¯ ρ − ) d +1 d (cid:17) , Θ (cid:16) (¯ ǫ − ) d − d +1 (cid:17) , Θ (cid:16) (¯ ρ − ) d − d (cid:17)o = max n Θ(¯ ǫ − ) , Θ (cid:16) (¯ ρ − ) d +1 d (cid:17)o , where the second inequality is due to the upper bounds of ¯ λ and A k in (4.9) and (1.2) respectively. (cid:3) Proof of Lemma 4.5.

Let ¯ v := ¯ u + − ∇ f x + (¯ y + ) + ∇ f x − (¯ y + ). Then ¯ v ∈ ( ∇ f x − + ∂h )(¯ y + ). By (4.13)and (4.14), it holds that¯ y + − ¯ y − + λ + ¯ u + − λ − ¯ u − = x + − x − ⇐⇒ ¯ y + − ¯ y − + λ − (¯ u + − ¯ u − ) = ( λ − − λ + )¯ u + + x + − x − ⇐⇒ ¯ y + − ¯ y − + λ − (¯ v − ¯ u − ) = λ − (¯ v − ¯ u + ) + ( λ − − λ + )¯ u + + x + − x − . Recall that ¯ v ∈ ( ∇ f x − + ∂h )(¯ y + ) and ¯ u − ∈ ( ∇ f x − + ∂h )(¯ y − ), and from the convexity of f x − + h it holdsthat h ¯ y + − ¯ y − , ¯ v − ¯ u − i ≥ . Therefore, λ − k ¯ v − ¯ u − k ≤ k ¯ y + − ¯ y − + λ − (¯ v − ¯ u − ) k = (cid:13)(cid:13)(cid:13) λ − (¯ v − ¯ u + ) + ( λ − − λ + )¯ u + + x + − x − (cid:13)(cid:13)(cid:13) ≤ λ − k ¯ v − ¯ u + k + (cid:12)(cid:12) λ − − λ + (cid:12)(cid:12) k ¯ u + k + k x + − x − k . Using the previous identity and the triangle inequality of the norms implies that λ − k ¯ u + − ¯ u − k ≤ λ − ( k ¯ u + − ¯ v k + k ¯ v − ¯ u − k ) ≤ λ − k ¯ v − ¯ u + k + (cid:12)(cid:12) λ − − λ + (cid:12)(cid:12) λ − k ¯ u + k + λ − k x + − x − k≤ λ − k∇ f x + (¯ y + ) − f x − (¯ y + ) k + (cid:12)(cid:12) λ − − λ + (cid:12)(cid:12) k ¯ y + − x + k + λ − k x + − x − k . (cid:3) Next we shall present the lemmas with proofs that were used in Section 4.

Lemma A.2

Suppose λ + , λ − , β + and β − are generated from Algorithm 3. When the nubmer ofiteration j in Algorithm 3 satisfying j ≥ log (2 / (1 − ¯ β )) with ¯ β = β (¯ λ ) and ¯ λ deﬁned in (4.9) , we have λ + − λ − ≤ (cid:16) max n Θ (cid:0) ¯ ǫ − (cid:1) , Θ (cid:16) ¯ ρ − d +1 d (cid:17)o(cid:17) ( β + − β − ) . (1.11) Proof.

Since j ≥ log (2 / (1 − ¯ β )), inequality (1.10) holds. By the mean-value theorem and the deﬁnitionof λ ( β ), there exists η ∈ ( β − , β + ) such that λ + − λ − = A k (cid:18) − η ) − (cid:19) ( β + − β − ) ≤ A k (cid:18) − ¯ β ) − (cid:19) ( β + − β − ) , β = p ¯ λ + 4¯ λA k − ¯ λ A k = 2¯ λ p ¯ λ + 4¯ λA k + ¯ λ . The relation of β and λ in (4.1) gives A k (1 − ¯ β ) = ¯ λ A k ¯ β = ( p ¯ λ + 4¯ λA k + ¯ λ ) A k ¯ λ ≤ (¯ λ + 4 A k ) A k . Therefore, by invoking (4.9), (1.1) and (1.2), we have λ + − λ − ≤ A k (cid:18) − ¯ β ) − (cid:19) ( β + − β − ) ≤ A k (1 − ¯ β ) ( β + − β − ) ≤ λ + 4 A k ) A k ( β + − β − ) ≤ (cid:16) max n Θ (cid:16) (¯ ǫ − ) d − d +1 (cid:17) , Θ (cid:16) (¯ ρ − ) d − d (cid:17)o + max n Θ (cid:0) ¯ ǫ − (cid:1) , Θ (cid:16) (¯ ρ − ) d +1 d (cid:17)o(cid:17) ( β + − β − ) ≤ (cid:16) max n Θ (cid:0) ¯ ǫ − (cid:1) , Θ (cid:16) (¯ ρ − ) d +1 d (cid:17)o(cid:17) ( β + − β − ) . (cid:3) The following lemma is exactly Proposition 4.5 in [23].

Lemma A.3

Let A : R s ⇒ R s be a maximal monotone operator. Then for any x, ˜ x ∈ R s , we have k ( I + λA ) − ( x ) − ( I + λA ) − (˜ x ) k ≤ k x − ˜ x k . (1.12) Moreover, if x ∗ ∈ A − (0) then max {k ( I + λA ) − ( x ) − x k , k ( I + λA ) − ( x ) − x ∗ k} ≤ k x − x ∗ k . (1.13)Now we can bound the residual in terms of the distance between current iterate and an optimal solution. Lemma A.4

Let T := ∇ f + ∂h and T x := ∇ f x + ∂h . Assume that x ∗ ∈ T − (0) = ( ∇ f + ∂h ) − (0) andlet ¯ x, x ∈ R n be given. Then, k x − ( I + λT ¯ x ) − ( x ) k ≤ k x − x ∗ k + λ ( L d + M ) d ! k ¯ x − x ∗ k d . (1.14) As a consequence, for every x ∈ R n , x ∗ ∈ T − (0) , and λ > , it holds that λ k x − ( I + λT ¯ x ) − ( x ) k d − ≤ λ (cid:18) k x − x ∗ k + λ ( L d + M ) d ! k x − x ∗ k d (cid:19) d − . (1.15)29 roof. Let r be a constant mapping such that r ( x ) = ∇ f ( x ∗ ) −∇ f ¯ x ( x ∗ ) for any x ∈ R n . Then, construct A := T ¯ x + r , where A is also a maximal monotone operator. By Lemma A.3, k ( I + λA ) − ( x ) − x k ≤ k x − x ∗ k . (1.16)Let y = x + λ ( ∇ f ( x ∗ ) − ∇ f ¯ x ( x ∗ )) and z = ( I + λr + λT ¯ x ) − ( y ). We have x + λ ( ∇ f ( x ∗ ) − ∇ f ¯ x ( x ∗ )) = y = ( I + λr + λT ¯ x ) ( z ) = z + λ ( ∇ f ( x ∗ ) − ∇ f ¯ x ( x ∗ )) + λT ¯ x ( z ) . Canceling λ ( ∇ f ( x ∗ ) − ∇ f ¯ x ( x ∗ )) on both sides leads to( I + λT ¯ x ) − ( x ) = z = ( I + λr + λT ¯ x ) − ( y ) = ( I + λA ) − ( y ) . Combining the above inequality with (1.16) and Lemma 2.1 we have k x − ( I + λT ¯ x ) − ( x ) k = k x − ( I + λA ) − ( y ) k≤ k x − ( I + λA ) − ( x ) k + k ( I + λA ) − ( x ) − ( I + λA ) − ( y ) k≤ k x − x ∗ k + λ k∇ f ( x ∗ ) − ∇ f ¯ x ( x ∗ ) k≤ k x − x ∗ k + λ ( L d + M ) d ! k ¯ x − x ∗ k d , which proves (1.14), and (1.15) follows from (1.14) straightforwardly. (cid:3) Lemma A.5

Suppose x + = x β + and x − = x β − are generated from Algorithm 3, and ¯ y + , ¯ y − are deﬁnedin (4.12) . When the number of iterations j in Algorithm 3 satisﬁes j ≥ log (2 / (1 − ¯ β )) with ¯ β = β (¯ λ ) and ¯ λ deﬁned in (4.9) , we have k x + − x − k ≤ (cid:16) √ − σ (cid:17) D ( β + − β − ) , k x + − ¯ y + k ≤ max n Θ(¯ ǫ − ) , Θ (cid:16) ¯ ρ − d +1 d (cid:17)o and k x − − ¯ y − k ≤ max n Θ (cid:16) ¯ ǫ − d − d +1 (cid:17) , Θ (cid:16) ¯ ρ − d − d (cid:17)o . Proof.

Let x ∗ be the projection of x onto the optimal solution set X ∗ . According to Lemma 3.1, itholds that k x k − x ∗ k ≤ D and k y k − x ∗ k ≤ (cid:16) √ − σ + 1 (cid:17) D. By (4.2), we have x + = (1 − β + ) y k + β + x k and x − = (1 − β − ) y k + β − x k . Therefore, k x + − x ∗ k ≤ D, and k x − − x ∗ k ≤ (cid:16) √ − σ + 1 (cid:17) D, (1.17)and k x + − x − k = k ( β + − β − )( x k − y k ) k ≤ ( k x k − x ∗ k + k y k − x ∗ k )( β + − β − ) ≤ (cid:16) √ − σ (cid:17) D ( β + − β − ) . j ≥ log (2 / (1 − ¯ β )), β − ≤ ¯ β and inequality(4.10) holds. Applying Lemma A.4 with x = ¯ x = x + , inquality (4.10) and (4.5), we have k x + − y + k ≤ (1 + ˆ σ ) k x + − x ∗ k + (1 + ˆ σ ) λ + L d d ! k x + − x ∗ k d +1 ≤ max n Θ(¯ ǫ − ) , Θ (cid:16) ¯ ρ − d +1 d (cid:17)o . Since λ ( β ) is monotonically increasing in β , β − ≤ ¯ β amounts to λ − ≤ ¯ λ = max n Θ (cid:16) ¯ ǫ − d − d +1 (cid:17) , Θ (cid:16) ¯ ρ − d − d (cid:17)o . Finally, applying Lemma A.4 again with x = ¯ x = x − and using (4.5) yields k x − − y − k ≤ (1 + ˆ σ ) k x − − x ∗ k + (1 + ˆ σ ) λ − L d d ! k x − − x ∗ k d +1 ≤ max n Θ (cid:16) ¯ ǫ − d − d +1 (cid:17) , Θ (cid:16) ¯ ρ − d − d (cid:17)o . (cid:3) Lemma A.6

Suppose that x + = x β + and x − = x β − are generated by Algorithm 3, and ¯ y + , ¯ y − aredeﬁned in (4.12) . If the number of iterations j in Algorithm 3 satisﬁes j ≥ log (2 / (1 − ¯ β )) with ¯ β = β (¯ λ ) and ¯ λ deﬁned in (4.9) , then we have k∇ f x + (¯ y + ) − ∇ f x − (¯ y + ) k ≤ max n Θ (cid:16) ¯ ǫ − d +1 (cid:17) , Θ (cid:16) ¯ ρ − ( d − d +1) d (cid:17)o ( β + − β − ) . Proof.

According to the deﬁnition of function f x ( · ) in (2.4), it holds that k∇ f x − (¯ y + ) − ∇ f x + (¯ y + ) k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d X ℓ =1 ℓ − (cid:16) ∇ ℓ f ( x + )[¯ y + − x + ] ℓ − − ∇ ℓ f ( x − )[¯ y + − x − ] ℓ − (cid:17) + Md ! (cid:16) k ¯ y + − x − k d − ( y + − x − ) − k ¯ y + − x + k d − (¯ y + − x + ) (cid:17) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ d X ℓ =1 ℓ − (cid:13)(cid:13)(cid:13) ∇ ℓ f ( x + )[¯ y + − x + ] ℓ − − ∇ ℓ f ( x − )[¯ y + − x − ] ℓ − (cid:13)(cid:13)(cid:13) (1.18)+ Md ! (cid:13)(cid:13)(cid:13) k ¯ y + − x − k d − ( y + − x − ) − k ¯ y + − x + k d − (¯ y + − x + ) (cid:13)(cid:13)(cid:13) . (1.19)31ote that (1.19) can be further bounded as follows: Md ! (cid:13)(cid:13)(cid:13) k ¯ y + − x − k d − (¯ y + − x − ) − k ¯ y + − x + k d − (¯ y + − x + ) (cid:13)(cid:13)(cid:13) = Md ! (cid:13)(cid:13)(cid:13) ( k ¯ y + − x − k d − − k ¯ y + − x + k d − )(¯ y + − x − ) + k ¯ y + − x + k d − ( x + − x − ) (cid:13)(cid:13)(cid:13) ≤ Md ! (cid:16) ( d − (cid:12)(cid:12)(cid:12) k ¯ y + − x − k − k ¯ y + − x + k (cid:12)(cid:12)(cid:12) max {k ¯ y + − x − k , k ¯ y + − x + k} d − k ¯ y + − x − k + k ¯ y + − x + k d − k x + − x − k (cid:17) ≤ Md ! (cid:16) ( d −

1) max {k ¯ y + − x − k , k ¯ y + − x + k} d − k ¯ y + − x − k + k ¯ y + − x + k d − (cid:17) k x + − x − k≤ Md ! (cid:16) d (cid:0) k ¯ y + − x + k + k x + − x − k (cid:1) d − (cid:17) (cid:16) √ − σ (cid:17) D ( β + − β − ) ≤ max n Θ (cid:2) (¯ ǫ − ) d − (cid:3) , Θ (cid:2) (¯ ρ − ) ( d +1)( d − d (cid:3)o ( β + − β − ) (1.20)where the ﬁrst inequality is due to (4.16), and the second last inequality is from Lemma A.5, and that k ¯ y + − x − k ≤ k ¯ y + − x + k + k x + − x − k .It remains to bound (1.18). We ﬁrst show by induction that for 1 ≤ ℓ ≤ d and any convex combinationof x − , x + and x ∗ denoted by z , k∇ ℓ f ( z ) k ≤ Θ(1) , (1.21) k∇ ℓ f ( x + ) − ∇ ℓ f ( x − ) k ≤ Θ(1)( β + − β − ) . (1.22)Our induction works backwardly starting from the base case: ℓ = d . Recall that z is a convex combina-tion of x − , x + and x ∗ . By (1.17) we have k z − x ∗ k ≤ max {k x + − x ∗ k , k x − − x ∗ k} ≤ Θ(1) . Therefore, k∇ d f ( z ) k ≤ k∇ d f ( x ∗ ) k + k∇ d f ( x ∗ ) − ∇ d f ( z ) k≤ k∇ d f ( x ∗ ) k + L d k z − x ∗ k≤ Θ(1) . Moreover, by invoking (2.2) and Lemma A.5, we have k∇ d f ( x + ) − ∇ d f ( x − ) k ≤ L d k x + − x − k≤ L d (cid:16) √ − σ (cid:17) D ( β + − β − ) ≤ Θ(1)( β + − β − ) . Now suppose that the conclusion holds for some ℓ + 1. Consider z = t x − + t x + + (1 − t − t ) x ∗ , ∀ ≤ t , t ≤ . D := (cid:16) √ − σ (cid:17) D . By letting x t = t t + t x − + t t + t x + and (1.17), we have k x t − x ∗ k ≤k x − − x ∗ k + k x + − x ∗ k ≤ (cid:16) √ − σ (cid:17) D = D . Consequently, k∇ ℓ f ( z ) − ∇ ℓ f ( x ∗ ) k = k∇ ℓ f ( x ∗ + ( t + t )( x t − x ∗ )) − ∇ ℓ f ( x ∗ ) k = (cid:13)(cid:13)(cid:13)(cid:13)Z t + t ∇ ℓ +1 f ( x ∗ + u ( x t − x ∗ )) [ x t − x ∗ ] du (cid:13)(cid:13)(cid:13)(cid:13) ≤ Z t + t k∇ ℓ +1 f ( x ∗ + u ( x t − x ∗ )) kk x t − x ∗ k du ≤ ( t + t ) D Θ(1) , where the second last inequality is due to (2.1) and the last inequality follows from the inductionhypothesis on (1.21). Then, it follows that k∇ ℓ f ( z ) k ≤ k∇ ℓ f ( x ∗ ) k + ( t + t ) D Θ(1) ≤ Θ(1) . Now by induction on (1.21), applying Lemma A.5 and using (2.1) we have k∇ ℓ f ( x + ) − ∇ ℓ f ( x − ) k = (cid:13)(cid:13)(cid:13) Z ∇ ℓ +1 f ( x − + t ( x + − x − ))[ x + − x − ] dt (cid:13)(cid:13)(cid:13) ≤ Θ(1) D ( β + − β − ) ≤ Θ(1)( β + − β − ) . Therefore, by induction it follows that (1.21) and (1.22) hold for any 1 ≤ ℓ ≤ d .Now we come back to bound (1.18). For 2 ≤ ℓ ≤ d , (cid:13)(cid:13)(cid:13) ∇ ℓ f ( x + )[¯ y + − x + ] ℓ − − ∇ ℓ f ( x − )[¯ y + − x − ] ℓ − (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ∇ ℓ f ( x + )[¯ y + − x + ] ℓ − − ∇ ℓ f ( x + )[¯ y + − x − ] ℓ − (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ∇ ℓ f ( x + )[¯ y + − x − ] ℓ − − D ℓ f ( x − )[¯ y + − x − ] ℓ − (cid:13)(cid:13)(cid:13) . (1.23)Applying Lemma A.5 and (1.21), the ﬁrst term on the right hand side of (1.23) can be further upperbounded as follows: (cid:13)(cid:13)(cid:13) ∇ ℓ f ( x + )[¯ y + − x + ] ℓ − − ∇ ℓ f ( x + )[¯ y + − x − ] ℓ − (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ℓ − X j =1 ∇ ℓ f ( x + ) h [¯ y + − x + ] j − [ x − − x + ][¯ y + − x − ] ℓ − j − i(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ℓ − X j =1 k∇ ℓ f ( x + ) kk ¯ y + − x + k j − k ¯ y + − x − k ℓ − j − k x + − x − k≤ ℓ − X j =1 Θ(1) k ¯ y + − x + k j − (cid:16) k x + − ¯ y + k + k x − − x + k (cid:17) ℓ − j − k x + − x − k≤ ( ℓ − (cid:16) k x + − x − k + k x + − ¯ y + k (cid:17) ℓ − D ( β + − β − ) ≤ max n Θ (cid:2) (¯ ǫ − ) ℓ − (cid:3) , Θ (cid:2) (¯ ρ − ) ( d +1)( ℓ − d (cid:3)o ( β + − β − ) . (1.24)33oreover, applying Lemma A.5, (1.21) and (2.1) to the second term on the right hand side of (1.23)gives that (cid:13)(cid:13)(cid:13) ∇ ℓ f ( x + )[¯ y + − x − ] ℓ − − ∇ ℓ f ( x − )[¯ y + − x − ] ℓ − (cid:13)(cid:13)(cid:13) ≤ k∇ ℓ f ( x + ) − ∇ ℓ f ( x − ) kk ¯ y + − x − k ℓ − ≤ Θ(1)( β + − β − ) (cid:16) k x + − x − k + k x + − ¯ y + k (cid:17) ℓ − ≤ max n Θ (cid:2) (¯ ǫ − ) ℓ − (cid:3) , Θ (cid:2) (¯ ρ − ) ( d +1)( ℓ − d (cid:3)o ( β + − β − ) . (1.25)Putting (1.23), (1.24) and (1.25) together yields (cid:13)(cid:13)(cid:13) ∇ ℓ f ( x + )[ y + − x + ] ℓ − − ∇ ℓ f ( x − )[ y + − x − ] ℓ − (cid:13)(cid:13)(cid:13) ≤ max n Θ (cid:2) (¯ ǫ − ) ℓ − (cid:3) , Θ (cid:2) (¯ ρ − ) ( d +1)( ℓ − d (cid:3)o ( β + − β − )for ℓ = 2 , ..., d . When ℓ = 1, (1.22) guarantees that k∇ f ( x + ) − ∇ f ( x − ) k ≤ Θ(1)( β + − β − ) . Therefore, the quantity in (1.18) can be bounded as d X ℓ =1 ℓ − (cid:13)(cid:13)(cid:13) ∇ ℓ f ( x + )[¯ y + − x + ] ℓ − − ∇ ℓ f ( x − )[¯ y + − x − ] ℓ − (cid:13)(cid:13)(cid:13) ≤ max n Θ (cid:2) (¯ ǫ − ) d − (cid:3) , Θ (cid:2) (¯ ρ − ) ( d +1)( d − d (cid:3)o ( β + − β − ) . (1.26)Finally, replacing (1.19) and (1.18) with (1.20) and (1.26) respectively leads to the desired conclusion. (cid:3)(cid:3)