[PDF] The global rate of convergence for optimal tensor methods in smooth convex optimization

Abstract

We consider convex optimization problems with the objective function having Lipshitz-continuous p -th order derivative, where p≥1 . We propose a new tensor method, which closes the gap between the lower O( ε − 2 3p+1 ) and upper O( ε − 1 p+1 ) iteration complexity bounds for this class of optimization problems. We also consider uniformly convex functions, and show how the proposed method can be accelerated under this additional assumption. Moreover, we introduce a p -th order condition number which naturally arises in the complexity analysis of tensor methods under this assumption. Finally, we make a numerical study of the proposed optimal method and show that in practice it is faster than the best known accelerated tensor method. We also compare the performance of tensor methods for p=2 and p=3 and show that the 3rd-order method is superior to the 2nd-order method in practice.

Full PDF

11–17

Optimal Tensor Methods in Smooth Convex and Uniformly ConvexOptimization

Alexander Gasnikov

GASNIKOV @ YANDEX . RU Moscow Institute of Physics and Technology, Institute for Information Transmission Problems, National Re-search University Higher School of Economics

Pavel Dvurechensky

PAVEL . DVURECHENSKY @ GMAIL . COM

Weierstrass Institute for Applied Analysis and Stochastics, Institute for Information Transmission Problems

Eduard Gorbunov

EDUARD . GORBUNOV @ PHYSTECH . EDU

Moscow Institute of Physics and Technology

Evgeniya Vorontsova

VORONTSOVAEA @ GMAIL . COM

Far Eastern Federal University

Daniil Selikhanovych

SELIHANOVICH . DO @ PHYSTECH . EDU

Moscow Institute of Physics and Technology, Institute for Information Transmission Problems

C´esar A. Uribe

CAURIBE @ MIT . EDU

Massachusetts Institute of Technology

September 2, 2018 Abstract

We consider convex optimization problems with the objective function having Lipshitz-continuous p -th order derivative, where p ≥ . We propose a new tensor method, which closes the gap be-tween the lower O (cid:16) ε − p +1 (cid:17) and upper O (cid:16) ε − p +1 (cid:17) iteration complexity bounds for this class ofoptimization problems. We also consider uniformly convex functions, and show how the proposedmethod can be accelerated under this additional assumption. Moreover, we introduce a p -th ordercondition number which naturally arises in the complexity analysis of tensor methods under thisassumption. Finally, we make a numerical study of the proposed optimal method and show that inpractice it is faster than the best known accelerated tensor method. We also compare the perfor-mance of tensor methods for p = 2 and p = 3 and show that the 3rd-order method is superior tothe 2nd-order method in practice. Keywords:

Convex optimization, unconstrained minimization, tensor methods, worst-case com-plexity, global complexity bounds, condition number

1. Introduction

In this paper, we consider the unconstrained convex optimization problem f ( x ) → min x ∈ R n , (1)

1. The ﬁrst version of this paper appeared on September 2, 2018 in Russian. In the current version we present atranslation into English of the main derivations and extend the analysis from the case of strongly convex objective tothe case of uniformly convex objectives and add the numerical analysis of our results. c (cid:13) A. Gasnikov, P. Dvurechensky, E. Gorbunov, E. Vorontsova, D. Selikhanovych & C.A. Uribe. a r X i v : . [ m a t h . O C ] F e b PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION where f has p -th Lipschitz-continuous derivative with constant M p . For p = 1 , ﬁrst-order methodsare commonly used to solve this problem, i.e., gradient descent. The lower bound for the complexityof these methods was proposed in (Nemirovsky and Yudin, 1983; Nesterov, 2004), and an optimalmethod was introduced in (Nesterov, 1983). The case of p = 2 , i.e., Newton-type methods, was wellunderstood only recently. A nearly optimal method was proposed in (Nesterov, 2008), an optimalmethod was proposed in (Monteiro and Svaiter, 2013), and a lower bound was obtained in (Agarwaland Hazan, 2018; Arjevani et al., 2018).The idea of using higher order derivatives (starting from p ≥ ) in optimization is known atleast since 1970’s, see Hoffmann and Kornstaedt (1978). Recently this direction of research becameof interest from the point of view of complexity bounds. In the unpublished preprint Baes (2009),extending the estimating functions technique of Nesterov (2004), proposes accelerated high-order(tensor) methods for convex problems with complexity O (cid:18)(cid:16) M p R p +1 ε (cid:17) p +1 (cid:19) , where p ≥ , ε is theaccuracy of the obtained solution ˆ x , i.e., f (ˆ x ) − f ∗ ≤ ε , M p is the Lipschitz constant of the p -thderivative, and R is an estimate for the distance between a starting point and the closest solution.Nevertheless, the author doubts that the obtained methods are implementable since the auxiliaryproblem on each iteration is possibly non-convex. Agarwal and Hazan (2018); Arjevani et al. (2018)construct lower complexity bounds O (cid:18)(cid:16) M p R p +1 ε (cid:17) p +1 (cid:19) and O (cid:18)(cid:16) M p R p +1 ε (cid:17) p +1 (cid:19) respectively forthe case f having Lipschitz p -th derivative and conjecture that the upper bound can be improved.Nesterov (2018) proposes implementable tensor methods showing that an appropriately regularizedTaylor expansion of a convex function is again a convex function, thus making auxiliary problemson each iteration of the tensor methods tractable. The author also provides an accelerated schemewith complexity bound O (cid:18)(cid:16) M p R p +1 ε (cid:17) p +1 (cid:19) , shows that the complexity of each iteration for p = 3 is of the same order as for the case p = 2 , and conjectures the existence of an optimal scheme withcomplexity bound O (cid:18)(cid:16) M p R p +1 ε (cid:17) p +1 (cid:19) .The optimal method for the case p = 1 has complexity O (cid:18)(cid:16) M R ε (cid:17) (cid:19) (Nesterov, 1983) andfor p = 2 has the complexity O (cid:18)(cid:16) M R ε (cid:17) (cid:19) (Monteiro and Svaiter, 2013), but the question ofexistence of optimal methods for p ≥ remains open. In this paper we extend the framework ofMonteiro and Svaiter (2013) and propose optimal tensor methods for all p ≥ . Our approach isalso based on regularized Taylor step of Nesterov (2018), and, thus, our optimal method for p = 2 is different from Monteiro and Svaiter (2013).We also consider problem (1) under additional assumption that f is uniformly convex, i.e., thereexist ≤ q ≤ p + 1 and σ q > s.t. f ( y ) ≥ f ( x ) + (cid:104)∇ f ( x ) , y − x (cid:105) + σ q q (cid:107) y − x (cid:107) q , ∀ x, y ∈ Q. PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION

Under this additional assumption, we show, how the restart technique can be applied to accelerateour method to obtain complexity O (cid:32)(cid:18) M p σ p +1 (cid:19) p +1 log ∆ ε (cid:33) , q = p + 1; O  M p (∆ ) p +1 − qq σ p +1 q q  p +1 + log ∆ ε  , q < p + 1 , where f ( x ) − f ∗ ≤ ∆ . This bound suggests a natural generalization of ﬁrst- and second-ordercondition number (Nesterov, 2008). If f is such that q = p +1 , then the complexity of our algorithmdepends only logarithmically on the starting point and is proportional to ( γ p ) p +1 , where γ p = M p σ p +1 is the p -th order condition number. Nemirovsky and Yudin (1983); Nesterov(2004) and Arjevani et al. (2018) propose lower bounds for particular cases of strongly convexfunctions (i.e., q = 2 ) with p = 1 and p = 2 respectively. Our upper bounds match them.As a related work, we also mention Birgin et al. (2017); Cartis et al. (2018), who study com-plexity bounds for tensor methods for ﬁnding approximate stationary points with the main focuson non-convex optimization, which we do not consider in our work. Also the work in (Wibisonoet al., 2016) considers tensor methods from the variational perspective and obtains similar boundsto those in Baes (2009). The ﬁrst version of this paper appeared in arXiv on September 2, 2018.In December 2018, two months after that, Jiang et al. (2018); Bubeck et al. (2018) proposed analgorithm, which is very similar to our Algorithm 1. Unlike them, we also analyze the case of uni-formly convex functions and propose an algorithm, which is faster in this case, see our Algorithm 3.Moreover, we are the ﬁrst to make a numerical study of tensor methods for p = 3 and show thatthey work in practice. Our contributions. • We propose a new optimal tensor method and analyze its iteration complexity. • We generalize this method for the case of uniformly convex objectives and propose a deﬁni-tion of p -th order condition number. • We make a numerical study of the proposed method and show that our optimal method isfaster than accelerated tensor method Nesterov (2018) in practice. We also compare the per-formance of tensor methods for p = 2 and p = 3 and show that the 3rd-order method issuperior to the 2nd-order method in practice. Notations and generalities.

For p ≥ , we denote by ∇ p f ( x )[ h , ..., h p ] the directional deriva-tive of function f at x along directions h i ∈ R n , i = 1 , ..., p . ∇ p f ( x )[ h , ..., h p ] is symmetric p -linear form and its norm is deﬁned as (cid:107)∇ p f ( x ) (cid:107) = max h ,...,h p ∈ R n {∇ p f ( x )[ h , ..., h p ] : (cid:107) h i (cid:107) ≤ , i = 1 , ..., p } or equivalently (cid:107)∇ p f ( x ) (cid:107) = max h ∈ R n {|∇ p f ( x )[ h, ..., h ] | : (cid:107) h (cid:107) ≤ , i = 1 , ..., p } . PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION

Here, for simplicity, (cid:107) · (cid:107) is standard Euclidean norm, but our algorithm and derivations can begeneralized for the Euclidean norm given by general a positive semi-deﬁnite matrix B . We considerconvex, p times differentiable on R functions satisfying Lipschitz condition for p -th derivative (cid:107)∇ p f ( x ) − ∇ p f ( y ) (cid:107) ≤ M p (cid:107) x − y (cid:107) , x, y ∈ R n . (2)

2. Optimal Tensor Method

Given a function f , numbers p ≥ and M ≥ , deﬁne T fp,M ( x ) ∈ Arg min y ∈ R n  p (cid:88) r =0 r ! ∇ r f ( x ) [ y − x, ..., y − x ] (cid:124) (cid:123)(cid:122) (cid:125) r + M ( p + 1)! (cid:107) y − x (cid:107) p +12  . (3)and given a number L ≥ and point z ∈ R n , we deﬁne F L,z ( x ) (cid:44) f ( x ) + L (cid:107) x − z (cid:107) . (4) Theorem 1

Let sequence ( x k , y k , u k ) , k ≥ be generated by Algorithm 1. Then f ( y N ) − f ∗ ≤ cM p (cid:107) y − x ∗ (cid:107) p +12 N p +12 , c = 2 p +1)2+44 ( p + 1) p ! . Note that this bound allows to obtain an O (cid:18)(cid:16) M p R p +1 ε (cid:17) p +1 (cid:19) iteration complexity. The imple-mentability and cost of each iteration is discussed below in Section 2.3. The proof of Theorem 1 isbased on the framework of Monteiro and Svaiter (2013), which is presented in the next subsection. Algorithm 1

Optimal Tensor Method

Input: u , y — starting points; N — iteration number; A = 0 Output: y N for k = 0 , , , . . . , N − do Choose L k such that ≤ p + 1) M p p ! L k (cid:107) y k +1 − x k (cid:107) p − ≤ , (5)where a k +1 = 1/ L k + (cid:113) (cid:14) L k + 4 A k / L k , A k +1 = A k + a k +1 , { note that L k a k = A k +1 } x k = A k A k +1 y k + a k +1 A k +1 u k , y k +1 = T F Lk,xk p,pM p ( x k ) . u k +1 = u k − a k +1 ∇ f ( y k +1 ) end for return y N PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION

Monteiro and Svaiter (2013) introduced Algorithm 2 for convex optimization problems. To ﬁnd y k +1 on each iteration, the authors use gradient type method for the case p = 1 and a trust regionNewton-type method for the case p = 2 . Their analysis of the algorithm is based on the followingTheorem. Theorem 2 ( (Monteiro and Svaiter, 2013, Theorem 3.6 ) )

Let sequence ( x k , y k , u k ) , k ≥ begenerated by Algorithm 2 and deﬁne R := (cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) . Then, for all N ≥ , (cid:13)(cid:13) u N − x ∗ (cid:13)(cid:13) + A N · (cid:0) f (cid:0) y N (cid:1) − f ( x ∗ ) (cid:1) + 14 N (cid:88) k =1 A k L k − (cid:13)(cid:13)(cid:13) y k − x k − (cid:13)(cid:13)(cid:13) ≤ R , (6) f (cid:0) y N (cid:1) − f ( x ∗ ) ≤ R A N , (cid:13)(cid:13) u N − x ∗ (cid:13)(cid:13) ≤ R, (7) N (cid:88) k =1 A k L k − (cid:13)(cid:13)(cid:13) y k − x k − (cid:13)(cid:13)(cid:13) ≤ R . (8)We also need the following Lemma. Lemma 3 ( (Monteiro and Svaiter, 2013, Lemma 3.7 a)))

Let sequences { A k , L k } , k ≥ begenerated by Algorithm 2. Then, for all N ≥ , A N ≥ (cid:32) N (cid:88) k =1 (cid:112) L k − (cid:33) . (9) Algorithm 2

Accelerated hybrid proximal extragradient method

Input: u , y — starting point; N — iteration number; A = 0 Output: y N for k = 0 , , , . . . , N − do Choose L k and y k +1 s.t. (cid:13)(cid:13) ∇ F L k ,x k (cid:0) y k +1 (cid:1)(cid:13)(cid:13) ≤ L k (cid:13)(cid:13) y k +1 − x k (cid:13)(cid:13) , where a k +1 = 1/ L k + (cid:113) (cid:14) L k + 4 A k / L k , A k +1 = A k + a k +1 , x k = A k A k +1 y k + a k +1 A k +1 u k . u k +1 = u k − a k +1 ∇ f (cid:0) y k +1 (cid:1) . end for return y N It follows from Algorithm 1 that y k +1 = T F Lk,xk p,pM p ( x k ) , thus by (Nesterov, 2018, Lemma 1), (cid:13)(cid:13)(cid:13) ∇ F L k ,x k (cid:16) y k +1 (cid:17)(cid:13)(cid:13)(cid:13) ≤ ( p + 1) M p p ! (cid:13)(cid:13)(cid:13) y k +1 − x k (cid:13)(cid:13)(cid:13) p . PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION

At the same time, by the condition in step 2 of Algorithm, 1, p + 1) M p p ! L k (cid:107) y k +1 − x k (cid:107) p − (cid:54) . Hence, (cid:13)(cid:13)(cid:13) ∇ F L k ,x k (cid:16) y k +1 (cid:17)(cid:13)(cid:13)(cid:13) ≤ L k (cid:13)(cid:13)(cid:13) y k +1 − x k (cid:13)(cid:13)(cid:13) and we can apply the framework of the previous subsection. What remains is to estimate the growthof A N , which is our next step.By the condition in step 2 of Algorithm, 1, L k (cid:13)(cid:13)(cid:13) y k +1 − x k (cid:13)(cid:13)(cid:13) p − ≥ θ, (10)where θ = p !4( p +1) M p . Using this inequality, we prove that N (cid:88) k =1 A k L p +1 p − k − ≤ R θ − p − . (11)Indeed, from (8) and (10) we have that θ p − N (cid:88) k =1 A k L p +1 p − k − ≤ N (cid:88) k =1 A k L p − k − (cid:18) L k − (cid:13)(cid:13)(cid:13) y k − x k − (cid:13)(cid:13)(cid:13) p − (cid:19) p − = N (cid:88) k =1 A k L k − (cid:13)(cid:13)(cid:13) y k − x k − (cid:13)(cid:13)(cid:13) ≤ R . (12)Further, from (11) it follows that N (cid:88) k =1 (cid:112) L k − ≥ θ p +1 (2 R ) p − p +1) (cid:32) N (cid:88) k =1 A p − p +1 k (cid:33) p +12( p +1) . (13)To prove that, let us introduce new variables z k = 1 (cid:14)(cid:112) L k − and consider the following opti-mization problem to ﬁnd the worst possble value of the l.h.s. in (13) min N (cid:88) k =1 z k s.t. N (cid:88) k =1 A k z − γk ≤ C, (14)where in accordance with (11) γ = 2 p + 1 p − , C = 2 R θ − p − . Since the objective and constraints are separable, this problem can be solved explicitly by the La-grange principle z k =  C N (cid:88) j =1 A γ +1 j  γ A γ +1 k . PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION

Hence, min N (cid:80) k =1 A k z − γk ≤ C N (cid:88) k =1 z k = 1 C γ (cid:32) N (cid:88) k =1 A γ +1 k (cid:33) γ +1 γ . From this inequality, (9) and (13), we have A N ≥ θ p +1 (2 R ) p − p +1 (cid:32) N (cid:88) k =1 A p − p +1 k (cid:33) p +1 p +1 . (15)From this inequality, we obtain that there exists a number c such that, for all N ≥ , A N ≥ cM p R p − N p +12 . (16)The derivation of exact value of the constant c can be found in Lemma 5 in Appendix. This ﬁnishesthe proof. First of all, Theorem 1 in Nesterov (2018) says that, by the appropriate choice M = pM p in (3), thesubproblem for ﬁnding y k +1 in step 2 of Algorithm 1 is convex and, thus is tractable. Moreover, for p = 2 this step corresponds to the step of cubic regularized Newton method of Nesterov and Polyak(2006) and, as it is shown there, can be computed with the same complexity as solving a linearsystem. For the case p = 3 , Nesterov (2018) showed that this step can be also computed efﬁciently.In both cases the complexity of calculating y k +1 is ˜O (cid:0) n . (cid:1) .Let us now discuss the process of ﬁnding such L k that the inequality (5) holds. By construction, y k +1 = arg min y ∈ R n  p (cid:88) r =0 r ! ∇ r f (cid:16) x k (cid:17) (cid:104) y − x k , ..., y − x k (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) r + pM p ( p + 1)! (cid:13)(cid:13)(cid:13) y − x k (cid:13)(cid:13)(cid:13) p +12 + L k (cid:107) y − x k (cid:107)  . This problem is strongly convex and, thus, has a unique solution for each L k > . Hence, y k +1 isuniquely deﬁned by L k . At the same time, if L k → , y k +1 → ˜ y k with ˜ y k ∈ Arg min y ∈ R n  p (cid:88) r =0 r ! ∇ r f (cid:16) x k (cid:17) (cid:104) y − x k , ..., y − x k (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) r + pM p ( p + 1)! (cid:13)(cid:13)(cid:13) y − x k (cid:13)(cid:13)(cid:13) p +12  being a ﬁxed point. Whence, p + 1) M p p ! L k (cid:107) y k +1 − x k (cid:107) p − → + ∞ . On the other hand, if L k → + ∞ , y k +1 → x k and p + 1) M p p ! L k (cid:107) y k +1 − x k (cid:107) p − → . PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION

By the continuity of the dependence of y k +1 from L k , we see that there exists such L k that in-equality (5) holds. Appropriate value of L k can be found by an extended line-search procedureas in (Monteiro and Svaiter, 2013, Section 7). The details of complexity of the line-search can befound in Jiang et al. (2018); Bubeck et al. (2018), where the authors prove a bound of ˜ O (1) calls of T F Lk,xk p,pM p ( x k ) on each iteration.

3. Extension for Uniformly Convex Case

In this section, we additionally assume that the objective function is uniformly convex of degree q ≥ , i.e., there exists σ q > s.t. f ( y ) ≥ f ( x ) + (cid:104)∇ f ( x ) , y − x (cid:105) + σ q q (cid:107) y − x (cid:107) q , ∀ x, y ∈ Q. (17)We also assume that q ≤ p + 1 . As a corollary, f ( y ) ≥ f ( x ∗ ) + σ q q (cid:107) y − x ∗ (cid:107) q , ∀ y ∈ Q, (18)where x ∗ is a solution to problem (1). We show, how the restart technique can be used to accelerateAlgorithm 1 under this additional assumption. Algorithm 3

Restarted Optimal Tensor Method

Input: p , M p , q , σ q , z , ∆ s.t. f ( z ) − f ∗ ≤ ∆ . for k = 0 , , ... do Set ∆ k = ∆ · − k and N k = max  cM p q p +1 q σ p +1 q q ∆ p +1 − qq k  p +1  ,  . (19) Set z k +1 = y N k as the output of Algorithm 1 started from z k and run for N k steps. Set k = k + 1 . end forOutput: z k . Theorem 4

Let sequence z k , k ≥ be generated by Algorithm 3. Then σ q q (cid:107) z k − x ∗ (cid:107) q ≤ f ( z k ) − f ∗ ≤ ∆ · − k , and the total number of steps of Algorithm 1 is bounded by ( c is deﬁned in (16) ) (cid:16) cq p +1 q (cid:17) p +1 M p +1 p σ p +1) q (3 p +1) q (∆ ) p +1 − q ) q (3 p +1) · k (cid:88) i =0 − i p +1 − q ) q (3 p +1) + k. PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION

Proof

Let us prove the ﬁrst statement of the Theorem by induction. For k = 0 it holds. If it holdsfor some k ≥ , by the choice of N k , we have that cM p N p +12 k (cid:18) q ∆ k σ q (cid:19) p +1 q ≤ ∆ k . By (18), (cid:107) z k − x ∗ (cid:107) p +12 ≤ (cid:18) q ( f ( z k ) − f ∗ ) σ q (cid:19) p +1 q ≤ (cid:18) q ∆ k σ q (cid:19) p +1 q since, by our assumption, q ≤ p + 1 . Combining the above two inequalities and Theorem 1, weobtain f ( z k +1 ) − f ∗ ≤ cM p (cid:107) z k − x ∗ (cid:107) p +12 N p +12 k ≤ ∆ k k +1 . It remains to bound the total number of steps of Algorithm 1. Denote ˜ c = (cid:16) cq p +1 q (cid:17) p +1 . k (cid:88) i =0 N i ≤ ˜ c M p +1 p σ p +1) q (3 p +1) q k (cid:88) i =0 (∆ · − i ) p +1 − q ) q (3 p +1) + k ≤ ˜ c M p +1 p σ p +1) q (3 p +1) q (∆ ) p +1 − q ) q (3 p +1) · k (cid:88) i =0 − i p +1 − q ) q (3 p +1) + k. Let us make several remarks on the complexity of the restarted scheme in different settings. It iseasy to see from Theorem 4 that, to achieve an accuracy ε , i.e. to ﬁnd a point ˆ x s.t. f (ˆ x ) − f ∗ ≤ ε ,the number of tensor steps in Algorithm 3 is O  M p +1 p σ p +1) q (3 p +1) q (∆ ) p +1 − q ) q (3 p +1) + log ∆ ε  , q < p +1 , and O  M p +1 p σ p +1) q (3 p +1) q + 1  log ∆ ε  , q = p +1 . Theorem 4 suggests a natural generalization of ﬁrst- and second-order condition number Nesterov(2008). If f is such that q = p +1 , then the complexity of Algorithm 3 depends only logarithmicallyon the starting point and is proportional to ( γ p ) p +1 , where γ p = M p σ p +1 is the p -th order conditionnumber. Unfortunately, if q < p + 1 , the complexity depends polinomially on the initial objectiveresidual ∆ , which, in general, is not controlled.An interesting special case is when q = 2 and p ≥ , and, as a consequence, q < p + 1 .As it can be seen from Theorem 2 (see also Bubeck et al. (2018)), the sequence, generated byAlgorithm 1 is bounded by some R = O ( (cid:107) x − x ∗ (cid:107) ) . Hence, the constant M can be estimatedas M ≤ M p R p − . At the same time, in (Nesterov, 2008, Sect.6), it is shown that the Cubicregularized Newton method Nesterov and Polyak (2006) has the region of quadratic convergencegiven by { x : f ( x ) − f ∗ ≤ σ M ≤ σ M p R p − } . To enter this region, Algorithm 3 requires O  M p +1 p σ p +13 p +1 (∆ ) p − p +1 + log ∆ M p R p − σ  = O  M p +1 p σ p +13 p +1 (∆ ) p − p +1 + log M p ∆ p − σ p  , (20) PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION where we used inequality R ≤ σ , which follows from (18). After entering the region ofquadratic convergence, Algorithm 3 can be switched to the Cubic regularized Newton method Nes-terov and Polyak (2006), which has ﬁnal stage complexity, (Nesterov and Polyak, 2006, Sect. 6) O (cid:18) log / log σ M ε (cid:19) = O (cid:32) log / log σ M p R p − ε (cid:33) . Summing this inequality and (20) we obtain the total complexity of this switching procedure toobtain small accuracy ε . Note, that the second term in (20) is typically dominated by the ﬁrst one,so we can ignore it without loss of generality.Finally, let us compare our upper bound with known lower bounds. For the case p = 1 , q = 2 ,our complexity bound coincides with lower bound for ﬁrst-order methods Nemirovsky and Yudin(1983); Nesterov (2004). Arjevani et al. (2018) propose lower bounds for second-order methods forthe case p = 2 , q = 2 and our complexity bound coincides with their lower bound up to a changeof D = (cid:113) ∆ σ , which is natural as, in this case f is strongly convex.

4. Numerical Analysis

In this section, we analyze and compare the performance of Algorithm 1 with the accelerated tensormethod proposed in Nesterov (2018).We study the numerical performance for two classes of functions. Initially, an universal para-metric family of objective functions, which are difﬁcult for all tensor methods Nesterov (2018)deﬁned as f m ( x ) = η p +1 ( A m x ) − x , (21)where, for integer parameter p ≥ , η p +1 ( x ) = p +1 n (cid:80) i =1 | x i | p +1 , ≤ m ≤ n , x ∈ R n , A m is the n × n block diagonal matrix: A m = (cid:18) U m I n − m (cid:19) , with U m =  − . . .

00 1 − . . . ... ... . . . ... . . . −

10 0 . . .  , (22)and I n is the identity n × n -matrix. For a detailed description of the high-order derivatives of thisclass of functions, and its optimality properties see Nesterov (2018).Figure 1 shows the normalized optimality gap of the iterations generated by the accelerated ten-sor method from Nesterov (2018) in Figure1(a), and Algorithm 1 in Figure1(b). We denote the min-imum function value as f ∗ . For both results we have used p = 3 , and n = k = { , , , , } .These numerical results show that Algorithm 1 requires a much smaller number of iterations than theaccelerated tensor method from Nesterov (2018) to reach the same optimality gap, namely · − ,for the class of “bad” functions described in Nesterov (2018). For example, for the case where n = k = 25 , Algorithm 1 has reached the desired accuracy in about iterations, while theaccelerated tensor method requires about · . PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION . . . . · − − − Iterations | f ( x k ) − f ∗ | | f ( x ) − f ∗ | (Nesterov,2018) n = k = 5 n = k = 10 n = k = 15 n = k = 20 n = k = 25 ( a )

20 40 60 80 10010 − − − Iterations | f ( x k ) − f ∗ | | f ( x ) − f ∗ | Algorithm 1 n = k = 5 n = k = 10 n = k = 15 n = k = 20 n = k = 25 ( b )Figure 1: A performance comparison between the accelerated tensor method in Nesterov (2018)(shown in (a)) and Algorithm 1 (shown in (b)). We minimize an instance of the family of functionsin (21) with p = 3 and various values of dimension n and k . Note that the x -axis scaling on bothﬁgures is different.As a second set of numerical results we study the performance of the proposed method for thenon-regularized logistic regression problem. For this problem we are given a set of d data pairs { y i , w i } for ≤ i ≤ d , where y i ∈ { , − } is the class label of object i , and w i ∈ R n is the set offeatures of object i . We are interested in ﬁnding a vector x that solves the following optimizationproblem d d (cid:88) i =1 ln (cid:16) (cid:0) − y i (cid:104) w i , x (cid:105) (cid:1)(cid:17) → min x ∈ R n . (23)Figure 2 shows the simulation results for the logistic regression problem in (23) for variousdatasets. Similarly as in Figure 1, we compare the performance of Algorithm 1, and the acceleratedtensor method in Nesterov (2018). In Figure 2(a) and Figure 2(b), we generate synthetic data, where,initially we deﬁne a vector ˆ x ∈ [ − , with every entry is chosen uniformly at random. The setof features for each i , i.e., w i ∈ [ − , n has also every entry chosen uniformly at random, ﬁnallyeach label is computed as y i = sign ( (cid:104) w i , ˆ x (cid:105) ) . For Figure 2(a) we set n = 10 and d = 100 , while inFigure 2(b) we set n = 100 and d = 1000 . Figure 2(c) uses the mushroom dataset ( n = 8124 and d = 112 ) Dheeru and Karra Taniskidou (2017), and Figure 2(d) uses the a9a dataset ( n = 32561 and d = 123 ) Dheeru and Karra Taniskidou (2017).For the logistic regression problem, we don’t have access to the optimal value function in gen-eral, thus, we plot only the cost function evaluated at the current iterate. As expected by the theo-retic results, Algorithm 1 requires one order of magnitude less iterations than the accelerated tensormethod from Nesterov (2018) to achieve the same function value.In Appendix B, we numerically compare the performance of the accelerated tensor methodfrom Nesterov (2018) for p = 2 and p = 3 , as well as its accelerated and non-accelerated versions. PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION (Nesterov,2018) Algorithm 1 . . . . Iterations f ( x k ) Synthetic n = 10 , d = 100 ( a ) . . . . Iterations f ( x k ) Synthetic n = 100 , d = 1000 ( b ) . . . . Iterations f ( x k ) mushroom dataset ( c ) . . . Iterations f ( x k ) a9a dataset ( d )Figure 2: Performance comparison for the non-regularized logistic regression problem between theaccelerated tensor method from Nesterov (2018) and Algorithm 1. (a) Uses synthetic data with n = 10 and d = 100 , (b) uses synthetic data with n = 100 and d = 1000 , (c) uses the mushroom dataset ( d = 8124 and n = 112 ) Dheeru and Karra Taniskidou (2017), and (d) uses the a9a dataset( d = 32561 and n = 123 ) Dheeru and Karra Taniskidou (2017). Acknowledgments

The authors are grateful to Yurii Nesterov for fruitful discussions. The work of A. Gasnikov wassupported by RFBR 18-29-03071 mk and was prepared within the framework of the HSE UniversityBasic Research Program and funded by the Russian Academic Excellence Project ’5-100’, the workof P. Dvurechensky and E. Vorontsova was supported by RFBR 18-31-20005 mol-a-ved and thework of E. Gorbunov was supported by the grant of Russian’s President MD-1320.2018.1

References

Naman Agarwal and Elad Hazan. Lower bounds for higher-order convex optimization. In S´ebastienBubeck, Vianney Perchet, and Philippe Rigollet, editors,

Proceedings of the 31st Conference OnLearning Theory , volume 75 of

Proceedings of Machine Learning Research , pages 774–792.PMLR, 06–09 Jul 2018. URL http://proceedings.mlr.press/v75/agarwal18a.html .Yossi Arjevani, Ohad Shamir, and Ron Shiff. Oracle complexity of second-order methods forsmooth convex optimization.

Mathematical Programming , May 2018. ISSN 1436-4646. doi: 10.1007/s10107-018-1293-1. URL https://doi.org/10.1007/s10107-018-1293-1 .Michel Baes. Estimate sequence methods:extensions and approximations. Technical report, 2009.URL .E. G. Birgin, J. L. Gardenghi, J. M. Mart´ınez, S. A. Santos, and Ph. L. Toint. Worst-case eval-uation complexity for unconstrained nonlinear optimization using high-order regularized mod-els.

Mathematical Programming , 163(1):359–368, May 2017. ISSN 1436-4646. doi: 10.1007/s10107-016-1065-8. URL https://doi.org/10.1007/s10107-016-1065-8 . PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION

S´ebastien Bubeck, Qijia Jiang, Yin Tat Lee, Yuanzhi Li, and Aaron Sidford. Near-optimal methodfor highly smooth convex optimization. arXiv:1812.08026 , 2018.Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. Improved second-order evalua-tion complexity for unconstrained nonlinear optimization using high-order regularized models. arXiv:1708.04044 , 2018.Dua Dheeru and Eﬁ Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml .K. H. Hoffmann and H. J. Kornstaedt. Higher-order necessary conditions in abstract mathe-matical programming.

Journal of Optimization Theory and Applications , 26(4):533–568, Dec1978. ISSN 1573-2878. doi: 10.1007/BF00933151. URL https://doi.org/10.1007/BF00933151 .Bo Jiang, Haoyue Wang, and Shuzhong Zhang. An optimal high-order tensor method for convexoptimization. arXiv:1812.06557 , 2018.R. Monteiro and B. Svaiter. An accelerated hybrid proximal extragradient method for convexoptimization and its implications to second-order methods.

SIAM Journal on Optimization ,23(2):1092–1125, 2013. doi: 10.1137/110833786. URL https://doi.org/10.1137/110833786 .A.S. Nemirovsky and D.B. Yudin.

Problem Complexity and Method Efﬁciency in Optimization . J.Wiley & Sons, New York, 1983.Yu. Nesterov. Accelerating the cubic regularization of newton’s method on convex problems.

Mathematical Programming , 112(1):159–181, Mar 2008. ISSN 1436-4646. doi: 10.1007/s10107-006-0089-x. URL https://doi.org/10.1007/s10107-006-0089-x .Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1 /k ) . Soviet Mathematics Doklady , 27(2):372–376, 1983.Yurii Nesterov.

Introductory Lectures on Convex Optimization: a basic course . Kluwer AcademicPublishers, Massachusetts, 2004.Yurii Nesterov. Implementable tensor methods in unconstrained convex optimization. Tech-nical report, CORE UCL, 2018. URL https://alfresco.uclouvain.be/alfresco/service/guest/streamDownload/workspace/SpacesStore/aabc2323-0bc1-40d4-9653-1c29971e7bd8/coredp2018_05web.pdf . COREDiscussion Paper 2018/05.Yurii Nesterov and Boris Polyak. Cubic regularization of newton method and its global perfor-mance.

Mathematical Programming , 108(1):177–205, 2006. ISSN 1436-4646. doi: 10.1007/s10107-006-0706-8. URL http://dx.doi.org/10.1007/s10107-006-0706-8 .Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. A variational perspective on acceleratedmethods in optimization.

Proceedings of the National Academy of Sciences , 113(47):E7351–E7358, 2016. PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION

Optimal Tensor Methods in Smooth Convex andUniformly Convex Optimization:Supplementary Material

Appendix A. Technical lemmas

Lemma 5

Consider the sequence { A k } k ≥ of non-negative numbers such that A N ≥ θ p +1 (2 R ) p − p +1 (cid:32) N (cid:88) k =1 A p − p +1 k (cid:33) p +1 p +1 , (24) where p ≥ , θ = p !4( p +1) M p and M p , R > . Then for all N ≥ we have A k ≥ cM p R p − k p +12 , (25) where c = 2 p +1)2+44 ( p + 1) p ! (26) Proof

We prove (25) by induction. For k = 1 we have A (24) ≥ θ p +1 (2 R ) p − p +1 A p − p +1 ⇐⇒ A p +1 ≥ θ p +1 p − p +1 R p − p +1 ⇐⇒ A ≥ p !2 p +52 ( p + 1) M p R p − . The last inequality implies (25) for p ≥ . Now let us assume that for all k ≤ N inequality (25)holds and N ≥ . Next we will establish (25) for k = N + 1 . We have A N +1 (24) ≥ θ p +1 (2 R ) p − p +1 (cid:32) N +1 (cid:88) k =1 A p − p +1 k (cid:33) p +1 p +1 ≥ θ p +1 (2 R ) p − p +1 (cid:32) N (cid:88) k =1 A p − p +1 k (cid:33) p +1 p +1 (25) ≥ θ p +1 (2 R ) p − p +1 (cid:32)(cid:18) cM p R p − (cid:19) p − p +1 N (cid:88) k =1 k p − (cid:33) p +1 p +1 . If N = 1 then A N +1 = A ≥ p +12 θ p +1 (2 R ) p − p +1 (cid:18) cM p R p − (cid:19) p − p +1 (2) p +12 . (27)If N > we can write A N +1 ≥ θ p +1 (2 R ) p − p +1 (cid:18) cM p R p − (cid:19) p − p +1 (cid:32) N (cid:88) k =2 k p − (cid:33) p +1 p +1 . (28) PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION

Since p − ≥ the function f ( x ) = x is convex and, as a consequence, we get N (cid:88) k =2 k p − ≥ N (cid:90) x p − dx = 2 p + 1 N p +12 − p + 1 ≥ p + 1 N p +12 − . (29)Using this fact we continue: A N +1 (29) ≥ θ p +1 (2 R ) p − p +1 (cid:18) cM p R p − (cid:19) p − p +1 (cid:18)

12 + N p +12 (cid:19) p +1 p +1 ≥ θ p +1 (2 R ) p − p +1 (cid:18) cM p R p − (cid:19) p − p +1 N p +12 . For all

N > we have (cid:18) NN + 1 (cid:19) p +12 = (cid:18) − N + 1 (cid:19) p +12 ≥ (cid:18) − (cid:19) p +12 = 12 p +12 . From this and (28) we obtain that for all N ≥ A N +1 ≥ p +12 θ p +1 (2 R ) p − p +1 (cid:18) cM p R p − (cid:19) p − p +1 ( N + 1) p +12 . It remains to show that (26) implies p +12 θ p +1 (2 R ) p − p +1 (cid:18) cM p R p − (cid:19) p − p +1 = 1 cM p R p − . Using θ = p !4( p +1) M p we get p +12 θ p +1 (2 R ) p − p +1 (cid:18) cM p R p − (cid:19) p − p +1 = 1 cM p R p − ⇐⇒ c p +1 p +12 (cid:18) p !4( p + 1) (cid:19) p +1 p − p +1 = 1 ⇐⇒ c p +1 = 2 p +12 (cid:18) p + 1) p ! (cid:19) p +1 p − p +1 ⇐⇒ c = 2 (3 p +1)( p +1)4 p + 1) p ! 2 p − ⇐⇒ c = 2 p +1)2+44 ( p + 1) p ! , which is exactly what we have in (26). Appendix B. Comparison of the accelerated tensor method from Nesterov (2018) for p = 2 and p = 3 . In this appendix, we numerically compare the performance of the accelerated tensor method pro-posed in (Nesterov, 2018), for p = 2 and p = 3 . We also compare the accelerated and non-accelerated version of this method. PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION

10 20 30 40 500 . . . . Iterations | f ( x k ) − f ∗ | | f ( x ) − f ∗ | f m ( x ) in (21), m = 50 , n = 100 p = 2 Accelerated p = 2 p = 3 Accelerated p = 3 Figure 3: Performance of tensor methods and accelerated tensor methods for p = 2 and p = 3 on a difﬁcult instance (21) for all unconstrained minimization tensor methods with n = 100 and m = 50 .Similarly as in Figure 1 and Figure 2, we present the numerical results for the class of badfunctions deﬁned in (21) and one instance of the logistic regression problem.In Figure 3, we compare the behavior of the following methods: 1) tensor method Nesterov(2018) for p = 3 ; 2) accelerated tensor method Nesterov (2018) for p = 3 ; 3) tensor methodNesterov (2018) for p = 2 ; 4) accelerated tensor method Nesterov (2018) for p = 2 . Again, theoptimal function value is denoted by f ∗ . Interestingly, we obtain that the non-accelerated methodoutperforms the accelerated method for the ﬁrst m iterations. Since Theorem from Nesterov(2018) works only for k ≤ m we don’t study the behaviour of the methods for larger number ofiterations. Even in this simple setting it is still non-trivial how to implement tensor methods for suchbad examples of functions.

20 40 60 80 1000 . . . Iterations f ( x k ) Covertype dataset p = 2 Accelerated p = 3 Accelerated p = 2 p = 3 Figure 4: Function value achieved by the iterates of the accelerated tensor method for the logisticregression problem on the

Covertype dataset Dheeru and Karra Taniskidou (2017). Number ofsamples d = 20000 , dimension n = 55 . PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION

In Figure 4, we consider the behaviour of the same set of methods as in Figure 3, but for logisticregression problem deﬁned in (23) on Covertype dataset Dheeru and Karra Taniskidou (2017). Andagain, we notice that in both cases non-accelerated version works better in our experimentsFirst of all, we point out that tensor methods in general are non-trivial in implementation, so,it is interesting direction of the future work to get better implementation. Secondly, we conjecturethat slow convergence that we see in our experiments is because of large M p that we use. Due totuning of the parameters one can obtain better convergence in practice.that we use. Due totuning of the parameters one can obtain better convergence in practice.