The global rate of convergence for optimal tensor methods in smooth convex optimization
Alexander Gasnikov, Pavel Dvurechensky, Eduard Gorbunov, Evgeniya Vorontsova, Daniil Selikhanovych, César A. Uribe
11–17
Optimal Tensor Methods in Smooth Convex and Uniformly ConvexOptimization
Alexander Gasnikov
GASNIKOV @ YANDEX . RU Moscow Institute of Physics and Technology, Institute for Information Transmission Problems, National Re-search University Higher School of Economics
Pavel Dvurechensky
PAVEL . DVURECHENSKY @ GMAIL . COM
Weierstrass Institute for Applied Analysis and Stochastics, Institute for Information Transmission Problems
Eduard Gorbunov
EDUARD . GORBUNOV @ PHYSTECH . EDU
Moscow Institute of Physics and Technology
Evgeniya Vorontsova
VORONTSOVAEA @ GMAIL . COM
Far Eastern Federal University
Daniil Selikhanovych
SELIHANOVICH . DO @ PHYSTECH . EDU
Moscow Institute of Physics and Technology, Institute for Information Transmission Problems
C´esar A. Uribe
CAURIBE @ MIT . EDU
Massachusetts Institute of Technology
September 2, 2018 Abstract
We consider convex optimization problems with the objective function having Lipshitz-continuous p -th order derivative, where p ≥ . We propose a new tensor method, which closes the gap be-tween the lower O (cid:16) ε − p +1 (cid:17) and upper O (cid:16) ε − p +1 (cid:17) iteration complexity bounds for this class ofoptimization problems. We also consider uniformly convex functions, and show how the proposedmethod can be accelerated under this additional assumption. Moreover, we introduce a p -th ordercondition number which naturally arises in the complexity analysis of tensor methods under thisassumption. Finally, we make a numerical study of the proposed optimal method and show that inpractice it is faster than the best known accelerated tensor method. We also compare the perfor-mance of tensor methods for p = 2 and p = 3 and show that the 3rd-order method is superior tothe 2nd-order method in practice. Keywords:
Convex optimization, unconstrained minimization, tensor methods, worst-case com-plexity, global complexity bounds, condition number
1. Introduction
In this paper, we consider the unconstrained convex optimization problem f ( x ) → min x ∈ R n , (1)
1. The first version of this paper appeared on September 2, 2018 in Russian. In the current version we present atranslation into English of the main derivations and extend the analysis from the case of strongly convex objective tothe case of uniformly convex objectives and add the numerical analysis of our results. c (cid:13) A. Gasnikov, P. Dvurechensky, E. Gorbunov, E. Vorontsova, D. Selikhanovych & C.A. Uribe. a r X i v : . [ m a t h . O C ] F e b PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION where f has p -th Lipschitz-continuous derivative with constant M p . For p = 1 , first-order methodsare commonly used to solve this problem, i.e., gradient descent. The lower bound for the complexityof these methods was proposed in (Nemirovsky and Yudin, 1983; Nesterov, 2004), and an optimalmethod was introduced in (Nesterov, 1983). The case of p = 2 , i.e., Newton-type methods, was wellunderstood only recently. A nearly optimal method was proposed in (Nesterov, 2008), an optimalmethod was proposed in (Monteiro and Svaiter, 2013), and a lower bound was obtained in (Agarwaland Hazan, 2018; Arjevani et al., 2018).The idea of using higher order derivatives (starting from p ≥ ) in optimization is known atleast since 1970’s, see Hoffmann and Kornstaedt (1978). Recently this direction of research becameof interest from the point of view of complexity bounds. In the unpublished preprint Baes (2009),extending the estimating functions technique of Nesterov (2004), proposes accelerated high-order(tensor) methods for convex problems with complexity O (cid:18)(cid:16) M p R p +1 ε (cid:17) p +1 (cid:19) , where p ≥ , ε is theaccuracy of the obtained solution ˆ x , i.e., f (ˆ x ) − f ∗ ≤ ε , M p is the Lipschitz constant of the p -thderivative, and R is an estimate for the distance between a starting point and the closest solution.Nevertheless, the author doubts that the obtained methods are implementable since the auxiliaryproblem on each iteration is possibly non-convex. Agarwal and Hazan (2018); Arjevani et al. (2018)construct lower complexity bounds O (cid:18)(cid:16) M p R p +1 ε (cid:17) p +1 (cid:19) and O (cid:18)(cid:16) M p R p +1 ε (cid:17) p +1 (cid:19) respectively forthe case f having Lipschitz p -th derivative and conjecture that the upper bound can be improved.Nesterov (2018) proposes implementable tensor methods showing that an appropriately regularizedTaylor expansion of a convex function is again a convex function, thus making auxiliary problemson each iteration of the tensor methods tractable. The author also provides an accelerated schemewith complexity bound O (cid:18)(cid:16) M p R p +1 ε (cid:17) p +1 (cid:19) , shows that the complexity of each iteration for p = 3 is of the same order as for the case p = 2 , and conjectures the existence of an optimal scheme withcomplexity bound O (cid:18)(cid:16) M p R p +1 ε (cid:17) p +1 (cid:19) .The optimal method for the case p = 1 has complexity O (cid:18)(cid:16) M R ε (cid:17) (cid:19) (Nesterov, 1983) andfor p = 2 has the complexity O (cid:18)(cid:16) M R ε (cid:17) (cid:19) (Monteiro and Svaiter, 2013), but the question ofexistence of optimal methods for p ≥ remains open. In this paper we extend the framework ofMonteiro and Svaiter (2013) and propose optimal tensor methods for all p ≥ . Our approach isalso based on regularized Taylor step of Nesterov (2018), and, thus, our optimal method for p = 2 is different from Monteiro and Svaiter (2013).We also consider problem (1) under additional assumption that f is uniformly convex, i.e., thereexist ≤ q ≤ p + 1 and σ q > s.t. f ( y ) ≥ f ( x ) + (cid:104)∇ f ( x ) , y − x (cid:105) + σ q q (cid:107) y − x (cid:107) q , ∀ x, y ∈ Q. PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION
Under this additional assumption, we show, how the restart technique can be applied to accelerateour method to obtain complexity O (cid:32)(cid:18) M p σ p +1 (cid:19) p +1 log ∆ ε (cid:33) , q = p + 1; O M p (∆ ) p +1 − qq σ p +1 q q p +1 + log ∆ ε , q < p + 1 , where f ( x ) − f ∗ ≤ ∆ . This bound suggests a natural generalization of first- and second-ordercondition number (Nesterov, 2008). If f is such that q = p +1 , then the complexity of our algorithmdepends only logarithmically on the starting point and is proportional to ( γ p ) p +1 , where γ p = M p σ p +1 is the p -th order condition number. Nemirovsky and Yudin (1983); Nesterov(2004) and Arjevani et al. (2018) propose lower bounds for particular cases of strongly convexfunctions (i.e., q = 2 ) with p = 1 and p = 2 respectively. Our upper bounds match them.As a related work, we also mention Birgin et al. (2017); Cartis et al. (2018), who study com-plexity bounds for tensor methods for finding approximate stationary points with the main focuson non-convex optimization, which we do not consider in our work. Also the work in (Wibisonoet al., 2016) considers tensor methods from the variational perspective and obtains similar boundsto those in Baes (2009). The first version of this paper appeared in arXiv on September 2, 2018.In December 2018, two months after that, Jiang et al. (2018); Bubeck et al. (2018) proposed analgorithm, which is very similar to our Algorithm 1. Unlike them, we also analyze the case of uni-formly convex functions and propose an algorithm, which is faster in this case, see our Algorithm 3.Moreover, we are the first to make a numerical study of tensor methods for p = 3 and show thatthey work in practice. Our contributions. • We propose a new optimal tensor method and analyze its iteration complexity. • We generalize this method for the case of uniformly convex objectives and propose a defini-tion of p -th order condition number. • We make a numerical study of the proposed method and show that our optimal method isfaster than accelerated tensor method Nesterov (2018) in practice. We also compare the per-formance of tensor methods for p = 2 and p = 3 and show that the 3rd-order method issuperior to the 2nd-order method in practice. Notations and generalities.
For p ≥ , we denote by ∇ p f ( x )[ h , ..., h p ] the directional deriva-tive of function f at x along directions h i ∈ R n , i = 1 , ..., p . ∇ p f ( x )[ h , ..., h p ] is symmetric p -linear form and its norm is defined as (cid:107)∇ p f ( x ) (cid:107) = max h ,...,h p ∈ R n {∇ p f ( x )[ h , ..., h p ] : (cid:107) h i (cid:107) ≤ , i = 1 , ..., p } or equivalently (cid:107)∇ p f ( x ) (cid:107) = max h ∈ R n {|∇ p f ( x )[ h, ..., h ] | : (cid:107) h (cid:107) ≤ , i = 1 , ..., p } . PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION
Here, for simplicity, (cid:107) · (cid:107) is standard Euclidean norm, but our algorithm and derivations can begeneralized for the Euclidean norm given by general a positive semi-definite matrix B . We considerconvex, p times differentiable on R functions satisfying Lipschitz condition for p -th derivative (cid:107)∇ p f ( x ) − ∇ p f ( y ) (cid:107) ≤ M p (cid:107) x − y (cid:107) , x, y ∈ R n . (2)
2. Optimal Tensor Method
Given a function f , numbers p ≥ and M ≥ , define T fp,M ( x ) ∈ Arg min y ∈ R n p (cid:88) r =0 r ! ∇ r f ( x ) [ y − x, ..., y − x ] (cid:124) (cid:123)(cid:122) (cid:125) r + M ( p + 1)! (cid:107) y − x (cid:107) p +12 . (3)and given a number L ≥ and point z ∈ R n , we define F L,z ( x ) (cid:44) f ( x ) + L (cid:107) x − z (cid:107) . (4) Theorem 1
Let sequence ( x k , y k , u k ) , k ≥ be generated by Algorithm 1. Then f ( y N ) − f ∗ ≤ cM p (cid:107) y − x ∗ (cid:107) p +12 N p +12 , c = 2 p +1)2+44 ( p + 1) p ! . Note that this bound allows to obtain an O (cid:18)(cid:16) M p R p +1 ε (cid:17) p +1 (cid:19) iteration complexity. The imple-mentability and cost of each iteration is discussed below in Section 2.3. The proof of Theorem 1 isbased on the framework of Monteiro and Svaiter (2013), which is presented in the next subsection. Algorithm 1
Optimal Tensor Method
Input: u , y — starting points; N — iteration number; A = 0 Output: y N for k = 0 , , , . . . , N − do Choose L k such that ≤ p + 1) M p p ! L k (cid:107) y k +1 − x k (cid:107) p − ≤ , (5)where a k +1 = 1/ L k + (cid:113) (cid:14) L k + 4 A k / L k , A k +1 = A k + a k +1 , { note that L k a k = A k +1 } x k = A k A k +1 y k + a k +1 A k +1 u k , y k +1 = T F Lk,xk p,pM p ( x k ) . u k +1 = u k − a k +1 ∇ f ( y k +1 ) end for return y N PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION
Monteiro and Svaiter (2013) introduced Algorithm 2 for convex optimization problems. To find y k +1 on each iteration, the authors use gradient type method for the case p = 1 and a trust regionNewton-type method for the case p = 2 . Their analysis of the algorithm is based on the followingTheorem. Theorem 2 ( (Monteiro and Svaiter, 2013, Theorem 3.6 ) )
Let sequence ( x k , y k , u k ) , k ≥ begenerated by Algorithm 2 and define R := (cid:13)(cid:13) y − x ∗ (cid:13)(cid:13) . Then, for all N ≥ , (cid:13)(cid:13) u N − x ∗ (cid:13)(cid:13) + A N · (cid:0) f (cid:0) y N (cid:1) − f ( x ∗ ) (cid:1) + 14 N (cid:88) k =1 A k L k − (cid:13)(cid:13)(cid:13) y k − x k − (cid:13)(cid:13)(cid:13) ≤ R , (6) f (cid:0) y N (cid:1) − f ( x ∗ ) ≤ R A N , (cid:13)(cid:13) u N − x ∗ (cid:13)(cid:13) ≤ R, (7) N (cid:88) k =1 A k L k − (cid:13)(cid:13)(cid:13) y k − x k − (cid:13)(cid:13)(cid:13) ≤ R . (8)We also need the following Lemma. Lemma 3 ( (Monteiro and Svaiter, 2013, Lemma 3.7 a)))
Let sequences { A k , L k } , k ≥ begenerated by Algorithm 2. Then, for all N ≥ , A N ≥ (cid:32) N (cid:88) k =1 (cid:112) L k − (cid:33) . (9) Algorithm 2
Accelerated hybrid proximal extragradient method
Input: u , y — starting point; N — iteration number; A = 0 Output: y N for k = 0 , , , . . . , N − do Choose L k and y k +1 s.t. (cid:13)(cid:13) ∇ F L k ,x k (cid:0) y k +1 (cid:1)(cid:13)(cid:13) ≤ L k (cid:13)(cid:13) y k +1 − x k (cid:13)(cid:13) , where a k +1 = 1/ L k + (cid:113) (cid:14) L k + 4 A k / L k , A k +1 = A k + a k +1 , x k = A k A k +1 y k + a k +1 A k +1 u k . u k +1 = u k − a k +1 ∇ f (cid:0) y k +1 (cid:1) . end for return y N It follows from Algorithm 1 that y k +1 = T F Lk,xk p,pM p ( x k ) , thus by (Nesterov, 2018, Lemma 1), (cid:13)(cid:13)(cid:13) ∇ F L k ,x k (cid:16) y k +1 (cid:17)(cid:13)(cid:13)(cid:13) ≤ ( p + 1) M p p ! (cid:13)(cid:13)(cid:13) y k +1 − x k (cid:13)(cid:13)(cid:13) p . PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION
At the same time, by the condition in step 2 of Algorithm, 1, p + 1) M p p ! L k (cid:107) y k +1 − x k (cid:107) p − (cid:54) . Hence, (cid:13)(cid:13)(cid:13) ∇ F L k ,x k (cid:16) y k +1 (cid:17)(cid:13)(cid:13)(cid:13) ≤ L k (cid:13)(cid:13)(cid:13) y k +1 − x k (cid:13)(cid:13)(cid:13) and we can apply the framework of the previous subsection. What remains is to estimate the growthof A N , which is our next step.By the condition in step 2 of Algorithm, 1, L k (cid:13)(cid:13)(cid:13) y k +1 − x k (cid:13)(cid:13)(cid:13) p − ≥ θ, (10)where θ = p !4( p +1) M p . Using this inequality, we prove that N (cid:88) k =1 A k L p +1 p − k − ≤ R θ − p − . (11)Indeed, from (8) and (10) we have that θ p − N (cid:88) k =1 A k L p +1 p − k − ≤ N (cid:88) k =1 A k L p − k − (cid:18) L k − (cid:13)(cid:13)(cid:13) y k − x k − (cid:13)(cid:13)(cid:13) p − (cid:19) p − = N (cid:88) k =1 A k L k − (cid:13)(cid:13)(cid:13) y k − x k − (cid:13)(cid:13)(cid:13) ≤ R . (12)Further, from (11) it follows that N (cid:88) k =1 (cid:112) L k − ≥ θ p +1 (2 R ) p − p +1) (cid:32) N (cid:88) k =1 A p − p +1 k (cid:33) p +12( p +1) . (13)To prove that, let us introduce new variables z k = 1 (cid:14)(cid:112) L k − and consider the following opti-mization problem to find the worst possble value of the l.h.s. in (13) min N (cid:88) k =1 z k s.t. N (cid:88) k =1 A k z − γk ≤ C, (14)where in accordance with (11) γ = 2 p + 1 p − , C = 2 R θ − p − . Since the objective and constraints are separable, this problem can be solved explicitly by the La-grange principle z k = C N (cid:88) j =1 A γ +1 j γ A γ +1 k . PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION
Hence, min N (cid:80) k =1 A k z − γk ≤ C N (cid:88) k =1 z k = 1 C γ (cid:32) N (cid:88) k =1 A γ +1 k (cid:33) γ +1 γ . From this inequality, (9) and (13), we have A N ≥ θ p +1 (2 R ) p − p +1 (cid:32) N (cid:88) k =1 A p − p +1 k (cid:33) p +1 p +1 . (15)From this inequality, we obtain that there exists a number c such that, for all N ≥ , A N ≥ cM p R p − N p +12 . (16)The derivation of exact value of the constant c can be found in Lemma 5 in Appendix. This finishesthe proof. First of all, Theorem 1 in Nesterov (2018) says that, by the appropriate choice M = pM p in (3), thesubproblem for finding y k +1 in step 2 of Algorithm 1 is convex and, thus is tractable. Moreover, for p = 2 this step corresponds to the step of cubic regularized Newton method of Nesterov and Polyak(2006) and, as it is shown there, can be computed with the same complexity as solving a linearsystem. For the case p = 3 , Nesterov (2018) showed that this step can be also computed efficiently.In both cases the complexity of calculating y k +1 is ˜O (cid:0) n . (cid:1) .Let us now discuss the process of finding such L k that the inequality (5) holds. By construction, y k +1 = arg min y ∈ R n p (cid:88) r =0 r ! ∇ r f (cid:16) x k (cid:17) (cid:104) y − x k , ..., y − x k (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) r + pM p ( p + 1)! (cid:13)(cid:13)(cid:13) y − x k (cid:13)(cid:13)(cid:13) p +12 + L k (cid:107) y − x k (cid:107) . This problem is strongly convex and, thus, has a unique solution for each L k > . Hence, y k +1 isuniquely defined by L k . At the same time, if L k → , y k +1 → ˜ y k with ˜ y k ∈ Arg min y ∈ R n p (cid:88) r =0 r ! ∇ r f (cid:16) x k (cid:17) (cid:104) y − x k , ..., y − x k (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) r + pM p ( p + 1)! (cid:13)(cid:13)(cid:13) y − x k (cid:13)(cid:13)(cid:13) p +12 being a fixed point. Whence, p + 1) M p p ! L k (cid:107) y k +1 − x k (cid:107) p − → + ∞ . On the other hand, if L k → + ∞ , y k +1 → x k and p + 1) M p p ! L k (cid:107) y k +1 − x k (cid:107) p − → . PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION
By the continuity of the dependence of y k +1 from L k , we see that there exists such L k that in-equality (5) holds. Appropriate value of L k can be found by an extended line-search procedureas in (Monteiro and Svaiter, 2013, Section 7). The details of complexity of the line-search can befound in Jiang et al. (2018); Bubeck et al. (2018), where the authors prove a bound of ˜ O (1) calls of T F Lk,xk p,pM p ( x k ) on each iteration.
3. Extension for Uniformly Convex Case
In this section, we additionally assume that the objective function is uniformly convex of degree q ≥ , i.e., there exists σ q > s.t. f ( y ) ≥ f ( x ) + (cid:104)∇ f ( x ) , y − x (cid:105) + σ q q (cid:107) y − x (cid:107) q , ∀ x, y ∈ Q. (17)We also assume that q ≤ p + 1 . As a corollary, f ( y ) ≥ f ( x ∗ ) + σ q q (cid:107) y − x ∗ (cid:107) q , ∀ y ∈ Q, (18)where x ∗ is a solution to problem (1). We show, how the restart technique can be used to accelerateAlgorithm 1 under this additional assumption. Algorithm 3
Restarted Optimal Tensor Method
Input: p , M p , q , σ q , z , ∆ s.t. f ( z ) − f ∗ ≤ ∆ . for k = 0 , , ... do Set ∆ k = ∆ · − k and N k = max cM p q p +1 q σ p +1 q q ∆ p +1 − qq k p +1 , . (19) Set z k +1 = y N k as the output of Algorithm 1 started from z k and run for N k steps. Set k = k + 1 . end forOutput: z k . Theorem 4
Let sequence z k , k ≥ be generated by Algorithm 3. Then σ q q (cid:107) z k − x ∗ (cid:107) q ≤ f ( z k ) − f ∗ ≤ ∆ · − k , and the total number of steps of Algorithm 1 is bounded by ( c is defined in (16) ) (cid:16) cq p +1 q (cid:17) p +1 M p +1 p σ p +1) q (3 p +1) q (∆ ) p +1 − q ) q (3 p +1) · k (cid:88) i =0 − i p +1 − q ) q (3 p +1) + k. PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION
Proof
Let us prove the first statement of the Theorem by induction. For k = 0 it holds. If it holdsfor some k ≥ , by the choice of N k , we have that cM p N p +12 k (cid:18) q ∆ k σ q (cid:19) p +1 q ≤ ∆ k . By (18), (cid:107) z k − x ∗ (cid:107) p +12 ≤ (cid:18) q ( f ( z k ) − f ∗ ) σ q (cid:19) p +1 q ≤ (cid:18) q ∆ k σ q (cid:19) p +1 q since, by our assumption, q ≤ p + 1 . Combining the above two inequalities and Theorem 1, weobtain f ( z k +1 ) − f ∗ ≤ cM p (cid:107) z k − x ∗ (cid:107) p +12 N p +12 k ≤ ∆ k k +1 . It remains to bound the total number of steps of Algorithm 1. Denote ˜ c = (cid:16) cq p +1 q (cid:17) p +1 . k (cid:88) i =0 N i ≤ ˜ c M p +1 p σ p +1) q (3 p +1) q k (cid:88) i =0 (∆ · − i ) p +1 − q ) q (3 p +1) + k ≤ ˜ c M p +1 p σ p +1) q (3 p +1) q (∆ ) p +1 − q ) q (3 p +1) · k (cid:88) i =0 − i p +1 − q ) q (3 p +1) + k. Let us make several remarks on the complexity of the restarted scheme in different settings. It iseasy to see from Theorem 4 that, to achieve an accuracy ε , i.e. to find a point ˆ x s.t. f (ˆ x ) − f ∗ ≤ ε ,the number of tensor steps in Algorithm 3 is O M p +1 p σ p +1) q (3 p +1) q (∆ ) p +1 − q ) q (3 p +1) + log ∆ ε , q < p +1 , and O M p +1 p σ p +1) q (3 p +1) q + 1 log ∆ ε , q = p +1 . Theorem 4 suggests a natural generalization of first- and second-order condition number Nesterov(2008). If f is such that q = p +1 , then the complexity of Algorithm 3 depends only logarithmicallyon the starting point and is proportional to ( γ p ) p +1 , where γ p = M p σ p +1 is the p -th order conditionnumber. Unfortunately, if q < p + 1 , the complexity depends polinomially on the initial objectiveresidual ∆ , which, in general, is not controlled.An interesting special case is when q = 2 and p ≥ , and, as a consequence, q < p + 1 .As it can be seen from Theorem 2 (see also Bubeck et al. (2018)), the sequence, generated byAlgorithm 1 is bounded by some R = O ( (cid:107) x − x ∗ (cid:107) ) . Hence, the constant M can be estimatedas M ≤ M p R p − . At the same time, in (Nesterov, 2008, Sect.6), it is shown that the Cubicregularized Newton method Nesterov and Polyak (2006) has the region of quadratic convergencegiven by { x : f ( x ) − f ∗ ≤ σ M ≤ σ M p R p − } . To enter this region, Algorithm 3 requires O M p +1 p σ p +13 p +1 (∆ ) p − p +1 + log ∆ M p R p − σ = O M p +1 p σ p +13 p +1 (∆ ) p − p +1 + log M p ∆ p − σ p , (20) PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION where we used inequality R ≤ σ , which follows from (18). After entering the region ofquadratic convergence, Algorithm 3 can be switched to the Cubic regularized Newton method Nes-terov and Polyak (2006), which has final stage complexity, (Nesterov and Polyak, 2006, Sect. 6) O (cid:18) log / log σ M ε (cid:19) = O (cid:32) log / log σ M p R p − ε (cid:33) . Summing this inequality and (20) we obtain the total complexity of this switching procedure toobtain small accuracy ε . Note, that the second term in (20) is typically dominated by the first one,so we can ignore it without loss of generality.Finally, let us compare our upper bound with known lower bounds. For the case p = 1 , q = 2 ,our complexity bound coincides with lower bound for first-order methods Nemirovsky and Yudin(1983); Nesterov (2004). Arjevani et al. (2018) propose lower bounds for second-order methods forthe case p = 2 , q = 2 and our complexity bound coincides with their lower bound up to a changeof D = (cid:113) ∆ σ , which is natural as, in this case f is strongly convex.
4. Numerical Analysis
In this section, we analyze and compare the performance of Algorithm 1 with the accelerated tensormethod proposed in Nesterov (2018).We study the numerical performance for two classes of functions. Initially, an universal para-metric family of objective functions, which are difficult for all tensor methods Nesterov (2018)defined as f m ( x ) = η p +1 ( A m x ) − x , (21)where, for integer parameter p ≥ , η p +1 ( x ) = p +1 n (cid:80) i =1 | x i | p +1 , ≤ m ≤ n , x ∈ R n , A m is the n × n block diagonal matrix: A m = (cid:18) U m I n − m (cid:19) , with U m = − . . .
00 1 − . . . ... ... . . . ... . . . −
10 0 . . . , (22)and I n is the identity n × n -matrix. For a detailed description of the high-order derivatives of thisclass of functions, and its optimality properties see Nesterov (2018).Figure 1 shows the normalized optimality gap of the iterations generated by the accelerated ten-sor method from Nesterov (2018) in Figure1(a), and Algorithm 1 in Figure1(b). We denote the min-imum function value as f ∗ . For both results we have used p = 3 , and n = k = { , , , , } .These numerical results show that Algorithm 1 requires a much smaller number of iterations than theaccelerated tensor method from Nesterov (2018) to reach the same optimality gap, namely · − ,for the class of “bad” functions described in Nesterov (2018). For example, for the case where n = k = 25 , Algorithm 1 has reached the desired accuracy in about iterations, while theaccelerated tensor method requires about · . PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION . . . . · − − − Iterations | f ( x k ) − f ∗ | | f ( x ) − f ∗ | (Nesterov,2018) n = k = 5 n = k = 10 n = k = 15 n = k = 20 n = k = 25 ( a )
20 40 60 80 10010 − − − Iterations | f ( x k ) − f ∗ | | f ( x ) − f ∗ | Algorithm 1 n = k = 5 n = k = 10 n = k = 15 n = k = 20 n = k = 25 ( b )Figure 1: A performance comparison between the accelerated tensor method in Nesterov (2018)(shown in (a)) and Algorithm 1 (shown in (b)). We minimize an instance of the family of functionsin (21) with p = 3 and various values of dimension n and k . Note that the x -axis scaling on bothfigures is different.As a second set of numerical results we study the performance of the proposed method for thenon-regularized logistic regression problem. For this problem we are given a set of d data pairs { y i , w i } for ≤ i ≤ d , where y i ∈ { , − } is the class label of object i , and w i ∈ R n is the set offeatures of object i . We are interested in finding a vector x that solves the following optimizationproblem d d (cid:88) i =1 ln (cid:16) (cid:0) − y i (cid:104) w i , x (cid:105) (cid:1)(cid:17) → min x ∈ R n . (23)Figure 2 shows the simulation results for the logistic regression problem in (23) for variousdatasets. Similarly as in Figure 1, we compare the performance of Algorithm 1, and the acceleratedtensor method in Nesterov (2018). In Figure 2(a) and Figure 2(b), we generate synthetic data, where,initially we define a vector ˆ x ∈ [ − , with every entry is chosen uniformly at random. The setof features for each i , i.e., w i ∈ [ − , n has also every entry chosen uniformly at random, finallyeach label is computed as y i = sign ( (cid:104) w i , ˆ x (cid:105) ) . For Figure 2(a) we set n = 10 and d = 100 , while inFigure 2(b) we set n = 100 and d = 1000 . Figure 2(c) uses the mushroom dataset ( n = 8124 and d = 112 ) Dheeru and Karra Taniskidou (2017), and Figure 2(d) uses the a9a dataset ( n = 32561 and d = 123 ) Dheeru and Karra Taniskidou (2017).For the logistic regression problem, we don’t have access to the optimal value function in gen-eral, thus, we plot only the cost function evaluated at the current iterate. As expected by the theo-retic results, Algorithm 1 requires one order of magnitude less iterations than the accelerated tensormethod from Nesterov (2018) to achieve the same function value.In Appendix B, we numerically compare the performance of the accelerated tensor methodfrom Nesterov (2018) for p = 2 and p = 3 , as well as its accelerated and non-accelerated versions. PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION (Nesterov,2018) Algorithm 1 . . . . Iterations f ( x k ) Synthetic n = 10 , d = 100 ( a ) . . . . Iterations f ( x k ) Synthetic n = 100 , d = 1000 ( b ) . . . . Iterations f ( x k ) mushroom dataset ( c ) . . . Iterations f ( x k ) a9a dataset ( d )Figure 2: Performance comparison for the non-regularized logistic regression problem between theaccelerated tensor method from Nesterov (2018) and Algorithm 1. (a) Uses synthetic data with n = 10 and d = 100 , (b) uses synthetic data with n = 100 and d = 1000 , (c) uses the mushroom dataset ( d = 8124 and n = 112 ) Dheeru and Karra Taniskidou (2017), and (d) uses the a9a dataset( d = 32561 and n = 123 ) Dheeru and Karra Taniskidou (2017). Acknowledgments
The authors are grateful to Yurii Nesterov for fruitful discussions. The work of A. Gasnikov wassupported by RFBR 18-29-03071 mk and was prepared within the framework of the HSE UniversityBasic Research Program and funded by the Russian Academic Excellence Project ’5-100’, the workof P. Dvurechensky and E. Vorontsova was supported by RFBR 18-31-20005 mol-a-ved and thework of E. Gorbunov was supported by the grant of Russian’s President MD-1320.2018.1
References
Naman Agarwal and Elad Hazan. Lower bounds for higher-order convex optimization. In S´ebastienBubeck, Vianney Perchet, and Philippe Rigollet, editors,
Proceedings of the 31st Conference OnLearning Theory , volume 75 of
Proceedings of Machine Learning Research , pages 774–792.PMLR, 06–09 Jul 2018. URL http://proceedings.mlr.press/v75/agarwal18a.html .Yossi Arjevani, Ohad Shamir, and Ron Shiff. Oracle complexity of second-order methods forsmooth convex optimization.
Mathematical Programming , May 2018. ISSN 1436-4646. doi: 10.1007/s10107-018-1293-1. URL https://doi.org/10.1007/s10107-018-1293-1 .Michel Baes. Estimate sequence methods:extensions and approximations. Technical report, 2009.URL .E. G. Birgin, J. L. Gardenghi, J. M. Mart´ınez, S. A. Santos, and Ph. L. Toint. Worst-case eval-uation complexity for unconstrained nonlinear optimization using high-order regularized mod-els.
Mathematical Programming , 163(1):359–368, May 2017. ISSN 1436-4646. doi: 10.1007/s10107-016-1065-8. URL https://doi.org/10.1007/s10107-016-1065-8 . PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION
S´ebastien Bubeck, Qijia Jiang, Yin Tat Lee, Yuanzhi Li, and Aaron Sidford. Near-optimal methodfor highly smooth convex optimization. arXiv:1812.08026 , 2018.Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. Improved second-order evalua-tion complexity for unconstrained nonlinear optimization using high-order regularized models. arXiv:1708.04044 , 2018.Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml .K. H. Hoffmann and H. J. Kornstaedt. Higher-order necessary conditions in abstract mathe-matical programming.
Journal of Optimization Theory and Applications , 26(4):533–568, Dec1978. ISSN 1573-2878. doi: 10.1007/BF00933151. URL https://doi.org/10.1007/BF00933151 .Bo Jiang, Haoyue Wang, and Shuzhong Zhang. An optimal high-order tensor method for convexoptimization. arXiv:1812.06557 , 2018.R. Monteiro and B. Svaiter. An accelerated hybrid proximal extragradient method for convexoptimization and its implications to second-order methods.
SIAM Journal on Optimization ,23(2):1092–1125, 2013. doi: 10.1137/110833786. URL https://doi.org/10.1137/110833786 .A.S. Nemirovsky and D.B. Yudin.
Problem Complexity and Method Efficiency in Optimization . J.Wiley & Sons, New York, 1983.Yu. Nesterov. Accelerating the cubic regularization of newton’s method on convex problems.
Mathematical Programming , 112(1):159–181, Mar 2008. ISSN 1436-4646. doi: 10.1007/s10107-006-0089-x. URL https://doi.org/10.1007/s10107-006-0089-x .Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1 /k ) . Soviet Mathematics Doklady , 27(2):372–376, 1983.Yurii Nesterov.
Introductory Lectures on Convex Optimization: a basic course . Kluwer AcademicPublishers, Massachusetts, 2004.Yurii Nesterov. Implementable tensor methods in unconstrained convex optimization. Tech-nical report, CORE UCL, 2018. URL https://alfresco.uclouvain.be/alfresco/service/guest/streamDownload/workspace/SpacesStore/aabc2323-0bc1-40d4-9653-1c29971e7bd8/coredp2018_05web.pdf . COREDiscussion Paper 2018/05.Yurii Nesterov and Boris Polyak. Cubic regularization of newton method and its global perfor-mance.
Mathematical Programming , 108(1):177–205, 2006. ISSN 1436-4646. doi: 10.1007/s10107-006-0706-8. URL http://dx.doi.org/10.1007/s10107-006-0706-8 .Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. A variational perspective on acceleratedmethods in optimization.
Proceedings of the National Academy of Sciences , 113(47):E7351–E7358, 2016. PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION
Optimal Tensor Methods in Smooth Convex andUniformly Convex Optimization:Supplementary Material
Appendix A. Technical lemmas
Lemma 5
Consider the sequence { A k } k ≥ of non-negative numbers such that A N ≥ θ p +1 (2 R ) p − p +1 (cid:32) N (cid:88) k =1 A p − p +1 k (cid:33) p +1 p +1 , (24) where p ≥ , θ = p !4( p +1) M p and M p , R > . Then for all N ≥ we have A k ≥ cM p R p − k p +12 , (25) where c = 2 p +1)2+44 ( p + 1) p ! (26) Proof
We prove (25) by induction. For k = 1 we have A (24) ≥ θ p +1 (2 R ) p − p +1 A p − p +1 ⇐⇒ A p +1 ≥ θ p +1 p − p +1 R p − p +1 ⇐⇒ A ≥ p !2 p +52 ( p + 1) M p R p − . The last inequality implies (25) for p ≥ . Now let us assume that for all k ≤ N inequality (25)holds and N ≥ . Next we will establish (25) for k = N + 1 . We have A N +1 (24) ≥ θ p +1 (2 R ) p − p +1 (cid:32) N +1 (cid:88) k =1 A p − p +1 k (cid:33) p +1 p +1 ≥ θ p +1 (2 R ) p − p +1 (cid:32) N (cid:88) k =1 A p − p +1 k (cid:33) p +1 p +1 (25) ≥ θ p +1 (2 R ) p − p +1 (cid:32)(cid:18) cM p R p − (cid:19) p − p +1 N (cid:88) k =1 k p − (cid:33) p +1 p +1 . If N = 1 then A N +1 = A ≥ p +12 θ p +1 (2 R ) p − p +1 (cid:18) cM p R p − (cid:19) p − p +1 (2) p +12 . (27)If N > we can write A N +1 ≥ θ p +1 (2 R ) p − p +1 (cid:18) cM p R p − (cid:19) p − p +1 (cid:32) N (cid:88) k =2 k p − (cid:33) p +1 p +1 . (28) PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION
Since p − ≥ the function f ( x ) = x is convex and, as a consequence, we get N (cid:88) k =2 k p − ≥ N (cid:90) x p − dx = 2 p + 1 N p +12 − p + 1 ≥ p + 1 N p +12 − . (29)Using this fact we continue: A N +1 (29) ≥ θ p +1 (2 R ) p − p +1 (cid:18) cM p R p − (cid:19) p − p +1 (cid:18)
12 + N p +12 (cid:19) p +1 p +1 ≥ θ p +1 (2 R ) p − p +1 (cid:18) cM p R p − (cid:19) p − p +1 N p +12 . For all
N > we have (cid:18) NN + 1 (cid:19) p +12 = (cid:18) − N + 1 (cid:19) p +12 ≥ (cid:18) − (cid:19) p +12 = 12 p +12 . From this and (28) we obtain that for all N ≥ A N +1 ≥ p +12 θ p +1 (2 R ) p − p +1 (cid:18) cM p R p − (cid:19) p − p +1 ( N + 1) p +12 . It remains to show that (26) implies p +12 θ p +1 (2 R ) p − p +1 (cid:18) cM p R p − (cid:19) p − p +1 = 1 cM p R p − . Using θ = p !4( p +1) M p we get p +12 θ p +1 (2 R ) p − p +1 (cid:18) cM p R p − (cid:19) p − p +1 = 1 cM p R p − ⇐⇒ c p +1 p +12 (cid:18) p !4( p + 1) (cid:19) p +1 p − p +1 = 1 ⇐⇒ c p +1 = 2 p +12 (cid:18) p + 1) p ! (cid:19) p +1 p − p +1 ⇐⇒ c = 2 (3 p +1)( p +1)4 p + 1) p ! 2 p − ⇐⇒ c = 2 p +1)2+44 ( p + 1) p ! , which is exactly what we have in (26). Appendix B. Comparison of the accelerated tensor method from Nesterov (2018) for p = 2 and p = 3 . In this appendix, we numerically compare the performance of the accelerated tensor method pro-posed in (Nesterov, 2018), for p = 2 and p = 3 . We also compare the accelerated and non-accelerated version of this method. PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION
10 20 30 40 500 . . . . Iterations | f ( x k ) − f ∗ | | f ( x ) − f ∗ | f m ( x ) in (21), m = 50 , n = 100 p = 2 Accelerated p = 2 p = 3 Accelerated p = 3 Figure 3: Performance of tensor methods and accelerated tensor methods for p = 2 and p = 3 on a difficult instance (21) for all unconstrained minimization tensor methods with n = 100 and m = 50 .Similarly as in Figure 1 and Figure 2, we present the numerical results for the class of badfunctions defined in (21) and one instance of the logistic regression problem.In Figure 3, we compare the behavior of the following methods: 1) tensor method Nesterov(2018) for p = 3 ; 2) accelerated tensor method Nesterov (2018) for p = 3 ; 3) tensor methodNesterov (2018) for p = 2 ; 4) accelerated tensor method Nesterov (2018) for p = 2 . Again, theoptimal function value is denoted by f ∗ . Interestingly, we obtain that the non-accelerated methodoutperforms the accelerated method for the first m iterations. Since Theorem from Nesterov(2018) works only for k ≤ m we don’t study the behaviour of the methods for larger number ofiterations. Even in this simple setting it is still non-trivial how to implement tensor methods for suchbad examples of functions.
20 40 60 80 1000 . . . Iterations f ( x k ) Covertype dataset p = 2 Accelerated p = 3 Accelerated p = 2 p = 3 Figure 4: Function value achieved by the iterates of the accelerated tensor method for the logisticregression problem on the
Covertype dataset Dheeru and Karra Taniskidou (2017). Number ofsamples d = 20000 , dimension n = 55 . PTIMAL T ENSOR M ETHODS FOR S MOOTH C ONVEX O PTIMIZATION
In Figure 4, we consider the behaviour of the same set of methods as in Figure 3, but for logisticregression problem defined in (23) on Covertype dataset Dheeru and Karra Taniskidou (2017). Andagain, we notice that in both cases non-accelerated version works better in our experimentsFirst of all, we point out that tensor methods in general are non-trivial in implementation, so,it is interesting direction of the future work to get better implementation. Secondly, we conjecturethat slow convergence that we see in our experiments is because of large M p that we use. Due totuning of the parameters one can obtain better convergence in practice.that we use. Due totuning of the parameters one can obtain better convergence in practice.