Near-Optimal Hyperfast Second-Order Method for convex optimization and its Sliding
aa r X i v : . [ m a t h . O C ] J un Near-Optimal Hyperfast Second-Order Method forConvex Optimization and its Sliding
Dmitry Kamzolov a and Alexander Gasnikov a,b,c . a Moscow Institute of Physics and Technology, Moscow, Russia; b National Research University Higher School of Economics, Moscow, Russia; c Institute for Information Transmission Problems RAS, Moscow, Russia.June 30, 2020
Abstract
In this paper, we present a new Hyperfast Second-Order Method with convergencerate O ( N − ) up to a logarithmic factor for the convex function with Lipshitz the thirdderivative. This method based on two ideas. The first comes from the superfast second-order scheme of Yu. Nesterov (CORE Discussion Paper 2020/07, 2020). It allowsimplementing the third-order scheme by solving subproblem using only the second-order oracle. This method converges with rate O ( N − ) . The second idea comes fromthe work of Kamzolov et al. (arXiv:2002.01004). It is the inexact near-optimal third-order method. In this work, we improve its convergence and merge it with the scheme ofsolving subproblem using only the second-order oracle. As a result, we get convergencerate O ( N − ) up to a logarithmic factor. This convergence rate is near-optimal andthe best known up to this moment. Further, we investigate the situation when thereis a sum of two functions and improve the sliding framework from Kamzolov et al.(arXiv:2002.01004) for the second-order methods. In recent years, it has been actively developing higher-order or tensor methods for convexoptimization problems. The primary impulse was the work of Yu. Nesterov [34] about thepossibility of the implementation tensor method. He proposed a smart regularization ofTaylor approximation that makes subproblem convex and hence implementable. Also Yu.Nesterov proposed accelerated tensor methods [33, 34], later A. Gasnikov et al. [4, 14, 15, 24]proposed the near-optimal tensor method via the Monteiro–Svaiter envelope [32] with line-search and got a near-optimal convergence rate up to a logarithmic factor. Starting from2018-2019 the interest in this topic rises. There are a lot of developments in tensor methods,like tensor methods for Holder continuous higher-order derivatives [19,42], proximal methods16], tensor methods for minimizing the gradient norm of convex function [11, 19], inexacttensor methods [18, 25, 35], and near-optimal composition of tensor methods for sum of twofunctions [25]. There are some results about local convergence and convergence for stronglyconvex functions [7, 13, 14]. See [13] for more references on applications of tensor method.At the very beginning of 2020, Yurii Nesterov proposed a Superfast Second-Order Method[37] that converges with the rate O ( N − ) for a convex function with Lipshitz third-orderderivative. This method uses only second-order information during the iteration, but assumeadditional smoothness via Lipshitz third-order derivative. Here we should note that for thefirst-order methods, the worst-case example can’t be improved by additional smoothnessbecause it is a specific quadratic function that has all high-order derivatives bounded [35]. But for the second-order methods, one can see that the worst-case example does not haveLipshitz third-order derivative. This means that under the additional assumption, classicallower bound O ( N − / ) can be beaten, and Nesterov proposes such a method that convergeswith O ( N − ) up to a logarithmic factor. The main idea of this method to run the third-order method with an inexact solution of the Taylor approximation subproblem by methodfrom Nesterov with inexact gradients that converges with the linear speed. By inexactgradients, it becomes possible to replace the direct computation of the third derivative bythe inexact model that uses only the first-order information. Note that for non-convexproblems previously was proved that the additional smoothness might speed up algorithms[1, 3, 18, 38, 43].In this paper, we propose a Hyperfast Second-Order Method for a convex function withLipshitz third-order derivative with the convergence rate O ( N − ) up to a logarithmic factor.For that reason, firstly, we introduce Inexact Near-optimal Accelerated Tensor Method,based on methods from [4, 25] and prove its convergence. Next, we apply Bregman-DistanceGradient Method from [18, 37] to solve Taylor approximation subproblem up to the desiredaccuracy. This leads us to Hyperfast Second-Order Method and we prove its convergencerate. This method have near-optimal convergence rates for a convex function with Lipshitzthird-order derivative and the best known up to this moment.Also we propose a Hyperfast Second-Order Sliding for a sum of two convex functions withLipshitz third-order derivative. It is based on ideas [26, 27] and [25]. By sliding frameworkswe can separate oracle complexities for different functions, it means that we compute onlynecessary number of derivatives for each functions, like they are separate and not in thesum. So we use sliding for the third-order methods from [25] and solve inner subproblemby Bregman-Distance Gradient Method from [37]. As a result, we get method with separateoracle complexity with the convergence rate O ( N − ) up to a logarithmic factor. This methodhave near-optimal oracle complexities for a sum of two convex functions with Lipshitz third-order derivative and the best known on that moment. Note, that for the first-order methods in non-convex case earlier (see, [5] and references therein) it wasshown that additional smoothness assumptions lead to an additional acceleration. In convex case, as far aswe know these works of Yu. Nesterov [35, 37] are the first ones where such an idea was developed. However, there are some results [45] that allow to use tensor acceleration for the first-order schemes.This additional acceleration requires additional assumptions on smoothness. More restrictive ones thanlimitations of high-order derivatives.
In what follows, we work in a finite-dimensional linear vector space E = R n , equipped witha Euclidian norm k · k = k · k .We consider the following convex optimization problem: min x f ( x ) , (1)where f ( x ) is a convex function with Lipschitz p -th derivative, it means that k D p f ( x ) − D p f ( y ) k ≤ L p k x − y k . (2)Then Taylor approximation of function f ( x ) can be written as follows: Ω p ( f,x ; y ) = f ( x ) + p X k =1 k ! D k f ( x ) [ y − x ] k , y ∈ R n . By (2) and the standard integration we can get next two inequalities | f ( y ) − Ω p ( f,x ; y ) | ≤ L p ( p + 1)! k y − x k p +1 , k∇ f ( y ) − ∇ Ω p ( f,x ; y ) k ≤ L p p ! k y − x k p . (3)Problem (1) can be solved by tensor methods [34] or its accelerated versions [4,15,24,33].These methods have next basic step: T H p ( x ) = argmin y n ˜Ω p,H p ( f,x ; y ) o , where ˜Ω p,H p ( f,x ; y ) = Ω p ( f,x ; y ) + H p p ! k y − x k p +1 . (4)For H p ≥ L p this subproblem is convex and hence implementable.But what if we can not solve exactly this subproblem. In paper [37] it was introducedInexact p th-Order Basic Tensor Method (BTMI p ) and Inexact p th-Order Accelerated TensorMethod (ATMI p ). They have next convergence rates O ( k − p ) and O ( k − ( p +1) ) , respectively.3n this section, we introduce Inexact p th-Order Near-optimal Accelerated Tensor Method(NATMI p ) with improved convergence rate ˜ O ( k − p +12 ) , where ˜ O ( · ) means up to logarithmicfactor. It is an improvement of Accelerated Taylor Descent from [4] and generalization ofInexact Accelerated Taylor Descent from [25].Firstly, we introduce the definition of the inexact subproblem solution. Any point fromthe set N γp,H p ( x ) = n T ∈ R n : k∇ ˜Ω p,H p ( f,x ; T ) k ≤ γ k∇ f ( T ) k o (5)is the inexact subproblem solution, where γ ∈ [0; 1] is an accuracy parameter. N p,H p is theexact solution of the subproblem.Next we propose Algorithm 1. Algorithm 1
Inexact p th-Order Near-optimal Accelerated Tensor Method (NATMI) Input: convex function f : R n → R such that ∇ p f is L p -Lipschitz, H p = ξL p where ξ isa scaling parameter, γ is a desired accuracy of the subproblem solution. Set A = 0 , x = y for k = 0 to k = K − do Compute a pair λ k +1 > and y k +1 ∈ R n such that ≤ λ k +1 H p · k y k +1 − ˜ x k k p − ( p − ≤ pp + 1 , where y k +1 ∈ N γp,H p (˜ x k ) (6)and a k +1 = λ k +1 + q λ k +1 + 4 λ k +1 A k , A k +1 = A k + a k +1 , and ˜ x k = A k A k +1 y k + a k +1 A k +1 x k . Update x k +1 := x k − a k +1 ∇ f ( y k +1 ) return y K To get the convergence rate of Algorithm 1 we prove additional lemmas. The first lemmagets intermediate inequality to connect theory about inexactness and method’s theory.
Lemma 1. If y k +1 ∈ N γp,H p (˜ x k ) , then k∇ ˜Ω p,H p ( f, ˜ x k ; y k +1 ) k ≤ γ − γ · ( p + 1) H p + L p p ! k y k +1 − ˜ x k k p . (7)4 roof. From triangle inequality we get k∇ f ( y k +1 ) k ≤ k∇ f ( y k +1 ) − ∇ Ω p ( f, ˜ x k ; y k +1 ) k + k∇ Ω p ( f, ˜ x k ; y k +1 ) − ∇ ˜Ω p,H p ( f, ˜ x k ; y k +1 ) k + k∇ ˜Ω p,H p ( f, ˜ x k ; y k +1 ) k (3) , (4) , (5) ≤ L p p ! k y k +1 − ˜ x k k p + ( p + 1) H p p ! k y k +1 − ˜ x k k p + γ k∇ f ( y k +1 ) k . Hence, (1 − γ ) k∇ f ( y k +1 ) k ≤ ( p + 1) H p + L p p ! k y k +1 − ˜ x k k p . And finally from (5) we get k∇ ˜Ω p,H p ( f, ˜ x k ; y k +1 ) k ≤ γ − γ · ( p + 1) H p + L p p ! k y k +1 − ˜ x k k p . Next lemma plays the crucial role in the prove of the Algorithm 1 convergence. It is thegeneralization for inexact subpropblem of Lemma 3.1 from [4].
Lemma 2. If y k +1 ∈ N γp,H p (˜ x k ) , H p = ξL p such that ≥ γ + ξ ( p +1) and ≤ λ k +1 H p · k y k +1 − ˜ x k k p − ( p − ≤ pp + 1 , (8) then k y k +1 − (˜ x k − λ k +1 ∇ f ( y k +1 )) k ≤ σ · k y k +1 − ˜ x k k and σ ≥ pξ + 1 − ξ + 2 γξ (1 − γ )2 pξ , (9) where σ ≤ .Proof. Note, that by definition ∇ ˜Ω p,H p ( f, ˜ x k ; y k +1 ) = ∇ Ω p ( f, ˜ x k ; y k +1 ) + H p ( p + 1) p ! k y k +1 − ˜ x k k p − ( y k +1 − ˜ x k ) . (10)Hence, y k +1 − ˜ x k = p ! H p ( p + 1) k y k +1 − ˜ x k k p − (cid:16) ∇ ˜Ω p,H p ( f, ˜ x k ; y k +1 ) − ∇ Ω p ( f, ˜ x k ; y k +1 ) (cid:17) . (11)5hen, by triangle inequality we get k y k +1 − (˜ x k − λ k +1 ∇ f ( y k +1 )) k = k λ k +1 ( ∇ f ( y k +1 ) − ∇ Ω p ( f, ˜ x k ; y k +1 ))+ λ k +1 ∇ ˜Ω p,H p ( f, ˜ x k ; y k +1 ) + (cid:16) y k +1 − ˜ x k + λ k +1 ( ∇ Ω p ( f, ˜ x k ; y k +1 ) − ∇ ˜Ω p,H p ( f, ˜ x k ; y k +1 )) (cid:17) k (3) , (11) ≤ λ k +1 L p p ! k y k +1 − ˜ x k k p + λ k +1 k∇ ˜Ω p,H p ( f, ˜ x k ; y k +1 ) k + (cid:12)(cid:12)(cid:12)(cid:12) λ k +1 − p ! H p · ( p + 1) · k y k +1 − ˜ x k k p − (cid:12)(cid:12)(cid:12)(cid:12) · k∇ ˜Ω p,H p ( f, ˜ x k ; y k +1 ) − ∇ Ω p ( f, ˜ x k ; y k +1 ) k (7) , (10) ≤ k y k +1 − ˜ x k k (cid:18) λ k +1 L p p ! k y k +1 − ˜ x k k p − + λ k +1 γ − γ · ( p + 1) H p + L p p ! k y k +1 − ˜ x k k p − (cid:19) + (cid:12)(cid:12)(cid:12)(cid:12) λ k +1 − p ! H p · ( p + 1) · k y k +1 − ˜ x k k p − (cid:12)(cid:12)(cid:12)(cid:12) · ( p + 1) H p p ! k y k +1 − ˜ x k k p = k y k +1 − ˜ x k k (cid:18) λ k +1 p ! (cid:18) L p + γ − γ (( p + 1) H p + L p ) (cid:19) k y k +1 − ˜ x k k p − (cid:19) + k y k +1 − ˜ x k k (cid:12)(cid:12)(cid:12)(cid:12) λ k +1 ( p + 1) H p p ! k y k +1 − ˜ x k k p − − (cid:12)(cid:12)(cid:12)(cid:12) (8) ≤ k y k +1 − ˜ x k k (cid:18) λ k +1 p ! (cid:18) L p + γ − γ (( p + 1) H p + L p ) (cid:19) k y k +1 − ˜ x k k p − (cid:19) + k y k +1 − ˜ x k k (cid:18) − λ k +1 ( p + 1) H p p ! k y k +1 − ˜ x k k p − (cid:19) = k y k +1 − ˜ x k k (cid:18) λ k +1 p ! (cid:18) L p − ( p + 1) H p + γ − γ (( p + 1) H p + L p ) (cid:19) k y k +1 − ˜ x k k p − (cid:19) . Hence, by (8) and simple calculations we get σ ≥ pH p (cid:18) L p − ( p + 1) H p + γ − γ (( p + 1) H p + L p ) (cid:19) = 1 + 12 pξ (cid:18) − ( p + 1) ξ + γ − γ (( p + 1) ξ + 1) (cid:19) = 1 + 12 pξ (cid:18) − pξ − ξ + γpξ + γξ + γ − γ (cid:19) = 1 + 12 pξ (cid:18) − pξ − ξ − γ + γpξ + γξ + γpξ + γξ + γ − γ (cid:19) = 1 + (cid:18) − pξ − ξ + 2 γpξ + 2 γξ (1 − γ )2 pξ (cid:19) = pξ + 1 − ξ + 2 γξ (1 − γ )2 pξ . σ ≤ . For that we need (1 − γ )2 pξ ≥ pξ + 1 − ξ + 2 γξ ( p + 1) ξ ≥ γξ (1 + p )12 − ξ ( p + 1) ≥ γ. We have proved the main lemma for the convergence rate theorem, other parts of theproof are the same as [4]. As a result, we get the next theorem.
Theorem 1.
Let f be a convex function whose p th derivative is L p -Lipschitz and x ∗ denotea minimizer of f . Then Algorithm 1 converges with rate f ( y k ) − f ( x ∗ ) ≤ ˜ O (cid:18) H p R p +1 k p +12 (cid:19) , where R = k x − x ∗ k is the maximal radius of the initial set. In recent work [37] it was mentioned that for convex optimization problem (1) with firstorder oracle (returns gradient) the well-known complexity bound ( L R /ε ) / can not bebeaten even if we assume that all L p < ∞ . This is because of the structure of the worthcase function f p ( x ) = | x | p +1 + | x − x | p +1 + ... + | x n − x n − | p +1 , where p = 1 for first order method. It’s obvious that f p ( x ) satisfy the condition L p < ∞ forall natural p . So additional smoothness assumptions don’t allow to accelerate additionally.The same thing takes place, for example, for p = 3 . In this case, we also have L p < ∞ forall natural p . But what is about p = 2 ? In this case L = ∞ . It means that f ( x ) couldn’tbe the proper worth case function for the second order method with additional smoothnessassumptions. So there appears the following question: Is it possible to improve the bound ( L R /ε ) / ? At the very beginning of 2020 Yu. Nesterov gave a positive answer. For thispurpose, he proposed to use an accelerated third-order method that requires ˜ O (cid:0) ( L R /ε ) / (cid:1) iterations by using second-order oracle [34]. So all this means that if L < ∞ , then thereare methods that can be much faster than ˜ O (cid:16) ( L R /ε ) / (cid:17) .In this section, we improve convergence speed and reach near-optimal speed up to loga-rithmic factor. We consider problem (1) with p = 3 , hence L < ∞ . In previous section, wehave proved that Algorithm 1 converges. Now we fix the parameters for this method p = 3 , γ = 12 p = 16 , ξ = 2 pp + 1 = 32 .
7y (9) we get σ = 0 . that is rather close to initial exact σ = 0 . . For such parameters weget next convergence speed of Algorithm 1 to reach accuracy ε : N out = ˜ O (cid:18) L R ε (cid:19) ! . Note, that at every step of Algorithm 1 we need to solve next subproblem with accuracy γ = 1 / y (cid:26) h∇ f ( x i ) ,y − x i i + 12 ∇ f ( x i )[ y − x i ] + 16 D f ( x i )[ y − x i ] + L k y − x i k (cid:27) . (12)In [18] it was proved, that problem (12) can be solved by Bregman-Distance GradientMethod (BDGM) with linear convergence speed. According to [37] BDGM can be improvedto work with inexact gradients of the functions. This made possible to approximate D f ( x ) by gradients and escape calculations of D f ( x ) at each step. As a result, in [37] it wasproved, that subproblem (12) can be solved up to accuracy γ = 1 / with one calculation ofHessian and O (cid:16) log (cid:16) k∇ f ( x i ) k + k∇ f ( x i ) k ε (cid:17)(cid:17) calculation of gradient.We use BDGM to solve subproblem from Algorithm 1 and, as a result, we get nextHyperfast Second-Order method as merging NATMI and BDGM. Algorithm 2
Hyperfast Second-Order Method Input: convex function f : R n → R with L -Lipschitz rd-order derivative. Set A = 0 , x = y for k = 0 to k = K − do Compute a pair λ k +1 > and y k +1 ∈ R n such that ≤ λ k +1 L · k y k +1 − ˜ x k k ≤ , where y k +1 ∈ N / , L / (˜ x k ) solved by Algorithm 3 and a k +1 = λ k +1 + q λ k +1 + 4 λ k +1 A k , A k +1 = A k + a k +1 , and ˜ x k = A k A k +1 y k + a k +1 A k +1 x k . Update x k +1 := x k − a k +1 ∇ f ( y k +1 ) return y K lgorithm 3 Bregman-Distance Gradient Method Set z = ˜ x k and τ = δ √ k∇ f (˜ x k ) k Set objective function ϕ k ( z ) = h∇ f (˜ x k ) ,z − ˜ x k i + ∇ f (˜ x k )[ z − ˜ x k ] + D f (˜ x k )[ z − ˜ x k ] + L k z − ˜ x k k Set feasible set S k = z : k z − ˜ x k k ≤ √ L k∇ f (˜ x k ) k ! Set scaling function ρ k ( z ) = 12 (cid:10) ∇ f (˜ x k )( z − ˜ x k ) ,z − ˜ x k (cid:11) + L k z − ˜ x k k for k ≥ do Compute the approximate gradient g ϕ k ,τ ( z i ) by (13). IF k g ϕ k ,τ ( z i ) k ≤ k∇ f ( z i ) k − δ , then STOP ELSE z i +1 = argmin z ∈ S k n h g ϕ k ,τ ( z i ) ,z − z i i + 2 (cid:16) √ (cid:17) β ρ k ( z i ,z ) o , return z i Here β ρ k ( z i ,z ) is a Bregman distance generated by ρ k ( z ) β ρ k ( z i ,z ) = ρ k ( z ) − ρ k ( z i ) − h∇ ρ k ( z i ) , z − z i i . By g ϕ k ,τ ( z ) we take an inexact gradient of the subproblem (12) g ϕ k ,τ ( z ) = ∇ f (˜ x k ) + ∇ f (˜ x k )[ z − ˜ x k ] + g τ ˜ x k ( z ) + L k z − ˜ x k k ( z − ˜ x k ) (13)and g τ ˜ x k ( z ) is a inexact approximation of D f (˜ x k )[ y − ˜ x k ] g τ ˜ x k ( z ) = 1 τ ( ∇ f (˜ x k + τ ( z − ˜ x k )) + ∇ f (˜ x k − τ ( z − ˜ x k )) − ∇ f (˜ x k )) . In paper [37] it is proved, that we can choose δ = O ε k∇ f (˜ x k ) k ∗ + k∇ f (˜ x k ) k /L ! , then total number of inner iterations equal to T k ( δ ) = O (cid:18) ln G + Hε (cid:19) , G and H are the uniform upper bounds for the norms of the gradients and Hessianscomputed at the points generated by the main algorithm. Finally, we get next theorem. Theorem 2.
Let f be a convex function whose third derivative is L -Lipschitz and x ∗ de-note a minimizer of f . Then to reach accuracy ε Algorithm 2 with Algorithm 3 for solvingsubproblem computes N = ˜ O (cid:18) L R ε (cid:19) ! Hessians and N = ˜ O (cid:18) L R ε (cid:19) log (cid:18) G + Hε (cid:19)! gradients, where G and H are the uniform upper bounds for the norms of the gradients andHessians computed at the points generated by the main algorithm. One can generalize this result on uniformly-strongly convex functions by using inverserestart-regularization trick from [16].So, the main observation of this section is as follows: If L < ∞ , then we can use thissuperfast second order algorithm instead of considered in the paper optimal one to makeour sliding faster (in convex and uniformly convex cases). In this section, we consider problem min x f ( x ) = g ( x ) + h ( x ) , where g ( x ) and h ( x ) are the convex functions such that ∇ g is L ,g -Lipschitz and ∇ h is L ,h -Lipschitz, also L ,g ≤ L ,h .In [25] it was proposed an algorithmic framework for composition of tensor methods,also called sliding. This framework separates oracle complexity, hence, we get much smallernumber calls for function g ( x ) . In this section we combine sliding technique and HyperfastSecond-Order Method 2 to get Hyperfast Second-Order Sliding.Firstly, we need NATMI version with smooth composite part as an outer basic method. Here we use terminology introduced in [37]. lgorithm 4 rd-Order NATMI with smooth composite part Input: convex functions h ( x ) and g ( x ) such that ∇ h is L ,h -Lipschitz and ∇ g is L ,g -Lipschitz. Set A = 0 , x = y for k = 0 to k = K − do Compute a pair λ k +1 > and y k +1 ∈ R n such that ≤ λ k +1 L ,g · k y k +1 − ˜ x k k ≤ , where y k +1 ∈ n T ∈ E : k∇ ˜Ω , L ,g / ( g, ˜ x k ; T ) + ∇ h ( T ) k ≤ k∇ f ( T ) k o solved by Algorithm 3 and a k +1 = λ k +1 + q λ k +1 + 4 λ k +1 A k , A k +1 = A k + a k +1 , and ˜ x k = A k A k +1 y k + a k +1 A k +1 x k . Update x k +1 := x k − a k +1 ∇ f ( y k +1 ) − a k +1 ∇ h ( y k +1 ) return y K The convergence of this method can be easily proved by united proofs of NATMI withCATD from [25]. We get the same the convergence rate.Now we propose Hyperfast Second-Order Sliding Method. It contains three levels ofmethods outer Algorithm 4 for function g ( x ) and h ( x ) as a composite part, in the middle Al-gorithm 4 solves outer’s subproblem with model of g ( x ) and h ( x ) and the deepest Algorithm3 solves inner subproblem of the sum of two models of g ( x ) and h ( x ) . Algorithm 5
Hyperfast Second-Order Sliding Input: convex functions g ( x ) and h ( x ) such that ∇ h is L ,h -Lipschitz and ∇ g is L ,g -Lipschitz. Set z = y = x for k = 0 , to K − do Run Algorithm 4 for problem g ( x ) + h ( x ) , where h ( x ) is a composite part. for m = 0 , to M − do Run Algorithm 4 up to desired accuracy for subproblem min y (cid:16) ˜Ω , L ,g / ( g, ˜ x k ; y ) + h ( y ) (cid:17) return y K The total complexity is a multiplication of complexities of each submethod in Algorithm5. Hence, from [25] we get next convergence speed for outer and middle method. To reach11 ( x N ) − f ( x ∗ ) ≤ ε , we need N g computations of derivatives g ( x ) and N h computation ofderivatives h ( x ) , where N g = ˜ O "(cid:18) L ,g R ε (cid:19) , (14) N h = ˜ O "(cid:18) L ,h R ε (cid:19) . (15)Also from previous section we have known convergence rate of inner Algorithm (3), thatequal to O (cid:0) ln G + Hε (cid:1) . This lead us to the next theorem. Theorem 3.
Assume f ( x ) and g ( x ) are convex functions with Lipshitz third derivative and L ,g < L ,h . Then Method 5 converges to f ( x N ) − f ( x ∗ ) ≤ ε with N g as (14) computationsof Hessian f ( x ) , N h as (15) computation of Hessian h ( x ) , O (cid:0) N g ln G + Hε (cid:1) computations ofgradients g ( x ) and O (cid:0) N h ln G + Hε (cid:1) computations of gradients h ( x ) , where G and H are theuniform upper bounds for the norms of the gradients and Hessians f ( x ) computed at thepoints generated by the main algorithm. One can generalize this result on sum of n functions sorted by L ,f i and applied consec-utively, also it is possible to separate it by some batches. Many modern machine learning applications reduce at the end to the following optimizationproblem min x ∈ R n m m X k =1 f k ( x ) + g ( x ) . Sometimes these problems are convex and have uniformly bounded high-order derivatives.For example, in click prediction model we have to solve logistic regression problem f k ( x ) =log (1 + exp( − y k h a k , x i )) with (quadratic) regularizer g ( x ) [40]. So the most appropriateconditions to apply the developed second-order scheme are: the complexity of calculatingHessian O ( m · n · T ∇ f k ) is of the same order as the complexity of inverting Hessian O ( n . ) http://towardsdatascience.com/mobile-ads-click-through-rate-ctr-prediction Here T ∇ f k – is the complexity of ∇ f k ( x ) calculation. But this formula sometimes is just a rough upperbound. For example, if we consider logistic regression and matrix A = [ a ,...,a m ] is sparse we can improvethe bound as follows O (cid:0) ms (cid:1) , where s is upper bound of number of nonzero elements in { a k } mk =1 . For clickprediction model s ≪ n . the number of terms m is not to big, that is the first-ordervariance reduction schemes [28] O (cid:16)(cid:16) m + p mL R /ε (cid:17) T ∇ f k (cid:17) or SGD [8] O (( L R /ε ) T ∇ f k ) are don’t dominate hyper-fast second order scheme. In particular, for click prediction modelif m = Ω ( n . /s ) the complexity of proposed hyper-fast scheme is ˜ O (cid:0) ( ms + n . ) ε − / (cid:1) =˜ O (cid:0) ms ε − / (cid:1) that can be better than the complexity of optimal variance reduction scheme ˜ O (cid:16) ms + s p m/ε (cid:17) and SGD O ( sε − ) depends on the relation between m , s and ε (here ε is a ‘relative’ accuracy). For example, if s = O (1) , ε = O ( m − / ) hyper-fast scheme is thebest one. Unfortunately, the requirement on accuracy ε is not very practical one. But,it’s important to note, that if ε = O ( m − / ) hyper-fast scheme is still better than SGD. So if we allow to use parallel calculations variance reduction scheme fails and we shouldcompare our approach only with SGD. In this case we have rather reasonable requirementson accuracy.Note, that all these formulas can be rewritten in strongly convex case.Note also, that in non convex case we hope that the proposed approach sometimes canalso works good in practice due to the theoretical results concern auxiliary problem fortensor methods in non convex case [18] and optimistic practical results concern applicationsof Monteiro–Svaiter accelerated proximal envelope for non convex problems [22].
In this paper, we present Inexact Near-optimal Accelerated Tensor Method and improve itsconvergence rate. This improvement make it possible to solve the Taylor approximation Standard Gaussian method with the complexity O ( n ) a.o. is not optimal. The best known theoreticalmethod requires O ( n . ) a.o. [29] are not practical one. So we lead the complexity that corresponds to thebest theoretical complexity among practical methods. The result O ( n . ) a.o. seems to be a folklore result.Some specialists consider that the best practical methods have the complexity close to Gaussian method O ( n ) a.o. For our purposes (see bellow) this is even better if we want to choose m bigger. Note also thatthe complexity of auxiliary problem determines not only the complexity of Hessian inversion. There appearsalso an additional logarithmic factor [34], but we skip it for simplicity. Without loss of generality in machine learning applications every time one may consider m to be not largethan ˜ O (cid:0) ε − (cid:1) in convex (but non strongly convex) case [12, 41]. Note, that it’s possible due to the properregularization. Note also, that this remark and the sensitivity of tensor methods to gradient estimation[17, 43, 44] assume that if m is bigger than ˜ O (cid:0) ε − (cid:1) , one can randomly select a subsum with ˜ O (cid:0) ε − (cid:1) termsand solve only this problem instead of using batched gradient, hessian etc. As far as we know, for the moment there are no any high-order variance reduction schemes. Quadratic regularizer don’t play any role here until it is not strongly convex. However in distributed context by using statistical preconditioned algorithms one may significantlyreduce m [21]. So this estimate became practical one. For SGD we can parallelize calculations on O (cid:16) σ R /ε / p L R /ε (cid:17) nodes [9], [46]. But, if we havenonsmooth case ( L < ∞ , but L = ∞ ) or n is small enough, the best known for us strategy is to useellipsoid method with batched gradient. The number of iterations ˜ O ( n ) (it seems that this number can beimproved to ˜ O ( n ) for more advanced schemes [30]) and at each iteration one should calculate a batch of size ˜ O ( σ R /ε ) . O ( N − ) up to logarithmic factor. This method is a combinationof Inexact Third-Order Near-Optimal Accelerated Tensor Method with Bregman-DistanceGradient Method for solving inner subproblem. Finally, we assume, that our problem isthe sum of some functions. We propose Hyperfast Second-Order Sliding to separate oraclecomplexities of two functions. All this methods have near-optimal convergence rates forgiven problem classes and the best known on that moment.Though it seems there is no way to obtain super-fast first and zero order schemes we mayuse the proposed approach to obtain new classes of first and zero order schemes. Namely,for convex smooth enough problems (analogues for strongly convex) it is possible to proposefirst-order method that requires ˜ O ( nε − / ) gradient calculations and that ˜ O ( ε − / ) timesinverts Hessian (with complexity O ( n . ) a.o. of each inversion). This results is not optimalfrom the modern complexity theory point of view. There exists a theoretical algorithm [30]that requires ˜ O ( n ) gradient calculations and ˜ O ( n ) a.o. per each iteration. But for thecurrent moment of time this result is far from to be practical one.Analogously, one can show that it is possible to propose zero-order method that requires ˜ O ( n ε − / ) function value calculations and that ˜ O ( ε − / ) times inverts Hessian. The samethings about optimality we may say here. But instead of [30] one should refers to [31].In this paper, we developed near-optimal Hyperfast Second-Order method for sufficientlysmooth convex problem in terms of convergence in function. Based on the technique from thework [11], we can also developed near-optimal Hyperfast Second-Order method for sufficientlysmooth convex problem in terms of convergence in the norm of the gradient. In particular,based on the work [20] one may show that the complexity of this approach to the dual problemfor -entropy regularized optimal transport problem will be ˜ O (cid:16) (( √ n ) /ε ) / (cid:17) · O ( n . ) = O ( n . ε − / ) a.o., where n is the linear dimension of the transport plan matrix, that couldbe better than the complexity of accelerated gradient method and accelerated Sinkhornalgorithm O ( n . ε − / ) a.o. [10, 20]. Note, that the best theoretical bounds for this problemare also far from to be practical ones [2, 23, 30, 39].Note, that in March 2020 there appears a new preprint of Yu. Nesterov where the sameresult ˜ O ( ε − / ) was obtained based on another proximal accelerated envelop [36]. Note, thatboth approaches (described in this paper and in [36]) have also the same complexity in termsof O ( ) (have factors ln ε − due to the line search and requirements to the accuracy we needto solve auxiliary problem). We would like to thank Yurii Nesterov, Pavel Dvurechensky and Cesar Uribe for fruitfuldiscussions. We also would like to thanks Soomin Lee (Yahoo), Erick Ordentlich (Yahoo)and Andrey Vorobyev (Huawei), Evgeny Yanitsky (Huawei), Olga Vasiukova (Huawei) formotivating us by concrete problems formulations.14 .2 Funding
The work of D. Kamzolov was funded by RFBR, project number 19-31-27001. The work ofA.V. Gasnikov in the first part of the paper was supported by RFBR grant 18-29-03071 mk.A.V. Gasnikov was also partially supported by the Yahoo! Research Faculty EngagementProgram.
References [1] Ernesto G Birgin, JL Gardenghi, Jos´e Mario Mart´ınez, Sandra Augusta Santos, andPh L Toint. Worst-case evaluation complexity for unconstrained nonlinear optimizationusing high-order regularized models.
Mathematical Programming , 163(1-2):359–368,2017.[2] Jose Blanchet, Arun Jambulapati, Carson Kent, and Aaron Sidford. Towards optimalrunning times for optimal transport. arXiv preprint arXiv:1810.07717 , 2018.[3] S´ebastien Bubeck, Qijia Jiang, Yin-Tat Lee, Yuanzhi Li, and Aaron Sidford. Complexityof highly parallel non-smooth convex optimization. In
Advances in Neural InformationProcessing Systems , pages 13900–13909, 2019.[4] S´ebastien Bubeck, Qijia Jiang, Yin Tat Lee, Yuanzhi Li, and Aaron Sidford. Near-optimal method for highly smooth convex optimization. In
Conference on LearningTheory , pages 492–507, 2019.[5] Y Carmon, JC Duchi, O Hinder, and A Sidford. Lower bounds for finding stationarypoints ii: first-order methods. arXiv preprint arXiv:1711.00841 , 2017.[6] Nikita Doikov and Yurii Nesterov. Contracting proximal methods for smooth convexoptimization. arXiv preprint arXiv:1912.07972 , 2019.[7] Nikita Doikov and Yurii Nesterov. Local convergence of tensor methods. arXiv preprintarXiv:1912.02516 , 2019.[8] John C Duchi. Introductory lectures on stochastic optimization.
The mathematics ofdata , 25:99–185, 2018.[9] Darina Dvinskikh and Alexander Gasnikov. Decentralized and parallelized primal anddual accelerated methods for stochastic convex programming problems. arXiv preprintarXiv:1904.09015 , 2019.[10] Pavel Dvurechensky, Alexander Gasnikov, and Alexey Kroshnin. Computational opti-mal transport: Complexity by accelerated gradient descent is better than by sinkhorn’salgorithm. arXiv preprint arXiv:1802.04367 , 2018.1511] Pavel Dvurechensky, Alexander Gasnikov, Petr Ostroukhov, C´esar A Uribe, and Anas-tasiya Ivanova. Near-optimal tensor methods for minimizing the gradient norm of convexfunction. arXiv preprint arXiv:1912.03381 , 2019.[12] Vitaly Feldman and Jan Vondrak. High probability generalization bounds for uniformlystable algorithms with nearly optimal rate. arXiv preprint arXiv:1902.10710 , 2019.[13] Alexander Gasnikov. Universal gradient descent. arXiv preprint arXiv:1711.00394 ,2017.[14] Alexander Gasnikov, Pavel Dvurechensky, Eduard Gorbunov, Evgeniya Vorontsova,Daniil Selikhanovych, and C´esar A Uribe. Optimal tensor methods in smooth convexand uniformly convexoptimization. In
Conference on Learning Theory , pages 1374–1391,2019.[15] Alexander Gasnikov, Pavel Dvurechensky, Eduard Gorbunov, Evgeniya Vorontsova,Daniil Selikhanovych, C´esar A Uribe, Bo Jiang, Haoyue Wang, Shuzhong Zhang,S´ebastien Bubeck, et al. Near optimal methods for minimizing convex functions withlipschitz p -th derivatives. In Conference on Learning Theory , pages 1392–1393, 2019.[16] Alexander Vladimirovich Gasnikov and Dmitry A Kovalev. A hypothesis about the rateof global convergence for optimal methods (newton’s type) in smooth convex optimiza-tion.
Computer research and modeling , 10(3):305–314, 2018.[17] Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Generalized uniformly optimalmethods for nonlinear programming.
Journal of Scientific Computing , 79(3):1854–1881,2019.[18] Geovani Nunes Grapiglia and Yurii Nesterov. On inexact solution of auxiliary problemsin tensor methods for convex optimization. arXiv preprint arXiv:1907.13023 , 2019.[19] Geovani Nunes Grapiglia and Yurii Nesterov. Tensor methods for minimizing functionswith h¨older continuous higher-order derivatives. arXiv preprint arXiv:1904.12559 , 2019.[20] Sergey Guminov, Pavel Dvurechensky, Tupitsa Nazary, and Alexander Gasnikov. Ac-celerated alternating minimization, accelerated sinkhorn’s algorithm and acceleratediterative bregman projections. arXiv preprint arXiv:1906.03622 , 2019.[21] Hadrien Hendrikx, Lin Xiao, Sebastien Bubeck, Francis Bach, and Laurent Massoulie.Statistically preconditioned accelerated gradient method for distributed optimization. arXiv preprint arXiv:2002.10726 , 2020.[22] Anastasiya Ivanova, Dmitry Grishchenko, Alexander Gasnikov, and Egor Shulgin. Adap-tive catalyst for smooth convex optimization. arXiv preprint arXiv:1911.11271 , 2019.1623] Arun Jambulapati, Aaron Sidford, and Kevin Tian. A direct tilde { O } (1/epsilon) it-eration parallel algorithm for optimal transport. In Advances in Neural InformationProcessing Systems , pages 11355–11366, 2019.[24] Bo Jiang, Haoyue Wang, and Shuzhong Zhang. An optimal high-order tensor methodfor convex optimization. In
Conference on Learning Theory , pages 1799–1801, 2019.[25] Dmitry Kamzolov, Alexander Gasnikov, and Pavel Dvurechensky. On the optimal com-bination of tensor optimization methods. arXiv preprint arXiv:2002.01004 , 2020.[26] Guanghui Lan. Gradient sliding for composite optimization.
Mathematical Program-ming , 159(1-2):201–235, 2016.[27] Guanghui Lan and Yuyuan Ouyang. Accelerated gradient sliding for structured convexoptimization. arXiv preprint arXiv:1609.04905 , 2016.[28] Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method.
Mathematical programming , 171(1-2):167–215, 2018.[29] Fran¸cois Le Gall. Powers of tensors and fast matrix multiplication. In
Proceedings of the39th international symposium on symbolic and algebraic computation , pages 296–303,2014.[30] Yin Tat Lee and Aaron Sidford. Solving linear programs with sqrt (rank) linear systemsolves. arXiv preprint arXiv:1910.08033 , 2019.[31] Yin Tat Lee, Aaron Sidford, and Santosh S Vempala. Efficient convex optimization withoracles. In
Building Bridges II , pages 317–335. Springer, 2019.[32] Renato DC Monteiro and Benar Fux Svaiter. An accelerated hybrid proximal extra-gradient method for convex optimization and its implications to second-order methods.
SIAM Journal on Optimization , 23(2):1092–1125, 2013.[33] Yurii Nesterov.
Lectures on convex optimization , volume 137. Springer, 2018.[34] Yurii Nesterov. Implementable tensor methods in unconstrained convex optimization.
Mathematical Programming , pages 1–27, 2019.[35] Yurii Nesterov. Inexact accelerated high-order proximal-point methods. Technical re-port, Technical Report CORE Discussion paper 2020, Universit´e catholique de Louvain,Center for Operations Research and Econometrics, 2020.[36] Yurii Nesterov. Inexact high-order proximal-point methods with auxiliary search pro-cedure. Technical report, 2020.[37] Yurii Nesterov. Superfast second-order methods for unconstrained convex optimization.Technical report, Technical Report CORE Discussion paper 2020, Universit´e catholiquede Louvain, Center for Operations Research and Econometrics, 2020.1738] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and itsglobal performance.
Mathematical Programming , 108(1):177–205, 2006.[39] Kent Quanrud. Approximating optimal transport with linear programs. arXiv preprintarXiv:1810.05957 , 2018.[40] Shai Shalev-Shwartz and Shai Ben-David.
Understanding machine learning: From the-ory to algorithms . Cambridge university press, 2014.[41] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochasticconvex optimization. In
COLT , 2009.[42] Chaobing Song and Yi Ma. Towards unified acceleration of high-order algorithms underh¨older continuity and uniform convexity. arXiv preprint arXiv:1906.00582 , 2019.[43] Z Wang, Y Zhou, Y Liang, and G Lan. Cubic regularization with momentum fornonconvex optimization. In
Proc. Conference on Uncertainty in Artificial Intelligence(UAI) , 2019.[44] Zhe Wang, Yi Zhou, Yingbin Liang, and Guanghui Lan. A note on inexact conditionfor cubic regularized newton’s method. arXiv preprint arXiv:1808.07384 , 2018.[45] Ashia Wilson, Lester Mackey, and Andre Wibisono. Accelerating rescaled gradientdescent. arXiv preprint arXiv:1902.08825 , 2019.[46] Blake E Woodworth, Jialei Wang, Adam Smith, Brendan McMahan, and Nati Srebro.Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. In