Improved Exploiting Higher Order Smoothness in Derivative-free Optimization and Continuous Bandit
NNoname manuscript No. (will be inserted by the editor)
Improved Exploiting Higher Order Smoothness inDerivative-free Optimization and Continuous Bandit
Vasilii Novitskii · Alexander Gasnikov
Received: date / Accepted: date
Abstract
We consider β -smooth (satisfies the generalized H¨older conditionwith parameter β >
2) stochastic convex optimization problem with zero-orderone-point oracle. The best known result was [1]: E [ f ( x N ) − f ( x ∗ )] = ˜ O (cid:32) n γN β − β (cid:33) in γ -strongly convex case, where n is the dimension. In this paper we improvethis bound: E [ f ( x N ) − f ( x ∗ )] = ˜ O (cid:32) n − β γN β − β (cid:33) . Keywords zeroth-order optimization · convex problem · stochastic optimiza-tion · one-point bandit · smoothing kernel This work is based on results achieved by 63 Conference MIPT held in November 2020.The research of A. Gasnikov was partially supported by the Ministry of Science and HigherEducation of the Russian Federation (Goszadaniye) 075-00337-20-03, project no. 0714-2020-0005. The work of V. Novitskii was supported by Andrei M. Raigorodskii Scholarship inOptimization.V. NovitskiiMoscow Institute of Physics and Technology, RussiaE-mail: [email protected]. GasnikovMoscow Institute of Physics and Technology, RussiaInstitute for Information Transmission Problems RAS, RussiaWeierstrass Institute for Applied Analysis and Stochastics, Germany a r X i v : . [ m a t h . O C ] J a n V. Novitskii and A. Gasnikov
We study the problem of zero-order stochastic optimization in which the aim isto minimize an unknown convex or strongly convex function where no gradientrealization is given but a function value is available at each iteration withsome additive noise ξ . We also study a closely related problem of continuousstochastic bandits. These problems have received significant attention in theliterature (see [1–4, 6–9, 14, 15]) and are fundamental for many applicationwhere the derivative of function is not available or it is hard to calculatederivatives.The goal of this paper is to exploit higher order smoothness of the func-tion to improve the performance of projected gradient-like algorithms. Ourapproach is outlined in Algorithm 1, in which a sequential algorithm gets ateach iteration two function values under some noise. At each iteration the al-gorithm gets function values at points x k + δ k and x k − δ k , where δ k = τ k r k e k .Here r k is uniformly distributed random variable, e k is uniformly distributedon the Euclidean sphere, τ k – is tunable parameter of the algorithm, the smaller τ k is, the smaller approximation error of the gradient (cid:107) (cid:101) g k − ∇ f ( x k ) (cid:107) is (in thisarticle we use only the Euclidean norm) but the bigger variance of (cid:107) (cid:101) g k (cid:107) is, sothe trade-off between these terms is needed. Our approach uses kernel smooth-ing technique proposed by Polyak and Tsybakov in [11], this helps to exploithigher order smoothness. Algorithm 1
Zero-order Stochastic Projected Gradient
Requires:
Kernel K : [ − , → R , step size α k >
0, parameters τ k . Initialization:
Generate scalars r , . . . , r N uniformly on [ − ,
1] and vectors e , . . . , e N uniformly on the Euclidean unit sphere S n = { e ∈ R n : (cid:107) e (cid:107) = 1 } . for k = 1 , . . . , N do y k := f ( x k + τ k r k e k ) + ξ k , y (cid:48) k := f ( x k − τ k r k e k ) + ξ (cid:48) k
2. Define (cid:102) g k := n τ k ( y k − y (cid:48) k ) e k K ( r k )3. Update x k +1 := Π Q ( x k − α k (cid:102) g k ) end forOutput: { x k } Nk =1 . In algorithms like Algorithm 1 the two possibilities are usually considered.The first one is to obtain a function value in one point with some noise (”one-point” multi-armed bandit), the second is to observe function values in twopoints with the same noise at each iteration (”two-point” multi-armed bandit).The use of three and more points do not make dramatic difference to the resultsfor two points [5]. Note that despite our algorithm gets two function values foriteration, they are obtained with different noise ξ k and ξ (cid:48) k , so it is correct toregard Algorithm 1 one-point and to compare it with one-point algorithms.In this paper we study higher order smooth functions f functions satisfyingthe generalized H¨older condition with parameter β > mproved Exploiting Higher Order Smoothness in Derivative-free Optimization 3 We address the question: what is the performance of Algorithm 1, namelythe explicit dependency of the convergence rate on the main parameters n (dimension), N , γ (strong convexity parameter for strongly convex functions), β . To handle this task we prove upper bound for Algorithm 1. Contributions.
Out main contributions can be summarized as follows:1. For strongly-convex case: under an adversarial noise assumption (seeAssumption 1) we establish for all β > O (cid:32) n − β γN β − β (cid:33) for the optimization error of Algorithm 1 for strongly convexcase.2. For convex case: under an adversarial noise assumption (see Assumption1) we establish for all β > N ( ε ) = O (cid:32) n β − ε β − (cid:33) iterations ofAlgorithm 1 for the regularized function f γ ( x ) := f ( x ) + ε R (cid:107) x − x (cid:107) weachieve the optimization error less than or equal to ε .For clarity we compare our results with state-of-the-art ones in Table 1(dependence of optimization error ε on the number of iteration N , dimension n and β , γ ) and Table 2 (dependence of the number of iteration N on theoptimization error ε , dimension n and β , γ ). To summarize the results we use˜ O () , where ˜ O () coincides with O () up to the logarithmic factor. Table 1
The dependence of optimization error ( ε ) on N (number of iterations), n (dimen-sion), γ , β strongly convex convexlower bound [1] O nγN β − β O (cid:32) √ nN β − β (cid:33) this work(2020) ˜ O n − β γN β − β ˜ O n − β N β − β Akhavan, Pontil,Tsybakov (2020) [1] ˜ O n γN β − β ˜ O (cid:32) nN β − β (cid:33) Bach, Perchet(2016) [2] O n − β +1 ( γN ) β − β +1 O n − β +1 N β − β +1) Gasnikov and al.(2015), β = 2, [8] ˜ O (cid:18) n √ γN (cid:19) ˜ O (cid:18) √ nN / (cid:19) Akhavan, Pontil,Tsybakov (2020),special case β = 2 [1] ˜ O (cid:18) n √ γN (cid:19) ˜ O (cid:18) √ nN / (cid:19) Zhang and al.(2020) [15] O (cid:18) n √ γN (cid:19) O (cid:18) √ nN / (cid:19) V. Novitskii and A. Gasnikov
Table 2
The dependence of N (number of iterations) on ε , n (dimension), γ , β stronglyconvex convexlower bound [1] O n β − ( γε ) ββ − O n β − ε β − this work(2020) ˜ O n β − ( γε ) ββ − ˜ O n β − ε β − Akhavan, Pontil,Tsybakov (2020) [1] ˜ O n β − ( γε ) ββ − ˜ O n β − ε β − Bach, Perchet(2016) [2] O n β − γε β +1 β − O n β − ε β − Gasnikov and al.(2015), β = 2 [8] ˜ O (cid:18) n γε (cid:19) ˜ O (cid:18) n ε (cid:19) Akhavan, Pontil,Tsybakov (2020),special case β = 2 [1] ˜ O (cid:18) n γε (cid:19) ˜ O (cid:18) n ε (cid:19) Zhang and al.(2020) [15] O (cid:18) n γε (cid:19) O (cid:18) n ε (cid:19) Comments on Table 1 and Table 2 .1. Note that in Table 1 and Table 2 the right column equals to the centralone by γ ∼ ε .2. Note that the results of this work have better dependency ε ( N ) or N ( ε )than Gasnikov’s one-point method only if β > γ ≥ N − / + / β (otherwise it is better to use convex methods) and (see [1]) 2 γ ≤ max x ∈ Q (cid:107)∇ f ( x ) (cid:107) .4. The bounds marked in blue are not given in this article and in referencesbut they can be got.5. Too optimistic bounds O (cid:32) n − β +1 ( γN ) β − β +1 (cid:33) and O (cid:18) n ε β − (cid:19) were claimed in[2] instead of O (cid:32) n − β +1 ( γN ) β − β +1 (cid:33) and O (cid:32) n β − ε β − (cid:33) , but Akhavan, Pontil andTsybakov [1] found error in Lemma 2 in [2] where factor d of dimension ( n in our notation) is missing. In this section we give the necessary notation, definitions and assumptions. mproved Exploiting Higher Order Smoothness in Derivative-free Optimization 5 (cid:104)· , ·(cid:105) and (cid:107) · (cid:107) be the standard inner product and Euclidean norm on R n respectively. For every closed convex set Q ⊂ R n and for every x ∈ R n let Π Q ( x ) denote the Euclidean projection of x on Q .2.2 ProblemWe address the conditional minimization problem f ( x ) → min x ∈ Q , where f : U ε ( Q ) → R – function (convex or strongly convex), Q ⊂ R n –convex compact set (Euclidean metrics).The optimization problem can be formulated as follows: find the sequence { x k } Nk =1 ⊂ Q minimizing the average regret:1 N N (cid:88) k =1 E [ f ( x k ) − f ( x ∗ )] . If the average regret is less than or equal to ε then the optimization error ofaveraged estimator x N = N N (cid:80) k =1 x k is also less than or equal to ε : E [ f ( x N ) − f ( x ∗ )] ≤ N N (cid:88) k =1 E [ f ( x k ) − f ( x ∗ )] ≤ ε. f ( x k + τ k r k e k ) and f ( x k − τ k r k e k ) are given with additivenoise ξ k and ξ (cid:48) k respectively (see Algorithm 1). Recall that the Algorithm 1is randomized: the scalars r , . . . , r N are distributed uniformly on [ − ,
1] andthe vectors e , . . . , e N are distributed uniformly on the Euclidean unit sphere S n = { e ∈ R n : (cid:107) e (cid:107) = 1 } . Assumption 1
For all k = 1 , , . . . , N it holds that1. E [ ξ k ] ≤ σ and E [ ξ (cid:48) k ] ≤ σ where σ ≥ ;2. the random variables ξ k and ξ (cid:48) k are independent from e k and r k , the randomvariables e k and r k are independent. We do not assume here neither zero-mean of ξ k and ξ (cid:48) k nor i.i.d of { ξ k } Nk =1 and { ξ (cid:48) k } Nk =1 as condtition 2 from assumption 1 allows to avoid that. V. Novitskii and A. Gasnikov l denote maximal integer number strictly less than β . Let F β ( L ) denotethe set of all functions f : R n → R which are differentiable l times and for all x, z ∈ U ε ( Q ) satisfy H¨older condition: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f ( z ) − (cid:88) ≤| m |≤ l m ! D m f ( x )( z − x ) m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ L (cid:107) z − x (cid:107) β , (1)where L >
0, the sum is over multi-index m = ( m , . . . , m n ) ∈ N n , we use thenotation m ! = m ! · · · · · m n !, | m | = m + · · · + m n and we defined D m f ( x ) z m = ∂ | m | f ( x ) ∂ m x . . . ∂ m n x n z m · · · · · z m n n , ∀ z = ( z , . . . , z n ) ∈ R n . Let F γ,β ( L ) denote the set of γ -strongly convex functions f ∈ F β ( L ).Recall that f is called γ -strongly convex for some γ > x, z ∈ R n itholds that f ( z ) ≥ f ( x ) + (cid:104)∇ f ( x ) , z − x (cid:105) + γ (cid:107) x − z (cid:107) .2.5 KernelFor gradient estimator (cid:101) g k we use the kernel K : [ − , → R , satisfying E [ K ( r )] = 0 , E [ rK ( r )] = 1 , E [ r j K ( r )] = 0 , j = 2 , . . . , l, E (cid:2) | r | β | K ( r ) | (cid:3) ≤ ∞ , (2)where r is a uniformly distributed on [ − ,
1] random variable. This helps usto get better bounds on the gradient bias (cid:107) (cid:101) g k − ∇ f ( x k ) (cid:107) (see Theorem 1 fordetails).A weighted sum of Legendre polynoms is an example of such kernels: K β ( r ) := l ( β ) (cid:88) m =0 p (cid:48) m (0) p m ( r ) , (3)where l ( β ) is maximal integer number strictly less than β and p m ( r ) = √ m + 1 L m ( r ), L m ( u ) is Legendre polynom. We have E [ p m p m (cid:48) ] = δ ( m − m (cid:48) ) . As { p m ( r ) } jm =0 is a basis for polynoms of degree less than or equal to j wecan represent u j := j (cid:80) m =0 b m p m ( r ) for some integers { b m } jm =0 (they depend on j ). Let’s calculate the expectation mproved Exploiting Higher Order Smoothness in Derivative-free Optimization 7 E (cid:2) r j K β ( r ) (cid:3) = j (cid:88) m =0 b m p (cid:48) m (0) = ( r j ) (cid:48) | r =0 = δ ( j − , here δ (0) = 1 and δ ( x ) = 1 if x (cid:54) = 0. We proved that the presented K β ( r )satisfies (2). We have the following kernels for different betas (see Figure 1): K β ( r ) = 3 r, β ∈ [2 , ,K β ( r ) = 15 r − r ) , β ∈ (3 , ,K β ( r ) = 105 r
64 (99 r − r + 35) , β ∈ (5 , . r K ( r ) = 3= 5= 7= 11 Fig. 1
Examples of kernels from (3)
For Theorem 1 and Theorem 2 we need to introduce the constants κ β = (cid:90) | u | β | K ( u ) | du (4)and κ = (cid:90) K ( u ) du. (5)It is proved in [2] that κ β and κ do not depend on n , they depend only on β : κ β ≤ √ β − , (6) κ ≤ √ β / . (7) V. Novitskii and A. Gasnikov
In this section we prove upper bounds on the optimization error of Algorithm1 for strongly convex function (Theorem 1) and for convex function (Theorem2).
Theorem 1
Let f ∈ F γ,β ( L ) with γ , L > and β > . Let Assumption 1hold and let Q be a convex compact subset of R n . Let f be G -Lipschitz on theEuclidean τ -neighborhood of Q .Then the optimization error of averaged estimator x N = N N (cid:80) k =1 x k wherethe points x k are given by Algorithm 1 with parameters τ k = (cid:18) κσ n β − κ β L ) (cid:19) β k − β , α k = 2 γk , k = 1 , . . . , N satisfies E [ f ( x N ) − f ( x ∗ )] ≤ γ (cid:18) n − β A N β − β + A n (1 + ln N ) N (cid:19) , where A = 3 β ( κσ ) β − β ( κ β L ) β , A = c ∗ κG , κ β and κ are constants depend-ing only on β , see (4) and (5) .Proof Step 1.
Fix an arbitrary x ∈ Q . As x k +1 is the Euclidean projectionwe have (cid:107) x k +1 − x (cid:107) ≤ (cid:107) x k − α k (cid:101) g k − x (cid:107) which is equivalent to (cid:104) (cid:101) g k , x k − x (cid:105) ≤ (cid:107) x k − x (cid:107) − (cid:107) x k +1 − x (cid:107) α k + α k (cid:107) (cid:101) g k (cid:107) . (8)By the strong convexity assumption we have f ( x k ) − f ( x ) ≤ (cid:104)∇ f ( x k ) , x k − x (cid:105) − γ (cid:107) x k − x (cid:107) . (9)Combining the last two inequations we obtain f ( x k ) − f ( x ) ≤(cid:104)∇ f ( x k ) − (cid:101) g k , x k − x (cid:105) + (cid:107) x k − x (cid:107) − (cid:107) x k +1 − x (cid:107) α k + α k (cid:107) (cid:101) g k (cid:107) − γ (cid:107) x k − x (cid:107) . (10)Taking conditional expectation given x k with respect to r k , ξ k and ξ (cid:48) k weobtain f ( x k ) − f ( x ) ≤(cid:104)∇ f ( x k ) − E [ (cid:101) g k | x k ] , x k − x (cid:105) + α k E (cid:2) (cid:107) (cid:101) g k (cid:107) | x k (cid:3) + (cid:107) x k − x (cid:107) − E (cid:2) (cid:107) x k +1 − x (cid:107) | x k (cid:3) α k − γ (cid:107) x k − x (cid:107) . (11) mproved Exploiting Higher Order Smoothness in Derivative-free Optimization 9 Step 2 (Bounding bias term).
Our aim is to bound the first term in(11), namely (cid:104)∇ f ( x k ) − E [ (cid:101) g k | x k ] , x k − x (cid:105) . Using the Taylor expansion we have f ( x k + τ k r k e k ) = f ( x k ) + (cid:104)∇ f ( x k ) , τ k r k e k (cid:105) + (cid:88) ≤| m |≤ l ( τ k r k ) | m | m ! D ( m ) f ( x k ) e mk + R ( τ k r k e k ) , (12)where by assumption | R ( τ k r k e k ) | ≤ L (cid:107) τ k r k e k (cid:107) β = L ( τ k · | r k | ) β . Thus, (cid:101) g k = (cid:16) (cid:104)∇ f ( x k ) , τ k r k e k (cid:105) + (cid:88) ≤| m |≤ l, | m | odd ( τ k r k ) | m | m ! D ( m ) f ( x k ) e mk + 12 R ( τ k r k e k ) − R ( − τ k r k e k ) + ξ k − ξ (cid:48) k (cid:17) nτ k K ( r k ) e k . (13)Using the properties of the smoothing kernel K , independence of e k and r k (Assumption 1) and the fact that E (cid:2) e k e Tk (cid:3) = n I n × n we obtain E e k ,r k (cid:20) (cid:104)∇ f ( x k ) , τ k r k e k (cid:105) nτ k K ( r k ) e k (cid:12)(cid:12) x k (cid:21) = ∇ f ( x k ) . (14)Using the fact that E (cid:104) r | m | k K ( r k ) (cid:105) = 0 if 2 ≤ | m | ≤ l or | m | = 0 andAssumption 1 we have (cid:16) (cid:88) ≤| m |≤ l, | m | odd ( τ k r k ) | m | m ! D ( m ) f ( x k ) e mk + ξ k − ξ (cid:48) k (cid:17) nτ k K ( r k ) e k = 0 . (15)Combining (13), (14) and (15) and using the definition of κ β we obtain |(cid:104)∇ f ( x k ) − E [ (cid:101) g k | x k ] , x k − x (cid:105)| == (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20)(cid:18) R ( τ k r k e k ) − R ( − τ k r k e k ) (cid:19) nτ k K ( r k ) (cid:104) e k , x k − x (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) x k (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Lτ β − k · E r k (cid:2) | r k | β K ( r k ) (cid:3) · n | E e k [ (cid:104) e k , x k − x (cid:105)| x k ] |≤ κ β L √ nτ β − k (cid:107) x k − x (cid:107) , (16)where in the last inequality the fact that | E e [ (cid:104) e, s (cid:105) ] | ≤ E e (cid:2) (cid:104) e, s (cid:105) (cid:3) = (cid:107) s (cid:107) n wasused (the fact from concentration measure theory). Applying the inequality ab ≤ / ( a + b ) to the last expression in (16) we finally get |(cid:104)∇ f ( x k ) − E [ (cid:101) g k | x k ] , x k − x (cid:105)| ≤ ( κ β L ) γ nτ β − k + γ (cid:107) x k − x (cid:107) . (17) Step 3 (Bounding second moment of gradient estimator).
Our aimis to estimate E (cid:2) (cid:107) (cid:101) g k (cid:107) | x k (cid:3) which is the second term in (11). The expectation here is with respect to r k , ξ k and ξ (cid:48) k . To lighten the presentation and withousloss of generality we drop the lower script k in all quantities.We have (cid:107) (cid:101) g (cid:107) = n τ (cid:107) ( f ( x + τ re ) − f ( x − τ re ) + ξ − ξ (cid:48) ) K ( r ) e (cid:107) = n τ (( f ( x + τ re ) − f ( x − τ re ) + ξ − ξ (cid:48) )) K ( r ) . (18)Using the inequality ( a + b + c ) ≤ a + b + c ) and Assumption 1 weget E (cid:2) (cid:107) (cid:101) g (cid:107) | x (cid:3) ≤ n τ (cid:0) E (cid:2) ( f ( x + τ re ) − f ( x − τ re )) K ( r ) (cid:12)(cid:12) x (cid:3) + 2 κσ (cid:1) . (19)Lemma 9 in [13] states that for any function f which is G − Lipschitz withrespect to 2-norm, it holds that if e is uniformly distributed on the Euclideanunit sphere, then (cid:112) E [( f ( e ) − E [ f ( e )]) ] ≤ cG n , (20)where c < a + b ) ≤ a + b ) we obtain E (cid:104) ( f ( x + e ) − f ( x − e )) (cid:12)(cid:12)(cid:12) x (cid:105) = E e (cid:104) ( f ( x + e ) − f ( x − e )) (cid:105) ≤ E e (cid:104) (( f ( x + e ) − E e [ f ( x + e )]) − ( f ( x − e ) − E e [ f ( x − e )])) (cid:105) ≤ E e (cid:104) ( f ( x + e ) − E e [ f ( x + e )]) (cid:105) + 2 E e (cid:104) ( f ( x − e ) − E e [ f ( x − e )]) (cid:105) ≤ (cid:114) E e (cid:104) ( f ( x + e ) − E e [ f ( x + e )]) (cid:105) + 2 (cid:114) E e (cid:104) ( f ( x − e ) − E e [ f ( x − e )]) (cid:105) ≤ cG n , (21)so we have E (cid:104) ( f ( x + τ re ) − f ( x − τ re )) (cid:12)(cid:12)(cid:12) x (cid:105) ≤ c ( τ r ) G n ≤ cτ G n . (22)By substituting (22) into (19), using independence of e and r and returningthe lower script k we finally get E (cid:2) (cid:107) (cid:101) g k (cid:107) | x (cid:3) ≤ κ (cid:18) c ∗ nG + 3( nσ ) τ k (cid:19) , (23)where c ∗ = 3 c . mproved Exploiting Higher Order Smoothness in Derivative-free Optimization 11 Step 4.
Let ρ k denote E [ (cid:107) x k − x (cid:107) ]. Substituting (17) and (23) into (11),taking full expectation and summing over k we obtain N (cid:88) k =1 E [ f ( x k ) − f ( x )] ≤ N (cid:88) k =1 (cid:18) ( κ β L ) γ nτ β − k + α k κ (cid:18) c ∗ nG + 3( nσ ) τ k (cid:19)(cid:19) + N (cid:88) k =1 (cid:18) ρ k − ρ k +1 α k − (cid:16) γ − γ (cid:17) ρ k (cid:19) . (24)Let ρ N +1 = 0. Then setting α k = 2 γk yields N (cid:88) k =1 (cid:18) ρ k − ρ k +1 α k − γ ρ k (cid:19) ≤ ρ (cid:18) α − γ (cid:19) + N +1 (cid:88) k =2 ρ k (cid:18) α k − α k − − γ (cid:19) = ρ (cid:16) γ − γ (cid:17) + N +1 (cid:88) k =2 ρ k (cid:16) γ − γ (cid:17) = 0 . (25)Substituting (25) into (24) with α k = γk we obtain N (cid:88) k =1 E [ f ( x k ) − f ( x )] ≤ γ N (cid:88) k =1 (cid:18) ( κ β L ) nτ β − k + κ (cid:18) c ∗ nG + 3( nσ ) τ k (cid:19) k (cid:19) = 1 γ N (cid:88) k =1 (cid:18)(cid:20) n · ( κ β L ) τ β − k + n · κσ kτ k (cid:21) + c ∗ κnG k (cid:19) . (26)If σ > τ k = (cid:18) κσ n β − κ β L ) (cid:19) β k − β is the minimizer of squarebrackets. Plugging this τ k in (26) and using two inequalities: for the expressionin square brackets N (cid:80) k =1 k − / β ≤ βN / β (if β >
2) and for the term after squarebrackets N (cid:80) k =1 1 k ≤ N we get N (cid:88) k =1 E [ f ( x k ) − f ( x )] ≤ γ (cid:16) n − β A N β + A n (1 + ln N ) (cid:17) (27)with A and A from the formulation of Theorem 1. Due to the convexity of f we finally prove the theorem E [ f ( x N ) − f ( x ∗ )] ≤ γ (cid:18) n − β A N β − β + A n (1 + ln N ) N (cid:19) . (28) (cid:3) We emphasize that the usage of kernel smoothing technique, measure con-centration inequalities and the assumption that ξ k is independent from e k or r k (Assumption 1) lead to the results better than the state-of-the-art ones for β > ξ k and ξ (cid:48) k nor i.i.d of { ξ k } Nk =1 and { ξ (cid:48) k } Nk =1 . Theorem 2
Let f ∈ F β ( L ) with γ , L > and β > . Let Assumption 1hold and let Q be a convex compact subset of R n . Let f be G -Lipschitz on theEuclidean τ -neighborhood of Q . Let x N denote N N (cid:80) k =1 x k .Then we achieve the optimization error E [ f ( x N ) − f ( x ∗ )] ≤ ε after N ( ε ) steps of Algorithm 1 with settings from Theorem 1 for the regularized function: f γ ( x ) := f ( x ) + γ (cid:107) x − x (cid:107) , where γ ≤ εR , R = (cid:107) x − x ∗ (cid:107) , x ∈ Q – arbitrarypoint. N ( ε ) = max (cid:40)(cid:16) R (cid:112) A (cid:17) ββ − n β − ε β − , (cid:16) R (cid:112) c (cid:48) A (cid:17) ρ ) n ρ ε ρ ) (cid:41) , where A = 3 β ( κσ ) β − β ( κ β L ) β , A = c ∗ κG – constants from Theorem 1, ρ > – arbitrarily small positive number.Proof Step 1.
Let x ∗ and x ∗ γ denote arg min x ∈ Q f ( x ) and arg min x ∈ Q f γ ( x ) respec-tively. Setting γ = εR and using the inequality f γ ( x ∗ γ ) ≤ f γ ( x ∗ ) we obtain f ( x N ) − f ( x ∗ ) = f γ ( x N ) − f γ ( x ∗ ) − γ (cid:107) x N − x (cid:107) + γ (cid:107) x ∗ − x (cid:107) ≤ f γ ( x N ) − f γ ( x ∗ ) + γ (cid:107) x ∗ − x (cid:107) ≤ f γ ( x N ) − f γ ( x ∗ γ ) + ε . (29) Step 2.
Now we apply Theorem 1 for f γ ( x ) and bound RHS by ε : E [ f γ ( x N ) − f γ ( x ∗ )] ≤ γ (cid:18) n − β A N β − β + A n (1 + ln N ) N (cid:19) ≤ ε . (30)The inequality (30) is done if ( γ = εR )max (cid:26) n − β A N β − β , A n (1 + ln N ) N (cid:27) ≤ γε ε R . (31)It is true that 1 + ln N ≤ c (cid:48) N ρρ +1 for some c (cid:48) >
0. So the inequality (31)holds if N ≥ max (cid:40)(cid:16) R (cid:112) A (cid:17) ββ − n β − ε β − , (cid:16) R (cid:112) c (cid:48) A (cid:17) ρ ) n ρ ε ρ ) (cid:41) . (32)The inequalities (29) and (30) yield E [ f ( x N ) − f ( x ∗ )] ≤ ε . (cid:3) mproved Exploiting Higher Order Smoothness in Derivative-free Optimization 13 In our experiment [10] we compare the Algorithm 1 (with β = 3 and β = 5)proposed in this paper with Gasnikov’s one-point method and with Akhavan’smethod for the special case β = 2.We consider the problem of the minimization of the quadratic function f ( x ) = 14 x + x + 4 x on the Euclidean ball Q = { x ∈ R : (cid:107) x (cid:107) ≤ } .The starting point is x with (cid:107) x (cid:107) = / . The dependency of f ( x N ) − f ( x ∗ )(optimization error) on N (iteration number) is presented on the Figure 2. Theoptimization error has its mean and 0.95-confidence interval. As the Lipshitzconstants for the quadratic oracle are equal to zero, for the Algorithm 1 wechoose L = 0 . N (iteration number) o p t i m i z a t i o n e rr o r method GasnikovAkhavan( =2)Algorithm 1( =3)Algorithm 1( =5)
Fig. 2
Dependency of optimization error of Algorithm 1 on iteration
We see on the Figure 2 that the usage of higher-order smoothness by Al-gorithm 1 helps to overcome the methods which do not use this.
The results of this paper can be generalized for the saddle-point problems.Recently GANs and Reinforcement Learning caused a big interest for saddle-point problems, see [12].Another possible genelization of this paper is obtaining the large probabil-ity bounds for optimization error. We cannot obtain upper bounds in termsof of large deviation probability (not in terms of expectation) under the As-sumption 1. The exploiting of higher order smoothness with the help of kernels under rather general noise assumptions (non-zero mean) causes big variation (cid:107) (cid:101) g k − ∇ f ( x k ) (cid:107) and this can causes the problems with large deviation proba-bility rates.It remains an open question whether large deviation probability can beobtained under non-zero mean noise. And also it remains an open questionwhether better dependence of optimization error on the dimenstion n andstrong convexity parameter γ can be obtained. References
1. Akhavan, A., Pontil, M., Tsybakov, A.B.: Exploiting higher order smoothness inderivative-free optimization and continuous bandits. arXiv preprint arXiv:2006.07862(2020)2. Bach, F., Perchet, V.: Highly-smooth zero-th order online optimization. In: Conferenceon Learning Theory, pp. 257–283 (2016)3. Bubeck, S., Lee, Y.T., Eldan, R.: Kernel-based methods for bandit convex optimization.In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing,pp. 72–85 (2017)4. Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to Derivative-Free Op-timization. Society for Industrial and Applied Mathematics (2009). DOI10.1137/1.97808987187685. Duchi, J.C., Jordan, M.I., Wainwright, M.J., Wibisono, A.: Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactionson Information Theory (5), 2788–2806 (2015)6. Gasnikov, A., Dvurechensky, P., Kamzolov, D.: Gradient and gradient-free methods forstochastic convex optimization with inexact oracle. arXiv preprint arXiv:1502.06259(2015)7. Gasnikov, A., Dvurechensky, P., Nesterov, Y.: Stochastic gradient methods with inexactoracle. arXiv preprint arXiv:1411.4218 (2014)8. Gasnikov, A.V., Krymova, E.A., Lagunovskaya, A.A., Usmanova, I.N., Fedorenko, F.A.:Stochastic online optimization. single-point and multi-point non-linear multi-armedbandits. convex and strongly-convex case. Automation and remote control (2), 224–234 (2017)9. Larson, J., Menickelly, M., Wild, S.M.: Derivative-free optimization methods. ActaNumerica , 287–404 (2019). DOI 10.1017/S096249291900006010. Novitskii, V.: Zeroth-order algorithms for smooth saddle-point problems (2020). URLhttps://cutt.ly/bjxQHRY11. Polyak, B.T., Tsybakov, A.B.: Optimal order of accuracy of search algorithms instochastic optimization. Problemy Peredachi Informatsii (2), 45–53 (1990)12. Sadiev, A., Beznosikov, A., Dvurechensky, P., Gasnikov, A.: Zeroth-order algorithms forsmooth saddle-point problems. arXiv preprint arXiv:2009.09908 (2020)13. Shamir, O.: An optimal algorithm for bandit and zero-order convex optimization withtwo-point feedback. The Journal of Machine Learning Research18