Stopping rules for accelerated gradient methods with additive noise in gradient
NNoname manuscript No. (will be inserted by the editor)
Stopping rules for accelerated gradient methods withadditive noise in gradient
Vasin Artem · Alexander Gasnikov · Vladimir Spokoiny
Received: date / Accepted: date
Abstract
In this article, we investigate an accelerated first-order method,namely, the method of similar triangles, which is optimal in the class of convex(strongly convex) problems with a Lipschitz gradient. The paper considers amodel of additive noise in a gradient and a Euclidean prox-structure for notnecessarily bounded sets. Convergence estimates are obtained in the case ofstrong convexity and its absence, and a stopping criterion is proposed for notstrongly convex problems.
The research of A. Gasnikov was supported by the Ministry of Science and Higher Educationof the Russian Federation (Goszadaniye) 075-00337-20-03, project no. 0714-2020-0005. Thework of A. Vasin was supported by Andrei M. Raigorodskii Scholarship in Optimization.A. VasinMoscow Institute of Physics and Technology, RussiaA. GasnikovMoscow Institute of Physics and Technology, RussiaInstitute for Information Transmission Problems RAS, RussiaWeierstrass Institute for Applied Analysis and Stochastics, GermanyV. SpokoinyWeierstrass Institute for Applied Analysis and Stochastics, GermanyInstitute for Information Transmission Problems RAS, Russia a r X i v : . [ m a t h . O C ] F e b A. Vasin and A. Gasnikov and V. Spokoiny
We consider L -smooth ( µ -strongly) convex optimization problem ( µ ≥ x ∈ Q f ( x ) . This means that Q is convex set, and for all x, y ∈ Q : f ( x ) + (cid:104)∇ f ( x ) , y − x (cid:105) + µ (cid:107) y − x (cid:107) ≤ f ( y ) , (cid:107)∇ f ( y ) − ∇ f ( x ) (cid:107) ≤ L (cid:107) y − x (cid:107) . In the analysis of the rates of convergence of different first-order methods theserelations are typically rewrite as follows [15,9,6,27,4,25,37,34,50,21,47,23,13] f ( x ) + (cid:104)∇ f ( x ) , y − x (cid:105) + µ (cid:107) y − x (cid:107) ≤ f ( y ) ≤ f ( x ) + (cid:104)∇ f ( x ) , y − x (cid:105) + L (cid:107) y − x (cid:107) . (1)Note, that the last relation is a consequence of the previous ones and in generalis not equivalent to them [49,25].In many applications, especially for gradient-free methods (when estimat-ing the gradient by finite differences [11,44,7]) optimization problems in in-finite dimensional spaces (such examples arise when solving inverse problems[31,26]) instead of an access to ∇ f ( x ) we have an access to its inexact approx-imation ˜ ∇ f ( x ).The two most popular conception of inexactness of gradient in practice are[42]: for all x ∈ Q (cid:107) ˜ ∇ f ( x ) − ∇ f ( x ) (cid:107) ≤ δ, (2) (cid:107) ˜ ∇ f ( x ) − ∇ f ( x ) (cid:107) ≤ α (cid:107)∇ f ( x ) (cid:107) , α ∈ [0 , . (3)For the first conception (2) several results about the accumulation of error canbe found in [42,12,10,1], but all these results are still far from to be optimisticin general. The reason was described in [41]. We can explain this reason byvery simple example: min x ∈ R n f ( x ) := 12 n (cid:88) i =1 λ i · ( x i ) , (4)where 0 ≥ µ = λ ≤ λ ≤ ... ≤ λ n = L , L ≥ µ . The solution of this problemis x ∗ = 0. Assume that inexactness takes place only in the first component.That is instead of ∂f ( x ) /∂x = µx we have an access to ˜ ∂f ( x ) /∂x = µx − δ .For simple gradient dynamic x k = x k − − L ˜ ∇ f ( x k − ) , topping rules for accelerated gradient methods with additive noise in gradient 3 we can conclude that for all k ∈ N x k ≥ δL − (1 − µ/L ) k − (1 − µ/L ) ≥ δ µ . (5)Hence ∗ f ( x k ) − f ( x ∗ ) ≥ δ µ . So we have a problem with (5), since µ can be to small ( µ (cid:46) ε – degenerateregime, where ε – desired accuracy in function value) in denominator of theRHS. We may expect even more serious troubles for accelerated gradient meth-ods, since they are more sensitive to the level of noise [16,25]. The solution ofthis problem is well known (see, for example, [41,42,35]): to propose a stoppingrule for the considered algorithm or to use regularization µ ∼ ε [25]. Roughlyspeaking, for non accelerated algorithms in [41,42] it was proved that if δ ∼ ε ,then it’s possible to reach ε -accuracy in function value (with almost the samenumber of iterations as for no noise case δ = 0) by applying computationallyconvenient stopping rule. In this paper we show that it’s sufficient to have δ ∼ ε both for primal-dualnon accelerated and accelerated gradient type methods [37, 25]. Primal-dualityof methods is used to build computationally convenient stopping rule in degen-erate regime. We emphasize, that the results δ ∼ ε has a simple explanation(see section 2) and one might think that it is well known. But to the best ofour knowledge the best results for accelerated methods require δ ∼ ε / . So weconsider our observation (that δ ∼ ε ) to be an important part of this paper,although it has rather simple explanation. The situation with the second criteria (3) is significantly better. For nonaccelerated algorithms inexactness in this case lead only to the decelerationof convergence ∼ (1 − α ) − -times [42]. This result holds true with the relaxedstrong convexity assumption [25] (Polyak–Lojasiewicz condition). For acceler-ated case to the best of our knowledge this is an open problem to estimateaccumulation of an error [25]. In this paper we show that if α (cid:46) (cid:0) µL (cid:1) / in µ -strongly convex case and(on k -th iteration) α k (cid:46) (cid:0) k (cid:1) / in degenerate regime we do not have anydeceleration. Numerical experiments demonstrate that in general for α largerthan mentioned above thresholds the convergence may slow down a lot up todivergence for considered accelerated method. Note, that close results (with the requirement α (cid:46) (cid:0) µL (cid:1) / ) in the case µ (cid:29) ε were recently obtained by using another techniques in Stochastic Op-timization with decision dependent distribution [18] and Policy Evaluationin Reinforcement Learning via reduction to stochastic Variational Inequalitywith Markovian noise [33]. In [33,18] it was assumed that (cid:107) ˜ ∇ f ( x ) − ∇ f ( x ) (cid:107) ≤ B (cid:107) x − x ∗ (cid:107) , α ∈ [0 , . (6) ∗ This bound corresponds to the worst-case philosophy concerning the choice of con-sidered example for considered class of methods [36,37,9,25]. We expect more interestingresults here by considering average-case complexity [46,40] (spectrum { λ i } average). A. Vasin and A. Gasnikov and V. Spokoiny Since x ∗ is a solution, from Fermat’s principle ∇ f ( x ∗ ) = 0. Therefore, (cid:107)∇ f ( x ) (cid:107) = (cid:107)∇ f ( x ) − ∇ f ( x ∗ ) (cid:107) ≤ L (cid:107) x − x ∗ (cid:107) . So if (3) holds true then 6 also holds true with B = αL . Important results in gradient error accumulation for first-order methods weredeveloped in the cycle of works of O. Devolder, F. Glineur and Yu. Nesterov2011–2014 [14,16,17,15]. In these works authors were motivated by (1). Theidea is to “relax” (1), assuming inexactness in gradient. So they introduceinexact gradient ˜ ∇ f ( x ), satisfying for all x, y ∈ Qf ( x ) + (cid:104) ˜ ∇ f ( x ) , y − x (cid:105) + µ (cid:107) y − x (cid:107) − δ ≤ f ( y ) ≤ f ( x ) + (cid:104) ˜ ∇ f ( x ) , y − x (cid:105) + L (cid:107) y − x (cid:107) + δ. (7)Such a definition allows to develop precise theory for error accumulation forfirst-order methods.Namely, it was proved that for non-accelerated gradient methods f ( x k ) − f ( x ∗ ) = O (cid:18) min (cid:26) LR k + δ, LR exp (cid:16) − µL k (cid:17) + δ (cid:27)(cid:19) , (8)and for accelerated ones [16,20] f ( x k ) − f ( x ∗ ) = O (cid:32) min (cid:40) LR k + kδ, LR exp (cid:18) − (cid:114) µL k (cid:19) + (cid:115) Lµ δ (cid:41)(cid:33) , (9)where R = (cid:107) x start − x ∗ (cid:107) – the distance between starting point and the solution x ∗ . If x ∗ is not unique we take such x ∗ that is the closest to x start . Both ofthese bounds are unimprovable [16,17]. See also [15,22,32] for “indermediate”situations between accelerated and non-accelerated methods.Following to [17] we may reduce conception (2) to (7) by putting δ = δ (7) = δ L + δ µ (cid:39) δ µ (10)and changing 2-times constant µ, L . The key observations here are (cid:104) ˜ ∇ f ( x ) − ∇ f ( x ) , y − x (cid:105) ≤ L (cid:107) ˜ ∇ f ( x ) − ∇ f ( x ) (cid:107) + L (cid:107) y − x (cid:107) , (cid:104) ˜ ∇ f ( x ) − ∇ f ( x ) , y − x (cid:105) ≥ µ (cid:107) ˜ ∇ f ( x ) − ∇ f ( x ) (cid:107) − µ (cid:107) y − x (cid:107) . So, when µ > f ( x k ) − f ( x ∗ ) = topping rules for accelerated gradient methods with additive noise in gradient 5 ε when † µ (cid:38) ε we should put δ (2) ∼ ε that is good and rather expected.Unfortunately, for accelerated methods from this approach we will have δ (2) ∼ ε / . That is far from what we’ve declared in section 1. To improve this it’sworth to propose more detailed conception rather then (7).In the following works [16,19,20,48,47] the conception (7) was further de-veloped f ( x ) + (cid:104) ˜ ∇ f ( x ) , y − x (cid:105) + µ (cid:107) y − x (cid:107) − δ (cid:107) y − x (cid:107) ≤ f ( y ) ≤ f ( x ) + (cid:104) ˜ ∇ f ( x ) , y − x (cid:105) + L (cid:107) y − x (cid:107) + δ . (11)In this case (8) and (9) take a form for non-accelerated gradient methods f ( x k ) − f ( x ∗ )= O (cid:18) min (cid:26) LR k + ˜ Rδ + δ , LR exp (cid:16) − µL k (cid:17) + ˜ Rδ + δ (cid:27)(cid:19) , (12)and for accelerated ones [16,20] f ( x k ) − f ( x ∗ )= O (cid:32) min (cid:40) LR k + ˜ Rδ + kδ , LR exp (cid:18) − (cid:114) µL k (cid:19) + ˜ Rδ + (cid:115) Lµ δ (cid:41)(cid:33) , (13)where ˜ R is the maximal distance between generated points and the solution.Thus from (12), (13) we may conclude that if ˜ R is bounded, ‡ then bychoosing δ = δ (2) , δ = δ L , we will have the desired result: it is possible to reach f ( x k ) − f ( x ∗ ) = ε with δ (2) ∼ ε .But in general situation there is a problem in the assumption “if ˜ R isbounded”. As we may see from example (4) in general degenerate regime onlysuch bound ˜ R (cid:39) R + δ (2) µ (cid:38) R + δ (2) ε takes place [25]. This dependence spoils the result. The growth of ˜ R we observein different experiments. In the paper below we investigate this problem. In † If µ (cid:46) ε , we can regularize the problem and guarantee the required condition [25].Another advantage of strong convexity is possibility to use the norm of inexact gradientfor the stopping criteria [25], like in [41]. But regularization requires some prior knowledgeabout the size of the solution [25]. Since we typically don not have such information theprocedure becomes more difficult via applying the restarts [27,25]. ‡ In many situations this is true. For example, when Q is bounded, when µ (cid:29) ε . A. Vasin and A. Gasnikov and V. Spokoiny particular, we propose an alternative approach to regularization § that is basedon “early stopping” ¶ of considered iterative procedure by developing properstopping rule.Now we explain how to reduce relative inexactness (3) to (7) and to apply(9) when µ (cid:29) ε . Since f ( x ) has Lipschitz gradient from (3), (7) we may derivethat after k iterations (where k is greater than (cid:112) L/µ on a logarithmic factorlog (cid:0) LR /ε (cid:1) , where ε – accuracy in function value) f ( x k ) − f ( x ∗ ) (9) , (10) (cid:39) ε (cid:115) Lµ δ µ (cid:39) (cid:115) Lµ δ µ (3) , (7) (cid:39) (cid:115) Lµ α max t =1 ,...,k (cid:107)∇ f ( x t ) (cid:107) µ ≤ (cid:115) Lµ Lα max t =1 ,...,k ( f ( x k ) − f ( x ∗ )) µ (cid:46) (cid:115) Lµ Lα ( f ( x ) − f ( x ∗ )) µ . (14)To guarantee that (restart condition) f ( x k ) − f ( x ∗ ) ≤
12 ( f ( x ) − f ( x ∗ ))we should have α (cid:46) (cid:0) µL (cid:1) / . Then we restart the method. After log ( ∆f /ε )restarts we can guarantee the desired ε -accuracy in function value. In degener-ate case the calculations are more tricky, but the idea remains the same withthe replacing (cid:112) L/µ to k (see (9)) that lead to α k (cid:46) (cid:0) k (cid:1) / . More accurateanalysis in the subsequent part of the paper allows to improve these bounds: α (cid:46) (cid:0) µL (cid:1) / , α k (cid:46) (cid:0) k (cid:1) / .Below we’ll concentrate only on accelerated method and choose the methodwith one projection (Similar Triangles Method (STM)), see [28,10,30,48,23]and reference there in. We decided to choose this method because: 1) it’sprimal-dual [28]; 2) has a nice theory of how to bound ˜ R in no noise regime[28,37] ( ˜ R ≤ R ) and noise one [30]; 3) and has previously been intensivelyinvestigated, see [23] and references there in. In this section we describe only two directions where inexact gradient play animportant role. We emphasise that although the results below are not new,the way they are presented is of some value in our opinion and can be usefulfor specialist in these directions. § By using regularization we can guarantee µ ∼ ε and therefore with δ (2) ∼ ε we havethe desired ˜ R (cid:39) R . ¶ This terminology is popular also in Machine Learning community, where “early stop-ping” is used also as alternative to regularization to prevent overfitting [29].topping rules for accelerated gradient methods with additive noise in gradient 7 x ∈ Q ⊆ R n f ( x ) . In some applications we do not have an access to gradient ∇ f ( x ) of targetfunction, but can calculate the value of ‖ f ( x ) with accuracy δ f [11]: | ˜ f ( x ) − f ( x ) | ≤ δ f . In this case there exist different conceptions for full gradient estimation (see[7] and references there in). For example (below we assuming that f has L p -Lipschitz p -order derivatives in 2-norm), – ( p -order finite-differences) ˜ ∇ i f ( x ) = ˜ f ( x + he i ) − ˜ f ( x − he i )2 h for p = 2 , where e i is i -th ort. Here we have δ = √ nO (cid:18) L p h p + δ f h (cid:19) in the conception (2), see [7]. Optimal choice of h guarantees δ ∼ √ nδ pp +1 f .From section 1 we know that it is possible to solve the problem with accu-racy (in function value) ε ∼ δ . Hence, δ f ∼ (cid:18) ε √ n (cid:19) p +1 p . Unfortunately, such simple idea does not give tight lower bound in theclass of algorithm that has sample complexity Poly( n, ε ) [44] (obtained for p = 0, that is only Lipschitz-continuity of f required): δ f ∼ max (cid:26) ε √ n , εn (cid:27) . (15)Note, that instead of finite-difference approximation approach in some ap-plications we can use kernel approach [43,3]. The interest to this alternativehas grown last time [2,39]. ‖ Note, that the approach describe above required that function values should be availablenot only in Q , but also in some (depends on approach we used) vicinity of Q . This problemcan be solved in a two different ways. The first one is “margins inward approach” [8]. Thesecond one is “continuation” f to R n with preserving of convexity and Lipschitz continuity[44]: f new ( x ) := f (cid:0) proj Q ( x ) (cid:1) + α min y ∈ Q (cid:107) x − y (cid:107) . A. Vasin and A. Gasnikov and V. Spokoiny – (Gaussian Smoothed Gradients) ˜ ∇ f ( x ) = 1 h E ˜ f ( x + he ) e, where e ∈ N (0 , I n ) is standard normal random vector. Here we have δ = O (cid:18) n p/ L p h p + √ nδ f h (cid:19) in the conception (2), see [38,7]. Optimal choice of h guarantees δ ∼ ( nδ f ) pp +1 . Hence, δ f ∼ ε p +1 p n . That is also does not match the lower bound. Moreover, here (and in theapproach below) we have additional difficulty: how to estimate ˜ f ( x ). Wecan do it only roughly, for example, by using Monte Carlo approach [7].This is a payment for the better quality of approximation! – (Sphere Smoothed Gradients) ˜ ∇ f ( x ) = nh E ˜ f ( x + he ) e, where e is random vector with uniform distribution in a unit sphere (withcenter at 0) in R n . Here we have δ = O (cid:18) L p h p + nδ f h (cid:19) in the conception (2), see [7]. Optimal choice of h guarantees δ ∼ ( nδ f ) pp +1 .Hence, δ f ∼ ε p +1 p n . That is also does not match the lower bound. One can consider that the lasttwo approach are almost the same, but below we describe more accurateresult concerning Sphere smoothing. We do not know how to obtain such aresult for Gaussian smoothing. The results is as follows [16,44]: For Spheresmoothed gradient in conception (7) we have δ (cid:39) L h + √ nδ f ˜ Rh , (16)where L is Lipschitz constant of f and L = min (cid:110) L , L h (cid:111) in (7), when p = 1 and L = L h , when p = 0. The bound (16) is more accurate thanthe previous ones, since it corresponds to the first part of the lower bound(15). Indeed, by choosing properly h in (16) we obtain ε ∼ δ ∼ n / δ / f .Hence, δ f ∼ ε √ n . The rest part ( δ f ∼ εn ) of lower bound (15) is also tight, see [5]. topping rules for accelerated gradient methods with additive noise in gradient 9 The last calculations (see (16)) additionally confirm that the conceptionof inexactness and algorithms we use and develop in section 2 are also tight(optimal) enough. Otherwise, it’d be hardly possible to reach lower bound byusing gradient-free methods reduction to gradient ones and proposed analysisof an error accumulation for gradient-type methods.3.2 Inverse problemsAnother rather big direction of research where gradients are typically availableonly approximately is optimization in a Hilbert spaces [51]. Such optimizationproblems arise, in particular, in inverse problems theory [31].We start with the reminder of how to calculate a derivative in generalHilbert space. Let J ( q ) := J ( q, u ( q )) , where u ( q ) is determine as unique solution of G ( q, u ) = 0 . Assume that G q ( q, u ) is invertible, then G q ( q, u ) + G u ( q, u ) ∇ u ( q ) = 0 , hence ∇ u ( q ) = − [ G u ( q, u )] − G q ( q, u ) . Therefore ∇ J ( q ) := J q ( q, u ) + J u ( q, u ) ∇ u ( q ) = J q ( q, u ) − J u ( q, u ) [ G u ( q, u )] − G q ( q, u ) . The same result could be obtained by considering Lagrange functional L ( q, u ; ψ ) = J ( q, u ( q )) + (cid:104) ψ, G ( q, u ) (cid:105) with L u ( q, u ; ψ ) = 0 , G q ( q, u ) = 0and ∇ J ( q ) = L q ( q, u ; ψ ) . Indeed, by simple calculations we can relate these two approaches, where ψ ( q, u ) = − (cid:2) G u ( q, u ) T (cid:3) − J u ( q, u ) T . Now we demonstrate this technique on inverse problem for elliptic initial-boundary value problem.Let u be the solution of the following problem (P) u xx + u yy = 0 , x, y ∈ (0 , ,u (1 , y ) = q ( y ) , y ∈ (0 , , u x (0 , y ) = 0 , y ∈ (0 , ,u ( x,
0) = u ( x,
1) = 0 , x ∈ (0 , . The first two relations − u xx − u yy = 0 , x, y ∈ (0 , ,q ( y ) − u (1 , y ) = 0 , y ∈ (0 , , we denote as G ( q, u ) = ¯ G · ( q, u ) = 0 and the last two ones as u ∈ Q .Assume that we want to estimate q ( y ) ∈ L (0 ,
1) by observing b ( y ) = u (0 , y ) ∈ L (0 , u ( x, y ) ∈ L ((0 , × (0 , q J ( q ) := min u : ¯ G · ( q,u )=0 ,u ∈ Q J ( q, u ) := J ( u ) = (cid:90) | u (0 , y ) − b ( y ) | dy. (17)We can solve (17) numerically. This problem is convex quadratic optimizationproblem. We can directly apply Lagrange multipliers principle to (17), see [51]: L ( q, u ; ψ := ( ψ ( x, y ) , λ ( y ))) = J ( u )+ (cid:104) ψ, ¯ G · ( q, u ) (cid:105) = (cid:90) | u (0 , y ) − b ( y ) | dy − (cid:90) (cid:90) ( u xx + u yy ) ψ ( x, y ) dxdy + (cid:90) ( q ( y ) − u (1 , y )) λ ( y ) dy. To obtain conjugate problem for ψ we should vary L ( q, u ; ψ ) on δu satisfying u ∈ Q : δ u L ( q, u ; ψ ) = 2 (cid:90) ( u (0 , y ) − b ( y )) δu (0 , y ) dy − (cid:90) (cid:90) ( δu xx + δu yy ) ψ ( x, y ) dxdy − (cid:90) δu (1 , y ) λ ( y ) dy, (18)where δu x (0 , y ) = 0 , y ∈ (0 , ,δu ( x,
0) = δu ( x,
1) = 0 , x ∈ (0 , . Using integration by part, from (18) we can derive δ u L ( q, u ; ψ ) = (cid:90) (2 ( u (0 , y ) − b ( y )) − ψ x (0 , y )) δu (0 , y ) dy − (cid:90) ψ (1 , y ) δu x (1 , y ) dy − (cid:90) ψ ( x, δu y ( x, dx + (cid:90) ψ ( x, δu y ( x, dy + (cid:90) (cid:90) ( ψ xx + ψ yy ) δu ( x, y ) dxdy + (cid:90) ( ψ x (1 , y ) − λ ( y )) δu (1 , y ) dy. topping rules for accelerated gradient methods with additive noise in gradient 11 Consider corresponding conjugate problem (D) ψ xx + ψ yy = 0 , x, y ∈ (0 , ,ψ x (0 , y ) = 2 ( u (0 , y ) − b ( y )) , y ∈ (0 , ,ψ (1 , y ) = 0 , y ∈ (0 , ,ψ ( x,
0) = ψ ( x,
1) = 0 , x ∈ (0 , λ ( y ) = ψ x (1 , y ) , y ∈ (0 , . (19)These relations appears since δ u L ( q, u ; ψ ) = 0 and δu (0 , y ) , δu x (1 , y ) , δu (1 , y ) ∈ L (0 , δu y ( x, , δu y ( x, ∈ L (0 , δu ( x, y ) ∈ L ((0 , × (0 , J ( q ) = min u :( q,u ) ∈ ( P ) J ( u ) = min u : ¯ G · ( q,u )=0 ,u ∈ Q J ( u ) = min u ∈ Q max ψ ∈ ( D ) L ( q, u ; ψ ) , from the Demyanov–Danskin’s formula [45] ∗∗ ∇ J ( q ) = ∇ q min u ∈ Q max ψ ∈ ( D ) L ( q, u ; ψ ) = L q ( q, u ( q ); ψ ( q )) , where u ( q ) is the solution of (P) and ψ ( q ) is the solution of (D) where ψ x (0 , y ) = 2 ( u (0 , y ) − b ( y )) , y ∈ (0 , u (0 , y ) depends on q ( y ) via (P) and, at the same time, the pair ( u ( q ) , ψ ( q ))is the solution of min u ∈ Q max ψ ∈ ( D ) L ( q, u ; ψ )saddle-point problem. Since δ ψ L ( q, u ; ψ ) = 0 entails ¯ G · ( q, u ) = 0 that is form(P) if we add u ∈ Q and δ u L ( q, u ; ψ ) = 0, when u ∈ Q entails (D) as we’veshown above.Note also that L q ( q, u ( q ); ψ ( q ))( y ) = λ ( y ) , y ∈ (0 , . Hence, due to (19) ∇ J ( q )( y ) = ψ x (1 , y ) , y ∈ (0 , ∇ J ( q )( y ) calculation to the solution of two correct initial-boundaryvalue problem for elliptic equation in a square (P) and (D) [31]. ∗∗ The same result in more simple situation (without additional constraint u ∈ Q ) weconsider at the beginning of this section. We don’t apply Demyanov–Danskin’s formula anduse inverse function theorem.2 A. Vasin and A. Gasnikov and V. Spokoiny This result can be also interpreted in a little bit different manner. Weintroduce a linear operator A : q ( y ) := u (1 , y ) (cid:55)→ u (0 , y ) . Here u ( x, y ) is the solution of problem (P). It was shown in [31] that A : L (0 , → L (0 , . Conjugate operator is [31] A ∗ : p ( y ) := ψ x (0 , y ) (cid:55)→ ψ x (1 , y ) , A ∗ : L (0 , → L (0 , . Here ψ ( x, y ) is the solution of conjugate problem (D). So, by considering J ( q )( y ) = (cid:107) Aq − b (cid:107) , we can write ∇ J ( q )( y ) = A ∗ (2 ( Aq − b )) , that completely corresponds to the same scheme as described above: Based on q ( y ) we solve (P) and obtain u (0 , y ) = Aq ( y ) and define p ( y ) =2 ( u (0 , y ) − b ( y )) . Based on p ( y ) we solve (D) and calculate ∇ J ( q )( y ) = A ∗ p ( y ) = ψ x (1 , y ) . So inexactness in gradient ∇ J ( q ) arises since we can solve (P) and (D) onlynumerically.The described above technique can be applied to many different inverseproblems [31] and optimal control problems [51]. Note that for optimal con-trol problems in practice another strategy widely used. Namely, instead ofapproximate calculation of gradient, optimization problem replaced by approx-imate one (for example, by using finite-differences schemes). For this reduced(finite-dimensional) problem the gradient is typically available precisely [24].Moreover, in [24] the described above Lagrangian approach is based to ex-plain the core of automatic differentiation where the function calculation treerepresented as system of explicitly solvable interlocking equations. We consider convex optimization problem on a convex (not necessarily bounded)set Q ⊆ R n : min x ∈ Q f ( x ) . Assume that (cid:107) ˜ ∇ f ( x ) − ∇ f ( x ) (cid:107) (cid:54) δ, (20)where ˜ ∇ f ( x ) oracle gradient value. We consider two cases: Q is a compact setand Q is unbounded, for example R n . We define the constant: R = (cid:107) x start − x ∗ (cid:107) topping rules for accelerated gradient methods with additive noise in gradient 13 to be the distance between the solution x ∗ and starting point x start , if x ∗ is notunique we take such x ∗ that is the closest to x start . We assume that function f has Lipschitz gradient with constant L f : ∀ x, y ∈ Q, (cid:107)∇ f ( x ) − ∇ f ( y ) (cid:107) (cid:54) L f (cid:107) x − y (cid:107) . (21)This implies inequality: ∀ x, y ∈ Q, f ( y ) (cid:54) f ( x ) + (cid:104)∇ f ( x ) , y − x (cid:105) + L f (cid:107) x − y (cid:107) . (22)We will use following lemma: Lemma 1 (Fenchel inequality)
Let ( E , (cid:104)· , ·(cid:105) ) – euclidean space, then ∀ λ ∈ R + , ∀ u, v ∈ E the inequality holds: (cid:104) u, v (cid:105) (cid:54) λ (cid:107) u (cid:107) E + λ (cid:107) v (cid:107) E . From previous assumptions we can get upper bound with inexact oracle.
Claim 1 ∀ x, y ∈ Q, the following estimate holds: f ( y ) (cid:54) f ( x ) + (cid:104) ˜ ∇ f ( x ) , y − x (cid:105) + L (cid:107) x − y (cid:107) + δ , where L = 2 L f , δ = δ L f .Proof The proof follows from f ( y ) (cid:54) f ( x ) + (cid:104)∇ f ( x ) , y − x (cid:105) + L f (cid:107) x − y (cid:107) (cid:54)(cid:54) f ( x )+ (cid:104) ˜ ∇ f ( x ) , y − x (cid:105) + 12 L f (cid:107)∇ f ( x ) − ˜ ∇ f ( x ) (cid:107) + L f (cid:107) x − y (cid:107) + L f (cid:107) x − y (cid:107) (cid:54)(cid:54) f ( x ) + (cid:104) ˜ ∇ f ( x ) , y − x (cid:105) + L (cid:107) x − y (cid:107) + δ . We also assume strong convexity of f with parameter µ , however µ mayequal zero – this corresponds to the ordinary convexity, supposed initially.Further we will use only a consequence of this: f ( x ) + (cid:104)∇ f ( x ) , y − x (cid:105) + µ (cid:107) x − y (cid:107) (cid:54) f ( y ) . (23)We obtain similar to claim 1 two lower bounds with inexact oracle. Claim 2 ∀ x, y ∈ Q , the following estimate holds: f ( x ) + (cid:104) ˜ ∇ f ( x ) , y − x (cid:105) + µ (cid:107) x − y (cid:107) − δ (cid:107) x − y (cid:107) (cid:54) f ( y ) , where δ = δ . Proof
Using Cauchy inequality and (23) we obtain: f ( x ) + (cid:104) ˜ ∇ f ( x ) , y − x (cid:105) + µ (cid:107) x − y (cid:107) − δ (cid:107) x − y (cid:107) (cid:54) f ( x )++ (cid:104) ˜ ∇ f ( x ) , y − x (cid:105) + µ (cid:107) x − y (cid:107) − (cid:107) ˜ ∇ f ( x ) − ∇ f ( x ) (cid:107) (cid:107) x − y (cid:107) (cid:54)(cid:54) f ( x ) + (cid:104) ˜ ∇ f ( x ) , y − x (cid:105) + µ (cid:107) x − y (cid:107) −− (cid:104) ˜ ∇ f ( x ) − ∇ f ( x ) , y − x (cid:105) = f ( x ) + (cid:104)∇ f ( x ) , y − x (cid:105) + µ (cid:107) x − y (cid:107) (cid:54) f ( y ) ⇒ f ( x ) + (cid:104) ˜ ∇ f ( x ) , y − x (cid:105) + µ (cid:107) x − y (cid:107) − δ (cid:107) x − y (cid:107) (cid:54) f ( y ) . Claim 3 ∀ x, y ∈ Q , if in (23) µ (cid:54) = 0 , the following estimate holds, f ( x ) + (cid:104) ˜ ∇ f ( x ) , y − x (cid:105) + µ (cid:107) y − x (cid:107) − δ (cid:54) f ( y ) , where δ = δ µ .Proof Trivial calculations bring f ( x ) + (cid:104) ˜ ∇ f ( x ) , y − x (cid:105) + µ (cid:107) x − y (cid:107) − δ = f ( x ) + (cid:104)∇ f ( x ) , y − x (cid:105) ++ (cid:104) ˜ ∇ f ( x ) − ∇ f ( x ) , y − x (cid:105) + µ (cid:107) x − y (cid:107) − δ . Using lemma 1 we obtain: f ( x ) + (cid:104) ˜ ∇ f ( x ) , y − x (cid:105) + µ (cid:107) x − y (cid:107) − δ (cid:54) f ( x )++ (cid:104)∇ f ( x ) , y − x (cid:105) + δ µ + µ (cid:107) x − y (cid:107) + µ (cid:107) y − x (cid:107) − δ == f ( x ) + (cid:104)∇ f ( x ) , y − x (cid:105) + µ (cid:107) y − x (cid:107) (cid:54) f ( y ) . The last two inequalities give different results in convergence under certainconditions. We will study two models based on statements 2, 3 and we willdenote them by the index τ , that is denote: µ = µ,µ = µ . (24)Further in the text, we will use statements 3 and 2 in the notation correspond-ing to (24). topping rules for accelerated gradient methods with additive noise in gradient 15 In this section we describe an accelerated method we choose to investigategradient-error accumulation.
Algorithm 1
ST M ( L, µ, τ, x start ), Q ⊆ R n Input:
Starting point x start , number of steps N Set ˜ x = x start , Set A = L , Set α = L , ψ ( x ) = (cid:107) x − ˜ x (cid:107) + α (cid:16) f (˜ x ) + (cid:104) ˜ ∇ f (˜ x ) , x − ˜ x (cid:105) + µ (cid:107) x − ˜ x (cid:107) (cid:17) , Set z = argmin y ∈ Q ψ ( y ), Set x = z . for k = 1 . . . N do α k = µ τ A k − L + (cid:114) µ τ A k − L + A k − µ τ A k − ,A k = A k − + α k , ˜ x k = A k − x k − + α k z k − A k ,ψ k ( x ) = ψ k − ( x ) + α k (cid:16) ( f (˜ x k ) + (cid:104) ˜ ∇ f (˜ x k ) , x − ˜ x k (cid:105) + µ (cid:107) x − ˜ x k (cid:107) (cid:17) ,z k = argmin y ∈ Q ψ k ( y ) ,x k = A k − x k − + α k z k A k . end forOutput: x N . Figure 5 describes the position of the vertices. On the sides, not theirlengths are marked, but the relationships in the corresponding sides in thesimilarity of triangles. In the case Q = R n , we can simplify the step of thealgorithm by replacing it with: z k = z k − − α k A k µ τ (cid:16) ˜ ∇ f (˜ x k )+ µ τ ( z k − − ˜ x k ) (cid:17) . We define constant:˜ R = max (cid:54) k (cid:54) N {(cid:107) z k − x ∗ (cid:107) , (cid:107) x k − x ∗ (cid:107) , (cid:107) ˜ x k − x ∗ (cid:107) } . We will also write down several identities that we will need in the proofs A k ( x k − ˜ x k ) = α k ( z k − ˜ x k ) + A k − ( x k − − ˜ z k ) , µ τ A k − (cid:107) z k − z k − (cid:107) = L (cid:107) x k − ˜ x k (cid:107) ,A k − (cid:107) ˜ x k − x k − (cid:107) = α k (cid:107) ˜ x k − z k − (cid:107) . (25)Some of the identities can be obtained from geometric considerations, for ex-ample, from a figure, others by direct substitution into the definitions of thesequences x k , ˜ x k , z k . Also very important are the estimates for the sequence A k . Fig. 1
Geometry of Similar Triangles method [28]
Claim 4 If µ (cid:54) = 0 and ∀ k ∈ N the following inequality holds: A k (cid:62) A k − λ µ τ ,L , where θ µ τ ,L = µ τ L , λ µ τ ,L = (cid:18) θ µ τ ,L + 12 (cid:113) θ µ τ ,L + 4 θ µ τ ,L (cid:19) . Proof
Using the definition of the sequences A k and solving the quadratic equa-tion, we obtain: A k (1 + µ τ A k − ) = Lα k = L ( A k − A k − ) = LA k − LA k A k − + LA k − ⇔⇔ A k − A k (1 + A k − (2 + θ µ τ ,L ))+ A k − = 0 , D = (1 + 2 A k − + θ µ τ ,L ) − A k A k,apex = 12 + (cid:18) θ µ τ ,L (cid:19) A k ⇒ A k = 12 (cid:16) (1 + (2 + θ µ τ ,L )) A k − + √D (cid:17) √D = (cid:114) θ µ τ ,L + 4) A k − + (cid:16) θ µ τ ,L + 4 θ µ,L (cid:17) A k − (cid:62) A k − (cid:113) θ µ,L + 4 θ µ,L ⇒⇒ A k (cid:62) (cid:16) θ µ τ ,L + (cid:113) θ µ τ ,L + 4 θ µ τ ,L (cid:17) A k − == (cid:18) θ µ τ ,L + 12 (cid:113) θ µ τ ,L + 4 θ µ τ ,L (cid:19) A k − ⇒⇒ A k (cid:62) L λ kµ τ ,L , λ µ τ ,L = (cid:18) θ µ τ ,L + 12 (cid:113) θ µ τ ,L + 4 θ µ τ ,L (cid:19) . topping rules for accelerated gradient methods with additive noise in gradient 17 Corollary 1 λ µ τ ,L > (cid:18) (cid:112) θ µ τ ,L (cid:19) = (cid:18) (cid:112) θ µ τ ,L + 14 θ µ τ ,L (cid:19) , (cid:18) (cid:112) θ µ τ ,L (cid:19) > e θ µτ ,L . Claim 5 If µ (cid:54) = 0 ∀ k ∈ N the following inequality holds: k (cid:88) j =0 A j A k (cid:54) (cid:115) Lµ τ .λ µ τ ,L = (cid:16) θ µ τ ,L + (cid:113) θ µ τ ,L + 4 θ µ τ ,L (cid:17) , θ µ τ ,L = µ τ L . Proof
Using previous claim we can reduce this amount exponentially. k (cid:88) j =0 A j A k (cid:54) k (cid:88) j =0 λ − jµ τ ,L = λ k +1 µ τ ,L − λ k +1 µ τ ,L − λ kµ τ ,L (cid:54) λ µ τ ,L λ µ τ ,L − (cid:54) (cid:115) Lµ τ . Claim 6 If µ = 0 then: A k (cid:62) ( k + 1) L .
Proof If µ = 0 then A k = Lα k and solving quadratic equation we get: α k = 1 + (cid:113) L α k − L .
Then by induction it is easy to get that: α k (cid:62) k + 12 L ⇒ A k = La k (cid:62) ( k + 1) L .
Claim 7 If µ = 0 we have: k (cid:88) j =0 A j A k (cid:54) k. Proof
The proof follows from the simple calculations: k (cid:88) j =0 A j A k (cid:54) k (cid:88) j =0 a j a k (cid:54) kα k − α k = kα k − ( L + (cid:113) L + α k − ) (cid:54) k. Lemma 2 ∀ k (cid:62) the following inequality holds: ψ k ( z k ) (cid:62) ψ k − ( z k − ) + 1 + µ τ A k − (cid:107) z k − z k − (cid:107) ++ α k (cid:16) f (˜ x k ) + (cid:104) ˜ ∇ f (˜ x k ) , z k − ˜ x k (cid:105) + µ τ (cid:107) z k − ˜ x k (cid:107) (cid:17) . Proof
From the definition of the ψ k − function, it has a minimum at the point z k − , then: (cid:104)∇ ψ k − ( z k − ) , z k − z k − (cid:105) (cid:62) , ∇ ψ k − ( z k − ) = ( z k − − ˜ x ) ++ k − (cid:88) j =0 α j (cid:16) ˜ ∇ f (˜ x j ) + µ τ ( z k − − ˜ x j ) (cid:17) ⇒⇒ ψ k ( z k ) = ψ k − ( z k ) + α k (cid:16) f (˜ x k ) + (cid:104) ˜ ∇ f (˜ x k ) , z k − ˜ x k (cid:105) + µ τ (cid:107) z k − ˜ x k (cid:107) (cid:17) == 12 (cid:107) z k − ˜ x (cid:107) + k − (cid:88) j =0 α j (cid:16) f (˜ x j ) + (cid:104) ˜ ∇ f (˜ x j ) , z k − ˜ x j (cid:105) + µ τ (cid:107) z k − ˜ x j (cid:107) (cid:17) ++ α k (cid:16) f (˜ x k ) + (cid:104) ˜ ∇ f (˜ x k ) , z k − ˜ x k (cid:105) (cid:17) + µ τ (cid:107) z k − ˜ x k (cid:107) . From (25) and the above we obtain: ψ k ( z k ) (cid:62) (cid:107) z k − − ˜ x (cid:107) + (cid:104) z k − − ˜ x , z k − z k − (cid:105) + 12 (cid:107) z k − − z k (cid:107) ++ k − (cid:88) j =0 α j (cid:16) f (˜ x j ) + (cid:104) ˜ ∇ f (˜ x j ) , z k − ˜ x j (cid:105) + µ τ (cid:107) z k − ˜ x j (cid:107) (cid:17) ++ α k (cid:16) f (˜ x k ) + (cid:104) ˜ ∇ f (˜ x k ) , z k − ˜ x k (cid:105) + µ τ (cid:107) z k − ˜ x k (cid:107) (cid:17) == k − (cid:88) j =0 α j (cid:16) (cid:104) ˜ ∇ f (˜ x j ) + µ τ ( z k − − ˜ x j ) , z k − − z k (cid:105) (cid:17) ++ k − (cid:88) j =0 α j (cid:16) f (˜ x j ) + (cid:104) ˜ ∇ f (˜ x j ) , z k − ˜ x j (cid:105) + µ (cid:107) z k − ˜ x j (cid:107) (cid:17) ++ α k (cid:16) f (˜ x k ) + (cid:104) ˜ ∇ f (˜ x k ) , z k − ˜ x k (cid:105) + µ τ (cid:107) z k − ˜ x k (cid:107) (cid:17) + 12 (cid:107) z k − − ˜ x (cid:107) + 12 (cid:107) z k − − z k (cid:107) . Using the linearity of the dot product, we split the sum by two and apply to: µ τ (cid:104) z k − − ˜ x j , z k − − z k (cid:105) . topping rules for accelerated gradient methods with additive noise in gradient 19 Equality from (25), and finally get: ψ k ( z k ) (cid:62) (cid:107) z k − − ˜ x (cid:107) + 1 + µ τ A k − (cid:107) z k − − z k (cid:107) ++ k − (cid:88) j =0 α j (cid:16) f (˜ x j ) + (cid:104) ˜ ∇ f (˜ x j ) , z k − − ˜ x j (cid:105) + µ τ (cid:107) z k − − ˜ x j (cid:107) (cid:17) ++ α k (cid:16) f (˜ x k ) + (cid:104) ˜ ∇ f (˜ x k ) , z k − ˜ x k (cid:105) + µ τ (cid:107) z k − ˜ x k (cid:107) (cid:17) == ψ k − ( z k − ) + 1 + µ τ A k − (cid:107) z k − z k − (cid:107) ++ α k (cid:16) f (˜ x k ) + (cid:104) ˜ ∇ f (˜ x k ) , z k − ˜ x k (cid:105) + µ τ (cid:107) z k − ˜ x k (cid:107) (cid:17) . Remark 1
In the case µ = 0, we obtain a corollary from the strongly convexity of functions ψ k and their definition, that is: ψ k ( z k ) = ψ k − ( z k ) + α k (cid:16) f (˜ x k ) + (cid:104) ˜ ∇ f (˜ x k ) , z k − ˜ x k (cid:105) (cid:17) ⇒ ψ k ( z k ) (cid:62) ψ k − ( z k − ) + 12 (cid:107) z k − z k − (cid:107) + α k (cid:16) f (˜ x k ) + (cid:104) ˜ ∇ f (˜ x k ) , z k − ˜ x k (cid:105) (cid:17) . Here we will describe some results based on the previously presented lemmasand statements.6.1 Additive noise and main theorems.
Theorem 1 ∀ k ∈ N the following inequality holds: A k f ( x k ) (cid:54) ψ k ( z k ) + δ k (cid:88) j =0 A j + 2 ˜ Rδ A k . Proof
Base, k = 0: f ( x ) (cid:54) f (˜ x ) + (cid:104) ˜ ∇ f (˜ x ) , x − ˜ x (cid:105) + L (cid:107) x − ˜ x (cid:107) + δ (cid:54)(cid:54) Lψ ( z ) − Lµ (cid:107) z − ˜ x (cid:107) + δ (cid:54) Lψ ( z ) + δ . Induction step: A k f ( x k ) − A k − δ (cid:107) x k − − ˜ x k (cid:107) (cid:54)(cid:54) A k (cid:18) f (˜ x k ) + (cid:104) ˜ ∇ f (˜ x k ) , x k − ˜ x k (cid:105) + L (cid:107) x k − ˜ x k (cid:107) + δ (cid:19) − A k − δ (cid:107) x k − − ˜ x k (cid:107) . Using equations (25) we obtain: A k f ( x k ) − A k − δ (cid:107) x k − − ˜ x k (cid:107) (cid:54)(cid:54) A k − (cid:16) f (˜ x k ) + (cid:104) ˜ ∇ f (˜ x k ) , x k − − ˜ x k (cid:105) (cid:17) + α k (cid:16) f (˜ x k ) + (cid:104) ˜ ∇ f (˜ x k ) , z k − ˜ x k (cid:105) (cid:17) ++ (1 + µ A k − )2 (cid:107) z k − z k − (cid:107) + A k δ − A k − δ (cid:107) x k − − ˜ x k (cid:107) (cid:54)(cid:54) A k − f ( x k − ) + α k ( f (˜ x k ) (cid:104) ˜ ∇ f (˜ x k ) , z k − ˜ x k (cid:105) ++ 1 + µ A k − (cid:107) z k − z k − (cid:107) + A k δ . Using the induction hypothesis, we obtain: A k f ( x k ) − A k − δ (cid:107) x k − − ˜ x k (cid:107) (cid:54) ψ k − ( z k − ) + δ k − (cid:88) j =0 A j + 2 ˜ Rδ A k − ++ 1 + µ A k − (cid:107) z k − z k − (cid:107) + α k (cid:16) f (˜ x k ) + (cid:104) ˜ ∇ f (˜ x k ) , z k − ˜ x k (cid:105) (cid:17) + A k δ . Using lemma 2 we can get: A k f ( x k ) (cid:54) A k − δ (cid:107) x k − − ˜ x k (cid:107) + ψ k ( z k ) + δ k (cid:88) j =0 A j + 2 ˜ Rδ A k − == ψ k ( z k ) + δ k (cid:88) j =0 A j + 2 ˜ Rδ A k − + α k (cid:107) ˜ x k − z k − (cid:107) (cid:54)(cid:54) ψ k ( z k ) + δ k (cid:88) j =0 A j + 2 ˜ Rδ A k − + α k ( (cid:107) z k − − x ∗ (cid:107) + (cid:107) ˜ x k − x ∗ (cid:107) ) δ (cid:54)(cid:54) ψ k ( z k ) + δ k (cid:88) j =0 A j + 2 ˜ Rδ A k − + 2 α k ˜ R ⇒⇒ A k f ( x k ) (cid:54) ψ ( z k ) + δ k (cid:88) j =0 A j + 2 ˜ Rδ A k . Remark 2
We should note that this inequality is true both in the case of µ (cid:54) = 0 and inthe case of µ = 0. Theorem 2 If µ (cid:54) = 0 ∀ k ∈ N the following inequality holds: A k f ( x k ) (cid:54) ψ k ( z k ) + δ k (cid:88) j =0 A j + δ k − (cid:88) j =0 A j The proof repeats verbatim theorem 1, except for claim 2, replaced by claim 3. topping rules for accelerated gradient methods with additive noise in gradient 21
Theorem 3 If δ = δ = δ = 0 then ˜ R = R .Proof Using theorem 1 we get A k f ( x k ) (cid:54) ψ k ( z k ) then:12 (cid:107) z k − x ∗ (cid:107) = 12 (cid:107) z k − x ∗ (cid:107) + A k f ( x k ) − A k f ( x k ) (cid:54)(cid:54) ψ k ( z k ) + 12 (cid:107) z k − x ∗ (cid:107) − A k f ( x k ) (cid:54) k (cid:88) j =0 α j ( f (˜ x j ) + (cid:104)∇ f (˜ x k ) , x ∗ − ˜ x k (cid:105) ++ µ (cid:107) x ∗ − x k (cid:107) ) + 12 (cid:107) x ∗ − ˜ x (cid:107) (cid:54) A k f ( x k ) − A k f ( x k ) + 12 (cid:107) x − x ∗ (cid:107) = 12 R . We now prove bound for the sequence x k , similarly for ˜ x k . We prove by in-duction, so assume fairness for k −
1, base is obvious. (cid:107) x k − x ∗ (cid:107) = (cid:107) A k − A k ( x k − − x ∗ ) + α k A k ( z k − x ∗ ) (cid:107) (cid:54)(cid:54) A k − A k (cid:107) x k − − x ∗ (cid:107) + α k A k (cid:107) z k − x ∗ (cid:107) = R . Theorem 4 (convergence in function)
Both inequalities take place with µ (cid:54) = 0 f ( x N ) − f ( x ∗ ) (cid:54) LR exp (cid:18) − (cid:114) µL N (cid:19) + (cid:32) (cid:115) Lµ (cid:33) δ + 3 ˜ Rδ ,f ( x N ) − f ( x ∗ ) (cid:54) LR exp (cid:18) − (cid:114) µ L N (cid:19) + (cid:32) (cid:115) Lµ (cid:33) δ + (cid:32) (cid:115) Lµ (cid:33) δ . Proof
Using, all of the above is easy to show what is required, the proof ofboth convergence is the same with the replacement of theorem 1 by theorem 2and replacement claim 2 by claim 3, therefore, we present only the proof ofthe first inequality. A N f ( x N ) (cid:54) ψ N ( z N ) + δ N (cid:88) j =0 A j + 2 ˜ Rδ A N (cid:54) (cid:107) x ∗ − ˜ x (cid:107) ++ δ N (cid:88) j =0 A j + 2 ˜ Rδ A N + N (cid:88) j =0 α k ( f (˜ x j ) + (cid:104) ˜ ∇ f (˜ x j ) , x ∗ − ˜ x j (cid:105) + µ (cid:107) x ∗ − x i (cid:107) ) (cid:54)(cid:54) δ N (cid:88) j =0 A j + 2 ˜ Rδ A N + N (cid:88) j =0 α k ( ˜ Rδ + f ( x ∗ )) + 12 R == δ N (cid:88) j =0 A j + 3 ˜ Rδ A N + A N f ( x ∗ ) + 12 R ⇐⇒⇐⇒ f ( x N ) − f ( x ∗ ) (cid:54) LR exp (cid:18) − (cid:114) µL N (cid:19) + (cid:32) (cid:115) Lµ (cid:33) δ + 3 ˜ Rδ . Remark 3 If µ = 0 we can get analogue of the first convergence, repeating the proof usingclaims 6, 7. f ( x N ) − f ( x ∗ ) (cid:54) LR N + 3 ˜ Rδ + N δ . Remark 4
Suppose µ = 0, then consider the auxiliary problem: f µ ( x ) = f ( x ) + µ (cid:107) x − ˜ x (cid:107) → min x ∈ Q , ∇ f µ ( x ) = ∇ f ( x ) + µ ( x − ˜ x ) , ˜ ∇ f µ ( x ) = ˜ ∇ f ( x ) + µ ( x − ˜ x ) , (cid:107) ˜ ∇ f µ ( x ) − ∇ f µ ( x ) (cid:107) = (cid:107) ˜ ∇ f ( x ) − ∇ f ( x ) (cid:107) (cid:54) δ. The resulting function will satisfy the condition that the gradient is Lipschitz,that is ∀ x, y ∈ Q : (cid:107)∇ f µ ( x ) − ∇ f µ ( y ) (cid:107) = (cid:107) ( ∇ f ( x ) − ∇ f ( y )) + µ ( x − y ) (cid:107) (cid:54)(cid:54) (cid:107) ( ∇ f ( x ) − ∇ f ( y ) (cid:107) + µ (cid:107) x − y (cid:107) (cid:54)(cid:54) L f (cid:107) x − y (cid:107) + µ (cid:107) x − y (cid:107) (cid:54) ( L f + µ ) (cid:107) x − y (cid:107) . We will assume, that µ <
1. That is, we can let L µ = 2( L f + 1) = L + 2 (cid:62) L f µ .The resulting function will already be strongly convex, which means that thesecond model is applicable to it τ = 2. Using theorem 4 we can get the followinginequality: x ∗ µ = argmin x ∈ Q f µ ( x ) ,R µ = (cid:107) x ∗ µ − ˜ x (cid:107) ,f µ ( x k ) − f µ ( x ∗ µ ) (cid:54) L µ R µ λ k µ , L µ + (cid:32) (cid:115) L + 4 µ (cid:33) ( δ + δ ) ⇒ f µ ( x k ) − f µ ( x ∗ µ ) (cid:54) LR µ exp (cid:18) − (cid:114) µ L + 2) k (cid:19) + (cid:32) (cid:115) L + 4 µ (cid:33) (cid:18) L + 1 µ (cid:19) δ ,f µ ( x ∗ µ ) (cid:54) f ( x ∗ ) + µ R . Then we can get convergence rate for not regularized function: f ( x k ) − f ( x ∗ ) (cid:54) f µ ( x k ) − f ( x ∗ ) (cid:54) f µ ( x k ) − f ( x ∗ µ ) + µ R (cid:54)(cid:54) LR µ exp (cid:18) − (cid:114) µ L + 2) k (cid:19) ++ (cid:32) (cid:115) L + 4 µ (cid:33) (cid:18) L + 1 µ (cid:19) δ + µ R . topping rules for accelerated gradient methods with additive noise in gradient 23 Using strong convexity of the function f µ we get: f ( x ∗ ) + µ R µ (cid:54) f ( x ∗ µ ) + µ R µ = f µ ( x ∗ µ ) (cid:54) f µ ( x ∗ ) = f ( x ∗ ) + µ R ⇒ R µ (cid:54) R. Finally we get convergence: f ( x k ) − f ( x ∗ ) (cid:54) LR exp (cid:18) − (cid:114) µ L + 2) k (cid:19) ++ (cid:32) (cid:115) L + 4 µ (cid:33) (cid:18) L + 1 µ (cid:19) δ + µ R . We choose value for parameter µ in the remark 9. Remark 5
If we consider the problem in the first model τ = 1, the case µ = 0 and assumethat (cid:107) x ∗ (cid:107) (cid:54) R ∗ . Then we choose a starting point for the ST M algorithm ina ball of radius R ∗ , specifically put x start = 0. R = (cid:107) x ∗ − ˜ x (cid:107) (cid:54) R ∗ Let us formulate a stopping rule for the this model ( ∀ ζ > f ( x k ) − f ( x ∗ ) (cid:54) kδ + R ∗ δ + δ k (cid:88) j =1 α j A k (cid:107) ˜ x j − z j − (cid:107) + ζ. Lemma 3 (Bound for ˜ R ) Before the stopping criterion is satisfied, the fol-lowing inequality holds: ˜ R (cid:54) R. Proof
Note, that from (cid:107) z k − x ∗ (cid:107) (cid:54) R we get (cid:107) x k − x ∗ (cid:107) (cid:54) R, (cid:107) ˜ x k − x ∗ (cid:107) (cid:54) R similarly to theorem 3. But it’s worth noting that to estimate (cid:107) ˜ x k − x ∗ (cid:107) , onlyinequalities are required for all j (cid:54) k − (cid:107) ˜ x k − x ∗ (cid:107) = (cid:107) A k − A k ( x k − − x ∗ ) + α k A k ( z k − − x ∗ ) (cid:107) (cid:54)(cid:54) A k − A k (cid:107) x k − − x ∗ (cid:107) + α k A k (cid:107) z k − − x ∗ (cid:107) (cid:54) R An analysis of the proof of theorem 2 gives a stronger convergence: A k f ( x k ) (cid:54) ψ k ( z k ) + δ k (cid:88) j =0 A j + δ k (cid:88) j =1 α j (cid:107) ˜ x j − z j − (cid:107) Then, using the convexity of the function ψ k we get: A k f ( x k ) + 12 (cid:107) z k − x ∗ (cid:107) (cid:54) (cid:107) z k − x ∗ (cid:107) + ψ k ( z k ) + δ k (cid:88) j =0 A j ++ δ k (cid:88) j =1 α j (cid:107) ˜ x j − z j − (cid:107) (cid:54) ψ k ( x ∗ ) + δ k (cid:88) j =0 A j ++ δ k (cid:88) j =1 α j (cid:107) ˜ x j − z j − (cid:107) (cid:54) R + A k f ( x ∗ ) + δ k (cid:88) j =0 A j ++ δ k (cid:88) j =1 α j (cid:107) ˜ x j − z j − (cid:107) + δ k (cid:88) j =0 α j (cid:107) z k − x ∗ (cid:107) ⇒
12 ( R − (cid:107) z k − x ∗ (cid:107) ) (cid:62)(cid:62) A k ( f ( x k ) − f ( x ∗ )) − kδ + δ k (cid:88) j =1 α j A k (cid:107) ˜ x j − z j − (cid:107) + R ∗ δ + ζ (cid:62) . Therefore, when the stopping criterion is met, we will receive the estimate: f ( x k ) − f ( x ∗ ) (cid:54) kδ + δ R ∗ + δ k (cid:88) j =0 α j (cid:107) ˜ x j − z j − (cid:107) + ζ. From remark 3 we get an estimate of the number of iterations: N stop (cid:62) (cid:115) LR ζ .f ( x N ) − f ( x ∗ ) (cid:54) LR N + N δ + R ∗ δ + δ N (cid:88) j =1 α j A N (cid:107) ˜ x j − z j − (cid:107) (cid:54)(cid:54) N δ + R ∗ δ + δ N (cid:88) j =1 (cid:107) ˜ x j − z j − (cid:107) + ζ ⇒ LR N (cid:54) ζ ⇔ N (cid:62) LR ζ . Summing up, we obtain the following theorem:
Theorem 5
For model τ = 1 with µ = 0 , using stopping rule: f ( x N ) − f ( x ∗ ) (cid:54) N δ + R ∗ δ + δ N (cid:88) j =1 α j A N (cid:107) ˜ x j − z j − (cid:107) + ζ. We can guarantee, that: ˜ R (cid:54) R. And the criterion is reached after: N stop = (cid:34) (cid:115) LR ζ (cid:35) + 1 . topping rules for accelerated gradient methods with additive noise in gradient 25 L = 2 L f . Where L f – Lipschitz constant of ∇ f . From theorem 2 and similar to theorem 4reasoning we get: f ( x k ) − f ( x ∗ ) (cid:54) R A k + 32 k − (cid:88) j =0 A j α (cid:107)∇ f ( x k ) (cid:107) A k µ + 3 α (cid:107)∇ f ( x k ) (cid:107) µ ,∆ k = f ( x k ) − f ( x ∗ ) ,∆ k (cid:54) R A k + 32 k − (cid:88) j =0 A j α (cid:107)∇ f ( x j ) (cid:107) A k µ + 3 α µ (cid:107)∇ f ( x k ) (cid:107) . We define: θ = 3 Lα µ (1 − Lα µ ) ,λ = R − Lα µ . From inequality: (cid:107)∇ f ( x k ) (cid:107) (cid:54) L ( f ( x k ) − f ( x ∗ )) . We obtain: ∆ k (cid:54) λA k + θ k − (cid:88) j =0 A j A k ∆ j . In these designations by induction we can obtain:
Claim 8 ∆ k (cid:54) (1 + θ ) k − A k λ + θ A (1 + θ ) k − A k ∆ . Proof
Base, k = 1 is obvious. Induction step: ∆ k (cid:54) λA k + θ k − (cid:88) j =0 A j A k ∆ j (cid:54)(cid:54) λA k + k − (cid:88) j =0 (cid:18) A j A k (1 + θ ) j − A k λ + θ A (1 + θ ) j − A k ∆ (cid:19) + A A k ∆ (cid:54)(cid:54) λA k + k − (cid:88) j =0 (cid:18) λ (1 + θ ) j A k + θ A (1 + θ ) j A k ∆ (cid:19) + A A k ∆ == (1 + θ ) k − A k λ + θ A (1 + θ ) k − A k ∆ . That is we can formulate the following inequality: f ( x k ) − f ( x ∗ ) (cid:54) λ (1 + θ ) k A k + θ A (1 + θ ) k A k ( f ( x ) − f ( x ∗ )) . Using corollary 1 we can estimate: A k (cid:62) (cid:18) (cid:114) µ L (cid:19) k A . (26)We will choose an alpha such that:1 + θA k (cid:54)
11 + √ (cid:112) µL . Using (26) and definition of θ we obtain, that we should choose α from: α (cid:54) (cid:118)(cid:117)(cid:117)(cid:116) √ (cid:16) Lµ (cid:17) + Lµ α = O (cid:18)(cid:16) µL (cid:17) (cid:19) (27)From simple inequality:1 + 13 √ (cid:114) µL > exp (cid:18) √ (cid:114) µL (cid:19) . We get the following theorem:
Theorem 6
If in the model described in (3) in the strongly convex case wecan chose α according to (27) we obtain: f ( x k ) − f ( x ∗ ) (cid:54) (cid:18) LR − α + 3 Lα µ (1 − α ) ( f ( x ) − f ( x ∗ )) (cid:19) exp (cid:18) − √ (cid:114) µL (cid:19) . Corollary 2
Under the conditions of the theorem, we obtain convergence inthe argument: (cid:107) x k − x ∗ (cid:107) (cid:54) R (cid:18) Lµ (1 − α ) + 3 L α µ (1 − α ) (cid:19) exp (cid:18) − √ (cid:114) µL (cid:19) . Proof
This is a direct consequence of the inequalities: f ( x k ) − f ( x ∗ ) (cid:54) L (cid:107) x k − x ∗ (cid:107) f ( x k ) − f ( x ∗ ) (cid:62) µ (cid:107) x k − x ∗ (cid:107) topping rules for accelerated gradient methods with additive noise in gradient 27 Using Theorem 3 and assume, that Q – compact set we can we can denote R as diam( Q ) instead of (cid:107) x − x ∗ (cid:107) , then we can also bound ˜ R (cid:54) R and thiswill simplify bounds in theorem 4. Remark 7
With the same assumption µ (cid:54) = 0 we obtain a comparison of the two conver-gences in the Theorem 4. Recall that: δ = δ, δ = δ L , δ = δ µ . So if δ < R (cid:113) Lµ µ + (cid:113) Lµ ( √ − L . Then the accumulation of noise in the model corresponding to τ = 2, thatdescribed in (3) is less than in model τ = 1, described in (2). Remark 8
If we use model τ = 2, described in theorem 4 one can set the desired accuracyof the solution. f ( x N ) − f ( x ∗ ) (cid:54) ε. Then we get from theorem 4 that: f ( x N ) − f ( x ∗ ) (cid:54) LR exp (cid:18) − (cid:114) µ L N (cid:19) + (cid:32) (cid:115) Lµ (cid:33) δ + (cid:32) (cid:115) Lµ (cid:33) δ ,f ( x N ) − f ( x ∗ ) (cid:54) LR exp (cid:18) − (cid:114) µ L N (cid:19) + (cid:32) L + µ (cid:112) µ L ( √ (cid:33) δ . That is we can get estimates for δ value and number of steps N : (cid:32) L + µ (cid:112) µ L ( √ (cid:33) δ (cid:54) ε ,δ (cid:54) √ ε (cid:115) √ (cid:115) L + µ (cid:112) µ L ,δ = O (cid:32) √ ε ( L + µ ) ( µ L ) (cid:33) ; LR exp (cid:18) − (cid:114) µ L N (cid:19) (cid:54) ε ,N (cid:62) (cid:115) Lµ (cid:0) ln 2 LR + ln ε − (cid:1) ,N = O (cid:32)(cid:115) Lµ ln LR ε (cid:33) . Remark 9
Using remark 4 and previous remark 8, we can found similar bounds. Remindthat: f ( x N ) − f ( x ∗ ) (cid:54) LR exp (cid:18) − (cid:114) µ L + 2) N (cid:19) ++ (cid:32) (cid:115) L + 4 µ (cid:33) (cid:18) L + 1 µ (cid:19) δ + µ R . However we should value of the parameter µ . We will let: µ = 23 εR . Using inequality: δ (cid:32) (cid:115) L + 4 µ (cid:33) (cid:18) µ + LµL (cid:19) (cid:54) ε . And the selected value of the parameter mu we get required value of error δ : δ (cid:54) (cid:18) (cid:19) (cid:112) √ L + 4 R − ε ,δ = O (cid:16) L − R − ε (cid:17) . topping rules for accelerated gradient methods with additive noise in gradient 29 Similarly, get an estimate of the number of steps: LR exp (cid:18) − (cid:114) µ L + 2) N (cid:19) (cid:54) ε ,N (cid:62) √ L + 24 R ln 2 LR + 2 √ L + 4 1 √ ε ln 1 ε ,N = O (cid:32)(cid:114) Lε ln LR ε (cid:33) . Remark 10
Using remark 5 and theorem 5 we can apply it to problem: Ax = b,A ∈ GL n ( R ) . Solving such a problem is equivalent to solving the convex optimization prob-lem: f ( x ) = 12 (cid:107) Ax − b (cid:107) → min , ∇ f ( x ) = A T ( Ax − b ) . We will assume similarly the estimate of the norm x ∗ : (cid:107) x ∗ (cid:107) (cid:54) R ∗ . Let the original problem be solved with an ε accuracy in the sense: (cid:107) Ax − b (cid:107) (cid:54) ε ,f ( x ) − f ( x ∗ ) = 12 (cid:107) Ax − b (cid:107) (cid:54) ε,ε = 12 ε . When the algorithm stops, we get the convergence: f ( x N stop ) − f ( x ∗ ) (cid:54) N δ + 3 δ R ∗ ,N stop = (cid:34) (cid:115) LR ζ (cid:35) + 1 . Then we choose δ, ζ from the following conditions: ζ (cid:54) ε ,δ (cid:54) (cid:18) L √ R (cid:19) ε ,δ (cid:54) ε R ∗ . For example, we can let: δ = C R,R ∗ ,L ε, C R,R ∗ ,L = min (cid:40) L √ R , R ∗ (cid:41) . Then the number of steps required is expressed as: N ε = (cid:34) (cid:114) LR ε (cid:35) + 1 . Accordingly, the estimate required for solving the problem of linear equations: N ε = (cid:34) √ LR ε (cid:35) + 1 . Remark 11
The work considered a model of additive noise in equation (20), similar to [41],that is we can consider that:˜ ∇ f ( x ) = ∇ f ( x ) + r x , (cid:107) r x (cid:107) (cid:54) δ. Similarly to this work, a stopping criterion was proposed for the
ST M algo-rithm, as was proposed for gradient descent. x k +1 = x k − L ∇ f ( x k ) . Note that in the same noise model, the convergence estimate in both consideredcases will be: j N = argmin (cid:54) k (cid:54) N f ( x k ) ,y N = x j N ,f ( y N ) − f ( x ∗ ) = O (cid:18) LR N + δ L + ˜ Rδ (cid:19) ,f ( y N ) − f ( x ∗ ) = O (cid:18) LR exp (cid:16) − µL N (cid:17) + δ L + ˜ Rδ (cid:19) .f ( y N ) − f ( x ∗ ) = O (cid:18) LR exp (cid:16) − µ L (cid:17) + δ L + δ µ (cid:19) , ˜ R = max k (cid:54) N (cid:107) x k − x ∗ (cid:107) . Despite the fact that in the work [17], a slightly different model was considered,namely ( δ, L ) and ( δ, L, µ ) oracle (equation 3.1 Definition 1 in [17]) similarorders of convergence were obtained, that is theorem 4 and relevant remark topping rules for accelerated gradient methods with additive noise in gradient 31
3. Namely, function satisfies the ( δ, L, µ ) model at point x ∈ Q means, thatexists functions f δ ( x ) and ψ δ ( x, y ), such that: ∀ y ∈ Qµ (cid:107) x − y (cid:107) (cid:54) f ( x ) − f δ ( y ) − ψ δ ( x, y ) (cid:54) L (cid:107) x − y (cid:107) + δ. Similarly to papers [48], [17], the results also hold in the case of an unboundedset Q (result in [48] is on the page 26, obtained for fast adaptive gradientmethod page 13). Stopping criteria are also formulated, which give an estimateon ˜ R for a non-compact Q , remind that:˜ R = max (cid:54) k (cid:54) N {(cid:107) x k − x ∗ (cid:107) , (cid:107) ˜ x k − x ∗ (cid:107) , (cid:107) z k − x ∗ (cid:107) } . We also note that a similar models of ( δ, ∆, L ) and ( δ, ∆, L, µ ) oracle wasconsidered in the work [47]. Moreover, the function satisfies ( δ, ∆, L, µ )–model f ( y ) (cid:54) f δ + ψ ( y, x ) + ∆ (cid:107) x − y (cid:107) + δ + LV ( y, x ) ,f δ + ψ ( x ∗ , x ) + µV ( y, x ) (cid:54) f ( x ∗ ) ,f ( x ) − δ (cid:54) f δ ( x ) (cid:54) f ( x ) ,ψ ( x, x ) = 0 . Here V ( x, y ) – Bregman divergence. At the same time, an adaptive analogue ofSTM was considered. As well as similar estimations for a δ and number of steps N , following [17] (page 24, remarks 11 – 14 ), namely there are remarks 8, 9,10. Also considered an example of using regularization to obtain convergencein the model τ = 1, for the case µ = 0. For testing
ST M for degenerate problems, the function described in [37] onpage 69, that is: f ( x ) = L x + k − (cid:88) j =0 ( x j − x j +1 ) + x k − L x x ∗ = (cid:18) − k + 1 , . . . , − kk + 1 , , . . . , (cid:19) T (cid:54) k (cid:54) dim x These two plots reflect the convergence of the method at the first 50 000 and10 000 iterations, respectively, at different δ . Fig. 2
First test – first 50 000 steps.
Fig. 3
First test – first 10 000 steps.
Let’s also consider a drawing with two types of noise. topping rules for accelerated gradient methods with additive noise in gradient 33
Fig. 4
Second test – relative and additive types of noises comparison.
To compare the convergence of a degenerate problem with different α pa-rameters in the case of relative noise, consider the following graph. Fig. 5
Third test – relative noise with different values of α for µ = 0.4 A. Vasin and A. Gasnikov and V. Spokoiny The last figure shows that for α (cid:54) .
71 the convergence of the methoddoes not deteriorate, but we can assume the existence of such a thresholdvalue α ∗ ≈ .
71, that at large of α values the method diverges.Also for testing on strongly convex functions, an analogue of the finite-dimensional Nesterov function was used from [37] on page 78, that is: f ( x ) = µ ( χ − x + n − (cid:88) j =1 ( x j − x j +1 ) − x + µ (cid:107) x (cid:107) χ = Lµ ∇ f ( x ) = (cid:18) µ ( χ − A + µE (cid:19) x − µ ( χ − e e = (1 , , . . . , T Where E – identity operator, A is the matrix defined as:2 − . . . . . . − − . . . . . . − − . . . . . . − Then minimum f , x ∗ , can be found from systems of linear equations.Let us consider the graphs of the residuals for different parameters of thedelta additive noise. Fig. 6
Fourth test – δ ∈ { . , . , . , . , . , . } . topping rules for accelerated gradient methods with additive noise in gradient 35 Fig. 7
Fifth test – mean of 30 tests, level of approximation and required number of steps.
The last plot confirms theorem 4 and remark 8. Similarly to the degeneratecase, consider the behavior of the method for different parameters α . Fig. 8
Sixth test – relative noise with different values of α for µ > Note that in the strongly convex case, we obtain a property similar to thedegenerate case: for α values less than a certain threshold value α ∗ , from thefigure we can assume a value of 0.71, the convergence of the method does notdeteriorate, and for large α values, the method diverges. Fig. 9
Seventh test – relative noise with different values of α for other L and µ . Figure 9 shows, that The figure shows that changing the parameters L and µ , the value of the assumed threshold α ∗ does not change much. We alsonote that such threshold values turned out to be approximately equal for thedegenerate and strongly convex problem. The authors are grateful to Eduard Gorbunov for useful discussions.
References
1. Ajalloeian, A., Stich, S.U.: Analysis of sgd with biased gradient estimators. arXivpreprint arXiv:2008.00051 (2020)2. Akhavan, A., Pontil, M., Tsybakov, A.B.: Exploiting higher order smoothness inderivative-free optimization and continuous bandits. arXiv preprint arXiv:2006.07862(2020)3. Bach, F., Perchet, V.: Highly-smooth zero-th order online optimization. In: V. Feldman,A. Rakhlin, O. Shamir (eds.) 29th Annual Conference on Learning Theory,
Proceedingsof Machine Learning Research , vol. 49, pp. 257–283. PMLR, Columbia University, NewYork, New York, USA (2016). URL http://proceedings.mlr.press/v49/bach16.html
4. Beck, A.: First-order methods in optimization. SIAM (2017)5. Belloni, A., Liang, T., Narayanan, H., Rakhlin, A.: Escaping the local minima via sim-ulated annealing: Optimization of approximately convex functions. In: P. Gr¨unwald,E. Hazan, S. Kale (eds.) Proceedings of The 28th Conference on Learning Theory,
Pro-ceedings of Machine Learning Research , vol. 40, pp. 240–265. PMLR, Paris, France(2015). URL http://proceedings.mlr.press/v40/Belloni15.html
6. Ben-Tal, A., Nemirovski, A.: Lectures on Modern Convex Optimization (Lecture Notes).Personal web-page of A. Nemirovski (2015)topping rules for accelerated gradient methods with additive noise in gradient 377. Berahas, A.S., Cao, L., Choromanski, K., Scheinberg, K.: A theoretical and empiricalcomparison of gradient approximations in derivative-free optimization. arXiv preprintarXiv:1905.01332 (2019)8. Beznosikov, A., Sadiev, A., Gasnikov, A.: Gradient-free methods with inexact oraclefor convex-concave stochastic saddle-point problem. In: International Conference onMathematical Optimization Theory and Operations Research, pp. 105–119. Springer(2020)9. Bubeck, S.: Convex optimization: Algorithms and complexity. arXiv preprintarXiv:1405.4980 (2014)10. Cohen, M.B., Diakonikolas, J., Orecchia, L.: On acceleration with noise-corrupted gra-dients. arXiv preprint arXiv:1805.12591 (2018)11. Conn, A., Scheinberg, K., Vicente, L.: Introduction to Derivative-Free Optimization.Society for Industrial and Applied Mathematics (2009). DOI 10.1137/1.9780898718768.URL http://epubs.siam.org/doi/abs/10.1137/1.9780898718768
12. d’Aspremont, A.: Smooth optimization with approximate gradient. SIAM Journal onOptimization (3), 1171–1183 (2008)13. d’Aspremont, A., Scieur, D., Taylor, A.: Acceleration methods. arXiv preprintarXiv:2001.09545 (2021)14. Devolder, O.: Stochastic first order methods in smooth convex optimization. COREDiscussion Paper 2011/70 (2011)15. Devolder, O.: Exactness, inexactness and stochasticity in first-order methods for large-scale convex optimization. Ph.D. thesis, ICTEAM and CORE, Universit´e Catholiquede Louvain (2013)16. Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex opti-mization with inexact oracle. Mathematical Programming (1), 37–75 (2014). DOI10.1007/s10107-013-0677-5. URL http://dx.doi.org/10.1007/s10107-013-0677-5
17. Devolder, O., Glineur, F., Nesterov, Y., et al.: First-order methods with inexact oracle:the strongly convex case. CORE Discussion Papers , 47 (2013)18. Drusvyatskiy, D., Xiao, L.: Stochastic optimization with decision-dependent distribu-tions. arXiv preprint arXiv:2011.11173 (2020)19. Dvinskikh, D., Gasnikov, A.: Decentralized and parallelized primal and dual acceleratedmethods for stochastic convex programming problems. Journal of Inverse and Ill-posedProblems (2021)20. Dvinskikh, D.M., Turin, A.I., Gasnikov, A.V., Omelchenko, S.S.: Accelerated and nonaccelerated stochastic gradient descent in model generality. Matematicheskie Zametki (4), 515–528 (2020)21. Dvurechensky, P.: Numerical methods in large-scale optimization: inexact oracle andprimal-dual analysis. HSE. Habilitation (2020)22. Dvurechensky, P., Gasnikov, A.: Stochastic intermediate gradient method for con-vex problems with stochastic inexact oracle. Journal of Optimization Theory andApplications (1), 121–145 (2016). DOI 10.1007/s10957-016-0999-6. URL http://dx.doi.org/10.1007/s10957-016-0999-6
23. Dvurechensky, P., Staudigl, M., Shtern, S.: First-order methods for convex optimization.arXiv preprint arXiv:2101.00935 (2021)24. Evtushenko, Y.G.: Optimization and fast automatic differentiation. Computing Centerof RAS, Moscow (2013)25. Gasnikov, A.: Universal gradient descent. arXiv preprint arXiv:1711.00394 (2017)26. Gasnikov, A., Kabanikhin, S., Mohammed, A., Shishlenin, M.: Convex optimization inhilbert space with applications to inverse problems. arXiv preprint arXiv:1703.00267(2017)27. Gasnikov, A.V., Gasnikova, E.V., Nesterov, Y.E., Chernov, A.V.: Efficient numericalmethods for entropy-linear programming problems. Computational Mathematics andMathematical Physics (4), 514–524 (2016). DOI 10.1134/S0965542516040084. URL http://dx.doi.org/10.1134/S0965542516040084
28. Gasnikov, A.V., Nesterov, Y.E.: Universal method for stochastic composite optimizationproblems. Computational Mathematics and Mathematical Physics (1), 48–64 (2018)29. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep learning, vol. 1. MIT pressCambridge (2016)8 A. Vasin and A. Gasnikov and V. Spokoiny30. Gorbunov, E., Dvinskikh, D., Gasnikov, A.: Optimal decentralized distributed algo-rithms for stochastic convex optimization. arXiv preprint arXiv:1911.07363 (2019)31. Kabanikhin, S.I.: Inverse and ill-posed problems: theory and applications, vol. 55. WalterDe Gruyter (2011)32. Kamzolov, D., Dvurechensky, P., Gasnikov, A.V.: Universal intermediate gradientmethod for convex problems with inexact oracle. Optimization Methods and Softwarepp. 1–28 (2020)33. Kotsalis, G., Lan, G., Li, T.: Simple and optimal methods for stochastic variationalinequalities, ii: Markovian noise and policy evaluation in reinforcement learning. arXivpreprint arXiv:2011.08434 (2020)34. Lan, G.: First-order and Stochastic Optimization Methods for Machine Learning.Springer (2020)35. Nemirovski, A.S.: Regularizing properties of the conjugate gradient method for ill-posedproblems. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki (3), 332–347(1986)36. Nemirovsky, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization.J. Wiley & Sons, New York (1983)37. Nesterov, Y.: Lectures on convex optimization, vol. 137. Springer (2018)38. Nesterov, Y., Spokoiny, V.: Random gradient-free minimization of convex functions.Found. Comput. Math. (2), 527–566 (2017). DOI 10.1007/s10208-015-9296-2. URL https://doi.org/10.1007/s10208-015-9296-2 . First appeared in 2011 as CORE dis-cussion paper 2011/1639. Novitskii, V., Gasnikov, A.: Improved exploiting higher order smoothness in derivative-free optimization and continuous bandit. arXiv preprint arXiv:2101.03821 (2021)40. Pedregosa, F., Scieur, D.: Average-case acceleration through spectral density estimation.arXiv preprint arXiv:2002.04756 (2020)41. Poljak, B.: Iterative algorithms for singular minimization problems. In: Nonlinear Pro-gramming 4, pp. 147–166. Elsevier (1981)42. Polyak, B.: Introduction to Optimization. New York, Optimization Software (1987)43. Polyak, B.T., Tsybakov, A.B.: Optimal order of accuracy of search algorithms instochastic optimization. Problemy Peredachi Informatsii (2), 45–53 (1990)44. Risteski, A., Li, Y.: Algorithms and matching lower bounds for approximately-convexoptimization. Advances in Neural Information Processing Systems , 4745–4753 (2016)45. Rockafellar, R.T.: Convex analysis, vol. 36. Princeton university press (1970)46. Scieur, D., Pedregosa, F.: Universal asymptotic optimality of polyak momentum. In:International Conference on Machine Learning, pp. 8565–8572. PMLR (2020)47. Stonyakin, F.: Adaptive methods for variational inequalities, minimization problemsand functional with generalized growth condition. MIPT. Habilitation (2020)48. Stonyakin, F., Tyurin, A., Gasnikov, A., Dvurechensky, P., Agafonov, A., Dvinskikh,D., Pasechnyuk, D., Artamonov, S., Piskunova, V.: Inexact relative smoothness andstrong convexity for optimization and variational inequalities by inexact model. arXivpreprint arXiv:2001.09013 (2020)49. Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation andexact worst-case performance of first-order methods. Mathematical Programming161