Second-Order Information in Non-Convex Stochastic Optimization: Power and Limitations
Yossi Arjevani, Yair Carmon, John C. Duchi, Dylan J. Foster, Ayush Sekhari, Karthik Sridharan
SSecond-Order Information in Non-Convex Stochastic Optimization:Power and Limitations
Yossi ArjevaniNew York University [email protected]
Yair CarmonStanford University [email protected]
John C. DuchiStanford University [email protected]
Dylan J. FosterMIT [email protected]
Ayush SekhariCornell University [email protected]
Karthik SridharanCornell University [email protected]
Abstract
We design an algorithm which finds an (cid:15) -approximate stationary point (with (cid:107)∇ F ( x ) (cid:107) ≤ (cid:15) )using O ( (cid:15) − ) stochastic gradient and Hessian-vector products, matching guarantees that werepreviously available only under a stronger assumption of access to multiple queries with thesame random seed. We prove a lower bound which establishes that this rate is optimal and—surprisingly—that it cannot be improved using stochastic p th order methods for any p ≥
2, evenwhen the first p derivatives of the objective are Lipschitz. Together, these results characterizethe complexity of non-convex stochastic optimization with second-order methods and beyond.Expanding our scope to the oracle complexity of finding ( (cid:15), γ )-approximate second-order stationarypoints, we establish nearly matching upper and lower bounds for stochastic second-order methods.Our lower bounds here are novel even in the noiseless case. Let F : R d → R have Lipschitz continuous gradient and Hessian, and consider the task of finding an( (cid:15), γ )-second-order stationary point (SOSP), that is, x ∈ R d such that (cid:107)∇ F ( x ) (cid:107) ≤ (cid:15) and ∇ F ( x ) (cid:23) − γI. (1)This task plays a central role in the study of non-convex optimization: for functions satisfying aweak strict saddle condition (Ge et al., 2015), exact SOSPs (with (cid:15) = γ = 0) are local minima, andtherefore the condition (1) serves as a proxy for approximate local optimality. Moreover, for agrowing set of non-convex optimization problems arising in machine learning, SOSPs are in fact global minima (Ge et al., 2015, 2016; Sun et al., 2018; Ma et al., 2019). Consequently, there hasbeen intense recent interest in the design of efficient algorithms for finding approximate SOSPs (Jinet al., 2017; Allen-Zhu, 2018a; Carmon et al., 2018; Fang et al., 2018; Tripuraneni et al., 2018; Xuet al., 2018; Fang et al., 2019).In stochastic approximation tasks—particularly those motivated by machine learning—access tothe objective function is often restricted to stochastic estimates of its gradient; for each query point x ∈ R d we observe (cid:100) ∇ F ( x, z ), where z ∼ P z is a random variable such that E (cid:2)(cid:100) ∇ F ( x, z ) (cid:3) = ∇ F ( x ) and E (cid:107) (cid:100) ∇ F ( x, z ) − ∇ F ( x ) (cid:107) ≤ σ . (2) However, it is NP-Hard to decide whether a SOSP is a local minimum or a high-order saddle point (Murty andKabadi, 1987). a r X i v : . [ c s . L G ] J un = 2
Stochastic
Figure 1: The elbow effect : For stochastic oracles, the optimal complexity sharply improves from (cid:15) − for p = 1 to (cid:15) − for p = 2, but there is no further improvement for p >
2. For noiseless oracles,the optimal complexity begins at (cid:15) − for p = 1 and smoothly approaches (cid:15) − as the derivative order p → ∞ .This restriction typically arises due to computational considerations (when (cid:100) ∇ F ( · , z ) is much cheaperto compute than ∇ F ( · ), as in empirical risk minimization or Monte Carlo simulation), or due tofundamental online nature of the problem at hand (e.g., when x represents a routing scheme and z represents traffic on a given day). However, for many problems with additional structure, wehave access to extra information. For example, we often have access to stochastic second-orderinformation in the form of a Hessian estimator (cid:91) ∇ F ( x, z ) satisfying E (cid:2) (cid:91) ∇ F ( x, z ) (cid:3) = ∇ F ( x ) and E (cid:107) (cid:91) ∇ F ( x, z ) − ∇ F ( x ) (cid:107) ≤ σ . (3)In this paper, we characterize the extent to which the stochastic Hessian information (3), as wellas higher-order information, contributes to the efficiency of finding first- and second-order stationarypoints. We approach this question from the perspective of oracle complexity (Nemirovski andYudin, 1983), which measures efficiency by the number of queries to estimators of the form (2)—andpossibly (3)—required to satisfy the condition (1). We provide new upper and lower bounds on the stochastic oracle complexity of finding (cid:15) -stationarypoints and ( (cid:15), γ )-SOSPs. In brief, our main results are as follows. • Finding (cid:15) -stationary points: The elbow effect.
We propose a new algorithm that findsan (cid:15) -stationary point ( γ = ∞ ) with O ( (cid:15) − ) stochastic gradients and stochastic Hessian-vectorproducts. We furthermore show that this guarantee is not improvable via a complementaryΩ( (cid:15) − ) lower bound. All previous algorithms achieving O ( (cid:15) − ) complexity require “multi-point”queries, in which the algorithm can query stochastic gradients at multiple points for the samerandom seed. Moreover, we show that Ω( (cid:15) − ) remains a lower bound for stochastic p th-ordermethods for all p ≥ ( (cid:15), γ ) -stationary points: Improved algorithm and nearly matching lower bound. We extend our algorithm to find ( (cid:15), γ )-stationary points using O ( (cid:15) − + (cid:15) − γ − + γ − ) stochasticgradient and Hessian-vector products, and prove a nearly matching Ω( (cid:15) − + γ − ) lower bound.In the remainder of this section we overview our results in greater detail. Unless otherwisestated, we assume F has both Lipschitz gradient and Hessian. To simplify the overview, we focuson dependence on (cid:15) − and γ − while keeping the other parameters—namely the initial optimalitygap F ( x (0) ) − inf x ∈ R d F ( x ), the Lipschitz constants of ∇ F and ∇ F , and the variances of theirestimators—held fixed. Our main theorems give explicit dependence on these parameters. γ = ∞ ) We first describe our developments for the task of finding (cid:15) -approximate first-order stationary points(satisfying (1) with γ = ∞ ), and subsequently extend our results to general γ . The reader may alsorefer to Table 1 for a succinct comparison of upper bounds. Variance reduction via Hessian-vector products: A new gradient estimator.
Usingstochastic gradients and stochastic Hessian-vector products as primitives, we design a new variance-reduced gradient estimator. Plugging it into standard stochastic gradient descent (SGD), we obtainan algorithm that returns a point (cid:98) x satisfying E (cid:107)∇ F ( (cid:98) x ) (cid:107) ≤ (cid:15) and requires O ( (cid:15) − ) stochastic gradientand HVP queries in expectation. In comparison, vanilla SGD requires O ( (cid:15) − ) queries (Ghadimiand Lan, 2013), and the previously best known rate under our assumptions was O ( (cid:15) − . ), by bothcubic-regularized Newton’s method and a restarted variant of SGD (Tripuraneni et al., 2018; Fanget al., 2019).Our approach builds on a line of work by Fang et al. (2018); Zhou et al. (2018); Wang et al.(2019); Cutkosky and Orabona (2019) that also develop algorithms with complexity O ( (cid:15) − ), butrequire a “multi-point” oracle in which algorithm can query the stochastic gradient at multiplepoints for the same random seed. Specifically, in the n -point variant of this model, the algorithmcan query at the set of points ( x , . . . , x n ) and receive (cid:100) ∇ F ( x , z ) , . . . , (cid:100) ∇ F ( x n , z ) , where z i . i . d . ∼ P z , (4)and where the estimator (cid:100) ∇ F ( x, z ) is unbiased and has bounded variance in the sense of (2). Theaforementioned works achieve O ( (cid:15) − ) complexity using n = 2 simultaneous queries, while ournew algorithm achieves the same rate using n = 1 (i.e., z is drawn afresh at each query), butusing stochastic Hessian-vector products in addition to stochastic gradients. However, we show inAppendix B that under the statistical assumptions made in these works, the two-point stochasticgradient oracle model is strictly stronger than the single-point stochastic gradient/Hessian-vectorproduct oracle we consider here. On the other hand, unlike our algorithm, these works do notrequire Lipschitz Hessian.The algorithms that achieve complexity O ( (cid:15) − ) using two-point queries work by estimatinggradient differences of the form ∇ F ( x ) − ∇ F ( x (cid:48) ) using (cid:100) ∇ F ( x, z ) − (cid:100) ∇ F ( x (cid:48) , z ) and applying recursivevariance reduction (Nguyen et al., 2017). Our primary algorithmic contribution is a second-orderstochastic estimator for ∇ F ( x ) − ∇ F ( x (cid:48) ) which avoids simultaneous queries while maintainingcomparable error guarantees. To derive our estimator, we note that ∇ F ( x ) − ∇ F ( x (cid:48) ) = (cid:82) ∇ F ( xt + x (cid:48) (1 − t ))( x − x (cid:48) ) dt , and use K queries to the stochastic Hessian estimator (3) to numerically3 ethod Requires (cid:91) ∇ F ? Complexitybound AdditionalassumptionsSGD (Ghadimi and Lan, 2013) No O ( (cid:15) − )Restarted SGD (Fang et al., 2019) No O ( (cid:15) − . ) (cid:100) ∇ F Lipschitzalmost surelySubsampled regularized Newton(Tripuraneni et al., 2018) Yes ∗ O ( (cid:15) − . )Recursive variance reduction(e.g., Fang et al., 2018) No O ( (cid:15) − ) Mean-squared smoothness,simultaneous queries (see Appendix B)SGD with HVP-RVR (Algorithm 2) Yes ∗ O ( (cid:15) − ) NoneSubsampled Newtonw/ HVP-RVR (Algorithm 3) Yes O ( (cid:15) − ) None Table 1: Comparison of guarantees for finding (cid:15) -stationary points (i.e., E (cid:107)∇ F ( x ) (cid:107) ≤ (cid:15) ) for a function F with Lipschitz gradient and Hessian. See Table 2 for explicit dependence on problem parameters.Algorithms marked with ∗ require only stochastic Hessian-vector products.approximate this integral. Specifically, our estimator takes the form1 K K − (cid:88) k =0 (cid:91) ∇ F (cid:0) x · (1 − kK ) + x (cid:48) · kK , z ( i ) (cid:1) ( x − x (cid:48) ) , (5)where z ( i ) i . i . d . ∼ P z . Unlike the usual estimator (cid:100) ∇ F ( x, z ) − (cid:100) ∇ F ( x (cid:48) , z ), the estimator (5) is biased.Nevertheless, we show that choosing K dynamically according to K ∝ (cid:107) x − x (cid:48) (cid:107) provides adequatecontrol over both bias and variance while maintaining the desired query complexity. Combining theintegral estimator (5) with recursive variance reduction, we attain O ( (cid:15) − ) complexity. Demonstrating the power of second-order information.
For functions with Lipschitz gradi-ent and Hessian, we prove an Ω( (cid:15) − . ) lower bound on the minimax oracle complexity of algorithmsfor finding stationary points using only stochastic gradients (2). This lower bound is an extensionof the results of Arjevani et al. (2019a), who showed that for functions with Lipschitz gradient but not
Lipschitz Hessian, the optimal rate is Θ( (cid:15) − ) using only stochastic gradients (2). Together withour new O ( (cid:15) − ) upper bound, this lower bound reveals that stochastic Hessian-vector products offeran Ω( (cid:15) − . ) improvement in the oracle complexity for finding stationary points in the single-pointquery model. This contrasts the noiseless optimization setting, where finite gradient differences canapproximate Hessian-vector products arbitrarily well, meaning these oracle models are equivalent. Demonstrating the limitations of higher-order information ( p > ). For algorithms thatcan query both stochastic gradients and stochastic Hessians, we prove a lower bound of Ω( (cid:15) − )on the oracle complexity of finding an expected (cid:15) -stationary point. This proves that our O ( (cid:15) − ) More precisely, our estimator (5) only requires stochastic Hessian-vector products, whose computation is oftenroughly as expensive as that of a stochastic gradient (Pearlmutter, 1994). We formally prove our results for the structured class of zero-respecting algorithms (Carmon et al., 2019a); thelower bounds extend to general randomized algorithms via similar arguments to Arjevani et al. (2019a). (cid:15) , despite using only stochastic Hessian-vectorproducts rather than full stochastic Hessian queries.Notably, our Ω( (cid:15) − ) lower bound extends to settings where stochastic higher-order oracles areavailable, i.e, when the first p derivatives are Lipschitz and we have bounded-variance estimators { (cid:91) ∇ q F ( · , · ) } q ≤ p . The lower bound holds for any finite p , and thus, as a function of the oracle order p , the minimax complexity has an elbow (Figure 1): for p = 1 the complexity is Θ( (cid:15) − ) (Arjevaniet al., 2019a) while for all p ≥ (cid:15) − ). This means that smoothness and stochastic derivativesbeyond the second-order cannot improve the leading term in rates of convergence to stationarity,establishing a fundamental limitation of stochastic high-order information. This highlights anothercontrast with the noiseless setting, where p th order methods enjoy improved complexity for every p (Carmon et al., 2019a).As we discuss in Appendix B, for multi-point stochastic oracles (4), the rate O ( (cid:15) − ) is attainableeven without stochastic Hessian access. Moreover, our Ω( (cid:15) − ) lower bound for stochastic p th orderoracles holds even when multi-point queries are allowed. Consequently, when viewed through thelens of worst-case oracle complexity, our lower bounds show that even stochastic Hessian informationis not helpful in the multi-point setting. γ . We incorporate our recursive variance-reduced Hessian-vectorproduct-based gradient estimator into an algorithm that combines SGD with negative curvaturesearch. Under the slightly stronger (relative to (3)) assumption that the stochastic Hessians havealmost surely bounded error, we prove that—with constant probability—the algorithm returns an( (cid:15), γ )-SOSP after performing O ( (cid:15) − + (cid:15) − γ − + γ − ) stochastic gradient and Hessian-vector productqueries. A lower bound for finding second-order stationary points.
We prove a minimax lowerbound which establishes that the stochastic second-order oracle complexity of finding ( (cid:15), γ )-SOSPsis Ω( (cid:15) − + γ − ). Consequently, the algorithms we develop have optimal worst-case complexity inthe regimes γ = O ( (cid:15) / ) and γ = Ω( (cid:15) . ). Compared to our lower bounds for finding (cid:15) -stationarypoints, proving the Ω( γ − ) lower bound requires a more substantial modification of the constructionsof Carmon et al. (2019a) and Arjevani et al. (2019a). In fact, our lower bound is new even in thenoiseless regime (i.e., σ = σ = 0), where it becomes Ω( (cid:15) − . + γ − ); this matches the guarantee ofthe cubic-regularized Newton’s method (Nesterov and Polyak, 2006) and consequently characterizesthe optimal rate for finding approximate SOSPs using noiseless second-order methods. We briefly survey additional upper and lower complexity bounds related to our work and place ourresults within their context. The works of Monteiro and Svaiter (2013); Arjevani et al. (2019b);Agarwal and Hazan (2018) delineate the second-order oracle complexity of convex optimization inthe noiseless setting; Arjevani and Shamir (2017) treat the finite-sum setting.For functions with Lipschitz gradient and Hessian, oracle access to the Hessian significantlyaccelerates convergence to ε -approximate global minima, reducing the complexity from Θ( ε − . ) toΘ( ε − / ). However, since the hard instances for first-order convex optimization are quadratic (Ne-mirovski and Yudin, 1983; Arjevani and Shamir, 2016; Simchowitz, 2018), assuming Lipschitzcontinuity of the Hessian does not improve the complexity if one only has access to a first-order5racle. This contrasts the case for finding (cid:15) -approximate stationary points of non-convex func-tions with noiseless oracles. There, Lipschitz continuity of the Hessian improves the first-orderoracle complexity from Θ( (cid:15) − ) to O ( (cid:15) − . ), with a lower bound of Ω( (cid:15) − / ) for deterministicalgorithms (Carmon et al., 2017, 2019b). Additional access to full Hessian further improves thiscomplexity to Θ( (cid:15) − . ), and for p th-order oracles with Lipschitz p th derivative, the complexityfurther improves to Θ( (cid:15) − (1+ p ) ) (Carmon et al., 2019a); see Figure 1. We formally introduce our notation and oracle model in Section 2. Section 3 contains our resultsconcerning the complexity of finding (cid:15) -first-order stationary points: algorithmic upper bounds(Section 3.1) and algorithm-independent lower bounds (Section 3.2). Following a similar outline,Section 4 describes our upper and lower bounds for finding ( (cid:15), γ )-SOSPs. We conclude the paperin Section 5 with a discussion of directions for further research. Additional technical comparisonwith related work is given in Appendix A and B, and proofs are given in Appendix C throughAppendix G.
Notation.
We let C p denote the class of p -times differentiable real-valued functions, and let ∇ q F denote the q th derivative of a given function F ∈ C p for q ∈ { , . . . , p } . Given a function F ∈ C , we let ∇ i F ( x ) := [ ∇ F ( x )] i = ∂∂x i F ( x ). When F ∈ C is twice differentiable, we define, ∇ ij f ( x ) := (cid:2) ∇ f ( x ) (cid:3) ij = ∂ ∂x i ∂x j f ( x ), and similarly define [ ∇ p f ( x )] i ,i ,...,i p = ∂ p ∂x i ··· ∂x ip f ( x ) for p th-order derivatives. For a vector x ∈ R d , (cid:107) x (cid:107) denotes the Euclidean norm and (cid:107) x (cid:107) ∞ denotesthe (cid:96) ∞ norm. For matrices A ∈ R d × d , (cid:107) A (cid:107) op denotes the operator norm. More generally, forsymmetric p th order tensors T , we define the operator norm via (cid:107) T (cid:107) op = sup (cid:107) v (cid:107) =1 |(cid:104) T, v ⊗ p (cid:105)| , andwe let T [ v (1) , . . . , v ( p ) ] = (cid:10) T, v (1) ⊗ · · · ⊗ v ( p ) (cid:11) . Note that for a vector x ∈ R d the operator norm (cid:107) x (cid:107) op coincides with the Euclidean norm (cid:107) x (cid:107) . We let S d denote the space of symmetric matricesin R d × d . We let B r ( x ) denote the Euclidean ball of radius r centered at x ∈ R d (with dimensionclear from context). We adopt non-asymptotic big-O notation, where f = O ( g ) for f, g : X → R + if f ( x ) ≤ Cg ( x ) for some constant C > We study the problem of finding (cid:15) -stationary and ( (cid:15), γ )-second order stationary points in thestandard oracle complexity framework (Nemirovski and Yudin, 1983), which we briefly review here.
Function classes.
We consider p -times differentiable functions satisfying standard regularityconditions, and define F p (∆ , L p ) = (cid:26) F : R d → R (cid:12)(cid:12)(cid:12)(cid:12) F ∈ C p , F (0) − inf x F ( x ) ≤ ∆ , (cid:107)∇ q F ( x ) − ∇ q F ( y ) (cid:107) op ≤ L q (cid:107) x − y (cid:107) for all x, y ∈ R d , q ∈ [ p ] (cid:27) , so that L p := ( L , . . . , L p ) specifies the Lipschitz constants of the q th order derivatives ∇ q F withrespect to the operator norm. We make no restriction on the ambient dimension d . Oracles.
For a given function F ∈ F p (∆ , L p ), we consider a class of stochastic p th order oraclesdefined by a distribution P z over a measurable set Z and an estimator O pF ( x, z ) := (cid:16) (cid:98) F ( x, z ) , (cid:100) ∇ F ( x, z ) , (cid:91) ∇ F ( x, z ) , . . . , (cid:91) ∇ p F ( x, z ) (cid:17) , (6)6here { (cid:91) ∇ q F ( · , z ) } pq =0 are unbiased estimators of the respective derivatives. That is, for all x , E z ∼ P z [ (cid:98) F ( x, z )] = F ( x ) and E z ∼ P z [ (cid:91) ∇ q F ( x, z )] = ∇ q F ( x ) for all q ∈ [ p ]. Given variance parameters σ p = ( σ , . . . , σ p ), we define the oracle class O p ( F, σ p ) to be theset of all stochastic p th-order oracles for which the variance of the derivative estimators satisfies E z ∼ P z (cid:13)(cid:13)(cid:13) (cid:91) ∇ q F ( x, z ) − ∇ q F ( x ) (cid:13)(cid:13)(cid:13) ≤ σ q , q ∈ [ p ] . (7)The upper bounds in this paper hold even when σ := max x ∈ R d Var( (cid:98) F ( x, z )) is infinite, while ourlower bounds hold when σ = 0, so to reduce notation, we leave dependence on this parameter tacit. Optimization protocol.
We consider stochastic p th-order optimization algorithms that accessan unknown function F ∈ F p (∆ , L p ) through multiple rounds of queries to a stochastic p th-orderoracle ( O pF , P z ) ∈ O p ( F, σ p ). When queried at x ( t ) in round t , the oracle performs an independentdraw of z ( t ) ∼ P z and answers with O pF ( x ( t ) , z ( t ) ). Algorithm queries depend on F only through theoracle answers; see e.g. Arjevani et al. (2019a, Section 2) for a more formal treatment. In this section we focus on the task of finding (cid:15) -approximate stationary points (satisfying (cid:107)∇ F ( x ) (cid:107) ≤ (cid:15) ). As prior work observes (cf. Carmon et al., 2017; Allen-Zhu, 2018a), stationary point searchis a useful primitive for achieving the end goal of finding second-order stationary points (1). Webegin with describing algorithmic upper bounds on the complexity of finding stationary points withstochastic second-order oracles, and then proceed to match their leading terms with general p thorder lower bounds. Our algorithms rely on recursive variance reduction (Nguyen et al., 2017): we sequentially estimatethe gradient at the points { x ( t ) } t ≥ by accumulating cheap estimators of ∇ F ( x ( τ ) ) − ∇ F ( x ( τ − ) for τ = t + 1 , . . . , t , where at iteration t we reset the gradient estimator by computing a high-accuracyapproximation of ∇ F ( x ( t ) ) with many oracle queries. Our implementation of recursive variancereduction, Algorithm 1, differs from previous approaches (Fang et al., 2018; Zhou et al., 2018; Wanget al., 2019) in three aspects.1. In Line 8 we estimate differences of the form ∇ F ( x ( τ ) ) − ∇ F ( x ( τ − ) by averaging stochasticHessian-vector products. This allows us to do away with multi-point queries and operateunder weaker assumptions than prior work (see Appendix B), but it also introduces bias to ourestimator, which makes its analysis more involved. This is the key novelty in our algorithm.2. Rather than resetting the gradient estimator every fixed number of steps, we reset with auser-defined probability b (Line 4); this makes the estimator stateless and greatly simplifies itsanalysis, especially when we use a varying value of b to find second-order stationary points.3. We dynamically select the batch size K for estimating gradient differences based on thedistance between iterates (Line 2), while prior work uses a constant batch size. Our dynamicbatch size scheme is crucial for controlling the bias in our gradient estimator, while stillallowing for large step sizes as in Wang et al. (2019). For p ≥ (cid:91) ∇ p F ( x, z ) is a symmetric tensor. lgorithm 1 Recursive variance reduction with stochastic Hessian-vector products (
HVP-RVR ) // Gradient estimator for F ∈ F (∆ , L ) given stochastic oracle in O ( F, σ ) . function HVP-RVR-G
RADIENT -E STIMATOR (cid:15),b ( x , x prev , g prev ): Set K = (cid:24) ( σ + L (cid:15) ) b(cid:15) · (cid:107) x − x prev (cid:107) (cid:25) and n = (cid:108) σ (cid:15) (cid:109) . Sample C ∼ Bernoulli( b ). if C is 1 or g prev is ⊥ then Query the oracle n times at x and set g ← n (cid:80) nj =1 (cid:100) ∇ F ( x, z ( j ) ), where z ( j ) i . i . d . ∼ P z . else Define x ( k ) := kK x + (cid:0) − kK (cid:1) x prev for k ∈ { , . . . , K } . Query the oracle at the set of points (cid:0) x ( k ) (cid:1) K − k =0 to compute g ← g prev + (cid:80) Kk =1 (cid:91) ∇ F ( x ( k − , z ( k ) ) (cid:0) x ( k ) − x ( k − (cid:1) , where z ( k ) i . i . d . ∼ P z . return g . Algorithm 2
Stochastic gradient descent with
HVP-RVR
Input:
Oracle ( O F , P z ) ∈ O ( F, σ ) for F ∈ F (∆ , L , L ). Precision parameter (cid:15) . Set η = √ L + σ + (cid:15)L , T = (cid:108) η(cid:15) (cid:109) , b = min (cid:26) , η(cid:15) √ σ + (cid:15)L σ (cid:27) . Initialize x (0) , x (1) ← g (0) ← ⊥ . for t = 1 to T do g ( t ) ← HVP-RVR-Gradient-Estimator (cid:15),b ( x ( t ) , x ( t − , g ( t − ). x ( t +1) ← x ( t ) − ηg ( t ) . return (cid:98) x chosen uniformly at random from (cid:8) x ( t ) (cid:9) Tt =1 .The core of our analysis is the following lemma, which bounds the gradient estimation errorand expected oracle complexity. To state the lemma, we let { x ( t ) } t ≥ be sequence of queriesto Algorithm 1, and let g ( t ) = HVP-RVR-Gradient-Estimator (cid:15),b ( x ( t ) , x ( t − , g ( t − ) be the sequence ofestimates it returns. Lemma 1.
For any oracle in O ( F, σ ) and F ∈ F (∆ , L ) , Algorithm 1 guarantees that E (cid:107) g ( t ) − ∇ F ( x ( t ) ) (cid:107) ≤ (cid:15) for all t ≥ . Furthermore, conditional on x ( t − , x ( t ) and g ( t − , the t th execution of Algorithm 1with reset probability b uses at most O (cid:16) b σ (cid:15) + (cid:13)(cid:13) x ( t ) − x ( t − (cid:13)(cid:13) · σ + (cid:15)L b(cid:15) (cid:17) stochastic gradient and Hessian-vector product queries in expectation. We prove the lemma in Appendix C by bounding the per-step variance using the HVP oracle’svariance bound (7), and by bounding the per-step bias relative to ∇ F ( x ( t ) ) − ∇ F ( x ( t − ) using theLipschitz continuity of the Hessian.Our first algorithm for finding (cid:15) -stationary points, Algorithm 2, is simply stochastic gradientdescent using the HVP-RVR gradient estimator (Algorithm 1); we bound its complexity by O ( (cid:15) − ).Before stating the result formally, we briefly sketch the analysis here. Standard analysis of SGD8ith step size η ≤ L shows that its iterates satisfy E (cid:107)∇ F ( x ( t ) ) (cid:107) ≤ η E [ F ( x ( t +1) ) − F ( x ( t ) )] + O (1) · E (cid:107) g ( t ) − ∇ F ( x ( t ) ) (cid:107) . Telescoping over T steps, using Lemma 1 and substituting in the initialsuboptimality bound ∆, this implies that1 T T − (cid:88) t =0 E (cid:107)∇ F ( x ( t ) ) (cid:107) ≤ ∆ ηT + O ( (cid:15) ) . (8)Taking T = Ω( ∆ η(cid:15) ), we are guaranteed that a uniformly selected iterate has expected norm O ( (cid:15) ).To account for oracle complexity, we observe from Lemma 1 that T calls to Algorithm 1require at most T ( σ b(cid:15) + 1) + (cid:80) Tt =1 E (cid:107) x ( t ) − x ( t − (cid:107) · (cid:0) σ + L (cid:15)b(cid:15) (cid:1) oracle queries in expectation. Using x ( t ) − x ( t − = ηg ( t − , Lemma 1 and (8) imply that (cid:80) Tt =1 E (cid:107) x ( t ) − x ( t − (cid:107) ≤ O ( T (cid:15) ). We thenchoose b to out the terms T (cid:0) σ b(cid:15) (cid:1) and T (cid:0) σ + L (cid:15)b (cid:1) . This gives the following complexity guarantee,which we prove in Appendix E.1. Theorem 1.
For any function F ∈ F (∆ , L , L ) , stochastic second-order oracle in O ( F, σ , σ ) ,and (cid:15) < min (cid:8) σ , √ ∆ L (cid:9) , with probability at least , Algorithm 2 returns a point (cid:98) x such that (cid:107)∇ F ( (cid:98) x ) (cid:107) ≤ (cid:15) and performs at most O (cid:16) ∆ σ σ (cid:15) + ∆ L . σ (cid:15) . + ∆ L (cid:15) (cid:17) stochastic gradient and Hessian-vector product queries. The oracle complexity of Algorithm 2 depends on the Lipschitz parameters of F only throughlower-order terms in (cid:15) , with the leading term scaling only with the variance of the gradient and Hessianestimators. In the low noise regime where σ < (cid:15) and σ < max { L , √ L (cid:15) } , the complexity becomes O (∆ L (cid:15) − + ∆ L . (cid:15) − . ) which is simply the maximum of the noiseless guarantees for gradientdescent and Newton’s method. We remark, however, that in the noiseless regime σ = σ = 0, aslightly better guarantee O (∆ L . L . (cid:15) − . + ∆ L . (cid:15) − . ) is achievable (Carmon et al., 2017).In the noiseless setting, any algorithm that uses only first-order and Hessian-vector productqueries must have complexity scaling with L , but full Hessian access can remove this dependence(Carmon et al., 2019b). We show that the same holds true in the stochastic setting: Algorithm 3,a subsampled cubic regularized trust-region method using Algorithm 1 for gradient estimation,enjoys a complexity bound independent of L . We defer the analysis to Appendix E.2 and state theguarantee as follows. Theorem 2.
For any function F ∈ F (∆ , ∞ , L ) , stochastic second order oracle in O ( F, σ , σ ) ,and (cid:15) < σ , with probability at least , Algorithm 3 returns a point (cid:98) x such that (cid:107)∇ F ( (cid:98) x ) (cid:107) ≤ (cid:15) andperforms at most O (cid:16) ∆ σ σ (cid:15) · log . d + ∆ L . σ (cid:15) . (cid:17) stochastic gradient and Hessian queries. The guarantee of Theorem 2 constitutes an improvement in query complexity over Theorem 1 inthe regime L (cid:38) (1 + σ (cid:15) )( σ + √ L (cid:15) ). However, depending on the problem, full stochastic Hessianscan be up to d times more expensive to compute than stochastic Hessian-vector products.9 lgorithm 3 Subsampled cubic-regularized trust-region method with
HVP-RVR
Input:
Oracle ( O F , P z ) ∈ O ( F, σ ) for F ∈ F (∆ , ∞ , L ). Precision parameter (cid:15) . Set M = 5 max (cid:110) L , (cid:15)σ log( d ) σ (cid:111) , η = 25 (cid:112) (cid:15)M , T = (cid:108) η(cid:15) (cid:109) and n H = (cid:108) σ η log( d ) (cid:15) (cid:109) . Set b = min (cid:26) , η √ σ + (cid:15)L σ (cid:27) . Initialize x (0) , x (1) ← g (0) ← ⊥ . for t = 1 to T do Query the oracle n H times at x ( t ) and compute H ( t ) ← n H n H (cid:88) j =1 (cid:91) ∇ F ( x ( t ) , z ( t,j ) ) , where z ( t,j ) i . i . d . ∼ P z . g ( t ) ← HVP-RVR-Gradient-Estimator (cid:15),b (cid:0) x ( t ) , x ( t − , g ( t − (cid:1) . Set the next point x ( t +1) as x ( t +1) ← arg min y : (cid:107) y − x ( t ) (cid:107)≤ η (cid:10) g ( t ) , y − x ( t ) (cid:11) + 12 (cid:10) y − x ( t ) , H ( t ) ( y − x ( t ) ) (cid:11) + M (cid:13)(cid:13) y − x ( t ) (cid:13)(cid:13) . return (cid:98) x chosen uniformly at random from (cid:8) x ( t ) (cid:9) T +1 t =2 . Having presented stochastic second-order methods with O ( (cid:15) − )-complexity bound for finding (cid:15) -stationary points, our we next show that this rates cannot be improved. In fact, we show that thisrate is optimal even when one is given access to stochastic higher derivatives of any order. Weprove our lower bounds for the class of zero-respecting algorithms, which subsumes the majority ofexisting optimization methods; see Appendix G.1 for a formal definition. We believe that existingtechniques (Carmon et al., 2019a; Arjevani et al., 2019a) can strengthen our lower bounds to applyto general randomized algorithms; for brevity, we do not pursue it here.The lower bounds in this section closely follow a recent construction by Arjevani et al. (2019a,Section 3), who prove lower bounds for stochastic first-order methods. To establish complexitybounds for p th-order methods, we extend the ‘probabilistic zero-chain’ gradient estimator introducedin Arjevani et al. (2019a) to high-order derivative estimators.The most technically demanding partof our proof is a careful scaling of the basic construction to simultaneously meet multiple Lipschitzcontinuity and variance constraints. Deferring the proof details to Appendix G.1, our lower boundis as follows. Theorem 3.
For all p ∈ N , ∆ , L p , σ p > and (cid:15) ≤ O ( σ ) , there exists F ∈ F p (∆ , L p ) and ( O pF , P z ) ∈ O p ( F, σ p ) , such that for any p th-order zero-respecting algorithm, the number of queriesrequired to obtain an (cid:15) -stationary point with constant probability is bounded from below by Ω(1) · ∆ σ (cid:15) min (cid:40) min q ∈{ ,...,p } (cid:18) σ q σ (cid:19) q − , min q (cid:48) ∈{ ,...,p } (cid:18) L q (cid:48) (cid:15) (cid:19) /q (cid:48) (cid:41) . (9) A construction of dimension Θ (cid:16) ∆ (cid:15) min (cid:110) min q ∈{ ,...,p } (cid:16) σ q σ (cid:17) q − , min q (cid:48) ∈{ ,...,p } (cid:16) L q (cid:48) (cid:15) (cid:17) /q (cid:48) (cid:111)(cid:17) realizes thislower bound. p = 2), Theorem 3 specializes to the oracle complexity lowerbound Ω(1) · min (cid:26) ∆ σ σ (cid:15) , ∆ L . σ (cid:15) . , ∆ L σ (cid:15) (cid:27) , (10)which is tight in that it matches (up to numerical constants) the convergence rate of Algorithm 2 inthe regime where ∆ σ σ (cid:15) − dominates both the upper bound in Theorem 1 and expression (10).The lower bound (10) is also tight when the second-order information is not available or reliable( σ is infinite or very large, respectively): Standard SGD matches the (cid:15) − term (Ghadimi and Lan,2013), while more sophisticated variants based on restarting (Fang et al., 2019) and normalizedupdates with momentum (Cutkosky and Mehta, 2020) match the (cid:15) − . term (the former up tologarithmic factors)—neither of these algorithms requires stochastic second derivative estimation.Theorem 3 implies that while higher-order methods (with p >
2) might achieve better dependenceon the variance parameters than the upper bounds for Algorithm 2 or Algorithm 3, they cannotimprove the (cid:15) − scaling. This highlights a fundamental limitation for higher-order methods instochastic non-convex optimization which does not exist in the noiseless case. Indeed, without noisethe optimal rate for finding (cid:15) -stationary point with a p th order method is Θ( (cid:15) − p ) Carmon et al.(2019a); we illustrate this contrast in Figure 1.Altogether, the results presented in this section fully characterize (with respect to dependenceon (cid:15) ) the complexity of finding (cid:15) -stationary points with stochastic second-order methods and beyondin the single-point query model. We briefly remark that lower bound in (9) immediately extends tomulti-point queries, which shows that even second-order methods offer little benefit once two ormore simultaneous queries are allowed. Having established rates of convergence for finding (cid:15) -stationary points, we now turn our attention to( (cid:15), γ )-second order stationary points, which have the additional requirement that λ min ( ∇ F ( x )) ≥ − γ ,i.e. that F is γ -weakly convex around x . This section follows the general organization of the prequel:we first design and analyze an algorithm with improved upper bounds, and then develop nearly-matching lower bounds that apply to a broad class of algorithms. Our first contribution for this section is an algorithm that enjoys improved complexity for finding( (cid:15), γ )-second-order stationary points, and that achieves this using only stochastic gradient andHessian-vector product queries. To guarantee second-order stationarity, we follow the establishedtechnique of interleaving an algorithm for finding a first-order stationary point with negativecurvature descent (Carmon et al., 2017; Allen-Zhu, 2018a). However, we employ a randomizedvariant of this approach. Specifically, at every iteration we flip a biased coin to determine whetherto perform a stochastic gradient step or a stochastic negative curvature descent step.Our algorithm estimates stochastic gradients using the
HVP-RVR scheme (Algorithm 1), wherethe value of the restart probability b depends on the type of the previous step (gradient or negativecurvature). To implement negative curvature descent, we apply Oja’s method (Oja, 1982; Allen-Zhuand Li, 2017) which detects directions of negative curvature using only stochastic Hessian-vectorproduct queries. For technical reasons pertaining to the analysis of Oja’s method, we require thestochastic Hessians to be bounded almost surely, i.e., (cid:107) (cid:91) ∇ F ( x, z ) − ∇ F ( x ) (cid:107) op ≤ ¯ σ a.s.; we let11 ( F, σ , ¯ σ ) denote the class of such bounded noise oracles. Under this assumption, Algorithm 4—whose description is deferred to the Appendix F—enjoys the following convergence guarantee. Theorem 4.
For any function F ∈ F (∆ , L ) , stochastic Hessian-vector product oracle in O ( F, σ , ¯ σ ) , (cid:15) ≤ min (cid:8) σ , √ ∆ L (cid:9) , and γ ≤ min (cid:8) ¯ σ , L , √ (cid:15)L (cid:9) , with probability at least Al-gorithm 4 returns a point (cid:98) x such that (cid:107)∇ F ( (cid:98) x ) (cid:107) ≤ (cid:15) and λ min (cid:0) ∇ F ( (cid:98) x ) (cid:1) ≥ − γ, and performs at most (cid:101) O (cid:32) ∆ σ ¯ σ (cid:15) + ∆ L σ ¯ σ γ (cid:15) + ∆ L (¯ σ + L ) γ + ∆ L (cid:15) (cid:33) stochastic gradient and Hessian-vector product queries. Similar to the case for finding (cid:15) -stationary points (see discussion preceding Theorem 2), usingfull stochastic Hessian information allows us to design an algorithm (Algorithm 5) which removesthe dependence on L from the theorem above. Moreover, estimating negative curvature directlyfrom empirical Hessian estimates saves us the need to use Oja’s method, which means that we donot need the additional boundedness assumption on the stochastic Hessian used by Algorithm 4.We defer the complete description and analysis for Algorithm 5 to Appendix F.1, and state itscomplexity guarantee below. Theorem 5.
For any function F ∈ F (∆ , ∞ , L ) , stochastic second order oracle in O ( F, σ , σ ) , (cid:15) ≤ σ , and γ ≤ min (cid:8) σ , √ (cid:15)L , ∆ L (cid:9) , with probability at least Algorithm 5 returns a point (cid:98) x such that λ min (cid:0) ∇ F ( (cid:98) x ) (cid:1) ≥ − γ, and (cid:107)∇ F ( (cid:98) x ) (cid:107) ≤ (cid:15), and performs at most (cid:101) O (cid:18) ∆ σ σ (cid:15) + ∆ L σ σ γ (cid:15) + ∆ L σ γ (cid:19) stochastic gradient and Hessian queries. We now develop lower complexity bounds for the task of finding ( (cid:15), γ )-stationary points. To do so,we prove new lower bounds for the simpler sub-problem of finding a γ -weakly convex point, i.e., apoint x such that λ min ( ∇ F ( x )) ≥ − γ (with no restriction on (cid:107)∇ F ( x ) (cid:107) ). Lower bounds for finding( (cid:15), γ )-SOSPs follow as the maximum (or, equivalently, the sum) of lower bounds we develop hereand the lower bounds for finding (cid:15) -stationary points given in Theorem 6. To see why this is so, let F (cid:15) and F γ be hard instances for finding (cid:15) -stationary and γ -weakly-convex points respectively, andconsider the “direct sum” F (cid:15),γ ( x ) := F (cid:15) ( x , . . . , x d ) + F γ ( x d +1 , . . . , x d ); this is a hard instancefor finding ( (cid:15), γ )-SOSPs that inherits all the regularity properties of its constituent functions. The notation (cid:101) O ( · ) hides lower-order terms and logarithmic dependence on the dimension d . See the proof inAppendix F for the complete description of the algorithm and the full complexity bound, including lower order terms. λ min ( ∇ F ( x )) is possible only when essentiallynone of the entries of x is zero. Given T >
0, we define the hard function G T ( x ) := Ψ(1)Λ( x ) + T (cid:88) i =2 [Ψ( − x i − )Λ( − x i ) + Ψ( x i − )Λ( x i ))] , (11)where Ψ( x ) := exp(1 − x − ) (cid:8) x > (cid:9) (as in Carmon et al. (2019a)) and Λ( x ) := 8( e − x − (cid:91) ∇ G T to be exactlyequal to ∇ G T , so that the lower bound holds even for σ = 0; Appropriately scaling G T allowsus to tune the Lipschitz constants of its derivatives and the variance of the estimators, therebyestablishing the following complexity bounds (see Appendix G.2 for a full derivation). Theorem 6.
Let p ≥ and ∆ , L p , σ p > be fixed. If γ ≤ O (min { σ , L } ) , then there exists F ∈ F p (∆ , L p ) and ( O pF , P z ) ∈ O p ( F, σ p ) such that for any stochastic p th-order zero-respectingalgorithm, the number of queries to O pF required to obtain a γ -weakly convex point with constantprobability is at least Ω(1) · ∆ σ L γ , p = 2 , ∆ σ γ min (cid:26) min q ∈{ ,...,p } (cid:16) σ q σ (cid:17) q − , min q (cid:48) ∈{ ,...,p } (cid:16) L q (cid:48) γ (cid:17) q (cid:48)− (cid:27) , p > . (12) A construction of dimension Θ (cid:16) ∆ γ min (cid:110) min q ∈{ ,...,p } (cid:16) σ q σ (cid:17) q − , min q (cid:48) ∈{ ,...,p } (cid:16) L q (cid:48) γ (cid:17) q (cid:48)− (cid:111)(cid:17) realizes thelower bound. Theorem 6 is new even in the noiseless case (in which σ = · · · = σ p = 0), where it specializes toΩ(1) · ∆ γ min q ∈{ ,...,p } (cid:18) L q γ (cid:19) q − . (13)For the class F p (∆ , L p ), the lower bound (13) further simplifies to ∆ L p − p γ − p +1 p − , which is attainedby the p th-order regularization method given in Cartis et al. (2017, Theorem 3.6). Together, theseresults characterize the deterministic complexity of finding γ -weakly convex points with noiseless p th-order methods.Returning to the stochastic setting, the bound in Theorem 6, when combined with Theorem 3,implies the following oracle complexity lower bound bound for finding ( (cid:15), γ )-SOSPs with zero-respecting stochastic second-order methods ( p = 2):Ω(1) · (cid:18) min (cid:26) ∆ σ σ (cid:15) , ∆ L . σ (cid:15) . , ∆ L σ (cid:15) (cid:27) + ∆ σ L γ (cid:19) . (14)Our lower bound matches the (cid:15) − + γ − terms in the upper bound given by Theorem 4, but does notmatch the mixed term (cid:15) − γ − appearing in the upper bound. Overall, the rates match whenever γ = Ω( (cid:15) . ) or γ = O ( (cid:15) / ). Young’s inequality only gives (cid:15) − + γ − ≥ Ω( (cid:15) − / γ − ). γ − for p ≥
3, while the optimal rate in the noiseless regime, γ − p +1 p − , continuesimproving for all p . However, we are not yet aware of an algorithm using stochastic third-orderinformation or higher that can achieve the γ − complexity bound. This paper provides a fairly complete picture of the worst-case oracle complexity of finding stationarypoints with a stochastic second-order oracle: for (cid:15) -stationary points we characterize the leadingterm in (cid:15) − exactly and for ( (cid:15), γ )-SOSPs we characterize the leading term in γ − for a wide rangeof parameters. Nevertheless, our results point to a number of open questions. Benefits of higher-order information for γ -weakly convex points. Our upper and lowerbounds (in Theorem 5 and Theorem 6) resolve the optimal rate to find an ( (cid:15), γ )-stationary point for p = 2, i.e., when F is second-order smooth and the algorithm can query stochastic gradient andHessian information. Furthermore, Theorem 3 shows that higher order information ( p ≥
3) cannotimprove the dependence of the rate on the first-order stationarity parameter (cid:15) . However, our lowerbound for dependence on γ scales as γ − for p = 2, but scales as γ − for p ≥
3. The weaker lowerbound for p ≥ Global methods.
For statistical learning and sample average approximation problems, it isnatural to consider problem instances of the form F ( x ) = E (cid:2) (cid:98) F ( x, z ) (cid:3) . For this setting, a morepowerful oracle model is the global oracle , in which samples z (1) , . . . , z ( n ) are drawn i.i.d. and thelearner observes the entire function (cid:98) F ( · , z ( t ) ) for each t ∈ [ n ]. Global oracles are more powerful thanstochastic p th order oracles for every p , and lead to improved rates in the convex setting (Fosteret al., 2019). Is it possible to beat the (cid:15) − elbow for such oracles, or do our lower bounds extend tothis setting? Adaptivity and instance-dependent complexity.
Our lower bounds show that stochastichigher-order methods cannot improve the (cid:15) − oracle complexity attained with stochastic gradientsand Hessian-vector products. Furthermore, in the multi-point query model, stochastic second-orderinformation does not even lead to improved rates over stochastic first-order information. However,these conclusions could be artifacts of our worst-case point of view—are there natural familiesof problem instances for which higher-order methods can adapt to additional problem structureand obtain stronger instance-dependent convergence guarantees? Developing a theory of instance-dependent complexity that can distinguish adaptive algorithms stands out as an exciting researchprospect. Indeed, when high-order noise moments are assumed finite, the term min q ∈{ ,...,p } ( σ q /σ ) q − can longer bedisregarded. This, in turn, implies that for sufficiently small γ , one cannot improve over γ − -scaling, as seen by (12). cknowledgements We thank Blake Woodworth and Nati Srebo for helpful discussions. YA acknowledges partialsupport from the Sloan Foundation and Samsung Research. JCD acknowledges support from theNSF CAREER award CCF-1553086, ONR YIP N00014-19-2288, Sloan Foundation, NSF HDR1934578 (Stanford Data Science Collaboratory), and the DAWN Consortium. DF acknowledgesthe support of TRIPODS award 1740751. KS acknowledges support from NSF CAREER Award1750575 and a Sloan Research Fellowship.
References
Naman Agarwal and Elad Hazan. Lower bounds for higher-order convex optimization. In
ConferenceOn Learning Theory , pages 774–792, 2018.Zeyuan Allen-Zhu. How to make the gradients small stochastically: Even faster convex andnonconvex SGD. In
Advances in Neural Information Processing Systems , pages 1165–1175, 2018a.Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. In
Advances in NeuralInformation Processing Systems , pages 2675–2686, 2018b.Zeyuan Allen-Zhu and Yuanzhi Li. Follow the compressed leader: Faster algorithms for matrixmultiplicative weight updates.
International Conference on Machine Learning , 2017.Yossi Arjevani and Ohad Shamir. On the iteration complexity of oblivious first-order optimizationalgorithms. In
International Conference on Machine Learning , pages 908–916, 2016.Yossi Arjevani and Ohad Shamir. Oracle complexity of second-order methods for finite-sum problems.In
Proceedings of the 34th International Conference on Machine Learning , pages 205–213, 2017.Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth.Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365 , 2019a.Yossi Arjevani, Ohad Shamir, and Ron Shiff. Oracle complexity of second-order methods for smoothconvex optimization.
Mathematical Programming , 178(1-2):327–360, 2019b.Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Convex until proven guilty:Dimension-free acceleration of gradient descent on non-convex functions. In
Proceedings of the34th International Conference on Machine Learning , pages 654–663, 2017.Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for nonconvexoptimization.
SIAM Journal on Optimization , 28(2):1751–1772, 2018.Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationarypoints I.
Mathematical Programming , May 2019a.Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationarypoints II: First-order methods.
Mathematical Programming , September 2019b.Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Improved second-order evaluationcomplexity for unconstrained nonlinear optimization using high-order regularized models. arXivpreprint arXiv:1708.04044 , 2017.Ashok Cutkosky and Harsh Mehta. Momentum improves normalized SGD.
International Conferenceon Machine Learning , 2020. 15shok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex SGD.In
Advances in Neural Information Processing Systems , 2019.Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex opti-mization via stochastic path-integrated differential estimator. In
Advances in Neural InformationProcessing Systems , pages 689–699, 2018.Cong Fang, Zhouchen Lin, and Tong Zhang. Sharp analysis for nonconvex SGD escaping fromsaddle points. In Alina Beygelzimer and Daniel Hsu, editors,
Proceedings of the Thirty-SecondConference on Learning Theory , volume 99, pages 1192–1234, 2019.Dylan J. Foster, Ayush Sekhari, Ohad Shamir, Nathan Srebro, Karthik Sridharan, and BlakeWoodworth. The complexity of making the gradient small in stochastic convex optimization.
Proceedings of the Thirty-Second Conference on Learning Theory , pages 1319–1345, 2019.Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points: online stochasticgradient for tensor decomposition. In
Conference on Learning Theory , pages 797–842, 2015.Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In
Advances in Neural Information Processing Systems , pages 2973–2981, 2016.Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvexstochastic programming.
SIAM Journal on Optimization , 23(4):2341–2368, 2013.Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddlepoints efficiently. In
Proceedings of the 34th International Conference on Machine Learning-Volume70 , pages 1724–1732, 2017.Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex finite-sum optimization viaSCSG methods. In
Advances in Neural Information Processing Systems , pages 2348–2358, 2017.Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen. Implicit regularization in nonconvexstatistical estimation: Gradient descent converges linearly for phase retrieval, matrix completionand blind deconvolution.
Foundations of Computational Mathematics , 2019.Lester Mackey, Michael I Jordan, Richard Y Chen, Brendan Farrell, and Joel A Tropp. Matrixconcentration inequalities via the method of exchangeable pairs.
The Annals of Probability , 42(3):906–945, 2014.Renato DC Monteiro and Benar Fux Svaiter. An accelerated hybrid proximal extragradientmethod for convex optimization and its implications to second-order methods.
SIAM Journal onOptimization , 23(2):1092–1125, 2013.Katta G Murty and Santosh N Kabadi. Some NP-complete problems in quadratic and nonlinearprogramming.
Mathematical programming , 39(2):117–129, 1987.Arkadi Nemirovski and David Borisovich Yudin.
Problem Complexity and Method Efficiency inOptimization . Wiley, 1983.Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance.
Mathematical Programming , 108(1):177–205, 2006.Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak´aˇc. SARAH: A novel method for machinelearning problems using stochastic recursive gradient. In
Proceedings of the 34th InternationalConference on Machine Learning , pages 2613–2621, 2017.16rkki Oja. Simplified neuron model as a principal component analyzer.
Journal of mathematicalbiology , 15(3):267–273, 1982.Barak A Pearlmutter. Fast exact multiplication by the Hessian.
Neural computation , 6(1):147–160,1994.Max Simchowitz. On the randomized complexity of minimizing a convex quadratic function. arXivpreprint arXiv:1807.09386 , 2018.Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval.
Foundations ofComputational Mathematics , 18(5):1131–1198, 2018.Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I Jordan. Stochastic cubicregularization for fast nonconvex optimization. In
Advances in Neural Information ProcessingSystems , pages 2899–2908, 2018.Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. SpiderBoost and momentum:Faster stochastic variance reduction algorithms. In
Advances in Neural Information ProcessingSystems , 2019.Yi Xu, Rong Jin, and Tianbao Yang. First-order stochastic algorithms for escaping from saddlepoints in almost linear time. In
Advances in Neural Information Processing Systems , pages5530–5540, 2018.Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvexoptimization. In
Advances in Neural Information Processing Systems , pages 3925–3936, 2018.17 ontents of Appendix
A Detailed comparison with existing rates 19B Comparison: multi-point queries and mean-squared smoothness 19C Variance-reduced gradient estimator (
HVP-RVR ) 21D Supporting technical results 24
D.1 Error bound for empirical Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24D.2 Descent lemma for stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . 25D.3 Descent lemma for cubic-regularized trust-region method . . . . . . . . . . . . . . . 26D.4 Stochastic negative curvature search . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
E Upper bounds for finding (cid:15) -stationary points 31
E.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31E.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
F Upper bounds for finding ( (cid:15), γ ) -second-order-stationary points 35 F.1 Full statement and proof for Algorithm 4 . . . . . . . . . . . . . . . . . . . . . . . . 35F.2 Full statement and proof for Algorithm 5 . . . . . . . . . . . . . . . . . . . . . . . . 42
G Lower bounds 47
G.1 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47G.1.1 Bounding the operator norm of ∇ pi F T . . . . . . . . . . . . . . . . . . . . . . 52G.2 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5218 Detailed comparison with existing rates
Method Uses (cid:91) ∇ F ? Complexity bound AdditionalassumptionsSGD(Ghadimi and Lan, 2013) No O (∆ L σ (cid:15) − )Restarted SGD(Fang et al., 2019) No O (∆ L . σ (cid:15) − . ) † (cid:100) ∇ F Lipschitzalmost surelyNormalized SGD(Cutkosky and Mehta, 2020) No O (∆ L . σ (cid:15) − . ) † Subsampled regularizedNewton (Tripuraneni et al., 2018) Yes ∗ O (∆ L . σ (cid:15) − . ) † Recursive variancereduction (e.g., Fang et al., 2018) No O (∆ σ σ mss (cid:15) − + ∆ L (cid:15) − ) Mean-squaredsmoothness σ mss ≤ σ ,simultaneous queries(Appendix B)SGD with HVP-RVR (Algorithm 2) Yes ∗ O (∆ σ σ (cid:15) − + ∆ L . σ (cid:15) − . + ∆ L (cid:15) − )Subsampled Newton with HVP-RVR (Algorithm 3) Yes O (∆ σ σ (cid:15) − + ∆ L . σ (cid:15) − . + ∆ σ (cid:15) − )Table 2: Detailed comparison of guarantees for finding (cid:15) -stationary points (satisfying E (cid:107)∇ F ( x ) (cid:107) ≤ (cid:15) )for a function F with L -Lipschitz gradients and L -Lipschitz Hessian. Here ∆ is the initial optimalitygap, and σ p is the variance of (cid:91) ∇ p F . Algorithms marked with ∗ require only stochastic Hessian-vectorproducts. Complexity bounds marked with † only show leading order term in (cid:15) . B Comparison: multi-point queries and mean-squared smoothness
Stochastic first-order methods that utilize variance reduction (Lei et al., 2017; Fang et al., 2018; Zhouet al., 2018) employ the following mean-squared smoothness (MSS) assumption on the stochasticgradient estimator: E (cid:107) (cid:100) ∇ F ( x, z ) − (cid:100) ∇ F ( y, z ) (cid:107) ≤ ¯ L (cid:107) x − y (cid:107) for all x, y ∈ R d . Since E [ (cid:100) ∇ F ( x, z )] = ∇ F ( x ), this is equivalent to assuming E (cid:107) (cid:100) ∇ F ( x, z ) − (cid:100) ∇ F ( y, z ) − ( ∇ F ( x ) − ∇ F ( y )) (cid:107) ≤ σ (cid:107) x − y (cid:107) for all x, y ∈ R d , (15)for some σ mss < ¯ L . In fact, while it always holds that ¯ L ≤ L + σ , inspection of the resultsof Fang et al. (2018); Wang et al. (2019) shows one can replace ¯ L with σ mss in the leading terms oftheir complexity bounds without any change to the algorithms.19lgorithms that take advantage of the MSS structure rely on the following additional simultaneousquery assumption (which is a special case of (4) for n = 2): We may query x, y ∈ R d and observe O F ( x, z ) and O F ( y, z ) for the same draw of z ∼ P z . (16)In empirical risk minimization problems, z represents the datapoint index and possibly dataaugmentation parameters, and the value of z is typically part of the query, which means thatassumption (16) indeed holds. In certain online learning settings, however, the assumption can fail.For example, the variable z could represent the instantaneous power demands in an electric grid,and testing two grid configurations for the same grid state might be impractical.We observe that assuming access to both an MSS gradient estimator and simultaneous two-pointqueries is stronger than assuming a bounded variance stochastic Hessian-vector product estimator.This holds because the former allows us to simulate the latter with finite differencing. Formally, wehave the following. Observation 1.
Let F have L -Lipschitz Hessian, let (cid:100) ∇ F satisfy (15), and assume we have accessto a two-point query oracle as in (16). Then, for any δ > u , theHessian-vector product estimator (cid:91) ∇ F δ ( x, z ) u := 1 δ (cid:104)(cid:100) ∇ F ( x + δ · u, z ) − (cid:100) ∇ F ( x, z ) (cid:105) (17)satisfies (cid:13)(cid:13)(cid:13) E [ (cid:91) ∇ F δ ( x, z ) u ] − ∇ F ( x ) u (cid:13)(cid:13)(cid:13) ≤ L δ E (cid:13)(cid:13)(cid:13) (cid:91) ∇ F δ ( x, z ) u − ∇ F ( x ) u (cid:13)(cid:13)(cid:13) ≤ σ + L δ . Proof.
We have E [ (cid:91) ∇ F δ ( x, z ) u ] = δ [ ∇ F ( x + δ · u ) − ∇ F ( x )], and by Lipschitz continuity of ∇ F , (cid:107)∇ F ( x + δ · u ) − ∇ F ( x ) − ∇ F ( x )[ δu ] (cid:107) ≤ L δ (cid:107) u (cid:107) = L δ , which implies the bound on the bias. To bound the variance, we note that E (cid:13)(cid:13)(cid:13) (cid:91) ∇ F δ ( x, z ) u − E [ (cid:91) ∇ F δ ( x, z ) u ] (cid:13)(cid:13)(cid:13) ≤ δ E (cid:13)(cid:13)(cid:13)(cid:100) ∇ F ( x + δu, z ) − (cid:100) ∇ F ( x, z ) − [ ∇ F ( x + δu ) − ∇ F ( x )] (cid:13)(cid:13)(cid:13) ≤ δ · σ (cid:107) δu (cid:107) = σ , by the MSS property (15).We conclude from Observation 1 that Algorithm 2, which only requires stochastic Hessian-vectorproducts, attains O ( (cid:15) − ) complexity under assumptions no stronger than previous algorithms. Infact, we show now that our assumptions are in fact strictly weaker than prior work. That is, whilean MSS gradient estimator implies a bounded variance Hessian estimator, the opposite is not truein general. This is simply due to the fact that in our oracle model, (cid:100) ∇ F and (cid:91) ∇ F can be completelyunrelated. Consider for example the case where P z is uniform on {− , } and (cid:100) ∇ F ( x, z ) = (cid:40) ∇ F ( x ) + x (cid:107) x (cid:107) z x (cid:54) = 0 ∇ F ( x ) x = 0 , while (cid:91) ∇ F ( x, z ) = ∇ F ( x ) . Clearly (cid:100) ∇ F is not MSS, even though (cid:91) ∇ F has zero variance.20here is, however, an important setting where bounded variance for (cid:91) ∇ F does imply that (cid:100) ∇ F is MSS. Suppose that the derivative of (cid:100) ∇ F ( x, z ) exists, and has the form ∇ [ (cid:100) ∇ F ( x, z )] = (cid:91) ∇ F ( x, z ) . (18)That is, the Hessian estimator is the Jacobian of the gradient estimator. In this case, boundedvariance for the Hessian estimator implies mean-squared smoothness. Observation 2.
Let F have gradient and Hessian estimators (cid:100) ∇ F and (cid:91) ∇ F satisfying (3) and (18).Then (cid:100) ∇ F has the MSS property (15) with σ mss ≤ σ . Proof.
Under the property (18), we have (cid:100) ∇ F ( x, z ) − (cid:100) ∇ F ( y, z ) − [ ∇ F ( x ) − ∇ F ( y )]= (cid:90) (cid:16) (cid:91) ∇ F ( xt + y (1 − t ) , z ) − ∇ F ( xt + y (1 − t )) (cid:17) ( x − y ) dt. Taking the squared norm, applying Jensen’s inequality, and substituting the variance bound (3)gives the MSS property (15).The property (18) holds for empirical risk minimization, where we have the more general relation (cid:91) ∇ p F ( x, z ) = ∇ p (cid:98) F ( x, z ) for any p ; That is, all the stochastic derivative estimators are themselvesthe derivatives of a single stochastic function. Therefore, by Observation 1 and Observation 2, inempirical risk minimization settings, mean-square smoothness is essentially equivalent to boundedvariance of the stochastic Hessian estimator. C Variance-reduced gradient estimator (
HVP-RVR ) In this section we prove Lemma 1. First, we formally describe the protocol in which our optimizationalgorithms query the gradient estimator
HVP-RVR-Gradient-Estimator described in Algorithm 1, anddefine some additional notation.Given a function F ∈ F (∆ , L , L ) and a stochastic second-order oracle in O ( F, σ ), theoptimization algorithm interacts with
HVP-RVR-Gradient-Estimator by sequentially querying points (cid:8) x ( t ) (cid:9) ∞ t =1 with reset probabilities (cid:8) b ( t ) (cid:9) ∞ t =1 , to obtain estimates g ( t ) for ∇ F ( x ( t ) ) for each time t ;that is, x ( t ) = A ( t ) ( g (0) , g (1) , . . . , g ( t − ; r ( t − ) , b ( t ) = B ( t ) ( r ( t − ) , and g ( t ) = HVP-RVR-Gradient-Estimator (cid:15),b ( t ) ( x ( t ) , x ( t − , g ( t − ) , (19)where A ( t ) , B ( t ) are measurable mappings modeling the optimization algorithm and { r ( t ) } is anindependent sequence of random seeds. That is, Lemma 1 holds for any sequence of queries where x ( t ) , and b ( t ) are adapted to the filtration G ( t ) = σ (cid:16) { g ( j ) , r ( j ) } j We prove that E (cid:13)(cid:13) g ( t ) − ∇ F ( x ( t ) ) (cid:13)(cid:13) ≤ (cid:32) − E [ b ( t ) ]2 (cid:33) E (cid:13)(cid:13) g ( t − − ∇ F ( x ( t − ) (cid:13)(cid:13) + E [ b ( t ) ]2 (cid:15) , whence the result follows by a simple induction whose basis is E (cid:13)(cid:13) g (1) − ∇ F ( x (1) ) (cid:13)(cid:13) ≤ σ n ≤ (cid:15) . Let C ( t ) denote the value of the coin toss in the t th call to Algorithm 1 (Line 3), recalling that C ( t ) ∼ Bernoulli( b ( t ) ). Writing e ( t ) = g ( t ) − ∇ F ( x ( t ) ) for brevity, we have E (cid:104)(cid:13)(cid:13) e ( t ) (cid:13)(cid:13) (cid:12)(cid:12)(cid:12) b ( t ) (cid:105) = b ( t ) E (cid:104)(cid:13)(cid:13) e ( t ) (cid:13)(cid:13) (cid:12)(cid:12)(cid:12) C ( t ) = 1 (cid:105) + (1 − b ( t ) ) E (cid:104)(cid:13)(cid:13) e ( t ) (cid:13)(cid:13) (cid:12)(cid:12)(cid:12) C ( t ) = 0 (cid:105) . (20)Clearly, E (cid:104)(cid:13)(cid:13) e ( t ) (cid:13)(cid:13) (cid:12)(cid:12)(cid:12) C ( t ) = 1 (cid:105) ≤ σ n = (cid:15) . (21)Moreover, conditional on C ( t ) = 0, we have from the definition of the gradient estimator that e ( t ) = e ( t − + ψ ( t ) , where ψ ( t ) := K ( t ) (cid:88) k =1 (cid:91) ∇ F ( x ( t,k − , z ( t,k ) ) (cid:16) x ( t,k ) − x ( t,k − (cid:17) − ∇ F ( x ( t ) ) + ∇ F ( x ( t − ) , and K ( t ) = (cid:38) (cid:0) σ + L (cid:15) (cid:1) b ( t ) (cid:15) · (cid:107) x ( t ) − x ( t − (cid:107) (cid:39) , (22)where x ( t,k ) and x ( t,k ) respectively denote the values of x ( k ) and z ( k ) (defined on Line 8) during the t th call to Algorithm 1.We may therefore decompose the error conditional on C ( t ) = 0 as E (cid:104)(cid:13)(cid:13) e ( t ) (cid:13)(cid:13) (cid:12)(cid:12)(cid:12) C ( t ) = 0 (cid:105) ( i ) = E (cid:13)(cid:13) e ( t − + E (cid:2) ψ ( t ) (cid:12)(cid:12) G ( t ) (cid:3)(cid:13)(cid:13) + E (cid:13)(cid:13) ψ ( t ) − E (cid:2) ψ ( t ) (cid:12)(cid:12) G ( t ) (cid:3)(cid:13)(cid:13) ii ) ≤ E (cid:20)(cid:18) b ( t ) (cid:19)(cid:13)(cid:13) e ( t − (cid:13)(cid:13) (cid:21) + E (cid:20)(cid:18) b ( t ) (cid:19)(cid:13)(cid:13) E (cid:2) ψ ( t ) (cid:12)(cid:12) G ( t ) (cid:3)(cid:13)(cid:13) (cid:21) + E (cid:13)(cid:13) ψ ( t ) − E (cid:2) ψ ( t ) (cid:12)(cid:12) G ( t ) (cid:3)(cid:13)(cid:13) , (23)where ( i ) is due to e ( t − ∈ G ( t ) and ( ii ) is due to Young’s inequality.The facts that z ( t,k ) is independent from G ( t ) , that ∇ F ( x ( t ) ) − ∇ F ( x ( t − ) ∈ G ( t ) , and that (cid:91) ∇ F ( · ) is unbiased give E (cid:104) ψ ( t ) (cid:12)(cid:12)(cid:12) G ( t ) (cid:105) = K ( t ) (cid:88) k =1 ∇ F ( x ( t,k − ) (cid:16) x ( t,k ) − x ( t,k − (cid:17) − ∇ F ( x ( t ) ) + ∇ F ( x ( t − )22or every t . Consequently, the scaling (22) and Hessian estimator variance bound imply E (cid:20)(cid:13)(cid:13)(cid:13) ψ ( t ) − E (cid:104) ψ ( t ) (cid:12)(cid:12) G ( t ) (cid:105)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) G ( t ) (cid:21) ( (cid:63) ) = 1( K ( t ) ) K ( t ) (cid:88) k =1 E (cid:20)(cid:13)(cid:13)(cid:13) ( (cid:91) ∇ F ( x ( t,k − , z ( t,k ) ) − ∇ F ( x ( t,k − ))( x ( t ) − x ( t − ) (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) G ( t ) (cid:21) ≤ K ( t ) ) K ( t ) (cid:88) k =1 E (cid:20)(cid:13)(cid:13)(cid:13) (cid:91) ∇ F ( x ( t,k − , z ( t,k ) ) − ∇ F ( x ( t,k − ) (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) G ( t ) (cid:21)(cid:13)(cid:13) x ( t ) − x ( t − (cid:13)(cid:13) ≤ σ · (cid:107) x ( t ) − x ( t − (cid:107) K ( t ) ≤ b ( t ) · (cid:15) , (24)where the equality ( (cid:63) ) above is due to the fact that z ( t, , . . . , z ( t,K ( t ) ) are i.i.d., as well as x ( t,k ) − x ( t,k − = K ( t ) ( x ( t ) − x ( t − ).Next, we observe that Taylor’s theorem and fact that F has L -Lipschitz Hessian implies that (cid:107)∇ F ( x (cid:48) ) − ∇ F ( x ) − ∇ ( x ) F ( x (cid:48) − x ) (cid:107) ≤ L (cid:107) x (cid:48) − x (cid:107) for all x, x (cid:48) ∈ R d . Therefore, (cid:13)(cid:13)(cid:13) E (cid:104) ψ ( t ) (cid:12)(cid:12)(cid:12) G ( t ) (cid:105)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K ( t ) (cid:88) k =1 ∇ F ( x ( t,k ) ) − ∇ F ( x ( t,k − ) − ∇ F ( x ( t,k − ) (cid:16) x ( t,k ) − x ( t,k − (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ K ( t ) (cid:88) k =1 (cid:13)(cid:13)(cid:13) ∇ F ( x ( t,k ) ) − ∇ F ( x ( t,k − ) − ∇ F ( x ( t,k − ) (cid:16) x ( t,k ) − x ( t,k − (cid:17)(cid:13)(cid:13)(cid:13) ≤ K ( t ) · L · (cid:32) (cid:107) x ( t ) − x ( t − (cid:107) K ( t ) (cid:33) ≤ b ( t ) · (cid:15) , (25)where we used (22) again.Substituting back through equations (25), (24), (23), (21) and (20), we have E (cid:13)(cid:13) e ( t ) (cid:13)(cid:13) ≤ E (cid:104) b ( t ) · (cid:15) + (1 − b ( t ) ) (cid:16) (1 + b ( t ) ) (cid:13)(cid:13) e ( t − (cid:13)(cid:13) + (1 + b ( t ) )( b ( t ) (cid:15) ) + b ( t ) · (cid:15) (cid:17)(cid:105) ≤ (cid:16) − E [ b ( t ) ]2 (cid:17) E (cid:13)(cid:13) g ( t − − ∇ F ( x ( t − ) (cid:13)(cid:13) + E [ b ( t ) ]2 (cid:15) ≤ (cid:15) , as required; the second inequality follows from algebraic manipulation and the fact that e ( t − isindependent of b ( t ) by assumption.The following lemma bounds the number of oracle queries made per call to the gradient estimator. Lemma 3. The expected number of stochastic oracle queries made by HVP-RVR-Gradient-Estimator when called a single time with arguments ( x , x prev , g prev ) and parameters ( (cid:15), b ) is at most (cid:32) bσ (cid:15) + ( σ + L (cid:15) ) · (cid:107) x − x prev (cid:107) b(cid:15) (cid:33) . Proof. Let m denote the number of oracle calls made by the gradient estimator when invoked witharguments ( x , x prev , g prev ). For any call to the estimator, there are two cases, either (a) C = 1, or(b) C = 0. In the first case, the gradient estimator queries the oracle n times at the point x andreturns the empirical average of the returned stochastic estimates (see Line 5 in Algorithm 1). Thus,23 = n for this case. In the second case, the estimator queries the oracle once for each point inthe set (cid:0) x ( k − (cid:1) Kk =1 , and updates the gradient using a stochastic path integral as in Line 8. Thus, m = K for this case.Combining the two cases, using C ∼ Bernoulli( b ) and substituting in the values of n and K , weget E [ m ] = Pr( C = 1) E [ m | C = 1] + Pr( C = 0) E [ m | C = 0]= E [ b · n + (1 − b ) · K ]= (cid:24) bσ (cid:15) (cid:25) + (cid:38) σ + L (cid:15) ) · (cid:107) x − x prev (cid:107) b(cid:15) (cid:39) ≤ (cid:32) bσ (cid:15) + ( σ + L (cid:15) ) · (cid:107) x − x prev (cid:107) b(cid:15) + 1 (cid:33) , where the final inequality follows from (cid:100) x (cid:101) ≤ x + 1. D Supporting technical results D.1 Error bound for empirical Hessian In order to find the negative curvature direction at a given point or to build a cubic regularizedsub-model, Algorithm 3 and Algorithm 5 estimate the Hessian by computing an empirical averageof the stochastic Hessian queries to the oracle. The following lemma is a standard result whichbounds the expected error for the empirical Hessian. Lemma 4. Given a function F ∈ F (∆ , ∞ , L ) , a stochastic oracle in O ( F, σ ) and a point x , let H := m (cid:80) mi =1 (cid:91) ∇ F ( x, z ( i ) ) denote the empirical Hessian at the point x estimated using m stochasticqueries at x , where z ( i ) i . i . d . ∼ P z . Then E (cid:104)(cid:13)(cid:13) H − ∇ F ( x ) (cid:13)(cid:13) (cid:105) ≤ σ log( d ) m . Proof. This is an immediate consequence of Lemma 5 below, using A i := (cid:91) ∇ F ( x, z ( i ) ) and B := ∇ F ( x ). Lemma 5. Let ( A i ) ni =1 be a collection of i.i.d. matrices in S d , with E [ A i ] = B and E (cid:107) A i − B (cid:107) ≤ σ . Then it holds that E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 A i − B (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ σ log dn . Proof. We drop the normalization by n throughout this proof. We first symmetrize. Observe thatby Jensen’s inequality we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 A i − B (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ E A E A (cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 A i − A (cid:48) i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = E A E A (cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 ( A i − B ) − ( A (cid:48) i − B ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E A E A (cid:48) E (cid:15) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:15) i (( A i − B ) − ( A (cid:48) i − B )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ E A E (cid:15) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:15) i ( A i − B ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , where ( A (cid:48) ) ni =1 is a sequence of independent copies of ( A i ) ni =1 and ( (cid:15) i ) ni =1 are Rademacher randomvariables. Henceforth we condition on A . Let p = log d , and let (cid:107)·(cid:107) S p denote the Schatten p -norm.In what follows, we will use that for any matrix X , (cid:107) X (cid:107) op ≤ (cid:107) X (cid:107) S p ≤ e / (cid:107) X (cid:107) op . To begin, wehave E (cid:15) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:15) i ( A i − B ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ E (cid:15) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:15) i ( A i − B ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) S p ≤ E (cid:15) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:15) i ( A i − B ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) pS p /p , where the second inequality follows by Jensen. We now apply the matrix Khintchine inequality(Mackey et al., 2014, Corollary 7.4), which implies that E (cid:15) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:15) i ( A i − B ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) pS p /p ≤ (2 p − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 ( A i − B ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) S p ≤ (2 p − n (cid:88) i =1 (cid:107) ( A i − B ) (cid:107) S p ≤ e (2 p − n (cid:88) i =1 (cid:107) ( A i − B ) (cid:107) . Putting all the developments so far together and taking expectation with respect to A , we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 A i − B (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ e (2 p − n (cid:88) i =1 E A i (cid:107) ( A i − B ) (cid:107) ≤ e (2 p − nσ . To obtain the final result we normalize by n . D.2 Descent lemma for stochastic gradient descent The following lemma characterizes the effect of gradient descent update step used by Algorithm 2and Algorithm 4. Lemma 6. Given a function F ∈ F (∆ , L , ∞ ) , a point x , and gradient estimator g at x, define y := x − ηg. Then, for any η ≤ L , the point y satisfies F ( x ) − F ( y ) ≥ η (cid:107)∇ F ( x ) (cid:107) − η (cid:107)∇ F ( x ) − g (cid:107) . Proof. Since, the gradient of F is L -Lipschitz, we have F ( y ) ≤ F ( x ) + (cid:104)∇ F ( x ) , y − x (cid:105) + L (cid:107) y − x (cid:107) i ) = F ( x ) − η (cid:104)∇ F ( x ) , g (cid:105) + L η (cid:107) g (cid:107) = F ( x ) − η (cid:104)∇ F ( x ) − g, g (cid:105) − η (cid:107) g (cid:107) + L η (cid:107) g (cid:107) ii ) ≤ F ( x ) + η (cid:107)∇ F ( x ) − g (cid:107)(cid:107) g (cid:107) − η (cid:18) − L η (cid:19) (cid:107) g (cid:107) iii ) ≤ F ( x ) + η (cid:107)∇ F ( x ) − g (cid:107) − η (cid:18) − L η (cid:19) (cid:107) g (cid:107) iv ) ≤ F ( x ) + η (cid:107)∇ F ( x ) − g (cid:107) − η (cid:107) g (cid:107) v ) ≤ F ( x ) + 3 η (cid:107)∇ F ( x ) − g (cid:107) − η (cid:107)∇ F ( x ) (cid:107) , (26)where ( i ) uses that y − x = ηg , ( ii ) is due to the Cauchy-Schwarz inequality, ( iii ) is given by anapplication of the AM-GM inequality and ( iv ) holds because η ≤ L . Finally, ( v ) follows by invokingJensen’s inequality for the function (cid:107)·(cid:107) to upper bound (cid:107)∇ F ( x ) (cid:107) ≤ (cid:16) (cid:107)∇ F ( x − g ) (cid:107) + (cid:107) g (cid:107) (cid:17) .Rearranging the terms in (26), we get, F ( x ) − F ( y ) ≥ η (cid:107)∇ F ( x ) (cid:107) − η (cid:107)∇ F ( x ) − g (cid:107) . D.3 Descent lemma for cubic-regularized trust-region method The following lemmas establish properties for the updates step involving constrained minimizationof the cubic regularized model in used in Algorithm 3 and Algorithm 5. Lemma 7. Given a function F ∈ F (∆ , ∞ , L ) , gradient estimator g ∈ R d and hessian estimator H ∈ S d , define m x ( y ) = F ( x ) + (cid:104) g, y − x (cid:105) + H y − x, y − x ] + M (cid:107) y − x (cid:107) , and let y ∈ arg min z ∈ B η ( x ) m x ( z ) . Then, for any M ≥ L and η ≥ , the point y satisfies F ( x ) − F ( y ) ≥ M (cid:107) y − x (cid:107) − √ M (cid:107)∇ F ( x ) − g (cid:107) + 4 η √ M (cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) . Proof. Since ∇ F is L -Lipschitz, we have F ( y ) − F ( x ) ≤ F ( x ) + (cid:104)∇ F ( x ) , y − x (cid:105) + 12 ∇ F ( x )[ y − x, y − x ] + L (cid:107) y − x (cid:107) − F ( x ) ( i ) = m x ( y ) + L − M (cid:107) y − x (cid:107) + (cid:104)∇ F ( x ) − g, y − x (cid:105) + 12 ∇ F ( x )[ y − x, y − x ] − H [ y − x, y − x ] − m x ( x ) ( ii ) ≤ − M (cid:107) y − x (cid:107) + (cid:107)∇ F ( x ) − g (cid:107)(cid:107) y − x (cid:107) + 12 (cid:13)(cid:13) ∇ F ( x )[ y − x, · ] − H [ y − x, · ] (cid:13)(cid:13) (cid:107) y − x (cid:107) , (27)where ( i ) follows from the definition of m x ( · ) and ( ii ) follows by the fact that y ∈ arg min y (cid:48) B η ( x ) m x ( y (cid:48) ),along with an application of the Cauchy-Schwarz inequality for remainder of the terms, and because M ≥ L . Additionally, using Young’s inequality, we have (cid:107)∇ F ( x ) − g (cid:107)(cid:107) y − x (cid:107) ≤ √ M (cid:107)∇ F ( x ) − g (cid:107) + M (cid:107) y − x (cid:107) , (cid:13)(cid:13) ∇ F ( x )[ y − x, · ] − H [ y − x, · ] (cid:13)(cid:13) (cid:107) y − x (cid:107) ≤ √ M (cid:13)(cid:13) ∇ F ( x )[ y − x, · ] − H [ y − x, · ] (cid:13)(cid:13) + M (cid:107) y − x (cid:107) . Plugging these bounds into (27), we have F ( y ) − F ( x ) ≤ − M (cid:107) y − x (cid:107) + 8 √ M (cid:107)∇ F ( x ) − g (cid:107) + 4 √ M (cid:13)(cid:13) ∇ F ( x )[ y − x, · ] − H [ y − x, · ] (cid:13)(cid:13) ( i ) ≤ − M (cid:107) y − x (cid:107) + 8 √ M (cid:107)∇ F ( x ) − g (cid:107) + 4 √ M (cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) op (cid:107) y − x (cid:107) ( ii ) ≤ − M (cid:107) y − x (cid:107) + 8 √ M (cid:107)∇ F ( x ) − g (cid:107) + 4 √ M (cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) op · η , where ( i ) follows by the definition of the operator norm and ( ii ) follows by observing that (cid:107) y − x (cid:107) ≤ η .Rearranging the terms, we have F ( x ) − F ( y ) ≥ M (cid:107) y − x (cid:107) − √ M (cid:107)∇ F ( x ) − g (cid:107) + 4 η √ M (cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) . Lemma 8. Under the same setting as Lemma 7, the point y satisfies (cid:26) (cid:107)∇ F ( y ) (cid:107) ≥ M η (cid:27) ≤ η (cid:107) y − x (cid:107) + 2 M η (cid:16) (cid:107)∇ F ( x ) − g (cid:107) + η (cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) op (cid:17) . Proof. There are two scenarios: ( i ) either y lies on the boundary of B η ( x ), or ( ii ) y is in the interiorof B η ( x ). In the first case, (cid:107) y − x (cid:107) = η . In the second case, (cid:107)∇ F ( y ) (cid:107) ( i ) ≤ (cid:13)(cid:13) ∇ F ( y ) − ∇ F ( x ) − ∇ F ( x )[ y − x, · ] (cid:13)(cid:13) + (cid:13)(cid:13) ∇ F ( x ) + ∇ F ( x )[ y − x, · ] (cid:13)(cid:13) ( ii ) ≤ L (cid:107) y − x (cid:107) + (cid:13)(cid:13) ∇ F ( x ) + ∇ F ( x )[ y − x, · ] (cid:13)(cid:13) ( iii ) ≤ L (cid:107) y − x (cid:107) + (cid:107)∇ F ( x ) − g (cid:107) + (cid:13)(cid:13) ∇ F ( x )[ y − x, · ] − H [ y − x, · ] (cid:13)(cid:13) + (cid:107) g + H [ y − x, · ] (cid:107) ( iv ) ≤ L (cid:107) y − x (cid:107) + (cid:107)∇ F ( x ) − g (cid:107) + (cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) op · η + (cid:107) g + H [ y − x, · ] (cid:107) ( v ) ≤ L + M (cid:107) y − x (cid:107) + (cid:107)∇ F ( x ) − g (cid:107) + (cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) op · η, (28)where ( i ) follows by triangle inequality, ( ii ) follows by Taylor expansion of ∇ F ( y ) at x and observingthat F is L -hessian Lipschitz, ( iii ) follows by another application of the triangle inequality, ( iv )follows from Cauchy-Schwarz inequality and observing that (cid:107) y − x (cid:107) ≤ η , and ( v ) follows by usingfirst order optimization conditions for y ∈ arg min B η ( x ) m x ( y ), i.e., (cid:107)∇ (cid:98) m x ( y ) (cid:107) = 0 , or, g + H [ y − x, · ] + M (cid:107) y − x (cid:107) ( y − x ) = . Rearranging the terms in (28), we get, (cid:107) y − x (cid:107) ≥ L + M (cid:16) (cid:107)∇ F ( y ) (cid:107) − (cid:107)∇ F ( x ) − g (cid:107) − (cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) op · η (cid:17) . (cid:107) y − x (cid:107) < η or (cid:107) y − x (cid:107) = η ) must hold, we have, (cid:107) y − x (cid:107) ≥ min (cid:26) η , L + M (cid:16) (cid:107)∇ F ( y ) (cid:107) − (cid:107)∇ F ( x ) − g (cid:107) − η · (cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) (cid:17)(cid:27) ≥ min (cid:26) η , L + M (cid:107)∇ F ( y ) (cid:107) (cid:27) − L + M (cid:107)∇ F ( x ) − g (cid:107) − ηL + M (cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) op . Rearranging the terms, and using the fact that M ≥ L , we havemin (cid:26) M η , (cid:107)∇ F ( y ) (cid:107) (cid:27) ≤ M (cid:107) y − x (cid:107) + (cid:107)∇ F ( x ) − g (cid:107) + η (cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) op . Finally, using the fact that for any a, b ≥ 0, min { a, b } ≤ a { b ≥ a } , we have M η (cid:26) (cid:107)∇ F ( y ) (cid:107) ≥ M η (cid:27) ≤ M (cid:107) y − x (cid:107) + (cid:107)∇ F ( x ) − g (cid:107) + η (cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) op , or, equivalently, (cid:26) (cid:107)∇ F ( y ) (cid:107) ≥ M η (cid:27) ≤ η (cid:107) y − x (cid:107) + 2 M η (cid:16) (cid:107)∇ F ( x ) − g (cid:107) + η (cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) op (cid:17) . Lemma 9. Consider the same setting as Lemma 7, but let H ∈ S d and g ∈ R d be random variables.Then the random variable y satisfies E [ F ( x ) − F ( y )] ≥ M η 60 Pr (cid:16) (cid:107)∇ F ( y ) (cid:107) ≥ M η (cid:17) − √ M · E (cid:104) (cid:107)∇ F ( x ) − g (cid:107) (cid:105) − η √ M · E (cid:104)(cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) (cid:105) , where Pr( · ) and E [ · ] are taken with respect to the randomness over H and g . Proof. For the ease of notation, let χ and ζ denote the error in the gradient estimator g and thehessian estimator H at x respectively, i.e. χ := (cid:107)∇ F ( x ) − g (cid:107) and ζ := (cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) op . We prove the desired statement by combining the following two results. • First, plugging x = x , and z = y in to Lemma 7, we have F ( x ) − F ( y ) ≥ M (cid:107) y − x (cid:107) − √ M χ t − √ M ( ηζ ) . Taking expectations on both the sides, we get, E [ F ( x ) − F ( y )] ≥ M E (cid:104) (cid:107) y − x (cid:107) (cid:105) − √ M E (cid:104) ( χ ) (cid:105) − √ M E (cid:104) ( ηζ ) (cid:105) ≥ M E (cid:104) (cid:107) y − x (cid:107) (cid:105) − √ M (cid:0) E (cid:2) χ t (cid:3)(cid:1) − √ M (cid:0) η E (cid:2) ζ t (cid:3)(cid:1) , (29)where the last inequality follows from an application of Jensen’s inequality.28 Similarly, plugging x = x , z = y in Lemma 8, we get (cid:26) (cid:107)∇ F ( y ) (cid:107) ≥ M η (cid:27) ≤ η (cid:107) y − x (cid:107) + 2 M η ( χ + ηζ ) . Raising both the sides with the exponent of , we get (cid:26) (cid:107)∇ F ( y ) (cid:107) ≥ M η (cid:27) ≤ (cid:18) η (cid:107) y − x (cid:107) + 2 M η ( χ + ηζ ) (cid:19) ≤ η (cid:107) y − x (cid:107) + 5 M η (cid:16) χ + ( ηζ ) (cid:17) . Taking expectations on both the sides and rearranging the terms implies that E (cid:104) (cid:107) x ( t +1) − x (cid:107) (cid:105) ≥ η (cid:18) (cid:107)∇ F ( y ) (cid:107) ≥ M η (cid:19) − M E (cid:104) χ + ( ηζ ) (cid:105) ≥ η (cid:18) (cid:107)∇ F ( y ) (cid:107) ≥ M η (cid:19) − M (cid:16)(cid:0) E (cid:2) χ t (cid:3)(cid:1) + (cid:0) η E (cid:2) ζ t (cid:3)(cid:1) (cid:17) , (30)where the last inequality follows from an application of the Jensen’s inequality.Plugging (30) into (29), we get E [ F ( x ) − F ( y )] ≥ M η 60 Pr (cid:18) (cid:107)∇ F ( y ) (cid:107) ≥ M η (cid:19) − √ M (cid:0) E (cid:2) χ t (cid:3)(cid:1) − η √ M (cid:0) E (cid:2) ζ t (cid:3)(cid:1) . The final statement follows from the above inequality by using the definition of χ and ζ . D.4 Stochastic negative curvature search The following lemma establishes properties of the negative curvature search step used in Algorithm 4and Algorithm 5. Lemma 10. Let γ > , and F ∈ F (∆ , ∞ , L ) be given. Let x ∈ R d be given, and let H ∈ S d be arandom variable (representing a stochastic estimator for the Hessian at x ). Define y via y := (cid:26) x + rγL · u, if λ min ( H ) ≤ − γ,x, otherwise. , where r is an independent Rademacher random variable and u is an arbitrary unit vector such that H [ u, u ] ≤ − γ . Then, the point y satisfies E [ F ( x ) − F ( y )] ≥ γ L Pr( λ min ( H ) ≤ − γ ) − γ L E (cid:104)(cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) op (cid:105) , where Pr( · ) and E [ · ] are taken with respect to the randomness in H and r . Proof. There are two cases: either (a) λ min ( H ) > − γ , or, (b) λ min ( H ) ≤ − γ . In the first case, y = x , and thus, F ( y ) − F ( x ) = 0 ≤ γ L (cid:13)(cid:13) H − ∇ F ( x ) (cid:13)(cid:13) op (31)29n the second case, Taylor expansion for F ( y ) at F ( x ) implies that F ( y ) ≤ F ( x ) + (cid:104)∇ F ( x ) , ˜ u (cid:105) + 12 ∇ F ( x )[˜ u, ˜ u ] + L (cid:107) ˜ u (cid:107) , where ˜ u := rγL · u . Taking expectations on both the sides with respect to r , we get E r [ F ( y )] ( i ) = F ( x ) + γ L ∇ F ( x )[ u, u ] + γ L (cid:107) u (cid:107) ≤ F ( x ) + γ L (cid:0) H [ u, u ] + ∇ F ( x )[ u, u ] − H [ u, u ] (cid:1) + γ L (cid:107) u (cid:107) ii ) = F ( x ) + γ L (cid:16) − γ + (cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) op (cid:17) + γ L ≤ F ( x ) − γ L + γ L (cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) op , (32)where ( i ) is given by the fact that E r [ (cid:104)∇ F ( x ) , ru (cid:105) ] = 0, and ( ii ) follows from the fact that u ischosen such that E (cid:2) ∇ F ( x )[ u, u ] (cid:3) ≤ − γ and (cid:107) u (cid:107) = 1, and the fact that for any matrix A andvector b , (cid:107) Ab (cid:107) ≤ (cid:107) A (cid:107) op (cid:107) b (cid:107) .Since, one of the two cases ( λ min ( H ) > − γ or λ min ( H ) ≤ − γ ) must hold, combining (31) and(32), we have E r [ F ( y )] ≤ F ( x ) − γ L { λ min ( H ) ≤ − γ } + γ L (cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) op . Taking expectation on both the sides gives the desired statement: E [ F ( x ) − F ( y )] ≥ γ L Pr( λ min ( H ) ≤ − γ ) − γ L E (cid:104)(cid:13)(cid:13) ∇ F ( x ) − H (cid:13)(cid:13) op (cid:105) . The following lemma establishes properties of Oja’s method ( Oja ), as used in Algorithm 4. Lemma 11 (Allen-Zhu (2018b), Lemma 5.3) . The procedure Oja takes as input a point x ∈ R d , astochastic Hessian-vector product oracle O F ∈ O ( F, σ , ¯ σ ) for some function F ∈ F (∆ , L , ∞ ) , aprecision parameter γ > and a failure probability δ ∈ (0 , , and runs outputs u ∈ R d ∪ {⊥} suchthat with probability at least − δ , either a) u = ⊥ , and ∇ F ( x ) (cid:23) − γI .b) if u (cid:54) = ⊥ , then (cid:107) u (cid:107) = 1 and (cid:104) u, ∇ F ( x ) u (cid:105) ≤ − γ .Moreover, when invoked as above, the procedure uses at most O (cid:32) (¯ σ + L ) γ log (cid:18) dδ (cid:19)(cid:33) queries to the stochastic Hessian-vector product oracle. Note that if this event fails, the algorithm still returns either ⊥ or a unit vector u . Upper bounds for finding (cid:15) -stationary points E.1 Proof of Theorem 1 Proof of Theorem 1. In the following, we first show that Algorithm 2 returns a point (cid:98) x suchthat, E [ (cid:107)∇ F ( (cid:98) x ) (cid:107) ] ≤ (cid:15) . We then bound the expected number of oracle queries used throughoutthe execution. Since, η = √ L +¯ σ +˜ (cid:15)L ≤ L and F has L -Lipschitz gradient, Lemma 6 implies that the point x ( t +1) computed using the update rule x ( t +1) ← x ( t ) − ηg ( t ) satisfies η (cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) ≤ F ( x ( t ) ) − F ( x ( t +1) ) + 3 η (cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) − g ( t ) (cid:13)(cid:13)(cid:13) . Telescoping the above from t from 1 to T , this implies η T (cid:88) t =1 (cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) ≤ F ( x (0) ) − F ( x ( T +1) ) + 3 η T (cid:88) t =1 (cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) − g ( t ) (cid:13)(cid:13)(cid:13) ≤ ∆ + 3 η T (cid:88) t =1 (cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) − g ( t ) (cid:13)(cid:13)(cid:13) , where the last inequality follows from the fact that F ( x (0) ) − F ( x ( T +1) ) ≤ ∆. Next, takingexpectation on both the sides (with respect to the stochasticity of the oracle and the algorithm’sinternal randomization), we get η E (cid:34) T (cid:88) t =1 (cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) (cid:35) ≤ ∆ + 3 η T (cid:88) t =1 E (cid:20)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) − g ( t ) (cid:13)(cid:13)(cid:13) (cid:21) . Using Lemma 2, we have E (cid:104)(cid:13)(cid:13) ∇ F ( x ( t ) ) − g ( t ) (cid:13)(cid:13) (cid:105) ≤ (cid:15) for all t ≥ 1. Dividing both the sides by ηT ,and plugging in the value of the parameters T and η , we get, E (cid:34) T T (cid:88) t =1 (cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) (cid:35) ≤ ηT + 6 T T (cid:88) t =1 E (cid:20)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) − g ( t ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ ηT + 6 (cid:15) ≤ (cid:15) . (33)Thus, for (cid:98) x chosen uniformly at random from the set (cid:0) x ( t ) (cid:1) Tt =1 , we have E (cid:107)∇ F ( (cid:98) x ) (cid:107) = 1 T T (cid:88) t =1 E (cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) ≤ (cid:118)(cid:117)(cid:117)(cid:116) E (cid:34) T T (cid:88) t =1 (cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13) (cid:35) ≤ (cid:15). Finally, Markov’s inequality implies that with probability at least , (cid:107)∇ F ( (cid:98) x ) (cid:107) ≤ (cid:15). (34) In the proof, we show convergence to a 32 (cid:15) -stationary point. A simple change of variable, i.e. running Algorithm 2with (cid:15) ← (cid:15) , returns a point (cid:98) x that enjoys the guarantee that (cid:107)∇ F (ˆ x ) (cid:107) ≤ (cid:15) . ound on the number of oracle queries. Algorithm 2 queries the stochastic oracle in onlywhen it invokes HVP-RVR in Line 4 to compute the gradient estimate g ( t ) at time t . Let M denotethe total number of oracle calls made up until time T . Invoking Lemma 3 to bound the expectednumber of stochastic oracle calls for each t ≥ 1, and ignoring all the mutiplicative constants, we get E [ M ] ≤ T (cid:88) t =1 E (cid:34) bσ (cid:15) + (cid:13)(cid:13) x ( t +1) − x ( t ) (cid:13)(cid:13) · (cid:0) σ + (cid:15)L (cid:1) b(cid:15) + 1 (cid:35) ( i ) ≤ O (cid:32) T (cid:88) t =1 E (cid:34) bσ (cid:15) + (cid:13)(cid:13) ηg ( t ) (cid:13)(cid:13) · (cid:0) σ + (cid:15)L (cid:1) b(cid:15) + 1 (cid:35)(cid:33) ( ii ) ≤ O (cid:32) ∆ η(cid:15) · (cid:32) bσ (cid:15) + E (cid:34) T T (cid:88) t =1 (cid:13)(cid:13)(cid:13) g ( t ) (cid:13)(cid:13)(cid:13) (cid:35) · η (cid:0) σ + (cid:15)L (cid:1) b(cid:15) + 1 (cid:33)(cid:33) ( iii ) = O (cid:32) ∆ η(cid:15) · (cid:32) bσ (cid:15) + η (cid:0) σ + (cid:15)L (cid:1) b + 1 (cid:33)(cid:33) , (35)where ( i ) is given by plugging in the update rule from Line 5 and by dropping multiplicativeconstants, ( ii ) is given by rearranging the terms, plugging in the value of T and using that T ≥ (cid:15) ≤ √ ∆ L , and ( iii ) follows by observing that E (cid:34) T T (cid:88) t =1 (cid:13)(cid:13)(cid:13) g ( t ) (cid:13)(cid:13)(cid:13) (cid:35) ≤ E (cid:34) T T (cid:88) t =1 (cid:13)(cid:13)(cid:13) g ( t ) − ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) + 1 T T (cid:88) t =1 (cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) (cid:35) ≤ (cid:15) , as a consequence of Lemma 2 and the bound in (33). Next, note that since we assume (cid:15) < σ ,and since we have η ≤ √ σ + (cid:15)L , the parameter b is equal to η(cid:15) √ σ + (cid:15)L σ (as this is smaller than 1).Thus, plugging the value of b and η in the bound (35), we get, E [ m ( T )] = O (cid:32) ∆ σ (cid:112) σ + (cid:15)L (cid:15) + ∆ (cid:112) L + σ + (cid:15)L (cid:15) (cid:33) = O (cid:18) ∆ σ σ (cid:15) + ∆ σ √ L (cid:15) . + ∆ σ (cid:15) + ∆ L (cid:15) + ∆ √ L (cid:15) . (cid:19) . Using Markov’s inequality, we have that with probability at least , M ≤ O (cid:18) ∆ σ σ (cid:15) + ∆ σ √ L (cid:15) . + ∆ σ (cid:15) + ∆ L (cid:15) + ∆ √ L (cid:15) . (cid:19) . (36)The final statement follows by taking a union bound with failure probabilities for (34) and (36). E.2 Proof of Theorem 2 Proof of Theorem 2. In the following, we first show that Algorithm 3 returns a point ˆ x , suchthat with probability at least , (cid:107)∇ F (ˆ x ) (cid:107) ≤ (cid:15) . We then bound, with probability at least , thetotal number of oracle queries made up until time T .Note that, using Lemma 2 and Lemma 4, we have for all t ≥ E (cid:104) (cid:107)∇ F (cid:16) x ( t ) (cid:17) − g ( t ) (cid:107) (cid:105) ≤ (cid:15) , and E (cid:104) (cid:107)∇ F (cid:16) x ( t ) (cid:17) − H ( t ) (cid:107) op (cid:105) ≤ (cid:15) η . (37)32hus, for each t ≥ 1, invoking Lemma 9 and plugging in the bounds from (37), and using the valueof η , we get E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) (cid:105) ≥ M η 60 Pr (cid:16)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t +1) ) (cid:13)(cid:13)(cid:13) ≥ M η (cid:17) − (cid:15) √ M ≥ (cid:15) √ M (cid:18) Pr (cid:16)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t +1) ) (cid:13)(cid:13)(cid:13) ≥ (cid:15) (cid:17) − (cid:19) . Telescoping this inequality from t = 1 to T , we have that E (cid:104) F ( x (1) ) − F ( x ( T +1) ) (cid:105) ≥ (cid:15) √ M · T · (cid:32) T T (cid:88) t =1 Pr (cid:16)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t +1) ) (cid:13)(cid:13)(cid:13) ≥ (cid:15) (cid:17) − (cid:33) = 240 (cid:15) √ M · T · (cid:18) Pr( (cid:107)∇ F ( (cid:98) x ) (cid:107) ≥ (cid:15) ) − (cid:19) , where the equality follows because (cid:98) x is sampled uniformly at random from the set (cid:8) x ( t ) (cid:9) T +1 t =2 . Next,using the fact that, F ( x ( t ) ) − F (cid:0) x ( T +1) (cid:1) ≤ ∆, rearranging the terms, and plugging in the value of T , we get Pr( (cid:107)∇ F ( (cid:98) x ) (cid:107) ≥ (cid:15) ) ≤ ∆ √ M (cid:15) T + 116 ≤ . Thus, with probability at least , (cid:107)∇ F ( (cid:98) x ) (cid:107) ≤ (cid:15). (38) Bound on the number of oracle queries. Algorithm 3 queries the stochastic oracle in Line 5and Line 6 only to compute the respective Hessian and gradient estimates. Let M h and M g denotethe total number of stochastic oracle queries made by Line 5 and Line 6 till time T respectively.Further, Let M = M h + M g denote the total number of oracle queries made till time T .In what follows, we first bound E [ M h ] and E [ M g ]. Then, we invoke Markov’s inequality todeduce that the desired bound on M holds with probability at least .1. Bound on E [ M h ] . Since the algorithm queries the stochastic Hessian oracle n H times periteration, M h = T · n H . Plugging the values of T , n H and M as specified in Algorithm 3, andignoring multiplicative constant, we get, E [ M h ] = (cid:38) √ M (cid:15) . (cid:39) · (cid:24) σ η log( d ) (cid:15) (cid:25) ≤ O (cid:32) ∆ √ M(cid:15) . + ∆ σ log( d ) (cid:15) . √ M (cid:33) ≤ O (cid:32) ∆ √ L (cid:15) . + ∆ σ (cid:15) + ∆ σ σ (cid:112) log( d ) (cid:15) (cid:33) , (39)where the first inequality above follows from the fact that ∆ √ M(cid:15) . ≥ (cid:15) ≤ ∆ M and using the identity (cid:100) x (cid:101) ≤ x + 1 for x ≥ Bound on E [ M g ] . Invoking Lemma 3 for each t ≥ 1, we get E [ M g ] = 6 T (cid:88) t =1 E (cid:34) bσ (cid:15) + ( σ + L (cid:15) ) · (cid:13)(cid:13) x ( t ) − x ( t − (cid:13)(cid:13) b(cid:15) + 1 (cid:35) ( i ) = O (cid:18) T · (cid:18) bσ (cid:15) + ( σ + L (cid:15) ) · η b(cid:15) + 1 (cid:19)(cid:19) ( ii ) = O (cid:18) ∆ η(cid:15) · (cid:18) bσ (cid:15) + ( σ + L (cid:15) ) · η b(cid:15) + 1 (cid:19)(cid:19) (40)where ( i ) follows by observing (cid:13)(cid:13) x ( t ) − x ( t − (cid:13)(cid:13) ≤ η due to the update rule in Line 7 and ( ii ) isgiven by plugging in the value of T ≤ O ( ∆ η(cid:15) ) for the natural choice of parameter (cid:15) = O (∆ M ).Next, note that since M > L , and since we assume (cid:15) < σ , the parameter b is equal to η √ σ + (cid:15)L σ (which is smaller than 1). Thus, plugging the value of b and η in the bound (40),we get E [ M g ] = O (cid:32) ∆ σ (cid:112) σ + (cid:15)L (cid:15) + ∆ √ M(cid:15) . (cid:33) = O (cid:18) ∆ σ σ (cid:15) + ∆ σ √ L (cid:15) . + ∆ σ (cid:15) (cid:112) log( d ) + ∆ √ L (cid:15) . (cid:19) , (41)where the second equality follows by using that (cid:15) ≤ σ to simplify the term ∆ √ M(cid:15) . .Adding (41) and (39), the total number of oracle queries made by Algorithm 3 till time T is bounded,in expectation, by E [ M ] = E [ M g + M h ] = O (cid:18) ∆ σ σ (cid:15) (cid:112) log( d ) + ∆ σ √ L (cid:15) . + ∆ σ (cid:15) (cid:112) log( d ) + ∆ √ L (cid:15) . (cid:19) . Using Markov’s inequality, we get that, with probability at least , M ≤ O (cid:18) ∆ σ σ (cid:15) (cid:112) log( d ) + ∆ σ √ L (cid:15) . + ∆ σ (cid:15) (cid:112) log( d ) + ∆ √ L (cid:15) . (cid:19) . (42)The final statement follows by taking a union bound for the failure probability of (38) and (42).34 Upper bounds for finding ( (cid:15), γ ) -second-order-stationary points F.1 Full statement and proof for Algorithm 4 Algorithm 4 Stochastic gradient descent with negative curvature search and HVP-RVR Input: Oracle ( O F , P z ) ∈ O ( F, σ , ¯ σ ) for F ∈ F (∆ , L , L ). Precision parameters (cid:15), γ . Set η = min (cid:26) γ(cid:15)L , √ L +¯ σ + (cid:15)L (cid:27) , T = (cid:108) L γ + η(cid:15) (cid:109) , p = γ γ +10∆ L η(cid:15) , δ = γ L . Set b g = min { , η(cid:15) √ ¯ σ + (cid:15)L σ } and b H = min { , γ √ ¯ σ + (cid:15)L σ L } . Initialize x (0) , x (1) ← g (1) ← HVP-RVR-Gradient-Estimator (cid:15),b g ( x (1) , x (0) , ⊥ ). for t = 1 to T do Sample Q t ∼ Bernoulli( p ). if Q t = 1 then x ( t +1) ← x ( t ) − η · g ( t ) . g ( t +1) ← HVP-RVR-Gradient-Estimator (cid:15),b g ( x ( t +1) , x ( t ) , g ( t ) ). else u ( t ) ← Oja (cid:0) x ( t ) , O F , γ, δ (cid:1) . // Oja’s algorithm (Lemma 11). if u ( t ) ≡ ⊥ then x ( t +1) ← x ( t ) . g ( t +1) ← g ( t ) . else Sample r ( t ) ∼ Uniform( {− , } ). x ( t +1) ← x ( t ) + γL · r ( t ) · u ( t ) . g ( t +1) ← HVP-RVR-Gradient-Estimator (cid:15),b H ( x ( t +1) , x ( t ) , g ( t ) ). return (cid:98) x chosen uniformly at random from (cid:0) x ( t ) (cid:1) Tt =1 . Proof of Theorem 4. We first show that Algorithm 4 returns a point (cid:98) x such that, E [ (cid:107)∇ F ( (cid:98) x ) (cid:107) ] ≤ (cid:15) and λ min (cid:0) ∇ F ( (cid:98) x ) (cid:1) ≥ − γ . We then bound the expected number of oracle queries used throughoutthe execution.To begin, note that, for any t ≥ 1, there are two scenarios: (a) either Q t = 1 and x ( t +1) is setusing the update rule in Line 7, or, (b) Q t = 0 and we set x ( t +1) using Line 10, respectively. Weanalyze the two cases separately below. Case 1: Q t = 1 . Since, η ≤ √ L +¯ σ +˜ (cid:15)L ≤ L and F has L -Lipschitz gradient, using Lemma 6,we have F ( x ( t ) ) − F ( x ( t +1) ) ≥ η (cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) − η (cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) − g ( t ) (cid:13)(cid:13)(cid:13) . Taking expectation on both the sides, while conditioning on the event that Q t = 1, we get E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) | Q t = 1 (cid:105) ≥ η E (cid:20)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) (cid:21) − η E (cid:20)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) − g ( t ) (cid:13)(cid:13)(cid:13) (cid:21) ≥ η E (cid:20)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) (cid:21) − η(cid:15) , (43)where the last inequality follows using Lemma 2.35 ase 2: Q t = 0 . Let E Oja ( t ) denote the event that Oja succeeds at time t , in the sense that theevent in Lemma 11 holds: ( i ) if u ( t ) = ⊥ then ∇ F ( x ( t ) ) (cid:23) − γI , and ( ii ) otherwise, u ( t ) satisfies (cid:104) u ( t ) , ∇ F ( x ( t ) ) u ( t ) (cid:105) ≤ − γ .Then, using Lemma 12, we are guaranteed that E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) | Q t = 0 (cid:105) ≥ γ L (cid:18) Pr (cid:16) λ min ( ∇ F ( x ( t ) )) ≤ − γ (cid:17) − L γ Pr (cid:16) ¬ E Oja ( t ) | Q t = 0 (cid:17)(cid:19) . In particular, we are guaranteed by Lemma 11 that E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) | Q t = 0 (cid:105) ≥ γ L (cid:18) Pr (cid:16) λ min ( ∇ F ( x ( t ) )) ≤ − γ (cid:17) − L γ δ (cid:19) . (44)Combining the two cases ( Q t = 0 and Q t = 1) from (43) and (44) above, we get E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) (cid:105) (45)= (cid:88) q ∈{ , } Pr( Q t = q ) E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) | Q t = q (cid:105) ≥ − p ) γ L (cid:18) Pr (cid:16) λ min ( ∇ F ( x ( t ) )) ≤ − γ (cid:17) − L γ δ (cid:19) + p (cid:18) η E (cid:20)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) (cid:21) − η(cid:15) (cid:19) . (46)Using that E (cid:104)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13) (cid:105) ≥ (8 (cid:15) ) · Pr (cid:0)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13) ≥ (cid:15) (cid:1) and that δ ≤ γ L , we have E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) (cid:105) ≥ − p ) γ L (cid:18) Pr (cid:16) λ min ( ∇ F ( x ( t ) )) ≤ − γ (cid:17) − (cid:19) + 8 pη(cid:15) (cid:18) Pr (cid:16)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) ≥ (cid:15) (cid:17) − (cid:19) . Telescoping this inequality for t from 1 to T and using the bound E (cid:2) F ( x (1) ) − F ( x ( T +1) ) (cid:3) ≤ ∆, weget ∆ ≥ E (cid:104) F ( x (1) ) − F ( x ( T +1) ) (cid:105) ≥ T (1 − p ) γ L (cid:16) T T − (cid:88) t =0 Pr (cid:16) λ min ( ∇ F ( x ( t ) )) ≤ − γ (cid:17) − (cid:17) + 8 T pη(cid:15) (cid:16) T T − (cid:88) t =0 Pr (cid:16)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) ≥ (cid:15) (cid:17) − (cid:17) ( i ) ≥ T (1 − p ) γ L (cid:18) Pr (cid:0) λ min ( ∇ F ( (cid:98) x )) ≤ − γ (cid:1) − (cid:19) + 8 T pη(cid:15) (cid:16) Pr( (cid:107)∇ F ( (cid:98) x ) (cid:107) ≥ (cid:15) ) − (cid:17) ( ii ) ≥ (cid:18) Pr (cid:0) λ min ( ∇ F ( (cid:98) x )) ≤ − γ (cid:1) + Pr( (cid:107)∇ F ( (cid:98) x ) (cid:107) ≥ (cid:15) ) − (cid:19) , (47)where ( i ) follows because (cid:98) x is sampled uniformly at random from (cid:0) x ( t ) (cid:1) Tt =1 and ( ii ) follows fromLemma 14. Rearranging the terms, we getPr (cid:0) λ min ( ∇ F ( (cid:98) x )) ≤ − γ (cid:1) + Pr( (cid:107)∇ F ( (cid:98) x ) (cid:107) ≥ (cid:15) ) ≤ , which further implies that Pr (cid:0) λ min ( ∇ F ( (cid:98) x )) ≥ − γ ∧ (cid:107)∇ F ( (cid:98) x ) (cid:107) ≤ (cid:15) (cid:1) ≥ . (48)36 ound on the number of oracle queries. At every iteration, Algorithm 4 queries the stochasticoracle in either Line 8 or Line 17 (to compute the stochastic gradient estimator and to execute Oja’salgorithm, respectively), and possibly Line 10 (to update the gradient estimator after a negativecurvature step). Let m g ( t ) denote the total number of stochastic oracle queries made by Line 8 orLine 17 at time t , and let M g = (cid:80) Tt =1 m g ( t ). Further, let M nc denote the total number of oraclecalls made by Line 10, and further let M = M g + M nc be the total number of oracle queries madeup until time T .In what follows, we first bound E [ M g ] and E [ M nc ]. Then, we invoke Markov’s inequality tobound M with probability at least . Bound on M g . For any t > 0, there are two scenarios, either (a) Q t = 1 and we go throughLine 7, or (b) Q t = 0 and Line 17 is executed. Thus, E [ M g ] = T (cid:88) t =1 Pr( Q t = 0) E [ m g ( t ) | Q t = 0] + T (cid:88) t =1 Pr( Q t = 1) E [ m g ( t ) | Q t = 1] (49)We denote the two terms on the right hand side above by ( A ) and ( B ), respectively. We boundthem separately as follows. • Bound on ( A ) . Using Lemma 3 with the fact that Pr( Q t = 0) = 1 − p , we get( A ) = O (1) T (cid:88) t =1 (1 − p ) · E (cid:20) b H σ (cid:15) + (cid:13)(cid:13) x ( t +1) − x ( t ) (cid:13)(cid:13) · ¯ σ + (cid:15)L b H (cid:15) + 1 (cid:12)(cid:12)(cid:12) Q t = 0 (cid:21) ( i ) = O (cid:32) T · (1 − p ) · (cid:32) γσ (cid:112) ¯ σ + (cid:15)L L (cid:15) + γ (cid:15) · ¯ σ + (cid:15)L L + 1 (cid:33)(cid:33) ( ii ) ≤ O (cid:32) ∆ L σ (cid:112) ¯ σ + (cid:15)L γ (cid:15) + ∆(¯ σ + (cid:15)L ) γ(cid:15) + ∆ L γ (cid:33) , (50)where ( i ) is given by plugging in (cid:107) x ( t ) − x ( t − (cid:107) = γ/L . The inequality ( ii ) follows by usingthe bound on T · (1 − p ) from Lemma 14. • Bound on ( B ) . Using Lemma 3 with the fact that Pr( Q t = 1) = p , we get( B ) = O (1) T (cid:88) t =1 p · E (cid:20) b g σ (cid:15) + (cid:13)(cid:13) x ( t +1) − x ( t ) (cid:13)(cid:13) · ¯ σ + (cid:15)L b g (cid:15) + 1 (cid:12)(cid:12)(cid:12) Q t = 1 (cid:21) ( i ) = O (1) T (cid:88) t =1 p · E (cid:20) b g σ (cid:15) + (cid:13)(cid:13) ηg ( t ) (cid:13)(cid:13) · ¯ σ + (cid:15)L b g (cid:15) + 1 (cid:12)(cid:12)(cid:12) Q t = 1 (cid:21) ( ii ) ≤ O (cid:32) ∆ η(cid:15) · (cid:16) E (cid:34) T T (cid:88) t =1 (cid:13)(cid:13)(cid:13) g ( t ) (cid:13)(cid:13)(cid:13) (cid:35) · η (¯ σ + (cid:15)L ) b g (cid:15) + b g σ (cid:15) + 1 (cid:17)(cid:33) ( iii ) = O (cid:16) ∆ σ (cid:112) ¯ σ + (cid:15)L (cid:15) + ∆ η(cid:15) (cid:17) , (51)where ( i ) follows by plugging in the update rule from Line 7 (when Q t = 1), ( ii ) follows byrearranging the terms and using the bound on T · p from Lemma 14, and ( iii ) is follows from37he choices of b g (in particular, our assumption that (cid:15) ≤ σ implies that b g = η(cid:15) √ ¯ σ + (cid:15)L σ ) and η , as well as the following bound for E (cid:104) T (cid:80) Tt =1 (cid:13)(cid:13) g ( t ) (cid:13)(cid:13) (cid:105) : E (cid:34) T T (cid:88) t =1 (cid:107) g ( t ) (cid:107) (cid:35) ≤ E (cid:34) T T (cid:88) t =1 (cid:13)(cid:13)(cid:13) g ( t ) − ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) + 2 T T (cid:88) t =1 (cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) (cid:35) ≤ O (cid:0) (cid:15) + (cid:107)∇ F ( (cid:98) x ) (cid:107) (cid:1) ≤ O ( (cid:15) ) , where the last inequality is uses Lemma 2 and Lemma 13.Combining the bounds from (50) and (51) in (49), we have E [ M g ] ≤ O (cid:32) ∆ L σ (cid:112) ¯ σ + (cid:15)L γ (cid:15) + ∆(¯ σ + (cid:15)L ) γ(cid:15) + ∆ L γ + ∆ σ (cid:112) ¯ σ + (cid:15)L (cid:15) + ∆ η(cid:15) (cid:33) . (52) Bound on M nc . Using the law of total probability with the observation that Algorithm 4 entersLine 10 only if Q t = 0, we get E (cid:34) T (cid:88) t =1 m nc ( t ) (cid:35) = T (cid:88) t =1 (cid:88) q ∈{ , } Pr( Q t = q ) E [ m nc ( t ) | Q t = q ]= T (cid:88) t =1 Pr( Q t = 0) E [ m nc ( t ) | Q t = 0]= T · (1 − p ) · n H ≤ O (cid:18) ∆ L γ · n H (cid:19) , (53)where n H denotes the number of oracle queries made by Oja , the last inequality follows by bounding T · (1 − p ) as in (47). Note that Lemma 11 implies that for δ = γ L , n H ≤ O (cid:32) (¯ σ + L ) γ log (cid:18) L γ d (cid:19)(cid:33) . (54)Combining the above bounds for M g and M nc (in (52) and (53) respectively), we get E [ M ] ≤ E [ M g + M nc ]= O (cid:32) ∆ L σ (cid:112) ¯ σ + (cid:15)L γ (cid:15) + ∆(¯ σ + (cid:15)L ) γ(cid:15) + ∆ L γ + ∆ σ (cid:112) ¯ σ + (cid:15)L (cid:15) + ∆ η(cid:15) + ∆ L γ · n H (cid:33) . Plugging in the value of η from Algorithm 4 and n H from (54), and using Markov’s inequality, weget that, with probability at least , M = O (cid:32) ∆ σ (cid:112) ¯ σ + (cid:15)L (cid:15) + ∆ L (cid:0) σ ¯ σ + √ (cid:15)L + γ ¯ σ /L + γ(cid:15) (cid:1) γ (cid:15) + ∆ L γ (cid:32) (¯ σ + L ) γ log (cid:18) L γ d (cid:19)(cid:33) + O (cid:32) ∆ L γ + ∆ (cid:112) L + ¯ σ + (cid:15)L (cid:15) (cid:33) . (55)38gnoring the lower-order terms, we have M = (cid:101) O (cid:32) ∆ σ ¯ σ (cid:15) + ∆ L σ ¯ σ γ (cid:15) + ∆ L (¯ σ + L ) γ (cid:33) . The final statement follows by taking a union bound for the failure probability of the claims in (48)and (55). Lemma 12. Under the setting of Theorem 4, we are guaranteed that E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) | Q t = 0 (cid:105) ≥ γ L (cid:18) Pr (cid:16) λ min ( ∇ F ( x ( t ) )) ≤ − γ (cid:17) − L γ Pr (cid:16) ¬ E Oja ( t ) | Q t = 0 (cid:17)(cid:19) . Proof. Recall that Algorithm 4 calls Oja with the precision parameter 2 γ . To begin, suppose that E Oja ( t ) holds. Then if Oja returns ⊥ , then λ min (cid:0) ∇ F ( x ( t ) ) (cid:1) ≥ − γ , otherwise Oja returns a unitvector u ( t ) such that ∇ F ( x ( t ) )[ u ( t ) , u ( t ) ] ≤ − γ . Thus, using Lemma 10 with H = ∇ F ( x ( t ) ) and u ( t ) , we conclude that—conditioned on the history up to time t , and on Q t = 0—we have { E Oja ( t ) } ( F ( x ( t ) ) − F ( x ( t +1) )) ≥ γ L { λ min ( ∇ F ( x ( t ) )) ≤ − γ ∧ E Oja ( t ) } . In particular, this implies that F ( x ( t ) ) − F ( x ( t +1) ) ≥ γ L (cid:16) { λ min ( ∇ F ( x ( t ) )) ≤ − γ } − {¬ E Oja ( t ) } (cid:17) − {¬ E Oja ( t ) } ( F ( x ( t ) ) − F ( x ( t +1) )) . Taking conditional expectations, this further implies that E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) | Q t = 0 (cid:105) ≥ γ L (cid:16) Pr (cid:16) λ min ( ∇ F ( x ( t ) )) ≤ − γ (cid:17) − Pr (cid:16) ¬ E Oja ( t ) | Q t = 0 (cid:17)(cid:17) − E (cid:104) {¬ E Oja ( t ) } ( F ( x ( t ) ) − F ( x ( t +1) )) | Q t = 0 (cid:105) . Now, consider the term E (cid:104) {¬ E Oja ( t ) } ( F ( x ( t ) ) − F ( x ( t +1) )) | Q t = 0 (cid:105) = Pr( ¬ E Oja ( t ) | Q t = 0) · E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) | Q t = 0 , ¬ E Oja ( t ) (cid:105) . Given that Oja fails, there are two cases two consider: The first case is where it returns ⊥ (eventhough we may not have λ min (cid:0) ∇ F ( x ( t ) ) (cid:1) ≥ − γ ), which we denote by P t = 0, and the secondcase is that it returns some vector u ( t ) (which may not actually satisfy ∇ F ( x ( t ) )[ u ( t ) , u ( t ) ] ≤ − γ ),which we denote P t = 1. If P t = 0, we have x ( t +1) − x ( t ) , so E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) | Q t = 0 , ¬ E Oja ( t ) , P t = 0 (cid:105) = 0 . Otherwise, using a third-order Taylor expansion, and following the same reasoning as the proof ofLemma 10, we have E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) | Q t = 0 , ¬ E Oja ( t ) , P t = 1 (cid:105) E (cid:20) γ L ∇ F ( x )[ u ( t ) , u ( t ) ] + γ L (cid:107) u ( t ) (cid:107) | Q t = 0 , ¬ E Oja ( t ) , P t = 1 (cid:21) ≤ γ L L + γ L ≤ γ L L . Combining this bound with the earlier inequalities (and being rather loose with constants), weconclude that E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) | Q t = 0 (cid:105) ≥ γ L (cid:18) Pr (cid:16) λ min ( ∇ F ( x ( t ) )) ≤ − γ (cid:17) − (cid:18) L γ (cid:19) Pr (cid:16) ¬ E Oja ( t ) | Q t = 0 (cid:17)(cid:19) ≥ γ L (cid:18) Pr (cid:16) λ min ( ∇ F ( x ( t ) )) ≤ − γ (cid:17) − L γ Pr (cid:16) ¬ E Oja ( t ) | Q t = 0 (cid:17)(cid:19) . Lemma 13. Under the same setting as Theorem 4, the point (cid:98) x returned by Algorithm 4 satisfies E (cid:104) (cid:107)∇ F ( x ( t ) ) (cid:107) (cid:105) ≤ (cid:15) . Proof. Starting from (46) in the proof of Theorem 4, we have E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) (cid:105) ≥ − p ) γ L (cid:18) Pr (cid:16) λ min ( ∇ F ( x ( t ) )) ≤ − γ (cid:17) − L γ δ (cid:19) + p (cid:18) η E (cid:104) (cid:107)∇ F ( x ( t ) ) (cid:107) (cid:105) − η(cid:15) (cid:19) . Ignoring the positive term Pr (cid:0) λ min ( ∇ F ( x ( t ) )) ≤ − γ (cid:1) on the right hand side in the above, we get E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) (cid:105) ≥ pη (cid:16) E (cid:104) (cid:107)∇ F ( x ( t ) ) (cid:107) (cid:105) − (cid:15) (cid:17) − − p ) γ L L γ δ. Telescoping this inequality for t from 1 to T and using that F ( x (1) ) − F ( x ( T +1) ) ≤ ∆, we get∆ ≥ T pη (cid:0) E (cid:2) (cid:107)∇ F ( (cid:98) x ) (cid:107) (cid:3) − (cid:15) (cid:1) − T − p ) γ L L γ δ ≥ ∆4 (cid:15) (cid:0) E (cid:2) (cid:107)∇ F ( (cid:98) x ) (cid:107) (cid:3) − (cid:15) (cid:1) − L γ δ, where the last inequality follows from Lemma 14. Rearranging the terms, we get E (cid:2) (cid:107)∇ F ( (cid:98) x ) (cid:107) (cid:3) ≤ (cid:15) + 280 (cid:15) · L γ δ ≤ (cid:15) , where the last inequality uses that δ ≤ γ L . Lemma 14. For the values of the parameters T and p specified in Algorithm 4, η(cid:15) ≤ T p ≤ η(cid:15) , and, L γ ≤ T (1 − p ) ≤ L γ . roof. Since, η ≤ √ L +¯ σ + (cid:15)L ≤ L and (cid:15) ≤ √ ∆ L , we have that T ≥ η(cid:15) ≥ L (cid:15) ≥ . Thus, using the fact that x ≤ (cid:100) x (cid:101) ≤ x for all x ≥ 1, we get20∆ L γ + 2∆ η(cid:15) ≤ T ≤ L γ + 4∆ η(cid:15) . (56)Consequently, by plugging in the values of T and p , we have T (1 − p ) = (cid:24) L γ + 2∆ η(cid:15) (cid:25) · (cid:18) − γ γ + 10∆ L η(cid:15) (cid:19) ≤ (cid:18) L γ + 4∆ η(cid:15) (cid:19) · (cid:18) L η(cid:15) γ + 10∆ L η(cid:15) (cid:19) = 40∆ L γ , where the first inequality is due to (56). Similarly, we have that T (1 − p ) ≥ (cid:18) L γ + 2∆ η(cid:15) (cid:19) · (cid:18) L η(cid:15) γ + 10∆ L η(cid:15) (cid:19) = 20∆ L γ . Together, the above two bounds imply that20∆ L γ ≤ T (1 − p ) ≤ L γ . The bound on T · p follows similarly. 41 .2 Full statement and proof for Algorithm 5 Algorithm 5 Subsampled cubic-regularized trust-region method with HVP-RVR Input: Stochastic second-order oracle ( O F , P z ) ∈ O ( F, σ ), where F ∈ F (∆ , ∞ , L ).Precision parameter (cid:15) . Set M = 4 max (cid:110) L , σ (cid:15) log( d ) σ (cid:111) , η = 30 (cid:112) (cid:15)M , T = (cid:108) L γ + ∆ √ M (cid:15) / (cid:109) , p = √ Mγ / √ Mγ / +540 L (cid:15) / . Set m = (cid:108) · · σ log( d ) (cid:15)M (cid:109) , m = (cid:108) σ log( d ) γ (cid:109) . Set b g = min (cid:26) , η √ σ + (cid:15)L σ (cid:27) and b H = min (cid:26) , γ √ σ + (cid:15)L σ L (cid:27) . Initialize x (0) , x (1) ← g (1) ← HVP-RVR-Gradient-Estimator (cid:15),b g (cid:0) x (1) , x (0) , ⊥ (cid:1) . for t = 1 to T do Sample Q t ∼ Bernoulli( p ) with bias p . if Q t = 1 then Query the oracle m times at x ( t ) and compute H ( t )1 ← m m (cid:88) j =1 (cid:91) ∇ F ( x ( t ) , z ( t,j ) ) , where z ( t,j ) i . i . d . ∼ P z . Set the next point x ( t +1) as x ( t +1) ← arg min (cid:107) y − x ( t ) (cid:107)≤ η (cid:10) g ( t ) , y − x ( t ) (cid:11) + 12 (cid:10) y − x ( t ) , H ( t )1 ( y − x ( t ) ) (cid:11) + M (cid:107) y − x ( t ) (cid:107) . g ( t +1) ← HVP-RVR-Gradient-Estimator (cid:15),b g (cid:0) x ( t +1) , x ( t ) , g ( t ) (cid:1) . else Query the oracle m times at x ( t ) and compute H ( t )2 ← m m (cid:88) j =1 (cid:91) ∇ F ( x ( t ) , z ( t,j ) ) , where z ( t,j ) i . i . d . ∼ P z . if λ min (cid:16) H ( t )2 (cid:17) ≤ − γ then Find a unit vector u ( t ) such that H ( t )2 (cid:2) u ( t ) , u ( t ) (cid:3) ≤ − γ . x ( t +1) ← x ( t ) + γL · r ( t ) · u ( t ) , where r ( t ) ∼ Uniform( {− , } ). g ( t +1) ← HVP-RVR-Gradient-Estimator (cid:15),b H (cid:0) x ( t +1) , x ( t ) , g ( t ) (cid:1) . else x ( t +1) ← x ( t ) . g ( t +1) ← g ( t ) . return (cid:98) x chosen uniformly at random from (cid:8) x ( t ) (cid:9) T − t =1 .42 roof of Theorem 5. We first show that Algorithm 5 returns a point (cid:98) x such that, (cid:107)∇ F ( (cid:98) x ) (cid:107) ≤ (cid:15) and λ min (cid:0) ∇ F ( (cid:98) x ) (cid:1) ≥ − γ . We then bound the expected number of oracle queries used throughoutthe execution.Before we delve into the proof, first note that using Lemma 2, we have for all t ≥ E (cid:20)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) − g ( t ) (cid:13)(cid:13)(cid:13) (cid:21) ≤ (cid:15) . Further, using Lemma 4 with our choice of m and m , we have, for all t ≥ E (cid:20)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) − H ( t )1 (cid:13)(cid:13)(cid:13) (cid:21) ≤ (cid:15)M , and, E (cid:20)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) − H ( t )2 (cid:13)(cid:13)(cid:13) (cid:21) ≤ γ . (57)To begin the proof, we observe that for any t ≥ 0, there are two scenarios: (a) either Q t = 1 andthe algorithm goes through Line 9, or, (b) Q t = 0 and the algorithm goes through Line 15. Weanalyze the two cases separately below.(a) Case 1: Q t = 1 . In this case, we set x ( t +1) using the update rule in Line 9. Invoking Lemma 9with the bound in (57) and η = 30 (cid:112) (cid:15)M , we get E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) (cid:12)(cid:12)(cid:12) Q t = 1 (cid:105) ≥ (cid:15) / √ M (cid:18) Pr (cid:16)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t +1) ) (cid:13)(cid:13)(cid:13) ≥ (cid:15) (cid:17) − (cid:19) . (58)(b) Case 2: Q t = 0 . In this case, either λ min (cid:16) H ( t )2 (cid:17) > − γ , in which case we set x ( t +1) = x ( t ) , orwe compute x ( t +1) using the update rule in Line 15 in Algorithm 5. Thus, using Lemma 10with (57), we get E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) (cid:12)(cid:12)(cid:12) Q t = 0 (cid:105) ≥ γ L (cid:18) Pr (cid:16) λ min (cid:16) H ( t )2 (cid:17) ≤ γ (cid:17) − (cid:19) . (59)Combining the two cases ( Q t = 0 or Q t = 1) from (58) and (59) above, we get E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) (cid:105) = (cid:88) q ∈{ , } Pr( Q t = q ) E (cid:104) F ( x ( t ) ) − F ( x ( t +1) ) | Q t = q (cid:105) ≥ (1 − p ) · γ L (cid:18) Pr (cid:16) λ min ( ∇ F ( x ( t ) )) ≤ − γ (cid:17) − (cid:19) + p · (cid:15) / √ M (cid:18) Pr (cid:16)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t +1) ) (cid:13)(cid:13)(cid:13) ≥ (cid:15) (cid:17) − (cid:19) . Telescoping the inequality above for t from 0 to T − 1, and using the bound E (cid:2) F ( x (0) ) − F ( x ( T ) ) (cid:3) ≤ ∆,we get∆ ≥ E (cid:104) F ( x (0) ) − F ( x ( T ) ) (cid:105) ≥ T (1 − p ) γ L (cid:32) T T − (cid:88) t =0 Pr (cid:16) λ min ( ∇ F ( x ( t ) )) ≤ − γ (cid:17) − (cid:33) + 450 T p(cid:15) / √ M (cid:32) T T (cid:88) t =1 Pr (cid:16)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) ≥ (cid:15) (cid:17) − (cid:33) i ) ≥ (cid:32) T T − (cid:88) t =0 Pr (cid:16) λ min ( ∇ F ( x ( t ) )) ≤ − γ (cid:17) + 1 T T (cid:88) t =1 Pr (cid:16)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) ≥ (cid:15) (cid:17) − (cid:33) ( ii ) ≥ (cid:32) T − T − (cid:88) t =1 (cid:16) Pr (cid:16) λ min ( ∇ F ( x ( t ) )) ≤ − γ (cid:17) + Pr (cid:16)(cid:13)(cid:13)(cid:13) ∇ F ( x ( t ) ) (cid:13)(cid:13)(cid:13) ≥ (cid:15) (cid:17)(cid:17) − (cid:33) ( iii ) ≥ (cid:18) (cid:0) Pr (cid:0) λ min ( ∇ F ( (cid:98) x )) ≤ − γ (cid:1) + Pr( (cid:107)∇ F ( (cid:98) x ) (cid:107) ≥ (cid:15) ) (cid:1) − (cid:19) , (60)where the inequality in ( i ) follows from Lemma 15. The inequality in ( ii ) is given by ignoring the(non-negative) terms Pr (cid:0) ∇ F ( x (0) ) ≤ − γ (cid:1) and Pr (cid:0)(cid:13)(cid:13) ∇ F ( x ( T ) ) (cid:13)(cid:13) ≥ (cid:15) (cid:1) on the right-hand side andusing the fact that T ≥ 6. Finally, ( iii ) follows by recalling the definition of (cid:98) x as samples uniformlyat random from the set ( x ( t ) ) T − t =1 . Rearranging the terms, we getPr (cid:0) λ min ( ∇ F ( (cid:98) x )) ≤ − γ (cid:1) + Pr( (cid:107)∇ F ( (cid:98) x ) (cid:107) ≥ (cid:15) ) ≤ , which further implies that the returned point (cid:98) x satisfiesPr (cid:0) λ min ( ∇ F ( (cid:98) x )) ≥ − γ ∧ (cid:107)∇ F ( (cid:98) x ) (cid:107) ≤ (cid:15) (cid:1) ≥ . (61) Bound on the number of oracle queries. Let us first introduce some notation to count thenumber of oracle calls made in each iteration of the algorithm. • On Line 10 and Line 16, Algorithm 5 queries the stochastic oracle through the subroutine HVP-RVR-Gradient-Estimator . Let m g ( t ) denote the total number of oracle queries resultingfrom either line at iteration t . • Let m h, ( t ) and m h, ( t ) denote the total number of oracle calls made by Line 8 and Line 12at iteration t to compute H ( t )1 and H ( t )2 respectively.Define M g , M h, and M h, by (cid:80) Tt =1 m g ( t ), (cid:80) Tt =1 m h, ( t ) and (cid:80) Tt =1 m h, ( t ) respectively. In whatfollows, we give separate bounds for E [ M g ], E [ M h, ] and E [ M h, ]. The final statement on the totalnumber of oracle calls follows by an application of Markov’s inequality. Bound on E [ M g ] . For any t > 0, there are two scenarios, either (a) Q t = 1 and we update x ( t +1) through Line 9, or (b) Q t = 0 and we update x ( t +1) through Line 15 orLine 18. Thus, using the lawof total expectation E [ M g ] = T − (cid:88) t =0 Pr( Q t = 0) E [ m g ( t ) | Q t = 0] + T − (cid:88) t =0 Pr( Q t = 1) E [ m g ( t ) | Q t = 1] . (62)We denote the two terms on the right hand side above by ( A ) and ( B ), respectively. We boundthem separately in as follows.(a) Bound on ( A ) . Using Lemma 3 with the fact that Pr( Q t = 0) = 1 − p , we get( A ) = 6 T (cid:88) t =1 (1 − p ) · E (cid:34) b H σ (cid:15) + ( σ + L (cid:15) ) · (cid:13)(cid:13) x ( t +1) − x ( t ) (cid:13)(cid:13) b H (cid:15) + 1 (cid:12)(cid:12)(cid:12) Q t = 0 (cid:35) i ) = 6 T (1 − p ) · (cid:18) b H σ (cid:15) + ( σ + L (cid:15) ) · γ b H (cid:15) L + 1 (cid:19) ( ii ) = O (cid:18) ∆ L γ · (cid:18) b H σ (cid:15) + ( σ + L (cid:15) ) · γ b H (cid:15) L + 1 (cid:19)(cid:19) ( iii ) = O (cid:32) ∆ L σ (cid:112) σ + (cid:15)L γ (cid:15) + ∆ (cid:0) σ + (cid:15)L (cid:1) γ(cid:15) + ∆ L γ (cid:33) , (63)where ( i ) holds because when Q t = 0, we either have (cid:13)(cid:13) x ( t ) − x ( t − (cid:13)(cid:13) ≤ γL (if we follow theupdate rule in Line 15) or (cid:13)(cid:13) x ( t ) − x ( t − (cid:13)(cid:13) = 0 (if we follow Line 18). The inequality ( ii ) usesthe bound on T · (1 − p ) from Lemma 15 and ( iii ) follows from plugging in the value of b H .(b) Bound on ( B ) . Using Lemma 3 with the definition Pr( Q t = 1) = p , we get( B ) = 6 T (cid:88) t =1 p · E (cid:34) b g σ (cid:15) + ( σ + L (cid:15) ) · (cid:13)(cid:13) x ( t +1) − x ( t ) (cid:13)(cid:13) b g (cid:15) + 1 (cid:12)(cid:12)(cid:12) Q t = 1 (cid:35) ( i ) = 6 T p · (cid:18) b g σ (cid:15) + ( σ + L (cid:15) ) · η b g (cid:15) + 1 (cid:19) ( ii ) = O (cid:32) ∆ √ M(cid:15) . · (cid:18) b g σ (cid:15) + ( σ + L (cid:15) ) · η b g (cid:15) + 1 (cid:19)(cid:33) ( iii ) = O (cid:32) ∆ σ (cid:112) σ + (cid:15)L (cid:15) + ∆ √ M(cid:15) . (cid:33) ( iv ) = O (cid:32) ∆ σ (cid:112) σ + (cid:15)L (cid:15) + ∆ √ L (cid:15) . + ∆ σ (cid:112) log( d ) (cid:15) (cid:33) , (64)where ( i ) is given by the update rule from Line 9 and the fact that HVP-RVR-Gradient-Estimator uses parameter b g in this case, and ( ii ) follows by using the bound on T · p from Lemma 15.The inequality ( iii ) follows because for the choice of parameters η and M and the assumedrange of (cid:15) in the theorem statement, b g = η √ σ + (cid:15)L σ < 1. Finally, the inequality ( iv ) is givenby plugging in the value of M and using that (cid:15) ≤ σ .Plugging the bound in (63) and (64) back in (62), we get E [ M g ] = O (cid:32) ∆ L σ (cid:112) σ + (cid:15)L γ (cid:15) + ∆ (cid:0) σ + (cid:15)L (cid:1) γ(cid:15) + ∆ L γ (cid:33) + O (cid:32) ∆ σ (cid:112) σ + (cid:15)L (cid:15) + ∆ √ L (cid:15) . + ∆ σ (cid:112) log( d ) (cid:15) (cid:33) . (65) Bound on E [ M H, ] . For each t ≥ 0, Algorithm 5 samples an independent Bernoulli Q t with bias E [ Q t ] = p and executes Line 8 if Q t = 1. For every such pass through Line 8, the algorithm queriesthe stochastic Hessian oracle m times. Thus, E [ M H ] = E (cid:34) T − (cid:88) t =0 { Q t = 1 } · m (cid:35) = T · p · m i ) = O (cid:32) ∆ √ M(cid:15) . · (cid:24) σ log( d ) (cid:15)M (cid:25)(cid:33) = O (cid:32) ∆ √ L (cid:15) . + ∆ σ σ (cid:112) log( d ) (cid:15) (cid:33) , (66)where ( i ) follows by plugging in the values of m and M as specified in Algorithm 5 (using that (cid:15) ≤ σ to simplify), and using the bound on T · p from Lemma 15 . Bound on E [ M H, ] . The algorithm executes Line 12 only if Q t = 0, which happens with probability1 − p . For every such pass through Line 12, the algorithm queries the stochastic Hessian oracle m times. Consequently, E [ M H ] = E (cid:34) T − (cid:88) t =0 { Q t = 0 } · m (cid:35) = T · (1 − p ) · m i ) = O (cid:18) ∆ L γ · (cid:24) σ log( d ) γ (cid:25)(cid:19) = O (cid:18) ∆ L σ log( d ) γ + ∆ L γ (cid:19) , (67)where ( i ) follows by plugging in the values of m as specified in Algorithm 5, and using the boundon T · p from Lemma 15.Adding together all the bounds above (from (65), (66), and (67)), we have that the total numberof oracle queries by Algorithm 5 till time T is bounded in expectation by E [ M ] = E [ M g + M H, + M H, ]= O (cid:32) ∆ L σ log( d ) γ + ∆ L σ (cid:112) σ + (cid:15)L γ (cid:15) + ∆ σ σ (cid:112) log( d ) (cid:15) + ∆ σ √ L (cid:15) . (cid:33) + O (cid:32) ∆ (cid:0) σ + (cid:15)L (cid:1) γ(cid:15) + ∆ L γ + ∆ σ (cid:112) log( d ) (cid:15) + ∆ √ L (cid:15) . (cid:33) . Using Markov’s inequality, this implies that with probability at least , M = O (cid:32) ∆ L σ log( d ) γ + ∆ L σ (cid:112) σ + (cid:15)L γ (cid:15) + ∆ σ σ (cid:112) log( d ) (cid:15) + ∆ σ √ L (cid:15) . (cid:33) + O (cid:32) ∆ (cid:0) σ + (cid:15)L (cid:1) γ(cid:15) + ∆ L γ + ∆ σ (cid:112) log( d ) (cid:15) + ∆ √ L (cid:15) . (cid:33) . Ignoring the lower order terms, we have M = (cid:101) O (cid:32) ∆ L σ γ + ∆ L σ σ γ (cid:15) + ∆ σ σ (cid:15) (cid:33) . (68)The final statement follows by union bound, using the failure probabilities for (61) and (68). Lemma 15. For the values of the parameters T and p specified in Algorithm 5, we have ∆ √ M (cid:15) ≤ T p ≤ √ M (cid:15) and L γ ≤ T (1 − p ) ≤ L γ . roof. Under the assumption that γ ≤ ∆ L , we have that T ≥ L γ ≥ . Thus, using the fact that x ≤ (cid:100) x (cid:101) ≤ x for any x ≥ 1, we get18∆ L γ + ∆ √ M (cid:15) ≤ T ≤ L γ + ∆ √ M (cid:15) . (69)Thus, plugging in the value of T and p , we get T (1 − p ) = (cid:38) L γ + ∆ √ M (cid:15) (cid:39) · (cid:32) − √ M γ √ M γ + 540 L (cid:15) (cid:33) ≤ (cid:32) L γ + ∆ √ M (cid:15) (cid:33) · L (cid:15) √ M γ + 540 L (cid:15) = 36∆ L γ , where the first inequality is due to (69). Similarly, we have that T (1 − p ) ≥ (cid:38) L γ + ∆ √ M (cid:15) (cid:39) · L (cid:15) √ M γ + 540 L (cid:15) = 18∆ L γ . Together, the above two bounds imply that18∆ L γ ≤ T (1 − p ) ≤ L γ The bound on T · p follows similarly. G Lower bounds G.1 Proof of Theorem 3 In this section, we prove Theorem 3. We begin by generalizing the lower bound framework ofArjevani et al. (2019a)—which centers around the notion of zero-respecting algorithms and stochasticgradient estimators called probabilistic zero-chains —to higher-order derivatives. Given a q th-ordertensor T ∈ R ⊗ q d , we define support { T } := { i ∈ [ d ] | T i (cid:54) = 0 } , where T i is the ( q − T i ] j ,...,j q − = T i,j ,...,j q − . Given a tuple of tensors T = (cid:0) T (1) , T (2) , . . . (cid:1) , we letsupport {T } := (cid:83) i support { T ( i ) } be the union of the supports of T ( i ) . Lastly, given an algorithm A and a an oracle O pF , we let x ( t ) A [ O pF ] denote the (possibly randomized) t th query point generated by A when fed by information from O (i.e., x ( t ) A [ O pF ] is a measurable function of (cid:8) O pF ( x ( i ) , z ( i ) ) (cid:9) t − i =1 , andpossibly a random seed r ( t ) ). Definition 1. A stochastic p th-order algorithm A is zero-respecting if for any function F and any p th-order oracle O pF , the iterates { x ( t ) } t ∈ N produced by A by querying O pF satisfy support (cid:0) x ( t ) (cid:1) ⊆ (cid:91) i A collection of derivative estimators (cid:91) ∇ F ( x, z ) , . . . , (cid:91) ∇ p F ( x, z ) for a function F formsa probability- ρ zero-chain if Pr (cid:16) ∃ x | prog( (cid:91) ∇ F ( x, z ) , . . . , (cid:91) ∇ p F ( x, z )) = prog ( x ) + 1 (cid:17) ≤ ρ and Pr (cid:16) ∃ x | prog( (cid:91) ∇ F ( x, z ) , . . . , (cid:91) ∇ p F ( x, z )) = prog ( x ) + i (cid:17) = 0 , i > . No constraint is imposed for i ≤ prog ( x ) . We note that the constant 1 / / Lemma 16. Let (cid:91) ∇ F ( x, z ) , . . . , (cid:91) ∇ p F ( x, z ) be a collection of probability- ρ zero-chain derivativeestimators for F : R T → R , and let O pF be an oracle with O pF ( x, z ) = ( (cid:100) ∇ q F ( x, z )) q ∈ [ p ] . Let (cid:8) x ( t ) A [ O F ] (cid:9) be a sequence of queries produced by A ∈ A zr ( K ) interacting with O pF . Then, with probability at least − δ , prog (cid:16) x ( t ) A [ O pF ] (cid:17) < T, for all t ≤ T − log(1 /δ )2 ρ . The proof of Lemma 16 is a simple adaptation of the proof of Lemma 1 of Arjevani et al. (2019a)to high-order zero-respecting methods—we provide it here for completeness. The proof idea is thatany zero-respecting algorithm must activate coordinates in sequence, and must wait on average atleast Ω(1 /ρ ) rounds between activations, leading to a total wait time of Ω( T /ρ ) rounds. Proof. Let { (cid:91) ∇ q F ( x ( i ) , z ( i ) ) } q ∈ [ p ] denote the oracle responses for the i th query made at the point x ( i ) , and let G ( i ) be the natural filtration for the algorithm’s iterates, the oracle randomness, andthe oracle answers up to time i . We measure the progress of the algorithm through two quantities: π ( t ) := max i ≤ t prog (cid:16) x ( i ) (cid:17) = max (cid:110) j ≤ d | x ( i ) j (cid:54) = 0 for some i ≤ t (cid:111) ,δ ( t ) := max i ≤ t prog (cid:16) ∇ q F ( x ( i ) , z ( i ) ) (cid:17) = max (cid:110) j ≤ d | ∇ q f ( x ( i ) , z ( i ) ) j (cid:54) = 0 for some i ≤ t and q ∈ [ p ] (cid:111) . Note that π ( t ) is the largest non-zero coordinate in support { ( x ( i ) ) i ≤ t } , and that π (0) = 0 and δ (0) = 0. Thus, for any zero-respecting algorithm π ( t ) ≤ δ ( t − , (72)48or all t . Moreover, observe that with probability one,prog (cid:16) ∇ q F ( x ( t ) , z ( t ) ) (cid:17) ≤ ( x ( t ) ) ≤ x ( t ) ) ≤ π ( t ) ≤ δ ( t − , (73)where the first inequality follows by the zero-chain property. Further, using the ρ -zero chain property,it follows that conditioned on G ( i ) , with probability at least 1 − ρ ,prog (cid:16) ∇ q F ( x ( t ) , z ( t ) ) (cid:17) ≤ prog ( x ( t ) ) ≤ prog( x ( t ) ) ≤ π ( t ) ≤ δ ( t − . (74)Combining (73) and (74), we have that conditioned on G ( i − , δ ( t − ≤ δ ( t ) ≤ δ ( t − + 1 and Pr (cid:104) δ ( t ) = δ ( t − + 1 (cid:105) ≤ ρ. Thus, denoting the increments ι ( t ) := δ ( t ) − δ ( t − , we have via the Chernoff method,Pr (cid:104) δ ( t ) ≥ T (cid:105) = Pr t (cid:88) j =1 ι ( j ) ≥ T ≤ E (cid:104) exp (cid:16)(cid:80) tj =1 ι ( j ) (cid:17)(cid:105) exp( T ) = e − T E (cid:34) t (cid:89) i =1 E (cid:104) exp (cid:16) ι ( i ) (cid:17) | G ( i − (cid:105)(cid:35) ≤ e − T (1 − ρ + ρ · e ) t ≤ e ρt − T . Thus, Pr (cid:2) δ ( t ) ≥ T (cid:3) ≤ δ for all t ≤ T − log(1 /δ )2 ρ ; combined with (72), this yields the desired result.In light of Lemma 16, our lower bound strategy is as follows. We construct a function F ∈F p (∆ , L p ) that both admits probability- ρ zero-chain derivative estimators and has large gradientsfor all x ∈ R T with prog (cid:0) x ( i ) (cid:1) < T . Together with Lemma 16, this ensures that any zero-respectingalgorithm interacting with a p th-order oracle must perform Ω( T /ρ ) steps to make the gradient of F small. We make this approach concrete by adopting the construction used in Arjevani et al.(2019a), and adjusting it so as to be consistent with the additional high-order Lipschitz and varianceparameters. For each T ∈ N , we define F T ( x ) := − Ψ(1)Φ( x ) + T (cid:88) i =2 [Ψ( − x i − )Φ( − x i ) − Ψ( x i − )Φ( x i )] , (75)where the component functions Ψ and Φ areΨ( x ) = (cid:40) , x ≤ / , exp (cid:16) − x − (cid:17) , x > / x ) = √ e (cid:90) x −∞ e − t dt. (76)We start by collecting some relevant properties of F T . Lemma 17 (Carmon et al. (2019a)) . The function F T satisfies:1. F T (0) − inf x F T ( x ) ≤ ∆ · T , where ∆ = 12 .2. For p ≥ , the p th order derivatives of F T are (cid:96) p -Lipschitz continuous, where (cid:96) p ≤ e p log p + cp for a numerical constant c < ∞ .3. For all x ∈ R T , p ∈ N and i ∈ [ T ] , we have (cid:107)∇ pi F T ( x ) (cid:107) op ≤ (cid:96) p − .4. For all x ∈ R T and p ∈ N , prog( ∇ p F T ( x )) ≤ prog ( x ) + 1 . . For all x ∈ R T , if prog ( x ) < T then (cid:107)∇ F T ( x ) (cid:107) ≥ |∇ prog ( x )+1 F T ( x ) | > . Proof. Parts 1 and 2 follow from Lemma 3 in Carmon et al. (2019a) and its proof; Part 3 is provenin Section G.1.1; Part 4 follows from Observation 3 in Carmon et al. (2019a) and Part 5 is the sameas Lemma 2 in Carmon et al. (2019a).The derivative estimators we use are defined as (cid:104) (cid:92) ∇ q F T ( x, z ) (cid:105) i := (cid:18) (cid:110) i > prog ( x ) (cid:111)(cid:18) zρ − (cid:19)(cid:19) · ∇ qi F T ( x ) , (77)where z ∼ Bernoulli( ρ ). Lemma 18. The estimators (cid:92) ∇ q F T form a probability- ρ zero-chain, are unbiased for ∇ q F T , andsatisfy E (cid:107) (cid:92) ∇ q F T ( x, z ) − ∇ q F T ( x ) (cid:107) ≤ (cid:96) q − (1 − ρ ) ρ , for all x ∈ R T . (78) Proof. First, we observe that E (cid:104) (cid:92) ∇ q F T ( x, z ) (cid:105) = ∇ q F T ( x ) for all x ∈ R T , as E [ z/ρ ] = 1. Second, weargue that the probability- ρ zero-chain property holds. Recall that prog α ( x ) is non-increasing in α (in particular, prog ( x ) ≥ prog ( x )). Therefore, by Lemma 17.4, [ (cid:92) ∇ q F T ( x, z )] i = ∇ i F T ( x ) = 0for all i > prog ( x ) + 1, all x ∈ R T and all z ∈ { , } . In addition, since z ∼ Bernoulli( ρ ), we havePr (cid:16) ∃ x | prog( (cid:92) ∇ F T ( x, z ) , . . . , (cid:92) ∇ p F T ( x, z )) = prog ( x ) + 1 (cid:17) ≤ ρ , establishing that the oracle is aprobability- ρ zero-chain.To bound the variance of the derivative estimators, we observe that (cid:92) ∇ q F T ( x, z ) − ∇ q F T ( x ) has atmost one nonzero ( q − i x = prog ( x ) + 1. Therefore, E (cid:107) (cid:92) ∇ q F T ( x, z ) − ∇ q F T ( x ) (cid:107) = (cid:13)(cid:13) ∇ qi x F T ( x ) (cid:13)(cid:13) E (cid:18) zρ − (cid:19) = (cid:13)(cid:13) ∇ qi x F T ( x ) (cid:13)(cid:13) − ρρ ≤ (1 − ρ ) (cid:96) q − ρ , where the final inequality is due to Lemma 17.3, establishing the variance bound in (78). Proof of Theorem 3. We now prove the Theorem 3 by scaling the construction F T appropriately.Let ∆ and (cid:96) be the numerical constants in Lemma 17. Let the accuracy parameter (cid:15) , initialsuboptimality ∆, derivative order p ∈ N , smoothness parameters L , . . . , L p , and variance parameters σ , . . . , σ p be fixed. We set F (cid:63)T ( x ) = αF T ( βx ) , for some scalars α and β to be determined. The relevant properties of F (cid:63)T scale as follows F (cid:63)T (0) − inf x F (cid:63)T ( x ) = α (cid:0) F T (0) − inf x F T ( αx ) (cid:1) ≤ α ∆ T, (79) (cid:13)(cid:13) ∇ q +1 F (cid:63)T ( x ) (cid:13)(cid:13) = αβ q +1 (cid:13)(cid:13) ∇ q +1 F T ( βx ) (cid:13)(cid:13) ≤ αβ q +1 (cid:96) q , (80) (cid:107)∇ F (cid:63)T ( x ) (cid:107) ≥ αβ (cid:107)∇ F T ( x ) (cid:107) ≥ αβ, ∀ x s.t., prog ( x ) < T. (81)The corresponding scaled derivative estimators (cid:92) ∇ q F (cid:63)T ( x, z ) = αβ q (cid:92) ∇ q F T ( βx, z ) clearly form aprobability- ρ zero-chain. Therefore, by Lemma 16, we have that for every zero respecting algorithm50 interacting with O pF (cid:63)T , with probability at least 1 / 2, prog (cid:16) x ( t ) A [ O pF ] (cid:17) < T for all t ≤ ( T − / ρ .Hence, since prog ( x ) ≤ prog( x ) for any x ∈ R T , we have by Lemma 17, E (cid:107)∇ F (cid:63)T (cid:0) x ( t ) A [ O pF ] (cid:1) (cid:107) = αβ E (cid:107)∇ F T (cid:0) βx ( t ) A [ O pF ] (cid:1) (cid:107) ≥ αβ , ∀ t ≤ ( T − / ρ. (82)We bound the variance of the scaled derivative estimators as E (cid:107) (cid:92) ∇ q F (cid:63)T ( x, z ) − ∇ q F (cid:63)T ( x ) (cid:107) = α β q E (cid:13)(cid:13)(cid:13) (cid:92) ∇ q F T ( βx, z ) − ∇ q F T ( βx ) (cid:13)(cid:13)(cid:13) ≤ α β q (cid:96) q − (1 − ρ ) ρ , where the last inequality follows by Lemma 18. Our goal now is to meet the following set ofconstraints: • ∆-constraint : α ∆ T ≤ ∆ • L q -constraint : αβ q +1 (cid:96) q ≤ L q , for q ∈ [ p ] • (cid:15) -constraint : αβ ≥ (cid:15) • σ q -constraint : α β q (cid:96) q − (1 − ρ ) ρ ≤ σ q , for q ∈ [ p ]Generically, since there are more inequalities to satisfy than the number of degrees of freedom( α, β, T and ρ ) in our construction, not all inequalities can be activated (that is, met by equality)simultaneously. Different compromises will yield different rates.First, to have a tight dependence in terms of (cid:15) , we activate the (cid:15) -constraint by setting α = 2 (cid:15)/β .Next, we activate the σ -constraint, by setting ρ = min { ( αβ(cid:96) /σ ) , } = min { (2 (cid:15)(cid:96) /σ ) , } . Thebound on the variance of the q th -order derivative now reads α β q (cid:96) q − (1 − ρ ) ρ ≤ σ α β q (cid:96) q − ( αβ(cid:96) ) = (cid:96) q − β q − σ (cid:96) , q = 2 , . . . , p. Since β is the only degree of freedom which can be tuned to meet though (not necessarily activate)the σ q -constraint for q = 2 , . . . , p and the L q -constraints for q = 1 , . . . , p , we are forced to set β = min q =2 ,...,pq (cid:48) =1 ,...,p min (cid:40)(cid:18) (cid:96) σ q (cid:96) q − σ (cid:19) q − , (cid:18) L q (cid:48) (cid:15)(cid:96) q (cid:48) (cid:19) /q (cid:48) (cid:41) . (83)Lastly, we activate the ∆-constraint by setting T = (cid:22) ∆ α ∆ (cid:23) = (cid:22) ∆ β (cid:15) (cid:23) . Assuming (2 (cid:15)(cid:96) /σ ) ≤ T ≥ 3, we have by (82) that the number of oracle queries required toobtain an (cid:15) -stationary point for G (cid:63)T is bounded from below by T − ρ = 12 ρ (cid:18)(cid:22) ∆ β (cid:15) (cid:23) − (cid:19) ( (cid:63) ) ≥ ρ · ∆ β (cid:15) σ (cid:96) (cid:15) ) · ∆4∆ (cid:15) · min q =2 ,...,pq (cid:48) =1 ,...,p min (cid:40)(cid:18) (cid:96) σ q (cid:96) q − σ (cid:19) q − , (cid:18) L q (cid:48) (cid:15)(cid:96) q (cid:48) (cid:19) /q (cid:48) (cid:41) ≥ ∆ σ ∆ (cid:96) (cid:15) · min q =2 ,...,pq (cid:48) =1 ,...,p min (cid:40)(cid:18) (cid:96) σ q (cid:96) q − σ (cid:19) q − , (cid:18) L q (cid:48) (cid:15)(cid:96) q (cid:48) (cid:19) /q (cid:48) (cid:41) , (84)where ( (cid:63) ) uses (cid:98) ξ (cid:99) − ≥ ξ/ ξ ≥ 3, implying the desired bound. Lastly, we note that onecan obtain tight lower complexity bounds for deterministic oracles by setting ρ = 1. Following thesame chain of inequalities as in (84), in this case we get a lower oracle-complexity bound of∆8∆ (cid:15) min q =1 ,...,p (cid:18) L q (cid:15)(cid:96) q (cid:19) /q . (85) G.1.1 Bounding the operator norm of ∇ pi F T In this subsection we complete the proof of Lemma 17 by proving Part 3. Our proof follows alongthe lines of the proof of Lemma 3 of Carmon et al. (2019a). Let x ∈ R T and i , . . . , i p ∈ [ T ], andnote that by the chain-like structure of F T , ∂ i · · · ∂ i p F T ( x ) is non-zero if and only if | i j − i k | ≤ j, k ∈ [ p ]. A straightforward calculation yields | ∂ i · · · ∂ i p F T ( x ) | ≤ max i ∈ [ T ] max δ ∈{ , } p − ∪{ , − } p − | ∂ i + δ · · · ∂ i + δ p − ∂ i F T ( x ) | (86) ≤ max k ∈ [ p ] (cid:40) ξ ∈ R (cid:12)(cid:12)(cid:12) Ψ k ( ξ ) (cid:12)(cid:12)(cid:12) sup ξ (cid:48) ∈ R (cid:12)(cid:12)(cid:12) Φ p − k ( ξ (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:41) ≤ exp(2 . p log p + 4 p + 9) ≤ (cid:96) p − p +1 , where the penultimate inequality is due to Lemma 1 of Carmon et al. (2019a). Therefore, for afixed i ∈ [ T ], we have (cid:107)∇ pi F T ( x ) (cid:107) op ( a ) = sup (cid:107) v (cid:107) =1 |(cid:104)∇ pi F T ( x ) , v (cid:105)| = sup (cid:107) v (cid:107) =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ,...,i p − ∈ [ T ] ∂ i · · · ∂ i p − ∂ i F T ( x ) v i · · · v i p − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( b ) ≤ (cid:88) δ ∈{ , } p − ∪{ , − } p − | ∂ i + δ · · · ∂ i + δ p − ∂ i F T ( x ) | ( c ) ≤ (2 p − (cid:96) p − p +1 ≤ (cid:96) p − , where ( a ) follows from the definition of the operator norm, ( b ) follows by the chain-like structure of F T , and ( c ) follows from (86), concluding the proof. G.2 Proof of Theorem 6 In this section we prove Theorem 6 following the schema outlined in Section 4.2. We start bycollecting all the relevant properties of Ψ and Λ from the construction in (11). Lemma 19. The functions Ψ and Λ satisfy the following properties: . For all x ≤ / and for all k ∈ N ∪ { } , Ψ ( k ) ( x ) = 0 .2. The function Ψ is non-negative and its first- and second-order derivatives are bounded by ≤ Ψ ≤ e, ≤ Ψ (cid:48) ≤ (cid:112) /e, − ≤ Ψ (cid:48)(cid:48) ≤ . 3. The function Λ and its first- and second-order derivatives are bounded by − ≤ Λ ≤ , − ≤ Λ (cid:48) ≤ , − ≤ Λ (cid:48)(cid:48) ≤ . 4. Both Ψ and Λ are infinitely differentiable, and for all k ∈ N , we have sup x (cid:12)(cid:12)(cid:12) Ψ ( k ) ( x ) (cid:12)(cid:12)(cid:12) ≤ exp (cid:18) k k ) (cid:19) and sup x (cid:12)(cid:12)(cid:12) Λ ( k ) ( x ) (cid:12)(cid:12)(cid:12) ≤ √ e · exp (cid:18) k + 1)2 log (cid:18) k + 1)2 (cid:19)(cid:19) . Proof. Parts 1-4 are immediate. Part 5 follows from Lemma 1 of Carmon et al. (2019a) and bynoting that sup x (cid:12)(cid:12)(cid:12) Λ ( k ) ( x ) (cid:12)(cid:12)(cid:12) = 8 √ e sup x (cid:12)(cid:12)(cid:12) Φ ( k +1) ( x ) (cid:12)(cid:12)(cid:12) ≤ √ e · exp (cid:18) k + 1)2 log (cid:18) k + 1)2 (cid:19)(cid:19) . Using these basic properties of Ψ and Λ, we establish the following properties of the construction G T (analogous to Lemma 17). Lemma 20. The function G T satisfies the following properties:1. G T (0) − inf x ( G T ( x )) ≤ ∆ T , with ∆ = 40 .2. For p ≥ , the p th order derivatives of G T are ˜ (cid:96) p -Lipschitz continuous, where ˜ (cid:96) p ≤ e cp log p + c (cid:48) p for a numerical constant c, c (cid:48) < ∞ .3. For all x ∈ R T , and i ∈ [ T ] , we have (cid:107)∇ pi G T ( x ) (cid:107) op ≤ ˜ (cid:96) p .4. For all x ∈ R T and q ∈ [ p ] , prog (cid:0) ∇ ( q ) G T ( x ) (cid:1) ≤ prog ( x ) + 1 .5. For all x ∈ R T , if prog ( x ) < T − then λ min (cid:0) ∇ G T ( x ) (cid:1) ≤ − . , and λ min (cid:0) ∇ G T ( x ) (cid:1) ≤ otherwise. Proof. We prove the individual parts of the lemma one by one:1. Since Ψ(0) = Λ(0) = 0, we have G T (0) = Ψ(1)Λ(0) + T (cid:88) i =2 [Ψ(0)Λ(0) + Ψ(0)Λ(0))] = − Ψ(1)Λ(0) = 0 . On the other hand, G T ( x ) = Ψ(1)Λ( x ) + T (cid:88) i =2 [Ψ( − x i − )Λ( − x i ) + Ψ( x i − )Λ( x i ))] ≥ − eT (by Lemma 19.2. and Lemma 19.3) ≥ − T. 53. The proof follows along the same lines of Lemma 3 of Carmon et al. (2019a) together with thederivative bounds stated in Lemma 19.4.3. The claim follows using the same calculation as in Section G.1.1, with the derivative boundsreplaced by those in Lemma 19.4, mutatis mutandis.4. The claim follows Observation 3 in Carmon et al. (2019a), mutatis mutandis.5. We have ∂G T ∂x j = − Ψ( − x j − )Λ (cid:48) ( − x j ) + Ψ( x j − )Λ (cid:48) ( x j ) − Ψ (cid:48) ( − x j )Λ( − x j +1 ) + Ψ (cid:48) ( x j )Λ( x j +1 ) . (87)Therefore, for any x ∈ R d , ∇ G T ( x ) is a tridiagonal matrix specified as follows. ∇ G T ( x ) i,j = Ψ( − x i − )Λ (cid:48)(cid:48) ( − x i ) + Ψ( x i − )Λ (cid:48)(cid:48) ( x i )+Ψ (cid:48)(cid:48) ( − x i )Λ( − x i +1 ) + Ψ (cid:48)(cid:48) ( x i )Λ( x i +1 ) if i = j, Ψ (cid:48) ( − x j )Λ (cid:48) ( − x i ) + Ψ (cid:48) ( x j )Λ (cid:48) ( x i ) if j = i − , Ψ (cid:48) ( − x i )Λ (cid:48) ( − x j ) + Ψ (cid:48) ( x i )Λ (cid:48) ( x j ) if j = i + 1 , . The following facts can be verified by a straightforward calculation:(i) Ψ( x ) ≥ . x ≥ / (cid:48)(cid:48) ( x ) ≥ | x | < / (cid:48)(cid:48) ( x ) ≤ − | x | < / k := prog ( x ) + 1 < T , we have, by definition, that | x k +1 | , | x k | < ≤ | x k − | ,implying, λ min ( ∇ G T ( x )) = min y ∈ R n y T ∇ G T ( x ) yy T y (Rayleigh quotient) ≤ e Tk ∇ G T ( x ) e k e Tk e k = ∇ G T ( x ) k,k = Ψ( − x k − )Λ (cid:48)(cid:48) ( − x k ) + Ψ( x k − )Λ (cid:48)(cid:48) ( x k )+ Ψ (cid:48)(cid:48) ( − x k )Λ( − x k +1 ) + Ψ (cid:48)(cid:48) ( x k )Λ( x k +1 ) ≤ Ψ( − x k − )Λ (cid:48)(cid:48) ( − x k ) + Ψ( x k − )Λ (cid:48)(cid:48) ( x k ) ((ii) and Λ ≤ | x k − | )Λ (cid:48)(cid:48) (sign { x k − } x k ) (Ψ( x ) = 0 , ∀ x < ≤ − · . − . . ((i) and (iii))Otherwise, if nothing is assumed on x , then the same chain of inequalities, using k = 2, canbe used to bound the minimal value of ∇ G T ( x ). λ min ( ∇ G T ( x )) = min y ∈ R n y T ∇ G T ( x ) yy T y (Rayleigh quotient) ≤ e Tk ∇ G T ( x ) e k e Tk e k ∇ G T ( x ) k,k = Ψ( − x k − )Λ (cid:48)(cid:48) ( − x k ) + Ψ( x k − )Λ (cid:48)(cid:48) ( x k )+ Ψ (cid:48)(cid:48) ( − x k )Λ( − x k +1 ) + Ψ (cid:48)(cid:48) ( x k )Λ( x k +1 ) ≤ e + 320) ≤ , thus giving the desired bound.We employ similar derivative estimators to the proof of Theorem 3, only this time we provide anoiseless estimate for the gradient. Formally, we set (cid:104) (cid:92) ∇ q G T ( x, z ) (cid:105) i := (cid:40) ∇ i G T ( x ) q = 1 , (cid:16) (cid:110) i > prog ( x ) (cid:111)(cid:16) zp − (cid:17)(cid:17) · ∇ qi G T ( x ) q ≥ , (88)where z ∼ Bernoulli( ρ ). The dynamics of zero-respecting methods can be now characterized in ananalogous way to the proof of Theorem 3. The only difference is that here, since Λ (cid:48) (0) = Ψ (cid:48) (0) = 0,it follows that prog ( ∇ G T ( x )) = prog ( x ). Therefore, the collection of estimators defined above is a ρ -probability zero-chain—with respect to prog (rather than prog as in Definition 2) —in whichthe variance of the gradient estimator is 0; a key property that shall be used soon. Following theproof of Lemma 16, mutatis mutandis, gives us the same bound on the number of non-zero entriesacquired over time. That is, we have that with probability at least 1 − δ ,prog (cid:16) x ( t ) A [ O pF ] (cid:17) < T, for all t ≤ T − log(1 /δ )2 ρ , (89)where we employ the same notation as in Lemma 16. The proof now proceeds along the same linesof the proof of Theorem 3. The estimators have variance bounded as E (cid:107) (cid:92) ∇ q G T ( x, z ) − ∇ q G T ( x ) (cid:107) ≤ (cid:40) q = 1 , ˜ (cid:96) q − (1 − ρ ) ρ , for all x ∈ R T q ≥ , (90)which can established the same fashion as Lemma 18 by invoking Lemma 20.3 and Lemma 20.4. Proof of Theorem 6. We now complete the proof of Theorem 6 for p ≥ G T appro-priately. Let ∆ and ˜ (cid:96) p be the numerical constants in Lemma 20. Let the accuracy parameter γ ,initial suboptimality ∆, derivative order p ∈ N , smoothness parameter L , . . . , L p , and varianceparameter σ , σ , . . . , σ p be fixed. We let G (cid:63)T ( x ) := αG T ( βx ) , for scalars α and β to be determined. The relevant properties of G (cid:63)T are as follows: G (cid:63)T (0) − inf x G (cid:63)T ( x ) = α (cid:0) G T (0) − inf x G T ( αx ) (cid:1) ≤ α ˜∆ T, (91) Using prog , rather than prog , carries one major disadvantage: our bounds for finding γ -weakly convex pointscannot be directly extended to arbitrary randomized algorithm using the technique presented in Section 3.4 of Carmonet al. (2019a) as is (at least, not without the degrading the dependence on problem parameters). We defer such anextension to future work. (cid:13) ∇ q +1 G (cid:63)T ( x ) (cid:13)(cid:13) = αβ q +1 (cid:13)(cid:13) ∇ q +1 G T ( βx ) (cid:13)(cid:13) ≤ αβ q +1 ˜ (cid:96) q , (92) λ min ( ∇ G (cid:63)T (cid:0) x (cid:1) ) = αβ λ min ( ∇ G T (cid:0) x (cid:1) ) ≤ − αβ , ∀ x s.t., prog / ( x ) < T. (93)The corresponding scaled derivative estimators (cid:92) ∇ q G (cid:63)T ( x, z ) = αβ q (cid:92) ∇ q G T ( βx, z ) clearly form aprobability- ρ zero-chain, thus by (89), we have that for every zero respecting algorithm A interactingwith O pG (cid:63)T , with probability at least 1 − / (4 · (cid:16) x ( t ) A [ O pF ] (cid:17) < T − t ≤ ( T − / ρ .Therefore, since prog / ( x ) ≤ prog( x ) for any x ∈ R T , we have by Lemma 20.5, E (cid:104) λ min ( ∇ G (cid:63)T (cid:0) x ( t ) A [ O pF ] (cid:1) ) (cid:105) = αβ λ min ( ∇ G T (cid:0) βx ( t ) A [ O pF ] (cid:1) ) ≤ αβ (cid:18) − . · (1 − · 700 ) + 700 · · (cid:19) ≤ − αβ , (94)for any t ≤ ( T − / ρ . The variance of the scaled derivative estimators can be bounded as E (cid:107) (cid:92) ∇ q G (cid:63)T ( x, z ) − ∇ q G (cid:63)T ( x ) (cid:107) = α β q E (cid:13)(cid:13)(cid:13) (cid:92) ∇ q G T ( βx, z ) − ∇ q G T ( βx ) (cid:13)(cid:13)(cid:13) ≤ α β q ˜ (cid:96) q − (1 − ρ ) ρ , where the last inequality is by (90). Our goal now is to meet the following set of constraints: • ∆-constraint : α ˜∆ T ≤ ∆ . • L q -constraint : αβ q +1 ˜ (cid:96) q ≤ L q for q = 1 , . . . , p . • γ -constraint : − αβ ≤ − γ . • σ q -constraint : α β q ˜ (cid:96) q − (1 − ρ ) ρ ≤ σ q for q = 1 , . . . , p .As there are more inequalities to satisfy than the four degrees of freedom ( α, β, T and ρ ) in ourconstruction, generically, not all inequalities can be activated (that is, met by equality) simultaneously.Different compromises may yield different bounds. First, to have a tight dependence in terms of γ ,we activate the γ -constraint by setting α = 5 γ/β . Next, we activate the σ -constraint, by setting ρ = min { ( αβ ˜ (cid:96) /σ ) , } = min { (5˜ (cid:96) γ/σ ) , } . The bound on the variance of the q th derivativefor q = 3 , . . . , p, now reads α β q ˜ (cid:96) q − (1 − ρ ) ρ ≤ σ α β q ˜ (cid:96) q − ( αβ ˜ (cid:96) ) = ˜ (cid:96) q − β q − σ ˜ (cid:96) , q = 3 , . . . , p. Since β is the only degree of freedom which can be tuned to meet (though not necessarily activate)the σ q -constraints for q = 3 , . . . , p , and the L q (cid:48) -constraint for q (cid:48) = 2 , . . . , p , we are forced to have β = min q =3 ,...,pq (cid:48) =2 ,...,p min (cid:32) ˜ (cid:96) σ q ˜ (cid:96) q − σ (cid:33) q − , (cid:32) L q (cid:48) (cid:96) q (cid:48) γ (cid:33) q (cid:48)− . (95)Note that, by definition, the σ -constraint always holds (as the variance of the gradient estimator iszero, see (90)). To satisfy the L -constraint, i.e., αβ ˜ (cid:96) ≤ L , we must have γ ≤ L / (cid:96) . (96)56his constraint holds w.l.o.g. as L also bounds the absolute value of the Hessian eigenvalues (inother words, any point x is trivially O ( L )-weakly convex). Lastly, we activate the ∆-constraint, bysetting T = (cid:22) ∆ α ˜∆ (cid:23) = (cid:22) ∆ β γ (cid:23) . Assuming (5˜ (cid:96) γ/σ ) ≤ γ = O ( σ )) and T ≥ 3, we have by (94) that the number of oraclequeries required to obtain a point x such that λ min ( ∇ G (cid:63)T (cid:0) x (cid:1) ) ≤ λ , is bounded from below by T − ρ = 12 ρ (cid:18)(cid:22) ∆ β γ (cid:23) − (cid:19) ( (cid:63) ) ≥ ρ ∆ β ˜∆ γ ≥ σ (5˜ (cid:96) γ ) · ∆ β ˜∆ γ = σ (5˜ (cid:96) γ ) · ∆5 ˜∆ γ min q =3 ,...,pq (cid:48) =2 ,...,p min (cid:32) ˜ (cid:96) σ q ˜ (cid:96) q − σ (cid:33) q − , (cid:32) L q (cid:48) (cid:96) q (cid:48) γ (cid:33) q (cid:48)− = 15 ˜ (cid:96) ˜∆ · ∆ σ γ min q =3 ,...,pq (cid:48) =2 ,...,p min (cid:32) ˜ (cid:96) σ q ˜ (cid:96) q − σ (cid:33) q − , (cid:32) L q (cid:48) (cid:96) q (cid:48) γ (cid:33) q (cid:48)− , (97)where ( (cid:63) ) uses that (cid:98) ξ (cid:99) − ≥ ξ/ ξ ≥ 3, implying the desired result (note that this bounddoes not depend on L and σ .).If σ = · · · = σ p = 0, we obtain the following lower complexity bound for noiseless oracles (where ρ is effectively set to one), assuming γ = O ( L ) (this holds without loss of generality, as we discussabove). As before, we set α = 5 γ/β . The L -constraint is satisfied under the same condition statedin (96). Thus, letting β = min q =2 ,...,p (cid:32) L q (cid:96) q γ (cid:33) q − , it follows that our construction is L q -Lipschitz for any q = 1 , . . . , p . Following the same chain ofinequalities as in (97) yields an oracle complexity lower bound of∆ β ˜∆ γ = ∆5 ˜∆ γ min q =2 ,...,p (cid:32) L q (cid:96) q γ (cid:33) q − . Note that this bound does not depend on L1