On the Iteration Complexity of Oblivious First-Order Optimization Algorithms
aa r X i v : . [ m a t h . O C ] M a y On the Iteration Complexity of Oblivious First-OrderOptimization Algorithms
Yossi ArjevaniWeizmann Institute of Science [email protected]
Ohad ShamirWeizmann Institute of Science [email protected]
Abstract
We consider a broad class of first-order optimization algorithms which are oblivious , in the sensethat their step sizes are scheduled regardless of the function under consideration, except for limited side-information such as smoothness or strong convexity parameters. With the knowledge of these two pa-rameters, we show that any such algorithm attains an iteration complexity lower bound of Ω( p L/ǫ ) for L -smooth convex functions, and ˜Ω( p L/µ ln(1 /ǫ )) for L -smooth µ -strongly convex functions. Theselower bounds are stronger than those in the traditional oracle model, as they hold independently of thedimension. To attain these, we abandon the oracle model in favor of a structure-based approach whichbuilds upon a framework recently proposed in [1]. We further show that without knowing the strong con-vexity parameter, it is impossible to attain an iteration complexity better than ˜Ω (( L/µ ) ln(1 /ǫ )) . Thisresult is then used to formalize an observation regarding L -smooth convex functions, namely, that theiteration complexity of algorithms employing time-invariant step sizes must be at least Ω( L/ǫ ) . The ever-increasing utility of mathematical optimization in machine learning and other fields has led toa great interest in understanding the computational boundaries of solving optimization problems. Of aparticular interest is the class of unconstrained smooth, and possibly strongly convex, optimization problems.Formally, we consider the following problem, min x ∈ R d f ( x ) where f : R d → R is convex and L -smooth, i.e., k∇ f ( x ) − ∇ f ( y ) k ≤ L k x − y k for some L > , and possibly µ -strongly convex , that is, f ( y ) ≥ f ( x ) + h y − x , ∇ f ( x ) i + µ k y − x k for some µ > . In this work, we address the question as to how fast can one expect to solve this sortof problems to a prescribed level of accuracy, using methods which are based on first-order information(gradients, or more generally sub-gradients) alone.The standard approach to quantify the computational hardness of optimization problems is through the Oracle Model . In this approach, one models the interaction of a given optimization algorithm with some1nstance from a class of functions as a sequence of queries, issued by the algorithm, to an external first-orderoracle procedure. Upon receiving a query point x ∈ R d , the oracle reports the corresponding value f ( x ) and gradient ∇ f ( x ) . In their seminal work, Nemirovsky and Yudin [11] showed that for any first-orderoptimization algorithm, there exists an L -smooth and µ -strongly convex function f : R d → R such that thenumber of queries required to obtain an ǫ -optimal solution ˜ x which satisfies f (˜ x ) < min x ∈ R d f ( x ) + ǫ, is at least ˜Ω (cid:0) min (cid:8) d, √ κ (cid:9) ln(1 /ǫ ) (cid:1) , µ > (1) ˜Ω(min { d ln(1 /ǫ ) , p L/ǫ } ) , µ = 0 where κ := L/µ is the so-called condition number . This lower bound, although based on information con-siderations alone, is tight. Concretely, it is achieved by a combination of Nesterov’s well-known acceleratedgradient descent (AGD, [12]) with an iteration complexity of ˜ O (cid:0) √ κ ln(1 /ǫ ) (cid:1) , µ > (2) O (cid:16)p L/ǫ (cid:17) , µ = 0 , and the center of gravity method (MCG, [9, 14]) whose iteration complexity is O ( d ln(1 /ǫ )) . Although the combination of MCG and AGD appear to achieve optimal iteration complexity, this is notthe case when focusing on computationally efficient algorithms. In particular, the per-iteration cost of MCGscales poorly with the problem dimension, rendering it impractical for high-dimensional problems. In otherwords, not taking into account the computational resources needed for processing first-order informationlimits the ability of the oracle model to give a faithful picture of the complexity of optimization.To overcome this issue [1] recently proposed the framework of p -Stationary Canonical Linear Iterative( p -SCLI) in which, instead of modeling the way algorithms acquire information on the function at hand,one assumes certain dynamics which restricts the way new iterates are being generated. This frameworkincludes a large family of computationally efficient first-order algorithms, whose update rule, when appliedon quadratic functions, reduce to a recursive application of some fixed linear transformation on the mostrecent p points (in other words, p indicates the number of previous iterates stored by the algorithm in orderto compute a new iterate). The paper showed that the iteration complexity of p -SCLIs over smooth andstrongly convex functions is bounded from below by ˜Ω (cid:0) p √ κ ln(1 /ǫ ) (cid:1) . (3)Crucially, as opposed to the classical lower bounds in (1), the lower bound in (3) holds for any dimension d > . This implies that even for fixed d , the iteration complexity of p -SCLI algorithms must scale withthe condition number. That being said, the lower bound in (3) raises a few major issues which we wish toaddress in this work: Following standard conventions, here, tilde notation hides logarithmic factors in the smoothness parameter, the strong convexityparameter and the distance of the initialization point from the minimizer. Practical first-order algorithms in the literature only attain this bound for p = 1 , (by standard gradientdescent and AGD, respectively), so the lower bound appears intuitively loose. Nevertheless, [1] showedthat this bound is actually tight for all p . The reason for this discrepancy is that the bound for p > wasshown to be attained by p -SCLI algorithms whose updates require exact knowledge of spectral propertiesof the Hessian, which is computationally prohibitive to obtain in large-scale problems. In this work,we circumvent this issue by systematically considering the side-information available to the algorithm.In particular, we show that under the realistic assumption, that the algorithm may only utilize the strongconvexity and smoothness of the objective function, the lower bound in (3) can be substantially improved. • The lower bound stated above is limited to stationary optimization algorithms whose coefficients α j , β j are not allowed to change in time (see Section 2.2). • The formulation suggested in [1] does not allow generating more than one iterate at a time. This require-ment is not met by many popular optimization problems for finite sums minimization. • Lastly, whereas the proofs in [1] are elaborate and technically complex, the proofs we provide here arerelatively short and simple.In its simplest form, the framework we consider is concerned with algorithms which generate iteratesby applying the following simple update rule repeatedly: x ( k +1) = p X j =1 α j ∇ f ( x ( k +1 − j ) ) + β j x ( k +1 − j ) , (4)where α j , β j ∈ R denote the corresponding coefficients. A clear advantage of this class of algorithms isthat, given the corresponding gradients, the computational cost of executing each update rule scales linearlywith the dimension of the problem and p .This basic formulation already subsumes popular first-order optimization algorithms. For example, ateach iteration the Gradient Descent (GD) method generates a new iterate by computing a linear combinationof the current iterate and the gradient of the current iterate, i.e., x ( k +1) = x ( k ) + α ∇ f ( x ( k ) ) (5)for some real scalar α . Another important example is a stationary variant of AGD [13] and the heavy-ballmethod (e.g., [16]) which generates iterates according to x ( k +1) = β x ( k ) + α ∇ f ( x ( k ) )+ β x ( k − + α ∇ f ( x ( k − ) . (6)In this paper, we follow a generalized form of (4) which is exhibited by standard optimization algorithms:GD, conjugate gradient descent, sub-gradient descent, AGD, the heavy-ball method, coordinate descent,quasi-Newton methods, ellipsoid method, etc. The main difference being how much effort one is willing toput in computing the coefficients of the optimization process. We call these methods first-order p -CanonicalLinear Iterative optimization algorithms (in this paper, abbr. p -CLI). We note that our framework (as amethod to prove lower bounds) also applies to stochastic algorithms, as long as the expected update rule(conditioned on the history) follows a generalized form similar to (4).In the context of machine learning, many algorithms for minimizing finite sums of functions with, pos-sibly, a regularization term (also known as, Regularized Empirical Risk Minimization) also fall into our3ramework, e.g., Stochastic Average Gradient (SAG, [17]), Stochastic Variance Reduction Gradient (SVRG,[7]), Stochastic Dual Coordinate Ascent (SDCA, [19]), Stochastic Dual Coordinate Ascent without Duality(SDCA without duality, [18]) and SAGA [3], to name a few, and as such, are subject to the same lowerbounds established through this framework.In its full generality, the formulation of this framework is too rich to say much. In what follows, we shallfocus on oblivious p-CLIs, which satisfy the realistic assumption that the coefficients α j , β j do not dependon the specific function under consideration. Instead, they can only depend on time and some limited side-information on the function (this term will be made more precise in Definition 1). In particular, we showthat the iteration complexity of oblivious p -CLIs over L -smooth and µ -strongly convex functions whosecoefficients are allowed to depend on µ and L is ˜Ω (cid:0) √ κ ln(1 /ǫ ) (cid:1) , µ > (7) ˜Ω( p L/ǫ ) , µ = 0 . Note that, in addition to being dimension-independent (similarly to (3)), this lower bound holds regardlessof p . We further stress that the algorithms discussed earlier which attain the lower bound stated in (3) arenot oblivious and require more knowledge of the objective function.In the paper, we also demonstrate other cases where the side-information available to the algorithmcrucially affects its performance, such as knowing vs. not knowing the strong convexity parameter.Finally, we remark that this approach of modeling the structure of optimization algorithms, as opposedto the more traditional oracle model, can be also found in [16, 8, 5, 4]. However, whereas these works areconcerned with upper bounds on the iteration complexity, in this paper we primarily focus on lower bounds.To summarize, our main contributions are the following: • In Section 2.1, we propose a novel framework which substantially generalizes the framework introducedin [1], and includes a large part of modern first-order optimization algorithms. • In Section 2.2, we identify within this framework the class of oblivious optimization algorithms, whosestep sizes are scheduled regardless of the function at hand, and provide an iteration complexity lowerbound as given in (7). We improve upon [1] by establishing lower bounds which hold both for smoothfunctions and smooth and strongly convex functions, using simpler and shorter proofs. Moreover, inaddition to being dimension-independent , the lower bounds we derive here are tight. In the context ofmachine learning optimization problems, the same lower bound is shown to hold on the bias of methodsfor finite sums with a regularization term, such as: SAG, SAGA, SDCA without duality and SVRG. • Some oblivious algorithms for L -smooth and µ -strongly convex functions admit a linear convergence rateusing step sizes which are scheduled regardless of the strong convexity parameter (e.g., standard GD witha step size of /L . See Section 3 in [17] and Section 5 in [3]). In Section 4.1, we show that adapting to’hidden’ strong convexity, without explicitly incorporating the strong convexity parameter, results in aninferior iteration complexity of ˜Ω ( κ ln(1 /ǫ )) . (8)This result sheds some light on a major issue regarding scheduling step sizes of optimization algorithms . • In Section 4.2, we discuss the class of stationary optimization algorithms, which use time-invariant stepsizes, over L -smooth functions and show that they admit a tight iteration complexity of Ω( L/ǫ ) . (9)4n particular, this bound implies that in terms of dependency on the accuracy parameter ǫ , SAG andSAGA admit an optimal iteration complexity w.r.t. the class of stochastic stationary p -CLIs. Accelerationschemes, such as [6, 10], are able to break this bound by re-scheduling these algorithms in a non-stationary(though oblivious) way. In the sequel we present our framework for analyzing first-order optimization algorithms. We begin byproviding a precise definition of a class of optimization problems, accompanied by some side-information.We then formally define the framework of p -CLI algorithms and the corresponding iteration complexity. Definition 1 (Class of Optimization Problems) . A class of optimization problems C is an ordered pair of ( F , I ) , where F is a family of functions which defined over the same domain, and I : F → I is a mappingwhich provides for each f ∈ F the corresponding side-information element in some set I . The domain ofthe functions in F is denoted by dom ( C ) . For example, let us consider quadratic functions of the form x x ⊤ Q x + q ⊤ x , where Q ∈ R d × d is a positive semidefinite matrix whose spectrum lies in Σ ⊆ R + , and q ∈ R d . Here, eachinstance may be accompanied with either a complete specification of Σ ; lower and upper bounds for Σ ; justan upper bound for Σ ; a rough approximation of Q − (e.g., sketching techniques), etc. We will see thatthe exact nature of side-information strongly affects the iteration complexity, and that this differentiationbetween the family of functions under consideration and the type of side-information is not mere pedantry,but a crucial necessity.We now turn to rigorously define first-order p -CLI optimization algorithms. The basic formulationshown in (4) does not allow generating more than one iterate at a time. The framework which we presentbelow relaxes this restriction to allow a greater generality which is crucial for incorporating optimizationalgorithms for finite sums (see Stochastic p -CLIs in Section 2.2). We further extend (4) to allow non-differentiable functions and constraints into this framework, by generalizing gradients to sub-gradients. Definition 2. [First-order p -CLI] An optimization algorithm is called a first-order p -Canonical LinearIterative ( p -CLI) optimization algorithm over a class of optimization problems C = ( F , I ( · )) , if given aninstance f ∈ F and an arbitrary set of p initialization points x , . . . , x p ∈ dom ( C ) , it operates by iterativelygenerating points for which x ( k +1) i ∈ p X j =1 (cid:16) A ( k ) ij ∂f + B ( k ) ij (cid:17) ( x ( k ) j ) , k = 0 , , . . . (10) holds, where the coefficients A kij , B kij are some linear operators which may depend on I ( f ) . Formally, the expression A ( k ) ij ∂f in (10) denotes the composition of A ( k ) ij and the sub-gradient operator.Likewise, the r.h.s. of (10) is to be understand as an evaluation of sum of two operators A ( k ) ij ∂f and B ( k ) ij at x ( k ) j . 5n this level of generality, this framework encompasses very different kinds of optimization algorithms.We shall see that various assumptions regarding the coefficients complexity and side-information yield dif-ferent lower bound on the iteration complexity.We note that although this framework concerns algorithms whose update rules are based on a fixednumber of points, a large part of the results shown in this paper holds in the case where p grows indefinitelyin accordance with the number of iterations.We now turn to provide a formal definition of Iteration Complexity . We assume that the point returnedafter k iterations is x ( k ) p . This assumption merely serves as a convention and is not necessary for our boundsto hold. Definition 3 (Iteration Complexity) . The iteration complexity IC ( ǫ ) of a given p -CLI w.r.t. a given problemclass C = ( F , I ) is defined to be the minimal number of iterations K such that f ( E x ( k ) p ) − min x ∈ dom C f ( x ) < ǫ, ∀ k ≥ K uniformly over F , where the expectation is taken over all the randomness introduced into the optimizationprocess (see Stochastic p -CLIs below). For simplicity, when stating bounds in this paper, we shall omit the dependency of the iteration com-plexity on the initialization points. The precise dependency can be found in the corresponding proofs. p -CLIs and Scope of Work As mentioned before, we cannot say much about the framework in its full generality. In this paper, werestrict our attention to the following three (partially overlapping) classes of p -CLIs: Stationary p -CLI where the coefficients are allowed to depend exclusively on side-information (see Def-inition 3). In particular, the coefficients are not allowed to change with time. Seemingly restrictive,this class of p -CLIs subsumes many efficient optimization methods, especially when coupled withstochasticity (see below). Notable stationary p -CLIs are: GD with fixed step size [13], stationaryAGD [13] and the Heavy-Ball method [16]. Oblivious p -CLI where the coefficients are allowed to depend on side-information, as well as to change intime. Notable algorithms here are GD and AGD with step sizes which are scheduled irrespectively ofthe function under consideration [13] and the Sub-Gradient Descent method (e.g., [20]). Stochastic p -CLI where (10) holds with respect to E x ( k ) j , that is, E x ( k +1) i ∈ p X j =1 (cid:16) A ( k ) ij ∂f + B ( k ) ij (cid:17) ( E x ( k ) j ) . (11)Stochasticity is an efficient machinery of tackling optimization problems where forming the gradientis prohibitive, but engineering an efficient unbiased estimator is possible. Such situations occur fre-quently in the context of machine learning, where one is interested in minimizing finite sums of largenumber of convex functions, min x ∈ R d F ( x ) := m X i =1 f i ( x ) ,
6n which case, forming a sub-gradient of F at a given point may be too expensive. Notable optimiza-tion algorithms for variants of this problem are: SAG, SDCA without duality, SVRG and SAGA, allof which are stationary stochastic p -CLIs. Moreover, as opposed to algorithms which produce onlyone new point at each iteration (e.g., (4)), these algorithms sometimes update a few points at the sametime. To illustrate this, let us express SAG as a stochastic stationary ( m + 1) -CLI. In order to avoidthe computationally demanding task of forming the exact gradient of F at each iteration, SAG usesthe first m points to store estimates for the gradients of the individual functions y i ≈ ∇ f i ( x ( k ) m +1 ) , i = 1 . . . m. At each iteration, SAG sets y i = ∇ f i ( x ( k ) m +1 ) for some randomly chosen i ∈ [ m ] , and then up-dates x ( k ) m +1 accordingly, by making a gradient step with a fixed step size using the new estimate for ∇ F ( x ( k ) m +1 ) . This implies that the expected update rule of SAG is stationary and satisfies (11).As opposed to an oblivious schedule of step sizes, many optimization algorithms set the step sizesaccording to the first-order information which is accumulated during the optimization process. A well-known example for such a non-oblivious schedule is conjugate gradient descent, whose update rule can beexpressed as follows: x ( k +1)1 = x ( k )2 x ( k +1)2 = ( α∂f + (1 + β ) I ) x ( k )2 − β x ( k )1 , (12)where the step sizes are chosen so as to minimize f ( x ( k +1)1 ) over α, β ∈ R . Other algorithms employ coef-ficients whose schedule does not depend directly on first-order information. For example, at each iterationcoordinate descent updates one coordinate of the current iterate, by completely minimizing the function athand along some direction. In our formulation, such update rules are expressed using coefficients whichare diagonal matrices. In a sense, the most expensive coefficients used in practice are the one employed byNewton method, which in this framework, may be expressed as follows: x ( k +1)1 = ( I − ∇ ( f ) − ∇ f ) x ( k )1 (13)The algorithms mentioned above: conjugate gradient descent, coordinate descent and Newton methods;as well as other non-oblivious p -CLI optimization algorithms, such as quasi-Newton methods (e.g., [15])and the ellipsoid method (e.g., [2]), will not be further considered in this paper. p -CLIs Having formally defined the framework, we are now in position to state our first main result. Perhapsthe most common side-information used by practical algorithms is the strong-convexity and smoothnessparameters of the objective function. Oblivious p -CLIs with such side-information tend to have low per-iteration cost and a straightforward implementation. However, this lack of adaptivity to the function beingoptimized results in an inevitable lower bound on the iteration complexity: Theorem 1.
Suppose the smoothness parameter L and the strong convexity parameter µ are known, i.e., I ( · ) = { L, µ } . Then the iteration complexity of any oblivious, possibly stochastic, p -CLI optimization lgorithm is bounded from below by ˜Ω (cid:0) √ κ ln(1 /ǫ ) (cid:1) , µ > (14) Ω( p L/ǫ ) , µ = 0 , where κ := L/µ
As discussed in the introduction, Theorem 1 significantly improves upon the lower obtained by [1] in 3major aspects: • It holds for both smooth functions, as well as smooth and strongly convex functions. • In both the strongly-convex and non-strongly convex cases, the bounds we derive are tight for p > (Note that if the coefficients are scalars and time-invariant, then for smooth and strongly convexfunctions a better lower bound of ˜Ω( κ ln(1 /ǫ )) holds. See Theorem 8, [1]). • It considers a much wider class of algorithms, namely, methods which may use different step size ateach iteration and may freely update each of the p points.We stress again that, in contrast to (1), this lower bound does not scale with the dimension of the problem.The proof of Theorem 1, including logarithmic factors and constants which appear in the lower bound, isfound in (A.1), and can be roughly sketched as follows. First, we consider L -smooth and µ -strongly convexquadratic functions of the form x x ⊤ diag ( η ) x + η ⊤ x , η ∈ [ µ, L ] , over R d , all of which share the same minimizer, x ∗ = − diag − ( η ) η = − . Next, we observe that each iteration of p -CLI involves application of A∂f + B , which is a linear expressionin ∂f whose coefficients are some linear operators, on the current points x ( k ) j , j = 1 , . . . , p , which are thensummed up to form the next iterate. Applying this argument inductively, and setting the initialization pointsto be zero, we see that the point returned by the algorithm at the k ’th iteration can be expressed as follows, x ( k ) p = ( s ( η ) η, . . . , s d ( η ) η ) ⊤ , where s i ( η ) are real polynomials of degree k − . Here, the fact that the coefficients are scheduled oblivi-ously, i.e., do not depend on the very choice of η , is crucial (when analyzing other types of p -CLIs, one mayencounter cases where the coefficients of s ( η ) are not constants, in which case the resulting expression maynot be a polynomial). Bearing in mind that our goal is to bound the distance to the minimizer − (which,in thic case, is equivalent to the iteration complexity up to logarithmic factors), we are thus led to ask howsmall can | s ( η ) η + 1 | be. Formally, we aim to bound max η ∈ [ µ,L ] | s ( η ) η + 1 | from below. To this end, we use the properties of the well-known Chebyshev polynomials, by which wederive the following lower bound: min s ( η ) ∈ R [ η ] ,∂ ( s )= k − max η ∈ [ µ,L ] | s ( η ) η + 1 | ≥ (cid:18) √ κ − √ κ + 1 (cid:19) k . p -CLI, exploit the idea that when applied on some strongly convex quadraticfunctions x ⊤ Q x + q ⊤ x over R d , the k ’th iterate can be expressed as s ( Q ) q for some real polynomial s ( η ) ∈ R [ η ] of degree at most k − . Bounding the iteration complexity is then essentially reduced to thequestion of how well can we approximate Q − using such polynomials. However, the approach here uses afundamentally different technique for achieving this, and whereas the oracle model does not impose any re-strictions on the coefficients of s ( η ) , the framework of p -CLIs allows us to effectively control the way thesecoefficients are being produced. The excessive freedom in choosing s ( η ) constitutes a major weakness inthe oracle model and prevents obtaining iteration complexity bounds significantly larger than the dimension d . To see why, note that by the Cayley-Hamilton theorem, there exists a real polynomial s ( η ) of degree atmost d − such that s ( Q ) = − Q − . Therefore, the d ’th iterate can potentially be s ( Q ) q = − Q − q , theexact minimizer. We avoid this limited applicability of the oracle model by adopting a more structural ap-proach, which allows us to restrict the kind of polynomials which can be produced by practical optimizationalgorithms. Furthermore, our framework is more flexible in the sense that the coefficients of s ( η ) may beformed by optimization algorithms which do not necessarily fall into the category of first-order algorithms,e.g., coordinate descent.It is instructive to contrast our approach with another structural approach for deriving lower boundswhich was proposed by [13]. Nesterov [13] considerably simplifies the technique employed by Nemirovskyand Yudin [11] at the cost of introducing additional assumption regarding the way new iterates are generated.Specifically, it is assumed that each new iterate lies in the span of all the gradients acquired earlier. Similarlyto [11], this approach also does not yield dimension-independent lower bounds. Moreover, such an approachmay break in presence of conditioning mechanisms (which essentially, aim to handle poorly-conditionedfunctions by multiplying the corresponding gradients by some matrix). In our framework, such conditioningis handled through non-scalar coefficients. Thus, as long as the conditioning matrices depend solely on µ, L our lower bounds remain valid. Below we discuss the effect of not knowing exactly the strong convexity parameter on the iteration com-plexity of oblivious p -CLIs. In particular, we show that the ability of oblivious p -CLIs to obtain iterationcomplexity which scales like √ κ crucially depends on the quality of the strong convexity estimate of thefunction under consideration. Moreover, we show that stationary p -CLIs are strictly weaker than generaloblivious p -CLIs for smooth non-strongly convex functions, in the sense that stationary p -CLIs cannot obtainan iteration complexity of O ( p L/ǫ ) .The fact that decreasing the amount of side-information increases the iteration complexity is best demon-strated by a family of quadratic functions which we already discussed before, namely, x x ⊤ Q x + q ⊤ x , where Q ∈ R d × d is positive semidefinite whose spectrum lies in Σ ⊆ R + and q ∈ R d . In Theorem 8 in91], it is shown that if Q is given in advance, but q is unknown, then the iteration complexity of stationary p -CLIs which follows (4) is ˜Ω( p √ κ ln(1 /ǫ )) . It is further shown that this lower bound is tight (see Appendix A in [1]). In Theorem 1 we show that if boththe smoothness and the strong convexity parameters { µ, L } are known then the corresponding lower boundfor this kind of algorithms is ˜Ω( √ κ ln(1 /ǫ )) . As mentioned earlier, this lower bound is tight and is attained by a stationary version of AGD.However, what if only the smoothness parameter L is known a-priori? The following theorem showsthat in this case the iteration complexity is substantially worse. For reasons which will become clear later, itwill be convenient to denote the strong convexity parameter and the condition number of a given function f by µ ( f ) and κ ( f ) , respectively. Theorem 2.
Suppose that only L the smoothness parameter is known, i.e. I ( · ) = { L } . If the iterationcomplexity of a given oblivious, possibly stochastic, p -CLI optimization algorithm is ˜ O ( κ ( f ) α ln(1 /ǫ )) , (15) then α ≥ . Theorem 2 pertains to the important issue of optimal schedules for step sizes. Concretely, it implies that,in the absence of the strong convexity parameter, one is still able to schedule the step sizes according to thesmoothness parameter so as to obtain exponential convergence rate, but only to the limited extent of lineardependency on the condition number (as mentioned before, this sub-optimality in terms of dependence onthe condition number, can be also found in [17] and [3]). This bound is tight and is attained by standardgradient descent (GD).Theorem 2 also emphasizes the superiority of standard GD in cases where the true strong convexityparameter is poorly estimated. Such situations may occur when one underestimate the true strong convex-ity parameter by following the strong convexity parameter introduced by an explicit regularization term.Specifically, if ˆ µ denotes our estimate for the true strong convexity parameter µ (obviously, ˆ µ < µ to ensureconvergence), then Theorem 1 already implies that, for a fixed accuracy level, the worst iteration complexityof our algorithm is on the order of p L/ ˆ µ , whereas standard GD with /L step sizes has iteration complexityon the order of L/µ . Thus, if our estimate is too conservative, i.e., ˆ µ < µ /L , then the iteration complexityof GD is µ/ √ L ˆ µ ≥ times better. Theorem 2 further strengthen this statement, by indicating that if ourestimate does not depend on the true strong convexity parameter, then the iteration complexity of GD is evenmore favorable with a factor of µ/ ˆ µ ≥ , compared to our algorithm.The proof of Theorem 2, which appears in Appendix A.2, is again based on a reduction to an approxima-tion problem via polynomials. In contrast to the proof of Theorem 1 which employs Chebyshev polynomials,here only elementary algebraic manipulations are needed.Another implication of Theorem 2 is that the coefficients of optimal stationary p -CLIs for smooth andstrongly convex functions must have an explicit dependence on the strong convexity parameter. In the nextsection we shall see that this fact is also responsible for the inability of stationary p -CLIs to obtain a rate of O ( p L/ǫ ) for L -smooth convex functions. 10 .2 No Acceleration for Stationary Algorithms over Smooth Convex Functions Below, we prove that, as opposed to oblivious p -CLIs, stationary p -CLIs (namely, p -CLIs with time-invariantcoefficients) over L -smooth convex functions can obtain an iteration complexity no better than O ( L/ǫ ) . Aninteresting implication of this is that some current methods for minimizing finite sums of functions, suchas SAG and SAGA (which are in fact stationary p -CLIs) cannot be optimal in this setting, and that time-changing coefficients are essential to get optimal rates. This further motivates the use of current accelerationschemes (e.g., [6, 10]) which turn a given stationary algorithm into an non-stationary oblivious one.The proof of this result is based on a reduction from the class of p -CLIs over L -smooth convex functionsto p -CLIs over L -smooth and µ -strongly convex, where the strong convexity parameter is given explicitly.This reduction allows us to apply the lower bound in Theorem 2 on p -CLIs designed for smooth non-stronglyconvex functions.We now turn to describe the reduction in detail. In his seminal paper, Nesterov [12] presents the AGDalgorithm and shows that it obtains a convergence rate of f ( x k ) − f ( x ∗ ) ≤ L (cid:13)(cid:13) x − x ∗ (cid:13)(cid:13) ( k + 2) (16)for L -smooth convex functions, which admits at least one minimizer (accordingly, throughout the rest ofthis section we shall assume that the functions under consideration admit at least one minimizer, i.e., argmin( f ) = ∅ ). In addition, Nesterov proposes a restarting scheme of this algorithm which, assumingthe strong convexity parameter is known, allows one to obtain an iteration complexity of ˜ O ( √ κ ln(1 /ǫ )) .Scheme 4.2 shown below forms a simple generalization of the scheme discussed in that paper, and allowsone to explicitly introduce a strong convexity parameter into the dynamics of (not necessarily oblivious) p -CLIs over L -smooth convex functions. S CHEME R ESTARTING S CHEME P ARAMETERS (cid:5) S MOOTHNESS PARAMETER
L > (cid:5) S TRONG CONVEXITY PARAMETER µ > (cid:5) C ONVERGENCE PARAMETERS α > , C > G IVEN A p -CLI OVER L - SMOOTH FUNCTIONS P WITH f ( x k ) − f ∗ ≤ CL k ¯ x − x ∗ k k α FOR ANY INITIALIZATION VECTOR ¯ x I TERATE
FOR t = 1 , , . . . R ESTART THE STEP SIZE SCHEDULE OF P I NITIALIZE P AT ¯ x R UN P FOR α p CL/µ
ITERATIONS S ET ¯ x TO BE THE LAST ITERATE OF THIS EXECUTION E ND The following lemma provides an upper bound on the iteration complexity of p -CLIs obtained throughScheme 4.2. Lemma 1.
The convergence rate of a p -CLI algorithm obtained by applying Scheme 4.2, using the corre-sponding set of parameters L, µ, C, α , is ˜ O (cid:0) α √ κ ln(1 /ǫ ) (cid:1) , where κ = L/µ denotes the condition number. roof Suppose P is a p -CLI as stated in Scheme 4.2 and let f be a L -smooth and µ -strongly convexfunction. Each external iteration in this scheme involves running P for k = α p CL/µ iterations, Thus, forany arbitrary point ¯ x , f ( x ( k ) ) − f ∗ ≤ CL k ¯ x − x ∗ k ( α p CL/µ ) α = k ¯ x − x ∗ k /µ . Also, f is µ -strongly convex, therefore f ( x ( k ) ) − f ∗ ≤ f (¯ x ) − f ( x ∗ )) /µ /µ ≤ f (¯ x ) − f ( x ∗ )2 . That is, after each external iteration the sub-optimality in the objective value is halved. Thus, after T externaliterations, we get f ( x ( T α √ CL/µ ) ) − f ∗ ≤ f (¯ x ) − f ( x ∗ )2 T , where ¯ x denotes some initialization point. Hence, the iteration complexity for obtaining an ǫ -optimalsolution is α √ Cκ log (cid:18) f (¯ x ) − f ( x ∗ ) ǫ (cid:19) . The stage is now set to prove the statement made at the beginning of this section. Let P be a stationary p -CLI over L -smooth functions with a convergence rate of O ( L/k α ) , and let µ ∈ (0 , L ) be the strongconvexity parameter of the function to be optimized. We apply Scheme 4.2 to obtain a new p -CLI, whichaccording to Lemma 1, admits an iteration complexity of O ( α √ κ ln(1 /ǫ )) . But, since P is stationary, theresulting p -CLI under Scheme 4.2 is again P (That is, stationary p -CLIs are invariant w.r.t. Scheme 4.2).Now, P is a p -CLI over smooth non-strongly convex, and as such, its coefficients do not depend on µ .Therefore, by Theorem 2, we get that α ≤ . Thus, we arrive at the following corollary: Corollary 1.
If the iteration complexity of a given stationary p -CLI over L -smooth functions is O (cid:16) α p L/ǫ (cid:17) , then α ≤ . The lower bound above is tight and is attained by standard Gradient Descent.
In this work, we propose the framework of first-order p -CLIs and show that it can be efficiently utilizedto derive bounds on the iteration complexity of a wide class of optimization algorithms, namely, oblivious,possibly stochastic, p -CLIs over smooth and strongly-convex functions.We believe that these results are just the tip of the iceberg, and the generality offered by this frameworkcan be successfully instantiated for many other classes of algorithms. For example, it is straightforward to12erive a lower bound of Ω(1 /ǫ ) for 1-CLIs over 1-Lipschitz (possibly non-smooth) convex functions usingthe following set of functions n k x − c k (cid:12)(cid:12)(cid:12) c ∈ R d o . How to derive a lower bound for other types of p -CLIs in the non-smooth setting is left to future work. Acknowledgments:
This research is supported in part by an FP7 Marie Curie CIG grant, the Intel ICRI-CIInstitute, and Israel Science Foundation grant 425/13. We thank Nati Srebro for several helpful discussionsand insights.
References [1] Yossi Arjevani, Shai Shalev-Shwartz, and Ohad Shamir. On lower and upper bounds for smooth andstrongly convex optimization problems. arXiv preprint arXiv:1503.06833 , 2015.[2] Mikhail J Atallah.
Algorithms and theory of computation handbook . CRC press, 1998.[3] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient methodwith support for non-strongly convex composite objectives. In
Advances in Neural Information Pro-cessing Systems , pages 1646–1654, 2014.[4] Yoel Drori.
Contributions to the Complexity Analysis of Optimization Algorithms . PhD thesis, Tel-AvivUniversity, 2014.[5] Nicolas Flammarion and Francis Bach. From averaging to acceleration, there is only a step-size. arXivpreprint arXiv:1504.01577 , 2015.[6] Roy Frostig, Rong Ge, Sham M Kakade, and Aaron Sidford. Un-regularizing: approximateproximal point and faster stochastic algorithms for empirical risk minimization. arXiv preprintarXiv:1506.07512 , 2015.[7] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance re-duction. In
Advances in Neural Information Processing Systems , pages 315–323, 2013.[8] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algo-rithms via integral quadratic constraints. arXiv preprint arXiv:1408.3595 , 2014.[9] A Yu Levin. On an algorithm for the minimization of convex functions. In
Soviet Mathematics Dok-lady , volume 160, pages 1244–1247, 1965.[10] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In
Advances in Neural Information Processing Systems , pages 3366–3374, 2015.[11] AS Nemirovsky and DB Yudin. Problem complexity and method efficiency in optimization. 1983.
Willey-Interscience, New York , 1983.[12] Yurii Nesterov.
A method of solving a convex programming problem with convergence rate O (1/k2) .@, 1983. 1313] Yurii Nesterov.
Introductory lectures on convex optimization , volume 87. Springer Science & BusinessMedia, 2004.[14] Donald J Newman. Location of the maximum on unimodal surfaces.
Journal of the ACM (JACM) ,12(3):395–398, 1965.[15] Jorge Nocedal and Stephen Wright.
Numerical optimization . Springer Science & Business Media,2006.[16] Boris T Polyak.
Introduction to optimization . Optimization Software New York, 1987.[17] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic averagegradient. arXiv preprint arXiv:1309.2388 , 2013.[18] Shai Shalev-Shwartz. Sdca without duality. arXiv preprint arXiv:1502.06177 , 2015.[19] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss.
The Journal of Machine Learning Research , 14(1):567–599, 2013.[20] Naum Zuselevich Shor.
Minimization methods for non-differentiable functions , volume 3. SpringerScience & Business Media, 2012. 14
Proofs
A.1 Proof for Theorem 1
Let us apply the given oblivious p -CLI algorithm on a quadratic function of the form f : R d → R : x x ⊤ Q x + q ⊤ x , where Q = diag ( η, . . . , η ) and q = − v η for some η ∈ [ µ, L ] and v = 0 ∈ R d . In particular, we have thatthe norm of the unique minimizer is k x ∗ k = (cid:13)(cid:13) − Q − q (cid:13)(cid:13) = k v k . We set the initialization points to be zero,i.e., E x j = 0 , j = 1 , . . . , p , and denote the corresponding coefficients by A ( k ) ij , B ( k ) ij ∈ R d × d . The crux ofproof is that, as long as η lies in [ µ, L ] , the side-information { µ, L } remains consistent, and therefore, thecoefficients remain unchanged.First, we express E x k +1 i in terms of Q, q and E x ( k )1 , . . . , E x ( k ) p ∈ R d . By Definition 3 we have for any i ∈ [ p ] , E x k +1 i = p X j =1 (cid:16) A ( k ) ij ∂f + B ( k ) ij (cid:17) ( E x ( k ) j )= p X j =1 ( A ( k ) ij ∂f ( E x ( k ) j ) + B ( k ) ij E x ( k ) j )= p X j =1 ( A ( k ) ij ( Q E x ( k ) j + q ) + B ij E x ( k ) j )= p X j =1 ( A ( k ) ij Q + B ( k ) ij ) E x ( k ) j + p X j =1 A ( k ) ij q . Our next step is to reduce the problem of minimizing f to a polynomial approximation problem. We claimthat for any k ≥ and i ∈ [ d ] there exist d real polynomials s k,i, ( η ) , . . . , s k,i,d ( η ) of degree at most k − ,such that E x ( k ) i = ( s k,i, ∗ ( η )) η, (17)where ( s k,i, ∗ ( η )) := ( s k,i, ( η ) , . . . , s k,i,d ( η )) ⊤ . Let us prove this claim using mathematical induction. For k = 1 we have E x (1) i = p X j =1 ( A (0) ij Q + B (0) ij ) E x (0) j + p X j =1 A (0) ij q = − p X j =1 A (0) ij v η, (18)showing that the base case holds. For the induction step, assume the statement holds for some k > with15 k,i,j ( η ) as above. Then, E x ( k +1) i = p X j =1 ( A ( k ) ij Q + B ( k ) ij ) E x ( k ) j + p X j =1 A ( k ) ij q = p X j =1 ( A ( k ) ij diag ( η, . . . , η ) + B ( k ) ij )( s k,j, ∗ ( η )) η − p X j =1 A ( k ) ij v η = p X j =1 ( A ( k ) ij diag ( η, . . . , η ) + B ( k ) ij )( s k,j, ∗ ( η )) − p X j =1 A ( k ) ij v η. (19)The expression inside the last parenthesis is a vector with d entries, each of which contains a real polynomialof degree at most k . This concludes the induction step (note that the derivations of equalities (18) and (19)above are exactly where we use the fact that there is no functional dependency of A ( k ) ij and B ( k ) ij on η ).We are now ready to estimate the sub-optimality of E x ( k ) p , the expected point returned by the algorithmat the k ’th iteration. Setting m ∈ argmax j | v j | we have (cid:13)(cid:13)(cid:13) E x ( k ) p − x ∗ (cid:13)(cid:13)(cid:13) = k ( s k,p, ∗ ( η )) η + v k ≥ | s k,p,m ( η ) η + v m | = | v m || s k,p,m ( η ) η/v m + 1 | . (20)By Lemma 2 in appendix B, there exists η ∈ [ µ, L ] , such that | s k,p,m ( η ) η/v m + 1 | ≥ (cid:18) √ κ − √ κ + 1 (cid:19) k , where κ = L/µ . Defining Q and q accordingly, and choosing, e.g., v = R e where R denotes a prescribeddistance, yields (cid:13)(cid:13)(cid:13) E x ( k ) p − x ∗ (cid:13)(cid:13)(cid:13) ≥ R (cid:18) √ κ − √ κ + 1 (cid:19) k . Using the fact that f is µ -strongly convex concludes the proof of the first part of the theorem.For the smooth case we need to estimate f ( E x ( k ) p ) − f ∗ . Let η ∈ (0 , L ] and define Q and q , accordingly.Inequality (20) yields f ( E x ( k ) ) − f ( x ∗ ) = 12 ( E x ( k ) − x ∗ ) ⊤ Q ( E x ( k ) − x ∗ ) (21) ≥ ( v m ) η (( s k,p,m )( η ) η/v m + 1) . Now, by Lemma 3 in appendix B, min s ( η ) , ∂s ≤ k − max η ∈ (0 ,L ] η ( s ( η ) η + 1) ≥ L (2 k + 1) Thus, choosing v = R e concludes the proof. 16 .2 Proof for Theorem 2 The proof of this theorem follows the exact reduction used in the proof of Theorem 1 (see Appendix A.1above). The only difference is that here µ is allowed to be any real number in (0 , L ) . This considerationreduces our problem into, yet another, polynomial approximation problem. For completeness, we providehere the full proof.Let us apply the given oblivious p -CLI algorithm on a quadratic function of the form f : R d → R : x x ⊤ Q x + q ⊤ x , where Q = diag ( η, . . . , η ) and q = − v η for some η ∈ (0 , L ) and v = 0 ∈ R d . In particular, we have thatthe norm of the unique minimizer is k x ∗ k = (cid:13)(cid:13) − Q − q (cid:13)(cid:13) = k v k . We set the initialization points to be zero,i.e., E x j = 0 , j = 1 , . . . , p , and denote the corresponding coefficients by A ( k ) ij , B ( k ) ij ∈ R d × d . The crux ofproof is that, as long as η lies in (0 , L ] , the side-information { µ, L } remains consistent, and therefore, thecoefficients remain unchanged.First, we express E x k +1 i in terms of Q, q and E x ( k )1 , . . . , E x ( k ) p ∈ R d . By Definition 3 we have for any i ∈ [ p ] , E x k +1 i = p X j =1 (cid:16) A ( k ) ij ∂f + B ( k ) ij (cid:17) ( E x ( k ) j )= p X j =1 ( A ( k ) ij ∂f ( E x ( k ) j ) + B ( k ) ij E x ( k ) j )= p X j =1 ( A ( k ) ij ( Q E x ( k ) j + q ) + B ij E x ( k ) j )= p X j =1 ( A ( k ) ij Q + B ( k ) ij ) E x ( k ) j + p X j =1 A ( k ) ij q Our next step is to reduce the problem of minimizing f to a polynomial approximation problem. We claimthat for any k ≥ and i ∈ [ d ] there exist d real polynomials s k,i, ( η ) , . . . , s k,i,d ( η ) of degree at most k − ,such that E x ( k ) i = ( s k,i, ∗ ( η )) η, (22)where ( s k,i, ∗ ( η )) := ( s k,i, ( η ) , . . . , s k,i,d ( η )) ⊤ . Let us prove this claim using mathematical induction. For k = 1 we have, E x (1) i = p X j =1 ( A (0) ij Q + B (0) ij ) E x (0) j + p X j =1 A (0) ij q = − p X j =1 A (0) ij v η, (23)17howing that the base case holds. For the induction step, assume the statement holds for some k > with s k,i,j ( η ) as above, then E x ( k +1) i = p X j =1 ( A ( k ) ij Q + B ( k ) ij ) E x ( k ) j + p X j =1 A ( k ) ij q = p X j =1 ( A ( k ) ij diag ( η, . . . , η ) + B ( k ) ij )( s k,j, ∗ ( η )) η − p X j =1 A ( k ) ij v η = p X j =1 ( A ( k ) ij diag ( η, . . . , η ) + B ( k ) ij )( s k,j, ∗ ( η )) − p X j =1 A ( k ) ij v η. (24)The expression inside the last parenthesis is a vector with d entries, each of which contains a real polynomialof degree at most k . This concludes the induction step. (note that the derivations of equalities (23) and (24)above are exactly where we use the fact that there is no functional dependency of A ( k ) ij and B ( k ) ij on η ).We are now ready to estimate the sub-optimality of E x ( k ) p , the point returned by the algorithm at the k ’thiteration. Let us set m ∈ argmax j | v j | , then (cid:13)(cid:13)(cid:13) E x ( k ) p − x ∗ (cid:13)(cid:13)(cid:13) = k ( s k,p, ∗ ( η )) η + v k≥ | s k,p,m ( η ) η + v m | = | v m || s k,p,m ( η ) η/v m + 1 | (25)By Lemma 4, there exists η ∈ ( L/ , L ) , such that | s k,p,m ( η ) η/v m + 1 | ≥ (1 − η/L ) k +1 . (26)Defining Q and q accordingly, and choosing, e.g., v = R e where R denotes a prescribed distance, yields (cid:13)(cid:13)(cid:13) E x ( k ) p − x ∗ (cid:13)(cid:13)(cid:13) ≥ (1 − η/L ) k +1 . Using the fact that f is L/ -strongly convex concludes the proof. B Technical Lemmas
Below, we provide 3 lemmas which are used to bound from below the quantity | s ( η ) η + 1 | over differentdomains of η , where s ( η ) is a real polynomial. For brevity, we denote the set of real polynomials of degree k by P k . 18 emma 2. Let s ( η ) ∈ P k , and let < µ < L . Then, max η ∈ [ µ,L ] | s ( η ) η + 1 | ≥ (cid:18) √ κ − √ κ + 1 (cid:19) k +1 where κ := L/µ . Proof
Denote q ( η ) := T − k +1 (cid:16) L + µL − µ (cid:17) T k +1 (cid:16) η − µ − LL − µ (cid:17) , where T k ( η ) denotes the Chebyshev polynomial ofdegree k , T k ( η ) = cos( k arccos( η )) | η | ≤ k arcosh( η )) η ≥ − n cosh( k arcosh( − η )) η ≤ − . (27)It follows that | T k +1 ( η ) | ≤ for η ∈ [ − , and T k +1 (cos( jπ/ ( k + 1))) = ( − j , j = 0 , . . . , k + 1 . Accordingly, | q ( η ) | ≤ T − k +1 (cid:16) L + µL − µ (cid:17) , η ∈ [ µ, L ] and q ( θ j ) = ( − j T − k +1 (cid:18) L + µL − µ (cid:19) , j = 0 , . . . , k + 1 , where θ j = cos( jπ/ ( k + 1))( L − µ ) + µ + L . Suppose, for the sake of contradiction, that max η ∈ [ µ,L ] | s ( η ) η + 1 | < max η ∈ [ µ,L ] | q ( η ) | . Thus, for r ( η ) = q ( η ) − (1 + s ( η ) η )) , we have r ( θ j ) > for even j , and r ( θ j ) < for odd j . Hence, r ( η ) has k + 1 roots in [ µ, L ] . But, since r (0) = 0 and µ > , it follows r ( η ) has at least k + 2 roots, whichcontradicts the fact that the degree of r ( η ) is at most k + 1 . Therefore, max η ∈ [ µ,L ] | s ( η ) η + 1 | ≥ max η ∈ [ µ,L ] | q ( η ) | = T − k +1 (cid:18) κ + 1 κ − (cid:19) , κ = L/µ . Since ( κ + 1) / ( κ − ≥ , we have by Equation (27), T k (cid:18) κ + 1 κ − (cid:19) = cosh (cid:18) k arcosh (cid:18) κ + 1 κ − (cid:19)(cid:19) = cosh k ln κ + 1 κ − s(cid:18) κ + 1 κ − (cid:19) − = cosh (cid:18) k ln (cid:18) κ + 2 √ κ + 1 κ − (cid:19)(cid:19) = cosh (cid:18) k ln (cid:18) √ κ + 1 √ κ − (cid:19)(cid:19) = 12 (cid:18) √ κ + 1 √ κ − (cid:19) k + (cid:18) √ κ − √ κ + 1 (cid:19) k ! ≤ (cid:18) √ κ + 1 √ κ − (cid:19) k . Hence, max η ∈ [ µ,L ] | s ( η ) η + 1 | ≥ (cid:18) √ κ − √ κ + 1 (cid:19) k +1 Lemma 3.
Let s ( η ) ∈ P k , and let < L . Then, max η ∈ [0 ,L ] η | s ( η ) η + 1 | ≥ L (2 k + 3) Proof
First, we define q ( η ) = ( ( − k (2 k + 3) − p L/ηT k +3 ( p η/L ) η = 00 η = 0 where T k ( η ) is the k ’th Chebyshev polynomial (see (27)). Let us show that q ( η ) is a polynomial of degree k + 1 and that q (0) = 1 . The following trigonometric identity cos α + cos β = 2 cos (cid:18) α − β (cid:19) cos (cid:18) α + β (cid:19) , together with (27), yields the following recurrence formula T k ( η ) = 2 ηT k − ( η ) − T k − ( η ) . Noticing that T ( η ) = 1 and T ( η ) = x (also by (27)), we can use mathematical induction to prove thatChebyshev polynomials of odd degree have only odd powers and that the corresponding coefficient for the20rst power η in T k +3 ( η ) is indeed ( − k (2 k + 3) . Equivalently, we get that q ( η ) is a polynomial of degree k + 1 and that q (0) = 1 . Next, note that for θ j = L cos (cid:18) jπ k + 3 (cid:19) ∈ [0 , L ] , j = 0 , . . . , k + 1 we have max η ∈ [0 ,L ] η / | q ( η ) | = ( − j θ / j q ( θ j ) = √ L k + 3 . Now, suppose, for the sake of contradiction, that max η ∈ [0 ,L ] η | s ( η ) η + 1 | < max η ∈ [0 ,L ] η | q ( η ) | . In particular, θ / j | s ( θ j ) θ / j + 1 | < θ / j | q ( θ j ) | . Since θ j > , we have | s ( θ j ) θ / j + 1 | < | q ( θ j ) | . We proceed in a similar way to the proof of Lemma 2. For r ( η ) = q ( η ) − (1 + s ( η ) η )) , we have r ( θ j ) > for even j , and r ( θ j ) < for odd j . Hence, r ( η ) has k + 1 roots in [ θ k +1 , L ] . But, since r (0) = 0 and θ k +1 > , it follows r ( η ) has at least k + 2 roots, which contradicts the fact that degree of r ( η ) is a at most k + 1 . Therefore, max η ∈ [0 ,L ] η | s ( η ) η + 1 | ≥ max η ∈ [0 ,L ] η | q ( η ) | ≥ L (2 k + 3) concluding the proof. Lemma 4.
Let s ( η ) ∈ P k , and let < L . Then exactly one of the two following holds:1. For any ǫ > , there exists η ∈ ( L − ǫ, L ) such that | s ( η ) η + 1 | > (1 − η/L ) k +1 . s ( η ) η + 1 = (1 − η/L ) k +1 . Proof
It suffices to show that if (1) does not hold then s ( η ) η + 1 = (1 − η/L ) k +1 . Suppose that there exists ǫ > such that for all η ∈ ( L − ǫ, L ) it holds that | s ( η ) η + 1 | ≤ (cid:16) − ηL (cid:17) k +1 . q ( η ) := s ( L (1 − η )) L (1 − η ) + 1 (28)and denote the corresponding coefficients by q ( η ) = P k +1 j =0 q i η j . We show by induction that q j = 0 for all j = 0 , . . . , k .For j = 0 we have that since for any η ∈ (0 , − ( L − ǫ ) /L ) | q ( η ) | ≤ (cid:18) − L (1 − η ) L (cid:19) k +1 = η k +1 , it holds that | q | = | q (0) | = (cid:12)(cid:12)(cid:12)(cid:12) lim η → + q ( η ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ lim η → + η k +1 = 0 . Now, if q = · · · = q m − = 0 for m < k + 1 then | q m | = (cid:12)(cid:12)(cid:12)(cid:12) q (0) η m (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) lim η → + q ( η ) η m (cid:12)(cid:12)(cid:12)(cid:12) ≤ lim t → + η k +1 − m = 0 . Thus, proving the induction claim. This, in turns, implies that q ( η ) = q k +1 η k +1 . Now, by Equation (28), itfollows that q k +1 = q (1) = 1 . Hence, q ( η ) = η k +1 . Lastly, using Equation (28) again yields s ( η ) η + 1 = q (cid:16) − ηL (cid:17) = (cid:16) − ηL (cid:17) k +1 ,,