DynaNewton - Accelerating Newton's Method for Machine Learning
DD YNA N EWTON
Accelerating Newton’s Method for Machine Learning
Hadi Daneshmand, Aurelien Lucchi, Thomas Hofmann
Department of Computer Science, ETH Zurich, Switzerland
Abstract
Newton’s method is a fundamental technique in optimization with quadratic con-vergence within a neighborhood around the optimum. However reaching thisneighborhood is often slow and dominates the computational costs. We exploittwo properties specific to empirical risk minimization problems to accelerate New-ton’s method, namely, subsampling training data and increasing strong convexitythrough regularization. We propose a novel continuation method, where we definea family of objectives over increasing sample sizes and with decreasing regular-ization strength. Solutions on this path are tracked such that the minimizer of theprevious objective is guaranteed to be within the quadratic convergence regionof the next objective to be optimized. Thereby every Newton iteration is guar-anteed to achieve super-linear contractions with regard to the chosen objective,which becomes a moving target. We provide a theoretical analysis that motivatesour algorithm, called D
YNA N EWTON , and characterizes its speed of convergence.Experiments on a wide range of data sets and problems consistently confirm thepredicted computational savings.
In machine learning, we often fix a function class with parameters x ∈ R d , define a non-negativefamily of loss functions φ z , a regularizer Ω , and then aim to minimize a regularized sample loss overtraining data S , f S ν ( x ) := 1 |S| (cid:88) z ∈S φ z ν ( x ) , φ z ν ( x ) := φ z ( x ) + ν Ω( x ) . (1)Justified by theories like Tikhonov regularization or structural risk minimization, we know that wecan control the expected risk of the minimizers x ∗ ν of f S ν , i.e. avoid overfitting, by choosing theregularization strength ν appropriately.In this paper, we focus on Newton’s method for optimization, which obeys quadratic convergencetowards an extremal point, when initialized sufficiently close to such a point [20]. However, despitethis unmatched speed of convergence, Newton’s method has shortcomings for large-scale machinelearning problems of the type described above: (i) Reaching the quadratic convergent regime throughan initial damped phase [7] is where most of the computation is typically spent, unless one hasaccess to an initial solution close-enough to the optimum. (ii) Being a batch algorithm, each Newtoniteration involves a complete pass over the entire data set to compute the required gradient andHessian matrix. (iii) The computation of the Newton update requires to solve a linear system ofequations, which is a challenge, in particular, if the data dimensionality is high.The strategy presented in this paper is as follows. Following the ideas of [8], we present a system-atic way to dynamically subsample the data so as to match-up statistical and computational accuracy,leading to significant savings in the amount of overall computation. Moreover, employing this to-gether with an adaptive control of the regularization strength, we define a continuation method [1],1 a r X i v : . [ c s . L G ] M a y x ∗ n • x ∗ n/ • x ∗ n/ • x ∗ n/ • x Figure 1: Illustration of the continuation method approach proposed in this paper: Assume we chose ν ∝ /n . After optimizing the risk for a small subsample, we continuously increase the sample size(by a factor, e.g. ) and proportionally lower ν . We want to guarantee that each solution provides astarting point that is within the quadratic convergence region of the subsequent optimization. Theseregions are depicted by colored circles.where we track solutions computed for problems with fewer data and with stronger regularization.We use previous solutions in order to compute the starting point for the next Newton iteration(s),possibly operating on a larger sample. An ideal sketch of the situation is shown in Figure 1. This ismeant to address challenges (i) and (ii). Finally, in order to overcome challenge (iii), we also empir-ically investigate our strategy for a popular quasi-Newton method, namely BFGS, which computesapproximations to the inverse Hessian through closed-form rank-one matrix updates. Adaptive data sub-sampling
Recent results have shown that the finite sum structure of the empir-ical risk can be exploited to achieve linear convergence for strongly convex objectives [15, 10, 23].This is accomplished by revisiting samples and storing their gradients in order to reduce the varianceof future update directions. Although these methods achieve fast convergence on the empirical risk,they do not explicitly consider the expected risk, which has been separately studied in the literatureon learning theory. It is usually analysed with the help of uniform convergence bounds that take thegeneric form [5] E S (cid:20) sup x ∈F (cid:12)(cid:12) f S ( x ) − E z φ z ( x ) (cid:12)(cid:12)(cid:21) ≤ H ( n ) , (2)where the expectation is over a random sample S of size n . Here H is a bound that depends on n ,usually through a ratio n/d , where d is the capacity of F (e.g. VC dimension) [6]. The recent work of [8] simultaneously exploits the properties of the empirical risk as well as theconcentration bounds from learning theory to achieve fast convergence to the expected risk. Thisapproach uses a dynamic sample size schedule that matches up optimization error and statisticalaccuracy. Although this approach was tailored specifically for variance-reduced stochastic gradientmethods, we here show how a similar adaptive sample size strategy can be used in the context ofnon-stochastic approaches such as Newton’s method.
Regularization paths and continuation methods
There is a rich body of literature on numericalcontinuation methods, a reference work being [1]. The basic idea is to define a family of objectives,in the simplest case with a single parameter t ∈ [0; T ] , and to optimize over a sequence of objectives f t with increasing t , such that the sequence approaches some desired final f = f T . The generalmotivation is that following the solution path x ∗ t = arg min x f t ( x ) may be computationally moreefficient than optimizing the (typically) harder problem f directly. For ease of exposition, we assume that all regularized risks f ν can be governed by the same function H . deterministic annealing [21],a deterministic variant of simulated annealing [16]. Here the family of objectives is parametrizedby the computational analogue of temperature. Similar techniques known as graduated optimiza-tion have also been proposed in computer vision [4] and in machine learning, a recent examplebeing [13].In machine learning, the model complexity is typically controlled through a regularizer. In struc-tural risk minimization, choosing a good regularizer plays an important role in the bias-variancetrade-off for model selection [22]. Another virtue of the regularization factor is its influence on theperformance of the optimization procedure, as can be seen through continuation methods or reg-ularization path techniques [12, 3]. This is formally described by defining a family of objectivesfunctions f t := f ν t with a decreasing sequence ν t which progressively provide a better approxi-mation of f . The typical goal pursued in this line of work is to combine optimization with modelselection, although [12, Section 4.3] also report computational savings. Our use of a continuationmethod is purely motivated by computational complexity and justified by a rigorous analysis of thequadratic convergence regime of Newton’s method as described in detail in the next section. Assume that we have a µ strongly-convex function f : R d → R , which we want to minimize oversolutions x ∈ R d . A Newton step defines the increment as (cid:52) x = − (cid:2) ∇ f ( x ) (cid:3) − ∇ f ( x ) , x ← x + (cid:52) x (3)An equivalent way to define Newton increments without the need to invert the Hessian is implicitlyas the solution of the linearized optimality condition ∇ f ( x + (cid:52) x ) ≈ ∇ f ( x ) + ∇ f ( x ) (cid:52) x ! = 0 (4)as can be verified by plugging in Eq. (3).Newton’s method converges to the optimal solution x ∗ := arg min f ( x ) in a finite number of steps.The speed of convergence is characterized by two distinct phases that depend on the distance to x ∗ . The first phase is a damped phase with slow convergence while the second phase has quadraticconvergence and is triggered when entering a region close to x ∗ .In order to formally characterized this region of quadratic convergence, an important quantity is the Newton decrement function [7] defined as λ f := (cid:113) ∇ f (cid:62) [ ∇ f ] − ∇ f , λ f : R d → R ≥ . (5)We will directly make use of this definition in conjunction with an additional requirement that weimpose on f , namely that it is self-concordant [19]. Self-concordance allows for an elegant, affine-invariant characterization of the quadratic convergence region, as detailed in [7], leading to thesufficient condition that λ f ( x ) ≤ η , where η ∈ (0; ) is a constant whose exact value depends oncontrol parameters for the line search. The self-concordance property restricts the set of functions f , yet it is known that in practice, a similar analysis often optimistically applies to a wider range offunctions. For further details, we refer the reader to the discussion in [7, Section 9.6] and the workon logistic regression in [2].Note that strong convexity implies ∇ f (cid:23) µ I and thus [ ∇ f ] − (cid:22) µ I , so that immediately λ f ( x ) ≤ (cid:107)∇ f ( x ) (cid:107) µ I = 1 √ µ (cid:107)∇ f ( x ) (cid:107) = ⇒ (cid:107)∇ f ( x ) (cid:107) ! ≤ η √ µ (6)and we arrive at a simple (sufficient) condition on the gradient norm at x .3 lgorithm 1 Basic Newton continuation method: a priori (V1) and data-adaptive variant (V2). given sample S , iterations T V1 : given sequence ( µ t , m t ) , V2 : given starting point ( µ , m ) x ← arg min x f µ ,m ( x ; S ) for t = 1 , . . . , T do V1 : do nothing, V2 : compute µ t and m t compute Newton increment (cid:52) x for f t ( x ) := f µ t ,m t ( x t − , S ) x t ← x t − + (cid:52) x end for3.2 Continuation Method We study the -parametric family of regularized empirical risk functions that are defined via smoothand convex, non-negative loss functions φ z and relative to a full sample S ∗ = ( z , . . . , z N ) . Wemake use of the definitions in Eq. (1) and we think of f S ν as being indexed by ν as well as n = |S| ,where S = ( z , . . . , z n ) consists of the first n samples of S ∗ . The canonical regularizer we consideris Ω( x ) = (cid:107) x (cid:107) and we will sometimes utilize the commonly used heuristics of choosing µ ∝ /m to focus on a simpler -parametric family.We want to implement the abstract procedure described in Algorithm 1, where we either pre-generatea sequence of problems f t or, alternatively, construct ( µ t , m t ) greedily in a data-adaptive manner.The key condition for making this work as expected (by design) is that we are able to establish thefollowing condition λ f t ( x ∗ t − ) ≤ η or, more conservatively, (cid:107)∇ f t ( x ∗ t − ) (cid:107) ≤ η √ µ t , (7)which will assure (under appropriate assumptions, e.g. self-concordance) that the minimizer of theprevious optimization problem will provide a starting point that is within the quadratic convergenceregion of the subsequent optimization problem, yielding a proper ”hand-over” of the solution asillustrated in Figure 1. Let us first assume that we fix S and that µ := µ t − . We are seeking for the range of possible ν := µ t ≤ µ for which we can guarantee the condition in Eq. (7). Lemma 1.
Let Ω( · ) = (cid:107) · (cid:107) , x ∗ µ := arg min x f µ ( x ) , and B µ := η µ (cid:107) x ∗ µ (cid:107) . For any ν such that µ ≥ ν ≥ µ (cid:18) − (cid:16)(cid:113) B µ + 4 B µ − B µ (cid:17)(cid:19) = ⇒ λ f ν ( x ∗ µ ) ≤ η (8) Proof.
By the first order optimality condition for x ∗ µ , we have that ∇ f ( x ∗ µ ) = − µ x ∗ µ , hence ∇ f ν ( x ∗ µ ) = ∇ f ( x ∗ µ ) + ν x ∗ µ = ( ν − µ ) x ∗ µ ⇐⇒ (cid:107)∇ f ν ( x ∗ µ ) (cid:107) = ( µ − ν ) (cid:107) x ∗ µ (cid:107) . (9)By definition we have λ f ν ( x ∗ µ ) = 1 √ ν (cid:107)∇ f ν ( x ∗ µ ) (cid:107) ! ≤ η ⇐⇒ (cid:107) x ∗ µ (cid:107) ! ≤ η √ ν ( µ − ν ) ⇐⇒ ν − (2 + B µ ) µν + µ ≤ (10)Solving the quadratic equation for ν yields the claim. Corollary 2.
Assume that φ ( , z ) ≤ Φ ( ∀ z ). Lemma 1 remains valid with B = η ≥ B µ .Proof. One can bound (cid:107) x ∗ µ (cid:107) easily through the following argument, exploiting non-negativity of φ Φ ≥ f ( ) = f µ ( ) ≥ f µ ( x ∗ µ ) = f ( x ∗ µ ) + µ (cid:107) x ∗ µ (cid:107) ≥ µ (cid:107) x ∗ µ (cid:107) = ⇒ (cid:107) x ∗ µ (cid:107) ≤ µ . (11)Note that Corollary 2 gives an a priori rate guarantee, which only depends on the easily computable Φ . This is a strong argument in favor of a continuation method, yet may not result in the mostefficient schedules for reducing µ . We will revisit this issue in Section 3.5.4 .4 Increasing Sample Size We now generalize our analysis to the case, where we also increase the sample size. The basicchallenge is that we need to upper bound the gradient norm contributions coming from new datapoints. Intuitively this depends on the generalization capability of the current iterate.
Lemma 3.
Assume f has Lipschitz continuous gradients with constant L . Let S be a given n -sample, which we split into S = ( z , . . . , z m ) and S = ( z m +1 , . . . , z n ) , where m ≤ n is arbi-trary. Define x ∗ := arg min x f S ( x ) . With high probability over S it holds that E S (cid:107)∇ f S ( x ∗ ) (cid:107) ≤ L ( n − m ) n H ( m ) (12) Proof.
See Appendix.
Theorem 4.
Under the same conditions as Lemma 3 and for arbitrary ν ≤ µ and β ∈ (0; ∞ ) , E S (cid:107)∇ f S ν ( x ∗ ) (cid:107) ≤ (1 + β ) ( ν − µ ) (cid:107) x ∗ (cid:107) + (1 + β − ) (cid:18) n − mn (cid:19) L H ( m ) Proof.
See Appendix.The theorem directly implies that we can construct a sequence of objective functions with samplesize increasing at a geometric rate.
Corollary 5.
There exists an α ∗ ∈ [0; 1) such that for all α ∈ [ α ∗ ; 1] , with m/α ∈ Z , the followingholds for ν := αµ and S = ( z , . . . , z m/α ) , (cid:107)∇ f S ν ( x ∗ ) (cid:107) ≤ η √ m , (13) where α ∗ can be explicitly computed as the root of a third order polynomial.Proof. See Appendix.
The above analysis is largely data-independent and just involves a few constants: L , Φ , propor-tionality constants involving H and µ ∝ /m . As such, it leads to geometric rates that can bequite conservative. We will now show a data-adaptive strategy that – up to small approximationerrors – maximally increases the sample size, and, equivalently, maximally decreases the regular-ization strength, within the desired range. First of all, note that we can easily compute the gradienton an increased sample. Denote S = ( z , . . . , z n ) and S = ( z , . . . , z m ) , m ≤ n . Define x ∗ = arg min x f S µ ( x ) , ∇ f S ν ( x ∗ ) = 1 n n (cid:88) k = m +1 ∇ φ z k ( x ∗ ) + ν (cid:16) − µmνn (cid:17) x ∗ . (14)Note that if µ/ν = n/m , then the second term vanishes. As we need to compute the gradientanyway as part of the Newton iteration, we get it for free. We now could approximate λ f S ν ( x ∗ ) ≤ √ ν (cid:107)∇ f ν,n ( x ∗ ) (cid:107) , (15)but this results in bounds that are not very tight and hence underestimate the possible rate. We thuspropose a tighter bound based on a Taylor approximation of the inverse Hessian at the previousiteration, something that we can compute with a little bit of extra computational effort. Let ν ≤ µ and define H := ∇ f S µ ( x ∗ )[ H − ( µ + ν ) I ] − = H − + ( µ − ν ) H − + O ( µ ) (16) Computing a Newton update is usually done via LU decomposition, which is cheaper. lgorithm 2 Data-adaptive Newton’s method - D
YNA N EWTON Given sample S and ( µ , m ), define S := S m x ← arg min f S µ for t = 1 , . . . , T do Find the smallest α such that λ f α ( x ∗ ) ≤ η, where f α := f S (cid:48) ν , ν := αµ, S (cid:48) := S n , n := m t − /α. ( µ t , m t ) ← ( ν, n ) Compute Newton increment (cid:52) x for f α at x t − x t ← x t − + (cid:52) x end for and thus with the additional assumption that H ≈ ∇ f S µ ( x ∗ ) , (cid:2) λ f S ν ( x ∗ ) (cid:3) ≈ ∇ f S ν ( x ∗ ) (cid:62) H − ∇ f S ν ( x ∗ ) + ( µ − ν ) (cid:107) H − ∇ f S ν ( x ∗ ) (cid:107) . (17)For instance, in the setting ν ∝ /n , it is straightforward to (numerically) find the maximal n – orequivalently minimal ν – such that the above approximation is < η . In our experiments, we havefound this approximation to be correct up to a few percent relative error.The complete algorithm is summarized in Algorithm 2. In order to find an α that fulfils Eq. (22),we need to compute an estimate of (cid:13)(cid:13) ∇ f S µ ( x ∗ ) (cid:13)(cid:13) . We solve this problem by first assuming that x t is a good approximation of the solution to f t and then computing the estimate from the samples ( z m +1 , . . . , z m + k ) . We can avoid the first approximation at the cost of making − Newtoniterations instead of just , but we have found this not to be necessary in any one of our experiments.Further experimental results are provided in the appendix. Datasets
We apply (cid:96) -regularized logistic regression on a set of 4 datasets of varying size anddimensionality summarized in Table 1. We use of the data points as the training set and theremaining as test set. D ATASET S IZE N UMBER OF FEATURESA A W A COVTYPE . BINARY
Table 1:
Details of the datasets used in our experiments.
Comparison to standard baselines
We compare D
YNA N EWTON (cf. Algorithm 2) to Newton’smethod and – as a competitive SGD variant – to SAGA. We used a step size of ∼ L for SAGA assuggested in previous work [10, 14]. The results presented in Figure 2 show significant speeds-upscompared to Newton’s method. D YNA N EWTON also outperforms SAGA and gets a very accuratesolution after less than 6 epochs on all datasets. We also evaluated the solution quality on theexpected risk and provide the complete results of this evaluation in the Appendix (see Figure 6).In order to relate convergence on the empirical and expected risks, we plotted a horizontal dottedline that shows the iteration at which D
YNA N EWTON reached convergence on the test set. Thisdemonstrates that D
YNA N EWTON also achieves significant gains in terms of test error.As the cost per iteration is typically higher for Newton’s method on a single machine, we also presenta comparison in terms of running time in the Appendix (see Figure 5). Although the gains are moremoderate when measuring time, it is worth pointing out that Newton is inherently much easier toparallelize as discussed in [9] and further gains are thus to be expected in a distributed setting.
Increment factor test
We evaluate the influence of α on the convergence of the algorithm withoutdata-adaptivity. As can be seen in Figure 4 a small value of α leads to faster convergence whileit might also make the algorithm diverge if chosen too aggressively small. In contrast, the data-adaptive approach adapts the value of α and yields a stable convergence behavior on all datasets.6 −14 −12 −10 −8 −6 −4 −2 dyna−newtonnewtonsaga −14 −12 −10 −8 −6 −4 −2 dyna−newtonnewtonsaga A A W A −14 −12 −10 −8 −6 −4 −2 dyna−newtonnewtonsaga −14 −12 −10 −8 −6 −4 −2 dyna−newtonnewtonsaga COVTYPE SUSY
Figure 2:
Suboptimality on the empirical risk vs effective number of epochs . The plots show thesuboptimality of the empirical risk on the full data set S ∗ with ν = 1 /N . The horizontal axisrepresents the number of effective epochs , i.e. number of passes over the whole training data set.The dotted horizontal line refers to statistical accuracy (explained in the main text).7 −15 −10 −5 α = 0 . α = 0 . α = 0 . α = 0 . Figure 3:
Increment Factor test on the W A dataset . The plots shows the suboptimality of theempirical risk on the full data set S ∗ with ν = 1 /N . In these experiments we used the incrementfactor α = m t − /m t . It is clear that the smallest α value causes leaving the region of quadraticconvergence and hence leads to divergence. This behavior is excluded by the data-adaptive approach. −14 −12 −10 −8 −6 −4 −2 newton:0newton:1newton:2 −14 −12 −10 −8 −6 −4 −2 dyna−newton:0dyna−newton:1dyna−newton:2 NEWTON DYNA - NEWTON
Figure 4:
Robustness against the initialisation point on the A A dataset . We here show the sub-optimality on empirical risk against time for various initialization points: in red, x = (cid:126) , in blue x = (cid:126) and in black x = (cid:126) .We observed empirically that the data-adaptive method chooses a value of α ≈ thus explainingwhy the non-adaptive approach with α = achieves a similar - but slightly inferior - performance. Importance of the initialization point
We investigate the role of the initialization point on theconvergence of Newton’s method and D
YNA N EWTON . The results shown in Figure 4 demonstratethat bad initialization points (i.e. far away from the optimum) significantly slow down the conver-gence of Newton’s method as they require more steps to get inside the ball of quadratic convergence.By comparison, a poor initialization does not significantly affect D
YNA N EWTON . We proposed a continuation method variant of Newton’s method that dynamically adapts the samplesize and the regularization strength such that each subproblem provides a starting point that is withinthe quadratic convergence region of the subsequent optimization problem. We provided a theoreticalanalysis that characterizes the conditions required for achieving a proper hand-over and also deviseda data-adaptive strategy of discretizing the solution path.Our empirical results demonstrate significant speed-ups across a wide range of datasets both in termsof empirical as well as expected risk. In particular the speed of convergence on the latter is quite8emarkable, often getting close to optimal solutions in about effective epochs. All results seem tobe in good agreement with our theory and its predictions.In the appendix, we provide further empirical evidence that shows that our results extend to non-exact Newton methods such as L-BFGS. This is of special interest for training large deep networksfor which a distributed implementation of L-BFGS has been shown to provide significant speed-upsin terms of training time [9]. While our analysis does not provide any guarantees for this case, itseems that the proposed continuation method has merits beyond the results presented in this paper.A better understanding of the behavior of quasi-Newton methods is the topic of future work. Acknowledgments
We would like to thank Aryan Mokhtari for helpful discussions on Newton’s method.
References [1] E. L. Allgower and K. Georg.
Numerical continuation methods: an introduction , volume 13.Springer Science & Business Media, 2012.[2] F. Bach. Self-concordant analysis for logistic regression.
Electronic Journal of Statistics ,4:384–414, 2010.[3] F. R. Bach, R. Thibaux, and M. I. Jordan. Computing regularization paths for learning multiplekernels.
Advances in neural information processing systems , 17:73–80, 2005.[4] A. Blake and A. Zisserman.
Visual reconstruction , volume 2. MIT press Cambridge, 1987.[5] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A survey of some recentadvances.
ESAIM: probability and statistics , 9:323–375, 2005.[6] O. Bousquet and L. Bottou. The tradeoffs of large scale learning. In
Advances in NeuralInformation Processing Systems , pages 161–168, 2008.[7] S. Boyd and L. Vandenberghe.
Convex optimization . Cambridge university press, 2004.[8] H. Daneshmand, A. Lucchi, and T. Hofmann. Starting small - learning with adaptive samplesizes. In
International Conference on Machine Learning , 2016.[9] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang,Q. V. Le, et al. Large scale distributed deep networks. In
Advances in Neural InformationProcessing Systems , pages 1223–1231, 2012.[10] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method withsupport for non-strongly convex composite objectives. In
Advances in Neural InformationProcessing Systems , pages 1646–1654, 2014.[11] J. E. Dennis, Jr and J. J. Mor´e. Quasi-newton methods, motivation and theory.
SIAM review ,19(1):46–89, 1977.[12] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization path for the supportvector machine.
The Journal of Machine Learning Research , 5:1391–1415, 2004.[13] E. Hazan, K. Y. Levy, and S. Shalev-Swartz. On graduated optimization for stochastic non-convex problems. arXiv preprint arXiv:1503.03712 , 2015.[14] T. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. McWilliams. Variance reduced stochasticgradient descent with neighbors. In
Advances in Neural Information Processing Systems 28 ,pages 2296–2304, 2015.[15] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variancereduction. In
Advances in Neural Information Processing Systems , pages 315–323, 2013.[16] S. Kirkpatrick. Optimization by simulated annealing: Quantitative studies.
Journal of statisti-cal physics , 34(5-6):975–986, 1984.[17] D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization.
Mathematical programming , 45(1-3):503–528, 1989.[18] A. Mokhtari and A. Ribeiro. Res: Regularized stochastic bfgs algorithm.
Signal Processing,IEEE Transactions on , 62(23):6089–6104, 2014.919] Y. Nesterov, A. Nemirovskii, and Y. Ye.
Interior-point polynomial algorithms in convex pro-gramming , volume 13. SIAM, 1994.[20] J. M. Ortega and W. C. Rheinboldt.
Iterative solution of nonlinear equations in several vari-ables , volume 30. SIAM, 1970.[21] K. Rose. Deterministic annealing for clustering, compression, classification, regression, andrelated optimization problems.
Proceedings of the IEEE , 86(11):2210–2239, 1998.[22] V. Vapnik.
Statistical learning theory , volume 1. Wiley New York, 1998.[23] J. Wang, H. Wang, and N. Srebro. Reducing runtime by recycling samples. arXiv preprintarXiv:1602.02136 , 2016. 10
Appendix
A.1 ProofsLemma 3
Proof.
Using classical concentration inequality from statistical learning theory we get (see also [8])with high probability over S E S (cid:104) f S ( x ∗ ) − min x f S ( x ) (cid:105) ≤ n − mn H ( m ) , (18)which bounds the suboptimality of the minimizer of the smaller m -sample on the larger n -sample.Furthermore, by virtue of smoothness: (cid:107)∇ f S ( x ∗ ) (cid:107) ≤ L (cid:104) f S ( x ∗ ) − min x f S ( x ) (cid:105) (19)Putting both inequalities together yields the claim. Theorem 4
Proof. ∇ f S ν ( x ∗ ) = mn ∇ f S ( x ∗ ) + n − mn ∇ f S ( x ∗ ) + ν x ∗ (20) = − mn µ x ∗ + n − mn ∇ f S µ ( x ∗ ) − n − mn µ x ∗ + ν x ∗ = ( ν − µ ) x ∗ + n − mn ∇ f S µ ( x ∗ ) = ( ν − µ ) x ∗ + n − mn ∇ f S µ ( x ∗ ) . Here we have repeatedly exploited the first order optimality condition of x ∗ , i.e. ∇ f S µ ( x ∗ ) = 0 .Now, taking squared norms on both sides, we apply the generalized parallelogram identity (cid:107) a + b (cid:107) ≤ (1 + β ) (cid:107) a (cid:107) + (1 + β − ) (cid:107) b (cid:107) with β ∈ (0; ∞ ) . (cid:107)∇ f S ν ( x ∗ ) (cid:107) ≤ (1 + β ) ( ν − µ ) (cid:107) x ∗ (cid:107) + (1 + β − ) (cid:18) n − mn (cid:19) (cid:13)(cid:13) ∇ f S µ ( x ∗ ) (cid:13)(cid:13) (21)Taking expectation with regard to S (cid:48) on both sides and applying Lemma 3 concludes the proof. Corollary 5
Proof.
For concreteness set β = 1 . We want to chose α such that µ (cid:107) x ∗ (cid:107) (1 − α ) + 2 L H ( m ) (1 − α ) ≤ µη (22)For α = 1 , the left hand side is , while the right hand side is strictly positive. As the derivative ofthe left hand side is finite at , we can solve for α < (such that α ≥ ) as claimed. A.2 Running time
We compare the running time of D
YNA N EWTON against other baselines in Figure 5. Although wesee more moderate gains in comparison to SAGA in terms of computational time (especially in earlyiterations), it is worth pointing out that Newton is inherently much easier to parallelize as discussedin [9] and further gains are thus to be expected in a distributed setting.
A.3 Expected risk
We present the results in terms of expected error in Figure 6. This largely confirms the resultsobtained on the training set. Although all methods achieve convergence after less than 3 epochs,D
YNA N EWTON achieves even faster convergence than Newton and SAGA.11 −14 −12 −10 −8 −6 −4 −2 dyna−newtonnewtonsaga −14 −12 −10 −8 −6 −4 −2 dyna−newtonnewtonsaga A A W A −14 −12 −10 −8 −6 −4 −2 dyna−newtonnewtonsaga −14 −12 −10 −8 −6 −4 −2 dyna−newtonnewtonsaga COVTYPE SUSY
Figure 5:
Suboptimality on the empirical risk vs time . The vertical axis shows the suboptimality ofthe empirical risk, i.e. f ν,n ( x t ) − f ∗ ν,n , where ν = 1 /n . The horizontal axis represents run time inseconds. The horizontal dotted line corresponds to the iteration at which convergence is reached onthe expected risk (see details in the main text). A.4 Approximation
The analysis we developed assumes that we reach x ∗ µ := arg min x f µ ( x ) before switching to ν .We empirically check the robustness to an approximate solution ˆ x µ obtained by performing a singleNewton step instead of steps, which guarantees convergence up to numerical precision (see Sec-tion 9.5 in [7]). As shown in Figure 7, the impact of the resulting approximate solution is almostnegligible. A.5 BFGS
One shortcoming of Newton’s method is that it requires solving a linear equation system involvingthe Hessian matrix, which may be impractical for large and high-dimensional datasets. Approxi-mate variants known as quasi-Newton methods [11] have thus been developed, such as the popu-lar BFGS or its limited memory version L-BFGS [17]. Quasi-Newton methods such as BFGS donot require computing the Hessian matrix but instead construct a quadratic model of the objectivefunction by successive evaluations of the gradient. There seems to be a gap between theoreticallyguaranteed convergence rates and the empirically observed effectiveness of BFGS, in particular onill-conditioned problems [18] and for non-convex problems [9].We used the inspiration provided by our continuation method approach to develop a variant of L-BFGS that like
DYNA
NEWTON, adaptively changes the sample size and the regularizer. We namethis approach
DYNA
LBFGS. The pseudo-code of this method is the same as Algorithm 2 except thatthe Newton increment is computed from the approximate L-BFGS Hessian matrix instead of the trueHessian. We evaluate the performance of
DYNA
LBFGS for the task of (cid:96) -regularized logistic re-12 dyna−newtonnewtonsaga dyna−newtonnewtonsaga A A W A dyna−newtonnewtonsaga dyna−newtonnewtonsaga COVTYPE SUSY
Figure 6:
Suboptimality on the test set vs number of epochs . The vertical axis shows the suboptimal-ity of the expected risk, i.e. φ T ( x ) = (cid:80) z ∈T φ z ( x ) / |T | , where T is the test set. The horizontalaxis represents the number of epochs. 13 −14 −12 −10 −8 −6 −4 −2 exact−solutiondyna−newton −14 −12 −10 −8 −6 −4 −2 exact−solutiondyna−newton
1. Empirical suboptimality on A A
2. Empirical suboptimality on
COVTYPE exact−solutiondyna−newton exact−solutiondyna−newton
3. Test error on A A
4. Test error on
COVTYPE
Figure 7:
Effect of the approximate solution obtained by
DYNA
NEWTON. We here check the effectof the approximate solution obtained by
DYNA
NEWTON by iterating for 1 epoch against the exactsolution obtained by iterating for 6 epochs. 14
10 20 30 40 50 60 7010 −8 −7 −6 −5 −4 −3 −2 −1 dyna−bfgsbfgs −14 −12 −10 −8 −6 −4 −2 dyna−bfgsbfgs
1. Empirical suboptimality on A A
2. Empirical suboptimality on
SUSY dyna−bfgsbfgs dyna−bfgsbfgs
3. Test error on A A
4. Test error on
SUSY
Figure 8:
Comparison of BFGS vs
DYNA
LBFGS. We here provide the empirical suboptimality aswell as the test error on the A A and SUSY datasets. We would like to point out that the range ofthe y-axis between the top and bottom row is different as convergence on the test set was typicallyobserved after 1 or 2 epochs.gression on the two datasets described in the main paper. The results shown in Figure 8 demonstratesignificant gains compared to L-BFGS. We also investigate the performance of
DYNA
LBFGS ontraining a convolutional neural network consisting of two convolutional and pooling layers with onefully-connected layer. We include results on the standard MNIST dataset in Figure 9. Although ouranalysis does not extend to non-convex functions, we nevertheless still observe significant gains.15 epochs
Sub o p t i m a li t y -3 -2 -1 LBFGSDyna-LBFGS n epochs
Sub o p t i m a li t y -2 -1 LBFGSDyna-LBFGS
1. Training error 2. Test errorFigure 9:
Comparison of BFGS vs
DYNA
BFGS for training neural networks . We here show theperformance of