[PDF] Low Rank Saddle Free Newton: Algorithm and Analysis

Abstract

Many tasks in engineering fields and machine learning involve minimizing a high dimensional non-convex function. The existence of saddle points poses a central challenge in practice. The Saddle Free Newton (SFN) algorithm can rapidly escape high dimensional saddle points by using the absolute value of the Hessian of the empirical risk function. In SFN, a Lanczos type procedure is used to approximate the absolute value of the Hessian. Motivated by recent empirical works that note neural network training Hessians are typically low rank, we propose using approximation via scalable randomized low rank methods. Such factorizations can be efficiently inverted via Sherman Morrison Woodbury formula. We derive bounds for convergence rates in expectation for a stochastic version of the algorithm, which quantify errors incurred in subsampling as well as in approximating the Hessian via low rank factorization. We test the method on standard neural network training benchmark problems: MNIST and CIFAR10. Numerical results demonstrate that in addition to avoiding saddle points, the method can converge faster than first order methods, and the Hessian can be subsampled significantly relative to the gradient and retain superior performance for the method.

Full PDF

LLow Rank Saddle Free Newton: Algorithm and Analysis

Thomas O’Leary-Roseberry Nick Alger Omar Ghattas Abstract

Many tasks in engineering ﬁelds and machinelearning involve minimizing a high dimensionalnon-convex function. The existence of saddlepoints poses a central challenge in practice. TheSaddle Free Newton (SFN) algorithm can rapidlyescape high dimensional saddle points by usingthe absolute value of the Hessian of the empiricalrisk function. In SFN, a Lanczos type procedureis used to approximate the absolute value of theHessian. Motivated by recent empirical worksthat note neural network training Hessians aretypically low rank, we propose using approxima-tion via scalable randomized low rank methods.Such factorizations can be efﬁciently inverted viaSherman Morrison Woodbury formula. We derivebounds for convergence rates in expectation for astochastic version of the algorithm, which quan-tify errors incurred in subsampling as well as inapproximating the Hessian via low rank factoriza-tion. We test the method on standard neural net-work training benchmark problems: MNIST andCIFAR10. Numerical results demonstrate that inaddition to avoiding saddle points, the method canconverge faster than ﬁrst order methods, and theHessian can be subsampled signiﬁcantly relativeto the gradient and retain superior performancefor the method.

1. Introduction

We consider the stochastic optimization problem min w ∈ R d F ( w ) = (cid:90) (cid:96) ( w ; x, y ) dν ( x, y ) (1)where (cid:96) is a smooth loss function, w ∈ R d is the vector of op-timization variables and the data pairs ( x, y ) are distributedwith joint probability distribution ν ( x, y ) . F : R d → R is referred to as the expected risk function. This problem Oden Institute for Computational Engineering and Sciences,The University of Texas at Austin, Austin, TX. Correspondence to:Thomas O’Leary-Roseberry < [email protected] > . arises in machine learning, where the goal is to reconstruct amapping x (cid:55)→ y with a deep neural network or other modelparameterized by w . In practice ν is not known, and insteadone only has access to samples x i , y i ∼ ν . Given a sampledataset X = { ( x i , y i ) ∼ ν } N X i =1 , a Monte Carlo approxima-tion of the expected risk minimization problem, Equation 1,leads to the empirical risk minimization problem: min w ∈ R d F X ( w ) = 1 N X N X (cid:88) i =1 (cid:96) ( w ; x i , y i ) . (2)Optimization problem (2) serves as a surrogate for (1).In many settings such as deep learning, problems (1) and (2)are nonconvex. Nonconvexity makes it computationally in-tractible (NP-hard) to ﬁnd global minima (Bertsekas, 1997;Murty & Kabadi, 1987). It is also generally not the case thatthe global minimizer of (2) is the global minimizer of (1).Thus iterative (gradient-based) methods of the form w k +1 = w k + α k p k (3)are typically used to explore the nonconvex energy land-scape for (2), searching for local minima that generalizewell to unseen data. Here p k is a search direction, and α k isa step length parameter.Nonconvex energy landscapes typically contain many strictsaddle points (stationary points with at least one direction ofnegative curvature). Much work has been dedicated to un-derstanding how ﬁrst order methods perform in the vicinityof strict saddle points (Ge et al., 2015; Jin et al., 2017; Leeet al., 2017; 2016). Saddle points slow the local convergenceof ﬁrst order methods (methods in which p k is constructedusing only gradient information). These methods typicallyescape saddle points asymptotically. Newton’s method with-out modiﬁcation converges locally to strict saddle points,since gradient components initially oriented away from thesaddle are reoriented towards the saddle point due to theassociated negative eigenvalue of the Hessian.Newton methods can be adapted to escape saddle points byenforcing positive deﬁniteness of the Hessian matrix, H := ∇ F X . One approach that facilitates fast escape from saddle pointsinvolves replacing the Hessian with the absolute value of a r X i v : . [ m a t h . O C ] F e b ow Rank Saddle Free Newton: Algorithm and Analysis the Hessian, | H | (Gill et al., 1981). By the absolute valueof the Hessian, we mean the matrix that is the same as theHessian, except the negative eigenvalues are ﬂipped to bepositive. Speciﬁcally, let the spectral decomposition of theHessian be given as follows: H = U Λ U T = d (cid:88) i =1 λ i u i u Ti , (4)where the eigenvalues λ i are sorted such that | λ i | ≥ | λ j | forall i > j , and u i ∈ R d are the associated eigenvectors. Theabsolute value of the Hessian thus is | H | = d (cid:88) i =1 | λ i | u i u Ti . Then | H | is used in place of H within the Newton sys-tem. Moreover, rank deﬁciency of the Hessian is oftenaddressed via (cid:96) regularization or Levenberg-Marquardtdamping. With these modiﬁcations the Newton system be-comes ( | H | + γI ) p k = − g k , (5)where γ is a regularization or damping parameter, and g k is the gradient. By using | H | rather than H , iterates w k escape saddle points quickly, because ﬂipping the sign ofthe eigenvalue causes saddle points to repel the iterates.Clasically, | H | is computed by forming the dense Hessianmatrix, which requires O ( N X d ) work, then performing aspectral decomposition of this matrix, which requires O ( d ) work. In deep learning d is large, so formation and factor-ization of the Hessian is not computationally tractible. TheSaddle Free Newton (SFN) algorithm (Dauphin et al., 2014)addresses this difﬁculty by using the Lanczos procedureto form an approximation of | H | . Computing the Lanczosapproximation of order r does not require forming or factor-izing the Hessian matrix; instead only the application of theHessian to r vectors is required.Typically it is too expensive to use all N X samples at eachiteration. It is standard practice, therefore, to subsample thegradient and Hessian at each iteration (use a small subset ofthe data pairs to form the gradient and Hessian). Recentlyit has been shown that Newton-type algorithms convergerapidly even when the Hessian is subsampled more than thegradient, because the variance of the Hessian is typicallysmaller than that of the gradient (Erdogdu & Montanari,2015; Roosta-Khorasani & Mahoney, 2016a;b; Xu et al.,2016). In the remainder of this paper, all Hessians andgradients are subsampled. We denote the gradient batch sizeat iteration k by N X k and the Hessian batch size by N S k .

2. Low Rank Saddle Free Newton

Recent empirical studies of their spectra show that neuralnetwork training Hessians are typically numerically low rank, and when away from local minima often have at leastone large magnitude negative eigenvalue (Alain et al., 2019;Ghorbani et al., 2019; Sagun et al., 2016). In this work wetherefore propose forming a low rank approximation of theHessian, H r = [ ∇ F ] r = U r Λ r U Tr = r (cid:88) i =1 λ i u i u Ti . (6)using the double pass randomized eigenvalue decompositionin Algorithm 5.3 of (Halko et al., 2011), instead of theLanczos procedure. The rank r can be chosen such thatthe trailing eigenvalues are smaller than some tolerance | λ j | < (cid:15) H for all j > r . Like Lanczos, this algorithmavoids forming the Hessian, and instead employs only r products of the Hessian with random vectors. We use thislow rank approximation, in combination with the Sherman-Morrison-Woodbury formula, to solve the modiﬁed Newtonsystem (5) as follows: p k = − (cid:20) γ I d − γ U r (cid:18) | Λ r | − + 1 γ I r (cid:19) − U Tr (cid:21) g k . (7)The low rank saddle free Newton (LRSFN) method is sum-marized in Algorithm 1. Algorithm 1

Low Rank Saddle Free NewtonGiven w while Not converged do Compute the randomized approximation U ( k ) r Λ r U ( k ) Tr ≈ H r Compute p k using Equation (7) α k given or computed via line search w k +1 = w k + α k p k end while

3. Low Rank vs. Krylov

On one hand, the Eckart-Young Theorem states that the lowrank approximation (6) is optimal in the spectral and Frobe-nius norms. On the other hand, Krylov approximations suchas Lanczos are optimal in the sense of certain polynomial ap-proximations; for more details see (Saad, 2003). So whichone is better here? We claim that randomized low rank ap-proximation is better in this setting. It (a) leads to solutionsthat generalize better when the Hessian is subsampled, (b)is better at escaping saddle points, and (c) is better suitedfor modern parallel computer architectures.(a) The objective function is most sensitive to perturbationsof w in directions corresponding to eigenvalues of largemagnitude, since the energy landscape has large curvature Actually, the Hessian is typically oversampled, so that r + p random vectors are used, where p is typically much smaller than r . ow Rank Saddle Free Newton: Algorithm and Analysis in these directions. These directions typically persist whendifferent sets of subsamples are used to approximate theHessian. Directions corresponding to eigenvalues of smallmagnitude are less important since the objective functionis less sensitive to perturbations in these directions. Thesedirections tend to ﬂuctuate when different sets of subsam-ples are used to approximate the Hessian. Since Krylovmethods approximate the whole spectrum, they waste com-putational effort attempting to approximate eigenvalues ofsmall magnitude that vary depending on the random sub-samples used. Low rank approximation only approximatesthe large eigenvalues, and therefore leads to solutions thatgeneralize better.(b) Krylov subspace approximations are heavily dependenton the initial vector for the subspace. In Newton-Krylovmethods such as SFN, the gradient is the initial vector. How-ever in the vicinity of a saddle point the gradient may havesmall components in eigenvector directions correspondingto eigenvalues that are negative but large in magnitude. Ran-domized low rank methods are better than Krylov methodsat capturing these large magnitude directions when the gra-dient is small in these directions. Hence LRSFN pushesiterates away from saddle points more strongly than Krylov-based Saddle Free Newton.(c) Krylov methods are inherently serial, while the matrix-vector products required by randomized low rank methodsare independent and therefore easily parallelized.For a general discussion of randomized methods and Krylovmethods, see (Martinsson & Tropp, 2020).

4. Semi-stochastic convergence rate

In this section we prove asymptotic bounds for convergencerates of LRSFN, in expectation with respect to the trainingdata used. These convergence rates quantify errors dueto sampling, and errors due to the randomized procedurefor performing the low-rank approximation of the Hessian.We present an expected bound for the convergence ratesof the semi-stochastic version of the randomized LRSFNalgorithm. In this case only the Hessian is subsampled, sothere is no associated sampling error for the gradient. Wedenote by E ν the expectation with respect to ν , we denoteby E k the conditional expectation at iteration k taken overall Hessian batch sizes of size N S k , and we denote by E ρ the expectation taken with respect to the Gaussian randommatrices used in the randomized low rank approximation. Assumption 1 (Bounded variance of Hessian components) . There exists σ > such that (cid:107) E ν [( ∇ F i ( w ) − ∇ F ( w )) ] (cid:107) ≤ σ . (8) for all w ∈ R d . Lemma 1.

If Assumption 1 holds, then E k (cid:2) E ρ [ (cid:107) ( H ( k ) r + γI )( w k ) − ∇ F ( w k ) (cid:107) ] (cid:3) ≤ (cid:18) C | λ ( k ) r +1 | + γ + σ (cid:112) N S k (cid:19) (9) We have C = 1 in the case that the low rank approximationis exact, and C = (cid:18) √ d ( r + p ) p − (cid:19) when randomized lowrank approximation is used with oversampling parameter p .Here, we use the shorthand E k [ λ ( k ) r ] = λ ( k ) r .Proof. When the randomized low rank decomposition isused, we have E k (cid:2) E ρ [ (cid:107) ( H ( k ) r + γI )( w k ) − ∇ F ( w k )] (cid:3) ≤ E k (cid:2) E ρ [ (cid:107) ( H ( k ) r + γI )( w k ) − ∇ F S k ( w k )] (cid:3) + E k [ (cid:107)∇ F S k ( w k ) − ∇ F ( w k )] . (10)We may bound the ﬁrst term in (10) as follows: E k (cid:2) E ρ [ (cid:107) ( H ( k ) r + γI )( w k ) − ∇ F S k ( w k )] (cid:3) ≤ E k (cid:20)(cid:18) (cid:112) d ( r + p ) p − (cid:19) | λ ( k ) r +1 | + γ (cid:21) = (cid:18) (cid:112) d ( r + p ) p − (cid:19) | λ ( k ) r +1 | + γ (11)The inequality in (11) comes from Equation 1.8 in (Halkoet al., 2011). The second term in (10) is bounded by σ √ N Sk due to the bounded variance of Hessian components viaLemma 2.3 in (Bollapragada et al., 2018). Theorem 1 (Expected convergence rate for semi-stochasticrandomized low rank Newton) . Let w k be in the basin ofattraction of a local minimum w ∗ , and suppose that α k =1 , γ > is chosen such that the Tikhonov regularizationparameter γ satisﬁes γ (cid:54) = λ i for all λ i , the Hessian isLipschitz continuous with Lipschitz constant M > , andAssumption 1 holds. The iterates of the semi-stochasticrandomized low rank Newton method satisfy the followingbound: E k,ρ [ (cid:107) w k +1 − w ∗ (cid:107) ] ≤ c (cid:107) w k − w ∗ (cid:107) + c (cid:107) w k − w ∗ (cid:107) (12) where c = 1 | λ ∗ + γ | (cid:20)(cid:18) (cid:112) d ( r + p ) p − (cid:19) | λ ( k ) r +1 | + γ + σ (cid:112) N S k (cid:21) (13) c = M | λ ∗ + γ | (14) and λ ∗ is the eigenvalue of ∇ F closest to − γ . ow Rank Saddle Free Newton: Algorithm and Analysis Proof.

By a derivation in Lemma 2.2 in (Bollapragada et al.,2018) we have the following bound: (cid:107)∇ F ( w k )( w k − w ∗ ) − ∇ F ( w k ) (cid:107) ≤ M (cid:107) w k − w ∗ (cid:107) . (15)This bound and the bound in Lemma 1, and the triangleinequality yield the following desired bound: E k,ρ [ (cid:107) w k +1 − w ∗ (cid:107) ]= E k,ρ [ (cid:107) w k − w ∗ − [ H ( k ) r + γI ] − ( w k ) ∇ F ( w k ) (cid:107) ] ≤ | λ ∗ + γ | E k,ρ [ (cid:107) ([ H ( k ) r + γI ]( w k ) − ∇ F ( w k ))( w k − w ∗ )+ ∇ F ( w k )( w k − w ∗ ) − ∇ F ( w k ) (cid:107) ] ≤ | λ ∗ + γ | E k [ (cid:107) ([ H ( k ) r + γI ]( w k ) − ∇ F ( w k ))( w k − w ∗ ) (cid:107) ]+ 1 | λ ∗ + γ | (cid:107)∇ F ( w k )( w k − w ∗ ) − ∇ F ( w k ) (cid:107)≤ | λ ∗ + γ | (cid:18) C | λ ( k ) r +1 | + γ + σ (cid:112) N S k (cid:19) (cid:107) w k − w ∗ (cid:107) + M | λ ∗ + γ | (cid:107) w k − w ∗ (cid:107) When the error in the Hessian approximation approacheszero, one recovers the classic quadratic convergence boundof Newton’s method. Fast super-linear convergence canbe observed when the low rank approximation is accurate( | λ r +1 | (cid:28) ), and the Hessian sampling error is small(small variance of Hessian components σ , or large Hessianbatch size N S k ). For the SFN method of Dauphin, a similarbound can be proven that will have the Krylov error in thelinear error constant c in place of the error stemming fromthe randomized approximation of the Hessian. For the fullystochastic case, an additional constant error term is incurredfrom the gradient sampling error. See (O’Leary-Roseberryet al., 2019) for more detailed convergence rates.

5. Numerical Experiments

We demonstrate the effectiveness of the LRSFN algo-rithm on the standard neural network benchmark problemsMNIST (LeCun & Cortes, 2010) and CIFAR10 (Krizhevskyet al., 2010). We compare against standard ﬁrst order meth-ods as well as the Lanczos based SFN algorithm. OurLRSFN method outperforms these other methods.Our focus is on comparing how the optimization methodsperform for ﬁxed neural network training problems, in aﬁxed number of neural network sweeps (we deﬁne a sweep as a forward or adjoint evaluation of the network). Findingoptimal architectures for a given input-output representa-tion is outside of the scope of this work. We compare the performance of the LRSFN against an existing implemen-tation of the SFN algorithm (Fernandes, 2019), as well asAdam and gradient descent (GD). In our implementation ofLRSFN we implement different batching for the Hessianand gradient. In the SFN code of (Fernandes, 2019) thisis not implemented, so we are constrained to use the samedata for the gradient and Hessian in this method. We makesome direct comparisons between the methods, where bothuse the same gradient and Hessian data. In general we com-pare all methods based on the number of neural networksweeps, as this is the primary computational cost in neu-ral network training. For both MNIST and CIFAR10 wetake , training data samples, from which batches usedin training are sampled. We set aside , testing datasamples in order to compute generalization (testing) errors.For all training problems we take γ = 0 . and r = 20 ; r isboth the rank of the low rank approximation and the Krylovdimension. For the MNIST classiﬁcation problem, we compare LRSFNagainst SFN, Adam, and GD. For GD and LRSFN we useline search (LS) to select α k at each iteration. For Adamwe test different step lengths α k = 0 . , . (larger steplengths had highly oscillatory behavior). The loss functionfor the training problem is the cross-entropy function. ForGD and LRSFN we use gradient batch size N X k = 10 , ,and for LRSFN we use Hessian batch size N S k = 1000 . ForAdam and SFN we use N X k = 1000 for the gradient, andas noted before, for SFN N X k = N S k . We use small datafor SFN so that it doesn’t run out of neural network sweepsafter just a few iterations. We have one hidden layer with units; the size of the conﬁguration space is d = 79 , .The methods are judged based on how well the trained neu-ral network classiﬁes unseen data. Figures 1, 2, and 3 showthat both SFN and LRSFN outperformed the ﬁrst order meth-ods and were less prone to overﬁtting. It is clear from thetraining and testing error plots that the ﬁrst order methodsoverﬁt. Both of the second order methods generalized betterto unseen data, and LRSFN had the highest classiﬁcationaccuracy. In our second set of numerical experiments, we train con-volutional autoencoders on the MNIST and CIFAR10 datasets. For the convolutional autoencoder training problem aleast squares loss function is used to measure the error inreconstructing input images with a four layer autoencodernetwork. For the MNIST problem d = 517 , for the CI-FAR10 problem d = 1543 ; the goal of the autoencoder isto compress information, the training problem is thereforeoverdetermined. ow Rank Saddle Free Newton: Algorithm and Analysis Neural Network Sweeps F X k ( w ) MNIST Classiﬁcation Training Error

Adam 0.001Adam 0.01GD LSLRSFN LSSFN

Figure 1.

MNIST classiﬁcation training error

Neural Network Sweeps × × F X k ( w ) MNIST Classiﬁcation Testing Error

Adam 0.001Adam 0.01GD LSLRSFN LSSFN

Figure 2.

MNIST classiﬁcation testing error

Neural Network Sweeps A cc u r a c y MNIST Classiﬁcation Testing Accuracy

Adam 0.001Adam 0.01GD LSLRSFN LSSFN

Figure 3.

MNIST classiﬁcation testing accuracy

Neural Network Sweeps × × × × F X k ( w ) MNIST Autoencoder Training Error

LRSFN LSSFN

Figure 4.

MNIST training error comparison SFN vs LRSFN

Neural Network Sweeps × × × × F X k ( w ) MNIST Autoencoder Testing Error

LRSFN LSSFN

Figure 5.

MNIST testing error comparison SFN vs LRSFN

First we present a head-to-head comparison of LRSFNand SFN for the same Hessian and gradient batches sizes, N X k = N S k = 1000 . Figures 4 and 5 demonstrate that forthe MNIST autoencoder training problem, LRSFN outper-formed SFN. In what follows we have a general comparisonof methods for MNIST and CIFAR10 autoencoder trainingproblems. Guided by analysis, we advocate for using largegradient data batches for LRSFN, since the gradient sam-pling error shows up as a constant term in convergence ratebounds such as expression (12). We use N X k = 10000 for LRSFN and GD accordingly, and use line search forboth. For Adam and SFN we take N X k = 1000 , and forAdam we choose step lengths of α k = 0 . , . . For bothSFN and LRSFN we take the Hessian data batch size to be N S k = 1000 .For the MNIST dataset, Figures 6 and 7 show that LRSFNperformed better both in terms of training and generalizationerror than any of the other methods. Both Adam with α k =0 . and GD with line search performed better than the SFNalgorithm on this particular problem. Adam with α k = 0 . was too slow to converge in the allotted number of neuralnetwork sweeps.Similarly for the CIFAR10 dataset, Figures 8 and 9 showthat LRSFN performs the best in both testing and trainingerror, although in this case gradient descent with line searchperforms almost as well. SFN is not far behind, and bothAdams variants do not perform well and tend to overﬁt.Note that LRSFN is implemented with a non-monotoneline search: after ﬁve backtracking iterations have beenperformed, if no descent direction is found, LRSFN takes ow Rank Saddle Free Newton: Algorithm and Analysis Neural Network Sweeps × × × × F X k ( w ) MNIST Autoencoder Training Error

Adam 0.01Adam 0.1GD LSLRSFN LSSFN

Figure 6.

General comparison of training errors for MNIST autoen-coder training

Neural Network Sweeps × × × × F X k ( w ) MNIST Autoencoder Testing Error

Adam 0.01Adam 0.1GD LSLRSFN LSSFN

Figure 7.

General comparison of testing errors for MNIST autoen-coder training the step anyway; this leads to spikes that can be seen inFigure 8.As in other work, we observe the Hessian to be highlyindeﬁnite and have high rank at random initial guesses, buthave numerical low rank after a few iterations (Alain et al.,2019; Ghorbani et al., 2019; Sagun et al., 2016). In the paperof Dauphin, numerical results showed that the magnitude ofthe largest negative eigenvalues decreased during the iteratesof the SFN algorithm (Dauphin et al., 2014). We observesimilar behavior for the MNIST autoencoder problem. In theMNIST problem the largest negative eigenvalue decreasedby four orders of magnitude between the initial guess andthe 70th iterate. In contrast Figure 12 shows that SFN isunable to escape the indeﬁnite regions as well as LRSFN.

Neural Network Sweeps F X k ( w ) CIFAR10 Autoencoder Training Error

Adam 0.01Adam 0.1GD LSLRSFN LSSFN

Figure 8.

General comparison of training errors for CIFAR10 au-toencoder training

Neural Network Sweeps × × × × F X k ( w ) CIFAR10 Autoencoder Testing Error

Adam 0.01Adam 0.1GD LSLRSFN LSSFN

Figure 9.

General comparison of testing errors for CIFAR10 au-toencoder training i λ i MNIST Spectrum LRSFN iteration 0 positivenegative

Figure 10.

Spectrum of MNIST autoencoder training problem Hes-sian at initial guess

The effect of escaping negative deﬁnite regions can be seeneven more for the CIFAR10 spectra in Figures 13 and 14.At the initial guess for CIFAR10 the Hessian is highly in-deﬁnite and at the 70th iterate of LRSFN at least the ﬁrst100 eigenvalues are positive. Figure 15 demonstrates againthat SFN is unable to escape the indeﬁnite regions as wellas LRSFN.In what follows we investigate the variance of the domi-nant 100 eigenvalues of the Hessian during training. Wecompute one spectrum with N S k = 10 , , and ten fur-ther sub-sampled Hessians with N S k = 1 , . In Figures16 through 21 we use “full” to denote the , sampleHessian, and “sub i ” to denote the i th , sample Hes-sian. The general trend in Figures 16 through 21 is that the i − λ i MNIST Spectrum LRSFN iteration 70 positivenegative

Figure 11.

Spectrum of MNIST autoencoder training problem Hes-sian after 70 iterations of LRSFN ow Rank Saddle Free Newton: Algorithm and Analysis i − λ i MNIST Spectrum SFN iteration 70 positivenegative

Figure 12.

Spectrum of MNIST autoencoder training problem Hes-sian after 70 iterations of SFN i λ i CIFAR10 Spectrum LRSFN iteration 0 positivenegative

Figure 13.

Spectrum of CIFAR10 autoencoder training problemHessian at initial guess i − λ i CIFAR10 Spectrum LRSFN iteration 70 positivenegative

Figure 14.

Spectrum of CIFAR10 autoencoder training problemHessian after 70 iterations of LRSFN i − λ i CIFAR10 Spectrum SFN iteration 70 positivenegative

Figure 15.

Spectrum of CIFAR10 autoencoder training problemHessian after 70 iterations of SFN i | λ i | MNIST Spectra LRSFN iteration 0 fullsub 0sub 1sub 2sub 3sub 4sub 5sub 6sub 7sub 8sub 9

Figure 16.

Subsampled Hessian spectra for MNIST at iteration 0 i | λ i | MNIST Spectra LRSFN iteration 25 fullsub 0sub 1sub 2sub 3sub 4sub 5sub 6sub 7sub 8sub 9

Figure 17.

Subsampled Hessian spectra for MNIST at iteration 25 subsampled Hessians agree with the “full” Hessian for thedominant eigenvalues, but diverge for the trailing eigenval-ues. This supports our claim that LRSFN is well suited foruse with subsampled Hessians, because a low rank approxi-mation of the subsampled Hessian is a good approximationof the dominant modes of the full Hessian.For the CIFAR10 dataset we investigate the effect of Hessianbatch size N S k on the convergence of the LRSFN algorithm.Figure 22 shows that similar convergence was observedfrom N S k = 50 all the way to , . In this case notmuch was gained by using the full batch. Figure 23 demon-strates that signiﬁcant computational economy can be hadby Hessian subsampling. Second order methods do not re-quire more work than ﬁrst order methods per outer iteration;because the Hessian can be aggressively subsampled theyonly need be slightly more expensive per iteration. i − | λ i | MNIST Spectra LRSFN iteration 50 fullsub 0sub 1sub 2sub 3sub 4sub 5sub 6sub 7sub 8sub 9

Figure 18.

Subsampled Hessian spectra for MNIST at iteration 50 ow Rank Saddle Free Newton: Algorithm and Analysis i | λ i | CIFAR10 Spectra LRSFN iteration 0 fullsub 0sub 1sub 2sub 3sub 4sub 5sub 6sub 7sub 8sub 9

Figure 19.

Subsampled Hessian spectra for CIFAR10 at iteration 0 i | λ i | CIFAR10 Spectra LRSFN iteration 25 fullsub 0sub 1sub 2sub 3sub 4sub 5sub 6sub 7sub 8sub 9

Figure 20.

Subsampled Hessian spectra for CIFAR10 at iteration25 i − − | λ i | CIFAR10 Spectra LRSFN iteration 50 fullsub 0sub 1sub 2sub 3sub 4sub 5sub 6sub 7sub 8sub 9

Figure 21.

Subsampled Hessian spectra for CIFAR10 at iteration50

Newton Iterations × × × × F X k ( w ) LRSFN CIFAR10 training error vs. N S k N S k =100 N S k =1000 N S k =10000 N S k =2500 N S k =50 N S k =500 N S k =5000 Figure 22.

LRSFN training error vs Newton iteration for differentHessian batch sizes.

Neural Network Sweeps × × F X k ( w ) LRSFN CIFAR10 training error vs. N S k N S k =100 N S k =1000 N S k =10000 N S k =2500 N S k =50 N S k =500 N S k =5000 Figure 23.

LRSFN training error vs neural network sweeps fordifferent Hessian batch sizes.

Code used to generate these results can be found in thefollowing repository (O’Leary-Roseberry, 2020).

6. Conclusion

In this paper we presented the low rank saddle free Newton(LRSFN) method. A randomized method is used to form alow rank approximation of the loss Hessian. The absolutevalue of the low rank approximation is used in conjunctionwith the Sherman-Morrison-Woodbury formula to solve theregularized Newton linear system.The low rank approximation captures the dominant subspaceof the Hessian better than the Lanczos method used in theSFN method (Dauphin et al., 2014). Resolving the domi-nant subspace of the Hessian allows iterates to escape indef-inite regions with large negative eigenvalues faster. Whensubsampling is used to construct the Hessian, our resultsshow that the dominant subspace of the full Hessian is well-approximated by the dominant subspace of subsampled Hes-sians, while the trailing subspace is less well-approximated.Thus low rank approximation, which captures the dominantsubspace and ignores the trailing subspace, yields bettergeneralization then Lanczos approximation, which attemptsto approximate the entire spectrum at once. Moreover, therandomized low rank algorithm is inherently parallel, unlikeKrylov based methods.We prove a semi-stochastic convergence rate bound for themethod, which shows that the LRSFN method can achievefast convergence in expectation with respect to the Hessianbatching.We numerically test the LRSFN algorithm on two standardneural network benchmark problems, MNIST and CIFAR10.Numerical results show that the LRSFN algorithm outper-forms the SFN algorithm as well as standard ﬁrst order meth-ods in both classiﬁcation and autoencoder training problems.The LRSFN algorithm achieves faster convergence and bet-ter generalization error that the other methods. ow Rank Saddle Free Newton: Algorithm and Analysis

References

Alain, G., Roux, N. L., and Manzagol, P.-A. Negativeeigenvalues of the Hessian in deep neural networks. arXivpreprint arXiv:1902.02366 , 2019.Bertsekas, D. P. Nonlinear Programming.

Journal of theOperational Research Society , 48(3):334–334, 1997.Bollapragada, R., Byrd, R. H., and Nocedal, J. Exact and in-exact subsampled Newton methods for optimization.

IMAJournal of Numerical Analysis , 39(2):545–578, 2018.Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Gan-guli, S., and Bengio, Y. Identifying and attacking thesaddle point problem in high-dimensional non-convex op-timization. In

Advances in neural information processingsystems , pp. 2933–2941, 2014.Erdogdu, M. A. and Montanari, A. Convergence Rates ofSub-sampled Newton Methods. In Cortes, C., Lawrence,N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.),

Advances in Neural Information Processing Systems 28 ,pp. 3052–3060. Curran Associates, Inc., 2015.Fernandes, D. SaddleFreeOptimizer. https://https://github.com/dave-fernandes/SaddleFreeOptimizer/ , 2019.Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping fromsaddle pointsonline stochastic gradient for tensor decom-position. In

Conference on Learning Theory , pp. 797–842,2015.Ghorbani, B., Krishnan, S., and Xiao, Y. An Investigationinto Neural Net Optimization via Hessian EigenvalueDensity. arXiv preprint arXiv:1901.10159 , 2019.Gill, P. E., Murray, W., and Wright, M. H.

Practical opti-mization . Academic press, 1981.Halko, N., Martinsson, P.-G., and Tropp, J. A. Findingstructure with randomness: Probabilistic algorithms forconstructing approximate matrix decompositions.

SIAMreview , 53(2):217–288, 2011.Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan,M. I. How to escape saddle points efﬁciently. arXivpreprint arXiv:1703.00887 , 2017.Krizhevsky, A., Nair, V., and Hinton, G. CIFAR-10 (Cana-dian Institute for Advanced Research). 2010. URL .LeCun, Y. and Cortes, C. MNIST handwritten digitdatabase. 2010. URL http://yann.lecun.com/exdb/mnist/ . Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B.Gradient descent only converges to minimizers. In

Con-ference on learning theory , pp. 1246–1257, 2016.Lee, J. D., Panageas, I., Piliouras, G., Simchowitz, M.,Jordan, M. I., and Recht, B. First-order methodsalmost always avoid saddle points. arXiv preprintarXiv:1710.07406 , 2017.Martinsson, P. G. and Tropp, J. Randomized Numerical Lin-ear Algebra: Foundations & Algorithms. arXiv preprintarXiv:2002.01387 , 2020.Murty, K. G. and Kabadi, S. N. Some NP-complete prob-lems in quadratic and nonlinear programming.

Mathe-matical programming , 39(2):117–129, 1987.O’Leary-Roseberry, T. hessianlearn. https://github.com/tomoleary/hessianlearn , 2020.O’Leary-Roseberry, T., Alger, N., and Ghattas, O. Inex-act Newton methods for stochastic non-convex optimiza-tion with applications to neural network training. arXivpreprint arXiv:1905.06738 , 2019.Roosta-Khorasani, F. and Mahoney, M. W. Sub-sampledNewton methods i: globally convergent algorithms. arXivpreprint arXiv:1601.04737 , 2016a.Roosta-Khorasani, F. and Mahoney, M. W. Sub-sampledNewton methods ii: Local convergence rates. arXivpreprint arXiv:1601.04738 , 2016b.Saad, Y.

Iterative methods for sparse linear systems , vol-ume 82. siam, 2003.Sagun, L., Bottou, L., and LeCun, Y. Eigenvalues of thehessian in deep learning: Singularity and beyond. arXivpreprint arXiv:1611.07476 , 2016.Xu, P., Yang, J., Roosta-Khorasani, F., R´e, C., and Mahoney,M. W. Sub-sampled Newton Methods with Non-uniformSampling. In Lee, D. D., Sugiyama, M., Luxburg, U. V.,Guyon, I., and Garnett, R. (eds.),