Early Stopping without a Validation Set
Maren Mahsereci, Lukas Balles, Christoph Lassner, Philipp Hennig
EEarly Stopping without a Validation Set
Maren Mahsereci [email protected]
Max Planck Institute for Intelligent Systems,Spemannstraße, Tübingen, Germany
Lukas Balles [email protected]
Max Planck Institute for Intelligent Systems,Spemannstraße, Tübingen, Germany
Christoph Lassner ∗ [email protected] Max Planck Institute for Intelligent Systems,Spemannstraße, Tübingen, Germany
Philipp Hennig [email protected]
Max Planck Institute for Intelligent Systems,Spemannstraße, Tübingen, Germany
Abstract
Early stopping is a widely used technique to prevent poor generalization per-formance when training an over-expressive model by means of gradient-basedoptimization. To find a good point to halt the optimizer, a common practice is tosplit the dataset into a training and a smaller validation set to obtain an ongoingestimate of the generalization performance. We propose a novel early stoppingcriterion based on fast-to-compute local statistics of the computed gradients andentirely removes the need for a held-out validation set. Our experiments show thatthis is a viable approach in the setting of least-squares and logistic regression, aswell as neural networks.
The training of parametric machine learning models often involves the formal task of minimizing theexpectation of a loss (risk) over a population p ( x ) of data, of the form L ( w ) = E x ∼ p ( x ) [ (cid:96) ( w, x )] , (1)where the loss function (cid:96) ( w, x ) quantifies the performance of parameter vector w ∈ R D on data point x . In practice though, the data distribution p ( x ) is usually unknown, and Eq. 1 is approximated bythe empirical risk : L D ( w ) = 1 M (cid:88) x ∈D (cid:96) ( w, x ) . (2)Here D denotes a dataset of size M = |D| with instances drawn independently from p ( x ) . Oftenthere is easy access to the gradient of (cid:96) and gradient-based optimizers can be used to minimize theempirical risk. The gradient descent ( GD ) algorithm, for example, updates an estimate w t for theminimizer of L D according to w t +1 = w t − α t ∇ L D ( w t ) with ∇ L D ( w ) = / M (cid:80) x ∈D ∇ (cid:96) ( w, x ) ,and some hand-tuned or adaptive step sizes α t . In practice, however, evaluating ∇ L D can becomeexpensive for very large M thus making it impossible to make progress in a reasonable time.Instead, stochastic optimization methods are used, which use coarser but much cheaper gradientestimates by randomly choosing a mini-batch B ⊂ D of size |B| = m (cid:28) M from the trainingset and computing ∇ L B ( w ) = / m (cid:80) x ∈B ∇ (cid:96) ( w, x ) . The gradient descent update then becomes w t +1 = w t − α t ∇ L B ( w t ) and the corresponding iterative algorithm is commonly known as stochasticgradient descent ( SGD ) [17]. ∗ equally affiliated with: Bernstein Center for Computational Neuroscience, Otfried-Müller-Str. 25, Tübingen,Germany a r X i v : . [ c s . L G ] J un .1 Overfitting, Regularization and Early-Stopping Since the risk L is virtually always unknown, a key question arising when minimizing the empiricalrisk L D , is how the performance of a model trained on a finite dataset D generalizes to unseendata. Performance can be measured by the loss itself or other quantities, e.g., the mean accuracyin classification problems. Typically, to measure the generalization performance a finite test set isentirely withheld from the training procedure and the performance of the final model is evaluated onit. This test loss, however, is also only an estimator for L (in the same sense as the train loss) with afinite stochastic error whose variance drops linearly with the test set size. If the used model is overlyexpressive, minimizing the empirical risk (Eq. 2) exactly—or close to exactly—will usually result inpoor test performance, since the model overfits to the training data. There is a range of measures thatcan be taken to mitigate this effect; textbooks like Bishop [3] give an overview over general concepts,chapter 7 of Goodfellow et al. [6] gives a comprehensive summary targeted at deep learning. Somewidely used concepts are briefly discussed in the following paragraphs. Model selection techniques choose a model among a hypothesis class which, under some measure,has the closest level of complexity to the given dataset. They alter the form of the loss function (cid:96) in Eq. 2 over an outer optimization loop (first find a good (cid:96) , then optimize L D ), such that the finaloptimization on L D is conducted on an adequately expressive model. This can—but does not needto—constrain the number of variables of the model. In the case of deep neural networks the numberof variables can even significantly exceed the number of training examples [8, 19, 20, 7].If the dataset is not sufficiently representative of the data distribution, an opposite (although notincompatible) approach is to artificially enrich it to match a complex model. Data augmentation artificially enlarges the training set by adding transformations/perturbations of the training data. Thiscan range from injecting noise [18, 23] to carefully tuned contrast and colorspace augmentation [8].Finally, a widely-used provision against overfitting is to add regularization terms to the objectivefunction that penalize the parameter vector w , typically measured by the l or l norm [9]. Theseterms constrain the magnitude of w . They tend to drive individual parameters toward zero or, in the l case, enforce sparsity [3, 6]. In linear regression, these concepts are known as least-squares and LASSO regularization [21], respectively.Despite these countermeasures, high-capacity models will often overfit in the course of the optimiza-tion process. While the loss on the training set decreases throughout the optimization procedure,the test loss saturates at some point and starts to increase again. This undesirable effect is usuallycountered by early stopping the optimization process, meaning, that for a given model, the optimizeris halted if a user-designed early stopping criterion is met. This is complementary to the model anddata design techniques mentioned above and does not undo eventual poor design choices of (cid:96) . Itmerely ensures that we do not minimize the empirical risk L D of a given model beyond the point ofbest generalization. In practice, however, it is often more accessible to ‘early-stop’ a high-capacitymodel for algorithmic purposes or because of restrictions to a specific model class, and thus preferredor even enforced by the model designer.Arguably the gold-standard of early stopping is to monitor the loss on a validation set [14, 16, 15].For this, a (usually small) portion of the training data is split off and its loss is used as an estimate ofthe generalization loss L (again in the same sense as Eq. 2), leaving less effective training data todefine the training loss L D . An ongoing estimate of this generalization performance is then trackedand the optimizer is halted when the generalization performance drops again. This procedure hasmany advantages, especially for very large datasets where splitting off a part has minor or no effecton the generalization performance of the learned model. Nevertheless, there are a few obviousdrawbacks. Evaluating the model on the validation set in regular intervals can be computationallyexpensive. More importantly, the choice of the size of the validation set poses a trade-off: A smallvalidation set has a large stochastic error, which can lead to a misguided stopping decision. Enlargingthe validation set yields a more reliable estimate of generalization, but reduces the remaining amountof training data, depriving the model of potentially valuable information. This trade-off is not easilyresolved, since it is influenced by properties of the data distribution (the variance Λ introduced inEq. 3 below) and subject to practical considerations, e.g., redundancy in the dataset.Recently Maclaurin et al. [11] introduced an interpretation of (stochastic) gradient descent in theframework of variational inference. As a side effect, this motivated an early-stopping criterion basedon the estimation of the marginal likelihood, which is done by tracking the change in entropy of the2 p ( L D | L ) w p ( ∇ L D | ∇ L ) w p ( ∇ L D | ∇ L ) Figure 1:
Sketch of early stopping criterion . Left: marginal distribution of function values definedby left expression in Eq. 3. Mean L in thick solid orange, ± full dataset defines one realization of this distribution which is shown indashed blue (same as L D of Eq. 2). Middle: same as left plot but for corresponding gradients. Thepdf is defined by the right expression in Eq. 3 and the corresponding ∇ L D is shown in dashed blue. Right: orange and blue same as middle plot; red shaded ares define desired stopping regions (detailsin text). The vertical red shaded area shows the region of ± ∇ L D is likely to be zero). If gradients are within this are, the optimization process is halted.This can be translated into a simple stopping criterion (horizontal shaded area, text for details); ifgradients are within this area, the optimizer stops.posterior distribution of w , induced by each optimization step. Since the method requires estimationof the Hessian diagonals, it comes with considerable computational overhead.The following section motivates and derives a cheap and scalable early stopping criterion which issolely based on local statistics of the computed gradients. In particular, it does not require a held-outvalidation set, thus enabling the optimizer to use all available training data. This section derives a novel criterion for early stopping in stochastic gradient descent. We firstintroduce notation and model assumptions (§2.1), and motivate the idea of evidence-based stopping(§2.2). Section 2.3 covers the more intuitive case of gradient descent; Section 2.4 extends to stochasticsettings.
Let S be some set of instances sampled independently from p ( x ) . The following holds for any S , butspecifically for the training set D or a subsampled mini-batch B and any validation or test set. Usingthe same notation as in Eq. 2, L S ( w ) and ∇ L S ( w ) are unbiased estimators of L ( w ) and ∇L ( w ) respectively. Since the elements in S are independent draws from p ( x ) , by the Central Limit Theorem L S ( w ) and ∇ L S ( w ) are approximately normal distributed according to L S ( w ) ∼ N (cid:18) L ( w ) , Λ( w ) |S| (cid:19) and ∇ L S ( w ) ∼ N (cid:18) ∇L ( w ) , Σ( w ) |S| (cid:19) (3)with population (co-)variances Λ( w ) = var x ∼ p ( x ) [ (cid:96) ( w, x )] ∈ R and Σ( w ) = cov x ∼ p ( x ) [ ∇ (cid:96) ( w, x )] ∈ R D × D , respectively. The (co)-variances of L S ( w ) and ∇ L S ( w ) both scale inversely proportionalto the dataset size |S| . In the population limit |S| (cid:95) ∞ , Eq. 3 concentrates on L ( w ) and ∇L ( w ) . Tosimplify notation, the indicator ( w ) will occasionally be dropped: e.g. L S ( w ) =: L S . The perhaps obvious but crucial observation at the heart of the criterion proposed below is thateven the full, but finite, data-set is just a finite-variance sample from a population: By Eq. (3), the estimators L D and ∇ L D are approximately Gaussian samples around their expectations L and ∇L ,respectively. Figure 1 provides an illustrative, one-dimensional sketch. The left subplot shows themarginal distribution of function values (Eq. 3, left). The true, but usually unknown, optimizationobjective L (Eq. 1), is the mean of this distribution and is shown in solid orange. The objective L D (Eq. 2), which is optimized in practice and is fixed by the training set D , defines one realization outof this distribution and is shown in dashed blue. 3n general, the minimizers of L and L D need not be the same. Often, for a finite but large numberof parameters w ∈ R D , the loss L D can be optimized to be very small. When this is the case themodel tends to overfits to the training data and thus performs poorly on newly generated (test) data T ∼ p ( x ) with T ∩ D = ∅ . A widely used technique to prevent overfitting is to stop the optimizationprocess early. The idea is, that variations of training examples which do not contain information forgeneralization, are mostly learned at the very end of the optimization process where the weights w are fine-tuned. In practice the true minimum of L is unknown, however the approximate errors ofthe estimators L D and ∇ L D are accessible at every position w . Local estimators for the diagonalof Σ( w ) have been successfully used before [12, 2] and can be computed efficiently even for veryhigh dimensional optimization problems. Here the variance estimator of the gradient distribution isdenoted as ˆΣ( w ) ≈ var x ∼ p ( x ) [ ∇ (cid:96) ( w, x )] with ˆΣ( w ) = / ( |S|− (cid:80) x ∈S ( ∇ (cid:96) ( w, x ) − ∇ L S ( w )) (cid:12) ,where (cid:12) denotes the elementwise square and S is either the full dataset D or a mini-batch B .Since the minimizers of L and L D are not generally identical, also their gradients will cross zero atdifferent locations w . The middle plot of Figure 1 illustrates this behavior. Similar to the left plot,it shows a marginal distribution, but this time over gradients (right expression in Eq. 3). The truegradient ∇L is the mean of this distribution and is shown in solid orange. The one realization definedby the dataset D is shown as dashed blue and corresponds to the dashed blue function values L D ofthe left plot. Ideally the optimizer should stop in an area in w -space where possible minima are likelyto occur, if different datasets of same size were samples from p . In the sketch, this is encoded as thered vertical shaded area in the right plot. It is the area around the minimizer of L where ∇L ± standard deviation still encloses zero.Since ∇L is unknown however, this criterion is hard to use in practice, and must be turned into astatement about ∇ L D . Denote the minimizer of L by w ∗ = arg min w L ( w ) and the populationvariance of gradients at w ∗ as Σ ∗ := Σ( w ∗ ) . A similar criterion that captures this desideratain essence is to stop when the collected gradients ∇ L D are becoming consistently very small incomparison to the error Σ ∗ / M (red horizontal shaded area). Close enough to the minima of L D and L ,the two criteria roughly coincide (intersection of red vertical and horizontal shaded areas). A measurefor this is the probability p ( ∇ L D |∇L = 0) = N (cid:18) ∇ L D ; 0 , Σ ∗ M (cid:19) , (4)of observing ∇ L D , were it generated by a true zero gradient ∇L = 0 . This can be seen as theevidence of the trivial model class p ( ∇L ) = δ ( ∇L ) , with p ( ∇ L D ) = (cid:82) p ( ∇ L D |∇L ) p ( ∇L ) d ∇L (in principal more general models can be formulated, which lead to a richer class of stopping criteria).If gradients ∇ L D are becoming too small or, ‘too probable’ (stepping into the horizontal shadedarea) the gradients are less likely to still carry information about ∇L but rather represent noise dueto the finiteness of the dataset, then the optimizer should stop. Using these assumptions, the nextsection derives a stopping criterion for the gradient decent algorithm which then can be extended tostochastic gradient descent as well. When using gradient descent, the whole dataset is used to compute the gradient ∇ L D in each iteration.Still this gradient estimator has an error in comparison to the true gradient ∇L , which is encoded inthe covariance matrix Σ . In practice Σ is unknown, the variance estimator ˆΣ described in Section 2.2however is always accessible. In addition Eq. 4 requires the gradient variance Σ ∗ at the true minimumwhich is unknown in practice. Again it can be approximated by Σ( w t ) which is the gradient varianceat the current position of the optimizer w t . This is a sensible choice if the optimizer is in convergenceand already close to a minimum. Thus, at every position w an approximation to p ( ∇ L D ) of Eq. 4 is p ( ∇ L D ( w )) ≈ D (cid:89) k =1 N (cid:32) ∇ L k D ( w ); 0 , ˆΣ k ( w ) M (cid:33) . (5)Though being a simplification, this allows for fast and scalable computations since dimensions aretreated independent of each other. To derive an early stopping criterion based only on ∇ L D weborrow the idea of the previous section that the optimizer should halt when gradients become sosmall that they are unlikely to still carry information about ∇L , and combine this with well-known4echniques from statistical hypothesis testing. Specifically: stop when log p ( ∇ L D ) − E ∇ L D ∼ p [log p ( ∇ L D )] > . (6)Here E [ · ] is the expectation operator. According to Eq. 6, the optimizer stops when the logarithmicevidence of the gradients is larger than its expected value, roughly meaning that more gradientsamples ∇ L D lie inside of some expected range. In particular, combining Eq. 5 with Eq. 6 andscaling with the dimension D of the objective, gives D [log p ( ∇ L D ) − E ∇ L D ∼ p [log p ( ∇ L D )]] = 1 − MD D (cid:88) k =1 (cid:20) ( ∇ L k D ) ˆΣ k (cid:21) > . (7)This criterion (hereafter called EB -criterion, for ‘evidence-based’) is very intuitive; if all gradientelements lay at exactly one standard deviation distance to zero, then (cid:80) k ( ∇ L k D ) / ˆΣ k = (cid:80) k ˆΣ k / M · ˆΣ k = D / M ; thus the left-hand side of Eq. 7 would become zero and the optimizer would stop.We note on the side that Eq. 7 defines a mean criterion over all elements of the parameter vector w . This implicitly assumes that all dimensions converge in roughly the same time scale such thatweighing the fractions f k := M · ( ∇ L k D ) / ˆΣ k equally is justified. If optimization problems deal withparameters that converge at different speeds, like for example different layers of neural networks (orbiases and weights inside one layer) it might be appropriate to compute one stopping criterion persubset of parameters which are roughly having similar timescales. In Section 3.4 we will use thisslight variation of Eq. 7 for experiments on a multi layer perceptron. It is straightforward to extend the stopping criterion of Eq. 7 to stochastic gradient descent (
SGD );the estimator for ∇ L D is replaced with an even more uncertain ∇ L B by sub-sampling the trainingdataset at each iteration. The local gradient generation is ∇ L B = ∇ L D + η = ∇L + ν with η ∼ N (0 , Σ obs ) , ν ∼ N (0 , Σ / M + Σ obs ) . (8)Combining this with Eq. 3 yields Σ / M + Σ obs = Σ / m . Thus Σ obs = M − mmM Σ . Equivalently to Eq. 4,5 and 7, this results in an early stopping criterion for stochastic gradient descent: D [log p ( ∇ L B ) − E ∇ L B ∼ p [log p ( ∇ L B )]] = 1 − mD D (cid:88) k =1 (cid:20) ( ∇ L k B ) ˆΣ k (cid:21) > . (9) Remark on implementation:
Computing the stopping criterion is straight-forward, given that thevariance estimate ˆΣ is available. In this case, it amounts to an element-wise division of the squaredgradient by the variance, followed by an aggregation over all dimensions. Balles et al. [2, §4.2]comment on this issue and present a solution for computing ˆΣ in contemporary software frameworks,that computes the variance estimate implicitly, increasing e.g. the computational cost of a backwardpass of a neural network by a factor of about 1.25. For proof of concept experiments, we evaluate the EB -criterion on a number of standard classificationand regression problems. For illustration and analysis, Sections 3.1 and 3.2 show a least-squares toyproblem and large synthetic quadratic problems; Sections 3.3 and 3.4 deal with the more realisticsetting of logistic regression on the well-known Wisconsin Breast Cancer Dataset (WDBC) [24]and a multi layer perceptron on the handwritten digits dataset MNIST [10]. Section 3.5 containsexperiments for logistic regression, as well as for a shallow neural network on the SECTOR dataset[4]; the SECTOR dataset complements MNIST and WDBC, in the sense, that it has a much lessfavorable feature-to-datapoint ratio ( ∼ ); increasing the gains on the generalization performance,when all available training data can be used. 5 − . − . l ogv a li d a ti on l o ss − . − . − . − . l og t e s tl o ss . . . − − . . number of steps in 1e+05 c r it e r i on Figure 2: Results for logistic regression on the Wisconsin Breast Cancer dataset. Results for thetwo variants are color-coded; red for validation set-based early stopping, blue for the evidence-basedcriterion of Eq. 7. The middle plot shows test loss versus the number of optimization steps for bothmethods. The top row shows validation loss; since the validation loss decreases over the wholeoptimization process it does not induce a stopping point. The bottom row shows the evolution of thestopping criterion, inducing a stopping decision indicated by the blue vertical bar. train datavalidation data − − − l og a r it h m i c l o ss trainvalidationtest − − number of steps s t opp i ng c r it e r i on Figure 3:
Least-squares toy problem . Top left logarithmic losses vs. number of optimization steps(colors in legend); shaded areas indicate two standard deviations ± (cid:112) Λ / |S| of the noise loss estimatescomputed during the optimization (Eq. 3). Bottom left: evolution of the EB -criterion (Eq. 7); greenvertical bar indicates the induced stopping point. For the steps marked with color-coded vertical bars,the model fit is illustrated on the right column ; orange iteration: sub-optimal fit ( ˆ y ( w ) in solid darkblue) to the training data (gray crosses); green iteration: fit, when the EB -criterion of Eq. 7 indicatesstopping; red iteration: the model ˆ y has already overfitted to the training data. We begin with a toy regression problem on artificial data generated from a one-dimensional linearfunction y with additive uniform Gaussian noise. This simple setup allows us to illustrate themodel fit at various stages of the optimziation process and provides us with the true generalizationperformance, since we can generate large amounts of test data. We use a largely over-parametrized50-dimensional linear regression model ˆ y ( w, x ) = w (cid:124) φ ( x ) which contain the ground truth features(bias and linear) and additional periodic features with varying frequency. The features φ ( x ) =[1 , x, sin( a x ) , cos( a x ) , . . . sin( a p ( x )) , cos( a p x )] (cid:124) with p = 24 obviously define a massively over-parametrized model for the true function and is thus prone to overfitting. We fit the model by6inimizing the squared error, i.e. the loss function is (cid:96) ( w, ( x, y )) = ( y − ˆ y ( w, x )) . We use 20samples for training and about 10 for validation, and then train the model using gradient descent. Theresults are shown in Figure 3; both, validation loss, and the EB -criterion find an acceptable point tostop the optimization procedure, thus preventing overfitting. We construct synthetic quadratic optimization problems of the form L ( w ) = ( w − w ∗ ) (cid:124) B ( w − w ∗ ) ,where B ∈ R D × D is a positive definite matrix and w ∗ ∈ R D is the global minimizer of L ( w ) ; thegradient is ∇L = B ( w − w ∗ ) . In this controlled environment we can test the EB -criterion on differentconfigurations of eigen-spectra, for example uniform, exponential, or structured (a few large, manysmall eigenvalues); the matrix B is constructed by defining a diagonal matrix Γ ∈ R D × D whichcontains the eigenvalues on its diagonal, and a random rotation R ∈ R D × D which is drawn fromthe Haar-measure on the D -dimensional uni-sphere [5]; then B := R Γ R (cid:124) . We artificially define the‘empirical’ loss L D ( w ) by moving the true minimizer w ∗ by a Gaussian random variable ζ D , such that L D ( w ) = ( w − w ∗ + ζ D ) (cid:124) B ( w − w ∗ + ζ D ) with ζ D ∼ N (0 , Λ) . Thus ∇ L D = ∇L + Bζ D isdistributed according to ζ D ∼ N (0 , B Λ B (cid:124) ) , and we define ˆΣ / | D | := diag( B Λ B (cid:124) ) . For experimentswe chose D = 10 as input dimension and zero ( w ∗ = 0 ) as the true minimizer of L . Figure 4 showsresults for three different types of eigen-spectra. The EB -criterion performs well across the different
200 400 600 800 1 , . . . . e i g e n - s p ec t r u m uniform . . . . l og l o ss testtrain . . . . − − . . number of steps in 1e+03 c r it e r on
200 400 600 800 1 , exponential . . . . . . testtrain . . . . − − . . number of steps in 1e+03
200 400 600 800 1 , structured . . . . . . . . testtrain . . . . − − . . number of steps in 1e+03 Figure 4:
Synthetic quadratic problem for three different structures of eigen-spectra: uniform,exponential, structured. middle row: logarithmic (exact) test loss in red and train loss in gray; bottom row: evolution of the EB -criterion, inducing a stopping decision indicated by the blue verticalbar.type of partially ill-conditioned problems and induced meaningful stopping decisions; this workedwell for different noise levels Λ (Figure 4 shows Λ = 10 · I ; note that the covariance matrix B Λ B (cid:124) of the gradient is dense).We noticed, however, that another assumption is crucial for the EB -criterion, which might also explainthe slightly early stopping decision for the logistic regressor on WBCD (Figure 2 in subsequentsection) and full batch GD on MNIST (Figure 7, column 1). Eq. (6) implicitly assumes that (on itspath to the minimum of the empirical loss L D ) the optimizer passes by a better minimizer with highergeneralization performance; this allows to use variances only (in the form of ˆΣ ) in the stoppingcriterion; there is no information about bias (direction of shift w ∗ − w ∗D ) because this is fundamentallyhard to know. 7he assumption is usually well justified, primarily because otherwise early stopping would not be aviable concept in the first place; and second because over-fitting is usually associated with ‘too large’weights (weights are initialized small; and regularizers that pull weights to zero are often a good idea);on the way from small weights (under-fitting) to too large weights (over-fitting), optimizers usuallypass a better point with weights of intermediate size. If the assumption is fundamentally violated the EB -criterion will stop too early. We can artificially construct this setup by initializing the optimizerwith weights that lead to an optimization path that does not lead to any over-fitting; this is depicted inFigure 5. The setup is identical to the one in Figure 4 ( B, w ∗ as well as ζ D and w ∗D are identical); theonly difference is the initialization of the weights w for the optimization process. Since—with thisinitialization—the lowest point of L that can be reached by minimizing L D is w ∗D , any early stoppingdecision will lead to under-fitting. In Figure 5 the (exact) test loss flattens out and does not increaseagain for all three configurations; the assumptions of the EB -criterion are violated and it induces asub-optimal stopping decision. Figure 6 illustrates these two scenarios in a 2D-sketch.
200 400 600 800 1 , . . . . e i g e n - s p ec t r u m uniform . . . . l og l o ss testtrain . . . . − − . . number of steps in 1e+03 c r it e r on
200 400 600 800 1 , exponential . . . . testtrain . . . . − − . . number of steps in 1e+03
200 400 600 800 1 , structured . . . . testtrain . . . . − − . . number of steps in 5e+03 Figure 5:
Synthetic quadratic problem for three different structures of eigen-spectra; subplots andcolors as in Figure 4. Weights are initialized such, that the model can not overfit, as can be seenfrom the exact test loss (red) that flattens out, but does not increase again; the assumptions of the EB -criterion are violated and it induces a sub-optimal stopping decision. Next, we apply the EB -criterion to logistic regression on the Wisconsin Breast Cancer dataset. Thetask is to classify cell nuclei (described by features such as radius, area, symmetry, et cetera) as eithermalignant or benign. We conduct a second-order polynomial expansion of the original 30 features(i.e., features of the form x i x j ) resulting in 496 effective features. Of the 569 instances in the dataset,we withhold 369, a relatively large share, for testing purposes in order to get a reliable estimate ofthe generalization performance. The remaining 200 instances are available for training the classifier.We perform two trainining runs: one with early stopping based on a validation set of 60 instances(reducing the training set to 140 instances) and one using the full training set and early stopping withthe EB -criterion derived in Section 2.3.If parameters converge at different speeds during the optimization, as indicated in Section 2.3, itis sensible to compute the criterion separately for different subgroups of parameters. Generally,if we split the parameters into N disjoint subgroups S i ⊂ { , . . . D } , and denote D i = | S i | , thecriterion reads N (cid:80) Ni =1 (cid:16) − MD i (cid:80) k ∈ S i (cid:104) ( ∇ L k D ) ˆΣ k (cid:105)(cid:17) > . Since bias and weight gradients usually8 w Figure 6: Illustration of implicit early-stopping assumptions:
Contours of the true loss L ( w ) inred; contours of the optimizer’s objective L D ( w ) in gray; their minimizers w ∗ and w ∗D are markedas crosses. The EB -criterion induces a stopping decision, which is roughly described by the blueshaded area. Blue solid line: path of an optimizer that passes by weights of better generalizationperformance than w ∗D ; it is stopped by the EB -criterion when it enters the blue shaded area, resultingin better generalization performance. Red solid line: path of an optimizer than can not overfit, sinceweights were initialized such that w ∗D yields best generalization performance. The assumptions ofthe EB -criterion are violated, and it thus induces a sub-optimal stopping decision that might lead tounder-fitting.have different magnitudes they converge at different speeds when trained with the same learning rate.For logistic regression, we thus treat the weight vector and the bias parameter of the logistic regressoras separate subgroups. Since the criterion above is noisy we also smooth it with an exponentialrunning average. The results are depicted in the left-most column of Figure 7. The effect of theadditional training data is clearly visible, resulting in lower test losses throughout the optimizationprocess. In this scarce data setting the validation loss, computed on a small set of only 60 instances,is clearly misleading (left-most column, top plot). It decreases throughout the optimization processand, thus, fails to find a suitable stopping point. The bottom left plot of Fig. 7 shows the evolution ofthe EB -criterion. The induced stopping point is not optimal (in that it does not coincide with the pointof minimal test loss) but falls into an acceptable region. Thanks to the additional training data, thetest loss at the stopping point is lower than any test loss attainable when withholding a validation set. For a non-convex optimization problem, we train a multi-layer perceptron (MLP) on the well-studiedproblem of hand-written digit classification on the MNIST dataset ( × gray-scale images). Weuse a MLP with five hidden layers with 2500, 2000, 1500, 1000 and 500 units, respectively, ReLUactivation, and a standard cross-entropy loss for the 10 outputs with soft-max activation ( ∼
12 milliontrainable parameters). We treat each weight matrix and each bias vector of the network as a separatesubgroup as described in Section 3.3.The MNIST dataset contains 60k training images, which wesplit into 40k-10k-10k for train, test and validation sets. Again, the criterion is smoothed by anexponential running average.The results for full-batch gradient descent are shown on Column 1 of Figure 7, and
SGD runswith minibatch size 128 and three different learning rates Column 2-4 of the same Figure. Therelatively large validation set (10k images) yields accurate estimates of the generalization performance.Consequently, the stopping points more or less coincide with the points of minimal test loss. Thereduced training set size leads to only slightly higher test losses. Since the strength of the EB -criterionis to utilize the additional training data and the fact, that also validation losses are only inexact guessesof the generalization error, both of these points thus favor the early stopping criterion based on thevalidation loss. Still, for all three SGD -runs (columns 2-4 in Figure 7) the EB -criterion performs asgood as or better than the validation set induced method. An additional observation is that the qualityof the stopping points induced by the EB -criterion varies between the different training configurations.It is thus arguably not as stable in comparison to setups where the validation loss is very reliable. Forgradient descent (full training set in each iteration, Column 1 of Figure 7) , the EB -criterion performsreasonably well, however (an very similarly to the gradient descent runs on the logistic regression on9 − − . − . − . l ogv a li d a ti on l o ss − − . − . − . l og t e s tl o ss − − . . number of steps in 1e+04 c r it e r i on − − . − . − .
40 2 4 6 8 − − . − . − .
40 2 4 6 8 − . . Figure 7:
Multi-layer perceptron on MNIST:
Column 1: full batch gradient descent with learningrate 0.01; columns 2-4
SGD with a mini-batch size of 128 and learning rates 0.003, 0.005 and0.01, respectively. Results are color-coded: red for validation set-based early stopping, blue for the EB -criterion. Middle row: logarithmic test loss versus the number of optimization steps for bothmethods; top row logarithmic validation loss; minimal point induces a stopping decision (red verticalbar); bottom row: evolution of the EB -criterion, stopping decision as blue vertical bar; details in text.WDBC in Figure 2) chooses to stop a bit too early, and thus does result in a slightly worse test setperformance. The difference is not very much (test loss red: − . , blue − . ) but it also clearlydoes not outperform the nearly exactly positioned stopping point induced by this well calibratedvalidation loss. Finally, we trained a logistic regressor and a shallow fully-connected neural network on the SECTORdataset[4]. It contains 6412 training and 3207 test datapoints with 55 197 features each, thus having aless favorable feature-to-datapoint ratio than for example MNIST (784 features vs. 60 000 datapoints).The features are extracted from web-pages of companies and the classes describe 105 differentindustry sectors. The shallow network has one hidden layer with 200 hidden units; the logisticregressor, thus contains ∼ . million, and the shallow net ∼ . million trainable parameters.Experiments are set up in the same style as the ones in Section 3.3 and 3.4. We use of thetraining data for the validation set; this yields 1282 validation examples and a reduced number of5130 training examples. Figure 8 shows results; columns 1-2 for the logistic regressor and columns3-4 for the shallow net. Since the size of the dataset is quite small, the gap between test lossesis quite large (middle row, full training set (blue), reduced train set, due to validation split (red)).Both architectures do not overfit properly, the test loss rather flattens out, although we trained botharchitectures for very long ( . · steps) and initialized weights close to zero. The EB -criterionis again a bit too cautious, and induces stopping when the test loss starts to flatten out; but since itallows utilization of all training data, it beats the validation set on both architectures. For the EB -criterion, we compute f k = m ( ∇ L k B ) / ˆΣ k for each gradient element k . This quantitycan be understood as a ‘signal-to-noise ratio’ and the EB -criterion takes the mean over the individual f k . As a side experiment, we employ the same idea in an element-wise fashion: we stop the trainingfor an individual parameter w k ∈ R (not to be confused with the full parameter vector w t ∈ R D . − . − . − . l ogv a li d a ti on l o ss − . − . − . − . l og t e s tl o ss − − . . number of steps in 1e+05 c r it e r i on number of steps in 1e+05 − · − · − . − . − . .
10 1 2 − − . . number of steps in 1e+05 number of steps in 1e+05 Figure 8:
Colums 1-2:
Logistic regression on SECTOR;
SGD with batch size 128 and learning rates0.03 and 0.003 respectively;
Colums 3-4:
Shallow net on SECTOR;
SGD with batch size 128 andlearning rates 0.03 and 0.003 respectively. Plots and colors as in Figure 7; text for details. − − . − . l og a r it h m i c l o ss . . . . number of steps in 1e+04 s w it c h e do ff p a r a m e t e r s traintest weightsbiasesfull net Figure 9:
Greedy element-wise stopping for a multi-layer perceptron on MNIST.
Columns:
SGD withbatch size 128 and learning rates 0.003, 0.005 and 0.01, respectively.
Top row logarithmic training(gray) and test loss (blue).
Bottom row fraction of weights where learning has been shut off by thegreedy element-wise stopping; each weight matrix (red), each bias vector (blue), full net (green).at iteration t ) as soon as f k falls below the threshold. Importantly, this is not a sparsification ofthe parameter vector, since w k is not set to zero when being switched off but merely fixed at itscurrent value. We smooth successive f k over multiple steps using an exponential moving average;these averages are initialized at high values, resulting in a warm-up phase where all weights are‘active’. Figure 9 presents results; intriguingly, immediately after the warm-up phase the training of aconsiderable fraction of all weights (10 percent or more, depending on the training configuration)is being stopped. This fraction increases further as training progresses. Especially towards the endwhere overfitting sets in, a clear signal can be seen; the fraction of weights where learning has beenstopped suddenly increases at a higher rate. Despite this reduction in effective model complexity, thenetwork reaches test losses comparable to our training runs without greedy element-wise stopping(test losses in Figure 7). The fraction of switched-off parameters towards the end of the optimizationprocess reaches up to 80 percent in a single layer and around 50 percent for the whole net.11 Conclusion
We presented the EB -criterion, a novel approach to the problem of determining a good point forearly-stopping in gradient-based optimization. In contrast to existing methods it does not rely ona held-out validation set and enables the optimizer to utilize all available training data. We exploitfast-to-compute statistics of the observed gradient to assess when it represents noise originating fromthe finiteness of the training set, instead of an informative gradient direction. The presented methodso far is applicable in gradient descent as well as stochastic gradient descent settings and adds littleoverhead in computation, time, and memory consumption. In our experiments, we presented resultsfor linear least-squares fitting, logistic regression and a multi-layer perceptron, proving the generalconcept to be viable. Furthermore, preliminary findings on element-wise early stopping open up thepossibility to monitor and control model fitting with a higher level of detail. References [1] L. Balles and P. Hennig. Follow the Signs for Robust Stochastic Optimization.
ArXiv e-prints , May 2017.[2] L. Balles, J. Romero, and P. Hennig. Coupling Adaptive Batch Sizes with Learning Rates.
ArXiv e-prints ,Dec. 2016.[3] C. M. Bishop.
Pattern Recognition and Machine Learning . Springer, 2006.[4] C.-C. Chang and C.-J. Lin.
LIBSVM: A library for support vector machines , 2011. URL .[5] P. Diaconis and M. Shahshahani. The subgroup algorithm for generating uniform random variables.
Probability in Engineering and Informational Sciences , 1(15-32):40, 1987.[6] I. Goodfellow, Y. Bengio, and A. Courville.
Deep Learning . MIT Press, 2016.[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 770–778, 2016.[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neuralnetworks. In
Advances in Neural Information Processing Systems (NIPS) , volume 25, pages 1097–1105,2012.[9] A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In
Advances in NeuralInformation Processing Systems (NIPS) , volume 4, pages 950–957, 1991.[10] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceedings of the IEEE , 86(11):2278–2324, 1998.[11] D. Maclaurin, D. Duvenaud, and R. P. Adams. Early stopping is nonparametric variational inference.Technical Report arXiv:1504.01344 [stat.ML], 2015.[12] M. Mahsereci and P. Hennig. Probabilistic line searches for stochastic optimization. In
Advances in NeuralInformation Processing Systems (NIPS) , volume 28, pages 181–189, 2015.[13] J. Martens. New perspectives on the natural gradient method.
CoRR , abs/1412.1193, 2014. URL http://arxiv.org/abs/1412.1193 .[14] N. Morgan and H. Bourlard. Generalization and parameter estimation in feedforward nets: Some experi-ments. In
Proceedings of the 2nd International Conference on Neural Information Processing Systems ,pages 630–637. MIT Press, 1989.[15] L. Prechelt.
Early Stopping — But When? , pages 53–67. Springer Berlin Heidelberg, Berlin, Heidelberg,2012. ISBN 978-3-642-35289-8. doi: 10.1007/978-3-642-35289-8_5.[16] R. Reed. Pruning algorithms-a survey.
IEEE transactions on Neural Networks , 4(5):740–747, 1993.[17] H. Robbins and S. Monro. A stochastic approximation method.
The Annals of Mathematical Statistics , 22(3):400–407, Sep. 1951.[18] J. Sietsma and R. J. Dow. Creating artificial neural networks that generalize.
Neural networks , 4(1):67–79,1991.
19] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition".
CoRR , abs/1409.1556, 2014.[20] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2015.[21] R. Tibshirani. Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society.Series B (Methodological) , pages 267–288, 1996.[22] T. Tieleman and G. Hinton.
RMSprop Gradient Optimization , 2015. URL .[23] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust featureswith denoising autoencoders. In
Proceedings of the 25th International Conference on Machine Learning(ICML) , pages 1096–1103. ACM, 2008.[24] W. H. Wolberg, W. N. Street, and O. L. Mangasarian.
UCI Machine Learning Repository: Breast CancerWisconsin (Diagnostic) Data Set , Jan. 2011. URL http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) . —Supplements— RMS
PROP
This Section explores the differences and similarities of
SGD + EB -criterion and RMS PROP . This israther meant as a means for gaining a better intuition, and not for comparing them as competitors;both methods were derived for different purposes and could be combined in principle. EB -Criterion The non-greedy elementwise EB -criterion can be formulated as c t = βc t − + (1 − β ) (cid:0) − f EB -crit t (cid:1) w t +1 = w t − α · I [ c t ≤ (cid:12) ∇ L B ( w t ) (10)for some conservative smoothing constant β ∈ (0 , , usually β ≈ . , or . , learning rate α , andthe fraction f EB -crit t := |B| [ ∇ L B ( w t ) (cid:12) (cid:11) ˆΣ( w t )] as defined in Section 3.6. The symbol ‘ (cid:11) ’ denoteselementwise division and I [ · ] is the indicator function. In contrast to the greedy implementationof Section 3.6, where switched-off learning rates stayed switches off, Eq. 10 allows learning to beswitched on again. RMS
PROP
RMS
PROP [22]is a well known optimization algorithm that scales learning rates elementwise by anexponential running average of gradient magnitudes; specifically: v t = γv t − + (1 − γ ) ∇ L B ( w t ) (cid:12) w t +1 = w t − α ∇ L B ( w t ) (cid:11) √ v t , (11)again for some smoothing constant γ ∈ (0 , , usually γ ≈ . , and learning rate α . Let z max t bethe largest element of the factor z t := 1 (cid:11) √ v t , then the second line of Eq. 11 can be rewritten as w t +1 = w t − αz max t (cid:18) z t z max t (cid:19) (cid:12) ∇ L B ( w t ) . (12)The fraction f RMS
PROP t := ( z t / z max t ) ∈ (0 , describes the scaling of learning rates relative to thelargest one: if the i th element of f RMS
PROP t is very small, the learning of the corresponding parameteris damped heavily relative to a full step of size αz max t . This can be interpreted as ‘switching-off’ thelearning of these parameters, similarly to the elementwise EB -criterion.13 .3 Connections and Differences The following table gives a rough overview over the possible set of learning rates for each method.method step size domain maximal step size minimal step size
SGD { α } α α SGD + EB -crit { , α } α (only when converged)RMS PROP (0 , αz max t ] αz max t > The table shows, that
SGD + EB -criterion is a very minor variation of SGD , in the sense that it can alsoset the learning rate to zero, but only for converged parameters to prevent overfitting. It does notimprove the convergence properties of
SGD while it is still training, since the sizes of the ‘active’learning rates remain unchanged. Specifically, it does not explicitly encode curvature, or othergeometric properties of the loss.In contrast to this, RMS
PROP also adapts the absolute value of the largest possible step at everyiteration by a varying factor z max t , and scales the other steps relative to it. It is based on the steepestdescent direction in w -space, measured by a weighted norm, where the weight matrix is the inverseFisher information matrix F t at ever position w t . If the learned conditional distribution approximatesthe true conditional data-distribution well, F t also approximates the expected Hessian of the loss [13].RMS PROP thus encodes geometric information, which allows for faster convergence compared to
SGD .Another interpretation of RMS
PROP , which in spirit is much closer to the EB -criterion, has recentlybeen formulated by Balles and Hennig [1]. It is possible to associate the RMS PROP -update of Eq. 11with local gradient and variance estimators, according to − α ∇ L B ( w t ) (cid:11) √ v t ≈ − α sign[ ∇L ( w t )] (cid:112) w t )] (cid:11) |B|∇L ( w t ) (cid:12) (13)since ∇ L B ( w t ) ≈ E x ∼ p ( x ) [ ∇L B ( w t )] = ∇L ( w t ) , and v t ≈ E x ∼ p ( x ) (cid:2) ∇ L B ( w t ) (cid:12) (cid:3) = ∇L ( w t ) (cid:12) + diag[Σ( w t )] |B| . (14)The fraction on the right hand side of Eq. 13 contains the term / snr t := diag[Σ( w t )] (cid:11) |B|∇L ( w t ) (cid:12) ,which closely resembles the inverse of f EB -crit t . Thus gradients with a small signal-to-noise ratiosnr t get shortened; noise free gradients induce steps of equal(!) size − α · sign[ ∇L ( w t )] in everydirection (note, that they are independent of the magnitude of ∇ L B ); RMS PROP thus can be seen aselementwise stochastic gradient-sign estimators, which are mildly damped if noisy.We have now explored algebraic, as well as behavioral connections between
SGD + EB -criterionand RMS PROP ; the following paragraph summarizes the above points and lists some noteworthydistinctions:
Geometry encoding:
RMS
PROP encodes geometric information about the objective and can beloosely associated with second order methods that perform an approximate diagonal preconditioningat every iteration. Alternatively it can be interpreted as stochastic sign estimator, scaling each stepwith the inverse gradient magnitude, and damping due to noise. In contrast to this, the EB -criterion isjust a mild add-on to SGD ; it does not alter learning rates due to curvature or other geometric effects.
Mild damping vs. stopping:
The EB -criterion defines a strict threshold, justified by a statistical test,when learning should be terminated. RMS PROP defines a vaguer version, in the sense, that theoptimizer should move somewhat ‘less’ into directions of uncertain gradients. Even if the signal-to-noise ratio snr t falls well below the threshold of the stopping decision induces by the EB -criterion(roughly snr t < ), RMS PROP just reduces the step proportional to the inverse if the square root ∼ (1 + / snr t ) − / (e.g. for snr t = 0 . ( EB -crit stops), the RMS PROP -step gets reduced by a factor ofonly / √ ≈ . ). If the loss (cid:96) can be interpreted as negative log likelihood, this is an approximation to the steepest descentdirection in the distribution space, where an approximation to the KL-divergence defines a measure. moothing and bias: The derivation of Eq. 13 omits the geometric smoothing contribution of γ which is present in the RMS PROP -update in Eq. 11. In contrast to this, the EB -criterion relies onlocal (non-smoothed) computations of ˆΣ( w t ) ; this is essential to a stopping decision, since largegradient-samples are usually associated with large variances as well. Smoothing the latter wouldthus bias learning towards following large gradients; in case of RMS PROP it does bias towards largersteps for high variance samples.The views presented above, give insight on the internal workings of RMS
PROP as well as the EB -criterion. It is apparent, that, even though RMS PROP shortens high variance directions, they do notget damped enough to prevent overfitting the objective to the data.
For an empirical comparison, we run RMS
PROP , SGD with elementwise EB -criterion (as in Eq. 10),and an instance of vanilla SGD on a multi-layer-perception on MNIST, similar to the setup in Section3.4. For the
SGD instance that uses the EB -criterion, the fraction of switched-off parameters is definedas P EB -crit t := 1 D D (cid:88) i =1 I [ c i,t ≤ . (15)The percentage of ‘switched-off’ parameters for RMS PROP can be roughly described as the fraction P RMS
PROP t of parameters, whose f RMS
PROP t (defined in Section 5.2) lie below a threshold T ∈ (0 , P RMS
PROP t := 1 D D (cid:88) i =1 I (cid:2) f RMS
PROP i,t < T (cid:3) . (16)The same smoothing factor γ = β = 0 . was used for both methods, for a meaningful compar-ison. Figure 10 depicts results; the first row shows training losses (light colors) and test losses(corresponding dark colors) of all three methods. Rows 3-7 show the evolution of P RMS
PROP t forfive choices of T = [10 − , − , − , − , − ] ; the second row shows P EB -crit t . As mentionedabove, in contrast to the ‘greedy’ implementation of Section 3.6 (switched-off learning rates, stayedswitched-off), and for a more natural comparison to RMS PROP , we allowed learning rates to beswitched on again as well. The results for P RMS
PROP t and P EB -crit t are color coded as in Figure 9 of themain paper: green for the full net, and additionally red for weight matrices and orange for biases perlayer.The test losses of vanilla SGD and
SGD + EB -criterion are almost identical, while the training loss of SGD + EB -criterion is a bit more conservative than the one of vanilla SGD ; this is expected, since the EB -criterion ideally should not impair generalization performance, but might lead to larger traininglosses at convergence, due to the overfitting prevention. Already at the beginning of the training SGD + EB -criterion switches off about 10-20% of all learning rates; after that, the fraction increases toabout 50% (green line, second row); since the EB -criterion only detects convergence, the curve isquite monotonic, exhibiting not significant jumps.RMS PROP converges a bit faster, as it is expected. Also the plots for P RMS
PROP t are richer in structure.Especially one layer seems to have significantly smaller learning rates for both, biases and weights,than the other layers. Overall the difference between the largest learning rate and all others tendsto roughly increase over the optimization process (especially for T = 10 − , green line, last row).There are also significant jumps in all the curves, in contrast to the rather monotonic increasing lineof SGD + EB -criterion. This indicates nontrivial scaling of the absolute, as well as relative sizes oflearning rates throughout the optimization process; also, no learning rate is smaller than − timesthe largest one at each iteration (third row, green line at exactly zero).In the future a combination of both—learning rate scaling and overfitting prevention—i.e. combiningthe EB -criterion with advanced search direction like RMS PROP , is desirable.15 . . . . . − . − − . l og l o ss SGD vanillaSGD EB -critRMSprop . . . . . . . . . P E B - c r it t . . . . . . . . . P R M S P R O P t T=1e-05 . . . . . . . . . P R M S P R O P t T=1e-04 . . . . . . . . . P R M S P R O P t T=1e-03 . . . . . . . . . P R M S P R O P t T=1e-02 . . . . . . . . . number of steps in1e+04 P R M S P R O P t T=1e-01
Figure 10:
Comparison of
RMS
PROP and
SGD + EB -criterion on a multi-layer perceptron on MNIST;batch size is 120. Top row: logarithmic training loss (light colors) and test loss (corresponding darkcolors) for vanilla
SGD (gray),
SGD + EB -criterion (red) and RMS PROP (blue).
Row 2: fraction ofweights P EB -crit t where learning has been shut off by the elementwise stopping; each weight matrix(red), each bias vector (blue), full net (green). Row 3-7: same as row 2, but for P RMS