Investigating the interaction between gradient-only line searches and different activation functions
IInvestigating the interaction between gradient-only linesearches and different activation functions
Dominic Kafka a , Daniel N. Wilke b Centre for Asset and Integrity Management (C-AIM),Department of Mechanical and Aeronautical Engineering,University of Pretoria, Pretoria, South Africa. a [email protected] b [email protected] Abstract
Gradient-only line searches (GOLS) adaptively determine step sizes alongsearch directions for discontinuous loss functions resulting from dynamicmini-batch sub-sampling in neural network training. Step sizes in GOLSare determined by localizing Stochastic Non-Negative Associated GradientProjection Points (SNN-GPPs) along descent directions. These are identi-fied by a sign change in the directional derivative from negative to positivealong a descent direction. Activation functions are a significant componentof neural network architectures as they introduce non-linearities essential forcomplex function approximations. The smoothness and continuity charac-teristics of the activation functions directly affect the gradient characteristicsof the loss function to be optimized. Therefore, it is of interest to investigatethe relationship between activation functions and different neural networkarchitectures in the context of GOLS. We find that GOLS are robust fora range of activation functions, but sensitive to the Rectified Linear Unit(ReLU) activation function in standard feedforward architectures. The zero-derivative in ReLU’s negative input domain can lead to the gradient-vectorbecoming sparse, which severely affects training. We show that implementingarchitectural features such as batch normalization and skip connections canalleviate these difficulties and benefit training with GOLS for all activationfunctions considered.
Keywords:
Artificial Neural Networks, Gradient-only Line Searches,Learning Rates, Activation Functions, ResNet
Preprint submitted to Neural Networks February 25, 2020 a r X i v : . [ s t a t . M L ] F e b . Introduction The recent introduction of Gradient-Only Line Searches (GOLS) (Kafkaand Wilke, 2019a) has enabled learning rates to be determined automat-ically in the discontinuous loss functions of neural networks training withdynamic mini-batch sub-sampling (MBSS). The discontinuous nature of thedynamic MBSS loss is a direct result of successively sampling different mini-batches from the training data at every function evaluation, introducing a sampling error (Kafka and Wilke, 2019a). To determine step sizes, GOLS lo-cates
Stochastic Non-Negative Associated Gradient Projection Points (SNN-GPPs), manifesting as sign changes from negative to positive in the direc-tional derivative along a descent direction. This allows GOLS to strike abalance between the benefits of training using dynamic MBSS, such as 1) in-creasing the training algorithm’s exposure to training data (Bottou, 2010) aswell as 2) overcoming local minima (Saxe et al., 2013; Dauphin et al., 2014;Goodfellow et al., 2015; Choromanska et al., 2015); and the ability to local-ize optima in discontinuous loss functions (Kafka and Wilke, 2019b). Thisis in contrast to minimization line searches (Arora, 2011), which find falselocal minima induced by sampling error discontinuities. These have foundto be uniformly spread along the descent direction, rendering minimizationline searches ineffective for determining representative step sizes (Kafka andWilke, 2019b,a).Previous work has shown, that the
Gradient-Only Line Search that is In-exact (GOLS-I) is capable of determining step sizes for training algorithmsbeyond stochastic gradient descent (SGD) (Robbins and Monro, 1951), suchas Adagrad (Duchi et al., 2011), which incorporates approximate second orderinformation (Kafka and Wilke, 2019a). GOLS-I has also been demonstratedto outperform probabilistic line searches (Mahsereci and Hennig, 2017), pro-vided mini-batch sizes are not too small ( <
50 for investigated problems)(Kafka and Wilke, 2019). The gradient-only optimization paradigm has re-cently also shown promise in the construction of approximation models toconduct line searches (Chae and Wilke, 2019).Some of the most important factors governing the nature of the computedgradients are: 1) The neural network architecture, 2) the activation functions(AFs) used within the architecture, 3) the loss function implemented, and 4)the mini-batch size used to evaluate the loss, to name a few. In this study,we concentrate on the influence of activation functions (AFs) on trainingperformance of GOLS for different neural network architectures. The stabil-2ty characteristics of neural network losses with different activation functionshave been extensively studied (Liu et al., 2017). Activation functions havea direct influence on the loss surface properties of a neural network, and byextension, dictate the nature of the gradients used in GOLS. Therefore, weempirically study how different activation functions affect training when im-plementing GOLS to determine learning rates. We also consider the effect ofarchitectural features such as batch normalization (Ioffe and Szegedy, 2015)and skip connections (He et al., 2016) on training architectures with differentAFs using GOLS.The AFs considered in this study can be split primarily into two cate-gories, namely:1. The saturation class (Xu et al., 2016): Including Sigmoid (Han andMorag, 1995), Tanh (Bergstra et al., 2009) and Softsign (Karlik, 2015);and2. The sparsity class: Including ReLU (Glorot and Bordes, 2011), leakyReLU (Maas et al., 2013) and ELU (Clevert et al., 2016).The respective function values and derivatives of both classes are shownin Figure 1 over an input domain of [ − , ±∞ . The function values begin either at 0 (as for Sigmoid) or -1 (Tanh andSoftsign) have an upper limit of 1. Often, function values and derivatives ofthe saturation class are smooth and continuous. The outlier in the saturationAFs chosen for this study is Softsign, which has a derivative that is contin-uous, but not smooth where the input is 0. The derivative characteristicsof the Sigmoid AF are also notable among the saturation class AFs. Themaximum derivative value of Sigmoid is 0 .
25, which is a factor of 4 lowerthan those of Tanh and Softsign with unit derivatives at the origin.The original sparsity AF, ReLU, was introduced to recreate the sparse-ness and switching behaviour observed in neuroscientific studies, in artificialneural networks (Glorot and Bordes, 2011). ReLU is characterized by hav-ing an output of zero in the negative input domain, and a linear output withunit gradient in the positive domain. This makes the function values of ReLUcontinuous and non-smooth, while the derivative is step-wise discontinuousat input 0. As with ReLU, the sparsity class is characterized by having lin-ear outputs in the positive input domain, while the derivatives approximatezero as the negative input domains tend to −∞ . However, the derivatives inthe positive input domains are always 1, which is a critical difference to the3 a) Saturation class function val-ues (b) Saturation class derivatives(c) Sparsity class function values (d) Sparsity class derivatives Figure 1: (a,c) Function value and (b,d) derivatives of activation functions considered inour investigations. These are grouped together into (a,b) saturation and (c,d) sparsityclasses respectively. The primary difference between saturation and sparsity classes arethe derivatives in the positive input domain. The saturation class is characterized byderivatives that tend towards zero as input tends toward + ∞ . Conversely, the sparsityclass, is characterized by unit derivatives that spread over all of the positive input domain.This gives the sparsity class AFs behaviour characteristics that approximate a ”switch”,being either ”on” or ”off”. saturation class of AFs. The leaky ReLU AF relaxes the absolute sparsity ofReLU by allowing a small constant derivative in the negative input domain.However, the non-smooth function value and the discontinuous derivativeproperties of ReLU are maintained. The ELU AF is a further modificationthat enforces smoothness in the function value and continuity in the deriva-tive. However, the derivative remains non-smooth. The formulations of leakyReLU and ELU are both claimed to improve training performance over ReLU(Clevert et al., 2016).We consider the difference in loss function characteristics of the selectedAFs for a simple neural network problem presented in Figure 2. The contoursof the mean squared error (MSE) loss function are depicted for a single4idden layer feedforward neural network with 10 hidden nodes, fitted tothe Iris dataset (Fisher, 1936). The plots are generated by taking steps α { } ∈ [ − , − . , . . . ,
5] and α { } ∈ [ − , − . , . . . ,
5] along the two randomperpendicular unit directions, u and u , as inspired by Li et al. (2017). (a) Sigmoid (b) Tanh (c) Softsign(d) ReLU (e) leaky ReLU (f) ELU Figure 2: Contours of the mean squared error loss function along two random perpen-dicular unit directions, u and u , computed for a single hidden layer feedforward neuralnetwork using different activation functions. The hidden layer consists of 10 hidden nodesand the architecture is applied to the classic Iris dataset (Fisher, 1936). The contours of the Sigmoid AF represent a smooth loss function, con-taining a single minimum, where difference in function value is ±
15 over thesampled domain. The use of the Tanh AF results in a larger range in func-tion value of over ±
120 and the emergence of an additional local minimum.Although the loss function range of Softsign is between that of Sigmoid andTanh at ±
60, in this case the two local minima drift further apart in thesampled domain. A characteristic feature of the saturation class of AFs is,that the loss function contours are smooth. Amongst the sparsity class ofAFs, the ReLU AF retains the multi-modal nature of Tanh and Softsign,but demonstrates abrupt changes in contour characteristics. The modifica-tion of ReLU to include leaky gradients (leaky ReLU) softens these abruptcharacteristics slightly, evidenced by the contours around the local minimum5t α { } ≈ − α { } ≈ −
2. However, ELU impacts loss characteristicsthe most within the sparsity class, as it smooths out the contours of the lossand brings the two local minima closer together. ELU also results in a largerrange of function value over the sampled domain, encompassing a change of ±
300 compared to ±
120 for ReLU and leaky ReLU respectively.It is clear, that the choice of AF can significantly influence loss functionlandscape characteristics. By extension, these changes translate to the re-spective loss function gradients. Previous studies have confirmed, that lossfunctions with higher curvature cause SNN-GPPs to be more localized inspace (Kafka and Wilke, 2019b). Consequently, the aim is to investigate andquantify the choice of AF with regards to the performance of GOLS in de-termining step sizes for dynamic MBSS neural network training; and if it isadversely affected, what can be done to improve or restore performance.
2. Connections: Gradient-only line searches and activation func-tions
Consider neural network loss functions formulated as: L ( x ) = 1 M M (cid:88) b =1 (cid:96) (ˆ t b ( x ); t b ) , (1)where x ∈ R p denotes the vector of weights parameterizing the neuralnetwork, the training dataset of M samples is given by { t , . . . , t M } , and (cid:96) (ˆ t b ( x ); t b ) is the elemental loss evaluating x (via neural network modelprediction ˆ t b ( x )) in terms of the training sample t b . Implementing Back-propagation (Werbos, 1994) allows for efficient evaluation of the analyticalgradient of L ( x ) with regards to x : ∇L ( x ) = 1 M M (cid:88) b =1 ∇ (cid:96) (ˆ t b ( x ); t b ) . (2)When L ( x ) and ∇L ( x ) are evaluated using the full training dataset of M samples, the smoothness and continuity characteristics of both L ( x ) and ∇L ( x ) are subject only to the smoothness and continuity characteristics ofthe AFs used in the neural network that constructs ˆ t b ( x ).In order to conduct neural network training, the loss function in Equation(1) is minimized. Consider the standard gradient-descent update: x n +1 =6 n − α ∇L ( x n )(Arora, 2011). An iteration, n , is performed when the pa-rameters, x n , are updated to a new state, x n +1 . To determine step size, α ,line searches are performed at every iteration, n , of the training algorithm inpursuit of a good minimum. Thus an iteration, n , encompasses exactly oneline search. Line searches are conducted by finding the optimum of a univari-ate function, F n ( α ), constructed from current solution, x n , along a searchdirection, d n . If full-batch sampling is implemented in univariate function F n ( α ), we define: F n ( α ) = f ( x n ( α )) = L ( x n + α d n ) , (3)with corresponding directional derivative F (cid:48) n ( α ) = dF n ( α ) dα = d Tn · ∇L ( x n + α d n ) . (4)Figures 3(a) and (b) show examples of F n ( α ) and corresponding direc-tional derivative F (cid:48) n ( α ) for different AFs. For compactness, the explicit de-pendency on variables such as α is dropped in further discussions, unlessspecifically required. In Figure 3, F n and F (cid:48) n are constructed along the nor-malized search direction d n = u + u (cid:107) u + u (cid:107) in our illustrative example introducedin Section 1. Note, how all instances of F n are continuous. This means thatminimization line searches (Arora, 2011) can be used to determine step sizesfor training in full-batch sampled loss functions. The minimizer of F n , namely α ∗ , subsequently becomes the step size for iteration n , i.e. α n,I n = α ∗ , where I n is the number of function evaluations required during the line search tofind the optimum at iteration n . This notation can also be used to describefixed step sizes. In such cases, α n,I n is a predetermined constant value overevery iteration, and I n = 1.However, modern datasets and corresponding network architectures havememory requirements that exceed current computational resources (Krizhevskyet al., 2012). Therefore, only a fraction of the training data, B ⊂ { , . . . , M } with |B| (cid:28) M , is loaded into memory at a given time. This is referredto as mini-batch sub-sampling (MBSS). Omitting training data to constructmini-batches invariably results in a sampling error associated with the MBSSloss, compared to the full-batch sampled loss. Some training approaches em-ploy static MBSS, where mini-batches are fixed for the minimum duration ofa search direction, d n (Friedlander and Schmidt, 2011; Bollapragada et al.,2017; Kungurtsev and Pevny, 2018; Kafka and Wilke, 2019a). Alternatively,7 igure 3: Univariate functions and directional derivatives of (a) F n and (b) F (cid:48) n , usingfull-batch sampling, and (c) ˜ F n and (d) ˜ F (cid:48) n , using dynamic mini-batch sub-sampling witha selection of activation functions. dynamic MBSS can be implemented, where a new mini-batch is resampledfor every evaluation, i , of the loss function. It has been shown that dy-namic MBSS, also referred to as approximate optimization (Bottou, 2010),can benefit a given training algorithm by exposing it to larger amounts ofinformation per search direction (Kafka and Wilke, 2019a). We thereforedefine dynamic MBSS approximations of L ( x ) and ∇L ( x ) respectively, as:˜ L ( x ) = 1 |B n,i | (cid:88) b ∈B n,i (cid:96) (ˆ t b ( x ); t b ) , (5)and ˜ g ( x ) = 1 |B n,i | (cid:88) b ∈B n,i ∇ (cid:96) (ˆ t b ( x ); t b ) . (6)Note, that the mini-batch, B n,i , sampled for an instance of ˜ L ( x ) and ˜ g ( x )is consistent between the pair, while only at a new evaluation of pair ˜ L ( x )8nd ˜ g ( x ) is B n,i resampled (Kafka and Wilke, 2019a). However, the actof abruptly alternating between sampling errors associated different mini-batches B n,i , interrupts the smoothness and continuity characteristics of theloss function, irrespective of the choice of AF used in the neural network ar-chitecture. This results in point-wise discontinuous loss, ˜ L ( x ), and gradient,˜ g ( x ), functions. Although E [ ˜ L ( x )] = L ( x ) and E [˜ g ( x )] = ∇L ( x ) (Tongand Liu, 2005), the probability of encountering critical points, ˜ g ( x ∗ ) = ¯ ,is infeasibly low in dynamic MBSS loss functions. Additionally, the point-wise discontinuities between consecutive evaluations of ˜ L ( x ) result in theemergence of spurious candidate local minima (Wilson and Martinez, 2003;Schraudolph and Graepel, 2003; Schraudolph et al., 2007), which have beenshown to be approximately uniformly distributed over the loss landscape(Kafka and Wilke, 2019b).By substituting ˜ L ( x ) and ˜ g ( x ) into Equations (3) and (4) respectively,we obtain dynamic MBSS univariate function ˜ F n ( α ) and corresponding di-rectional derivative ˜ F (cid:48) n ( α ). Consider Figures 3(c) and (d) for a range ofAFs. Note, how sampling B n,i uniformly from the full training dataset inFigure 3(c), results in local minima all along the search direction. Althoughdirectional derivatives Figure 3(d) get close to zero, none are a critical point,i.e. ˜ F (cid:48) n ( α ∗ ) (cid:54) = 0. This is best illustrated by the first local minimum along thesearch direction for ReLU. Using full-batch sampling, a clear local minimum, F n ( α ∗ ), can be observed in Figure 3(a)(ReLU) with corresponding criticalpoint, F (cid:48) n ( α ∗ ) = 0, in Figure 3(b)(ReLU). Both are indicated by the first redcircle from left to right along F n and F (cid:48) n respectively. Using dynamic MBSSin Figures 3(c)(ReLU) and (d)(ReLU), none of the directional derivatives arecritical points, ˜ F (cid:48) n (cid:54) = 0, and local minima are located all along ˜ F n , illustratedby small red points.The discontinuities in ˜ F n make minimization ineffective in determiningstep sizes in dynamic MBSS loss functions (Kafka and Wilke, 2019a). His-torically, this has led to the popularity of subgradient methods for neuralnetwork training, using a priori determined step sizes.(Schraudolph, 1999;Boyd et al., 2003; Smith, 2015). Line searches were first introduced intodynamic MBSS loss functions by Mahsereci and Hennig (2017), determiningstep sizes by using probabilistic surrogates along ˜ F n to estimate the loca-tion of optima. An alternative approach is the use of Gradient-Only LineSearches (GOLS) (Kafka and Wilke, 2019a; Kafka and Wilke, 2019), whichemploy an extension of the gradient-only optimality criterion (Wilke et al.,2013; Snyman and Wilke, 2018), namely the Stochastic Non-Negative As- ociative Gradient Projection Point (SNN-GPP) (Kafka and Wilke, 2019a),given as follows: Definition 2.1.
SNN-GPP: A stochastic non-negative associated gradientprojection point (SNN-GPP) is defined as any point, x snngpp , for which thereexists r u > such that ∇ f ( x snngpp + λ u ) u ≥ , ∀ u ∈ { y ∈ R p | (cid:107) y (cid:107) = 1 } , ∀ λ ∈ (0 , r u ] , (7) with non-zero probability. (Kafka and Wilke, 2019a) Subsequently, a ball, B (cid:15) , exists that bounds all possible SNN-GPPs ofa surrounding neighbourhood, where each neighbourhood contains one trueoptimum: Definition 2.2. B (cid:15) : Consider a dynamic mini-batch sub-sampled loss func-tion ˜ L ( x ) , of a continuous, smooth and convex full-batch loss function L ( x ) ,such that each sampled mini-batch with associated L ( x ) that is used to eval-uate ˜ L ( x ) has the same smoothness, continuity and convexity characteristicsas L ( x ) . Then there exists a ball, B (cid:15) ( x ) = { x ∈ R p : (cid:107) x − x ∗ (cid:107) < (cid:15), < (cid:15) < ∞ , x (cid:54) = x ∗ } , (8) that contains all the stochastic non-negative gradient projection points (SNN-GPPs), where x ∗ is the minimum of L ( x ) . (Kafka and Wilke, 2019a) Along any univariate function, F n , an SNN-GPP manifests as a signchange in the directional derivative from negative, F (cid:48) n <
0, to positive, F (cid:48) n >
0, along the descent direction. In the deterministic, full-batch set-ting, F n , the SNN-GPP reduces to the critical point associated with a localminimum, i.e. F n ( α snngpp ) = F n ( α ∗ ), and ball B (cid:15) reduces to a single point.In the stochastic setting of dynamic MBSS losses, ball B (cid:15) has a finite rangethat is dependent on the variance of the directional derivative as well as theexpected curvature δE [ ˜ F (cid:48) n ] δα in a neighbourhood (Kafka and Wilke, 2019b).The SNN-GPP and B (cid:15) can be visually illustrated in Figure 3(d). Withthe Sigmoid activation function, there is a single neighbourhood, with alarge ball B (cid:15) containing all SNN-GPPs. There exist numerous sign changesfrom negative to positive along the descent direction in B (cid:15) , due to the vari-ance of ˜ F (cid:48) n and the slow change in ˜ F (cid:48) n along α . In the case of Tanh, ReLUand ELU, there are two neighbourhoods in which SNN-GPPs can be found.10hese neighbourhoods are separated by a maximum, as demonstrated by F (cid:48) n . Note, how the SNN-GPP definition ignores maxima, as it considers onlysign changes from negative to positive along the descent direction. The ˜ F (cid:48) n plots of Tanh, ReLU and ELU demonstrate how the size of ball B (cid:15) decreases,with a decrease in variance and an increase in expected curvature. These˜ F (cid:48) n plots also show, that the characteristics of B (cid:15) can change in differentneighbourhoods of the loss function, and vary according to each AF.It has been shown, that an exact GOLS will converge to an SNN-GPPwithin ball B (cid:15) (Kafka and Wilke, 2019a). Therefore, GOLS determine thestep size at iteration n of a training algorithm, by locating an SNN-GPPsuch that α n,I n = α snngpp . It has also been demonstrated, that the Gradient-Only Line Search that is Inexact (GOLS-I) behaves in a manner consistentwith Lyapunov’s global stability theorem (Lyapunov, 1992; Kafka and Wilke,2019). The latter proof was developed in the context of loss functions thatare positive, coercive and AFs that result in strict descent (Kafka and Wilke,2019). Subsequently, this paper explores how GOLS perform with a largerrange of AFs; and consider the implications a given AF may have on a neuralnetwork architecture for a given problem.
3. Contribution
In this paper we empirically study the interaction between activationfunctions and neural network architectures, when using Gradient-Only LineSearches (GOLS) to determine step sizes for dynamic MBSS loss functions.In our investigations we consider six activation functions, as introduced inSection 1, in the context of 1) shallow and deep feedforward classificationnetworks, and 2) architectural features such as batch normalization and skipconnections. To this end, we use 13 datasets to construct a range of train-ing problems, where we primarily use the Gradient-Only Line Search thatis Inexact (GOLS-I) (Kafka and Wilke, 2019a) to determine step sizes fortraining. Depending on the nature of the investigation, the Gradient-OnlyLine Search with Bisection (GOLS-B) (Kafka and Wilke, 2019a) and fixedstep sizes are sporadically used to be benchmarks against which the perfor-mance of GOLS-I can be compared. Overall, our investigations demonstratethat GOLS-I is effective in determining step sizes in a range of feedforwardneural network architectures using different activation functions. However,we also give examples, where activation function selection can significantlyimpede training performance with GOLS-I. We show that these difficulties11an be alleviated by modifying the network architecture of a given problem.Therefore, this paper serves as a practical guide for neural network practi-tioners to improve the construction of network architectures that promoteefficient training using GOLS.
4. Empirical Study
In our studies we consider three different types of training problems,namely:1. Foundational problems: Using small classification datasets with variousfeedforward neural network architectures.2. MNIST with NetII: A training problem used by Mahsereci and Hennig(2017) to explore early stochastic line searches.3. CIFAR10 with ResNet18: A state of the art architecture includingskip-connections (He et al., 2016) and batch normalization (Ioffe andSzegedy, 2015).The foundational training problems are used to explore the influenceof AFs on training performance in the context of 1) hidden layer heightand depth, 2) GOLS-I and GOLS-B, as well as constant step sizes; and 3)full-batch versus dynamic MBSS training. The NetII problem is used todemonstrate the potential sensitivity of training problems to the choice ofAF, and subsequently show the corrective effect of batch normalization ontraining. Skip connections are another architectural consideration of interestwith different AFs in the context of GOLS. The relationship between AFsand neural networks with skip connections is explored using the ResNet18architecture with the CIFAR10 dataset, as adapted from the implementationby Liu (2018).The datasets used in this study, along with and their respective properties,are listed in Table 1. All datasets were scaled using the standard transform(Z-score). For the foundational problems (spanning datasets 1 to 11), thestandard transform was applied for each individual input D, while for MNISTand CIFAR10 the standard transform was applied over each image channel.For MNIST, there is a single grey scale channel of 28x28 pixels (total of 784inputs), while CIFAR10 has 3 colour channels of 32x32 pixels each (total of3072 inputs). Since the small datasets are not separated into training andtest datasets by default, we divided them manually into training, validationand test datasets with a ratio of 2:1:1 respectively. We choose this division12o demonstrate that the manual construction of validation and test datasetsresulted in representative, unbiased hold-out datasets. Therefore, we expectsimilar performance between validation and test datasets. Conversely, bothMNIST and CIFAR10 datasets have been predetermined test datasets, whichare subsequently used.
No. Dataset name Author Observations Inputs , D Classes , K Table 1: Properties of the datasets considered for the this study.
Table 2 summarizes the 11 investigations performed in this study on thecorresponding neural network training problems. For the foundational prob-lems we implement shallow nets with the number of hidden nodes being halfof the input dimensions, D and twice that of the input dimensions, 2 D . Wealso implement a deep architecture with 6 hidden layers of 2 D nodes. Weconduct training limited by iterations for all training problems except NetII,which is limited in the number of function evaluations, as prescribed by Mah-sereci and Hennig (2017). We couple each of the training problems with theactivation functions discussed in Section 2. Constructed dynamic MBSS,search directions d n = − ˜ g ( x ) were supplied by the line search stochasticgradient descent (LS-SGD) algorithm (Robbins and Monro, 1951; Kafka andWilke, 2019a) for all training runs accept those of the deep architecture ap-plied to the foundational problems, which uses Adagrad (Duchi et al., 2011;Kafka and Wilke, 2019a) instead. We adopt the convention whereby the nameof a method is the combination of the line search used to determine the stepsize, and the algorithm used to provide the search direction. For example,using GOLS-I to determine step sizes for Adagrad is denoted ”GOLS-I Ada-grad”. As indicated in Table 2, step sizes for LS-SGD were predominantlydetermined using GOLS-I, while alternatively GOLS-B and fixed step sizeswere implemented, depending on the investigation performed. The fixed step13ize, α n,I n = 0 .
05, used in investigation 4 was manually tuned to give goodtraining performance for the foundational problems with the ReLU AF. Thefixed step sizes chosen for investigation 8 were selected in order to highlighta range of training performance, from slow to unstable, for the NetII trainingproblem with the ReLU AF.
Inv. |B n,i | Loss1 D ] No GOLS-I SGD 3000Its. 32 MSE D ] No GOLS-I SGD 3000Its. 32 MSE D, D, D, D, D, D ] No GOLS-I Ada-grad 3000Its. 32 MSE D ] No α n,I n = 0 . D ] No GOLS-BSGD 3000Its. 64 MSE D ] No GOLS-BSGD 3000Its. M MSE
12 [1000 , ,
12 [1000 , , α n,I n = 0 . α n,I n = 0 . α n,I n = 0 .
12 [1000 , ,
13 ResNet18 Yes GOLS-I SGD 40,000Its. 128 BCE
13 ResNet18 No GOLS-I SGD 40,000Its. 128 BCE
Table 2: Parameters and settings governing the implemented network architectures (withand without batch normalization (BN)) regarding their training in various investigations(Inv.).
All training runs were conducted using PyTorch 1.0 (pytorch.org, 2019).By default, He initialization (He et al., 2015) is used for networks imple-menting ReLU and leaky ReLU AFs, while Xavier initialization (Glorot andBengio, 2010) is used for networks with the remaining AFs considered in thisstudy. For the foundational and NetII training problems, 10 training runswere conducted for each dataset and AF combination, whereas for CIFAR10with ResNet18, a single training run per AF is performed.14 .1. Software
In pursuit of transparency and reproducibility, we have made code avail-able at https://github.com/gorglab/GOLS . The repositories posted in-clude user-friendly versions of the source code used in our investigations.These include GPU compatible examples of all the GOLS methods devel-oped in Kafka and Wilke (2019a), including GOLS-I and GOLS-B as usedin this study. Example codes are self contained and accessible, such as tocreate an environment suited to exploring the properties of GOLS.
5. Results
The results of our empirical study are ordered according to the train-ing problems considered, namely 1) foundational problems, 2) the MNISTdataset with the NetII architecture, and 3) the CIFAR10 dataset with theResNet18 architecture. Note, that the loss is used to evaluate training per-formance for the foundational problems, while the classification errors of thedatasets are plotted to evaluate NetII and ResNet18. Note, that resultsgiven in terms of iterations are not representative of computational cost, asa number of function evaluations can be performed per iteration when linesearches are implemented. However, giving results in terms of iteration al-lows for comparison between line searches with different costs, while assessingthe reduction in loss (i.e. the quality) provided by the line searches.
The average training, validation and test losses with corresponding aver-age step sizes for the foundational problems are given in Figure 4. The resultsof the foundational problems are averaged over all the respective datasets andtheir corresponding 10 training runs. This allows a representative trend tobe demonstrated for each AF over a number of datasets. The results of thefirst analysis, with hidden layers of size D , are shown in the first column,Figure 4(a). Firstly, it is evident that step sizes estimated by GOLS-I resultin effective training over a range of datasets and AFs. The mean trainingloss continually decreases, while that of both the validation and test datasetsincreases after 500 iterations, indicating the onset of overfitting. The con-sistency between validation and test losses suggests, that both validationand test datasets are large enough to give unbiased assessment of the neuralnetworks’ generalization performance.15 a) D , GOLS-I SGD(b) 2 D , GOLS-ISGD (c) 2 D (6), GOLS-IAda. (d) 2 D , α n,I n = 0 . Figure 4: Training, validation and test losses with corresponding log of step sizes for thefoundational problems with various networks architectures, including (a) investigation 1:single hidden layer (HL) networks with D hidden nodes, (b) investigation 2: single HLnetworks with 2 D nodes and (c) investigation 3: networks with six HLs of 2 D nodes usingGOLS-I Adagrad. For (a) to (c), GOLS-I was used to determine step sizes, while (d)investigation 4 implements LS-SGD with fixed step sizes α n,I n = 0 .
05 for a single HLnetwork with 2 D nodes. In investigation 1, shown in Figure 4(a), ReLU is convincingly the worstperformer. Conversely, the Sigmoid AF is the best performer in training,while the performance between the remaining AFs is almost indistinguish-able. The best validation and test losses also belong to Sigmoid, with amarginal advantage over the remaining AFs (accept ReLU, where the dif-ference is significant). Interestingly, the step sizes of Sigmoid, presented inthe last row of Figure 4(a), diverge significantly from those of the remain-16ng AFs. We postulate that this phenomenon compensates for the Sigmoid’ssmall derivative magnitudes, see Figure 1(b). As mentioned, the maximumderivative magnitude of the Sigmoid AF is 0 .
25, while for the remainingAFs it is 1. Cumulatively, this results in progressively smaller gradients forSigmoid, as the training algorithm progresses closer towards an optimum.Subsequently, this prompts GOLS-I to increase the step sizes in the searchfor univariate SNN-GPPs. This is not common among the foundational prob-lem datasets, but the larger step sizes of individual problems dominate thecalculation of average step sizes.The ReLU AF was introduced to promote sparsity within a neural net-work, which is meant to approximate brain processes observed in neuro-science (Glorot and Bordes, 2011), where only a fraction of the network isused at a given time. However, ReLU was developed predominantly for largeneural networks, which makes it unclear whether the bad performance ob-served in Figure 4(a) is due to the use of GOLS-I for determining learningrates, or due to the network architecture selected being too small.We therefore perform training runs with increased hidden layer sizes of2 D for investigation 2 shown in Figure 4(b). Hence, the hidden layers ofnetwork architectures trained in investigation 2 are 4 times larger than thoseconsidered in investigation 1. However, this increase has little effect on im-proving the training performance of ReLU. Instead, there is a shift betweenthe relative performances of the remaining AFs. Subsequently, leaky ReLUoutperforms the Sigmoid AF in training. However, this does not translate togeneralization, where the Sigmoid still outperforms leaky ReLU.Presuming that the investigated architectures are still too small to caterfor the sparsity that ReLU induces, we increase the number of hidden layersto 6 with 2 D nodes each in investigation 3. To aid training with deeperlayers, we employ the GOLS-I Adagrad for this analysis. The results shownin Figure 4(c) exhibit an average loss improvement of 0 . α n,I n = 0 .
05 withLS-SGD for all AFs. It is clear, that the training performance with ReLUimproves significantly with α n,I n = 0 .
05. However, this is not coincidental,as the fixed step size was manually tuned for this purpose. This indicates,that the step sizes determined by GOLS-I are not effective for ReLU with thefoundational training problems. Interestingly, the variance between trainingperformances of the remaining AFs is higher with LS-SGD using fixed stepsize, than when implementing GOLS-I SGD. The Sigmoid AF performs sig-nificantly worse, due to its comparatively smaller derivatives, as discussedabove. This confirms, that GOLS-I is capable of adapting its step sizes tothe properties of different AFs in feedforward network architectures, with theexception of ReLU.
What makes ReLU the outlier among the considered AFs, is that it en-forces sparsity in an absolute manner, i.e. the derivative in the negativeinput domain is exactly zero. Previous studies have shown, that the use ofReLU with the MSE loss can cause ˜ L ( x ) and ˜ g ( x ) to be zero over a rangeof x (Kafka and Wilke, 2019b). This breaks the assumptions of positivity,coerciveness and strict descent, namely those of Lyapunov’s global stabilitytheorem, which govern the convergence of GOLS (Kafka and Wilke, 2019).The reason for GOLS-I’s inability to train feedforward networks with ReLUis as follows: Conducting updates with step sizes that are too large, as ispossible in an inexact line search such as GOLS-I, can cause numerous nodeswithin a ReLU network to enter the negative input domain. If the step sizesare large enough, the affected nodes remain ”off” irrespective of the variancein the incoming data. Subsequently, large parts of the network may be ”off”permanently, causing the gradient vector to have numerous zero-valued par-tial derivatives, i.e. become sparse. This results in no change to weights with18ero-partial-derivatives during update steps, effectively terminating trainingfor the deactivated portion of the architecture. (a) DS 1, |B n,i | = 64(b) DS 2-4, |B n,i | =64 (c) DS 1, M (d) DS 2-4, M Figure 5: Training loss, test loss and the log of step sizes for a subset of datasets (DS) fromthe foundational problems dataset pool, trained using GOLS-B SGD (a,b) with mini-batchsize |B n,i | = 64 for investigation 5 and (c,d) using full-batches, M in investigation 6. These considerations, as well as the results observed in Figure 4, suggestthat the step sizes determined by GOLS-I are too large to result in stabletraining with ReLU. A contributing factor is that GOLS-I’s initial acceptcondition (Kafka and Wilke, 2019a) allows univariate SNN-GPPs to be over-shot in a controlled manner. Although overshoot has been shown to aid thetraining performance of LS-SGD (Kafka and Wilke, 2019a; Kafka and Wilke,2019), it may be too aggressive to be implemented with ReLU AFs. There-fore, investigations 5 and 6 focus on whether GOLS-B (Kafka and Wilke,2019a) with a conservative SNN-GPP bracketing strategy is capable of de-termining step sizes for LS-SGD in feedforward networks with ReLU AFs.The conservative bracketing strategy grows the minimum step size by a fac-tor of 2, until a positive directional derivative is observed. This increases theprobability of encountering SNN-GPPs in B (cid:15) that are closest to x n along thedescent direction. 19n investigations 5 and 6, we more closely consider the loss functions ofthe first 4 foundational problems with ReLU AFs. The focus is primarily onthe change in characteristics between dynamic MBSS and full-batch sampledlosses. Therefore, we find the largest mini-batch size of the power 2 thatallows MBSS for the first 4 problems. Due to separating problem data intotraining, validation and test datasets, the training dataset sizes for the first4 problems are M ∈ { , , , } respectively. Therefore, the largestcommon mini-batch size of power 2 is |B| = 64, which is implemented forinvestigation 5. Subsequently, investigation 6 uses full-batch sampling for thesame datasets. The results for both investigations shown in Figure 5, whereall step sizes are determined using GOLS-B. Satisfied, that the validationand test datasets have been chosen representatively in Figure 4, we omit thevalidation dataset losses in Figure 5. We also separate the results of dataset1 in Figures 5(a) and (c) from those of datasets 2-4 in Figures 5(b) and (d),as these have distinct training characteristics.The results for training on dataset 1, which is the smallest considereddataset among the foundational problems, with |B n,i | = 64 on the singlehidden layer architecture with 2 D nodes are shown in Figure 5(a). Here,training the ReLU architecture is still unstable, while training the networksof the other AFs is effective. Note, that the step sizes of all AFs are signifi-cantly smaller and noisier with GOLS-B SGD compared to those of GOLS-ISGD. This is due to a combination of the conservative bracketing strategyand the lack of overshoot, compared to GOLS-I. However, this conservativeapproach aids in successfully training feedforward networks with the ReLUAF for datasets 2-4 in Figure 5(b). This is confirmation, that the larger stepsizes determined by GOLS-I led to the divergent training behaviour observedin Figure 4. However, this improved stability comes at the expense of train-ing performance, as GOLS-B is an exact line search, which uses an orderof magnitude more function evaluations per iteration compared to GOLS-I(Kafka and Wilke, 2019a).The full-batch sampled loss or true loss function results for investigation6 are plotted in Figures 5(c) and (d). The training performance of dataset 1with ReLU in Figure 5(c) shows little significant improvement in comparisonto Figure 5(a). The average training loss is only marginally better, with amean drop in loss of 0 .
022 for full-batch training. This indicates, that thedeterministic loss function pertaining to dataset 1 with ReLU has descentdirections leading into flat planes, which trap the training algorithm.Generally, the determined step sizes for all AFs are significantly more20table for the full-batch case, than with |B n,i | . There are no incidences ofminimum step sizes, as all search directions are deterministic descent direc-tions. The variance that remains is due to the qualities of the deterministicdescent direction itself, where the step size to the SNN-GPP along each de-scent direction is different. However, the step sizes of ReLU networks areparticularly noisy, as GOLS-B SGD contends with the boundary betweenobtaining gradients that are dense, or sparse. Compared to dataset 1, thisvariance in step size is significantly reduced for datasets 2-4, which indicatesthat the boundary between obtaining dense and sparse gradient vectors alonga descent direction is less abrupt, prompting less aggressive changes in stepsizes. This is echoed by the training performance for ReLU with datasets2-4, which is competitive with that of the remaining AFs. This suggests, thatsome ReLU feedforward network architectures construct small positive direc-tional derivatives that ”push” the line search back from zero-valued domainsin the loss function, an observation confirmed by Kafka and Wilke (2019b).An additional interesting observation between Figures 5(b) and 5(d) isthat the minimum test losses of most AFs are lower in the dynamic MBSScase, than when using full-batch training. It is clear, that training stagnatesin Figure 5(b) compared to Figure 5(d), where the training loss decreases ata more rapid rate. However, as training slows in Figure 5(b), the test lossesfor all AFs accept for Sigmoid are lower than their full-batch equivalent inFigure 5(d). This indicates, that dynamic MBSS together with a conservativegradient-only line search can either slow training (as is the case for Sigmoid),or act as a regularizer during training (as is the case for all other AFs), whichdefinitely warrants future study.This investigation has demonstrated, that successfully determining stepsizes using GOLS for feedforward neural networks with ReLU AFs is sensi-tive to both the line search strategy used, and the architecture of the givenproblem. In the examples shown, GOLS-B effectively resolved step sizes forLS-SGD in larger networks with ReLU AFs, while GOLS-I SGD was un-able to conduct reliable training on the same problems. Step sizes that aretoo large, and variance produced by dynamic MBSS, lead to detailed featuressuch as small positive directional derivatives being ignored. This impedes theability of GOLS-I to determine effective step sizes for the ReLU AF. GOLS-Bresolves more conservative step sizes, but bears a high computational cost.Instead, relaxing the absolute sparsity of ReLU, by implementing the leakyReLU or ELU AF, significantly improves GOLS-I’s ability to determine ef-fective step sizes in dynamic MBSS losses for the feedforward architectures21onsidered in this study. The non-zero derivatives of ELU and leaky ReLUin their negative input domains ensure that ˜ g ( x ) remains dense. This guar-antees that all weights participate in update steps, and allows the trainingalgorithm to recover from previous step sizes that were too large. Next, we present a larger training problem, where the choice of AF signif-icantly influences training performance using GOLS-I SGD. The NetII archi-tecture in combination with the well known MNIST dataset has been usedto demonstrate the ability of probabilistic line searches to determine learn-ing rates in dynamic MBSS loss functions (Mahsereci and Hennig, 2017).Mahsereci and Hennig (2017) only implement the Tanh AF for this prob-lem, which we extend by analysing all the AFs considered in Section 2 forinvestigation 7. In addition, we quantify the effect of batch normalization ininvestigation 9. The training and test classification errors, as well as accom-panying step sizes for the different AFs are given in Figure 6. We remind thereader, that the results presented for investigations 7-8 are given in functionevaluations to be consistent with Mahsereci and Hennig (2017).In investigation 7, the training performance between different AFs variessignificantly for the standard NetII architecture with GOLS-I SGD in Fig-ures 6(a)-(c). It is immediately evident, that ReLU is also unstable in thestandard NetII architecture using GOLS-I SGD. The overall best perfor-mance is obtained using leaky ReLU, followed by Sigmoid. The next best per-former is ELU, with Softsign marginally outperforming Tanh. Leaky ReLUtrains particularly noisily in the first 5,000 function evaluations, before es-tablishing a clear lead over the remaining AFs. Sigmoid trains slower thanTanh, Softsign and ELU during the first 10,000 function evaluations, butoutperforms these thereafter. The relative performances of the AFs gener-alize, as they are also reflected in the test classification errors. There are afew indications that the resulting loss function of leaky ReLU and the Sig-moid AFs have different characteristics to those of Tanh, Softsign and ELU,namely: 1) The significantly lower training and test errors after 40,000 func-tion evaluations, 2) the higher variance in error during training, and 3) stepsizes that are up to two orders of magnitude larger than those of the rest. Wespeculate that this is due to specific interactions between activation functionproperties and the neural network architecture.Both Sigmoid and leaky ReLU are AFs that ”activate” in the positiveinput domain and tend towards zero in the negative input domain. There-22 a) GOLS-I (b) GOLS-I (c) GOLS-I(d) Fixed step sizes (e) Fixed step sizes (f) Fixed step sizes(g) GOLS-I, batch-norm. (h) GOLS-I, batch-norm. (i) GOLS-I, batch-norm.
Figure 6: Training and test classification errors with corresponding log of step sizes asobtained for the MNIST Dataset with the NetII architecture for different activation func-tions. Step sizes for LS-SGD are determined by (a)-(c) GOLS-I for the standard NetIIarchitecture with all considered AFs, (d)-(f) a range of fixed step sizes for the standardNetII architecture with only ReLU, and (g)-(i) determined by GOLS-I for NetII with batchnormalization and all considered AFs. fore, we postulate that the ability of the AFs to approximate zero functionvalue outputs, while having non-zero derivatives, constructs loss functionlandscapes that are easier to traverse with GOLS-I SGD. This supports astudy by Xu et al. (2016), which found that a ”penalized Tanh”, that re-duces the output magnitudes of function values and derivatives of Tanh inthe negative input domain, performed competitively with sparsity class acti-vation functions in deep convolutional neural network training. We postulate23hat the ability to significantly reduce the absolute function value of a node,decreases the information passed forward into a network. Subsequently, thisreduces the sensitivity of nodes further downstream to the nodes that havelow function value output. This contributes towards uncoupling parametersin the optimization space, x , thus changing the nature of the loss functionand resulting the unique training characteristics observed for Sigmoid andleaky ReLU.Since ReLU demonstrates the same training difficulties with GOLS-I SGDas investigated in Section 5.1, we consider training ReLU using LS-SGD withfixed step sizes of α n,I n ∈ { . , . , . } for investigation 8 in Figures 6(d)-(f). Again, training performance improves for ReLU using LS-SGD withfixed step sizes over GOLS-I SGD. However, the variance between each ofthe 10 training runs performed increases proportionately to the fixed stepsize. Compared to the relatively consistent performance of the other AFswith GOLS-I SGD, this result is unsatisfactory. Although individual trainingruns with α n,I n = 0 . α n,I n . This highlights, that trainingusing fixed step sizes is not a practical alternative to using GOLS-I withLS-SGD in feedforward neural networks and ReLU activation functions. Asargued in Section 5.1 implementing GOLS-B instead is computationally toodemanding. It is therefore preferable to explore alternative means, by whichthe benefits of GOLS-I can be extended to ReLU architectures.The problem of training ReLU feedforward architectures using GOLS-I,as considered in Section 5.1, is summarized in Figure 7. At initialization, thedistribution of information entering ReLU’s input domain is centred around0 (He et al., 2016), see Figure 7(a). This allows the ”switching” mechanismproposed for ReLU to occur, whereby some data samples cause the nodeto ”fire” (inputs in the positive domain) and others keep the node dormant(inputs in the negative domain). As discussed, large variance in gradientnorms and the nature of a line search can cause step sizes that are spuriouslytoo large, resulting in large changes in weight updates. Such updates canshift all the training dataset variance propagated through the network farinto the negative or positive input domain of ReLU, see Figure 7(b). If allthe data variance is in the positive input domain (scenario 2), gradients areavailable to allow subsequent update steps to correct for this shift. However,if all the dataset variance is in the negative input domain (scenario 1), thezero-derivative of ReLU prohibits information from travelling through the24ctivation function to subsequent nodes. When this occurs to a significantportion of the network architecture, the flow of information through thenetwork is significantly hampered and training stagnates. (a) At initialization (b) Spurious update w/oBN (c) With batch normaliza-tion Figure 7: Schematic of the distribution of incoming information to a ReLU activationfunction (a) at initialization, (b) without batch normalization (BN) after spurious updatesdue to step sizes that are too large, and (c) after implementing batch normalization.Batch normalization centres the incoming information distribution around 0 and scales it,ensuring that the ReLU activation function remains active with a reasonable probability.Subsequently, ˜ g ( x ) is more likely to remain dense than with the standard feedforwardarchitecture. Batch normalization (Ioffe and Szegedy, 2015) is a method by which theinputs to a layer of nodes are continually centred by the mean of the previouslayer’s output and scaled by the corresponding variance. Applying batchnormalization results in the distribution of information into a node remainingaround the centre of the ReLU input domain, see Figure 7, increasing thelikelihood of ˜ g ( x ) for the whole architecture remaining dense. This shouldincrease the ability of GOLS-I to determine step sizes for ReLU, as partialderivatives are more likely to be non-zero, even after spurious updates.We implement batch normalization for NetII with the given AFs for in-vestigation 9 and show the results in Figures 6(g)-(i). It is clear, that thetraining performance of ReLU is drastically improved with GOLS-I. For thefirst time, it is possible to obtain competitive performance with ReLU usingGOLS-I. Additionally, all activation functions show accelerated training withbatch normalization over the standard NetII architecture. The best trainingperformances are shared by Leaky ReLU and Softsign. However, the testerrors are more comparable, for all AFs, ranging between 10 − . and 10 − af-ter 40,000 function evaluations. One exception is Tanh, which diverges after5,000 function evaluations. However, the training and test errors achievedwith batch normalization before 5,000 function evaluations are lower than25hose achieved for the standard architecture after 40,000 function evalua-tions. Therefore, although the reason for Tanh’s divergence is worthy offurther investigation, in this study we are satisfied with observing improvedperformance with Tanh due to batch normalization.It is noteworthy, that the step sizes for NetII with batch normalizationhave a consistent magnitude between different AFs and vary less in compar-ison to those determined for the standard NetII architecture. Since batchnormalization alters the scaling of the search direction, the loss functionseems more spherical to LS-SGD, which results in the step sizes being moreconsistent between AFs. However, this does not result in equal training per-formance between AFs (as seen for Tanh), which indicates that the differentAFs still contribute unique characteristics to the loss function. Interestingly,the error variance characteristics also change, for many AFs. Leaky ReLUand Sigmoid errors remain noisy, but AFs such as Tanh, Softsign and ELUexhibit more variance with batch normalization than without, as their re-spective derivatives are highest around 0.In summary, this investigation has shown that:1. Larger feedforward networks (than investigated in Section 5.1) withReLU AFs can also be unstable when training with GOLS-I.2. The interaction between neural network architecture and AF can leadto significant differences in training performance with GOLS-I (evenwhen the systemically poor performance of ReLU is ignored).3. Implementing constant learning rates is not a viable alternative to de-termining step sizes with GOLS-I for ReLU AFs.4. Instead, batch normalization significantly improves training of ReLUarchitectures using GOLS-I.5. And lastly, batch normalization accelerated training for all AFs withGOLS-I in our investigation. Consider the interaction between architectures that use skip connections(He et al., 2016) and different AFs using GOLS-I SGD. To this end, we im-plement the ResNet18 architecture, as applied to the CIFAR10 dataset. Skipconnections ensure that the information flow to subsequent nodes remainsunimpeded, regardless of whether an activation function such as ReLU pro-hibits information from travelling through a given node. The role of AFs26n this case is to manipulate information that is additional to that trans-ferred by the skip connections i.e. the ”residuals”. The interaction betweenthe ”skip-transferred” information and the residuals constructs the mappingbehaviour between input and output domains of the neural network. Thestandard ResNet18 architecture includes batch normalization.For investigation 10, we compare training and test classification error, aswell as corresponding step sizes for ResNet18 with the considered range ofAFs in Figures 8(a)-(c). In this case, there is a distinct difference in per-formance between the sparsity class and the saturation class of AFs. ReLUand leaky ReLU perform best in terms of training, with a slight advantageover ELU. However, this difference is less prominent in the test classificationerrors, where ELU is competitive with the rest of its class. A similar cluster-ing occurs between Tanh and Softsign for the saturation class. Both performvery similarly in both training and test errors, with only a slight advantagebelonging to Tanh. The Sigmoid AF is convincingly the worst performer ofthe considered AFs. Though training is slow and stable, the test losses areconsiderably noisier than those of the remaining AFs.Using Xavier initialization (Glorot and Bengio, 2010), the initial weightsfor the Sigmoid AF are distributed around 0 in the input domain. However,an input distribution around zero corresponds to a function value output cen-tred around 0.5 for Sigmoid. This is in contrast to the remaining saturationclass AFs, which have outputs centred around 0. As shown by Glorot andBengio (2010), the cumulative effect of 0.5 outputs over a number of hiddenlayers can push Sigmoid AFs in later layers into saturation, where the deriva-tive is significantly decreased. Batch normalization counters this problem byre-centring the outputs of network layers around 0, where the derivative isat a maximum. Additionally, the maximum derivative of Sigmoid is 0.25,which over many layers diminishes the gradient during backpropagation dueto the chain-rule. Batch normalization also compensates for this property byscaling the variance to be 1 for every layer, ensuring that the gradient magni-tudes remain adequately scaled. It is clear, that batch normalization has todo a significant amount of ”correcting” for the Sigmoid AF in ResNet18. Wesuspect that the combination of these factors leads GOLS-I to estimate smallinitial step sizes for Sigmoid, with slow subsequent step size growth. How-ever, the step sizes gradually increase to the point where they are comparableto those of the other AFs after 20,000 iterations.Apart from the step sizes of the Sigmoid AF, the step size magnitudesbetween the remaining AFs are similar and approximately constant. This27 a) Standard (b) Standard (c) Standard(d) No Batch-Norm (e) No Batch-Norm (f) No Batch-Norm
Figure 8: (a) Log training error, (b) log test error and (c) the log of step sizes for the CI-FAR10 Dataset with the ResNet18 architecture trained using GOLS-I SGD. The standardResNet architecture includes batch-norm layers, which are omitted in (d)-(f) in order tohighlight the effect of skip connections on training with sparsity enforcing ReLU activationfunctions. matches the trend observed in Section 5.3 for NetII with batch normaliza-tion. However, unlike the training performance in the NetII analysis, theTanh AF does not diverge with ResNet18 and batch normalization. Interest-ingly, the magnitude and variance of determined step sizes for all AFs (exceptSigmoid) increases significantly after 20,000 iterations. This correlates to anincrease in gradient variance between data points associated with large losses(resulting in larger gradient norms), and those with lower losses (smaller gra-dient norms), as the architecture increasingly fits the data. thus, dependingon the data-points in mini-batch, B n,i , the magnitude and direction of ˜ g ( x )may vary significantly. Subsequently, this variance is transferred to the di-rectional derivatives used by GOLS-I, leading to a higher variance in stepsizes. A plausible corrective measure to manage this variance, is to graduallyincrease B n,i as training progresses (Friedlander and Schmidt, 2011; Smithet al., 2017).Since ResNet is constructed with both skip-connections and batch nor-28alization, it is unclear, which of the two architectural features contributemore significantly towards improving training performance with ReLU. Wehave considered the contribution of batch normalization to improving train-ing in Section 5.3. Here, we consider skip connections more closely. Thedifference between standard feedforward nodes and skip connected nodes isillustrated in Figure 9 with the ReLU AF. A standard node, shown in Fig-ure 9(a) passes the incoming distribution through the activation function atthe node, which augments the distribution according to its characteristics.As demonstrated in Figure 7, this can be problematic with the ReLU AF ifpoor updates occur, as the propagation of gradients can become obstructed.For a skip connected node, shown in Figure 9(b), the incoming distributionis added to the output of the standard node. This ensures, that even in theworst case, where no information passes through the AF, the incoming dis-tribution always propagates forward. Subsequently, evaluated gradients willalways be dense.In investigation 11 we observe the influence of skip connections in ResNet18.Therefore, investigation 11 repeats the analysis of investigation 10, albeitwithout the use of batch normalization. The results are shown in Fig-ures 8(d)-(f). As is consistent with investigations 7 and 9 performed inSection 5.3, training performance slows without the use of batch normaliza-tion for all AFs. The results of investigation 11 show that training progressesfor all AFs, with the exception of the Sigmoid AF. We postulate that thisdrop in performance is due to scaling difficulties in deep neural networksdriven by Sigmoid’s positive offset, small maximum derivative and saturat-ing nature (Glorot and Bengio, 2010), which are subsequently not correctedby batch normalization.We confirmed this phenomenon by conducting numerous additional runswith modified versions of AFs considered in this study. A modified versionof the Tanh AF, namely 0 . · Tanh+0 . · Sigmoid −
1, which also has a max-imum derivative of 0.5 while being centred around 0. This suggests thatpassing through the origin is an important AF property for promoting effec-tive training with deep networks, an assumption held for both Xavier (Glorotand Bengio, 2010) and He (He et al., 2015) initialization strategies. Addi-tionally, a modified leaky ReLU with maximum output derivative < . (a) Standard Node(b) Node with skip-connection Figure 9
Importantly, training ResNet18 without bath normalization with the ReLUAF using GOLS-I SGD is not only stable, it also shares the best trainingperformance with leaky ReLU. This is significant, as it demonstrates the ef-fectiveness of skip connections in ensuring that gradients remain dense withReLU AFs. This allows ReLU’s performance to be directly compared to thatof the remaining AFs without batch-normalization. Both ReLU and leakyReLU are outperformed by ELU during the first half of training, but overtakeit during the latter half. Figures 8(d) and (e) suggest that within the contextof skip connections, the difference in performance between ReLU and LeakyReLU is insignificant.It is an emerging trend, that the sparsity class of AFs gradually beginsto outperform the saturation class as the size of considered neural networksincreases. For the foundational problems of Section 5.1, this difference ismarginal, as the number of nodes in the architecture increases. For theNetII problem in Section 5.3, two of the top three performers are from thesparsity class for architectures with and without batch normalization. Inthe ResNet18 architecture, all of the sparsity class AFs outperform all ofthe saturation class AFs both with and without batch normalization. Thissuggests, that the sparsity property becomes increasingly useful, as the sizeof the network increases. This is consistent with how sparsity operates, asthe number of ”channels” available to construct the mapping between input30nd output spaces increases with growing architecture size.This investigation demonstrates that skip-connections are effective in en-suring that ˜ g ( x ) remains dense for ReLU AFs, where sparsity is enforced. Incases where the outputs of ReLU nodes are zero, it is only the residual thatis zero, while the information of previous nodes is still passed to subsequentlayers through skip connections. This drastically improves GOLS-I SGD’sability to train ReLU neural network architectures. Subsequently, batch nor-malization layers contribute additional benefit for all AFs considered in ourinvestigations, by increasing robustness and accelerating training. Addition-ally, this investigation supports the trends observed in Sections 5.1 and 5.3,where the training performance of larger architectures using GOLS-I SGD isimproved by implementing sparsity class AFs.
6. Conclusion
In this study, we consider the interaction between gradient-only linesearches and a variety of neural networks constructed with six different ac-tivation functions. The activation functions considered are split into twoclasses, namely the saturation class (Sigmoid, Tanh and Softsign), and thesparsity class (ReLU, leaky ReLU and ELU). Gradient-only line searches areused to determine the step sizes for gradient based optimizers in full-batchand dynamic mini-batch sub-sampled (MBSS) loss functions. In our study weimplement the gradient-only line search that is inexact (GOLS-I) as well asthe Gradient-Only Line Search with Bisection (GOLS-B) (Kafka and Wilke,2019a). Eleven investigations are conducted using 13 different datasets witha total of 37 network architectures, some using the cross entropy loss and oth-ers the mean squared error loss. Training problems include 11 foundationaldatasets with feedforward neural networks, the MNIST dataset with NetII(Mahsereci and Hennig, 2017) and the CIFAR10 dataset with ResNet18 (Heet al., 2016). These problems cover a range of architectural features, includ-ing batch normalization (Ioffe and Szegedy, 2015) and skip connections.We find, that GOLS is effective in determining the step sizes in dynamicMBSS training for all but a few combinations of activation functions andnetwork architectures. The small neural networks show a close groupingin training performance between the considered activation functions, with aslight advantage belonging to non-linear, saturation type activations. How-ever, training performance of feedforward networks with the ReLU activa-tion function, coupled with GOLS performed poorly. Analyses with NetII31nd ResNet18 without batch normalization show that a particular activationfunction can significantly improve the training performance for a given net-work architecture with GOLS-I. For NetII, the best performers were leakyReLU and Sigmoid activation functions, while the troublesome performanceof ReLU with GOLS-I seen in the foundational problems recurred for NetII.Our investigations suggest that the predominant cause of GOLS-I’s in-ability to train ReLU architectures are weight updates with step sizes thatare too large. These updates shift the distribution of information entering aReLU activation function fully into its inactive domain, thus producing zero-outputs and halting the flow of information through a node. This can lead tolarge portions of the network becoming and remaining inactive, which leadsto the gradient vector being sparse in these cases. In addition, the implica-tion that there exist domains in the loss function that have zero-gradientsmeans that ReLU loss functions with feedforward architectures can breakthe assumptions of Lyapunov’s global stability theorem, which govern theconvergence properties of GOLS-I.However, the training difficulties encountered with ReLU architecturesusing GOLS-I can be alleviated by implementation of batch normalization,and skip connections as used in residual networks. Batch normalization cen-tres the distribution of information flowing between sequential network layersaround zero, increasing the number of active nodes in the network. Alter-natively, skip connections by design guarantee the propagation of input in-formation throughout the network architecture. These technologies ensurethat the gradient vector remains dense, allowing GOLS-I to recover formspurious updates, when they occur. Both skip connections and batch nor-malization render ReLU’s competitive with the other activation functions.Batch normalization has the added benefit of accelerating training for all ac-tivation functions. Training ResNet18 using GOLS-I with and without batchnormalization demonstrated, that additionally implementing batch normal-ization with skip connections results in a double benefit of stability for ReLU,as well as accelerated training for all activation functions.Gradient-only line searches are effective at determining adaptive step sizesfor gradient descent based training algorithms. Our studies have demon-strated, that the interaction between activation functions and neural net-work architectures matter. The ResNet18 training problem showed a cleardistinction between saturation and sparsity classes of activation functions,with superior training performance belonging to the latter group. The prop-erties of an activation function’s derivative have a direct effect on the nature32f the loss function presented to the optimization algorithm. Our studiessuggest, that activation functions that promote sparsity are better suited tolarger network architectures than classical saturation type activation func-tions. Additionally, significant difficulties can be encountered when trainingfeedforward ReLU architectures with GOLS-I. Therefore, we suggest thatpractitioners consider technologies such as batch normalization and skip con-nections, when constructing neural network architectures to be trained withGOLS-I.
Acknowledgements
This work was supported by the Centre for Asset and Integrity Manage-ment (C-AIM), Department of Mechanical and Aeronautical Engineering,University of Pretoria, Pretoria, South Africa. We would also like to thankNVIDIA for sponsoring the Titan X Pascal GPU used in this study.
References
Arora, J. (2011).
Introduction to Optimum Design, Third Edition . AcademicPress Inc.Bergstra, J., Desjardins, G., Lamblin, P., and Bengio, Y. (2009). QuadraticPolynomials Learn Better Image Features. In
Technical Report 1337,IT Department and operations research, University of Montreal , pages1–11.Bollapragada, R., Byrd, R., and Nocedal, J. (2017). Adaptive SamplingStrategies for Stochastic Optimization. arXiv:1710.11258 , pages 1–32.Bottou, L. (2010). Large-Scale Machine Learning with Stochastic Gradi-ent Descent. In
COMPSTAT 2010, Keynote, Invited and ContributedPapers , volume 19, pages 177–186.Boyd, S., Xiao, L., and Mutapcic, A. (2003). Subgradient Methods. In lecturenotes of EE392o, Stanford University , volume 1, pages 1–21.Chae, Y. and Wilke, D. N. (2019). Empirical study towards under-standing line search approximations for training neural networks. arXiv:1909.06893 [stat.ML] , pages 1–30.33horomanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y.(2015). The Loss Surfaces of Multilayer Networks. In
AISTATS 2015 ,volume 38, pages 192–204.Clevert, D.-A., Unterthiner, T., and Hochreiter, S. (2016). Fast and AccurateDeep Network Learning by Exponential Linear Units (ELUS). In
ICLR2016 , pages 1–14.Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio,Y. (2014). Identifying and attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization. In
ICLR 2014 , pages 1–9.Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive Subgradient Methodsfor Online Learning and Stochastic Optimization.
Journal of MachineLearning Research , 12(July):2121–2159.Fisher, R. A. (1936). The use of Multiple Measurements in Taxonomic Prob-lems.
Annals of Eugenics , 7(2):179–188.Friedlander, M. P. and Schmidt, M. (2011). Hybrid Deterministic-StochasticMethods for Data Fitting. arXiv:1104.2373 [cs.LG] , pages 1–26.Glorot, X. and Bengio, Y. (2010). Understanding the Difficulty of TrainingDeep Feedforward Neural Networks. arXiv:1308.0850 , pages 1–8.Glorot, X. and Bordes, A. (2011). Deep Sparse Rectifier Neural Networks. In
Proceedings of Machine Learning Research , volume 15, pages 315–323.Goodfellow, I. J., Vinyals, O., and Saxe, A. M. (2015). Qualitatively Charac-terizing Neural Network Optimization Problems. In
ICLR 2015 , pages1–11.Han, J. and Morag, C. (1995). The Influence of the Sigmoid Function Pa-rameters on the Speed of Backpropagation Learning. In Mira, J. andSandoval, F., editors,
Natural to Artificial Neural Computation. LectureNotes in Computer Science , volume 930. Springer.He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into recti-fiers: Surpassing human-level performance on imagenet classification. arXiv:1502.01852 [cs.CV] , pages 1–11.34e, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Residual Learning forImage Recognition. In
IEEE 2016 IEEE Conference on Computer Visionand Pattern Recognition (CVPR) - Las Vegas, NV, USA (2016.6.27-2016.6.30) .Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift. arXiv:1502.03167[cs.LG] , pages 1–9.Johnson, B., Tateishi, R., and Xie, Z. (2012). Using Geographically WeightedVariables for Image Classification.
Remote Sensing Letters , 3(6):491–499.Kafka, D. and Wilke, D. N. (2019). Gradient-Only Line Searches: An Alter-native to Probabilistic Line Searches. arXiv:1903.09383 [stat.ML] , pages1–25.Kafka, D. and Wilke, D. N. (2019a). Resolving Learning Rates Adaptively bylocating Stochastic Non-Negative Associated Gradient Projection Pointsusing Line Searches. Unpublished: In review at the Journal of GlobalOptimization.Kafka, D. and Wilke, D. N. (2019b). Visual Interpretation of the Robustnessof Non-Negative Associative Gradient Projection Points over FunctionMinimizers in Mini-Batch Sampled Loss Functions. arXiv:1903.08552[stat.ML] , pages 1–32.Karlik, B. (2015). Performance Analysis of Various Activation Functionsin Generalized MLP Architectures of Neural Networks.
InternationalJournal of Artificial Intelligence and Expert Systems , 1(4):111–122.Krizhevsky, A. and Hinton, G. E. (2009). Learning Multiple Layers of Fea-tures from Tiny Images.
University of Toronto .Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet Classifi-cation with Deep Convolutional Neural Networks. In
NIPS 2012 , pages1–9.Kungurtsev, V. and Pevny, T. (2018). Algorithms for solving optimiza-tion problems arising from deep neural net models: smooth problems. arXiv:1807.00172 [math.OC] , pages 1–5.35ecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-BasedLearning Applied to Document Recognition.
Proceedings of the IEEE ,86(11):2278–2324.Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. (2017). Visualizingthe Loss Landscape of Neural Nets. arXiv:1712.09913 , pages 1–21.Liu, K. (2018). 95.16% on CIFAR10 with PyTorch. https://github.com/kuangliu/pytorch-cifar . Accessed: 2018-09-30.Liu, P., Zeng, Z., Wang, J. (2017). Multistability of Delayed Recurrent NeuralNetworks with Mexican Hat Activation Functions.
Neural Computation ,29(2):423–457.Lucas, D. D., Klein, R., Tannahill, J., Ivanova, D., Brandon, S., Domyancic,D., and Zhang, Y. (2013). Failure Analysis of Parameter-Induced Sim-ulation Crashes in Climate Models.
Geoscientific Model Development ,6(4):1157–1171.Lyapunov, A. M. (1992). The General Problem of the Stability of Motion.
International Journal of Control , 55(3):531–534.Maas, A. L., Hannun, A. Y., and Ng, A. Y. (2013). Rectifier Nonlinearitiesimprove Neural Network Acoustic Models. In
ICML 2013 , volume 28,page 6.Mahsereci, M. and Hennig, P. (2017). Probabilistic Line Searches for Stochas-tic Optimization.
Journal of Machine Learning Research , 18:1–59.Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R., and Consonni, V.(2013). Quantitative StructureActivity Relationship Models for ReadyBiodegradability of Chemicals.
Journal of Chemical Information andModeling , 53(4):867–878.Prechelt, L. (1994). PROBEN1 - a set of neural network benchmark problemsand benchmarking rules (Technical Report 21-94). Technical report,Universit¨at Karlsruhe.pytorch.org (2019). PyTorch. https://pytorch.org/ . Version: 1.0.Robbins, H. and Monro, S. (1951). A Stochastic Approximation Method.
The Annals of Mathematical Statistics , 22(3):400–407.36axe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact Solutions tothe Nonlinear Dynamics of Learning in Deep Linear Neural Networks.
CoRR , abs/1312.6:1–22.Schraudolph, N. N. (1999). Local Gain Adaptation in Stochastic GradientDescent. , 1999:569–574.Schraudolph, N. N. and Graepel, T. (2003). Combining conjugate directionmethods with stochastic approximation of gradients.
Proceedings of theNinth International Workshop on Artificial Intelligence and Statistics,AISTATS 2003 , pages 2–7.Schraudolph, N. N., Yu, J., and G¨unter, S. (2007). A Stochastic Quasi-Newton Method for Online Convex Optimization.
International Con-ference on Artificial Intelligence and Statistics , pages 436—-443.Smith, L. N. (2015). Cyclical Learning Rates for Training Neural Networks. arXiv:1506.01186 .Smith, S. L., Kindermans, P.-J., Ying, C., and Le, Q. V. (2017). Don’t Decaythe Learning Rate, Increase the Batch Size. arXiv:1711.00489 .Snyman, J. A. and Wilke, D. N. (2018).
Practical Mathematical Optimization ,volume 133 of
Springer Optimization and Its Applications . SpringerInternational Publishing, Cham.Tong, F. and Liu, X. (2005). Samples Selection for Artificial Neural Net-work Training in Preliminary Structural Design.
Tsinghua Science &Technology , 10(2):233–239.Werbos, P. J. (1994).
The Roots of Backpropagation: From Ordered Deriva-tives to Neural Networks and Political Forecasting . Wiley-Interscience,New York, NY, USA.Wilke, D. N., Kok, S., Snyman, J. A., and Groenwold, A. A. (2013).Gradient-Only Approaches to Avoid Spurious Local Minima in Uncon-strained Optimization.
Optimization and Engineering , 14(2):275–304.Wilson, D. R. and Martinez, T. R. (2003). The General Inefficiency of BatchTraining for Gradient Descent Learning.
Neural Networks , 16(10):1429–1451. 37u, B., Huang, R., and Li, M. (2016). Revise Saturated Activation Functions. arXiv:1602.05980[cs.LG]arXiv:1602.05980[cs.LG]