AutoLR: An Evolutionary Approach to Learning Rate Policies
Pedro Carvalho, Nuno Lourenço, Filipe Assunção, Penousal Machado
AAutoLR: An Evolutionary Approach to Learning Rate Policies
Pedro Carvalho
University of Coimbra, CISUC, DEICoimbra, [email protected]
Nuno Lourenço
University of Coimbra, CISUC, DEICoimbra, [email protected]
Filipe Assunção
University of Coimbra, CISUC, DEIUniversity of LisbonCoimbra, [email protected]
Penousal Machado
University of Coimbra, CISUC, DEICoimbra, [email protected]
ABSTRACT
The choice of a proper learning rate is paramount for good Arti-ficial Neural Network training and performance. In the past, onehad to rely on experience and trial-and-error to find an adequatelearning rate. Presently, a plethora of state of the art automaticmethods exist that make the search for a good learning rate easier.While these techniques are effective and have yielded good resultsover the years, they are general solutions. This means the opti-mization of learning rate for specific network topologies remainslargely unexplored. This work presents AutoLR, a framework thatevolves Learning Rate Schedulers for a specific Neural NetworkArchitecture using Structured Grammatical Evolution. The systemwas used to evolve learning rate policies that were compared with acommonly used baseline value for learning rate. Results show thattraining performed using certain evolved policies is more efficientthan the established baseline and suggest that this approach is aviable means of improving a neural network’s performance.
CCS CONCEPTS • Computing methodologies → Genetic programming ; Super-vised learning;
Neural networks ; KEYWORDS
Learning Rate Schedulers, Structured Grammatical Evolution
ACM Reference Format:
Pedro Carvalho, Nuno Lourenço, Filipe Assunção, and Penousal Machado.2020. AutoLR: An Evolutionary Approach to Learning Rate Policies. In
Genetic and Evolutionary Computation Conference (GECCO ’20), July 8–12,2020, Cancún, Mexico.
ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3377930.3390158
The study of Artificial Neural Networks (ANNs) is a field in modernArtificial Intelligence (AI). These networks’ defining characteristic
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
GECCO ’20, July 8–12, 2020, Cancún, Mexico © 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-7128-5/20/07...$15.00https://doi.org/10.1145/3377930.3390158 is that they are able to learn how to perform a certain task whenprovided with an appropriate architecture, data and resources. Thenetworks have a set of internal parameters known as weights and training is the process through which they are modified sothat the network is able to solve a given problem. Fine-tuning theweights of ANNs is crucial in order to obtain a consistently usefulsystem. There are several parameters that regulate training, onethe most important parameters is the learning rate . In fact, andaccording to [11, p. 424], if we only have the chance to modify onehyperparameter, the focus should be on the learning rate.The learning rate determines the magnitude of the changes thatare made to the weights. Consequently, the choice of an adequatelearning rate is paramount for effective training. When the valueof the learning rate is too small the network will be unable to makeimpactful changes to its weights, making the training slow. Onthe other hand, if the learning rate is too high the system willmake radical changes even in response to small mistakes, causinginconsistent and unpredictable behaviour. On top of this, researchsuggests that the best training results are achieved by adjustingthe learning rate over the course of the training process [22]. Oneway to make these adjustments during training is by updating thelearning rate as training progresses. The functions responsible forthese adjustments are known as learning rate policies . There issubset of these functions known as learning rate schedulers, i.e.,functions that are periodically called during training and return anew learning rate based on multiple training characteristics, suchas the current learning rate or the number of performed iterations.The main objective of this work is to devise an approach thatis able to evolve learning rate policies for specific neural networkarchitectures, in order to improve its performance. In concrete, wedeveloped AutoLR, a system that allows us to study the viabilityof this approach and how it may contribute to the field of learningrate optimization as a whole. Learning rate policies can take manydifferent shapes [23], and therefore it will be notable if our systemis capable of automatically discovering functions that are variationsof the ones found in the literature. Such a result is interestingbecause if this approach is able to evolve solutions that are widelyaccepted it is possible that these same ideas can be used to find stillundiscovered, better methods. We are also interested in inspectingthe evolved schedulers, and comparing them with human-designedschedulers to obtain meaningful insights. The contributions of thispaper are: a r X i v : . [ c s . N E ] J u l ECCO ’20, July 8–12, 2020, Cancún, Mexico Pedro Carvalho, Nuno Lourenço, Filipe Assunção, and Penousal Machado
Input Layer
Fully ConnectedLayer
Output Layer
Figure 1: Example of an Artificial Neural Network. • The proposal of AutoLR, a framework based on SGE that per-forms automatic optimisation for learning rate schedulers. • The design, test and analysis of experiments that validatethe use evolutionary algorithms to optimise learning rateschedulers. • We show that the evolved policies are competitive and havecharacteristics that allow then to thrive in the problems athand.The remainder of the paper is organised as follows. Section 2introduces the background concepts and surveys the key-worksrelated to the optimisation of learning rate schedulers. Section 3describes AutoLR, the methodology proposed for the evolutionof learning rate schedulers. Section 4 details the experimentalsetup and discusses the experimental results. Finally, Section 5summarises the main conclusions of the paper and addresses futurework.
This section provides the necessary context for the reader to under-stand the rest of the paper. Section 2.1 introduces Artificial NeuralNetworks; Section 2.2 details Structured Grammatical Evolution;and Section 2.3 surveys works related to learning rate optimizationand learning rate schedulers.
Artificial Neural Networks (ANNs) are a machine learning approachthat draws inspiration in the biological neural networks seen innature in order to create a computing system that is able to learn.These systems are comprised by a set of nodes (known as neurons)and edges (known as synaptic weights). An example of the generalstructure of an ANN is depicted in Figure 1.Although these networks can have different architectures (e.g.,LSTM [13], ResNet [12]) we will, without loss of generality, focuson feed-forward ANNs. In these models the nodes are grouped intoseparate layers connected sequentially. These layers are flankedby an input and output layer which are responsible for receivingthe data that the network will process, and yield the result of thenetwork’s calculations, respectively. Edges are directional connec-tions between two nodes from different layers and are the means
Figure 2: Example of the architecture of a ConvolutionalNeural Network. through which information travels through the network. Each nodeperforms an operation (i.e., a mathematical function) on the valuesit receives from the previous layer and sends the new value to allnodes it is connected to in the next layer. Every edge has a weightthat scales the value it carries, i.e., the value that a node outputs isalways adjusted before it is provided to the nodes in the next layer.These networks can be used to solve tasks of many differenttypes. The ideas that will be presented in this work can be widelyapplied to different types of ANNs. Without loss of generality wewill focus in the optimisation of a learning rate scheduler for asupervised learning classification problem. In supervised learningthe system is tasked with learning a function that can separate datainstances into their respective classes. To achieve this the networkis provided with a set of labeled instances.The training of ANNs is an iterative process where the networkcompares its attempted classifications of a subset of examples withthe expected ones and adjusts its weights to get closer to the correctresults. There is a function – known as loss function – that comparesthe classification and measures how incorrect the network’s outputwas. The size of the changes made to the weights is partially givenby the error returned by the loss function (a larger error leads tolarger changes). Another parameter, the
Learning Rate (learningrate), determines the magnitude of the adjustments that are madeto the weights. The learning rate is the main subject of this paper.For more details on ANNs refer to [9].Deep Neural Networks (DNNs) are a subset of ANNs notable forbeing able to perform representation learning, and consequentlythe networks are able to automatically extract the features requiredto solve the problem. This is often associated to the need for deeperarchitectures, i.e., a greater number of hidden-layers. This allowsthe networks to possibly solve solve harder problems. In the currentwork we focus on Convolutional Neural Networks (CNNs) [10], aDNN topology that is known to work well on spatially-related data(e.g., image). An example of the architecture of CNNs is shown inFigure 2. Two layer types are commonly used in CNNs: convolu-tional and pooling layers. More details can be found in [17]
In the current work we will perform the optimisation of learningrate schedulers using SGE. SGE is a variant of GE [19] that usesan altered genotype representation to address the main limitationsof GE: low locality and high redundancy. In GE the genotype isencoded as a single list of integers, where each integer encodesa grammatical expansion possibility. Contrary, in SGE there is aseparate list for each non-terminal symbol; this avoids the need for utoLR: An Evolutionary Approach to Learning Rate Policies GECCO ’20, July 8–12, 2020, Cancún, Mexico the modulo operation when performing the genotype to phenotypemapping.These approaches add another layer of decision making however,namely in the form of the grammar design [15]. The grammar usedfor any GE experiment will define what kind of programs the engineis able to create and this has many implicit consequences. Themost obvious one is that the provided grammar must encompasssolutions that can solve the problem at hand. While this seemstrivial it must be understood that not knowing the compositionof the desired program is one of the main motivations to use thistype of system in the first place. This also means, however, thatthe grammar specificity can be increased as more knowledge of theproblem is available, aiding the search process.This specific type of EA is suited for this work because thefunctions we are looking to evolve are very specific. This means thatour domain knowledge is high, and there is a strong understandingof what our desired program is like. As previously mentioned wecan use this knowledge to create a grammar that enhances resultsby narrowing the search space. An in-depth explanation of thesealgorithms can be found in [9, 21].
In the context of this work, hyper-parameters are the set of pa-rameters that configure an ANN and its training strategy. The learning rate is one of such parameters and its role is to scale thechanges made to the network weights during training. Researchsuggests that hyper-parameter optimization is effective in improv-ing the system’s performance without adding complexity [5].
The traditional approach is to use asingle learning rate for the entire training process [20]. Under thesecircumstances all optimization must be done before training starts.Oftentimes the programmer must rely on expertise and intuitionin order to guess adequate learning rate values. While automaticsolutions to this problem exist they are, to the best of our knowledge,either comparable to manual optimization [5] or non-trivial inimplementation [6]. Much of the difficulty of finding a convenientsolution to this issue stems from the fact that hyper-parameters areinter-dependent [7]. This means that even when an ideal learningrate is found there is no guarantee that this value remains optimal(or even usable) as the other parameters are tweaked.
The reasons stated in the previoussection make the use of a static learning rate a possible drawback. Itis desirable that the method we are using to determine our learningrate is robust enough that performance does not dip with everychange to the system. In order to increase flexibility we wouldideally have a method to change the learning rate as training pro-gresses, i.e., even if the initial value is not adequate the systemhas a chance to correct its course. This strategy will be referred toas a dynamic learning rate . The most uncomplicated policy forvarying the learning rate can be inferred intuitively. It is expectedthat as training progresses the ANN’s performance gradually im-proves as it gets better at solving the task at hand. If the system ispotentially closer to its objective it seems desirable that it does notstray from its course. This is to say that, in order to improve, thenetwork requires progressively finer tuning; this can be achievedwith a decaying learning rate (meaning that the learning ratedecreases as learning progresses). There are some issues that are frequently encountered during training that make this approachnot ideal however. Better performance is rarely an indicator that thenetwork is closer to a perfect solution. Using a decaying learningrate leaves the system susceptible to early stagnation in a localoptimum. This is not ideal despite the fact that a local optimumis sufficient for most situations as this approach can lead to earlystagnation if applied incorrectly. Despite these limiting factors de-caying learning rates can lead to improvement over static ones asseen in [22].In order to expand on these ideas we need to apply the conceptsof exploration and exploitation . These refer to the two com-plementary strategies that can be used in heuristic optimization.Exploration is the idea of using a mechanism that helps the algo-rithm explore solutions that do not seem as promising in an attemptto avoid falling into a local optimum. The contrasting techniqueis exploitation, in this strategy we adjust our approach to makesure the algorithm is able to find the local optimum (once it reachesa promising region). Finding a proper balance between these twostrategies is crucial for further improvement of the dynamic learn-ing rate.
Smith et. al. propose the use of a cyclic learning rate in [23].Their approach fluctuates the learning rate between a maximumand a minimum bound. While the system uses no information aboutwhether or not it is stuck by periodically increasing the learningrate it is able to explore the search space more effectively. Thistechnique is consequently less vulnerable to early stagnation thandecaying learning rate policies. This method is, to the best of ourknowledge, the most efficient use of dynamic learning rates.
Further improvements in thisarea can still be achieved if the system responsible for assigning thelearning rate has access to information throughout training. Thismeans that we will now study algorithms that can acknowledgewhen training is stagnating as it is happening. From this pointonward we refer to these methods as adaptive learning rates .These techniques unlock one more option of optimization. Sofar we have been working with a single value learning rate butwith this extra information it is desirable to use a vector of valuesinstead. Consider the following scenario, an ANN is being trainedfor 100 generations with a single value adaptive learning rate. Onespecific weight of the network reaches a near optimal value withinthe first 5 generations, but all of the others are still off the mark.An adaptive learning rate recognizes this and has to decide what isthe ideal learning rate value for the next generation. On the onehand, using a small learning rate will benefit the fine tuning of thenode that is already performing well. A larger learning rate, on thecontrary, will allow the sub optimal weights to find better values.Using vectors of learning rates allows the system to have a learningrate value for each weight, making the most out of these nuancedsituations [14]. Several algorithms [8, 16, 24] have been built onthis theoretical foundation and these systems are the best learningrate policies we know of.
AutoLR is a framework created to apply evolutionary algorithms tolearning rate policy optimization. While SGE is used to handle theevolutionary processes, the system’s novelty comes from using thealgorithm to explore new possibilities in the learning rate policy
ECCO ’20, July 8–12, 2020, Cancún, Mexico Pedro Carvalho, Nuno Lourenço, Filipe Assunção, and Penousal Machado condition value if true value if false ifepoch < 10 condition value if true value if false if0.1 epoch < 50 0.05 0.01
Figure 3: Example of a learning rate scheduler. search space. This is achieved through the design of a grammarthat is able to effectively navigate part of this space and a fitnessfunction that can accurately measure each policy’s quality.
The scope of this work is limited to evolving learning rate sched-ulers. We define learning rate schedulers as it is done in the Keras[1]library. Learning rate schedulers are functions that are called peri-odically during training (each epoch, in this case) and update thelearning rate value. In other words, we are evolving the initial learn-ing rate and the ensuing variation function. These functions’ inputsare comprised of the learning rate of the previous epoch and thenumber of performed epochs. This function returns a single learn-ing rate for all dimensions. Using the terminology established sofar, this means the evolved policies can be either a static or dy-namic learning rate solution . It is important to define the rangeof our solutions as this establishes what conventional techniqueswe should be kept in mind during analysis.Figure 3 depicts an example of a learning rate scheduler. In thiscase the ANN will train using a learning rate of 0.1 for the first10 epochs as this is when the condition epoch < is met. Thislearning rate will be used until the 10th epoch is reached, at whichpoint the learning rate scheduler will automatically decrease thelearning rate to 0.05. Following the same rationale, after the 50thepoch the learning rate to use is 0.01. The search space that weconsider is detail on the next sub-section. The grammar (Figure 4) defines the search space of the learning rateschedulers. The individuals created by this grammar will typicallyresolve into a sequence of chained if-else conditions (created bythe logic_expr production) that once evaluated yield a learningrate (provided by the terminals in lr_const ). This means that thesystem is creating dynamic learning rate policies most of the time.A notable exception to this is that the system can resolve the initial expr production into a lr_const , creating a static learning rate policy.An if_func is a simple function that does the same as a regularif-then-else construct. Since the code for this system was writtenin Python this function was created so all individuals could bedescribed in a single line that can be read easily by the user. Thecode for this function is shown in Algorithm 1. < expr > :: = if_func( < logic_expr > , < expr > , < expr > ) | < lr_const >< logic_expr > :: = learning_rate < logic_op > < lr_const > | epoch < logic_op > < ep_const >< logic_op > :: = < | ≤ | > | ≥ < lr_const > :: = . | . | . | . | . . . . | . | . | . | . < ep_const > :: = | | | | | | | | | . . . | | | | | | | | | Figure 4: Grammar used for the optimisation of learningrate schedulers.
The conditions used by if_func are generated by logic_expr . Thisproduction will compare one of the input variables (learning_rate,epoch) with the corresponding constants (lr_const and ep_const,respectively) using one of several logical operators from logic_op . logic_op includes all logical operators with the exception of equality(==) and inequality (!=). Conditions using these operators are toospecific since they only return a different value for a single constant.This means that, in the vast majority of situations, conditions usingthese operators do not change the policy’s behaviour. This makesthem unable to contribute meaningfully to the evolutionary process.The constants chosen for lr_const and ep_const are 100 evenlyspaced values between the minimum and maximum value for eachof the variables. It should be noted that these production rules havebeen abridged in the figure. Only a few of the lowest and highestpossible values are shown so that the range is accurately portrayedwhilst keeping the figure brief. Our training starts in epoch 1 andends in epoch 100, since we are also using 100 values for our con-stant we used every possible epoch value (every natural numberfrom 1 to 100) for ep_const . lr_const values are more complicatedas there is an infinite number of valid learning rates. We keep thevalues of the learning rate bounded between 0.001 and the 0.1 asall values in this range are suitable for training.This grammar is capable of creating a large variety of individ-uals despite its simplicity. While it is not possible for our treesto exactly recreate the dynamic solution functions mentioned inSection 2.3 they can reproduce approximated versions that exhibitsimilar behaviour. As the main hypothesis implies we are looking to evolve learningrate policies. This means that we will be using an EA on a popula-tion of learning rate policies. Additionally, our hypothesis demandsthat an individual’s fitness must be some measure of the network’sperformance when trained using that specific solution. This is nec-essary since if the evolutionary process is not successful, its resultswill not address the question we posed. utoLR: An Evolutionary Approach to Learning Rate Policies GECCO ’20, July 8–12, 2020, Cancún, Mexico
Algorithm 1:
Template of the code used to implement theif_func routine. params : condition, state1, state2 if condition then return state1; else return state2;We decided that the best way to assess a policy’s performancewas through the function seen in Algorithm 2. That is, we train thenetwork and assess its performance using the accuracy metric. Algorithm 2:
Simplified version of the fitness function usedto evaluate a learning rate policy params : network, learning_rate_policy, training_data,test_data trained_network ← train(network, learning_rate_policy,training_data); fitness_score ← get_test _accuracy(trained_network,test_data); return fitness_score;To elaborate on the algorithm above, our fitness function willuse 4 components • network - The ANN we are optimizing the learning ratescheduler for. This network is the same throughout the entireevolutionary run. • learning_rate_policy - An evolved learning rate schedulerthat we want to evaluate. • training/test_data - This is the data of the problem theANN will be attempting to solve. As the name implies, train-ing data is used for training. Test data is a separate set of ex-amples that are used to evaluate the network’s performanceonce training is complete. In the actual fitness function thetraining data is further split into training and validation(see 4.3) but this distinction will temporarily be omitted forexplanation’s sake.The evaluation function has two phases. First, the network mustbe trained, this is where the policy we are evaluating will affectthe process. The train function returns the network provided withits weights changed through the training process. We could atthis point also retrieve the best performance the network achievedduring training. We do not take this approach as it is not the mostaccurate measure of an ANN’s real effectiveness. The objective ofthe training is that the network learns a set of weights that solve theproposed problem. The data used for training is only a sample of allpossible inputs. As training progresses a network becomes graduallytoo attached to the training data, this is known as overfitting .Overfitting means that the network is to constraint to the trainingdata, and does not represent the general learning problem. Thishappens since data will often have some noise (i.e. informationthat is not important to solve the task). It is not desirable for thenetwork to learn to produce solutions based on this noise as thatwill hurt its performance when dealing with inputs not included in training. Consequently, we measure the effectiveness of training byhow well the network performs on a second set of data that it hasnot come into contact with. We call this second set the test data.Every policy will be evaluated using the same network and train-ing data meaning that the learning rate scheduler is the only varyingcomponent between individuals. Since all other hyper-parametersare fixed, and the used datasets are balanced, we consider the resultof evaluating the trained network’s accuracy on the test data tobe an adequate measure of the policy’s fitness.In the context of our work, learning rate policies are executablecomputer code. We will be using the Python language specificallyas it has vast support for ANN handling through the Tensorflow [3] library. An EA is also needed for our system, we chose to useGE-based evolutionary engine as it gives us a flexible and readablemeans of defining the problem space in the form of grammars. Inparticular, we chose SGE [18] for its Python implementation andsuperior results over regular GE. Our hypothesis also demandsa mindful choice of network architecture. Since we are lookingfor optimization in specific scenarios, we want to avoid genericarchitectures. We therefore decided to use a CNN model evolvedspecifically for image classification obtained from Deep Evolution-ary Network Structured Representation (DENSER) [4].
The objective of this work is to promote the automatic optimisationof learning rate schedulers for a fixed-topology network. Section 4.1introduces the topology of the used network; Section 4.2 details thedataset; Section 4.3 describes the experimental setup; and Section 4.4analyses and discusses the experimental results.
The network architecture we used was automatically generated us-ing DENSER [4] – a grammar-based NeuroEvolution approach. TheCNN optimised by DENSER was evaluated using a fixed learningrate strategy, and thus it is likely that better learning policies exist.The architecture was generated for the CIFAR-10 dataset using afixed learning rate of 0.01, where the individuals were trained for10 epochs. The details of how the network was created are impor-tant as they might inform our conclusions later on. The specifictopology of the network is described in Figure 5.
We opted to use the Fashion-MNIST instead of the network’s nativeCIFAR-10 as it is a dataset where the training is faster. This datasetis composed by 70000 instances: 60000 for training and 10000 fortesting. Each instance is 28 ×
28 grayscale image, which contrastswith CIFAR-10’s 32 ×
32 RGB images. We will be scaling our imagesinto 32 ×
32 RGB as they would not fit the network’s input layerotherwise. This scaling was performed using the nearest neighbourmethod, and to pass from one to three channels we replicate thesingle channel three times.
We divide the experimental setup into two parts: the parametersused for the evolutionary search (Section 4.3.1); and for a longertraining after the end of evolution (Section 4.3.2).
ECCO ’20, July 8–12, 2020, Cancún, Mexico Pedro Carvalho, Nuno Lourenço, Filipe Assunção, and Penousal Machado
InputLayerConv2DBatchNormalizationConv2DConv2DFlatten DenseDropoutDenseDenseDenseDense
Figure 5: Topology of the used CNN.Figure 6: Example images from the Fashion-MNIST dataset.
The experimental parameters are summarisedin Table 1. They are organized into five sections: • SGE Parameters – parameters of the evolutionary engine. • Dataset Parameters – number of instances of each of thedata partitions. • Early Stop – the stop condition used to halt the training ofthe ANN. • Training Data Augmentation – real time data augmentationparameters. • Network Training Parameters – parameters used when train-ing the ANN.Our experimental parameters were picked with some considera-tions. Since evolutionary algorithms are very demanding in terms ofcomputation resources it was paramount that the parameters usedallowed us to perform meaningful evolutionary runs that couldbe completed in an acceptable time-frame. This motivated the se-lection of parameters that effectively reproduce an evolutionary
SGE Parameter Value
Number of runs 10Number of generations 50Number of individuals 5Mutation rate 0.15
Dataset Parameter Value
Training set 7000 instances from the trainingValidation set 1500 instances from the trainingTest set 1500 instances from the training
Training Data Augmentation Value
Feature-wise Center TrueFeature-wise Std. Deviation TrueRotation Range 20Width Shift Range 0.2Height Shift Range 0.2Horizontal Flip True
Early Stop Value
Patience 3Metric Validation LossCondition Stop if Validation Loss does notimprove in 3 consecutive epochs
Network Training Parameter Value
Batch Size 1000Epochs 100 / 20Metrics Accuracy
Table 1: Experimental parameters.Dataset Parameter Value
Train set 52500 instances from training dataValidation set 7500 instances from training dataTest set 10000 instances from test data
Table 2: Dataset information for parameters. strategy. Additionally, the fitness function operates on a fraction ofthe dataset as training utilizing all 60000 training examples was tootime consuming. We also picked the training parameters accord-ingly. Ideally, we would perform evolution on 100 training epochswith no early stop as we are trying to optimize the network’s per-formance as much as possible. Instead, we performed two sets ofexperiments: (i) using 100 epochs and an early stop mechanism;(ii) using 20 epochs with no early stop. We started by reducing thecomputational cost through the implementation of an early stopmechanism. Notwithstanding, we were concerned that the evolu-tionary process would exploit this mechanism, which motivatedthe 20 epochs experience, where no early stop is used and the costis instead reduced by reducing the training epochs.
After the evolutionary process is complete weneed to properly assess the quality of the generated policies. Thetesting routine is the same as our fitness function, differing only inthe data used (seen in Table 2).In our testing routine we use all training instances (splitting theminto training and validation) to train the network using the policy utoLR: An Evolutionary Approach to Learning Rate Policies GECCO ’20, July 8–12, 2020, Cancún, Mexico
PolicyScenario A B BaselineValidation 0 . ± .
167 n/a . ± . . ± .
241 n/a . ± . Validation n/a 0 . ± . . ± . . ± . . ± . . ± . . ± .
003 0 . ± . . ± . . ± .
009 0 . ± . Table 3: Accuracy of the evolved policies (A & B) on theirevolutionary environment (1 & 2 respectively) and scenario3 (representative of an actual use case), compared with thebaseline policy. we want to evaluate. This network is subsequently tested using alltest data (that was not used previously) to obtain an unbiased testaccuracy . We will also be tracking each policy’s best validationaccuracy to have additional insight into how well the learnedweights are able to generalize. Finally, we need to decide on a policyto serve as a baseline. We chose to use a static learning rate policy of0.01 for three reasons. The fact that the network was evolved usingthis learning rate (as was explained in Section 4.1). This assuresus that this is an adequate learning rate for this network makingit a good benchmark for our evolved policies. Additionally, thisparticular constant is the most common policy in often used DeepLearning frameworks [1, 2]. We believe that benchmarking againstsuch a widely used policy is a proper way to test our hypothesis.Finally, we had to use a baseline that had similar information to ourevolved methods. The adaptive techniques referred to in Section 2.3,for example, use the gradient of the loss function to make moreprecise adjustments to a per-dimension learning rate. The fact thatthese methods have access to additional information means theyare not suitable as benchmarks.We have three testing scenarios: • The first scenario (1) is the same as the first evolutionaryscenario, i.e., training is done for
100 epochs with the earlystop mechanism . • The second scenario (2) trains for only
20 epochs, withno early stop . • The third scenario (3) trains for
100 epochs, but the earlystop mechanism is disabled .The first and second scenarios exist primarily so we can see howthe evolved policies compare with the baseline in the conditionsthey were evolved in. Scenario 3 yields the most important resultsas its conditions represent the typical use case of a neural network.In order to make discussion clearer the evolved polices will bereferred to as policy A (for the best policy evolved with the earlystop mechanism) and policy B (for the best policy evolved with noearly stop). These evolved polices were tested in their evolutionaryenvironments (scenario 1 and 2 for policies A and B respectively)and in scenario 3. The baseline policy was tested in all 3 scenarios.
The table presented in 3 summarizes the results of our experimen-tation, showing the average and standard deviation of the accuracyof a given policy in a specific scenario over five runs. As detailed in Section 4.3.2, each run trains the network using the chosen policyand subsequently tests its accuracy on the 10000 test instances. yields results that are not intuitive given thecircumstances. Training in this scenario can be halted by an earlystop mechanism. Since policy A was evolved using this same kindof training it is to be expected that it would perform well in theseconditions. However, the results show the opposite. Policy A, infact, performs far worse then the baseline when early stop is inuse. Analysing individual results showed that this policy will occa-sionally trigger the early stop in the first few epochs (this can beobserved in the large standard deviation associate with these trials).There are several interpretations for the implications this has onthe validity of the evolutionary process. On the one hand it canbe argued that this demonstrates an issue with the evolutionaryprocess since the policy is not a consistent solution to the problemit is supposed to solve. While it is a fact that the policy is an incon-sistent solution we do not believe this implies any problems withthe evolution. The fact that this policy can, on occasion, yield thebest performance implies the genetic information of this individualis useful for the evolutionary process. results are more in line with our expectations.We can observe that, albeit only marginally, policy B shows bettertest accuracy than the baseline when trained under the parametersit was evolved for. It is noteworthy that policy B does not havesuperior accuracy in validation. This suggests that the evolved pol-icy is outperforming the baseline in its ability to generalize when moved to a different set of data. was designed to test which policy is able to getthe most out of this network’s architecture and it gave the most im-portant set of results. The results show that, under these conditions, the best accuracy this network achieved was obtained usingan evolved policy for training . On average, policy A performsbetter than the baseline in the test set by 1.2% and it obtains thesegood results more consistently. Another interesting result is thatpolicy B (that was previously outstanding because of its ability togeneralize) suffers the biggest dip in performance from validationto test in this scenario. Ideally, both evolved policies would out-perform the baseline. There are, however, some possibly limitingfactors. Namely, it is possible that the shorter training duration usedin scenario 2 discourages the evolution of policies that translatewell into scenario 3. This topic is discussed further in 4.4.4 as weanalyse policy B’s shape.
As discussed in 1, we are interested in analysingthe shapes that our evolved policies take. Namely, in this sectionwe will be analysing the shape of the previously discussed PoliciesA and B. These policies can be observed in Figures 7 and 8. Thesefigures show how the learning rate evolved over time as well as avertical line that signals the epoch where the training using thispolicy stopped.Observing the shape of policy A (seen in Figure 7) led to someinteresting insights. Initially, it seemed that this policy only hadthe best performance during evolution because its shape cheat theearly stop mechanism. We suspected that by frequently using a highlearning rate it might be possible to create false improvements thattrick the system. To elaborate, it is feasible for a policy to routinelyworsen and subsequently improve its performance on purpose in ECCO ’20, July 8–12, 2020, Cancún, Mexico Pedro Carvalho, Nuno Lourenço, Filipe Assunção, and Penousal Machado L e a r n i n g R a t e Figure 7: Policy A L e a r n i n g R a t e Figure 8: Policy B order to pass the early stop check. We can, in fact, observe that thispolicy is able to train for a long time despite the early stop as thevertical line shows. As a comparison, the baseline policy typicallytriggers the early stop between epochs 20-30, which means thatpolicy A is able to train for twice as long.Policy B took on a very different shape. Despite the erratic be-haviour shown past epoch 60, this policy is effectively a static learn-ing rate as its training always ended in epoch 20 (as a reminder,no early stop was used in the evolution of the policy). While thisinitially seems disappointing (finding an adequate constant is notsomething that requires such a complex system), it is important tounderstand that, due to the reduced training duration, there is apossibility that the benefits of using a dynamic policy in this contextare negligible, stifling probability that they show up in evolution.The idea that the evolution of dynamic methods is suppressed underthese circumstances is further supported by the fact that all twelveof the best policies during the evolution of policy B were constants.In this context the twelve best policies we are referring to is theset of policies that were, at some point during evolution, the bestpolicy in all runs. We have, up until this point, observed two types of evolvedfunctions shapes (within the individuals that perform well). Thefirst type is constants, these comprise the majority of the searchspace so their presence is expected. The second type can be observedin policy A, we refer to these as oscillator policies . We believe thatthese policies are approximations of the policies used in [23]. Whileit would be disingenuous to claim that we are evolving cyclicalpolicies, it seems feasible that the evolved oscillator policies areeffective for the same reasons as the cyclical ones. It is notable thatwhile many known dynamic policies are decaying policies we havenot observed any well performing evolved policies with a similarshape.
In this work we posed the question of whether or not evolvinglearning rate policies was a viable way of improving a networkarchitecture’s performance. To test this, we designed and developedAutoLR. a framework that optimizes learning rate policies usingSGE. Furthermore, this framework was then utilized to create twoevolved policies. These evolved policies were tested and comparedwith a widely used baseline policy. Both of the policies evolved wereable to improve on the established baseline in some capacity. Notonly that, the network’s best recorded performance was achievedwith an evolved policy, suggesting that evolving learning ratepolicies for a specific architecture did in fact improve thenetwork performance . Additionally, some of the evolved policiesresemble man-made policies seen in [23], suggesting that the systemmight have implicitly discovered the ideas that make such policieseffective. In the future we would like to expand the range of policiesthat can be evolved to enable meaningful comparisons with a widerarray of state of the art methods.
This work is funded by national funds through the FCT - Foun-dation for Science and Technology, I.P., within the scope of theproject CISUC - UID/CEC/00326/2020 and by European Social Fund,through the Regional Operational Program Centro 2020.Filipe Assunção is partially funded by: Fundação para a Ciência eTecnologia (FCT), Portugal, under the PhD grantSFRH/BD/114865/2016. We also thank the NVIDIA Corporation forthe hardware granted to this research.
REFERENCES
GeneticProgramming and Evolvable Machines (27 Sep 2018).[5] James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithmsfor hyper-parameter optimization.
Advances in Neural Information Processing utoLR: An Evolutionary Approach to Learning Rate Policies GECCO ’20, July 8–12, 2020, Cancún, Mexico
Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011,NIPS 2011 (2011), 1–9.[6] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameteroptimization.
Journal of Machine Learning Research
13 (2012), 281–305.[7] Thomas M. Breuel. 2015. The Effects of Hyperparameters on SGD Training ofNeural Networks. (2015). arXiv:1508.02788 http://arxiv.org/abs/1508.02788[8] John Duchi, Elad Hazan, and Yoram Singer. 2010. Adaptive subgradient methodsfor online learning and stochastic optimization.
COLT 2010 - The 23rd Conferenceon Learning Theory
12 (2010), 257–269.[9] A. E. Eiben and James E. Smith. 2015.
Introduction to Evolutionary Computing (2nd ed.). Springer Publishing Company, Incorporated.[10] Kunihiko Fukushima. 1980. Neocognitron: A self-organizing neural networkmodel for a mechanism of pattern recognition unaffected by shift in position.
Biological Cybernetics (1980). https://doi.org/10.1007/BF00344251[11] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016.
Deep learning . MITpress.[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. ResNet.
Proceed-ings of the IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition (2016). https://doi.org/10.1109/CVPR.2016.90 arXiv:arXiv:1512.03385v1[13] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.
Neural Computation (1997). https://doi.org/10.1162/neco.1997.9.8.1735[14] Robert A. Jacobs. 1988. Increased rates of convergence through learning rate adap-tation.
Neural Networks (1988). https://doi.org/10.1016/0893-6080(88)90003-2[15] Maarten Keijzer, Michael O’Neill, Conor Ryan, and Mike Cattolico. 2002. Gram-matical Evolution Rules: The Mod and the Bucket Rule. In
Genetic Programming,5th European Conference, EuroGP 2002, Kinsale, Ireland, April 3-5, 2002, Proceedings . 123–130. https://doi.org/10.1007/3-540-45984-7_12[16] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Opti-mization. (2014), 1–15. arXiv:1412.6980 http://arxiv.org/abs/1412.6980[17] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition.
Proc. IEEE
86, 11 (1998), 2278–2324.[18] Nuno Lourenço, Francisco B. Pereira, and Ernesto Costa. 2016. Unveiling the prop-erties of structured grammatical evolution.
Genetic Programming and EvolvableMachines
17, 3 (01 Sep 2016), 251–289.[19] Michael O’Neill and Conor Ryan. 2001. Grammatical evolution.
IEEE Transactionson Evolutionary Computation (2001). https://doi.org/10.1109/4235.942529[20] Russell Reed and Robert J MarksII. 1999.
Neural smithing: supervised learning infeedforward artificial neural networks . Mit Press.[21] Conor Ryan, Michael O’Neill, and JJ Collins. 2018.
Handbook of GrammaticalEvolution . Springer.[22] Andrew Senior, Georg Heigold, Marc’Aurelio Ranzato, and Ke Yang. 2013. Anempirical study of learning rates in deep neural networks for speech recognition.In
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing- Proceedings . https://doi.org/10.1109/ICASSP.2013.6638963[23] Leslie N Smith. 2017. Cyclical learning rates for training neural networks. In2017 IEEE Winter Conference on Applications of Computer Vision (WACV)