[PDF] Neuroevolutionary Transfer Learning of Deep Recurrent Neural Networks through Network-Aware Adaptation

Abstract

Transfer learning entails taking an artificial neural network (ANN) that is trained on a source dataset and adapting it to a new target dataset. While this has been shown to be quite powerful, its use has generally been restricted by architectural constraints. Previously, in order to reuse and adapt an ANN's internal weights and structure, the underlying topology of the ANN being transferred across tasks must remain mostly the same while a new output layer is attached, discarding the old output layer's weights. This work introduces network-aware adaptive structure transfer learning (N-ASTL), an advancement over prior efforts to remove this restriction. N-ASTL utilizes statistical information related to the source network's topology and weight distribution in order to inform how new input and output neurons are to be integrated into the existing structure. Results show improvements over prior state-of-the-art, including the ability to transfer in challenging real-world datasets not previously possible and improved generalization over RNNs trained without transfer.

Full PDF

NN EUROEVOLUTIONARY T RANSFER L EARNING OF D EEP R ECURRENT N EURAL N ETWORKS THROUGH N ETWORK -A WARE A DAPTATION

A P

REPRINT

AbdElRahman ElSaid ∗ [email protected] Joshua Karns ∗ [email protected] Zimeng Lyu ∗ [email protected] Daniel Krutz ∗ [email protected] Alexander Ororbia II ∗ [email protected] Travis Desell ∗ [email protected] June 5, 2020 A BSTRACT

Transfer learning entails taking an artiﬁcial neural network (ANN) that is trained on a source datasetand adapting it to a new target dataset. While this has been shown to be quite powerful, its usehas generally been restricted by architectural constraints. Previously, in order to reuse and adapt anANN’s internal weights and structure, the underlying topology of the ANN being transferred acrosstasks must remain mostly the same while a new output layer is attached, discarding the old outputlayer’s weights. This work introduces network-aware adaptive structure transfer learning (N-ASTL),an advancement over prior efforts to remove this restriction. N-ASTL utilizes statistical informationrelated to the source network’s topology and weight distribution in order to inform how new inputand output neurons are to be integrated into the existing structure. Results show improvements overprior state-of-the-art, including the ability to transfer in challenging real-world datasets not previouslypossible and improved generalization over RNNs trained without transfer. K eywords Neuroevolution · Recurrent Neural Networks · Transfer Learning.

As predictive analytics becomes increasingly more commonplace, there is a widespread and growing number ofreal-world systems that stand to beneﬁt from an improved ability to forecast and predict data. In many ﬁelds, suchas power systems, aviation, transportation, and manufacturing, developing engines for predictive analytics requirestime series forecasting algorithms capable of adapting to noisy, constantly changing sensors as well as major systemmodiﬁcations or upgrades. Mechanical systems, such as self-driving cars, robotics, aircraft, and unmanned aerialsystems (UAS) in particular would be signiﬁcantly improved from the ability to more accurately predict potentialequipment and systems failure, which is critically important for cost and safety reasons. Developing predictive modelsfor these systems poses a particularly challenging problem given that these systems experience rapid changes in termsof their operation and sensor capabilities.Transfer learning potentially offers a solution. It has already proven to be a powerful tool to improve the optimization ofdeep artiﬁcial neural networks (ANNs), allowing them to re-use and ﬁne-tune knowledge gained from training on large,prior datasets. However, transfer learning has mostly been limited to problems which require minimal to no topologicalchanges, i.e. , changing the number of neurons and/or their synaptic connectivity, so that the previously trained weightsand structure can be more easily ﬁne tuned to the new target dataset. This is common in ANN specialization, oftenreferred to as “pre-training”. Gupta et al. specialized a recurrent neural network (RNN) trained to predict different ∗ Rochester Institute of Technology, Rochester, New York a r X i v : . [ c s . N E ] J un PREPRINT - J

UNE

5, 2020phenotypes from clinical data, by retraining it to predict previously unseen phenotypes with limited data [1]. Zhang etal. applied the same principle when predicting the remaining useful life (RUL) of various systems when data wasscarce [2]. Other examples include [3, 4, 2, 5].Approaches involving some structural network modiﬁcation include work by Mun et al. , which removed an ANN’soutput layer, replacing it with two additional hidden layers and a new output layer [6]. Partial knowledge transferhas also been done through the use of pre-trained word and phrase embeddings [7, 8]. Hinton et al. proposed theconcept of “knowledge distillation”, where an ensemble of teacher models are “compressed” to a single pupil model [9].Tang et al. also conducted the converse of this experiment – they trained a complex pupil model using a simpler teachermodel [10]. Their ﬁndings demonstrated that knowledge gathered by a simple teacher model can be transferred to amore complex pupil model yielding greater generalization capability. Deo et al. also concatenated mid-level pre-trainedfeatures from two datasets as input to a target feedforward network [11].Neuroevolutionary approaches have mostly focused on using indirect encodings for transfer learning, such as Taylor etal. utilizing NeuroEvolution of Augmenting Topologies (NEAT [12]) to evolve task mappings to translate trained neuralnetwork policies between a source and target [13]. Yang et al. took this concept further by designing networks forcross-domain, cross-application, and cross-lingual transfer settings [14]. Vierbancsics et al. used a different approachwhere an indirect encoding was evolved that could be applied to ANN tasks that required different neural structures [15].To date, transfer learning research has almost exclusively focused on classiﬁcation tasks, with most work focusingon feedforward and convolutional neural networks (CNNs), and has mostly avoided any major architectural changesto the ANNs being transferred. While some studies have utilized recurrent neural networks (RNNs), they have notattempted to develop RNN models for time series data forecasting, a very challenging regression problem (see Section 4for example prediction problems and why traditional statistical auto-regressive forecasting methods like ARIMA areinsufﬁcient). This work advances transfer learning to this area, as these capabilities will play a crucial role in developingpredictive engines for previously described systems.Prior work introduced adaptive structure transfer learning (ASTL), where input and output nodes were added andremoved to adapt the structure of an RNN evolved on a source data set to a target data set [16]. However, this workdid not take into account the overall structure and weight distribution of the transferred network, only adding minimalconnections for the newly added input and output nodes. This work proposes a signiﬁcantly improved approach fortransferring RNN knowledge across tasks through network-aware adaptive transfer learning (N-ASTL). N-ASTLovercomes the limitations of ASTL by utilizing statistical information related to the source RNN’s topology and weightdistribution to inform the adaptation process. N-ASTL shows signiﬁcant improvements over previously reported results,including successful transfer learning on the challenging datasets where ASTL previously failed.

This work utilizes the Evolutionary eXploration of Augmenting Memory Models (EXAMM) algorithm [17] to drivethe neuroevolution process. EXAMM evolves progressively larger RNNs through a series of mutation and crossover(reproduction) operations. Mutations can be edge-based: split edge , add edge , enable edge , add recurrent edge , and disable edge operations, or work as higher-level node-based mutations: disable node , enable node , add node , splitnode and merge node . The type of node to be added is selected uniformly at random from a suite of simple neuronsand complex memory cells: ∆ -RNN units [18], gated recurrent units (GRUs) [19], long short-term memory cells(LSTMs) [20], minimal gated units (MGUs) [21], and update-gate RNN cells (UGRNNs) [22]. This allows EXAMMto select for the best performing recurrent memory units. EXAMM also allows for deep recurrent connections whichenables the RNN to directly use information beyond the previous time step. These deep recurrent connections haveproven to offer signiﬁcant improvements in model generalization, even yielding models that outperform state-of-the-art gated architectures [23]. To the authors’ knowledge, these capabilities are not available in other neuroevolutionframeworks capable of evolving RNNs, which is the primary reason EXAMM was selected to serve as the foundation forthis work. Due to space limitations we refer the reader to Ororbia et al. [17] for more details on the EXAMM algorithm.The N-ASTL and ASTL implementations have been made freely available and incorporated into the EXAMM githubrepository .To speed up the neuro-evolution process, EXAMM utilizes an asynchronous, distributed computing strategy thatincorporates the concept of islands to promote speciation. This mechanism encourages both exploration and exploitationof massive search spaces. A master process maintains the populations for each island and generates new RNN candidatemodels from the islands in a round-robin manner. Workers receive candidate models and locally train them withback-propagation through time (BPTT), making EXAMM a memetic algorithm. When a worker completes the training https://github.com/travisdesell/exact PREPRINT - J

UNE

5, 2020of an RNN, that RNN is inserted back into the island that it originated from. Then, if the number of RNNs in an islandexceeds the island’s maximum population size, the RNN with the worst ﬁtness score, i.e., validation set mean squarederror (MSE), is deleted.This asynchronous approach is particularly important given that the generated RNNs will have different topologies,with each candidate model requiring a different amount of time to train. This strategy allows the workers to completethe training of the generated RNNs at whatever speed they are capable of, yielding an algorithm that is naturallyload-balanced. Unlike synchronous parallel evolutionary strategies, EXAMM easily scales up to any number ofavailable processors, allowing population sizes that are independent of processor availability. The EXAMM codebasehas a multi-threaded implementation for multi-core CPUs as well as an MPI [24] implementation that allows EXAMMto readily leverage high performance computing resources.To initialize the island populations, EXAMM “seeds” each island population with the minimal network topologypossible for the given inputs and outputs, i.e., a topology with no hidden nodes where each input node has a single feedforward connection to each output node. Each island population utilizes this minimal genome as a seed network as theﬁrst RNN in its population, which is ultimately sent to the ﬁrst worker requesting an RNN to be trained. Subsequentrequests for work from that island create new RNN candidates from mutations of the seed network until the populationis full. When an island population is full, EXAMM will start generating new RNNs from that island utilizing bothmutation and intra-island crossover (both parents are selected within that same island). When all island populations arefull, EXAMM will then generate additional, new RNNs from an inter-island crossover process. This crossover selectsthe ﬁrst parent from the island that an RNN is being generated from and matches it with the most ﬁt RNN from another,randomly selected island to serve as the second parent.

Algorithm 1

Removal of Unused Nodes and Edges function R EMOVE U NUSED (SeedNetwork sn, Param[] targetOutputs, Param[] targetInputs) (cid:46)

Remove unused input and output nodes for all

InputNode i in sn.inputNodes doif i.param not in targetInputs then sn.removeNode(i) for all

OutputNode o in sn.outputNodes doif o.param not in targetOutputs then sn.removeNode(o) (cid:46)

Mark reachability of edges and hidden nodessn.markForwardReachability()sn.markBackwardReachability() for all

Node n in sn.hiddenNodes doif !n.forwardReachable() or !n.backwardReachable() then n.setDisabled() for all

Edge e in sn.edges doif !e.forwardReachable() or !e.backwardReachable() then e.setDisabled()

Prior work on adaptive structure transfer learning (ASTL) proposed a simple scheme for transfer learning in EXAMMby hijacking the island seeding process to ultimately modify a previously trained RNN instead of the minimalgenome [16]. The previously trained RNN is itself adapted to a new dataset through the following steps: i) removeunused outputs, ii) connect new outputs to all inputs, iii) remove unused inputs, iv) connect new inputs to all outputs,and v) disable vestigial hidden nodes and edges.The disabling of vestigial structures is crucial since removing the unused inputs and outputs can potentially disconnectparts of the RNN’s topology, which, if retained, would yield wasted computation. To safeguard against this, all edgesand nodes are ﬂagged for forward reachability, i.e. , there is a path to the edge or node from an enabled input node, andbackward reachability, i.e. , there is a path from the node or edge to any output. Nodes and edges which are not forwardand backward reachable are labeled as vestigial and disabled. They can, however, later be reconnected and enabled3 PREPRINT - J

UNE

5, 2020via EXAMM’s mutation/crossover operations, essentially boostrapping the learning process by enabling easy reuse ofpreviously learned neural circuits. This process is formalized in Algorithms 1 and 2.

Algorithm 2

ASTL Seed Network Adaptation function

ASTL(SeedNetwork sn, Param[] targetOutputs, Param[] targetInputs)R

EMOVE U NUSED (sn, targetOutputs, targetInputs) (cid:46)

Add new input and output nodes for all

Param ti in targetInputs doif !sn.hasInputForParam(ti) then sn.addInputNode(new InputNode(ti)) for all

Param to in targetOutputs doif !sn.hasOutputForParam(to) then sn.addOutputNode(new OutputNode(to)) (cid:46)

Connect all new input and output nodes for all

InputNode i in sn.inputNodes doif i.param in targetInputs thenfor all

OutputNode o in sn.outputNodes do sn.addEdge(i, o, weight ← U ( − . , . ) for all OutputNode o in sn.outputNodes doif o.param in targetOutputs thenfor all

InputNode i in sn.inputNodes do sn.addEdge(i, o, weight ← U ( − . , . ) While ASTL had some preliminary success in transferring RNNs for time series modeling tasks, it only addedconnections between input and output nodes and ignored the internal latent structure of the network being transferred.Furthermore, it re-initialized all weights in the network, only retaining the source network’s structure. In this work, wegeneralize ASTL to a process we call network-aware adaptive structure transfer learning (N-ASTL), which utilizesinformation about the seed network to improve the transfer learning process. Our hypothesis is that when using anyexisting RNN as a seed network, the RNN itself already contains useful information about the form of its topology aswell as weight distribution information which can be used in the transfer learning process. Thus, N-ASTL leveragesknowledge of a seed network’s connectivity and weight distribution to inform how it connects new input and outputnodes. N-ASTL involves three strategies, detailed below.

In ASTL, new node biases and edge weights were initialized uniformly at random, U ( − . , . , similar to howEXAMM initializes weights in the minimal seed genome. In N-ASTL, before adapting any structure, the mean, µ w , andstandard deviation, σ w , of the seed network’s weights are computed. Afterwards, when new edges are generated duringthe seed network adaption process, weights are initialized according to a dynamic normal (Gaussian) distribution drivenby µ w and σ w , or N ( µ w , σ w ) . This mirrors how EXAMM performs epigenetic/Lamarckian weight/bias initializationwhen performing mutation and crossover operations. Algorithm 3 presents the output-aware input connection procedure for N-ASTL. Similar to our dynamic weightinitialization scheme, before adapting any structure of the seed network, the mean, µ o , and standard deviation, σ o , ofthe number of outputs that each input and hidden node has in the network are calculated. Following this, the unusedinput and output nodes are then removed from the seed network, with any resulting vestigial hidden nodes and edgesappropriately disabled. Following this, the new output and input nodes are added to the network. Each new inputnode is connected to either output nodes or enabled hidden nodes with a number of connections that is randomlyselected according to a Gaussian distribution but with the restriction that at least one connection must be made, i.e. , max (1 , N ( µ o , σ o )) . This ensures that all input nodes are connected to the seed network in a functional way that alsofollows a similar distribution to the seed RNN’s existing structure.4 PREPRINT - J

UNE

5, 2020

Algorithm 3

N-ASTL Seed Network Adaptation: Output-Aware Input Connection function

NASTL-I

NPUTS (SeedNetwork sn, Param[] targetOutputs, Param[] targetInputs)R

EMOVE U NUSED (sn, targetOutputs, targetInputs) µ w ← sn.getWeightMean() σ w ← sn.getWeightStdDev() µ o ← sn.getMeanOutputs() σ o ← sn.getStdDevOutputs() (cid:46) Connect the new input nodes for all

InputNode i in sn.inputNodes doif i.param in targetInputs then nInputs ← max (1 , N ( µ o , σ o )) Node[] nodes ← sn.getEnabledHiddenNodes() ∪ sn.getOutputNodes() SHUFFLE (nodes) for j ← to nInputs do sn.addEdge(i, nodes[j], weight ← N ( µ w , σ w ) ) Algorithm 4 presents the input-aware output connection procedure for N-ASTL. New output nodes are connected ina way similar to that used for the new input nodes. Before any adaptation, the mean, µ i , and standard deviation, σ i ,of the number of inputs that each output and hidden node has in the network is calculated. After removing unusedinput and output nodes (along with disabling vestigial edges/hidden nodes) and then adding the new input and outputnodes, output nodes are potentially wired to any input node or enabled hidden node. The number of connections is thensampled from a Gaussian distribution over the number of inputs, again with the restriction that at least one connectionis made, i.e., max (1 , N ( µ i , σ i )) . Algorithm 4

N-ASTL Seed Network Adaptation: Input-Aware Output Connection function

NASTL-O

UTPUTS (SeedNetwork sn, Param[] targetOutputs, Param[] targetInputs)R

EMOVE U NUSED (sn, targetOutputs, targetInputs) µ w ← sn.getWeightMean() σ w ← sn.getWeightStdDev() µ i ← sn.getMeanInputs() σ i ← sn.getStdDevInputs() (cid:46) Connect the new output nodes for all

OutputNode o in sn.outputNodes doif o.param in targetOutputs then nOutputs ← max (1 , N ( µ i , σ i )) Node[] nodes ← sn.getEnabledHiddenNodes() ∪ sn.getInputNodes() SHUFFLE (nodes) for j ← to nOutputs do sn.addEdge(nodes[j], o, weight ← N ( µ w , σ w ) ) (a) Cessna 172 Skyhawk (b) Piper PA-28 Cherokee (c) Piper PA-44 Seminole Figure 1: The data used for transfer learning comes from three different airframes (images under creative commonslicenses). 5

PREPRINT - J

UNE

5, 2020Figure 2: Example of the pitch and E1 EGT1 parameters from PA28 ﬂight in the NGAFID dataset. To compare to prior state of the art results reported for the original ASTL, we utilize the same open source aviationbased dataset which has been provided as part of the EXAMM github repository . The source data for this transferlearning problem consists of ﬂights gathered from the National General Aviation Flight Information Database . Itincludes three different airframes, with ﬂights coming from Cessna 172 Skyhawks (C172s), from Piper PA-28Cherokees (PA28s), and the last from Piper PA-44 Seminoles (PA44s). Each of the ﬂights came from a differentaircraft. The different airframes have signiﬁcant design differences (see Figure 1). The C172s are “high wing” (wingsare on the top) with a single engine, the PA28s are “low wing” (wings are on the bottom) and have a single engine, andthe PA44s are low wing with dual engines. The ﬂight data consists of per second readings and the duration of eachﬂight data ﬁle ranges from to hours. Each airframe shares common sensor parameters, C172 and PA44 add 7additional sensor parameter which the PA28 does not have, C172s have 3 additional engine parameters which PA-28sand PA-44s do not have, and PA-44s add an additional parameters, mostly related to its second engine, which C172sand PA-28s do not have. All available parameters were used as inputs and Appendix A provides a detailed tabulardescription of which sensors each airframe has and which were used as prediction outputs (if available). Figure 2provides an example of the data being predicted showing the pitch and E1 EGT1 values from PA28 ﬂight , illustratingthe challenges involved. The data is very noisy containing sudden, non-stationary, and non-seasonal changes as wellas varying correlations to the other input parameters. As a result, traditional statistical methods, e.g. , those from theauto-regressive integrated moving average (ARIMA) family of models, are not well-suited to the task. All EXAMM neuro-evolution runs utilized islands, each with a max population size of . New RNNs were generatedvia a mutation rate of 70%, an intra-island crossover rate of 20%, and an inter-island crossover rate of 10%. out ofEXAMM’s mutation operations were utilized (all except for split edge ), each chosen with a uniform 10% chance.EXAMM generated new nodes by selecting from a set that included simple neurons, ∆ -RNN, GRU, LSTM, MGU, andUGRNN memory cells uniformly at random. Recurrent connections could span any time-skip, sampled uniformly, i.e., U (1 , .All RNNs were locally trained for epochs via stochastic gradient descent (SGD) using backpropagation through time(BPTT) [25] to compute gradients, all using the same hyperparameters. RNN weights were initialized via EXAMM’sLamarckian strategy (described in [17]), which allows children RNNs to reuse parental weights, signiﬁcantly reducingthe number of epochs required for the neuroevolution’s local RNN training steps. SGD was run with a learning rate of https://github.com/travisdesell/exact/tree/master/datasets/2019_ngaﬁd_transfer http://ngaﬁd.org PREPRINT - J

UNE

5, 2020 (a) C172 to PA28 (b) PA44 to PA28(c) PA28 to C172 (d) PA44 to C172(e) PA28 to PA44 (f) C172 to PA44

Figure 3: Convergence rates (in terms of best MSE on validation data) for the EXAMM runs starting from scratch(Basic C172, PA28 or PA44) compared to starting with a seed network transferred from a different airframe whenpredicting engine exhaust gas temperature (EGT) values. η = 0 . and used Nesterov momentum with µ = 0 . . For the memory cells with forget gates, the forget gate bias hada value of . added to it (motivated by [26]). To prevent exploding gradients, gradient clipping [27] was used when thenorm of the gradient exceeded a threshold of . . To combat vanishing gradients, gradient boosting (the opposite ofclipping) was used when the gradient norm was below . . These parameters were selected by hand based on priorexperience with this dataset. One major goal of this work is to facilitate and enable the fast development of predictive systems. In the realm ofaviation, this could mean that a given organization may operate a ﬂeet of aircraft of certain airframes and employ RNNestimators as part of their predictive systems. Instead of having to train new RNNs from scratch every time existingairframes are modiﬁed, new airframes start being utilized, or sensor systems are upgraded, they can instead adapt RNNstrained on existing airframes and transfer them over for use in these new/modiﬁed systems. This transfer process wouldalso require less data compared to the scenario of training estimators from scratch. To mirror such a scenario, we used7

PREPRINT - J

UNE

5, 2020 (a) C172 to PA28 (b) PA44 to PA28(c) PA28 to C172 (d) PA44 to C172(e) PA28 to PA44 (f) C172 to PA44

Figure 4: Convergence rates (in terms of best MSE on validation data) for the EXAMM runs starting from scratch(Basic C172, PA28 or PA44) compared to starting with a seed network transferred from a different airframe predictingthe non-engine parameters (AltAGL, IAS, LatAc, NormAc, Pitch and Roll).EXAMM to evolve and train RNNs on each of the three airframes (C172s, PA28s and PA44s) for , generated andtrained RNNs, i.e. , genomes were evaluated. This was repeated times for each airframe. The ﬂight data ﬁleswere split up into training and validation data, with the ﬁrst (of each set of ﬂights) used for training and the last usedas validation data.To appropriately evaluate our proposed N-ASTL methodology, we utilized the same prediction task deﬁned in [16],which was to predict the exhaust gas temperature (EGT) engine parameters for the target airframe, as well as a new taskthat entailed predicting non-engine parameters. RNNs predicting on PA28 would predict engine 1 (E1) EGT1, those forC172s would predict E1 EGT1-4, and RNNs predicting on PA44 data would predict both E1 EGT1-4 and engine 2 (E2)EGT1-4. The non-engine prediction parameters were Altitude Miles Above Sea Level (AltMSL), Indicated Air Speed(IAS), Lateral Acceleration (LatAc), Normal Acceleration (NormAc), Pitch and Roll. We investigated the transferlearning methods using each airframe as a source and target, resulting in six different transfer learning examples: C172to PA28, C172 to PA44, PA28 to C172, PA28 to PA44, PA44 to C172, and PA44 to PA28.8 PREPRINT - J

UNE

5, 2020

Inputs Inputs Outputs OutputsTask Added Removed Added RemovedPA28 to PA44 13 0 4 0PA28 to C172 8 0 3 0C172 to PA28 0 8 0 3C172 to PA44 10 3 4 0PA44 to PA28 0 13 0 7PA44 to C172 3 10 0 7

Table 1: Transfer Learning TasksTable 1 presents the different transfer learning tasks examined, along with how many input and output parameterswere added and removed in each example. Input nodes were added or removed by the previously described strategiesto utilize all available sensor inputs for the target data. Likewise, outputs were added or removed to predict all theavailable EGT parameters. For the non-engine parameters no outputs needed to be added or removed. For each ofthe repeated runs, the best genome in each set after the , genome evaluations was selected as a seed networkfor the transfer learning strategies. For the airframes, the selected seed networks were then utilized to evaluatethe different adaptation strategies. We examined ASTL and N-ASTL independently, as well as using them together(denoted as ASTL + N-ASTL ) for connecting the new input and output nodes. For each of these three strategies, weused either ASTL weight initialization or N-ASTL epigenetic weight initialization (denoted with +epi ). For the runswhere PA28 was the target, since we only removed inputs and outputs, we only examined ASTL with and withoutepigenetic weight initialization, as there were no new nodes to connect.

Figures 3 and 4 compare the different seed network adaptation strategies and their combinations using each of theairframes (C172, PA28 and PA44) as a source to be transferred to each of other airframes (the targets) evolved to predictthe engine parameter values and non-engine parameter values, respectively.For the experiments using the PA28 data as a target, since no input or output nodes were added, transfer learning wastested with and without the N-ASTL epigenetic weight initialization. In all four cases, epigenetic weight initializationhad strong error reduction and learned the task much quicker. Additionally, improved results were seen transferringfrom the PA44 RNNs as a source, whereas the prior ASTL results had failed.For the experiments using the C172 data as a target, large improvements were again seen with epigenetic weightinitialization. In all cases,

N-ASTL+epi learned the quickest and resulted in the lowest error, with

N-ASTL+ASTL+epi providing the next best results followed by

ASTL+epi . Even without epigenetic weights, N-ASTL outperformed thestrategies involving ASTL connectivity.For the experiments using the PA44 data as a target, N-ASTL strongly shows its signiﬁcance. Prior results withASTL were unable to improve over utilizing EXAMM from scratch on the PA44 data, while in this case, even withoutepigenetic weights, N-ASTL is still able to improve over EXAMM from scratch. Additionally, the use of epigeneticweights shows a very large improvement, with all three versions showing signiﬁcant improvement over N-ASTL. Inthree of the four cases, again

N-ASTL+epi learned the fastest and found the most accurate results, except in the case ofC172 to PA44 on the engine parameter predictions, where

ASTL+epi and

N-ASTL+ASL+epi performed comparatively,though slightly, better.

Overall, we ﬁnd these results strongly positive as they show that the N-ASTL strategies yield signiﬁcantly betterperformance across all experiments. Furthermore, in almost every case, the N-ASTL strategies created better performingRNNs after , genomes compared to the best found after , genomes when evolving RNNs “from scratch”.If we look at the curvature of these plots, the RNNs that evolved from scratch and the ASTL and non-epigeneticweight tests were converging to signiﬁcantly worse performance than the N-ASTL runs. This suggests that transferringRNNs trained on other data and seeding them with genomes in a network-aware manner produces RNN predictors thatgeneralize far better.Epigenetic weight initialization signiﬁcantly improved the results in all cases, showing that utilizing network awareweight distributions for weight initialization is highly important for the transfer learning process. Additionally, in of the cases that involved adding/removing inputs and outputs ( i.e. , those transferring to C172 or PA44), N-ASTLwithout epigenetic weight initialization outperformed RNNs that were evolved/trained from scratch, suggesting thatutilizing network-aware topology information is also important to the process. This is further backed by the fact thatin of those cases the N-ASTL+epi runs provided the best results. For the two cases they where did not, PA28 to9

PREPRINT - J

UNE

5, 2020PA44 and C172 to PA44, which were when additional outputs were added, results suggest that when additional outputsare added, making sure they are connected to all inputs is important. On the other hand, making sure new inputs areconnected to all existing outputs is not as important.

This work investigates the use of a novel network-aware adaptive structure transfer learning strategy (N-ASTL) tofurther speed transfer learning of deep RNNs. N-ASTL utilizes statistical information about the source RNN’s topologyand weight distributions to inform how it should be adapted to new data sets which have different input and outputparameters, necessitating the use of a different neural architecture. These strategies were evaluated using the challengingreal world problem of performing transfer learning of RNNs trained to predict aviation engine parameters betweenthree different airframes with different designs and engines.N-ASTL provided signiﬁcant performance improvements over prior state of the art, which did not take into accountnetwork topology or weight distributions. Further, N-ASTL was shown to be able to successfully perform transferlearning on tasks where transfer learning was not previously possible. Interestingly enough, this work shows that inmany cases the transfer learning strategies are able to evolve RNNs that outperform ones which started from scratch butwere evolved and trained on the target dataset for twice as long. In many cases, the curvature of those plots suggestthey would never reach the performance of the RNNs seeded by a transferred network. This suggests that the transferlearning strategy is able to evolve more robust and generalizable RNNs, as performance of the non-transferred RNNslevels off at a signiﬁcantly lower accuracy than the transfer learning evolved RNNs. These results are signiﬁcant andshowcase the use of transfer learning as a means to enhance predictive systems in aviation, with applications to otherdomains involving time series data of different input and output dimensionalities.This study also opens up a number of directions for future work. While N-ASTL only seeds EXAMM with a singleadapted network structure, the manner in which it connects new inputs and outputs is stochastic. This provides thepotential to generate multiple seed network candidates to provide more initial variety to EXAMM’s island populationswhich could lead to improved reliability and robustness when searching for optimal RNN architectures. Additionally,there appears to be a difference between adding inputs to adding outputs. While the tests which only added inputsbut not outputs had N-ASTL with epigenetic weights achieving the best results, the tests which added outputs (thosetransferring to PA44) showed better performance when also adding in ASTL connectivity. It is worth examining if usingN-ASTL plus a modiﬁed version of ASTL which only fully connects new outputs to inputs may prove better in thesecases. Further, N-ASTL has shown that connecting new inputs and outputs to nodes in the hidden layers is important. Itis worth further study to see if simply connecting new inputs and outputs to all hidden nodes provides any beneﬁt.Lastly, while this work has focused on the challenging problem of time series prediction with RNNs, future workwill involve applying N-ASTL to RNN classiﬁcation tasks, such as natural language processing, to see if transferringbetween different language dictionaries or word and character embeddings provides similar improvements, as well as toconvolutional neural networks, allowing transfer learning between images and output spaces of different shapes andsizes.

This material is based upon work supported by the U.S. Department of Energy, Ofﬁce of Science, Ofﬁce of AdvancedCombustion Systems under Award Number

PREPRINT - J

UNE

5, 2020

References [1] Priyanka Gupta, Pankaj Malhotra, Lovekesh Vig, and Gautam Shroff. Transfer learning for clinical time seriesanalysis using recurrent neural networks. arXiv preprint arXiv:1807.01705 , 2018.[2] Ansi Zhang, Honglei Wang, Shaobo Li, Yuxin Cui, Zhonghao Liu, Guanci Yang, and Jianjun Hu. Transfer learningwith deep recurrent neural networks for remaining useful life estimation.

Applied Sciences , 8(12):2416, 2018.[3] Seunghyun Yoon, Hyeongu Yun, Yuna Kim, Gyu-tae Park, and Kyomin Jung. Efﬁcient transfer learning schemesfor personalized language modeling using recurrent neural network. In

Workshops at the Thirty-First AAAIConference on Artiﬁcial Intelligence , 2017.[4] Guido Zarrella and Amy Marsh. Mitre at semeval-2016 task 6: Transfer learning for stance detection. arXivpreprint arXiv:1606.03784 , 2016.[5] Nikola Mrkši´c, Diarmuid O Séaghdha, Blaise Thomson, Milica Gaši´c, Pei-Hao Su, David Vandyke, Tsung-HsienWen, and Steve Young. Multi-domain dialog state tracking using recurrent neural networks. arXiv preprintarXiv:1506.07190 , 2015.[6] Seongkyu Mun, Suwon Shon, Wooil Kim, David K Han, and Hanseok Ko. Deep neural network based learningand transferring mid-level audio features for acoustic scene classiﬁcation. In , pages 796–800. IEEE, 2017.[7] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of wordsand phrases and their compositionality. In

Advances in neural information processing systems , pages 3111–3119,2013.[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.[9] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprintarXiv:1503.02531 , 2015.[10] Zhiyuan Tang, Dong Wang, and Zhiyong Zhang. Recurrent neural network training with dark knowledge transfer.In , pages 5900–5904.IEEE, 2016.[11] Ratneel Vikash Deo, Rohitash Chandra, and Anuraganand Sharma. Stacked transfer learning for tropical cycloneintensity prediction. arXiv preprint arXiv:1708.06539 , 2017.[12] Kenneth Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies.

Evolutionarycomputation , 10(2):99–127, 2002.[13] Matthew E Taylor, Shimon Whiteson, and Peter Stone. Transfer via inter-task mappings in policy searchreinforcement learning. In

Proceedings of the 6th international joint conference on Autonomous agents andmultiagent systems , page 37. ACM, 2007.[14] Zhilin Yang, Ruslan Salakhutdinov, and William W Cohen. Transfer learning for sequence tagging with hierarchicalrecurrent networks. arXiv preprint arXiv:1703.06345 , 2017.[15] Phillip Verbancsics and Kenneth O Stanley. Evolving static representations for task transfer.

Journal of MachineLearning Research , 11(May):1737–1769, 2010.[16] Zimeng Lyu Daniel Krutz Alexander G. Ororbia AbdElRahman ElSaid, Joshua Karns and Travis Desell. Neuro-evolutionary transfer learning through structural adaptation. In

The 23nd International Conference on theApplications of Evolutionary Computation (EvoStar: EvoApps 2020) , Seville, Spain, April 2020.[17] Alexander Ororbia, AbdElRahman ElSaid, and Travis Desell. Investigating recurrent neural network memorystructures using neuro-evolution. In

Proceedings of the Genetic and Evolutionary Computation Conference ,GECCO ’19, pages 446–455, New York, NY, USA, 2019. ACM.[18] Alexander G. Ororbia II, Tomas Mikolov, and David Reitter. Learning simpler language models with thedifferential state framework.

Neural Computation , 0(0):1–26, 2017. PMID: 28957029.[19] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrentneural networks on sequence modeling. arXiv preprint arXiv:1412.3555 , 2014.[20] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.

Neural Computation , 9(8):1735–1780, 1997.[21] Guo-Bing Zhou, Jianxin Wu, Chen-Lin Zhang, and Zhi-Hua Zhou. Minimal gated unit for recurrent neuralnetworks.

International Journal of Automation and Computing , 13(3):226–234, 2016.11

PREPRINT - J

UNE

5, 2020[22] Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo. Capacity and trainability in recurrent neural networks. arXiv preprint arXiv:1611.09913 , 2016.[23] Travis Desell, AbdElRahman ElSaid, and Alexander G. Ororbia. An empirical exploration of deep recurrentconnections using neuro-evolution. In

The 23nd International Conference on the Applications of EvolutionaryComputation (EvoStar: EvoApps 2020) , Seville, Spain, April 2020.[24] Message Passing Interface Forum. MPI: A message-passing interface standard.

The International Journal ofSupercomputer Applications and High Performance Computing , 8(3/4):159–416, Fall/Winter 1994.[25] Paul J Werbos. Backpropagation through time: what it does and how to do it.

Proceedings of the IEEE ,78(10):1550–1560, 1990.[26] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent networkarchitectures. In

International Conference on Machine Learning , pages 2342–2350, 2015.[27] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difﬁculty of training recurrent neural networks. In

International Conference on Machine Learning , pages 1310–1318, 2013.12

PREPRINT - J

UNE

5, 2020

Appendices

Appendix A Aviation Dataset Parameters

This paper utilizes 12 ﬂight logs each from aircraft of three different airframes: Cessna 172 Skyhawk (C172s), Piper-Archer 28 Cherokees (PA-28s) and Piper-Archer 44 Seminoles (PA-44s). These aircraft share common sensorparameters, and then each has varying sensor parameters for their engine(s). The data ﬁles used in this work are freelyavailable as comma separated value (CSV) format as part of the EXAMM github repository . The following tablepresents which sensors are present for which airframe and which were used as prediction outputs (if available), allavailable parameters were used as inputs.: Cessna 172 Piper-Archer 28 Piper-Archer 44 Potential OutputParameter Name Skyhawk Cherokee Seminole (Engine) (Non-Engine)Altitude Above Ground Level (AltAGL) x x xBarometric Altitude (AltB) x x xGPS Altitude (AltGPS) x x xAltitude Miles Above Sea Level (AltMSL) x x x xFuel Quantity Left (FQtyL) x x xFuel Quantity Right (FQtyR) x x xGround Speed (GndSpd) x x xIndicated Air Speed (IAS) x x x xLateral Acceleration (LatAc) x x x xNormal Acceleration (NormAc) x x x xOutside Air Temperature (OAT) x x xPitch x x x xRoll x x x xTrue Airspeed (TAS) x x xVertical Speed (VSpd) x x xVertical Speed Gs (VSpdG) x x xWind Direction (WndDir) x x xWind Speed (WndSpd) x x xAbsolute Barometric Pressure (BaroA) x xEngine 1 Cylinder Head Temperature 1 (E1 CHT1) x xEngine 1 Cylinder Head Temperature 2 (E1 CHT2) xEngine 1 Cylinder Head Temperature 3 (E1 CHT3) xEngine 1 Cylinder Head Temperature 4 (E1 CHT4) xEngine 1 Exhaust Gas Temperature 1 (E1 EGT1) x x x xEngine 1 Exhaust Gas Temperature 2 (E1 EGT2) x x xEngine 1 Exhaust Gas Temperature 3 (E1 EGT3) x x xEngine 1 Exhaust Gas Temperature 4 (E1 EGT4) x x xEngine 1 Fuel Flow (E1 FFlow) x x xEngine 1 Oil Pressure (E1 OilP x x xEngine 1 Oil Temperature (E1 OilT) x x xEngine 1 Rotations Per minute (E1 RPM) x x xEngine 1 Manifold Absolute Pressure (E1 MAP) xEngine 2 Cylinder Head Temperature 1 (E2 CHT1) xEngine 2 Exhaust Gas Temperature 1 (E2 EGT1) x xEngine 2 Exhaust Gas Temperature 2 (E2 EGT2) x xEngine 2 Exhaust Gas Temperature 3 (E2 EGT3) x xEngine 2 Exhaust Gas Temperature 4 (E2 EGT4) x xEngine 2 Fuel Flow (E2 FFlow) xEngine 2 Oil Pressure (E2 OilP) xEngine 2 Oil Temperature (E2 OilT) xEngine 2 Rotations Per minute (E2 RPM) xEngine 2 Manifold Absolute Pressure (E2 MAP) x https://github.com/travisdesell/exact/tree/master/datasets/2019_ngaﬁd_transferhttps://github.com/travisdesell/exact/tree/master/datasets/2019_ngaﬁd_transfer