Augmenting Neural Differential Equations to Model Unknown Dynamical Systems with Incomplete State Information
AA UGMENTING N EURAL D IFFERENTIAL E QUATIONS TO M ODEL U NKNOWN D YNAMICAL S YSTEMS WITH I NCOMPLETE S TATE I NFORMATION
A P
REPRINT
Robert R. Strauss
Los Alamos High SchoolLos Alamos, NM 87544 [email protected]
August 25, 2020 A BSTRACT
Neural Ordinary Differential Equations [1, 2] replace the right-hand side F ( . ) of a conventional ODEwith a neural net, which by virtue of the universal approximation theorem [3], can be trained to therepresent any function. [4, 2] It’s inputs are the current state vector ˆ x and outputs are the derivative. ∂ ˆ x ( t ) ∂t = F [ˆ x ( t )] and initial condition: ˆ x (0) = ˆ x (1)If we knew the true, analytic, function F ( . ) then we could directly train a neural net approximationdirectly to the output of F ( . ) for varied inputs. But when we do not know the function itself, but justhave the state trajectories (time evolution of the ODE system), we can still train the neural ODE tolearn a representation of the underlying but unknown ODE.However, if the state vector at a given time, ˆ x , is incompletely known then the right-hand side ofthe ODE is generally undefined. And thus the derivatives to propagate the system are unavailable,preventing training against the state trajectory data.We show here that a specially augmented Neural ODE (ANODE) [5] can learn the system model evenwhen given incomplete state vector ˆ x information. As a worked example, we apply neural ODEs tothe Lotka-Volterra problem of three interacting species: rabbits, wolves, and bears. We show thateven when all the data for the bear population time series is removed, the remaining time series of therabbits and wolves is sufficient to learn the dynamical system despite the missing incomplete stateinformation and no knowledge of the ODE functional form. This is surprising since a conventionalright-hand side ODE function cannot output the correct derivatives without the full state as the input.A core, desirable, property of every ODE system, preserved by the neural ODE, is that trajectories inthe state space cannot cross. The primary purpose of the ANODE architecture is not to predict themissing information but rather to enforce that even with missing state information.Surprisingly, we show that only by luck will the missing bear population be predicted correctly, eventhough the trajectory of the wolf and rabbit populations is being extrapolated correctly. Basically, theANODE is finding some different neural ODE solution that has identical behavior to the true one onthe known channels, but the functional form is not unique.Finally, we implemented the augmented neural ODEs and differential equation solvers in the Julia programming language [6] within an existing neural ODE framework[7]. a r X i v : . [ q - b i o . N C ] A ug PREPRINT - A
UGUST
25, 2020
The evolution of a system can be computed when encoded as an ordinary differential equation. Unfortunately, it iscommon to not know the functional form of a system of interest. By appeal to the universal approximation theorem[3],a neural network can be trained to approximate the ODE governing the system. Since the neural ODE[1, 2] can betrained from data, a neural ODE can learn the rules governing a physical system with an unknown ODE functional form.However, an ODE cannot be evolved when the state of the initial state of the system is incompletely known. Forexample, the rate of change of populations in the Lotka-Volterra model of a predator-prey system depends only on thepredator and prey populations; but if the number of predators is not known then the rate of change of the prey cannot becomputed and so the system cannot be evolved in time.A complication occurs in training a neural ODE when the state trajectory is missing for some of the state variables. Forexample, if we know only the prey population time series but not the predator population then the training loss functionwon’t demand the predator population be predicted correctly.Intuitively, there is some reason not to despair. If we knew the exact functional form of the right-hand side F ( . ) , but notits parameters, then, in favorable cases, we might still be able to infer a varying predator population just by observingthe prey population with time. That is, a falling prey population requires there exist a predator, and it falls faster themore numerous predators. Thus perhaps we might infer the proportional fluctuation of the predators from just the preypopulation dynamics even without knowing the initial condition.A worse situation occurs when the reason we lack the data on a variable is that we don’t even know the variable is avital part of the state equations. For example, if we were not aware bears existed, a two-species ODE for rabbit and wolfpopulation mechanics would be missing dependencies on an evolving bear predator population. This doesn’t simplymake the problem harder; it may entirely prevent a solution by a two-variable ODE. The feasible trajectory space of atwo-variable ODE (rabbits and wolves) does not include every trajectory available to those same two variables whenembedded in a three-variable ODE (rabbits, wolves, and bears).This paper examines that hard challenge: whether the system can still be solved when both the entire functional formof the right-hand side is unknown and, simultaneously, all of the training data for one state variable’s trajectory isunknown. Furthermore, we restrict the training to short time series but require that it extrapolate to long-duration timeseries in testing. In each case study, we find the neural ODE successfully learns from just short time-series data thensuccessfully predicts much longer time spans even for initial conditions absent in the training set. The usefulness ofneural ODEs for modeling unknown dynamical systems is thus demonstrated successfully. Our principal work on this was completed in October 2019 and was subsequently published and presented in multiple,peer-reviewed, venues [8, 9, 10, 11]. In those we described our two novel Neural ODE architectures as "recurrent external channel" neural ODEs to distinguish them from the conventional recurrent neural network (internal hiddennodes). As it happens, one of our two architectures is similar to one presented at NeuroIPS 2019 by Dupont et al , butneither of us was aware of the other’s contemporaneous developments and so until now have not cited each other.[5, 12]In the interest of not fragmenting public terminology we now adopt Dupont et al ’s catchy acronym "ANODE" in thissummary publication. The "recurrent external channels" described in our previous work are logically the same as theaugmentation variables " ˆ a ( t ) " in Dupont et al .Dupont et al addressed a different problem than the one we posed. Moreover, that work set the undefined initialcondition of the augmentation variables to a constant (zero) which was sufficient for their posed problem. But ingeneral, this doesn’t entirely resolve the non-crossing rule of ODE phase space trajectories that the ANODE (a.k.a.recurrent external channel) architecture was intended to address in the first place. In particular, it did not providefeasible solutions in the problems we posed for unknown systems (case-2 or case-3 below). To fully resolve this incase-3, we added a second neural net to compute unique initial conditions for the recurrent external channels that wereconditioned on the data itself. Furthermore, this required that we trained both of these neural nets simultaneously. (SeeMethods section.)For our case-1 (validation) experiment, we deliberately chose an identical ODE functional form and neural architecture tothat of Rackauckas et al. to conveniently verify our implementation. Surprisingly, we observed a different, significantlyimproved, outcome arising from a modest change in the training protocol. The changes (discussed below) appear tomake the solution process more robust for somewhat stiff oscillatory equations and that protocol advancement wascritical in the more challenging case-2 and -3 presented below.2 PREPRINT - A
UGUST
25, 2020
As our concrete example, we shall use predator-prey population evolution modeled by the Lotka-Volterra equations (2).We refer to the species in this model as rabbits, which are prey only, and wolves, which eat rabbits. The canonical 2species Lotka-Volterra equations are: ∂r∂t = αr − βrw∂w∂t = δrw − γw (2)Where w is the wolf population, r is the rabbit population, and α , β , γ , and δ are constants.Adding Bears as an apex predator, eating both rabbits and wolves, the three species system becomes: ∂r∂t = αr − βrw − (cid:15)rb∂w∂t = δrw − γw − ζwb∂b∂t = ηrb + θbw − ιb (3)Where b is the bear population and all Greek letters are constants.To train the neural ODE we need training and testing data sets in the form of evolving populations over time. Wearbitrarily choose some "true" parameters ( α, β, (cid:15), δ, γ, ζ, η, θ, ι ) and then integrate the ODEs to generate the "true"population evolution over time. We will sample this continuous trajectory coarsely over a limited time range to generatethe training data supplied to the neural ODE (which we may pretend are measurements from the field). (For purposes ofplotting the "true" time series or the predicted time series we show continuous or finely sample wave-forms.)After generating the data sets, no part of this "true" equation will be used in the neural ODE. None of the parameters ofthe true equation will be supplied to the neural ODE, nor will the functional form be given. The goal here is not toinvert the neural ODE to an analytic function or to estimate the parameters of the "true" function. It is only to make aneural net that correctly extrapolates the system dynamics. Thus once we create the data the reader could safely forgetthey ever saw the true ODE. This aligns with the target application of inferring neural ODEs for systems with unknownphysics, for which the equations would, of course, not be provided like this. This subsection is a quick primer just for readers unfamiliar with neural ODEs. A system can be simulated byintegrating its differential equation. This means adding the rate of change of the state to the current state to get thefuture state, then repeating to produce a time-series of the state of the system.To avert a common misunderstanding we note that the neural net does not directly predict the state of the system at apoint in time in the future. It only acts as a replacement of the functional form of the ordinary differential equation,meaning a neural ODE must also be integrated through time to reach the future point [13]. Alternatively, one can viewit as the asymptotic limit of a residual neural net with an infinite number of layers [14, 15].At first, this seems counter-intuitive. Direct prediction would take a single evaluation passed through the neural network,while integrating may take thousands of evaluations, and so seems much less efficient. However, integrating throughtime produces not only produces better results but more importantly preserves the non-crossing trajectory rule ofODE systems that is not preserved directly by a neural net on its own. Since physics happens to very often take theform of a differential equation this embeds strong restrictions on the possible outcomes. Enforcing this deterministic,rate-of-change, structure on the neural ODE makes it easier to learn the correct physics. Additionally, this form canenforce additional physical conservation laws. 3
PREPRINT - A
UGUST
25, 2020One may ask, how is the neural net weight training accomplished when thousands of iterative passages through theintegrator and neural net are needed to compute the trajectory used in the loss function? To backpropagate to the neuralweight gradients requires the integrator itself to support automatic differentiation.
We aim to show neural ODEs can learn to model a system governed by unknown equations and missing state variables.For pedagogical illustration, we proceed in three stages of increasing complexity. First, a neural ODE is trained tomatch 2-species dynamics, then 3-species dynamics with one species’s population evolution withheld from the trainingto simulate an unobserved variable in the state, and finally, 3-species dynamics with all state information about onespecies deleted. The first case is a well-known test problem, so it’s here to introduce our notation and validate ourprocess. It can be solved with a simple neural ODE containing a neural network with 2 inputs (the populations of the 2species), giving two outputs, their respective rate of change. Novel augmented neural ODE architectures are required tosolve the second two problems. These incorporate an augmented recurrent channel (Fig 3), and also a second externalneural net to provide an initial condition (Fig 4).1. Can a neural net predict how quickly wolves will eat rabbits?(a) First, to test if a neural ODE really can learn to model a system without knowing an equation (just fromdata), we train a neural ODE to model a two-species predator-prey population system without giving itany knowledge of the equations.2. What if there are uncounted bears in the forest?(a) Next, to test how the neural ODE handles an unmeasured variable, we add a third species to the systembut hide the data for it, except for the initial population, from the neural ODE.3. What if we had no idea there were bears in the first place?(a) Finally, to test how a neural ODE handles a hidden variable problem, we fully hide the third species fromthe neural ODE and force it to model the other two species. It doesn’t have any hint there even is a thirdspecies.In each case, the training is on the species populations "measured" over short time spans with coarsely sampledtime intervals, then tested on a finely sampled longer time spans and with different initial conditions. The lossfunction is the total squared difference of the "measurements" (the time-series of the true populations) and the predic-tion (the time-series from integrating the neural network). Training is done in mini-batches of different initial conditions.
The neural ODE is trained on population time series data with nothing withheld. This control will validate the neuralODE is sufficiently deep to reproduce the ODE dynamics and the training protocol is sufficient on short-time spans.Testing will compare the extrapolation accuracy on long time-span test-sets.
In this case, the data for the bear population is withheld while training the neural network, except we supply the truebear population at t=0 as the initial condition. The training loss function contains just the wolf and rabbit populations,ignoring the bear population. If you like, you may imagine it as a partially known variable – you know bears are in aforest affecting the system, but they are only measured once at the start because it is too dangerous to the graduatestudents to repeatedly count bears in the woods.One can see that this architecture has the capacity to correctly represent the differential equation since it has three inputsand three outputs. That is, if the neural network just happened to begin emitting the correct bear rate of change giventhe correct number of bear inputs, that’s self-consistent and thus it has to work just as well as the previous model. Butone might doubt it can be successfully trained to achieve that. First, since there’s no constraint on the "bear" channel inthe loss function it does not have to emit the correct bear population to achieve low loss. It could, for example, emit anymonotonic transformation of the bear population (e.g. the cube root of the tangent of the bear population), because4
PREPRINT - A
UGUST
25, 2020there are infinitely many possible ODE functional forms that are self-consistent with the wolf and rabbit populationtrajectories. Furthermore, this ambiguity might make it too hard to successfully learn any model at all without thehelpful constraint of the withheld bear population evolution in training data.The ODE neural net has three inputs. Two of these are the wolf and rabbit population. Intuitively, we would like tothink of the third one as "bears". But as we shall see it does not turn out to be the bear population. It’s just an additionalrecurrent external channel. There are three outputs which provide the rate of change for the Rabbit and Wolf andrecurrent external channel to the integrator.
This is almost the same as the previous test, except the external channel is not initialized to the bear population at t=0.We devised another approach to initializing it. Simply supplying a constant, like zero, would not be likely to generalizeto a correct result on testing data. This can be seen by imagining two test data sets generated starting with the sameinitial number of rabbits and wolves but different numbers of bears. The different number of bear predators meansthe trajectories must be different, but every trained ANODE will have a single deterministic output for the same threeinputs (wolves and rabbits and zero). Thus it could not match both data sets. This is a special case of the general rulethat ODEs cannot produce crossing trajectories.Thus more than just the ANODE architecture is needed to resolve this. We implemented a separate neural networkoutside of the neural ODE which receives not just the t=0 wolf and rabbit populations but the first ten time points of thewolf and rabbit population. It outputs a single scalar to initiate (t=0) the recurrent external channel. Thus the initialconditions delivered to the neural ODE itself are the ordinary t=0 values of the rabbit and wolf, as well as this derivedinitial condition for the recurrent channel.Note that both of the initial-condition-predictor and the neural net inside the ODE must be trained simultaneously. Weare not separately training the external net to predict some known bear population; we would not have that initial bearinformation since we are pretending we don’t know there are bears at all. This means we also back-propagate throughboth neural nets including through all iterations of the solver and neural ODE to further back-propagate through theinitial condition determining network.
Through training, the prediction from integrating the neural ODE goes from uniformed and inaccurate to matching thetrue solution of the Lotka-Volterra equations. The gradual progression of training to match the Lotka-Volterra equationsis shown in figure 1. To test if the neural network actually learned to generalize rather than memorize, the model isapplied to an initial condition it hasn’t trained on and, even more stringently, tested over a longer time span, so it mustextrapolate using the F ( . ) it has learned. The plots in figure 2 are the results from these tests in each experiment. To test what it learned, the neural ODE and Lotka-Volterra equations are solved with an initial condition not seen in thetraining set and over a longer period of time than in training. Figure 2a shows the prediction from the neural ODEmatches up with the true solution to the Lotka-Volterra equations in these testing conditions very well. This shows theneural net was successfully trained through back-propagation through the ODE solver. Given it extrapolates to theheld-out testing data establishes the neural network learned the physics of the system rather than memorizing the testdata. This demonstrates neural ODEs are able to learn the physics of unknown systems if no data is withheld.
Again, to test what it learned, the neural ODE and Lotka-Volterra equations are solved with an initial condition not seenin the training set and over a longer period of time than in training. Figure 2b shows the prediction from integrating theneural ODE matches up with the true solution to the Lotka-Volterra equations in only two of the three species. This isconfusing at first. How can the neural ODE model the rabbits and wolves without knowing the bear population?5
PREPRINT - A
UGUST
25, 2020Figure 1:
Training (Wolves and Rabbits)
All : x-axis is time, y-axis is population. Dots represent the prediction fromintegrating the neural ODE, lines represent the true solution to the Lotka-Volterra equations. There are four plots, eachmade after different amounts of training. By the end the dots align well with the lines, meaning after training the neuralODE learns to accurately predict the true solution. (a) Prediction before training, compared to truth (b) Prediction after some training, compared to truth(c) Prediction after more training, compared to truth (d) Prediction after even more training, compared to truth.
We were initially surprised that the bear population did not match the recurrent external state variable. It seemedlike that since we gave it the true bear population as the initial condition that the simplest outcome would be to havereproduced the original ODE. But it did not. And it didn’t have to because the loss function did not impose this. So itdoesn’t need to predict the bear population, the recurrent channel only needs to transmit and update some informationthat aids it in predicting the other two populations. However, it is hard to imagine a general class of transformations thatalso would be self-consistent with every initial value matching the bear population at t=0. Thus we expected it would bebiased to find the true bear population, and we were surprised when it found some alternative solution.
Figure 2c shows the prediction from integrating the neural ODE matches up with the true solution to the Lotka-Volterraequations in two of the three species, and has different behavior in the third. Again, only the rabbit and wolf populationsmatch, and they match very well. Notice that, unlike the previous case, it now does not match the initial value ofthe bears. It does seem to have oscillatory dynamics like the true bear population but magnitudes and phases evolvedifferently over time.
We have shown that, not only can a neural ODE be trained to model systems with unknown physics from justmeasurements, but it can resolve cases where the system has 1) known but immeasurable variables or 2) dependson unanticipated and unmeasured state variables. While this has the suspicious aura of magic, not physics, it issimply because knowing something can be modeled by an ODE, even without knowing the ODE, imposes very strongconstraints on the outcome. The neural ODE formulation is thus an efficient and highly adaptable means of learning the6
PREPRINT - A
UGUST
25, 2020Figure 2:
All : X-axis is time, y-axis is population, lines represent the solution of the Lotka-Volterra equations, dotsrepresent the prediction from integrating the neural ODE. : Blue is rabbits, orange is wolves, green is bears, purple dots are predicted rabbits, yellow dots are predictedwolves, blue dots are predicted bears/recurrent variable. : Blue is rabbits, orange is wolves, green dots are predicted rabbits, purple dots are predicted wolves. (a) Testing (Wolves and Rabbits)
The prediction from integrat-ing the neural ODE matches the true solution of the Lotka-Volterra equations.. This shows neural ODEs can learn thephysics of a system and extrapolate to other initial condition,when provided enough data to train on. (b)
Testing (Hidden Bears)
Two of the three dotted lines matchwith the lines, while the third dotted line has a different behavior.This shows neural ODEs can learn to model some variables of asystem, without training on data on other variables(c)
Testing (What Bears?)
Two of the three lines match, whilethe third dotted line is far from the corresponding line. This showsneural ODEs can learn to model some variables of a system whileother components are completely hidden from it in training. PREPRINT - A
UGUST
25, 2020underlying model from data while automatically filtering out all ODE-incompatible models. This achieves somethingthat otherwise might be incomprehensible or defy easy parameter estimation techniques.Our case 1 was intentionally patterned after the Lotka-Volterra neural ODE studied by Rackauckas et al. to be able tocompare the results. Rackauckas et al. investigated the problem of reverse engineering to a simple analytic functionalform from a trained neural ODE. They trained on short-time data sets as well and found that the reverse-engineeredanalytic function could extrapolate to longer time spans far better than their trained neural ODE. Intriguingly, whiletheir trained neural ODE extrapolated poorly, our identical architecture neural ODE extrapolated very well for all theparameter variations we tried. We speculated subtle training differences might matter more than one might expect,including 1. small changes in the number of time points in the interval, 2. small changes in the time span (ours was thetypical interval between population spikes), and 3. the use of batch training from many different initial conditions.Tentatively exploring these, we found only the last of these had a profound impact. Training by serial stochasticgradient descent usually terminated on quite poor local minima. In contrast, mini-batch training was excellent. Counter-intuitively, we observed that giving the training process data over longer time spans produced even worse results.Intuitively, spans covering many oscillatory spikes should have been more informative, just as Fourier estimationworks better with time ranges longer than a period. But evidently, they are practically harder for gradient methodsto optimize in the context of an ODE integration process. Instead when those long intervals were split up into shortinterval mini-batches the training was efficient and converged to better solutions. That is, short time spans of spikytime series are much easier to optimize but perform poorly in extrapolation, but mini-batches allow gradients on alldifferent phases of the population cycle at once enabling both easier fitting and better extrapolation. Thus the mostcritical difference to the extrapolation was not simply the total amount of data we used compared to Rackauckas et al. (who used less) but the division of long multi-cycle time series into batches of short time series. We think this is anunexpected and valuable practical finding. (Chris Rackaucus has now updated the neural ODE optimization API andadded documentation to the
DiffEqFlux.jl library to facilitate batch learning. Unfortunately, our code, available onGitHub, predates that so it doesn’t demonstrate this nice new API.)In cases 2 and 3, the values of the recurrent external state variable did not match the bear truth data. The functionalform the neural net has learned is not the canonical Lotka-Volterra equations. For example, in the canonical formthere a simple quadratic monomial (wolves times bears) to represent the loss bears impose on wolves. This could bereplaced by other ways of coupling these state variables when we have the freedom to make the external state variablebe anything we want. For example, one could make this channel be tangent(bears*wolves+rabbits) and there still wouldexist some ODE set that would correctly propagate the populations of wolves and rabbits. So the neural net isn’t lockedinto using the same state variables nor estimating the same functional form used to generate the data.Thus, except by chance, the third channel should never give the original "bear" population. Even when we biased thesystem towards being bear-like in case 2 (by supplying the "bear" initial condition) it still did not cause the externalchannel to discover a state variable that followed the bear population. It’s simply an abstract external channel thatprovides recurrent information necessary to make this a 3-variable, not a 2-variable state space.This raises an intriguing set of questions that are beyond the scope of this work. Namely, does the additional freedomunleashed by the infinite number of possible solutions it might learn to make the training process "easier" or "harder"?Given stochastic descent as the training, will random mini-batch orders lead to different solutions or are there attractorsthat recur frequently? If one reverse engineered many different functional forms of the implicit ODE in the manner ofRackauckas et al. would they all belong to some easily described transformation group?
Our new neural network architecture was designed specifically to address the central problem arising from incompletestate specifications. It preserves the core property of all stationary ODEs: trajectories don’t cross. The crossing oftrajectories does not occur in the fully specified ODE of known systems simply because the next integration step alongthe time derivative ∂ ˆ x ( t ) ∂t depends only on the current position F [ˆ x ( t )] . Two ramifications occur for unknown systems.First, if one is unaware of an additional member variable in an N-dimensional ˆ x , then the reduced state space of N-1dimensions can have a position that maps to many states in the N-dimensional space with different gradients, and so8 PREPRINT - A
UGUST
25, 2020paths cross. Second, even if one knows the problem is N-dimensional if the initial condition isn’t specified for all theelements of ˆ x , the same problem arises.Figure 3: Diagram of the structure of first new augmented neural ODE. Wolf and rabbit populations are input to theneural ODE and their respective rate of changes are output. The third variable (previously symbolizing bear population)is also input, and its rate of change also output.The first problem is solved by augmenting an N-1 dimensional ODE with an additional idler variable meant to makethis mapping unique again. We may not know what this variable is, but we just add it on. This need not be done blindly.We need only need to add as many variables as required to remove trajectory crossings in the training data.Figure 4: Diagram of the structure of the second new augmented neural ODE. Neural net 1 takes in the first 10data-points in wolf and rabbit populations and outputs initial condition for the third variable. (This is not trained tooutput the exact bear population, it is trained in conjunction with the neural ODE so only guesses a useful number, notthe real bear population.)Secondly, how do we initialize these new variables or any other variables for which we lack that initial data? Weaddress this by training a second neural net which takes in a time series of the variables whose trajectory we do know,and outputs a unique initial condition. Thus the trajectory ambiguity of the unknown initial condition is removed andagain paths don’t cross. We note that this degeneracy could be broken by any arbitrary mapping of known data to aunique initial condition (e.g. a hash function) but such would be unlikely to have nice continuity properties and thusrequires a deeper neural net to handle the arbitrary discontinuity. Therefore, we used a neural net to more smoothlyestimate the initial condition. We implemented the augmented neural ODEs and differential equation solvers in the
Julia programming language[6] within an existing neural ODE framework[7]. This was all completed by writing code in the
Julia programminglanguage version 1.0. The truth ODE function for the Lotka-Volterra equations was written and solved in each casefor truth data. Neural networks were created with the
Flux package [16]. Differential equations were solved with the
DifferentialEquations package [17], and back-propagation through a neural ODE used the
DiffEqFlux package[4].The neural ODE network structure was intentionally kept the same as ones used for solving the canonical Lotka-Volterraproblem with 3 dense hidden layers, 32 nodes wide, using swish activations. These had 2 (case 1) or 3 (cases 2&3)9
PREPRINT - A
UGUST
25, 2020inputs/outputs. The structure of our initial condition neural net was 20 inputs (10 each for r and w), one output (theinitial condition for the external channel), with 3 hidden dense layers 32 nodes wide with swish activation. Trainingused the Flux.jl package [16] with ADAM optimization. The code and documentation written for this project can befound online at https://github.com/robertstrauss/nnDiffEq . For the benefit of readers less familiar with neural ODEs this subsection will raise and answer a few straw-man issuesthat we have been asked about many times at presentations of this work.The enabling innovations of this work are the "recurrent ODE channel" and the process of initializing it. Crucially, thisis distinct from a "recurrent channel" in the conventional neural network sense of the term "recurrent". Unlike whatis normally meant by a recurrent node, this "recurrent channel" actually passes through the ODE solver too and isintegrated.Besides imposing the ODE trajectory rules on the channel there is also a crucial practical reason we cannot (easily)use a conventional, non-integrated, recurrent neural network internal state. The ODE solver algorithm itself mightjump back and forth in time (e.g. adaptive step size, successively approximating, backward-forward algorithm).By letting the channel be differential we let the solver itself handle the intricate bookkeeping of the time spacingand order of its calculations, just as it does for integrating the other normal channels. For clarity, we note thatfor the special case of a simple, fixed-time-step, forward Euler integrator, one could in principle get away withthe recurrent channel being a conventional hidden internal state instead of having it output the derivative of thechannel to be passed through the integrator. But in that trivial case, there would also be no need to pass any of thechannels through the integrator, since a recurrent neural net could also be tasked with learning the Euler integration itself.Since the use of an ODE integrator seems like it adds complexity, one might ask
Why not just have a neural netdirectly predict the entire solution instead?
Surprisingly, the ODE actually simplifies the process while it also ensuresthe solution will have the properties we expect. An ODE forces the curves to be continuous and non-crossing nomatter how sparsely sampled (not enforced by an ordinary neural net). Thus we are exploiting the prior knowledgethat there is some way of describing the system with an ordinary differential equation even if we don’t know thatODEs or even all the variables of the system. Similarly, this can enforce conservation laws. Although in thisexample there is no conservation law between wolf and rabbit populations, in other cases this becomes important.It’s simpler because it’s recurrent: one is using the same neural net weights for every output node of the trajectoryno matter when they occur in the time series, so there will be far fewer weights to train. Being smaller it will beeasier and faster to manage the optimization. And it can ensure the solution obeys consistent behaviors. For example,in a stationary ODE, if all the free variables in a system return to a set of values that they already had, the samebehavior should follow now as when it previously happened. If something happens twice, expect the same out-come each time regardless of when it occurs. Direct prediction from a recurrent neural network can violate this principle.
Why didn’t we use a parameter-fitting model containing dozens of guesses at possible mathematical combinations of theinput (such as a Taylor series) rather than an entire neural network to find the rule?
Parameter fitting seemingly hasmany advantages: much fewer trainable parameters (a parameter fitting model might have a few dozen while a neuralnet has thousands), and a readable outcome telling you which terms were needed and thus giving the functional form atthe end. Recent research has shown that directly curve fitting to an ensemble of possible functional forms just does notseem to be as robust.[4]. It is unclear why this is, but for now, it appears the best strategy for model inference withlimited or noisy data is to train the neural ODE first then use it to generate additional data for parameter fitting to anensemble of functions.[4]The third channel is absolutely required since it would be impossible for a neural ODE to predict correctly with onlythe input of two of the three populations and no other information because the correct output depends on all threepopulations. That is for the same rabbit and wolf population the derivative output is a multi-valued function thatwill differ depending on the bear population. We create the recurrent ODE channel to give the neural ODE moreinformation, in the hope that will be enough to allow it to predict correctly. This setup makes logical sense because, fora system for which we lack equations, we may not even realize a variable at play in the system. This "recurrent channel"acts as an open slot for variables which we do not know about, but are required for the prediction of the system.10
PREPRINT - A
UGUST
25, 2020
Thanks to Dr. Charlie Strauss for mentoring and editing. Thanks to Chris Rackauckas for his inspirational online videotutorials and early access to his pre-print on Universal Neural ODEs.[4]
References [1] K. ichi Funahashi and Y. Nakamura, “Approximation of dynamical systems by continuous timerecurrent neural networks,”
Neural Networks
Mathematics of control, signals andsystems , vol. 2, no. 4, pp. 303–314, 1989.[4] C. Rackauckas, Y. Ma, J. Martensen, C. Warner, K. Zubov, R. Supekar, D. Skinner, and A. Ramadhan, “Universaldifferential equations for scientific machine learning,” 2020. [Online]. Available: https://arxiv.org/abs/2001.04385[5] E. Dupont, A. Doucet, and Y. W. Teh, “Augmented neural odes,” in
Advances in Neural InformationProcessing Systems 32 , H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox,and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 3140–3150. [Online]. Available: http://papers.nips.cc/paper/8577-augmented-neural-odes.pdf[6] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, “Julia: A fresh approach to numerical computing,”
SIAMReview , vol. 59, no. 1, pp. 65–98, 2017. [Online]. Available: https://epubs.siam.org/doi/10.1137/141000671[7] C. Rackauckas, M. Innes, Y. Ma, J. Bettencourt, L. White, and V. Dixit, “Diffeqflux.jl - a julia library for neuraldifferential equations,” 2019. [Online]. Available: https://arxiv.org/abs/1902.02376[8] R. Strauss, “Using neural differential equations to model unknown dynamical systems,” 2019, the SouthwestRegion Junior Science and Humanities Symposium, abstract submitted on December 16 2019, Paper ReviewedMarch 2020, Oral presentation March 13-14 2020.[9] ——, “Automatic differentiation and ai applied to computational physics,” 2019, (abstract submitted on December10 2019, Paper reviewed April 2020, Oral presentation online April 15 2020, Algorithm Innovation Award May2020). [Online]. Available: https://supercomputingchallenge.org/19-20/finalreports/1003/Neural_Differential_Equations_supercomputing.pdf[10] ——, “Using neural differential equations to model unknown dynamical systems,” 2020, 1st place Los Alamoscounty science fair peer-reviewed poster, Los Alamos, NM, Febuary 1 2020.[11] ——, “Using neural differential equations to model unknown dynamical systems,” 2020, 1st place Southwest NewMexico regional science fair peer-reviewed poster, Las Vegas, NM, March 7 2020.[12] E. Dupont, A. Doucet, and Y. W. Teh, “Augmented neural odes,” 2019. [Online]. Available:https://arxiv.org/abs/1904.01681[13] J. Sinai. (2019) Understanding neural ODE’s. [Online]. Available: https://jontysinai.github.io/jekyll/update/2019/01/18/understanding-neural-odes.html[14] A. Honchar. (2019) Neural ODEs: breakdown of another deep learning breakthrough. [Online]. Available:https://towardsdatascience.com/neural-odes-breakdown-of-another-deep-learning-breakthrough-3e78c7213795[15] R. D. Beer, “On the dynamics of small continuous-time recurrent neural networks,”
Adaptive Behavior , vol. 3,no. 4, pp. 469–509, 1995. [Online]. Available: https://doi.org/10.1177/105971239500300405[16] M. Innes, E. Saba, K. Fischer, D. Gandhi, M. C. Rudilosso, N. M. Joy, T. Karmali, A. Pal, andV. Shah, “Fashionable modelling with flux,”
CoRR , vol. abs/1811.01457, 2018. [Online]. Available:https://arxiv.org/abs/1811.01457[17] C. Rackauckas and Q. Nie, “Differentialequations.jl – a performant and feature-rich ecosystem for solvingdifferential equations in julia,”