A Shooting Formulation of Deep Learning
François-Xavier Vialard, Roland Kwitt, Susan Wei, Marc Niethammer
AA Shooting Formulation of Deep Learning
François-Xavier Vialard ∗ Roland Kwitt † Susan Wei ‡ Marc Niethammer § June 19, 2020
Abstract
Continuous-depth neural networks can be viewed as deep limits of discrete neuralnetworks whose dynamics resemble a discretization of an ordinary differential equation(ODE). Although important steps have been taken to realize the advantages of suchcontinuous formulations, most current techniques are not truly continuous-depth as theyassume identical layers. Indeed, existing works throw into relief the myriad difficultiespresented by an infinite-dimensional parameter space in learning a continuous-depth neuralODE. To this end, we introduce a shooting formulation which shifts the perspective fromparameterizing a network layer-by-layer to parameterizing over optimal networks describedonly by a set of initial conditions . For scalability, we propose a novel particle-ensembleparametrization which fully specifies the optimal weight trajectory of the continuous-depthneural network. Our experiments show that our particle-ensemble shooting formulationcan achieve competitive performance, especially on long-range forecasting tasks. Finally,though the current work is inspired by continuous-depth neural networks, the particle-ensemble shooting formulation also applies to discrete-time networks and may lead to anew fertile area of research in deep learning parametrization.
Deep neural networks (DNN) are closely related to optimal control (OC), a well-establishedfield with both theoretical and practical insights on offer [25, 24, 20]. The sought-for controlvariable of DNNs corresponds to the parameters of neural network layers. To be able totalk about an optimal control requires the definition of a control cost, i.e., a norm on thecontrol variable. We explore the ramifications of such a control cost in the context of DNNparametrization. For simplicity, we focus on continuous formulations in the spirit of neuralODEs [14]. However, both discrete and continuous OC formulations exist [13, 5, 38] and ourapproach could also be developed for the former.Initial work on continuous DNN formulations was motivated by the realization that a
ResNet [21, 22] resembles Euler forward time-integration [20, 24]. Specifically, the forwardpass of some input vector ˜ x through a network with L layers specified as, x (0) = ˜ x and x ( j + 1) = x ( j ) + f ( x ( j ) , θ ( j )) , j = 0 , , . . . , L, closely relates to an explicit Euler [37]discretization of the ODE ˙ x ( t ) = f ( t, x ( t ) , θ ( t )) , x (0) = ˜ x , ≤ t ≤ T . (1.1)In the ODE-inspired DNN setting, we seek an optimal θ such that the terminal predictiongiven by x ( T ) , i.e., the solution to Eq. (1.1) at time T , minimizes (cid:96) ( x ( T )) for a task-specificloss function (cid:96) . ∗ LIGM, UPEM, [email protected] † Department of Computer Science University of Salzburg;
[email protected] ‡ School of Mathematics and Statistics University of Melbourne; [email protected] § Department of Computer Science University of North Carolina at Chapel Hill; [email protected] a r X i v : . [ c s . N E ] J un tandard NODE Shooting NODE x i (0) x i ( T ) p i (0) p i ( T ) t = 0 t = T ˙ p = ∂ x f ( x , θ ) (cid:62) ( p ) x i (0) x i ( T ) p i (0) p i ( T ) t = 0 t = T ˙ p = ∂ x f ( x , θ ) (cid:62) ( p )˙ x = f ( x , θ ) ˙ x = f ( x , θ ) Particle Shooting NODE x i (0) x i ( T ) p j (0) t = 0 t = T ˙ x = f ( x , θ ) q j (0) p j ( T ) q j ( T ) ˙ q = f ( q ,θ )˙ p = ∂ q f ( q ,θ ) (cid:62) ( p ) θ = θ ( p , q ) Figure 1:
Optimization in the neural ODE (NODE) framework of [14] (left) amounts to a forwardpass with the gradient computed via backpropagation. Optimization under the shooting principle(middle) turns the forward-backward system into a forward second-order system, where we essentiallyrun the backpropagation equation forward. We use a Hamiltonian particle ensemble (right) consistingof (position, momentum) pairs ( q j , p j ) to make the shooting efficient. In shooting θ is time-dependentwhereas in NODE θ ( t ) = θ ∀ t . Although Eq. (1.1) with time-varying parameter θ ( t ) can be considered as a neural networkwith an infinite number of layers, current implementations of ODE-inspired networks largelyassume parameters θ that are fixed in time [14, 15], i.e., ∀ t : θ ( t ) = θ , or that follow a designed,prescribed dynamics [44]. Instead, we explore time-varying θ ( t ) . We employ regularization(i.e., a control cost) to render estimating θ ( t ) well-posed and to assure regularity of theresulting flow. Hence, we explore the regularized loss E ( θ ) = (cid:90) T R ( θ ( t )) d t + λ (cid:96) ( x ( T )) , λ ∈ R + , subject to Eq. (1.1) , (1.2)where R : L ([0 , T ] , R d ) → R is a complexity measure of θ corresponding to the control cost.Instead of directly optimizing over the set of time-dependent θ ( t ) as in standard ResNet s, we restrict the optimization set to those θ which are critical points of E ( θ ) , thereby dramaticallyreducing the number of parameters. In doing so, one can describe the optimization task asan initial value problem . Namely, we show that we can rewrite the loss in Eq. (1.2) solelyin terms of the input x (0) and a corresponding finite-dimensional momentum variable, p (0) .Such an approach, just like optimizing the initial speed of a mass particle to reach a givenpoint, is called a shooting method in numerical analysis [31] and control [11], giving its nameto our new formulation.The first two panels of Fig. 1 illustrate the difference between the optimization of a neuralODE (NODE) via [14] and our shooting formulation. Since in practice, we have multipleinputs ˜ x i , i = 1 , . . . , n , there is an initial momentum vector p i corresponding to each ofthem. If the shooting formulation is to scale up to a large sample size n , we must take carethat the parametrization does not grow linearly with n . To this end, we propose what wecall the Hamiltonian particle-ensemble parametrization . It is a finite set of particles, whereeach particle is a (position, momentum) pair. The initial conditions of these particle pairs { ( q j , p j ) } Kj =1 (where K (cid:28) n ) completely determine θ ( t ) . This is illustrated in the rightmostpanel of Fig. 1. Once the optimized set of particles has been computed, the computationalefficiency of the forward model, similarly to NODE [14], is retained for vector fields f thatare linear in their parameters θ ( t ) .Our contributions are as follows: 1) We introduce a shooting formulation for DNNs,amounting to an initial-value formulation for neural network parametrization. This allowsfor optimization over the original network parameter space via optimizing over the initialconditions of critical networks only; 2) We propose an efficient implementation of the shootingapproach based on a novel particle-ensemble parametrization in which a set of initial particles(the (position, momentum) pairs) describe the space of putative optimal network parameters;3) We propose the UpDown model which gives rise to explicit shooting equations; 4) We prove2niversality for the flows of the
UpDown vector field and demonstrate in experiments its goodperformance on several prediction tasks.
We draw inspiration from two separate branches of research: 1) continuous formulations ofneural networks [14] and 2) shooting approaches for deformable image registration [2, 28, 30].
Continuous-depth neural networks.
Continuous equivalents of
ResNet s [21, 22]have been developed in [33, 20], but naïve implementations are memory-demanding sincebackpropagation requires differentiating through the numerical integrator. Two approachescan address this unfavorable memory footprint. Neural ODE [14] does not store intermediatevalues in the forward pass, but recomputes them by integrating the forward model backward.This is easily possible only if the forward model is numerically invertible and the formulationis time-continuous [18] . Instead, checkpointing [18] is a general approach to reduce memoryrequirements by selectively recomputing parts of the forward solution [19]. Our work caneasily be combined with these numerical approaches. Solving implicit equations.
A recent line of works, including deep equilibrium models [7]and implicit residual networks [32], has shown that it may not always be necessary to freelyparametrize all the layers in the network. Specifically, in [7] and [32], the parameters ofeach layer are defined via an implicit equation motivated by weight tying thus improvingexpressiveness and reducing the number of parameters while decreasing the memory footprintvia implicit differentiation. Instead, our work increases expressiveness and reduces the numberof parameters via particle-based shooting.
Invertibility and expressiveness.
Based on similarity with continuous time integration,constraining the norm of a layer in a
ResNet will result in an invertible network such as in[9, 23]. Invertibility is also explored in [42], where it is enforced (as in our setting) via a penaltyof the norm. These works show that standard learning tasks can be performed on top of aone-to-one transformation. Recent theoretical developments [43] show that indeed capping aNODE or i-ResNet [9] with a single linear layer gives universal approximation for non-invertiblecontinuous functions. Further, expressiveness can be increased by moving to more complexmodels, e.g., by introducing additional dimensions as explored in augmented NODE [15]. In[44] (
AnodeV2 ), Zhang et al. treat time-dependent θ ( t ) . Weights are evolved jointly withthe state of the continuous DNN. While this weight evolution could, in principle, also becaptured by a learned weight network, the authors argue that this would result in a largeincrease in parameters and therefore opt for explicitly parametrizing these evolutions (e.g., viaa reaction diffusion equation). In contrast, our method does not rely on learning a separateweight-network or on explicitly specifying a weight evolution. Instead, our evolving weightsare a direct consequence of the shooting equations which, in turn, are a direct consequence ofpenalizing network parameters (the control cost) over time; a large increase in parametersdoes not occur. Hamiltonian approaches.
Toth et al. [36] proposed Hamiltonian generative networksto learn the Hamiltonian governing the evolution of a physical system. Specifically, theylearn Hamiltonian vector fields in the latent space of an image encoder-decoder architecture.Sæmundsson et al. [34] also learn the underlying dynamics of a system from time-dependentdata, starting from a discrete Lagrangian combined with a variational integrator. Thismotivates particular network structures; e.g., Newtonian networks where the potential energy In a discrete setting, resolving the forward model in the backward direction generally requires costlysolving of implicit equations. This can be done (it is, e.g., done for invertible
ResNet s [9]). In general, anexplicit numerical solution for forward time-integration becomes implicit in the backward direction and viceversa.
3s learned via a neural network. Although sharing common tools, our work completelydiffers from this line of research in the sense that we exploit Hamiltonian mechanics toparametrize general continuous neural networks. In principle, our work applies to mostnetwork architectures and is not specific to physical data.Finally, we mention that shooting approaches have been applied successfully in other areassuch as diffeomorphic image matching [28, 2, 30]. However, the decisive difference here is inthe dimensionality of the underlying space: in image registration it is typically a 3D image;for DNNs, it is usually a high-dimensional space ( R d ) with data sampled from a distributionin that space. We consider, for simplicity, a supervised learning task where the input and target spaces are
X ⊂ R d and Y , resp., and sampled data are denoted by { (˜ x i , ˜ y i ) } ni =1 ⊂ X × Y . The goal is tolearn the weight θ ( t ) in the following flow equation ˙ x i ( t ) = f ( x i ( t ) , θ ( t )) , x i (0) = ˜ x i , ≤ t ≤ T (3.1)such that it minimizes the loss (cid:80) ni =1 (cid:96) ( x i ( T ) , ˜ y i ) for some loss function (cid:96) . In existing works,the weight is chosen independent of time, i.e., θ ( t ) = θ [14], or specific evolution equationsare postulated for it [26, 44]. Such strategies show the difficulty of addressing infinitedimensional parametrizations of time-dependent θ and the need for regularization for well-posedness [17, 26, 20]. Instead of parametrizing θ ( t ) directly, we aim at penalizing θ ( t ) according to the regularity of f ( · , θ ( t )) to arrive at a well-posed problem. Specifically, weconsider a regularization term R ( θ ( t )) (discussed in §3.1) and propose to minimize, for λ > , E n ( θ ) = (cid:90) T R ( θ ( t )) d t + λ n (cid:88) i =1 (cid:96) ( x i ( T ) , ˜ y i ) subject to Eq. (3.1) . (3.2)Note that upon discretizing the time t (i.e., having a number of parameters proportionalto the number of timesteps) this is similar to a ResNet with weight decay. For a
ResNet ora NODE, optimization is based on computing the parameter gradient via a forward passfollowed by backpropagation (see left panel of Fig. 1).
Optimality equations.
The optimality conditions for Eq. (3.2) in continuous time are: ˙ x i ( t ) − f ( x i ( t ) , θ ( t )) = 0 , x i (0) = ˜ x i , Data evolution ˙ p i ( t ) + ∂ x f ( x ( t ) , θ ( t )) (cid:62) ( p i ) = 0 , p i ( T ) = −∇ (cid:96) ( x i ( T ) , ˜ y i ) , Adjoint evolution ∂ θ R ( θ ( t )) − (cid:80) ni =1 ∂ θ f ( x i ( t ) , θ ( t )) (cid:62) ( p i ( t )) = 0 . Compatibility (3.3)The first equation is the forward model and the second equation is the adjoint equationsolved backward in time in order to compute the gradient with respect to the parameters. Atconvergence, the third equation is also satisfied. This last equation encodes the optimality ofthe layer at timestep t , as it is the case for an argmin layer or weight tying [3]. Its left handside corresponds to the gradient with respect to the parameter θ , but as we shall see it willallow us to compute θ directly via our (position, momentum) pairs in our particle shootingformulation. The shooting approach simply replaces the optimization set by the set of criticalpoints of Eq. (3.2) expressed in these optimality conditions. That is, we only optimize oversolutions fulfilling Eq. (3.3). Shooting principle.
The shooting method is standard in optimal control [11] and canbe formulated as follows: Since, at optimality, the system in Eq. (3.3) is satisfied, one can turnthis system into a forward model defined only by its initial conditions { ( x i (0) , p i (0)) } ni =1 which4pecify the entire trajectory of optimal parameters. We refer to the forward model defined byEq. (3.3) as the shooting equations . Unfortunately, this initial-condition parametrization stillrequires all initial conditions x i (0) and their corresponding momenta p i (0) for i = 1 , . . . , n .Since this does not scale to very large datasets, we propose an approximation using a collectionof particles, as described next. Hamiltonian particle ensemble.
In the limit and ideal case where the data distributionis known, the optimality equations can be approximated using a collection of particles whichfollow the Hamiltonian system (see Appendix A). We thus consider a collection of particles ( q j , p j ) j =1 ,...,K ∈ R d × R d that drive the evolution of the entire population ( x i ) i =1 ,...,n ∈ R d through the following forward model ˙ x i ( t ) − f ( x i ( t ) , θ ( t )) = 0 , x i (0) = ˜ x i Data evolution ˙ q j ( t ) − f ( q j ( t ) , θ ( t )) = 0 , ˙ p j ( t ) + ∂ x f ( q j ( t ) , θ ( t )) (cid:62) ( p j ( t )) = 0 ,∂ θ R ( θ ( t )) − K (cid:88) j =1 ∂ θ f ( q j ( t ) , θ ( t )) (cid:62) ( p j ( t )) = 0 , Hamiltonian equations (3.4)with initial conditions { ( q j (0) , p j (0)) } j =1 ,...,K , where the gradient with respect to to this newparametrization is computed via backpropagation. This set of (position, momentum) pairsis termed the Hamiltonian particle ensemble . As the number of particles is reduced, so arethe number of free parameters, see Appendix B. Indeed, varying the Hamiltonian particleensemble allows for controlling the tradeoff between reconstruction and network complexity.
The main computational bottleneck in the forward model of Eq. (3.4) is the implicitparametrization of θ by the last equation. Making it explicit is key to render shootingcomputationally tractable. Linear in parameter - quadratic penalty.
In the simplest case, the space of functions f is a linear space parametrized by θ ( t ) . In this case quadratic penalty amounts to akinetic penalty. Specifically, as a motivating example, consider the forward model (withcomponent-wise activation function σ ) f ( x ( t ) , θ ( t )) = A ( t ) σ ( x ( t )) + b ( t ) (3.5)with A ∈ L ([0 , , R d ) , b ∈ L ([0 , , R d ) and θ ( t ) = [ A ( t ) , b ( t )] . With the quadraticregularizer R ( θ ( t )) = Tr (cid:0) A ( t ) (cid:62) M A A ( t ) (cid:1) + b ( t ) (cid:62) M b b ( t ) , where M A , M b are positivedefinite matrices, the particle shooting equations are (cid:40) ˙ q j ( t ) = A ( t ) σ ( q j ( t )) + b ( t ) , ˙ p j ( t ) = − d σ ( q j ( t )) (cid:62) A ( t ) (cid:62) p j ( t ) , (cid:40) A ( t ) = M A − ( − (cid:80) Kj =1 p j ( t ) σ ( q j ( t )) (cid:62) ) b ( t ) = M b − ( − (cid:80) Kj =1 p j ( t )) , (3.6)with given initial conditions ( p j (0) , q j (0)) . We emphasize that θ ( t ) is explicitly defined by { ( p j ( t ) , q j ( t )) } Kj =1 and the computational cost is reduced to matrix multiplications.As is well-known [4], the Hamiltonian flow preserves the Hamiltonian function. In the “linearin parameter - quadratic regularization” case, this preserved quantity, denoted H ( p ( t ) , q ( t )) ,equals to R ( θ ( t )) , which is the kinetic energy of the system of particles. As a first consequence,the objective functional can be rewritten as, H ( p (0) , q (0))) + λ (cid:80) ni =1 (cid:96) ( x i ( T ) , ˜ y i ) . This clearlyallows for direct optimization on ( p (0) , q (0)) , i.e., shooting. As a second consequence, since thevector field has constant norm (its squared norm is the kinetic energy), it gives a quantitativebound on the regularity of the flow map at time t = T explicit in terms of H ( p (0) , q (0)) . In5ddition (Appendix A), the Rademacher complexity of the generated flows with bounded H ( p (0) , q (0))) can also be controlled. Nonlinear in parameter and non-quadratic penalty.
A standard
ResNet structureuses vector fields of the type (in convolutional form or not) f ( x ( t ) , θ ( t )) = θ ( t ) σ ( θ ( t ) x ( t ) + b ( t )) + b ( t ) , (3.7)where θ ( t ) ∈ L ( R n , R d ) and θ ( t ) ∈ L ( R d , R n ) . This forward model can also be handled inour shooting approach since the shooting equations in Eq. (3.3) are completely specified bythe Hamiltonian H ( p , x , θ ) = R ( θ ) − p (cid:62) f ( x , θ ) . Automatic differentiation can be used (seeAppendix C) to implement the forward model ˙ x ( t ) = ∂H∂ p ( p ( t ) , x ( t ) , θ ( t )) , ˙ p ( t ) = − ∂H∂ x ( p ( t ) , x ( t ) , θ ( t )) , θ ( t ) ∈ arg min H ( p ( t ) , x ( t ) , θ ( t )) . (3.8)Important bottlenecks appear since the third equation is nonlinear and potentially associatedwith a non-convex optimization problem. This could be addressed by unrolling the optimizationcorresponding to the last equation, resulting in increased computational cost. In addition, inthis nonlinear case, the Hamiltonian function is no longer (in general) equal to R ( θ ( t )) evenin the quadratic regularization setting. Therefore, results on the smoothness or Rademachercomplexity would no longer be guaranteed as for the linear - quadratic penalty case. Last,quadratic regularization has no known theoretical results for the Rademacher complexityof functions generated by Eq. (3.7) with bounded norm. Norms for which the Rademachercomplexity of this class of functions is known [16] to be bounded are called Barron norms, whichare non-smooth and non-convex, and which would add to the difficulty. To circumvent theseissues while retaining expressiveness and theoretical guarantees in the linear parametrizationsetting, we next introduce the UpDown model.
UpDown model
The key idea is to transform the vector field of Eq. (3.7) into a model which is linear inparameters on which the quadratic regularization can be applied. To this end, we introducethe additional state v ( t ) = θ ( t ) x ( t ) + b ( t ) which we differentiate with respect to time.We obtain ˙ v ( t ) = ˙ θ ( t ) x ( t ) + ˙ b ( t ) + θ ( t ) ˙ x ( t ) and replacing ˙ x ( t ) by its formula, we get ˙ v ( t ) = ˙ θ ( t ) x ( t ) + ˙ b ( t ) + θ ( t )( θ ( t ) σ ( v ( t )) + b ( t )) . Thus, using the additional state variable v ( t ) , we propose the following ODE system, denoted the UpDown model, introducing the(integer-valued) inflation factor α ≥ , ˙ x ( t ) = θ ( t ) σ ( v ( t )) + b ( t ) , ˙ v = θ ( t ) x ( t ) + b ( t ) + θ ( t ) σ ( v ( t )) , (3.9)with x ( t ) ∈ R d , v ( t ) ∈ R αd . For the data evolution, the initial conditions for x ( t ) aregiven by the data samples { ˜ x i } . We parametrize the initial conditions for v ( t ) using an affinemap via v (0) = Θ ( x (0)) + b , where Θ ∈ L ( R d , R αd ) , b ∈ L ( R αd ) . In Appendix D, weprove the following theorem: Theorem 1.
Given the flow of a time-dependent vector field defined on a compact domain C of R d , which is time continuous and Lipschitz, we denote by ϕ ( T, x (0)) its flow at time T from starting value x (0) . Then, there exists a parametrization of the UpDown model for whichits solution is ε -close to the flow, sup x (0) ∈ C (cid:107) ϕ ( T, x (0)) − x ( T ) (cid:107) ≤ ε . Notably, in the proof, the dimension of the hidden state v is used twice: first , for having asufficient number of neurons in Eq. (3.7) to approximate a stationary vector field (standarduniversality property of multilayer perceptron) and, second , for approximating time-dependent6ector fields. Therefore, at the cost of introducing a possibly large number of dimensions, the UpDown model is universal in the class of time-dependent NODEs. As shown in Appendix D,this universality result transfers to our shooting formulation. Due to its additional dimensions,it is also likely to be universal in the space of functions (i.e., not necessarily injective). Wefocus on the
UpDown model in our experiments. Note also that while we derived our theoryfor vector-valued evolutions for simplicity, similar linear in parameter evolution equations canfor example be derived for convolutional neural networks.
Our goal is to demonstrate that it is possible to learn DNNs by optimizing only over theinitial conditions of critical networks. This is made possible via shooting and efficient via ourparticle parametrization. A key difference to prior work is that our approach allows to capturetime-dependent (i.e., layer-dependent in the discrete setting) parameters without discretizingthese parameters at every time-point. Comparisons to other neural-ODE like methods are notstraightforward as hyper-parameters and different implementation may make results difficultto compare. For consistency, we provide four different formulations (based on the
UpDown model of §3.2). • The static direct model foregoes the Hamiltonian particle ensemble, and instead directlyoptimizes over time-constant parameters: θ ( t ) = θ . Everything else, including the UpDown model, stays unchanged. This model is most closely related to NODE [14] and augmentedNODE [15]. • The static with particles model is similar to the static direct model. However, insteadof directly optimizing over a time-constant θ , it uses a set of (position, momentum) pairs(i.e., particles, as in our dynamic with particles model below) to parametrize θ indirectly. • We call our proposed shooting model dynamic with particles . It is parametrized via aset of initial conditions of (position, momentum) pairs, which evolve over time and fullyspecify θ ( t ) . • Finally, we consider the dynamic direct model which uses a piece-wise time-constant θ ( t ) . It essentially chains together multiple static direct models and is closely related toa discrete ResNet in the sense that multiple blocks (we use five) are used in succession.However, each block involves time-integrating the
UpDown model. While the dynamic withparticles model captures θ ( t ) indirectly via particles and shooting, the dynamic direct model requires many more parameters as it represents θ ( t ) directly. We show results forthe dynamic direct model for a subset of the experiments.All experiments use the UpDown model with quadratic penalty function R . Detailed experimen-tal settings, including weights for the quadratic penalty function, can be found in AppendixE. Simple 1D function regression.
We approximate a simple quadratic-like function y = x + 3 / (1 + x ) which is non-invertible. We use 15 particles for our experiments. Fig. 2shows the test loss and the network complexity, as measured by the log Frobenius normintegrated over time [29], for the different models as a function of the inflation factor α (cf.§3.2). On average, the dynamic with particles model shows the best fits with the lowestcomplexity measures, indicating the simplest network parametrization. Note that the staticwith particles approach results in the lowest complexity measures only because it cannotproperly fit the function as indicated by the high test loss. Additional results for a cubicfunction y = x are in Appendix F. 7 .20.30.40.50.6 L o ss
1* 2* 1* 1*1* 1*1* 1* α = 128 α = 64 α = 32 α = 16
16 32 64 128 -2.5-2.0-1.5-1.0-0.5 l og c o m p l e x i t y
1* 1* 1* 1*Static direct Static w particles Dynamic direct Dynamic w particles α = 16 α = 32 α = 64 α = 128 Figure 2:
Fit for quadratic-like y = x + 3 / (1 + x ) for 10 random initializations. Left : Test loss;
Right : time-integral of log of the Frobenius norm complexity. Lower is better for both measures.* indicates number of removed outliers (outside the interquartile range (IQR) by ≥ . × IQR); α denotes the inflation factor. L o ss l og ( c o m p l e x i t y )
1* 1*050100150 4* 1* 1*3* 2* 1* 5560657075 1*1*Static direct Static with particles Dynamic direct Dynamic with particles α = α = α = α = α = α = α = α = l og ( c o m p l e x i t y ) L o ss S h o r t - r a n g e r e s u l t s L o n g - r a n g e r e s u l t s α α = α = α = α = α = α = α = α = Number of parametersFit for Dynamic with particles ˙ x = Ax Prediction 1,1162,1724,2848,508
Figure 3:
Fit for spiral (short- and long-range). Losses for the different models as well as thetime-integral of log of the Frobenius norm complexity measure. Lower is better for both measures.The * symbol indicates how many outliers were removed and α denotes the inflation factor. Spiral.
Next, we revisit the spiral ODE example of [14] following the nonlinear dynamics ˙ x = Ax , x ∈ R (where the power is component-wise). We fix x (0) = [2 , T , use A =[ − . , . − , − . and evolve the dynamics for time T = 10 . The training data consists ofsnippets from this trajectory, all of the same length. We use an L norm loss (calculated onall intermediate time-points) and 15 particles. Our goal is to show that we can obtain thebest fit to the training data due to our dynamic model. Fig. 3 ( top ) shows that we can indeedobtain similar or better fits (lower losses) for a similar number of parameters while achievingthe lowest network complexity measures. Fig. 3 ( bottom ) shows the corresponding results forthe validation data consisting of the original long trajectory starting from initial value x (0) .Interestingly, by pasting together short-range solutions we are successful in predicting thelong-range trajectory despite training on short-range trajectory snippets. Concentric circles.
To study the impact of the inflation factor α in a classifi-cation regime, we replicate the concentric circles setting of [15]. The task is learn-ing to separate points, sampled from two disjoint annuli in R . While we are lessinterested in the learned flow (as in [15]), we study how often the proposed UpDown (dynamic with particles) model perfectly fits the training data as a function of α .
4% 8% 26.7% S u cce ss r a t e [ % ] α = 5 α = 10 α = 20 To the right, we show the success rate over 50 training runsfor three choices of α and 20 particles. Notably, the effect of α is only visible if the classification loss is down-weighted so thatthe regularization, R , dominates . Otherwise, for the tested α ,the model always fits the data. The experiment is consistentwith [15], where it is shown that increasing the space on which8 SE ± σ † GPPVAE-DIS ± † GPPVAE-JOINT ± † ODE VAE ± † ODE VAE-KL ± Ours (stat. direct) 0.0126 ± Ours (stat. w particles) 0.0125 ± Ours (dyn. w particles) 0.0122 ± ↔ Test time point
Figure 4:
Left : Image (per-pixel) MSE (measured at the marked time point) averaged over all testingsequences of the rotated MNIST dataset.
Right : Two testing sequences and predictions (marked blue)for all 16 time points when the image at t = 0 is given as input (marked red). Results marked with † are taken from [40]. an ODE is solved allows for easy separation of the data and leads to less complex flows. Thelatter is also observed for our model; visualizations can be found in Appendix F. Static directStatic with particlesDynamic with particles
Rotating MNIST.
Here, we are given sequences of a rotatingMNIST digit (along 16 angles, linearly spaced in [0 , π ] ). The task islearning to synthesize the digit at any rotation angle, given only the first image of a sequence. We replicate the setup of [40] and considerrotated versions of the digit “3”. We identify each rotation angle asa time point t i and randomly drop four time points of each sequenceduring training. One fixed time point is consistently left-out and laterevaluated during testing. We use the same convolutional autoencoderof [40] with the UpDown model operating in the internal representationspace after the encoder. During training, the encoder receives the firstimage of a sequence (always at angle ◦ ), the UpDown model integratesforward to the desired time points, and the decoder decodes theserepresentations. As loss, we measure the mean-squared-error (MSE)of the decoder outputs. Fig. 4 lists the MSE (at the left-out angle),averaged over all testing sequences and shows two example sequences with predictions for alltime points (100 particles, α = 10 ).While all UpDown variants substantially lower the MSE previously reported in the literature,they exhibit comparable performance. To better understand the differences, we visualize theinternal representation space of the autoencoder by projecting all 16 internal representations(i.e., the output of the
UpDown models after receiving the output of the encoder) of each testingimage onto the two largest principal components, shown to the right (different colors indicatethe different rotation angles). This qualitative result shows that allowing for a time-dependentparametrization leads to a more structured latent space of the autoencoder.
Bouncing balls.
Finally, we replicate the “bouncing balls” experiment of [40]. This issimilar to the rotating MNIST experiment, but the underlying dynamics are more complex.In particular, we are given 10,000 (training) image sequences of bouncing balls at 20 differenttime points [35]. The task is learning to predict, after seeing the first three images of asequence, future time points. We use the same convolutional autoencoder of [40] and minimizeimage (per-pixel) MSE (using all 20 time points for training). Our
UpDown model operates inthe internal representation space of the encoder (50-dimensional in our experiments ). In testmode, the network receives the first three image of a sequence and predicts 10 time pointsahead. We measure the image (per-pixel) MSE and average the results (per time point) overall 500 testing sequences. For model selection, we rely on the provided validation set. Our UpDown (dynamic with particles) model uses 100 particles. Fig. 5 ( left ) lists the averaged We did not further experiment with this hyperparameter, so potentially better results can be obtained. A v e r a g e M S E † ODE VAE ( (cid:11) † DDPAE ( (cid:11) † DTSBN-S ( (cid:11) Ours ( (cid:11) Figure 5:
Left : Image (per-pixel) MSE for predicting 10 time points ahead (after receiving the firstthree inputs of a sequence), averaged over all testing sequences (numbers in parentheses indicate theMSE when additionally averaged over all prediction time points). Results marked with † are takenfrom [40]. Right : Two testing sequences with predictions (marked blue).
MSE per time point, plotted against the approaches listed in [40]. Fig. 5 ( right ) shows twotesting sequences with predictions (the three input time points are not shown). Results forthe
UpDown static and static with particles model are (cid:11) . and (cid:11) . , respectively. We demonstrated that it is possible to parametrize DNNs via initial conditions of (position,momentum) pairs. While our experiments are admittedly still simple, results are encouragingas they show that 1) the particle-based approach can achieve competitive performance overdirect parametrizations and that 2) time-dependent parametrizations are useful to obtainingsimpler networks and can be realized with significantly fewer parameters using particle-basedshooting. Our work opens up many different follow-up questions and formulations. E.g.,we presented our approach for a model with continuous dynamics, but the particle and theshooting formalism can also be applied to discrete-time models. Further, we focused, forsimplicity, on continuous variants of multi-layer perceptrons, but similar linear-in-parametermodels can be formulated for convolutional neural networks. Models that are nonlinear intheir parameters hold the promise for connections with optimal mass transport theory andto theoretical complexity results, which we touched upon for our
UpDown model. Indeed,this change of paradigm in the parametrization may result in new quantitative results onthe generalization properties of the constructed networks. Lastly, how well the approachgeneralizes to more complex problems, how many particles are needed to switch from astandard deep network to its shooting formulation, and how optimizing over critical pointsof the original optimization problem via shooting relates to network generalization will befascinating to explore.
Broader Impact
Our work advances knowledge in how to parametrize deep neural networks. Specifically, itshifts the parametrization of deep neural networks from a layer-by-layer perspective to aninitial-value parametrization. At this point it is theoretical and conceptual in nature and so isits likely current broader impact. 10 cknowledgments and Disclosure of Funding
This research project was initiated during a one-month invitation of M. Niethammer by theLabex Bézout, supported by the French National Research Agency ANR-10-LABX-58.
References [1] Approximation by superpositions of a sigmoidal function.
Math. Control Signals Syst. ,2(4):303–314, 1989.[2] Diffeomorphic 3D image registration via geodesic shooting using an efficient adjointcalculation.
IJCV , 97(2):229–241, 2012.[3] B. Amos and J. Z. Kolter. OptNet: Differentiable optimization as a layer in neuralnetworks. In
ICML , 2017.[4] V. I. Arnold.
Mathematical Methods of Classical Mechanics . Springer, 1978.[5] M. Athans and P. L. Falb.
Optimal control: an introduction to the theory and itsapplications . Courier Corporation, 2013.[6] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducingpenalties.
Foundations and Trends in Machine Learning , 4(1):1–106, 2012.[7] S. Bai, J. Z. Kolter, and V. Koltun. Deep equilibrium models. In
NeurIPS , 2019.[8] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk boundsand structural results.
J. Mach. Learn. Res. , 3(null):463–482, Mar. 2003.[9] J. Behrmann, W. Grathwohl, R. T. Chen, D. Duvenaud, and J.-H. Jacobsen. Invertibleresidual networks. In
ICML , 2019.[10] J.-D. Benamou and Y. Brenier. A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem.
Numerische Mathematik , 84(3):375–393, 2000.[11] J. F. Bonnans. The shooting approach to optimal control problems. In
IFAC InternationalWorkshop on Adaptation and Learning in Control and Signal Processing , 2013.[12] M. Bruveris and F.-X. Vialard. On completeness of groups of diffeomorphisms.
J. Eur.Math. Soc. (JEMS) , 19(5):1507–1544, 2017.[13] A. Bryson and Y. Ho. Applied optimal control: optimization, estimation, and control.1975.
Hemisphere, New York , pages 177–203.[14] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinarydifferential equations. In
NeurIPS , 2018.[15] E. Dupont, A. Doucet, and Y. W. Teh. Augmented neural ODEs. In
NeurIPS , 2019.[16] W. E, C. Ma, and L. Wu. Barron spaces and the compositional function spaces for neuralnetwork models. arXiv , 2019. https://arxiv.org/abs/1906.08039 .[17] C. Finlay, J.-H. Jacobsen, L. Nurbekyan, and A. M. Oberman. How to train your neuralODE: the world of Jacobian and kinetic regularization. In
ICML , 2020.[18] A. Gholami, K. Keutzer, and G. Biros. ANODE: Unconditionally accurate memory-efficient gradients for neural odes. In
IJCAI , 2019.1119] A. Griewank and A. Walther.
Evaluating derivatives: principles and techniques ofalgorithmic differentiation , volume 105. SIAM, 2008.[20] E. Haber and L. Ruthotto. Stable architectures for deep neural networks.
Inverse Probl. ,34(1):014004, 2017.[21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
CVPR , 2016.[22] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In
ECCV , 2016.[23] J.-H. Jacobsen, A. Smeulders, and E. Oyallon. i-RevNet: Deep invertible networks. In
ICLR , 2018.[24] Q. Li, L. Chen, C. Tai, and E. Weinan. Maximum principle based algorithms for deeplearning.
JMLR , 18(1):5998–6026, 2017.[25] G.-H. Liu and E. A. Theodorou. Deep learning theory review: An optimal control anddynamical systems perspective. arXiv , 2019. https://arxiv.org/abs/1908.10920 .[26] S. Massaroli, M. Poli, J. Park, A. Yamashita, and H. Asama. Dissecting neural ODEs. arXiv , 2020. https://arxiv.org/abs/2002.08071 .[27] A. Maurer. A vector-contraction inequality for rademacher complexities. In R. Ortner,H. U. Simon, and S. Zilles, editors,
Algorithmic Learning Theory , pages 3–17, Cham,2016. Springer International Publishing.[28] M. Miller, A. Trouvé, and L. Younes. Geodesic shooting for computational anatomy.
J.Math. Imaging Vis. , 24:209–228, 2006.[29] B. Neyshabur, S. Bhojanapalli, D. Mcallester, and N. Srebro. Exploring generalization indeep learning. In
NeurIPS . 2017.[30] M. Niethammer, Y. Huang, and F.-X. Vialard. Geodesic regression for image time-series.In
MICCAI , 2011.[31] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery.
Numerical recipes3rd edition: The art of scientific computing . Cambridge University Press, 2007.[32] V. Reshniak and C. Webster. Robust learning with implicit residual networks. arXiv ,2019. https://arxiv.org/abs/1905.10479 .[33] L. Ruthotto and E. Haber. Deep neural networks motivated by partial differentialequations.
J. Math. Imaging Vis. , pages 1–13, 2019.[34] S. Sæmundsson, A. Terenin, K. Hofmann, and M. P. Deisenroth. Variational integratornetworks for physically meaningful embeddings. arXiv , 2019. https://arxiv.org/abs/1910.09349 .[35] I. Sutskever, G. Hinton, and G. Taylor. The recurrent temporal restricted Boltzmannmachine. In
NeurIPS , 2009.[36] P. Toth, D. J. Rezende, A. Jaegle, S. Racanière, A. Botev, and I. Higgins. Hamiltoniangenerative networks. In
ICLR , 2020. 1237] L. N. Trefethen. Finite difference and spectral methods for ordinary and partial differentialequations. 1996.[38] J. L. Troutman.
Variational calculus and optimal control: optimization with elementaryconvexity . Springer Science & Business Media, 2012.[39] M. J. Wainwright.
High-Dimensional Statistics: A Non-asymptotic Viewpoint , volume 48.Cambridge University Press, 2019.[40] C. Yildiz, M. Heinonen, and H. Lahdesmaki. ODE VAE: Deep generative second orderODEs with Bayesian neural networks. In
NeurIPS , 2019.[41] L. Younes.
Shapes and Diffeomorphisms . Springer, 2010.[42] L. Younes. Diffeomorphic learning. arXiv , 2018. https://arxiv.org/abs/1806.01240 .[43] H. Zhang, X. Gao, J. Unterman, and T. Arodz. Approximation Capabilities of NeuralODEs and Invertible Residual Networks. In
ICML , 2020.[44] T. Zhang, Z. Yao, A. Gholami, J. E. Gonzalez, K. Keutzer, M. W. Mahoney, and G. Biros.ANODEV2: A coupled neural ODE framework. In
NeurIPS , 2019.13 upplementary material
The following sections discuss additional aspects of our approach. Specifically, Sec. A showsthat the optimality condition underlying our shooting formulation can be approximatedvia particles. Sec. B discusses the number of free parameters of our shooting approach inrelation to the number of free parameters for direct optimization. Sec. C explains how theshooting equations can be automatically derived via automatic differentiation. Sec. D showsthe universality of our
UpDown model. Sec. E provides details on our experimental setup.Lastly, Sec. F shows some additional experimental results.
A Expectation approximation of optimality equations
In this section, we show that the particle shooting approach can be viewed as approximatingthe optimality equation associated with a collection of particles.
A.1 Variational setup
Suppose the data consists of input X ∈ R d . Let f ( · , θ ( t )) be a vector field on R d , e.g. ashallow hidden layer or a linear (in parameter) layer. Consider the flow ϕ := ϕ ( T, · ) generatedby f according to (cid:40) dd t ϕ ( t, x ) = f ( ϕ ( t, x ) , θ ( t )) ,ϕ (0 , x ) = x . (A.1)We consider the general task of minimizing, Reg( ϕ ) + λ E [ (cid:96) ( ϕ ( X ))] , (A.2)where λ is a positive regularization parameter.Without loss of generality, set the terminal time to T = 1 . Letting ρ denote the probabilitydensity of X , minimizing Eq. (A.2) is equivalent to minimizing Reg( ϕ ) + λ inf ϕ (cid:90) R d (cid:96) ( ϕ ( x )) ρ ( x ) d x . This can be rewritten as
Reg( ϕ ) + λ inf ϕ (cid:90) R d (cid:96) ( x (cid:48) ) ρ ( x (cid:48) ) d x (cid:48) , where ρ ( x ) := ρ (1 , x ) is the flow of the continuity equation ∂ t ρ ( t, x ) + div( ρ ( t, x ) f ( x , θ )) = 0 , ρ (0 , x ) = ρ ( x ) , where div is the divergence operator on vector fields. Note ρ can be regarded as the densityrepresenting the data at time . A.2 Choice of regularization
We discuss two different possibilities for regularization of the map ϕ : Reg( ϕ ) = (cid:90) R ( θ ( t )) d t . (A.3)The first possibility is a quadratic penalty, where R ( θ ( t )) = (cid:107) θ ( t ) (cid:107) is the Frobenius normof the parameter θ . Since the space of vector fields is a finite dimensional linear space, it14an be endowed with a scalar product, which turns this space into a Reproducing KernelHilbert Space (RKHS). Therefore, the linear in parameter - quadratic regularization settingis a particular case of vector fields encoded by f ( · , θ ( t )) ∈ H , with H a RKHS embedded in W , ∞ vector fields . This setting leverages strong analytical and geometrical foundations[41, 12]: 1) When the activation function (very often, the RKHS is composed of smoothvector fields) is smooth, the flow map ϕ is guaranteed to be a one-to-one smooth map (i.e.a diffeomorphism). For instance, with the UpDown model, it is a homeomorphism in ( q , v ) .Moreover, the quadratic penalty induces a right-invariant distance on the set of flows generatedby Eq. (A.1) and the distance to identity of the resulting flow can be bounded by Reg( ϕ ) (see [41, 12] for more details in a Sobolev setting). 2) When the activation function is of ReLU type, the resulting map is still a W , ∞ one-to-one map (i.e. a homeomorphism) and hasLipschitz regularity.The other type of regularization we will discuss is the Barron norm regularization for theshallow hidden layer of Eq. (3.7). First, recall Barron’s norm [16], (cid:107) θ (cid:107) B := 1 n n (cid:88) j =1 (cid:107) θ j (cid:107) ( (cid:107) [ θ ] j (cid:107) + (cid:107) b j (cid:107) ) . As discussed in the main text, the reason we consider a Barron norm penalty for the shallowhidden layer rather than the quadratic penalty is because of its theoretical results. Indeed,the Rademacher complexity is bounded for the combination of a shallow hidden layer with aBarron norm penalty, but not when combined with a quadratic penalty.
A.3 Rademacher complexity of bounded energy flows.
In this section we consider the flows generated by a neural ODE approach with a kineticenergy penalty for which we address the question of bounding the Rademacher complexity ofthe generated flows. The flow of a vector field f ( · , θ ( t )) is a vector valued map denoted by ϕ .Let us first treat the case of the Rademacher complexity of a component of a the flow map ϕ k . Theorem 2.
Let F be a space of vector fields defined on a compact space C ⊂ R d . Assumethat the Rademacher complexity on n points in C of each component of the vector fields f k ( t, · ) for k = 1 , . . . , d is controlled by M ( n, t ) which depends on n , then the Rademacher complexityof each component of the flows at time is bounded by (cid:82) M ( n, t ) d t .Proof. Recall that Rademacher complexity (see [39]) of a class of functions F is defined as,for Z = ( z , . . . , z n ) ∈ C , Rad F , Z def. = E [sup g ∈F n (cid:88) i =1 ε i g ( z i )] , where the ( ε i ) i =1 ,...,n are i.i.d. Rademacher random variables. Our hypothesis ensures Rad F , Z ≤ M ( n ) . Apply the definition of the flow to get ϕ (1 , x ) = x + (cid:90) f ( ϕ ( t, x ) , θ ( t )) d t . Therefore, E [sup ϕ ∈F n (cid:88) i =1 ε i ϕ k ( z i )] ≤ Rad { Id } , Z + (cid:90) E [ sup f ( · ,θ ( t )) n (cid:88) i =1 ε i f k ( ϕ ( t, z i ) , θ ( t ))] d t , ≤ (cid:90) M ( n, t ) d t . I.e smoothness asks for Lipschitz regularity of the first derivative, which ensures existence and uniquenessof the flow.
15n the previous formula, we used the fact that the Rademacher complexity of just a singlemap is zero.
Corollary 3.
Let H be a RKHS of vector fields whose kernel k satisfies (cid:107) k ( x , x ) k (cid:107) ∞ ≤ ,then, the set of flows denoted by F at time of time-dependent vector fields in a ball of radius R satisfies Rad F ,n ≤ R √ n , where Rad F ,n is the Rademacher complexity for n points.Proof. The Rademacher complexity of the ball of radius R in the RKHS H [8, Lemma22] isupper bounded by Rad
H,n ≤ R √ n . We then directly apply Theorem 2.A similar result also holds for vector fields generated by the shallow-hidden-layer ofEq. (3.7), see [16]. Last, we note that the result and its proof also hold if one uses the followingRademacher complexity for vector valued functions [27], Rad F , Z def. = E [sup g ∈F n (cid:88) i =1 d (cid:88) j =1 ε j ε i g j ( z i )] , for g = ( g j ) j =1 ,...,d ∈ F . A.4 Optimality equations and Hamiltonian ensemble approximation
We detail the optimality equations in the case data are represented via a probability measure.As mentioned above, the regularity of the map is enforced via a penalty on the weights at eachtimepoint and is the integral (cid:82) R ( θ ( t )) d t or even more generally (cid:82) R ( θ ( t ) , ρ ( t )) d t . UsingLagrange multipliers, this constraint can be enforced and minimizers of the energy should besaddlepoints of the energy L ( ρ, θ, p ) := λ (cid:90) R d (cid:96) ( x ) ρ ( x ) d x + (cid:90) R ( θ ( t )) d t + (cid:90) (cid:90) R d p ( t, x )( ∂ t ρ ( t, x ) + div( ρ ( t, x ) f ( x , θ ( t )))) d x d t , where p ( t, x ) is a time and space dependent function. The optimality equations are then ∂ t ρ ( t, x ) + div( ρ ( t, x ) f ( t, x , θ ( t ))) = 0 ,∂ t p ( t, x ) + ∇ p ( t, x ) · f ( t, x , θ ( t )) = δRδρ ( θ ( t ) , ρ ( t )) , δRδθ ( θ ( t ) , ρ ( t )) − (cid:82) R d δfδθ ( x ) (cid:62) ( ∇ p ( t, x ) ρ ( t, x )) = 0 , (A.4)where ∇ p is the gradient w.r.t. x of p ( t, x ) and δ denotes differentiation w.r.t. the indicatedparameter.In practice, one does not have access to the full distribution and the variational setupneeds to be approximated. As proposed in the main text, we approximate it using a collectionof particles that follow the optimality equations which are Hamiltonian evolution equationsfor this collection of particles. The collection of particles ( q i , p i ) are defined by their state andcostate; the quantities p ( t, x ) and ρ ( t, x ) are simply approximated with p i ( t ) , q i ( t ) satisfyingthe Hamiltonian equations associated with the empirical measure ρ ( t, · ) = N (cid:80) Ni =1 δ q i ( t ) ( · ) , thecorresponding function p ( t, · ) being only defined at points q i . When the number of particlestends to infinity, the optimal trajectory in the parameter space can be well approximated,and one can expect to recover the optimal trajectory.However, we do not explore the following question here: How well can a set of Hamiltonianparticles approximate Eq. (A.4)?
We simply remark that this question is directly connectedto expressiveness and generalization properties of the constructed neural network and is alsoprobably data dependent. 16 .5 Linear in parameters - quadratic energy
Now let us examine in detail models that are linear in parameters and have quadratic energyon parameters: this case is the simplest to be studied, and computationally not as demandingas the nonlinear case. As mentioned above, the set of possible vector fields f ( · , θ ( t )) is afinite dimensional linear space, which is a reproducing kernel Hilbert space when endowedwith an L norm. Since all Hilbert norms in finite dimensions are equivalent, this choice ofregularization is universal in this class of quadratic penalties.1. The vector field is f ( · , θ ( t )) = θ · σ , where σ is a vector of maps. In this case, theoptimality equation reads δfδθ ( x ) ∗ ( ∇ p ( t, x ) ρ ( t, x )) = (cid:90) R d σ ( x ) (cid:62) ( ∇ p ( t, x )) ρ ( t, x )) d x .
2. If the penalty R only depends on θ and is quadratic: R ( θ ( t )) = (cid:82) (cid:107) θ (cid:107) d t , then onehas δRδθ ( θ ( t ) , ρ ( t )) = θ ( t ) .Thus, under these two conditions, the parameters are explicit in terms of p , ρ and σ : θ ( t ) = (cid:90) R d σ ( x ) (cid:62) ( ∇ p ( t, x ) ρ ( t, x )) d x . (A.5) Remark 4.
If, instead of quadratic regularization on the parameters, we were to choose aRKHS norm (in the infinite dimensional case) as penalty, it would result in the introductionof the kernel applied to the r.h.s. of Eq. (A.5).
Having a formula for the parameter θ ( t ) , one could be tempted to derive an evolutionequation for θ . This equation is known as the EPDiff equation [41] and is unfortunately nota closed equation on the set of parameters θ ( t ) themselves. Therefore, our approach is apossible way to approximate it. A.6 Nonlinear in parameters - energy which depends on the distribution
For exposition purposes, we present two cases of interest, which are not contained in theprevious section.
Example of Barron’s norm.
Obviously, a shallow-hidden-layer shl θ is not linear inparameters. We have already discussed that it is proper in this case to endow the space withnorms such as Barron’s norm [16]. For instance, for shl θ ( x ) one has shl θ ( x ) = θ σ ( θ ( x ) + b ) , (A.6)where θ ∈ L ( R n , R d ) and θ ∈ L ( R d , R n ) . The Barron norm of the shallow-hidden-layer isthen (cid:107) shl θ (cid:107) B := 1 n n (cid:88) j =1 (cid:107) θ j (cid:107) ( (cid:107) [ θ ] j (cid:107) + (cid:107) [ b ] j (cid:107) ) , where θ j denotes the j th column of θ and [ θ ] j denotes the j th row of θ .Let us consider the case of R ( θ ( t )) = (cid:107) shl θ ( t ) (cid:107) B . In this case, one has the followingoptimality equations to solve λθ j ( (cid:107) [ θ ] j (cid:107) + (cid:107)| [ b ] j (cid:107) ) = (cid:90) R d σ ([ θ ] j x + [ b ] j ) (cid:62) ( ∇ p ( t, x ) ρ ( t, x )) d x ,λ (cid:107) θ j (cid:107) ( (cid:107) [ θ ] j (cid:107) + (cid:107) [ b ] j (cid:107) ) ∂ (cid:107) [ θ ] kj (cid:107) = (cid:90) R d [ d σ ([ θ ] j x + [ b ] j )( x k )] (cid:62) ( ∇ p ( t, x ) ρ ( t, x )) d x ,λ (cid:107) θ j (cid:107) ( (cid:107) [ θ ] j (cid:107) + (cid:107) [ b ] j (cid:107) ) ∂ (cid:107) [ b ] jk (cid:107) = (cid:90) R d [ d σ ([ θ ] j x + [ b ] j )( x k )] (cid:62) ( ∇ p ( t, x ) ρ ( t, x )) d x . L norm, and optimization of this typeof functions, which involves sparsity, is a well-explored field [6]. We leave experiments withthis norm for future work. L regularization, optimal transport. Last, we briefly mention a model that is part ofour framework which has the advantage of not specifying a norm on the space of vector fields.In case there is no obvious norm to be used on the space of vector fields, it is possible to usean L type of penalty, directly on the vector fields themselves. Indeed, one way to be ratherindependent of the choice of the parametrization of the map consists in introducing a costthat represents the L norm of the map. However, L depends on the choice of a measureand this measure can be naturally chosen as the density of the data, ρ ( t, x ) . More precisely,one can have R ( f, ρ ( t )) = 12 (cid:90) R d (cid:107) f ( x , θ ) (cid:107) ρ ( t, x ) d x . (A.7)In such a case, this formulation resembles finding an optimal transport (OT) map between ρ and ρ . Specifically, optimal transport is an optimization problem which can be solvedvia a fluid dynamic formulation [10] introducing the kinetic penalty above. However, thetwo models (OT and the one defined by the regularization Eq. (A.7)) differ in the sensethat the optimization set for optimal transport is much larger and the above formulationis a parametrized approximation of OT. This parametrized approximation needs to retaingeneralization properties of the optimized map. Note however, that in the limit where thenumber of neurons goes to infinity, optimal transport will be well-approximated since theoptimization is performed on a dense subset of all vector fields. Obviously, fixing the choice of ashallow-hidden-layer design implies a choice for n in Eq. (3.7), which thus gives a regularizationof the computed approximation of the optimal transport map. Computational burden.
In both cases, the implicit equation corresponding to the thirdequation in Eq. (A.4) has to be solved at each layer of the discretization. We experimentedwith a simple strategy of unrolling the related minimization scheme. An efficient approach tosolve such implicit equations will be necessary for practical implementations.
B Analysis of the number of free parameters
It is instructive to understand the number of parameters for a shooting approach in comparisonto the typical approach of optimizing a neural network (where the parameter-dependency atoptimality is only considered implicitly at convergence of the numerical solution rather thanexplicitly during the shooting). We focus on the cases of affine and convolutional layers forillustration.Consider a DNN with a depth of D layers, each of them parametrized by a shallow hiddenlayer of P parameters. The number of free parameters is then DP , compared to KS where K is the number of active particles, each of them of size S . Hence, solutions with lessthan DP/ (2 S ) particles provide benefits in the number of free parameters. Therefore, as For example, S for our UpDown model simply corresponds to the dimension of its state space: S = ( α + 1) d ,where α is the inflation factor and d the data dimension. Note that in our experiments with the UpDown modelwe also learned an affine map from the initial conditions x (0) to the initial conditions v (0) . Such a map has αd ( d + 1) parameters. These parameters are included in the table of Fig. 3 and in Tables 1/2 summarizing thenumber of mode parameters. However, we will not consider parameters in our discussion here, as they wouldequally apply to both a shooting and a direct optimization approach and could also be avoided by simplyinitializing v (0) to zero. A similar initialization to zero approach is, for example, commonly taken in ResNet swhen increasing the number of feature channels [21].
Most remarkably, the number of free parameters is always KS regardlessof the number of parameters of a particular layer as the layer parameters are obtained viathe shooting equations based on the particle states. This is a consequence of regularizing theparameters in our loss which couples them across time at optimality. We make this clearer inwhat follows.
Affine layers
Recall that in our simple example of Sec. 3.1 the parameters θ ( t ) = [ A ( t ) , b ( t )] of our affine model given as A ( t ) = M A − ( − K (cid:88) j =1 p j ( t ) σ ( q j ( t )) (cid:62) ) , b ( t ) = M b − ( − K (cid:88) j =1 p j ( t )) . Here, A ( t ) has d and b ( t ) has d parameters; these parameters are indirectly given by the setof particles { ( q i ( t ) , p i ( t )) } at any given time. Hence, for this model S = d and P = d ( d + 1) .If we assume we have K particles and compare to a discrete layer implementation of thismodel then the particle-based approach will have less free parameters if Kd < Dd ( d + 1) . Importantly, the state-space dimension, d , only enters the number of free parameters linearlyfor the particle approach ( Kd ), while there is a quadratic dependence for direct optimization( Dd ( d + 1) ). This is a direct consequence of the optimality condition which couples theparameters θ ( t ) across time. One can see this phenomenon in action in Eq. (B), where thematrix A is expressed as the sum of matrices p j ( t ) σ ( q j ( t )) (cid:62) with rank ≤ . Concretely, aparticle-based shooting approach uses less parameters if the number of particles K < D ( d +1) / .Another interesting observation based on this example is that even if we would have onlyconsidered a linear model (i.e., without the bias term, b ( t ) ) the number of parameters for theparticles would have still remained at KS . This is again a consequence of optimality and ofour parameter regularization. Note that this also means that even though our UpDown model ˙ x ( t ) = θ ( t ) σ ( v ( t )) + b ( t ) , ˙ v = θ ( t ) x ( t ) + b ( t ) + θ ( t ) σ ( v ( t )) , has significantly more parameters θ ( t ) = [ θ ( t ) , b ( t ) , θ ( t ) , b ( t ) , θ ( t )] when directly opti-mized, this has no direct impact on the number of free parameters of its particle-basedparameterization. Only the state-space dimension matters. Concretely, if we were to insteadconsider a model of the form ˙ x ( t ) = θ ( t ) σ ( v ( t )) , ˙ v = θ ( t ) x ( t ) , the particle-based parameterization would stay unchanged! Only the way how one infers θ ( t ) from the particles changes. Convolutional layers
Shooting approaches for convolutional models can also be derived.We did not experiment with such models in this work. However, we show here that thenumber of free parameters may also be decreased with a particle-based approach. This willbe interesting to explore in future work. Specifically, for convolutional layers a particle-basedparameterization could be particularly effective as one typically has quadratic complexity inthe number of filters between convolutional layers (i.e., if a layer with N feature channelsis followed by a layer with M feature channels, this will induce the estimation of N × M convolutional filters and hence will drastically influence the number of parameters for large19 or M ). In contrast, a particle-based shooting approach does not increase the number ofparameters as it ties them together via the optimality conditions expressed by the shootingequations. As a rough estimate for a standard convolutional ResNet for D = 50 , P = 100 × , DP ≈ . . Thus, if particles have size , we end up with at most active particles. General Remarks
Nevertheless, all model parameters (e.g., [ A ( t ) , b ( t )] or all convolutionalfilters for a convolutional layer) are still instantiated during computation. It is important tonote that regardless of the chosen number of particles, a shooting neural network solution is apossible optimal solution (for a given data set) at any given time, not only at convergence.One optimizes over the family of possible neural network models with the goal of finding theelement within this family that best matches the observations. C Automatic shooting
The general shooting equations were presented in Eq. (3.3). We then proceeded to explicitlyderive the shooting equations for a continuous DNN with linear-in-parameter layers and
UpDown layers in §3.1 and §3.2, respectively. While this was instructive, it is somewhat cumbersome, inparticular, for more complex models or when moving to convolutional networks. Fortunately,in practice these shooting equations do not need to be derived by hand. Indeed, they arecompletely specified by the Hamiltonian H ( p , x , θ ) = p (cid:62) ( ˙ x − f ( t, x , θ )) + R ( θ ) , in the sensethat the shooting equations in Eq. (3.3) are computed via differentiation of H . Specifically,the shooting equations in Eq. (3.3) are equivalently given by ˙ x = ∂H ( p , x ,θ ) ∂p , ˙ p = − ∂H ( p , x ,θ ) ∂x ,θ ∈ arg min θ H ( p , x , θ ) . As discussed above, the last equation can be replaced by solving ∂ θ R ( θ ) − (cid:80) Ni =1 ∂ θ f ( t, x i , θ ) T ( p i ) = 0 . Automatic differentiation can be used to automatically ob-tain the shooting equations. As fitting a shooting model requires differentiating the shootingequations we in effect end up with differentiating twice. This can be done seamlessly usingmodern deep learning libraries, such as PyTorch . D Universality of the
UpDown model
In this section, we set out to demonstrate that the
UpDown model is universal in the sense thatits associated flow can come ε -close to the flow of any well behaving time-dependent vectorfield.Define the shallow-hidden-layer vector field with time-varying parameters θ ( t ) =( θ ( t ) , θ ( t ) , b ( t )) as: shl θ ( x ) = θ σ ( θ x + b ) . While shooting with the shallow hidden layer is theoretically appealing as shl θ is universal [1],it would result in implicit shooting equations. We first show that the UpDown model introducedin 3.2 can give the same flow as the shallow hidden layer (Lemma 6) and then leverage thisrelationship to show that the
UpDown model inherits the universality of the shallow hiddenlayer (7). We recall the reformulation presented in Section 3.2.
Lemma 5.
Let shl θ ( t ) : R d (cid:55)→ R d be a time-dependent vector field with θ ( t ) and b ( t ) beingpiecewise C and θ ( t ) , b ( t ) continuous. Then, there exists a parametrization of the UpDown model that gives the same flow at a fixed time, T = 1 . roof. We rewrite the differential equation ˙ q = θ ( t ) σ ( θ ( t ) q + b ( t )) + b ( t ) , by introducing the additional state variable v = θ ( t ) q + b ( t ) which we differentiate w.r.t.time. We obtain ˙ v = ˙ θ ( t ) q + ˙ b ( t ) + θ ( t ) ˙ q . Replacing ˙ q by its formula, we get ˙ v = ˙ θ ( t ) q + ˙ b ( t ) + θ ( t ) θ ( t )( σ v ( t )) + b ( t )) . The system can be rewritten as (cid:40) ˙ q = θ ( t ) σ ( v ( t )) + b ( t ) , ˙ v = θ ( t ) q ( t ) + θ ( t ) σ ( v ( t )) + b ( t ) . (D.1)Therefore, with the initial condition v (0) = θ (0) q (0) and q (0) = q , the two systems ofordinary differential equations are equivalent.Note that the key point in this lemma is the loss of regularity in the evolution of θ sincewe differentiated once in time. For that reason, we now show that adding more dimensionsusing the inflation factor α alleviate this issue. It is likely possible that one could provea universality result using only α = 1 ; we leave this question for future work. However,experimentally, the inflation factor has a crucial effect on the performance of the optimization,as discussed in Sec. 4. Lemma 6.
Let shl θ ( t ) : R d (cid:55)→ R d be a time-dependent vector field with θ ( t ) which is piecewisecontinuous. Then, there exists a parametrization of the UpDown model that gives the same flow.Proof.
Without loss of generality, we only treat the case of one discontinuity in time of theparametrization; We thus assume that θ ( t ) is continuous on [0 , t [ and [ t , . We consider q , v , v ∈ R d such that q , v are defined as in Lemma 5. We now define, up to time t , v ( t ) = θ ( t ) v ( t ) + θ ( t ) which implies (differentiating w.r.t. time) that v follows anevolution equation similar to v and thus can be encoded in the general form of Eq. (D.1).Now, q ( t ) , v ( t ) are defined on [ t , by the evolution Eq. (D.1) in order to coincide with theflow of shl θ ( t ) on [ t , , ˙ q = θ ( t ) σ ( v ( t )) + b ( t ) and ˙ v = θ ( t ) q ( t ) + θ ( t ) σ ( v ( t )) + b ( t ) forwell chosen parameters as in Lemma 5. Since the value of v ( t ) is not used in the evolutionequation of q , we can simply extend it by v ( t ) = v ( t ) which is a valid evolution equationfor Eq. (D.1).In the general case, we decompose the time interval [0 , into k intervals [ t i , t i +1 [ on which θ ( t ) is continuous and the proposed method can be directly extended using an inflation factor α = k , introducing v k ∈ R d .Note that the result of this lemma gives an equality between the two flows defined on the whole space R d . The next result is an approximation result which holds on a compact domain K ⊂ R d . For a function f : R d → R , we denote (cid:107) f (cid:107) K, ∞ = sup x ∈ K | f ( x ) | . Proposition 7.
The
UpDown model is universal in the class of time-dependent vector fields.Let K ⊂ R d be a compact domain. For every time-dependent vector field (such that it is timecontinuous and is Lipschitz in space) w : [0 , × R d (cid:55)→ R d and its associated flow ϕ ( t, x ) thereexist time dependent parameters of the UpDown model such that (cid:40) ˙ q = θ σ ( v ) + b , ˙ v = θ ( q ) + b + θ σ ( v ) , is ε -close to the solution ϕ (1 , x ) , e.g. (cid:107) ϕ (1 , x ) − q (1 , x ) (cid:107) K, ∞ ≤ ε . Note that the case α = 1 is similar in its formulation to a second-order model on q . roof. The proof is standard and we include it here for self-containedness. It is the consequenceof [1] and Lemma 6. Let B (0 , r ) a ball of radius r in R d which contains ϕ ( t, K ) for alltime t ∈ [0 , . The flow associated with a given time-dependent vector field v ( t, · ) can beapproximated by a vector field which is piecewise constant in time; i.e. let ε > be a positivereal, (by continuity in time of v ( t, · ) ) there exists a decomposition of [0 , into k intervals [ t i , t i +1 ] and Lipschitz vector fields v i ( x ) = shl θ i ( x ) such that (cid:107) v i ( x ) − v ( t, x ) (cid:107) B (0 ,r ) , ∞ ≤ ε for t ∈ [ t i , t i +1 ] . Denote by w ( t, · ) the time-dependent vector field defined by w ( t, · ) = v i ( · ) for all t ∈ [ t i , t i +1 ] . Thus, denoting the flow of v ( t, · ) by ϕ v and the flow of w ( t, · ) by ϕ w , we get (cid:107) ϕ v (1 , x ) − ϕ w (1 , x ) (cid:107) ≤ (cid:90) (cid:107) v ( t, ϕ v ( t, x )) − v ( t, ϕ w ( t, x )) (cid:107) + (cid:107) v ( t, ϕ w ( t, x )) − w ( t, ϕ w ( t, x )) (cid:107) d t (cid:107) ϕ v (1 , x ) − ϕ w (1 , x ) (cid:107) K, ∞ ≤ (cid:90) Lip( v ) (cid:107) ϕ v ( t, x ) − ϕ w ( t, x ) (cid:107) K, ∞ + (cid:107) v ( t, · ) − w ( · ) (cid:107) B (0 ,r ) , ∞ d t ≤ (cid:90) Lip( v ) (cid:107) ϕ v ( t, x ) − ϕ w ( t, x ) (cid:107) K, ∞ d t + ε , where Lip( v ) denotes a bound on the Lipschitz constant of v ( t, x ) w.r.t. x ∈ B (0 , r ) for all t ∈ [0 , . Then, the Grönwall lemma [41] gives (cid:107) ϕ v (1 , x ) − ϕ w (1 , x ) (cid:107) K, ∞ ≤ εe Lip( v ) . (D.2)By lemma 6, ϕ w (1 , x ) can be approximated by the flow of the UpDown and the result isobtained by triangular inequality.In this section, we focused on a universality result in the space of time-dependent vectorfields. Interestingly, due to the additional dimensions, it is likely that the model is universal inthe space of functions as well. This conjecture is supported by the the quadratic 1D functionregression example which shows that the
UpDown model is able to capture some maps whichare not homeomorphic. We leave this question for future work.
E Experimental settings
This section describes our experimental settings. We use our
UpDown model for all experimentsand simply use a weighted Frobenius norm penalty for all parameters. Specifically, weweigh this penalty for all parameters with except for, θ which we penalize by 10. In ourexperiments we have observed better convergence properties for higher penalties on θ . Thismight be due to the special role that θ plays in the model as it subsumes a quadratic term inthe original derivation of the UpDown model (see Sec. 3.2). In all experiments we also optimizeover the affine map from x (0) to v (0) for the data evolution. Simple function regression
We use 500 epochs for all experiments. For all particle-basedexperiments we freeze the positions of the particles for the first 50 epochs. We use a
ReLU activation function and the MSE loss. We weight the MSE loss by 100 and the parameter normloss by 1. We use 500 training samples, 1,000 testing samples and 1,000 validation samples anda batch size of 50. Note that for these simple examples there is in practice no real differencebetween the training, testing, and validation data, as the number of samples is large and thedomain is [ − . , . . We initialize the particle positions uniformly at random in [ − . , . and draw the momenta from a Gaussian distribution with zero mean and standard deviation . . All time-integrations are done via a fourth order Runge-Kutta integrator with time-step0.1. For optimization we use Adam with a learning rate of 0.01 and the
ReduceLROnPlateau learning rate scheduler of
PyTorch with a learning rate reduction factor of 0.5.22 piral
The spiral data is generated between time t = 0 and t = 10 with 200 uniformlyspaced timepoints. Training is only on small time snippets with an approximate length of . time-units. Evaluation is on these short time snippets as well as on the entire trajectoryby pasting together solutions for these short time snippets, i.e., an individual short solutionstarts where the previous one ends. Settings for the spiral are the same as for the simplefunction regression with the following exceptions. We use 1,500 epochs. The step-size for thefourth order Runge-Kutta integrator is 0.05. The MSE loss is still weighted by 100, but theparameter norm loss only by 0.01. We randomly draw 100 new training samples during eachepoch and use 100 evaluation samples and 1,000 short range samples and 1 long-range testingsample. All samples are randomly drawn from the trajectory. However, as the trajectoryis traversed at highly nonuniform speed the samples are drawn from a uniform distributionacross the trace of the spiral. As for the simple function regression experiment, there is littlepractical difference between the training, validation, and testing data as the problem is sosimple. However, this is not of concern in these experiments as the prime objective is to studythe fitting behavior of the different models. We use a batch size of 100. Rotating MNIST
We use the data provided by the authors of [40] and follow the sameautoencoder architecture, except that our encoder maps into a 20-dimensional representationspace. The number of particles is set to 100 and the inflation factor α is set to 10. Foroptimization, we use Adam with a learning rate of 0.001 and the
CosineAnnealingLR learningrate scheduler of
PyTorch . We train for 500 epochs with a batch size of 25 and the parameternorm loss set to 0.1.
Bouncing balls
As in the rotating MNIST experiment, we rely on the data provided bythe authors of [40], follow their autoencoder architecture and set the dimensionality of therepresentation space of the encoder to 50. The first three images of each sequence are providedto the encoder by concatenating the images along the channel dimension. The inflation factor α is set to 20 and we use 100 particles. We optimize over 100 epoch using Adam with the
CosineAnnealingLR learning rate scheduler of
PyTorch , the learning rate set to 0.001 andthe parameter norm loss set to 0.0001.
F Additional results
Simple function regression
In Section 4, we considered approximating a quadratic-likefunction. Here we show parallel results for approximating a cubic function y = x . We will alsoinclude some additional figures for the quadratic-like regression function. Note that whereasthe cubic function is invertible (but not diffeomorphic), the quadratic-like one consideredin Section 4 is a simple example of a non-invertible function. Tab. 1 shows the number ofparameters for the four different formulations for both regression functions. Fig. 6 shows forthe cubic regression the test loss and the network complexity, as measured by the Frobeniusnorm [29], for the four formulations. On average the particle-based approaches show the bestfits with the lowest complexity measures, indicating the simplest network parameterization.Note however that while the dynamic particle approach greatly outperformed the staticparticle approach for the quadratic-like function (see Fig. 2) this is not the case here. Infact, the static particle approach shows slightly better fits than the dynamic one. This mightbe because the cubic function is significantly simpler to fit and hence may not benefit asmuch from the dynamic approach. To illustrate that fitting the quadratic-like function isindeed harder, Figs. 7 and 8 show function fits for different numbers of particles for the cubicfunction and the quadratic-like function, respectively. All these fits are for the particle-baseddynamic UpDown model. Clearly, very few particles can achieve reasonable fits for a simple23unction. As little as two particles already show a good fit for the cubic function, whereas thequadratic-like function requires with more particles. This supports our hypothesis that fittingmore complex functions may require more particles.Since our approach is based on the time-integration of the
UpDown model it is interestingto see 1) how the mapping is expressed across time and 2) how the parameters, θ ( t ) , ofthe UpDown model change over time. Fig. 9 shows example mappings for the cubic and thequadratic-like function, respectively. The estimated mappings are highly regular. Lastly,Figs. 10 and 11 show the time-evolutions of the model parameters for the cubic and thequadratic-like function for two different inflation factors. While different parameters showdifferent dynamics, clear changes over time can be observed. In particular, θ ( t ) and b ( t ) showstrong changes. These parameters mostly control the behavior of the hidden high-dimensionalstate, v , as θ ( t ) is penalized significantly more in our model (see Sec.E) and consequentlyshows more moderate changes. Table 1:
Number of parameters for the simple function regression cubic and quadratic experiments.
Inflation factor L o ss
1* 1*2* 1* 1*1* 1* 1* 1*Static direct Static with particles Dynamic direct Dynamic with particles α = α = α = α = -1.5-1.0-0.50.0 1* 2* 1*1* l og ( c o m p l e x i t y ) α = α = α = α = Figure 6:
Function fit (15 particles) for cubic y = x for 10 random initializations. Left : Test loss;
Right : time-integral of log of the Frobenius norm complexity. Lower is better for both measures.* indicates number of removed outliers (outside the interquartile range (IQR) by ≥ . × IQR); α denotes the inflation factor. Spiral
Tab. 2 shows the number of parameters in each of the four formulations for the spiralexperiment. This table complements the Table in Fig. 3 which only showed the number ofparameters when using 15 particles. 24 particles 5 particles − . − . − . . . . . − − − y − . − . − . . . . . − − − y ... ... − . − . − . . . . . − − − y − . − . − . . . . . − − − y
15 particles 25 particles − . − . − . . . . . − − − y − . − . − . . . . . − − − y ... ... − . − . − . . . . . − − − y − . − . − . . . . . − − − y Figure 7:
Fits for the cubic function with inflation factor 16 and for different numbers of particles.Vertical lines indicate particle positions after optimization. While subtle, the figures suggest thatusing more particles allows for better approximation of the function. This is confirmed by the testloss values in Figure 6 (bottom left). − . − . − . . . . . . . . . y − . − . − . . . . . . . . . y
15 particles 25 particles − . − . − . . . . . . . . . y − . − . − . . . . . . . . . y Figure 8:
Fits for the quadratic-like function for inflation factor 16 with different numbers of particles.Vertical lines indicate particle positions after optimization. As this function is more complex than thecubic function 2 and 5 particles is not sufficient for a fit. But 15 and 25 particles result in a well-fittingapproximation. q Figure 9:
Mapping of the cubic function (left) and the quadratic-like function (right). As can beseen, the mappings are highly regular . Table 2:
Number of parameters for the spiral experiment.
Inflation factor
16 32 64 128static/dynamic w/ particles 15 1,116 2,172 4,284 8,50825 1,796 3,492 6,884 13,66850 3,496 6,792 13,384 26,568static direct n/a 1,282 4,610 17,410 67,586dynamic direct n/a 6,026 22,282 85,514 334,85826 .0 0.2 0.4 0.6 0.8 1.0Time-0.20.00.20.4
Time -0.020.000.020.040.06 .2 0.4 0.6 0.8 1.0Time -0.0050.0000.0050.0100.0150.0200.025 Inflation factor ( α ) : 64 Time Time 0.0 0.2 0.4 0.6 0.8 1.0Time-0.002-0.0010.0000.0010.0020.003
Inflation factor ( α ) : 16 b b θ θ θ θ θ θ b b Figure 10:
Weight evolution across time (i.e., continuous depth) for 15 particles when fitting thecubic function using the
UpDown model: ˙ x ( t ) = θ ( t ) σ ( v ( t ))+ b ( t ) , ˙ v = θ ( t ) x ( t )+ b ( t )+ θ ( t ) σ ( v ( t )) .Results are for the dynamic with particles approach. Top: Inflation factor 16. Bottom: Inflationfactor 64. Changes in parameter values can clearly be observed. ime Time TimeTime Time b Inflation factor ( α ) : 64 Time Time Time
Inflation factor ( α ) : 16 Time
Time b b θ θ θ b b θ θ θ Figure 11:
Weight evolution across time (i.e., continuous depth) for 15 particles when fitting thequadratic-like function using the
UpDown model: ˙ x ( t ) = θ ( t ) σ ( v ( t )) + b ( t ) , ˙ v = θ ( t ) x ( t ) + b ( t ) + θ ( t ) σ ( v ( t )) . Results are for the dynamic with particles approach. Top: Inflation factor 16. Bottom:Inflation factor 64. Changes in parameter values can clearly be observed.. Results are for the dynamic with particles approach. Top: Inflation factor 16. Bottom:Inflation factor 64. Changes in parameter values can clearly be observed.