[PDF] A Shooting Formulation of Deep Learning

Abstract

Continuous-depth neural networks can be viewed as deep limits of discrete neural networks whose dynamics resemble a discretization of an ordinary differential equation (ODE). Although important steps have been taken to realize the advantages of such continuous formulations, most current techniques are not truly continuous-depth as they assume \textit{identical} layers. Indeed, existing works throw into relief the myriad difficulties presented by an infinite-dimensional parameter space in learning a continuous-depth neural ODE. To this end, we introduce a shooting formulation which shifts the perspective from parameterizing a network layer-by-layer to parameterizing over optimal networks described only by a set of initial conditions. For scalability, we propose a novel particle-ensemble parametrization which fully specifies the optimal weight trajectory of the continuous-depth neural network. Our experiments show that our particle-ensemble shooting formulation can achieve competitive performance, especially on long-range forecasting tasks. Finally, though the current work is inspired by continuous-depth neural networks, the particle-ensemble shooting formulation also applies to discrete-time networks and may lead to a new fertile area of research in deep learning parametrization.

Full PDF

AA Shooting Formulation of Deep Learning

François-Xavier Vialard ∗ Roland Kwitt † Susan Wei ‡ Marc Niethammer § June 19, 2020

Abstract

Continuous-depth neural networks can be viewed as deep limits of discrete neuralnetworks whose dynamics resemble a discretization of an ordinary diﬀerential equation(ODE). Although important steps have been taken to realize the advantages of suchcontinuous formulations, most current techniques are not truly continuous-depth as theyassume identical layers. Indeed, existing works throw into relief the myriad diﬃcultiespresented by an inﬁnite-dimensional parameter space in learning a continuous-depth neuralODE. To this end, we introduce a shooting formulation which shifts the perspective fromparameterizing a network layer-by-layer to parameterizing over optimal networks describedonly by a set of initial conditions . For scalability, we propose a novel particle-ensembleparametrization which fully speciﬁes the optimal weight trajectory of the continuous-depthneural network. Our experiments show that our particle-ensemble shooting formulationcan achieve competitive performance, especially on long-range forecasting tasks. Finally,though the current work is inspired by continuous-depth neural networks, the particle-ensemble shooting formulation also applies to discrete-time networks and may lead to anew fertile area of research in deep learning parametrization.

Deep neural networks (DNN) are closely related to optimal control (OC), a well-establishedﬁeld with both theoretical and practical insights on oﬀer [25, 24, 20]. The sought-for controlvariable of DNNs corresponds to the parameters of neural network layers. To be able totalk about an optimal control requires the deﬁnition of a control cost, i.e., a norm on thecontrol variable. We explore the ramiﬁcations of such a control cost in the context of DNNparametrization. For simplicity, we focus on continuous formulations in the spirit of neuralODEs [14]. However, both discrete and continuous OC formulations exist [13, 5, 38] and ourapproach could also be developed for the former.Initial work on continuous DNN formulations was motivated by the realization that a

ResNet [21, 22] resembles Euler forward time-integration [20, 24]. Speciﬁcally, the forwardpass of some input vector ˜ x through a network with L layers speciﬁed as, x (0) = ˜ x and x ( j + 1) = x ( j ) + f ( x ( j ) , θ ( j )) , j = 0 , , . . . , L, closely relates to an explicit Euler [37]discretization of the ODE ˙ x ( t ) = f ( t, x ( t ) , θ ( t )) , x (0) = ˜ x , ≤ t ≤ T . (1.1)In the ODE-inspired DNN setting, we seek an optimal θ such that the terminal predictiongiven by x ( T ) , i.e., the solution to Eq. (1.1) at time T , minimizes (cid:96) ( x ( T )) for a task-speciﬁcloss function (cid:96) . ∗ LIGM, UPEM, [email protected] † Department of Computer Science University of Salzburg;

[email protected] ‡ School of Mathematics and Statistics University of Melbourne; [email protected] § Department of Computer Science University of North Carolina at Chapel Hill; [email protected] a r X i v : . [ c s . N E ] J un tandard NODE Shooting NODE x i (0) x i ( T ) p i (0) p i ( T ) t = 0 t = T ˙ p = ∂ x f ( x , θ ) (cid:62) ( p ) x i (0) x i ( T ) p i (0) p i ( T ) t = 0 t = T ˙ p = ∂ x f ( x , θ ) (cid:62) ( p )˙ x = f ( x , θ ) ˙ x = f ( x , θ ) Particle Shooting NODE x i (0) x i ( T ) p j (0) t = 0 t = T ˙ x = f ( x , θ ) q j (0) p j ( T ) q j ( T ) ˙ q = f ( q ,θ )˙ p = ∂ q f ( q ,θ ) (cid:62) ( p ) θ = θ ( p , q ) Figure 1:

Optimization in the neural ODE (NODE) framework of [14] (left) amounts to a forwardpass with the gradient computed via backpropagation. Optimization under the shooting principle(middle) turns the forward-backward system into a forward second-order system, where we essentiallyrun the backpropagation equation forward. We use a Hamiltonian particle ensemble (right) consistingof (position, momentum) pairs ( q j , p j ) to make the shooting eﬃcient. In shooting θ is time-dependentwhereas in NODE θ ( t ) = θ ∀ t . Although Eq. (1.1) with time-varying parameter θ ( t ) can be considered as a neural networkwith an inﬁnite number of layers, current implementations of ODE-inspired networks largelyassume parameters θ that are ﬁxed in time [14, 15], i.e., ∀ t : θ ( t ) = θ , or that follow a designed,prescribed dynamics [44]. Instead, we explore time-varying θ ( t ) . We employ regularization(i.e., a control cost) to render estimating θ ( t ) well-posed and to assure regularity of theresulting ﬂow. Hence, we explore the regularized loss E ( θ ) = (cid:90) T R ( θ ( t )) d t + λ (cid:96) ( x ( T )) , λ ∈ R + , subject to Eq. (1.1) , (1.2)where R : L ([0 , T ] , R d ) → R is a complexity measure of θ corresponding to the control cost.Instead of directly optimizing over the set of time-dependent θ ( t ) as in standard ResNet s, we restrict the optimization set to those θ which are critical points of E ( θ ) , thereby dramaticallyreducing the number of parameters. In doing so, one can describe the optimization task asan initial value problem . Namely, we show that we can rewrite the loss in Eq. (1.2) solelyin terms of the input x (0) and a corresponding ﬁnite-dimensional momentum variable, p (0) .Such an approach, just like optimizing the initial speed of a mass particle to reach a givenpoint, is called a shooting method in numerical analysis [31] and control [11], giving its nameto our new formulation.The ﬁrst two panels of Fig. 1 illustrate the diﬀerence between the optimization of a neuralODE (NODE) via [14] and our shooting formulation. Since in practice, we have multipleinputs ˜ x i , i = 1 , . . . , n , there is an initial momentum vector p i corresponding to each ofthem. If the shooting formulation is to scale up to a large sample size n , we must take carethat the parametrization does not grow linearly with n . To this end, we propose what wecall the Hamiltonian particle-ensemble parametrization . It is a ﬁnite set of particles, whereeach particle is a (position, momentum) pair. The initial conditions of these particle pairs { ( q j , p j ) } Kj =1 (where K (cid:28) n ) completely determine θ ( t ) . This is illustrated in the rightmostpanel of Fig. 1. Once the optimized set of particles has been computed, the computationaleﬃciency of the forward model, similarly to NODE [14], is retained for vector ﬁelds f thatare linear in their parameters θ ( t ) .Our contributions are as follows: 1) We introduce a shooting formulation for DNNs,amounting to an initial-value formulation for neural network parametrization. This allowsfor optimization over the original network parameter space via optimizing over the initialconditions of critical networks only; 2) We propose an eﬃcient implementation of the shootingapproach based on a novel particle-ensemble parametrization in which a set of initial particles(the (position, momentum) pairs) describe the space of putative optimal network parameters;3) We propose the UpDown model which gives rise to explicit shooting equations; 4) We prove2niversality for the ﬂows of the

UpDown vector ﬁeld and demonstrate in experiments its goodperformance on several prediction tasks.

We draw inspiration from two separate branches of research: 1) continuous formulations ofneural networks [14] and 2) shooting approaches for deformable image registration [2, 28, 30].

Continuous-depth neural networks.

Continuous equivalents of

ResNet s [21, 22]have been developed in [33, 20], but naïve implementations are memory-demanding sincebackpropagation requires diﬀerentiating through the numerical integrator. Two approachescan address this unfavorable memory footprint. Neural ODE [14] does not store intermediatevalues in the forward pass, but recomputes them by integrating the forward model backward.This is easily possible only if the forward model is numerically invertible and the formulationis time-continuous [18] . Instead, checkpointing [18] is a general approach to reduce memoryrequirements by selectively recomputing parts of the forward solution [19]. Our work caneasily be combined with these numerical approaches. Solving implicit equations.

A recent line of works, including deep equilibrium models [7]and implicit residual networks [32], has shown that it may not always be necessary to freelyparametrize all the layers in the network. Speciﬁcally, in [7] and [32], the parameters ofeach layer are deﬁned via an implicit equation motivated by weight tying thus improvingexpressiveness and reducing the number of parameters while decreasing the memory footprintvia implicit diﬀerentiation. Instead, our work increases expressiveness and reduces the numberof parameters via particle-based shooting.

Invertibility and expressiveness.

Based on similarity with continuous time integration,constraining the norm of a layer in a

ResNet will result in an invertible network such as in[9, 23]. Invertibility is also explored in [42], where it is enforced (as in our setting) via a penaltyof the norm. These works show that standard learning tasks can be performed on top of aone-to-one transformation. Recent theoretical developments [43] show that indeed capping aNODE or i-ResNet [9] with a single linear layer gives universal approximation for non-invertiblecontinuous functions. Further, expressiveness can be increased by moving to more complexmodels, e.g., by introducing additional dimensions as explored in augmented NODE [15]. In[44] (

AnodeV2 ), Zhang et al. treat time-dependent θ ( t ) . Weights are evolved jointly withthe state of the continuous DNN. While this weight evolution could, in principle, also becaptured by a learned weight network, the authors argue that this would result in a largeincrease in parameters and therefore opt for explicitly parametrizing these evolutions (e.g., viaa reaction diﬀusion equation). In contrast, our method does not rely on learning a separateweight-network or on explicitly specifying a weight evolution. Instead, our evolving weightsare a direct consequence of the shooting equations which, in turn, are a direct consequence ofpenalizing network parameters (the control cost) over time; a large increase in parametersdoes not occur. Hamiltonian approaches.

Toth et al. [36] proposed Hamiltonian generative networksto learn the Hamiltonian governing the evolution of a physical system. Speciﬁcally, theylearn Hamiltonian vector ﬁelds in the latent space of an image encoder-decoder architecture.Sæmundsson et al. [34] also learn the underlying dynamics of a system from time-dependentdata, starting from a discrete Lagrangian combined with a variational integrator. Thismotivates particular network structures; e.g., Newtonian networks where the potential energy In a discrete setting, resolving the forward model in the backward direction generally requires costlysolving of implicit equations. This can be done (it is, e.g., done for invertible

ResNet s [9]). In general, anexplicit numerical solution for forward time-integration becomes implicit in the backward direction and viceversa.

3s learned via a neural network. Although sharing common tools, our work completelydiﬀers from this line of research in the sense that we exploit Hamiltonian mechanics toparametrize general continuous neural networks. In principle, our work applies to mostnetwork architectures and is not speciﬁc to physical data.Finally, we mention that shooting approaches have been applied successfully in other areassuch as diﬀeomorphic image matching [28, 2, 30]. However, the decisive diﬀerence here is inthe dimensionality of the underlying space: in image registration it is typically a 3D image;for DNNs, it is usually a high-dimensional space ( R d ) with data sampled from a distributionin that space. We consider, for simplicity, a supervised learning task where the input and target spaces are

X ⊂ R d and Y , resp., and sampled data are denoted by { (˜ x i , ˜ y i ) } ni =1 ⊂ X × Y . The goal is tolearn the weight θ ( t ) in the following ﬂow equation ˙ x i ( t ) = f ( x i ( t ) , θ ( t )) , x i (0) = ˜ x i , ≤ t ≤ T (3.1)such that it minimizes the loss (cid:80) ni =1 (cid:96) ( x i ( T ) , ˜ y i ) for some loss function (cid:96) . In existing works,the weight is chosen independent of time, i.e., θ ( t ) = θ [14], or speciﬁc evolution equationsare postulated for it [26, 44]. Such strategies show the diﬃculty of addressing inﬁnitedimensional parametrizations of time-dependent θ and the need for regularization for well-posedness [17, 26, 20]. Instead of parametrizing θ ( t ) directly, we aim at penalizing θ ( t ) according to the regularity of f ( · , θ ( t )) to arrive at a well-posed problem. Speciﬁcally, weconsider a regularization term R ( θ ( t )) (discussed in §3.1) and propose to minimize, for λ > , E n ( θ ) = (cid:90) T R ( θ ( t )) d t + λ n (cid:88) i =1 (cid:96) ( x i ( T ) , ˜ y i ) subject to Eq. (3.1) . (3.2)Note that upon discretizing the time t (i.e., having a number of parameters proportionalto the number of timesteps) this is similar to a ResNet with weight decay. For a

ResNet ora NODE, optimization is based on computing the parameter gradient via a forward passfollowed by backpropagation (see left panel of Fig. 1).

Optimality equations.

The optimality conditions for Eq. (3.2) in continuous time are:  ˙ x i ( t ) − f ( x i ( t ) , θ ( t )) = 0 , x i (0) = ˜ x i , Data evolution ˙ p i ( t ) + ∂ x f ( x ( t ) , θ ( t )) (cid:62) ( p i ) = 0 , p i ( T ) = −∇ (cid:96) ( x i ( T ) , ˜ y i ) , Adjoint evolution ∂ θ R ( θ ( t )) − (cid:80) ni =1 ∂ θ f ( x i ( t ) , θ ( t )) (cid:62) ( p i ( t )) = 0 . Compatibility (3.3)The ﬁrst equation is the forward model and the second equation is the adjoint equationsolved backward in time in order to compute the gradient with respect to the parameters. Atconvergence, the third equation is also satisﬁed. This last equation encodes the optimality ofthe layer at timestep t , as it is the case for an argmin layer or weight tying [3]. Its left handside corresponds to the gradient with respect to the parameter θ , but as we shall see it willallow us to compute θ directly via our (position, momentum) pairs in our particle shootingformulation. The shooting approach simply replaces the optimization set by the set of criticalpoints of Eq. (3.2) expressed in these optimality conditions. That is, we only optimize oversolutions fulﬁlling Eq. (3.3). Shooting principle.

The shooting method is standard in optimal control [11] and canbe formulated as follows: Since, at optimality, the system in Eq. (3.3) is satisﬁed, one can turnthis system into a forward model deﬁned only by its initial conditions { ( x i (0) , p i (0)) } ni =1 which4pecify the entire trajectory of optimal parameters. We refer to the forward model deﬁned byEq. (3.3) as the shooting equations . Unfortunately, this initial-condition parametrization stillrequires all initial conditions x i (0) and their corresponding momenta p i (0) for i = 1 , . . . , n .Since this does not scale to very large datasets, we propose an approximation using a collectionof particles, as described next. Hamiltonian particle ensemble.

In the limit and ideal case where the data distributionis known, the optimality equations can be approximated using a collection of particles whichfollow the Hamiltonian system (see Appendix A). We thus consider a collection of particles ( q j , p j ) j =1 ,...,K ∈ R d × R d that drive the evolution of the entire population ( x i ) i =1 ,...,n ∈ R d through the following forward model  ˙ x i ( t ) − f ( x i ( t ) , θ ( t )) = 0 , x i (0) = ˜ x i Data evolution ˙ q j ( t ) − f ( q j ( t ) , θ ( t )) = 0 , ˙ p j ( t ) + ∂ x f ( q j ( t ) , θ ( t )) (cid:62) ( p j ( t )) = 0 ,∂ θ R ( θ ( t )) − K (cid:88) j =1 ∂ θ f ( q j ( t ) , θ ( t )) (cid:62) ( p j ( t )) = 0 ,  Hamiltonian equations (3.4)with initial conditions { ( q j (0) , p j (0)) } j =1 ,...,K , where the gradient with respect to to this newparametrization is computed via backpropagation. This set of (position, momentum) pairsis termed the Hamiltonian particle ensemble . As the number of particles is reduced, so arethe number of free parameters, see Appendix B. Indeed, varying the Hamiltonian particleensemble allows for controlling the tradeoﬀ between reconstruction and network complexity.

The main computational bottleneck in the forward model of Eq. (3.4) is the implicitparametrization of θ by the last equation. Making it explicit is key to render shootingcomputationally tractable. Linear in parameter - quadratic penalty.

In the simplest case, the space of functions f is a linear space parametrized by θ ( t ) . In this case quadratic penalty amounts to akinetic penalty. Speciﬁcally, as a motivating example, consider the forward model (withcomponent-wise activation function σ ) f ( x ( t ) , θ ( t )) = A ( t ) σ ( x ( t )) + b ( t ) (3.5)with A ∈ L ([0 , , R d ) , b ∈ L ([0 , , R d ) and θ ( t ) = [ A ( t ) , b ( t )] . With the quadraticregularizer R ( θ ( t )) = Tr (cid:0) A ( t ) (cid:62) M A A ( t ) (cid:1) + b ( t ) (cid:62) M b b ( t ) , where M A , M b are positivedeﬁnite matrices, the particle shooting equations are (cid:40) ˙ q j ( t ) = A ( t ) σ ( q j ( t )) + b ( t ) , ˙ p j ( t ) = − d σ ( q j ( t )) (cid:62) A ( t ) (cid:62) p j ( t ) , (cid:40) A ( t ) = M A − ( − (cid:80) Kj =1 p j ( t ) σ ( q j ( t )) (cid:62) ) b ( t ) = M b − ( − (cid:80) Kj =1 p j ( t )) , (3.6)with given initial conditions ( p j (0) , q j (0)) . We emphasize that θ ( t ) is explicitly deﬁned by { ( p j ( t ) , q j ( t )) } Kj =1 and the computational cost is reduced to matrix multiplications.As is well-known [4], the Hamiltonian ﬂow preserves the Hamiltonian function. In the “linearin parameter - quadratic regularization” case, this preserved quantity, denoted H ( p ( t ) , q ( t )) ,equals to R ( θ ( t )) , which is the kinetic energy of the system of particles. As a ﬁrst consequence,the objective functional can be rewritten as, H ( p (0) , q (0))) + λ (cid:80) ni =1 (cid:96) ( x i ( T ) , ˜ y i ) . This clearlyallows for direct optimization on ( p (0) , q (0)) , i.e., shooting. As a second consequence, since thevector ﬁeld has constant norm (its squared norm is the kinetic energy), it gives a quantitativebound on the regularity of the ﬂow map at time t = T explicit in terms of H ( p (0) , q (0)) . In5ddition (Appendix A), the Rademacher complexity of the generated ﬂows with bounded H ( p (0) , q (0))) can also be controlled. Nonlinear in parameter and non-quadratic penalty.

A standard

ResNet structureuses vector ﬁelds of the type (in convolutional form or not) f ( x ( t ) , θ ( t )) = θ ( t ) σ ( θ ( t ) x ( t ) + b ( t )) + b ( t ) , (3.7)where θ ( t ) ∈ L ( R n , R d ) and θ ( t ) ∈ L ( R d , R n ) . This forward model can also be handled inour shooting approach since the shooting equations in Eq. (3.3) are completely speciﬁed bythe Hamiltonian H ( p , x , θ ) = R ( θ ) − p (cid:62) f ( x , θ ) . Automatic diﬀerentiation can be used (seeAppendix C) to implement the forward model ˙ x ( t ) = ∂H∂ p ( p ( t ) , x ( t ) , θ ( t )) , ˙ p ( t ) = − ∂H∂ x ( p ( t ) , x ( t ) , θ ( t )) , θ ( t ) ∈ arg min H ( p ( t ) , x ( t ) , θ ( t )) . (3.8)Important bottlenecks appear since the third equation is nonlinear and potentially associatedwith a non-convex optimization problem. This could be addressed by unrolling the optimizationcorresponding to the last equation, resulting in increased computational cost. In addition, inthis nonlinear case, the Hamiltonian function is no longer (in general) equal to R ( θ ( t )) evenin the quadratic regularization setting. Therefore, results on the smoothness or Rademachercomplexity would no longer be guaranteed as for the linear - quadratic penalty case. Last,quadratic regularization has no known theoretical results for the Rademacher complexityof functions generated by Eq. (3.7) with bounded norm. Norms for which the Rademachercomplexity of this class of functions is known [16] to be bounded are called Barron norms, whichare non-smooth and non-convex, and which would add to the diﬃculty. To circumvent theseissues while retaining expressiveness and theoretical guarantees in the linear parametrizationsetting, we next introduce the UpDown model.

UpDown model

The key idea is to transform the vector ﬁeld of Eq. (3.7) into a model which is linear inparameters on which the quadratic regularization can be applied. To this end, we introducethe additional state v ( t ) = θ ( t ) x ( t ) + b ( t ) which we diﬀerentiate with respect to time.We obtain ˙ v ( t ) = ˙ θ ( t ) x ( t ) + ˙ b ( t ) + θ ( t ) ˙ x ( t ) and replacing ˙ x ( t ) by its formula, we get ˙ v ( t ) = ˙ θ ( t ) x ( t ) + ˙ b ( t ) + θ ( t )( θ ( t ) σ ( v ( t )) + b ( t )) . Thus, using the additional state variable v ( t ) , we propose the following ODE system, denoted the UpDown model, introducing the(integer-valued) inﬂation factor α ≥ , ˙ x ( t ) = θ ( t ) σ ( v ( t )) + b ( t ) , ˙ v = θ ( t ) x ( t ) + b ( t ) + θ ( t ) σ ( v ( t )) , (3.9)with x ( t ) ∈ R d , v ( t ) ∈ R αd . For the data evolution, the initial conditions for x ( t ) aregiven by the data samples { ˜ x i } . We parametrize the initial conditions for v ( t ) using an aﬃnemap via v (0) = Θ ( x (0)) + b , where Θ ∈ L ( R d , R αd ) , b ∈ L ( R αd ) . In Appendix D, weprove the following theorem: Theorem 1.

Given the ﬂow of a time-dependent vector ﬁeld deﬁned on a compact domain C of R d , which is time continuous and Lipschitz, we denote by ϕ ( T, x (0)) its ﬂow at time T from starting value x (0) . Then, there exists a parametrization of the UpDown model for whichits solution is ε -close to the ﬂow, sup x (0) ∈ C (cid:107) ϕ ( T, x (0)) − x ( T ) (cid:107) ≤ ε . Notably, in the proof, the dimension of the hidden state v is used twice: ﬁrst , for having asuﬃcient number of neurons in Eq. (3.7) to approximate a stationary vector ﬁeld (standarduniversality property of multilayer perceptron) and, second , for approximating time-dependent6ector ﬁelds. Therefore, at the cost of introducing a possibly large number of dimensions, the UpDown model is universal in the class of time-dependent NODEs. As shown in Appendix D,this universality result transfers to our shooting formulation. Due to its additional dimensions,it is also likely to be universal in the space of functions (i.e., not necessarily injective). Wefocus on the

UpDown model in our experiments. Note also that while we derived our theoryfor vector-valued evolutions for simplicity, similar linear in parameter evolution equations canfor example be derived for convolutional neural networks.

Our goal is to demonstrate that it is possible to learn DNNs by optimizing only over theinitial conditions of critical networks. This is made possible via shooting and eﬃcient via ourparticle parametrization. A key diﬀerence to prior work is that our approach allows to capturetime-dependent (i.e., layer-dependent in the discrete setting) parameters without discretizingthese parameters at every time-point. Comparisons to other neural-ODE like methods are notstraightforward as hyper-parameters and diﬀerent implementation may make results diﬃcultto compare. For consistency, we provide four diﬀerent formulations (based on the

UpDown model of §3.2). • The static direct model foregoes the Hamiltonian particle ensemble, and instead directlyoptimizes over time-constant parameters: θ ( t ) = θ . Everything else, including the UpDown model, stays unchanged. This model is most closely related to NODE [14] and augmentedNODE [15]. • The static with particles model is similar to the static direct model. However, insteadof directly optimizing over a time-constant θ , it uses a set of (position, momentum) pairs(i.e., particles, as in our dynamic with particles model below) to parametrize θ indirectly. • We call our proposed shooting model dynamic with particles . It is parametrized via aset of initial conditions of (position, momentum) pairs, which evolve over time and fullyspecify θ ( t ) . • Finally, we consider the dynamic direct model which uses a piece-wise time-constant θ ( t ) . It essentially chains together multiple static direct models and is closely related toa discrete ResNet in the sense that multiple blocks (we use ﬁve) are used in succession.However, each block involves time-integrating the

UpDown model. While the dynamic withparticles model captures θ ( t ) indirectly via particles and shooting, the dynamic direct model requires many more parameters as it represents θ ( t ) directly. We show results forthe dynamic direct model for a subset of the experiments.All experiments use the UpDown model with quadratic penalty function R . Detailed experimen-tal settings, including weights for the quadratic penalty function, can be found in AppendixE. Simple 1D function regression.

We approximate a simple quadratic-like function y = x + 3 / (1 + x ) which is non-invertible. We use 15 particles for our experiments. Fig. 2shows the test loss and the network complexity, as measured by the log Frobenius normintegrated over time [29], for the diﬀerent models as a function of the inﬂation factor α (cf.§3.2). On average, the dynamic with particles model shows the best ﬁts with the lowestcomplexity measures, indicating the simplest network parametrization. Note that the staticwith particles approach results in the lowest complexity measures only because it cannotproperly ﬁt the function as indicated by the high test loss. Additional results for a cubicfunction y = x are in Appendix F. 7 .20.30.40.50.6 L o ss

1* 2* 1* 1*1* 1*1* 1* α = 128 α = 64 α = 32 α = 16

16 32 64 128 -2.5-2.0-1.5-1.0-0.5 l og c o m p l e x i t y

1* 1* 1* 1*Static direct Static w particles Dynamic direct Dynamic w particles α = 16 α = 32 α = 64 α = 128 Figure 2:

Fit for quadratic-like y = x + 3 / (1 + x ) for 10 random initializations. Left : Test loss;

Right : time-integral of log of the Frobenius norm complexity. Lower is better for both measures.* indicates number of removed outliers (outside the interquartile range (IQR) by ≥ . × IQR); α denotes the inﬂation factor. L o ss l og ( c o m p l e x i t y )

1* 1*050100150 4* 1* 1*3* 2* 1* 5560657075 1*1*Static direct Static with particles Dynamic direct Dynamic with particles α = α = α = α = α = α = α = α = l og ( c o m p l e x i t y ) L o ss S h o r t - r a n g e r e s u l t s L o n g - r a n g e r e s u l t s α α = α = α = α = α = α = α = α = Number of parametersFit for Dynamic with particles ˙ x = Ax Prediction 1,1162,1724,2848,508

Figure 3:

Fit for spiral (short- and long-range). Losses for the diﬀerent models as well as thetime-integral of log of the Frobenius norm complexity measure. Lower is better for both measures.The * symbol indicates how many outliers were removed and α denotes the inﬂation factor. Spiral.

Next, we revisit the spiral ODE example of [14] following the nonlinear dynamics ˙ x = Ax , x ∈ R (where the power is component-wise). We ﬁx x (0) = [2 , T , use A =[ − . , . − , − . and evolve the dynamics for time T = 10 . The training data consists ofsnippets from this trajectory, all of the same length. We use an L norm loss (calculated onall intermediate time-points) and 15 particles. Our goal is to show that we can obtain thebest ﬁt to the training data due to our dynamic model. Fig. 3 ( top ) shows that we can indeedobtain similar or better ﬁts (lower losses) for a similar number of parameters while achievingthe lowest network complexity measures. Fig. 3 ( bottom ) shows the corresponding results forthe validation data consisting of the original long trajectory starting from initial value x (0) .Interestingly, by pasting together short-range solutions we are successful in predicting thelong-range trajectory despite training on short-range trajectory snippets. Concentric circles.

To study the impact of the inﬂation factor α in a classiﬁ-cation regime, we replicate the concentric circles setting of [15]. The task is learn-ing to separate points, sampled from two disjoint annuli in R . While we are lessinterested in the learned ﬂow (as in [15]), we study how often the proposed UpDown (dynamic with particles) model perfectly ﬁts the training data as a function of α .

4% 8% 26.7% S u cce ss r a t e [ % ] α = 5 α = 10 α = 20 To the right, we show the success rate over 50 training runsfor three choices of α and 20 particles. Notably, the eﬀect of α is only visible if the classiﬁcation loss is down-weighted so thatthe regularization, R , dominates . Otherwise, for the tested α ,the model always ﬁts the data. The experiment is consistentwith [15], where it is shown that increasing the space on which8 SE ± σ † GPPVAE-DIS ± † GPPVAE-JOINT ± † ODE VAE ± † ODE VAE-KL ± Ours (stat. direct) 0.0126 ± Ours (stat. w particles) 0.0125 ± Ours (dyn. w particles) 0.0122 ± ↔ Test time point

Figure 4:

Left : Image (per-pixel) MSE (measured at the marked time point) averaged over all testingsequences of the rotated MNIST dataset.

Right : Two testing sequences and predictions (marked blue)for all 16 time points when the image at t = 0 is given as input (marked red). Results marked with † are taken from [40]. an ODE is solved allows for easy separation of the data and leads to less complex ﬂows. Thelatter is also observed for our model; visualizations can be found in Appendix F. Static directStatic with particlesDynamic with particles

Rotating MNIST.

Here, we are given sequences of a rotatingMNIST digit (along 16 angles, linearly spaced in [0 , π ] ). The task islearning to synthesize the digit at any rotation angle, given only the ﬁrst image of a sequence. We replicate the setup of [40] and considerrotated versions of the digit “3”. We identify each rotation angle asa time point t i and randomly drop four time points of each sequenceduring training. One ﬁxed time point is consistently left-out and laterevaluated during testing. We use the same convolutional autoencoderof [40] with the UpDown model operating in the internal representationspace after the encoder. During training, the encoder receives the ﬁrstimage of a sequence (always at angle ◦ ), the UpDown model integratesforward to the desired time points, and the decoder decodes theserepresentations. As loss, we measure the mean-squared-error (MSE)of the decoder outputs. Fig. 4 lists the MSE (at the left-out angle),averaged over all testing sequences and shows two example sequences with predictions for alltime points (100 particles, α = 10 ).While all UpDown variants substantially lower the MSE previously reported in the literature,they exhibit comparable performance. To better understand the diﬀerences, we visualize theinternal representation space of the autoencoder by projecting all 16 internal representations(i.e., the output of the

UpDown models after receiving the output of the encoder) of each testingimage onto the two largest principal components, shown to the right (diﬀerent colors indicatethe diﬀerent rotation angles). This qualitative result shows that allowing for a time-dependentparametrization leads to a more structured latent space of the autoencoder.

Bouncing balls.

Finally, we replicate the “bouncing balls” experiment of [40]. This issimilar to the rotating MNIST experiment, but the underlying dynamics are more complex.In particular, we are given 10,000 (training) image sequences of bouncing balls at 20 diﬀerenttime points [35]. The task is learning to predict, after seeing the ﬁrst three images of asequence, future time points. We use the same convolutional autoencoder of [40] and minimizeimage (per-pixel) MSE (using all 20 time points for training). Our

UpDown model operates inthe internal representation space of the encoder (50-dimensional in our experiments ). In testmode, the network receives the ﬁrst three image of a sequence and predicts 10 time pointsahead. We measure the image (per-pixel) MSE and average the results (per time point) overall 500 testing sequences. For model selection, we rely on the provided validation set. Our UpDown (dynamic with particles) model uses 100 particles. Fig. 5 ( left ) lists the averaged We did not further experiment with this hyperparameter, so potentially better results can be obtained. A v e r a g e M S E † ODE VAE ( (cid:11) † DDPAE ( (cid:11) † DTSBN-S ( (cid:11) Ours ( (cid:11) Figure 5:

Left : Image (per-pixel) MSE for predicting 10 time points ahead (after receiving the ﬁrstthree inputs of a sequence), averaged over all testing sequences (numbers in parentheses indicate theMSE when additionally averaged over all prediction time points). Results marked with † are takenfrom [40]. Right : Two testing sequences with predictions (marked blue).

MSE per time point, plotted against the approaches listed in [40]. Fig. 5 ( right ) shows twotesting sequences with predictions (the three input time points are not shown). Results forthe

UpDown static and static with particles model are (cid:11) . and (cid:11) . , respectively. We demonstrated that it is possible to parametrize DNNs via initial conditions of (position,momentum) pairs. While our experiments are admittedly still simple, results are encouragingas they show that 1) the particle-based approach can achieve competitive performance overdirect parametrizations and that 2) time-dependent parametrizations are useful to obtainingsimpler networks and can be realized with signiﬁcantly fewer parameters using particle-basedshooting. Our work opens up many diﬀerent follow-up questions and formulations. E.g.,we presented our approach for a model with continuous dynamics, but the particle and theshooting formalism can also be applied to discrete-time models. Further, we focused, forsimplicity, on continuous variants of multi-layer perceptrons, but similar linear-in-parametermodels can be formulated for convolutional neural networks. Models that are nonlinear intheir parameters hold the promise for connections with optimal mass transport theory andto theoretical complexity results, which we touched upon for our

UpDown model. Indeed,this change of paradigm in the parametrization may result in new quantitative results onthe generalization properties of the constructed networks. Lastly, how well the approachgeneralizes to more complex problems, how many particles are needed to switch from astandard deep network to its shooting formulation, and how optimizing over critical pointsof the original optimization problem via shooting relates to network generalization will befascinating to explore.

Broader Impact

Our work advances knowledge in how to parametrize deep neural networks. Speciﬁcally, itshifts the parametrization of deep neural networks from a layer-by-layer perspective to aninitial-value parametrization. At this point it is theoretical and conceptual in nature and so isits likely current broader impact. 10 cknowledgments and Disclosure of Funding

This research project was initiated during a one-month invitation of M. Niethammer by theLabex Bézout, supported by the French National Research Agency ANR-10-LABX-58.

References [1] Approximation by superpositions of a sigmoidal function.

Math. Control Signals Syst. ,2(4):303–314, 1989.[2] Diﬀeomorphic 3D image registration via geodesic shooting using an eﬃcient adjointcalculation.

IJCV , 97(2):229–241, 2012.[3] B. Amos and J. Z. Kolter. OptNet: Diﬀerentiable optimization as a layer in neuralnetworks. In

ICML , 2017.[4] V. I. Arnold.

Mathematical Methods of Classical Mechanics . Springer, 1978.[5] M. Athans and P. L. Falb.

Optimal control: an introduction to the theory and itsapplications . Courier Corporation, 2013.[6] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducingpenalties.

Foundations and Trends in Machine Learning , 4(1):1–106, 2012.[7] S. Bai, J. Z. Kolter, and V. Koltun. Deep equilibrium models. In

NeurIPS , 2019.[8] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk boundsand structural results.

J. Mach. Learn. Res. , 3(null):463–482, Mar. 2003.[9] J. Behrmann, W. Grathwohl, R. T. Chen, D. Duvenaud, and J.-H. Jacobsen. Invertibleresidual networks. In

ICML , 2019.[10] J.-D. Benamou and Y. Brenier. A computational ﬂuid mechanics solution to the Monge-Kantorovich mass transfer problem.

Numerische Mathematik , 84(3):375–393, 2000.[11] J. F. Bonnans. The shooting approach to optimal control problems. In

IFAC InternationalWorkshop on Adaptation and Learning in Control and Signal Processing , 2013.[12] M. Bruveris and F.-X. Vialard. On completeness of groups of diﬀeomorphisms.

J. Eur.Math. Soc. (JEMS) , 19(5):1507–1544, 2017.[13] A. Bryson and Y. Ho. Applied optimal control: optimization, estimation, and control.1975.

Hemisphere, New York , pages 177–203.[14] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinarydiﬀerential equations. In

NeurIPS , 2018.[15] E. Dupont, A. Doucet, and Y. W. Teh. Augmented neural ODEs. In

NeurIPS , 2019.[16] W. E, C. Ma, and L. Wu. Barron spaces and the compositional function spaces for neuralnetwork models. arXiv , 2019. https://arxiv.org/abs/1906.08039 .[17] C. Finlay, J.-H. Jacobsen, L. Nurbekyan, and A. M. Oberman. How to train your neuralODE: the world of Jacobian and kinetic regularization. In

ICML , 2020.[18] A. Gholami, K. Keutzer, and G. Biros. ANODE: Unconditionally accurate memory-eﬃcient gradients for neural odes. In

IJCAI , 2019.1119] A. Griewank and A. Walther.

Evaluating derivatives: principles and techniques ofalgorithmic diﬀerentiation , volume 105. SIAM, 2008.[20] E. Haber and L. Ruthotto. Stable architectures for deep neural networks.

Inverse Probl. ,34(1):014004, 2017.[21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

CVPR , 2016.[22] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In

ECCV , 2016.[23] J.-H. Jacobsen, A. Smeulders, and E. Oyallon. i-RevNet: Deep invertible networks. In

ICLR , 2018.[24] Q. Li, L. Chen, C. Tai, and E. Weinan. Maximum principle based algorithms for deeplearning.

JMLR , 18(1):5998–6026, 2017.[25] G.-H. Liu and E. A. Theodorou. Deep learning theory review: An optimal control anddynamical systems perspective. arXiv , 2019. https://arxiv.org/abs/1908.10920 .[26] S. Massaroli, M. Poli, J. Park, A. Yamashita, and H. Asama. Dissecting neural ODEs. arXiv , 2020. https://arxiv.org/abs/2002.08071 .[27] A. Maurer. A vector-contraction inequality for rademacher complexities. In R. Ortner,H. U. Simon, and S. Zilles, editors,

Algorithmic Learning Theory , pages 3–17, Cham,2016. Springer International Publishing.[28] M. Miller, A. Trouvé, and L. Younes. Geodesic shooting for computational anatomy.

J.Math. Imaging Vis. , 24:209–228, 2006.[29] B. Neyshabur, S. Bhojanapalli, D. Mcallester, and N. Srebro. Exploring generalization indeep learning. In

NeurIPS . 2017.[30] M. Niethammer, Y. Huang, and F.-X. Vialard. Geodesic regression for image time-series.In

MICCAI , 2011.[31] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery.

Numerical recipes3rd edition: The art of scientiﬁc computing . Cambridge University Press, 2007.[32] V. Reshniak and C. Webster. Robust learning with implicit residual networks. arXiv ,2019. https://arxiv.org/abs/1905.10479 .[33] L. Ruthotto and E. Haber. Deep neural networks motivated by partial diﬀerentialequations.

J. Math. Imaging Vis. , pages 1–13, 2019.[34] S. Sæmundsson, A. Terenin, K. Hofmann, and M. P. Deisenroth. Variational integratornetworks for physically meaningful embeddings. arXiv , 2019. https://arxiv.org/abs/1910.09349 .[35] I. Sutskever, G. Hinton, and G. Taylor. The recurrent temporal restricted Boltzmannmachine. In

NeurIPS , 2009.[36] P. Toth, D. J. Rezende, A. Jaegle, S. Racanière, A. Botev, and I. Higgins. Hamiltoniangenerative networks. In

ICLR , 2020. 1237] L. N. Trefethen. Finite diﬀerence and spectral methods for ordinary and partial diﬀerentialequations. 1996.[38] J. L. Troutman.

Variational calculus and optimal control: optimization with elementaryconvexity . Springer Science & Business Media, 2012.[39] M. J. Wainwright.

High-Dimensional Statistics: A Non-asymptotic Viewpoint , volume 48.Cambridge University Press, 2019.[40] C. Yildiz, M. Heinonen, and H. Lahdesmaki. ODE VAE: Deep generative second orderODEs with Bayesian neural networks. In

NeurIPS , 2019.[41] L. Younes.

Shapes and Diﬀeomorphisms . Springer, 2010.[42] L. Younes. Diﬀeomorphic learning. arXiv , 2018. https://arxiv.org/abs/1806.01240 .[43] H. Zhang, X. Gao, J. Unterman, and T. Arodz. Approximation Capabilities of NeuralODEs and Invertible Residual Networks. In

ICML , 2020.[44] T. Zhang, Z. Yao, A. Gholami, J. E. Gonzalez, K. Keutzer, M. W. Mahoney, and G. Biros.ANODEV2: A coupled neural ODE framework. In

NeurIPS , 2019.13 upplementary material

The following sections discuss additional aspects of our approach. Speciﬁcally, Sec. A showsthat the optimality condition underlying our shooting formulation can be approximatedvia particles. Sec. B discusses the number of free parameters of our shooting approach inrelation to the number of free parameters for direct optimization. Sec. C explains how theshooting equations can be automatically derived via automatic diﬀerentiation. Sec. D showsthe universality of our

UpDown model. Sec. E provides details on our experimental setup.Lastly, Sec. F shows some additional experimental results.

A Expectation approximation of optimality equations

In this section, we show that the particle shooting approach can be viewed as approximatingthe optimality equation associated with a collection of particles.

A.1 Variational setup

Suppose the data consists of input X ∈ R d . Let f ( · , θ ( t )) be a vector ﬁeld on R d , e.g. ashallow hidden layer or a linear (in parameter) layer. Consider the ﬂow ϕ := ϕ ( T, · ) generatedby f according to (cid:40) dd t ϕ ( t, x ) = f ( ϕ ( t, x ) , θ ( t )) ,ϕ (0 , x ) = x . (A.1)We consider the general task of minimizing, Reg( ϕ ) + λ E [ (cid:96) ( ϕ ( X ))] , (A.2)where λ is a positive regularization parameter.Without loss of generality, set the terminal time to T = 1 . Letting ρ denote the probabilitydensity of X , minimizing Eq. (A.2) is equivalent to minimizing Reg( ϕ ) + λ inf ϕ (cid:90) R d (cid:96) ( ϕ ( x )) ρ ( x ) d x . This can be rewritten as

Reg( ϕ ) + λ inf ϕ (cid:90) R d (cid:96) ( x (cid:48) ) ρ ( x (cid:48) ) d x (cid:48) , where ρ ( x ) := ρ (1 , x ) is the ﬂow of the continuity equation ∂ t ρ ( t, x ) + div( ρ ( t, x ) f ( x , θ )) = 0 , ρ (0 , x ) = ρ ( x ) , where div is the divergence operator on vector ﬁelds. Note ρ can be regarded as the densityrepresenting the data at time . A.2 Choice of regularization

We discuss two diﬀerent possibilities for regularization of the map ϕ : Reg( ϕ ) = (cid:90) R ( θ ( t )) d t . (A.3)The ﬁrst possibility is a quadratic penalty, where R ( θ ( t )) = (cid:107) θ ( t ) (cid:107) is the Frobenius normof the parameter θ . Since the space of vector ﬁelds is a ﬁnite dimensional linear space, it14an be endowed with a scalar product, which turns this space into a Reproducing KernelHilbert Space (RKHS). Therefore, the linear in parameter - quadratic regularization settingis a particular case of vector ﬁelds encoded by f ( · , θ ( t )) ∈ H , with H a RKHS embedded in W , ∞ vector ﬁelds . This setting leverages strong analytical and geometrical foundations[41, 12]: 1) When the activation function (very often, the RKHS is composed of smoothvector ﬁelds) is smooth, the ﬂow map ϕ is guaranteed to be a one-to-one smooth map (i.e.a diﬀeomorphism). For instance, with the UpDown model, it is a homeomorphism in ( q , v ) .Moreover, the quadratic penalty induces a right-invariant distance on the set of ﬂows generatedby Eq. (A.1) and the distance to identity of the resulting ﬂow can be bounded by Reg( ϕ ) (see [41, 12] for more details in a Sobolev setting). 2) When the activation function is of ReLU type, the resulting map is still a W , ∞ one-to-one map (i.e. a homeomorphism) and hasLipschitz regularity.The other type of regularization we will discuss is the Barron norm regularization for theshallow hidden layer of Eq. (3.7). First, recall Barron’s norm [16], (cid:107) θ (cid:107) B := 1 n n (cid:88) j =1 (cid:107) θ j (cid:107) ( (cid:107) [ θ ] j (cid:107) + (cid:107) b j (cid:107) ) . As discussed in the main text, the reason we consider a Barron norm penalty for the shallowhidden layer rather than the quadratic penalty is because of its theoretical results. Indeed,the Rademacher complexity is bounded for the combination of a shallow hidden layer with aBarron norm penalty, but not when combined with a quadratic penalty.

A.3 Rademacher complexity of bounded energy ﬂows.

In this section we consider the ﬂows generated by a neural ODE approach with a kineticenergy penalty for which we address the question of bounding the Rademacher complexity ofthe generated ﬂows. The ﬂow of a vector ﬁeld f ( · , θ ( t )) is a vector valued map denoted by ϕ .Let us ﬁrst treat the case of the Rademacher complexity of a component of a the ﬂow map ϕ k . Theorem 2.

Let F be a space of vector ﬁelds deﬁned on a compact space C ⊂ R d . Assumethat the Rademacher complexity on n points in C of each component of the vector ﬁelds f k ( t, · ) for k = 1 , . . . , d is controlled by M ( n, t ) which depends on n , then the Rademacher complexityof each component of the ﬂows at time is bounded by (cid:82) M ( n, t ) d t .Proof. Recall that Rademacher complexity (see [39]) of a class of functions F is deﬁned as,for Z = ( z , . . . , z n ) ∈ C , Rad F , Z def. = E [sup g ∈F n (cid:88) i =1 ε i g ( z i )] , where the ( ε i ) i =1 ,...,n are i.i.d. Rademacher random variables. Our hypothesis ensures Rad F , Z ≤ M ( n ) . Apply the deﬁnition of the ﬂow to get ϕ (1 , x ) = x + (cid:90) f ( ϕ ( t, x ) , θ ( t )) d t . Therefore, E [sup ϕ ∈F n (cid:88) i =1 ε i ϕ k ( z i )] ≤ Rad { Id } , Z + (cid:90) E [ sup f ( · ,θ ( t )) n (cid:88) i =1 ε i f k ( ϕ ( t, z i ) , θ ( t ))] d t , ≤ (cid:90) M ( n, t ) d t . I.e smoothness asks for Lipschitz regularity of the ﬁrst derivative, which ensures existence and uniquenessof the ﬂow.

15n the previous formula, we used the fact that the Rademacher complexity of just a singlemap is zero.

Corollary 3.

Let H be a RKHS of vector ﬁelds whose kernel k satisﬁes (cid:107) k ( x , x ) k (cid:107) ∞ ≤ ,then, the set of ﬂows denoted by F at time of time-dependent vector ﬁelds in a ball of radius R satisﬁes Rad F ,n ≤ R √ n , where Rad F ,n is the Rademacher complexity for n points.Proof. The Rademacher complexity of the ball of radius R in the RKHS H [8, Lemma22] isupper bounded by Rad

H,n ≤ R √ n . We then directly apply Theorem 2.A similar result also holds for vector ﬁelds generated by the shallow-hidden-layer ofEq. (3.7), see [16]. Last, we note that the result and its proof also hold if one uses the followingRademacher complexity for vector valued functions [27], Rad F , Z def. = E [sup g ∈F n (cid:88) i =1 d (cid:88) j =1 ε j ε i g j ( z i )] , for g = ( g j ) j =1 ,...,d ∈ F . A.4 Optimality equations and Hamiltonian ensemble approximation

We detail the optimality equations in the case data are represented via a probability measure.As mentioned above, the regularity of the map is enforced via a penalty on the weights at eachtimepoint and is the integral (cid:82) R ( θ ( t )) d t or even more generally (cid:82) R ( θ ( t ) , ρ ( t )) d t . UsingLagrange multipliers, this constraint can be enforced and minimizers of the energy should besaddlepoints of the energy L ( ρ, θ, p ) := λ (cid:90) R d (cid:96) ( x ) ρ ( x ) d x + (cid:90) R ( θ ( t )) d t + (cid:90) (cid:90) R d p ( t, x )( ∂ t ρ ( t, x ) + div( ρ ( t, x ) f ( x , θ ( t )))) d x d t , where p ( t, x ) is a time and space dependent function. The optimality equations are then  ∂ t ρ ( t, x ) + div( ρ ( t, x ) f ( t, x , θ ( t ))) = 0 ,∂ t p ( t, x ) + ∇ p ( t, x ) · f ( t, x , θ ( t )) = δRδρ ( θ ( t ) , ρ ( t )) , δRδθ ( θ ( t ) , ρ ( t )) − (cid:82) R d δfδθ ( x ) (cid:62) ( ∇ p ( t, x ) ρ ( t, x )) = 0 , (A.4)where ∇ p is the gradient w.r.t. x of p ( t, x ) and δ denotes diﬀerentiation w.r.t. the indicatedparameter.In practice, one does not have access to the full distribution and the variational setupneeds to be approximated. As proposed in the main text, we approximate it using a collectionof particles that follow the optimality equations which are Hamiltonian evolution equationsfor this collection of particles. The collection of particles ( q i , p i ) are deﬁned by their state andcostate; the quantities p ( t, x ) and ρ ( t, x ) are simply approximated with p i ( t ) , q i ( t ) satisfyingthe Hamiltonian equations associated with the empirical measure ρ ( t, · ) = N (cid:80) Ni =1 δ q i ( t ) ( · ) , thecorresponding function p ( t, · ) being only deﬁned at points q i . When the number of particlestends to inﬁnity, the optimal trajectory in the parameter space can be well approximated,and one can expect to recover the optimal trajectory.However, we do not explore the following question here: How well can a set of Hamiltonianparticles approximate Eq. (A.4)?

We simply remark that this question is directly connectedto expressiveness and generalization properties of the constructed neural network and is alsoprobably data dependent. 16 .5 Linear in parameters - quadratic energy

Now let us examine in detail models that are linear in parameters and have quadratic energyon parameters: this case is the simplest to be studied, and computationally not as demandingas the nonlinear case. As mentioned above, the set of possible vector ﬁelds f ( · , θ ( t )) is aﬁnite dimensional linear space, which is a reproducing kernel Hilbert space when endowedwith an L norm. Since all Hilbert norms in ﬁnite dimensions are equivalent, this choice ofregularization is universal in this class of quadratic penalties.1. The vector ﬁeld is f ( · , θ ( t )) = θ · σ , where σ is a vector of maps. In this case, theoptimality equation reads δfδθ ( x ) ∗ ( ∇ p ( t, x ) ρ ( t, x )) = (cid:90) R d σ ( x ) (cid:62) ( ∇ p ( t, x )) ρ ( t, x )) d x .

2. If the penalty R only depends on θ and is quadratic: R ( θ ( t )) = (cid:82) (cid:107) θ (cid:107) d t , then onehas δRδθ ( θ ( t ) , ρ ( t )) = θ ( t ) .Thus, under these two conditions, the parameters are explicit in terms of p , ρ and σ : θ ( t ) = (cid:90) R d σ ( x ) (cid:62) ( ∇ p ( t, x ) ρ ( t, x )) d x . (A.5) Remark 4.

If, instead of quadratic regularization on the parameters, we were to choose aRKHS norm (in the inﬁnite dimensional case) as penalty, it would result in the introductionof the kernel applied to the r.h.s. of Eq. (A.5).

Having a formula for the parameter θ ( t ) , one could be tempted to derive an evolutionequation for θ . This equation is known as the EPDiﬀ equation [41] and is unfortunately nota closed equation on the set of parameters θ ( t ) themselves. Therefore, our approach is apossible way to approximate it. A.6 Nonlinear in parameters - energy which depends on the distribution

For exposition purposes, we present two cases of interest, which are not contained in theprevious section.

Example of Barron’s norm.

Obviously, a shallow-hidden-layer shl θ is not linear inparameters. We have already discussed that it is proper in this case to endow the space withnorms such as Barron’s norm [16]. For instance, for shl θ ( x ) one has shl θ ( x ) = θ σ ( θ ( x ) + b ) , (A.6)where θ ∈ L ( R n , R d ) and θ ∈ L ( R d , R n ) . The Barron norm of the shallow-hidden-layer isthen (cid:107) shl θ (cid:107) B := 1 n n (cid:88) j =1 (cid:107) θ j (cid:107) ( (cid:107) [ θ ] j (cid:107) + (cid:107) [ b ] j (cid:107) ) , where θ j denotes the j th column of θ and [ θ ] j denotes the j th row of θ .Let us consider the case of R ( θ ( t )) = (cid:107) shl θ ( t ) (cid:107) B . In this case, one has the followingoptimality equations to solve λθ j ( (cid:107) [ θ ] j (cid:107) + (cid:107)| [ b ] j (cid:107) ) = (cid:90) R d σ ([ θ ] j x + [ b ] j ) (cid:62) ( ∇ p ( t, x ) ρ ( t, x )) d x ,λ (cid:107) θ j (cid:107) ( (cid:107) [ θ ] j (cid:107) + (cid:107) [ b ] j (cid:107) ) ∂ (cid:107) [ θ ] kj (cid:107) = (cid:90) R d [ d σ ([ θ ] j x + [ b ] j )( x k )] (cid:62) ( ∇ p ( t, x ) ρ ( t, x )) d x ,λ (cid:107) θ j (cid:107) ( (cid:107) [ θ ] j (cid:107) + (cid:107) [ b ] j (cid:107) ) ∂ (cid:107) [ b ] jk (cid:107) = (cid:90) R d [ d σ ([ θ ] j x + [ b ] j )( x k )] (cid:62) ( ∇ p ( t, x ) ρ ( t, x )) d x . L norm, and optimization of this typeof functions, which involves sparsity, is a well-explored ﬁeld [6]. We leave experiments withthis norm for future work. L regularization, optimal transport. Last, we brieﬂy mention a model that is part ofour framework which has the advantage of not specifying a norm on the space of vector ﬁelds.In case there is no obvious norm to be used on the space of vector ﬁelds, it is possible to usean L type of penalty, directly on the vector ﬁelds themselves. Indeed, one way to be ratherindependent of the choice of the parametrization of the map consists in introducing a costthat represents the L norm of the map. However, L depends on the choice of a measureand this measure can be naturally chosen as the density of the data, ρ ( t, x ) . More precisely,one can have R ( f, ρ ( t )) = 12 (cid:90) R d (cid:107) f ( x , θ ) (cid:107) ρ ( t, x ) d x . (A.7)In such a case, this formulation resembles ﬁnding an optimal transport (OT) map between ρ and ρ . Speciﬁcally, optimal transport is an optimization problem which can be solvedvia a ﬂuid dynamic formulation [10] introducing the kinetic penalty above. However, thetwo models (OT and the one deﬁned by the regularization Eq. (A.7)) diﬀer in the sensethat the optimization set for optimal transport is much larger and the above formulationis a parametrized approximation of OT. This parametrized approximation needs to retaingeneralization properties of the optimized map. Note however, that in the limit where thenumber of neurons goes to inﬁnity, optimal transport will be well-approximated since theoptimization is performed on a dense subset of all vector ﬁelds. Obviously, ﬁxing the choice of ashallow-hidden-layer design implies a choice for n in Eq. (3.7), which thus gives a regularizationof the computed approximation of the optimal transport map. Computational burden.

In both cases, the implicit equation corresponding to the thirdequation in Eq. (A.4) has to be solved at each layer of the discretization. We experimentedwith a simple strategy of unrolling the related minimization scheme. An eﬃcient approach tosolve such implicit equations will be necessary for practical implementations.

B Analysis of the number of free parameters

It is instructive to understand the number of parameters for a shooting approach in comparisonto the typical approach of optimizing a neural network (where the parameter-dependency atoptimality is only considered implicitly at convergence of the numerical solution rather thanexplicitly during the shooting). We focus on the cases of aﬃne and convolutional layers forillustration.Consider a DNN with a depth of D layers, each of them parametrized by a shallow hiddenlayer of P parameters. The number of free parameters is then DP , compared to KS where K is the number of active particles, each of them of size S . Hence, solutions with lessthan DP/ (2 S ) particles provide beneﬁts in the number of free parameters. Therefore, as For example, S for our UpDown model simply corresponds to the dimension of its state space: S = ( α + 1) d ,where α is the inﬂation factor and d the data dimension. Note that in our experiments with the UpDown modelwe also learned an aﬃne map from the initial conditions x (0) to the initial conditions v (0) . Such a map has αd ( d + 1) parameters. These parameters are included in the table of Fig. 3 and in Tables 1/2 summarizing thenumber of mode parameters. However, we will not consider parameters in our discussion here, as they wouldequally apply to both a shooting and a direct optimization approach and could also be avoided by simplyinitializing v (0) to zero. A similar initialization to zero approach is, for example, commonly taken in ResNet swhen increasing the number of feature channels [21].

Most remarkably, the number of free parameters is always KS regardlessof the number of parameters of a particular layer as the layer parameters are obtained viathe shooting equations based on the particle states. This is a consequence of regularizing theparameters in our loss which couples them across time at optimality. We make this clearer inwhat follows.

Aﬃne layers

Recall that in our simple example of Sec. 3.1 the parameters θ ( t ) = [ A ( t ) , b ( t )] of our aﬃne model given as A ( t ) = M A − ( − K (cid:88) j =1 p j ( t ) σ ( q j ( t )) (cid:62) ) , b ( t ) = M b − ( − K (cid:88) j =1 p j ( t )) . Here, A ( t ) has d and b ( t ) has d parameters; these parameters are indirectly given by the setof particles { ( q i ( t ) , p i ( t )) } at any given time. Hence, for this model S = d and P = d ( d + 1) .If we assume we have K particles and compare to a discrete layer implementation of thismodel then the particle-based approach will have less free parameters if Kd < Dd ( d + 1) . Importantly, the state-space dimension, d , only enters the number of free parameters linearlyfor the particle approach ( Kd ), while there is a quadratic dependence for direct optimization( Dd ( d + 1) ). This is a direct consequence of the optimality condition which couples theparameters θ ( t ) across time. One can see this phenomenon in action in Eq. (B), where thematrix A is expressed as the sum of matrices p j ( t ) σ ( q j ( t )) (cid:62) with rank ≤ . Concretely, aparticle-based shooting approach uses less parameters if the number of particles K < D ( d +1) / .Another interesting observation based on this example is that even if we would have onlyconsidered a linear model (i.e., without the bias term, b ( t ) ) the number of parameters for theparticles would have still remained at KS . This is again a consequence of optimality and ofour parameter regularization. Note that this also means that even though our UpDown model ˙ x ( t ) = θ ( t ) σ ( v ( t )) + b ( t ) , ˙ v = θ ( t ) x ( t ) + b ( t ) + θ ( t ) σ ( v ( t )) , has signiﬁcantly more parameters θ ( t ) = [ θ ( t ) , b ( t ) , θ ( t ) , b ( t ) , θ ( t )] when directly opti-mized, this has no direct impact on the number of free parameters of its particle-basedparameterization. Only the state-space dimension matters. Concretely, if we were to insteadconsider a model of the form ˙ x ( t ) = θ ( t ) σ ( v ( t )) , ˙ v = θ ( t ) x ( t ) , the particle-based parameterization would stay unchanged! Only the way how one infers θ ( t ) from the particles changes. Convolutional layers

Shooting approaches for convolutional models can also be derived.We did not experiment with such models in this work. However, we show here that thenumber of free parameters may also be decreased with a particle-based approach. This willbe interesting to explore in future work. Speciﬁcally, for convolutional layers a particle-basedparameterization could be particularly eﬀective as one typically has quadratic complexity inthe number of ﬁlters between convolutional layers (i.e., if a layer with N feature channelsis followed by a layer with M feature channels, this will induce the estimation of N × M convolutional ﬁlters and hence will drastically inﬂuence the number of parameters for large19 or M ). In contrast, a particle-based shooting approach does not increase the number ofparameters as it ties them together via the optimality conditions expressed by the shootingequations. As a rough estimate for a standard convolutional ResNet for D = 50 , P = 100 × , DP ≈ . . Thus, if particles have size , we end up with at most active particles. General Remarks

Nevertheless, all model parameters (e.g., [ A ( t ) , b ( t )] or all convolutionalﬁlters for a convolutional layer) are still instantiated during computation. It is important tonote that regardless of the chosen number of particles, a shooting neural network solution is apossible optimal solution (for a given data set) at any given time, not only at convergence.One optimizes over the family of possible neural network models with the goal of ﬁnding theelement within this family that best matches the observations. C Automatic shooting

The general shooting equations were presented in Eq. (3.3). We then proceeded to explicitlyderive the shooting equations for a continuous DNN with linear-in-parameter layers and

UpDown layers in §3.1 and §3.2, respectively. While this was instructive, it is somewhat cumbersome, inparticular, for more complex models or when moving to convolutional networks. Fortunately,in practice these shooting equations do not need to be derived by hand. Indeed, they arecompletely speciﬁed by the Hamiltonian H ( p , x , θ ) = p (cid:62) ( ˙ x − f ( t, x , θ )) + R ( θ ) , in the sensethat the shooting equations in Eq. (3.3) are computed via diﬀerentiation of H . Speciﬁcally,the shooting equations in Eq. (3.3) are equivalently given by  ˙ x = ∂H ( p , x ,θ ) ∂p , ˙ p = − ∂H ( p , x ,θ ) ∂x ,θ ∈ arg min θ H ( p , x , θ ) . As discussed above, the last equation can be replaced by solving ∂ θ R ( θ ) − (cid:80) Ni =1 ∂ θ f ( t, x i , θ ) T ( p i ) = 0 . Automatic diﬀerentiation can be used to automatically ob-tain the shooting equations. As ﬁtting a shooting model requires diﬀerentiating the shootingequations we in eﬀect end up with diﬀerentiating twice. This can be done seamlessly usingmodern deep learning libraries, such as PyTorch . D Universality of the

UpDown model

In this section, we set out to demonstrate that the

UpDown model is universal in the sense thatits associated ﬂow can come ε -close to the ﬂow of any well behaving time-dependent vectorﬁeld.Deﬁne the shallow-hidden-layer vector ﬁeld with time-varying parameters θ ( t ) =( θ ( t ) , θ ( t ) , b ( t )) as: shl θ ( x ) = θ σ ( θ x + b ) . While shooting with the shallow hidden layer is theoretically appealing as shl θ is universal [1],it would result in implicit shooting equations. We ﬁrst show that the UpDown model introducedin 3.2 can give the same ﬂow as the shallow hidden layer (Lemma 6) and then leverage thisrelationship to show that the

UpDown model inherits the universality of the shallow hiddenlayer (7). We recall the reformulation presented in Section 3.2.

Lemma 5.

Let shl θ ( t ) : R d (cid:55)→ R d be a time-dependent vector ﬁeld with θ ( t ) and b ( t ) beingpiecewise C and θ ( t ) , b ( t ) continuous. Then, there exists a parametrization of the UpDown model that gives the same ﬂow at a ﬁxed time, T = 1 . roof. We rewrite the diﬀerential equation ˙ q = θ ( t ) σ ( θ ( t ) q + b ( t )) + b ( t ) , by introducing the additional state variable v = θ ( t ) q + b ( t ) which we diﬀerentiate w.r.t.time. We obtain ˙ v = ˙ θ ( t ) q + ˙ b ( t ) + θ ( t ) ˙ q . Replacing ˙ q by its formula, we get ˙ v = ˙ θ ( t ) q + ˙ b ( t ) + θ ( t ) θ ( t )( σ v ( t )) + b ( t )) . The system can be rewritten as (cid:40) ˙ q = θ ( t ) σ ( v ( t )) + b ( t ) , ˙ v = θ ( t ) q ( t ) + θ ( t ) σ ( v ( t )) + b ( t ) . (D.1)Therefore, with the initial condition v (0) = θ (0) q (0) and q (0) = q , the two systems ofordinary diﬀerential equations are equivalent.Note that the key point in this lemma is the loss of regularity in the evolution of θ sincewe diﬀerentiated once in time. For that reason, we now show that adding more dimensionsusing the inﬂation factor α alleviate this issue. It is likely possible that one could provea universality result using only α = 1 ; we leave this question for future work. However,experimentally, the inﬂation factor has a crucial eﬀect on the performance of the optimization,as discussed in Sec. 4. Lemma 6.

Let shl θ ( t ) : R d (cid:55)→ R d be a time-dependent vector ﬁeld with θ ( t ) which is piecewisecontinuous. Then, there exists a parametrization of the UpDown model that gives the same ﬂow.Proof.

Without loss of generality, we only treat the case of one discontinuity in time of theparametrization; We thus assume that θ ( t ) is continuous on [0 , t [ and [ t , . We consider q , v , v ∈ R d such that q , v are deﬁned as in Lemma 5. We now deﬁne, up to time t , v ( t ) = θ ( t ) v ( t ) + θ ( t ) which implies (diﬀerentiating w.r.t. time) that v follows anevolution equation similar to v and thus can be encoded in the general form of Eq. (D.1).Now, q ( t ) , v ( t ) are deﬁned on [ t , by the evolution Eq. (D.1) in order to coincide with theﬂow of shl θ ( t ) on [ t , , ˙ q = θ ( t ) σ ( v ( t )) + b ( t ) and ˙ v = θ ( t ) q ( t ) + θ ( t ) σ ( v ( t )) + b ( t ) forwell chosen parameters as in Lemma 5. Since the value of v ( t ) is not used in the evolutionequation of q , we can simply extend it by v ( t ) = v ( t ) which is a valid evolution equationfor Eq. (D.1).In the general case, we decompose the time interval [0 , into k intervals [ t i , t i +1 [ on which θ ( t ) is continuous and the proposed method can be directly extended using an inﬂation factor α = k , introducing v k ∈ R d .Note that the result of this lemma gives an equality between the two ﬂows deﬁned on the whole space R d . The next result is an approximation result which holds on a compact domain K ⊂ R d . For a function f : R d → R , we denote (cid:107) f (cid:107) K, ∞ = sup x ∈ K | f ( x ) | . Proposition 7.

The

UpDown model is universal in the class of time-dependent vector ﬁelds.Let K ⊂ R d be a compact domain. For every time-dependent vector ﬁeld (such that it is timecontinuous and is Lipschitz in space) w : [0 , × R d (cid:55)→ R d and its associated ﬂow ϕ ( t, x ) thereexist time dependent parameters of the UpDown model such that (cid:40) ˙ q = θ σ ( v ) + b , ˙ v = θ ( q ) + b + θ σ ( v ) , is ε -close to the solution ϕ (1 , x ) , e.g. (cid:107) ϕ (1 , x ) − q (1 , x ) (cid:107) K, ∞ ≤ ε . Note that the case α = 1 is similar in its formulation to a second-order model on q . roof. The proof is standard and we include it here for self-containedness. It is the consequenceof [1] and Lemma 6. Let B (0 , r ) a ball of radius r in R d which contains ϕ ( t, K ) for alltime t ∈ [0 , . The ﬂow associated with a given time-dependent vector ﬁeld v ( t, · ) can beapproximated by a vector ﬁeld which is piecewise constant in time; i.e. let ε > be a positivereal, (by continuity in time of v ( t, · ) ) there exists a decomposition of [0 , into k intervals [ t i , t i +1 ] and Lipschitz vector ﬁelds v i ( x ) = shl θ i ( x ) such that (cid:107) v i ( x ) − v ( t, x ) (cid:107) B (0 ,r ) , ∞ ≤ ε for t ∈ [ t i , t i +1 ] . Denote by w ( t, · ) the time-dependent vector ﬁeld deﬁned by w ( t, · ) = v i ( · ) for all t ∈ [ t i , t i +1 ] . Thus, denoting the ﬂow of v ( t, · ) by ϕ v and the ﬂow of w ( t, · ) by ϕ w , we get (cid:107) ϕ v (1 , x ) − ϕ w (1 , x ) (cid:107) ≤ (cid:90) (cid:107) v ( t, ϕ v ( t, x )) − v ( t, ϕ w ( t, x )) (cid:107) + (cid:107) v ( t, ϕ w ( t, x )) − w ( t, ϕ w ( t, x )) (cid:107) d t (cid:107) ϕ v (1 , x ) − ϕ w (1 , x ) (cid:107) K, ∞ ≤ (cid:90) Lip( v ) (cid:107) ϕ v ( t, x ) − ϕ w ( t, x ) (cid:107) K, ∞ + (cid:107) v ( t, · ) − w ( · ) (cid:107) B (0 ,r ) , ∞ d t ≤ (cid:90) Lip( v ) (cid:107) ϕ v ( t, x ) − ϕ w ( t, x ) (cid:107) K, ∞ d t + ε , where Lip( v ) denotes a bound on the Lipschitz constant of v ( t, x ) w.r.t. x ∈ B (0 , r ) for all t ∈ [0 , . Then, the Grönwall lemma [41] gives (cid:107) ϕ v (1 , x ) − ϕ w (1 , x ) (cid:107) K, ∞ ≤ εe Lip( v ) . (D.2)By lemma 6, ϕ w (1 , x ) can be approximated by the ﬂow of the UpDown and the result isobtained by triangular inequality.In this section, we focused on a universality result in the space of time-dependent vectorﬁelds. Interestingly, due to the additional dimensions, it is likely that the model is universal inthe space of functions as well. This conjecture is supported by the the quadratic 1D functionregression example which shows that the

UpDown model is able to capture some maps whichare not homeomorphic. We leave this question for future work.

E Experimental settings

This section describes our experimental settings. We use our

UpDown model for all experimentsand simply use a weighted Frobenius norm penalty for all parameters. Speciﬁcally, weweigh this penalty for all parameters with except for, θ which we penalize by 10. In ourexperiments we have observed better convergence properties for higher penalties on θ . Thismight be due to the special role that θ plays in the model as it subsumes a quadratic term inthe original derivation of the UpDown model (see Sec. 3.2). In all experiments we also optimizeover the aﬃne map from x (0) to v (0) for the data evolution. Simple function regression

We use 500 epochs for all experiments. For all particle-basedexperiments we freeze the positions of the particles for the ﬁrst 50 epochs. We use a

ReLU activation function and the MSE loss. We weight the MSE loss by 100 and the parameter normloss by 1. We use 500 training samples, 1,000 testing samples and 1,000 validation samples anda batch size of 50. Note that for these simple examples there is in practice no real diﬀerencebetween the training, testing, and validation data, as the number of samples is large and thedomain is [ − . , . . We initialize the particle positions uniformly at random in [ − . , . and draw the momenta from a Gaussian distribution with zero mean and standard deviation . . All time-integrations are done via a fourth order Runge-Kutta integrator with time-step0.1. For optimization we use Adam with a learning rate of 0.01 and the

ReduceLROnPlateau learning rate scheduler of

PyTorch with a learning rate reduction factor of 0.5.22 piral

The spiral data is generated between time t = 0 and t = 10 with 200 uniformlyspaced timepoints. Training is only on small time snippets with an approximate length of . time-units. Evaluation is on these short time snippets as well as on the entire trajectoryby pasting together solutions for these short time snippets, i.e., an individual short solutionstarts where the previous one ends. Settings for the spiral are the same as for the simplefunction regression with the following exceptions. We use 1,500 epochs. The step-size for thefourth order Runge-Kutta integrator is 0.05. The MSE loss is still weighted by 100, but theparameter norm loss only by 0.01. We randomly draw 100 new training samples during eachepoch and use 100 evaluation samples and 1,000 short range samples and 1 long-range testingsample. All samples are randomly drawn from the trajectory. However, as the trajectoryis traversed at highly nonuniform speed the samples are drawn from a uniform distributionacross the trace of the spiral. As for the simple function regression experiment, there is littlepractical diﬀerence between the training, validation, and testing data as the problem is sosimple. However, this is not of concern in these experiments as the prime objective is to studythe ﬁtting behavior of the diﬀerent models. We use a batch size of 100. Rotating MNIST

We use the data provided by the authors of [40] and follow the sameautoencoder architecture, except that our encoder maps into a 20-dimensional representationspace. The number of particles is set to 100 and the inﬂation factor α is set to 10. Foroptimization, we use Adam with a learning rate of 0.001 and the

CosineAnnealingLR learningrate scheduler of

PyTorch . We train for 500 epochs with a batch size of 25 and the parameternorm loss set to 0.1.

Bouncing balls

As in the rotating MNIST experiment, we rely on the data provided bythe authors of [40], follow their autoencoder architecture and set the dimensionality of therepresentation space of the encoder to 50. The ﬁrst three images of each sequence are providedto the encoder by concatenating the images along the channel dimension. The inﬂation factor α is set to 20 and we use 100 particles. We optimize over 100 epoch using Adam with the

CosineAnnealingLR learning rate scheduler of

PyTorch , the learning rate set to 0.001 andthe parameter norm loss set to 0.0001.

F Additional results

Simple function regression

In Section 4, we considered approximating a quadratic-likefunction. Here we show parallel results for approximating a cubic function y = x . We will alsoinclude some additional ﬁgures for the quadratic-like regression function. Note that whereasthe cubic function is invertible (but not diﬀeomorphic), the quadratic-like one consideredin Section 4 is a simple example of a non-invertible function. Tab. 1 shows the number ofparameters for the four diﬀerent formulations for both regression functions. Fig. 6 shows forthe cubic regression the test loss and the network complexity, as measured by the Frobeniusnorm [29], for the four formulations. On average the particle-based approaches show the bestﬁts with the lowest complexity measures, indicating the simplest network parameterization.Note however that while the dynamic particle approach greatly outperformed the staticparticle approach for the quadratic-like function (see Fig. 2) this is not the case here. Infact, the static particle approach shows slightly better ﬁts than the dynamic one. This mightbe because the cubic function is signiﬁcantly simpler to ﬁt and hence may not beneﬁt asmuch from the dynamic approach. To illustrate that ﬁtting the quadratic-like function isindeed harder, Figs. 7 and 8 show function ﬁts for diﬀerent numbers of particles for the cubicfunction and the quadratic-like function, respectively. All these ﬁts are for the particle-baseddynamic UpDown model. Clearly, very few particles can achieve reasonable ﬁts for a simple23unction. As little as two particles already show a good ﬁt for the cubic function, whereas thequadratic-like function requires with more particles. This supports our hypothesis that ﬁttingmore complex functions may require more particles.Since our approach is based on the time-integration of the

UpDown model it is interestingto see 1) how the mapping is expressed across time and 2) how the parameters, θ ( t ) , ofthe UpDown model change over time. Fig. 9 shows example mappings for the cubic and thequadratic-like function, respectively. The estimated mappings are highly regular. Lastly,Figs. 10 and 11 show the time-evolutions of the model parameters for the cubic and thequadratic-like function for two diﬀerent inﬂation factors. While diﬀerent parameters showdiﬀerent dynamics, clear changes over time can be observed. In particular, θ ( t ) and b ( t ) showstrong changes. These parameters mostly control the behavior of the hidden high-dimensionalstate, v , as θ ( t ) is penalized signiﬁcantly more in our model (see Sec.E) and consequentlyshows more moderate changes. Table 1:

Number of parameters for the simple function regression cubic and quadratic experiments.

Inﬂation factor L o ss

1* 1*2* 1* 1*1* 1* 1* 1*Static direct Static with particles Dynamic direct Dynamic with particles α = α = α = α = -1.5-1.0-0.50.0 1* 2* 1*1* l og ( c o m p l e x i t y ) α = α = α = α = Figure 6:

Function ﬁt (15 particles) for cubic y = x for 10 random initializations. Left : Test loss;

Tab. 2 shows the number of parameters in each of the four formulations for the spiralexperiment. This table complements the Table in Fig. 3 which only showed the number ofparameters when using 15 particles. 24 particles 5 particles − . − . − . . . . . − − − y − . − . − . . . . . − − − y ... ... − . − . − . . . . . − − − y − . − . − . . . . . − − − y

15 particles 25 particles − . − . − . . . . . − − − y − . − . − . . . . . − − − y ... ... − . − . − . . . . . − − − y − . − . − . . . . . − − − y Figure 7:

Fits for the cubic function with inﬂation factor 16 and for diﬀerent numbers of particles.Vertical lines indicate particle positions after optimization. While subtle, the ﬁgures suggest thatusing more particles allows for better approximation of the function. This is conﬁrmed by the testloss values in Figure 6 (bottom left). − . − . − . . . . . . . . . y − . − . − . . . . . . . . . y

15 particles 25 particles − . − . − . . . . . . . . . y − . − . − . . . . . . . . . y Figure 8:

Fits for the quadratic-like function for inﬂation factor 16 with diﬀerent numbers of particles.Vertical lines indicate particle positions after optimization. As this function is more complex than thecubic function 2 and 5 particles is not suﬃcient for a ﬁt. But 15 and 25 particles result in a well-ﬁttingapproximation. q Figure 9:

Mapping of the cubic function (left) and the quadratic-like function (right). As can beseen, the mappings are highly regular . Table 2:

Number of parameters for the spiral experiment.

Inﬂation factor

16 32 64 128static/dynamic w/ particles 15 1,116 2,172 4,284 8,50825 1,796 3,492 6,884 13,66850 3,496 6,792 13,384 26,568static direct n/a 1,282 4,610 17,410 67,586dynamic direct n/a 6,026 22,282 85,514 334,85826 .0 0.2 0.4 0.6 0.8 1.0Time-0.20.00.20.4

Time -0.020.000.020.040.06 .2 0.4 0.6 0.8 1.0Time -0.0050.0000.0050.0100.0150.0200.025 Inﬂation factor ( α ) : 64 Time Time 0.0 0.2 0.4 0.6 0.8 1.0Time-0.002-0.0010.0000.0010.0020.003

Inﬂation factor ( α ) : 16 b b θ θ θ θ θ θ b b Figure 10:

Weight evolution across time (i.e., continuous depth) for 15 particles when ﬁtting thecubic function using the

UpDown model: ˙ x ( t ) = θ ( t ) σ ( v ( t ))+ b ( t ) , ˙ v = θ ( t ) x ( t )+ b ( t )+ θ ( t ) σ ( v ( t )) .Results are for the dynamic with particles approach. Top: Inﬂation factor 16. Bottom: Inﬂationfactor 64. Changes in parameter values can clearly be observed. ime Time TimeTime Time b Inﬂation factor ( α ) : 64 Time Time Time

Inﬂation factor ( α ) : 16 Time

Time b b θ θ θ b b θ θ θ Figure 11:

Weight evolution across time (i.e., continuous depth) for 15 particles when ﬁtting thequadratic-like function using the

UpDown model: ˙ x ( t ) = θ ( t ) σ ( v ( t )) + b ( t ) , ˙ v = θ ( t ) x ( t ) + b ( t ) + θ ( t ) σ ( v ( t )) . Results are for the dynamic with particles approach. Top: Inﬂation factor 16. Bottom:Inﬂation factor 64. Changes in parameter values can clearly be observed.. Results are for the dynamic with particles approach. Top: Inﬂation factor 16. Bottom:Inﬂation factor 64. Changes in parameter values can clearly be observed.