[PDF] Inferring Time-Dependent Distribution Functions from Kinematic Snapshots

Abstract

We propose a method for constructing the time-dependent phase space distribution function (DF) of a collisionless system from an isolated kinematic snapshot. In general, the problem of mapping a single snapshot to a time-dependent function is intractable. Here we assume a finite series representation of the DF, constructed from the spectrum of the system's Koopman operator. This reduces the original problem to one of mapping a kinematic snapshot to a discrete spectrum rather than to a time-dependent function. We implement this mapping with a convolutional neural network (CNN). The method is demonstrated on two example models: the quantum simple harmonic oscillator, and a self-gravitating isothermal plane. The latter system exhibits phase space spiral structure similar to that observed in Gaia Data Release 2.

Full PDF

MMNRAS , 1–12 (2021) Preprint 9 February 2021 Compiled using MNRAS L A TEX style ﬁle v3.0

Inferring Time-Dependent Distribution Functions from KinematicSnapshots

Keir Darling (cid:63) and Lawrence M. Widrow † Department of Physics, Engineering Physics & Astronomy, Queen’s University, Stirling Hall, Kingston, ON K7L 3N6, Canada

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

We propose a method for constructing the time-dependent phase space distributionfunction (DF) of a collisionless system from an isolated kinematic snapshot. In general,the problem of mapping a single snapshot to a time-dependent function is intractable.Here we assume a ﬁnite series representation of the DF, constructed from the spectrumof the system’s Koopman operator. This reduces the original problem to one of map-ping a kinematic snapshot to a discrete spectrum rather than to a time-dependentfunction. We implement this mapping with a convolutional neural network (CNN).The method is demonstrated on two example models: the quantum simple harmonicoscillator, and a self-gravitating isothermal plane. The latter system exhibits phasespace spiral structure similar to that observed in Gaia Data Release 2.

Key words:

Galaxy: kinematics and dynamics – Galaxy: structure – Galaxy: disc

Disc galaxies, such as the Milky Way, exhibit bars, spiralarms, and warps. Such features are often too large to betreated as linear perturbations from an equilibrium model.In this work we introduce a novel scheme to construct galaxymodels that include nonlinear time-dependent phenomena.The method relaxes the common assumption that isolatedgalaxies are close to equilibrium while retaining the attrac-tive feature of superposition from perturbation theory. Theultimate goal of this endeavor is to determine a mappingfrom a kinematic snapshot to a dynamical model. Withthe Milky Way, the motivation for considering this problemcomes from the vast catalog of stellar positions and veloc-ities being made available by the Gaia mission (Gaia Col-laboration et al. 2018). In this case, the dynamical modelis described by the stellar phase space distribution function(DF).In our scheme, the map from kinematic snapshot to DFtakes the form of a convolutional neural network (CNN),which we train on a set of N-body simulations. For each sim-ulation, we construct a data matrix where the columns cor-respond to a time-dependent scalar observable of the phasespace coordinates. The observable here is an approximationof the DF on a grid in phase space, that is, the coarse-grainedDF. Our scheme is based on the idea that there exists a linear operator, that determines the time evolution of observables,even if the evolution in the coordinates is nonlinear. Such (cid:63)

E-mail: [email protected] † E-mail: [email protected] an operator is frequently called the Koopman operator. Itfollows that the DF can be expanded as a series using theeigendecomposition of this operator. We approximate theeigenfunctions and eigenvalues of the Koopman operator us-ing a data-driven method known as Dynamic Mode Decom-position (DMD). The CNN then maps snapshots to eigen-functions and eigenvalues, from which the time-dependentDF is recovered.The opportunity to infer dynamics from a single kine-matic snapshot has been recognized for some time. For ex-ample, Bovy et al. (2010) solve the problem of modelingthe dynamics of planets in the solar system and the grav-itational force law from the positions and velocities of theplanets at a single instant in time. Their analysis employsBayesian inference with the key assumption that the planetsare on stable orbits. The steady-state assumption is also atthe heart of orbit based schemes such as the SchwarzschildMethod (Schwarzschild 1979) and Made-to-Measure (Syer& Tremaine 1996). These approaches build the DF from aset of orbits subject to the condition that the system is inequilibrium and that observables derived from the DF ﬁt ob-servational constraints. More recently, Green & Ting (2020)proposed Hamiltonian neural networks as a means of infer-ring the gravitational potential of a collisionless system froma snapshot of the DF. This approach also assumes a station-ary DF in a time-independent potential.We focus on the collisionless Boltzmann equation(CBE) coupled to gravity through the Poisson equation. Thestandard analytic method for treating this nonlinear systemis to ﬁrst construct an equilibrium model and then study de-partures from equilibrium by solving the corresponding lin- © a r X i v : . [ a s t r o - ph . GA ] F e b Keir Darling et al. earized equations (Binney & Tremaine 2008; Binney 2013).Conversely, one can simulate the full nonlinear dynamics ofgalaxies via N-body techniques. One way to connect sim-ulations with analytic methods is via the spectral analy-sis methods of Sellwood & Athanassoula (1986). This ap-proach was originally developed to study the emergence ofbars and spiral structure in isolated disc galaxies from equi-librium initial conditions, though it has been extended toinclude bending waves and warps by Chequers & Widrow(2017). The idea is to take a sequence of snapshots, eachof which comprises a set of observables such as the surfacedensity across the disc, Σ(

R, φ, t ), where R and φ are po-lar coordinates in a disc-centered system. One then com-putes a Fourier series in azimuthal mode number m andfrequency ω : ˜Σ( R, m, ω ). Though the method does not ex-plicitly assume that the Fourier components are linear, it ismotivated by perturbation theory since linear perturbationsabout an asymmetric equilibrium state have deﬁnite m andparity about the mid-plane. The framework therefore makesit diﬃcult to study nonlinear couplings that may arise withspiral structure and warps (Masset & Tagger 1997a,c).The phase space spirals discovered by Antoja et al.(2018) in kinematic data from the Gaia mission (Gaia Col-laboration et al. 2018) provide another example of a galacticstructure that is diﬃcult to study using perturbation theory.The spirals appear in the z − v z projection of the stellar DFwhere z is the position of a star relative to the mid-planeof the Galaxy and v z is the the z -component of its velocity.This observation is suggestive of phase mixing in z − v z thatfollows a perturbation to the disc 300 −

900 Myr ago (Antojaet al. 2018; Sch¨onrich & Binney 2018; Darling & Widrow2019; Bland-Hawthorn et al. 2019; Li & Shen 2019; Laporteet al. 2019). Tellingly, perturbation theory entirely missesthe spirals. Mathur (1990), Weinberg (1991), and Widrow& Bonner (2015) determined modes of the isothermal plane,a model for the vertical structure of galactic discs devisedby Spitzer (1942); Camm (1950). These modes included aseiche mode where the slab as a whole oscillates in the grav-itational potential of a background halo, as well as higherorder modes that correspond to compression-rarefaction orbreathing of the disc. Weinberg (1991) carried out N-bodysimulations by passing a perturbing mass through the slab.While the seiche mode was easily excited, the higher ordermodes were not. A likely explanation for this is that thediscrete higher order modes lie close to a continuum andtherefore any disturbance tends to rapidly phase-mix away.We propose an alternative means for dealing with non-linearity in the CBE-Poisson system, which draws on per-turbation theory and spectral analysis while addressing theirshortcomings. As in spectral analysis, we consider a sequenceof snapshots from the evolution of a system. Here we takethe snapshots to be of the DF, which acts as a scalar ob-servable function of the phase space of the system. The cruxof our method is that the dynamics in the space of observ-ables is linear even if the dynamics in phase space are non-linear. The basis for the transformation between these twospaces comes from Koopman theory (Koopman 1931; Mezi´c2005; Rowley et al. 2009), and the linear Koopman opera-tor that determines the linear dynamics can be estimatedfrom simulations using DMD (Schmid 2010; Tu et al. 2013).The advantage of this over the spectral analysis methods ofSellwood & Athanassoula (1986) is that it does not make a priori assumptions about the form of the modes. In par-ticular, it does not assume deﬁnite m or parity about themid-plane. Instead, the form of the modes is determinedby the data. Moreover, the equilibrium (or dominant zero-frequency) state and the time-dependent modes are deter-mined simultaneously. A detailed study of the isothermalplane using DMD found that the technique could identifystructures leading to long-lived spirals (Darling & Widrow2018) similar to those seen in Gaia data (Antoja et al. 2018).In addition, a preliminary study of a disc galaxy simulationproduced DMD modes that include both m = 2 spirals and m = 1 warps (Widrow et al. 2019), which are suggestive ofthe physics described in Masset & Tagger (1997b).The outline of this document is as follows. We begin inSection 2 with some background on collisionless self gravi-tating systems in a Hamiltonian framework to contextualizeour proposed methodology. This is followed by a summaryof Koopman theory, and the computational techniques em-ployed in approximating the Koopman operator in Section 3.We then provide an overview of the speciﬁc problem we willsolve before proposing our solution in Section 4. Section 5contains two applications of our method to test systems, ﬁrsta simple quantum harmonic oscillator, and then the more as-trophysically motivated isothermal plane model. Discussionof future work and concluding remarks are in Section 6. Consider a collisionless system with one spatial degree offreedom q , conjugate momentum p , and time parameter t .For such a system, construct the continuous DF, f ( q, p, t ),and the corresponding conﬁguration space density ρ ( q, t ) = (cid:90) ∞−∞ f ( q, p, t ) dp. (1)The equations of motion in this system are derived form thesingle particle or mean ﬁeld Hamiltonian H ( q, p ) = p + Φ( q, t ) , (2)where the potential Φ( q, t ) depends on the full conﬁgura-tion space distribution ρ ( q, t ) and external masses describedby Φ ext ( q ). We take ρ ( q, t ) and f ( q, p, t ) to comprise a self-consistent solution to the collisionless Boltzmann Equation(CBE). That is, we have ∂f∂t + { f, H } = ∂f∂t + p ∂f∂q − ∂ Φ ∂q ∂f∂p = 0 , (3) ∇ Φ( q, t ) = 4 πGρ ( q, t ) + ∇ Φ ext ( q ) , (4)where G denotes the gravitational constant, and { f, H } thePoisson bracket of the functions f and H . Since the poten-tial depends on the DF, the system is nonlinear and thereis no fully general prescription for generating solutions foreither the time-dependent or time-independent cases. How-ever, if an equilibrium solution is known, then one can studytime-dependent behavior analytically by considering the lin-earized CBE. MNRAS , 1–12 (2021) nferring Distribution Functions For a known equilibrium DF, f ( q, p ), one may consider aperturbative solution f ( q, p, t ) = f ( q, p )+ (cid:15)f ( q, p, t ) , | f | (cid:39) | f | and | (cid:15) | (cid:28) . (5)Substituting this DF into Equation 3 and dropping terms oforder (cid:15) and higher, we obtain the linearized CBE, ∂f ∂t + { f , H } + { f , Φ } = 0 . (6)Solving equation 6 for f ( q, p, t ) yields a set of generallytime-dependent modes of the DF. Additionally since equa-tion 3 is linear, given a set of modes f j ( q, p, t ), each a solu-tion of equation 6, we can analyze any linear combinationof the f j ( q, p, t ) as a solution, and determine correspondingpotential modes Φ j paired with the f j . We can then considerinstead of the DF in equation 5, one of the form f ( q, p, t ) = f ( q, p ) + (cid:15) (cid:88) j f j ( q, p, t ) . (7) In this paper we seek a DF of the form f ( q, p, t ) = (cid:88) j f j ( q, p, t ) = (cid:88) j b j ψ j ( q, p )e ω j t , (8)where the b j ∈ C are constant coeﬃcients, the ψ j ( q, p ) ∈ C are linearly independent functions, and the ω j ∈ C are com-plex frequencies that determine the oscillatory and dampingbehavior of each term in the sum. Let us assume here thatRe { ω j } ≤ ω j = 0 terms).Nevertheless, it is useful to separate out these terms fromthe sum. Supposing we have a single zero frequency modeat j = 0, we may write f ( q, p, t ) = f ( q, p ) + (cid:88) j (cid:54) =0 b j ψ j ( q, p )e ω j t , (9)where f ( q, p ) = b ψ ( q, p ) is the equilibrium DF. In per-turbation theory, each term in equation 9 is a solution tothe linearized CBE, which arises from an a priori known f ( q, p ). Conversely in our method, we treat all terms in thesum equally and determine them simultaneously without lin-earizing the CBE. When we describe a galaxy with f ( q, p, t ), we are using theabstraction of a phase space probability density to describe amore fundamental dynamical system, namely the evolutionof all N stars as they interact with each other and external forces. A general form for the equations of motion of such asystem is ddt x ( t ) = g ( x ( t )) (10)where x ∈ R N is a vector containing the phase space co-ordinates of the N stars, and g ( x ) is a generally nonlinearfunction determined by Hamilton’s equations.Since this system can be prohibitively complex for largenumbers of stars, we generally opt for a statistical descrip-tion in the form of probability density function, which de-scribes probability of ﬁnding a particle in a volume of phasespace. This function, f ( q, p, t ), is constrained by Hamilton’sEquations, probability conservation, and can be found toobey the ﬂow of an incompressible ﬂuid. When we identifythe evolution of x with that of f ( q, p, t ), we are constructinga scalar observable function of the 2 N dimensional phasespace, that is f : x ∈ R N (cid:55)→ R . Koopman theory facilitates producing a genuinely lineardescription of systems that are nonlinear in their statespace (here phase space). The system evolution is describedin terms of observables on the phase space rather thanthe phase space coordinates themselves. These observablesevolve linearly in time according to the Koopman operator.At this point we transition to a discrete-time descrip-tion both for notational convenience, and because we willultimately work with discrete-time simulations. Our evolu-tion equation is then x ( t m +1 ) = g ( x ( t m )) . (11)Suppose that we have an observable represented by a scalarfunction f : x (cid:55)→ R . This could be a density, or any otherscalar function of x , but for our purposes the observablefunction is to be identiﬁed with the DF. The Koopman op-erator, U maps f ( x ) to another function Uf ( x ) deﬁned bythe composition of g ( x ) and f ( x ). Explicitly, we have themapping Uf ( x ( t m )) = f ( g ( x ( t m ))) . (12)From equation 11, g ( x ( t m )) = x ( t m +1 ), and therefore Uf ( x ( t m )) = f ( x ( t m +1 )) . (13)That is, U is a linear evolution operator for the observ-able f ( x ). Note that the Koopman operator acts on thespace of all possible observables, of which f ( x ) is a singleelement. The observable space is inﬁnite dimensional andconsequently so is the Koopman operator itself. So the non-linear system in equation 11 has become linear, with thecaveat that the dynamics are represented in an inﬁnite di-mensional space. The advantage of describing the dynamicswith equation 13 rather than equation 11 lies in the eigen-function expansion of linear operators. The evolution of thesystem can be represented in terms of the eigendecomposi-tion of the Koopman operator. MNRAS000

Keir Darling et al.

Since the exact Koopman operator is inﬁnite dimensional,its explicit form cannot be determined in general. We in-stead compute ﬁnite dimensional approximations with an al-gorithm called Dynamic Mode Decomposition (DMD). Thismethod determines the best ﬁt linear evolution operator,along with its eigenfunctions and eigenvalues, for time seriesdata. Here we provide a brief overview of DMD, for which adetailed introduction can be found in Kutz et al. (2016).In what follows we assume a time series comprised of m snapshots of the scalar observable function f ( q, p, t ) ar-ranged in the data matrix F =  | | | f ( t ) f ( t ) ... f ( t m ) | | |  , (14)where each column, f ( t i ) is the DF sampled on a grid in ( q, p )at time t i and reshaped into a column vector. In DMD, onelooks for series solutions of the continuous-time linear ﬂowequation ddt f ( t ) = A f ( t ) , (15)that are the best ﬁt for the discrete time data in F . Thegeneral solution of equation 15 is given by equation 8 where ψ j and ω j are the complex valued eigenfunctions and eigen-values of the evolution operator A , that is A ψ j = ω j ψ j . (16)The coeﬃcients b j are determined by the initial condition f ( t ). To obtain these solutions in practice, one considersthe corresponding discrete time system f ( t m +1 ) = Af ( t m ) , (17)where the discrete-time and continuous-time operators arerelated by A = e A ∆ t , (18)and ∆ t is the temporal sampling period in the simulation.The discrete-time operator has the eigendecomposition A ψ j = λ j ψ j , (19)where ω j = ln ( λ j ) / ∆ t. (20)Given an estimate for the discrete-time evolution operatorof F , we can construct series solutions as in equation 8 fromthe eigendecomposition of A . The rank, r , of the estimatedevolution operator is chosen at the outset and determinesthe number of terms in the series solution.When we apply DMD with state vectors mapped intoobservables we are treating the linear DMD evolution oper-ator A as a ﬁnite-dimensional, rank- r approximation of theKoopman operator. Further, it can be shown that the eigen-values of A are indeed eigenvalues of U and the eigenvectorsof A are related to the eigenvectors of U (Tu et al. 2013). To apply perturbation theory to collisionless dynamics, onemust ﬁrst determine an equilibrium solution to the CBE.This solution generates a linear operator, which replacesthe original nonlinear operator of the CBE. The resultinglinear system is an approximation that is only valid whenperturbations are small relative to the equilibrium solution.Conversely, when we “linearize” a system with the Koopmanapproach, we are not constructing a linearized operator de-pendent on a zeroth order solution, but rather transformingthe system to an inﬁnite dimensional space which naturallyadmits a linear representation of the system.

In this Section, we propose a method for inferring a time-dependent model from a single phase space snapshot. Thisinference problem can be stated as estimating the function G : f ( q, p ; t ) (cid:55)→ f ( q, p, t ) that maps a snapshot at time t to the time-dependent DF. The goal is to construct anappropriate non-parametric model for the function G . Sucha model is trained on a set of simulated systems to facilitatemaking inferences from snapshots sampled from an unknownDF. We will use CNNs to approximate the mapping G . The DMD series solution plays a crucial role of “compres-sion” in our methodology. When determining a rank- r DMDsolution for a sequence of m snapshots each with grid reso-lution M q × M p (where we always have r (cid:28) m ), we are stor-ing all of the information the rank- r decomposition is ableto capture into r eigenfunctions, and r eigenvalues. This is r (1 + M q M p ) complex numbers, in contrast to the mM q M p real numbers needed to store the entire simulation. Addition-ally, the eigenvalues and eigenfunctions come in conjugatepairs, so we need only store half of them. Note that the co-eﬃcients in the DMD series can be obtained by projecting f ( q, p ; t ) onto the eigenfunctions. Therefore, they do notneed to be stored.Since the spatiotemporal behavior of the DF is capturedin the r eigenfunctions and eigenvalues, we can map thesnapshot f ( q, p ; t ) to these quantities instead of an explicittime-series. Additionally, it will be easier to design two sep-arate models to perform mappings for the eigenvalues andeigenvectors separately, instead of having one model that at-tempts to map f ( q, p ; t ) to the full spectrum. The input ofthe model is a snapshot of the DF, f ( q, p ; t ), which acts as aKoopman observable of the state space, and the output is theeigendecomposition of a ﬁnite dimensional approximation ofthe Koopman operator. From this point on we will separatethe problem into two functions, G ω : f ( q, p ; t ) (cid:55)→ Ω and G ψ : f ( q, p ; t ) (cid:55)→ Ψ where for a rank- r evolution operator, Ω = diag (cid:18) ω − r2 , ..., ω , ..., ω r − (cid:19) , (21) Ψ =  | | | ψ − r ... ψ ... ψ r − | | |  . (22) MNRAS , 1–12 (2021) nferring Distribution Functions That is, G ψ maps the snapshot to the static phase spacestructures that form a basis for our predicted DFs, and G ω determines the temporal characteristics in the form of damp-ing coeﬃcients and frequencies. In Section 5 we will use sep-arate neural network models for each of these mappings.Note that we must choose the rank, r , which controlsthe complexity of the model, when designing the networkand constructing training data. In the work presented here, r is a ﬁxed parameter of G . Were we to allow r to vary, wewould not be able to use the simple non-parametric tech-niques described here. In many cases, it is suﬃcient to workwith a relatively low, ﬁxed rank as we are most interested indominant, persisting structure rather than the weak persist-ing contributions and transients that tend to arise in higherrank DMD models. Some basic knowledge of convolutional networks is assumedin Section 5, however the essential background on this topicis covered here, with supplemental information in SectionA2. For a comprehensive review of CNNs, we refer the readerto Goodfellow et al. (2016).We realize the function G as a non-parametric regres-sion model implemented with CNNs. Koopman theory waspreviously combined with neural networks by Lusch et al.(2018) who used them to determine the mapping from sys-tem coordinates to Koopman observables. To our knowledge,neural networks have never before been used to map singlesnapshots from time series to eigenfunctions and eigenvaluesfrom which time evolution of a system can be constructed.For an N -dimensional input space the learnable param-eters of the CNN are the elements of N -dimensional ﬁltersthat are used in convolution operations. The CNN comprisesa series of layers. The mapping performed in each layer isthe convolution of the input of the layer with that layer’sﬁlters, followed by the application of an activation functionto the result of the convolution. A simple concrete exam-ple of this for a 2-dimensional input is visualized in Fig.1. This diagram makes clear the signal path of the inputmatrix through to the output, and the exact role the learn-able ﬁlters play in the mapping performed by the block. Wecascade multiple blocks of this form in the models we usein Section 5, and combine them with down-sampling andreshaping techniques to obtain our desired output space di-mension.There are two main beneﬁts to using CNNs here. First,in contrast to linear combination based neural networks (seeA1), CNNs share parameters among multiple operations.This makes training and evaluating models involving largedata volumes much more eﬃcient than for the equivalentconventional network. The examples in Section 5 are of lowdimension and could be handled by conventional networks.However, it would be prohibitively expensive to scale upthe resulting models for large spaces like the 6-dimensionalphase space required for complete modeling of a galaxy.CNNs provide a comparatively feasible route forward in scal-ing to higher dimension. More importantly is that CNNs aredesigned speciﬁcally to identify spatial structure in the in-put. In the case of trying to map a phase space snapshotto a basis for the system’s evolution, we should expect that x InputImage(60 × × x ∗ k k x ∗ k k x ∗ k k x ∗ k k FilterKernels(3 × ×

1) ReLU(x ∗ k )ConvolutionOutput(58 × × ∗ k )ReLU(x ∗ k )ReLU(x ∗ k )Activation FunctionOutput(58 × ×

1) Convolutional BlockOutput(58 × × Figure 1.

Example of a single convolutional block for a 60 × q, p ) sampled on a 60 ×

60 grid.). The block consists of four 3 × there is meaningful spatial structure in the snapshot thatcorrelates to particular system properties. Here we apply the methodology described in the previoussections to two toy model problems. First, in Section 5.1,we consider the Schr¨odinger equation with a harmonic po-tential. This is followed by the isothermal plane model (orSpitzer sheet) in Section 5.2, where we must use the meth-ods described in Section 3 to search for transformations toa space in which the dynamics become linear.

We begin with the quantum harmonic oscillator for two rea-sons. First, as the Schr¨odinger equation is linear, its solu-tions may always be written exactly as eigenfunction ex-pansions of the form in equation 8. Additionally, since theHamiltonian is a Hermitian operator, the eigenvalues arestrictly real, which simpliﬁes the regression problem involvedin determining the mapping from snapshot to eigenvalues. Inthe particular case of a harmonic potential, we have analyticexpressions for the eigenfunctions and eigenvalues, so we canomit the DMD step of our procedure and focus solely on theproblem of inferring system dynamics from a single snap-shot with a non-parametric model. Further, the Schr¨odingerequation admits a probabilistic description of conﬁgurationspace dynamics, converse to the full phase space needed inthe case of the CBE. This means the input and output spacesare signiﬁcantly smaller than in the collisionless dynamicscase.This example is not entirely unrelated to stellar dy-namics. In fact, the Schr¨odinger equation has been appliedto collisionless self-gravitating systems in Widrow & Kaiser(1993). In that paper, the authors constructed a systemcomprised of the Schr¨odinger and Poisson equations suchthat the typical coupled collisionless Boltzmann and Pois-son equations are recovered in a particular limit.

MNRAS000

MNRAS000 , 1–12 (2021)

Keir Darling et al.

Consider observations of a wavefunction that satisﬁes theSchr¨odinger equation with a harmonic potential. For conve-nience we take (cid:126) = m = 1, and write ∂∂t Ψ( q, t ) = − i H Ψ( q, t ) , H = − ∂ ∂q + ξ q . (23)Suppose that the potential curvature ξ is unknown. Theproblem then becomes: given one time series sample Ψ( q ; t )from the evolution of a wavefunction, with unknown initialcondition, and potential curvature ξ , can we determine pastand future states of the wavefunction. Although we do notknow the curvature ξ , we will assume that it belongs to anormal distribution of possible values with mean, ¯ ξ and vari-ance σ ξ , denoted N (¯ ξ, σ ξ ). The idea here is that in order forone to eﬃciently produce training data, and subsequentlytrain a model for use on a system, some knowledge of sys-tem parameters must be known.As equation 23 is a linear time evolution equation, wehave the familiar general solution obtainable through sepa-ration of variables,Ψ( q, t ) = ∞ (cid:88) j =0 a j ψ j ( q )e − i E j t , (24)where the eigenfunctions ψ j ( q ) and eigenvalues E j satisfythe linear diﬀerential eigenvalue equation H ψ j ( q ) = E j ψ j ( q ) . (25)To be consistent with Section 3, we deﬁne the complex fre-quencies ω j = − i E j . (26)The eigenfunctions of equation 23 are well known to be ψ j ( q ) = (cid:18) ξπ (cid:19) / (cid:112) j j ! H j (cid:18)(cid:112) ξq (cid:19) e − ξ q , (27)where H j ( q ) are the Hermite polynomials of order j . Thecorresponding eigenvalues are given by E j = (cid:0) j + (cid:1) ξ. (28)The input-output pairs of the mapping G are,(Ψ( q ; t ) , ( ψ j , ω j )), where the input Ψ( q ; t ), is an evolvedstate of a random initial condition, and the output ( ψ j , ω j )is the eigendecomposition of the Hamiltonian operator. Foreach element of the training data set, we (i) sample ξ from N (¯ ξ, σ ξ ); (ii) generate a random initial condition comprisedof the ground state in superposition with excited states as-signed random coeﬃcients; (iii) determine the eigenvectorsand eigenvalues, ( ψ j , ω j ), for the chosen ξ ; and (iv) evolvethe initial condition to a randomly selected time t , to pro-duce Ψ( q ; t ). For the purposes of our calculations, we ap-proximate the eigenfunctions by evaluating equation 27 ona 500 point grid in q . This means that the non-parametricmodel sees the outputs in the training set simply as arrays, and has no prior knowledge of the explicit form of the eigen-functions, namely that they contain Hermite polynomials.For the training data we sample 5 × potentials from N (¯ ξ, σ ξ ) with mean potential curvature ¯ ξ = 5, and variance σ ξ = 1. We use a test set of 10 points, with values of ξ sampled from a uniform distribution, U (¯ ξ − σ ξ , ¯ ξ + 3 σ ξ ).This means the model is tested on approximately the samenumber of points for each value of ξ within the range of theuniform distribution facilitating investigation into how wellthe model generalizes to systems away from ξ = ¯ ξ . This willbe made clear when we evaluate the model performance inFigs. 4 and 5. For this problem we use simple one dimensional CNNs. Fol-lowing the discussion in Section 4, we split the problem ofdetermining the mapping G into the two sub-problems de-scribed by G ψ : Ψ( q ; t ) (cid:55)→ ψ j and G ω : Ψ( q ; t ) (cid:55)→ ω j , eachimplemented by their own CNN. Both models use the sameconvolutional structure, but diﬀer in the sizes of their out-puts and hyperparameters. The shared network architectureis shown in Table A2.The eigenvector predictor, G ψ is trained with a learningrate of 1 × − , and a patience parameter of 50. Evaluatedon the test data, the trained model has a mean squared errorof 4 . × − , which corresponds to ∼ .

05% of the meaneigenvector magnitude. The eigenvalue predictor is trainedwith a learning rate of 1 × − and patience of 100. Thisyields a mean squared error of 0 .

018 on the test data, whichis ∼ .

2% of the average eigenvalue magnitude.Fig. 2 shows representative examples of model perfor-mance evaluated on the test data. The values of ξ shown inthe three panels here correspond to the mean value of theparameter, along with values exactly two and three standarddeviations away from the mean.The models are successful in estimating both the eigen-functions and eigenvalues. Note that although the eigenval-ues are predicted with relatively high accuracy, we shouldnot expect them to exactly match the true values. Sincethe eigenvalues are arguments of temporal exponential func-tions, small errors lead to large changes in the wavefunctionas time progresses. That is, although the wave function hasthe correct qualitative behavior, it goes out of phase withthe exact solution on a time scale proportional to the in-verse of the error in the eigenvalue. This point is illustratedin Fig. 3 where we show the time-dependence of the true andpredicted wave functions for the test case from the ﬁrst rowof Fig. 2. The predicted wave function is in good qualitativeagreement with the true one though it is clearly out of phaseby t (cid:39) . ξ and averageabsolute error on predictions for ψ and ω . For both mod-els, the error increases as ξ values approach the tails of thedistribution N (¯ ξ, σ ξ ), as is consistent with the expectationfrom exposing the model to more points close to ¯ ξ than awayfrom it.Note that the error distributions for G ψ and G ω diﬀergreatly. The eigenvalue model, shown in Fig. 5, has whatlooks to be a single elongated region of high density, which MNRAS , 1–12 (2021) nferring Distribution Functions Figure 2.

Comparison of true eigenvectors and eigenvalues tomodel predictions. The eigenvectors are shown in the large left-side panels, and the eigenvalues in the small right-side panels.True values and model predictions are shown as gray solid linesand black dotted lines respectfully.

Figure 3.

Time evolution of the true (top) and model output (bot-tom) wave functions. Each thin black curve is a time series sam-ple from their evolution. The thick magenta curves are drawn atmuch larger sampling intervals to emphasize how the two wavefunctions diﬀer at identical times.

Figure 4.

Density of test points in the space of the model param-eter and mean absolute prediction error on the eigenvectors. Theblue histogram is normalized density in this space, and the blackcurves are corresponding contours. Dashed lines indicate how fara test point’s corresponding ξ value is from the mean in one σ ξ increments. Figure 5.

Same as Fig. 4 but for the eigenvalue predictor. aligns with a roughly quadratic dependence on ξ . Addition-ally, the distribution in error at any ﬁxed value of ξ hasapproximately the same spread as at any other ξ value, ascan be seen in the contours. Conversely, the eigenvector dis-tribution shown in Fig. 4, has two local maxima at the edgesof the ξ range. These result from the spread in error beingmuch tighter at ξ further than 2 σ ξ away from ¯ ξ , than ξ near ¯ ξ . The takeaway from this is that the upper and lowerbounds on the error in the eigenvalue predictor are relativelyindependent of ξ , compared to the case of the eigenvectorpredictor, where the error bounds diﬀer greatly near ¯ ξ , andconverge as one moves towards the tails of N (¯ ξ, σ ξ ). Thisindicates that the eigenvector predictor is more susceptibleto error in the regions on the parameter space it has notbeen heavily trained on than the eigenvalue predictor. It is MNRAS000

Keir Darling et al. possible that this is simply a result of the more complex na-ture of the eigenvector problem, G ψ : Ψ( q ; t ) (cid:55)→ ψ j , with itsmuch larger output space compared to the eigenvalue prob-lem, G ω : Ψ( q ; t ) (cid:55)→ ω j . Whether this can be eﬀectivelyremedied simply with further training as opposed to modiﬁ-cations to network architecture is currently unclear and notwithin the scope of this work. It should be noted howeverthat in the framework as it is presented, eigenvector predic-tors are not as robust to incorrect assumptions on systemparameters as their eigenvalue counterparts.We end this section with a remark that with the knowl-edge required to produce this training data, one could poten-tially learn the spectrum of the system from a single snap-shot by simply assuming a dominant ground state and ﬁt-ting a Gaussian for the j = 0 mode. The curvature of thepotential could then be inferred without the need for thisframework. The purpose of this example is not to revolu-tionize the way we solve for time-dependence of the har-monic oscillator, but rather to take a simple system, and seeif a non-parametric model is capable of accurately mappingsnapshot to spectrum. A more interesting problem wouldbe to consider the Schr¨odinger equation with an unknownfunctional form of the potential, where we train the modelon simulations for many diﬀerent potentials and initial con-ditions. Since this problem approaches the complexity of theself-gravitating systems we aim to study, we move on to asystem described by the CBE and Poisson equation. The isothermal plane model was developed by Spitzer (1942)and Camm (1950) and used to study vertical structure instellar discs in Freeman (1978) and van der Kruit & Searle(1981). A particular modiﬁcation of this model has beenshown admit a DMD representation in Darling & Widrow(2019). Following this, we proceed with the isothermal planeto demonstrate how the framework proposed in Section 4applies to a nonlinear, astrophysical model.

The equilibrium DF and corresponding density are given by f eq ( z, v z ) = ρ (2 πσ z ) / e − E z /σ z , (29)and ρ eq ( z ) = ρ e − Φ( z ) /σ z , (30)where E z is the vertical energy, and σ z is the vertical ve-locity dispersion. This density, along with a correspondingequilibrium potential, comprise a potential-density pair sat-isfying the Poisson equation. Deﬁning ρ = σ z / πGz , sucha potential isΦ eq ( z ) = 2 σ z ln cosh ( z/z ) . (31)As in Darling & Widrow (2019), we separate the gravi-tational potential into a time-independent part, which couldcome from masses external to the disc, and a time-dependent or live part coming from the stellar mutual interactionswithin the disc. Explicitly, the total potential isΦ( z, t ) = Φ ext ( z ) + Φ live ( z, t ) , (32)where Φ ext = (1 − α ) Φ eq . The time-dependent term Φ live corresponds to mutual interaction of disc stars with massesreduced by a factor of α relative to the isolated case. Thiscreates a scenario where the parameter α , herein called thelive fraction, quantiﬁes the relative dominance of self gravityand external potential in the evolution of the system. Inthe particular system realization here, σ z = 20 km s − and z = 500 pc, such that the surface density is Σ = 2 z ρ =60 M (cid:12) pc − .Unlike in the harmonic oscillator example, the time evo-lution equation (equation 3) is nonlinear and we cannot ingeneral ﬁnd analytic expressions for eigenfunctions of theDF. Instead, we estimate a linear time evolution operatorand its corresponding eigendecomposition with DMD.The data for our CNN model is generated by runningsimulations of the isothermal plane, and then computing theeigendecomposition of the evolution operator with DMD.The inputs are isolated snapshots of the simulations takenat randomly generated times biased to be away from thestartup period of the simulation. The outputs, which con-sist of the eigenfunctions and eigenvalues of the evolutionoperator in equation 15, are determined with DMD as de-scribed in Section 3. Each simulated isothermal plane hasa randomly generated α sampled from a probability distri-bution. The parameter α acts analogously to the potentialcurvature parameter, ξ , in the Schr¨odinger equation exam-ple of Section 5.1. Like in that example, we specify a meanvalue, ¯ α = , and a standard deviation σ α = 0 .

1. We gener-ate 1200 input-output pairs, with live fractions drawn fromthe Gaussian distribution N (¯ α, σ α ). From this data set weuse 1000 for training and 200 as a test set. For additionaltesting to probe how well the model generalizes we gener-ate 5000 input-output pairs with α values sampled from theuniform distribution U (¯ α − σ α , ¯ α + 3 σ α ).To be clear, we do not simulate the ﬂow of the DF f ( q, p, t ) generated by the CBE directly. Rather, we per-form N-body simulations, evolving a ﬁnite number of dis-crete particles according to Hamilton’s equations. It followsthen that the state vectors of our system consist of the kine-matic information for each of these N particles. We do notconstruct our data matrices directly from these vectors, butrather map them into observables ﬁrst. Our observables inthis case are estimates of the DF f ( q, p, t ), on a grid of cellsin q and p . Estimates of the DF are obtained from the num-ber of particles per phase space cell, weighted by mass, anddivided by the cell volume. The resulting matrix for a singletime step is then reshaped into a column vector, from whichfor many time steps we can build the DMD data matrix asin equation 14. The problem of estimating G for this model is signiﬁcantlymore complex than the previous example. We now havetwo-dimensional eigenfunctions, complex eigenvalues, and atime-dependent potential. Interestingly however the simplenetwork architecture introduced in the previous example MNRAS , 1–12 (2021) nferring Distribution Functions does not require great modiﬁcation. We replace the one-dimensional conﬁguration-space convolution operators withtwo-dimensional phase space ones and allow for complexeigenfunctions and eigenvalues. We reduce the number oflayers so as to speed up the training process. The CNN ar-chitecture is summarized in Tables A3 and A4. In principle,there should be a single mode with ω = 0, which we iden-tify with equilibrium. However, in DMD it is common forthe eigenvalue of this mode to have a small real part andzero imaginary part. Formally, the mode then slowly decays.In this example, we side-step the problem by omitting theeigenvalue corresponding to the equilibrium component in G ω . This allows the training process to focus on the com-plex eigenvalues of the time-dependent modes.We train G ψ and G ω with learning rates of 5 × − ,and 10 − , and patience parameters of 200 and 50 for the twomodels respectively. The networks were evaluated at the testsnapshots for which we know the correct eigenvectors andeigenvalues. In this test we observe that the eigenvectors arepredicted with a mean squared error of 1 . × − whichcorresponds to ∼ .

3% of the average eigenvector magnitude.The eigenvalue model yields a mean squared error of 2 . × − , which is ∼ .

02% of the average eigenvalue magnitude.Representative examples of model performance on testdata are shown in Fig. 6 and Fig. 7 for the eigenfunctions andeigenvalue models respectively. Here we show test results forlive fractions α = ¯ α , and α = ¯ α ± σ α . It is important to lookat results on either side of the mean parameter value in thiscase, as the behavior of the system is drastically diﬀerent forlive fractions above and below α = ¯ α , in contrast to the ξ dependence in Section 5.1. In particular, for α close to zerowe have kinematic particles that undergo phase mixing. For α near unity the model is dominated by self-gravity, andbending and breathing modes (Weinberg 1991) are present.Fig. 8 shows how the models generalize to the parame-ter space α ∈ [¯ α − σ α , ¯ α + 3 σ α ]. Since the isothermal planemodel requires N-body simulations, we are restricted in theamount of data we can generate. Therefore, we compute theaverage error and its standard deviation as a function of α inanalogy to the error densities in Figures 4 and 5. The depen-dence of error on α is then represented by a solid line for themean value in each bin, with a transparent band represent-ing the standard deviation. As was the case in Section 5.1,there is a general increase in error as one moves away fromthe mean, although there is a more pronounced asymmetryin the region α ∈ [¯ α − σ α , ¯ α + σ α ] for the eigenvalue predictorhere than in the harmonic oscillator model, which we believeis a result of the smaller training set. Additionally like theharmonic oscillator models, the prediction errors are moretightly clustered around the maximum error for parametervalues beyond 2 σ α from the mean. This indicates that themodel will consistently under-perform in this region. For rel-atively small training sets like those used here, one shouldhave suﬃcient knowledge of a system so as to be able to es-timate the parameter to within two standard deviations ofthe true value. The challenge in galactic dynamics is to construct modelsthat incorporate time-dependent phenomena but that are constrained by kinematic observations at a single epoch. Themost promising approaches to this problem combine theoryand simulations, each of which have their limitations. In par-ticular, theoretical models often begin with the assumptionthat systems are close to equilibrium. On the other hand,it can be diﬃcult to tune simulations to match observa-tions since the space of possible initial conditions can bedauntingly large and the likelihood function that describesa goodness of ﬁt between simulation and data, diﬃcult todeﬁne.In this paper, we considered a model for the verticalstructure of the Galactic disc, which has a single free pa-rameter α quantifying the degree to which the system isself-gravitating. When self-gravity dominates, DMD iden-tiﬁes bending and breathing modes of the type discussedin Mathur (1990); Weinberg (1991); Widrow et al. (2014)and Widrow & Bonner (2015). Spiral modes are found whengravity is dominated by an external ﬁeld (Darling & Widrow2019). Here, we demonstrate that a CNN can recover thetime-dependent DF of a system from a single snapshot of un-known α with the correct phase space structure and tempo-ral characteristics. This method can be applied to the Gaiadata in an eﬀort to better understand the physics of the z − v z spirals found in Gaia (Antoja et al. 2018).A classic example of DMD is the turbulent ﬂow aroundan obstruction from the ﬁeld of ﬂuid mechanics. A simi-lar situation might be at play in galactic discs where spiralstructure is continually generated through a combination ofswing ampliﬁcation and dynamical instabilities (Sellwood &Carlberg 2014, 2021). This spiral structure can then couplenonlinearly to bending of the disc, as described in Masset &Tagger (1997b). The methodology presented here providesa novel way of exploring these phenomena. Already, prelim-inary results indicate that DMD can identify modes thatinvolve both m = 2 spirals and m = 1 warps (Widrow et al.2019). The next step is to run a set of realistic Milky Waysimulations on which a CNN can be trained to map a singlesnapshot to DMD modes.As with other non-parametric methods, the model isonly as good as the training set. Computational time willbe an issue if the method is to be applied to full three-dimensional galaxy simulations since training data requirescomputationally expensive N-body simulations. Neverthe-less, we believe that DMD and CNNs are promising for ef-ﬁciently extracting information about the dynamics of theMilky Way from the abundant incoming data from Gaia. ACKNOWLEDGEMENTS

This work was supported by a Discovery Grant with the Nat-ural Sciences and Engineering Research Council of Canada.KD was supported by an Ontario Graduate Scholarship.

DATA AVAILABILITY STATEMENT

The data and software underlying this article are availableon reasonable request.

MNRAS000

MNRAS000 , 1–12 (2021) Keir Darling et al.

Figure 6.

Comparison of true and predicted eigenfunctions for the isothermal plane model. The panels A, B, and C correspond to livefractions of α = ¯ α − σ α = 0 . α = ¯ α = 0 .

5, and α = ¯ α + 2 σ α = 0 . Figure 7.

True and predicted eigenvalues of the evolution operatorfor live fractions of α = ¯ α − σ α = 0 . α = ¯ α = 0 . α = ¯ α + 2 σ α = 0 . REFERENCES

Antoja T., et al., 2018, Nature, 561, 360Binney J., 2013, New Astron. Rev., 57, 29Binney J., Tremaine S., 2008, Galactic Dynamics: Second Edition.Princeton University PressBland-Hawthorn J., et al., 2019, MNRAS, 486, 1167Bovy J., Murray I., Hogg D. W., 2010, APJ, 711, 1157Camm G. L., 1950, MNRAS, 110Chequers M. H., Widrow L. M., 2017, MNRAS, 472, 2751Darling K., Widrow L. M., 2018, MNRAS, 484, 1050Darling K., Widrow L. M., 2019, MNRAS, 490, 114Dozat T., 2015.Freeman K. C., 1978, in Berkhuijsen E. M., Wielebinski R., eds,IAU Symposium Vol. 77, Structure and Properties of NearbyGalaxies. pp 3–10

Figure 8.

Logarithm of the absolute error in the quantity Q as itdepends on the live-fraction α , with Q denoting either ψ or ω .The curves shown here are computed by computing the averagelogarithmic absolute error in α − bins. The red curve shows Q = ω and the blue curve shows Q = ψ . In both cases, the bandcorresponds to the standard deviation of the logarithmic errorwithin each bin.Gaia Collaboration et al., 2018, A&A, 616, A11Goodfellow I., Bengio Y., Courville A., 2016, Deep Learning. MITPressGreen G. M., Ting Y.-S., 2020, arXiv e-prints, p.arXiv:2011.04673Koopman B. O., 1931, Proceedings of the National Academy ofSciences, 17, 315Kutz J. N., Brunton S. L., Brunton B. W., Proctor J. L., 2016, Dy-namic Mode Decomposition: Data-Driven Modeling of Com-plex Systems. SIAMLaporte C. F. P., Minchev I., Johnston K. V., G´omez F. A., 2019,MNRAS, 485, 3134Li Z.-Y., Shen J., 2019, arXiv e-prints,Lusch B., Kutz J. N., Brunton S. L., 2018, Nature Communica-MNRAS , 1–12 (2021) nferring Distribution Functions tions, 9, 4950Masset F., Tagger M., 1997a, A&A, 318, 747Masset F., Tagger M., 1997b, A&A, 318, 747Masset F., Tagger M., 1997c, A&A, 322, 442Mathur S. D., 1990, MNRAS, 243, 529Mezi´c I., 2005, Nonlinear Dynamics, 41, 309Rowley C. W., Mezic I., Bagheri S., Schlatter P., HenningsonD. S., 2009, Journal of Fluid Mechanics, 641, 115–127Schmid P. J., 2010, Journal of Fluid Mechanics, 656, 5Sch¨onrich R., Binney J., 2018, MNRAS, 481, 1501Schwarzschild M., 1979, ApJ, 232, 236Sellwood J. A., Athanassoula E., 1986, MNRAS, 221, 195Sellwood J. A., Carlberg R. G., 2014, ApJ, 785, 137Sellwood J. A., Carlberg R. G., 2021, MNRAS, 500, 5043Spitzer L., 1942, APJ, 95Syer D., Tremaine S., 1996, MNRAS, 282, 223Tu J. H., Rowley C. W., Luchtenburg D. M., Brunton S. L., KutzJ. N., 2013, arXiv e-prints, p. arXiv:1312.0041Weinberg M. D., 1991, ApJ, 373, 391Widrow L. M., Bonner G., 2015, MNRAS, 450, 266Widrow L. M., Kaiser N., 1993, ApJ, 416, L71Widrow L. M., Barber J., Chequers M. H., Cheng E., 2014, MN-RAS, 440, 1971Widrow L. M., K. D., Li H., 2019, Proceedings of the InternationalAstronomical Union, 14, 65–70van der Kruit P. C., Searle L., 1981, A&A, 95, 105 APPENDIX A: NEURAL NETWORKS

Here we provide a very basic overview of some importantdetails of NNs in the context of the problem described inSection 4. We refer the reader to Goodfellow et al. (2016)for a comprehensive discussion. This section serves to pro-vide a brief introduction for readers unfamiliar with NNs,supplying only the most necessary information on the con-cepts discussed in the main body of this paper.For our purposes, a NN is just a means of representing afunction G : x (cid:55)→ y , that maps some quantity x to another, y . In this section we will use x to denote an arbitrary vectorinput of a NN, and X an arbitrary matrix or tensor input.Additionally, we let y be an arbitrary vector output, and Y the matrix or tensor output. We make these distinctions hereas the dimension of the input and output spaces of G dependon the dimension of the system state space. Explicitly thenwe have the idealized function y = G ( x ) , (A1)where the vector quantities may be exchanged for matrix ortensor quantities. We seek a representation of this functionthat we can use to make predictions for y . Consider then aparticular realization y = ˜ G ( x , θ ) , (A2)where θ is vector of parameters that determine the partic-ular realization of the system. We will construct NNs to beparticular realizations of G , which we can train to determinean optimal set of parameters θ . The exact role the parame-ters play in the function depends on the architecture of thenetwork. In what follows we give a brief overview of two ba-sic types of architectures. The ﬁrst of which we do not makeexplicit use of in this paper, but serves as a starting pointfor the second, which is used in Section 5. x yW , c Figure A1.

Feed forward neural network with zero hidden layers,equivalent to a two dimensional linear regression problem.

A1 Feed Forward Neural Networks

The simplest type of NN is the feed forward network ormulti-layer perceptron (MLP). At this point it is importantto consider the model as segmented into layers. These con-sist of an input layer, output layer, and a set of hidden layersbetween the input and output. For MLPs, layers are com-prised of parallel nodes. The number of nodes in the inputand output layers must match the sizes of x and y respec-tively, but the number of nodes per hidden layer can bechosen independently. We will consider only what are calledfully-connected networks here. That is, between two adja-cent layers k and k + 1, all nodes in layer k are connectedto all nodes in layer k + 1 as shown in Fig. A1 or A2. Inthe simplest case, the mapping from layer k to layer k + 1is linear, and is prescribed by a matrix of weights W andvector of biases c . So for a two layer, linear network, y = W (cid:62) x + c . (A3)A graphical representation of this for the case of x ∈ R , y ∈ R is shown in Fig. A1. Note that this is just a multivariatelinear equation, and the problem of determining the optimalparameters here is the familiar linear regression problem.There is no reason to expect that the function we areafter, G should in general be well represented by a linearfunction, so we would like the NN realization to be non-linear if necessary. To introduce nonlinearity, we may use anonlinear function h ( x ), called the activation function, in themapping from layer to layer. The two layer case in equationA3 then becomes y = h ( W (cid:62) x + c ) . (A4)A commonly used nonlinear activation function is the recti-ﬁed linear unit (ReLU) function given byReLU( x ) = max { , x } . (A5)In this paper we use only linear, h ( x ) = x and ReLU ac-tivation functions, opting for ReLU at all layers except theoutput where we choose a linear activation function. Formore on activation functions, see Goodfellow et al. (2016).Note that in equations A4 and A5, there are zero hid-den layers. We now move to a simple four layer model (twohidden layers). The architecture of this network is shownin Fig. A2, and the corresponding equations describing themapping from input to output are y = ReLU( W (cid:62) x + c ) y = ReLU( W (cid:62) y + c ) y = W (cid:62) y + c . (A6) MNRAS000

Simple example of a feed forward neural network withone hidden layer. Input and output nodes are shown in black, andhidden nodes are gray.Type Nodes ActivationDense 2 ReLUDense 3 ReLUDense 3 ReLUDense 2 Linear

Table A1.

Tabular description of the network represented by Fig.A2.

The network in Fig. A2 can also be described by table A1.We will use only tables for the larger networks employed inSection 5.The complexity of the network can be increased or de-creased by adjusting the number of hidden layers, and num-ber of nodes per layer. The mapping from input to outputfor an arbitrary MLP is a simple extension of equations A6,that is for N layers we have y i = ReLU( W (cid:62) i y i − + c i ) , i = 1 , , ...N − , y = xy = W (cid:62) N − y N − + c N − . (A7) A2 Convolutional Neural Networks

In this paper we use convolutional neural networks (CNNs).As the name implies, CNNs replace the application of a lin-ear transformation (the weight matrix) that occurs at everylayer before the application of an activation function in anMLP, by a convolution of the input matrix to a layer witha ﬁlter kernel. That is, MLP weight matrices are replacedwith kernels, and the operation of linear transformation isreplaced with convolution. One can think of the number ofﬁlters per layer in a CNN as analogous to the number nodesper layer in a MLP. The parameters to be learned by thenetwork to best model the function G are the elements ofthe kernels. In Fig. 1 we show a detailed view of a simpleexample convolutional layer.We also use what are called pooling layers in CNNs.These are layers that do not have any learnable parame-ters, but typically apply a sampling function to their input,which computes some summary statistic of segments of theinput to produce a size-reduced matrix at the output. Theuser speciﬁes a “stride” that determines the sampling reso-lution. For example if one uses a stride-2 maximum pooling(denoted MaxPool-2 in this paper) layer on a 60 ×

60 in-put, the output is a 30 ×

30 matrix where each element is XK K K W y Figure A3.

Simple example of a NN model comprised of convo-lutional, pooling, and MLP layers. The input and output layersare portrayed in dark gray. Light gray layers have a ReLU activa-tion function. MLP layers are denoted by one dimensional datavolumes, and are labeled by their weight matrices W . Three di-mensional data volumes denote convolutional and pooling layers.Convolutional layers are labeled by their ﬁlter kernels K , andpooling layers are shown in red. the maximum of the corresponding 4-element block in theinput matrix. It is common to combine convolutional lay-ers with the linear transformation based layers of the MLPin applications where convolutional layers are desired, butthe output of the function to be modeled is not necessarilysuited to the outputs of convolutional layers. An example ofsuch a model is shown in Figure A3. A3 Training

To train a model, we determine the optimal set of param-eters such that ˜ G best approximates the desired function G . How to go about this is an active area of research, andwe only brieﬂy summarize the simplest training scenario toprovide context to Section 5. First we need data to trainthe network. To train the model, we will provide inputs x ,and have it predict outputs ˜ y . We then use a loss function, L ( y , ˜ y , θ ) to quantify how well the network is doing in pre-dicting the true y . Optimization of the parameters can beachieved by application of gradient descent algorithms to theloss function. In our models we use the NADAM algorithm(Dozat 2015). For our purposes, the relevant quantities fortraining are the learning rate, which sets the step size inthe gradient descent algorithm, and the patience parame-ter, which sets the number of iterations the optimizationalgorithm should tolerate with marginal change in the lossfunction. This paper has been typeset from a TEX/L A TEX ﬁle prepared bythe author. MNRAS , 1–12 (2021) nferring Distribution Functions Type Nodes/Filters ActivationConvolution 1D 8 ReLUConvolution 1D 8 ReLUMaxPool-2Convolution 1D 16 ReLUConvolution 1D 16 ReLUMaxPool-2Convolution 1D 32 ReLUConvolution 1D 32 ReLUMaxPool-2Convolution 1D 16 ReLUConvolution 1D 16 ReLUMaxPool-2ReshapeDense N Linear

Table A2.

Shared network architecture for both harmonic oscilla-tor predictors. The ﬁlters consist of 9 element vectors. G ψ usesmean squared error loss, and has output shape N o = r × M q . G ω uses mean absolute error loss and has output shape N o = r .Type Nodes/Filters ActivationConvolution 2D 8 ReLUMaxPool-2Convolution 2D 8 ReLUMaxPool-2Convolution 2D 16 ReLUMaxPool-2Convolution 2D 16 ReLUMaxPool-2ReshapeDense r × M q × M p × Table A3.

Network architecture of G ψ for the isothermal planemodel. The ﬁlters have 3 × r − × Table A4.

Network architecture of G ω for the isothermal planemodel. The ﬁlters have 3 ×000