Discovering conservation laws from trajectories via machine learning
DDiscovering conservation laws from trajectories via machine learning
Seungwoong Ha and Hawoong Jeong ∗ Department of Physics, Korea Advanced Institute of Science and Technology, Daejeon 34141, Korea (Dated: February 9, 2021)Invariants and conservation laws convey critical information about the underlying dynamics ofa system, yet it is generally infeasible to find them without any prior knowledge. We proposeConservNet to achieve this goal, a neural network that extracts a conserved quantity from groupeddata where the members of each group share invariants. As a neural network trained with a noveland intuitive loss function called noise-variance loss, ConservNet learns the hidden invariants ineach group of multi-dimensional observables in a data-driven, end-to-end manner. We demonstratethe capability of our model with simulated systems having invariants as well as a real-world doublependulum trajectory. ConservNet successfully discovers underlying invariants from the systems froma small number of data points, namely less than several thousand. Since the model is robust tonoise and data conditions compared to baseline, our approach is directly applicable to experimentaldata for discovering hidden conservation laws and relationships between variables.
I. INTRODUCTION
Modern science greatly depends on the mathematicalmodeling of given systems and finding the internal struc-tures between observables. One of the most importantconcepts in system modeling is the invariants that under-lie the system dynamics, which provide significant infor-mation about structural symmetries and low-dimensionalembeddings of the system. Invariants and symmetriesare fundamental building blocks of nearly all physicalsystems in nature, such as classical systems with Hamil-tonians, Gauge orbits, trajectories under Lorentz trans-formation, and many dynamical systems with coupleddifferential equations. Scientists have long attempted toidentify the hidden correlations and interactions amongthe state variables of such systems by discovering theconserved quantities and underlying symmetries.Recently, under phenomenal advances in machinelearning in physical sciences [1–11], various studies havecontributed towards the automation of science [12–21],referring to current efforts to reveal scientific conceptsand construct models solely from observed data withouthuman intervention. Following this line, several studieshave attempted to accomplish the automated discoveryof conserved quantities with neural networks [18–21]; lim-itations of these works though include the requirementsfor additional non-automated preprocessing and often agreat number of datasets from different conditions, aswell as the ability to only infer the number of invari-ants. Real-world empirical data are often sparse, noisy,and scattered into small groups, and hence a model forautomated discovery needs to be robust to such harshconditions.In this study, we introduce ConservNet, a neural net-work to discover conserved quantities in grouped data,such as trajectories, without any prior knowledge of the ∗ Also at Center for Complex Systems, Korea Advanced In-stitute of Science and Technology, Daejeon 34141, Korea;[email protected] system. Instead of explicitly restricting the model to en-sure certain symmetries, we propose a novel loss functionthat facilitates the model to learn the boundary of themanifold shaped by conservation laws. We show thatConservNet robustly finds a conserved quantity by re-ducing the intra-group variance of its output while pre-venting convergence into trivial constant functions. Ourmodel can be applied to a variety of realistic data con-ditions, is robust to noises and nuisance variables, andemploys a pipeline from raw data to invariants in anend-to-end manner that enables the direct extraction ofsymbolic formulas. We examine the capability of Con-servNet to discover conserved quantities by applying itto five model systems ranging from synthetic invariantsto physical models that cover diverse functional forms,along with experimental trajectory data of a double pen-dulum. The robustness of our method strongly demon-strates the potential of ConservNet to be applied to realsystems where data is sparse and no conservation lawsare known.
II. METHODS AND MATERIALSA. Noise-Variance Loss
Throughout this paper, the data condition (
N, M ) in-dicates that the data is divided into N groups, in whicheach group shares the same invariant and has M datapoints. Our goal is to find conserved quantities hiddenin such grouped d -dimensional data that are expected tohave at least one invariant. In this study, we assumethat the system has an invariant function V that sat-isfies V ( x ij ) = C i for all x ij ∈ G i , where G i denotesthe i -th group. Also, we presume that V is meaning-ful , implying that the invariant function should show itsinvariance only for the data with the corresponding sym-metry. More formally, let us consider a set of manifolds S = {M , M , . . . } in which each manifold M i is con-structed by removing a degree of freedom as a form ofa conservation law with value C i from the full space a r X i v : . [ c s . L G ] F e b 𝐿 = 𝑖 𝑉𝑎𝑟 𝐹 𝜃 𝑥 𝑖𝑗 + 𝑄 − 𝑉𝑎𝑟 𝐹 𝜃 𝑥 𝑖𝑗 + 𝜀 𝑖𝑗 Group 1 Group 2 Group 3
Decrease variation of data outputs Increase variation of noised outputs 𝑥 𝑥 𝑥 + 𝜀 𝑖𝑗 (𝑥, 𝑦, 𝑣 𝑥 , 𝑣 𝑦 ) ConservNet 𝐹 𝜃 (𝑥 𝑖𝑗 )𝑥 𝑦 𝑣 𝑥 𝑣 𝑦 State variables 𝐹 𝜃 𝑥 𝑖𝑗 + 𝜀 𝑖𝑗 𝐹 𝜃 𝑥 𝑖𝑗 𝑥 𝑖 𝐹 𝜃 𝐹 𝜃 𝑥 𝑖𝑗 + 𝜀 𝑖𝑗 FIG. 1. Overview of ConservNet and schematic description ofthe role of noise-variance loss. Each group of data x ij , whichis a time series of planet trajectories in this example, is fedinto model F θ and optimized to minimize the variance lossVar( F θ ( x ij )) and spreading loss | Q − Var( F θ ( x ij + ε ij )) | , withspreading constant Q and spreading noise ε ij restrained by R = max( || ε ij || ). R d . We presume that our invariant function satisfies V ( x ij ) = C i if and only if x ij ∈ M i .In order for the model to properly approximate the in-variant, it needs to satisfy two important criteria. First,the desired model should produce a ground-truth con-served quantity C , or at least a value strongly correlatedwith the true invariant. Second, the model output fromthe same group should be equal in the ideal case, or atleast its deviation should be minimized.To satisfy the second criteria, the loss function L forthe neural output F θ should decrease the intra-groupvariance of the outputs from each group, and thus thefollowing variance term should be minimized L i, var = Var( F θ ( x ij )) (1)= M (cid:88) j F θ ( x ij ) − M (cid:88) j F θ ( x ij ) , (2)where x ij ∈ R d is the j -th input data of dimension d from group i .Here, the naive optimization of this loss function willgenerally fall into a trivial minima. As an example, thewhole class of simple multivariate function f : R d → C for any real value C becomes one of the global minimaof Eq. 2 since any input will yield a constant and vari-ance becomes zero. Convergence to such a trivial solutionwould violate the first criteria in our case.Thus, we have to prevent F θ from trivial convergenceto capture a meaningful invariant, which we desire to bea true invariant. One way to inhibit this is by adopting a spreading term that increases the variance of the outputfrom improper input, such as input with noise. In thiswork, we call such outputs from noisy inputs as noisedoutputs. The key insight here is that since a constantfunction would yield constant values for every input, weinduce the model to not yield constant values from anyinput data that is perturbed from the proper manifold M ⊂ R d . This spreading loss can be expressed as L i, noise = Var( F θ ( x ij )) + | Q − Var( F θ ( x ij + ε ij )) | , (3)where Q is the spreading constant and ε ij denotes aGaussian noise vector, the L norm of which is boundedto R = max( || ε ij || ). Here, Q restricts the absolute valueof the variance of the noised output, since optimizationwithout this constraint will lead F θ into a diverging func-tion, ignoring the variance minimization term. Thus, therelative scale of Q and R controls the fineness of thespreading. Combining the two terms and summing overall groups, the loss function for ConservNet becomes L = (cid:88) i L i = (cid:88) i Var( F θ ( x ij )) + | Q − Var( F θ ( x ij + ε ij )) | . (4)We propose this new loss function for capturing a con-served quantity as noise-variance (NV) loss, as schemat-ically depicted in Fig. 1. One can regard NV loss asa competition between two pressures: one that reducesthe variance of the outputs from the given data, and an-other that increases the variance of the noised output.This balance therefore promotes the neural network toaccurately separate the boundary of the manifold fromnearby points. ConservNet attempts to find a functionthat reduces intra-group variance while not being a trivialconstant function. We prove that NV loss inhibits trivialconvergence by implicitly preventing the divergence of F θ from becoming zero in Appendix A. TABLE I. Systems and invariants for verification. We use α, β, δ, γ = (1 . , . , . , .
4) for the Lotka–Volterra systemand m = 1 , GM = 1 for the Kepler problem. For the doublependulum case, the Hamiltonian of the ideal model is given.System Invariant formulaS1 C = x − x x + 3 x S2 C = 3 x + 2 sin( x ) + (cid:112) | x | x S3 C = 2 x x − (ln( | x + x | ) − x ) /x Lotka–Volterra C = α ln( x ) + δ ln( y ) − βx − γy Kepler problem C = xv y − yv x C = m ( v x + v y ) − GMmr C = p × L − mk ˆ r Double pendulum C ideal = L ( m + m ) ω + m L ω +2 m m L L ω ω cos( θ − θ )(experiment) − gL ( m + m ) cos( θ ) − gm L cos( θ ) B. Neural model construction and training
ConservNet is a feed-forward neural network con-structed with 4 hidden layers with a layer width of 320neurons and a single output neuron, using Mish [22] asa non-linear activation function. Our model receives tar-get system data x ij and produces a single scalar value F θ ( x ij ) that aims to approximate the mapping functionfrom states on the manifolds to conserved quantities. Thenoise vector ε ij is newly sampled from Gaussian distri-bution at every mini-batch. In practice, we employ stan-dard deviation σ ( x ) instead of variance Var( x ) = σ ( x ),and thus Eq. 4 becomes L = (cid:88) i L i = (cid:88) i σ ( F θ ( x ij ))+ | Q − σ ( F θ ( x ij + ε ij )) | . (5)As a baseline for comparison, we trained a recentlyproposed Siamese neural network (SNN) [20] along withour model. This SNN architecture extracts an invari-ant by classifying whether two data points are from thesame instance or not, similar to [18]. Both ConservNetand the SNN are trained with Adam [23] optimizer us-ing PyTorch [24] for 50 ,
000 epochs with early stoppings.For all experiments, Q = 1 and spreading noise vector ε ij is sampled from the uniform random vector with themaximum norm R = 1 (see Appendix F for details onhyperparameter selection). C. Model systems and datasets
In this study, the ability of ConservNet is tested withthree synthetic systems, two simulated model systems,and a real double pendulum dataset from [13]. Thefunctional form of each invariant is presented in Table I.Three synthetic systems S S
2, and S 𝑅 = 0.9994 𝑅 = 0.9919 S1 S2 𝑅 = 0.9918 (a) 𝑅 = 0.9866 S3 (b) 𝑅 = 0.9974 Kepler Lotka-Volterra
FIG. 2. Model performances of ConservNet. (a) Ground-truth invariants C versus fitted ConservNet outputs ˆ F θ = aF θ + b for S S S
3, the Lotka–Volterra equation, and theKepler problem are plotted under data condition (20 , R values. Points with the same color share the sameinvariant values but are plotted at jittered values for visu-alization. The black dots in each group indicate the invari-ant and mean output value of that group, with an identityline (black, dotted) drawn for comparison. Error bars de-note the standard deviation of each group of data points.For each of the five model systems, ( a, b ) is (0 . , − . − . , . . , . . , . . , . ρ ), and mean intra-group standard deviation(¯ σ ) while training 16 ,
000 epochs of ConservNet for invariant S , to show a variety of functional forms such as cubic,trigonometric, logarithmic, and rational functions. Forthe Lotka–Volterra system ( dxdt = αx − γxy , dydt = − βy + δxy ) [25] and the Kepler problem ( H Kepler = p m − GMmr ),data are simulated by numerical integration with Euler’smethod. We find that normalizing the data to match thescale between variables improves overall performances,and thus variables with maximum values exceeding 10are rescaled by a factor of 0 .
1. Rescaling inputs also en-courages the model output to be more linear with thetrue invariant; although unsupervised neural models canlearn an arbitrary function of invariant h ( C ( x )), the out-put can still be linearized as h ( C ( x )) = aC ( x ) + b withconstants a and b if the output range is restricted to asmall region. III. RESULTS
We prepare 2 ,
000 training data with various data con-ditions (
N, M ) and an equal number of test data for allsimulated systems, which is notable as a small amountcompared to modern deep learning and other relatedstudies [19–21] that typically employ more than 10 , A. Model performances
The model performance of ConservNet is evaluated bythe aforementioned two criteria: high correlation withthe ground-truth invariant and small intra-group vari-ance. We use Pearson correlation ρ and mean intra-groupstandard deviation ¯ σ = N (cid:80) σ i for each criterion.Figure 2(a) illustrates the notable performances ofConservNet, achieving strong Pearson correlation andsmall intra-group variation in every model system. Forthe case of multiple invariants in the Kepler problem,ConservNet captures the angular momentum first andfinds the energy secondly when the angular momentumis controlled (see Appendix E for analysis on multipleinvariants). Figure 2(b) shows training statistics of Con-servNet for S , σ decreases and ρ approaches to 1.ConservNet shows consistent performance for differentdata conditions ( N, M ) as presented in Appendix F.
B. Model robustness
We further investigate the capability and robustnessof ConservNet by applying several different conditionsprevalent in experimental data, and then compare theperformance with the baseline SNN. First, we check theimpact of noise on the datasets by adding Gaussian noise N (0 , s ) with various strengths s . Figure 3(a) showsthat ConservNet gives consistent performances under thenoised condition, with better correlation compared to thebaseline. We find that if the data consists of pure noise (a) (b) FIG. 3. Robustness of ConservNet (CN). (a) Pearson corre-lation of ConservNet and SNN for invariant S S S
2+ and Kepler+) that include nuisancevariables not appearing in the invariants. without any invariants, our model alerts it by a strongoverfitting with high test loss and intra-group variance,as presented in Appendix F.In a real scenario, there might be irrelevant variables inan observed dataset that do not compose the meaningfulinvariant. Filtering out such nuisance variables is crucialfor data-driven discovery without any prior knowledge.We test our model with two reinforced datasets. First,we concatenate one extra variable x ∼ N (0 ,
1) to the S S
2+ with a noisy variable. Sec-ond, we transform the data of the Kepler problem fromCartesian coordinates ( x, ˙ x, y, ˙ y ) into polar coordinates( r, ˙ r, θ, ˙ θ ) to construct Kepler+. In the polar coordi-nates, θ becomes a cyclic coordinate and neither ˙ r nor θ appears in angular momentum r ˙ θ , different from theoriginal Cartesian form x ˙ y − y ˙ x where all of the statevariables appear. The Pearson correlation of each case isplotted in Fig. 4(b), where ConservNet shows robust per-formances even with the existence of the nuisance vari-ables and coordinate transformation. We find that theSNN strongly overfits and shows low performance whenthere are unused variables, possibly due to the nature ofclassifiers and the absence of a proper regularizer. C. Single trajectory of real data
Finally, we apply our model to a real double pendulumtrajectory from [13], which is a challenging task in a num-ber of ways. According to [13], the data is extracted fromraw video via a motion tracking system, and hence it doesnot strictly obey any conservation laws due to noise andfriction. This means that even a perfectly trained modeloutput will not be the same as the ideal Hamiltonian ofa double pendulum. Furthermore, the model has to dis-cover the invariant in an extreme data condition whereonly a single trajectory ( N = 1) with a limited numberof data points ( M = 654) is available for training. Notethat the SNN is inapplicable to this case since it needs (a) (b) 𝐶 = 𝑎 𝑤 + 𝑎 𝑤 +𝑎 𝑤 𝑤 + 𝑎 𝐶 = 𝑏 cos 𝜃 − 𝜃 +𝑏 cos 𝜃 + 𝑏 cos 𝜃 + 𝑏 𝜃 𝜃 FIG. 4. ConservNet results for real double pendulum data. (a) Model output F θ ( x ) and the noised model output F θ ( x + ε )with ε = ( ε , ε ) (top), and double pendulum trajectories θ , θ and noised trajectories (bottom) versus time. Data in theshaded area are used for training (from 0 s to 6 .
54 s), with the remaining data used for testing (6 .
54 s to 8 .
18 s). (b) 2Dheatmap of model output F θ (left) and ideal Hamiltonian (right) for ( θ , θ , ω , ω ) = (0 , , ω , ω ) (top) and ( θ , θ , ω , ω ) =( θ , θ , ,
10) (bottom). The training data points are scattered in the left panels, while the ideal formulas for the cross-sectionare presented in the right panels. Here, the ideal heatmaps are drawn with constants ( a , a , a , a = 1 , . , . , − . b , b , b , b = 41 , − . , − , , at least two groups of data to compare ( N ≥ F θ stably remains constant for the training andtest trajectory but not for the noised trajectory, verifyingthat ConservNet falls into neither trivial convergence noroverfitting and properly captures the functional form ofthe invariant. We further check two-dimensional cross-sections of the model output by fixing two variables andvarying two variables, and compare them with the cross-sections of the ideal four-dimensional Hamiltonian, con-structed with the constants from [13]. The results areshown in Fig. 4b. Considering inherent frictions andthe restricted regions of the data points, both heatmapsare similar enough to the point where the inference ofthe abstract functional form is possible, especially at theregions where the data points present. To summarize,ConservNet successfully captured the conserved quantityfrom a real double pendulum system with extreme dataconditions. IV. DISCUSSION
We have introduced ConservNet, a neural frameworkto discover a conserved quantity from grouped data. In-terpreting the results of the model output shows thatConservNet learns meaningful invariants in the systemswith various data conditions, even a single trajectory.In real practice where the ground-truth invariant is un-known, we may identify the symbolic form of the invari-ant by sorting the output values and employing off-the-shelf polynomial regression or symbolic regression algo-rithms. We illustrate a result of such application forinvariant S ACKNOWLEDGMENTS
This research was supported by the Basic Science Re-search Program through the National Research Founda-tion of Korea NRF-2017R1A2B3006930.
Appendix A: Proof for Noise-Variance Loss1. Simple loss and its limitation
In this study, we assume that the dataset consists of N groups with group size M , and has a meaningful invariantfunction V that satisfies V ( x ij ) = C i for all x ij ∈ G i ,where G i denotes the i -th group.One may construct a simple loss function with avariance-decreasing term only, such as (a) (b) 𝑅 = 0.5317 FIG. 5. Model performances of ConservNet where the modelis trained with L simple . (a) Training loss, test loss, correlationwith ground-truth invariant ( ρ ), and mean intra-group stan-dard deviation (¯ σ ) during training 20 ,
000 epochs for invariant S , C versus fittedConservNet outputs ˆ F θ = aF θ + b for the invariant S , R value. Here,( a, b ) = (8 . × − , . × − ). L simple = (cid:88) i L simple ,i = (cid:88) i Var( F θ ( x ij )) , (A1)where x ij is an input data. It is simple to show thatby performing gradient descent on L simple for the net-work parameters θ , the output from the same group willapproach the same constant value. Theorem A.1.
The global minimum of the functional L simple is F θ ( x ij ) = C i with some constant C i for allgroups.Proof of Theorem A.1. Neural model F θ tunes the out-put by optimizing F θ,ij = F θ ( x ij ) through the networkparameter θ . The condition of L simple ,i for the stationarypoint becomes ∂ L simple ,i ∂F θ,ij = ∂∂F θ,ij Var( F θ,ij )) = 0 (A2)for all F θ,ij . By expanding the variance function with µ Fi = M (cid:80) j F θ,ij , we get ∂ L simple ,i ∂F θ,ij (A3)= 1 M ∂∂F θ,ij (cid:88) j ( F θ,ij − µ Fi ) = 1 M (cid:88) j F θ,ij − µ Fi ) ∂F θ,ij − µ Fi ∂F θ,ij = 2 M ( F θ,ij − µ Fi ) − M (cid:88) j ( F θ,ij − µ Fi )= 2 M ( F θ,ij − µ Fi ) = 0 , since (cid:80) j ( F θ,ij − µ Fi ) = 0. This means F θ,ij = µ i at thestationary point, indicating that every F θ,ij is a constant C i = µ Fi . Also, by checking its second derivative, we get ∂ L simple ,i ∂F θ,ij = ∂∂F θ,ij M ( F θ,ij − µ Fi ) (A4)= 2 M (cid:18) − M (cid:19) > M is a natural number. Hence, this stationarypoint is the global minimum. Corollary A.1.1.
The constant function F const ( x ) = C for any x ∈ R d satisfies the condition for the global min-imum of the functional L simple . Obviously, the intra-group variance will be 0 if all ofthe model output from the same group becomes the sameconstant. But the zero intra-group variance is not a suffi-cient condition for a meaningful invariant, as mentionedin the main manuscript. Any modern deep learning ar-chitecture with perceptrons and feed-forward networks(including ConservNet) can express the constant functionby reducing the weight to zero (i.e., ignoring the inputcompletely), and hence is prone to learn such a simplefunction rather than a meaningful invariant. In Fig. 5,we can see that the model trained with L simple falls intothis trap; it does not properly capture the given invariant,and instead shows nearly constant behavior even thoughthe training and test loss rapidly converged to zero in theearly stages of training.
2. Noise-variance loss
Now, we focus on the proposed noise-variance loss (NVloss) for ConservNet: L = (cid:88) i L i = (cid:88) i Var( F θ ( x ij )) (cid:124) (cid:123)(cid:122) (cid:125) A i + | Q − Var( F θ ( x ij + ε ij )) (cid:124) (cid:123)(cid:122) (cid:125) B i | , (A5)where Q is a spreading constant and ε ij denotes aspreading noise vector. The function consists of the sameterm as L simple ( A i ) and an additional term that keepsthe variance of the noised output at a certain value ( B i ).We want to show that the minimization of this loss func-tion will avoid trivial convergence by flipping its behaviorwhen the noised output variance becomes too low. Theorem A.2.
The constant function F const ( x ) = C for all x ∈ R d and some constant C is not a minima of L i .Proof of Theorem A.2. To find a stationary point, weagain apply a partial derivative to L . Due to the ab-solute value in B i , the partial derivative becomes ∂∂F θ,ij |B i | = ∂∂F θ,ij (cid:113) B i = sgn( B i ) ∂ B i ∂F θ,ij (A6)= − sgn( B i ) (cid:20) ∂∂F θ,ij Var( F θ ( x ij + ε ij )) (cid:21) . By expanding this expression with Taylor expansion,we get − sgn( B i ) (cid:20) ∂∂F θ,ij Var( F θ ( x ij + ε ij )) (cid:21) = − sgn( B i ) (cid:20) ∂∂F θ,ij Var( F θ,ij + ∇ F θ,ij ε ij + O ( ε ij )) (cid:21) (A7)= − sgn( B i ) M ∂∂F θ,ij (cid:88) j ( F θ,ij + ∇ F θ,ij ε ij − µ F ∇ i ) = − sgn( B i ) M (cid:88) j F θ,ij + ∇ F θ,ij ε ij − µ F ∇ i ) ∂ ( F θ,ij + ∇ F θ,ij ε ij − µ F ∇ i ) ∂F θ,ij = − sgn( B i ) M ( F θ,ij + ∇ F θ,ij ε ij − µ F ∇ i ) − M (cid:88) j F θ,ij + ∇ F θ,ij ε ij − µ F ∇ i = − sgn( B i ) 2 M ( F θ,ij + ∇ F θ,ij ε ij − µ F ∇ i ) , where ∂ ∇ F θ,ij ε ij ∂F θ,ij = 0 and µ F ∇ i = M (cid:80) j F θ,ij + ∇ F θ,ij ε ij .Combining these two terms, A i and B i , the entire par- tial derivative has two cases depending on the sign of B i : ∂ L ∂F θ,ij = (cid:40) M (cid:2) ( F θ,ij − µ Fi ) − ( F θ,ij + ∇ F θ,ij ε ij − µ F ∇ i ) (cid:3) if sgn( B i ) = 1 M (cid:2) ( F θ,ij − µ Fi ) + ( F θ,ij + ∇ F θ,ij ε ij − µ F ∇ i ) (cid:3) if sgn( B i ) = − M (cid:2) ( µ ∇ i − ∇ F θ,ij ε ij ) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) C i if sgn( B i ) = 12 M (cid:2) (2 F θ,ij + ∇ F θ,ij ε ij − ( µ F ∇ i + µ Fi ) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) D i if sgn( B i ) = − , where µ ∇ i = M (cid:80) j ∇ F θ,ij ε ij . When the noised outputvariance is greater than Q ( C i ), both terms cooperateto reduce the variance of F θ regardless of the input. Butwhen the noised output variance becomes smaller than Q ,the terms that contributed to the cooperation cancel out,and the functional now has a global maximum instead ofa global minimum. In this regime, the constant functionbecomes the only solution, as follows. Lemma A.3.
A constant function F const ( x ) = C forall x ∈ R d and some constant C is a global maximumof L i when sgn ( B ) = 1 . Proof of Theorem A.3. First, C i has its critical pointwhen every ∇ F θ,ij ε ij becomes a constant µ ∇ i . This isa global maximum since any deviation from the constantwill decrease ( µ ∇ i − ∇ F θ,ij ε ij ), as the second derivativetest shows, ∂ L i, sgn( B i )=1 ∂ ∇ F θ,ij ε ij ∂∂ ∇ F θ,ij ε ij M ( µ ∇ i − ∇ F θ,ij ε ij ) (A9)= 2 M (cid:18) M − (cid:19) < . Since the noise vector ε ij can have an arbitrary value, 𝑅 = 0.8841 FIG. 6. Ground-truth invariants C , i.e. the total energy,versus fitted ConservNet outputs ˆ F θ = aF θ + b for the Ke-pler problem plotted under data condition (20 , R value, from a controlled dataset where angular momen-tum over the dataset is fixed to L = − .
5. Here, ( a, b ) =( − . , − . the only way to satisfy the condition for the global max-imum is that ∇ F θ,ij = 0, which means that the functionis a constant everywhere.Now, suppose that the constant function is one of theminima of L i . Then, it can only exist at the regionwhere sgn( B i ) = − B i ) = 1 since B i = Q − Var( F θ ( x ij + ε ij )) = Q − Q >
0. Thisis contradictory, and hence the constant function cannotbe a minimum of the NV loss.Intuitively, NV loss prevents trivial convergence bymaintaining a non-zero value for the divergence of F θ ,which cannot be accomplished by the constant function. Appendix B: Dataset Detail
We prepare total six systems for training: S S S
1. Synthetic systems
The datasets are composed by first randomly drawingthe relative variables except for the final one, and calcu- lating the last relative variable which preserves the over-all conserved quantity. We tried to maximize the varietyof simulated data by setting the noise distribution andvariable range for each system as differently as possible.While producing each dataset, we restricted the absolutevalue of the output of the final variable, consequently re-jecting some perilous set of variables that forces the lastvariable into an extremely diverging value.
2. Physical systems
We generate the data from physical systems by in-tegrating respective differential equations with Euler’smethod and performing subsampling to the trajectories.For the Lotka–Volterra system, we simulate the dynam-ics for 100 M steps for the dataset of batch size M withtime interval dt = 0 .
01. The obtained data are furthersubsampled at every 100 steps, effectively setting thetime interval between data points to 1. We scale x, y in the Lotka–Volterra equation and the position coordi-nates x, y in the Kepler problem by a factor of 0 .
3. Real double pendulum
Double pendulum data is adopted from [13], where twotrials of double pendulum data are available. We usethe first trial, consisting of 818 data points with four-dimensional time-series ( θ , θ , ω , ω ). Each data pointcorresponds to 0 .
01 s, making the total data length 8 . × . × . ω , ω by a factor of 0 . θ and θ scale. Appendix C: Training Detail
Both the SNN and ConservNet are composed of six lay-ers of multi-layer perceptrons (MLPs) with a layer widthof 320, where the input dimension varies by the targetsystem and has a single output neuron. Note that theoriginal SNN [17] used two layers of MLPs with a layerwidth of 160; we found that increasing the layer depthand width generally increased the overall performancesfor both SNN and ConservNet.During training, we found the training result of SNNsignificantly varies by random initialization and is highlyprone to overfitting. We report that the SNN shows agood performance (training and test accuracy of 100%and 99 .
75% with correlation 0 . .
03% withcorrelation 0 . TABLE II. Variable range and sample distribution of training and test data for each system. Variables with a star ( ∗ ) inthe distribution column are calculated after the other variables were drawn from the sample distribution, while the dash ( − )indicates a non-samplable variable. For physical systems (Lotka–Volterra and the Kepler problem), the distribution columnindicates each variable’s initial distribution (and hence differs from the actual range as shown). In the remarks column, theterm rescaled means that the variables are normalized by a given factor before constructing the dataset for the model training.Invariant System formula Variable Distribution Actual range RemarksS1 C = x − x x + x x ∗ [ − . , .
99] Model invariant x ∼ N (0 ,
2) [ − . , . x ∼ N (0 ,
2) [ − . , . x ∼ N (0 ,
2) [ − . , . C - [ − . , . C = 3 x + 2 sin( x ) + (cid:112) | x | x x ∼ U ( − ,
3) [ − . , .
52] Model invariant x ∗ [ − . , . x ∼ U ( − ,
3) [ − . , . C - [ − . , . C = 2 x x − (ln( | x + x | ) − x ) /x x ∗ [ − . , .
99] Model invariant x ∼ U ( − ,
10) [ − . , . x ∼ U (0 . ,
5) [0 . , . x ∼ U ( − ,
10) [ − . , . C - [1 . , . x ∼ U (1 ,
10) [0 . , . x , x rescaled (0 . C = α ln( x ) + δ ln( y ) − βx − γy x ∼ U (1 ,
10) [0 . , . C - [ − . , . x ∗ [ − , y ∼ U ( − ,
5) [ − . , .
28] Eccentricity e < . C = xv y − yv x v x ∼ U ( − ,
5) [ − . , . x, y rescaled (0 . v y ∼ U ( − ,
5) [ − . , . C - [1 . , . θ - [ − . , . C ideal = L ( m + m ) ω + m L ω θ - [ − . , . ω , ω rescaled (0 . m m L L ω ω cos( θ − θ ) ω - [ − . , . − gL ( m + m ) cos( θ ) − gm L cos( θ ) ω - [ − . , . For a fair comparison, we test every combination oflearning rates [0 . , . , . , . ,
000 epochs with early stopping,Adam[23] optimizer, and no particular regularizer. Thebatch size for mini-batch training is fixed to 64 for theSNN and tentative for Conservnet, where its batch size isfixed as respective group size M . Training takes severalhours on a single GeForce GTX 1080, depending on thebatch size. Appendix D: Extraction of symbolic formula fromConservNet results
In a real scenario, identifying the explicit functionalform of the invariant rather than just a numerical value ofthe invariant is often crucial for understanding the systemand its inherent symmetry. This can be done by perform-ing polynomial or symbolic regression on the ConservNetoutput, as mentioned in the main manuscript, but manyregression methods are prone to overfitting if the data iserroneous. Hence, besides its usefulness, the successfulretrieval of symbolic functions from data also indicatesthe high quality of the model output. As an exemplarycase, we perform ridge regression with polynomial fea-tures of the input data on the output of ConservNet forinvariant S
1. The result of the regression for order 2 is0
FIG. 7. Pearson correlation ρ with varying data con-ditions during 50 ,
000 epochs of training ConservNetfor the invariants. Here, the total data numberis fixed to 2 ,
000 and hence the data conditions of(2 , , (4 , , (10 , , (20 , , (40 , , F θ = 0 . x + 0 . x − . x + 0 . x (D1)+ 0 . x + 0 . x x − . x x + 0 . x x − . x − . x x − . x x + 0 . x + 0 . x x + 0 . x ≈ . x − . x x + 0 . x . We can see that the approximated output is the sameas S
1, showing that ConservNet can provide reliableoutput for the extraction of symbolic formulas.
Appendix E: ConservNet for a system with multipleinvariants
In the main manuscript, we show that ConservNet dis-covers angular momentum in the Kepler problem amongthree possible invariants. Since the model is designedto output a single value from a single output, jointlyfinding multiple invariants needs a modification to thecurrent architecture. In [20], the authors verified theirmodel (SNN) by fixing the angular momentum of thegiven dataset and performing the same training. We testour model with similar settings using a controlled datasetto examine whether the model can discover a second in-variant. Figure 6 shows that the model output shows astrong correlation with a second invariant, namely thetotal energy of the orbital system, from the controlled
FIG. 8. Pearson correlation ρ with varying spreading constant Q (fixed R = 1) and max noise norm R (fixed Q = 1) during20 ,
000 epochs of training ConservNet for the invariant S σ ) during 5 ,
000 epochs of training Con-servNet for random data. dataset. Modifying ConservNet for the simultaneous dis-covery of multiple invariants would be an interesting fu-ture direction.1
Appendix F: Robustness of ConservNetperformances in various conditions1. Results with various data conditions
In the main manuscript, we fixed all of the simulateddata conditions to (20 , , ρ for all fivesimulated systems with different data conditions are plot-ted. We can confirm that ConservNet shows good per-formances ( ρ > .
9) for all simulated settings, includingboth extreme ends: from the points where only a single,long dataset is possible, to the points where a hundreddifferent trials with a short period of observation wasrecorded.
2. Hyperparameters and spreader selection
In the main manuscript, we set Q = 1 and R = 1 forall experiments. But in our NV loss, the relative mag-nitude between Q and R determines the strength of thespreader, and one may raise a question about the rela-tionship between such a specific choice of hyperparam-eters and the model performance. To test the robust- ness of ConservNet with other hyperparameters, we use S , ρ while varying Q and R values. The resultsare shown in Fig. 8, which verify that the model perfor-mances are practically unaffected by the choice of specifichyperparameters.We also test different types of spreaders by restrainingthe noise with different norms. Instead of L -norm weuse in the main experiment, we test L -norm and L ∞ -norm to be restrained to 1. As a result, both spreadersachieve comparable results of 0 . . S ,
3. System with no specific invariant
If the system under study has no specific invariant, agood model for invariant discovery should notice suchan absence. We train ConservNet with random dataconsisting of five-dimensional Gaussian random vector X ∼ N ( µ, Σ ), where µ ∈ R and Σ = I . ConservNetalerts the absence of an invariant by showing strong over-fitting and large intra-group deviations, as shown in Fig.9. [1] J. Carrasquilla and R. G. Melko, Machine learning phasesof matter, Nature Physics , 431 (2017).[2] K. Ch’Ng, J. Carrasquilla, R. G. Melko, and E. Khatami,Machine learning phases of strongly correlated fermions,Physical Review X , 031038 (2017).[3] E. P. Van Nieuwenburg, Y.-H. Liu, and S. D. Huber,Learning phase transitions by confusion, Nature Physics , 435 (2017).[4] Y. Zhang and E.-A. Kim, Quantum loop topography formachine learning, Physical Review Letters , 216401(2017).[5] G. Carleo and M. Troyer, Solving the quantum many-body problem with artificial neural networks, Science , 602 (2017).[6] P. Baldi, P. Sadowski, and D. Whiteson, Searching forexotic particles in high-energy physics with deep learning,Nature communications , 1 (2014).[7] P. Ponte and R. G. Melko, Kernel methods for inter-pretable machine learning of order parameters, PhysicalReview B , 205146 (2017).[8] P. Zhang, H. Shen, and H. Zhai, Machine learning topo-logical invariants with neural networks, Physical ReviewLetters , 066401 (2018).[9] N. Sun, J. Yi, P. Zhang, H. Shen, and H. Zhai, Deeplearning topological invariants of band insulators, Phys-ical Review B , 085402 (2018).[10] G. Torlai, G. Mazzola, J. Carrasquilla, M. Troyer,R. Melko, and G. Carleo, Neural-network quantum statetomography, Nature Physics , 447 (2018). [11] M. Rafayelyan, J. Dong, Y. Tan, F. Krzakala, and S. Gi-gan, Large-scale optical reservoir computing for spa-tiotemporal chaotic systems prediction, Physical ReviewX , 041037 (2020).[12] J. Bongard and H. Lipson, Automated reverse engineer-ing of nonlinear dynamical systems, Proceedings of theNational Academy of Sciences , 9943 (2007).[13] M. Schmidt and H. Lipson, Distilling free-form naturallaws from experimental data, Science , 81 (2009).[14] T. Wu and M. Tegmark, Toward an artificial intelligencephysicist for unsupervised learning, Physical Review E , 033311 (2019).[15] H. Li, X.-q. Shi, M. Huang, X. Chen, M. Xiao, C. Liu,H. Chat´e, and H. Zhang, Data-driven quantitative mod-eling of bacterial active nematics, Proceedings of the Na-tional Academy of Sciences , 777 (2019).[16] K. Champion, B. Lusch, J. N. Kutz, and S. L. Brunton,Data-driven discovery of coordinates and governing equa-tions, Proceedings of the National Academy of Sciences , 22445 (2019).[17] R. Iten, T. Metger, H. Wilming, L. Del Rio, and R. Ren-ner, Discovering physical concepts with neural networks,Physical Review Letters , 010508 (2020).[18] A. Decelle, V. Martin-Mayor, and B. Seoane, Learning alocal symmetry with neural networks, Physical Review E , 050102 (2019).[19] Y.-i. Mototake, Interpretable conservation law estimationby deriving the symmetries of dynamics from traineddeep neural networks, arXiv preprint arXiv:2001.00111 (2019).[20] S. J. Wetzel, R. G. Melko, J. Scott, M. Panju, andV. Ganesh, Discovering symmetry invariants and con-served quantities by interpreting siamese neural net-works, Physical Review Research , 033499 (2020).[21] Z. Liu and M. Tegmark, Ai poincar \ ’e: Machine learn-ing conservation laws from trajectories, arXiv preprintarXiv:2011.04698 (2020).[22] D. Misra, Mish: A self regularized non-monotonic activa-tion function, arXiv preprint arXiv:1908.08681 (2019). [23] D. P. Kingma and J. Ba, Adam: A method for stochasticoptimization, arXiv preprint arXiv:1412.6980 (2014).[24] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. , Pytorch: An imperative style, high-performancedeep learning library, in Advances in neural informationprocessing systems (2019) pp. 8026–8037.[25] Y. Takeuchi,
Global dynamical properties of Lotka-Volterra systems (World Scientific, 1996).[26] H. Seungwoong and J. Hawoong, nokpil/conservnet, http://doi.org/10.5281/zenodo.4491096http://doi.org/10.5281/zenodo.4491096