[PDF] A Bayesian neural network predicts the dissolution of compact planetary systems

Abstract

Despite over three hundred years of effort, no solutions exist for predicting when a general planetary configuration will become unstable. We introduce a deep learning architecture to push forward this problem for compact systems. While current machine learning algorithms in this area rely on scientist-derived instability metrics, our new technique learns its own metrics from scratch, enabled by a novel internal structure inspired from dynamics theory. Our Bayesian neural network model can accurately predict not only if, but also when a compact planetary system with three or more planets will go unstable. Our model, trained directly from short N-body time series of raw orbital elements, is more than two orders of magnitude more accurate at predicting instability times than analytical estimators, while also reducing the bias of existing machine learning algorithms by nearly a factor of three. Despite being trained on compact resonant and near-resonant three-planet configurations, the model demonstrates robust generalization to both non-resonant and higher multiplicity configurations, in the latter case outperforming models fit to that specific set of integrations. The model computes instability estimates up to five orders of magnitude faster than a numerical integrator, and unlike previous efforts provides confidence intervals on its predictions. Our inference model is publicly available in the SPOCK package, with training code open-sourced.

Full PDF

AA Bayesian neural network predicts the dissolution of compact planetary systems

Miles Cranmer a, † , Daniel Tamayo a , Hanno Rein b,c , Peter Battaglia d , Samuel Hadden e ,Philip J. Armitage f,g , Shirley Ho g,a,h , and David N. Spergel g,a a Princeton University, Princeton, New Jersey, USA b University of Toronto at Scarborough, Toronto, Ontario, Canada c University of Toronto, Toronto, Ontario, Canada d DeepMind, London, UK e Center for Astrophysics | Harvard & Smithsonian, Cambridge, Massachusetts, USA f Stony Brook University, Stony Brook, New York, USA g Flatiron Institute, New York, New York, USA h Carnegie Mellon University, Pittsburgh, Pennsylvania, USA † Corresponding author. Email: [email protected]

Abstract

Despite over three hundred years of eﬀort, no solutions exist for predicting when a general planetary conﬁg-uration will become unstable. We introduce a deep learning architecture to push forward this problem forcompact systems. While current machine learning algorithms in this area rely on scientist-derived instabilitymetrics, our new technique learns its own metrics from scratch, enabled by a novel internal structure inspiredfrom dynamics theory. Our Bayesian neural network model can accurately predict not only if, but also when acompact planetary system with three or more planets will go unstable. Our model, trained directly from shortN-body time series of raw orbital elements, is more than two orders of magnitude more accurate at predictinginstability times than analytical estimators, while also reducing the bias of existing machine learning algo-rithms by nearly a factor of three. Despite being trained on compact resonant and near-resonant three-planetconﬁgurations, the model demonstrates robust generalization to both non-resonant and higher multiplicityconﬁgurations, in the latter case outperforming models ﬁt to that speciﬁc set of integrations. The modelcomputes instability estimates up to ﬁve orders of magnitude faster than a numerical integrator, and unlikeprevious eﬀorts provides conﬁdence intervals on its predictions. Our inference model is publicly available inthe

SPOCK ∗ package, with training code open-sourced ‡ . The ﬁnal growth of terrestrial bodies in current theories of planetformation occurs in a phase of giant impacts (1). During this stage, thenumber of planets slowly declines as bodies collide and merge (2, 3).Close planetary encounters and the wide dynamic range exhibited bythe times between consecutive collisions computationally limit currentnumerical eﬀorts to model this process. Two theoretical roadblocksimpede the development of a more eﬃcient iterative map for modelingplanet formation. First, one must predict a distribution of instabilitytimes from a given initial orbital conﬁguration. Second, one mustpredict a distribution of post-collision orbital architectures (e.g., 4)subject to mass, energy and angular momentum constraints. Towardsthis end, we focus on the long-standing former question of instabilitytime prediction.In the compact dynamical conﬁgurations that characterize theplanet formation process, the simpler two-planet case is understoodanalytically. In this limit, instabilities are driven by the interactionsbetween nearby mean-motion resonances (MMRs), i.e., integer com-mensurabilities between the orbital periods of the planets like the 3:2MMR between Pluto and Neptune (5–8). While the general higher-multiplicity case is not yet understood, two important results guideour analysis and provide an important test for any model. First,when planets are initialized on circular orbits, chaos is driven by theoverlap of 3-body MMRs between trios of adjacent planets (9), andtheoretical estimates of the timescale required for the orbits to reach ∗ https://github.com/dtamayo/spock ‡ https://github.com/MilesCranmer/bnn_chaos_model orbit-crossing conﬁgurations accurately match numerical integrations(10). As we show below, such analytical estimates perform poorly inthe generic eccentric case where the eﬀects of 2-body MMRs aredominant (10, 11). However, analytical and empirical studies agreethat while the dynamical behavior changes strongly from the two tothree-planet case (3, 12–18), three-planet systems are the simplestprototype for predictions at higher multiplicities in compact systems(10, 11).We recently presented a machine learning model, dubbed the Sta-bility of Planetary Orbital Conﬁgurations Klassiﬁer, or SPOCK,trained to classify the stability of compact planetary systems overtimescales of 10 orbits (11). This represented a long-term eﬀort toexploit the substantial but incomplete current analytical understanding(5, 6, 8, 19, 20) to engineer summary metrics that captured these sys-tems’ chaotic dynamics; these features were then used by the machinelearning model to classify whether the input conﬁguration would bestable over 10 orbits.While simple binary stability classiﬁcation is eﬀective for con-straining physical and orbital parameters consistent with long-termstability (21), other applications like modeling terrestrial planet for-mation require the prediction of continuous instability times. Ad-ditionally, several ﬁelds in which it is challenging to ﬁnd eﬀectivehand-picked features—such as computer vision, speech recognition,and text translation—have been revolutionized by neural networks inthe last decade (notable early breakthroughs include 22–24). Ratherthan relying on domain expert input, these ﬂexible models learndata-driven features that can often signiﬁcantly outperform human-engineered approaches. A key theme with deep learning models is1 a r X i v : . [ a s t r o - ph . E P ] J a n hat their structure resembles the hand-designed algorithm, but withadded ﬂexibility parametrized by neural networks (for discussion, see25). For example, modern computer vision models consist of learnedconvolutional ﬁlters which take the place of hand-designed ﬁlters inclassic algorithms (26).Pursuing a deep learning approach, we present a neural networkthat, trained only on short-time series of the orbits in compact plane-tary systems, not only improves on long-term predictions of previousmodels based on engineered features (11, 27), but also signiﬁcantlyreduces the model bias and improves generalization beyond the train-ing set. We design our model as a Bayesian neural network, whichnaturally incorporates conﬁdence intervals into its instability time pre-dictions, accounting both for model uncertainty as well as the intrinsicuncertainty due to the chaotic dynamics. Finally, unlike previous ma-chine learning models based on decision trees (11, 27), our modelis diﬀerentiable. That is, we can extract from the model estimatesof the derivatives of the predicted instability times with respect tothe parameters deﬁning the orbital conﬁguration in question. Suchgradient information can signiﬁcantly speed up parameter estimationusing Hamiltonian Monte Carlo techniques (28). We focus on the regime leading to typical compact multi-planet sys-tems observed to date, with mass ratios with the central star rangingfrom 10 − (roughly the ratio of the Moon-mass embryos thought toinitially characterize the giant impact phase, relative to the Sun) to10 − (roughly Neptune’s mass relative to the Sun). As detailed inthe appendix, we place planets on nearly coplanar orbits, with ad-jacent planets spaced within 30 mutual Hill radii § of one another(e.g., 29). Orbital eccentricities in observed systems are often poorlyconstrained, so we consider the range from initially circular to orbit-crossing values.We train our model on the set of 113,543 publicly available, com-pact three-planet conﬁgurations in and near strong resonances from(11). In particular, this “resonant” dataset initializes one pair of plan-ets in or near a strong MMR, while the third planet’s orbital parametersare chosen randomly. This choice focuses the training set on the nar-row resonant regions of phase space where the dynamical behaviorchanges most strongly, and we later test the model’s generalization tonon-resonant systems in section 3.Each initial condition was integrated for 10 orbits of the innermostplanet using the WHFast integrator (30) in the REBOUND N-bodypackage (31). If at any point two planets came within a distance of oneanother given by the sum of their Hill radii, the simulation was stoppedand the instability time was recorded. Because gravity is scale invari-ant, the instability time 𝑡 inst is most usefully non-dimensionalized bythe innermost orbital period 𝑃 orb . Given the large dynamic range intimescales over which instabilities can occur, we deﬁne the dimen-sionless log instability time 𝑇 ≡ log ( 𝑡 inst / 𝑃 orb ) . Conﬁgurationswith instability times longer than 10 orbits ( 𝑇 >

9) were labeled asstable, and integration was stopped.

To predict systems’ instability times, we perform a computationallycheap numerical integration of the ﬁrst 10 orbits, and use this time § The mutual Hill radius 𝑅 𝐻 is a relevant length scale within which the gravity of theplanets dominates that of the star 𝑅 𝐻 ≈ 𝑎 ( 𝜇 + 𝜇 ) / , where 𝑎 is the inner planet’ssemi-major axis, and 𝜇 and 𝜇 are each planet’s mass ratio relative to the star. series to make long-term predictions. Each of the three planets’ 3Dpositions and velocities correspond to six standard orbital elements(the appendix), which we record at 𝑛 𝑡 =

100 equally spaced outputsacross the short integration. In addition we pass the three constantmass ratios for each planet relative to the star, as well as the timevalue, for a combined input matrix of real values 𝑋 ∈ R + 𝑛 𝑡 for agiven conﬁguration.Because the dynamics of compact multi-planet systems are chaotic,instability times for a given initial orbital conﬁguration are eﬀectivelynon-deterministic. Nevertheless, numerical experiments (32, 33) haveshown that instability times for unstable, compact multi-planetary sys-tems settle to well deﬁned, approximately log-normal distributions.Thus, rather than predicting a single instability time for a given orbitalconﬁguration, our model maps from an input initial orbital conﬁgu-ration to a predicted log-normal distribution of instability times, i.e.,a Gaussian distribution of 𝑇 with mean 𝜇 and variance 𝜎 . Thisgives the network the ﬂexibility to both model the fundamental uncer-tainties imposed by the chaotic dynamics, and to incorporate modeluncertainty into its predictions by assigning larger widths to conﬁgu-rations it is less sure about.In our initial eﬀorts, we experimented with various combinationsof convolutional neural networks (see reviews by 34–36), long short-term memory networks (37), 1D scattering transforms (38), regularmulti-layer perceptrons (MLP, see 25) and gaussian processes (39).All of these models underperformed or tended to overﬁt the data.The fundamental challenge for making such predictions is the sharptransitions in dynamical behavior at MMRs, where instability timescan change by several orders of magnitude (40) over sub-percentchanges in planetary orbital periods, i.e., in the original space oforbital elements. We found substantially improved performance bystructurally splitting the problem into three components: 1) Find atransformation from the sharply punctuated space of orbital elementsto new variables. 2) Calculate statistical summaries of the time seriesin these transformed variables. 3) Use these summary features topredict a log-normal distribution over instability times for the inputorbital conﬁguration, parametrized by mean 𝜇 and variance 𝜎 . Thisis illustrated in ﬁg. 1.We model steps 1 and 3 with neural networks 𝑓 and 𝑓 , respec-tively. In step 2, we choose to use a well-motivated but non-standardpooling operation, and calculate both the mean and variance for eachtransformed time series (deep neural networks traditionally use one ofeither sum, mean, or max pooling). This structure was motivated bythe hand-engineered approach of (11), whose features are means andvariances of theoretically-motivated instability metrics. In this case,we give the machine learning model additional ﬂexibility to learn itsown instability metrics. This design process is analogous to how theinvention of convolutional neural networks was motivated by creat-ing similar structures to hand-engineered algorithms which convolveﬁlters over an image. Our model is parametrized by 𝑚 neural network weights 𝜃 ≡( 𝜽 , 𝜽 ) , 𝜃 ∈ R 𝑚 ( 𝜽 for 𝑓 and 𝜽 for 𝑓 ). Deﬁning the trainingset 𝐷 as the collection of input orbital conﬁgurations and their asso-ciated N-body instability times, we seek the most likely set of modelparameters given the data, i.e., we maximize 𝑃 ( 𝜃 | 𝐷 ) , which is in turnproportional to 𝑃 ( 𝐷 | 𝜃 ) 𝑃 ( 𝜃 ) .Our model predicts a log-normal distribution of instability times forany input orbital conﬁguration. For a given set of network weights 𝜃 ,the likelihood 𝑃 ( 𝐷 | 𝜃 ) is then simply the product of the probabilitiesthat each training set example’s output 𝑇 𝑖 is drawn from the associated2 ime NumericalIntegration:

System goesunstable atdistant time TShort timeintegration recordedTime

Single scalar

Orbital elementsover time: ...

Neural Network 1( f ) Learned featuresover time: ...

Compute average andvariance over timefor each learned feature ... ...

Neural Network 2( f ) Time

Figure 1: Schematic of our model. The model numerically integrates 10,000 orbits for a compact 3-planet system (top) and records orbitalelements at 100 times. Neural network 𝑓 creates learned summary features from these elements at each time. Neural network 𝑓 takes theaverage and variance of these features as input and estimates a distribution over possible instability times.3aussian N ( 𝜇 𝑖 , 𝜎 𝑖 ) predicted by the model. As discussed above,this choice is motivated by the numerical result that the distributionin 𝑇 is normal, for diﬀerent conﬁgurations with a wide range of meaninstability times (33).Note that we have 4 < 𝑇 ≤ 𝑇 <

𝑇 > 𝑇 = 𝑇 . Thus, we build a truncated normal distribution with a cutoﬀ at 𝑇 =

4, and the cumulative probability of the Gaussian above 𝑇 = 𝑃 ( 𝜃 ) is a prior on the neural network’s parameters. The per-parameter prior here is unimportant: what matters is the prior inducedon the output of the model , and we use an uninformative Gaussian prioron parameters to induce an uninformative prior on the output. See(41) for a detailed discussion of priors in Bayesian deep learning. By having our model predict a distribution of instability times with aﬁnite width, we account for intrinsic uncertainty (sometimes referredto as “aleatoric” uncertainty). However, we also wish to include ex-trapolation uncertainty (or “epistemic” uncertainty) for systems thatdiﬀer from those found in the training set. To do this we marginal-ize over potential model parameters, with what is referred to as a“Bayesian Neural Network” (BNN). This is a neural network whoseparameters are distributions rather than point values, and is trainedwith Bayesian inference. These models naturally fold in extrapolationuncertainty via marginalization.This concept is familiar in traditional statistical inference, whereone can marginalize out the internal nuisance parameters of a modelusing Markov Chain Monte Carlo (MCMC) techniques. The factthat neural networks typically have millions of parameters rendersMCMC computationally prohibitive, and various practical simpliﬁca-tions are adopted for implementing a BNN. The most common strategyis Monte Carlo dropout (42, 43) which treats the neural network’s pa-rameters as independent Bernoulli random variables, and has beenused in several astronomical applications (44–47). A selection ofother techniques includes Bayes by Backprop (48), Bayesian Layers(49), variants of normalizing ﬂows (e.g., 50), Bayes by Hypernet(51, 52), and many other strategies. One recently-proposed strategy,named “MultiSWAG” (53, 54) learns a distribution over the posteriorof weights that best ﬁt the training set, without a diagonal covarianceassumption, and is much closer to standard MCMC inference. Weexperimented with a selection of common techniques: Monte Carlodropout, Bayes by Backprop, and MultiSWAG, and found that “Mul-tiSWAG” produced the best accuracy and uncertainty estimations onthe validation dataset.To move beyond a single best-ﬁt set of parameters 𝜃 , SWAG, or“Stochastic Weight Averaging Gaussian” (53, 55), instead ﬁts a Gaus-sian to a mode of the posterior over 𝜃 , assuming a low-rank covariancematrix. This was extended in (54) to “MultiSWAG,” which repeatsthis process for several modes of the weight posterior, to help ﬁll outthe highly degenerate parameter space. This technique is summarizedbelow:1. Train 𝑓 and 𝑓 simultaneously via stochastic gradient descentuntil the parameters settle into a minimum of the weight posterior.2. Increase the learning rate and continue training. This causesthe optimizer to take a random walk in parameter space nearthe minima, which is assumed to look like a high-dimensionalGaussian. 3. Accumulate the average parameters along this random walk aswell as a low-rank approximation of the covariance matrix.4. The average parameters not only provide better generalizationperformance (stochastic weight averaging or SWA), but we haveadditionally ﬁt a Gaussian to a mode of the parameter posterior.We can thus sample weights from this Gaussian to marginalizeover parameters. This is SWAG (53).5. The next step is to repeat this process from a diﬀerent randominitialization of the parameters. This will ﬁnd another mode ofthe parameter posterior.6. Fit ∼

30 diﬀerent modes of the parameter space. We can thensample weights from multiple modes of the parameter posterior,which gives us a more rigorous uncertainty estimates. This isMultiSWAG (54).Training a neural network through stochastic gradient descent isapproximately the same as Bayesian sampling of the weights (56, 57),so this aforementioned process allows one to learn a Bayesian posteriorover the weights of a neural network 𝑃 MultiSWAG ( 𝜃 ) .Once we have learned 𝑃 MultiSWAG , we can draw from it to samplea set of network weights 𝜃 . This model then predicts a log-normaldistribution of instability times with mean 𝜇 and variance 𝜎 for thegiven input orbital conﬁguration, from which we can ﬁnally sample alog instability time T. We can write a forward model for this predictionas follows: ( 𝜽 , 𝜽 ) ∼ 𝑃 MultiSWAG ( 𝜃 ) , (1) y 𝑡 = 𝑓 ( x 𝑡 ; 𝜽 ) for all x 𝑡 ≡ 𝑋 : ,𝑡 , (2) z ∼ ( E 𝑡 [ y 𝑡 ] , Var 𝑡 [ y 𝑡 ]) , (3) ( 𝜇, 𝜎 ) = 𝑓 ( z ; 𝜽 ) , (4) 𝑇 instability = 𝑇 for 𝑇 ∼ N ( 𝜇, 𝜎 ) , (5)where 𝑡 is a time step from 1 to 100. Here, we have labeled y 𝑡 asthe learned transformed variables for a single time step of the system(brown cells in ﬁg. 1), and z as the average and variance of these trans-formed variables over time. To account for statistical errors due to ourﬁnite number of time series samples, we sample the z from normaldistributions with frequentist estimates of the variance in the samplemean and variance: Var 𝑡 [ y 𝑡 ] 𝑛 𝑡 and 𝑡 [ y 𝑡 ] ( 𝑛 𝑡 − ) , respectively. A Bayesiangraphical model for this is shown in the appendix. Repeatedly sam-pling in this way provides a predicted distribution of T given the inputorbital conﬁguration, marginalized over the posterior distribution ofnetwork weights 𝜃 .We split our data into 60/20/20% train/validation/test, train ourmodel on 60% of our ≈ ,

000 training examples of resonant andnear resonant systems, and validate it on half of the remaining data totune the hyperparameters. Hyperparameters for our model are givenin the supplementary material, and we also release the code to trainand evaluate our model.With this trained model, we then explore its performance on theremaining 20% holdout data from the resonant dataset, as well asother datasets described below.

For a given orbital conﬁguration, our probabilistic model producesone sample of 𝑇 . If a given sample is above 𝑇 =

9, we treat the4ample as a “stable” prediction. Since we are unable to make speciﬁctime predictions above the maximum integration time in our trainingdataset of 𝑇 =

9, we resample from a user-deﬁned prior 𝑃 ( 𝑇 | 𝑇 ≥ ) for each occurrence. For the purposes of this study, we assume asimple analytic form for this prior, though followup work on this prioris ongoing (see supplementary).For all results, we sample 10,000 predicted values of the posteriorover 𝑇 per planetary system. We compare our predictions againstseveral alternatives which are explained below. Since the models wecompare against can only produce point estimates while our modelpredicts a distribution, we take the median of our model’s predictedposterior over 𝑇 . This is used for plotting points, as well as forcomputing root-mean-square prediction errors.We ﬁrst compute the N-body versus predicted (median) 𝑇 valueover the holdout test dataset of ≈ ,

000 examples not seen duringtraining, which can be seen in the bottom middle panel of ﬁg. 2. Wereiterate that the N-body instability times measured for the variousorbital conﬁgurations in our training set are not ‘true’ instability times,but rather represent single draws from the diﬀerent planetary systems’respective instability time distributions, established by their chaoticdynamics. To estimate a theoretical limit (bottom right panel of ﬁg. 2)we use the results from (33), who ﬁnd that the T values measuredby N-body integrations (x-axis of ﬁg. 2) should be approximatelynormally distributed around the mean instability time predicted by anideal model. We use a random standard deviation drawn from thevalues measured empirically for compact systems by (33), which theyﬁnd are sharply peaked around ≈ .

43 dex, independent of whether ornot the system is near MMRs, and valid across a wide range of meaninstability times. We plot this representative intrinsic width of 0.43dex as dotted lines on all panels for comparison.While we defer a detailed comparison to previous work to thefollowing section, we measure a root mean square error (RMSE) of1.02 dex for our model on the holdout test set. We note that whilethe RMSE is an intuitive metric for comparing models, it does notprovide a full picture for a model that is trained on a diﬀerent lossfunction to predict both 𝜇 and 𝜎 , A model that can predict its own 𝜎 will sacriﬁce worse 𝜇 accuracy in challenging regions of parameterspace to better predict it on more easily predictable conﬁgurations.For comparison, if we weight the RMSE by the predicted signal-to-noise ratio (SNR), 𝜇 / 𝜎 , the model achieves 0.87 dex, within afactor of ≈ 𝜎 accurately capture the spread of N-body times around thepredicted mean values 𝜇 . For each test conﬁguration, we predict 𝜇 ,subtract it from its respective 𝑇 measured by N-body integration, anddivide by the predicted 𝜎 . If this distribution approximates a Gaussiandistribution of zero mean and unit variance, the model’s uncertaintyestimates are accurate. We ﬁnd that a Komolgorov-Smirnov test can-not conﬁdently distinguish our predictions from this ideal Gaussian(p-value of 0.056), and plot the two distributions in ﬁg. 8 of thesupplementary material. Guided by the dynamical intuition that short-timescale instabilitiesare driven by the interaction of MMRs (5, 8, 11), we chose to train ourmodel on systems with particular period ratios and orbital elements

Resonant

Model RMSE Classif.accur. Bias † (4, 5) Bias(8, 9)Obertas et al. (2017) 2.12 0.628 1.04 -1.71Petit et al. (2020) 3.22 0.530 3.99 0.54Tamayo et al. (2020) 1.48 0.946 2.07 -0.62Modiﬁed ∗ Tamayo+20 0.99 0.946 0.65 -0.60Ours 1.02 0.952 0.29 -0.38Ours, SNR-weighted 0.87 0.971 0.18 -0.25

Theoretical limit 0.43 0.992 0.05 -0.04

Random

Obertas et al. (2017) 2.41 0.721 2.15 -0.93Petit et al. (2020) 3.09 0.517 4.17 0.50Tamayo et al. (2020) 1.24 0.949 1.16 -0.59Modiﬁed ∗ Tamayo+20 1.14 0.945 0.79 -0.70Ours 1.20 0.939 0.40 -0.51Ours, SNR-weighted 1.09 0.959 0.23 -0.49

Theoretical limit 0.44 0.989 0.06 -0.04 (dex) (AUC) (dex) (dex) † Average diﬀerence between predicted minus true 𝑇 in given range ∗ Modiﬁed and retrained for regression.Table 1: Statistical summaries of each estimator applied to a holdouttest portion of the resonant dataset, and all of the random dataset.in the narrow ranges near such resonances where the dynamical be-havior changes sharply (40). It is therefore important to test how wellsuch a model generalizes to a more uniform coverage of parameterspace, given that most observed orbital architectures are not in MMRs(possibly because such conﬁgurations typically have short lifetimesand have been eliminated). Additionally, previous work has typicallyignored the sharp variations near MMRs to ﬁt overall trends in insta-bility times (40), so a test on resonant systems would not provide afair comparison.For this generalization test and comparison, we use the ‘random’dataset of (11), 25,000 three-planet systems with the same mass ra-tio and inclination distributions as above, and eccentricities drawnlog-uniformly from ≈ − to orbit-crossing. Rather than drawingnear-integer period ratios as in our resonant training set, the spacingbetween adjacent planets is drawn uniformly between [3.5, 30] mutualHill radii (see 11).We ﬁnd that our model exhibits only a minor loss in performance(1.20 vs 1.02 dex RMSE) generalizing to this uniform distributionof orbital conﬁgurations table 1. This supports the assertion thatinstabilities in compact systems within 10 orbits are dominantlydriven by the MMRs we focused on in our training sample (11).To compare our results to the extensive set of past eﬀorts, we divideprevious approaches into three broad groups.First, many authors have run N-body integrations along low-dimensional cuts through the parameter space of compact orbital con-ﬁgurations, and ﬁt simple functional forms to the resulting trends ininstability times. For example, several studies have highlighted thesteep dependence on interplanetary separation, while ﬁxing orbits tobe coplanar and initially circular, and planets to be equal mass andequally separated from one another (12, 14, 16, 17, 40). We comparethe performance of the ﬁt from such a study (40), using ﬁve equallyspaced Earth-mass planets (mass ratio ≈ × − ) on our random5 he model of Obertas et al. (2017) The model of Petit et al. (2020) Tamayo et al. (2020), retrained for regressionOur model Our model, applied to the resonant dataset Theoretical limitFigure 2: Density plots showing the predicted versus true instability times for various models on the random dataset. All predictions outside of 𝑇 ∈ [ , ] are moved to the edge of the range. For the plot showing the predictions of our model, transparency shows relative model-predictedSNR. The theoretical limit, using the numerical experiments of (33), is given in the lower-right plot. The 0 .

43 dex RMSE average from this isused to give the dotted contours in all plots.test set in the top left panel of ﬁg. 2, with a resulting RMSE of 2.41.Followup studies have incorporated the eﬀect of ﬁnite inclinationsand eccentricities (13, 15, 58, 59), but consider equal initial eccen-tricities, planetary masses, etc. in order to ﬁt simple models. Weconclude that while such controlled experiments yield insight into theunderlying dynamics (9, 15, 60), instability times depend sensitivelyon masses and several orbital parameters, rendering empirical ﬁts tolow-dimensional cuts in the parameter space of limited applicability.In the following section we perform the converse generalization test,where we ask our model to instead predict on the N-body simulationsused for the ﬁt by (40), and ﬁnd good agreement.Second, previous authors have developed analytical instability timeestimates from ﬁrst principles. These have been most successful inthe limit of initially circular orbits, where three-body MMRs havebeen identiﬁed as the dominant driver of chaos (9). Recent work (10)has extended this theory to provide accurate instability time estimates.We will compare the predictions of our model to this limit of initiallycircular orbits in the next section. Here we simply emphasize the pointby (10) that such predictions perform poorly at ﬁnite eccentricities(top middle panel of ﬁg. 2), likely due to the dominant eﬀects ofstronger two-body MMRs. The fact that the analytic model predictsthe vast majority of systems to be stable implies that most of our testconﬁgurations would be stable on circular orbits, but that ﬁnite orbitaleccentricities strongly modulate instability times.The ﬁnal approach is to make predictions across the high-dimensional space of orbital conﬁgurations using machine learning (11, 27). We consider two variants of (11) adapted for regression.The ﬁrst, labeled ‘Tamayo et al. (2020)’, is to simply use an identicalmodel as (11), but map the probability estimates of stability past 10 orbits through an inverse cumulative distribution of a log-normal withan optimized constant standard deviation. The second, labeled ‘Mod-iﬁed Tamayo+20’, the model is an XGBoost (61) regression model(rather than classiﬁcation) retrained on the same features as used in(11).We ﬁnd that our model achieves similar performance to the Modi-ﬁed Tamayo+20 variant (top right panel of ﬁg. 2, table 1), though thelatter exhibits signiﬁcant bias. We quantify this bias for each model inthe range 𝑇 ∈ ( , ) and 𝑇 ∈ ( , ) . As is evident in table 1 as well asﬁg. 2, the model introduced in this work exhibits signiﬁcantly reducedbias compared to other models. Including SNR weighting furtherreduces bias. Bias is a measure of the generalizability of a modelto out-of-distribution data (see chapter 7 of 62), so is an importantmetric for understanding how these predictive models will extrapolateto new data. Our model achieves predictions that are more than twoorders of magnitude more accurate than the analytic models in eachcase: e.g., 10 . / . ≈ × when comparing our SNR-weightedmodel with (40) on the random dataset.Finally, we can make a comparison to the original classiﬁcationmodel of (11) by using our regression model as a form of classiﬁer.We count the fraction of samples above 𝑇 = As a second generalization test of our model, we compare its per-formance on the limiting case considered by (40). This case of ﬁveequally spaced, Earth-mass planets on initially circular and coplanarorbits diﬀers signiﬁcantly from our training set of resonant and nearresonant, eccentric and inclined conﬁgurations of three planets withunequal masses. This dataset contains 17500 simulations numericallyintegrated for 10 orbits (40). This generalization to a limiting set ofhigher multiplicity conﬁgurations provides a stringent test of whetherthe model has learned features of the dynamics or whether it is naivelyinterpolating across the distribution of orbital conﬁgurations presentin our training dataset.To extend our three-planet predictions to higher multiplicity sys-tems, we perform the same short integration for all planets, but passtime series for each adjacent trio of planets to the model separately.The model samples a single instability time for each adjacent trio,and the minimum across this set is adopted as the instability time forthe system, as an estimate of the time for the ﬁrst trio to go unstable.This procedure is then repeated, and we record the median and conﬁ-dence intervals of the resultant distribution in 𝑇 . Such a reduction ofcompact multi-planet systems to sets of adjacent trios has been pro-posed on theoretical (9, 10) as well as empirical (11) grounds. Thisis motivated by the fact that the perturbative eﬀects of planets on oneanother fall oﬀ exponentially with separation (9, 10), so non-adjacentinteractions can largely be ignored.The predictions can be seen in ﬁg. 3, and are remarkably accu-rate despite our model never seeing a system with ﬁve planets duringtraining. We overplot the analytical result of (10) in magenta devel-oped from ﬁrst principles for such cases with initially circular orbits,including a manual correction for ﬁve-planet systems. Our modelcaptures the same overall trend, but additionally identiﬁes the variousdips, which correspond to locations of diﬀerent MMRs (40). Weemphasize that our model was trained on the general eccentric casewhere the magenta model of (10) does not apply (ﬁg. 2), yet the gener-alization to this limiting case is excellent. In addition to matching theoverall trend of (10), our model captures the additional instability timemodulations at MMRs, as can be seen more clearly in the residuals inﬁg. 5 of the appendix. Additionally, our new model generalizes muchbetter than the predictions of the modiﬁed regression model based on(11) based on manually engineered features (gold). In industry machine learning, one is focused on making predictions asaccurate as possible, even at the expense of a more interpretable model.However, in physics, we are fundamentally interested in understandingproblems from ﬁrst principles.Obtaining such an explicit interpretation of our model will be dif-ﬁcult. However, as a ﬁrst step we consider the feature importancesof our model: what orbital elements is it using to make predictions,and does this align with expectations? To do this feature analysis, weexploit the diﬀerentiability of our model with respect to its input. Wecalculate the gradient of the predicted 𝜇 value our model with respectto the input instantaneous orbital elements; this is also referred to asa “saliency map” (25). This gives us a multi-dimensional array overfeature, simulation, time step, and model, representing how much thepredicted 𝜇 value will change should that feature be inﬁnitesimally in-creased. We compute the variance of the gradients over time and eachsimulation, and then average these variances over sampled network weights 𝜃 . This gives us a coarse representation of the importance ofeach feature. We show this visually in ﬁg. 4.To compare these importances to other work, (11) argue empiricallythat the short-timescale instabilities we probe here in compact systemsare driven speciﬁcally by the interactions between MMRs. A classicalresult of celestial mechanics is that in the absence of such MMRs, thelong-term dynamics keeps the semi-major axes ﬁxed. Variations inthe semi-major axes during the short integrations thus act as a proxyfor the importance of nearby MMRs (27), and we see that indeed, thesemi-major axes exhibit the highest feature importance in our modelﬁg. 4. The fact that the model ascribes comparable feature importanceto any given orbital element for each of the three planets also suggestsa physically-reasonable model.We note that there is a small but nonzero signiﬁcance of the “Time”instantaneous feature. This can be interpreted as being importantbecause the model takes the ﬁrst 10 orbits as input, and can predictinstability for the system as low as 10 . . Thus, the orbital parametersgiven at 10 orbits may be more important than the orbital parametersat 10 orbits for predicting such unstable systems, and thus the timefeature is used. The time feature would be less important for a systemthat goes unstable near 10 orbits, as the relative importance of thesystem’s parameters at 10 orbits is comparable to that at 10 orbit.Because we chose to structure our model to take means and vari-ances of the times series of the learned transformed variables ﬁg. 7,it may be possible to extract explicit interpretations of our model viasymbolic regression. Given that our approach is structurally similar tothat of a graph neural network (63), the frameworks of (64–66) wouldbe particularly applicable. This would be done by ﬁnding analyticforms for 𝑓 , representing each of the transformed variables, and thenﬁnding an analytic form for 𝑓 , to compute the instability time giventhe transformed variables. This type of explicit analysis of the modelwill be considered in future work. We have described a probabilistic machine learning model—aBayesian neural network—that can accurately predict a distributionover possible instability times for a given compact multi-planet ex-oplanet system. Our model is trained on the raw orbital parametersof a multi-planet system over a short integration, and learns its owninstability metrics. This is contrasted by previous machine learningapproaches which have given their models hand-designed instabilitymetrics based on specialized domain knowledge.Our model is more than two orders of magnitude more accurateat predicting instability times than analytical estimators, while alsoreducing the bias of existing learned models by nearly a factor of three.We also demonstrate that our model generalizes robustly to ﬁve-planetconﬁgurations eﬀectively drawn from a one-dimensional cut throughthe broad parameter space used to train the model. This improves onthe estimates of analytic and other learned models, despite our modelonly being trained on compact three-planet systems.Our model’s computational speedup over N-body integrations bya factor of up to 10 enables a broad range of applications, such asusing stability constraints to rule out unphysical conﬁgurations andconstrain orbital parameters (21), and to develop eﬃcient terrestrialplanet formation models. Towards this end, our model will be madepublicly available through the SPOCK ¶ package, with training codealso available in a separate repo ‖ . Acknowledgements. ¶ https://github.com/dtamayo/spock ‖ https://github.com/MilesCranmer/bnn_chaos_model t i i i e e e m m m a a a Figure 4: Feature importances in the model, calculated as the root-mean-square of the gradients of the model’s output with respect to theinput features, and normalized.Miles Cranmer would like to thank Andrew Gordon Wilson andDan Foreman-Mackey for advice on Bayesian deep learning tech-niques, and Andrew Gordon Wilson, Dan Foreman-Mackey, and Se-bastian Wagner-Carena for comments on a draft of this paper. PhilipArmitage, Shirley Ho, and David Spergel’s work is supported by theSimons Foundation. This work made use of several Python softwarepackages: numpy (67), scipy (68), sklearn (69), jupyter (70), matplotlib (71), pandas (72), torch (73), pytorch-lightning (74), and tensorflow (75).

References [1] Eiichiro Kokubo and Shigeru Ida. Dynamics and accretionof planetesimals.

Progress of Theoretical and ExperimentalPhysics , 2012(1):01A308, 2012.[2] Kathryn Volk and Brett Gladman. Consolidating and crushingexoplanets: Did it happen here?

The Astrophysical JournalLetters , 806(2):L26, 2015. [3] Bonan Pu and Yanqin Wu. Spacing of kepler planets: sculptingby dynamical instability.

The Astrophysical Journal , 807(1):44,2015.[4] Scott Tremaine. The Statistical Mechanics of Planet Orbits.

Astrophysical Journal , 807(2):157, Jul 2015. doi: 10.1088/0004-637X/807/2/157.[5] Jack Wisdom. The resonance overlap criterion and the onset ofstochastic behavior in the restricted three-body problem.

TheAstronomical Journal , 85:1122–1133, 1980.[6] K. M. Deck, M. Payne, and M. J. Holman. First-order Res-onance Overlap and the Stability of Close Two-planet Sys-tems.

Astrophysical Journal , 774:129, September 2013. doi:10.1088/0004-637X/774/2/129.[7] Antoine C. Petit, Jacques Laskar, and Gwenaël Boué. Hill sta-bility in the AMD framework.

Astronomy & Astrophysics , 617:A93, Sep 2018. doi: 10.1051/0004-6361/201833088.[8] Sam Hadden and Yoram Lithwick. A criterion for the onset ofchaos in systems of two eccentric planets.

The AstronomicalJournal , 156(3):95, 2018.[9] Alice C Quillen. Three-body resonance overlap in closely spacedmultiple-planet systems.

Monthly Notices of the Royal Astronom-ical Society , 418(2):1043–1054, 2011.[10] Antoine C Petit, Gabriele Pichierri, Melvyn B Davies, and An-ders Johansen. The path to instability in compact multi-planetarysystems.

Astronomy & Astrophysics , 641:A176, 2020.[11] Daniel Tamayo, Miles Cranmer, Samuel Hadden, Hanno Rein,Peter Battaglia, Alysa Obertas, Philip J Armitage, Shirley Ho,David N Spergel, Christian Gilbertson, et al. Predicting the long-term stability of compact multiplanet systems.

Proceedings ofthe National Academy of Sciences , 117(31):18194–18205, 2020.[12] J. E. Chambers, G. W. Wetherill, and A. P. Boss. The Stabilityof Multi-Planet Systems.

Icarus , 119:261–268, February 1996.doi: 10.1006/icar.1996.0019.[13] Keiko Yoshinaga, Eiichiro Kokubo, and Junichiro Makino. Thestability of protoplanet systems.

Icarus , 139(2):328–335, 1999.[14] F. Marzari and S. J. Weidenschilling. Eccentric Extrasolar Plan-ets: The Jumping Jupiter Model.

Icarus , 156:570–579, April2002. doi: 10.1006/icar.2001.6786.815] Ji-Lin Zhou, Douglas NC Lin, and Yi-Sui Sun. Post-oligarchicevolution of protoplanetary embryos and the stability of plane-tary systems.

The Astrophysical Journal , 666(1):423, 2007.[16] Peter Faber and Alice C Quillen. The total number of giantplanets in debris discs with central clearings.

Monthly Noticesof the Royal Astronomical Society , 382(4):1823–1828, 2007.[17] Andrew W Smith and Jack J Lissauer. Orbital stability of systemsof closely-spaced planets.

Icarus , 201(1):381–394, 2009.[18] Yuji Matsumoto, Makiko Nagasawa, and Shigeru Ida. The or-bital stability of planets trapped in the ﬁrst-order mean-motionresonances.

Icarus , 221(2):624–631, November 2012. doi:10.1016/j.icarus.2012.08.032.[19] Boris V Chirikov. A universal instability of many-dimensionaloscillator systems.

Physics reports , 52(5):263–379, 1979.[20] Sam Hadden. An Integrable Model for the Dynamics of Plane-tary Mean-motion Resonances.

Astronomical Journal , 158(6):238, Dec 2019. doi: 10.3847/1538-3881/ab5287.[21] Daniel Tamayo, Christian Gilbertson, and Daniel Foreman-Mackey. Stability constrained characterization of multiplanetsystems, 2020.[22] Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E. Hinton. Im-agenet classiﬁcation with deep convolutional neural networks.In

Proceedings of the 25th International Conference on Neu-ral Information Processing Systems - Volume 1 , NIPS’12, page1097–1105, Red Hook, NY, USA, 2012. Curran Associates Inc.[23] Alex Graves, Abdel-rahman Mohamed, and Geoﬀrey Hinton.Speech recognition with deep recurrent neural networks. In , pages 6645–6649. IEEE, 2013.[24] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence tosequence learning with neural networks. In

Advances in neuralinformation processing systems , pages 3104–3112, 2014.[25] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

DeepLearning . MIT Press, 2016.[26] David G Lowe. Object recognition from local scale-invariantfeatures. In

Proceedings of the seventh IEEE international con-ference on computer vision , volume 2, pages 1150–1157. Ieee,1999.[27] Daniel Tamayo, Ari Silburt, Diana Valencia, Kristen Menou,Mohamad Ali-Dib, Cristobal Petrovich, Chelsea X Huang,Hanno Rein, Christa van Laerhoven, Adiv Paradise, et al. Amachine learns to predict the stability of tightly packed plan-etary systems.

The Astrophysical Journal Letters , 832(2):L22,2016.[28] Eric Agol, Caroline Dorn, Simon L Grimm, Martin Turbet, ElsaDucrot, Laetitia Delrez, Michael Gillon, Brice-Olivier Demory,Artem Burdanov, Khalid Barkaoui, et al. Reﬁning the transittiming and photometric analysis of trappist-1: Masses, radii,densities, dynamics, and ephemerides, 2020. [29] Lauren M Weiss, Geoﬀrey W Marcy, Erik A Petigura, Ben-jamin J Fulton, Andrew W Howard, Joshua N Winn, Howard TIsaacson, Timothy D Morton, Lea A Hirsch, Evan J Sinukoﬀ,et al. The california-kepler survey. v. peas in a pod: planetsin a kepler multi-planet system are similar in size and regularlyspaced.

The Astronomical Journal , 155(1):48, 2018.[30] Hanno Rein and Daniel Tamayo. whfast: a fast and unbiasedimplementation of a symplectic wisdom–holman integrator forlong-term gravitational simulations.

Monthly Notices of theRoyal Astronomical Society , 452(1):376–388, 2015.[31] Hanno Rein and S-F Liu. Rebound: an open-source multi-purpose n-body code for collisional dynamics.

Astronomy &Astrophysics , 537:A128, 2012.[32] David R Rice, Frederic A Rasio, and Jason H Steﬀen. Survivalof non-coplanar, closely packed planetary systems after a closeencounter.

Monthly Notices of the Royal Astronomical Society ,481(2):2205–2212, 2018.[33] Naireen Hussain and Daniel Tamayo. Fundamental limits fromchaos on instability time predictions in compact planetary sys-tems.

Monthly Notices of the Royal Astronomical Society , 491(4):5258–5267, 2019.[34] Alex Waibel, Toshiyuki Hanazawa, Geoﬀrey Hinton, KiyohiroShikano, and Kevin J Lang. Phoneme recognition using time-delay neural networks.

IEEE transactions on acoustics, speech,and signal processing , 37(3):328–339, 1989.[35] Yann LeCun, Bernhard Boser, John S Denker, Donnie Hen-derson, Richard E Howard, Wayne Hubbard, and Lawrence DJackel. Backpropagation applied to handwritten zip code recog-nition.

Neural computation , 1(4):541–551, 1989.[36] Waseem Rawat and Zenghui Wang. Deep convolutional neu-ral networks for image classiﬁcation: A comprehensive review.

Neural computation , 29(9):2352–2449, 2017.[37] Sepp Hochreiter and Jürgen Schmidhuber. Long short-termmemory.

Neural computation , 9(8):1735–1780, 1997.[38] Mathieu Andreux, Tomás Angles, Georgios Exarchakis, RobertoLeonarduzzi, Gaspar Rochette, Louis Thiry, John Zarka,Stéphane Mallat, Joakim Andén, Eugene Belilovsky, JoanBruna, Vincent Lostanlen, Matthew J. Hirn, Edouard Oyallon,Sixin Zhang, Carmine Cella, and Michael Eickenberg. Kymatio:Scattering Transforms in Python, December 2018.[39] Carl Edward Rasmussen and Christopher K. I. Williams.

Gaus-sian Processes for Machine Learning . 2006.[40] Alysa Obertas, Christa Van Laerhoven, and Daniel Tamayo. Thestability of tightly-packed, evenly-spaced systems of earth-massplanets orbiting a sun-like star.

Icarus , 293:52–58, 2017. doi:10.1016/j.icarus.2017.04.010.[41] Andrew Gordon Wilson. The case for bayesian deep learning,2020.[42] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Ap-proximation: Representing Model Uncertainty in Deep Learn-ing, June 2015.943] Yarin Gal, Jiri Hron, and Alex Kendall. Concrete Dropout, May2017.[44] Yashar D. Hezaveh, Laurence Perreault Levasseur, and Philip J.Marshall. Fast automated analysis of strong gravitational lenseswith convolutional neural networks.

Nature , 548(7669):555–557, August 2017. doi: 10.1038/nature23463.[45] Laurence Perreault Levasseur, Yashar D. Hezaveh, and Risa H.Wechsler. Uncertainties in Parameters Estimated with NeuralNetworks: Application to Strong Gravitational Lensing.

As-trophysical Journal, Letters , 850(1):L7, November 2017. doi:10.3847/2041-8213/aa9704.[46] Henry W. Leung and Jo Bovy. Deep learning of multi-elementabundances from high-resolution spectroscopic data.

MonthlyNotices of the RAS , 483(3):3255–3277, March 2019. doi: 10.1093/mnras/sty3217.[47] Sebastian Wagner-Carena, Ji Won Park, Simon Birrer, Philip J.Marshall, Aaron Roodman, and Risa H. Wechsler. HierarchicalInference With Bayesian Neural Networks: An Application toStrong Gravitational Lensing, October 2020.[48] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, andDaan Wierstra. Weight Uncertainty in Neural Networks, May2015.[49] Dustin Tran, Mike Dusenberry, Mark van der Wilk, and DanĳarHafner. Bayesian layers: A module for neural network uncer-tainty. In

Advances in Neural Information Processing Systems ,pages 14660–14672, 2019.[50] Christos Louizos and Max Welling. Multiplicative normalizingﬂows for variational bayesian neural networks, 2017.[51] Nick Pawlowski, Andrew Brock, Matthew C. H. Lee, MartinRajchl, and Ben Glocker. Implicit Weight Uncertainty in NeuralNetworks, November 2017.[52] David Krueger, Chin-Wei Huang, Riashat Islam, Ryan Turner,Alexandre Lacoste, and Aaron Courville. Bayesian Hypernet-works, October 2017.[53] Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov,and Andrew Gordon Wilson. A Simple Baseline for BayesianUncertainty in Deep Learning, February 2019.[54] Andrew Gordon Wilson and Pavel Izmailov. Bayesian DeepLearning and a Probabilistic Perspective of Generalization,February 2020.[55] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, DmitryVetrov, and Andrew Gordon Wilson. Averaging Weights Leadsto Wider Optima and Better Generalization, March 2018.[56] Stephan Mandt, Matthew D Hoﬀman, and David M Blei.Stochastic gradient descent as approximate bayesian inference.

The Journal of Machine Learning Research , 18(1):4873–4907,2017.[57] Chris Mingard, Guillermo Valle-Pérez, Joar Skalse, and Ard A.Louis. Is SGD a Bayesian sampler? Well, almost, June 2020.[58] B Funk, G Wuchterl, R Schwarz, E Pilat-Lohinger, and S Eggl.The stability of ultra-compact planetary systems.

Astronomy &Astrophysics , 516:A82, 2010. [59] Dong-Hong Wu, Rachel C Zhang, Ji-Lin Zhou, and Jason HSteﬀen. Dynamical instability and its implications for planetarysystem architecture.

Monthly Notices of the Royal AstronomicalSociety , 484(2):1538–1548, 2019.[60] Almog Yalinewich and Cristobal Petrovich. Nekhoroshev esti-mates for the survival time of tightly packed planetary systems,2019.[61] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable treeboosting system. In

Proceedings of the 22nd ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining , KDD ’16, pages 785–794, New York, NY, USA, 2016.ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2939785.[62] Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

Theelements of statistical learning: data mining, inference, andprediction . Springer Science & Business Media, 2009.[63] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, AlvaroSanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski,Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner,Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer,George Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Vic-toria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Push-meet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and RazvanPascanu. Relational inductive biases, deep learning, and graphnetworks, June 2018.[64] Miles D. Cranmer, Rui Xu, Peter Battaglia, and Shirley Ho.Learning Symbolic Physics with Graph Networks, September2019.[65] Miles Cranmer, Alvaro Sanchez-Gonzalez, Peter Battaglia, RuiXu, Kyle Cranmer, David Spergel, and Shirley Ho. DiscoveringSymbolic Models from Deep Learning with Inductive Biases,June 2020.[66] Miles Cranmer. PySR: Fast & Parallelized Symbolic Regressionin Python/Julia, September 2020. URL https://doi.org/10.5281/zenodo.4052869 .[67] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt,Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser,Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern,Matti Picus, Stephan Hoyer, Marten H. van Kerkwĳk, MatthewBrett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe,Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, TylerReddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke,and Travis E. Oliphant. Array programming with NumPy.

Na-ture , 585(7825):357–362, September 2020. ISSN 1476-4687.doi: 10.1038/s41586-020-2649-2. URL https://doi.org/10.1038/s41586-020-2649-2 .[68] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haber-land, Tyler Reddy, David Cournapeau, Evgeni Burovski, PearuPeterson, Warren Weckesser, Jonathan Bright, Stéfan J. van derWalt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Niko-lay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern,Eric Larson, CJ Carey, İlhan Polat, Yu Feng, Eric W. Moore,Jake Vand erPlas, Denis Laxalde, Josef Perktold, Robert Cimr-man, Ian Henriksen, E. A. Quintero, Charles R Harris, Anne M.Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van10ulbregt, and SciPy 1. 0 Contributors. SciPy 1.0: Funda-mental Algorithms for Scientiﬁc Computing in Python.

Na-ture Methods , 17:261–272, 2020. doi: https://doi.org/10.1038/s41592-019-0686-2.[69] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vin-cent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel,Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vander-plas, Alexandre Passos, David Cournapeau, Matthieu Brucher,Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: MachineLearning in Python.

Journal of Machine Learning Research , 12(85):2825–2830, 2011.[70] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez,Brian Granger, Matthias Bussonnier, Jonathan Frederic, KyleKelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, PaulIvanov, Damián Avila, Saﬁa Abdalla, and Carol Willing. JupyterNotebooks – a publishing format for reproducible computationalworkﬂows. In F. Loizides and B. Schmidt, editors,

Positioningand Power in Academic Publishing: Players, Agents and Agen-das , pages 87 – 90. IOS Press, 2016.[71] Michael Droettboom, John Hunter, Thomas A Caswell, Eric Fir-ing, Jens Hedegaard Nielsen, Phil Elson, Benjamin Root, DarrenDale, Jae-Joon Lee, Jouni K. Seppänen, Damon McDougall, An-drew Straw, Ryan May, Nelle Varoquaux, Tony S Yu, Eric Ma,Charlie Moad, Steven Silvester, Christoph Gohlke, Peter Würtz,Thomas Hisch, Federico Ariza, Cimarron, Ian Thomas, JamesEvans, Paul Ivanov, Jeﬀ Whitaker, Paul Hobson, mdehoon, andMatt Giuca. matplotlib: matplotlib v1.5.1, January 2016.[72] Wes McKinney. Data Structures for Statistical Computing inPython. In Stéfan van der Walt and Jarrod Millman, editors,

Proceedings of the 9th Python in Science Conference , pages 56– 61, 2010. doi: 10.25080/Majora-92bf1922-00a.[73] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, JamesBradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, NataliaGimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf,Edward Yang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, JunjieBai, and Soumith Chintala. PyTorch: An Imperative Style,High-Performance Deep Learning Library. In H. Wallach,H. Larochelle, A. Beygelzimer, F. d ' Alché-Buc, E. Fox, andR. Garnett, editors,

Advances in Neural Information ProcessingSystems 32 , pages 8024–8035. Curran Associates, Inc., 2019.[74] WA Falcon. Pytorch lightning.

GitHub. Note:https://github.com/PyTorchLightning/pytorch-lightning , 3,2019.[75] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, AndyDavis, Jeﬀrey Dean, Matthieu Devin, Sanjay Ghemawat, Geof-frey Irving, Michael Isard, et al. Tensorﬂow: A system forlarge-scale machine learning. In { USENIX } Symposiumon Operating Systems Design and Implementation ( { OSDI } ,pages 265–283, 2016.[76] Laura Dietz. Directed factor graph notation for generative mod-els. Technical report, Max Planck Institute for Informatics, Saar-brücken, Germany, 2010.[77] Leslie N. Smith. A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, andweight decay, March 2018. [78] Leslie N. Smith and Nicholay Topin. Super-Convergence: VeryFast Training of Neural Networks Using Large Learning Rates,August 2017.11 ppendix Petit et al. (2020) Our modelFigure 5: Residuals for predictions of instability time on the 5-planet dataset in ﬁg. 3.

Parameters.

A planetary orbit can be described with six coordinates at any given time. We choose to use eccentricity ( 𝑒 ), semi-major axis( 𝑎 , normalized to the initial innermost planet’s value), inclination ( 𝑖 ), longitude of the ascending node ( Ω ), longitude of pericenter ( 𝜛 ), andtrue longitude ( 𝜃 ). We also pass the mass of each planet normalized to the star’s mass to each call of 𝑓 in eq. (2). For each of the angularcoordinates excluding inclination, we split into two values—one for the sine and one for the cosine of the value—before passing to the ﬁrstneural network. Likelihoods.

Here we give a mathematical derivation of the likelihood used to train our model. Our goal is to estimate the distribution 𝑃 ( 𝑇 | 𝑋 ) , for a timeseries data 𝑋 ∈ R + 𝑛 𝑡 . We construct a probabilistic model deﬁned by eqs. (1) to (5), with parameters 𝜃 ∈ R 𝑚 , where 𝑚 isthe total number of parameters, which takes a timeseries for 3 planets, 𝑋 , and produces a normal probability distribution over 𝑇 , parametrizedby 2 scalars: ( 𝜇 𝜃 ( 𝑋 ) , 𝜎 𝜃 ( 𝑋 )) . 𝜇 𝜃 is the center of the instability time, and 𝜎 𝜃 is the standard deviation in that estimate.The distribution over 𝑇 parametrized by the model is equal to: 𝑃 ( 𝑇 | 𝑋, 𝜃 ) =  𝐴 ( 𝜇 𝜃 ( 𝑋 ) ,𝜎 𝜃 ( 𝑋 ))√ 𝜋 𝜎 𝜃 ( 𝑋 ) exp (cid:18) − ( 𝑇 − 𝜇 𝜃 ( 𝑋 )) 𝜎 𝜃 ( 𝑋 ) (cid:19) , 𝑇 < (cid:32) + erf (cid:18) 𝜇 𝜃 ( 𝑋 )− √ 𝜎 𝜃 ( 𝑋 ) (cid:19)(cid:33) 𝑃 ( 𝑇 | 𝑇 ≥ ) , 𝑇 ≥ . (6)This distribution is motivated by two things. First, as in the paper (33), exoplanet instability times usually follow a normal distribution inlogarithmic instability time, regardless of how large this time is. Therefore, we predict a normal distribution in 𝑇 for times under 𝑇 < orbits, hence we use a model that is independent of 𝑃 ( 𝑇 | 𝑇 ≥ ) . Wecalculate the cumulative probability of the normal distribution falling 𝑇 ≥ 𝐴 isa normalization function from the fact that we cut oﬀ the probability at 𝑇 =

4. Thus, 𝐴 ( 𝜇 𝜃 ( 𝑋 ) , 𝜎 𝜃 ( 𝑋 )) = + erf (cid:169)(cid:173)(cid:171) 𝜇 𝜃 ( 𝑋 )− √︃ 𝜎 𝜃 ( 𝑋 ) (cid:170)(cid:174)(cid:172) . (7)This term, from our prior that 𝑇 >

4, helps remove bias from our model, as can be seen in ﬁg. 2. Without this term in the model, we wouldbe artiﬁcially punishing predictions at low 𝑇 values.Assuming we produce a pointwise dataset 𝐷 = {( 𝑇 𝑖 , 𝑋 𝑖 )} 𝑖 = 𝑁 via numerical integration, where 𝑇 𝑖 ≡ orbits, the log-likelihood for this model is equal to:log 𝑃 ( 𝐷 | 𝜃 ) ∝ ∑︁ 𝑖  − ( 𝑇 𝑖 − 𝜇 𝜃 ( 𝑋 𝑖 )) 𝜎 𝜃 ( 𝑋 𝑖 ) − log (cid:0) 𝜎 𝜃 ( 𝑋 𝑖 ) (cid:1) − log (cid:169)(cid:173)(cid:173)(cid:171) + erf (cid:169)(cid:173)(cid:171) 𝜇 𝜃 ( 𝑋 𝑖 )− √︃ 𝜎 𝜃 ( 𝑋 𝑖 ) (cid:170)(cid:174)(cid:172)(cid:170)(cid:174)(cid:174)(cid:172) , 𝑇 𝑖 < (cid:169)(cid:173)(cid:173)(cid:171) + erf (cid:169)(cid:173)(cid:171) 𝜇 𝜃 ( 𝑋 𝑖 )− √︃ 𝜎 𝜃 ( 𝑋 𝑖 ) (cid:170)(cid:174)(cid:172)(cid:170)(cid:174)(cid:174)(cid:172) , 𝑇 𝑖 ≡ , (8)assuming a ﬁxed prior 𝑃 ( 𝑇 | 𝑇 ≥ ) . Note how this decouples the loss for the stable values 𝑇 𝑖 ≥ 𝑃 ( 𝑇 | 𝑇 ≥ ) , meaning thechoice of prior will have no eﬀect on our model, and can be chosen by a user after training. Examples of this are plotted in ﬁg. 6.Figure 6: Example likelihoods for various choices of 𝜇, 𝜎 corresponding to eq. (6). Example priors are given for conﬁgurations stable past10 , though note that these have no eﬀect on the learned model.Now, we also marginalize the model parameters 𝜃 , to incorporate epistemic uncertainty, and account for model biases due to the randomseed used in initializing and training the model. We proceed as follows: 𝑃 ( 𝑇 | 𝑋, 𝐷 ) ∝ ∫ 𝑃 ( 𝑇, 𝜃 | 𝑋, 𝐷 ) 𝑑𝜃 (9) ∝ ∫ 𝑃 ( 𝑇 | 𝑋, 𝐷, 𝜃 ) 𝑃 ( 𝜃 | 𝑋, 𝐷 ) 𝑑𝜃 (10) ∝ ∫ 𝑃 ( 𝑇 | 𝑋, 𝜃 ) 𝑃 ( 𝜃 | 𝐷 ) 𝑑𝜃. (11)We ﬁrst maximize the likelihood of the model, to ﬁnd 𝑃 ( 𝜃 | 𝐷 ) . We factor the joint probability using ﬁg. 7, and proceed: 𝑃 ( 𝜃, 𝐷 ) ∝ 𝑃 ( 𝜃 ) (cid:214) 𝑖 ∫ 𝑃 ( 𝑋 𝑖 ) 𝑃 ( 𝑇 𝑖 | 𝜇 𝑖 , 𝜎 𝑖 ) 𝑃 ( 𝜇 𝑖 , 𝜎 𝑖 | 𝜃, 𝑋 𝑖 ) 𝑑𝜇 𝑖 𝑑𝜎 𝑖 (12)Using 𝑃 ( 𝜃 | 𝐷 ) ∝ 𝑃 ( 𝜃, 𝐷 ) 𝑃 ( 𝐷 ) (13) ⇒ 𝑃 ( 𝜃 | 𝐷 ) ∝ 𝑃 ( 𝜃 ) (cid:214) 𝑖 ∫ 𝑃 ( 𝑇 𝑖 | 𝜇 𝑖 , 𝜎 𝑖 ) 𝑃 ( 𝜇 𝑖 , 𝜎 𝑖 | 𝜃, 𝑋 𝑖 ) 𝑑𝜇 𝑖 𝑑𝜎 𝑖 , (14)where 𝑃 ( 𝑇 𝑖 | 𝜇 𝑖 , 𝜎 𝑖 ) is given by eq. (6) as the log-likelihood for our model, and 𝑃 ( 𝜇 𝑖 , 𝜎 𝑖 | 𝜃, 𝑋 𝑖 ) 𝑑𝜇 𝑖 𝑑𝜎 𝑖 is our forward model given eqs. (2)to (5). Finally, we can write down the loss of our model, our function to minimize, as the negative logarithm of this:Loss ( 𝜃 ) = − log (cid:0) 𝑃 ( 𝜃 ) (cid:1) − ∑︁ 𝑖 E ( 𝜇 𝑖 ,𝜎 𝑖 )∼ 𝑓 ( 𝑋 𝑖 ; 𝜃 ) log (cid:16) 𝑃 ( 𝑇 𝑖 | 𝜇 𝑖 , 𝜎 𝑖 ) (cid:17) , (15)where 𝑓 ( 𝑋 𝑖 ; 𝜃 ) is the combined model eqs. (2) to (5) for a given 𝜃 , and E is used to refer to the fact that z is sampled in eq. (3), so we averagethe loss over samples. We set 𝑃 ( 𝜃 ) equal to a zero-centered uninformative Gaussian prior over the parameters. If this were a neural densityestimator instead of a full Bayesian Neural Network, we would minimize this for a single value of 𝜃 . Alternatively, we can sample 𝜃 ∼ 𝑃 ( 𝜃 | 𝐷 ) with a Bayesian Neural Network (BNN) algorithm. We use the MultiSWAG algorithm to do this (54), as described in the main text, and ﬁndthe distribution 𝑃 MultiSWAG ( 𝜃 ) ≈ 𝑃 ( 𝜃 | 𝐷 ) . 13 𝑖 𝑗 𝑓 y 𝑖 𝑗 ( E [·] , Var [·]) z 𝑖 𝜃𝜃𝜃 𝜃 N 𝜇𝜇𝜇 𝜃 Σ 𝜃 𝑓 𝜃𝜃𝜃 𝜇 𝑖 𝜎 𝑖 N 𝑇 𝑖 For each time step 𝑗 ∈ { ..𝑛 𝑡 } For each simulation 𝑖 ∈ { ..𝑁 } For each mode of the parameter surface

Figure 7: Bayesian graphical model representing our inference scheme for the instability time with Bayesian neural networks, which goes fromtimeseries created via short-term integration ( { x 𝑖 𝑗 } 𝑗 = [ 𝑛 𝑡 ] ) to a prediction of a (logarithmic) instability time ( 𝑇 𝑖 ) for each simulation 𝑖 . 𝑓 and 𝑓 are neural networks parametrized by 𝜃 ≡ ( 𝜃𝜃𝜃 , 𝜃𝜃𝜃 ) . A distribution over 𝜃 is learned according to the likelihood 𝑃 ( 𝜃 ) (cid:206) 𝑖 𝑃 ( 𝑇 𝑖 |{ x 𝑖 𝑗 } 𝑗 ; 𝜃 ) .The model is given in eqs. (1) to (5). Notation follows that of (76). Vectors are bolded and matrices are capitalized. Training.

We have 𝑛 𝑡 =

100 uniformly spaced samples of our integration over 10 ,

000 initial orbits (the unit orbit is the initial period of theinnermost planet). During training, we randomly select between 5 and 100 time steps, with replacement, to feed to the model. This is a typeof data augmentation that improves generalization of the model.Since we are working with a varying number of orbit samples, we also sample the averages and variances from Gaussians over theirfrequentist distributions: 𝜇 ± Var 𝑡 [ y 𝑡 ] 𝑛 𝑡 for the mean, and 𝜎 ± 𝑡 [ y 𝑡 ] ( 𝑛 𝑡 − ) for the variance, where 𝑛 𝑡 is the number of samples, and Var 𝑡 is thesample variance. This will naturally allow the model to grow increasingly certain if a longer time series is given as input, since the averagesand variances of the transformed coordinates are less subject to small-sample uncertainty. Hyperparameters.

For our ﬁnal model, we set both 𝑓 and 𝑓 to be a MLPs with ReLU activations: 1 hidden layer and 40 hidden nodeseach (i.e., a matrix multiplication, a ReLU, a matrix multiplication, a ReLU, and another matrix multiplication). The number of calculatedtransformed features from 𝑓 is 20, meaning 𝑓 takes 40 features as input. We take 500,000 stochastic gradient descent optimization stepswith random batches of simulations with a batch size of 2000 with a cosine-annealed learning rate (77, 78) from 5 × − down to 5 × − This is followed by 50,000 additional steps at a ﬁxed learning rate (presumably within a mode of the posterior) of 2 . × − , to record pointsof a Gaussian mode on the weight posterior. Gradient clipping of the L2-norm is used with a clipping value of 0 .

1. A small amount of noiseis added to the input features and summary statistics to encourage the model to only use features that are useful: a KL-divergence loss termis added to the loss function on this noise, with a multiplier of 10 − on the input and 10 − on the summary. This noise is not added duringevaluation, only training. We choose 5 as the minimum number of time steps to pass to the model. We rescale the data to have a zero meanand unit variance for each feature (i.e., we normalize with a mean and variance calculated over the entire training set and time series). All ofthese parameters were found with the hyperopt ∗∗ package with a 20% validation set, with a smaller number of training steps and acceleratedlearning rate schedule. Finally, we train an ensemble of 30 independent models: this represents 30 modes of the weight posterior. For asummary of these details, see our training code at https://github.com/MilesCranmer/bnn_chaos_model . Approximating the cumulative distribution of a Gaussian.

Due to precision issues of 32-bit ﬂoating point numbers, our auto-diﬀerentiationsoftware, torch (73), is increasingly incapable of accurately calculating log (cid:0) + erf ( 𝑥 ) (cid:1) and its gradients as 𝑥 decreases below −

1. Becausewe heavily rely on this function in our log-likelihood for training our model, and need to pass gradients through it, we needed to approximateit for large negative 𝑥 values. Otherwise, we found that the gradients in our model would often approach very large values and training wouldfail. We approximate this function with an analytic continuation via symbolic regression using PySR (66). We generate values of this functionin high precision code, and then ﬁt analytical forms with

PySR . We ﬁnd that the function is very accurately approximated over 𝑥 ∈ [− , − ] ∗∗ https://github.com/hyperopt/hyperopt . . . . . . Error over sigma . . . . . . . D e n s i t y Model error distributionGaussian distribution

Figure 8: A comparison of the residual between our model and the true value with a true Gaussian of the same standard deviation, with thesame number of samples. What this shows is that if our model predicts an uncertainty of 𝜎 =

1, then 68% of the true values will be withinthat 𝜎 range, as expected for a true Gaussian distribution.by: log (cid:0) + erf ( 𝑥 ) (cid:1) ≈ − . + . ∗ 𝑥 − . ∗ 𝑥 + . ∗ 𝑥 + . ∗ exp ( 𝑥 ) , and this function has equivalent asymptotic properties. We therefore use this formula in place of log (cid:0) + erf ( 𝑥 ) (cid:1) in our learning code. torch code is given as follows, to replace any appearance of log (cid:0) + erf ( 𝑥 ) (cid:1) in code: def safe_log_erf (x): """ Compute log(1+erf(x)) with an approximate analytic continuation below -1""" base_mask = x < -1value_giving_zero = torch. zeros_like (x, device=x. device )x_under = torch.where(base_mask , x, value_giving_zero )x_over = torch.where (~ base_mask , x, value_giving_zero )f_under = lambda x: (0. 485660082730562 *x + 0. 643278438654541 *torch.exp(x) +0. 00200084619923262 *x ** 3 - 0. 643250926022749 - 0. 955350621183745 *x ** 2)f_over = lambda x: torch.log(1.0+torch.erf(x))return f_under ( x_under ) + f_over ( x_over) Theoretical limit.

In (33), the authors measure the distribution of instability times for various orbital conﬁgurations. Taking an initial orbitalconﬁguration, the authors perturb the system by machine precision, and measure the instability time, and repeat. For each system, the authorsthen measure the mean instability time, 𝜇 (in log-space), as well as the standard deviation, 𝜎 (in log-space, modeled as a log-normal). Whatthis means is that we can deﬁne a “theoretical limit” to the accuracy with which we can predict the instability time, and this accuracy isbounded by 𝜎 , for we cannot predict the instability time for a given system better than within one 𝜎 standard deviation on average.For the purposes of this paper, we simulate an optimal estimator by taking a particular instability time, and then making a random predictionfor its instability time within one 𝜎 of the actual instability time. (33) found that 𝜎 , while it is diﬀerent for diﬀerent conﬁgurations, does notcorrelate to 𝜇 , so for the numerical value of 𝜎 , we simply randomly select numerical 𝜎 values from those released for (33). On average, thesestandard deviations are 0 ..