A Bayesian neural network predicts the dissolution of compact planetary systems
Miles Cranmer, Daniel Tamayo, Hanno Rein, Peter Battaglia, Samuel Hadden, Philip J. Armitage, Shirley Ho, David N. Spergel
AA Bayesian neural network predicts the dissolution of compact planetary systems
Miles Cranmer a, † , Daniel Tamayo a , Hanno Rein b,c , Peter Battaglia d , Samuel Hadden e ,Philip J. Armitage f,g , Shirley Ho g,a,h , and David N. Spergel g,a a Princeton University, Princeton, New Jersey, USA b University of Toronto at Scarborough, Toronto, Ontario, Canada c University of Toronto, Toronto, Ontario, Canada d DeepMind, London, UK e Center for Astrophysics | Harvard & Smithsonian, Cambridge, Massachusetts, USA f Stony Brook University, Stony Brook, New York, USA g Flatiron Institute, New York, New York, USA h Carnegie Mellon University, Pittsburgh, Pennsylvania, USA † Corresponding author. Email: [email protected]
Abstract
Despite over three hundred years of effort, no solutions exist for predicting when a general planetary config-uration will become unstable. We introduce a deep learning architecture to push forward this problem forcompact systems. While current machine learning algorithms in this area rely on scientist-derived instabilitymetrics, our new technique learns its own metrics from scratch, enabled by a novel internal structure inspiredfrom dynamics theory. Our Bayesian neural network model can accurately predict not only if, but also when acompact planetary system with three or more planets will go unstable. Our model, trained directly from shortN-body time series of raw orbital elements, is more than two orders of magnitude more accurate at predictinginstability times than analytical estimators, while also reducing the bias of existing machine learning algo-rithms by nearly a factor of three. Despite being trained on compact resonant and near-resonant three-planetconfigurations, the model demonstrates robust generalization to both non-resonant and higher multiplicityconfigurations, in the latter case outperforming models fit to that specific set of integrations. The modelcomputes instability estimates up to five orders of magnitude faster than a numerical integrator, and unlikeprevious efforts provides confidence intervals on its predictions. Our inference model is publicly available inthe
SPOCK ∗ package, with training code open-sourced ‡ . The final growth of terrestrial bodies in current theories of planetformation occurs in a phase of giant impacts (1). During this stage, thenumber of planets slowly declines as bodies collide and merge (2, 3).Close planetary encounters and the wide dynamic range exhibited bythe times between consecutive collisions computationally limit currentnumerical efforts to model this process. Two theoretical roadblocksimpede the development of a more efficient iterative map for modelingplanet formation. First, one must predict a distribution of instabilitytimes from a given initial orbital configuration. Second, one mustpredict a distribution of post-collision orbital architectures (e.g., 4)subject to mass, energy and angular momentum constraints. Towardsthis end, we focus on the long-standing former question of instabilitytime prediction.In the compact dynamical configurations that characterize theplanet formation process, the simpler two-planet case is understoodanalytically. In this limit, instabilities are driven by the interactionsbetween nearby mean-motion resonances (MMRs), i.e., integer com-mensurabilities between the orbital periods of the planets like the 3:2MMR between Pluto and Neptune (5–8). While the general higher-multiplicity case is not yet understood, two important results guideour analysis and provide an important test for any model. First,when planets are initialized on circular orbits, chaos is driven by theoverlap of 3-body MMRs between trios of adjacent planets (9), andtheoretical estimates of the timescale required for the orbits to reach ∗ https://github.com/dtamayo/spock ‡ https://github.com/MilesCranmer/bnn_chaos_model orbit-crossing configurations accurately match numerical integrations(10). As we show below, such analytical estimates perform poorly inthe generic eccentric case where the effects of 2-body MMRs aredominant (10, 11). However, analytical and empirical studies agreethat while the dynamical behavior changes strongly from the two tothree-planet case (3, 12–18), three-planet systems are the simplestprototype for predictions at higher multiplicities in compact systems(10, 11).We recently presented a machine learning model, dubbed the Sta-bility of Planetary Orbital Configurations Klassifier, or SPOCK,trained to classify the stability of compact planetary systems overtimescales of 10 orbits (11). This represented a long-term effort toexploit the substantial but incomplete current analytical understanding(5, 6, 8, 19, 20) to engineer summary metrics that captured these sys-tems’ chaotic dynamics; these features were then used by the machinelearning model to classify whether the input configuration would bestable over 10 orbits.While simple binary stability classification is effective for con-straining physical and orbital parameters consistent with long-termstability (21), other applications like modeling terrestrial planet for-mation require the prediction of continuous instability times. Ad-ditionally, several fields in which it is challenging to find effectivehand-picked features—such as computer vision, speech recognition,and text translation—have been revolutionized by neural networks inthe last decade (notable early breakthroughs include 22–24). Ratherthan relying on domain expert input, these flexible models learndata-driven features that can often significantly outperform human-engineered approaches. A key theme with deep learning models is1 a r X i v : . [ a s t r o - ph . E P ] J a n hat their structure resembles the hand-designed algorithm, but withadded flexibility parametrized by neural networks (for discussion, see25). For example, modern computer vision models consist of learnedconvolutional filters which take the place of hand-designed filters inclassic algorithms (26).Pursuing a deep learning approach, we present a neural networkthat, trained only on short-time series of the orbits in compact plane-tary systems, not only improves on long-term predictions of previousmodels based on engineered features (11, 27), but also significantlyreduces the model bias and improves generalization beyond the train-ing set. We design our model as a Bayesian neural network, whichnaturally incorporates confidence intervals into its instability time pre-dictions, accounting both for model uncertainty as well as the intrinsicuncertainty due to the chaotic dynamics. Finally, unlike previous ma-chine learning models based on decision trees (11, 27), our modelis differentiable. That is, we can extract from the model estimatesof the derivatives of the predicted instability times with respect tothe parameters defining the orbital configuration in question. Suchgradient information can significantly speed up parameter estimationusing Hamiltonian Monte Carlo techniques (28). We focus on the regime leading to typical compact multi-planet sys-tems observed to date, with mass ratios with the central star rangingfrom 10 − (roughly the ratio of the Moon-mass embryos thought toinitially characterize the giant impact phase, relative to the Sun) to10 − (roughly Neptune’s mass relative to the Sun). As detailed inthe appendix, we place planets on nearly coplanar orbits, with ad-jacent planets spaced within 30 mutual Hill radii § of one another(e.g., 29). Orbital eccentricities in observed systems are often poorlyconstrained, so we consider the range from initially circular to orbit-crossing values.We train our model on the set of 113,543 publicly available, com-pact three-planet configurations in and near strong resonances from(11). In particular, this “resonant” dataset initializes one pair of plan-ets in or near a strong MMR, while the third planet’s orbital parametersare chosen randomly. This choice focuses the training set on the nar-row resonant regions of phase space where the dynamical behaviorchanges most strongly, and we later test the model’s generalization tonon-resonant systems in section 3.Each initial condition was integrated for 10 orbits of the innermostplanet using the WHFast integrator (30) in the REBOUND N-bodypackage (31). If at any point two planets came within a distance of oneanother given by the sum of their Hill radii, the simulation was stoppedand the instability time was recorded. Because gravity is scale invari-ant, the instability time 𝑡 inst is most usefully non-dimensionalized bythe innermost orbital period 𝑃 orb . Given the large dynamic range intimescales over which instabilities can occur, we define the dimen-sionless log instability time 𝑇 ≡ log ( 𝑡 inst / 𝑃 orb ) . Configurationswith instability times longer than 10 orbits ( 𝑇 >
9) were labeled asstable, and integration was stopped.
To predict systems’ instability times, we perform a computationallycheap numerical integration of the first 10 orbits, and use this time § The mutual Hill radius 𝑅 𝐻 is a relevant length scale within which the gravity of theplanets dominates that of the star 𝑅 𝐻 ≈ 𝑎 ( 𝜇 + 𝜇 ) / , where 𝑎 is the inner planet’ssemi-major axis, and 𝜇 and 𝜇 are each planet’s mass ratio relative to the star. series to make long-term predictions. Each of the three planets’ 3Dpositions and velocities correspond to six standard orbital elements(the appendix), which we record at 𝑛 𝑡 =
100 equally spaced outputsacross the short integration. In addition we pass the three constantmass ratios for each planet relative to the star, as well as the timevalue, for a combined input matrix of real values 𝑋 ∈ R + 𝑛 𝑡 for agiven configuration.Because the dynamics of compact multi-planet systems are chaotic,instability times for a given initial orbital configuration are effectivelynon-deterministic. Nevertheless, numerical experiments (32, 33) haveshown that instability times for unstable, compact multi-planetary sys-tems settle to well defined, approximately log-normal distributions.Thus, rather than predicting a single instability time for a given orbitalconfiguration, our model maps from an input initial orbital configu-ration to a predicted log-normal distribution of instability times, i.e.,a Gaussian distribution of 𝑇 with mean 𝜇 and variance 𝜎 . Thisgives the network the flexibility to both model the fundamental uncer-tainties imposed by the chaotic dynamics, and to incorporate modeluncertainty into its predictions by assigning larger widths to configu-rations it is less sure about.In our initial efforts, we experimented with various combinationsof convolutional neural networks (see reviews by 34–36), long short-term memory networks (37), 1D scattering transforms (38), regularmulti-layer perceptrons (MLP, see 25) and gaussian processes (39).All of these models underperformed or tended to overfit the data.The fundamental challenge for making such predictions is the sharptransitions in dynamical behavior at MMRs, where instability timescan change by several orders of magnitude (40) over sub-percentchanges in planetary orbital periods, i.e., in the original space oforbital elements. We found substantially improved performance bystructurally splitting the problem into three components: 1) Find atransformation from the sharply punctuated space of orbital elementsto new variables. 2) Calculate statistical summaries of the time seriesin these transformed variables. 3) Use these summary features topredict a log-normal distribution over instability times for the inputorbital configuration, parametrized by mean 𝜇 and variance 𝜎 . Thisis illustrated in fig. 1.We model steps 1 and 3 with neural networks 𝑓 and 𝑓 , respec-tively. In step 2, we choose to use a well-motivated but non-standardpooling operation, and calculate both the mean and variance for eachtransformed time series (deep neural networks traditionally use one ofeither sum, mean, or max pooling). This structure was motivated bythe hand-engineered approach of (11), whose features are means andvariances of theoretically-motivated instability metrics. In this case,we give the machine learning model additional flexibility to learn itsown instability metrics. This design process is analogous to how theinvention of convolutional neural networks was motivated by creat-ing similar structures to hand-engineered algorithms which convolvefilters over an image. Our model is parametrized by 𝑚 neural network weights 𝜃 ≡( 𝜽 , 𝜽 ) , 𝜃 ∈ R 𝑚 ( 𝜽 for 𝑓 and 𝜽 for 𝑓 ). Defining the trainingset 𝐷 as the collection of input orbital configurations and their asso-ciated N-body instability times, we seek the most likely set of modelparameters given the data, i.e., we maximize 𝑃 ( 𝜃 | 𝐷 ) , which is in turnproportional to 𝑃 ( 𝐷 | 𝜃 ) 𝑃 ( 𝜃 ) .Our model predicts a log-normal distribution of instability times forany input orbital configuration. For a given set of network weights 𝜃 ,the likelihood 𝑃 ( 𝐷 | 𝜃 ) is then simply the product of the probabilitiesthat each training set example’s output 𝑇 𝑖 is drawn from the associated2 ime NumericalIntegration:
System goesunstable atdistant time TShort timeintegration recordedTime
Single scalar
Orbital elementsover time: ...
Neural Network 1( f ) Learned featuresover time: ...
Compute average andvariance over timefor each learned feature ... ...
Neural Network 2( f ) Time
Figure 1: Schematic of our model. The model numerically integrates 10,000 orbits for a compact 3-planet system (top) and records orbitalelements at 100 times. Neural network 𝑓 creates learned summary features from these elements at each time. Neural network 𝑓 takes theaverage and variance of these features as input and estimates a distribution over possible instability times.3aussian N ( 𝜇 𝑖 , 𝜎 𝑖 ) predicted by the model. As discussed above,this choice is motivated by the numerical result that the distributionin 𝑇 is normal, for different configurations with a wide range of meaninstability times (33).Note that we have 4 < 𝑇 ≤ 𝑇 <
𝑇 > 𝑇 = 𝑇 . Thus, we build a truncated normal distribution with a cutoff at 𝑇 =
4, and the cumulative probability of the Gaussian above 𝑇 = 𝑃 ( 𝜃 ) is a prior on the neural network’s parameters. The per-parameter prior here is unimportant: what matters is the prior inducedon the output of the model , and we use an uninformative Gaussian prioron parameters to induce an uninformative prior on the output. See(41) for a detailed discussion of priors in Bayesian deep learning. By having our model predict a distribution of instability times with afinite width, we account for intrinsic uncertainty (sometimes referredto as “aleatoric” uncertainty). However, we also wish to include ex-trapolation uncertainty (or “epistemic” uncertainty) for systems thatdiffer from those found in the training set. To do this we marginal-ize over potential model parameters, with what is referred to as a“Bayesian Neural Network” (BNN). This is a neural network whoseparameters are distributions rather than point values, and is trainedwith Bayesian inference. These models naturally fold in extrapolationuncertainty via marginalization.This concept is familiar in traditional statistical inference, whereone can marginalize out the internal nuisance parameters of a modelusing Markov Chain Monte Carlo (MCMC) techniques. The factthat neural networks typically have millions of parameters rendersMCMC computationally prohibitive, and various practical simplifica-tions are adopted for implementing a BNN. The most common strategyis Monte Carlo dropout (42, 43) which treats the neural network’s pa-rameters as independent Bernoulli random variables, and has beenused in several astronomical applications (44–47). A selection ofother techniques includes Bayes by Backprop (48), Bayesian Layers(49), variants of normalizing flows (e.g., 50), Bayes by Hypernet(51, 52), and many other strategies. One recently-proposed strategy,named “MultiSWAG” (53, 54) learns a distribution over the posteriorof weights that best fit the training set, without a diagonal covarianceassumption, and is much closer to standard MCMC inference. Weexperimented with a selection of common techniques: Monte Carlodropout, Bayes by Backprop, and MultiSWAG, and found that “Mul-tiSWAG” produced the best accuracy and uncertainty estimations onthe validation dataset.To move beyond a single best-fit set of parameters 𝜃 , SWAG, or“Stochastic Weight Averaging Gaussian” (53, 55), instead fits a Gaus-sian to a mode of the posterior over 𝜃 , assuming a low-rank covariancematrix. This was extended in (54) to “MultiSWAG,” which repeatsthis process for several modes of the weight posterior, to help fill outthe highly degenerate parameter space. This technique is summarizedbelow:1. Train 𝑓 and 𝑓 simultaneously via stochastic gradient descentuntil the parameters settle into a minimum of the weight posterior.2. Increase the learning rate and continue training. This causesthe optimizer to take a random walk in parameter space nearthe minima, which is assumed to look like a high-dimensionalGaussian. 3. Accumulate the average parameters along this random walk aswell as a low-rank approximation of the covariance matrix.4. The average parameters not only provide better generalizationperformance (stochastic weight averaging or SWA), but we haveadditionally fit a Gaussian to a mode of the parameter posterior.We can thus sample weights from this Gaussian to marginalizeover parameters. This is SWAG (53).5. The next step is to repeat this process from a different randominitialization of the parameters. This will find another mode ofthe parameter posterior.6. Fit ∼
30 different modes of the parameter space. We can thensample weights from multiple modes of the parameter posterior,which gives us a more rigorous uncertainty estimates. This isMultiSWAG (54).Training a neural network through stochastic gradient descent isapproximately the same as Bayesian sampling of the weights (56, 57),so this aforementioned process allows one to learn a Bayesian posteriorover the weights of a neural network 𝑃 MultiSWAG ( 𝜃 ) .Once we have learned 𝑃 MultiSWAG , we can draw from it to samplea set of network weights 𝜃 . This model then predicts a log-normaldistribution of instability times with mean 𝜇 and variance 𝜎 for thegiven input orbital configuration, from which we can finally sample alog instability time T. We can write a forward model for this predictionas follows: ( 𝜽 , 𝜽 ) ∼ 𝑃 MultiSWAG ( 𝜃 ) , (1) y 𝑡 = 𝑓 ( x 𝑡 ; 𝜽 ) for all x 𝑡 ≡ 𝑋 : ,𝑡 , (2) z ∼ ( E 𝑡 [ y 𝑡 ] , Var 𝑡 [ y 𝑡 ]) , (3) ( 𝜇, 𝜎 ) = 𝑓 ( z ; 𝜽 ) , (4) 𝑇 instability = 𝑇 for 𝑇 ∼ N ( 𝜇, 𝜎 ) , (5)where 𝑡 is a time step from 1 to 100. Here, we have labeled y 𝑡 asthe learned transformed variables for a single time step of the system(brown cells in fig. 1), and z as the average and variance of these trans-formed variables over time. To account for statistical errors due to ourfinite number of time series samples, we sample the z from normaldistributions with frequentist estimates of the variance in the samplemean and variance: Var 𝑡 [ y 𝑡 ] 𝑛 𝑡 and 𝑡 [ y 𝑡 ] ( 𝑛 𝑡 − ) , respectively. A Bayesiangraphical model for this is shown in the appendix. Repeatedly sam-pling in this way provides a predicted distribution of T given the inputorbital configuration, marginalized over the posterior distribution ofnetwork weights 𝜃 .We split our data into 60/20/20% train/validation/test, train ourmodel on 60% of our ≈ ,
000 training examples of resonant andnear resonant systems, and validate it on half of the remaining data totune the hyperparameters. Hyperparameters for our model are givenin the supplementary material, and we also release the code to trainand evaluate our model.With this trained model, we then explore its performance on theremaining 20% holdout data from the resonant dataset, as well asother datasets described below.
For a given orbital configuration, our probabilistic model producesone sample of 𝑇 . If a given sample is above 𝑇 =
9, we treat the4ample as a “stable” prediction. Since we are unable to make specifictime predictions above the maximum integration time in our trainingdataset of 𝑇 =
9, we resample from a user-defined prior 𝑃 ( 𝑇 | 𝑇 ≥ ) for each occurrence. For the purposes of this study, we assume asimple analytic form for this prior, though followup work on this prioris ongoing (see supplementary).For all results, we sample 10,000 predicted values of the posteriorover 𝑇 per planetary system. We compare our predictions againstseveral alternatives which are explained below. Since the models wecompare against can only produce point estimates while our modelpredicts a distribution, we take the median of our model’s predictedposterior over 𝑇 . This is used for plotting points, as well as forcomputing root-mean-square prediction errors.We first compute the N-body versus predicted (median) 𝑇 valueover the holdout test dataset of ≈ ,
000 examples not seen duringtraining, which can be seen in the bottom middle panel of fig. 2. Wereiterate that the N-body instability times measured for the variousorbital configurations in our training set are not ‘true’ instability times,but rather represent single draws from the different planetary systems’respective instability time distributions, established by their chaoticdynamics. To estimate a theoretical limit (bottom right panel of fig. 2)we use the results from (33), who find that the T values measuredby N-body integrations (x-axis of fig. 2) should be approximatelynormally distributed around the mean instability time predicted by anideal model. We use a random standard deviation drawn from thevalues measured empirically for compact systems by (33), which theyfind are sharply peaked around ≈ .
43 dex, independent of whether ornot the system is near MMRs, and valid across a wide range of meaninstability times. We plot this representative intrinsic width of 0.43dex as dotted lines on all panels for comparison.While we defer a detailed comparison to previous work to thefollowing section, we measure a root mean square error (RMSE) of1.02 dex for our model on the holdout test set. We note that whilethe RMSE is an intuitive metric for comparing models, it does notprovide a full picture for a model that is trained on a different lossfunction to predict both 𝜇 and 𝜎 , A model that can predict its own 𝜎 will sacrifice worse 𝜇 accuracy in challenging regions of parameterspace to better predict it on more easily predictable configurations.For comparison, if we weight the RMSE by the predicted signal-to-noise ratio (SNR), 𝜇 / 𝜎 , the model achieves 0.87 dex, within afactor of ≈ 𝜎 accurately capture the spread of N-body times around thepredicted mean values 𝜇 . For each test configuration, we predict 𝜇 ,subtract it from its respective 𝑇 measured by N-body integration, anddivide by the predicted 𝜎 . If this distribution approximates a Gaussiandistribution of zero mean and unit variance, the model’s uncertaintyestimates are accurate. We find that a Komolgorov-Smirnov test can-not confidently distinguish our predictions from this ideal Gaussian(p-value of 0.056), and plot the two distributions in fig. 8 of thesupplementary material. Guided by the dynamical intuition that short-timescale instabilitiesare driven by the interaction of MMRs (5, 8, 11), we chose to train ourmodel on systems with particular period ratios and orbital elements
Resonant
Model RMSE Classif.accur. Bias † (4, 5) Bias(8, 9)Obertas et al. (2017) 2.12 0.628 1.04 -1.71Petit et al. (2020) 3.22 0.530 3.99 0.54Tamayo et al. (2020) 1.48 0.946 2.07 -0.62Modified ∗ Tamayo+20 0.99 0.946 0.65 -0.60Ours 1.02 0.952 0.29 -0.38Ours, SNR-weighted 0.87 0.971 0.18 -0.25
Theoretical limit 0.43 0.992 0.05 -0.04
Random
Obertas et al. (2017) 2.41 0.721 2.15 -0.93Petit et al. (2020) 3.09 0.517 4.17 0.50Tamayo et al. (2020) 1.24 0.949 1.16 -0.59Modified ∗ Tamayo+20 1.14 0.945 0.79 -0.70Ours 1.20 0.939 0.40 -0.51Ours, SNR-weighted 1.09 0.959 0.23 -0.49
Theoretical limit 0.44 0.989 0.06 -0.04 (dex) (AUC) (dex) (dex) † Average difference between predicted minus true 𝑇 in given range ∗ Modified and retrained for regression.Table 1: Statistical summaries of each estimator applied to a holdouttest portion of the resonant dataset, and all of the random dataset.in the narrow ranges near such resonances where the dynamical be-havior changes sharply (40). It is therefore important to test how wellsuch a model generalizes to a more uniform coverage of parameterspace, given that most observed orbital architectures are not in MMRs(possibly because such configurations typically have short lifetimesand have been eliminated). Additionally, previous work has typicallyignored the sharp variations near MMRs to fit overall trends in insta-bility times (40), so a test on resonant systems would not provide afair comparison.For this generalization test and comparison, we use the ‘random’dataset of (11), 25,000 three-planet systems with the same mass ra-tio and inclination distributions as above, and eccentricities drawnlog-uniformly from ≈ − to orbit-crossing. Rather than drawingnear-integer period ratios as in our resonant training set, the spacingbetween adjacent planets is drawn uniformly between [3.5, 30] mutualHill radii (see 11).We find that our model exhibits only a minor loss in performance(1.20 vs 1.02 dex RMSE) generalizing to this uniform distributionof orbital configurations table 1. This supports the assertion thatinstabilities in compact systems within 10 orbits are dominantlydriven by the MMRs we focused on in our training sample (11).To compare our results to the extensive set of past efforts, we divideprevious approaches into three broad groups.First, many authors have run N-body integrations along low-dimensional cuts through the parameter space of compact orbital con-figurations, and fit simple functional forms to the resulting trends ininstability times. For example, several studies have highlighted thesteep dependence on interplanetary separation, while fixing orbits tobe coplanar and initially circular, and planets to be equal mass andequally separated from one another (12, 14, 16, 17, 40). We comparethe performance of the fit from such a study (40), using five equallyspaced Earth-mass planets (mass ratio ≈ × − ) on our random5 he model of Obertas et al. (2017) The model of Petit et al. (2020) Tamayo et al. (2020), retrained for regressionOur model Our model, applied to the resonant dataset Theoretical limitFigure 2: Density plots showing the predicted versus true instability times for various models on the random dataset. All predictions outside of 𝑇 ∈ [ , ] are moved to the edge of the range. For the plot showing the predictions of our model, transparency shows relative model-predictedSNR. The theoretical limit, using the numerical experiments of (33), is given in the lower-right plot. The 0 .
43 dex RMSE average from this isused to give the dotted contours in all plots.test set in the top left panel of fig. 2, with a resulting RMSE of 2.41.Followup studies have incorporated the effect of finite inclinationsand eccentricities (13, 15, 58, 59), but consider equal initial eccen-tricities, planetary masses, etc. in order to fit simple models. Weconclude that while such controlled experiments yield insight into theunderlying dynamics (9, 15, 60), instability times depend sensitivelyon masses and several orbital parameters, rendering empirical fits tolow-dimensional cuts in the parameter space of limited applicability.In the following section we perform the converse generalization test,where we ask our model to instead predict on the N-body simulationsused for the fit by (40), and find good agreement.Second, previous authors have developed analytical instability timeestimates from first principles. These have been most successful inthe limit of initially circular orbits, where three-body MMRs havebeen identified as the dominant driver of chaos (9). Recent work (10)has extended this theory to provide accurate instability time estimates.We will compare the predictions of our model to this limit of initiallycircular orbits in the next section. Here we simply emphasize the pointby (10) that such predictions perform poorly at finite eccentricities(top middle panel of fig. 2), likely due to the dominant effects ofstronger two-body MMRs. The fact that the analytic model predictsthe vast majority of systems to be stable implies that most of our testconfigurations would be stable on circular orbits, but that finite orbitaleccentricities strongly modulate instability times.The final approach is to make predictions across the high-dimensional space of orbital configurations using machine learning (11, 27). We consider two variants of (11) adapted for regression.The first, labeled ‘Tamayo et al. (2020)’, is to simply use an identicalmodel as (11), but map the probability estimates of stability past 10 orbits through an inverse cumulative distribution of a log-normal withan optimized constant standard deviation. The second, labeled ‘Mod-ified Tamayo+20’, the model is an XGBoost (61) regression model(rather than classification) retrained on the same features as used in(11).We find that our model achieves similar performance to the Modi-fied Tamayo+20 variant (top right panel of fig. 2, table 1), though thelatter exhibits significant bias. We quantify this bias for each model inthe range 𝑇 ∈ ( , ) and 𝑇 ∈ ( , ) . As is evident in table 1 as well asfig. 2, the model introduced in this work exhibits significantly reducedbias compared to other models. Including SNR weighting furtherreduces bias. Bias is a measure of the generalizability of a modelto out-of-distribution data (see chapter 7 of 62), so is an importantmetric for understanding how these predictive models will extrapolateto new data. Our model achieves predictions that are more than twoorders of magnitude more accurate than the analytic models in eachcase: e.g., 10 . / . ≈ × when comparing our SNR-weightedmodel with (40) on the random dataset.Finally, we can make a comparison to the original classificationmodel of (11) by using our regression model as a form of classifier.We count the fraction of samples above 𝑇 = As a second generalization test of our model, we compare its per-formance on the limiting case considered by (40). This case of fiveequally spaced, Earth-mass planets on initially circular and coplanarorbits differs significantly from our training set of resonant and nearresonant, eccentric and inclined configurations of three planets withunequal masses. This dataset contains 17500 simulations numericallyintegrated for 10 orbits (40). This generalization to a limiting set ofhigher multiplicity configurations provides a stringent test of whetherthe model has learned features of the dynamics or whether it is naivelyinterpolating across the distribution of orbital configurations presentin our training dataset.To extend our three-planet predictions to higher multiplicity sys-tems, we perform the same short integration for all planets, but passtime series for each adjacent trio of planets to the model separately.The model samples a single instability time for each adjacent trio,and the minimum across this set is adopted as the instability time forthe system, as an estimate of the time for the first trio to go unstable.This procedure is then repeated, and we record the median and confi-dence intervals of the resultant distribution in 𝑇 . Such a reduction ofcompact multi-planet systems to sets of adjacent trios has been pro-posed on theoretical (9, 10) as well as empirical (11) grounds. Thisis motivated by the fact that the perturbative effects of planets on oneanother fall off exponentially with separation (9, 10), so non-adjacentinteractions can largely be ignored.The predictions can be seen in fig. 3, and are remarkably accu-rate despite our model never seeing a system with five planets duringtraining. We overplot the analytical result of (10) in magenta devel-oped from first principles for such cases with initially circular orbits,including a manual correction for five-planet systems. Our modelcaptures the same overall trend, but additionally identifies the variousdips, which correspond to locations of different MMRs (40). Weemphasize that our model was trained on the general eccentric casewhere the magenta model of (10) does not apply (fig. 2), yet the gener-alization to this limiting case is excellent. In addition to matching theoverall trend of (10), our model captures the additional instability timemodulations at MMRs, as can be seen more clearly in the residuals infig. 5 of the appendix. Additionally, our new model generalizes muchbetter than the predictions of the modified regression model based on(11) based on manually engineered features (gold). In industry machine learning, one is focused on making predictions asaccurate as possible, even at the expense of a more interpretable model.However, in physics, we are fundamentally interested in understandingproblems from first principles.Obtaining such an explicit interpretation of our model will be dif-ficult. However, as a first step we consider the feature importancesof our model: what orbital elements is it using to make predictions,and does this align with expectations? To do this feature analysis, weexploit the differentiability of our model with respect to its input. Wecalculate the gradient of the predicted 𝜇 value our model with respectto the input instantaneous orbital elements; this is also referred to asa “saliency map” (25). This gives us a multi-dimensional array overfeature, simulation, time step, and model, representing how much thepredicted 𝜇 value will change should that feature be infinitesimally in-creased. We compute the variance of the gradients over time and eachsimulation, and then average these variances over sampled network weights 𝜃 . This gives us a coarse representation of the importance ofeach feature. We show this visually in fig. 4.To compare these importances to other work, (11) argue empiricallythat the short-timescale instabilities we probe here in compact systemsare driven specifically by the interactions between MMRs. A classicalresult of celestial mechanics is that in the absence of such MMRs, thelong-term dynamics keeps the semi-major axes fixed. Variations inthe semi-major axes during the short integrations thus act as a proxyfor the importance of nearby MMRs (27), and we see that indeed, thesemi-major axes exhibit the highest feature importance in our modelfig. 4. The fact that the model ascribes comparable feature importanceto any given orbital element for each of the three planets also suggestsa physically-reasonable model.We note that there is a small but nonzero significance of the “Time”instantaneous feature. This can be interpreted as being importantbecause the model takes the first 10 orbits as input, and can predictinstability for the system as low as 10 . . Thus, the orbital parametersgiven at 10 orbits may be more important than the orbital parametersat 10 orbits for predicting such unstable systems, and thus the timefeature is used. The time feature would be less important for a systemthat goes unstable near 10 orbits, as the relative importance of thesystem’s parameters at 10 orbits is comparable to that at 10 orbit.Because we chose to structure our model to take means and vari-ances of the times series of the learned transformed variables fig. 7,it may be possible to extract explicit interpretations of our model viasymbolic regression. Given that our approach is structurally similar tothat of a graph neural network (63), the frameworks of (64–66) wouldbe particularly applicable. This would be done by finding analyticforms for 𝑓 , representing each of the transformed variables, and thenfinding an analytic form for 𝑓 , to compute the instability time giventhe transformed variables. This type of explicit analysis of the modelwill be considered in future work. We have described a probabilistic machine learning model—aBayesian neural network—that can accurately predict a distributionover possible instability times for a given compact multi-planet ex-oplanet system. Our model is trained on the raw orbital parametersof a multi-planet system over a short integration, and learns its owninstability metrics. This is contrasted by previous machine learningapproaches which have given their models hand-designed instabilitymetrics based on specialized domain knowledge.Our model is more than two orders of magnitude more accurateat predicting instability times than analytical estimators, while alsoreducing the bias of existing learned models by nearly a factor of three.We also demonstrate that our model generalizes robustly to five-planetconfigurations effectively drawn from a one-dimensional cut throughthe broad parameter space used to train the model. This improves onthe estimates of analytic and other learned models, despite our modelonly being trained on compact three-planet systems.Our model’s computational speedup over N-body integrations bya factor of up to 10 enables a broad range of applications, such asusing stability constraints to rule out unphysical configurations andconstrain orbital parameters (21), and to develop efficient terrestrialplanet formation models. Towards this end, our model will be madepublicly available through the SPOCK ¶ package, with training codealso available in a separate repo ‖ . Acknowledgements. ¶ https://github.com/dtamayo/spock ‖ https://github.com/MilesCranmer/bnn_chaos_model t i i i e e e m m m a a a Figure 4: Feature importances in the model, calculated as the root-mean-square of the gradients of the model’s output with respect to theinput features, and normalized.Miles Cranmer would like to thank Andrew Gordon Wilson andDan Foreman-Mackey for advice on Bayesian deep learning tech-niques, and Andrew Gordon Wilson, Dan Foreman-Mackey, and Se-bastian Wagner-Carena for comments on a draft of this paper. PhilipArmitage, Shirley Ho, and David Spergel’s work is supported by theSimons Foundation. This work made use of several Python softwarepackages: numpy (67), scipy (68), sklearn (69), jupyter (70), matplotlib (71), pandas (72), torch (73), pytorch-lightning (74), and tensorflow (75).
References [1] Eiichiro Kokubo and Shigeru Ida. Dynamics and accretionof planetesimals.
Progress of Theoretical and ExperimentalPhysics , 2012(1):01A308, 2012.[2] Kathryn Volk and Brett Gladman. Consolidating and crushingexoplanets: Did it happen here?
The Astrophysical JournalLetters , 806(2):L26, 2015. [3] Bonan Pu and Yanqin Wu. Spacing of kepler planets: sculptingby dynamical instability.
The Astrophysical Journal , 807(1):44,2015.[4] Scott Tremaine. The Statistical Mechanics of Planet Orbits.
Astrophysical Journal , 807(2):157, Jul 2015. doi: 10.1088/0004-637X/807/2/157.[5] Jack Wisdom. The resonance overlap criterion and the onset ofstochastic behavior in the restricted three-body problem.
TheAstronomical Journal , 85:1122–1133, 1980.[6] K. M. Deck, M. Payne, and M. J. Holman. First-order Res-onance Overlap and the Stability of Close Two-planet Sys-tems.
Astrophysical Journal , 774:129, September 2013. doi:10.1088/0004-637X/774/2/129.[7] Antoine C. Petit, Jacques Laskar, and Gwenaël Boué. Hill sta-bility in the AMD framework.
Astronomy & Astrophysics , 617:A93, Sep 2018. doi: 10.1051/0004-6361/201833088.[8] Sam Hadden and Yoram Lithwick. A criterion for the onset ofchaos in systems of two eccentric planets.
The AstronomicalJournal , 156(3):95, 2018.[9] Alice C Quillen. Three-body resonance overlap in closely spacedmultiple-planet systems.
Monthly Notices of the Royal Astronom-ical Society , 418(2):1043–1054, 2011.[10] Antoine C Petit, Gabriele Pichierri, Melvyn B Davies, and An-ders Johansen. The path to instability in compact multi-planetarysystems.
Astronomy & Astrophysics , 641:A176, 2020.[11] Daniel Tamayo, Miles Cranmer, Samuel Hadden, Hanno Rein,Peter Battaglia, Alysa Obertas, Philip J Armitage, Shirley Ho,David N Spergel, Christian Gilbertson, et al. Predicting the long-term stability of compact multiplanet systems.
Proceedings ofthe National Academy of Sciences , 117(31):18194–18205, 2020.[12] J. E. Chambers, G. W. Wetherill, and A. P. Boss. The Stabilityof Multi-Planet Systems.
Icarus , 119:261–268, February 1996.doi: 10.1006/icar.1996.0019.[13] Keiko Yoshinaga, Eiichiro Kokubo, and Junichiro Makino. Thestability of protoplanet systems.
Icarus , 139(2):328–335, 1999.[14] F. Marzari and S. J. Weidenschilling. Eccentric Extrasolar Plan-ets: The Jumping Jupiter Model.
Icarus , 156:570–579, April2002. doi: 10.1006/icar.2001.6786.815] Ji-Lin Zhou, Douglas NC Lin, and Yi-Sui Sun. Post-oligarchicevolution of protoplanetary embryos and the stability of plane-tary systems.
The Astrophysical Journal , 666(1):423, 2007.[16] Peter Faber and Alice C Quillen. The total number of giantplanets in debris discs with central clearings.
Monthly Noticesof the Royal Astronomical Society , 382(4):1823–1828, 2007.[17] Andrew W Smith and Jack J Lissauer. Orbital stability of systemsof closely-spaced planets.
Icarus , 201(1):381–394, 2009.[18] Yuji Matsumoto, Makiko Nagasawa, and Shigeru Ida. The or-bital stability of planets trapped in the first-order mean-motionresonances.
Icarus , 221(2):624–631, November 2012. doi:10.1016/j.icarus.2012.08.032.[19] Boris V Chirikov. A universal instability of many-dimensionaloscillator systems.
Physics reports , 52(5):263–379, 1979.[20] Sam Hadden. An Integrable Model for the Dynamics of Plane-tary Mean-motion Resonances.
Astronomical Journal , 158(6):238, Dec 2019. doi: 10.3847/1538-3881/ab5287.[21] Daniel Tamayo, Christian Gilbertson, and Daniel Foreman-Mackey. Stability constrained characterization of multiplanetsystems, 2020.[22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Im-agenet classification with deep convolutional neural networks.In
Proceedings of the 25th International Conference on Neu-ral Information Processing Systems - Volume 1 , NIPS’12, page1097–1105, Red Hook, NY, USA, 2012. Curran Associates Inc.[23] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton.Speech recognition with deep recurrent neural networks. In , pages 6645–6649. IEEE, 2013.[24] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence tosequence learning with neural networks. In
Advances in neuralinformation processing systems , pages 3104–3112, 2014.[25] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
DeepLearning . MIT Press, 2016.[26] David G Lowe. Object recognition from local scale-invariantfeatures. In
Proceedings of the seventh IEEE international con-ference on computer vision , volume 2, pages 1150–1157. Ieee,1999.[27] Daniel Tamayo, Ari Silburt, Diana Valencia, Kristen Menou,Mohamad Ali-Dib, Cristobal Petrovich, Chelsea X Huang,Hanno Rein, Christa van Laerhoven, Adiv Paradise, et al. Amachine learns to predict the stability of tightly packed plan-etary systems.
The Astrophysical Journal Letters , 832(2):L22,2016.[28] Eric Agol, Caroline Dorn, Simon L Grimm, Martin Turbet, ElsaDucrot, Laetitia Delrez, Michael Gillon, Brice-Olivier Demory,Artem Burdanov, Khalid Barkaoui, et al. Refining the transittiming and photometric analysis of trappist-1: Masses, radii,densities, dynamics, and ephemerides, 2020. [29] Lauren M Weiss, Geoffrey W Marcy, Erik A Petigura, Ben-jamin J Fulton, Andrew W Howard, Joshua N Winn, Howard TIsaacson, Timothy D Morton, Lea A Hirsch, Evan J Sinukoff,et al. The california-kepler survey. v. peas in a pod: planetsin a kepler multi-planet system are similar in size and regularlyspaced.
The Astronomical Journal , 155(1):48, 2018.[30] Hanno Rein and Daniel Tamayo. whfast: a fast and unbiasedimplementation of a symplectic wisdom–holman integrator forlong-term gravitational simulations.
Monthly Notices of theRoyal Astronomical Society , 452(1):376–388, 2015.[31] Hanno Rein and S-F Liu. Rebound: an open-source multi-purpose n-body code for collisional dynamics.
Astronomy &Astrophysics , 537:A128, 2012.[32] David R Rice, Frederic A Rasio, and Jason H Steffen. Survivalof non-coplanar, closely packed planetary systems after a closeencounter.
Monthly Notices of the Royal Astronomical Society ,481(2):2205–2212, 2018.[33] Naireen Hussain and Daniel Tamayo. Fundamental limits fromchaos on instability time predictions in compact planetary sys-tems.
Monthly Notices of the Royal Astronomical Society , 491(4):5258–5267, 2019.[34] Alex Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, KiyohiroShikano, and Kevin J Lang. Phoneme recognition using time-delay neural networks.
IEEE transactions on acoustics, speech,and signal processing , 37(3):328–339, 1989.[35] Yann LeCun, Bernhard Boser, John S Denker, Donnie Hen-derson, Richard E Howard, Wayne Hubbard, and Lawrence DJackel. Backpropagation applied to handwritten zip code recog-nition.
Neural computation , 1(4):541–551, 1989.[36] Waseem Rawat and Zenghui Wang. Deep convolutional neu-ral networks for image classification: A comprehensive review.
Neural computation , 29(9):2352–2449, 2017.[37] Sepp Hochreiter and Jürgen Schmidhuber. Long short-termmemory.
Neural computation , 9(8):1735–1780, 1997.[38] Mathieu Andreux, Tomás Angles, Georgios Exarchakis, RobertoLeonarduzzi, Gaspar Rochette, Louis Thiry, John Zarka,Stéphane Mallat, Joakim Andén, Eugene Belilovsky, JoanBruna, Vincent Lostanlen, Matthew J. Hirn, Edouard Oyallon,Sixin Zhang, Carmine Cella, and Michael Eickenberg. Kymatio:Scattering Transforms in Python, December 2018.[39] Carl Edward Rasmussen and Christopher K. I. Williams.
Gaus-sian Processes for Machine Learning . 2006.[40] Alysa Obertas, Christa Van Laerhoven, and Daniel Tamayo. Thestability of tightly-packed, evenly-spaced systems of earth-massplanets orbiting a sun-like star.
Icarus , 293:52–58, 2017. doi:10.1016/j.icarus.2017.04.010.[41] Andrew Gordon Wilson. The case for bayesian deep learning,2020.[42] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Ap-proximation: Representing Model Uncertainty in Deep Learn-ing, June 2015.943] Yarin Gal, Jiri Hron, and Alex Kendall. Concrete Dropout, May2017.[44] Yashar D. Hezaveh, Laurence Perreault Levasseur, and Philip J.Marshall. Fast automated analysis of strong gravitational lenseswith convolutional neural networks.
Nature , 548(7669):555–557, August 2017. doi: 10.1038/nature23463.[45] Laurence Perreault Levasseur, Yashar D. Hezaveh, and Risa H.Wechsler. Uncertainties in Parameters Estimated with NeuralNetworks: Application to Strong Gravitational Lensing.
As-trophysical Journal, Letters , 850(1):L7, November 2017. doi:10.3847/2041-8213/aa9704.[46] Henry W. Leung and Jo Bovy. Deep learning of multi-elementabundances from high-resolution spectroscopic data.
MonthlyNotices of the RAS , 483(3):3255–3277, March 2019. doi: 10.1093/mnras/sty3217.[47] Sebastian Wagner-Carena, Ji Won Park, Simon Birrer, Philip J.Marshall, Aaron Roodman, and Risa H. Wechsler. HierarchicalInference With Bayesian Neural Networks: An Application toStrong Gravitational Lensing, October 2020.[48] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, andDaan Wierstra. Weight Uncertainty in Neural Networks, May2015.[49] Dustin Tran, Mike Dusenberry, Mark van der Wilk, and DanijarHafner. Bayesian layers: A module for neural network uncer-tainty. In
Advances in Neural Information Processing Systems ,pages 14660–14672, 2019.[50] Christos Louizos and Max Welling. Multiplicative normalizingflows for variational bayesian neural networks, 2017.[51] Nick Pawlowski, Andrew Brock, Matthew C. H. Lee, MartinRajchl, and Ben Glocker. Implicit Weight Uncertainty in NeuralNetworks, November 2017.[52] David Krueger, Chin-Wei Huang, Riashat Islam, Ryan Turner,Alexandre Lacoste, and Aaron Courville. Bayesian Hypernet-works, October 2017.[53] Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov,and Andrew Gordon Wilson. A Simple Baseline for BayesianUncertainty in Deep Learning, February 2019.[54] Andrew Gordon Wilson and Pavel Izmailov. Bayesian DeepLearning and a Probabilistic Perspective of Generalization,February 2020.[55] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, DmitryVetrov, and Andrew Gordon Wilson. Averaging Weights Leadsto Wider Optima and Better Generalization, March 2018.[56] Stephan Mandt, Matthew D Hoffman, and David M Blei.Stochastic gradient descent as approximate bayesian inference.
The Journal of Machine Learning Research , 18(1):4873–4907,2017.[57] Chris Mingard, Guillermo Valle-Pérez, Joar Skalse, and Ard A.Louis. Is SGD a Bayesian sampler? Well, almost, June 2020.[58] B Funk, G Wuchterl, R Schwarz, E Pilat-Lohinger, and S Eggl.The stability of ultra-compact planetary systems.
Astronomy &Astrophysics , 516:A82, 2010. [59] Dong-Hong Wu, Rachel C Zhang, Ji-Lin Zhou, and Jason HSteffen. Dynamical instability and its implications for planetarysystem architecture.
Monthly Notices of the Royal AstronomicalSociety , 484(2):1538–1548, 2019.[60] Almog Yalinewich and Cristobal Petrovich. Nekhoroshev esti-mates for the survival time of tightly packed planetary systems,2019.[61] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable treeboosting system. In
Proceedings of the 22nd ACM SIGKDDInternational Conference on Knowledge Discovery and DataMining , KDD ’16, pages 785–794, New York, NY, USA, 2016.ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2939785.[62] Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
Theelements of statistical learning: data mining, inference, andprediction . Springer Science & Business Media, 2009.[63] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, AlvaroSanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski,Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner,Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer,George Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Vic-toria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Push-meet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and RazvanPascanu. Relational inductive biases, deep learning, and graphnetworks, June 2018.[64] Miles D. Cranmer, Rui Xu, Peter Battaglia, and Shirley Ho.Learning Symbolic Physics with Graph Networks, September2019.[65] Miles Cranmer, Alvaro Sanchez-Gonzalez, Peter Battaglia, RuiXu, Kyle Cranmer, David Spergel, and Shirley Ho. DiscoveringSymbolic Models from Deep Learning with Inductive Biases,June 2020.[66] Miles Cranmer. PySR: Fast & Parallelized Symbolic Regressionin Python/Julia, September 2020. URL https://doi.org/10.5281/zenodo.4052869 .[67] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt,Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser,Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern,Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, MatthewBrett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe,Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, TylerReddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke,and Travis E. Oliphant. Array programming with NumPy.
Na-ture , 585(7825):357–362, September 2020. ISSN 1476-4687.doi: 10.1038/s41586-020-2649-2. URL https://doi.org/10.1038/s41586-020-2649-2 .[68] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haber-land, Tyler Reddy, David Cournapeau, Evgeni Burovski, PearuPeterson, Warren Weckesser, Jonathan Bright, Stéfan J. van derWalt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Niko-lay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern,Eric Larson, CJ Carey, İlhan Polat, Yu Feng, Eric W. Moore,Jake Vand erPlas, Denis Laxalde, Josef Perktold, Robert Cimr-man, Ian Henriksen, E. A. Quintero, Charles R Harris, Anne M.Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van10ulbregt, and SciPy 1. 0 Contributors. SciPy 1.0: Funda-mental Algorithms for Scientific Computing in Python.
Na-ture Methods , 17:261–272, 2020. doi: https://doi.org/10.1038/s41592-019-0686-2.[69] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vin-cent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel,Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vander-plas, Alexandre Passos, David Cournapeau, Matthieu Brucher,Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: MachineLearning in Python.
Journal of Machine Learning Research , 12(85):2825–2830, 2011.[70] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez,Brian Granger, Matthias Bussonnier, Jonathan Frederic, KyleKelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, PaulIvanov, Damián Avila, Safia Abdalla, and Carol Willing. JupyterNotebooks – a publishing format for reproducible computationalworkflows. In F. Loizides and B. Schmidt, editors,
Positioningand Power in Academic Publishing: Players, Agents and Agen-das , pages 87 – 90. IOS Press, 2016.[71] Michael Droettboom, John Hunter, Thomas A Caswell, Eric Fir-ing, Jens Hedegaard Nielsen, Phil Elson, Benjamin Root, DarrenDale, Jae-Joon Lee, Jouni K. Seppänen, Damon McDougall, An-drew Straw, Ryan May, Nelle Varoquaux, Tony S Yu, Eric Ma,Charlie Moad, Steven Silvester, Christoph Gohlke, Peter Würtz,Thomas Hisch, Federico Ariza, Cimarron, Ian Thomas, JamesEvans, Paul Ivanov, Jeff Whitaker, Paul Hobson, mdehoon, andMatt Giuca. matplotlib: matplotlib v1.5.1, January 2016.[72] Wes McKinney. Data Structures for Statistical Computing inPython. In Stéfan van der Walt and Jarrod Millman, editors,
Proceedings of the 9th Python in Science Conference , pages 56– 61, 2010. doi: 10.25080/Majora-92bf1922-00a.[73] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, JamesBradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, NataliaGimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf,Edward Yang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, JunjieBai, and Soumith Chintala. PyTorch: An Imperative Style,High-Performance Deep Learning Library. In H. Wallach,H. Larochelle, A. Beygelzimer, F. d ' Alché-Buc, E. Fox, andR. Garnett, editors,
Advances in Neural Information ProcessingSystems 32 , pages 8024–8035. Curran Associates, Inc., 2019.[74] WA Falcon. Pytorch lightning.
GitHub. Note:https://github.com/PyTorchLightning/pytorch-lightning , 3,2019.[75] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, AndyDavis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geof-frey Irving, Michael Isard, et al. Tensorflow: A system forlarge-scale machine learning. In { USENIX } Symposiumon Operating Systems Design and Implementation ( { OSDI } ,pages 265–283, 2016.[76] Laura Dietz. Directed factor graph notation for generative mod-els. Technical report, Max Planck Institute for Informatics, Saar-brücken, Germany, 2010.[77] Leslie N. Smith. A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, andweight decay, March 2018. [78] Leslie N. Smith and Nicholay Topin. Super-Convergence: VeryFast Training of Neural Networks Using Large Learning Rates,August 2017.11 ppendix Petit et al. (2020) Our modelFigure 5: Residuals for predictions of instability time on the 5-planet dataset in fig. 3.
Parameters.
A planetary orbit can be described with six coordinates at any given time. We choose to use eccentricity ( 𝑒 ), semi-major axis( 𝑎 , normalized to the initial innermost planet’s value), inclination ( 𝑖 ), longitude of the ascending node ( Ω ), longitude of pericenter ( 𝜛 ), andtrue longitude ( 𝜃 ). We also pass the mass of each planet normalized to the star’s mass to each call of 𝑓 in eq. (2). For each of the angularcoordinates excluding inclination, we split into two values—one for the sine and one for the cosine of the value—before passing to the firstneural network. Likelihoods.
Here we give a mathematical derivation of the likelihood used to train our model. Our goal is to estimate the distribution 𝑃 ( 𝑇 | 𝑋 ) , for a timeseries data 𝑋 ∈ R + 𝑛 𝑡 . We construct a probabilistic model defined by eqs. (1) to (5), with parameters 𝜃 ∈ R 𝑚 , where 𝑚 isthe total number of parameters, which takes a timeseries for 3 planets, 𝑋 , and produces a normal probability distribution over 𝑇 , parametrizedby 2 scalars: ( 𝜇 𝜃 ( 𝑋 ) , 𝜎 𝜃 ( 𝑋 )) . 𝜇 𝜃 is the center of the instability time, and 𝜎 𝜃 is the standard deviation in that estimate.The distribution over 𝑇 parametrized by the model is equal to: 𝑃 ( 𝑇 | 𝑋, 𝜃 ) = 𝐴 ( 𝜇 𝜃 ( 𝑋 ) ,𝜎 𝜃 ( 𝑋 ))√ 𝜋 𝜎 𝜃 ( 𝑋 ) exp (cid:18) − ( 𝑇 − 𝜇 𝜃 ( 𝑋 )) 𝜎 𝜃 ( 𝑋 ) (cid:19) , 𝑇 < (cid:32) + erf (cid:18) 𝜇 𝜃 ( 𝑋 )− √ 𝜎 𝜃 ( 𝑋 ) (cid:19)(cid:33) 𝑃 ( 𝑇 | 𝑇 ≥ ) , 𝑇 ≥ . (6)This distribution is motivated by two things. First, as in the paper (33), exoplanet instability times usually follow a normal distribution inlogarithmic instability time, regardless of how large this time is. Therefore, we predict a normal distribution in 𝑇 for times under 𝑇 < orbits, hence we use a model that is independent of 𝑃 ( 𝑇 | 𝑇 ≥ ) . Wecalculate the cumulative probability of the normal distribution falling 𝑇 ≥ 𝐴 isa normalization function from the fact that we cut off the probability at 𝑇 =
4. Thus, 𝐴 ( 𝜇 𝜃 ( 𝑋 ) , 𝜎 𝜃 ( 𝑋 )) = + erf (cid:169)(cid:173)(cid:171) 𝜇 𝜃 ( 𝑋 )− √︃ 𝜎 𝜃 ( 𝑋 ) (cid:170)(cid:174)(cid:172) . (7)This term, from our prior that 𝑇 >
4, helps remove bias from our model, as can be seen in fig. 2. Without this term in the model, we wouldbe artificially punishing predictions at low 𝑇 values.Assuming we produce a pointwise dataset 𝐷 = {( 𝑇 𝑖 , 𝑋 𝑖 )} 𝑖 = 𝑁 via numerical integration, where 𝑇 𝑖 ≡ orbits, the log-likelihood for this model is equal to:log 𝑃 ( 𝐷 | 𝜃 ) ∝ ∑︁ 𝑖 − ( 𝑇 𝑖 − 𝜇 𝜃 ( 𝑋 𝑖 )) 𝜎 𝜃 ( 𝑋 𝑖 ) − log (cid:0) 𝜎 𝜃 ( 𝑋 𝑖 ) (cid:1) − log (cid:169)(cid:173)(cid:173)(cid:171) + erf (cid:169)(cid:173)(cid:171) 𝜇 𝜃 ( 𝑋 𝑖 )− √︃ 𝜎 𝜃 ( 𝑋 𝑖 ) (cid:170)(cid:174)(cid:172)(cid:170)(cid:174)(cid:174)(cid:172) , 𝑇 𝑖 < (cid:169)(cid:173)(cid:173)(cid:171) + erf (cid:169)(cid:173)(cid:171) 𝜇 𝜃 ( 𝑋 𝑖 )− √︃ 𝜎 𝜃 ( 𝑋 𝑖 ) (cid:170)(cid:174)(cid:172)(cid:170)(cid:174)(cid:174)(cid:172) , 𝑇 𝑖 ≡ , (8)assuming a fixed prior 𝑃 ( 𝑇 | 𝑇 ≥ ) . Note how this decouples the loss for the stable values 𝑇 𝑖 ≥ 𝑃 ( 𝑇 | 𝑇 ≥ ) , meaning thechoice of prior will have no effect on our model, and can be chosen by a user after training. Examples of this are plotted in fig. 6.Figure 6: Example likelihoods for various choices of 𝜇, 𝜎 corresponding to eq. (6). Example priors are given for configurations stable past10 , though note that these have no effect on the learned model.Now, we also marginalize the model parameters 𝜃 , to incorporate epistemic uncertainty, and account for model biases due to the randomseed used in initializing and training the model. We proceed as follows: 𝑃 ( 𝑇 | 𝑋, 𝐷 ) ∝ ∫ 𝑃 ( 𝑇, 𝜃 | 𝑋, 𝐷 ) 𝑑𝜃 (9) ∝ ∫ 𝑃 ( 𝑇 | 𝑋, 𝐷, 𝜃 ) 𝑃 ( 𝜃 | 𝑋, 𝐷 ) 𝑑𝜃 (10) ∝ ∫ 𝑃 ( 𝑇 | 𝑋, 𝜃 ) 𝑃 ( 𝜃 | 𝐷 ) 𝑑𝜃. (11)We first maximize the likelihood of the model, to find 𝑃 ( 𝜃 | 𝐷 ) . We factor the joint probability using fig. 7, and proceed: 𝑃 ( 𝜃, 𝐷 ) ∝ 𝑃 ( 𝜃 ) (cid:214) 𝑖 ∫ 𝑃 ( 𝑋 𝑖 ) 𝑃 ( 𝑇 𝑖 | 𝜇 𝑖 , 𝜎 𝑖 ) 𝑃 ( 𝜇 𝑖 , 𝜎 𝑖 | 𝜃, 𝑋 𝑖 ) 𝑑𝜇 𝑖 𝑑𝜎 𝑖 (12)Using 𝑃 ( 𝜃 | 𝐷 ) ∝ 𝑃 ( 𝜃, 𝐷 ) 𝑃 ( 𝐷 ) (13) ⇒ 𝑃 ( 𝜃 | 𝐷 ) ∝ 𝑃 ( 𝜃 ) (cid:214) 𝑖 ∫ 𝑃 ( 𝑇 𝑖 | 𝜇 𝑖 , 𝜎 𝑖 ) 𝑃 ( 𝜇 𝑖 , 𝜎 𝑖 | 𝜃, 𝑋 𝑖 ) 𝑑𝜇 𝑖 𝑑𝜎 𝑖 , (14)where 𝑃 ( 𝑇 𝑖 | 𝜇 𝑖 , 𝜎 𝑖 ) is given by eq. (6) as the log-likelihood for our model, and 𝑃 ( 𝜇 𝑖 , 𝜎 𝑖 | 𝜃, 𝑋 𝑖 ) 𝑑𝜇 𝑖 𝑑𝜎 𝑖 is our forward model given eqs. (2)to (5). Finally, we can write down the loss of our model, our function to minimize, as the negative logarithm of this:Loss ( 𝜃 ) = − log (cid:0) 𝑃 ( 𝜃 ) (cid:1) − ∑︁ 𝑖 E ( 𝜇 𝑖 ,𝜎 𝑖 )∼ 𝑓 ( 𝑋 𝑖 ; 𝜃 ) log (cid:16) 𝑃 ( 𝑇 𝑖 | 𝜇 𝑖 , 𝜎 𝑖 ) (cid:17) , (15)where 𝑓 ( 𝑋 𝑖 ; 𝜃 ) is the combined model eqs. (2) to (5) for a given 𝜃 , and E is used to refer to the fact that z is sampled in eq. (3), so we averagethe loss over samples. We set 𝑃 ( 𝜃 ) equal to a zero-centered uninformative Gaussian prior over the parameters. If this were a neural densityestimator instead of a full Bayesian Neural Network, we would minimize this for a single value of 𝜃 . Alternatively, we can sample 𝜃 ∼ 𝑃 ( 𝜃 | 𝐷 ) with a Bayesian Neural Network (BNN) algorithm. We use the MultiSWAG algorithm to do this (54), as described in the main text, and findthe distribution 𝑃 MultiSWAG ( 𝜃 ) ≈ 𝑃 ( 𝜃 | 𝐷 ) . 13 𝑖 𝑗 𝑓 y 𝑖 𝑗 ( E [·] , Var [·]) z 𝑖 𝜃𝜃𝜃 𝜃 N 𝜇𝜇𝜇 𝜃 Σ 𝜃 𝑓 𝜃𝜃𝜃 𝜇 𝑖 𝜎 𝑖 N 𝑇 𝑖 For each time step 𝑗 ∈ { ..𝑛 𝑡 } For each simulation 𝑖 ∈ { ..𝑁 } For each mode of the parameter surface
Figure 7: Bayesian graphical model representing our inference scheme for the instability time with Bayesian neural networks, which goes fromtimeseries created via short-term integration ( { x 𝑖 𝑗 } 𝑗 = [ 𝑛 𝑡 ] ) to a prediction of a (logarithmic) instability time ( 𝑇 𝑖 ) for each simulation 𝑖 . 𝑓 and 𝑓 are neural networks parametrized by 𝜃 ≡ ( 𝜃𝜃𝜃 , 𝜃𝜃𝜃 ) . A distribution over 𝜃 is learned according to the likelihood 𝑃 ( 𝜃 ) (cid:206) 𝑖 𝑃 ( 𝑇 𝑖 |{ x 𝑖 𝑗 } 𝑗 ; 𝜃 ) .The model is given in eqs. (1) to (5). Notation follows that of (76). Vectors are bolded and matrices are capitalized. Training.
We have 𝑛 𝑡 =
100 uniformly spaced samples of our integration over 10 ,
000 initial orbits (the unit orbit is the initial period of theinnermost planet). During training, we randomly select between 5 and 100 time steps, with replacement, to feed to the model. This is a typeof data augmentation that improves generalization of the model.Since we are working with a varying number of orbit samples, we also sample the averages and variances from Gaussians over theirfrequentist distributions: 𝜇 ± Var 𝑡 [ y 𝑡 ] 𝑛 𝑡 for the mean, and 𝜎 ± 𝑡 [ y 𝑡 ] ( 𝑛 𝑡 − ) for the variance, where 𝑛 𝑡 is the number of samples, and Var 𝑡 is thesample variance. This will naturally allow the model to grow increasingly certain if a longer time series is given as input, since the averagesand variances of the transformed coordinates are less subject to small-sample uncertainty. Hyperparameters.
For our final model, we set both 𝑓 and 𝑓 to be a MLPs with ReLU activations: 1 hidden layer and 40 hidden nodeseach (i.e., a matrix multiplication, a ReLU, a matrix multiplication, a ReLU, and another matrix multiplication). The number of calculatedtransformed features from 𝑓 is 20, meaning 𝑓 takes 40 features as input. We take 500,000 stochastic gradient descent optimization stepswith random batches of simulations with a batch size of 2000 with a cosine-annealed learning rate (77, 78) from 5 × − down to 5 × − This is followed by 50,000 additional steps at a fixed learning rate (presumably within a mode of the posterior) of 2 . × − , to record pointsof a Gaussian mode on the weight posterior. Gradient clipping of the L2-norm is used with a clipping value of 0 .
1. A small amount of noiseis added to the input features and summary statistics to encourage the model to only use features that are useful: a KL-divergence loss termis added to the loss function on this noise, with a multiplier of 10 − on the input and 10 − on the summary. This noise is not added duringevaluation, only training. We choose 5 as the minimum number of time steps to pass to the model. We rescale the data to have a zero meanand unit variance for each feature (i.e., we normalize with a mean and variance calculated over the entire training set and time series). All ofthese parameters were found with the hyperopt ∗∗ package with a 20% validation set, with a smaller number of training steps and acceleratedlearning rate schedule. Finally, we train an ensemble of 30 independent models: this represents 30 modes of the weight posterior. For asummary of these details, see our training code at https://github.com/MilesCranmer/bnn_chaos_model . Approximating the cumulative distribution of a Gaussian.
Due to precision issues of 32-bit floating point numbers, our auto-differentiationsoftware, torch (73), is increasingly incapable of accurately calculating log (cid:0) + erf ( 𝑥 ) (cid:1) and its gradients as 𝑥 decreases below −
1. Becausewe heavily rely on this function in our log-likelihood for training our model, and need to pass gradients through it, we needed to approximateit for large negative 𝑥 values. Otherwise, we found that the gradients in our model would often approach very large values and training wouldfail. We approximate this function with an analytic continuation via symbolic regression using PySR (66). We generate values of this functionin high precision code, and then fit analytical forms with
PySR . We find that the function is very accurately approximated over 𝑥 ∈ [− , − ] ∗∗ https://github.com/hyperopt/hyperopt . . . . . . Error over sigma . . . . . . . D e n s i t y Model error distributionGaussian distribution
Figure 8: A comparison of the residual between our model and the true value with a true Gaussian of the same standard deviation, with thesame number of samples. What this shows is that if our model predicts an uncertainty of 𝜎 =
1, then 68% of the true values will be withinthat 𝜎 range, as expected for a true Gaussian distribution.by: log (cid:0) + erf ( 𝑥 ) (cid:1) ≈ − . + . ∗ 𝑥 − . ∗ 𝑥 + . ∗ 𝑥 + . ∗ exp ( 𝑥 ) , and this function has equivalent asymptotic properties. We therefore use this formula in place of log (cid:0) + erf ( 𝑥 ) (cid:1) in our learning code. torch code is given as follows, to replace any appearance of log (cid:0) + erf ( 𝑥 ) (cid:1) in code: def safe_log_erf (x): """ Compute log(1+erf(x)) with an approximate analytic continuation below -1""" base_mask = x < -1value_giving_zero = torch. zeros_like (x, device=x. device )x_under = torch.where(base_mask , x, value_giving_zero )x_over = torch.where (~ base_mask , x, value_giving_zero )f_under = lambda x: (0. 485660082730562 *x + 0. 643278438654541 *torch.exp(x) +0. 00200084619923262 *x ** 3 - 0. 643250926022749 - 0. 955350621183745 *x ** 2)f_over = lambda x: torch.log(1.0+torch.erf(x))return f_under ( x_under ) + f_over ( x_over) Theoretical limit.
In (33), the authors measure the distribution of instability times for various orbital configurations. Taking an initial orbitalconfiguration, the authors perturb the system by machine precision, and measure the instability time, and repeat. For each system, the authorsthen measure the mean instability time, 𝜇 (in log-space), as well as the standard deviation, 𝜎 (in log-space, modeled as a log-normal). Whatthis means is that we can define a “theoretical limit” to the accuracy with which we can predict the instability time, and this accuracy isbounded by 𝜎 , for we cannot predict the instability time for a given system better than within one 𝜎 standard deviation on average.For the purposes of this paper, we simulate an optimal estimator by taking a particular instability time, and then making a random predictionfor its instability time within one 𝜎 of the actual instability time. (33) found that 𝜎 , while it is different for different configurations, does notcorrelate to 𝜇 , so for the numerical value of 𝜎 , we simply randomly select numerical 𝜎 values from those released for (33). On average, thesestandard deviations are 0 ..