[PDF] Black Hole Weather Forecasting with Deep Learning: A Pilot Study

Abstract

In this pilot study, we investigate the use of a deep learning (DL) model to temporally evolve the dynamics of gas accreting onto a black hole in the form of a radiatively inefficient accretion flow (RIAF). We have trained a machine to forecast such a spatiotemporally chaotic system -- i.e. black hole weather forecasting -- using a convolutional neural network (CNN) and a training dataset which consists of numerical solutions of the hydrodynamical equations, for a range of initial conditions. We find that deep neural networks seem to learn well black hole accretion physics and evolve the accretion flow orders of magnitude faster than traditional numerical solvers, while maintaining a reasonable accuracy for a long time. For instance, CNNs predict well the temporal evolution of a RIAF over a long duration of 8\times 10^4 GM/c^3, which corresponds to 80 dynamical times at r=100 GM/c^2. The DL model is able to evolve flows from initial conditions not present in the training dataset with good accuracy. Our approach thus seems to generalize well. Once trained, the DL model evolves a turbulent RIAF on a single GPU four orders of magnitude faster than usual fluid dynamics integrators running in parallel on 200 CPU cores. We speculate that a data-driven machine learning approach should be very promising for accelerating not only fluid dynamics simulations, but also general relativistic magnetohydrodynamic ones.

Full PDF

MMNRAS , 1–13 (2020) Preprint 15 February 2021 Compiled using MNRAS L A TEX style ﬁle v3.0

Black Hole Weather Forecasting with Deep Learning: A Pilot Study

Roberta Duarte ★ , Rodrigo Nemmen † , João Paulo Navarro Universidade de São Paulo, Instituto de Astronomia, Geofísica e Ciências Atmosféricas, Departamento de Astronomia,São Paulo, SP 05508-090, Brazil NVIDIA

Accepted XXX. Received YYY; in original form ZZZ

ABSTRACT

In this pilot study, we investigate the use of a deep learning (DL) model to temporally evolvethe dynamics of gas accreting onto a black hole in the form of a radiatively ineﬃcient accretionﬂow (RIAF). We have trained a machine to forecast such a spatiotemporally chaotic system—i.e. black hole weather forecasting—using a convolutional neural network (CNN) and a trainingdataset which consists of numerical solutions of the hydrodynamical equations, for a range ofinitial conditions. We ﬁnd that deep neural networks seem to learn well black hole accretionphysics and evolve the accretion ﬂow orders of magnitude faster than traditional numericalsolvers, while maintaining a reasonable accuracy for a long time. For instance, CNNs predictwell the temporal evolution of a RIAF over a long duration of 8 × 𝐺 𝑀 / 𝑐 , which correspondsto 80 dynamical times at 𝑟 = 𝐺 𝑀 / 𝑐 . The DL model is able to evolve ﬂows from initialconditions not present in the training dataset with good accuracy. Our approach thus seemsto generalize well. Once trained, the DL model evolves a turbulent RIAF on a single GPUfour orders of magnitude faster than usual ﬂuid dynamics integrators running in parallel on200 CPU cores. We speculate that a data-driven machine learning approach should be verypromising for accelerating not only ﬂuid dynamics simulations, but also general relativisticmagnetohydrodynamic ones. Key words: accretion, accretion discs – black hole physics – hydrodynamics – methods:statistical – MHD – methods: numerical

Black holes (BHs) are inﬁnitely deep gravitational potential wellssurrounded by event horizons–surfaces that separate the outsideworld from the region of the BH from which nothing escapes.When matter falls into such a hole it forms a disk-like structure dueto the barrier posed by angular momentum conservation; given thisbarrier, magnetic forces in an ionized plasma supply the frictionrequired to allow gas to fall onto the BH (Balbus 2003). Thesestresses also convert some of the gravitational potential energy ofthe accretion ﬂow into heat and can release a substantial fractionof its rest mass, providing the primary power source behind activegalactic nuclei (AGNs), black hole binaries and gamma-ray bursts(Meier 2002).At the same time that magnetic stresses make BHs shinethrough the release of electromagnetic radiation, they also gener-ate turbulence in the accretion ﬂow thereby turning accreting BHsinto test beds of ﬂuid dynamics. The nonlinear partial diﬀerentialequations that need to be solved in order to describe the turbulence,gravity and radiation in the spacetimes around BHs are intractableanalytically. The traditional approach to deal with such a problem ★ E-mail: [email protected] † E-mail: [email protected] has been to numerically solve the partial diﬀerential equations be-hind the conservation laws for the system (e.g. Gammie et al. 2003;Toro 2009). With the numerical solutions, one can then performdetailed studies of the multidimensional gas dynamics and radiativeproperties of BH accretion ﬂows—i.e. BH weather forecasting—being only limited by the available computational resources. Suchnumerical simulations have been a key aspect in providing a frame-work for interpreting the multitude of observations of black holesand their environments (e.g. Event Horizon Telescope Collaborationet al. 2019b).Numerical simulations of ﬂuid dynamics—and particularly BHaccretion models—are costly. Currently, a single general relativisticmagnetohydrodynamic (GRMHD) 3D model with a competitivespatial resolution (e.g. 𝑁 𝑟 × 𝑁 𝜃 × 𝑁 𝜙 = × ×

192 cells)requires about 100 𝑘 CPU-hours to be evolved for a duration of5000

𝐺 𝑀 / 𝑐 (e.g., Porth et al. 2017; hereafter, we adopt units suchthat 𝐺 = 𝑐 =

1, i.e. both 𝐺 / 𝑐 and 𝐺 / 𝑐 are unity). Therefore, ascientist that desires to reproduce or improve upon such a modelneeds to have access to a CPU cluster with thousands of cores. Thecomputational cost can be reduced by a factor of ≈

15 if one adoptsa code optimized for graphical processing units (GPU; Grete et al.2019; Liska et al. 2019a), but one still would need to have accessto a GPU cluster. Large computational costs are of course not anexclusive issue of BH astrophysics—they are also a problem in many © a r X i v : . [ a s t r o - ph . H E ] F e b Duarte, Nemmen & Navarro other ﬁelds. One particular example are cosmological simulationsof large scale structure formation where one needs to evolve thegravitational assembly of dark matter haloes and their baryonicphysics (e.g. Vogelsberger et al. 2014; Schaye et al. 2015).A recent innovation is using machine learning (ML) as anapproach to computational simulations of large spatiotemporallychaotic systems (Brunton et al. 2020). The basic idea is to usethe tools of ML to “learn from experience”: instead of directlysimulating the physical processes involved, devise the predictionas a computer vision problem and infer the evolution of the systemfrom the sequence of input data cubes that comprise previous states.In other words, this is a data-driven, physics-free approach in whicha ML model learns to approximate the physics from the trainingexamples alone and not by incorporating a priori knowledge aboutthe equations underlying the processes (e.g. Jaeger & Haas 2004).Deep learning (DL) is a particular class of ML models whichis proving quite promising for data-driven forecasting of complexsystems. DL is a type of supervised learning, where the model istrained (or ﬁtted in astronomical jargon) with many input-output ex-amples (LeCun et al. 2015). For example, given many galaxy imageslabeled as elliptical or spiral, learn to predict whether a given imageis that of a elliptical or spiral (e.g. Hausen & Robertson 2019). Thisis achieved by updating the weights of a multi-layered (i.e. deep)neural network (NN) via gradient descent with a diﬀerentiable lossfunction. The ﬂexibility of deep nets renders DL a good approx-imator for functions which are too complex to have an analyticalform (Cybenko 1989; Hornik 1991; Zhou 2020). For DL to produceacceptable results, a large amount of training data is essential (e.g.LeCun et al. 2015; Krizhevsky et al. 2017).We review a couple of exciting applications of this approachfor data-driven forecasting of nonlinear systems in the time-domain.Tompson et al. (2016) used a convolutional neural network (here-after CNN) combined with an unsupervised learning framework tolearn the 3D solutions of the inviscid Euler equation. Their data-driven simulations outperform other methods and show good gen-eralization properties. Similarly, King et al. (2018); Mohan et al.(2019) obtained promising results using a long term short termCNN architecture. Pathak et al. (2018) employed a reservoir com-puting paradigm to forecast the solutions of the chaotic, Kuramoto-Sivashinsky equation over the large duration of six Lyapunov times.Using deep nets without convolutions, Breen et al. (2019) were ableto forecast with good accuracy the 3-body problem; Battaglia et al.(2016) did something similar with CNN for N-body systems with

𝑁 <

20. Finally, Agrawal et al. (2019) and Chattopadhyay et al.(2020) used CNN and capsule nets, respectively, for traditionalEarth-weather forecasting, with promising results.Here we address two interrelated questions. The ﬁrst questionis one of practical order: is it possible to obtain acceptable solutionsof the ﬂuid conservation equations (such as the Navier-Stokes equa-tion) faster than an explicit numerical solver using DL techniques?The second question is more fundamental: can DL learn and fore-cast turbulence well? Concretely, how good is it in predicting thefuture of a spatiotemporally chaotic system comprised by a turbu-lent ﬂuid? If it is able to forecast the future, for how long is thequality of the forecast acceptable?This work is a pilot study of DL techniques applied for BHweather forecasting. The astrophysical setting which provides thetraining dataset for our study consists of a hydrodynamical simula-tion of a BH accreting gas from a geometrically thick torus of veryhot gas, also called a radiatively ineﬃcient accretion ﬂow (RIAF;e.g. Yuan & Narayan 2014; Almeida & Nemmen 2020). RIAFs arethought to be the most common mode of BH accretion in present- day galaxies (e.g. Ho 2008; Nemmen et al. 2014; Event HorizonTelescope Collaboration et al. 2019a), where BHs are accreting atmass accretion rates (cid:164)

𝑀 < . (cid:164) 𝑀 Edd where (cid:164) 𝑀 Edd is the Edding-ton accretion rate. For this reason, RIAFs are widely modeled bynumerically solving the conservation laws (e.g. Porth et al. 2019).For the learning algorithm, we use the well-known U-Net ar-chitecture (Ronneberger et al. 2015), which is commonly used toextract patterns from datasets with spatial (images) and temporal(videos) coherence (e.g. Karpathy et al. 2014; Chen et al. 2016).The BH simulation generates the spacetime distribution of the den-sity ﬁeld which we feed to the DL model. The challenge then is howwell the DL model predicts the future state of the density ﬁeld. Wequantify the performance of the DL physics-free approach by com-paring a number of indicators with those obtained from the explicitsolution to the conservation equations.The paper is organized as follows: In section 2 we present theﬁducial equation-based model that we used to generate the trainingdata and the features. In section 3 we describe the DL model and thecustom loss function that we developed. In section 4 we describe theresults. In section 5 we present the discussion and ﬁnally, in section6 we ﬁnish with the conclusions of the work.An aside about the nomenclature. From here onwards, when werefer to a “model” we are referring to the DL model trained with thedata from the hydrodynamical simulation. A “frame” correspondsto an instance of the training dataset (as in a movie frame).

Our data-driven approach is supervised, meaning that for each train-ing example 𝑥 we provide the correct answer 𝑦 . The ML model thenlearns the relation 𝑥 → 𝑦 that best ﬁts the provided training exam-ples. Our goal is to ﬁrst provide a meaningful set of examples toteach the model about the physics represented in the data. In theML context, learn has a straightforward meaning, but challengingto achieve and prove in practice: we should perform inference fromunseen data using the trained model, delivering predictions ˆ 𝑦 (thelearned model’s output) as close as possible to the ground-truth 𝑦 (the data). If the trained model is successful, it would gener-alize well—i.e. strong generalization would imply in this contextreproducing the spatiotemporal evolution of a BH accretion ﬂowsimulation even for diﬀerent initial conditions.Concretely, we are interested in training a model to repro-duce the density ﬁeld 𝜌 ( r , 𝑡 ) , which is the feature used to ﬁndthe best model (cf. Figure 1). Our dataset was generated fromtwo-dimensional hydrodynamical simulations of viscous accretiononto a Schwarzschild BH (Almeida & Nemmen 2020, hereafterAN). They consist of nine simulations with durations ranging from8 × 𝑀 to 8 × 𝑀 which are extremely long for today’s stan-dards. These simulations diﬀer in the details of their initial condi-tions, more speciﬁcally their angular momentum distribution andamount of shear stress. The setup of the simulations can be foundin Appendix A.Each density map has a resolution of 400 ×

200 pixels in polarcoordinates ( 𝑁 𝑟 × 𝑁 𝜃 ). We use a non-uniform mesh which increasesthe resolution for smaller values of 𝑟 , i.e., the closer to the BH (AN).From the point of view of computer vision DL applications, this isa novelty since they usually represent images or videos as regular,uniform grids. MNRAS000

Simulated gas density (logarithm) around a black hole at 𝑡 = 𝑀 , which comprises one example of the training set. The lengthsare in units of the Schwarzschild radius, 𝑅 𝑠 ≡ 𝐺𝑀 / 𝑐 . Our NN architecture is based on the U-Net, proposed by Ron-neberger et al. (2015) (cf. Appendix C for details on the architec-ture). We feed the model with a 3–dimensional array composed of 𝑅 × 𝜃 × 𝑡 coordinates. The network is composed of an encoder anda decoder. While the encoder maps the input into the latent spacewhere the data are mapped into a compressed representation, thedecoder maps the latent space into another array which is the output(i.e. the forward pass) of the DL model.Since our data present spatial correlations, we adopt convolu-tional layers which capture these correlations. Convolutional layersput together in a sequence result in a convolutional neural net-work (CNNs). CNNs are powerful tools to solve problems in whichthe data present spatial and temporal coherence, where classicalmulti-layer perceptron (Murtagh 1991) approaches fail either forthe absence of numerical accuracy or due to the high-computationalcomplexity.Here we give a summary of how a CNN works. For in-depth ac-counts please refer to LeCun et al. (2015); Goodfellow et al. (2016).The free parameters of a CNN are the elements of the convolutionﬁlter. In the so-called forward pass, we perform successive convolu-tions followed by non-linear activation functions until we reach thelast layer, which produces the ﬁnal prediction ˆ 𝑦 . An error function L —which is called loss function in the ML community—measuresthe diﬀerence between ˆ 𝑦 (the prediction from the DL model) andthe target value 𝑦 (the data or ground-truth). As usual in statisticalmodel ﬁtting, we want to ﬁnd the values of the parameters that min-imizes L . Here, the parameters are the weights in the convolutionﬁlters.The process of ﬁnding 𝑚𝑖𝑛 (L( 𝑥 ; 𝑤 )) ( 𝑥 represents the data)is a non-convex optimization problem, with no warranty of globalminima. We adjust the weights iteratively, in a stochastic manner until we ﬁnd a set of parameters that best minimizes the loss functionfor a given training dataset. We have the freedom of choosing theloss function as any statistic that conveys the diﬀerence between ˆ 𝑦 and 𝑦 . Common choices for 𝑚𝑖𝑛 (L) are the mean absolute errorand mean absolute percentage error.ML algorithms have other free parameters called hyperparam-eters which are used to control the learning cycle, i.e. the numberof iterations, convergence criteria and network architecture. Thehyperparameters vary freely and are independent of each other andstrongly impact the the quality of the trained model. We used a gridsearch method combined with a random search method (Bergstra& Bengio 2012) to ﬁnd the best combinations. Our method evalu-ates several combinations of hyperparameters with random values,using the coeﬃcient of determination 𝑅 as an error metric in thevalidation set, whose minimization gives the best values of the hy-perparameters.We performed data preparation before we feed the algorithmwith the frames, detailed in Appendix B. The input is a block ( , , , ) , and the output is the block equivalent to ﬁve framesahead. The channels in the blocks represent ﬁve consecutive framesleading to temporal information to the model. Our CNN is trainedwith blocks composed of ( , , , ) , where 64 is the batchsize. A suitable loss function ensures a good convergence of the learn-ing procedure. One contribution of our work is the deﬁnition of L adapted to the challenge of dealing with a density that variesspatiotemporally as is natural for BH accretion, where the densityincreases towards smaller radii.When the NN is trained with many examples sharing the samefeatures, it may suﬀer from bias towards the features that are over-represented in the training set. In our case, this takes the form ofthe over-representation of density regions with 𝜌 < − in theﬂow (in code units) which occur in a larger volume than regionswith 𝜌 > − . To account for this bias, we propose a hierarchicalloss function built upon separate functions whose weights varydepending on the region of the accretion ﬂow encompassed, writtenas L = L T + 𝛼 L HD + 𝛽 L IN + 𝛾 L TORUS + 𝛿 L ATM . (1)The convention for each loss function component building the totalvalue L is the following: L T is the loss for the entire ﬂow, L HD forthe high-density region —the region with ¯ 𝜌 > . 𝜌 is themean normalized density—, L IN for the inner regions – the regionwith ¯ 𝜌 < . L TORUS for the torus – the region with ¯ 𝜌 = . L ATM for the atmosphere - the region excluding where 𝜌 < . 𝛼 , 𝛽 , 𝛾 and 𝛿 correspondto hyperparameters. We choose the losses as mean absolute errorssince the absolute error is robust when dealing with outliers, exceptfor L T which is a mean squared error. We considered other lossesfor our problem, however the choice reﬂected in equation 1 returnedthe best results.The relative size of the regions used to deﬁne the componentsof the loss in equation 1 is ﬁxed in time and chosen by eye. Thisis appropriate when performing training on the dataset from a sin-gle simulation (hereafter called “one-sim” case). When the datasetcomes from multiple simulations with diﬀerent initial conditions(hereafter named “multi-sims” case), we use a diﬀerent loss given MNRAS , 1–13 (2020)

Duarte, Nemmen & Navarro

Figure 2.

Schematic diagram illustrating the diﬀerent regions of the ﬂowwhere the components of the hierarchical loss function are calculated. Ineach box, we calculate the diﬀerence between the prediction 𝑦 and the targetˆ 𝑦 of the marked region. The total loss L is the sum of all these contributions.Hyperparameter ValueBatch Size 64Learning Rate 5 × − 𝛼 𝛽 𝛾 𝛿 Table 1.

The best hyperparameters are found by grid search method andrandom search method. by L = L T + 𝑎 L H + 𝑏 L L , (2)where 𝑎 , 𝑏 are hyperparameters and L H and L L are the lossescomputed for all regions with 𝜌 > . 𝜌 < .

8, respectively.Table 3.2 presents the results of the best hyperparameters found.

We performed two numerical experiments. In the one-sim exper-iment, we train the CNN on

PNSS3 —one of the longest durationnumerical simulations—in order to quantify the ability of the net-work to learn from a single simulation. Our model was trained with70% of the

PNSS3 ’s data of which 10% is used as validation to ob-tain the hyperparameters and to use the

EarlyStopping tool. Theremaining 20% is used as the ground truth 𝑦 which is comparedwith the model’s prediction ˆ 𝑦 .In the multi-sim experiment, we train the DL model with eightsimulations and omit model PL0SS3 in order to evaluate the gen-eralization power of the network. We exclude

PL0SS3 from the

Case Set Number of Frames Time ( 𝐺𝑀 / 𝑐 )One-sim Total 2678 530164One-sim Training 1875 371194One-sim Cross-validation 268 53056One-sim Test 535 105914Multi-sim Total 5015 992820Multi-sim Training 3511 695073Multi-sim Cross-validation 351 69487Multi-sim Test 1153 228259 Table 2.

The frame’s intervals of each set from the one-sim case and multi-sim case. In the ﬁrst column, we indicate the case – one-sim or multi-sim–. In the second column, we indicate the set’s name. In the third and fourthcolumns, we show the number of frames and the time duration. training since the accretion ﬂow is much thicker compared to theother models. The data preparation for each simulation is the sameas for

PNSS3 in the one-sim case.In the multi-sim approach, we generate a vector storing theinitial conﬁguration information for each system. Each conﬁgura-tion can be described by three properties: the angular momentumproﬁle, the viscosity proﬁle and the value of the 𝛼 parameter (cf.Appendix A), each property comprising the components of the vec-tor v , v = ( 𝑣 , 𝑣 , 𝑣 ) where the following rules set the values ofthe components: • If the angular momentum proﬁle is PN then 𝑣 =

0, if it is PLthen 𝑣 = • If the viscosity proﬁle is ST then 𝑣 =

0, if it is SS then 𝑣 = • If 𝛼 = . 𝑣 =

1, else if 𝛼 = . 𝑣 = .

3, else if 𝛼 = .

01 then 𝑣 = v is concatenated in the architecture’s latent space as inWang et al. (2020). Also, we match the number of snapshots in eachsimulation in order to avoid bias towards any model with longerduration. We train our models using 70% of the data as training set, and 10%and 20% for the cross-validation and test sets, respectively. In theone-sim case we have 2678 frames to train our model, whereas in themulti-sim one we have a total of 5015 frames from all simulations.In the multi-sim case, we will assess the performance of the learnedmodel against the

PL0SS3 simulation, which has 709 frames and isnot used in the training. We use the ﬁrst 250 frames to quantify thegeneralization power of the model.There are two types of forecasts that we perform using theDL model. The direct forecast consists of the DL model computinga prediction for the immediate next step once fed with a singleinput simulation frame from the hydrodynamical simulations. The iterative forecast consists of iterative computations of the DL modelon top of its own output. In other words, in both approaches thelearned model receives the input state only once; the direct forecastevaluates how well the learned model advances in time for onetime step whereas the iterative forecast is representative of longer-duration forecasting. Figure 3 illustrates both types of forecasting.To evaluate the quality of the DL model forecasts, it is usefulto quantify the diﬀerence between the target data provided by thehydrodynamical simulation (i.e. ground truth) and the DL modelforecast. For this purpose, we compute the diﬀerence between den-sities in the logarithmic space as Δ log 𝜌 ≡ log 𝑇 − log 𝑃 where 𝑇 MNRAS000

The two types of forecasts used in this work, direct (left panel)and iterative (right panel). The M is the model already trained, 𝑥 is thesnapshot from simulation, and ˆ 𝑦 is the prediction. In the iterative forecast,we feed 𝑥 as the ﬁrst step and then compute the predictions by iteration. and 𝑃 are the target and learned model prediction density arrays.Spatial averages are denoted by the usual bar above the correspond-ing variable. These averages are performed adopting a ﬂat spacetimesince our training data was generated from Newtonian simulations.In addition to Δ log 𝜌 as a measure of the quality of the DLmodel forecasts, we also use the mean absolute error (hereafterMAE) typical of ML studies: MAE ≡ − − Δ log 𝜌 . During thetraining, we use 𝑅 to quantify the performance of the learnedmodel.Another useful quantity to compare between the DL predictionand the ground-truth data is the rest-mass of the ﬂow. Obviously, themass tells us whether the DL model learned well mass conservation. Figure 4 shows the prediction using the direct method in the one-sim case. We use the last test set snapshot at 𝑡 = 𝑀 as aninput to the trained model. We can see that the DL model predictsthat the domain of the torus—which is characterized by the largerdensities in the ﬂow—has a density diﬀerence of up to ten per centcompared to the target data (max ( Δ log 𝜌 ) < . Δ log 𝜌 = . 𝜌 and theper cent diﬀerence between prediction and target as a function oftime. The largest diﬀerence between the target and prediction inboth the masses and ¯ 𝜌 is small: 0 . 𝑡 = 𝑀 and then iterating thepredictions through the DL model. Figure 6 shows the resultingprediction after 25, 50 and 100 iterations (respectively, 125, 250and 500 frames after 𝑡 = 𝑀 ). We see that the general shapeand density distribution of the torus itself is preserved in the threecases. However, we see a cumulative discrepancy in the density nearthe event horizon at 𝑟 < 𝑅 𝑠 and along the evacuated funnel withthe density increasing in the DL model compared to the target data.This indicates that lower-density regions can be problematic for theDL forecasting. Figure 7 shows the temporal evolution of mass for three diﬀer-ent radial ranges: for all 𝑟 < 𝑅 𝑠 , for 𝑟 < 𝑅 𝑠 , and 𝑟 > 𝑅 𝑠 .Looking at the left panel, we see that after the DL model reaches 𝑡 = . × 𝑀 the mass begins increasing exponentially. The mid-dle and right panels pin down exactly where the mass divergenceoccurs: at 𝑟 < 𝑅 𝑠 ; i.e. regions located at larger radii conservemass much better. It is as if the infalling gas is getting stuck in thelow-density region where a coronal wind is expected. In the multi-sim case, we will show that this mass conservation issue vanisheswhen the model learns from data coming from multiple simulations.We achieved a substantial speed-up when advancing the stateof the accretion ﬂow into the future using CNNs compared withthe usual numerical solutions of the ﬂuid dynamics. For instance,AN generated our training dataset by solving the ﬂow conservationequations on 400 CPU cores on the AGUIA-USP computer clustercontaining Intel Xeon E7- 2870 CPUs, which corresponds to 7 . .

03 seconds: a speedup of a factor ofabout 5000 times per frame.Figure 8 shows how much time the numerical simulation andthe DL model take to advance the state of the accreting BH from 𝑡 = . × 𝑀 to 6 . × 𝑀 , not taking into account the trainingtime. This corresponds to an overall speedup of a factor of 3 . × times. We also show the comparison between the number of CPU-hours both approaches take. If we take into account the training time,the speedup is decreased to 38 × , which is roughly consistent withwhat would be expected from a solving the conservation equationsusing a pure GPU code (e.g. Liska et al. 2019b). Here, we present the results of model trained on not only one simu-lation, but several of them each with diﬀerent initial conditions (i.e.the multi-sim case). We report how well the DL model forecastsa system whose initial conditions are outside the training data set,with the aim of quantifying the generalization capabilities of themodel to unseen regions of the parameter space. How well does themodel generalize?Figure 9 shows the corresponding direct prediction of themulti-sim model after being fed as input the simulation data ofPL0SS3 at 𝑡 = 𝑀 . The ﬁgure illustrates that the densitycontrast between the prediction and target is at most at 10%. Onefeature of the predictions is that their accuracy seems to be corre-lated with the spatial resolution of the input data. Concretely, partsof the domain with larger cell sizes (i.e. lower resolution) have cor-respondingly larger values of Δ log 𝜌 compared to mesh regionswith higher spatial resolution. We observe this behavior towards thepoles and large radii, precisely the parts of the domain where theresolution was by design lower in AN.Figure 10 shows how the mass, average density and diﬀerencebetween target and data vary as a function of time based on the directprediction approach. Both the mass and ¯ 𝜌 of the DL model remainfairly constant, displaying a systematic diﬀerence of less than 5%compared to the target. We show in the third panel, the percent errorbetween the target and prediction, which remains constant at about2%. Now we use the iterative prediction to assess whether the DLmodel trained on multiple simulations is able to robustly evolve asystem consisting of initial conditions that were not present in thetraining data. Starting from training data PL0SS3 at 𝑡 = 𝑀 ,we iterate the predictions. MNRAS , 1–13 (2020)

Duarte, Nemmen & Navarro

Figure 4.

Comparison of the density between the training dataset (target, left panel) and the DL model prediction (middle panel) at 𝑡 = 𝑀 for theone-sim case. The color maps the logarithm of density. The right panel shows the target-prediction diﬀerence. The y -axis is showing the z-component and the x -axis shows the radii of the torus, both in 𝑅 𝑠 . The highest and the lowest | Δ log 𝜌 | are 0 . . × − , respectively. Figure 5.

Temporal evolution of the gas mass, average density and thediﬀerence between target data and prediction expressed as a per cent errorusing direct predictions (one-sim). The dashed red curve represents the targetwhile the solid green curve represents the prediction. The plot shows thatthe model can learn well mass conservation, with an error of less than 1%. 𝑚 and ¯ 𝜌 correspond to the initial mass and mean density within the boxdisplayed in Figure 4 Figure 11 shows the results for 10, 25, and 50 iterationswhich advance the data set into the future by 50, 125, and 250frames, respectively. The time-diﬀerence between each two framesis 197 . 𝑀 . We see that the DL model predictions are increasinglydiﬀerent from the target data as the number of iterations increases inthe region near the torus border and towards the poles. We believethis is because for this training model, these regions are located near the poles where there is a smaller number of cells (i.e. a lowerresolution region) which, as we described previously, lead to fail-ure modes of the DL model. In this low-resolution region, Δ log 𝜌 increases to a large value of two, even though the mean diﬀerencein the equator remains low ( ¯ Δ log 𝜌 < . × speed-up with the DL model with respect to con-ventional, state-of-the-art CPU numerical ﬂuid dynamics solvers. We have used a DL model based on CNNs that was trained on thesolutions of ﬂuid dynamics conservation equations for the followingproblem: an accreting black hole surrounded by a radiatively inef-ﬁcient accretion ﬂow, where the rotating gas is subject to internalviscous stresses, pressure forces and gravity. Our aim was to assessthe performance of DL techniques to evolve the spatiotemporal den-sity distribution of a more realistic astrophysical simulation dataset.We considered two diﬀerent training datasets: one-sim, where we

MNRAS000

MNRAS000 , 1–13 (2020) lack Hole Deep Learning Figure 6.

Comparison between the target and iterative predictions (one-sim case) at three diﬀerent times, corresponding to 125, 250 and 500 time steps afterthe last snapshot of cross-validation at 𝑡 = 𝑀 . trained the model on the density spacetime distribution for onesingle initial condition, and the multi-sim case where the trainingset included several diﬀerent initial conditions. We considered twomethods for time-evolving the density distribution provided at aspeciﬁc time: the direct approach, where the DL model advancesone time-step given the input data, and the iteration approach where the model computes many diﬀerent future states once provided witha density time-slice. The iterative approach is particularly relevantsince it probes the longer term forecasting capabilities of the trainedmodel.We begin by discussing the shortcomings of our one-sim ex-periment in which the training dataset comes from one simulation, MNRAS , 1–13 (2020)

Duarte, Nemmen & Navarro

Figure 7.

Temporal evolution of the gas mass using the iterative approach within three diﬀerent regions of the ﬂow (one-sim case): all 𝑟 (left panel), 𝑟 < 𝑅 𝑠 (middle panel) and 𝑟 > 𝑅 𝑠 (right panel). The dashed line represents the DL prediction while the solid represents the target data. The mass increasesexponentially at small radii. Figure 8.

Comparison of the wall time to evolve the accretion ﬂow from 𝑡 = . × 𝑀 to 6 . × 𝑀 using the iterative method (one-sim) inseconds (left panel) and CPU-hours (right panel). Shorter is better in thisplot. The black rectangle indicates the performance of traditional numericalsimulations based on the Godunov approach implemented on a CPU cluster,whereas the light green rectangle shows the performance of the DL modelinference on a single GPU. The DL model was previously trained on twoGPUs. PNSS3. One limitation of these training data is that the torus is rela-tively quite dynamically stable throughout the entire duration of theoriginal simulation by AN, displaying weak convective turbulencewhich translates into little variability in the density distribution ofthe accretion ﬂow (i.e. it is almost a “frozen torus”).Another shortcoming of the training data is that there is a re-duced number of cells as one approaches the poles or towards largerradii. This dependence of the spatial resolution was implemented bydesign by AN, because the most important physical phenomena thatthe authors wanted to study with the simulations did not occur nearthe poles or at larger radii, so a lower resolution at these regions wasacceptable. However, we have found that somehow the subdomainswith lower resolution ended up aﬀecting the learning. Concretely,

Figure 9.

The density predicted by the DL model for the dataset PL0SS3( 𝑡 = 𝑀 ) compared with the target data, using direct predictions,multi-sim case. the largest diﬀerences between the DL predictions and the targetdata occurred precisely near the poles or at larger radii. We planon investigating ways of accounting for nonuniform meshes in thetraining data a posteriori.We have found in the one-sim DL model a strong mass-injection between the polar region and the bulk of the gaseoustorus, between 𝜃 ≈ ◦ and 70 ◦ (the exact angles depend on theinitial conditions of the training dataset). Since our training datado not include velocities, we are unable to say whether this extramass is outﬂowing or inﬂowing. We are also not sure why the neuralnetworks inject vigorous mass in this region. This deserves to beinvestigated further.Our goal with this pilot project was to study the ability of DLmodels to learn from simulated data of spatiotemporally chaoticastrophysical systems and reproduce the behavior of these systemsover time. We showed that the model could capture the system’s spa-tial features while conserving well mass when trained with a largedataset consisting of ﬂows evolved with diﬀerent initial conditions. MNRAS000

Temporal evolution of mass, average density and diﬀerencebetween target and data using the direct prediction method, multi-sim case.The dashed red curve represents the target while the solid green curverepresents the prediction.

The model also can capture well the range of variation of the den-sity. The model can simulate the accretion ﬂow over a duration of ∼ × 𝑀 while conserving the mass. This duration correspondsto 80 dynamical times at 𝑟 = 𝑀 or about 5 × dynamical timesat 𝑟 = 𝑀 . During this time, the DL model achieved a speed-up ofa factor of 3 × compared to regular numerical integrators for as-trophysical ﬂuid dynamics. This would means reducing the durationof the computations from 3 days to only 15 seconds. Taking theseresults together, they indicate that DL models oﬀer the potential ofdramatically accelerate numerical ﬂuid dynamics simulations whileconserving well mass, energy and momentum. In this pilot study we have have used a deep learning model toevolve in time a spatiotemporally chaotic astrophysical system con-sisting of a accreting black hole which feeds on a large, turbulentgas reservoir. In other words, we have trained a machine to makeblack hole weather forecasting, using a dataset consisting of numer-ical solutions to the conservation equations for a range of initialconditions and a convolutional neural network. The training set isobtained from a set of hydrodynamical, pseudo-Newtonian simula-tions appropriate for a Schwarzschild black hole surrounded by aradiatively ineﬃcient accretion ﬂow which extends from 2 to 400Schwarzschild radii. Our main conclusions can be summarized asfollows:(i) We ﬁnd that convolutional neural networks predict wellthe temporal evolution of the system over a long duration of 8 × 𝐺 𝑀 / 𝑐 , which corresponds to 80 dynamical times at 𝑟 = 𝑀 .(ii) The reality imagined by the deep learning model drifts fromthe training dataset over time—an “artiﬁcial Alzheimer”—with thedensity gradually building up and the error increasing to ∼

90 percent after 8 × 𝐺 𝑀 / 𝑐 .(iii) When we train the DL model with several simulations ofaccretion ﬂows spanning multiple initial conditions, the resultingmodel is able to evolve an accretion ﬂow with initial conditions notpresent in the training dataset for a duration of 40000 𝐺 𝑀 / 𝑐 . Ourapproach thus seems to generalize well.(iv) The DL model seems to “learn too well” from the trainingdataset. This results in artifacts in the regions of the ﬂow near the poles where the mesh contains less cells, since our hydrodynami-cal simulations are based on a nonuniform grid. At such regions,we observe an injection of mass by the DL model which can beattenuated by training the CNN with more models.(v) Once trained, the DL model evolves on a single GPU a tur-bulent accretion ﬂow 3 × times faster than traditional numericalﬂuid dynamics integrators running on 200 CPUs.In conclusion, Our results indicate that a deep learning modelgive acceptable solutions of the ﬂuid conservation equations ordersof magnitude faster than an explicit numerical solver. We believethat a data-driven machine learning approach is very promisingfor accelerating not only ﬂuid dynamics simulations, but also gen-eral relativistic magnetohydrodynamic ones. This topic certainlydeserved further investigations. ACKNOWLEDGEMENTS

We acknowledge productive discussions with Ivan Almeida, FabioCafardo, Gustavo Soares, Nando de Freitas, Mike Walmsley, Le-andro Kerber, Amelie Saintonge, Françoise Combes and AndrewHumphrey. We also acknowledge the useful discussions that oc-curred during the Khipu 2019 and UK-Brazil Frontiers of Science2020 workshops. R. D. was supported by CAPES (Coordenação deAperfeiçoamento de Pessoal de Nível Superior) Proex. R. N. wassupported by FAPESP (Fundação de Amparo à Pesquisa do Estadode São Paulo) under grant 2017/01461-2. The Black Hole Groupreceived the donation of two GPUs from NVIDIA: a Quadro P6000under the GPU Grant Program and a GP100 from NVIDIA Brasil.

REFERENCES

Abadi M., et al., 2015, TensorFlow: Large-Scale Machine Learning on Het-erogeneous Systems, http://tensorflow.org/

Agrawal S., Barrington L., Bromberg C., Burge J., Gazen C., Hickey J., 2019,Machine Learning for Precipitation Nowcasting from Radar Images( arXiv:1912.12132 )Almeida I., Nemmen R., 2020, MNRAS, 492, 2553Balbus S. A., 2003, ARA&A, 41, 555Battaglia P., Pascanu R., Lai M., Rezende D. J., kavukcuoglu K., 2016, inProceedings of the 30th International Conference on Neural InformationProcessing Systems. NIPS’16. Curran Associates Inc., Red Hook, NY,USA, pp 4509–4517Bergstra J., Bengio Y., 2012, Journal of Machine Learning ResearchBreen P. G., Foley C. N., Boekholt T., Portegies Zwart S., 2019, arXive-prints, p. arXiv:1910.07291Brunton S. L., Noack B. R., Koumoutsakos P., 2020, Annual Review ofFluid Mechanics, 52, 477Chattopadhyay A., Nabizadeh E., Hassanzadeh P., 2020, Journal of Ad-vances in Modeling Earth Systems, 12, e2019MS001958Chen Y., Jiang H., Li X., Ghamisi P., 2016, IEEE Transactions on Geoscienceand Remote SensingChollet F., et al., 2015, Keras, https://github.com/fchollet/keras

Cybenko G., 1989, Mathematics of Control, Signals and Systems, 2, 303Event Horizon Telescope Collaboration et al., 2019a, ApJ, 875, L1Event Horizon Telescope Collaboration et al., 2019b, ApJ, 875, L5Gammie C. F., McKinney J. C., Tóth G., 2003, ApJ, 589, 444Goodfellow I., Bengio Y., Courville A., 2016, Deep Learning. MIT PressGrete P., Glines F. W., O’Shea B. W., 2019, arXiv e-prints, p.arXiv:1905.04341Hausen R., Robertson B., 2019, Morpheus: A Deep Learning Frame-work For Pixel-Level Analysis of Astronomical Image Data( arXiv:1906.11248 )Ho L. C., 2008, ARA&A, 46, 475MNRAS , 1–13 (2020) Duarte, Nemmen & Navarro

Figure 11.

Comparison between target, prediction and Δ log 𝜌 after 10 iterations ( 𝑡 = 𝑀 ), 25 iterations ( 𝑡 = 𝑀 ) and 50 iterations ( 𝑡 = 𝑡 =000

Comparison between target, prediction and Δ log 𝜌 after 10 iterations ( 𝑡 = 𝑀 ), 25 iterations ( 𝑡 = 𝑀 ) and 50 iterations ( 𝑡 = 𝑡 =000 , 1–13 (2020) lack Hole Deep Learning Figure 12.

Temporal evolution of mass predicted using the iterative method, multi-sim training, for three regions: all domain, inner radii ( 𝑟 < 𝑅 𝑠 ) and outerradii ( 𝑟 > 𝑅 𝑠 ). Figure 13.

Time taken to evolve the PL0SS3 simulation using standardﬂuid dynamics solvers versus deep learning using the iterative approach,multi-sim training (shorter is better). The DL model evolves the black holeaccretion ﬂow 7 × times faster.Hornik K., 1991, Neural Networks, 4, 251Jaeger H., Haas H., 2004, Science, 304, 78Karpathy A., Toderici G., Shetty S., Leung T., Sukthankar R., Fei-Fei L.,2014, 2014 IEEE Conference on Computer Vision and Pattern Recog-nitionKing R., Hennigh O., Mohan A., Chertkov M., 2018, arXiv e-prints, p.arXiv:1810.07785Krizhevsky A., Sutskever I., Hinton G. E., 2017, Commun. ACM, 60, 84LeCun Y., Bengio Y., Hinton G., 2015, Nature, 521, 436Liska M., et al., 2019a, arXiv e-prints, p. arXiv:1912.10192Liska M., Tchekhovskoy A., Ingram A., van der Klis M., 2019b, MNRAS,487, 550Meier D. L., 2002, New Astronomy ReviewsMohan A., Daniel D., Chertkov M., Livescu D., 2019, arXiv:1903.00033Murtagh F., 1991, Neurocomputing, 2, 183Nemmen R. S., Storchi-Bergmann T., Eracleous M., 2014, MNRAS, 438,2804Pathak J., Hunt B., Girvan M., Lu Z., Ott E., 2018, Phys. Rev. Lett., 120,024102 Penna R. F., Kulkarni A., Narayan R., 2013, A&A, 559, A116Porth O., Olivares H., Mizuno Y., Younsi Z., Rezzolla L., Moscibrodzka M.,Falcke H., Kramer M., 2017, Computational Astrophysics and Cosmol-ogy, 4, 1Porth O., et al., 2019, ApJS, 243, 26Prechelt L., 1996, in Neural Networks: Tricks of the Trade.Ronneberger O., Fischer P., Brox T., 2015, MEDIA Journal Cover MedicalImage AnalysisSchaye J., et al., 2015, MNRAS, 446, 521Shakura N. I., Sunyaev R. A., 1973, A&A, 24, 337Sola J., Sevilla J., 1997, IEEE Transactions on Nuclear Science, 44, 1464Stone J. M., Pringle J. E., Begelman M. C., 1999, MNRAS, 310, 1002Tompson J., Schlachter K., Sprechmann P., Perlin K., 2016, CoRR,abs/1607.03597Toro E., 2009, Riemann Solvers and Numerical Methods for Fluid Dy-namics: A Practical Introduction. Springer Berlin Heidelberg, https://books.google.com.br/books?id=SqEjX0um8o0C Vogelsberger M., et al., 2014, Nature, 509, 177Wang Y., Bilinski P., Bremond F. F., Dantcheva A., 2020, in WACV2020 - Winter Conference on Applications of Computer Vision. Snow-mass Village, United States, https://hal.archives-ouvertes.fr/hal-02368319

Yuan F., Narayan R., 2014, ARA&A, 52, 529Zhou D.-X., 2020, Applied and Computational Harmonic Analysis, 48, 787

APPENDIX A: FLUID DYNAMIC SIMULATIONSDETAILS

We summarize the properties — kinematic viscosity 𝜈 , Shakura-Sunyaev’s parameter 𝛼 , and the angular momentum ℓ ( 𝑟 ) — of ninesimulations in Table A1. AN explore two parametrizations of thekinematic viscosity, similarly to Stone et al. (1999): • 𝜈 = 𝛼𝑟 / is the "K-model" in Stone et al. (1999). Thisparametrization of 𝜈 is referred by AN as ST. • 𝜈 = 𝛼𝑐 𝑠 / Ω 𝐾 is the parametrization by Shakura & Sunyaev(1973). AN referred to as SS. MNRAS , 1–13 (2020) Duarte, Nemmen & Navarro

Name ℓ ( 𝑅 ) 𝜈 𝛼 × 𝑇 ( 𝑀 ) PNST01 PN ST 0.01 8.0PNST1 PN ST 0.1 0.9PNSS1 PN SS 0.1 4.5PNSS3 PN SS 0.3 3.3PL0ST1 PL ST 0.1 0.8PL0SS3 PL SS 0.3 2.1PL2SS1 PL SS 0.1 1.4PL2SS3 PL SS 0.3 2.1

Table A1.

The initial conﬁgurations of each simulation: angular momentumproﬁle, viscosity proﬁle, and the alpha parameter. In the last column, weshow the duration of each simulation.

In the parameterizations of 𝜈 , Ω 𝐾 is the Keplerian angularvelocity, 𝑐 𝑠 is the sound speed, 𝛼 is the Shakura-Sunyaev parameter(Shakura & Sunyaev 1973). In accretion ﬂow simulations, 𝛼 is theparameter associate with turbulence (Stone et al. 1999; Shakura &Sunyaev 1973). AN used an 𝛼 varying with 𝑅 but for our purposes,we used only the data where 𝛼 is constant with a value between0 . < 𝛼 < . ℓ ( 𝑟 ) is important sinceit describes the torus’ kinetic energy. AN investigated two proﬁlesof angular momentum proﬁles: • ℓ ( 𝑅 ) ∝ 𝑅 𝑎 , with 𝑎 = . 𝑎 = .

2, and 𝑎 = .

4. Since it is apower-law, we will refer to it as PL. • ℓ ( 𝑅 ) = (cid:40) 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 𝑅 < 𝑅 𝑆 . ℓ 𝐾 otherwise , where ℓ 𝐾 is the Keplerianspeciﬁc angular momentum. Since it is based on Penna et al. (2013),so we will refer to it as PN.To our interests, we will distinguish the angular momentumproﬁle as PN or PL and the viscosity proﬁle 𝜈 as ST or SS. The timediﬀerence between two snapshots is Δ 𝑡 = . 𝑀 . APPENDIX B: DATA PREPARATION

To avoid a biased model, we performed data preparation. The datapreparation consists of normalize the density values, crop the gridof the simulations, and create 4D–arrays that will serve as input andoutput of our network. First, we normalize the density values, in therange [ , ] using a logarithm normalization: 𝜌 𝑁 𝑂𝑅𝑀 = 𝑙𝑜𝑔 ( 𝜌 ) − 𝑙𝑜𝑔 ( 𝑚𝑖𝑛 ( 𝜌 )) 𝑙𝑜𝑔 ( 𝑚𝑎𝑥 ( 𝜌 )) − 𝑙𝑜𝑔 ( 𝑚𝑖𝑛 ( 𝜌 )) . (B1)The normalization avoids bias by putting all values in the samescale. In our raw data, the range goes from 10 − up to 10 , withoutnormalization the largest values might dominated. Normalizationhelps as well to speed-up the learning, converging faster (Sola &Sevilla 1997).The next step of the data preparation is the crop of our grid. Thecrop was performed as follows: in the radial direction, we removed144 cells that span 9500 𝐺 𝑀 / 𝑐 , encompassing mostly atmospherewith 𝜌 < − . Meanwhile, in the polar direction, we remove eightcells that correspond to ∼ . ◦ along the poles. We performed Figure B1.

We append ﬁve consecutive (256 × ( 𝑁 × × × ) array with 𝑁 being the number of the snapshots. 𝑡 𝑁 is thedensity ﬁeld associated with the 𝑛 𝑡ℎ snpashot. Figure C1.

Illustration of the UNet architecture used to train the model. the crop after initial tests where the atmosphere dominated duringthe learning procedure. Our main interest is to evaluate how muchthe model can learn the accretion ﬂow dynamics so removing theatmosphere does not present major drawbacks.The ﬁnal part is to build our blocks that will feed the network.Since we want to forecast density ﬁelds after a Δ 𝑡 , we build theblocks to incorporate temporal information. The scheme in Fig. B1shows how we build the blocks. We attach ﬁve consecutive densityﬁeld creating a block – 𝑁 × × × 𝑁 being the numberof data. 𝑁 is deﬁned after the point of the accretion rate becomesstationary. APPENDIX C: ARCHITECTURE

The neural network architecture UNet (Ronneberger et al. 2015)that is used in this paper is shown in Table C1 and illustrated inFigure C1.The encoder is composed of four blocks, and each block hastwo convolutional layers and a MaxPooling layer. The encoder re-ceives the input and transforms it, generating intermediate featuremaps with lower dimensionality in ( 𝑥, 𝑦 ) plane. The latent space,which is also convolutions, connects the encoder to the decoder. Thedecoding operations transform the generated feature maps, gradu-ally upsampling it until the output tensors reach the same dimensionof inputs. There are skip connections between the encoder outputsand the decoder inputs. The concatenations connect the informa-tion of the encoder with the decoder. In this way, the decoder will MNRAS000 , 1–13 (2020) lack Hole Deep Learning Name Operation Output ShapeInput ( 𝑁 , , , ) Conv-1-1 Conv2D + ReLU ( 𝑁 , , , ) Conv-1-2 Conv2D + ReLU ( 𝑁 , , , ) Max-1 MaxPooling2D ( 𝑁 , , , ) Conv-2-1 Conv2D + ReLU ( 𝑁 , , , ) Conv-2-2 Conv2D + ReLU ( 𝑁 , , , ) Max-2 MaxPooling2D ( 𝑁 , , , ) Conv-3-1 Conv2D + ReLU ( 𝑁 , , , ) Conv-3-2 Conv2D + ReLU ( 𝑁 , , , ) Max-3 MaxPooling2D ( 𝑁 , , , ) Conv-4-1 Conv2D + ReLU ( 𝑁 , , , ) Conv-4-2 Conv2D + ReLU ( 𝑁 , , , ) Max-4 MaxPooling2D ( 𝑁 , , , ) Conv-5-1 Conv2D + ReLU ( 𝑁 , , , ) Conv-5-2 Conv2D + ReLU ( 𝑁 , , , ) UpSam-1 UpSampling2D ( 𝑁 , , , ) Conc-1 Conc(UpSam-1, Conv-4-2) ( 𝑁 , , , ) Conv-6-1 Conv2D + ReLU ( 𝑁 , , , ) Conv-6-2 Conv2D + ReLU ( 𝑁 , , , ) UpSam-2 UpSampling2D ( 𝑁 , , , ) Conc-2 Conc(UpSam-2, Conv-3-2) ( 𝑁 , , , ) Conv-7-1 Conv2D + ReLU ( 𝑁 , , , ) Conv-7-2 Conv2D + ReLU ( 𝑁 , , , ) UpSam-3 UpSampling2D ( 𝑁 , , , ) Conc-3 Conc(UpSam-3, Conv-2-2) ( 𝑁 , , , ) Conv-8-1 Conv2D + ReLU ( 𝑁 , , , ) Conv-8-2 Conv2D + ReLU ( 𝑁 , , , ) UpSam-4 UpSampling2D ( 𝑁 , , , ) Conc-3 Conc(UpSam-4, Conv-1-2) ( 𝑁 , , , ) Conv-9-1 Conv2D + ReLU ( , , , ) Conv-9-2 Conv2D + ReLU ( , , , ) Output Conv2D + ReLU ( 𝑁 , , , ) Table C1.

Details of the architecture UNet we used in this project. Conc isconcatenation and 𝑁 is the batch size. base their decision on the primary information of the encoder. Wefeed our network with the training set and use the validation set toEarly-Stopping procedure (Prechelt 1996).We performed all the DL experiments using two NVIDIAQuadro GPUs from Pascal architecture, GP100 and P6000. Theimplementation was done in Keras (Chollet et al. 2015) v2.1 withthe

TensorFlow (Abadi et al. 2015) v1.8 backend.

Figure D1.

We show the 10th step after the training set ends for all simula-tions. Each simulation has diﬀerent gravitational time in the 10th step, butthe Δ 𝑡 = . 𝑀 is the same for all of them. The density proﬁle of thetarget and the prediction as well as the diﬀerence Δ plot of four systems. ThePNSS3 system is the one in the “one simulation" case. APPENDIX D: “ALL SYSTEMS" CASE

Figure D1 and ﬁgure D2 shows the direct predictions of all simu-lations equivalent to Δ 𝑡 = . 𝑀 in the future. We see that themodel manages to predict all simulations, as in “one simulation"case, if we feed the previous snapshot of the simulation. This paper has been typeset from a TEX/L A TEX ﬁle prepared by the author.MNRAS , 1–13 (2020) Duarte, Nemmen & Navarro

Figure D2.

The same as in ﬁgure D1 for the rest of simulations. We showfurther analysis to the PL0SS3 simulation since the model was not trainedwith this conﬁguration. MNRAS000