[PDF] A Taylor Based Sampling Scheme for Machine Learning in Computational Physics

Abstract

Machine Learning (ML) is increasingly used to construct surrogate models for physical simulations. We take advantage of the ability to generate data using numerical simulations programs to train ML models better and achieve accuracy gain with no performance cost. We elaborate a new data sampling scheme based on Taylor approximation to reduce the error of a Deep Neural Network (DNN) when learning the solution of an ordinary differential equations (ODE) system.

Full PDF

AA Taylor Based Sampling Scheme for MachineLearning in Computational Physics

Paul Novello ∗† Gaël Poëtte † David Lugato † Pietro Marco Congedo ∗ Abstract

Machine Learning (ML) is increasingly used to construct surrogate models forphysical simulations. We take advantage of the ability to generate data usingnumerical simulations programs to train ML models better and achieve accuracygain with no performance cost. We elaborate a new data sampling scheme based onTaylor approximation to reduce the error of a Deep Neural Network (DNN) whenlearning the solution of an ordinary differential equations (ODE) system.

Computational physics allow simulating various physical phenomena when real measures andexperiments are very difﬁcult to perform and even sometimes not affordable. These simulationscan themselves come with a prohibitive computational cost so that very often they are replaced bysurrogate models [15, 10]. Recently, Machine Learning (ML) has proven its efﬁciency in severaldomains, and share similarities with classical surrogate models. The question of replacing somecostly parts of simulations code by ML models thus becomes relevant. To reach this long termobjective, this paper tackles a ﬁrst step by investigating new ways to increase ML models accuracyfor a ﬁxed performance cost.To this end we leverage the ability to control the sampling of the training data. Several previousworks have hinted towards the importance of the training set, in the context of Deep Learning [2],Active Learning [13] and Reinforcement Learning [8]. Our methodology relies on adapting the datageneration procedure to gain accuracy with a same ML model exploiting the information comingfrom the derivatives of the quantity of interest w.r.t the inputs of the model. This approach is similarto Active learning, in the sense that we adapt our training strategy to the problem, but still differsbecause it is prior to the training of the model. Therefore, it is complementary with active learningmethods (see [13] for a review and [6, 16] for more recent applications). The efﬁciency of thisoriginal methodology is tested on the approximation by a Deep Neural Network (DNN) of a stiffBateman Ordinary Differential Equation (ODE) system. Solving this system at a moderate cost isessential in many physical simulations (neutronic [3, 5], combustion [4], detonic [11], etc.).

To introduce the general methodology, we consider the problem of approximating a function f : S ⊂ R n i → R n o where S is a subspace deﬁned by the physics of interest. Let a model f θ ,whose parameters θ have to be found in order to minimize a loss function L ( θ ) = (cid:107) f − f θ (cid:107) . Insupervised learning, we are usually given a training data set of N points, { X , ..., X N } , and theirpointwise values { f ( X ) , ..., f ( X N ) } . These points allow to statistically estimate L ( θ ) and then touse optimization algorithms to ﬁnd a minimum of L ( θ ) w.r.t. θ .Amongst ML techniques, we chose f θ to be a DNN for two reasons. First, DNNs outperform most ofother ML models in high dimensions i.e. n i , n o (cid:29) which is often true in computational physics. ∗ INRIA Paris Saclay, {paul.novello,pietro.congedo}@inria.fr † CEA CESTA, {paul.novello, gael.poette, david.lugato}@cea.frSecond Workshop on Machine Learning and the Physical Sciences (NeurIPS 2019), Vancouver, Canada. a r X i v : . [ phy s i c s . c o m p - ph ] J a n econd, recent advances in Deep Learning frameworks have made DNNs much more efﬁcient. Betteroptimization algorithms as well as the compatibility with GPUs and TPUs have greatly increased itsperformances. Sampling hypothesis - The methodology is based on the assumption that a given DNN yields betteraccuracy when the dataset focuses on regions where f is locally steeper. To identify these regions,we make use of the Taylor approximation (multi-index notation) for order n on f : f ( X + (cid:15) ) = (cid:107) (cid:15) (cid:107)→ (cid:88) ≤| k |≤ n (cid:15) k ∂ k f ( X ) k ! + O ( (cid:15) n ) . (1) D n (cid:15) ( X ) = (cid:88) ≤| k |≤ n (cid:15) k (cid:107) ∂ k f ( X ) (cid:107) k ! . (2)Quantity f ( X + (cid:15) ) − f ( X ) derived using (1) gives an indication of how much f changes around X .By neglecting the orders above (cid:15) n , it is then possible to ﬁnd the regions of interest by focusing on D n (cid:15) , given by (2). Notice that D n (cid:15) is evaluated using (cid:107) ∂ k f ( X ) (cid:107) instead of ∂ k f ( X ) for derivativesnot to cancel each other. The next steps are to evaluate and sample from D n (cid:15) . Evaluating D n(cid:15) ( x ) - (2) involves the computation of derivatives of f . Usually in supervisedlearning, only { f ( X ) , ..., f ( X N ) } are provided and the derivatives of f are unknown. However,here the dataset is drawn from a numerical simulation software. It is therefore possible either touse ﬁnite difference to approximate the derivatives, or to compute them exactly using automaticdifferentiation if we have access to the implementation. In any case, { ∂ k f ( X ) , ..., ∂ k f ( X N ) } , andthen { D n (cid:15) ( X ) , ..., D n (cid:15) ( X N ) } can be computed along with { f ( X ) , ..., f ( X N ) } . Sampling procedure - According to the previous assumption, we want to sample more where D n (cid:15) is higher. To this end, we can build a probability density function (pdf) from D n (cid:15) , which is possiblesince D n (cid:15) ≥ . It remains to normalize it but in practice it is enough considering a distribution d ∝ D n (cid:15) . Here, to approximate d we use a Gaussian Mixture Model (GMM) with pdf d GMM that weﬁt to { D n (cid:15) ( X ) , ..., D n (cid:15) ( X N ) } using the Expectation-Maximization (EM) algorithm. N (cid:48) new datapoints { ¯ X , ..., ¯ X N (cid:48) } , can be sampled, with ¯ X ∼ d GMM . Finally, using the simulation software, weobtain { f ( ¯ X ) , ..., f ( ¯ X N (cid:48) ) } , add it to { f ( X ) , ..., f ( X N ) } and train our DNN on the whole dataset. Methodology recapitulation - Our methodology, which we call Taylor Based Sampling (TBS) isrecapitulated in Algorithm .

Line 1:

The choices of (cid:15) , the number of Gaussian distribution n GMM and N (cid:48) are not mandatory at this stage. Indeed, they are not a prerequisite to the computation of ∂ k f ( x ) , which is the computationally costly part of evaluating D n (cid:15) . It allows to choose parameters (cid:15) and n GMM a posteriori . In this work, our choice criterion is to avoid sparsity of { ¯ X , ..., ¯ X N (cid:48) } over S . We use the Python package scikit-learn [12], and more speciﬁcally the GaussianMixture class.

Line 2:

Usually in physical simulations, the input subspace S is bounded. Without a priori informations on f , we sample the ﬁrst points uniformly in a subspace S . Line 3-4:

To compute thederivatives of f and because we have access to its implementation, we use the python package jax [9], which allows automatic differentiation of numpy code. Line 7-13:

Because the support of aGMM is not bounded, some points can be sampled outside S . We recommend to discard these pointsand sample until all points are inside S . This rejection method is equivalent to sampling points froma truncated GMM. We apply our method to the resolution of the Bateman equations, which is an ODE system : (cid:26) ∂ t u ( t ) = v σ a · η ( t ) u ( t ) ,∂ t η ( t ) = v Σ r · η ( t ) u ( t ) , , with initial conditions (cid:26) u (0) = u , η (0) = η . , with u ∈ R + , η ∈ ( R + ) M , σ Ta ∈ R M , Σ r ∈ R M × M . Here, f : ( u , η , t ) → ( u ( t ) , η ( t )) .For physical applications, M ranges from tens to thousands. We consider the particular case M = 1 so that f : R → R , with f ( u , η , t ) = ( u ( t ) , η ( t )) . The advantage of M = 1 is that we haveaccess to an analytic, cheap to compute solution for f . It allows to conduct extensive analyses for thedesign of our methodology. Of course, this particular case can also be solved using a classical ODEsolver, which allows us to test it end to end. It can thus be generalized to higher dimensions ( M > ).All DNN trainings have been performed in Python , with

Tensorflow [1]. We used a fully connectedDNN with hyperparameters chosen using a simple grid search. The ﬁnal values are: 2 hidden layers,2 lgorithm:

Taylor Based Sampling (TBS) Require: (cid:15) , N , N (cid:48) , n GMM , n Sample { X , ..., X N } , with X ∼ U ( S ) for ≤ k ≤ n do (cid:46) Note that order k = 0 builds { f ( X ) , ..., f ( X N ) } Compute { ∂ k f ( X ) , ..., ∂ k f ( X N ) } using the simulation software. Compute { D n (cid:15) ( X ) , ..., D n (cid:15) ( X N ) } using (2) Approximate d ∼ D (cid:15) with a GMM using EM algorithm to obtain a density d GMM i ← while i ≤ N (cid:48) do (cid:46) Rejection method to prevent EM to sample outside S Sample ¯ X i ∼ d GMM if ¯ X i / ∈ S then discard ¯ X i else i ← i + 1 Compute { f ( ¯ X ) , ..., f ( ¯ X N (cid:48) ) } Add { f ( ¯ X ) , ..., f ( ¯ X N (cid:48) ) } to { f ( X ) , ..., f ( X N ) } "ReLU" activation function, and 32 units for each layer, trained with the Mean Squared Error (MSE)loss function using Adam optimization algorithm with a batch size of 50000, for 40000 epochs andon N + N (cid:48) = 50000 points, with N = N (cid:48) . We ﬁrst trained the model for ( u ( t ) , η ( t )) ∈ R , with anuniform sampling, that we call basic sampling (BS) ( N (cid:48) = 0 ), and then with TBS for several valuesof n , n GMM and (cid:15) = (cid:15) (1 , , , to be able to ﬁnd good values. We ﬁnally select (cid:15) = 5 × − , n = 2 and n GMM = 10 . The data points used in this case have been sampled with an explicit Euler scheme.This experiment has been repeated 50 times to ensure statistical signiﬁcance of the results.

Table 1 summarizes the MSE, i.e. the L norm of the error of f θ and L ∞ norm, with L ∞ ( θ ) =max X ∈ S ( | f ( X ) − f θ ( X ) | ) obtained at the end of the training phase. This last metric is important becausethe goal in computational physics is not only to be averagely accurate, which is measured with MSE,but to be accurate over the whole input space S . Those norms are estimated using a same test data setof N test = 50000 points. The values are the means of the independent experiments displayed witha conﬁdence interval. These results reﬂect an error reduction of 6.6% for L and of 45.3% for L ∞ , which means that TBS mostly improves the L ∞ error of f θ . Moreover, the L ∞ error conﬁdenceintervals do not intersect so the gain is statistically signiﬁcant for this norm. Table 1: Comparison between BS and TBS

Sampling L error ( × − ) L ∞ ( × − ) AEG ( × − ) AEL ( × − ) BS . ± .

13 5 . ± . - - TBS . ± . . ± .

37 2 . ± .

07 0 . ± . Figure 1a shows how the DNN can perform for an average prediction.

Figure 1b illustrates thebeneﬁts of TBS relative to BS on the L ∞ error (Figure 2b). These 2 ﬁgures conﬁrm the previousobservation about the gain in L ∞ error. Finally, Figure 2a displays u , η → max ≤ t ≤ D n (cid:15) ( u , η , t ) w.r.t. ( u , η ) and shows that D n (cid:15) increases when U → . TBS hence focuses on this region.Note that for the readability of this plots, the values are capped to . . Otherwise only fewpoints with high D n (cid:15) are visible. Figure 2b displays u , η → g θ BS ( u , η ) − g θ TBS ( u , η ) , with g θ : u , η → max ≤ t ≤ (cid:107) f ( u , η , t ) − f θ ( u , η , t ) (cid:107) where θ BS and θ T BS denote the parametersobtained after a training with BS and TBS, respectively. It can be interpreted as the error reductionachieved with TBS. The highest error reduction occurs in the expected region. Indeed more points aresampled where D n (cid:15) is higher. The error is slightly increased in the rest of S , which could be explainedby a sparser sampling on this region. However, as summarized in Table 1 , the average error loss (AEL)of TBS is around six times lower than the the average error gain (AEG), with

AEG = E u ,η ( Z (cid:49) Z> ) AEL = E u ,η ( Z (cid:49) Z< ) where Z ( u , η ) = g θ BS ( u , η ) − g θ TBS ( u , η ) . In practice, AEGand AEL are estimated using uniform grid integration, and averaged on the experiments. t . . . . . . . . f ( u , η , t ) Figure 1a fu ,η ,t ) f θ ( u ,η ,t ) with BS f θ ( u ,η ,t ) with TBS 0 2 4 6 8 10 t . . . . . f ( u , η , t ) Figure 1b f ( u ,η ,t ) f θ ( u ,η ,t ) with BS f θ ( u ,η ,t ) with TBS η . . . U Figure 2a . . . . . . D n (cid:15) η . . . U Figure 2b − . − . . . . L e rr o r ga i n t → f θ ( u , η , t ) for randomly chosen ( u , η ) , for f θ obtained with the two samplings. t → f θ ( u , η , t ) for ( u , η ) resulting in the highest point-wise error with the two samplings. u , η → max ≤ t ≤ D n (cid:15) ( u , η , t ) w.r.t. ( u , η ) . u , η → g θ BS ( u , η ) − g θ TBS ( u , η ) , Results are promising for this case of the Bateman equations ( M = 1 ). We selected a simplenumerical simulation to efﬁciently test the initial assumption and elaborate our methodology. Itsapplication to more expensive physical simulations is the next step of this work. By then, severalproblems have to be tackled. First, TBS does not scale well to higher dimensions, because it involvesthe computations of | ∂ k f ( x ) | , i.e. n o × n n +1 i derivatives for each D n (cid:15) . This issue is less importantwhen using automatic differentiation than ﬁnite difference, and we can compute the derivatives foronly few points (i.e. chose N < N (cid:48) ) to ease it, but we will investigate on how to traduce the initialassumption for high dimensions. Moreover, the exploration of the input space is more difﬁcult in thiscase, because the initial sampling is more sparse for a same N when n i increases. In this work, (cid:15) , N (cid:48) , n have only been empirically chosen, and we arbitrarily selected a GMM to approximate d . Thesechoices have to be questioned in order to improve the methodology. Finally, this paper focused onaccuracy rather than performance, whereas performance is the goal of using a surrogate in place of asimulation code. Extending TBS to higher dimensions will allow to investigate this aspect. Beside,DNN may not be the best ML model in every situation, in terms of performances, but the advantageof TBS is that it can be applied to any ML model. We described a new approach to sample training data based on a Taylor approximation in the contextof ML for approximation of physical simulation codes. Though non speciﬁc to Deep Learning, weapplied this method to the approximation of the solution of a physical ODE system by a Deep NeuralNetwork and increased its accuracy for a same model architecture.In addition to the leads mentioned above, the idea to use the derivatives of numerical simulations tobetter train ML models should be explored. This idea has already been investigated in other ﬁeldssuch as gradient-enhanced kriging [7] or Hermite interpolation [14] and could lead to ML approachesbased on higher orders. An example of application could be to include the derivatives as new trainingpoints and making a DNN learn these derivatives.4 eferences [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, GeoffreyIrving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens,Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, FernandaViégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available fromtensorﬂow.org.[2] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In

Proceedings of the 26th Annual International Conference on Machine Learning , ICML ’09, pages 41–48,New York, NY, USA, 2009. ACM.[3] Adrien Bernède and Gaël Poëtte. An unsplit monte-carlo solver for the resolution of the linear boltzmannequation coupled to (stiff) bateman equations.

Journal of Computational Physics , 354:211–241, 02 2018.[4] M. Bisi and L. Desvillettes. From reactive boltzmann equations to reaction–diffusion systems.

Journal ofStatistical Physics , 124(2):881–912, Aug 2006.[5] Jan Dufek, Dan Kotlyar, and Eugene Shwageraus. The stochastic implicit euler method – a stable couplingscheme for monte carlo burnup calculations.

Annals of Nuclear Energy , 60:295 – 300, 10 2013.[6] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In

Proceedings of the 34th International Conference on Machine Learning - Volume 70 , ICML’17, pages1183–1192. JMLR.org, 2017.[7] Zhong-Hua Han, Yu Zhang, Chen-Xing Song, and Ke-Shi Zhang. Weighted gradient-enhanced kriging forhigh-dimensional surrogate modeling and design optimization.

AIAA Journal , 55(12):4330–4346, 2017.[8] Tang Jie and Pieter Abbeel. On a connection between importance sampling and the likelihood ratio policygradient. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors,

Advances in Neural Information Processing Systems 23 , pages 1000–1008. Curran Associates, Inc., 2010.[9] Matt Johnson, Roy Frostig, Dougal Maclaurin, and Chris Leary. Jax: Autograd and xla. https://github.com/google/jax .[10] Lu Lu, Xuhui Meng, Zhiping Mao, and George E. Karniadakis. Deepxde: A deep learning library forsolving differential equations.

CoRR , abs/1907.04502, 2019.[11] D. Lucor, C. Enaux, H. Jourdren, and P. Sagaut. Stochastic design optimization: Application to reactingﬂows.

Computer Methods in Applied Mechanics and Engineering , 196(49):5047 – 5062, 2007.[12] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.Scikit-learn: Machine learning in Python.

Journal of Machine Learning Research , 12:2825–2830, 2011.[13] Burr Settles.

Active Learning . Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning. Morgan& Claypool Publishers, 2012.[14] A. Spitzbart. A generalization of hermite’s interpolation formula.

The American Mathematical Monthly ,67(1):42–46, 1960.[15] B. Sudret, S. Marelli, and J. Wiart. Surrogate models for uncertainty quantiﬁcation: An overview. In , pages 793–797, March 2017.[16] Christoph Zimmer, Mona Meister, and Duy Nguyen-Tuong. Safe active learning for time-series modelingwith gaussian processes. In

Proceedings of the 32Nd International Conference on Neural InformationProcessing Systems , NIPS’18, pages 2735–2744, USA, 2018. Curran Associates Inc., NIPS’18, pages 2735–2744, USA, 2018. Curran Associates Inc.