[PDF] AdS/Deep-Learning made easy: simple examples

Abstract

Deep learning has been widely and actively used in various research areas. Recently, in the gauge/gravity duality, a new deep learning technique so-called the AdS/Deep-Learning (DL) has been proposed [1, 2]. The goal of this paper is to describe the essence of the AdS/DL in the simplest possible setups, for those who want to apply it to the subject of emergent spacetime as a neural network. For prototypical examples, we choose simple classical mechanics problems. This method is a little different from standard deep learning techniques in the sense that not only do we have the right final answers but also obtain a physical understanding of learning parameters.

Full PDF

DDecember 23, 2020

AdS/Deep-Learning made easy: simple examples

Mugeon Song, a Maverick S. H. Oh, a,b

Yongjun Ahn, a and Keun-Young Kim a a Gwangju Institute of Science and Technology (GIST),Department of Physics and Photon Science, Gwangju, South Korea b University of California–Merced,Department of Physics, Merced, CA, USA

E-mail: [email protected] , [email protected] , [email protected] , [email protected] Abstract:

Deep learning has been widely and actively used in various research areas. Recently,in the gauge/gravity duality, a new deep learning technique so called the AdS/DL(DeepLearning) has been proposed [1, 2]. The goal of this paper is to describe the essence ofthe AdS/DL in the simplest possible setups, for those who want to apply it to the subjectof emergent spacetime as a neural network. For prototypical examples, we choose simpleclassical mechanics problems. This method is a little diﬀerent from standard deep learningtechniques in the sense that not only do we have the right ﬁnal answers but also obtainphysical understanding of learning parameters.

Keywords:

Gauge/gravity duality, Deep learning a r X i v : . [ phy s i c s . c l a ss - ph ] D ec ontents F ( v )

54 Case 2: Finding a Force F ( x )

95 Conclusions 14

Machine learning or deep learning [3] techniques have become very useful and novel toolsin various research areas. Recently, an interesting machine learning idea has been proposedby Hashimoto et al. in [1, 2], where the authors apply the deep learning techniques to theproblems in the gauge/gravity duality [4, 5]. They showed that the spacetime metric canbe “deep-learned” by the boundary conditions of the scalar ﬁeld, which lives in that space.The essential deep learning (DL) idea of [1, 2] is to construct the neural network (NN)by using the structure of the diﬀerential equation. The discretized version of the diﬀerentialequation include the information of physical parameters such as a metric. The discretizedvariable plays a role of diﬀerent “layers” of the NN and the dynamic variables correspond tonodes. Therefore, training the NN means training the physical parameters so, at the end ofthe day, we can extract the trained physical parameters. This idea is dubbed AdS/DL(DeepLearning).In this paper, we apply the AdS/DL technique to simple classical mechanics problemssuch as Fig. 1. By considering simple examples, we highlight the essential idea of AdS/DLwithout much details of the gauge/gravity duality, which will be useful for those who want toapply this method to the subject of emergent spacetime as a neural network. Furthermore,our work will be a good starting point to learn a physics-friendly NN technique rather thanthe classical way from computer science.Let us describe a prototypical problem. Suppose that we want to ﬁgure out the forcein the black box shown in Fig. 1. We are given only initial and ﬁnal data, for example,the initial and ﬁnal position & velocity, ( x i , v i ) and ( x f , v f ) . A standard method is tostart with an (educated) guess for a functional form of the force (say, F ( x, v ) ). One canuse this “trial” force to simulate the system by solving the Newton’s equation. After trial-and-error simulation and comparison with experimental data we may be able to obtain the– 1 – i v i t f v f x i x f Figure 1 . A ball goes though a “black-box” and the velocity of the ball changes from v i at t i to v f at t f . It is very challenging to retrieve the information inside the black-box when the given data islimited by initial and ﬁnal data. approximate functional form of the force. However, if the force is complicated enough itwill not be easy to make a good guess at ﬁrst glance, and it will not be easy to modify thetrial function in a simple way. In this situations machine learning can be a very powerfulmethod to obtain the force in the black box.Usually, when there is a big enough input-output data set, classical DL techniqueswith NN, even without considering the physical meaning of NN or the structure of theproblem, can surely make a model that takes input data points and gives matching outputvalues in a trained region, because that is what DL is good at. Having a wide and deepenough feed-forward NN with linear and nonlinear transformations can trivially make suchconvergence as the Universal Approximation Theorem (UAT) guarantees [6, 7]. Retrievingphysical parameters from such a model is not easy because the network in general has littleto do with the mathematical structures of models we want to understand. However, if webuild a NN in a way to reﬂect the mathematical structure of the problem as in the AdS/DL,we are able to retrieve physical information from the model. In this case, the discretizedtime ( t ) plays a role of layer and the dynamic variables ( x, v ) correspond to the nodes. The unknown force is encoded in the NN so it will be trained.This paper is orgarnized as follows. In section 2, the general framework of building andtraining NN from EOM is introduced. In section 3 and 4, example problems are tackledwith the methodology described in section 2. Section 3 covers a simpler example with onevariable (one-dimensional velocity) while section 4 deals with a problem with two variables(one-dimensional position and velocity). We conclude in section 5.

The general framework can be divided into three major parts. First, training data setis generated using the EOM of a system and a widely-used ordinary diﬀerential equationsolver; we used an adaptive Runge-Kutta method of order 5(4) [8]. Second, a NN is builtfrom EOM with randomly initialized parameters based on the Euler method. Third, theNN is trained with the training data sets. After these three steps, the resultant learnedparameters are compared against the right parameters to see if the learning was successful.The ﬁrst part, training data set generation, is trivial so that we give an elaboration fromthe second part. For comparison, the variables ( φ, π, η ) in [1] correspond to ( x, v, t ) in section 4 of this paper. scipy.integrate.ode with the integrator dopri5 – 2 – nput I (in)1 I (in)2 th x ( N ) v ( N ) st x (1) v (1) ( N − th x ( N − v ( N − N th x ( N ) v ( N ) Output¯ I (out)1 ¯ I (out)2 T pre T post Figure 2 . The NN structure with two kinematic variables, x and v . Each circular node denotesa neuron with its own variable. The lines between nodes show which nodes are directly correlatedwith which nodes. Their initial values ( x (0) , v (0) ) are calculated from two input information nodes I ( in )1 and I ( in )2 by a pre-processing transformation T pre . The kinematic variables propagate alongthe NN from th layer to N th layer with the rule given by the EOM. The ﬁnal values ( x ( N ) , v ( N ) ) are used to calculate the model’s output information nodes ¯ I ( out )1 and ¯ I ( out )2 , which is comparedagainst the true output values I ( out )1 and I ( out )2 given from the training data set. The number ofthe nodes in the input and output layers can vary depending on the experimental setup. In this section, we review how to build a NN from EOM, following the framework suggestedby [2]. Fig. 2 shows a basic structure of NN of our interest. It is a feed-forward network,which means the propagation of variables is one-directional without any circular feed-back.Its depth (the number of layers) is set to be N + 1 (from to N ) excluding the inputand output layers, while the width (the number of nodes for a layer) is kept as two. Thepropagation rule from one layer to the next layer is given by the diﬀerential equations fromEOM with learning parameters of interest.Here, T pre is the transformation from the input layer to the th layer (pre-processing),which is the identity transformation in our cases, and T post is the transformation from the N th layer to the output layer (post-processing). The input layer and the pre-processingtransformation is used when there needs a pre-processing of experimental data to the kine-matic variables that appear in the EOM. If there is no need for such a pre-processing, onemay omit the input layer, which is the case for the rest of this paper. We, however, choseto include the input layer in this section for more general applications which require pre-processing. The output layer corresponds to a set of experimental measurements after thepropagation of variables with the EOM. For our cases, it will be ﬁnal variables and/or ﬂagsshowing whether or not the trained data points give valid outputs. The details on how weset up the output layer are discussed in the followings sections.There are two main diﬀerences of a NN in our setup from a usual feed-forward NN.First, in our setup, the width of NN stays constant, which is nothing but the number ofkinematic variables used in the learning process. In usual cases, however, the width of theNN may vary for diﬀerent layers to hold more versatility. Second, the propagation rule is Measurement of velocity using Doppler eﬀect can be a good example of an experimental setup requiringnontrivial pre- and post-processing transformations. In that case, one input/output information node canbe initial/ﬁnal frequency information, while T pre / T post connects them to initial/ﬁnal speed values in theNN layers, respectively. – 3 –et by EOM and there are relatively less learning parameters, whereas most components ofpropagation rule of a usual NN are set as learning parameters. From NN perspective, oursetup may look restrictive, but from the physics perspective, it is more desirable becausewe may indeed obtain physical understanding on the inner structure of the NN: we want to“understand” the system rather than simply having answers.How is the propagation rule given by the EOM? Let us assume that, as time changesfrom t i to t f , the following EOM holds ¨ x = f ( x, ˙ x ) , (2.1)or, v = ˙ x , ˙ v = f ( x, v ) . (2.2)If we discretize the time of (2.2) and take every time slice as a layer, we may construct adeep neural network with the structure of Fig. 2 with the following propagation rule, whichis essentially the Euler method: x ( k +1) = x ( k ) + v ( k ) ∆ t ,v ( k +1) = v ( k ) + f ( x ( k ) , v ( k ) ) ∆ t , (2.3)where x ( k ) and v ( k ) are variables at time t i + k ∆ t ( k -th layer) where ∆ t := t f − t i N .Another way of writing (2.3) is separating the linear part and the nonlinear part. Thelinear transformation can be represented by a weight matrix, W ( k ) for the k th , while thenon-linear transformation is called an activation function, ϕ ( k ) for the k th , so that the k th layer variable set, x ( k ) = (cid:16) x ( k ) , v ( k ) (cid:17) T , (2.4)propagates to the ( k +1) th layer by x ( k +1) = ϕ ( k ) (cid:16) W ( k ) x ( k ) (cid:17) , (2.5)where W ( k ) = (cid:32) t (cid:33) , ϕ ( k ) (cid:32) ab (cid:33) = (cid:32) ab + f ( x ( k ) , v ( k ) ) ∆ t (cid:33) . (2.6)In this way, the NN is built from EOM and diﬀerent layers mean diﬀerent times, exceptfor the input and output layers. The learning parameters, f ( x ( k ) , v ( k ) ) in our case, arerandomly set within a reasonable range. The model output ¯x ( out ) can be expressed asfollows. To diﬀerentiate the true output (training data) from the model output, the modeloutput is speciﬁed as variable name with a bar on it whereas the true output without abar. ¯x ( out ) ≡ T post (cid:18) ϕ ( N − (cid:16) W ( N − · · · ϕ (0) (cid:16) W (0) (cid:0) T pre ( x ( in ) ) (cid:1)(cid:17)(cid:17)(cid:19) , (2.7)where x ( in ) = (cid:16) I ( in )1 , I ( in )2 (cid:17) T and ¯x ( out ) = (cid:16) ¯ I ( out )1 , ¯ I ( out )2 (cid:17) T . The true output from thetraining data is denoted as x ( out ) = (cid:16) I ( out )1 , I ( out )2 (cid:17) T .– 4 – .2 Training Neural Network Note that the weight matrix W ( k ) and the activation function ϕ ( k ) of the NN is constructedaccording to the EOM as shown in (2.6). Thus, our goal is to train the function f ( x ( k ) , v ( k ) ) using the NN and input/output data. A single pair ( x ( in ) , x ( out ) ) is called a training datapoint, and a whole collection of them { ( x ( in ) , x ( out ) ) } is called a training data set. Fromthe training data set, one can deﬁne a error function (a.k.a. loss function) as E = 1 n batch (cid:88) batch (cid:12)(cid:12)(cid:12) ¯ x ( out ) − x ( out ) (cid:12)(cid:12)(cid:12) + E reg , (2.8)where a batch is a part of data set chosen for one learning cycle and n batch is the numberof data points for one batch. For example, if there are 500 data points in total and 100data points are used for one batch of the learning process, n batch is and ﬁve learningcycles cover the whole data set, which is called one epoch of learning. The summation over“batch” means that we add up the term from every data point from the batch. Dividingthe data set into batches makes the learning process more eﬃcient especially when the dataset is big. To make multiple parameters optimized with enough stability, many epochs oflearning is required.The ﬁrst term in (2.8) is the L -norm error of the batch calculated from the diﬀerenceof the output from the NN model ¯ x ( out ) and the true output from the training data set x ( out ) , which is one of the most widely-used error functions. The second term, E reg , isthe regularization error which makes unphysical solutions (e.g. unnecessarily zigzaggingsolutions) unfavorable in learning. The details on E reg will be delivered in the followingsections. Note that the error function deﬁned here is one example of possible choices. Thestructure of E can vary depending on the nature of problems. Please refer to Sec. 4 for avariation.In general, the value of E depends on both the weight matrix and the activation func-tion. For our model, however, the weight matrices are constant in the sense that they arenot learning parameters in the NN so that the activation functions ϕ ( k ) , or more speciﬁcallythe parameters f ( x ( k ) , v ( k ) ) , are the only parameters to be learned while minimizing thevalue of E . As an optimizer (learning mechanism), the two most classic choices are stochas-tic gradient descent (SGD) and Adam, where the former is more stable and the latter isfaster in many cases [9]. We used Adam method with Python 3 and PyTorch as a generalmachine learning environment. F ( v ) In this section, we describe the basic idea of our method by using one of the most simplestexamples. Here, we use only one kinematic variable, the one-dimensional velocity v , toextract information of the velocity-dependent drag force F , True ( v ) of a given system. Thisexample is very simple but the application of DL methodology is relevant and clear. Thedrag force is designed to be non-trivial to fully test the capability of the methodology.– 5 – i v i t f v f gF Figure 3 . The problem setup of case 1. A ball in a known constant gravitational acceleration g downward goes though a “black-box” ﬁlled with homogeneous medium with an unknown drag force F ( v ) . From experiments, multiple initial and ﬁnal velocity values, v i and v f , are recorded at ﬁxedinitial and ﬁnal time t i and t f . th st nd N th v (0) v (1) v (2) v ( N ) Figure 4 . The diagram of the deep neural network for case 1.

Problem deﬁnition

We consider a problem setup described in Fig. 3. A ball with mass m is dropped with the initial velocity v i at time t i through a medium with an unknowncomplicated drag force F ( v ) under a constant gravitational acceleration g downward. Attime t f , the velocity v f is recorded. The times t i and t f are ﬁxed whereas v i varies as wellas v f so that we have the input-output data set { ( v i , v f ) } for training. The EOM is givenas follows and we want to ﬁnd the drag force F ( v ) : ˙ v = − g + F ( v ) m . (3.1) Method

Because we only have one kinematic variable v , it is enough to build a NN withone node per a layer (the width of one) as described in Fig. 4. We omitted the input andoutput layers in Fig. 4, since the th and N th layer values, v (0) and v ( N ) , are themselvesused as the input and output layers, v i and ¯ v f , without any pre- or post-processing. The In a usual NN, the activation function is ﬁxed as a nonlinear function, such as a sigmoid function or arectiﬁed unit function, and the weight matrix is trained. – 6 –ropagation rule of the NN is written as follows. v ( k +1) = v ( k ) − (cid:32) g − F ( v ( k ) ) m (cid:33) ∆ t . (3.2)The initial velocity values are set by v i ∈ [ − , , evenly spaced with the gap of 5(i.e. − , − , · · · , ) and the corresponding v f is calculated from an ordinary diﬀerentialequation (ODE) solver independent from the NN, which is shown as thick gray points inFig. 5(a). Thus, the total number of collected data points ( v i , v f ) is n data = 51 .As mentioned above, it is possible to build a NN with one kinematic variable v andlearn F from the training data set. The depth of NN is set by N = 10 . The dragforce F is modeled as an array of size L = 251 , where its i th element F ,i corresponds tothe value of drag force when | v | = i ( i = 0 , , , · · · , ); F ,i = F ( i ) . The array canhold the information of the drag force for integer speed values | v | ∈ [0 , where isthe upper limit of the speed of ball during the data collection with the true drag force F , true ( v ) . When the speed is not an integer, which is true for most cases, the value islinearly interpolated from two nearest integer values. For example, if v = 0 . , the dragforce value is calculated by F (0 .

4) = (1 − . × F (0) + 0 . × F (1) .Our goal is to train F to yield F , True . Let us now reﬁne our notation by adding thesuperscript ( j ) to denote the intermediate outputs by F ( j )1 . The initial drag force F (0)1 isset by L = 251 uniform random numbers between (10 , , as a “ﬁrst guess”. See the redwiggly line in Fig. 5(b). The L elements of the drag force array are learning parameters,which are updated in the direction of reducing the value of the error function. The errorgets minimized as learning proceeds, and the error at j -th learning cycle E ( j ) is: E ( j ) = 1 n batch (cid:88) batch (cid:12)(cid:12)(cid:12) ¯ v ( j ) f − v f (cid:12)(cid:12)(cid:12) + (cid:16) F ( j )1 (0) (cid:17) + c L − (cid:88) i =0 (cid:16) F ( j )1 ( i + 1) − F ( j )1 ( i ) (cid:17) . (3.3)Here, the ﬁrst term is the L -norm error to train the parameters to match the model outputof ﬁnal velocity values, ¯ v f = v ( N ) , with the true ﬁnal velocity values, v f , for a given batchinput v i . Meanwhile, the number of data points is small enough in this case, so we chooseto use the whole data set for every learning cycle; n batch = n data = 51 . To put a preferenceon a physically sensible proﬁle of F , two regularization terms are introduced. The ﬁrstterm, ( F (0)) reﬂects a physical requirement: F (0) = 0 which means there should be nodrag force when v = 0 . The second term, c (cid:80) N − i =0 ( F ,i +1 − F ,i ) , is a mean squared errorbetween adjacent F array values which gives a preference on smoother proﬁles; it is notplausible for the drag force to have a spiky zigzag proﬁle. In our computation c = 0 . isused and we explain how to choose a proper value of c at the end of this section. As anoptimizer, Adam method is used. For numerical work, we chose m = 1 , t i = 0 , t f = 4 , g = 10 . During the learning process, however, some data points can have | v | > by chance because of theirinitial random drag force proﬁle. In that case, they referred to the drag force value F ( v = 250) . With the learning rate of . . – 7 –

50 200 150 100 50 0 v i v f Trueepoch=0epoch=100epoch=200epoch=1000 (a) Learning progress of ¯ v f with diﬀerent epochs v F ( v ) Trueepoch=0epoch=100epoch=200epoch=1000 (b) Learning progress of F ( v ) with diﬀerent epochs Figure 5 . case 1: comparison of trained data for diﬀerent epochs and true data. With a bigenough epoch, for example 1000, the trained data (blue points/curve) agree with the true data(gray points/curve).

Examples

As an example force, the following hypothetical (complicated) form of F , True is assumed and the training data set { ( v i , v f ) } is collected. F , True ( v ) = v (300 − v )1000 (cid:20) (cid:16) v (cid:17) + 110 cos (cid:16) v (cid:17)(cid:21) + (cid:16) v (cid:17) , (3.4)which is shown as the gray line in Fig. 5(b).The learning result is shown in Fig. 5. In Fig. 5(a), the model output set { ¯ v f } is shownwith the training data set { v f } with diﬀerent epoch numbers. The NN model learns howto match those two precisely by modifying the learning parameters F ( v ) as the number ofepochs increases. How F ( v ) is trained over diﬀerent epoch numbers is shown in Fig. 5(b).As these plots show, the NN model matched { ¯ v f } with { v f } accurately and discovered F , True proﬁle with high accuracy with a big enough epoch number.To further test the capability of the NN to discover the drag force, diﬀerent-shapeddrag force proﬁles are tested with the same scheme. As Fig. 6 shows, the NN discoveredthe right F , True proﬁles accurately as well at epoch = 1000 . From the ﬁgures, it is clearthat both regularization terms (one for setting F v (0) = 0 and the other for smoothness) areguiding the learning correctly by ﬁltering out unphysical solutions.We wrap up this section by discussing the choice of c for regularization. The value of c controls the smoothness of the F proﬁle. Fig. 7(a) and 7(b) show the eﬀect of diﬀerent c values on the drag force and the error. If it is too big ( c = 3 , yellow dots), the F proﬁleafter the learning process ends up being too ﬂat and the error remains very high. If it istoo small ( c = 0 . , red dots), the F proﬁle stays spiky and the error remains relativelyhigh as well. It turns out that c = 0 . (green dots) is suitable, showing the trained F For the purpose of the test of our method, these force proﬁles are generated by ﬁtting artiﬁciallychosen complicated data. Their functional are . v − . × − v + 5 . × − v + 2 . × − v − . × − v + 6 . × − v − . × − v + 2 . × − v , and (2 . × − v − . × − v + 9 . × − v − . × − v + 1 . × − v − . × − v ) (cid:0) tanh( v − ) + tanh( − v ) + 1 . (cid:1) respectively. – 8 –

50 100 150 200 250 v F ( v ) Trueepoch=0epoch=100epoch=200epoch=1000 0 50 100 150 200 250 v F ( v ) Trueepoch=0epoch=100epoch=200epoch=1000

Figure 6 . Learning results from diﬀerent F , True proﬁles. F ( v ) True c =3e-3 c =3e-2 c =3 (a) Trained drag force proﬁles with diﬀerent c ’s. E c =3e-3 c =3e-2 c =3 (b) Decreasing tendency of E for diﬀerent c ’s. Figure 7 . The regularization c dependence of the training results. (a) If c is too small the forceproﬁle is not smooth enough and if c is too big, the proﬁle becomes ﬂat and deviates from the trueproﬁle. (b) We may choose c such that the error saturates to a smallest value. overlaps very well with F , True while resulting in the minimum error. Indeed, this value of c can be found by investigating how the error decreases as learning proceeds. As shownin Fig. 7(b), the c value which gives the minimum error can serve as the best value forregularization. F ( x ) In the second case, two one-dimensional kinematic variables x and v come into play toretrieve the position-dependent force F , True ( x ) of a system from given data. Again, theforce is designed to be non-trivial to fully examine the capability of the methodology. Thecontent is divided into three subsections as well; problem deﬁnition, method, and examples. Problem deﬁnition

As shown in Fig. 8, a ball is shot at the position x i with the initialvelocity v i at time t i . The initial position x i belongs to the range ( x min i , x max i ) and the This is true in the coarse grain sense. If we want to ﬁne tune the value of c , we need to be more carefulto remove the artiﬁcial eﬀect of the regularization term on the entire error. – 9 – min i x max i x i v i t i t f v f x f F Figure 8 . The problem setup of case 2. A ball goes though a black-box with an unknown forceﬁeld F ( x ) without any friction. The ball is dropped with speed v i at time t i and position x i ∈ ( x min i , x max i ) . At t f , if the ball is at the vicinity of x f , a speedometer reads its velocity v f and theinitial kinetic variable set ( x i , v i ) is taken as a positive data point (kind κ = 0 ); else, the data pointis negative ( κ = 1 ). th st nd N th Output x (0) x (1) x (2) x ( N ) ¯ κv (0) v (1) v (2) if κ = 0 v ( N ) v ( N ) T post Figure 9 . The diagram of the deep neural network for case 2. An input data point at the th layer ( x (0) , v (0) ) propagates to the N th layer. Every data point gives the ﬁrst output ¯ κ = T post (cid:0) x ( N ) (cid:1) while the second output v ( N ) is given only from one of the positive data points ( κ = 0 ). initial velocity will be also chosen in certain range so that we can have a window of trainingdata set.At a ﬁxed ﬁnal time t f , if the ball is at the vicinity of x f (—within x f ± (cid:15) ), the initialkinematic variable set ( x i , v i ) is taken as a positive data point (kind κ = 0 ) and its velocity v f is recorded. If the ball is not within x f ± (cid:15) when t = t f , the data point ( x i , v i ) is takenas a negative one (kind κ = 1 ) and we assume that we can not measure a corresponding v f value. In other words, a positive data point holds two output values κ = 0 and v f , whilea negative point holds one output value κ = 1 . The EOM is given as ¨ x = F ( x ) /m andwe want to ﬁnd the force F . The EOM can be separated into two ﬁrst order diﬀerentialequations as follows. ˙ x = v , ˙ v = 1 m F ( x ) . (4.1) Method

Because we have two kinematic variables x and v , two nodes per a layer (thewidth of two) setup is used for building the NN, as shown in Fig. 9. The input layer is– 10 –mitted while the output layer is formed using a post-processing transformation T post on ¯ x f = x ( N ) , which judges whether or not ¯ x f is at the vicinity of x f at t f by ¯ κ = T post (¯ x f ) (cid:39) for | ¯ x f − x f | ≤ (cid:15) (model-positive) and (cid:39) for | ¯ x f − x f | ≥ (cid:15) (model-negative). Morediscussions about T post follow shortly. For positive data points ( κ = 0 ), their ¯ v f = v ( N ) values are recorded as well. The propagation rule from EOM is written as follows. x ( k +1) = x ( k ) + v ( k ) ∆ t ,v ( k +1) = v ( k ) + F ( x ( k ) ) m ∆ t . (4.2)The depth of the NN is set by N = 20 . As in the drag force ﬁeld F of case 1, theforce ﬁeld F is modeled as an array that holds the force value for integer positions; F ismodeled as an array of size L = 21 covering integer position x ∈ [0 , . The i th componentof F is the force at x = i ; F ,i = F ( i ) . When a position value is not an integer, the forceis linearly interpolated from those of two nearest integer positions.Our goal is to train F to yield F , True . The model’s force proﬁle at j th learning cycle, F ( j )2 , approaches to F , True as j increases, if the learning is correctly designed and performed.The initial F array, F (0)2 , is set by normal random numbers with the average of − . andthe standard deviation of . as a “ﬁrst guess”. See the red wiggly line in Fig. 11(e).There are two things for the NN model to learn. First, the model needs to distinguishpositive and negative data points—it should match the ¯ κ of a given data point ( x i , v i ) withits actual κ properly. Second, the model should be able to match the model’s ﬁnal velocity ¯ v f = v ( N ) with the true ﬁnal velocity v f at positive data points ( x i , v i , κ = 0) .To reﬂect these, we need two terms for the error function, one for ¯ κ and κ and the otherfor ¯ v f and v f , in addition to the regularization error which gives a preference on smootherproﬁles. The error function at j th learning cycle is as follows. E ( j ) = N n batch (cid:88) batch (cid:12)(cid:12)(cid:12) ¯ κ ( j ) − κ (cid:12)(cid:12)(cid:12) + N n batch ,κ =0 (cid:88) batch ,κ =0 (cid:12)(cid:12)(cid:12) ¯ v ( j ) f − v f (cid:12)(cid:12)(cid:12) + c L − (cid:88) i =0 ( F ( j )2 ,i +1 − F ( j )2 ,i ) , (4.3)where the ﬁrst and second terms are L -norm errors normalized with their size n batch and n batch ,κ =0 , respectively, scaled by the coeﬃcients N and N that can control relativeimportance between the two terms (both are set as one for our case). The third termis the regularization error for smoothness of the proﬁle (the mean squared error betweenadjacent F array elements) with the coeﬃcient c . The model output of the kind variable ¯ κ = T post (¯ x f = x ( N ) ) is calculated using the following post-processing transformation T post ,which gives T post ( | x − x f | ≤ (cid:15) ) (cid:39) and T post ( | x − x f | > (cid:15) ) (cid:39) . T post (¯ x f ) = 12 (tanh [20 ((¯ x f − x f ) − (cid:15) )] − tanh [20 ((¯ x f − x f ) + (cid:15) )]) + 1= 12 (tanh [20 (¯ x f − (cid:15) )] − tanh [20 (¯ x f + (cid:15) )]) + 1 ( ∵ x f = 0) (4.4)– 11 – x i v i True ( + )True ( ) (a) Blue dots are positive data ( κ = 0 ) and orangedots are negative data ( κ = 1 ).

10 12 14 16 18 20 x i v f True (b) v f data for the positive data ( κ = 0 ). Figure 10 . Input data for case 2.

The reason to use an analytic function form for T post rather than a step function is to enablethe optimizer to diﬀerentiate the error function in the parameter space and ﬁnd the directionto update parameters to minimize the error; if it is a step function, an optimizer would notbe able to ﬁnd the direction to update the parameters. For learning setup, following valuesare used: c = 0 . , n batch = 200 , n batch ,κ =0 = 100 , (cid:15) = 0 . . The termination conditionwas epochs = 500 . As an optimizer, Adam method is used. For numerical work, we chose m = 1 , t i = 0 , t f = 4 , ( x min i , x max i ) = (10 , , x f = 0 , and v i ∈ ( − , , which is an enoughrange of velocity to collect positive data points in our setting. Examples

As an example force ﬁeld, the following hypothetical (complicated) form of F , True is assumed and the training data set { ( x i , v i , κ, v f ) } is collected. F , True ( x ) = 18000 ( x − x − ( x − − . , (4.5)which is shown as the gray line in Fig. 11(a). The experimental input data points ( x i , v i ) aregenerated uniform-randomly in their preset range and the output data points ( { κ = 0 , v f } for positive data points and { κ = 1 } for negative data points) are collected using an ODEsolver with F , True ( x ) . The number of the collected data points for training is 2,000 in total(1,000 for positive, 1,000 for negative). Fig. 10 shows the training data points. Fig. 10(a)shows the initial kinematic variables ( x i , v i ) where positive ( κ = 0 ) and negative ( κ = 1 )data points are marked diﬀerently with blue and orange, respectively. Fig. 10(b) showsthe distribution of the ﬁnal velocity v f of positive data points with respect to the initialposition x i . With the learning rate of × − . – 12 – x i v i True ( + )Model ( + )Both ( + ) (a) Before learning: κ and ¯ κ comparison.Blue: positive ( κ = 0 ), Orange: model-positive( ¯ κ (cid:39) ), Green: intersection ( κ = 0 and ¯ κ (cid:39) ).

10 12 14 16 18 20 x i v f , v f True ( v f )Model ( v f ) (b) Before learning: v f and ¯ v f comparison.Blue: training data ( v f ), Green: model propagationresult ( ¯ v f ).

10 12 14 16 18 20 x i v i True ( + )Model ( + )Both ( + ) (c) After learning: κ and ¯ κ comparison.Blue: positive ( κ = 0 ), Orange: model-positive( ¯ κ (cid:39) ), Green: intersection ( κ = 0 and ¯ κ (cid:39) ).

10 12 14 16 18 20 x i v f , v f True ( v f )Model ( v f ) (d) After learning: v f and ¯ v f comparison.Blue: training data ( v f ), Green: model propagationresult ( ¯ v f ). F ( x ) Trueepoch=0epoch=20epoch=500 (e) Learning progress of F ( x ) with diﬀerent epochs.Gray: F , True ( x ) Figure 11 . Model’s output before (a, b) and after (c, d) learning. The learning progress of F ( x ) is shown in (e). (a) ( x i , v i ) plot of positive and model-positive data points before learning. Mostpoints’ κ values are incorrectly guessed as ¯ κ by the model NN. (b) ( x i , v f ) and ( x i , ¯ v f ) plots forpositive data points before learning. Most v f values are incorrectly guessed as ¯ v f by the model NN.(c) ( x i , v i ) plot of positive and model-positive data points after learning. Most points’ κ valuesare correctly learned as ¯ κ by the model NN. (d) ( x i , v f ) and ( x i , ¯ v f ) plots for positive data pointsbefore learning. Most v f values are correctly learned as ¯ v f by the model NN.(e) As epoch increases, F ( x ) proﬁle approaches to F , True ( x ) . – 13 – .0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0x1.501.251.000.750.500.250.000.250.50 F ( x ) Trueepoch=0epoch=20epoch=500 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0x1.501.251.000.750.500.250.000.250.50 F ( x ) Trueepoch=0epoch=20epoch=500

Figure 12 . Learning results from diﬀerent F , True proﬁles.

The learning process and result is put together in Fig. 11. In Fig. 11(a) and Fig. 11(b),the before-learning training data points are put together with the model data points.Fig. 11(a) shows the initial kinematic variables ( x i , v i ) of the positive data points ( κ = 0 )by blue and the model-positive ( ¯ κ (cid:39) ) by orange where their intersection is marked asgreen; the green portion shows how much the model is correct on matching ¯ κ with κ . Notethat the intersection of negative and model-negative points ( κ = 1 & ¯ κ (cid:39) ) are omittedfor clarity. Fig. 11(b) shows the distribution of the ﬁnal velocity values from positive datapoints ( v f ) by blue and those from model propagation ( ¯ v f ) by green, respectively; theirdiscrepancy means the model is not matching ¯ v f with v f correctly. It is clear that the NNmodel before learning is incorrect on matching either ¯ κ with κ or ¯ v f with v f .Fig. 11(c) and Fig. 11(d) show the “after-learning” plots corresponding to Fig. 11(a)and Fig. 11(b), respectively. From the increased portion of green dots in Fig. 11(c) andthe accurate matching between blue and green dots in Fig. 11(d), it is clear that the NNmodel after learning is matching the outputs correctly. Meanwhile, Fig. 11(e) shows how F ( x ) is trained over diﬀerent epochs. It shows how the proﬁle of F proceeds from theinitial random distribution to the true proﬁle F , True by matching ¯ κ with κ and ¯ v f with v f while guided by the regularization error. From these plots, we can see that the NN modelmatched both sets of output variables correctly as well as accurately discovering F , True .To further test the capability of the NN to discover force ﬁelds, diﬀerent-shaped forceﬁeld proﬁles are tested with the same scheme. As Fig. 12 shows, the NN discovered theright F , True proﬁles accurately for both cases at epoch = 500 . From this result, we cancertify that the NN built with this methodology is capable of learning diﬀerent shapes ofcomplicated force ﬁelds.

In this paper we analysed classical mechanics problems using the DL. The main idea of ourproblem is explained in Fig. 1: how to ﬁnd the unknown force, by a deep learning technique, For the purpose of the test of our method, these force proﬁles are generated by ﬁtting ar-tiﬁcially chosen complicated data. Their functional forms are (cid:0) cos (cid:0) πx (cid:1) − (cid:1) + ( x − ) and − (tanh( x −

5) + tanh( x −

15) + 3) respectively. – 14 –nly from the initial and ﬁnal data sets. When the equation of motion (EOM) of a system isgiven with a set of initial conditions, calculating the propagation of variables numerically isusually not a hard work. On the other hand, retrieving the EOM (or the unknown force inthe equations) from a given data set can be a very challenging task, especially with limitedtypes/amount of information (e.g. only initial and ﬁnal data).By constructing the NN reﬂecting the EOM (Fig. 2), together with enough input andoutput data, we successfully obtained the unknown complicated forces. The learning pro-gresses to estimate the unknown forces are shown in Fig. 5(b),Fig. 6,Fig. 11(e), and Fig. 12.They show that our DL method successfully discovers the right force proﬁles without beingstuck at local minima, or the multiplicity of mathematically possible solutions.There are two major advantages of our method. First, the approach with DL caneasily ﬁnd a complicated answer, which does not allow much intuition to “correctly guess”the right form of the answer. Second, contrary to usual NN techniques, our approach trainsphysical quantities such as the unknown force assigned in the NN, which is important forunderstaning physics.Our framework can be generalized in a few directions. First, we can consider manyparticle cases and/or higher dimensional problems. In this case, the number of the kinematicvariables increases, which means the width of the NN increases in Fig. 2. Second, wecan improve our discretization method (2.2) by adding higher order corrections or usingthe neural ODE technique developed in [10]. Third, we may apply our method to morecomplicated problems. For example, we may consider a scattering experiments by unknownforces, which is not a simple power-law force or not even a central force. Last but not least,this work will be pedagogical and heuristic for those who want to apply AdS/DL to theemergence of spacetime as a neural network.In a broader view, the examples in this paper can enhance the mutual understandingof physics and computational science in the context of both education and research byproviding an interesting bridge between them.

Acknowledgments

We would like to thank Koji Hashimoto, Akinori Tanaka, Chang-Woo Ji, Hyun-Gyu Kim,Hyun-Sik Jeong for valuable discussions and comments. This work was supported by Ba-sic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Science, ICT & Future Planning (NRF-2017R1A2B4004810) andthe GIST Research Institute(GRI) grant funded by GIST in 2020. M. Song and M. S. H.Oh contributed equally as the ﬁrst author.

References [1] K. Hashimoto, S. Sugishita, A. Tanaka and A. Tomiya,

Deep learning and the AdS/CFTcorrespondence , Phys. Rev. D (2018) 046019 [ ].[2] K. Hashimoto, S. Sugishita, A. Tanaka and A. Tomiya, Deep Learning and HolographicQCD , Phys. Rev. D (2018) 106014 [ ]. – 15 –

3] A. Tanaka, A. Tomiya and K. Hashimoto,

Deep Learning and Physics , Springer (to appear inFebruary 2021).[4] J. Zaanen, Y.-W. Sun, Y. Liu and K. Schalm,

Holographic Duality in Condensed MatterPhysics , Cambridge Univ. Press (2015).[5] M. Ammon and J. Erdmenger,

Gauge/gravity duality: Foundations and applications ,Cambridge University Press, Cambridge (4, 2015).[6] K. Hornik, M. Stinchcombe and H. White,

Multilayer feedforward networks are universalapproximators , Neural Networks (1989) 359 .[7] C.M. Bishop, Pattern Recognition and Machine Learning (Information Science andStatistics) , Springer-Verlag, Berlin, Heidelberg (2006).[8] S.P.N.a. Ernst Hairer, Gerhard Wanner,

Solving Ordinary Diﬀerential Equations I: NonstiﬀProblems , Springer Series in Computational Mathematics 8, Springer-Verlag BerlinHeidelberg, 2 ed. (1993).[9] D.P. Kingma and J. Ba,

Adam: A method for stochastic optimization , in , Y. Bengio and Y. LeCun, eds., 2015,http://arxiv.org/abs/1412.6980.[10] K. Hashimoto, H.-Y. Hu and Y.-Z. You,

Neural ODE and Holographic QCD , ..