[PDF] On Theory-training Neural Networks to Infer the Solution of Highly Coupled Differential Equations

Abstract

Deep neural networks are transforming fields ranging from computer vision to computational medicine, and we recently extended their application to the field of phase-change heat transfer by introducing theory-trained neural networks (TTNs) for a solidification problem \cite{TTN}. Here, we present general, in-depth, and empirical insights into theory-training networks for learning the solution of highly coupled differential equations. We analyze the deteriorating effects of the oscillating loss on the ability of a network to satisfy the equations at the training data points, measured by the final training loss, and on the accuracy of the inferred solution. We introduce a theory-training technique that, by leveraging regularization, eliminates those oscillations, decreases the final training loss, and improves the accuracy of the inferred solution, with no additional computational cost. Then, we present guidelines that allow a systematic search for the network that has the optimal training time and inference accuracy for a given set of equations; following these guidelines can reduce the number of tedious training iterations in that search. Finally, a comparison between theory-training and the rival, conventional method of solving differential equations using discretization attests to the advantages of theory-training not being necessarily limited to high-dimensional sets of equations. The comparison also reveals a limitation of the current theory-training framework that may limit its application in domains where extreme accuracies are necessary.

Full PDF

OOn Theory-training Neural Networks to Infer the Solution of Highly CoupledDiﬀerential Equations

M. Torabi Rad ∗ , A. Viardin, and M. Apel Access e.V., Intzestr. 5, 52072 Aachen, Germany

Deep neural networks are transforming ﬁelds ranging from computer vision to computationalmedicine, and we recently extended their application to the ﬁeld of phase-change heat transfer byintroducing theory-trained neural networks (TTNs) for a solidiﬁcation problem [1]. Here, we presentgeneral, in-depth, and empirical insights into theory-training networks for learning the solution ofhighly coupled diﬀerential equations. We analyze the deteriorating eﬀects of the oscillating loss onthe ability of a network to satisfy the equations at the training data points, measured by the ﬁnaltraining loss, and on the accuracy of the inferred solution. We introduce a theory-training techniquethat, by leveraging regularization, eliminates those oscillations, decreases the ﬁnal training loss, andimproves the accuracy of the inferred solution, with no additional computational cost. Then, wepresent guidelines that allow a systematic search for the network that has the optimal trainingtime and inference accuracy for a given set of equations; following these guidelines can reduce thenumber of tedious training iterations in that search. Finally, a comparison between theory-trainingand the rival, conventional method of solving diﬀerential equations using discretization attests to theadvantages of theory-training not being necessarily limited to high-dimensional sets of equations.The comparison also reveals a limitation of the current theory-training framework that may limitits application in domains where extreme accuracies are necessary.

I. INTRODUCTION

Machine learning is a branch of artiﬁcial intelligence that utilizes statistical learning methods to enable computersto learn from data, identify patterns, and make predictions using minimal task-speciﬁc programming instructions.In the machine learning domain, neural networks are computing systems that consist of several simple but highlyinterconnected processing elements called neurons. These neurons map one or multiple inputs to one or multipleoutputs. Each neuron has a bias and connection weights, the values of which are determined in a process calledtraining. After a network is trained, it can be used to infer the outputs on yet-unseen inputs.Deep neural networks are now transforming ﬁelds such as speech recognition, computer vision, and computationalmedicine. We recently extended their application to the ﬁeld of phase-change heat transfer. In a procedure wetermed theory-training, we used a theoretical (i.e., mathematical) model to train neural networks to simulate asolidiﬁcation problem. In the literature [2–4], similar networks are also referred to as “physics-informed neuralnetworks”. We, however, use the term “Theory-Trained Neural networks” (TTNs) because we believe the latter termcorrectly emphasizes that the most important part of the process (i.e., training) is performed using a theoreticalmodel, and not, for example, experimental measurements or image data sets, which can also contain informationabout the underlying physics of a problem. The term “theory-training” also allows one to distinguish these networksfrom those trained using external data.The essential elements of theory-training are displayed in Figure 1, but discussed here only brieﬂy; for details, thereader is referred to [1, 2]. The ﬁgure shows a TTN with only one hidden layer and two nodes in that layer. Note thatthe network has eight connection weights and ﬁve biases. The TTN is being theory-trained using a highly coupledset of equations with three unknowns and two diﬀerential equations coupled through an algebraic relation. Theseequations will be discussed more in Section II. In the schematic, each solid arrow (red/black) is associated with atrainable network parameter (connection weights/node biases), and the dashed arrows show the ﬂow of data betweendiﬀerent elements. The input and outputs of the network are an array representing the time at the training datapoints and model unknowns, respectively. The derivatives of the outputs with respect to the time are calculated ina process called automatic diﬀerentiation. The outputs and their time derivatives are used to calculate the trainingloss, which is equal to the sum of the imbalance in the model equations and the initial conditions of the problem. Ifthe loss is less than a small and pre-determined value, then the training ends, and the TTN can be used to infer thesolution on new inputs. Otherwise, the trainable network parameters θ need to be updated, using an optimizer, suchthat the value of the loss is reduced.One of the main advantages of TTNs is that they do not need any prior knowledge of the solution of the governing ∗ Corresponding author Email address: [email protected]; [email protected] a r X i v : . [ c s . L G ] F e b FIG. 1: A schematic showing the essential elements of theory-training for a network with one hidden layer and two nodes inthat layer. The input to the network is an array containing the times at the training points. The solid arrows (red/black) areassociated with trainable network parameters (weights/biases), and the dashed arrows show the ﬂow of data between diﬀerentsteps in the process. equations or any external data for training. They essentially self-train by relying on the ability of a neural network tolearn the solution of partial diﬀerential equations (PDEs). In the literature, that ability is sometimes referred to bythe term “solving PDEs”; we, however, prefer and use the term “inferring (or learning) the solution of PDEs” instead.The reason is, after successful training, TTNs have the power to predict the solution of a PDE at unseen data-pointswithout actually solving the PDE, and the term “solving PDEs” simply neglects that powerful capability.Theory-training is a new and active research topic and, following the pioneering works of Raissi et al. [2–4], has sofar been successfully used for learning the solution of PDEs in ﬁelds such as solid [5, 6] and bio [7] mechanics, andmaterials science [1]. Despite those successes, the topic is still in its infancy, and theory-training still faces numerousopen questions. Some of these questions arise due to the yet-incomplete understanding of the diﬀerent aspects ofneural networks in the broader Deep Learning (DL) literature. For example, the dependence of training or inferenceperformances of a neural network on the size of the network or the training data is still under debate [8]. As anotherexample, while some studies suggest that deep nets are preferred over the shallow ones in computer vision or speechrecognition tasks, there is evidence that even single-layer fully connected feedforward nets can have a performancesimilar to deep nets (see [9] and references therein).Another class of open questions arises because some topics being studied extensively in the broader DL literaturehave not gained attention in the context of theory-training yet. For example, regularization has such a centralrole in ML research that is rivaled in importance only by optimization [10]. Training techniques such as L2 weightregularization are commonly used to perform computer vision or natural language processing tasks. For TTNs,however, as correctly pointed out by Raissi [11], the eﬀect of regularization, especially on inference accuracy is notclear. The question becomes central in comparing theory-training with a conventional and rival method of solvingdiﬀerential equations by discretizing the derivatives. Making any judicious statement favoring either of the methodsrequires one to achieve the optimal accuracy (for a given set of parameters) with both methods ﬁrst and only thenmaking the comparison. For example, comparing the predictions of a TTN that is not trained to its full learningcapacity with tightly converged ﬁnite diﬀerence results is of no relevance. In a method that uses discretization, thepredictions can be made as accurate as desired by simply reﬁning the discretization steps. In theory-training, however,it is entirely unclear how the accuracy of predictions changes with the training (hyper-) parameters.

II. EQUATIONSA. A highly coupled set of equations

The set of equations used here for theory-training represent a model for solidiﬁcation of a binary alloy under anexternal, uniform, and constant heat extraction. The set consists of equations for conservation of energy and solutein the liquid phase and a thermodynamic constitutive relation representing the liquidus line of the phase-diagram.The only diﬀerence between this set and the one used in our previous study [1] is in the energy equation. Here, thatequation does not have any spatial derivatives and, instead, includes a constant source term, representing the heatextraction. It reads ∂T ∗ ∂t ∗ = 1 S ∂φ s ∂t ∗ + ˙ q ∗ (1)where T ∗ , t ∗ , S and ˙ q ∗ represent the scaled temperature and time, Stephan number, and scaled heat extraction rate,respectively. The equations for the scaled liquid concentration C ∗ l and solid fraction φ s reader[ k φ s + (1 − φ s )] ∂C ∗ l ∂t ∗ = k (cid:20) − k ) k C ∗ l (cid:21) ∂φ s ∂t ∗ (2)and T ∗ > → φ s = 0 (3) − < T ∗ ≤ → T ∗ + C ∗ l = 0 T ∗ ≤ − → φ s = 1, respectively. The initial conditions reader T ∗ ( t ∗ = 0) = 0; C ∗ l ( t ∗ = 0) = 0 (4)Solving equations (1) to (4) using even a conventional method that discretizes the equations is not a trivial taskfor the following two reasons. First, one of the unknowns (i.e., φ s ) does not have an explicit relation to be calculatedfrom; therefore, a numerical scheme for updating it is required [18]. Second, due to the low values of the Stephannumber S (see the ﬁrst term on the right-hand side of equation (1)), there is a strong nonlinear coupling between theequations. Despite these diﬃculties, we have shown that theory-training can infer the solution even in the presence ofspatial derivatives in the equations [1]. Here, however, those derivatives are disregarded for the following two reasons.First, as already discussed in the introduction, the main objective of the present paper is to give in-depth insights intotheory-training highly couple diﬀerential equations. That objective can be perfectly accomplished without makingparts of the study not serving the objective unnecessarily more complex. Second, in the absence of spatial derivatives,we were able to derive the following analytical solution for equation (1) T ∗ ex = 11 − k  S ˙ q ∗ (1 − k ) t ∗ − (cid:113)(cid:2) S ˙ q ∗ ( k − t ∗ + Sk − (cid:3) + 4 Sk − Sk + 12 S + k  (5)The exact analytical solutions for C ∗ l and φ s were outlined in our previous work and are not repeated here. Havingan exact solution was critical in the present study because it allowed us to quantify the L and H prediction errors,which are deﬁned as [12] || u − u ex || L = (cid:20)(cid:90) ( u − u ex ) dt (cid:21) . (6)and || u − u ex || H = (cid:20)(cid:90) ( u − u ex ) dt + (cid:90) ( ˙ u − ˙ u ex ) dt (cid:21) . (7), respectively, where u can be either T ∗ , C ∗ l or φ S ∗ B. Standard and regularized theory-training loss

The training loss is calculated by ﬁrst calculating the errors in satisfying equations (1) to (3) at diﬀerent trainingdata points as E iT = ∂T ∗ ,i ∂t ∗ − S ∂φ is ∂t ∗ − ˙ q ∗ (8) E iC l = (cid:2) k φ is + (cid:0) − φ is (cid:1)(cid:3) ∂C ∗ ,il ∂t ∗ ,i − k (cid:20) − k ) k C ∗ ,il (cid:21) ∂φ is ∂t ∗ ,i (9) E iliq =  φ is − T ∗ ≤ φ is T ∗ > T ∗ ,i + C ∗ ,il − < T ∗ ≤ E ic,T = T ∗ ( t ∗ = 0); E ic,C l = C ∗ l ( t ∗ = 0) (11)Note that because the theoretical model here consists of ODEs, it suﬃces to use only one datapoint to enforce theinitial conditions. Therefore, in equation (11), unlike equations (8) to (10), the error terms do not contain superscripts.We distinguish two diﬀerent losses. The ﬁrst one is the standard training loss L S , which is deﬁned as the sum of theerrors associated with equations (8) to (11) and across all the training data-points. It is calculated from L S = 1 N N (cid:88) i =1 (cid:2) ( E iT ) + ( E iC l ) + ( E iliq ) (cid:3) + (cid:2) ( E ic,T ) + ( E ic,C l ) (cid:3) (12)where N is the number of training data-points that cover the interval 0 < t ∗ ≤

1. It is this loss template that wasminimized in our previous study [1] and in Raissi et al. [2–4, 11]. The regularized theory-training loss L R is deﬁnedas L R = L S + γ n h (cid:88) i =1 || W l || (13)The second term on the right-hand side represents the sum of the L norms of the weight matrixes of the nh hiddenlayers. In that term, γ and n h are the hyper-parameters controlling the regularization strength and the number ofhidden layers, respectively. Let us now discuss a subtle point about regularization (i.e., having non-zero γ in equation(13)) that becomes highly relevant in the theory-training context. In the deep learning of computer vision tasks, forexample, the training loss has no physical meaning. It is only a function to ﬁt the training data. In such tasks,following too closely patterns speciﬁc to a training dataset will degrade the network’s inference performance. Thisis a problem known as overﬁtting and is commonly avoided using diﬀerent regularization methods. Regularization,essentially, prevents a network from reducing the loss to the lowest values it potentially can (without regularization).Unlike computer vision tasks, in theory-training, the loss has a clear physical meaning. It represents, for example,the imbalance in the conservation of energy or matter. Imagine theory-training a single equation with a networkthat is large enough to make approximation errors negligible [13] and assume that the training and “inference”points are the same, such that there are no generalization errors. In this scenario, the L error of the “inferred”solution will proportional to the standard loss L S . Therefore, preventing the network from reducing L S to the lowestvalues it potentially can will have a negative impact on the inference accuracy, and can be viewed as a side-eﬀect ofregularization. This along with other potential side-eﬀects discussed by [14, 15] are, perhaps, the reasons why TTNsin the literature are usually not regularized.To use regularization without the abovementioned side-eﬀects, we propose to perform theorytraining in two parts:in the ﬁrst part, the network is trained by minimizing L R and, in the second part, it is trained by minimizing L S .This technique can be easily implemented by calculating γ in equation (11) from γ = (cid:40) γ n < n s III. RESULTS

All the networks theory-trained in the present study have hyperbolic tangent activation functions and were trainedusing the Adaptive Moment (Adam) optimizer [16] in the ﬁrst part of training (i.e., n < n s ) and the Sequential LeastSQuares Programming (SLSQP) optimizer for the remaining epochs until convergence. Both of these optimizers areavailable in TensorFlow. Unless otherwise mentioned, the value of the switchover epoch n s and the number of internaltraining data-points were 15,000 and 200, respectively. For all the hyper-parameters, except the initial learning rate λ of the Adam optimizer, the TensorFlow default values were used; the value of λ , as we pointed out in our previousstudy, is critical and is determined from the modiﬁed version of the relation we proposed in [1] and reads as follows.The modiﬁcation is to ensure that for networks with DW ¡ 10, λ will exceed 0.01. λ = 0 . × min (cid:34) , (cid:18) DW (cid:19) − (cid:35) (15)The results are organized as follows. In Section III A, the eﬀects of partial regularization on the training loss andinference accuracy are investigated. The results of that section are from the base TTN. In Section III B, we discusshow the accuracy of the solution inferred by theory-training can be improved by moving from the base TTN to theone that has the optimal inference accuracy. To better understand the reasons underlying the observations of SectionIII B, in Section III C, we analyze the statistics of the parameters learned by TTNs with diﬀerent depths and widths.Using the ﬁndings of Section III B, we present, in Section III D, theory-training guidelines. Finally, in Section III D,we compare theory-training with a rival, conventional method that discretizes the derivatives. A. Partial Regularization Theory-training Technique (PRT)

Figure 2 shows the variations of the total training loss (the black curve) and the losses associated with equations(8), (9), and (10) (the green, blue, and red curves, respectively) with epochs for the base TTN and in the absenceof any regularization (i.e., γ in equation (14)). Again, the base TTN is the one with the lowest number of hiddenlayers and nodes per hidden layer to reduce the loss to reasonably low values. For the current set of equations, thosevalues were found to be D =1 and W = 2 (displayed schematically in Figure 1). The loss associated with the initialconditions (i.e., equation (11)) are not displayed because they were much lower than the displayed losses. The ploton the left shows the changes during both the ﬁrst and second stages of training, while the plot on the right is a closeup showing only the second stage. The vertical dashed lines represent the switchover epoch, at which γ was set from γ to zero and the optimizer was switched from Adam to SLSQP.From the ﬁgure, it can be seen that the main contributions to the overall loss are from those equations (within theset) that contain time derivatives (i.e., equations (8) and (9)). Following the black curve in the plot, one can notehow the Adam optimizer initially decreases the total loss monotonically and relatively quickly. At epoch around 3000(highlighted by the downwards inclined arrow), however, the loss starts to oscillate, and the rate of decrease in theloss suddenly decreases. As the training continues, the oscillations become progressively stronger. These oscillations,which can deteriorate the predictions of a TTN [1], are commonly observed with the Adam optimizer [16]. They canbe eliminated using a lower initial learning rate, but that will slow down the convergence (because it will increasethe number of theory-training epochs required to reach a converged loss). In connection with the discussion of thenext ﬁgure, we will show how PRT can eliminate them without slowing the convergence. Despite the oscillations, theaverage value of loss keeps decreasing, and, at the switchover epoch (represented by the dashed vertical line), the losshas already been reduced to 5 × − . After that, the second optimizer takes over (see the right plot in the ﬁgure)and, in less than two-hundred epochs, reduces the loss further to 5 × − , before the loss converges and trainingends. Again, these results were in the absence of any regularization. To investigate the eﬀect of PRT on the trainingperformance, we re-trained the base TTN (i.e., the network schematically displayed in Figure 1) by increasing λ FIG. 2: The total loss (black) and losses associated with diﬀerent equations of the model (green, blue, and red) as a function ofthe training epoch, in the absence of any regularization (i.e., γ in equation (14)), during the ﬁrst and second parts of training(left) and a close up showing the second part only (right). The dashed vertical line represents the switchover epoch. Thedisplayed results are for the base TTN.FIG. 3: Results similar to those displayed in Figure 2 but here with partial regularization (i.e., γ = 10 − in equation (14)).Note how partial regularization has allowed the network to reach an oscillation-free plateau at the end of the ﬁrst part oftraining and how the second optimizer has become more eﬀective in reducing the loss. from 0 to 10 − , while keeping all the other hyper-parameters the same. The training curves for this new experimentare displayed in Figure 3. By investigating the black curve in the ﬁgure and comparing it with the black curve inthe left plot of Figure 2, it can be seen that PRT causes the total loss to plateau at an epoch that is, interestingly,almost the same as the epoch at which the oscillations in the left plot of the previous ﬁgure started. Due to the earlyplateauing, the loss at the switchover epoch is about ten times higher than it was without PRT (see Figure 2). In FIG. 4: Results similar to those displayed in Figure 2 but here with an earlier switchover to the second part of training (i.e., n s =5000 instead of 15000) other words, the loss oscillations have been eliminated but at the cost of having a higher loss at the switchover epoch.Nonetheless, the ﬁnal loss ( ∼ × − ) is about ten times smaller than it was without PRT ( ∼ × − as displayedin the second plot in Figure 2). This suggest that oscillations in the ﬁrst part of training degrade the eﬀectivenessof the second optimizer in reducing the loss. That optimizer performs better when it “receives” an oscillation-freeloss, and PRT helps to achieve that plateau. A formal analysis explaining the reason for this observation is beyondthe current understanding of how diﬀerent optimizers navigate the complex and highly non-convex loss landscape ofa neural network. Nonetheless, the fact that in the right plot of Figure 3, the second optimizer performs about 1,500epochs before converging, while in the right plot of Figure 2, it performs only less than a hundred epochs may beviewed as a sign that in the latter plot, the optimizer is “stuck” in some relatively low-quality minima. To providefurther evidence that the increase in the training performance when PRT was used is actually due to reaching theoscillation free plateau at the end of the ﬁrst part of the training, we performed two additional training experimentsboth without PRT. In the ﬁrst of these two additional experiments, all the training hyper-parameters were kept thesame as the ones in Figure 2 except the switchover epoch ns, which was reduced from 15000 to 5000. The latter epochis slightly after the epoch at which the oscillations in Figure 2 started. The motivation behind decreasing ns was toswitchover to the second optimizer before the oscillations become signiﬁcant. In the second additional experiment, weincreased the learning rate (from 0.01 to 0.05) while keeping all the other training hyper-parameters the same as theﬁrst additional experiment (so that the oscillations become signiﬁcant even at n s = 5000). The training curves forthese two experiments are displayed in Figure 4 and Figure 5, respectively. By comparing the ﬁnal value of the loss,it can be seen that, again, the ﬁnal value of the loss is lower (i.e., training performance is higher) when an oscillationfree plateau is reached at the end of the ﬁrst part of the training. Therefore, these two additional experiments furthersupport the conclusion already drawn in connection with the discussion of Figure 2 and Figure 3, which stated thatthe second optimizer performs better when it “receives” an oscillation-free loss, and that plateau can be reached,without slowing down the convergence, using PRT.Figure 6 shows the L and H error norms (the left and rights bar plots, respectively) of the diﬀerent predictedvariables (see the horizontal axis) obtained from the base network when it was trained without (grey) and with (black)PRT. It is evident that PRT reduces both error norms in all the variables by about a factor of two. This improvementin accuracy is attributed to the improved training performance analyzed in connection with Figure 3. Note that theimprovement is obtained with no additional training cost and by simply selecting a positive γ in equation 14. It canalso be seen that the errors in the diﬀerent variables of the model have similar values. This similarity is attributed tothe scaling of the equations, which has resulted in variables that are all order unity. The results presented so far wereall obtained from the base TTN, which, again, is the network with the lowest number of hidden layers and nodes perhidden layer to reduce the loss to reasonably low values (i.e., satisfy the governing equations at the training pointsreasonably well). For the present set of equations, those values were one and two, respectively. This indicates thatthe solution of a model even as highly coupled as the one studied here can be learned by theory-training a network FIG. 5: Results similar to those displayed in Figure 4 but here with an earlier switchover to the second part of training (i.e., λ =0.05 instead of 0.01)FIG. 6: A comparison between the L2 (left) and H1 (right) predictions errors of the diﬀerent variables (horizontal axis) predictedin the absence (blue) and presence (black) of PRT with only one hidden layer and two nodes per that layer. It is expected that for a set of equations even more complexthat the ones theory-trained here, those minimum values will be higher. Nonetheless, identifying the base TTN (i.e.,ﬁnding those minimum values) should be the starting point of any theory-training practice. B. From the base TTN to the optimal TTN to improve inference accuracy

Once the base TTN is identiﬁed, the next natural question is whether increasing its depth and/or width willincrease the inference accuracy. To address the above question, we performed theory training experiments on morethan seventy TTNs that were deeper and/or wider than the base TTN. These TTNs are represented by diﬀerentmarkers in Figure 7. The base TTN is also included in the plot by the purple marker. The blue markers representnetworks that are deeper but not wider than the base TTN. The black markers represent networks that are wider but

FIG. 7: Depth and width of diﬀerent networks theory-trained in the present study. The area of a marker is linearly proportionalto the number of parameters in the corresponding TTN. not deeper than the base. Networks that are both deeper and wider than the base TTN are represented by the green(when

D < W ) and red (when

D > W ) markers, respectively. The number of trainable parameters in each network n p is equal to the sum of the number of connection weights, W + ( D − W + 3 W , and the number of biases, DW + 3.Each TTN was trained using PRT with γ = 10 − and with two diﬀerent numbers of training points N f : N f = 200and 4 . The switchover epoch was 15000. Experiments were performed on a PC with Intel Core i5-7500 [email protected] and 16.0 GB of memory. The results of these experiments are summarized in Figure 8 and are discussed next.Figure 8 displays the ﬁnal training loss (left) and maximum inference error (right) of TTNs that are deeper and/orwider than the base TTN (see the ﬁgure caption for the description of diﬀerent colors and solid/empty markers). Theleft plot in the ﬁgure is analyzed ﬁrst. It can be seen that for TTNs that are only wider than the base TTN (theblack markers), as n p increases, the ﬁnal training loss decreases initially (i.e., n p <

30 ) until it reaches a minimumvalue (here 10 − represented by the dashed horizontal line). After that, further increasing n p causes the loss to haveonly oscillations around the minimum. The initial decrease can be seen as intuitive because increasing the size of anetwork should increase its ability to ﬁt the training data and should, therefore, reduce the training loss. Focusingon the blue markers in the plot shows that for TTNs that are only deeper than the base TTN, regardless of the valueof n p , there is no clear relationship between n p and the ﬁnal training loss. Interestingly, in these networks, even asmall increase in the size of the network (i.e., adding a single layer) can drastically (two orders of magnitude) change(decrease or increase) the ﬁnal training loss. Also, note that the training loss of most of the TTNs that are bothwider and deeper than the base TTN (red and green markers) are similar to the loss of a net that is only wider ordeeper than the base. This suggests that, concerning the ﬁnal training loss, networks that are deeper and wider thanthe base TTN have no consistent advantage over networks that are only wider or deeper than that TTN.The right plot in Figure 8 displays the maximum of the L errors of the diﬀerent variables ( T ∗ , φ s , and C ∗ l ) asa function of the ﬁnal training loss. The maximum H errors are not displayed because their variations with thetraining loss were similar to the L errors. In the plot, the size of a marker is linearly proportional to the size of theTTN the marker represents (in other words, the bigger a marker is, the larger the TTN was). It is evident that largerTTNs (i.e., larger markers) do not necessarily have lower errors, and the error of any network that is both deeper andwider than the base TTN (green/red markers) is similar to the error of a network that is only deeper or wider thanthat network. For example, the network with D = 8 and W =12 (the biggest marker in the plot, located almost atthe center) has an error close to the base TTN despite having about a hundred times more parameters. This suggeststhat concerning the inference performance also, networks that are deeper and wider than the base TTN have noconsistent advantage over networks that are only wider or deeper than that TTN. It can also be seen that networksthat are deeper but not wider than the base TTN (blue markers) show two properties that make them somehowdistinct from and superior to all the other nets, which are wider than the base TTN (black, green, and red markers).The former nets have the lowest prediction error for a given training loss, and their maximum error correlates almost0 FIG. 8: Training loss (left) and maximum inference error (right) of the base TTN and TTNs that are deeper and/or widerthan the base (see the label). In the right plot, the size of a marker is linearly proportional to the size of the TTN the markerrepresents, showing that larger nets do not necessarily give more accurate predictions. Note that nets that are deeper but notwider than the base (blue markers) have the lowest error for a given loss and have a nearly linear relationship between loss anderror. In the right plot, the markers distinguished by a cross represent the optimal TTN. linearly with the loss. The reason that the latter property is advantageous is discussed next. First, note that unlikethe training loss, the inference errors are, in general, entirely unknown for a theory-trainer (unless equations admitan exact solution, which is rarely the case). This makes it impossible to determine how to change D or W such thatthe error is decreased, unless the loss and errors are correlated monotonously. In the presence of such correlation, atheory-trainer can simply monitor the eﬀect of a change in D or W on the training loss, and if the loss decreases dueto that change, the error can also be expected to decrease. Our results suggest that networks that are deeper but notwider than the base enjoy such a correlation. Among these TTNs, the one that has the lowest loss has (almost) thelowest error, is termed the optimal TTN, and is distinguished by the cross in the right plot of the ﬁgure. Note thatthe optimal TTN has an inference accuracy that is about ten times better than the base TTN. C. Statistics of the learned parameters

The results presented in Section III A showed that the loss decreases much further after the switchover epoch.Obviously, the value of the network parameters (weights and biases) have a critical role in determining its outputsand, consequently, the loss. Therefore, the diﬀerence between the loss at the switchover and ﬁnal epochs can be dueto the changes in the parameters between those epochs. In addition, the results presented in Section III B showedthat networks that are only deeper than the base TTN have an advantage over the ones that are wider than the baseTTN. This distinction between the former and latter networks may also be linked to possible diﬀerences betweenthe parameters that those two types of networks learn. To investigate these issues, in this section, we analyze thesummary statistics and distributions of the learned parameters.In ﬁgure 9, the normalized (with the number of weights) distributions of the weights at the switchover and ﬁnalepochs are compared. The left and right panels make the comparison for a network that is wider and deeper than thebase, respectively. It can be seen that, for both networks, at the ﬁnal epoch, the weights have taken values that areless similar to each other than they were at the switchover epoch: the inter-variability of the weights within a networkhas increased after the switchover epoch. In the left plot, that increase is the result of the post-switchover decreasein the clustering around the mean (for example, at the switchover epoch, more than thirty-ﬁve percent of the weightsare essentially zero). In the right plot, however, the increase in the inter-variability is the result of post-switchoverlearning of a few weights with entirely diﬀerent values, thereby increasing the weight range. We observed the increasein weight inter-variability for other networks as well (nets with D = 1 and W = 3, D = 2 and W = 2, D = 1 and W Red: switchover epochBlue: end of training

FIG. 9: Comparison between the normalized distribution of the weights at the switchover (red) and ﬁnal (blue) epochs for anetwork that is only wider (left) and only deeper (right) than the base TTN.

Increasing depth of base TTN

FIG. 10: Standard deviation (left) and absolute mean (right) of the ﬁnal weights as a function of the number of trainableparameters n p for networks resulting from going deeper and/or wider than the base TTN (see the label). Note the separationof blue markers from the black ones. = 7, D = 5 and W = 2, D = 1 and W = 31, D = 25 and W = 2), but those results are not displayed here for brevity.Therefore, we believe that the increase in the inter-variability of the weights is the reason for the order of magnitudefurther decrease in the loss after switchover observed in Section III A.The left and right panels in Figure 10 show, respectively, the standard deviation and mean of the ﬁnal weights of thediﬀerent TTNs (markers) as a function of n p . In the plots, the area of a marker is selected to be linearly proportionalto the maximum prediction error of the corresponding TTN (i.e., the smaller a marker is, the lower the predictionerror was) to show that there is no clear relationship between the size and prediction error. It is readily evident thatthe blue markers are entirely separate from the black ones. The distinction between blue and black markers was notobserved in a similar plot (not included for brevity) for biases. This indicates that the only major diﬀerence between2 D8W12

FIG. 11: Epoch-evolving histograms of weights learned in diﬀerent layers for the TTN with D = 8 and W = 12. the parameters of nets that are only deeper than the base TTN with those that are wider is the higher inter-variabilityof the weights in the former networks. It is also interesting to note that among the networks that are both deeperand wider than the base, those with D ¿ W (i.e., more deep than wide) have a higher standard deviation of weights.This suggests that among TTNs of similar size, those that are deeper are able to learn weights with relatively higherintervariability.Figure 11 and Figure 12 display, respectively, the variations of the weight distributions with training epochs forthe TTN with D = 8 and W = 12 and the TTN with D = 8 and W = 4. These TTNs are selected for display asrepresentatives of TTNs with W > D (i.e., black and green markers in Figure 7) and ones with

W < D (i.e., blueand red markers in Figure 7). In Figure 10, these two TTNs are distinguished by a cross in the left plot. Each graphwithin these ﬁgures consists of slices that are stacked in the direction that appears perpendicular to the plane of view.Each slice represents a histogram where the horizontal axis shows the binned values, and the height is proportional tothe percentage of the weights with a value within the bin. Slices that belong to earlier epochs are further ”back” anddarker, while the ones that belong to later epochs are close to the foreground and lighter in color. Figure 11 showsthat the weights of the diﬀerent hidden layers evolve such that the ﬁnal weights become more or less the same and allclustered around the mean zero. Such behavior is not observed in Figure 12. This suggests that in a network that iswide enough, adding more hidden layers simply forces the network to learn the same weight values again and again,rather than having it learn new weights.

D. Theory-training guidelines

Based on the empirical observations analyzed in Section III B, we present the following theorytraining guidelines.Starting from a small network, ﬁnd the base TTN by gradually increasing the depth and with such that the ﬁnal3

Layer 2 Layer 5Layer 6 Layer 8

D8W4

FIG. 12: Epoch-evolving histograms of weights learned in diﬀerent layers for the TTN with D = 8 and W = 4. training loss becomes reasonably low. For the set of equations we theory-trained here, the base TTN had only onehidden layer and two nodes per that layer. If the complexity of the equations is increased even further, that TTN canbe expected to have a larger size; nonetheless, it can still be easily identiﬁed because any TTN smaller than the basewill have a much higher training loss. Train the base TTN (and all the others that will follow) in two parts, where theloss is regularized only in the ﬁrst part (see Section III A for details). Then, train networks that have more hiddenlayers than the base TTN. In increasing the number of hidden layers, note that adding a layer can drastically change(both decrease and increase) the training loss. Among the TTNs that are deeper but not wider than the base, the onethat has the lowest loss (i.e., the optimal TTN) can be expected to have the most accurate predictions and shouldbe used for inference. The reason for prioritizing adding layers to adding more neurons per existing layers (i.e., goingdeeper over going wider) is not that the former is more likely to reduce the errors. It is only because networks thatare only deeper than the base have a monotonous correlation between the loss and the inference error. Therefore, forthese networks, decreasing the training loss, which is a quantity that is always available to the practitioner, can beexpected to decrease the inference errors; those errors are generally not available (unless the equations admit an exactsolution, which is rarely the case). These guidelines are intended to ensure that the TTN that is ultimately used forinference is trained to its full realizable learning capacity, provides the most accurate predictions among TTNs thathave diﬀerent sizes, and does not have additional, unnecessary neurons.Before proceeding, we need to mention that the above guidelines are based only on the theorytraining experimentswe performed here. Therefore, a necessary future study is to test their validity under diﬀerent training conditions,for example, where a diﬀerent set of equations is theory-trained. Note that such a set will still need to admit anexact solution so that the inference errors can be quantiﬁed. If the new set of equations is more complex, ﬁnding thebase TTN may require more trial and error (because one will have to increase both D and W). However, after thatTTN is identiﬁed, we expect the guidelines to be reliable, mainly because all the individual equations are lumpedinto a single scalar loss, and that loss is the only interface between the network and equations. In addition, in thenext section, we will show that the 10X increase in the inference accuracy by switching from the base to the optimalTTN (see the last paragraph in Section III B) does not depend on N f . This provides some conﬁdence that, undertraining conditions diﬀerent than the ones explored here, the guidelines will still be reliable if not entirely accurate.4For example, there may be scenarios where among the TTNs that are deeper but not wider than the base, the onethat has the lowest training loss does not have exactly the best inference accuracy. Clarifying these issues obviouslyrequires further research. Nonetheless, the value of the present guidelines and the critical role they can play in makingany judicious statement concerning the comparison between theory-training and rival methods of solving diﬀerentialequations is further highlighted in the next section. E. Theory-training guidelines

It is known that theory-training a high-dimensional set of equations will be signiﬁcantly faster than solving thoseequations using conventional methods for solving diﬀerential equations that discretize the equations, because thelatter suﬀer from the so-called curse of dimensionality [17]. For a low-dimensional set of equations, however, it isentirely unclear whether theory-training has any advantages over those methods. The equations in the present studyare a good candidate to address that issue (because they are one-dimensional in the spatio-temporal domain). To thatend, we solved equations (1) to (3) by discretizing the time derivatives using the backward Euler method and usingthe solid fraction updating scheme of Torabi Rad and Beckermann [18]. Again, such a scheme is necessary because φ s in the equations does not have an explicit relation to be calculated from. Due to the high coupling between theequations, within each time step, they had to be solved iteratively until all the variables converged (i.e., did notchange with further iterations within a user-prescribed criterion). These simulations were performed using a simple(70 lines of code) Python script.Figure 13 displays the maximum L2 errors (left plot) and computational times (right plot) of theory-training (solidcurves) and the method using discretization (dashed curves) as a function of the number of training/discretizationpoints N f . The latter method will be referred to as the conventional method. The computational times of theory-training were almost equal to the training times because the inference was nearly instantaneous. Theory-trainingresults are displayed for the base TTN (black curves) and the optimal TTN (red curves), and the TTN with D = 2,and W = 6 (blue curves). The latter TTN has the same number of parameters as the optimal TTN. Before comparingthe results of theory-training and conventional methods, it should be mentioned that the 10X improvement in theinference accuracy that was obtained by switching from the base to the optimal TTN (see the right plot in Figure ?? and the discussion at the end of Section III B) persists regardless of the value of N f . This provides some conﬁdencethat the guidelines introduced in Section III D, which were based on observations with N f = 200 and 400 only, donot depend on N f and are general. Also, note that the training time of the optimal TTN is higher than the baseTTN by less than thirty percent, indicating that the 10X improvement in the inference accuracy is obtained withno signiﬁcant increase in the computational cost. It would be interesting to see whether such a gain in the inferenceaccuracy exists for a set of equations that contain spatial derivatives, especially in the presence of sharp gradients.For such a system, if the gain turns out to be signiﬁcant, it may reﬂect itself by somehow alleviating a reportedshortcoming in theory-training that concerns having overly-diﬀuse interfaces, instead of ones that should be as sharpas possible [6], and this is another issue that requires further research.Comparing the red and dashed-black curves in the ﬁgure shows that the computational time required with theory-training to reach the lowest error achievable by the optimal TTN (represented by the horizontal dotted line in theleft plot) is similar to the corresponding time with the conventional method: one hundred seconds compared to eightyseconds. The minor diﬀerence between the two times should not distract because these computations were performedusing diﬀerent computer codes (TensorFlow vs. a simple Python script). The important point to consider here is thatthe times are of the same order of magnitude even through the equations are only one-dimensional (in the spatio-temporal domain). Therefore, if the dimensionality of equations is increased by only one, then theory-training canbe expected to clearly defeat the conventional method. This is because such an increase will require increasing N f (number of training/discretization points). That increase will increase (almost linearly) only the computational timeof the conventional method (see the right plot and note that a thousand-fold increase in N f has almost no impacton the theory-training times). The similarity (between the computational times) that results in somehow unexpectedresult that even on a low-dimensional set of equations theory-training can have advantages over rival, conventionalmethods is attributed to the high coupling between the equations studied here. In the presence of such coupling, andwhen the equations are discretized, at each time step, numerous iterations have to be performed for the variables toconverge. Before proceeding, note that if the base TTN were used in the comparison with the conventional method,the result would have been an erroneous and misleading conclusion that, on a low-dimensional set of equations,theory-training cannot have any computational advantage over discretizing, as the latter can reach the minimumerror achievable by the base TTN more than ten times faster. This again highlights the importance of reaching theoptimal computational time and inference accuracy, using, for example, the guidelines in Section III D before makingany comparison against the rival methods.From the left plot, it can also be seen that, while discretization errors decrease linearly as N f increases, the theory-5 FIG. 13: Comparison between the maximum L errors (left) and computational times (right) of theory-training (solid curves)and a rival method for solving diﬀerential equations that discretizes the derivatives (dashed curves), showing that even on aone-dimensional set of equations, theory-training can be as fast as a discretizing method. training errors decrease with increasing N f only initially; then, the errors cease decreasing further, despite ordersof magnitude increase in N f . This implies that unlike the conventional methods, where errors can be decreasedas much as desired by following an instruction as simple as increasing the number of discretization points, in thetheory-training, decreasing the errors is a far more challenging task. That challenge further highlights the importanceof our guidelines that can help to decrease those errors, even though that decrease is only one order of magnitude.Furthermore, recalling that the errors of more than seventy networks analyzed in Section III B were all between5 × − and 5 × − (in variables that are dimensionless and order unity) implies a disadvantage of the theory-training. Regardless of the number of training points, depth, or width of the network, theory-training errors cannotbe made arbitrarily low. This disadvantage may limit the application of theory-training especially in domains whereextremely low errors are required. Nonetheless, an error within the range we achieved here will be entirely acceptablefor numerous engineering applications, such as macroscale modeling of solidiﬁcation. In such applications, currently,the linear increase of the computational times with the number of discretization points limits the simulation domainsize and resolution, which results in simulations that have not fully resolved critical defects [19], forcing engineers torely on simpliﬁed criterion [20] to predict them. If theory-training highly complex sets of equations that can predictthose defects turns out to be successful, then the simulation domain size will not have to be limited by the availablecomputational resources, and this is another area that requires further research. IV. CONCLUSIONS

By analyzing in detail the results of numerous computational experiments, we presented in-depth insights intotheory-training neural networks to infer the solution of highly coupled diﬀerential equations. The results showedthat the ﬁnal training loss, which represents a TTN’s ability to satisfy the equations at the training points, andthe accuracy of the solution inferred at the new data points can be both deteriorated when the loss oscillates withtraining epochs. We presented a Partial Regularization Training Technique (PRT) that is based on performing thetraining in two parts. In the ﬁrst part, a regularized loss is minimized. After that loss has reached a plateau, trainingis continued by minimizing a non-regularized loss and ends after the latter loss has converged. The results showedthat PRT eliminates the oscillations and, without increasing the training cost, gives a factor two improvement in theinference accuracy. That improvement was attributed to a factor ten decrease in the ﬁnal training loss.Among networks with diﬀerent sizes, those that were deeper but not wider than the base TTN, which is the networkwith the lowest number of hidden layers and nodes per layer that is still able to decrease the training loss to reasonablylow values, enjoyed a monotonous correlation between the training loss and inference error. The existence of such acorrelation allowed us to present the following guidelines. In theory-training a set of equations, the base TTN shouldbe identiﬁed ﬁrst. Then, diﬀerent TTNs that are all deeper but not wider than the base TTN need to be trained.6Among the latter TTNs, the one with the lowest training loss, which is termed the optimal TTN, can be expectedto have the lowest inference error, even when those errors cannot not be quantiﬁed. Following these guidelines willprovide some conﬁdence that the network ultimately used for inference has optimal training and inference times, asit does not have any additional unnecessary neurons, was trained to its full realizable learning capacity, and providesoptimally accurate predictions.A detailed comparison was made between theory-training and a rival, conventional method for solving diﬀerentialequations that discretizes the derivatives. The results attested that even for a one-dimensional set of equations,theory-training can be as fast as conventional methods when the equations are highly coupled. This suggests thatadvantages of theory-training over the conventional methods are not necessarily limited to high-dimensional sets ofequations and can be realized in modeling physical systems, where the dimensions are limited to four. Nonetheless,the comparison also revealed the limitation of theory-training: unlike conventional methods, in theory-training, errorscannot be made arbitrarily low, regardless of the number of training points, network depth, or width. This may limitthe application of the present theory-training framework in domains where extreme accuracies are necessary.It should be emphasized that our observations and guidelines were based only on the training experiments weperformed here and lack a rigorous mathematical proof. Therefore, they are not guaranteed to hold if, for example,training is performed using an entirely diﬀerent set of equations. Nonetheless, these observations can motivate furthertheoretical research on neural networks. For example, it would be interesting to see if the liner correlation betweenthe training loss and inference error that we observed for networks that are deep but not wide has a theoretical reasonand/or holds for other systems of equations.

V. ACKNOWLEDGMENT

This study was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Ger-many’s Excellence Strategy - EXC-2023 Internet of Production-390621612. The authors appreciate fruitful discussionswith Dr. Georg J. Schmitz of ACCESS e.V. MTR appreciates the comments by Dr. Ameya D. Jagtap of BrownUniversity on regularization. [1] Torabi Rad, M., Viardin, A., Schmitz, G. & Apel, M. Theory-training deep neural networks for an alloy solidiﬁcationbenchmark problem. Computational Materials Science 180, 109687 (2020).[2] Raissi, M., Perdikaris, P. & Karniadakis, G. Physics-informed neural networks: A deep learning framework for solvingforward and inverse problems involving nonlinear partial diﬀerential equations. Journal of Computational Physics 378,686-707 (2019).[3] Raissi, M., Wang, Z., Triantafyllou, M. & Karniadakis, G. Deep learning of vortex-induced vibrations. Journal of FluidMechanics 861, 119-137 (2018).[4] Raissi, M. & Karniadakis, G. Hidden physics models: Machine learning of nonlinear partial diﬀerential equations. Journalof Computational Physics 357, 125-141 (2018).[5] Haghighat, E. & Juanes, R. SciANN: A Keras/TensorFlow wrapper for scientiﬁc computations and physics-informed deeplearning using artiﬁcial neural networks. Computer Methods in Applied Mechanics and Engineering 373, 113552 (2021).[6] Haghighat, E., Raissi, M., Moure, A., Gomez, H., & Juanes, R. A deep learning framework for solution and discovery insolid mechanics. arXiv preprint arXiv:2003.02751 (2020).[7] Sahli Costabal, F., Yang, Y., Perdikaris, P., Hurtado, D. & Kuhl, E. Physics-Informed Neural Networks for CardiacActivation Mapping. Frontiers in Physics 8, (2020).[8] Nakkiran, P. Kaplun, G., Bansal, Y., Yang, T., Barak, B. & Sutskever, I. Deep Double Descent: Where Bigger Modelsand More Data Hurt. arXiv preprint arXiv:1912.02292 (2019).[9] Ba, J. & Caruana, R. Do deep networks really need to be deep? arXiv preprint arXiv:1312.6184 (2013).[10] Goodfellow, I., Bengio, Y., & Courville, A. Deep Learning. MIT press. (2016).[11] Raissi, M. Deep hidden physics models: Deep learning of nonlinear partial diﬀerential equations. The Journal of MachineLearning Research, 19(1), 932-955 (2018)[12] Brenner, S. & Scott, L. The mathematical theory of ﬁnite element methods. Springer. (2011).[13] Jin, P., Lu, L., Tang, Y. & Karniadakis, G. Quantifying the generalization error in deep learning in terms of data distributionand neural network smoothness. Neural Networks 130, 85-99 (2020).[14] Jagtap, A., Kawaguchi, K. & Karniadakis, G. Adaptive activation functions accelerate convergence in deep and physics-informed neural networks. Journal of Computational Physics 404, 109136 (2020).[15] Jagtap, A., Kawaguchi, K. & Em Karniadakis, G. Locally adaptive activation functions with slope recovery for deep andphysics-informed neural networks. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences476, 20200334 (2020).[16] Kingma, D. P., & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).7