Discovering dependencies in complex physical systems using Neural Networks
DDiscovering dependencies in complex physical systems using Neural Networks
Sachin Kasture ∗ OptoAI, 2805DL, Gouda, The Netherlands (Dated: February 1, 2021)In todays age of data, discovering relationships between different variables is an interesting anda challenging problem. This problem becomes even more critical with regards to complex dynam-ical systems like weather forecasting and econometric models, which can show highly non-linearbehaviour. A method based on mutual information and deep neural networks is proposed as aversatile framework for discovering non-linear relationships ranging from functional dependenciesto causality. We demonstrate the application of this method to actual multivariable non-linear dy-namical systems. We also show that this method can find relationships even for datasets with smallnumber of datapoints, as is often the case with empirical data.
I. INTRODUCTION
Finding relationships between different variables inlarge datasets [1–3] is an important problem that hasramifications in fields ranging from environmental sci-ence to economics and genetic networks. Understandingwhat variables affect a certain quantity becomes increas-ingly challenging when these relationships are highly non-linear, like those occurring in dynamical systems withseveral variables. Quite often in a large dataset with sev-eral variables, only a few variables maybe significantlyaffecting the target variable and identifying these vari-ables is first vital step in exploring these dependencies inmore detail.Several methods exist which can help find dependen-cies and correlations between variables. However mostof these methods are good at detecting a certain class offunctions while they fail for others. There are some meth-ods which are quite good at detecting functional depen-dencies between 2 variables [1, 4], they have however notbeen demonstrated in a multi-variable scenario where atarget variable depends on several input variables. Find-ing functional dependencies has been a topic exploredextensively in context of relational databases[5, 6]. How-ever these methods rely on finding exact functional re-lationships by finding all attributes which have a one toone or one to many relationship with a certain column Y.But this approach does not work well for small databaseswhich are just a sample of the true distribution as in thesecases one to one relations are more likely to occur. Alsoin such cases, it is difficult to reliably find the small-est subset of variables which are sufficient to describe Y.These methods do not offer any control over what kindof functional relationships maybe considered intuitivelyas good or interesting candidates. Also, these methodsdo not provide any kind of score to evaluate functionaldependencies.In this paper, we use Neural networks as devices tomodel nonlinear behavior and find complex non-linearrelationships. Especially deep neural networks (DNN) ∗ [email protected] which consist of more than 1 hidden layer are excellentcandidates for efficiently modelling multi-variable non-linear polynomial functions with small number of neu-rons [7, 8]. Additionally a regularization mechanism al-lows us to control the complexity of the model we wish toconsider [9]. Neural networks have been used recently todiscover physical concepts, identify phase transitions anddesign quantum experiments[10–12]. To help find depen-dencies, we use an DNN based autoencoder architecturewhich consists of an encoder-decoder pair. The encodermaps the input space to a latent space, while the decodermaps the latent space to the output space. This archi-tecture has been used, amongst other applications, fornon-linear Principle Component analysis (PCA) wherethe goal is to find a compressed representation of data[13]. As such the input and the output of the autoen-coder is conventionally the same. In our method theinput will be X , which is the set of input features and Y is the target feature or the set of features. We then usecompression of mutual information in the latent spaceto derive a loss function which can be minimized to findthe smallest set of features in X which can be used toreliably reconstruct Y . The loss function can be used toassign a score to compare the functional dependencies ondifferent set of input parameters.We then demonstratethis method to find dependencies in chaotic dynamicalsystems. Also we show that this method can be used tofind non-linear causal connections in the Grangier sensefor chaotic systems [14–16], even for a small dataset of100 samples. II. THEORY
We now derive a loss function using the informationbottleneck method [17] based on the fact that the la-tent intermediate layer can be used to extract only rel-evant information from X and used to reconstruct Y .We represent this latent representation by L . We alsonow assume a Markov chain Y → X → L . This means P ( Y | X, L ) = P ( Y | X ). This is because X, Y correspondto observed ground truth data.We now use the fact thatwe want to extract only relevant information from X which can reconstruct Y . We use Shannon mutual infor- a r X i v : . [ phy s i c s . d a t a - a n ] J a n FIG. 1. Plot shows comparison between x i and the corresponding scaled version of l i for (a)-(d) different values of y i = dx i /dt for equation 17. In the plots where l i is essentially noise, information from the corresponding x i is not used to reconstruct y i using the decoder. fac is a scaling factor chosen so that x i and l i /fac are comparable mation to quantify this information [17, 18]. Thereforewant to maximize the quantity I ( L, Y ) − λ enc I ( L, X ).The first term and the second term describe the capac-ity of the encoder and the decoder respectively with λ enc determining the relative weight between the two terms.We can write I ( L, Y ) as: I ( L, Y ) = (cid:90) dydlp ( y, l ) log p ( y | l ) p ( y )= (cid:90) dlp ( l ) (cid:90) dyp ( y | l ) log ( p ( y | l ) + H ( Y ) (1)where H ( Y ) is the Shannon entropy. We neglect H ( Y )since it is fixed by the data. Since it is very difficult to cal-culate p ( y | l ), we can approximate it by another analyticfunction φ ( y | l ). Using the fact that the KL divergencewhich measures the ‘distance’ between 2 probability dis-tributions is always non-negative: KL ( p ( y | l ) , φ ( y | l )) ≥ ⇒ (cid:90) dyp ( y | l ) logp ( y | l ) ≥ (cid:90) dyp ( y | l ) logφ ( y | l ) (2)we can write I ( L, Y ) ≥ (cid:90) dydlp ( y, l ) logφ ( y | l ) (3)We can now choose an appropriate function for φ ( y | l )which allows us to derive a suitable loss function as well as allows us to tune the complexity of the decoder. The out-put of the decoder is given by θ dec ( l ) which describes thecomposite function of the decoder neural network whichacts on the latent variable l . To also include an additionalL1 [9]regulation parameter which helps restrict the mag-nitude of the weights in the decoder neural network, weuse the following function for φ ( y | l ) φ ( y | l ) = e − ( θ dec ( l ) − y ) /σ dec − λ dec ( | θ d | + | θ d | + .. ) (4)where θ d , θ d .. etc. are weights of different neurons inthe decoder network. Therefore we can write I ( L, Y ) ≥ − (cid:90) dydlp ( y, l )[ ( θ dec ( l ) − y ) σ dec + λ dec ( | θ d | + | θ d | + .. )] (5)Now we use the fact that p ( y, l ) = (cid:82) dxp ( x, y, l ) = (cid:82) dxp ( l | x, y ) p ( x, y ). Using the Markovchain condition, this can be written as p ( y, l ) = (cid:82) dxp ( l | x ) p ( x, y ). Approximating (cid:82) dxdyp ( x, y ) A ( x, y ) = (1 /M ) (cid:80) Mk =1 A ( x k , y k ) where M is the number of distinct data points, we can write I ( L, Y ) ≥ − (1 /M ) M (cid:88) k =1 (cid:90) dlp ( l | x )[ ( θ dec ( l ) − y k ) σ dec + λ dec ( | θ d | + | θ d | + .. )] (6)Similarly we can define I ( L, X ) as: I ( L, X ) = (cid:90) dldxp ( x, l ) log p ( l | x ) p ( l )= (cid:90) dxdlp ( x, l ) logp ( l | x ) − (cid:90) dlp ( l ) logp ( l ) (7)We now again use another analytical function g ( l ) inplace of p ( l ) and use the result on positivity of KL diver-gence and get: I ( L, X ) = (cid:90) dldxp ( x, l ) logp ( l | x ) − (cid:90) p ( l ) logp ( l ) ≤ (cid:90) dxdlp ( x, l ) log p ( l | x ) g ( l ) (8)For convenience we use a Gaussian function centred at 0. g ( l ) = e − (cid:80) i l i /σ enc (9)where l = ( l , l .. ) are different components of l and σ enc is an adjustable parameter. For p ( l | x ) we can use: p ( l | x ) = (cid:89) i e − ( l i − W i x i ) /σ enc (10)where x = ( x , x , .. ) This means we use a linear trans-formation from X and add a independent Gaussian noisewith variance σ enc and mean 0 to each component. Wenow plug in definitions 9,10 into equation 8 and obtain: I ( L, X ) ≤ (cid:90) dxdlp ( x, l ) loge − (cid:80) i W i x i ( W i x i − l i ) /σ enc (11)Writing p ( x, l ) = p ( x ) p ( l | x ) we can write the above equa-tion as I ( L, X ) ≤ − (cid:90) dxdlp ( x ) (cid:89) i e − ( l i − W i x i ) /σ enc [ (cid:80) i W i x i ( W i x i − l i ) σ enc ] (12)Using the approximation (cid:82) dxp ( x ) A ( x ) =(1 /M ) (cid:80) Mk =1 A ( x k ), we can write I ( L, X ) ≤ − (1 /M ) M (cid:88) k =1 (cid:90) dl (cid:89) i e − ( l i − W i x ki ) /σ enc [ (cid:80) i W i x ki ( W i x ki − l i ) σ enc ] (13)Similarly substituting equation 10 into equation 6and assuming σ enc to be small enough so that e − ( l i − W i x i ) /σ enc ≈ δ ( l i − W i x i )we obtain: I ( L, Y ) − λ enc I ( L, X ) ≥ − (1 /M ) M (cid:88) k =1 [ ( θ dec ( l ) − y k ) σ dec + λ dec ( | θ d | + | θ d | + .. ) + λ enc (cid:88) i ( W i x ki ) σ enc ] (14) FIG. 2. Plots shows the case of fan-in causality pattern forset of delay equations in equation 18 for set of ξ ij values usedto obtain results in Figure 3 Therefore we can define a loss function to be minimizedas L = (1 /M ) M (cid:88) k =1 [ ( θ dec ( l ) − y k ) σ dec + λ dec ( | θ d | + | θ d | + .. ) + λ enc (cid:88) i ( W i x ki ) σ enc ] (15)We observe that the first term tries to minimize the leastsquares difference between θ dec ( l ) and y and the secondterm controls the size of the weights of the decoder whichin turn controls the maximum degree polynomials the de-coder NN can approximate. For the third term we seethat as we increase the λ inc , the NN will try to keep( W i x ki ) small to keep the total loss function small. As-suming now that we standardize our data so that x (cid:48) i s onan average have similar magnitudes, we absorb it into λ enc . The third term will now be smallest when only W (cid:48) i s corresponding to those x (cid:48) i s are non-zero, which arerequired to reproduce Y . Using this intution and the factthat term inside the summation over i in equation 17 isalways ≥
0, we can further simplify the loss function as L = (1 /M ) M (cid:88) k =1 [ ( θ dec ( l ) − y k ) σ dec ]+ λ dec ( | θ d | + | θ d | + .. ) + λ enc (cid:88) i ( | W i | ) (16)where we have merged σ enc with λ enc . This way we treatboth the encoder and decoder weights on equal termsusing L1 regularization. From a practical standpoint L1is advantageous since it can shrink weights faster. FIG. 3. Plot shows comparison between Y i and the corresponding scaled version of l i for (a)-(c) different values of y i = Y i forthe set of delay equations 18. In the plots where l i is noise, information from the corresponding x i is not used to reconstruct y i using the decoder III. APPLICATION
For further study we use a NN in which the encoderhas 2 linear layers. This gives us a mapping X → L .We then add Gaussian noise to the latent variables l i = l i + N (0 , σ enc ). The latent code is then sent througha multilayer decoder network with non-linear activationfunctions to give the output θ dec ( l ). We perform batch-normalization in between intermediate neural networklayers [19]. This layers prevents change in data distribu-tions between adjacent layers and allows neural networklearning at a higher learning rate. We then minimizethe loss function in equation 16 using Stochastic gradi-ent descent with different batch sizes. We can tune thevalues of λ enc , λ dec (regularization parameters) to obtainas low values of loss function as possible. This choice ofregularization parameters may also depend on our priorknowledge about the complexity of the system. The datais split into the training and validation set. The trainingdata is used to build the model and validation set checkshow well the model generalizes. The basic heuristic fortuning these parameters is as follows: after fixing thelearning rate for the gradient descent, we first increasethe value of λ dec which basically fixes the complexity offunctions the decoder can simulate. We then increase thevalue of λ enc and look at the value of the mean squareerror and stop when the mean square error is as smallas possible for both the training and the validation set. We now use this method to infer relationships in wellknown non-linear systems. We first consider a Lorenz96non-linear system which is defined as: dx i dt = ( x i +1 − x i − x i − − x i + F ) (17)where i goes from 1 to N where N is the number ofoscillators and x N +1 = x , x − = x N − , x = x N . F isthe driving term and we choose F = 8 where the systembehaves in the chaotic regime. Figure 1 shows the resultsfor N=5. We run N=5 times with each time y = dx i dt for i from 1 to 5. We see that the latent representation l i is basically just the added Gaussian noise when thecorresponding y has no dependency on l i . The numberof data points was 3000 and learning rate was 0.0001 andvalues of λ dec , λ enc where 0 and 0.1 respectively. Thetraining was run for 1000 epochs with a batch size of300.Next we apply NN to infer causal relationship in a setof non-linear delay equations. For this we look at thefollowing set of equations: Y i ( t + 1) = ( ξ ii − (cid:88) j =1 , , ( ξ jj − ξ ij Y j ( t ))) Y i ( t ) (18)for i=1,2,3. We choose to choose parameters ξ ij whichcorrespond to a fan-in pattern shown in Figure 2. Thevalues of ξ are as follows ξ = 4 , ξ = 3 , ξ = 2 , ξ = FIG. 4. Plot shows the plot for FD vs MR for differentvalues of λ enc . The legend also mentions the non-linear sys-tem for the plotted data. ‘dde’ stands for the delay differenceequations in equation 18 . , ξ = − .
6. These parameters corresponds to achaotic regime. In this case both Y and Y are causallydriven by Y . A fan-in pattern is a good test becausecorrelation based tests would falsely infer a causal re-lationship between Y and Y [2]. To infer the causalrelationships, we run the NN with y = Y i ( t + 1) and in-put X = [ Y ( t ) , Y ( t ) , Y ( t )]. From Figure 3 we can seethat we are able to correctly infer the dependencies, evenfor a very small data-set of 50 points. The plots were ob-tained for a learning rate of 0.001 and λ enc , λ dec valuesof 0.1 and 0.005 respectively.The number of epochs was1500 with a batch size of 32. We also summarize the per-formance of this method using 2 metrics False discovery(FD) and Miss rate (MR) which are defined as: F D = F PF P + T PM R = F NF N + T P (19) where FN, FP, TP are False negatives, false positivesand true positives respectively. Here a positive means acertain variable has been discovered to be independentof the output. The negative means a variable has beendiscovered to be related to the output.This data is ob-tained by obtaining results over 20 independent runs ofthe model. For the Lorenz96 model, the best result isobtained with λ enc = 0 . λ enc = 0 . IV. CONCLUSION
The proposed approach using NN is a versatile plat-form for inferring relationships, especially in complexnon-linear systems. This is because NN are a power-ful tool to model such non-linear functions. Even thoughit is difficult to infer the exact functional form using aNN, this method can help locate functional dependen-cies between variables in a multivariable system. Thesevariables can then be probed more extensively to find thefunctional (or approximate functional) form of the rela-tionships. Methods based on sparse regression have beenused in the past to find functional relationships. Howeverthey rely on pre-knowledge of the set of basis functionsto use for the regression. The proposed method has nosuch requirement and with a large enough NN, can sim-ulate any complex non-linear function. Besides locatingfunctional relationships, it can also help infer causal re-lationships in non-linear data as seen in the discussedexample, where it correctly inferred causal relationshipeven for a small dataset of 50 samples.
V. ACKNOWLEDGEMENTS
The author would like to thank Akshatha Mohanfor helpful comments and critical assessment of themanuscript. [1] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R.Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander,M. Mitzenmacher, and P. C. Sabeti, Science , 1518(2011).[2] D. Marbach, R. J. Prill, T. Schaffter, C. Mattiussi,D. Floreano, and G. Stolovitzky, Proceedings of the Na-tional Academy of Sciences , 6286 (2010).[3] S. L. Brunton, J. L. Proctor, and J. N. Kutz, Proceedingsof the National Academy of Sciences , 3932 (2016).[4] A. Dembo, A. Kagan, and L. A. Shepp, Bernoulli , 343(2001).[5] J. Liu, J. Li, C. Liu, and Y. Chen, IEEE Transactions onKnowledge and Data Engineering , 251 (2012).[6] Y. Huhtala, The Computer Journal , 100 (1999).[7] H. W. Lin, M. Tegmark, and D. Rolnick, Journal of Sta-tistical Physics , 1223 (2017). [8] D. Rolnick and M. Tegmark, arXiv:1705.05502 [cs, stat](2018), arXiv: 1705.05502.[9] R. Tibshirani, Journal of the Royal Statistical Society:Series B (Methodological) , 267 (1996).[10] R. Iten, T. Metger, H. Wilming, L. del Rio, and R. Ren-ner, Physical Review Letters , 010508 (2020).[11] B. S. Rem, N. K¨aming, M. Tarnowski, L. Asteria,N. Fl¨aschner, C. Becker, K. Sengstock, and C. Weiten-berg, Nature Physics , 917 (2019).[12] A. A. Melnikov, H. Poulsen Nautrup, M. Krenn, V. Dun-jko, M. Tiersch, A. Zeilinger, and H. J. Briegel, Pro-ceedings of the National Academy of Sciences , 1221(2018).[13] G. E. Hinton, Science , 504 (2006).[14] M. Detto, A. Molini, G. Katul, P. Stoy, S. Palmroth, andD. Baldocchi, The American Naturalist , 524 (2012). [15] J. Runge, J. Heitzig, V. Petoukhov, and J. Kurths, Phys-ical Review Letters , 258701 (2012).[16] H. Ma, K. Aihara, and L. Chen, Scientific Reports ,7464 (2015).[17] N. Tishby, F. C. Pereira, and W. Bialek,arXiv:physics/0004057 (2000), arXiv: physics/0004057. [18] C. Giannella and E. Robertson, Information Systems29