[PDF] Discovering dependencies in complex physical systems using Neural Networks

Abstract

In todays age of data, discovering relationships between different variables is an interesting and a challenging problem. This problem becomes even more critical with regards to complex dynamical systems like weather forecasting and econometric models, which can show highly non-linear behavior. A method based on mutual information and deep neural networks is proposed as a versatile framework for discovering non-linear relationships ranging from functional dependencies to causality. We demonstrate the application of this method to actual multivariable non-linear dynamical systems. We also show that this method can find relationships even for datasets with small number of datapoints, as is often the case with empirical data.

Full PDF

DDiscovering dependencies in complex physical systems using Neural Networks

Sachin Kasture ∗ OptoAI, 2805DL, Gouda, The Netherlands (Dated: February 1, 2021)In todays age of data, discovering relationships between diﬀerent variables is an interesting anda challenging problem. This problem becomes even more critical with regards to complex dynam-ical systems like weather forecasting and econometric models, which can show highly non-linearbehaviour. A method based on mutual information and deep neural networks is proposed as aversatile framework for discovering non-linear relationships ranging from functional dependenciesto causality. We demonstrate the application of this method to actual multivariable non-linear dy-namical systems. We also show that this method can ﬁnd relationships even for datasets with smallnumber of datapoints, as is often the case with empirical data.

I. INTRODUCTION

Finding relationships between diﬀerent variables inlarge datasets [1–3] is an important problem that hasramiﬁcations in ﬁelds ranging from environmental sci-ence to economics and genetic networks. Understandingwhat variables aﬀect a certain quantity becomes increas-ingly challenging when these relationships are highly non-linear, like those occurring in dynamical systems withseveral variables. Quite often in a large dataset with sev-eral variables, only a few variables maybe signiﬁcantlyaﬀecting the target variable and identifying these vari-ables is ﬁrst vital step in exploring these dependencies inmore detail.Several methods exist which can help ﬁnd dependen-cies and correlations between variables. However mostof these methods are good at detecting a certain class offunctions while they fail for others. There are some meth-ods which are quite good at detecting functional depen-dencies between 2 variables [1, 4], they have however notbeen demonstrated in a multi-variable scenario where atarget variable depends on several input variables. Find-ing functional dependencies has been a topic exploredextensively in context of relational databases[5, 6]. How-ever these methods rely on ﬁnding exact functional re-lationships by ﬁnding all attributes which have a one toone or one to many relationship with a certain column Y.But this approach does not work well for small databaseswhich are just a sample of the true distribution as in thesecases one to one relations are more likely to occur. Alsoin such cases, it is diﬃcult to reliably ﬁnd the small-est subset of variables which are suﬃcient to describe Y.These methods do not oﬀer any control over what kindof functional relationships maybe considered intuitivelyas good or interesting candidates. Also, these methodsdo not provide any kind of score to evaluate functionaldependencies.In this paper, we use Neural networks as devices tomodel nonlinear behavior and ﬁnd complex non-linearrelationships. Especially deep neural networks (DNN) ∗ [email protected] which consist of more than 1 hidden layer are excellentcandidates for eﬃciently modelling multi-variable non-linear polynomial functions with small number of neu-rons [7, 8]. Additionally a regularization mechanism al-lows us to control the complexity of the model we wish toconsider [9]. Neural networks have been used recently todiscover physical concepts, identify phase transitions anddesign quantum experiments[10–12]. To help ﬁnd depen-dencies, we use an DNN based autoencoder architecturewhich consists of an encoder-decoder pair. The encodermaps the input space to a latent space, while the decodermaps the latent space to the output space. This archi-tecture has been used, amongst other applications, fornon-linear Principle Component analysis (PCA) wherethe goal is to ﬁnd a compressed representation of data[13]. As such the input and the output of the autoen-coder is conventionally the same. In our method theinput will be X , which is the set of input features and Y is the target feature or the set of features. We then usecompression of mutual information in the latent spaceto derive a loss function which can be minimized to ﬁndthe smallest set of features in X which can be used toreliably reconstruct Y . The loss function can be used toassign a score to compare the functional dependencies ondiﬀerent set of input parameters.We then demonstratethis method to ﬁnd dependencies in chaotic dynamicalsystems. Also we show that this method can be used toﬁnd non-linear causal connections in the Grangier sensefor chaotic systems [14–16], even for a small dataset of100 samples. II. THEORY

We now derive a loss function using the informationbottleneck method [17] based on the fact that the la-tent intermediate layer can be used to extract only rel-evant information from X and used to reconstruct Y .We represent this latent representation by L . We alsonow assume a Markov chain Y → X → L . This means P ( Y | X, L ) = P ( Y | X ). This is because X, Y correspondto observed ground truth data.We now use the fact thatwe want to extract only relevant information from X which can reconstruct Y . We use Shannon mutual infor- a r X i v : . [ phy s i c s . d a t a - a n ] J a n FIG. 1. Plot shows comparison between x i and the corresponding scaled version of l i for (a)-(d) diﬀerent values of y i = dx i /dt for equation 17. In the plots where l i is essentially noise, information from the corresponding x i is not used to reconstruct y i using the decoder. fac is a scaling factor chosen so that x i and l i /fac are comparable mation to quantify this information [17, 18]. Thereforewant to maximize the quantity I ( L, Y ) − λ enc I ( L, X ).The ﬁrst term and the second term describe the capac-ity of the encoder and the decoder respectively with λ enc determining the relative weight between the two terms.We can write I ( L, Y ) as: I ( L, Y ) = (cid:90) dydlp ( y, l ) log p ( y | l ) p ( y )= (cid:90) dlp ( l ) (cid:90) dyp ( y | l ) log ( p ( y | l ) + H ( Y ) (1)where H ( Y ) is the Shannon entropy. We neglect H ( Y )since it is ﬁxed by the data. Since it is very diﬃcult to cal-culate p ( y | l ), we can approximate it by another analyticfunction φ ( y | l ). Using the fact that the KL divergencewhich measures the ‘distance’ between 2 probability dis-tributions is always non-negative: KL ( p ( y | l ) , φ ( y | l )) ≥ ⇒ (cid:90) dyp ( y | l ) logp ( y | l ) ≥ (cid:90) dyp ( y | l ) logφ ( y | l ) (2)we can write I ( L, Y ) ≥ (cid:90) dydlp ( y, l ) logφ ( y | l ) (3)We can now choose an appropriate function for φ ( y | l )which allows us to derive a suitable loss function as well as allows us to tune the complexity of the decoder. The out-put of the decoder is given by θ dec ( l ) which describes thecomposite function of the decoder neural network whichacts on the latent variable l . To also include an additionalL1 [9]regulation parameter which helps restrict the mag-nitude of the weights in the decoder neural network, weuse the following function for φ ( y | l ) φ ( y | l ) = e − ( θ dec ( l ) − y ) /σ dec − λ dec ( | θ d | + | θ d | + .. ) (4)where θ d , θ d .. etc. are weights of diﬀerent neurons inthe decoder network. Therefore we can write I ( L, Y ) ≥ − (cid:90) dydlp ( y, l )[ ( θ dec ( l ) − y ) σ dec + λ dec ( | θ d | + | θ d | + .. )] (5)Now we use the fact that p ( y, l ) = (cid:82) dxp ( x, y, l ) = (cid:82) dxp ( l | x, y ) p ( x, y ). Using the Markovchain condition, this can be written as p ( y, l ) = (cid:82) dxp ( l | x ) p ( x, y ). Approximating (cid:82) dxdyp ( x, y ) A ( x, y ) = (1 /M ) (cid:80) Mk =1 A ( x k , y k ) where M is the number of distinct data points, we can write I ( L, Y ) ≥ − (1 /M ) M (cid:88) k =1 (cid:90) dlp ( l | x )[ ( θ dec ( l ) − y k ) σ dec + λ dec ( | θ d | + | θ d | + .. )] (6)Similarly we can deﬁne I ( L, X ) as: I ( L, X ) = (cid:90) dldxp ( x, l ) log p ( l | x ) p ( l )= (cid:90) dxdlp ( x, l ) logp ( l | x ) − (cid:90) dlp ( l ) logp ( l ) (7)We now again use another analytical function g ( l ) inplace of p ( l ) and use the result on positivity of KL diver-gence and get: I ( L, X ) = (cid:90) dldxp ( x, l ) logp ( l | x ) − (cid:90) p ( l ) logp ( l ) ≤ (cid:90) dxdlp ( x, l ) log p ( l | x ) g ( l ) (8)For convenience we use a Gaussian function centred at 0. g ( l ) = e − (cid:80) i l i /σ enc (9)where l = ( l , l .. ) are diﬀerent components of l and σ enc is an adjustable parameter. For p ( l | x ) we can use: p ( l | x ) = (cid:89) i e − ( l i − W i x i ) /σ enc (10)where x = ( x , x , .. ) This means we use a linear trans-formation from X and add a independent Gaussian noisewith variance σ enc and mean 0 to each component. Wenow plug in deﬁnitions 9,10 into equation 8 and obtain: I ( L, X ) ≤ (cid:90) dxdlp ( x, l ) loge − (cid:80) i W i x i ( W i x i − l i ) /σ enc (11)Writing p ( x, l ) = p ( x ) p ( l | x ) we can write the above equa-tion as I ( L, X ) ≤ − (cid:90) dxdlp ( x ) (cid:89) i e − ( l i − W i x i ) /σ enc [ (cid:80) i W i x i ( W i x i − l i ) σ enc ] (12)Using the approximation (cid:82) dxp ( x ) A ( x ) =(1 /M ) (cid:80) Mk =1 A ( x k ), we can write I ( L, X ) ≤ − (1 /M ) M (cid:88) k =1 (cid:90) dl (cid:89) i e − ( l i − W i x ki ) /σ enc [ (cid:80) i W i x ki ( W i x ki − l i ) σ enc ] (13)Similarly substituting equation 10 into equation 6and assuming σ enc to be small enough so that e − ( l i − W i x i ) /σ enc ≈ δ ( l i − W i x i )we obtain: I ( L, Y ) − λ enc I ( L, X ) ≥ − (1 /M ) M (cid:88) k =1 [ ( θ dec ( l ) − y k ) σ dec + λ dec ( | θ d | + | θ d | + .. ) + λ enc (cid:88) i ( W i x ki ) σ enc ] (14) FIG. 2. Plots shows the case of fan-in causality pattern forset of delay equations in equation 18 for set of ξ ij values usedto obtain results in Figure 3 Therefore we can deﬁne a loss function to be minimizedas L = (1 /M ) M (cid:88) k =1 [ ( θ dec ( l ) − y k ) σ dec + λ dec ( | θ d | + | θ d | + .. ) + λ enc (cid:88) i ( W i x ki ) σ enc ] (15)We observe that the ﬁrst term tries to minimize the leastsquares diﬀerence between θ dec ( l ) and y and the secondterm controls the size of the weights of the decoder whichin turn controls the maximum degree polynomials the de-coder NN can approximate. For the third term we seethat as we increase the λ inc , the NN will try to keep( W i x ki ) small to keep the total loss function small. As-suming now that we standardize our data so that x (cid:48) i s onan average have similar magnitudes, we absorb it into λ enc . The third term will now be smallest when only W (cid:48) i s corresponding to those x (cid:48) i s are non-zero, which arerequired to reproduce Y . Using this intution and the factthat term inside the summation over i in equation 17 isalways ≥

0, we can further simplify the loss function as L = (1 /M ) M (cid:88) k =1 [ ( θ dec ( l ) − y k ) σ dec ]+ λ dec ( | θ d | + | θ d | + .. ) + λ enc (cid:88) i ( | W i | ) (16)where we have merged σ enc with λ enc . This way we treatboth the encoder and decoder weights on equal termsusing L1 regularization. From a practical standpoint L1is advantageous since it can shrink weights faster. FIG. 3. Plot shows comparison between Y i and the corresponding scaled version of l i for (a)-(c) diﬀerent values of y i = Y i forthe set of delay equations 18. In the plots where l i is noise, information from the corresponding x i is not used to reconstruct y i using the decoder III. APPLICATION

For further study we use a NN in which the encoderhas 2 linear layers. This gives us a mapping X → L .We then add Gaussian noise to the latent variables l i = l i + N (0 , σ enc ). The latent code is then sent througha multilayer decoder network with non-linear activationfunctions to give the output θ dec ( l ). We perform batch-normalization in between intermediate neural networklayers [19]. This layers prevents change in data distribu-tions between adjacent layers and allows neural networklearning at a higher learning rate. We then minimizethe loss function in equation 16 using Stochastic gradi-ent descent with diﬀerent batch sizes. We can tune thevalues of λ enc , λ dec (regularization parameters) to obtainas low values of loss function as possible. This choice ofregularization parameters may also depend on our priorknowledge about the complexity of the system. The datais split into the training and validation set. The trainingdata is used to build the model and validation set checkshow well the model generalizes. The basic heuristic fortuning these parameters is as follows: after ﬁxing thelearning rate for the gradient descent, we ﬁrst increasethe value of λ dec which basically ﬁxes the complexity offunctions the decoder can simulate. We then increase thevalue of λ enc and look at the value of the mean squareerror and stop when the mean square error is as smallas possible for both the training and the validation set. We now use this method to infer relationships in wellknown non-linear systems. We ﬁrst consider a Lorenz96non-linear system which is deﬁned as: dx i dt = ( x i +1 − x i − x i − − x i + F ) (17)where i goes from 1 to N where N is the number ofoscillators and x N +1 = x , x − = x N − , x = x N . F isthe driving term and we choose F = 8 where the systembehaves in the chaotic regime. Figure 1 shows the resultsfor N=5. We run N=5 times with each time y = dx i dt for i from 1 to 5. We see that the latent representation l i is basically just the added Gaussian noise when thecorresponding y has no dependency on l i . The numberof data points was 3000 and learning rate was 0.0001 andvalues of λ dec , λ enc where 0 and 0.1 respectively. Thetraining was run for 1000 epochs with a batch size of300.Next we apply NN to infer causal relationship in a setof non-linear delay equations. For this we look at thefollowing set of equations: Y i ( t + 1) = ( ξ ii − (cid:88) j =1 , , ( ξ jj − ξ ij Y j ( t ))) Y i ( t ) (18)for i=1,2,3. We choose to choose parameters ξ ij whichcorrespond to a fan-in pattern shown in Figure 2. Thevalues of ξ are as follows ξ = 4 , ξ = 3 , ξ = 2 , ξ = FIG. 4. Plot shows the plot for FD vs MR for diﬀerentvalues of λ enc . The legend also mentions the non-linear sys-tem for the plotted data. ‘dde’ stands for the delay diﬀerenceequations in equation 18 . , ξ = − .

6. These parameters corresponds to achaotic regime. In this case both Y and Y are causallydriven by Y . A fan-in pattern is a good test becausecorrelation based tests would falsely infer a causal re-lationship between Y and Y [2]. To infer the causalrelationships, we run the NN with y = Y i ( t + 1) and in-put X = [ Y ( t ) , Y ( t ) , Y ( t )]. From Figure 3 we can seethat we are able to correctly infer the dependencies, evenfor a very small data-set of 50 points. The plots were ob-tained for a learning rate of 0.001 and λ enc , λ dec valuesof 0.1 and 0.005 respectively.The number of epochs was1500 with a batch size of 32. We also summarize the per-formance of this method using 2 metrics False discovery(FD) and Miss rate (MR) which are deﬁned as: F D = F PF P + T PM R = F NF N + T P (19) where FN, FP, TP are False negatives, false positivesand true positives respectively. Here a positive means acertain variable has been discovered to be independentof the output. The negative means a variable has beendiscovered to be related to the output.This data is ob-tained by obtaining results over 20 independent runs ofthe model. For the Lorenz96 model, the best result isobtained with λ enc = 0 . λ enc = 0 . IV. CONCLUSION

The proposed approach using NN is a versatile plat-form for inferring relationships, especially in complexnon-linear systems. This is because NN are a power-ful tool to model such non-linear functions. Even thoughit is diﬃcult to infer the exact functional form using aNN, this method can help locate functional dependen-cies between variables in a multivariable system. Thesevariables can then be probed more extensively to ﬁnd thefunctional (or approximate functional) form of the rela-tionships. Methods based on sparse regression have beenused in the past to ﬁnd functional relationships. Howeverthey rely on pre-knowledge of the set of basis functionsto use for the regression. The proposed method has nosuch requirement and with a large enough NN, can sim-ulate any complex non-linear function. Besides locatingfunctional relationships, it can also help infer causal re-lationships in non-linear data as seen in the discussedexample, where it correctly inferred causal relationshipeven for a small dataset of 50 samples.

V. ACKNOWLEDGEMENTS

The author would like to thank Akshatha Mohanfor helpful comments and critical assessment of themanuscript. [1] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R.Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander,M. Mitzenmacher, and P. C. Sabeti, Science , 1518(2011).[2] D. Marbach, R. J. Prill, T. Schaﬀter, C. Mattiussi,D. Floreano, and G. Stolovitzky, Proceedings of the Na-tional Academy of Sciences , 6286 (2010).[3] S. L. Brunton, J. L. Proctor, and J. N. Kutz, Proceedingsof the National Academy of Sciences , 3932 (2016).[4] A. Dembo, A. Kagan, and L. A. Shepp, Bernoulli , 343(2001).[5] J. Liu, J. Li, C. Liu, and Y. Chen, IEEE Transactions onKnowledge and Data Engineering , 251 (2012).[6] Y. Huhtala, The Computer Journal , 100 (1999).[7] H. W. Lin, M. Tegmark, and D. Rolnick, Journal of Sta-tistical Physics , 1223 (2017). [8] D. Rolnick and M. Tegmark, arXiv:1705.05502 [cs, stat](2018), arXiv: 1705.05502.[9] R. Tibshirani, Journal of the Royal Statistical Society:Series B (Methodological) , 267 (1996).[10] R. Iten, T. Metger, H. Wilming, L. del Rio, and R. Ren-ner, Physical Review Letters , 010508 (2020).[11] B. S. Rem, N. K¨aming, M. Tarnowski, L. Asteria,N. Fl¨aschner, C. Becker, K. Sengstock, and C. Weiten-berg, Nature Physics , 917 (2019).[12] A. A. Melnikov, H. Poulsen Nautrup, M. Krenn, V. Dun-jko, M. Tiersch, A. Zeilinger, and H. J. Briegel, Pro-ceedings of the National Academy of Sciences , 1221(2018).[13] G. E. Hinton, Science , 504 (2006).[14] M. Detto, A. Molini, G. Katul, P. Stoy, S. Palmroth, andD. Baldocchi, The American Naturalist , 524 (2012). [15] J. Runge, J. Heitzig, V. Petoukhov, and J. Kurths, Phys-ical Review Letters , 258701 (2012).[16] H. Ma, K. Aihara, and L. Chen, Scientiﬁc Reports ,7464 (2015).[17] N. Tishby, F. C. Pereira, and W. Bialek,arXiv:physics/0004057 (2000), arXiv: physics/0004057. [18] C. Giannella and E. Robertson, Information Systems29