[PDF] Data-driven formulation of natural laws by recursive-LASSO-based symbolic regression

Abstract

Discovery of new natural laws has for a long time relied on the inspiration of some genius. Recently, however, machine learning technologies, which analyze big data without human prejudice and bias, are expected to find novel natural laws. Here we demonstrate that our proposed machine learning, recursive-LASSO-based symbolic (RLS) regression, enables data-driven formulation of natural laws from noisy data. The RLS regression recurrently repeats feature generation and feature selection, eventually constructing a data-driven model with highly nonlinear features. This data-driven formulation method is quite general and thus can discover new laws in various scientific fields.

Full PDF

11 Data-driven formulation of natural laws by recursive-LASSO-based symbolic regression

Yuma Iwasaki * & Masahiko Ishida System Platform Research Laboratories, NEC Corporation, Kawasaki 211-8666, Japan PRESTO, JST, Saitama 322-0012, Japan NEC-AIST Quantum Technology Cooperative Research Laboratory, National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, 305-8568, Japan

Abstract

Introduction

Development of natural law formulation, describing the world by mathematical expressions, has been based on data observation and consideration by scientists. For example, there is a story that Isaac Newton observed an apple falling from a tree and, after deep deliberation, had a brainstorm for the law of universal gravitation (There are also other versions of this story). Natural laws have been formulated by only a small number of geniuses. Recently, machine learning modeling using big data has been expected to contribute to data-driven discovery of new natural laws because data on natural phenomena (e.g., physical, chemical, biological, and astronomical phenomena) are being accumulated rapidly all over the world – . Data-driven approaches by machine learning have the potential to excavate from such big data a novel rule that we have never noticed . For instance, Baldi et al. used deep learning to construct a high-energy-physics model of particle colliders . Their deep learning model was highly predictive, which means that data-driven model construction was somehow made successful by machine learning. However, it is difficult for scientists to understand phenomena represented by deep learning models because the models are hard to interpret. A deep learning model is expressed as connections between large numbers of perceptrons (neurons), and its interpretation by humans is not feasible. Natural law is discovered only when humans recognize and understand a phenomenon modeled as a simple mathematical expression. The processes of machine learning modeling are in themselves insufficient for discovering new physical rules and laws. For focusing on the modeling of nature laws, it is ideal to construct a data-driven model as a nonlinear mathematical expression. If machine learning automatically generates the nonlinear mathematical expression model from observed data without prior knowledge, it can directly discover a new natural law. Symbolic regression techniques using a genetic algorithm (GA) allow constructing a data-driven nonlinear mathematical expression model . Here we propose a simple symbolic regression, recursive-LASSO-based symbolic regression method (referred to here as the RLS regression method), for constructing a data-driven nonlinear mathematical expression model and demonstrate that the proposed method can perform data-driven formulation of natural laws from noisy data. Figure 1 gives a rough explanation for the RLS regression, showing iteration of a nonlinear features generation step and a feature selection step. For simplicity of explanation, we here assume only three original features 𝒙 = {𝑥 , 𝑥 , 𝑥 } . RLS regression firstly makes nonlinear features using the four basic arithmetic operations (and squaring) with the original features 𝒙 . Figure 2a shows the generated nonlinear features. Secondly, feature selection (LASSO ) is performed with these features and the selected features are regarded as new features 𝒙′ . Here we assume that three features 𝒙 ′ = {𝑥 , 𝑥 𝑥 , 𝑥 +𝑥 } are selected. Again, the RLS regression makes some nonlinear features by using both original features 𝒙 and the new features 𝒙′ with de-duplication (i.e. , 𝑥 ). Figure 2b shows the secondary nonlinear features. Note that the features generated second (Fig. 2b) are more complex than those generated first (Fig. 2a). The RLS regression searches for proper nonlinear features automatically by repeating this cycle recursively. Details of this algorithm and parameter settings are provided in the Methods section. Results

As simple examples, we demonstrate that the RLS regression can represent two nonlinear natural phenomena. The parameter setting in the RLS regression, which is shown in the Methods section, is the same for both example cases. The data sets used are synthetic with Gaussian noise (standard deviation σ ). Data-driven formulation of small particles falling.

Figure 3a shows an illustration of a falling particle in a fluid. The particle radius D p and particle density ρ p , are respectively 0.0001–0.0002 m and 5000–6000 kg/m (values shown here as random value range). The fluid density ρ f is 1000–2000 kg/m and fluid viscosity η is 0.0008–0.012 Pa·s. The gravitational acceleration g is assumed to be 9.8 m/s . The number of data sets (observations) is 1000. The terminal velocity 𝑣 𝑠 m/s includes Gaussian noise with standard deviation σ = 0.005. With the D p , ρ p , ρ f , and η, we tried formulating the terminal velocity models 𝑣 𝑠 = 𝑓(𝐷 𝑝 , 𝜌 𝑝 , 𝜌 𝑓 , 𝜂) by the RLS regression method. For comparison, the LASSO and Neural Network (NN) models, which do not need a manual listing of potential constant values and potential nonlinear features beforehand, were also constructed. Figures 3b, c, and d show the accuracy of the LASSO, NN, and RLS regression models, respectively. The four-fold cross-validation mean absolute errors (CV-MAEs) of the LASSO, NN, and RLS regression models are −3 , −3 , and −3 , respectively. Thus although as shown in Fig. 3e the LASSO makes an interpretable mathematical model, its CV-MAE is larger than that of the NN and RLS regression models. This is to be expected because this modeled phenomenon is not expressed by a simple linear combination of the D p , ρ p , ρ f , and η. The NN model, on the other hand, has better prediction accuracy because the highly flexible model can express this nonlinear relationship. As shown by the inset in Fig. 3c, however, it is quite difficult for a person to interpret the complex NN model expressed as connections between large numbers of perceptrons (neurons). The prediction accuracy of the RLS regression model is comparable to that of the NN model. Moreover, we can understand the interpretable RLS regression model, which accordingly is of great help in understanding the physics of the modeled phenomenon. When we ignore small regression coefficient terms, the RLS regression model becomes as follows. 𝑣 𝑠 = −0.0039 + 0.602 𝐷 𝑝2 (𝜌 𝑝 − 𝜌 𝑓 )𝜂 − 0.00000147 (𝜌 𝑝 − 𝜌 𝑝2 𝐷 𝑝 + 𝜌 𝑓 ) 𝑣 𝑠 ≈ 0.602 𝐷 𝑝2 (𝜌 𝑝 − 𝜌 𝑓 )𝜂 (1) This model is consistent with a known natural law, Stokes’ law . 𝑣 𝑠 = 𝐷 𝑝2 (𝜌 𝑝 − 𝜌 𝑓 )𝑔18𝜂 (2) Thus the RLS regression method can construct a data-driven nonlinear mathematical expression model describing Stokes’ law without any prior knowledge. Although the NN model’s high prediction accuracy suggests that it too represents Stokes’ law, scientists cannot understand the inside of the model, so it does not inspire them to investigate the modeled phenomenon.

Data-driven formulation of rigid-body collision.

To demonstrate a case with a larger number of original features, we next model a rigid spherical body (ball) collision phenomenon as shown in Fig. 4a. Firstly, a red ball with mass m (1–30 g) is rolling on the hill at a velocity v (0.01–0.3 m/s). At the foot of a slope, a blue ball with mass m (1–10 g) is rolling at a velocity v (0.01–0.1 m/s). After a few seconds, the red ball accelerates on the slope and collides with the blue ball. Finally, the velocities of the red and blue balls become v ’ (0.01–1 m/s) and v ’ (0.01–3 m/s), respectively. The height of the hill is L, including Gaussian noise with σ = 0.03. The number of data sets (observations) is 1000. The gravitational acceleration g is assumed to be 9.8 m/s . A green ball with mass m (1–50 g) just stays on the hill from beginning to end. With the m , m , m , v , v , v ’ , and v ’ , we tried formulating the data-driven model 𝐿 =𝑓(𝑚 , 𝑚 , 𝑚 , 𝑣 , 𝑣 , 𝑣′ , 𝑣′ ) by the RLS regression method. For comparison, the LASSO and NN models were also constructed. Figures 4b, c, and d show the accuracies of the LASSO, NN, and RLS regression models. The LASSO and RLS regression model equations are shown in Fig. 4e. The LASSO correctly removed the m term, which clearly has nothing to do with the collision phenomenon shown in Fig. 4a, by feature selection with L regularization. However, the CV-MAE for the LASSO model is high because this phenomenon cannot be expressed by a simple linear combination of the m , m , m , v , v , v ’ , and v ’. The NN and RLS regression models have better prediction accuracy because these highly flexible models can express nonlinear relationships between variables. The NN model, whose visualization is shown in Fig. 4c, has low interpretability, while the RLS regression model is highly interpretable and thus we can easily investigate the physics of the phenomenon. Simplifying the RLS regression model by ignoring the small regression coefficient term and intercept term, it becomes as follows.

𝐿 = 0.01 − 0.058(𝑣 − 𝑣 ′12 ) − 0.059 𝑚 𝑚 𝑣 ′2 (𝑣 − 𝑣 ′2 ) − 0.06 𝑚 𝑚 𝑣 (𝑣 − 𝑣 ′2 ) L ≈ 0.058(𝑣′ − 𝑣 ) + 0.059 𝑚 𝑚 𝑣′ − 0.06 𝑚 𝑚 𝑣 (3) This model is consistent with a correct answer equation derived by applying the law of the conservation of energy, where the kinetic energy ( 𝑚𝑣 ) and potential energy ( 𝑚𝑔𝐿 ) are conserved before and after collision. 𝑚 𝑔𝐿 + 12 𝑚 𝑣 + 12 𝑚 𝑣 = 12 𝑚 𝑣 ′12 + 12 𝑚 𝑣 ′22 𝐿 = 12𝑔 (𝑣′ − 𝑣 ) + 12𝑔 𝑚 𝑚 𝑣′ − 12𝑔 𝑚 𝑚 𝑣 (4) We demonstrated that the RLS regression allows data-driven models to be constructed as interpretable mathematical expressions. Because we need to confirm that the RLS regression models are consistent with the physically true model, in this paper we show the physically known phenomena as examples. When the RLS regression method is applied to an unsettled issue (e.g., chaotic phenomenon, astronomical phenomenon, unknown compound), we can find a new rule expressed mathematically.

Discussion

Figures 5a and b show the prediction accuracies of the NN and RLS regression models for no-noise data ( σ = 0) in the case study of the rigid-body-collision modeling. The prediction accuracy of the RLS regression model was better than that of the NN model. This is to be expected because the RLS regression model succeeded in describing a physically true model. In this case, the RLS regression model was created as 𝐿 = −0.00000805 − 0.051(𝑣 − 𝑣 ′12 ) − 0.051 𝑚 𝑚 𝑣 ′2 (𝑣 − 𝑣 ′2 ) − 0.051 𝑚 𝑚 𝑣 (𝑣 − 𝑣 ′2 ) (5) and simplified by ignoring the very small intercept term. 𝐿 = 0.051(𝑣 ′12 − 𝑣 ) + 0.051 𝑚 𝑚 𝑣′ − 0.051 𝑚 𝑚 𝑣 (6) Even considering the values of the coefficients, this data-driven model is almost identical to the physically true model (equation 4). On the other hand, the NN model is an appropriate model whose basis functions are sigmoid and linear ones in this case. Figures 5c and d show the prediction accuracies of the NN and RLS regression models for large-noise data ( σ = 0.15). An equation for the RLS regression model is as follows. 𝐿 = 0.131 − 0.082 ( 𝑣 𝑚 − 𝑣 ′2 𝑚 ) 𝑚 − 0.015(𝑣 ′12 − 𝑣 )𝑣 ′1 + 0.003 𝑚 𝑚 𝑣′ (7) Although the RLS regression failed to construct a physically true model (equation 4) because of the too large noise, the prediction accuracy of the RLS regression model is comparable to that of the NN model. In summary, we proposed a recursive-feature-selection-based symbolic (RLS) regression method, where the generation of nonlinear features and feature selection are recursively repeated and, finally, highly nonlinear features can be created automatically. By using this method, we demonstrated that two simple natural laws including highly nonlinear relationships can be data-drivenly formulated as interpretable mathematical expressions without any prior knowledge. The RLS regression model can give us valuable inputs and impetus for constructing new physical rules leading to novel scientific discoveries.

Methods RLS regression.

The recursive-LASSO-based symbolic (RLS) regression method can construct a data-driven nonlinear mathematical expression model. This method includes iteration of a nonlinear-features generation step and feature-selection step. For simplicity of explanation, we here assume only three original features 𝒙 ={𝑥 , 𝑥 , 𝑥 } . In the nonlinear-features-generation step, nonlinear features 𝒙 𝒈 are created by the four basic arithmetic operators (and the squaring operator) with the original explanatory variables 𝒙 . Figure 2a is the list of 𝒙 𝒈 . In the feature-selection step, the 𝒙 𝒈 is screened by a LASSO-like algorithm, where the evaluation function of LASSO is modified as follows. ‖𝒚 − 𝑿 𝒈 𝜷‖ + λ ∑ 𝐴 𝑚 (𝜎)‖𝛽 𝑚 ‖ (8) y , X g , β , and λ are respectively an objective variable vector, a design matrix of nonlinear features 𝒙 𝒈 , a regression coefficients vector, and a regularization parameter. M is the number of nonlinear features 𝒙 𝒈 , and σ is the number of original features ( 𝑥 , 𝑥 , 𝑥 ) in each nonlinear feature. For example, σ of 𝑥 , 𝑥 𝑥 𝑥 , and 𝑥 𝑥 𝑥 −𝑥 are respectively 1, 3, and 6. With the σ, the A m (σ) controls the magnitude of regularization for each nonlinear feature. Here, we just use a step function as follows. 𝐴 𝑚 (𝜎) = {10 (𝜎 > 𝜎′)1 (𝜎 ≤ 𝜎′) (9) In this study, σ’ was set to the number of original features 𝒙 (i.e., σ’ was 4 for the small-particle’s-falling case study and was 7 for the rigid-body-collision case study). The function A m (σ) prevents the selection of nonlinear features that are too complex (e.g., 𝑥 𝑥 − 𝑥24𝑥1 𝑥 −𝑥 +𝑥 ). Exploring a model searching space including features that are too complex is a waste of time because equations of natural laws are often simple as well as nonlinear. Nonlinear features 𝒙 ′ , some of which can be a bit more complex than the original features 𝒙 , are selected in the feature-selection step by minimizing equation (8). Here we assume that three terms 𝒙 ′ = {𝑥 , 𝑥 𝑥 , 𝑥 +𝑥 } are selected. Again, the RLS regression makes some nonlinear features by using both original features 𝒙 and the selected nonlinear features 𝒙′ with de-duplication (i.e. , 𝑥 ). Figure 2b shows the nonlinear features generated in the second feature-selection step. Note that they are a bit more complex than the features generated in the first one (Fig. 2a). The RLS regression automatically searches for proper nonlinear features by recursively repeating this cycle. By using a 4-fold cross-validation error, a final model is selected from among models (with 𝒙′ ) created in each iteration. If the number of 𝒙′ is too large, it takes a long time to calculate the next nonlinear-features-generation step and feature-selection step. Therefore we provide an upper limit number K of 𝒙′ in the LASSO-like algorithm. The parameters in the RLS regression algorithm are the same in both case studies (small particle falling and rigid-body collision). The upper limit number K of 𝒙′ in the LASSO-like algorithm was set to 40. The threshold σ’ in A m (σ) is 2 + the number of x . For the A m (σ) setting, “penalty.factor” of the cv.glmnet function in the glmnet package in R programming language was used. For the 4-fold cross-validation, mean absolute error (MAE) was used. This RLS regression algorithm has room for customization, and here we mention three examples. One would be to expand nonlinearity. In the nonlinear-features-generation step, we consider just four simple arithmetic operators and the square operator. However, we could include other operators, such as those for exponential and trigonometric functions, in this step so that RLS regression could make more complex features such as 𝑥 𝑒 − 𝑥3𝑥2 and 𝑥 cos(𝑥 ) . The second customization would be to sophisticate the A m (σ). In this study, we used a quite simple function, the step function defined as shown in equation (9) . However, we can design an advanced A m (σ) for more effective model exploration. The third customization would be to employ other feature selection methods instead of the LASSO-like algorithm. For example, maximal information coefficient (MIC) could be used in the feature selection part to improve the RLS regression modeling. References

1. Curtarolo, S. et al. AFLOW.ORG: A distributed materials properties repository from high-throughput ab initio calculations.

Comput. Mater. Science

APL Materials J. Phys. Chem. C

J. Phys. Mater. Nucleic Acids Res . D712–D717 (2011).

6. Maglott, D., Ostell, J., Pruitt, K. D. & Tatusova, T. Entrez Gene: gene-centered information at NCBI.

Nucleic Acid. Res . , D52-D57 (2011). 7. Gardner, D. et al. The Neuroscience Information Framework: a data and knowledge environment for neuroscience. Neuroinformatics Nucleic Acids Res . D695–D699 (2012). 9. Kanz, C. et al. The EMBL Nucleotide Sequence Database.

Nucleic Acid Res . D29–D33 (2005). 10. Zdeborova, L. New tool in the box.

Nat. Phys . , 420–421 (2017). 11. Rodriguez-Nieva, J. F. & Scheurer, M. S. Identifying topological order through unsupervised machine learning. Nat. Phys . Nat. Commun . Science

Sci. Adv. eaay2631 (2020). 15. McRee, R. K. Symbolic Regression Using Nearest Neighbor Indexing. Proceedings of the 12th Annual Conference Companion on Genetic and Evolutionary Computation . GECCO ‘10, pp. 1983–1990 (2010). 16. Stijven, S., Minnebo, W. & Vladislavleva, K. Separating the wheat from the chaff: on feature selection and feature importance in regression random forests and symbolic regression.

Proceedings of the 13th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO ‘11, pp. 623–630 (2011). 17. McConaghy, T. FFX: fast, scalable, deterministic symbolic regression technology, in Genetic Programming Theory and Practice IX (eds. R. Riolo, E. Vladislavleva & J. Moore) (Springer, 2011) 18. Arnaldo, I., O’Reilly, U.-M., & Veeramachaneni, K. Building Predictive Models via Feature Synthesis.

Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation ( ACM , 2015), pp. 983–990. (2015). 19. Brunton, S. L., Proctor, J. L., & Kutz, J. N. Discovering governing equations from data by sparse identification of nonlinear dynamical systems.

Proc. Natl. Acad. Sci. U.S.A.

Chaos

Proceedings of the International Multiconference of Engineers and Computer Scientists Phys. Rev. Lett.

Trans. Camb. Phil. Soc . , 8–106 (1851). 24. Hastie, T., Tibshirani, R. & Wainwright, M. Statistical Learning with Sparsity: The Lasso and Generalizations. (CRC Press, 2015). 25. Speed, T. A correlation for the 21st century.

Science

Acknowledgements

This work was supported by JST-PRESTO “Advanced Materials Informatics through Comprehensive Integration among Theoretical, Experimental, Computational and Data-Centric Sciences” (Grant No. JPMJPR17N4).

Figures

Figure 1 | Overview of data-driven formulation using the RLS regression.

Without prior knowledge about natural science, machine learning can construct a data-driven model from observed big data. The proposed machine learning method, recursive-feature-selection-based symbolic (RLS) regression, repeats feature-generation and feature-selection steps recursively and constructs interpretable mathematical expression models of highly nonlinear phenomena.

Figure 2 | Explanation of the RLS regression algorithm.

When we assume three original features 𝒙 ={𝑥 , 𝑥 , 𝑥 } , a. the list of nonlinear features generated by the first feature-generation step using three original features. After that, feature selection, e.g., LASSO, is carried out concerning the 21 features. When assuming the three features 𝒙 ′ = {𝑥 , 𝑥 𝑥 , 𝑥 +𝑥 } are selected by the first feature-selection step, b. the list of features generated by the second feature-generation step using original features 𝒙 = {𝑥 , 𝑥 , 𝑥 } and selected features 𝒙 ′ = {𝑥 , 𝑥 𝑥 , 𝑥 +𝑥 } by the first feature-selection step. By repeating this cycle recursively, the RLS regression automatically searches for proper nonlinear features. Details of this algorithm and parameter settings are provided in the Methods section. Figure 3 | Data-driven formulation of small particles falling in fluid. a.

Illustration of falling particle with the particle radius D p , particle density ρ p , fluid density ρ f , fluid viscosity η, and particle terminal velocity 𝑣 𝑠 . The prediction accuracy of the data-driven model 𝑣 𝑠 = 𝑓(𝐷 𝑝 , 𝜌 𝑝 , 𝜌 𝑓 , 𝜂) using b. LASSO, c. neural network NN (the model visualization is also shown), and d. RLS regression. The NN and RLS regression models have better prediction accuracy than the LASSO model. e . Equations for LASSO and RLS regression models. The RLS regression constructs the data-driven model including highly nonlinear features. In this case, Stokes’ law was automatically constructed at the second term of the RLS regression model. Figure 4 | Data-driven formulation of rigid-body collision. a.

Illustration of rigid-body (ball) collision. Firstly, a red ball with mass m is rolling on the hill at a velocity v . At the foot of a slope with height L, a blue ball with mass m is rolling at a velocity v . After a few seconds, the red ball accelerates on the slope and collides with the blue ball. Finally, the velocities of the red and blue balls become v ’ and v ’ , respectively. b. The prediction accuracy of the data-driven model

𝐿 = 𝑓(𝑚 , 𝑚 , 𝑚 , 𝑣 , 𝑣 , 𝑣′ , 𝑣′ ) using b. LASSO, c. neural network NN (the model visualization is also shown), and d. RLS regression. The NN and RLS regression models have better prediction accuracy than the LASSO model. e. Equations of LASSO and RLS regression models. The RLS regression constructs the data-driven model including highly nonlinear features. In this case, the law of the conservation of energy was automatically constructed in the RLS regression model. Figure 5 | Investigation of noise robustness of the RLS regression model.

In the case study of the rigid-body-collision modeling, the noise robustness of the NN and RLS regression models was investigated. a. The prediction accuracy of the NN model for no-noise data ( σ = 0). b. The prediction accuracy of the RLS regression model for no-noise data ( σ = 0). The RLS regression model has a quite high prediction accuracy because it is a physically true model. c. The prediction accuracy of the NN model for large-noise data ( σ = 0.15). d. The prediction accuracy of the RLS regression model for large-noise data ( σ = 0.15). The prediction accuracy of the RLS regression model is comparable to that of the NN model.= 0.15). The prediction accuracy of the RLS regression model is comparable to that of the NN model.