Bayesian machine scientist to compare data collapses for the Nikuradse dataset
Ignasi Reichardt, Jordi Pallares Marta Sales-Pardo, Roger Guimera
AA Bayesian machine scientist to compare data collapses for the Nikuradse dataset
Ignasi Reichardt, ∗ Jordi Pallar`es, † Marta Sales-Pardo, ‡ and Roger Guimer`a
1, 3, § Department of Chemical Engineering, Universitat Rovira i Virgili, Tarragona 43007, Catalonia, Spain Department of Mechanical Engineering, Universitat Rovira i Virgili, Tarragona 43007, Catalonia, Spain ICREA, Barcelona 08010, Catalonia, Spain (Dated: April 28, 2020)Ever since Nikuradse’s experiments on turbulent friction in 1933, there have been theoretical attempts todescribe his measurements by collapsing the data into single-variable functions. However, this approach, whichis common in other areas of physics and in other fields, is limited by the lack of rigorous quantitative methodsto compare alternative data collapses. Here, we address this limitation by using an unsupervised method to findanalytic functions that optimally describe each of the data collapses for the Nikuradse data set. By descalingthese analytic functions, we show that a low dispersion of the scaled data does not guarantee that a data collapseis a good description of the original data. In fact, we find that, out of all the proposed data collapses, the originalone proposed by Prandtl and Nikuradse over 80 years ago provides the best description of the data so far, andthat it also agrees well with recent experimental data, provided that some model parameters are allowed to varyacross experiments.
In the early 1930s, Johann Nikuradse conducted experi-ments to measure the friction sustained by a turbulent flowin a rough pipe [1]. With remarkable accuracy, he measuredthe dependency of the friction factor f on two dimensionlessquantities, the Reynolds number Re and the relative rough-ness, that is, the ratio r/k between the size of the irregularitiesand the radius of the pipe (Fig. 1). Over eight decades later,and despite the fundamental and practical importance of theproblem, the functional relationship f = h (Re , r / k) remainsunknown.Numerous works [1–7] have attempted to solve the problemby collapsing Nikuradse’s data into a function ¯ f = ¯ h ( x ) thatdepends on a single variable x combining both the Reynoldsnumber (or the so-called turbulent Reynolds number, Re τ )and the relative roughness. For example, Prandtl and Niku-radse [3] proposed the collapse f − / +2 log( r/k ) = ¯ h (Re τ · r / k) , which suggests that a transformed turbulent friction fac-tor only depends on the product of the turbulent Reynoldsnumber and the relative roughness [1, 2].Data collapses such as this are common in many areas, andaim at establishing the combinations of independent variablesthat actually affect the dependent variable, disregarding theexact functional form of the dependency. These approacheshave been particularly useful in critical phenomena, becausethe scaling behavior of critical systems implies that, close tothe critical point, all relevant functions must be generalizedhomogeneous functions and therefore collapse under partic-ularly simple scaling transformations [8, 9]. This realizationhas led to important insights in physics and other disciplines,even for systems that are not at a critical point. For exam-ple, data collapses have enabled the characterization of thestatistical laws that govern the growth of human organizations[10, 11].Perhaps the major challenge of approaches based on scalingand data collapse is the fact that, in principle, many alternativecollapses are possible, especially when the data are noisy. Inthe Nikuradse dataset, for example, different arguments maylead to very different collapses (Fig. 2, Tab. I). In such cases, Re0.20.40.60.8 l og ( f ) FIG. 1. The Nikuradse dataset, displaying the turbulent friction fac-tor f as a function of the Reynolds number Re for different values ofthe inverse relative roughness k/r , which are indicated in the legend. the goodness of a collapse is typically evaluated qualitativelyby how well the empirical data obtained under different condi-tions fall onto a single well-defined “scaling function” ¯ h [12].In the few remarkable instances in which the goodness of thecollapse is quantified, existing methods rely on interpolationof the datasets to approximate the scaling function, and onmeasuring deviations in the collapsed data [13]. Additionally,they allow only to compare a scaling function with differentparameter values, but not to compare scaling functions thatare mathematically different.Here, we argue that the goodness of a data collapse shouldbe measured on the original unscaled data because: (i) thescaling function distorts the data by stretching some regionsand compressing others, thus potentially obscuring deviationsin the collapse; (ii) the distortion can lead to collapse func-tions with very different ranges, thus making comparison be-tween collapses meaningless. We propose a rigorous alterna-tive to evaluate the quality of data collapses, and show thatscaling functions that seem better at collapsing the data oftendo not describe the original unscaled data well. Our approachproceeds by first obtaining the most plausible functional formfor the scaling function ¯ h , for which we use an unsupervisedalgorithm that explores systematically the space of possible a r X i v : . [ phy s i c s . d a t a - a n ] A p r Re ( r / k ) l og ( R e / f ) (a) 10 x1.501.752.002.25 l og ( f ) (f) 10 Re0.20.40.60.8 l og ( f ) (k) G o l d e n f e l d Re + C s Re ( r / k ) l og ( f R e ) (b) 10 x56 l og ( f ) (g) 10 Re0.40.60.8 l og ( f ) (l) T a o Re + C s Re ( r / k ) l og ( f R e ) (c) 10 x56 l og ( f ) (h) 10 Re0.40.60.8 l og ( f ) (m) L i &H u a i k + s f / + l og ( r / k ) (d) 10 x0.51.01.52.0 f (i) 10 Re0.40.60.8 l og ( f ) (n) P r a nd t l k + s U + a v g (e) 10 x051015 f (j) 10 Re0.40.60.8 l og ( f ) (o) S h e e t a l. FIG. 2. (a)-(e) Nikuradse data collapsed according to each approach (Goldenfeld [4], Tao [5], Li and Huai [7], Prandtl [2], and She et al. [6]).(f)-(j) Models for the collapsed data. The solid line represents the most plausible closed-form mathematical model identified by the Bayesianmachine scientist for each collapse. (k)-(o) Unscaled models. Same as the middle column but unscaled to the original variables. functional forms for ¯ h ; following Refs. [14, 15] we call thisalgorithm a machine scientist. Next, we unscale this functionand quantify how it fits the original data, as opposed to thecollapsed data. For the Nikuradse dataset, we find that thebest collapse is the one originally proposed by Nikuradse andPrandtl [1, 2]. We also study what is the applicability of thebest models to recent datasets on turbulent friction in roughpipes.Let us first formalize the problem. The data collapse hy-pothesis posits that, under the appropriate transformation, allthe observed data D follow a single law ¯ f = ¯ h ( x ) [16]. Toidentify the most plausible candidate for ¯ h , we start by con-sidering the probability p ( h i | D ) that an expression h i is the correct one given the data, which can be written as [15] p ( h i | D ) = exp [ −L ( h i )] Z . (1)Here, Z = (cid:80) i exp [ −L ( h i )] is the partition function, and L ( h i ) = − log p ( h i , D ) is the description length of the model[15, 17]. The description length plays the role of a free energyin a physical system, and the model with minimum descrip-tion length is the most plausible one. The description lengthcannot, in general, be calculated exactly; however, it can beapproximated as L ( h i ) = B ( h i ) / − log p ( h i ) , where B ( h i ) is the Bayesian information criterion of expression h i [18, 19]and the prior p ( h i ) is the probability assigned to h i before anydata are observed. This prior acts as an expression regularizer TABLE I. Summary of the data collapses for the Nikuradse dataset.Reference Scaling variable, x Scaling function, ¯ h Most plausible model, h ( x ) Goldenfeld [4] Re / r/k f Re / c log (cid:0) ( c x ) c + x + e c (cid:1) Tao a [5] Re / + C s Re ( r/k ) / f Re c (cid:0) c c − x x + c (cid:1) + x Li&Huai b [7] Re / + C s Re ( r/k ) α f Re c ( c + c x ( c + c x )) Prandtl c [2] Re (cid:112) f/ k/r ) f − / +2log( r/k ) c ( c ( c x + c ) + c x ) She et al. c,d [6] Re (cid:112) f/ k/r ) (1 /κ ) ln(Re (cid:112) f/ )+B-1/ (2 (cid:112) f/ (cid:0) − c + log (cid:0) c x c + c + x (cid:1)(cid:1) e c a C s = 3 × − b C s = 1 × − and α = 1 / η/ , with η = 0 . c The independent variable defined by Prandtl and used by She et al. is denoted k + s in Figs. 2(d) and 2(e). d Following She et al., this function is denoted ∆ U +avg in Fig. 2(e). B is a Reynolds-dependent correction applied for Re (cid:112) f/ < (seedefinition in [6]). and requires certain hypotheses; following previous work [15]we use the maximum entropy p ( h i ) that is consistent with em-pirically observed frequencies of each operation [20].We explore the space of possible mathematical expressionsusing the Metropolis algorithm, by means of what has beencalled a Bayesian machine scientist [15, 20]. The Bayesianmachine scientist draws upon concepts developed in symbolicregression[21], exponential random graphs [22] and Bayesiannetwork sampling [15, 23, 24]; it is guaranteed to asymptot-ically sample from the stationary distribution p ( h i | D ) and isconsistent, that is, given enough data it will assign the high-est plausibility (the shortest description length) to the correctmodel with probability approaching one. Among all mod-els explored by the machine scientist using the Metropolisalgorithm, we select the model with the minimum descrip-tion length as the most plausible one [25]. Specifically, we letthe machine scientist sample expressions for each of the col-lapses ( ¯ f , x ) proposed for Nikuradse’s data (Table I). Fromthese samplings, we obtain the most plausible expression foreach data collapse (Fig. 2(f)-(j)).Despite fitting the collapsed data tightly, most of the func-tions do not reproduce the features of the original data whenunscaled, particularly in the low and intermediate Reynoldsregimes (Fig. 2(k)-(o)). This is important because, as men-tioned earlier, collapse and scaling theories are usually evalu-ated by how close the data appear to be in the collapse. Ourresults for the Nikuradse data show how this can be mislead-ing; some of the data collapses that have been proposed areeffectively a zoom-out , in which the curves for each rough-ness are stretched until the separation in the vertical axis isno longer visible. This can be achieved by multiplying boththe scaling variable and the scaling function by a quantity thatspans a broad range, typically a power of the Reynolds num-ber. As we show here, a function fitting such a stretched curvedoes not necessarily recover the correct f = h (Re , r / k) de-pendency when unscaled.In Fig. 3, we show the mean absolute error (MAE) for eachof the scalings when the original variables are recovered. Un-like the deviations that one measures in the collapsed data, weargue that this quantity is a comparable and reliable measureof the true goodness of the collapse. From this, we conclude Goldenfeld Tao Li&Huai Prandtl She et al.0.000.010.02 M ean ab s o l u t e e rr o r FIG. 3. Mean absolute error (MAE) of the most plausible model foreach data collapse, calculated on the original unscaled variables ofthe Nikuradse dataset (Fig. 2(k)-(o)). The error bars in indicate the95% confidence interval for the mean. that only the data collapses proposed by Prandtl [2] and byShe et al. [6] are accurate representations of the Nikuradsedataset in all regions. The remaining data collapses are onlygood representations of some regimes.The collapses proposed by Prandtl [2] and by She et al. [6]both use as their scaling variable the roughness Reynoldsnumber k + s , and the scaling function depends, again on bothcases, on the inverse square root of f . She et al. argued thatthe spread in the Prandtl-scaled data about k + s ∼ is too largefor it to be considered a good collapse. Therefore, they in-troduced an extra parameter p , which encodes the behaviourat the transitionally rough regime, about k + s ∼ in the scaleddata (see Fig. 5 in [6]). Effectively, p is an exponent that tunesthe sharpness of the transition between the different turbulentregimes with the roughness. Moreover, an ad-hoc correctionwas added to improve the collapse (the so-called B term).This correction applies only to Re τ ≡ Re (cid:112) f / < andthus makes the scaling function defined piecewise, introduc-ing a number of extra parameters. Despite these refinements,our approach suggests that the Prandtl collapse still providessmoother and more accurate unscaled curves.To further understand the validity of the original scaling byPrandtl and the need for the corrections introduced by Sheet al., we finally turn to the main criticism against Prandtl’sscaling, namely that it does not hold for more recent em-pirical data from the Princeton experiments with honed [26]and commercial [27] pipes. These experiments used smootherpipes than those used by Nikuradse ( k/r = 8716 and k/r = x1012 l og ( f ) (a) NikuradseCommercialHoned Re0.00.20.40.60.8 l og ( f ) (c) P r a nd t l x051015 l og ( f ) (b) NikuradseCommercialHoned Re0.00.20.40.60.8 l og ( f ) (d) S h e e t a l. FIG. 4. (a)-(b) Models for the Nikuradse and Princeton (honed pipewith k/r = 8716 and commercial pipe with k/r = 8065 ) datasetscollapsed according to the approaches proposed by Prandtl [2] (top)and She et al. [6] (bottom). The solid lines represent the most plausi-ble closed-form mathematical model identified by the Bayesian ma-chine scientist for all datasets simultaneously. (c)-(d) Unscaled mod-els. , respectively), and explored flows with Re up to × .To investigate the applicability of Prandtl’s collapse to thesedata, we check whether the machine scientist is able to gener-ate a new set of expressions fitting both Nikuradse’s and thePrinceton data (Fig. 4). Since the newer data seem to devi-ate from both Prandtl’s and She et al.’s collapses (Fig. 4), wesearch for a single mathematical expression fitting the threedatasets but with potentially different parameter values, ac-counting for different details arising, for example, from thegeometry of the irregularities in each pipe. This is possiblethanks to the fact that the machine scientist samples mathe-matical expressions and, once an expression is selected, it canbe fit separately to each dataset. The plausibility of the ex-pression is then given by the total description length L ( h i ) = 12 (cid:88) d B d ( h i ) − log p ( h i ) , (2)where B d ( h i ) is the Bayesian information criterion calculatedon dataset d .The most plausible model fitting both the Nikuradse and thePrinceton datasets using Prandtl’s collapse is ¯ f ( x ) = sinh (cid:16) c + ( c + x ) e − ( c x ) c (cid:17) , (3)whereas the most plausible model for She’s collapse is ¯ f ( x ) = (cid:16) c log (cid:16) ( c x c + c ) (cid:17)(cid:17) c . (4)As above, the machine scientist uncovers, for each collapse,several models that are almost equally plausible. Therefore, TABLE II.scaling Data set c c c c P r a nd tl Nikuradse 1.3 -3.6 3.1 0.69Honed 1.3 -4.2 6.0 0.70Commercial 1.3 -14 1.5 × S h e Nikuradse 0.79 0.15 3.6 1.1Honed 0.71 0.14 4.1 1.0Commercial 0.77 0.34 2.7 1.1
Nikuradse Commercial Honed0.0000.0020.0040.0060.008 M ean ab s o l u t e e rr o r PrandtlShe et al.
FIG. 5. Mean absolute error (MAE) of the most plausible model forPrandtl’s and for She et al.’s data collapses, calculated on the originalunscaled variables of the Nikuradse dataset and the two Princetondatasets (right column in Fig. 4). The error bars in indicate the 95%confidence interval for the mean. the models in Eqs. (3) and (4) should be taken as just an in-stance of the models describing each of the data collapses, andit is beyond our aims to discuss their putative physical mean-ing here. However, in both scaling laws we get two parame-ters whose value is shared (or very similar) among pipe types,whereas two others take pipe-specific values over a broaderrange (Tab. II). Therefore, it seems plausible that two param-eters are universal, whereas the other two may be dependenton properties of the material or the irregularities in it. In anycase, the addition of the Princeton experiments does not seemto justify the corrections introduced by She et al. Indeed, ourapproach suggests that the commercial and the honed pipesare equally well described by both data collapses (Fig. 5).Machine learning tools are increasingly applied to shedlight into physics problems [28], from detecting phase tran-sitions [29, 30] to approximating the wave function of many-body quantum systems [31]. Here, we have used a Bayesianmachine scientist [15], which automatically uncovers closed-form mathematical equations from data, to compare data col-lapses for the Nikuradse dataset. The machine scientist en-ables us to sample a wealth of analytic expressions that de-scribe each data collapse, using only probabilistic model-selection arguments and without introducing any heuristics orbiases. We then use the expressions obtained by the Bayesianmachine scientist to evaluate the goodness of the models onthe unscaled data; as we show, the common practice of evalu-ating goodness of fit on the collapsed data leads to misleadingconclusions.In the Nikuradse dataset, our approach favors the origi-nal data collapse proposed by Prandtl over 80 years ago and,conversely, disfavors the data collapses proposed more re-cently [4–7]. Our method suggests that these more recent ap-proaches produce an apparent collapse of the data mainly by compressing the transition region between flow regimes; theyonly describe the experimental data in limiting regimes butnot in these transition regions (although it is fair to note thattheir goal was, precisely, to describe some limiting behaviorsand not all regimes). Our approach also shows that, contraryto what has been suggested, Prandtl’s collapse is compatiblewith more recent experiments of turbulent friction, providedthat some parameters in the scaling function are allowed todepend on the details of the pipe.Although here we have focused on the Nikuradse dataset,we think that our approach can be applied to other systemson which data collapses are used to understand the underlyingphysics or mechanisms. Scalings and data collapses are, in-deed, powerful tools; we argue that, combined with rigorousmodel selection and machine learning, they have the potentialto become even more insightful.We thank A. Arenas for pointing us towards the Nikuradsedataset. This project has received funding from the SpanishMinisterio de Economia y Competitividad (FIS2015-71563-ERC, FIS2016-78904-C3-P-1, and DPI2016-75791-C2-1-P)and from the Government of Catalonia (2017SGR-896). ∗ [email protected] † [email protected] ‡ [email protected] § [email protected][1] J. Nikuradse, Tech. Mem. , 60 (1933).[2] L. Prandtl, (1933).[3] In his original paper, Nikuradse already presented his datascaled according to Prandtl’s collapse. However, this collapseis mentioned in [6], where Bodenschatz attributes it to Prandtl,Nikuradse’s supervisor.[4] N. Goldenfeld, Phys. Rev. Lett. , 044503 (2006).[5] J. Tao, Phys. Rev. Lett. , 264502 (2009).[6] Z.-S. She, Y. Wu, X. Chen, and F. Hussain, New J. Phys. ,093054 (2012). [7] S. Li and W. Huai, PLoS ONE , e0154408 (2016).[8] H. E. Stanley, Rev. Mod. Phys. , S358 (1999).[9] G. I. Barenblatt, Scaling , Cambridge Texts in Applied Mathe-matics (Cambridge University Press, 2003).[10] M. H. R. Stanley, L. A. N. Amaral, S. V. Buldyrev, S. Havlin,H. Leschhorn, P. Maass, M. A. Salinger, and H. E. Stanley,Nature , 804 (1996).[11] Y. Lee, L. A. N. Amaral, M. Meyer, D. Canning, and H. E.Stanley, Phys. Rev. Lett. , 3275 (1998).[12] We refer to ¯ h as the scaling function although, strictly speaking,this function is not always a generalized homogeneous function.[13] S. M. Bhattacharjee and F. Seno, J. Phys. A , 6375 (2001).[14] J. Evans and A. Rzhetsky, Science , 399 (2010).[15] R. Guimer`a, I. Reichardt, A. Aguilar-Mogas, F. A. Massucci,M. Miranda, J. Pallar`es, and M. Sales-Pardo, Sci. Adv. ,eaav6971 (2020).[16] In general, x could be a vector of independent variables insteadof just one variable as in Nikuradse’s data.[17] P. D. Gr¨unwald, The Minimum Description Length Principle (The MIT Press, Cambridge, Massachusetts, 2007).[18] G. Schwarz, Ann. Stat. , 461 (1978).[19] T. Ando, Bayesian model selection and statistical modeling (CRC Press, 2010).[20] See Supplemental Material, which includes Ref. [32].[21] M. Schmidt and H. Lipson, Science , 81 (2009).[22] A. Caimo and N. Friel, Soc. Netw. , 41 (2011).[23] S. Horv´at, E. Czabarka, and Z. Toroczkai, Phys. Rev. Lett. ,158701 (2015).[24] R. Fischer, J. C. Leit˜ao, T. P. Peixoto, and E. G. Altmann, Phys.Rev. Lett. , 188701 (2015).[25] In practice, all sampled expressions describe the data similarlywell, and none of the results below depend on which expres-sions we choose.[26] M. A. Shockling, J. J. Allen, and A. J. Smits, J. Fluid Mech. , 267285 (2006).[27] L. I. Langelandsvik, G. J. Kunkel, and A. J. Smits, J. FluidMech. , 323339 (2008).[28] L. Zdeborov´a, Nat. Phys. , 420 (2017).[29] J. Carrasquilla and R. G. Melko, Nat. Phys. , 431 (2017).[30] E. P. L. van Nieuwenburg, Y.-H. Liu, and S. D. Huber, Nat.Phys. , 435 (2017).[31] G. Carleo and M. Troyer, Science , 602 (2017).[32] D. J. Earl and M. W. Deem, Phys. Chem. Chem. Phys.7