[PDF] Inferring temporal dynamics from cross-sectional data using Langevin dynamics

Abstract

Cross-sectional studies are widely prevalent since they are more feasible to conduct compared to longitudinal studies. However, cross-sectional data lack the temporal information required to study the evolution of the underlying processes. Nevertheless, this is essential to develop predictive computational models which is the first step towards causal modelling. We propose a method for inferring computational models from cross-sectional data using Langevin dynamics. This method can be applied to any system that can be described as effectively following a free energy landscape, such as protein folding, stem cell differentiation and reprogramming, and social systems involving human interaction and social norms. A crucial assumption in our method is that the data-points are gathered from a system in (local) equilibrium. The result is a set of stochastic differential equations which capture the temporal dynamics, by assuming that groups of data-points are subject to the same free energy landscape and amount of noise. Our method is a 'baseline' method which initiates the development of computational models which can be iteratively enhanced through the inclusion of expert knowledge. We validate the proposed method against two population-based longitudinal datasets and observe significant predictive power in comparison with random choice algorithms. We also show how the predictive power of our 'baseline' model can be enhanced by incorporating domain expert knowledge. Our method addresses an important obstacle for model development in fields dominated by cross-sectional datasets.

Full PDF

aa r X i v : . [ c s . C E ] F e b Inferring temporal dynamics from cross-sectional data using Langevindynamics

Pritha Dutta a,b, ∗ , Rick Quax c,d, ∗∗ , Loes Crielaard c,e , Peter M.A. Sloot b,c,f a Interdisciplinary Graduate Programme, Nanyang Technological University, Singapore 637335, Singapore. b Complexity Institute Singapore, Nanyang Technological University, Singapore 637335, Singapore. c Institute for Advanced Study, University of Amsterdam, Amsterdam 1012 GC, The Netherlands d Computational Science Lab, University of Amsterdam, Amsterdam 1098 XH, The Netherlands e Department of Public Health, Amsterdam UMC, University of Amsterdam, Amsterdam Public Health Research Institute,Meibergdreef 9, Amsterdam, The Netherlands. f HPC Institute, ITMO University, St. Petersburg, The Russian Federation.

Abstract

Cross-sectional studies are widely prevalent since they are more feasible to conduct compared to longitudinalstudies. However, cross-sectional data lack the temporal information required to study the evolution ofthe underlying processes. Nevertheless, this is essential to develop predictive computational models whichis the ﬁrst step towards causal modelling. We propose a method for inferring computational models fromcross-sectional data using Langevin dynamics. This method can be applied to any system that can bedescribed as eﬀectively following a free energy landscape, such as protein folding, stem cell diﬀerentiationand reprogramming, and social systems involving human interaction and social norms. A crucial assumptionin our method is that the data-points are gathered from a system in (local) equilibrium. The result is a set ofstochastic diﬀerential equations which capture the temporal dynamics, by assuming that groups of data-pointsare subject to the same free energy landscape and amount of noise. Our method is a ‘baseline’ method whichinitiates the development of computational models which can be iteratively enhanced through the inclusionof expert knowledge. We validate the proposed method against two population-based longitudinal datasetsand observe signiﬁcant predictive power in comparison with random choice algorithms. We also show howthe predictive power of our ‘baseline’ model can be enhanced by incorporating domain expert knowledge.Our method addresses an important obstacle for model development in ﬁelds dominated by cross-sectionaldatasets.

Keywords: cross-sectional data, predictive computational models, pseudo-longitudinal data, Langevindynamics

1. Introduction

Longitudinal studies require a huge investment in terms of time, money, and eﬀort, depending on thesystem studied. For instance, biological experiment techniques such as sequencing-based assays destroy ∗ Corresponding author; [email protected] ∗∗ Authors contributed equally igure 1: The proposed procedure to infer a computational model from a cross-sectional dataset. (A) The energy landscape F ( x ) = − ax + bx is used to generate a surrogate dataset, using the true probability density (solid red line). The dashedblack line is the estimated probability density. (B) From the estimated probability density and the Boltzmann distribution weestimate the free energy landscape F ( x ). The red ball represents a data-point moving downslope towards the attractor as shownby the red arrow. (C) The force ﬁeld which is the derivative of the free energy landscape leads to the deterministic dynamics ofthe Langevin equation. The red arrows show the direction in which a data-point in any of the four partitions will be forced tomove by the force ﬁeld. The regions around 1 and -1 correspond to the two attractors. (D) The resulting predicted dynamics ofa single data-point and the occasional transition from one attractor to the other, across the tipping point at x = 0. The insetﬁgure is the histogram of the values of this data-point across time. The histogram almost captures the shape of the histogramof the cross-sectional data shown in (A) and is slightly diﬀerent because we have taken a short time range to improve visibilityof the predicted dynamics. dynamics from a cross-sectional data and highlight the assumptions. We validate the technique using twopopulation-based longitudinal datasets where the ﬁrst time-point is used as the cross-sectional data and thesubsequent time-points are used for model validation.4 . Methods We consider that each data-point, denoted as ~x , is vector-valued and can change over time. A data-pointrepresents a set of attributes (such as BMI) of an individual. We assume that all data-points in a cross-sectional dataset of size N have had suﬃcient time to mix and explore the state space by the time at whichthey are observed. Thus, we assume that even if there was a major perturbation, a system of data-pointshas converged to a stable distribution at the time of our observation. This implies that all the data-pointsfollow a probability distribution, p ( x ), which is stationary. Further, this implies that N is large enough toeﬀectively estimate p ( x ) from the dataset.In general it is not possible to derive the temporal evolution, dx/dt , from the stationary probabilitydistribution p ( x ). This is because there are multiple dx/dt which can lead to the same stationary distribution.For example, a data-point rotating clock-wise and another data-point rotating anti-clockwise both have thesame circular stationary distribution, but their dx/dt are diﬀerent. We solve this problem by assumingthat each data-point tends to follow the same free energy landscape, F ( x ), in a ‘downslope’ manner, i.e.,moves in time in the direction of − dF/dx . F ( x ) eﬀectively assigns an energy value to each possible vector ~x . The assumption is that systems have a tendency to minimize their energy, although random ﬂuctuationsmay sometimes have the opposite eﬀect. The minimum energy states correspond to the attractors in F ( x ).This assumption could be valid for social systems where individuals tend to follow norms, adhere to socialconventions, and identify themselves with groups [11, 12]. These inﬂuences could lead individuals to movetowards the same attractors. Thus, in these applications, the probability distribution of data-points couldbe used to make predictions, whereas in other applications this may not be the case.A free energy landscape can be considered analogous to an uneven hillside and the data-points in thelandscape are the balls rolling down the hillside. The balls will eventually come to rest in a stable state inthe valleys. These valleys are the attractors in the free energy landscape. Trajectories across a landscapecorrespond to the progression of the data-points. Our method estimates the free energy landscape based onthe probability distribution of the data-points. Roughly speaking, the free energy landscape is approximatelythe inverse of the probability distribution of the data-points. Thus, the attractors in this estimated energylandscape will approximately correspond to the peaks in the probability distribution, which correspond to thenorms. For instance, a free energy landscape estimated from a probability distribution of the body weights ofa group of individuals will have attractors that will correspond approximately to the most probable weightsin that group of individuals.The energy landscape, F ( x ), can be derived from p ( x ) through the Boltzmann distribution [19], βF ( x ) = − log p ( x ) . (1)Here, the constant β is interpreted as the inverse of temperature, or noise level: lower the value of β , largeris the eﬀect of random ﬂuctuations on x , and thus lower the inﬂuence of F on the points. The assumption5ehind this relation is that the data-points are non-interacting. The other assumption, i.e., the data-pointsare in equilibrium, is already met by the assumption of p ( x ) having reached stationarity.In addition to this deterministic tendency given by F , there is a random movement (noise) which isirrespective of F and is uncorrelated over time, deﬁned by a Wiener process W ( t ) [20]. Thus, the displacementof each data-point consists of the sum of a deterministic component and a stochastic component which leadsto the following overdamped Langevin dynamics equation [10]: dxdt = − β dFdx + σ dWdt . (2)Here, β controls the relative strength of the deterministic force and σ controls the relative strength of thenoisy movement. When β →

0, the deterministic component becomes negligible, and thus the data-pointsdiﬀuse randomly over the state space in all directions. When σ →

0, the stochastic component becomesnegligible , and all data-points converge to one or more local minima of F ( x ), after which no further changeoccurs. Clearly, a balance is needed between these two opposing processes, which will control the degree ofclustering and the amount of variance observed in the predicted distribution, ˆ p β,σ ( x ), and match with the p ( x ) from the data.For determining the values of β and σ , it is important to realize that only their ratio changes the stationarydata distribution. Thus, if we ﬁx the timescale of the deterministic component by setting β = 1 withoutloss of generality, then the remaining free parameter σ controls the ratio of the random and deterministicforces. This timescale can be ﬁxed because if we multiply both parameters in Eq. 2 with a constant τ , thevelocity dx/dt is also multiplied by τ . This means that x changes τ times faster over time. In principal, itis impossible to derive how fast x changes per unit time from the data available at a single time-point. Wecan, however, still predict the directions of the data-points, which is our main interest. To explain the numerical algorithm we will generate a set of data-points moving over a known freeenergy landscape and show how the algorithm recovers the dynamics. Consider a true underlying free energylandscape, F ( x ) = − ax + bx , from Landau’s second order phase transition formalism, which contains twoattractor states as long as a > , b >

0. We set a = 3 and b = 1, where a controls the height of the energybarrier separating the two attractors. Assuming the Boltzmann distribution (Eq. 1), we generate a datasetconsisting of 5000 i.i.d. data-points from the probability density function, p ( x ) = e ax − bx /Z , where Z is thenormalizing constant. p ( x ) is shown in Fig. 1A as a solid red line. The sampling is done through the inversetransform sampling method [21].We will now treat this dataset as given and ‘forget’ the F ( x ) used to generate it. The true p ( x ) always hasan exponential form due to the assumption of a Boltzmann distribution. Therefore, a Gaussian kernel densityestimation algorithm is used to estimate the data distribution. A Gaussian kernel density estimate is deﬁnedas ˆ p ( x ) = ( nh √ π ) − P ni e ( x − xi )22 h , where x i are the data-points and the parameter h is determined using6ilverman’s optimization procedure [22]. We would like to highlight here that even though the parameters h and σ represent similar concepts (both are standard deviation controlling parameters), they implementdiﬀerent components. The parameter h controls the standard deviation of the Gaussian kernel densityestimate obtained from the data, whereas, σ controls the the ratio of the random and deterministic forcesin Eq. 2. The estimated ˆ p ( x ) from the dataset is shown in Fig. 1A as a dashed black line. This estimatedˆ p ( x ) is then used to estimate the free energy landscape as ˆ F ( x ) = − log ˆ p ( x ). By ﬁxing the timescale of thedeterministic dynamics through β = 1, the deterministic term, dF/dx , of the Langevin dynamics (Eq. 2) isgiven as, dFdx = − n X i =1 x i e − ( x − xi ) h h − xe − ( x − xi ) h h ! e − ( x − xi ) h . (3)The stochastic part is then fully determined by an optimal choice for σ . The optimal σ is determinedby using the Hellinger distance, H (ˆ p β =1 ,σ ( x ) , p ( x )), as the cost function. The Hellinger distance is deﬁnedas, H ( p, q ) = R x (cid:16)p dp ( x ) − p dq ( x ) (cid:17) dx . To evaluate the goodness of ﬁt for each choice of σ duringthis optimization process, we need to estimate the stationary probability density ˆ p β =1 ,σ ( x ) numerically.Since integration techniques for stochastic diﬀerential equations are computationally intensive, we choose todiscretize the domain of the data, x , into discrete points with distance ∆ x , which is calculated as ∆ x = √ ∆ t/f . We use time-step ∆ t such that σ √ t = ( x max − x min ) /

2. This means that it would takea random Wiener diﬀusion process with diﬀusion parameter σ about 1000 time-steps to reach the entirerange of x values. The expected displacement of a random diﬀusion process after ∆ t time is √ ∆ t , which wedivide up into f discrete points. Thus, f controls the ‘ﬁneness’ of this grid. We ﬁnd that f = 10 producesaccurate results at low memory cost. x min and x max are calculated from the data by taking the ﬂoor ofthe minimum value and ceiling of the maximum value from the standardized data respectively. In this case,we get x max = − x min = 3 from the generated dataset. We obtain 601 discrete values for x and a sparse,approximately band-diagonal transition matrix of dimension 601 × W . To calculate thetransition probabilities in W , we ﬁrst determine the possible positions that a data-point, starting at position x , can reach after ∆ t time-steps. The displacement due to the deterministic force is given by x + ( − dFdx ∆ t ).Taking this value as the mean, we determine from the 601 discrete values of x the values ( y ) that fall within4 standard deviations of this mean, i.e., within ± σ √ ∆ t of the mean. We then calculate the probabilitiesof displacement to these y values due to random diﬀusion, by assuming a normal distribution with mean as x + ( − dFdx ∆ t ) and standard deviation as σ √ ∆ t . Thus, W x y −→ = P (cid:16) y | N (cid:16) x − dFdx ∆ t, σ √ ∆ t (cid:17)(cid:17) , where P ()denotes probability density function.For discrete Markov processes, the stationary distribution vector π can be found directly by solving theset of linear equations π = W π , which is computationally an eﬃcient operation since it reduces to ﬁndingthe (left) eigenvector of W , having an eigenvalue of 1. Before performing this operation, we normalize therows of W . The initial distribution vector, π , is calculated from the data as, π ( x ) = R x +∆ x/ x − ∆ x/ p ( x ). Finally,7e compute the Hellinger distance, H (ˆ p β =1 ,σ ( x ) , p ( x )), for which π is converted to a continuous function,ˆ p β =1 ,σ ( x ), using third-order spline interpolation. If changing σ no longer reduces the Hellinger distance, theprocedure terminates and the Langevin dynamics (Eq. 2) is completely speciﬁed. In Fig. 1D the resultingpredicted dynamics of a single data-point is shown, where the optimal ˆ σ ≈ . σ pertains to adding more random perturbations making transitionsmore frequent in both directions. Another interesting possibility is to add a term to the free energy functionpointing towards the desired attractor state, inducing an asymmetric force.Suppose, we introduce an asymmetric intervention by adding a term to the free energy function whichmakes the left attractor preferred. The free energy function is then represented as, G ( x ) = F ( x ) + cx . Here, c controls the strength of the eﬀect of the intervention. To determine the relative amount of eﬀort requiredfor such an intervention, we consider the relative change of velocity of an individual due to this intervention,denoted by r ( c ). r ( c ) is deﬁned as the square root of the second moment of displacement. We assume that W denotes a Wiener process at time t with mean dGdx t and standard deviation σ √ t , and W denotes a Wienerprocess at time t with mean dFdx t and standard deviation σ √ t . The second moments of the above two Wienerprocesses are denoted by E [ W ] and E [ W ] respectively. Thus, r is deﬁned as, r = Z ∞−∞ p ( x ) E [ W ] − E [ W ] E [ W ] dx, = Z ∞−∞ p ( x ) t (cid:0) ( dG/dx ) − ( dF/dx ) (cid:1) ( dF/dx ) t + σ dx, = Z ∞−∞ p ( x ) t (cid:16)(cid:0) ax − bx − c (cid:1) − (cid:0) ax − bx (cid:1) (cid:17) t (2 ax − bx ) + σ dx, = Z ∞−∞ p ( x ) t (cid:0) − axc + 8 bx c + c (cid:1) t (2 ax − bx ) + σ dx, = Z ∞−∞ p ( x ) t (cid:0) − axc + 8 bx c + c (cid:1) t (2 ax − bx ) + σ × t − t − dx, = Z ∞−∞ p ( x ) (cid:0) − axc + 8 bx c + c (cid:1) (2 ax − bx ) + σ t dx. (4)Here, p ( x ) = e ax − bx /Z , where Z is the normalizing constant. From Eq. 4, we see that r ∝ √ tσ . Thus, r depends more strongly on σ ( r ∝ σ − ) than it does on t ( r ∝ √ t ), since the power of t is half that of σ . To set8 to an interpretable value, we choose the mean time between tipping point (at x = 0) transitions, estimatedby integrating the Langevin dynamics equation (Eq. 2) over time between 0 to 10 . A tipping point transitionis observed when two consecutive values of x have diﬀerent signs. Using this method, we obtain t c ≈ . t c , ˆ σ , and c = 1 as an example, we ﬁnd that r ≈ . x suchas social norms (summarized into F and σ ), 2 .

5% additional force would be required in order to make theleft-hand attractor contain about R −∞ e − G dx/ R ∞−∞ e − G dx ≈ .

1% of the individuals in equilibrium, insteadof 50% under the symmetric F . We note that it is impossible to say in absolute terms how much eﬀort isneeded to implement the intervention, which would entail listing all existing current forces that lead to thecurrently observed distribution. We also note that this cannot be used to infer how an intervention shouldbe implemented, i.e., which are the causal relationships which should be exploited in order to achieve suchan intervention (e.g., stimulating physical activity versus healthier diet). The displacement of an individual i for a ∆ t time-increment is represented by a normal distribution withthe displacement due to the deterministic force, ( − dFdx ∆ t ) i as the mean and σ √ ∆ t as the standard deviation.Using this normal distribution, we determine for each individual i the probability of positive displacementand negative displacement, denoted as P i PD and P i ND , respectively.Here, we quantify the prediction accuracy of our model for individual i , which we denote as A i . If theobserved displacement for individual i from the data is positive, we assign P i PD as the prediction accuracyof our model, and if the observed displacement is negative, we assign P i ND as the prediction accuracy of ourmodel for individual i . We also determine the maximum prediction accuracy of our model for individual i ,denoted as A i maximum , as the maximum of the two probabilities P i PD and P i ND . We standardise the data tohave zero median and unit standard deviation in order to have a mathematically convenient representationof the variable.To determine the optimal value of ∆t, we employ the metric of Euclidean distance, by varying its valuefrom 1 to 100 with increments of 1, and selecting the value that gives the minimum Euclidean distancebetween the displacement due to the deterministic force ( − dFdx ∆ t ) predicted by our model and the observeddisplacement from data.As a further validation step, we deﬁne an ideal case based on the expectation that individuals shouldmove towards the model’s attractor. Assuming this behaviour, we determine the ‘maximally achievable’prediction accuracy of our model. The purpose of comparing our model prediction with this ideal case is totest if the prediction accuracy of our model increases when the displacement of individuals is according tothe expectation. The displacement in the ideal case is taken as the negative of the standardised data.In addition to the above validation tests, we compare the prediction accuracy of our model to randomchoice under the null hypothesis of random displacement. We randomly choose between P i PD and P i ND foreach individual i and then take the average over all individuals to obtain the prediction accuracy by random9hoice. We repeat this random choice step 1000 times to obtain a distribution of mean prediction accuracies ofrandom choice and then determine the 95% conﬁdence interval of this distribution. If the average predictionaccuracy of our model is greater than the upper limit of this conﬁdence interval, which we denote as U CI , wecan say that our model prediction is signiﬁcantly better than the prediction obtained by random choice. Wealso scale the average prediction accuracy ( A average ) to better compare with U CI and the average maximumprediction accuracy ( A averagemaximum ) as A averagescaled = A average − U CI A averagemaximum − U CI (5)After the above scaling, U CI will correspond to zero and A averagemaximum will correspond to 1. If A averagescaled isabove zero, we say that our model prediction is better than random choice; if it is below zero, then our modelprediction is indistinguishable from random choice. Thus, we validate our model predictions against data aswell as test whether the prediction accuracy of our model is signiﬁcantly better than that by random choice. We validate our model against two population-based longitudinal datasets: the College study dataset [23]and the Hoorn study dataset [24, 25, 26], where we consider the ﬁrst time-point as the cross-sectional dataand the subsequent time-points are compared to the model prediction.The College study dataset [23] is a longitudinal dataset with weight measured over 5 time-points: baseline,6 weeks, 6 months, 12 months, and 24 months. This study was conducted with 294 female ﬁrst-year students[age: 18 . ± .

44 years; all values are expressed as mean ± SD unless otherwise speciﬁed] recruited from twouniversities in Philadelphia. We preprocessed this dataset to include only those participants who had theirweights measured for all the 5 time-points and obtained 162 participants. We calculated the BMI (in kg / m )of these 162 participants from their weights and heights and their baseline BMI distribution is 23 . ± . / m .The Hoorn study dataset [24, 25, 26] is a longitudinal dataset with BMIs measured over 2 time-points:baseline, and 7 years. The baseline study [24, 26] was conducted in 2006-2007 in the Dutch city of Hoorn andincluded 2,807 participants. The follow-up study [25, 26] was conducted in 2013-2015 which included 1,734participants out of the 2,807 participants in the baseline study. We preprocessed this dataset to include onlythose participants who had their BMI measured for both the time-points and obtained 1,727 participantswith age 53 . ± .

53 years and baseline BMI 26 . ± .

88 kg / m . The data that support the ﬁndings of this study are available from second party. Restrictions apply tothe availability of these data, which were used under license for this study. Data are available from Lowe etal. [23] and Rutters et al. [26] on request.The code for the proposed method is written in Mathematica programming language and is available at https://github.com/Pritha17/langevin-crosssectional .10 igure 2: (A) The scaled average prediction accuracies, A averagescaled (Eq. 5) corresponding to the 4 time-points: 6 weeks, 6 months,12 months, and 24 months of the College study dataset, and the single time-point of 7 years of the Hoorn study dataset.The vertical bars at each point represent the error bar for the respective model prediction accuracy calculated from 1000bootstrap samples. The solid red line at 0 corresponds to the upper limit of the 95% conﬁdence interval of the distributionof mean prediction accuracies of random choice, U CI (refer to Methods: Validation procedure ), and the solid red line at1 corresponds to the average maximum model prediction accuracy, A averagemaximum (refer to

Methods: Validation procedure ).Model prediction accuracies above the line at 0 are signiﬁcantly better than the prediction accuracy of random choice. (B)The ‘maximally achievable’ scaled average prediction accuracy of our model obtained by validating against the ideal case whenindividuals change their BMI in the expected manner, i.e., decrease or increase their weight tending towards the attractor BMI(refer to

Methods: Validation procedure ).

3. Results

We present a baseline method of inferring a predictive computational model from cross-sectional databased on assumptions that do not depend on expert knowledge. We validate our model predictions againsttwo longitudinal datasets: the College study dataset [23] and the Hoorn study dataset [24, 25, 26], wherewe consider the ﬁrst time-point as the cross-sectional data and the subsequent time-points are compared tothe model predictions. The College study dataset contains BMI values over 5 time-points: baseline, 6 weeks,6 months, 12 months, and 24 months. The Hoorn study dataset contains BMI values over 2 time-points:baseline, and 7 years. Fig. 2 shows our model prediction accuracies (A averagescaled (Eq. 5)) using these two datasets.Our model shows signiﬁcant predictive power (green circles in Fig. 2(A)) compared to the prediction of arandom choice algorithm (solid red line 0 in Fig. 2). Additionally, our model prediction accuracy improvesfurther when we incorporate expert knowledge to our model, as shown by the blue crosses, cyan left triangles,and magenta diamonds in Fig. 2(A).To test if the predictive power of our model is enhanced with the incorporation of domain expert knowl-11 igure 3: The relative number of positive and negative displacements in each BMI bin at (A) 6 weeks, (B) 6 months, (C) 12months, and (D) 24 months obtained from the College study dataset. The relative number of positive and negative displacementsin each BMI bin is calculated as number of positive displacements - number of negative displacementsnumber of positive displacements + number of negative displacements . This relative number will be 1or -1 if all displacements at a particular BMI bin are in the same direction (either positive or negative). If a particular BMI binhas mixed displacement directions then the relative number will be between -1 and 1. A positive relative number indicates thatthere are more positive displacements than negative displacements and vice-versa. edge, we apply the empirical observation from epidemiology that individuals from diﬀerent BMI categoriesfollow diﬀerent landscapes [27, 28]. In accordance, we cluster the datasets based on the standard BMI cat-egories: underweight (BMI < ≤ BMI < ≤ BMI < ≤ BMI) [29]. We respectively obtain clusters of sizes 0, 119, 39, and 4 from the College studydataset, and 12, 739, 729, and 247 from the Hoorn study dataset. Since the underweight and obese clustersfrom the College study dataset have only 0 and 4 individuals, respectively, and the underweight cluster fromthe Hoorn study dataset has only 12 individuals, we disregard those clusters. We observe that if we considerseparate attractor landscapes for these diﬀerent clusters, the model prediction accuracy increases signiﬁcantly(Fig. 2A). The attractor in each cluster approximately corresponds to the BMI that is most prevalent relativeto the group of individuals in that cluster. In addition, we select a narrow BMI range to see if clusteringindividuals having almost exact BMIs further improves the prediction accuracy. Accordingly, we select allindividuals having BMI in the narrow range of 21 ≤ BMI <

22, and obtain 29 individuals from the Collegestudy dataset, and 96 individuals from the Hoorn study dataset. We observe that the model predictionaccuracy increases further (Fig. 2A). Thus, from the above results we can conclude that the incorporation ofdomain expert knowledge to our baseline method further enhances the predictive power of our model.We also deﬁne an ideal case based on the hypothesis that all individuals should move towards the attractor,which corresponds to the norm BMI (the BMI that is most prevalent relative to the group of individuals12nder consideration). This means that individuals with a BMI that is greater than the attractor BMI shoulddecrease their BMI and vice-versa. Assuming this behaviour, we deﬁne a ‘maximally achievable’ modelprediction accuracy. As observed from Fig. 2B, these ‘maximally achievable’ model prediction accuraciesare signiﬁcantly higher than the actual prediction accuracies using the real data. This reﬂects that not allindividuals in the dataset are changing their BMI as expected, i.e., individuals with a BMI that is greaterthan the attractor BMI sometimes increase their BMI and vice-versa. Based on this observation, we furtheranalyse the data to see if individuals having the same BMI indeed have displacements in the same direction.We select 15 BMI bins based on the data (the bins are shown as x axis labels in Fig. 3). We place anindividual in BMI bin x if the individual’s BMI falls in the range of x ≤ BMI < x + 1. For example, weplace an individual in BMI bin 25 if the individual’s BMI falls in the range of 25 ≤ BMI <

26. Then, wecalculate the displacements from baseline to 6 weeks, 6 months, 12 months, and 24 months. Fig. 3 showsthe relative number of individuals having positive and negative displacements in each BMI bin, which iscalculated as number of positive displacements - number of negative displacementsnumber of positive displacements + number of negative displacements . If all individuals in a particular BMIbin have displacements in the same direction (positive or negative), then this relative number will be 1 or-1 as shown by the red solid lines in Fig. 3. If individuals in a particular BMI bin have mixed displacementdirections then this relative number will be between 1 and -1. A positive relative number indicates that thereare more positive displacements than negative displacements and vice-versa. We observe from Fig. 3 thatthis relative number is between 1 and -1 for most of the BMI bins. Thus, individuals having similar BMIshave both positive and negative displacements indicating that individual behaviour is inherently random.

4. Discussion

Cross-sectional studies are widely prevalent since they require less investment in terms of time, money andeﬀort compared to longitudinal studies. However, since these data lack temporal information, they cannotbe used directly to study the evolution of the underlying processes. Nevertheless, this temporal informationis essential to develop predictive computational models which is the ﬁrst step towards causal modelling. Inthis work, we present a novel method to infer predictive computational models from cross-sectional datausing Langevin dynamics. This method can be applied to any system that can be described as eﬀectivelyfollowing a free energy landscape, such as protein folding, stem cell diﬀerentiation and reprogramming, andsocial systems involving human interaction and social norms.Our method in itself infers predictive computational models from cross-sectional data based on theLangevin dynamics and two assumptions. The ﬁrst assumption is that of well-mixedness or ergodicity,i.e., the data-points are suﬃciently mixed at the time of observation and are at (local equilibrium). Thus,we assume that even if there was a major perturbation, a system of data-points has converged to a stabledistribution at the time of our observation. This means that the mean value of all data-points at a particulartime-point and the mean value of any one data-point across time remain the same. This assumption would bevalid for short-term where we do not expect any major perturbation to the system. The second assumption13s that each data-point tends to follow the same free energy landscape in a ‘downslope’ manner, i.e., twodata-points will not follow each other even if they have the same stationary distribution. For instance, onedata-point rotating clock-wise and another data-point rotating anti-clockwise will have the same circularstationary distribution, but the changes over time, dx/dt , are diﬀerent. The second assumption is based onthe fact that systems have a tendency to minimize their energy and hence, even though the changes overtime for the two data-points may be diﬀerent, they move in time in the same direction, i.e., downslope of aconvex free energy landscape. For instance, two individuals will have diﬀerent rates of change of BMI overtime, but both will tend towards the norm BMI.As opposed to black-box machine learning techniques, our technique is based on interpretable assumptionssuch that domain expert knowledge can be readily incorporated. That is, the causal interpretation of theresulting model can be enhanced by adding domain expert knowledge in the form of statements of causaland non-causal relationships. This should lead to increased and more robust predictive power, as is indeeddemonstrated by our clustering of diﬀerent categories of BMI.It is important to realize that the proposed method can only estimate directions of progression, notvelocities. This is because, in principal, it is impossible to derive how fast a data-point (for instance BMI)changes per unit of time from a single time-point data. In other words, the timescale of the predicted dynamicsremains unknown. We can only estimate, for example, whether an individual will increase or decrease his/herBMI, and whether another individual progresses slower or faster. In some cases it is possible to estimatea timescale from the data, for instance by quantifying the relative frequency of tipping point transitions inthe model and comparing it to knowledge or data about it. It can also be inferred from known statisticalproperties of the rates of change in reality; for instance, the fact that the maximum sustainable rate of weightloss observed in a population is about 2 kg per month used in our previous work [17].The usefulness of our method lies in the fact that it can help to reveal the short-term dynamics from asingle time-point, without any dependence on expert knowledge, and this can be used as a starting pointfor developing causal models with the help of expert knowledge. We believe that the proposed method issuﬃciently simple to use as well as interpretable to initiate the iterative development of computational modelsfor any system that can be described as eﬀectively following a free energy landscape and thus help in studyingthe progression of important processes.

5. Conclusions

We have proposed a method to infer predictive computational models from cross-sectional data based onLangevin dynamics for systems that eﬀectively follow a free energy landscape. Our method shows signiﬁcantpredictive power when validated against two independent population-based longitudinal datasets which rep-resent systems involving human interaction and social norms. The performance of our method could be evenfurther improved by taking expert domain knowledge into account. This suggests that our method can indeedbe an eﬀective modelling paradigm for systems that can be interpreted as eﬀectively following a free energy14andscape. Our method can bootstrap the use of the already abundant cross-sectional datasets to studythe evolution of the underlying processes and initiate the iterative development of predictive computationalmodels.

Author contributions

R.Q. conceptualised the study and performed the mathematical derivations. P.D. carried out the simu-lations and analysis. P.D. and R.Q. drafted the manuscript with critical input from L.C. and P.M.A.S. Allauthors approved the ﬁnal version.

Declaration of interests

The authors declare no competing interests.

Acknowledgment

We thank Michael Lowe, Professor (Department of Psychology, Drexel University), for providing us withthe College study dataset. We also thank Jeroen Bruggeman, Associate Professor (Department of Social andBehavioural Sciences, University of Amsterdam), Debraj Roy, Assistant Professor (Computaional ScienceLab, University of Amsterdam), and Nadege Merabet, PhD (University of Amsterdam) for reviewing thepaper. This work is supported by the NTU Research Scholarship, DINAMICS (ZonMw, Netherlands Orga-nization for Health Research and Development, project number: 531003015), Social HealthGames (NWO,the Dutch Science Foundation, project number: 645.003.002), Computational Modelling of Criminal Net-works and Value Chains (Nationale Politie, project number: 2454972), and TO AITION (EU Horizon 2020programme, call: H2020-SC1-2018-2020 , grant number: 848146).