Inverse Ising techniques to infer underlying mechanisms from data
IInverse Ising techniques to infer underlying mechanisms from data ∗ Hong-Li Zeng( 曾 红 丽 ) , † , Erik Aurell , ‡ School of Science, Nanjing University of Posts and Telecommunications,New Energy Technology Engineering Laboratory of Jiangsu Province, Nanjing 210023, China Nordita, Royal Institute of Technology, and Stockholm University, SE-10691 Stockholm, Sweden KTH – Royal Institute of Technology, AlbaNova University Center, SE-106 91 Stockholm, Sweden Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University, 30-348 Krak´ow, Poland
February 14, 2020
Abstract
As a problem in data science the inverse Ising (or Potts) problem is to infer the parameters of a Gibbs-Boltzmann distributions of an Ising (or Potts) model from samples drawn from that distribution. The algo-rithmic and computational interest stems from the fact that this inference task cannot be done efficiently bythe maximum likelihood criterion, since the normalizing constant of the distribution (the partition function)can not be calculated exactly and efficiently. The practical interest on the other hand flows from severaloutstanding applications, of which the most well known has been predicting spatial contacts in protein struc-tures from tables of homologous protein sequences. Most applications to date have been to data that hasbeen produced by a dynamical process which, as far as it is known, cannot be expected to satisfy detailedbalance. There is therefore no a priori reason to expect the distribution to be of the Gibbs-Boltzmann type,and no a priori reason to expect that inverse Ising (or Potts) techniques should yield useful information. Inthis review we discuss two types of problems where progress nevertheless can be made. We find that de-pending on model parameters there are phases where, in fact, the distribution is close to Gibbs-Boltzmanndistribution, a non-equilibrium nature of the under-lying dynamics notwithstanding. We also discuss therelation between inferred Ising model parameters and parameters of the underlying dynamics.
Keywords:
Inverse Ising problem, kinetic Ising model, statistical genetics, fitness reconstruction
PACS: ∗ The work of H.-L. Zeng was supported partially by the National Natural Science Foundation of China (Grant No. 11705097),partially by Natural Science Foundation of Jiangsu Province (Grant No. BK20170895), Jiangsu Government Scholarship for OverseasStudies of 2018 and Scientific Research Foundation of Nanjing University of Posts and Telecommunications (NY217013). The workof EA was partially supported by Foundation for Polish Science through TEAM-NET project (contract no. POIR.04.04.00-00-17C1/18-00). † Corresponding author. E-mail: [email protected] ‡ Corresponding author. E-mail: [email protected] a r X i v : . [ s t a t . O T ] F e b . Introduction The Gibbs-Boltzmann distribution of the Ising model on 𝐿 spins is 𝑃 ( s ) = exp (︁ − 𝛽 (︁∑︀ 𝑖 𝜃 𝑖 𝑠 𝑖 + ∑︀ 𝑖<𝑗 𝐽 𝑖𝑗 𝑠 𝑖 𝑠 𝑗 )︁)︁ 𝑍 , (1)where 𝛽 is the inverse temperature, and 𝑍 is the partition function, defined as: 𝑍 = ∑︁ s exp ⎛⎝ − 𝛽 ⎛⎝∑︁ 𝑖 𝜃 𝑖 𝑠 𝑖 + ∑︁ 𝑖<𝑗 𝐽 𝑖𝑗 𝑠 𝑖 𝑠 𝑗 ⎞⎠⎞⎠ . (2)The parameters of the model are 𝐿 external fields { 𝜃 𝑖 } 𝐿𝑖 =1 and 𝐿 ( 𝐿 − coupling constants or interactions { 𝐽 𝑖𝑗 } 𝑖<𝑗 .The Gibbs-Boltzmann distribution of a Potts model is defined in a similar way, except that each variable cantake 𝑞 values ( 𝑞 = 2 for the Ising model) and the model parameters are vectors and matrices ( { 𝜃 ( 𝛼 ) 𝑖 } for1 ≤ 𝛼 ≤ 𝑞 and { 𝐽 ( 𝛼,𝛼 ′ ) 𝑖𝑗 } for 1 ≤ 𝛼, 𝛼 ′ ≤ 𝑞 ).From the viewpoint of physics (1) is the equilibrium distribution at inverse temperature 𝛽 correspondingto the Ising energy function (or Hamiltonian) [1–3]. The traditional Ising problem of statistical mechanics is todetermine properties of the distribution 𝑃 ( s ) from the model parameters { 𝜃 𝑖 , 𝐽 𝑖𝑗 } . The probability distribution 𝑃 ( s ), or ensemble, will be reflected in samples drawn independently from that distribution. Combining the twosteps of estimating the ensemble and sampling from the distribution, the direct Ising problem can be definedas the problem of estimating an empirical probability distribution over samples from model parameters. The inverse Ising problem is then the opposite problem of inferring model parameters from samples drawn from thedistribution [4–6].To stress the inverse nature of the problem it is useful to introduce some notation from statistics. Theclass of distributions (1), with values of the external fields and interactions in some set, is called an exponentialfamily . The inverse Ising problem is accordingly called parameter inference in an exponential family [7]. Themost basic way to infer parameters from independent samples from one and the same probability distributionis maximum likelihood (ML). For computational reasons ML is often formulated in logarithmic coordinates as maximum log-likelihood . Given 𝑁 independent samples from (1) maximum log-likelihood amounts to the convexoptimization problem { 𝜃 * 𝑖 , 𝐽 * 𝑖𝑗 } 𝑀𝐿 = argmax ⎡⎣ − ∑︁ 𝑖 𝜃 𝑖 ⟨ 𝑠 𝑖 ⟩ − ∑︁ 𝑖<𝑗 𝐽 𝑖𝑗 ⟨ 𝑠 𝑖 𝑠 𝑗 ⟩ − 𝛽 log 𝑍 ⎤⎦ (3)where ⟨ 𝑠 𝑖 ⟩ and ⟨ 𝑠 𝑖 𝑠 𝑗 ⟩ are the empirical averages computed from the samples. The star on the parameters onthe left-hand side mark that these are inferred , and the superscript 𝑀 𝐿 indicates the inference method. Theonly reason (3) is a difficult task is that the forward problem of computing 𝑍 from the parameters is difficult.The effect of the parameter 𝛽 cannot be separated from an overall scale of { 𝜃 * 𝑖 } and { 𝐽 * 𝑖𝑗 } , and therefore onlyappears in (3) as a proportionality of the log-partition function log 𝑍 (︀ 𝛽, { 𝜃 * 𝑖 } , { 𝐽 * 𝑖𝑗 } )︀ . From now on we will,when not specified otherwise, set 𝛽 equal to one. For later reference we prefer to refer to the number of spins in the model with the letter 𝐿 , for “loci”. The more customaryletter 𝑁 will later be reserved to the number of samples drawn from the distribution, following a convention using in statistics. By reparametrization invariance the number of independent paramaters is respectively 𝑞 − 𝑞 − forthe matrix, which for 𝑞 = 2 gives only one parameter of each type as in (1). Exponential because the parameters all appear in the exponent, and family because a set of parameters are considered. 𝑁 independent samples this means 𝑁 𝐿 data items, but (3) only depends on 𝐿 ( 𝐿 +1)2 numberscomputed from the data. Those numbers (here means and correlations) are called sufficient statistics forinference in an exponential family [8,9]. A second fundamental fact is that maximum likelihood inference givesthe same result as maximizing Shannon entropy conditioned by the sufficient statistics. From the physical pointof view this follows directly from (1) being an equilibrium distribution, which minimizes free energy. MaximizingShannon entropy conditioned by some chosen set of empirical averages is called the maximum-entropy [10–12]or max-entropy approach to statistical inference. By the above such a set of empirical averages is in one-to-onerelation with a set of parameters in an exponential family for which they are sufficient statistics. This relationbetween exponential parameters and empirical averages is called conjugacy , or, in Information Geometry [13,14],a duality . The max-entropy approach with a given set of empirical averages is equivalent to maximum likelihoodinference in an exponential family with the conjugate parameters.In Physics (1) appears as a (canonical) equilibrium distribution of a system interacting with a heat bath.Let two configurations of the system be s and s ′ , and let the probability of the system to make the change from s to s ′ per unit time be 𝑊 s , s ′ . Then equilibrium is reached if the transition rates satisfy the detailed balanceconditions [15] 𝑃 ( s ) 𝑊 s , s ′ = 𝑃 ( s ′ ) 𝑊 s ′ , s (4)In equilibrium transitions from s to s ′ and s ′ to s are equally likely. As a consequence there cannot be chains ofstates such that cyclic transitions in one direction ( s → s → · · · → s 𝑘 → s ) is more likely than in the oppositedirection ( s → s 𝑘 → · · · → s → s ). Chemistry and Biology have many examples of such cycles appear, fromchemical oscillations of the Belouzov-Zhabotinsky type to the cell cycle and circadian rythms [16,17]. Thisimmediately says that not all dynamics on discrete state spaces can satisfy detailed balance, and so cannot beexpected to have stationary distributions like (1).If we focus on single-spin flips and 𝑃 ( s ) in (1) we can write the detailed balance conditions as a relationbetween spin flip rates 𝑟 𝑖 (+ , s ∖ 𝑖 ) and 𝑟 𝑖 ( − , s ∖ 𝑖 ) 𝑟 𝑖 ( − , s ∖ 𝑖 ) = 𝑟 𝑖 (+ , s ∖ 𝑖 ) 𝑒 − 𝛽𝜃 𝑖 − 𝛽 ∑︀ 𝑗 𝐽 𝑖𝑗 𝑠 𝑗 (5)where 𝑟 𝑖 ( − ) is the rate of spin 𝑖 to flip from down to up, and 𝑟 𝑖 (+) is the rate up to down. Both of them dependon the configurations of all the other spins, written s ∖ 𝑖 . Alternatively we can write (5) as 𝑟 𝑖 ( s ) = 𝛾 𝑖 ( s ∖ 𝑖 ) 𝑒 − 𝛽 Δ 𝑖 𝐸 ( s ) (6)where 𝑟 𝑖 ( s ) is the rate of flipping spin 𝑖 in configuration s , ∆ 𝑖 𝐸 ( s ) is the energy change when doing so, and 𝛾 𝑖 ( s ∖ 𝑖 ) is an overall rate which does not depend on the value of spin 𝑖 . Different Monte Carlo procedures (orMarkov chain Monte Carlo (MCMC) algorithms) differ by this overall rate 𝛾 𝑖 ( s ∖ 𝑖 ).To give an example of a spin-flip dynamics which does not satisfy detailed balance we point to the class of focused algorithms for constraint satisfaction problems , invented by Christos Papadimitriou now three decadesago [18–26]. In such algorithms one imagines that the energy function is a sum of local terms all of which are oneor zero. A solution is a configuration where all the energy terms are zero (zero-energy ground state). A focused3lgorithm is one where the rate of flipping spin 𝑖 is zero unless at least one of the constraints depending on 𝑖 isunsatisfied, but otherwise the dynamics remains partly random. It is clear that for such dynamics one can flipinto a satisfied state, but once there the dynamics stops; one cannot flip out . It is well known that focusedalgorithms such as “walksat” outperform equilibrium algorithms in many important applications [19,24].Let us now go back to the problem of inferring the parameters of the Ising model in (1) where the data hasbeen generated by some process which may or may not satisfy detailed balance. The inference procedure is atthis point treated as a black-box. What does this mean? Does it even make sense? When does it make sense?In equilibrium statistical mechanics the answer is clear and simple: the process makes sense if the datawas generated by a process in detailed balance with an energy function in the same exponential family, andin a phase where sampling is possible. The first condition simply means that if the data was generated froma process with, say, third-order interactions between the spins, those interactions will not be recovered frominferring only first-order and second-order interactions. The second conditions means that parameters have tobe such that the dynamics explores enough configurations that there is enough information to infer from. Atrivial example when this is not the case is zero temperature where the configuration goes to a local minimumof the energy, and then does not change. A more subtle example is a spin glass phase where for large but notinfinite 𝛽 only part of the Gibbs distribution (1) will be sampled by an MCMC algorithm unless the simulationtime is exponentially large in system size [1]. Inference from naturally generated samples, that are “stuck in onevalley”, have long been known to be impossible by the class of inverse Ising methods surveyed here [27]. Forspecific problems and with more tailored methods such a task is sometimes nevertheless possible [28]. Inferencefrom samples that are drawn uniformly from such a distribution has on the other hand been shown to bepossible, and even easy [29]. Such uniform samples however have to be generated by methods that either needsa large computational effort (long simulation time), or one needs to restart the simulation many times with newrandom initial values, which corresponds to real data from many separate time series.Once we step out of the realm of equilibrium dynamics we are much more in the dark. For the specificexample of Symmetric Simple Exclusion Process (SSEP) it is known that the stationary distribution, i.e. theequivalent of (1), contains all interactions of all orders [30,31], meaning all single-spin and pair-wise terms asin (1), all three-spin interactions, and so on. This is so even though the SSEP dynamics is entirely specified bynearest-neighbor pairwise exclusion, and the non-equilibrium aspects are only the boundary conditions, particleexchanges with reservoirs. When the dynamics can be described as depending on energy changes with somenon-equilibrium element such as focusing at every step (“bulk driven non-equilibrium process”), the possibilitiesfor the stationary distributions are wider still. The outcome of an inverse Ising procedure applied to such datamay therefore be completely unrelated to the parameters of the mechanisms that gave rise to the data. Thecomputational complexity and number of data required to infer the parameters of any kind of non-equilibriumsteady state from snapshots has been shown to be daunting [32–34]. Nevertheless, this is the setting of mostsuccessful and interesting applications of inverse Ising techniques to date [35,36]. Why is this?In this review we will present two cases where the above problem can be analyzed and/or studied insimulations. The first case is kinetic Ising models with possibly different values of pairwise parameters 𝐽 𝑖𝑗 and 𝐽 𝑗𝑖 . When 𝐽 𝑖𝑗 = 𝐽 𝑗𝑖 (symmetric kinetic Ising models) this is nothing by a Monte Carlo procedure to compute The first condition of focusing can be satisfied in the equilibrium algorithm (6) by taking 𝛽 to infinity (zero temperature).But then the algorithm is a deterministic greedy search, and is no longer random. 𝑃 ( s ) in (1). Models where 𝐽 𝑖𝑗 ̸ = 𝐽 𝑗𝑖 (asymmetric kinetic Ising models) have however also beenwidely studied, e.g. as model systems in neuroscience [27,37,38]. The kinetic Ising models hence interpolatebetween equilibrium and non-equilibrium systems. They also illustrate that more efficient inference proceduresthan inverse Ising are available if one can use a time series and not only independent samples from a stationarydistribution.The second case are slightly more involved spin dynamics that model evolution under mutations, Darwinianselection (fitness), finite- 𝑁 effects (genetic drift) and recombination (sex). We will here see that inverse Isingworks in certain ranges of parameters describing the relative strengths of mutations, fitness and sex, but notin others. We will also see that the relation is not trivial; non-trivial theory is needed to translate the resultsfrom inverse Ising to inferred fitness that can be compared to model parameters.This review is organized as follows. In Section 2, we summarize for completeness some inverse Isingtechniques. This topic is already covered by excellent reviews to which we refer for more details and a widerpalette of techniques. In Section 3 we introduce the kinetic Ising problem in its symmetric and asymmetricform, and present characteristic results, and in Section 4, we present two applications of those techniques takenfrom earlier work by one of us (HLZ). Section 5 presents on the other hand a class of problems in populationgenetics, and Section 6 contains an outlook and discussion.
2. Techniques for Inverse Ising
The inverse Ising problem has been studied under several different names, such as statistical inference inexponential families (as above), Boltzmann machines, maximum-entropy modeling, Direct Coupling Analysis(DCA), logistic regression techniques, and more. For small enough system (small enough 𝐿 ) maximum likelihood(3) is computationally feasible, for instance by the iterative method also known as Boltzmann machine [39]. Theidea of that very widely used method is to adjust the parameters in the exponential family to make empiricalaverages and ensemble averages of the conjugate sufficient statistics agree.For large 𝐿 maximum likelihood (ML) is not computationally efficient, meaning that it requires an effortexponentially increasing in 𝐿 . It should be said that for a given fixed 𝐿 , what is and is not computationallyfeasible changes with time and the development of computer hardware. Nevertheless, for many applications thathave been of interest, either ML has not been feasible, or other inference schemes have given comparable resultswith less effort. In any case, it has been an interesting theoretical challenge to design and analyze schemes thatmake a different trade-off between accuracy and computational speed than ML.The state of the art of inverse Ising was recently extensively reviewed in [6], and we will here only provide abackground for the later sections. A first type of inference methods attempts to circumvent the computationalchallenge of ML by estimating the partition function 𝑍 efficiently. Such methods are collectively known as mean-field inference , because they rely on mean-field techniques. The by far most common version of mean-field inference relies on a variational ansatz in terms of magnetizations, which yields the physical mean-fieldequations of the Ising model 𝑚 𝑖 = tanh ⎛⎝ ℎ 𝑖 + ∑︁ 𝑗 𝐽 𝑖𝑗 𝑚 𝑗 ⎞⎠ (7)5n this equation only 𝑚 𝑖 is taken from the data, and there are only 𝐿 equations. By using also linear-response 𝑐 𝑖𝑗 = ⟨ 𝑠 𝑖 𝑠 𝑗 ⟩ − ⟨ 𝑠 𝑖 ⟩ ⟨ 𝑠 𝑗 ⟩ = 𝜕𝑚 𝑖 𝜕ℎ 𝑗 (8)one finds the naive mean-field inference formula [40] 𝐽 * ,𝑛𝑀𝐹𝑖𝑗 = − (︀ 𝑐 − )︀ 𝑖𝑗 (9)The above expression is computationally quite convenient as it reduces a complicated inference to matrixinversion. One may note that (9) is the same formula as inferring the interaction matrix of a Gaussian model(precision matrix in information theory) from data. It is an elementary property of multidimensional centeredGaussian distributions that they can be written 𝑃 ( x ) = 𝑁 exp (︀ − x 𝐶 − x )︀ where 𝐶 is the co-variance matrix.The precision matrix (the model parameters) can therefore be inferred as the inverse matrix of 𝐶 (the data).The difference is that for an Ising model (9) is only approximate, and does not always with good accuracy; forthe SK model (to be discussed below) it holds for instance at high-temperature (weak interactions), but not atlow temperature. If needed one can combine (7) and (9) to estimate also the external fields, i.e. ℎ * ,𝑛𝑀𝐹𝑖 = tanh − 𝑚 𝑖 − ∑︁ 𝑗 𝐽 * ,𝑛𝑀𝐹𝑖𝑗 𝑚 𝑗 (10)More advanced mean-field methods than naive mean-field are obtained by starting from more advancedapproximations than (7). The best-known of these is TAP (Thouless-Anderson-Palmer) [41] which starts from 𝑚 𝑖 = tanh ⎛⎝ ℎ 𝑖 + ∑︁ 𝑗 𝐽 𝑖𝑗 𝑚 𝑗 − 𝑚 𝑖 ∑︁ 𝑗 𝐽 𝑖𝑗 (1 − 𝑚 𝑗 ) ⎞⎠ (11)Using linear response then gives 𝐽 * 𝑖𝑗 as the solution of a quadratic equation 𝐽 * ,𝑇 𝐴𝑃𝑖𝑗 + 2 𝑚 𝑖 𝑚 𝑗 (︁ 𝐽 * ,𝑇 𝐴𝑃𝑖𝑗 )︁ = − (︀ 𝑐 − )︀ 𝑖𝑗 (12)A general feature of inference methods of this type is that in the variational ansatz the data is only takeninto account through the single-variables marginals, i.e. through the magnetizations. It is only linear-response(8), which is a exact property of the full Ising model, but not of the variational ansatz, that two-variablemarginal are brought back into play.Another type of mean-field inference equation attempts to find the Ising model which best fits the data. Thevariational parameters are then magnetizations ( 𝑚 𝑖 ) and correlations ( 𝑐 𝑖𝑗 ), conjugate to model parameters ℎ 𝑖 and 𝐽 𝑖𝑗 . This approach was first developed as an iterative procedure called “susceptibility propagation” [42,43]and only later shown to also yield equations like (9) and (12) where ratios of hyperbolic functions appear onthe left -hand side, but the right-hand side is still just the inverse matrix of correlations [44]. An alternativederivation of this elegant approach can be found in [6], which also contains a survey of many more methodsthat have been introduced and tested in the literature.A different type of inference gives up on the ambition to approximate the partition function, and hencethe full probability distribution 𝑃 ( s ). Instead one tries to infer the parameters from some other propertywhich can be efficiently computed. The most widely used such method is maximum pseudo-likelihood [45] or pseudo-likelihood maximization (PLM). This starts from the conditional probability of the Ising model 𝑃 ( 𝑠 𝑖 | s ∖ 𝑖 ) = exp (︁ − 𝛽 (︁ 𝜃 𝑖 𝑠 𝑖 + ∑︀ 𝑗 𝐽 𝑖𝑗 𝑠 𝑖 𝑠 𝑗 )︁)︁∑︀ 𝑠 ′ = ± exp (︁ − 𝛽 (︁ 𝜃 𝑖 𝑠 ′ + ∑︀ 𝑗 𝐽 𝑖𝑗 𝑠 ′ 𝑠 𝑗 )︁)︁ (13)6n contrast to (1) there is now no longer any difficult to compute normalization factor. The denominator of (13)is the normalization of a distribution over only one Ising spin, and hence has only two terms. When treated inthe same way as ML (3), (13) leads to 𝐿 inference problems, one for each spin 𝑖 (︀ 𝜃 * 𝑖 , 𝐽 * 𝑖𝑗 )︀ 𝑃 𝑀𝐿,𝑖 = argmax ⎡⎣ − 𝜃 𝑖 ⟨ 𝑠 𝑖 ⟩ − ∑︁ 𝑗 𝐽 𝑖𝑗 ⟨ 𝑠 𝑖 𝑠 𝑗 ⟩ − 𝛽 ⟨ log 𝜁 𝑖 ⟩ ⎤⎦ (14)where 𝜁 𝑖 is the sum in the denominator of (13). The left hand side emphasizes that this is inference “as seenfrom spin 𝑖 ” (by maximizing conditional probability of spin 𝑖 ). To get the final answer one needs to combine 𝐽 * ,𝑃 𝑀𝐿,𝑖𝑖𝑗 and 𝐽 * ,𝑃 𝑀𝐿,𝑗𝑖𝑗 , typically by taking their average.In the limit of infinite data PLM will almost surely find the same parameters as ML, a property referred toas statistical consistency . In applications PLM has often been found to outperform both naive and advancedmean-field inference [6]. Why that is so cannot be said to be completely known, since the number of samples inreal data sets is finite. The error of mean-field inference compared to PLM in the infinite sample limit (lack ofstatistical consistency) could therefore be compensated by the error in PLM when used on a finite number ofsamples. Empirically this has mostly not been found to be the case, but that may partially be a consequenceof the kinds of data sets that have been considered in the literature. High-dimensional statistics is the branch of modern statistics where the number of samples ( 𝑁 ) is assumedto grow together with or slower than the number of parameters (here 𝐿 ( 𝐿 +1)2 ). Common sense says that if thereare fewer samples than parameters and no other information, then the parameters cannot be fully determinedby the data. This rule-of-thumb has to be applied with care, because often there is other information, usedexplicitly or implicitly; we will refer to a few such cases below.Nevertheless, the rule-of-thumb points to something important, namely that in the important applicationof inverse Potts methods to contact prediction in protein structures [46,47], the number of parameters istypically about 20 · , which is four million, while the number of samples is rarely more than a hundredthousand. All inference methods outlined above are therefore in this application used in regimes where they areunder-sampled, and so need to be regularized. For naive mean-field inference a regularization by pseudo-counts(adding fictitious uniformly distributed samples) was used in [46,47], while an 𝐿 -regularization was used in [48],and an 𝐿 -regularization in [49]. For PLM similarly 𝐿 -regularization was used in [50] and [51,52].An important aspect of all inference is what is the family from which one tries to infer parameters. Thiscan be given a Bayesian interpretation as an a priori distribution of parameters; the more one knows in thatdirection, the better the inference can be. Many regularizers can be seen as logarithms of Bayesian priordistributions such that the analogy also works the other way: regularized inference is equivalent to inferencewith a prior (exponential of the regularizer), and can therefore work better because it uses more information. The formal definition of statistical consistency is that as the number of samples goes to infinity, the argmin of the estimatorconverges in probability to the right answer. This holds for ML and PLM and some other inference methods to be discussed below,but does not hold for mean-field inference methods. In the limit of infinite data the sample averages used in mean-field will alwayssurely be the same as ensemble averages, but the recovered parameters will not be the true ones because physical mean-field is initself approximate. For a discussion, see e.g. [6] and references cited therein. For 20 types of amino acids in a protein of 100 residues. 𝐿 -regularized PLMcan find the graph structure using relatively few samples, given certain assumption that were later shown tobe restrictive [54]. Nevertheless, using and analyzing thresholding also in the retained predictions, the authorsof [55] were able to show that 𝐿 -regularized PLM can indeed find the graph structure using order of log 𝐿 samples .A second and equally important aspect is the evaluation criteria. The criterion in [53] is graphical : theobjective is to infer properties of the model (the non-zero interactions) which can be represented as a graph.It is obvious that inference under this criterion will be difficult without a gap in the distribution of interactionparameters away from zero. Information theory imposes limits on the smallest couplings that can be retrievedfrom the finite amount of data [55,56]; given finite data it is simply not possible to distinguish a parameter whichis strictly zero from one which is only very small. Another type of criterion is metrical , most often the squareddifferences of the actual and inferred parameter values [6,27]. Yet another is probabilistic by determining somedifference between the two probability distributions as in (1), one with the actual parameters and one withthe inferred parameters. Two examples of probabilistic criteria are Kullback-Leibler divergence and variationaldistance. An advantage of probabilistic criteria is that they focus on typical differences of samples, and not onparameter differences which may (sometimes) not matter so much as to the samples observed. However, as thisrequires sampling from the distributions, it is also a disadvantage.Important results have been obtained as to how many samples are required for successful inference whenboth the a priori distributions and and the criteria are varied, first in [53] under strong assumptions, and morerecently in [55,57]. These two latter papers also introduce a different objective function, Interaction ScreeningObjective (ISO), that has dependence on the same local quantities as pseudo-likelihood, and which provablyoutperforms PLM in terms of expected error for the given number of samples, providing near sample-optimalguarantee. ISO has also more recently been generalized to learn Ising models in the presence of samplescorrupted by independent noise [58], and to the case of Potts models and beyond pairwise interactions [59].In practice and in many successful applications to real data, criteria have been of the type “correctly recov-ering 𝑘 largest interactions”, colloquially known as “top- 𝑘 ”. Performance under such criteria is straight-forwardto analyze empirically when there is a known answer; one simply compiles two lists of 𝑘 largest parameters andwhat interactions they refer to, and then compares the two lists. For instance, one can check what fractionof 𝑘 largest inferred interactions can also be found among the 𝑘 largest actual interactions, which is known as 𝑘 -True Positive Rate, or 𝑇 𝑃 𝑅 ( 𝑘 ). In the application of inverse Potts methods to contact prediction in proteinstructures [46,47], 𝑘 has commonly been taken to be around 100. The inequality that number of retained pa-rameters be less than the number of samples has hence been respected, with a large margin. The theoreticalanalysis of performance under this type of criterion is however more involved, as the distribution of the largestvalues of a random background is an extreme deviations problem. One approach is to leverage an 𝐿 ∞ normguarantee [55,59], for another using large deviation theory, see [60,61]. The same authors also showed that 𝐿 -regularized PLM with thresholding, as used in the plmDCA software of [52] canrecover parameters in 𝐿 norm using order of log 𝐿 samples. .2. Time series and alltime inverse Ising techniques In the following Section 3 we will consider inference from data generated by a kinetic Ising model, and inSection 4 we will consider applications of this technique to data in Neuroscience and from Finance. The mainmessage of these sections will be that if you have time series data, it is usually better to do inference on thetime-labeled data . As we will show, even when the dynamics is of the type (6), respects detailed balance, andhas stationary distribution (1), it can be faster and easier to infer 𝐽 𝑖𝑗 from the dynamical law than by inverseIsing techniques.Nevertheless, even if the data was generated in a dynamic process, we do not always have time series data.In Sections 5.1 and 5.4 we will consider models of evolution, intended as stylized descriptions of the kind ofgenetic / protein data on which inverse Ising (Potts) techniques have been applied successfully [36,46,47,62].The underlying dynamics is then of the type of 𝑁 𝑡𝑜𝑡 individuals (genomes / genetic signatures / proteins) ofsize (genomic length) 𝐿 evolving for a time 𝑇 , while the data is on 𝑁 individuals (genomes / genetic signatures/ proteins) sampled at one time .Averages at any given time will have errors which go down as ( 𝑁 𝑡𝑜𝑡 ) − , typically a very small numberfor real data sets, but not necessarily very small in a simulation. For the evaluation of how simulations matchtheory it is therefore of interest to also consider as input data to inverse Ising variants of naive mean-field (9)and PLM (14) where the averages are computed both over samples and over time. We refer to these variantsas alltime versions of the respective algorithms.
3. A Model: Kinetic Asynchronous Ising Dynamics
A standard approach to sample the equilibrium Ising model is Glauber dynamics [63,64]. On the level ofprobability distributions it is formulated as master equations 𝑑𝑑𝑡 𝑝 ( 𝑠 , ..., 𝑠 𝐿 ; 𝑡 ) = ∑︁ 𝑖 𝜔 𝑖 ( − 𝑠 𝑖 ) 𝑝 ( 𝑠 , ..., − 𝑠 𝑖 , ..., 𝑠 𝐿 ; 𝑡 ) − ∑︁ 𝑖 𝜔 𝑖 ( 𝑠 𝑖 ) 𝑝 ( s ; 𝑡 ) (15)where 𝜔 𝑖 ( 𝑠 𝑖 ) is the flipping rate, i.e. , the probability for the state of 𝑖 th spin to changes from 𝑠 𝑖 to − 𝑠 𝑖 per unittime while the other spins are momentarily unchanged. Equation (15) shows that the configuration 𝑠 , ..., 𝑠 𝐿 isdestroyed by a flip of any spin 𝑠 𝑖 (a loss term ), but it can also be created by the flip from any configurationwith the form 𝑠 , ... − 𝑠 𝑖 , ..., 𝑠 𝐿 (a gain term ). The flipping rate of spin 𝑖 is 𝜔 𝑖 ( s ) = 𝛾 [︁ 𝑠 𝑖 (︁ 𝜃 𝑖 + ∑︀ 𝑗 𝐽 𝑖𝑗 𝑠 𝑗 )︁]︁ = 𝛾 ⎡⎣ − 𝑠 𝑖 tanh ⎛⎝ 𝜃 𝑖 + ∑︁ 𝑗 𝐽 𝑖𝑗 𝑠 𝑗 ⎞⎠⎤⎦ (16)The parameter 𝛾 is an overall rate which in Glauber dynamics is assumed to be the same for all spins. Theleft-hand side depends on the whole configuration s because the values of all spins enter on the right-hand side.The inverse temperature 𝛽 is here set to be 1; as noted above it can be absorbed in the parameters. Or at uneven times so that the time information is hard to use, or the time at which they were sampled is unknown, thecases may differ depending on the data set. 𝐿 ) (15) can be simulated by solving 2 𝐿 linear ordinary differential equations.For larger 𝐿 (15) can only be simulated by Monte Carlo procedure. This means that one considers 𝑁 separatespin configurations s , . . . , s 𝑁 , each of which is evolved in time. The empirical probability distribution 𝑃 𝑒 ( s , 𝑡 ) = 1 𝑁 𝑁 ∑︁ 𝑠 =1 s , s 𝑠 ( 𝑡 ) (17)is then an approximation of 𝑃 ( s , 𝑡 ) in (15). We note (trivially) that for large systems 𝑃 𝑒 ( s , 𝑡 ) will typicallybe either zero or 𝑁 ; the chance that among 𝑁 separate spin configurations s , . . . , s 𝑁 two are exactly equalwill be very small. 𝑃 𝑒 ( s , 𝑡 ) hence approximates 𝑃 ( s , 𝑡 ) as to certain summary statistics such as single-spinaverages (magnetizations), but typically cannot approximate 𝑃 ( s , 𝑡 ) very well as to the values for individualconfigurations.For simplicity of presentation we will here focus on the time-homogeneous case where all parameters aretime independent. Distributions will then eventually relax to a stationary state, and we will assume that thisprocess has taken place. Inference can then by done by treating samples at different times as independent, i.e. by the type of alltime algorithms discussed in Section 2.2. For the rest of this section 𝑁 (the number ofdifferent time series) will hence be one. Indeed, as in the Monte Carlo procedure the different samples do notinteract, one can limit oneself to just one time series, as long as one is interested in properties of the statisticallystationary state reached at large times.The dynamics of a configuration s ( 𝑡 ) is governed by the same rates as in (15). In the Monte Carlo simulationscheme it is convenient to consider spin 𝑖 as responding to an effective field from the external field 𝜃 𝑖 and theinteractions from all the other spins. This effective field is time-dependent, because the configurations of theother spins change in time, viz. 𝐻 𝑖 ( 𝑡 ) = ∑︁ 𝑗 𝐽 𝑖𝑗 𝑠 𝑗 ( 𝑡 ) + 𝜃 𝑖 . (18)and the instantaneous rates are then 𝜔 𝑖 ( s , 𝑡 ) = 𝛾 − 𝑠 𝑖 ( 𝑡 ) tanh ( 𝐻 𝑖 ( 𝑡 ))] (19)One approach to simulation is to introduce a small time step increment 𝛿𝑡 and to flip each spin at eachtime with probability 𝜔 𝑖 ( s , 𝑡 ). For this scheme to simulate (15) one must take 𝛿𝑡 so small that the chance of anyother spin to flip in the same short time interval is negligible. This scheme can be said to rely on 𝐿 · 𝑡/𝛿𝑡 randomvariables, one for the decision whether or not to flip each spin in each time interval. Since on average less thanone spin will flip in each time interval the probabilities of these variables have to be very biased towards notflipping.A computationally more efficient scheme is to first consider the rate of the event of flipping any spin. Thatis 𝜔 𝑇 𝑂𝑇 ( s , 𝑡 ) = ∑︁ 𝑖 𝜔 𝑖 ( s , 𝑡 ) (20)As long as no spin flips this overall rate does not change. The waiting time until any spin has flipped istherefore an exponentially distributed random variable with rate 𝜔 𝑇 𝑂𝑇 , and the chance that it was spin 𝑖 thatflipped is 𝜔 𝑖 /𝜔 𝑇 𝑂𝑇 . The dynamics can then be simulated in discrete steps starting from a configuration s at 𝑡 such that flips take place at times 𝑡 , 𝑡 , . . . . Initially the rates are { 𝜔 𝑖 ( s ) } and 𝑡 − 𝑡 is an exponentially10istributed random variable with rate 𝜔 𝑇 𝑂𝑇 ( s ) = ∑︀ 𝑖 𝜔 𝑖 ( s ). The first spin to flip will be the 𝑗 ’th spin withprobability 𝜔 𝑗 ( s ) /𝜔 𝑇 𝑂𝑇 ( s ), and after the flip all rates are updated to { 𝜔 𝑖 ( s ) } , and the process is repeated.This algorithm is called the Gillespie algorithm [65], and relies on 𝐿 · 𝑡/ ∆ 𝑡 random variables where ∆ 𝑡 is somecharacteristic time interval between the flips. At the price of a slightly more complicated structure it is thusfaster than the first algorithm by a ratio ∆ 𝑡/𝛿𝑡 . Furthermore this method is exact; ∆ 𝑡 is a property of thedynamics and not of the simulation scheme.A third approach is to update at each step a spin 𝑖 picked uniformly at random with probability 𝛾𝛿𝑡 . Aftersuch an update, which may or may not change the spin value, the new value will be 𝑠 𝑖 ( 𝑡 + 𝛿𝑡 ) = ⎧⎨⎩ +1 with probability 1 / { − 𝛽𝐻 𝑖 ( 𝑡 )] }− / { 𝛽𝐻 𝑖 ( 𝑡 )] } From this we can evaluate the rate of flipping of spin 𝑖 per unit time to be ⎧⎨⎩ 𝛾/ { 𝛽𝐻 𝑖 ( 𝑡 )] } when 𝑠 𝑖 ( 𝑡 ) = 1 𝛾/ { − 𝛽𝐻 𝑖 ( 𝑡 )] } when 𝑠 𝑖 ( 𝑡 ) = − 𝐿 · 𝑡/𝛿𝑡 random variables. As illustrative examples we will now look at symmetric and asymmetric SK models [66] which are definedas follows. First we introduce 𝐽 𝑖𝑗 with no restriction on 𝑖 and 𝑗 . Such a matrix can be split into its symmetricand asymmetric parts. We write 𝐽 𝑖𝑗 = 𝐽 𝑠𝑖𝑗 + 𝑘𝐽 𝑎𝑠𝑖𝑗 , 𝑘 ≥ , (21)where 𝐽 𝑠𝑖𝑗 and 𝐽 𝑎𝑠𝑖𝑗 are symmetric and asymmetric interaction respectively: 𝐽 𝑠𝑖𝑗 = 𝐽 𝑠𝑗𝑖 ,𝐽 𝑎𝑠𝑖𝑗 = − 𝐽 𝑎𝑠𝑗𝑖 (22)The parameter 𝑘 in equation (21) measures the asymmetric degree of the interactions 𝐽 𝑖𝑗 . With 𝑘 = 0, 𝐽 𝑖𝑗 ’s area fully symmetric model the stationary distribution of which is (1). Any 𝑘 ̸ = 0 means the 𝐽 𝑖𝑗 and 𝐽 𝑗𝑖 are notthe same, and we have a non-equilibrium dynamics. The SK kinetic model, extended to non-equilibrium [67],means to take both the symmetric and the asymmetric couplings to be identically and independently Gaussiandistributed random variables with means zero and variances ⟨ 𝐽 𝑠𝑖𝑗 ⟩ = ⟨ 𝐽 𝑎𝑠𝑖𝑗 ⟩ = 𝑔 𝑁
11 + 𝑘 . (23)This parametrization is chosen such that the total coupling matrix 𝐽 follows a Gaussian distribution 𝑝 ( 𝐽 𝑖𝑗 ) ∝ exp (︃ − ( 𝐽 𝑖𝑗 − 𝜇 ) 𝜎 )︃ (24)with means 𝜇 = 0 and variance 𝜎 = 𝑔 /𝑁 independently of 𝑘 .The interactions 𝐽 𝑖𝑗 define spin update rates (16) or (19). To see that asymmetric interactions do not leadto Gibbs distributions (1), it is useful to temporarily change the parametrization so that there are three only11on-zero interactions 𝐽 𝑖𝑗 = 𝐽 𝑗𝑘 = 𝐽 𝑘𝑖 = 𝐽 , all large. All other 𝐽 𝑖𝑗 are zero, and all 𝜃 𝑖 are also zero. Assumethat initially the three spins 𝑠 𝑖 , 𝑠 𝑗 and 𝑠 𝑘 are all up i.e. + + +. They will then have the same (small) flip rate 𝛾/ (︀ 𝑒 𝐽 )︀ , and one of them will flip first, let that be spin 𝑖 , so that the next state is − + +. After this hashappened the (much larger) rate for either 𝑖 to flip back or for 𝑘 to flip will be 𝛾/ (︀ 𝑒 − 𝐽 )︀ . A flip of spin 𝑖 will hence almost surely either go back to the starting state + + + after two flips, or lead to the configuration − + − . This second state will in turn almost surely lead to + + − or − − − . The first of these is a shift of thestate after the first flip to the left, and by circular permutation symmetry it must be more likely that the shiftscontinue in that direction rather than to the right. The second is on the other hand obviously the mirror imageof the starting state, and all rates are again low. Flipping out of − − − would lead to + − − , which would give+ − + and then + + + or − − +, which is also a shift to the left. A dynamics which has some similarities tothe above where motion surely goes only in one direction is the basis of Edsger Dijkstra’s famous self-stabilizingsystem under distributed control [68], for a physics perspective, see [69]. Many techniques for inverse Ising as discussed above in Section 2 have been applied to data from asyn-chronous Ising (or similar) dynamics, mainly for neuroscience applications [4,70–72]. Since our purpose here isto compare to inference using a time series we will for the equilibrium case just consider the simplest method,which is naive mean-field (nMF) (9). On the methodological side much work has been done on applying inverseIsing techniques to synchronous versions of Ising dynamics [73–76]; this work will not be covered here. Dynamicmean-field inference as used below was originally developed for synchronous updates in [77], see also [78]. In-ference in more realistic (and more complex) models from neuroscience has also been carried out, but is beyondthe scope of this review, see [72,79,80].
We now derive versions of nMF and
TAP inference for asynchronously updated kinetic models follow-ing [81].For kinetic Ising model with Glauber dynamics, the state of spin 𝑖 is time dependent 𝑠 𝑖 ( 𝑡 ), thus the time-dependent means and correlations are naturally defined as 𝑚 𝑖 ( 𝑡 ) = ⟨ 𝑠 𝑖 ( 𝑡 ) ⟩ 𝑐 𝑖𝑗 ( 𝑡 + 𝜏, 𝑡 ) = ⟨ 𝑠 𝑖 ( 𝜏 + 𝑡 ) 𝑠 𝑗 ( 𝑡 ) ⟩ − 𝑚 𝑖 ( 𝜏 + 𝑡 ) 𝑚 𝑗 ( 𝑡 ) . (25)Then, with the master equation (15) and the flipping rate (16), we have equations of motion for means andcorrelations as 𝑑𝑚 𝑖 ( 𝑡 ) 𝑑𝑡 = − 𝑚 𝑖 ( 𝑡 ) + ⟨ tanh [ 𝐻 𝑖 ( 𝑡 )] ⟩ . (26) 𝑑 ⟨ 𝑠 𝑖 ( 𝑡 ) 𝑠 𝑗 ( 𝑡 ) ⟩ 𝑑𝑡 = −⟨ 𝑠 𝑖 ( 𝑡 ) 𝑠 𝑗 ( 𝑡 ) ⟩ + ⟨ tanh [ 𝐻 𝑖 ( 𝑡 ) 𝑠 𝑗 ( 𝑡 )] ⟩ . (27)In the forward problem of statistical physics we would here have the closure problem : the left-hand side is thetime derivative of an average while the right-hand side contains terms of an average of a higher order. In theinverse problem we start by observing that the term on the left-hand side and the first term on the right-hand12ide of equation (26) and (27) can be taken from data. The second term on the right-hand side contains averagesof the tanh function and involves all kinds of higher-order correlations. The equations thus have to be closedwith respect to these terms, but in a slightly different way in the forward problem.We introduce the notation 𝑏 𝑖 = 𝜃 𝑖 + ∑︁ 𝑗 𝐽 𝑖𝑗 𝑚 𝑗 (28)for the non-fluctuating part of the argument of the tanh and rewrite 𝐻 𝑖 ( 𝑡 ) = 𝜃 𝑖 + ∑︀ 𝑗 𝐽 𝑖𝑗 𝑠 𝑗 ( 𝑡 ) as 𝐻 𝑖 ≡ 𝑏 𝑖 + ∑︁ 𝑗 𝐽 𝑖𝑗 𝛿𝑠 𝑗 ( 𝑡 ) (29)where the sum depends on the fluctuating term 𝛿𝑠 𝑖 ( 𝑡 ) = 𝑠 𝑖 ( 𝑡 ) − 𝑚 𝑖 . In lowest order we neglect fluctuations inaltogether and close the equation for magnetizations as 𝑑𝑚 𝑖 ( 𝑡 ) 𝑑𝑡 = − 𝑚 𝑖 ( 𝑡 ) + tanh 𝑏 𝑖 ( 𝑡 ) (Lowest order closure) (30)If this equation reaches a stationary state it must satisfy 𝑚 𝑖 = tanh 𝑏 𝑖 , which we recognize as the equationof physical mean-field, (7). To the same lowest order (27) is 𝑑 ⟨ 𝑠 𝑖 ( 𝑡 ) 𝑠 𝑗 ( 𝑡 ) ⟩ 𝑑𝑡 = −⟨ 𝑠 𝑖 ( 𝑡 ) 𝑠 𝑗 ( 𝑡 ) ⟩ + 𝑚 𝑖 ( 𝑡 ) 𝑚 𝑗 ( 𝑡 ) whichrelaxes to the uncorrelated state.The first non-trivial equation is obtained by expanding (27) to first order which gives ⟨ 𝑠 𝑖 ( 𝑡 ) 𝑠 𝑗 ( 𝑡 ) ⟩ + 𝑑 ⟨ 𝑠 𝑖 ( 𝑡 ) 𝑠 𝑗 ( 𝑡 ) ⟩ 𝑑𝑡 = 𝑚 𝑖 𝑚 𝑗 + (1 − 𝑚 𝑖 ) ⎛⎝∑︁ 𝑗 𝐽 𝑖𝑘 ⟨ 𝛿𝑠 𝑘 ( 𝑡 ) 𝛿𝑠 𝑗 ( 𝑡 ) ⟩ ⎞⎠ (31)where we have used (30) and stationarity to identify the derivative of the tanh function as (1 − 𝑚 𝑖 ). Introducing 𝐶 𝑖𝑗 ( 𝑡, 𝑡 ) = ⟨ 𝛿𝑠 𝑖 ( 𝑡 ) 𝛿𝑠 𝑗 ( 𝑡 ) ⟩ = ⟨ 𝑠 𝑖 ( 𝑡 ) 𝑠 𝑗 ( 𝑡 ) ⟩ − 𝑚 𝑖 𝑚 𝑗 . (32)and 𝐷 𝑖𝑗 ( 𝑡, 𝑡 ) = 𝐶 𝑖𝑗 ( 𝑡, 𝑡 ) + 𝑑𝐶 𝑖𝑗 ( 𝑡, 𝑡 ) 𝑑𝑡 (33)we have 𝐷 𝑖𝑗 ( 𝑡, 𝑡 ) = (1 − 𝑚 𝑖 ) ∑︁ 𝑘 𝐽 𝑖𝑘 𝐶 𝑘𝑗 ( 𝑡, 𝑡 ) (34)While this equation holds (to this order) for any two times 𝑡 and 𝑡 it is especially convenient in the limit 𝑡 → 𝑡 .Similarly to the procedure in naive mean-field inference (9) we can then invert (34) to arrive at an asynchronousmean field inference formula 𝐽 * ,𝑎𝑠𝑦𝑛 − 𝑛𝑀𝐹 = 𝐴 − 𝐷𝐶 − , (35)where 𝐴 is the diagonal matrix given by 𝐴 𝑖𝑗 = 𝛿 𝑖𝑗 (1 − 𝑚 𝑖 ). Equation (35) is a linear matrix equation withrespect to 𝐽 𝑖𝑗 . We can solve it for 𝐽 𝑖𝑗 directly for asynchronous Ising models.Figure 1 shows the scatter plots for the tested couplings versus the recovered ones. The tested model forFigure 1(a) is the symmetric SK model with 𝑘 = 0 in equation (21) while fully asymmetric SK with 𝑘 = 1 forfigure 1(b). The couplings are reconstructed by the equilibrium nMF (9) (black dots) and the asynchronous nMF (35) method (red dots) respectively. As shown in figure 1(a), both methods have the same abilityto recover the tested symmetric SK model. Here, the data length 𝐿 = 20 × . Nevertheless, the couplings13igure 1: The scatter plots for the true tested couplings versus the reconstructed ones. (a) reconstructionfor the symmetric SK model with 𝑘 = 0; (b) inference for the asymmetric SK model with 𝑘 = 1. Red dots,inferred couplings with asynchronous nMF approximation; black dots, inferred ones with equilibrium nMF approximation. The recovered asynchronous 𝐽 𝑖𝑗 s in (a) are symmetrized while no symmetrization for them in(b). The other parameters for both panels are 𝑔 = 0 . 𝑁 = 20, 𝜃 = 0, 𝐿 = 20 × .inferred by the asynchronous nMF needs to be symmetrized to keep the same results with that from equilibrium nMF , especially for short data length (not shown here). Figure 1(b) shows that, for the fully asymmetric SKmodel with 𝑘 = 1, the asynchronous nMF works much better than the equilibrium nMF . This clearly showsthat equilibrium inference methods are typically not suitable for non-equilibrium processes, while asynchronousinference works for both equilibrium and non-equilibrium process.By a similar procedure we can also derive a higher-order approximation, which we refer to as dynamicTAP . The starting point is to redefine the 𝑏 𝑖 ( 𝑡 ) term in the tanh to include a term analogous to the static TAP equation (11). We then first have 𝐻 𝑖 ( 𝑡 ) = 𝑏 𝑖 − 𝑚 𝑖 ∑︁ 𝑘 ̸ = 𝑖 𝐽 𝑖𝑘 (1 − 𝑚 𝑘 ) + ∑︁ 𝑘 𝐽 𝑖𝑘 𝛿𝑠 𝑘 ( 𝑡 ) . (36)From which the lowest-order equation for the stationary state is of the TAP form. The second step is to expandthe tanh function in (27) around 𝑏 𝑖 − 𝑚 𝑖 ∑︀ 𝑘 ̸ = 𝑖 𝐽 𝑖𝑘 (1 − 𝑚 𝑘 ) to the third order and to keep terms up to thirdorder in 𝐽 . In this way we get an inference formula, which is formally the same as in the nMF approximation, 𝐽 * ,𝑎𝑠𝑦𝑛 − 𝑇 𝐴𝑃 = 𝐴 − 𝐷𝐶 − . (37)where only the matrix A is different 𝐴 𝑖𝑗 = 𝛿 𝑖𝑗 (1 − 𝑚 𝑖 ) ⎡⎣ − (1 − 𝑚 𝑖 ) ∑︁ 𝑗 𝐽 𝑖𝑗 (1 − 𝑚 𝑗 ) ⎤⎦ . (38)Equation (37) is a function of the couplings J , and therefore it is a nonlinear equation for matrix J .Equation (37) could be solved for J though two approaches. One iterative way is starting from reasonableinitial values 𝐽 𝑖𝑗 , and inserting them in the RHS of formula (37). The resulting 𝐽 𝑖𝑗 is the solution after oneiteration. They can be again replaced in the RHS to get the second iteration results and so on. 𝐽 𝑡 +1 = 𝐴 ( 𝐽 𝑡 ) − 𝐷𝐶 − (39)14n alternative way is solving it by casting the inference formula to a set of cubic equations. For equation (38),denoting 𝐹 𝑖 = (1 − 𝑚 𝑖 ) ∑︁ 𝑗 𝐽 𝑖𝑗 (1 − 𝑚 𝑗 ) (40)and plugging it into equation (37), and then we get the following equation for 𝐽 𝑖𝑗 : 𝐽 𝑎𝑠𝑦𝑛 − 𝑇 𝐴𝑃𝑖𝑗 = 𝑉 𝑖𝑗 (1 − 𝑚 𝑖 )(1 − 𝐹 𝑖 ) (41)where 𝑉 𝑖𝑗 = [ 𝐷𝐶 − ] 𝑖𝑗 . Substituting equation (41) with that in equation (40), we obtain the cubic equation for 𝐹 𝑖 as 𝐹 𝑖 (1 − 𝐹 𝑖 ) − ∑︀ 𝑗 𝑉 𝑖𝑗 (1 − 𝑚 𝑗 )1 − 𝑚 𝑖 = 0 . (42)With the obtained physical solution for 𝐹 𝑖 , we get the reconstructed couplings 𝐽 TAP as 𝐽 𝑎𝑠𝑦𝑛 − 𝑇 𝐴𝑃𝑖𝑗 = 𝐽 𝑎𝑠𝑦𝑛 − 𝑛𝑀𝐹𝑖𝑗 − 𝐹 𝑖 . (43) To emphasize how different is inference from a time series compared to from samples, we will now showthat maximum likelihood inference of such dynamics from such data is possible. We will also show that thisapproach admits approximation schemes different from mean-field. The presentation will follow [82].The log-likelihood of observing a full time series of a set of interacting spins is analogous to the probability ofa history of a Poisson point process [15]. The probability space of events in some time period [0 : 𝑡 ] consists of thenumber of jumps ( 𝑛 ), the times of these jumps ( 𝑡 , 𝑡 , . . . , 𝑡 𝑛 ) and which spin jumps at each time ( 𝑖 , 𝑖 , . . . , 𝑖 𝑛 ).The measure over this space is proportional to the uniform measure over 𝑛 times a weight 𝜇 (1) 𝑖 𝑑𝑡 · · · 𝜇 ( 𝑛 ) 𝑖 𝑛 𝑑𝑡 𝑛 · exp (︁ − 𝜇 (1) 𝑡 − 𝜇 (2) ( 𝑡 − 𝑡 ) − · · · − 𝜇 ( 𝑛 +1) ( 𝑡 − 𝑡 𝑛 ) )︁ where 𝜇 ( 𝑛 ) 𝑖 𝑛 is the jump rate in open time interval ( 𝑡 𝑖 − : 𝑡 𝑖 ) of the event that actually took place at time 𝑡 𝑖 ,and 𝜇 ( 𝑛 ) = ∑︀ 𝑖 𝜇 ( 𝑛 ) 𝑖 . We recall from the discussion of the Gillespie algorithm that in the open time interval( 𝑡 𝑖 − : 𝑡 𝑖 ) all the rates stay the same, and that the length of the interval is an exponentially distributed randomvariable with parameter which is the sum of all the rates. In another time interval some or all of the rates canbe different.A rigorous construction of the above path probability can be found in Appendix A of [83]. Here we willfollow a more heuristic approach and introduce a small finite time 𝛿𝑡 such that we can use the first simulationapproach discussed above in Section 3. The objective function to maximize is then ℒ = ∑︁ 𝑖,𝑡 log [︂ (1 − 𝛾𝛿𝑡 ) 𝛿 𝑠 𝑖 ( 𝑡 + 𝛿𝑡 ) ,𝑠 𝑖 ( 𝑡 ) + 𝛾𝛿𝑡 𝑒 𝑠 𝑖 ( 𝑡 + 𝛿𝑡 ) 𝐻 𝑖 ( 𝑡 ) 𝐻 𝑖 ( 𝑡 ) ]︂ . (44)The sums in (44) go over all spins 𝑖 and all times separated by the small increment 𝛿𝑡 . The terms in (44)can be understood as the lowest order approximation (linear in 𝛿𝑡 ) of log ∏︀ 𝑖𝑡 𝑃 𝑖 ( 𝑠 𝑖 ( 𝑡 + 𝛿𝑡 ) | s ( 𝑡 )) where 𝑃 𝑖 is theconditional probability of spin 𝑖 at time 𝑡 + 𝛿𝑡 , conditioned on the configuration of all spins at time 𝑡 . Maximumlikelihood inference of dynamics from a time series is therefore analogous to pseudo-maximum likelihood (14)from independent samples. At the price of potentially very many and very biased samples (at most times nospin will jump) this points to that inference from a time series is a fundamentally easier task.15eparating times with and without spin flips (44), the resulting learning rules will be 𝛿𝐽 𝑖𝑗 ∝ 𝜕 ℒ 𝜕𝐽 𝑖𝑗 = ∑︁ flips [ 𝑠 𝑖 ( 𝑡 + 𝛿𝑡 ) − tanh( 𝐻 𝑖 ( 𝑡 ))] 𝑠 𝑗 ( 𝑡 ) + 𝛾𝛿𝑡 ∑︁ no flips 𝑞 𝑖 ( 𝑡 ) 𝑠 𝑖 ( 𝑡 + 𝛿𝑡 ) 𝑠 𝑗 ( 𝑡 ) , (45)with 𝑞 𝑖 ( 𝑡 ) ≡ [1 − tanh ( 𝐻 𝑖 ( 𝑡 ))], and it includes the rule for the 𝜃 𝑖 with the convention 𝐽 𝑖 = 𝜃 𝑖 , 𝑠 ( 𝑡 ) = 1.Following [82] where we also considered the case that the times where nothing happens are known, we will refer(45) as the “spin-history-only” (“ SHO ”) algorithm.Similarly to mean-field inference (45) can also be averaged which gives the learning rule 𝛿𝐽 𝑖𝑗 ∝ 𝛾 − ˙ 𝐶 𝑖𝑗 (0) + 𝐶 𝑖𝑗 (0) − ⟨ tanh( 𝐻 𝑖 ( 𝑡 )) 𝑠 𝑗 ( 𝑡 ) ⟩ . (46)which we refer to as AVE [82].
AVE requires knowing equal-time correlations, their derivatives at 𝑡 = 0,and ⟨ tanh( 𝐻 𝑖 ( 𝑡 )) 𝑠 𝑗 ( 𝑡 ) ⟩ . This latter quantity depends on the model parameters (through 𝐻 𝑖 ( 𝑡 )), so, in practice,estimating it at each learning step requires knowing the entire spin history, the same data as needs SHO learning.All of four methods now introduced to infer parameters from a time series ( nMF , TAP , SHO and
AVE )will produce a fully connected network structure. Similarly to inverse Ising from samples we may want to include 𝐿 penalties to get the graphical structure [84]. Such effects are considered in [85], showing that inferring thesparsity structure from time series data is both a feasible and reliable procedure. In this section, performance tests of the four above introduced algorithms for recovering parameters inasynchronous Ising models are presented. We compared the performance of two ML algorithms
SHO , and
AVE to each other and to two mean-field algorithms nMF and
TAP .The tested model is as discussed above the fully asymmetric SK model ( 𝐽 𝑖𝑗 is independent of 𝐽 𝑗𝑖 ), 𝐽 𝑖𝑗 sare identically and independently distributed Gaussian variables with zero means and variance 𝑔 /𝑁 . As aperformance measure, we use the mean square error ( 𝜖 ) which measures the 𝐿 distance between the inferredparameters and the underlying parameters used to generate the data 𝜖 = ∑︀ 𝑖 ̸ = 𝑗 ( 𝐽 * 𝑖𝑗 − 𝐽 𝑇 𝑟𝑢𝑒𝑖𝑗 ) 𝑁 ( 𝑁 − . (47)where 𝐽 𝑇 𝑟𝑢𝑒𝑖𝑗 are the true values of interactions and 𝐽 * 𝑖𝑗 are the inferred ones. We study the reconstruction errorfor different data length 𝐿 , system size 𝑁 , external field 𝜃 and coupling strength 𝑔 .Figure 2 shows the performance of these algorithms. Each panel also shows two ML-based learning methods SHO and
AVE appear to perform equally well for large enough 𝐿 since they effectively use the same data(the spin history). Note however the opposite trend in figure 2(a) shows the reconstruction getting better withlonger data length 𝐿 for both ML and mean-field based methods. Figure 2(b) shows that the MSE for theML algorithms is insensitive to 𝑁 , while two mean-field algorithms improve as 𝑁 becomes larger; in thesecalculations, the average numbers of updates and flips per spin were kept constant, taking 𝐿 = 5 × 𝑁 ).Figure 2(c) shows that the performance of two ML algorithms is also not sensitive at all to 𝜃 , while nMF and16igure 2: Mean square error ( 𝜖 ) versus (a) data length 𝐿 , (b) system size 𝑁 , (c) external field 𝜃 and (d)temperature 1 /𝑔 . Black squares show nMF, red circles, TAP, blue up triangle SHO and pink down triangleAVE respectively. The parameters are 𝑔 = 0 . 𝑁 = 20, 𝜃 = 0, 𝐿 = 10 except when varied in a panel.TAP work noticeably less well with a non-zero 𝜃 . The effects of (inverse) 𝑔 are depicted in figure 2(d). For fixed 𝐿 , all the algorithms do worse at strong couplings (large 𝑔 ). The nMF and TAP do so in a much more clearfashion at smaller 𝑔 , growing approximately exponentially with 𝑔 for 𝑔 greater than ≈ .
2. In the weak-couplinglimit, all algorithms perform roughly similarly, as already seen in figure 2(a).To summarize, the ML methods recover the model better, but in general more slowly. The mean-field basedlearning rules ( nMF and
TAP ) are much faster in inferring the couplings but have worse accuracy comparedwith that of the ML-based iterative learning rules (
AVE , SHO ).
4. Example Applications of Asynchronous Ising Model
Inverse Ising problems have been applied to a wide rage of data analysis, ranging from equilibrium re-construction methods to kinetic ones. In this section, based on [82] and [86], we will present as illustrationsapplications to one data set of neuronal spike trains, and one data set on transaction data of stocks on financialmarket. Both areas have been investigated extensively in the last ten years. We refer to [87–93] for more recentneuronal data and and discussions of inference in this context and to [94–105] a for a sample of contributions17onsidering financial data.For the neuronal data, we show two ML-based learning rules. When considering the data as a time serieswe use the
AVE method (46), while when considering the same data as independent samples from the Gibbsdistribution (1) we use Boltzmann machine ( BM ) (introduced below). We find that for this data the couplingsbetween the neurons obtained are comparable. This means that although there is no a priori for this to be so,the dynamic process of this neuron system apparently satisfies detailed balance or has a stationary distributionof the form of (1) for other reasons. One clear difference is the self-couplings from one neuron to itself whichare absent in (1) but which are typically present in the dynamic model. A further difference is that to inferparameters from (1) using samples, those samples have to be generated by Monte Carlo procedure. Althoughboth methods are based on ML, the dynamic version is thus considerably faster than BM .For the financial stock trades data we show two mean-field-based algorithms. When considering the dataas a time series we use the asynchronous nMF method of (35) while when considering the same data asindependent samples from (1), we use naive mean-field inference (here equilibrium nMF ) of (9). We notethat we here apply inverse Ising inference to binary data obtained by transforming a time series of financialtransactions (see below). Again we find that the results from the two procedures are comparable, except thatasynchronous nMF allows the inference of self-couplings, as well as directed links (asymmetric couplings). Neurons are the computational units of the brain. While each neuron is actually a cell with complicatedinternal structure, there is a long history of considering simplified models where the state of a neuron at a giventime is a Boolean variable. Zero (or down, or -1) then means resting, and one (or up, or +1) means firing, orhaving a spike of activity. In most neural data most neurons are resting most of the time.
Data description and representation of data.
The neuronal spike trains are from salamander retina understimulation by a repeated 26.5-second movie clip. This data set records the spiking times for neurons and hasa data length of 3180 seconds (120 repetitions of the movie clip). Here, only the first 𝑁 = 20 neurons withhighest firing rates in the data set are considered. The data has been binned with time windows of 20 ms (thetypical time scale of the auto-correlation function of a neuron) in the previous study [106]. However, since weare using the kinetic model, we could study this data set using a much shorter time bin which leads low enoughfiring rates and (almost) never more than one spike per bin. Then, the temporal correlations with time delaysbetween neuron pairs as well as the self-correlations become important.For the asynchronous Ising model, the time bins are 𝛿𝑡 = 1 / ( 𝛾𝑁 ). For neuronal data, 𝛾 can be interpretedas the inverse of the time length of the auto-correlation function which is typically 10 ms or more [106]. Togenerate the binary spin history from this spike train data set, the spike trains should be separated into timebins with length 𝛾𝛿𝑡 = 1 /
20. This means the size of time bins should be chosen as 𝛿𝑡 = 1 / (20 𝛾 ) = 0 . − − /𝛾 in the data translation. Denote the totalfiring number of neuron 𝑖 as 𝐹 𝑖 , and 𝑡 𝑓𝑖 as the firing time of 𝑓 th spike for neuron 𝑖 , where 𝑖 = 1 , ..., 𝑁 and18 = 1 , ..., 𝐹 𝑖 −
1, then the mapping of the spike history is follows:s 𝑖 ( 𝑡 ) = ⎧⎨⎩ , if 𝑡 ∈ [︀ 𝑡 𝑓𝑖 , min ( 𝑡 𝑓 +1 𝑖 , 𝑡 𝑛𝑖 + 𝑋 ) )︀ with 𝑋 ∼ exp( 𝛾 − ) − , otherwise (48)where 𝑋 is a period drawn from exponential distribution with mean 10 ms. By this way, we obtain theasynchronous type of data that are needed for the asynchronous model. Inference methods.
For this fairly small system we use two types of ML to learn the parameters of (1). Inthe equilibrium case (3) this can be done with the iterative method called Boltzmann machine ( BM ) which isdefined as follows: 𝛿𝜃 𝑖 = 𝜂 ( ⟨ 𝑠 𝑖 ⟩ 𝐷𝑎𝑡𝑎 − ⟨ 𝑠 𝑖 ⟩ 𝑀𝑜𝑑𝑒𝑙 ) ,𝛿𝐽 𝑖𝑗 = 𝜂 ( ⟨ 𝑠 𝑖 𝑠 𝑗 ⟩ 𝐷𝑎𝑡𝑎 − ⟨ 𝑠 𝑖 𝑠 𝑗 ⟩ 𝑀𝑜𝑑𝑒𝑙 ) . In above 𝜂 “learning rate” is a relaxation parameter. For larger systems BM does not scale since computing theensemble averages ⟨ 𝑠 𝑖 ⟩ 𝑀𝑜𝑑𝑒𝑙 and ⟨ 𝑠 𝑖 𝑠 𝑗 ⟩ 𝑀𝑜𝑑𝑒𝑙 is costly, but for the data under consideration here it is a feasiblemethod. When retaining the time series nature of the data we on the other hand use the
AVE learning rule ofequation (46).Figure 3: Inferred asynchronous versus equilibrium couplings for retinal data. Red open dots show the self-couplings which by convention are equal to zero for the equilibrium model.
Inference Results.
In the current inference of retina functional connections, the value of model parameterslike window size 𝛿𝑡 , inverse time scale 𝛾 are set as a priori according to the previous studies on equilibriumIsing model. This avoids systematic studies over the value of parameters.As presented in figure 3, the inferred couplings by BM and asynchronous kinetic Ising model are very closeto each other. We also tested what happens to the couplings of the asynchronous model if during learning wesymmetrized the couplings matrix at each iteration by adding its transpose to itself and dividing by two andalso putting the self-couplings to zero. We find that the resulting asynchronous couplings get even closer to theequilibrium ones, which is consistent to the conclusion for kinetic Ising data.However, the asynchronous model allows the inference of self-couplings (diagonal elements of the couplingmatrix) which are not present in the equilibrium model. As shown in figure 3, the diagonals from the equilibriummodel equals to zeros by convention and denoted by the open red dots. Furthermore, to be different from the19ymmetric couplings by the equilibrium model, the asynchronous model provides more details as the inferredcouplings are directed and asymmetric.This result provides a guide for the use of the equilibrium Ising model: if the asynchronous couplings werefar away form the equilibrium ones, it would imply that the real dynamical process did not satisfy the Gibbsequilibrium conditions and that the final distribution of states is not the Gibbs equilibrium Ising model. Sinceinferring the equilibrium model is an exponentially difficult problem, requiring time consuming for Monte Carlosampling while the asynchronous approach does not. The asynchronous learning rules thus allow the inferenceof functional connections that for the retinal data largely agree with the equilibrium model, but the inferenceis much faster. In this case study, we present equilibrium nMF (9) and asynchronous nMF (35) algorithm to infer afinancial network from trade data with 100 stocks. The recorded time series are transformed into binaries bylocal averaging and thresholding. This introduces additional parameters that have to be studied extensivelyto understand the behavior of the system. The inferred couplings from asynchronous nMF method is quitesimilar to the equilibrium ones. Both produce network communities have similar industrial features. However,the asynchronous method is more detailed as they are directed compared with that from the equilibrium ones.
Data description and representation.
The data was transactions recordings on the New York Stock Exchange(NYSE) over a few years. Each trade is characterized by a time, a traded volume, and a price. We only focuson the trades for 100 trading days between 02.01.2003 and 30.05.2003. However, trading volume and tradingtime only are utilized in the study. To avoid the opening and closing periods of the stock exchange, 10 centralseconds of each day are employed as in [107]. Two parameters are introduced to the data transform as thesliding window is adopted. One is the size of the sliding time window (denoted as ∆ 𝑡 ), the other one is theshifting constant which is fixed as 1 second.For stock 𝑖 , the sum of the volumes 𝑉 𝑖 ( 𝑡, ∆ 𝑡 ) traded in window [ 𝑡, 𝑡 + ∆ 𝑡 ), is compared with a given volumethreshold 𝑉 𝑡ℎ𝑖 = 𝜒𝑉 𝑎𝑣𝑖 ∆ 𝑡 , where 𝑉 𝑎𝑣𝑖 is the average (over the whole time series) volume of the considered stocktraded per second, and 𝜒 a parameter controlling our volume threshold:s 𝑖 ( 𝑡 ) = ⎧⎨⎩ , 𝑖𝑓 𝑉 𝑖 ( 𝑡, ∆ 𝑡 ) ≥ 𝑉 𝑖𝑡ℎ − , 𝑖𝑓 𝑉 𝑖 ( 𝑡, ∆ 𝑡 ) < 𝑉 𝑖𝑡ℎ (49)We explored the parameters ∆ 𝑡 and 𝜒 systematically for the inference with the goal that to find values of theparameters which yield inferred couplings containing interesting information.Figure 4 shows the traded volume information for a mortgage company Fannie Mae(FNM). With themapping approach described in equation (49), we have +1s above the blue threshold line in figure 4 while − Inference methods.
With the transformed binaries, the magnetization 𝑚 𝑖 and correlations 𝐶 𝑖𝑗 ( 𝜏 ) are definedas (25). With them, two different inference methods with nMF approximation are utilized for the reconstruction. • Equilibrium nMF ( 𝑖 ̸ = 𝑗 ), which only focuses on equal time correlations [108] 𝐽 𝑖𝑗 = − 𝐶 (0) − 𝑖𝑗 𝑉 𝑖 ( 𝑡 ), red for summed volumes during time interval ∆ 𝑡 , blue for the threshold 𝑉 𝑡ℎ𝑖 = 𝜒𝑉 𝑎𝑣𝑖 × ∆ 𝑡 . Parameters: ∆ 𝑡 = 50 𝑠 and 𝜒 = 1. • Asynchronous nMF [81], uses the derivative of the time-lagged correlations ˙ 𝐶 𝑖𝑗 ( 𝜏 ), as shown in equation(35) and be rewritten as: 𝐽 𝑖𝑗 = 11 − 𝑚 𝑖 (︂ 𝑑𝐶 ( 𝜏 ) 𝑑𝜏 | 𝜏 =0 𝐶 (0) − )︂ 𝑖𝑗 Reconstruction Results.
Massive explorations over different values of the window size ∆ 𝑡 and 𝜒 s are com-plimented to achieve meaningful interactions between stocks. A natural rough approach is to consider thatcouplings contain interesting information if they are big in absolute value: they indicate a strong interactionbetween stocks. For asynchronous inference, the derivative of the time-lagged correlations ˙ 𝐶 𝑖𝑗 ( 𝜏 ) is computedthrough a linear fitting of this function 𝐶 𝑖𝑗 ( 𝜏 ) using four points: 𝐶 (0), 𝐶 (∆ 𝑡/ 𝐶 (2∆ 𝑡/
5) and 𝐶 (3∆ 𝑡/ nMF and re-scaled asynchronous nMF . Black squaresfor re-scaled 𝑁 ( 𝐽 𝑎𝑠𝑦𝑛 ) to have the same standard deviation as 𝑁 ( 𝐽 𝑒𝑞 ). 𝜒 = 0 . 𝑡 = 200 seconds for bothmethods.Figure 5 shows both inference methods give similar distributions of couplings. For comparison, the distri-butions are re-scaled so as to have the same standard deviation. It can be remarked that the inferred couplings21ave a strictly positive mean and a long positive tail. This prevalence of positive couplings can intuitively belinked with the market mode phenomenon [109–112]: a large eigenvalue appears, corresponding to a collectiveactivity of all stocks, illustrated in figure 6.Figure 6: Histograms of the eigenvalues of the equal time connected correlation matrix. Parameters: 𝜒 = 0 . 𝑡 = 100 seconds.The similarity of interaction matrices 𝐽 and 𝐽 ′ inferred from different methods can be measured by asimilarity quantity 𝑄 𝐽,𝐽 ′ , which is defined as 𝑄 𝐽,𝐽 ′ = ∑︀ 𝑖,𝑗 𝐽 𝑖𝑗 𝐽 ′ 𝑖𝑗 ∑︀ 𝑖,𝑗 max( 𝐽 𝑖𝑗 , 𝐽 ′ 𝑖𝑗 ) (50)This measurement compares elements of two matrices one by one and gives a global similarity measure. It takesreal values between 1 (when 𝐽 𝑖𝑗 = 𝐽 ′ 𝑖𝑗 for all 𝑖 and 𝑗 ) and -1 ( 𝐽 𝑖𝑗 = − 𝐽 ′ 𝑖𝑗 for all 𝑖 and 𝑗 ), and values close tozero indicate uncorrelated couplings. The values of 𝑄 is smaller than 0.02 in absolute value when all elementsof the vectors 𝐽 𝑖𝑗 and 𝐽 ′ 𝑖𝑗 are drawn independently at random from a same Gaussian distribution, of mean 0,and for different values of the standard deviation of this distribution. Here, the value of 𝑄 for the inferred 𝐽 𝑖𝑗 sby equilibrium nMF and asynchronous nMF is about 0 . 𝜒 = 0 . 𝑡 = 50 sec., which indicates thesetwo methods are not independent to each other.Next, we will present two inferred financial networks that recovered by equilibrium and asynchronous nMF method respectively. As the inferred finance networks are densely connected, we focus only on the largestcouplings, which can be explained by closely related activities of the considered stocks. Figure 7(a) shows thatwith equilibrium inference, more than half the stocks in the data can be displayed on a network where almostall links have simple economical interpretations.The network of figure 7(a) presents different communities, each color represents one industrial sector.They are mostly determined by a common industrial activity. Some of the links are very easy to explainwith the proximity of activities (and often quite robust). For instance, the pairs FNM - FRE (Fannie Mae -Freddie Mac, active in home loan and mortgage), UNP - BNI (Union Pacific Corporation - Burlington NorthernSanta Fe Corporation, railroads), BLS - SBC (BellSouth - SBC Communications, two telecommunicationscompanies now merged in AT&T), DOW - DD (Dow - DuPont, chemical companies), MRK - PFE (Merck &22 (cid:41)(cid:47)(cid:37)(cid:36)(cid:38)(cid:37)(cid:47)(cid:54) (cid:37)(cid:48)(cid:60) (cid:37)(cid:54)(cid:59) (cid:38)(cid:36)(cid:55) (cid:38)(cid:44)(cid:39)(cid:39)(cid:39)(cid:50)(cid:58) (cid:39)(cid:56)(cid:46) (cid:40)(cid:48)(cid:38) (cid:41)(cid:49)(cid:48) (cid:41)(cid:53)(cid:40)(cid:42)(cid:39) (cid:45)(cid:49)(cid:45)(cid:46)(cid:53)(cid:37) (cid:47)(cid:47)(cid:60)(cid:48)(cid:48)(cid:38)(cid:48)(cid:53)(cid:46) (cid:50)(cid:48)(cid:38) (cid:51)(cid:41)(cid:40)(cid:53)(cid:39)(cid:54)(cid:37)(cid:38) (cid:54)(cid:47)(cid:37)(cid:54)(cid:47)(cid:40)(cid:54)(cid:50)(cid:54)(cid:60)(cid:60)(cid:58)(cid:41)(cid:38) (cid:58)(cid:47)(cid:51) (a) (b) Figure 7: Inferred financial networks, showing only the largest interaction strengths (proportional to the widthof links and arrows). Colors are indicative, and chosen by a modularity-based community detection algorithm[16]. Parameters: 𝜒 = 0 . 𝑡 = 100 seconds. (a): equilibrium inference; (b): asynchronous inference, with 𝜏 = 20 seconds.Co. - Pfizer, pharmaceutical companies), KO - PEP (The Coca-Cola Company - PepsiCo, beverages). Thesetwo last companies display strong links with the medical sector at different scales of volume and time, asKO here with MDT (Medtronic) and JNJ (Johnson & Johnson). This medical sector is itself linked to thepharmaceutical sector with PFE, MRK, LLY (Lilly), BMY (Bristol-Myers Squibb) and SGP (Schering-Plough).Telecommunications (BLS, SBC) are linked to electric power with DUK (Duke Energy),GE (General Electrics) is for a large range of parameters a very central node, which is consistent with itsdiversified activities. Figure 7(a) presents the relation between PG (Procter & Gamble) and WMT (Walmart),both retailers of consumer goods, comes at this level of interaction strength through GE.The banking sector as shown by a chain with light blue color is linked to the sector of electronic technology(with dark blue color). Moreover, the defense and aerospace sector as shown in magenta is linked to enginesand machinery with (CAT) (Caterpillar Inc.) and DE (John Deere), and more strangely, to packaged food withCAG (ConAgra Foods), SYY (Sysco) and K (Kellogg Company).Figure 7(b) presents the results from asynchronous nMF in the same conditions. It shows that the resultsof equilibrium and asynchronous inference are consistent, and that asynchronous inference provides additionalinformation, as it infers an directed network. For instance, the financial sector is directed and influenced by themedical sector also. The detailed descriptions for each stock can be found in [113].From the network samples, we have the following two basic conclusions. First, they show market mode(most of the interaction strengths found are usually positive, which indicates that the financial market has aclear collective behavior) [110,111] even only trade and volume information is considered. Stocks tend to betraded or not traded at the same time.In addition, the strongest inferred interactions can be easily understood by similarities in the industrialactivities of the considered stocks. This means that financial activity tends to concentrate on a certain activity23ector at a certain time. For price dynamics this phenomenon is well-known [109,112,114], but it is perhapsmore surprising that it appeared based also on only information of traded volumes.
5. Fitness inference of population genetics
We now turn to inverse Ising (Potts) techniques applied to sequence-type biological data. This has variouslybeen called Direct Coupling Analysis (DCA) [36,43,46,47] and max-entropy modeling [35,115]; as noted above,other names are also in use. We will use the terms inverse Ising and DCA interchangeably.A common feature of all these applications is that the input is a static table of 𝑁 · 𝐿 symbols. Each rowis a sequence of 𝐿 symbols from data, and there are 𝑁 such rows ( 𝑁 samples). A breakthrough applicationhas been to identify residues (amino acid molecules) that are spatially close in proteins (chains of amino acids).The table then represents a family of proteins with supposedly similar structure and supposedly same origin,and each row is the amino acid sequence of a member of that family [36,43,46,47]. The basic idea is that twocolumns in the table (two positions in the protein structure) have non-trivial statistical dependency if theirjoint variation influence biological fitness. Such co-dependency in biological fitness is called epistasis . The mostimmediate cause of epistatsis among loci inside one gene coding for one protein is through structure [116]. Oftenthis is pictorially motivated by a mutation changing charge, hydrophilic/hydrophobic or size of one member ofa residue pair, which then changes the relative fitness of variants (alleles) of the other member of the pair. Incertain other cases dependencies discovered by DCA can be attributed to other causes than structure [117,118]but those cases appear to be relatively rare.Many details are needed to turn the above to powerful tool in protein structure prediction. One aspect isthat proteins in a family typically have different lengths, and that therefore the 𝑁 · 𝐿 table is not directly takenfrom the data, but only after multiple sequence alignment, which has to be done with the help of bionformaticssoftware, or the ready alignment taken from a data base such as PFAM [119,120]. Another is that predictingcontacts is only one ingredient in a much larger computational pipeline which uses inter-molecular force fields,predictions on secondary structure and solvability and know-how developed in the protein science communityover many years. Still, impressive results have been achieved [121–123]. It should be noted that if the goal is topredict protein structure a purely data-driven approach is possible, where a model of the deep neural networktype is trained on large training sets comprised of sequence-structure pairs. As has been widely reported, suchan approach from Google Deep Mind currently outperforms model-based learning methods such as DCA forthis task [124,125]. The price is computational cost beyond what most academic researchers can afford, andlack interpretability of the inferred model, which could be close to (1), but could also be very different.Beyond protein structures DCA has been used to predict nucleotide-nucleotide contacts of RNAs [126],multiple-scale protein-protein interactions [118], amino acid–nucleotide interaction in RNA-protein complexes [127],interactions between HIV and the host immune system [128–130], and other synergistic effects not necessar-ily related to spatial contacts [131–133]. Of particular relevance for the following are applications of DCA towhole-genome sequence data from bacterial populations, in [134] on Streptoccoccus pneumoniae and in [135] on
Neisseria gonorrhoeae . Standard versions of DCA are rather compute-intensive for genome-scale inference tasks,but methodological speed-ups [136,137] and alternative approaches [138] has been quickly developed. Antibioticresistance is an important medical problem throughout the world, and so is the relative paucity of new drugs.24ombinatorial drug combinations are therefore promising avenues to look for new treatment strategies. Theobstacle is the combinatorial explosion of combinations: if there are 𝐿 potential individual targets there are 𝐿 potential target pairs, and so on. The hope is that DCA could be one way (one out of many) to predictwhich combinations may have an effect on the grounds that they are already reflected as epistasis in naturalsequence data. In that respect it was promising that Skwark et al in [134] were able to retrieve interactionsbetween members of the Penicillin-Binding Protein (PBP) family of proteins; resistance to antibiotics in the 𝛽 -lactam family of compounds is in S. pneumoniae associated to alterations in their target enzymes, which arethe PBPs [139].Evolution is a dynamic process. We should imagine that the biological sequence data used in DCA are asin Section 2.2 (or more involved). The underlying dynamics is of 𝑁 𝑡𝑜𝑡 sequences (a number which could changewith time, but which we will assume constant) of length 𝐿 (which could also change, but which we will alsoassume constant), and which evolve evolving for a time 𝑇 . At the end of the process we sample 𝑁 sequences. Inprotein data 𝑇 is typically of the order of hundreds of millions of years, and the model is obviously simplified.In the bacterial whole-genome data of [134,135] 𝑇 may be as short as years or decades, and the model may becloser to reality. In any case, the goal is to infer fitness from the sampled sequences, and to understand whenthat can (or cannot) reasonably be done by DCA.We will structure the discussion as follows. In Section 5.1, we will discuss dynamics of a population in afitness landscape on which there is a large literature both in population genetics and in statistical physics. Wewill there define what we mean by fitness, and introduce recombination. In Section 5.2, we will present theimportant concept of Quasi-Linkage Equilibrium (QLE), originally due to Kimura, and in Section 5.3 we willstate the relation between inferred interactions and underlying fitness that hold in QLE. Numerical examplesand tests are presented in Section 5.4. That there exists formal similarities between the dynamics of genomes in a population and entities (spins) instatistical physics has been known for a long time. Fokker-Planck equations to describe the change of probabilitydistributions over allele frequencies were introduced by Fisher almost a century ago [140,141], and later, in a veryclear a concise manner, by Kolmogorov [142]. The link has been reviewed several times from the side of statisticalphysics, for instance in [143] and [144]. Central to the discussion in the following will be recombination (or sex),by which two parents give rise to an offspring the genome of which is a mixture of the genome of the parents.From the point of view of statistical physics recombination is a kind of collision phenomenon. It therefore cannotbe described by linear equations (Fokker-Planck-like equations) but can conceivably be described by nonlinearequations (Boltzmann-like equations). The mechanisms to be discussed are of this type, where Boltzmann’s
Stosszahlansatz is used to factorize the collision operator.All mammals reproduce sexually, as do almost all birds, reptiles and fishes, and most insects. Many plantsand fungi can reproduce both sexually and asexually. Recombination in bacteria is much less of a all-or-noneaffair. Typically only some genetic material is passed from a donor to a recipient, directly or indirectly. Themain forms of bacterial recombination are conjugation (direct transfer of DNA from a donor to a recipient),transformation (ability to take up DNA from the surroundings), and transduction (transfer of genetic material25y the intermediary of viruses). The relative rate of recombination in bacteria varies greatly between species,and also within one species, depending on conditions. As one example we quote a tabulation of the ratio ofrecombination to mutation rate in
S. pneumoniae , which has been measured to vary from less than one to overforty [145]. There are long-standing theoretical arguments against the possibility of complex life without sex,as a consequence of Eigen’s “error catastrophe” [146,147]. It is likely that most forms of life use some form ofrecombination, albeit perhaps not all the time, and though the relative rate of recombination to other processesmay be small. In the following we will eventually assume that recombination is faster than other processes,which may be as much the exception as the norm in bacteria and other microorganisms. Such a “dense-gas”(using the analogy with collisions) is however where there is an available theory which can be used at the presenttime.The driving forces of evolution are hence assumed to be genetic drift, mutations, recombination, and fitnessvariations. The first refers to the element of chance; in a finite population it is not certain which genotypes willreproduce and leave descendants in later generations. The last three describe the expected success or failure ofdifferent genotypes.Genetic drift can be explained by considering 𝑁 different genomes s , . . . , s 𝑁 . Under neutral evolution allgenomes have equal chance to survive into the next generation, but that does not mean all will do so. In a Wright-Fisher model one considers a new generation with 𝑁 new genomes s ′ , . . . , s ′ 𝑁 , where each one is adrawn randomly with uniform probability from the previous generation. The chance (or risk) that an individualdoes not survive from one generation to the next is then (︀ − 𝑁 )︀ 𝑁 , which is about 𝑒 − ≈ 𝑁 individuals evolving undermutations and genetic drift which happen synchronously is also called a Wright-Fisher model, and when theyhappen asynchronously a Moran model [144]. If the genome (or the variability of the genome) consists of onlyone biallelic locus (one Ising spin, 𝐿 = 1) then the state of a population is given by the number 𝑘 of individualswhere the allele is“up” ( 𝑁 − 𝑘 individuals then have the “down” allele). The dynamics of Moran model can beseen as the dynamics of 𝑁 spins where each spin can flip on its own or can copy the state of another spin (or donothing). It can also be seen as a transitions in a finite lattice where 𝑘 = 0 means all spins are down, and 𝑘 = 𝑁 means all spins are up. The probability distribution over this variable 𝑘 changes by a Master equation wherethe variable can take values 0 , , . . . , 𝑁 . If mutation rate is zero the two end states in the lattice are absorbing:eventually all individuals will be up, or all will be down. If mutation rate is non-zero but small, the stationaryprobability distribution is centered on small and large 𝑘 and transitions between the two macroscopic stateshappen only rarely. For a very pedagogical discussion of these classical facts we refer to [144].The evolution of the distribution over 𝐿 biallelic loci under mutations has many similarities to (15), andalways satisfies detailed balance . As the rate 𝑟 𝑖 ( s ) in (6) the rate of mutations can and generally does dependon genomic position (“mutation hotspots”) and on the alleles at other loci (“genomic context”). For theoreticaldiscussion and simulations it is however more convenient to assume an overall uniform flipping rate as in (16).A fitness landscape means a propensity for a given genotype to propagate its genomic material to the next This is not generally true for more than one allele per locus and a general mutation matrix. 𝐴 beats 𝐵 ”, “ 𝐵 beats 𝐶 ” and “ 𝐶 beats 𝐴 ” [148–152]. We will assume that the fitness ofgenotype s which carries allele 𝑠 𝑖 on locus 𝑖 depends on single-locus variations and pair-wise co-variations, thatis 𝐹 ( s ) = 𝐹 + ∑︁ 𝑖 𝑓 𝑖 𝑠 𝑖 + ∑︁ 𝑖𝑗 𝑓 𝑖𝑗 𝑠 𝑖 , 𝑠 𝑗 (51)The first term is an overall constant. The second term, linear in the genome, is called additive component offitness . The last term, quadratic in the genome, is called epistatic component of fitness . The dynamics due toDarwinian selection on the level of populations is thus 𝜕𝑃 ( s ) 𝜕𝑡 | 𝑠𝑒𝑙 = 𝑃 ( s ) ( 𝐹 ( s ) − ⟨ 𝐹 ( s ) ⟩ ) (52)where ⟨ 𝐹 ( s ) ⟩ = ∑︀ s 𝐹 ( s 𝑃 ( s ) is the average fitness over the population. Evolutionary dynamics due to fitness ina fitness landscape is thus quadratic in the distribution function . The conditions under which the combineddynamics under mutations and fitness satisfy detailed balance is a kind of integrability condition. On thelevel of dynamics on allele frequencies this condition is known as the existence of a Svirezhev-Shahshahanipotential [153–156], see also [157].Recombination (or sex) is the mixing of genetic material between different individuals. In diploid organisms(such as human) every individual has two copies of each separate component of its genetic material (chromo-some), where one comes from the father and one comes from the mother, each of whom also has two copies,one from each grandparent. When passing from the parents to the child the material from the grandparents ismixed in the process called cross-over, so that one chromosome of the child inherited from one parent typicallyconsists of segments alternately taken from the two chromosomes of that parent.In haploid organisms the situation is both simpler since each organism only has one copy of its geneticmaterial, and also more complicated since the mixing of information can happen in many different ways. It isconvenient to postulate a dynamics like a physical collision process 𝜕𝑃 ( s ) 𝜕𝑡 | 𝑟𝑒𝑐 = 𝑟 ∑︁ 𝜉, s ′ 𝐶 ( 𝜉 ) [︀ 𝑄 ( s , s ) 𝑃 ( s , s ) − 𝑄 ( s , s ′ ) 𝑃 ( s , s ′ ) ]︀ (53)where 𝑟 is an overall rate of sex compared to other processes, 𝑄 ( s , s )) is the chance of individuals s and s mating (reaction probability) and 𝐶 ( 𝜉 ) is the chance that they produce an offspring given by a pattern 𝜉 [158,159] (probability of outcome of reaction). On the left hand side we have the single-genome distributionfunction 𝑃 , and on the right-hand side the two-genome distribution function 𝑃 ; the equations are closed by a Stosszahlansatz 𝑃 ( s , s ) = 𝑃 ( s ) 𝑃 ( s ) (54)The complexities of recombination can then be accommodated by the two functions 𝑄 and 𝐶 . A method to inferrecombination hotspots in bacterial genomes was discussed in [160], and the issue was also discussed in [161],in relation to the the same “Maela” data set used in [134]. Detailed descriptions of the relation between on theone hand ( s , s ′ ) and on the other ( s , s ) as parametrized by 𝜉 can be found in [158] and [159]. 𝑃 ( s ) ⟨ 𝐹 ( s ) ⟩ is quadratic in 𝑃 ( s ). .2. The Quasi-Linkage Equilibrium Phase The concept of Quasi-Linkage Equilibrium (QLE) and its relation to sex was discovered by populationgeneticist Motoo Kimura [162–164], and later developed further by Richard Neher and Boris Shraiman in twoinfluential papers [158,165]. We will refer to this theory of QLE as
Kimura-Neher-Shraiman (KNS) theory. Todefine QLE and state the main result of KNS we must first introduce the simpler concept of Linkage Equilibrium(LE), which goes back to the work of Hardy and Weinberg more than a century ago [166,167].Consider two loci 𝐴 and 𝐵 where there can be, respectively, 𝑛 𝐴 and 𝑛 𝐵 alleles. The configuration of onegenome with respect to 𝐴 and 𝐵 is then ( 𝑥 𝐴 , 𝑥 𝐵 ) where 𝑥 𝐴 takes values in { , . . . , 𝑛 𝐴 } and 𝑥 𝐵 takes values in { , . . . , 𝑛 𝐵 } . The configuration of a population of 𝑁 individuals is the set [︁ ( 𝑥 ( 𝑠 ) 𝐴 , 𝑥 ( 𝑠 ) 𝐵 ) ]︁ where 𝑠 ranges from 1to 𝑁 . This set defines the empirical probability distribution with respect to 𝐴 and 𝐵 as 𝑃 𝐴𝐵 ( 𝑥 𝐴 , 𝑥 𝐵 ) = 1 𝑁 𝑁 ∑︁ 𝑠 =1 𝑥 ( 𝑠 ) 𝐴 ,𝑥 𝐴 𝑥 ( 𝑠 ) 𝐵 ,𝑥 𝐵 , (55)where 𝑎,𝑏 is the Kronecker delta. Similarly we can define distributions over one locus as 𝑃 𝐴 ( 𝑥 𝐴 ) = 𝑁 ∑︀ 𝑁𝑠 =1 𝑥 ( 𝑠 ) 𝐴 ,𝑥 𝐴 ,and 𝑃 𝐵 ( 𝑥 𝐵 ). The distribution of genomes in a population over loci 𝐴 and 𝐵 is said to be in Linkage Equi-librium (LE) if the alleles 𝑎 𝐴 and 𝑥 𝐵 are independent under the empirical distribution i.e. if 𝑃 𝐴𝐵 ( 𝑥 𝐴 , 𝑥 𝐵 ) = 𝑃 𝐴 ( 𝑥 𝐴 ) 𝑃 𝐵 ( 𝑥 𝐵 ). All other distributions are in Linkage Disequilibrium (LD).Specifying for completeness to the case of interest here where all loci are biallelic and epistatic contributionsto fitness is quadratic (pairwise dependencies)
Quasi-Linkage Equilibrium is a subset of distributions in LD wherethe joint distribution over loci is the Gibbs distribution in (1). The fundamental insight of Kimura was thatsuch distributions appear naturally in sexually reproducing populations where recombination is fast [162–164].In this setting epistatic contributions to fitness is a small effect since there is a lot of mixing of genomes betweenindividuals from one generation to the next. The dependencies (parameters 𝐽 𝑖𝑗 ) are also small, such that thedistributions over alleles are almost independent. In other words, the distributions in QLE which appear inKNS theory are close to being in Linkage Equilibrium. Nevertheless, the parameters ℎ 𝑖 and 𝐽 𝑖𝑗 in (1). arehence here consequences of a dynamical evolution law.The derivation of (1) from the dynamics described above in Section 5 has been given in the recent litera-ture [158,159] and will be therefore not be repeated here. We will instead just state the most important resultof KNS theory. This is 𝐽 𝑖𝑗 = 𝑓 𝑖𝑗 𝑟𝑐 𝑖𝑗 (56)where 𝑐 𝑖𝑗 characterizes the amount of recombination between loci 𝑖 and 𝑗 . Referring to the dynamics (53) thisquantity is defined as 𝑐 𝑖𝑗 = ∑︁ 𝜉 𝐶 ( 𝜉 ) ( 𝜉 𝑖 (1 − 𝜉 𝑗 ) + (1 − 𝜉 𝑖 ) 𝜉 𝑗 ) (57)In words 𝑐 𝑖𝑗 is simply the probability that the alleles at loci 𝑖 and 𝑗 were inherited from different parents. Inmost models of recombination this will depend on the genomic distance between 𝑖 and 𝑗 such that 𝑐 𝑖𝑗 will beclose to zero when 𝑖 and 𝑗 are close, and then grow to when they are far apart.28 .3. Inferred interactions and underlying fitness Turning around the concepts, (56) can be interpreted as a inference formula of epistatic fitness from genomicdata : 𝑓 * 𝑖𝑗 = 𝐽 * 𝑖𝑗 · 𝑟𝑐 𝑖𝑗 (58)where * indicates inferred value.The parameter 𝐽 * 𝑖𝑗 can be determined from data by DCA while the parameters 𝑟 and 𝑐 𝑖𝑗 have to bedetermined by other means. However, since the QLE phase is characterized by 𝐽 𝑖𝑗 being small, or, alternatively, 𝑓 𝑖𝑗 being smaller than 𝑟𝑐 𝑖𝑗 , we can make the simplifying assumption that 𝑐 𝑖𝑗 ≈ for all pairs of loci we consider.Formula (58) then says that underlying fitness parameters 𝑓 𝑖𝑗 are proportional to inferred Ising parameters 𝐽 𝑖𝑗 ,where the proportionality is 𝑟/ 𝑐 𝑖𝑗 is taken into account, as long as the product 𝑟𝑐 𝑖𝑗 remains smaller than 𝑓 𝑖𝑗 . It is not currently clear if there exists also an extension of (58) which holds also when 𝑟𝑐 𝑖𝑗 is of the order of or larger than 𝑓 𝑖𝑗 , including for the case when 𝑖 and 𝑗 are close (“hitch-hiking mutations”). We here describe results obtained from simulating a finite population using the FFPopSim software [168].Partial results in the same direction were reported in [159]; more complete results, though not using the single-time versions of algorithms as we will here, in [169].In a finite population statistical genetics as described above only holds on the average; when followingone population in time fluctuations of order 𝑁 − appear for observables such as single-locus frequencies andpair-wise loci-loci correlations. Figure 8(a) and (b) reports simulations using the FFPopSim software for allelefrequencies and a specified pair-wise loci-loci correlations that these fluctuations can in practice (in simulations)be quite large. Figure 8(c) presents the reconstructed fitness by DCA-PLM (blue dots) and DCA-nMF (reddots) against the tested fitness. Both methods exhibit clear trend along the diagonal direction though withfluctuations. . . . . . . Time[generations] A ll e l e f r e q u e n c i e s (a) − . − . . . . χ (5 , run.ave.¯ χ (5 , = 0 . Time[generations] χ ( , ) (b) f trueij f r e c . i j (c) Figure 8: Left panel(a): temporal behavior of all allele frequencies defined as 𝑓 𝑖 [1]. Data recorded every 5generations. Middle panel(b): an example of pairwise correlation changing with time. With finite populationsize, there exists strong fluctuations in the system. Right panel(c): Scatter plot for the reconstructed againstthe tested fitness with DCA-nMF (red dots) and DCA-PLM (blue dots) algorithm for 𝐽 𝑖𝑗 s. Parameters: numberof loci 𝐿 = 25, number of individuals 𝑁 = 200, mutation rate 𝜇 = 0 .
01, recombination rate 𝑟 = 0 .
1, crossoverrate 𝜌 = 0 .
5, standard deviation of epistatic fitness 𝜎 = 0 . 𝐿 = 25,the number of individuals 𝑁 = 500 (to avoid the singularity of correlation matrices of single generation), thelength of generations 𝑇 = 500 ×
5, the crossover rate 𝜌 = 0 .
5. The varied parameters are the mutation rate 𝜇 ,the recombination rate 𝑟 and the strength of fitness 𝜎 . In the following we will discuss what one can observeby systematically varying these three parameters.Furthermore, it is of interest to see how the KNS inference theory performs by averaging the resultsfrom singletime data. That means that we infer parameters from snapshots, and then average those inferredparameters over the time of the snapshot. In [169] in contrast was studied the inference using alltime versionsof the data, where inference is done only once. In figure 9 we show the phase diagrams of epistatic fitnessinference. The color indicates the relative root mean square error of the fitness reconstruction, where lightercolor means larger error. However, the mean square error 𝜖 as shown in (47) is used for consistence in thefollowing scatter-plots. Panel (a) shows the parameters mutation rate 𝜇 and recombination rate 𝑟 while (b) forfitness strength 𝜎 and 𝑟 . Both of these have wide broad ranges of parameters where KNS theory works well forthe fitness recovery. The inference phase diagrams based on 𝑠𝑖𝑔𝑙𝑒𝑡𝑖𝑚𝑒 here are quite close to those presentedin [169] using 𝑎𝑙𝑙𝑡𝑖𝑚𝑒 . -3 (a) (b) Figure 9: Phase Diagram for epistatic fitness recovery with DCA-nMF 𝐽 𝑖𝑗 s from the average of singletime data.Left panel(a): mutation rate 𝜇 versus recombination rate 𝑟 . For large recombination while low mutation KNSinference does not work as shown in figure 10(c). However, for small 𝑟 , the KNS inference theory does notsatisfied. Right panel(b): epistatic fitness strength 𝜎 with 𝑟 . For large recombination and very small fitnessKNS inference does not work.When mutation rates are very low, the frequencies of most loci is frozen to 0 or 1 for most of the time.This is a classical fact for evolution of one single locus, as discussed above, but also holds more generally. Foran evolving population simulated with the FFPopSim software it was demonstrated in [169]. In this regimefitness recovery is hence impossible as there is not enough variation. On the other hand, the KNS inferencetheory does not hold for high enough 𝜇 , as one of the assumptions is that recombination is a faster process thanmutations. Thus, three points on the 𝜇 - 𝑟 phase diagram are picked with same 𝜇 and differing 𝑟 s, marked as sad30ace, smiling face and not-that-sad face respectively. The corresponding scatter plots are presented in figure 10.As expected, KNS inference works but with very heavy fluctuations for very high 𝑟 but does not hold for low 𝑟 . −0.010 −0.005 0.000 0.005 0.010 f trueij −0.010−0.0050.0000.0050.010 f r e c . ij ε=6.66e-06 −0.010 −0.005 0.000 0.005 0.010 f trueij −0.010−0.0050.0000.0050.010 f r e c . ij ε=3.99e-06 −0.010 −0.005 0.000 0.005 0.010 f trueij −0.010−0.0050.0000.0050.010 f r e c . ij ε=6.61e-06 (a) (b) (c) Figure 10: Scatter-plots of inferred epistatic fitness against the true fitness based the averaged results from singletime . Here, DCA-nMF algorithm for 𝐽 𝑖𝑗 s is utilized. Left panel(a) with sad face: 𝑟 = 0 .
1, KNS theorycannot be satisfied here. Right panel(b) with smiling face: 𝑟 = 0 .
5, inference works. Right panel(c) withnot-that-sad face: 𝑟 = 0 .
9, KNS theory works but with very heavy fluctuations.To see if there are differences between the inference with average over singletime and alltime , the cor-responding scatter-plots of figure 10 ( singletime ) are presented in figure 11 ( alltime ). With the parametersillustrated here, the difference between two approaches is small. −0.010 −0.005 0.000 0.005 0.010 f trueij −0.010−0.0050.0000.0050.010 f r e c . ij ε=6.64e-06 −0.010 −0.005 0.000 0.005 0.010 f trueij −0.010−0.0050.0000.0050.010 f r e c . ij ε=4.01e-06 −0.010 −0.005 0.000 0.005 0.010 f trueij −0.010−0.0050.0000.0050.010 f r e c . ij ε=6.58e-06 (a) (b) (c) Figure 11: Corresponding scatter-plots by alltime averages with figure 10. DCA-nMF algorithm for 𝐽 𝑖𝑗 s is usedalso here. The parameters for each sub-panel are same with those in figure 10.
6. Discussion and Perspectives
Inverse Ising/Potts or DCA has emerged as a powerful paradigm of biological data analysis which hashelped to revolutionize protein structure prediction. For the first time it has been shown to be possible topredict protein structure from sequence, though crucially from many similar sequences, not from a single one.The central idea which has made this possible is to exploit statistical dependencies encoded by a postulatedGibbs distribution (1) of the Ising/Potts form over sequence space. While DCA recently has been overtakenby more complex AI learning methods of the deep learning type, it remains the case that it was the success ofDCA that showed this to be possible. Many other applications have appeared, some of them in areas where AIlearning methods are not likely to succeed due to lack of training examples.In this review we have striven to put these developments in the context of statistical physics. On some levela distribution over sequences must be arrived at by a evolutionary process, which though it may be complicated,31hares aspects of non-equilibrium spin dynamics. Indeed, these analogies have been noted for a long time, andhave been explored from both the viewpoint of (theoretical) population genetics, and statistical physics. Wehave here added the dimension of learning, how knowledge of the type of dynamics and inference techniques canbe used together to deduce biological parameters from data. We have also considered more direct applicationsof kinetic Ising models to model the evolution of neurons and of economic data, and how to infer connectionsfrom such data.The main conclusions are as follows. First, we have stressed that dynamics that does not fulfill detailedbalance can have practically arbitrarily complicated stationary states, even if interactions is only pair-wise. Itcan therefore not be the case that inverse Ising/Potts can generally give useful information: in the wrong param-eter phase it is instead much more likely to yield garbage. The simplest example is inference in asynchronouskinetic Ising models discussed in Section 3: those models contain parameters (the anti-symmetric combination 𝐽 𝑖𝑗 − 𝐽 𝑗𝑖 ) that are simply not present in the Ising distribution (1). DCA, by whichever algorithm, thereforewill never be able to find them. Even more, the stationary distribution in such models is quite different from(1), and DCA is also not able to find the symmetric combination 𝐽 𝑖𝑗 + 𝐽 𝑗𝑖 either (unless the anti-symmetriccombination is relatively small). On the other hand, straight-forward methods relying on inference from timeseries are able to recover symmetric and anti-symmetric combinations equally easy. The moral of this part ofour review is simple: if you have time series data, you should use it to infer dynamic models; it is both a moregeneral and an easier procedure .Second, we have considered evolutionary dynamics in finite populations under selection, mutations andrecombination. Following the pioneering work of Kimura and more recently Neher and Shraiman, we discussedhow the high-recombination regime leads to a distribution of the type (1), where the parameters can be inferredby DCA. We have noted that in the same high-recombination regime the effective interaction parametersare small, which corresponds to the high-temperature regime in inverse Ising. Hence inference in the high-recombination regime is limited by finite sample noise. Given finite data inference therefore works best in anintermediate regime, not too high recombination (because then statistical co-variance will be too weak), andnot too low recombination (because then the Kimura-Neher-Shraiman theory does not apply). Crucially, wehave observed that though the parameters inferred by DCA on such data are related to fitness, they are not thefitness parameters governing the evolutionary dynamics itself. The relation is albeit a simple proportionality,at least for pairs of loci far enough apart on the genome, but it is not an identity. The moral of this part of ourreview is thus: if you have a theory connecting the underlying mechanism which you want to clarify to the datawhich you can use, then you are well advised to analyze the data using that theory .Many open questions remain in the field of DCA, out of which we will but discuss some that are closelyconnected to the main thrust of our argument. Kimura-Neher-Shraiman (KNS) theory is a huge step forward toan understanding of what is actually inferred is such a procedure, but is obviously only a first step. Most directly,both Fig 10(a) and Fig 11(a) strongly suggest a functional relationship. These plots were obtained in parameterregions where KNS theory cannot be expected to be valid, and indeed it is not: the mean square error of inferredand underlying fitness parameters is large. Since the plots suggest a functional relationship, there should howeverbe another theory, which at this point is unknown. In other words, KNS is not the end, but should be thestarting point for developing theories connecting fitness (and other evolutionary parameters) to distributions over32equences in much wider settings, and ways to learn such parameters from sequence data. In particular, sinceKNS theory is only valid when fitness parameters 𝑓 𝑖𝑗 are smaller than compounded recombination parameters 𝑟𝑐 𝑖𝑗 , KNS is likely not valid for the very strongest epistatic effects which are potentially the most interestingand biologically relevant.Much work further deserves to be done to incorporate further biological realism in KNS and/or its successortheories and software. Among the many important effects (most discussed above) which have not been takeninto account in this review we list • Multi-allele loci • Realistic mutation matrices that vary over a genome and depending on the transitions • Mutations that do not act on single loci i.e. insertions and deletions (indels) • Other models of fitness and other distributions of e.g. pair-wise parameters 𝑓 𝑖𝑗 • More realistic models of recombination incorporating also recombination hotspots • More types of recombination, as appropriate for bacterial evolution • Effects of population growth and bottle-necksMany kinds of simulation software has been developed in the computational biology community, for instancethe fwdpp [170] software suite used recently in [171]. To objective would not be to redo or replace such softwarepackages, but to reuse them in the context of theory-driven inference.One further direction important to pursue is the effect of spatial and environmental separation, believedto be a main mechanism behind speciation and the emergence of biological variation in general. Its effectsin models of the Wright-Fisher-Moran type were emphasized in [144]. Spatial separation would in generaltend to counter-act recombination, in that individuals which could recombine if they would meet actually arenot likely to meet. For instance, a bacterium with one of highest known recombination rates is the humanpathogen
Helicobacter pylori (the cause of stomach ulcers), but two such bacteria actually can only recombinewhen they find themselves in the stomach of the same host. Strains of
H. pylori can thus be distinguished ona global scale, and only merge when their human host populations overlap [172].
Acknowledgment
The material in Sections 2, 3 and 4 is partly based on the PhD work of one of us (HLZ, Aalto University,Finland, 2014). We thank colleagues and collaborators during that time, particularly Mikko Alava, JohnHertz, Hamed Mahmoudi, Matteo Marsili and Yasser Roudi. The material in Section 5 was motivated by manydiscussions in the statistical physics community for which we thank Johannes Berg, Simona Cocco, Bert Kappen,David Lacoste, Andrey Lokhov, R´emi Monasson, Roberto Mulet, Andrea Pagnani, Luca Peliti, Federico Ricci-Tersenghi, Chris Sander, Alex Schug, Erik Van Nimwegen, Martin Weigt, Riccardo Zecchina, and many others.For the actual development of the material in Section 5, one of us (EA) is much indebted to a discussion withBoris Shraiman, and collaborations and/or discussions with Magnus Ekeberg, Yueheng Lan, Cecilia L¨ovkvist,33artin Weigt, Marcin Skwark, Andrea Pagnani, Christoph Feinauer, Hai-Jun Zhou, Chen-Yi Gao, AngeloVulpiani, Fabio Cecconi, Timo Koski, Arne Elofsson, Daniel Falush, Kaisa Thorell, Yoshiyuke Kabashima,Jukka Corander, Santeri Puranen, Yingying Xu, Alexander Mozeika, R´emi Lemoy and Onur Dikmen. Thematerial in Section 5.4 is the outcome of ongoing collaboration with Simona Cocco, Eugenio Mauro and R´emiMonasson whom we thank for many fruitful discussions and suggestions on various stages of the work. For thispart we also we also thank Richard Neher for the FFPopSim software package, and for valuable comments. Wefinally thank Johannes Berg and Andrey Lokhov for constructive comments on the manuscript.
References [1] M´ezard M, Parisi G and Virasoro M A 1987
Spin Glass Theory and beyond: An Introduction to the ReplicaMethod and Its Applications (World Scientific)[2] Fischer K and Hertz J A 1991
Spin Glasses (Cambridge University Press)[3] M´ezard M and Montanari A 2009
Information, Physics, and Computation (Oxford University Press)[4] Schneidman E, Berry M J, Segev R and Bialek W 2006
Nature
Front. Comput. Neurosci. Adv. Phys. Foundations and Trends in Machine Learning C. R. Acad. Sci. Paris
Trans. Am. Math. Soc. Phys. Rev.
PLOS Comput. Biol. PLOS Comput. Biol. e1004726[13] Amari S I, Barndorff-Nielsen O E, Kass R E, Lauritzen S L and Rao C R 1987 Chapter 2: DifferentialGeometrical Theory of Statistics[14] Amari S I and Nagaoka H 2000 Methods of Information Geometry, Translations of mathematical mono-graphs; v. 191 (American Mathematical Society)[15] Kampen N V 2007
Stochastic Processes in Physics and Chemistry: Third Edition (Elsevier)[16] Kuramoto Y 1984
Chemical Oscillations, Waves, and Turbulence (Springer Series in Synergistics)[17] Goldbeter A 2010
Biochemical Oscillations and Cellular Rhythms: The Molecular Bases of Periodic andChaotic Behaviour (Cambridge University Press)[18] Papadimitriou C H 1991 [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science pp. 163–169 3419] Selman B, Kautz H and Cohen B 1996
DIMACS Series in Discrete Mathematics and Theoretical ComputerScience vol. 26[20] Barthel W, Hartmann A K and Weigt M 2003
Phys. Rev. E Neural Information Processing Systems 2004 [22] Seitz S, Alava M and Orponen P 2005
J. Stat. Mech.: Theory Exp
P06006[23] Alava M, Ardelius J, Aurell E, Kaski P, Krishnamurthy S, Orponen P and Seitz S 2008
Proc. Natl. Acad.Sci. U.S.A.
Discrete Appl. Math.
Phys. Rev. E Phys. Rev. Lett.
Front. Comput. Neurosci. Phys. Rev. E Phys. Rev. Lett.
J. Stat. Phys.
J. Stat. Mech.: Theory Exp
P07023[32] Dettmer S L, Nguyen H C and Berg J 2016
Phys. Rev. E J. Stat. Mech.: Theory Exp
J. Stat. Mech.: Theory Exp
PLOS Comput.Biol. e1003408[36] Cocco S, Feinauer C, Figliuzzi M, Monasson R and Weigt M 2018 Rep. Prog. Phys. J. Phys. A: Math. Gen. L675[38] Roudi Y, Tyrcha J and Hertz J 2009
Phys. Rev. E Cogn. Sci. Neural Comput. Philos. Mag. J. Physiol.-Paris
Proc. Natl. Acad. Sci. U.S.A.
J. Stat. Mech.: Theory Exp
P080153545] Besag J 1975
J. Royal Stat. Soc. D Proc. Natl. Acad. Sci. U.S.A.
E1293[47] Marks D S, Colwell L J, Sheridan R, Hopf T A, Pagnani A, Zecchina R and Sander C 2011
PLoS ONE e28766[48] Jones D T, Buchan D W A, Cozzetto D and Pontil M 2012 Bioinformatics [50] Kamisetty H, Ovchinnikov S and Baker D 2013 Proc. Natl. Acad. Sci.
Phys. Rev. E J. Comput. Phys.
Ann. Stat. Proceedings of the 22nd International Conference on Neural InformationProcessing Systems
NIPS-09 (Red Hook, NY, USA: Curran Associates Inc.) pp. 1303–1311[55] Lokhov A Y, Vuffray M, Misra S and Chertkov M 2018
Sci. Adv. e1700791[56] Santhanam N P and Wainwright M J 2012 IEEE Trans. Inf. Theory Advances in Neural Information Processing Systems29 , (eds.) Lee D D, Sugiyama M, Luxburg U V, Guyon I and Garnett R (Curran Associates, Inc.) pp.2595–2603[58] Goel S, Kane D M and Klivans A R 2019
Proceedings of the Thirty-Second Conference on Learning Theory ,(eds.) Beygelzimer A and Hsu D vol. 99 of
Proceedings of Machine Learning Research (Phoenix, USA:PMLR) pp. 1449–1469[59] Vuffray M, Misra S and Lokhov A Y 2019 Efficient Learning of Discrete Graphical Models arXiv:1902.00600[60] Xu Y, Aurell E, Corander J and Kabashima Y 2017 ArXiv 1704.01459[physics.data-an][61] Xu Y, Puranen S, Corander J and Kabashima Y 2018
Phys. Rev. E PLOS Comput. Biol. e1004182[63] Glauber R J 1963 J. Math. Phys. J. Phys. Soc. Jpn J. Phys. Chem. Phys. Rev. Lett. Phys. Rev. A Commun. ACM J. Phys. Conf.
Phys. Rev. E J.Neurosci. Proc. Natl. Acad. Sci. U.S.A.
Phys. Rev. Lett.
J. Stat. Mech.: Theory Exp
P03031[75] M´ezard M and Sakellariou J 2011
J. Stat. Mech.: Theory Exp
L07001[76] Zhang P 2012
J. Stat. Mech.: Theory Exp
Phys. Rev. E Phys. Rev. E Nature
J. Stat. Mech.: Theory Exp
P10012[81] Zeng H L, Aurell E, Alava M and Mahmoudi H 2011
Phys. Rev. E Phys. Rev. Lett.
Scaling Limits of Interacting Particle Systems vol. 320 (Springer-Verlag)[84] Wainwright M J, Ravikumar P and Lafferty J D 2007
Adv. Neural. Inf. Process. Syst. Phys. Scr. J. Stat. Mech.: Theory Exp
P07008[87] Roudi Y, Dunn B and Hertz J 2015
Current Opinion in Neurobiology
38 large-Scale Recording Tech-nology (32)[88] Cocco S, Monasson R, Posani L and Tavoni G 2017
Curr. Opin. Struct. Biol. J. Stat. Mech.: Theory Exp
J. Neural. Eng. eLife e47012[92] Sadeghi K and Berry M J 2020 BioRxiv
Phys. Rev. E Market Microstructure and Liquidity The 27th Chinese Control and Decision Conference (2015 CCDC) (IEEE) pp. 238–243[96] Borysov S S, Roudi Y and Balatsky A V 2015
Eur. Phys. J. B Market Microstructure and Liquidity Entropy Entropy J. Phys. Conf.
Maximum entropy and network approaches to systemic risk and foreign exchange
Ph.D.thesis (SCHOOL OF ARTS & SCIENCES, Boston University)[102] Alossaimy A N M and Stemler T 2019
Using Complex Networks to Uncover Interaction in Stock Markets
Ph.D. thesis (The University of Western Australia)[103] Bucci F, Benzaquen M, Lillo F and Bouchaud J P 2019 arXiv:1901.05332 [104] Hoffmann T, Peel L, Lambiotte R and Jones N S 2020
Sci. Adv. eaav1478[105] Ikeda Y and Takeda H 2020 arXiv:2001.04097 [106] Segev R, Puchalla J and Berry M J 2006 J. Neurophysiol. J. Stat. Mech.: Theory Exp
P10012[108] Kappen H J and Rodriguez F 1998
Neural Comput. Eur. Phys. J. B Theory of financial risk and derivative pricing: from statistical physicsto risk management (Cambridge university press)[111] Mantegna R N and Stanley H E 2003 An introduction to econophysics: correlations and complexity infinance[112] Biely C and Thurner S 2008
Quant. Finance Connectivity inference with asynchronously updated kinetic Ising models
Ph.D. thesis(Aalto University)[114] Kullmann L, Kert´esz J and Kaski K 2002
Phys. Rev. E PLOS Comput. Biol. e1004182[116] Phillips P C 2008 Nat. Rev. Genet.
855 38117] Gueudr´e T, Baldassi C, Zamparo M, Weigt M and Pagnani A 2016
Proc. Natl. Acad. Sci. U.S.A.
Proc. Natl. Acad. Sci. U.S.A.
E2662[119] Pfam 32.0 2018 http://pfam.xfam.org/[120] El-Gebali S, Mistry J, Bateman A, Eddy S R, Luciani A, Potter S C, Qureshi M, Richardson L J, SalazarG A, Smart A, Sonnhammer E L, Hirsh L, Paladin L, Piovesan D, Tosatto S C and Finn R D 2018
NucleicAcids Res. D427[121] Ovchinnikov S, Park H, Varghese N, Huang P S, Pavlopoulos G A, Kim D E, Kamisetty H, Kyrpides N Cand Baker D 2017
Science
Bioinformatics i23[123] Ovchinnikov S, Park H, Kim D E, DiMaio F and Baker D 2018 Proteins et al. Nature
Nucleic Acids Res. Cell
Immunity Phys. Rev.E Proc. Natl. Acad.Sci. U.S.A.
E564[131] Figliuzzi M, Jacquier H, Schug A, Tenaillon O and Weigt M 2016
Mol. Biol. Evol. Nat.Biotechnol. Proc. Natl. Acad. Sci. U.S.A.
E9026[134] Skwark M J, Croucher N J, Puranen S, Chewapreecha C, Pesonen M, Xu Y Y, Turner P, Harris S R,Beres S B, Musser J M, Parkhill J, Bentley S D, Aurell E and Corander J 2017
PLoS Genet. e1006508[135] Schubert B, Maddamsetti R, Nyman J, Farhat M R and Marks D S 2018 bioRxiv 32599339136] Puranen S, Pesonen M, Pensar J, Xu Y Y, Lees J A, Bentley S D, Croucher N J and Corander J 2018 Microb. Genom. [137] Gao C Y, Zhou H J and Aurell E 2018 Phys. Rev. E Nucleic Acids Res. e112[139] Hakenbeck R, Br¨uckner R, Denapaite D and Maurer P 2012 Future Microbiol. P. Roy. Soc. Edinb. The Genetical Theory of Natural Selection (Clarendon)[142] Kolmogorov A N 1935
Dokl. Akad. Nauk SSSR J. Stat. Mech.: Theory Exp
P07018[145] Chaguza C, Andam C P, Harris S R, Cornick J E, Yang M, Bricio-Moreno L, Kamng’ona A W, Parkhill J,French N, Heyderman R S, Kadioglu A, Everett D B, Bentley S D and Hanage W P 2016 mBio e01053[146] Eigen M 1971 Naturwissenschaften Proc. Natl. Acad. Sci. U.S.A. Evolution and the Theory of Games (Cambridge University Press)[149] Nowak M A and Sigmund K 2004
Science
Phys. Rev. Lett.
Sci. Rep. Science
A New Mathematical Framework for the Study of Linkage and Selection (AmericanMathematical Society)[154] B¨urger R 2000
The mathematical theory of selection, recombination, and mutations vol. 228 (Wiley)[155] Svirezhev Y and Passekov V 2012
Fundamentals of mathematical evolutionary genetics vol. 22 (SpringerScience & Business Media)[156] Huillet T E 2017
J. Stat. Phys.
Rev. Mod. Phys. Phys. Biol. Mol. Biol. Evol. Nat. Genet. Evolution J. Appl. Probab. Genetics Proc. Natl. Acad. Sci. U.S.A. et al.
Science Jahresh. Ver. vaterl. Naturkd. W¨urttemb Genetics
Mol. Biol.Evol. PLoS Genet.13