[PDF] Inverse Ising inference by combining Ornstein-Zernike theory with deep learning

Abstract

Inferring a generative model from data is a fundamental problem in machine learning. It is well-known that the Ising model is the maximum entropy model for binary variables which reproduces the sample mean and pairwise correlations. Learning the parameters of the Ising model from data is the challenge. We establish an analogy between the inverse Ising problem and the Ornstein-Zernike formalism in liquid state physics. Rather than analytically deriving the closure relation, we use a deep neural network to learn the closure from simulations of the Ising model. We show, using simulations as well as biochemical datasets, that the deep neural network model outperforms systematic field-theoretic expansions, is more data-efficient than the pseudolikelihood method, and can generalize well beyond the parameter regime of the training data. The neural network is able to learn from synthetic data, which can be generated with relative ease, to give accurate predictions on real world datasets.

Full PDF

IInverse Ising inference by combining Ornstein-Zernike theory with deep learning

Soma Turi and Alpha A. Lee ∗ Cavendish Laboratory, University of Cambridge, Cambridge CB3 0HE, United Kingdom

Inferring a generative model from data is a fundamental problem in machine learning. It is well-known that the Ising model is the maximum entropy model for binary variables which reproducesthe sample mean and pairwise correlations. Learning the parameters of the Ising model from datais the challenge. We establish an analogy between the inverse Ising problem and the Ornstein-Zernike formalism in liquid state physics. Rather than analytically deriving the closure relation,we use a deep neural network to learn the closure from simulations of the Ising model. We show,using simulations as well as biochemical datasets, that the deep neural network model outperformssystematic ﬁeld-theoretic expansions, is more data-eﬃcient than the pseudolikelihood method, andcan generalize well beyond the parameter regime of the training data. The neural network is ableto learn from synthetic data, which can be generated with relative ease, to give accurate predictionson real world datasets.

Drawing meaningful interference from correlationsamongst variables is a fundamental problem in science.The central challenge is to decipher the probability dis-tribution that generates the correlations given a set of ob-servations, and then predict properties of unknown sam-ples with the inferred model. Many probabilistic modelshave been proposed in the literature [1, 2]. Focusing oncapturing the sample mean and pairwise correlations, thesimplest model, in the sense of maximum entropy, is aBoltzmann probability distribution with a Hamiltonianthat contains terms linear and bilinear in the variables [3].For continuous variables, this is the multivariate Gaus-sian distribution; for discrete variable, this model is theinverse problem of the Ising model in statistical physics.The inverse Ising model, also known as Boltzmannlearning [4], has found applications in diﬀerent disciplines[5] such as understanding neural spike trains [6, 7], birdﬂocks [8], and predicting structures of protein [9–11] andRNA [12] as well as their ﬁtness landscapes [13–16] us-ing evolutionary data. The inverse Ising model disen-tangles correlations in the dataset into pairwise interac-tions, thus could distinguish direct variable-variable in-teractions from indirect correlations mediated throughother variables.However, exact maximum likelihood inference of theinverse Ising model is numerically challenging because itrequires computing the partition function at each step ofthe optimisation [4]. To overcome this challenge, manyapproximate techniques have been developed [17]. Thosetechniques are typically leading terms of asymptotic ex-pansions that relate the sample mean and correlationto the Ising parameters in limits where the coupling istractable, e.g. asymptotically small correlations [18–20], small number of coupled clusters [21–23], a tree-likestructure [24]. (For comprehensive reviews, see [5, 25].)A relatively recent class of algorithms maximize the pseu-dolikelihood rather than the likelihood [26, 27]. Althoughthe pseudolikelihood method avoids partition functionevaluations and converges to the maximum likelihood so-lution in the limit of inﬁnite data, the algorithmic com- plexity scales with the number of samples.In this Letter, we eschew analytical asymptotic expan-sions and instead use a deep learning model to infer therelationship between the Ising parameters and samplemean and correlation. We will motivate an analogy be-tween the inverse Ising model and the Ornstein-Zernikeformalism in liquid state physics. We show that a deepneural network, trained using simulations of Ising mod-els, can approximate an Ornstein-Zernike-like closure forthe inverse Ising model. The deep neural network is gen-eralizable and achieves an accuracy beyond analytical ap-proximations for biochemical datasets as well as simula-tions with disparately diﬀerent parameters compared tothe training data. The neural network learns from syn-thetic data, which can be generated with relative ease,to give accurate predictions on real world datasets witha runtime independent of the dataset size.We begin by stating the inverse Ising model: We aregiven p sequences of N variables, { s α } pα =1 , each variable s i can take values ±

1. The maximum entropy distribu-tion with the mean and pairwise correlations agreeingwith the data, i.e. (cid:104) σ i (cid:105) P = (cid:104) s i (cid:105) data , (cid:104) σ i σ j (cid:105) P = (cid:104) s i s j (cid:105) data (1)is the Ising model P ( σ ) = 1 Z exp (cid:88) i h i σ i + (cid:88) i>j J ij σ i σ j  . (2)The inverse Ising problem is the inference of parameters { h i , J ij } from data. This is challenging because the cor-relation between variables i and j , C ij = (cid:104) s i s j (cid:105)−(cid:104) s i (cid:105) (cid:104) s j (cid:105) ,can be large even if J ij = 0 if there is an intervening vari-able k such that J ik and J kj are large. In other words,the pairwise correlations C ij one detect in a dataset is ameasure of the interaction between i and j that is medi-ated by all other variables.To make further progress, we follow the Ornstein-Zernike formalism [28, 29] in liquid state physics to a r X i v : . [ c ond - m a t . s t a t - m ec h ] J un deconvolve direct interactions and indirect correlations.We ﬁrst consider a one-component, homogenous andisotropic liquid. Molecules interact with a pairwise addi-tive potential v ( r ), where r is the distance betweenparticles 1 and 2. The liquid structure is characterisedby the radial distribution function, g ( r ), which is theprobability of observing a molecule at distance r awayfrom a molecule at the origin. Ornstein and Zernike no-ticed that g ( r ) can be long ranged even when v ( r ) isshort ranged because g ( r ) accounts for both the directinteraction between two molecules as well as the indirectinteractions with surrounding molecules. Their crucialinsight is to introduce a quantity known as the directcorrelation function, c ( r ), and write the total correla-tion function h ( r ) ≡ g ( r ) − h ( r ) = c ( r ) + (cid:90) d r c ( r ) c ( r ) + · · · (3)where the ﬁrst term captures the direct inﬂuence ofmolecule 1 on 2, the second term captures the inﬂuenceof molecule 1 on 2 mediated by molecule 3, and higher or-der terms capture correlations induced by more interven-ing molecules. Equation (3) can be rewritten in a com-pact form h ( r ) = c ( r ) + (cid:82) d r c ( r ) h ( r ) which isknown as the Ornstein-Zernike equation (we have scaledthe standard correlation functions by the density to makethe link with the Ising model more apparent later). Toclose the problem, we need a relation between c ( r ), h ( r ), and v ( r ). The crucial feature of many closuresis that they are local and take the form f ( c ( r ) , h ( r ) , v ( r ); ρ ) = 0 (4)where f is a real-valued function ( not a functional) and ρ is the liquid density [29]. The locality of the closure rela-tionship is approximate [30], albeit a very good approxi-mation for a wide variety of inter-particle interactions.Having introduced the Ornstein-Zernike formalism andclosure, we return to the inverse Ising problem. The sam-ple correlation, C ij , is a discrete analogy of h ( r ij ) in theOrnstein-Zernike formalism. Therefore, we introduce thedirect correlation D ij – the discrete analogue of c ( r ij ) –and replace the integrals in Equation (3) as a sum overlattice sites, i.e. C ij = δ ij + D ij + (cid:88) k D ik D kj + (cid:88) k,l D ik D kl D lj + · · · . (5)In matrix form, C = ( I − D ) − or D = I − C − . (6)Taking J = − C − is known as the mean-ﬁeld approxi-mation or Direct Coupling Analysis [10, 31].Now we introduce the key assumption of this Letter:we posit that the locality heuristic of closure relationsfor liquids applies to the inverse Ising problem. This cru-cial assumption will be veriﬁed later by comparing with simulations. We posit a function of four real variables F ( · , · , · , · ) such that J ij ≈ F (cid:0) C ij , [ C − ] ij , (cid:104) s i (cid:105) , (cid:104) s j (cid:105) (cid:1) . (7)Unlike the case of homogeneous ﬂuids, we need an extraequation to determine the ﬁelds h i . Motivated by the freeenergy of inhomogeneous ﬂuids [32], we posit a function G ( · , · , · , · ) such that h i ≈ G  tanh − (cid:104) s i (cid:105) , [ C − ] ii , (cid:88) j (cid:54) = i J ij (cid:104) s j (cid:105) , (cid:88) j (cid:54) = i C ij (cid:104) s j (cid:105)  , (8)where J ij is given by F . We note that the ﬁrst few termsof most approximate analytical inverse Ising theories dotake the form of Equations (7)-(8) with diﬀerent F and G depending on the approximations [5]. F and G are complicated functions. Rather than an-alytically determining what they are, we will use a deeplearning approach and approximate them using a richlyparameterised interpolating function. Our hypothesis isthat the neural network is able to ﬁnd F and G that arebetter than mean-ﬁeld expansions by inferring directlyfrom data.The training data is prepared by running 20 simu-lations with N = 50 Ising sites. In each simulation,the Ising parameters are drawn from a normal distri-bution, J ij = J ji ∼ N (0 , β J / √ N ) and h i ∼ N (0 , β h ),i.e. the Sherrington-Kirkpatrick model [33] with a ran-dom ﬁeld. The variances of the distributions are ﬁxed ineach simulation but vary across the 20 simulations, with β J ∼ uniform(0 . , .

5) and β h ∼ uniform(0 . , . N × steps of Markov Chain Monte Carlo (MCMC),sampled at every 10 N steps. F and G are approximated using multilayer neuralnetworks. A l layer neural network approximates func-tions by l successive non-linear compositions, i.e. y = W l σ ( W l − · · · σ ( W x ))), where W i ∈ R M i × M i − is aweight matrix inferred from data, M i is the number ofunits in layer i , and σ ( · ) is a non-linear function. We usea neural network as it is a numerically tractable way ofrepresenting complex functions, and the parameters W i can be eﬃciently inferred with backpropagation. Theneural network architecture is discussed in the Supple-mental Material and released on github .The computational cost of evaluating the neural net-work to obtain one entry of J is independent of the num-ber of variables or data, hence the total complexity is O ( N ). The largest cost is computing C − ( O ( N )). Thepseudolikelihood approximation, the state of art methodin the literature, has complexity O ( N p ) thus our algo-rithm is computationally less intensive than the pseudo-likelihood method in the large data limit where the pseu-dolikelihood method is mathematically exact; we will β γ J β γ h p -2 -1 γ J p -1 γ h TAPNeural NetworkPseudolikelihood

FIG. 1. The neural network model is accurate, generalisableand robust to sampling noise. The RMS error, γ , of predicting(A) J ij and (B) h i as a function of “inverse temperature” β .(C)-(D) The RMS error for predicting J ij and h i as a functionof p , the number of MCMC samples used to estimate C ij and (cid:104) s i (cid:105) , for β = 2. show that our method is more data-eﬃcient in the in-termediate data regime.To test the generality and accuracy of the model, wesimulate Ising models with N = 70 sites (note that themodel is only trained on N = 50 sites), J ij = J ji ∼N (0 , β/ √ N ) and h i ∼ N (0 , . β ). A large value of β corresponds to stronger coupling thus further away fromthe perturbative regimes that underlie analytical theo-ries. Figure 1A-B shows the root mean square errorof recovering J ij and h i as a function of β . The neu-ral network model is more accurate than the Thouless-Anderson-Palmer approximation (TAP), a third orderhigh temperature expansion [34]. The pseudolikelihoodapproximation slightly outperforms that neural networkat high temperature, where essentially all mean-ﬁeldmethods become accurate, but crucially the neural net-work outperforms the pseudolikelihood approximation atlow temperatures where correlations become non-trivial.Importantly, the neural network is accurate even for β ∈ (2 . , . p -2 -1 γ J Neural networkpseudolikelihood

10 20 30 40 501020304050 J γ J TAPNeural network A

20 40 60 80 100 site number s i t e nu m be r B FIG. 2. The neural network model is accurate even for verynon-Gaussian coupling matrices. (A) The RMS error, γ J , ofinferring J for the nearest neighbour Ising model; the insetshows the inferred coupling matrix for J = 2 . p = 10 samples. nearest-neighbour coupling J ij = J ji = J ( δ i,j +1 + δ i +1 ,j ).The coupling matrix recovered from the neural network(inset of Figure 2A) is strongly localised on the oﬀ-diagonal elements despite all the training data has a de-localised coupling matrix. We use the analytically de-termined correlation matrix as input to the neural net-work to focus on the error of the locality approximation,and Figure 2A reassures that this intrinsic error is low.Inspired by the use of inverse Ising models in proteinstructure and ﬁtness prediction [10, 15, 16], Figure 2Bshows that the neural network approximation can accu-rately recover a model contact map, and is more dataeﬃcient compared to the pseudolikelihood method. InFigure 2B, we consider the Bovine pancreatic trypsin in-hibitor protein (PDB ID: 5PTI), a benchmark examplein ref [35] with N = 58 amino acids. For this example,we assume the “ground truth” couplings are known (towhat extent are amino acid interactions pairwise additiveis a separate question [36]), and take J ij = e − d ij / ˚ A / √ N ,where the d ij is the C β distance between residues i and j .The correlation matrix is computed using MCMC; sam-ples are taken every 10 N steps to ensure independence,and in total we acquire p samples. The generalisabil-ity of the neural network to non-Gaussian distributedcouplings conﬁrms the approximate locality of the trueOrnstein-Zernike-like closure. The functions F and G can be accurately approximated as long as the trainingdata spans the four-dimensional input space.To interrogate what has the neural network learnt, wefocus on the case h i = 0 and (cid:104) s i (cid:105) = 0 for all i so thatthe closure is a 3D surface. Figure 3 shows that theneural network learnt non-trivial corrections to patholo-gies in analytical theories: mean-ﬁeld theory predicts J ij = − C − ij , yet this approximation signiﬁcantly overes-timates J ij [37]. The neural network automatically cor- [C -1 ] ij C ij J p r ed -0.3-0.2-0.100.10.20.3 FIG. 3. The neural network learns to correct the overestima-tion of J ij in mean-ﬁeld theory. The ﬁgure shows the coupling J ij predicted by the neural network closure as a function of C ij and [ C − ] ij for (cid:104) s i (cid:105) = 0. rects this by learning a sub-linear function to relate J ij to C − ij . This is conceptually akin to phenomenologicallyimposing a large regularisation, a technique discussed inthe physics literature [37], except the appropriate regu-larisation is inferred directly from data. Moreover, theneural network learns to use C ij , another way to estimatethe coupling, and only predicts a large value of J ij if both C ij and C − ij are large (c.f. the contours on the C ij – C − ij plane). More generally, analytical approximate methodsgenerally have some bias depending on aggregate quan-tities, e.g. inverse temperature and sparsity. The neuralnetwork learns those quantities by comparing C and itsinverse, and then applies an appropriate correction.We will now go beyond synthetic data where theground truth is known and apply our method to twoproblems in computational biology and chemistry. Fitness landscape of HIV-1 Gag : Developing a pre-dictive model for the ﬁtness of HIV virus as a functionof the amino acid sequence of the Group-speciﬁc anti-gen (Gag) protein is a challenge in vaccine development.Recent works showed that the ﬁtness landscape can beinferred from the statistics of sequences found in patients[14]. The hypothesis is that the frequency of observingconserved sites and sets of correlated mutations reﬂectthe contribution of those residues to ﬁtness [38]. There-fore, the ﬁtness of an unknown sequence can be predictedby its log probability computed using a generative modelfor the sequences observed in patients. We replicate theanalysis of ref [39] for the HIV-1 Gag protein using theIsing representation for sequences, except the Ising pa-rameters are inferred using the neural network. Fig-ure 4 shows that the log probability predicted by ourmodel is highly correlated with experimental measure-ments of replication capacity, with a correlation coeﬃ-cient r = 0 .

86, slightly higher than the state-of-the-artmodel [39] ( r = 0 . Tox21 Challenge : A key challenge in drug discovery

RC/RC WT -15-10-50 E W T - E r = 0.86 FIG. 4. The log probability of a mutant, E , is strongly cor-related with the replication capacity RC . The subscript WTdenotes wild type. Details of sequences and experiments canbe found in ref [39]. A pseudocount of 0 . is predicting whether an unknown molecule will bindto a particular receptor. We consider the Tox21 chal-lenge which reported binding aﬃnities of small moleculesagainst a panel of 12 toxicologically relevant receptors[41, 42]. Molecules are represented as a vector record-ing the presence (1)/absence (-1) of chemical groups (c.f.Supplemental Material). Each receptor is treated inde-pendently. As there are more variables compared to thenumber of data points, we undress ﬁnite sampling noisefrom the correlation matrix using an eigenvalue thresh-olding method inspired by random matrix theory [43–45](c.f. Supplemental Material). Separate Ising models areinferred for active ( { J b ij , h b i } ) and inactive ( { J n − b ij , h n − b i } )molecules. We score a molecule f by the log probabilityratio E ( f ) = (cid:80) i

False positive T r ue po s i t i v e AUC = 0.85

FIG. 5. The inverse Ising model accurately predicts protein-ligand binding for a panel of toxicologically relevant receptorsin the Tox21 challenge [41, 42]. The ﬁgure shows the truepositive rate averaged across the 12 receptors in the challengeas a function of false positive rate.

AAL thanks M. P. Brenner and R. Monasson for in-sightful discussions and comments, and J. P. Barton formaking available the data of Ref [39]. AAL acknowledgesthe Winton Programme for the Physics of Sustainabilityfor funding.Data availability: Codes and data to reproduce resultsin this paper are available in [47]. ∗ [email protected][1] Y. Bengio, A. Courville, and P. Vincent, IEEE trans-actions on pattern analysis and machine intelligence ,1798 (2013).[2] R. Salakhutdinov, Annual Review of Statistics and ItsApplication , 361 (2015).[3] E. T. Jaynes, Physical review , 620 (1957).[4] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, Cog-nitive Science , 147 (1985).[5] H. C. Nguyen, R. Zecchina, and J. Berg, Advances inPhysics , 197 (2017).[6] E. Schneidman, M. J. Berry, R. Segev, and W. Bialek,Nature , 1007 (2006).[7] S. Cocco, S. Leibler, and R. Monasson, Proceedings ofthe National Academy of Sciences , 14058 (2009).[8] W. Bialek, A. Cavagna, I. Giardina, T. Mora, E. Sil-vestri, M. Viale, and A. M. Walczak, Proceedings of theNational Academy of Sciences , 4786 (2012).[9] A. Lapedes, B. Giraud, and C. Jarzynski, arXiv preprintarXiv:1207.2484 (2012).[10] D. S. Marks, L. J. Colwell, R. Sheridan, T. A. Hopf,A. Pagnani, R. Zecchina, and C. Sander, PloS one ,e28766 (2011).[11] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S.Marks, C. Sander, R. Zecchina, J. N. Onuchic, T. Hwa,and M. Weigt, Proceedings of the National Academy ofSciences , E1293 (2011).[12] E. De Leonardis, B. Lutz, S. Ratz, S. Cocco, R. Monas-son, A. Schug, and M. Weigt, Nucleic Acids Research , 10444 (2015).[13] K. Shekhar, C. F. Ruberman, A. L. Ferguson, J. P. Bar-ton, M. Kardar, and A. K. Chakraborty, Physical ReviewE , 062705 (2013).[14] A. L. Ferguson, J. K. Mann, S. Omarjee, T. Ndungu,B. D. Walker, and A. K. Chakraborty, Immunity ,606 (2013).[15] M. Figliuzzi, H. Jacquier, A. Schug, O. Tenaillon, andM. Weigt, Molecular Biology and Evolution , 268(2015).[16] T. A. Hopf, J. B. Ingraham, F. J. Poelwijk, M. Springer,C. Sander, and D. S. Marks, Nature Biotechnology ,128 (2017).[17] M. Opper and D. Saad, Advanced mean ﬁeld methods:Theory and practice (MIT press, 2001).[18] H. J. Kappen and F. d. B. Rodr´ıguez, Neural Computa-tion , 1137 (1998).[19] T. Tanaka, Physical Review E , 2302 (1998).[20] V. Sessak and R. Monasson, Journal of Physics A: Math-ematical and Theoretical , 055001 (2009).[21] S. Cocco and R. Monasson, Physical Review Letters ,090601 (2011).[22] S. Cocco and R. Monasson, Journal of Statistical Physics , 252 (2012).[23] J. P. Barton, E. De Leonardis, A. Coucke, and S. Cocco,Bioinformatics , 3089 (2016).[24] H. C. Nguyen and J. Berg, Journal of Statistical Mechan-ics: Theory and Experiment , P03004 (2012).[25] S. Cocco, C. Feinauer, M. Figliuzzi, R. Monasson, andM. Weigt, arXiv preprint arXiv:1703.01222 (2017).[26] E. Aurell and M. Ekeberg, Physical Review Letters ,090201 (2012).[27] A. Decelle and F. Ricci-Tersenghi, Physical Review Let-ters , 070603 (2014).[28] L. S. Ornstein and F. Zernike, in Proc. Acad. Sci. Ams-terdam , Vol. 17 (1914) pp. 793–806.[29] J.-P. Hansen and I. R. McDonald,

Theory of Simple Liq-uids (Academic Press, 2013).[30] R. Fantoni and G. Pastore, The Journal of chemicalphysics , 10681 (2004).[31] A. R. Kinjo, Biophysics and Physicobiology , 117(2015).[32] J. Percus, Journal of statistical physics , 1157 (1988).[33] D. Sherrington and S. Kirkpatrick, Physical Review Let-ters , 1792 (1975).[34] D. J. Thouless, P. W. Anderson, and R. G. Palmer,Philosophical Magazine , 593 (1977).[35] S. Cocco, R. Monasson, and M. Weigt, PLoS ComputBiol , e1003176 (2013).[36] H. Jacquin, A. Gilson, E. Shakhnovich, S. Cocco, andR. Monasson, PLoS Comput Biol , e1004889 (2016).[37] J. P. Barton, S. Cocco, E. De Leonardis, and R. Monas-son, Physical Review E , 012132 (2014).[38] A. K. Chakraborty and J. P. Barton, Reports on Progressin Physics , 032601 (2017).[39] J. K. Mann, J. P. Barton, A. L. Ferguson, S. Omarjee,B. D. Walker, A. Chakraborty, and T. Ndung’u, PLoSComput Biol , e1003776 (2014).[40] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Bio-logical sequence analysis: probabilistic models of proteinsand nucleic acids (Cambridge University Press, 1998).[41] R. Huang, M. Xia, S. Sakamuru, J. Zhao, S. A. Shahane,M. Attene-Ramos, T. Zhao, C. P. Austin, and A. Sime-onov, Nature Communications (2016). [42] R. Huang and M. Xia, Frontiers in Environmental Science , 3 (2017).[43] J.-P. Bouchaud and M. Potters, in The Oxford Hand-book of Random Matrix Theory (Oxford University Press,2011).[44] A. A. Lee, M. P. Brenner, and L. J. Colwell, Proceedings of the National Academy of Sciences , 13564 (2016).[45] A. A. Lee, M. P. Brenner, and L. J. Colwell, PhysicalReview Letters , 208101 (2017).[46] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Ge-niesse, A. S. Pappu, K. Leswing, and V. Pande, ChemicalScience9