Reconstruction of Pairwise Interactions using Energy-Based Models
RR ECONSTRUCTION OF P AIRWISE I NTERACTIONS USING E NERGY -B ASED M ODELS
A P
REPRINT
Christoph Feinauer
Department of Decision SciencesBocconi Institute for Data Science and Analytics (BIDSA)Bocconi University, Milan, Italy [email protected]
Carlo Lucibello
Department of Decision SciencesBocconi Institute for Data Science and Analytics (BIDSA)Bocconi University, Milan, Italy [email protected]
December 15, 2020 A BSTRACT
Pairwise models like the Ising model or the generalized Potts model have found many successfulapplications in fields like physics, biology and economics. Closely connected is the problem ofinverse statistical mechanics, where the goal is to infer the parameters of such models given observeddata. An open problem in this field is the question on how to train these models in the case where thedata contain additional higher-order interactions that are not present in the pairwise model. In thiswork, we propose an approach based on Energy-Based Models and pseudolikelihood maximizationto address these complications: we show that hybrid models, which combine a pairwise model and aneural network, can lead to significant improvements in the reconstruction of pairwise interactions.We show these improvements to hold consistently when compared to a standard approach using onlythe pairwise model and to an approach using only a neural network. This is in line with the generalidea that simple interpretable models and complex black-box models are not necessarily a dichotomy:interpolating these two classes of models can allow to keep some advantages of both. K eywords Pairwise Models, Neural Networks, Inverse Ising, Energy Based Models
An important class of distributions used in the modeling of natural systems is the exponential family of pairwise models.Commonly investigated in the statistical physics community, pairwise models are a popular method for the analysis ofcategorical sequence data. Examples of data on which they have been successfully applied include protein sequencedata [1, 2, 3], neuronal recordings [4, 5], magnetic spins [6], economics and social networks [7, 8, 9].One main advantage of these models is their relative simplicity: The probability assigned to a sequence s of binary orcategorical variables is of the form p ( s ) ∝ exp( − E ( s )) , where E is a simple function of s , meaning that it consistsof terms that depend on only one or two variables. The parameters quantifying the pairwise interactions are typicallycalled couplings .Given this simple form, the parameters can often be given a direct interpretation in terms of the underlying system.Especially the couplings have been shown to contain highly non-trivial information in many cases: The couplings in theso-called Potts Models for protein sequence data can be seen as a measure for the strength of co-evolutionary pressure a r X i v : . [ c ond - m a t . d i s - nn ] D ec PREPRINT - D
ECEMBER
15, 2020between parts of the sequence and can be used for the prediction of structural features [1]; the couplings in models forneuronal recordings can be seen as the functional couplings between neurons [4]; the couplings in magnetic systems ofinteracting spins can be seen as describing their physical interaction strength.While pairwise models have been surprisingly successful in many fields, they have clear limitations: If the datagenerating process contains important interactions that cannot be described as pairwise interactions, the models mightfail to capture important variability. Even worse, if such interactions are strong enough, the pairwise models might evenstop to describe the pairwise interactions properly since they might contain effective pairwise interactions that try toinclude the variability of the higher-order interactions. In fact, it is known in literature that for example some variabilityin protein sequences is due to higher-order epistasis, including more than 2 residues [10].Several methods have been proposed to address such problems, for example the ‘manual’ addition of higher-orderinteraction terms based on a close look at the data [11], or the addition of complete sets of higher-order interactions, forexample all terms involving triplets of variables [12].Another option is to abandon simple pairwise models and adopt more complicated, but also more expressive methods,for example neural networks. These models can in principle capture interactions of all orders and can be trained fora specific task, for example the extraction of structural information [13], the generation of new samples [14] or thecreation of generic embeddings [15]. While this strategy has lead to unprecedented successes in many fields, it alsocomes at the cost of a higher computational demand and the loss of the interpretability of the single parameters definingthe distribution. Moreover, failing to encode the prior knowledge on the data generative process, these black-boxmethods are far away from being optimal in terms of sample efficiency.In this paper, we propose a combination of these two approaches: We develop a strategy for keeping the simple modelwith its advantages in place, but add a neural network model to help with the more complex patterns in the data. Thisapproach seems sensible in cases where we suspect or know that a simple model is able to capture most of the variabilityin the data, but that it might fail to capture some additional aspects or even gets confused by them.We implement this idea defining a new energy function E ( s ) = E pw ( s ) + E nn ( s ) , (1.1)where E pw is a pairwise model and E nn is a neural network that maps a configuration s to a real number. We thenlook at cases where the data generating process contains a simple part, corresponding to another pairwise model, and amore complicated part, corresponding to higher-order interactions. The hope is that the neural network picks up thesehigher-order interactions and thus helps the pairwise model in matching the pairwise interactions of the generativeprocess.We will focus on the so-called inverse problem of statistical mechanics, that is reconstructing the pairwise couplings ofa generative model containing also some unknown higher-order interaction terms. We consider a probability distribution p ( s ) over all possible configurations of N binary variables, {− , +1 } N . Anysuch distribution with support over the whole space can be written in the form p ( s ) = exp( − E ( s )) /Z, (2.1)where E : s → R is the so called energy function and Z is a normalization constant called the partition function.Denoting, with I the power set of { , . . . , N } , the energy can be uniquely expressed by the expansion E ( s ) = − (cid:88) I ∈I ξ I (cid:89) i ∈ I s i , (2.2)where ξ I ∈ R is the interaction coefficient for the term containing the variables specified by I . Such expansions areknown in theoretical computer science and Boolean algebra as Fourier expansions , and the corresponding parameters2
PREPRINT - D
ECEMBER
15, 2020 ξ I are called Fourier coefficients [16]. Determining specific coefficients from a black-box function E can be doneefficiently through sampling techniques (see Section 2.2) and coefficients larger than a given threshold can be determinedusing the Goldreich-Levin algorithm [16]. This is useful in our setting, since these techniques also apply when theenergy E is parametrized using an arbitrary neural network.The class of models where ξ I = 0 if | I | > are called pairwise models, defined by the energy E pw ( s ) = − (cid:88) i h i s i − (cid:88) i Generative Adversarial Networks [20] or Variational Autoencoders [21]. Themost important ones related to the present work are their relative uniformity and simplicity and their composionality(both also mentioned in [19]). By uniformity and simplicity we refer to the fact that due to the generic formulation usinga single energy function, tasks like training, sampling and analysis can often be formulated generically, independent ofthe exact parametrization of the energy. By composionality, we refer to the idea that EBMs can be combined easily bysumming their respective energies, leading to a so called product of experts model [22]. In fact, the central idea of thispaper is to combine two different energy functions: One simple, and one complex. The main disadvantage of EBMs isthat the normalization constants appearing in the definition of the probability cannot be calculated for typical inputsizes, which makes training and sampling harder. However, many training methods like contrastive divergence [23] or pseudolikelihoods (see SM Section S1) can be adapted for EBMs, and sampling can be done using standard MCMCalgorithms like Metropolis-Hastings [24].While such models are in principle capable of fitting any probability distribution with increasing layer sizes, the amountof data and the computational resources needed are typically much larger than for fitting a pairwise model. In addition,they might not be very efficient: In the trivial case that the generating distribution is in fact a pairwise distribution,the number of parameters that need to be introduced to the EBM might greatly exceed the ones needed when fitting asimpler distribution. The models we use in this work are hybrid models of the form E ( s ) = E pw ( s ) + E nn ( s ) = − (cid:88) i ECEMBER 15, 2020Figure 2.1: Representation of the basic idea of this work: Given a generative distribution that contains a strong pairwisepart but also higher-order interactions, we fit an energy-based model including a pairwise part and a neural network. Thehope is that the neural network captures the higher-order interactions, while the pairwise parts match up after training.expansion in Eq. (2.2) can contain in principle interactions of all orders. For small system sizes, we can extract thecorresponding interaction parameters by observing from Eq. (2.2) that for a generic energy E ( s ) we have ξ I = − E s (cid:34) E ( s ) (cid:89) i ∈ I s i (cid:35) , (2.5)where the expectation is according to the uniform distribution over all possible N configurations. Since we do notlimit the capacity of the neural network, E nn ( s ) can also contain significant pairwise interactions. Therefore, we mayhave E s [ E ( s ) s i s j ] (cid:54)≈ − J ij . We show below that this can be indeed observed in specific situations and approach theproblem as follows: we reconstruct the couplings from E = E pw + E nn using Eq. (2.5). We refer to these effectivecouplings as reconstructed couplings ˆ J ij = − E s [ E ( s ) s i s j ] , (2.6)as opposed to the explicit couplings J in the trained model (2.4). The reconstruction is performed only at the end of thetraining, and approximated for large systems with Monte Carlo samples in our experiments. As an alternative, inSM Section S2, we show that it can be also done during training, which effectively limits the pairwise interactions inthe hybrid model to the pairwise part.We use the same reconstruction method for extracting coefficients in models consisting only of the neural network,without the explicit pairwise part, to understand whether using an explicit pairwise term in model (2.4) brings anyadvantage. While we do this here only for comparison and use only simple multi-layer perceptrons (MLP), we notethat it would be an interesting avenue of research to use more advanced neural network models and see if the extractedcouplings can be used in applications where pairwise models are typically used. The difficulties in evaluating the normalization constant in energy-based models make density evaluation intractable,and efficient sampling becomes problematic as well. Many techniques have been proposed for the challenging taskof training EBMs, the most commonly used ones being contrastive divergence with Langevin dynamics [22, 28],noise-contrastive estimation [29], and score matching [30]. In this work, we use pseudolikelihood maximizationto train the parameters of the model given the data [31]. This method is very popular for the training of pairwisemodels [32, 33, 34] and is furthermore very similar to the method of training for state-of-the-art neural network modelssummarily called self-supervised learning , which transforms the task of unsupervised learning of unlabeled data into asupervised learning task by training the model to predict an artificially masked part of the data from the unmasked part.This technique is for example used when training the self-attention based Bert models [35]. Given a single mini-batch { s b } Bb =1 with B training configurations, we use the negative pseudo-likelihood loss function4 PREPRINT - D ECEMBER 15, 2020 L = − B B (cid:88) b =1 N (cid:88) i =1 log p ( s bi | s b/i ) , (2.7)where the quantity p ( s bi | s b/i ) corresponds to the probability of observing s bi given the other variables in s b , excluding s bi .This loss function can be calculated for a generic energy model over configurations using N forward passes. For apairwise model instead, we can use more efficient calculation schemes. For implementation details see SupplementaryS1. It is worth mentioning that the interaction screening approach of Ref. [36] provides an alternative with wellunderstood sample complexity guarantees to the pseudolikelihood framework used here.We train the models by standard stochastic gradient descent with batch size B = 1024 and a learning rate of . . Wedid not find consistent improvements for the hybrid models when applying an L regularization and do not apply itin this work. We did find, however, a slight improvement for models containing only the pairwise energy E pw , asexplained in detail below. We trained all models for epochs. In this work, the experimental setting is given by a data generating distribution p G ( s ) ∝ exp( − E G ( s )) over theconfigurations {− , +1 } N , where E G ( s ) contains a pairwise part and an additional number of higher-order interactions: E G ( s ) = E Gpw ( s ) + E Gho ( s ) = − (cid:88) i We analyse the effect that additional higher-order interactions in the generating process might have on the reconstructionof the pairwise couplings by training the same models on data from generators with different higher-order strength γ .5 PREPRINT - D ECEMBER 15, 2020 Figure 4.1: Reconstruction error for different system sizes N and different models as a function of higher-order strength γ in the data generator. The data is generated by a pairwise model with N additional interactions involving only 3variables (see Eq. (3.1)). We show means and standard deviations over 5 runs. The reconstructed couplings for theHybrid and the MLP only model are calculated using Eq. (2.6). Both the hybrid and the MLP only model had a singlelayer of 128 hidden neurons.We call the criterion that we adopt to measure the reconstruction performance the reconstruction error (cid:15) . It is a relativemeasure of the deviation of the inferred couplings ˆ J ij from the true ones J Gij : (cid:15) = (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) (cid:80) i ECEMBER 15, 2020Figure 4.2: Inferred versus true interactions for system size N = 64 . The generator includes N triplet interactions and γ is set to . . (Left) Blue points refer to pairwise interactions, orange points to all triplet interactions present in thegenerator and green to random triplet interactions not present in the generator. (Center and Right) Relation of theenergies between the submodels of the generator (pairwise and higher-order) and the trained model (pairwise and neuralnetwork). The color intensity is proportional to the density of points. The hybrid model contained a single hidden layerwith hidden neurons. All interactions were estimated using Eq. (2.5) using samples.for the hybrid model yield similar result, meaning that the learned E nn ( s ) function is approximately orthogonal tothe pairwise family in this experiment. It is interesting to note that the neural network with 128 hidden neurons isinsufficient to reconstruct the couplings. This confirms the idea that the explicit pairwise model is useful in training.However, we will later show that using networks with much larger capacity, the MLP only model can approach thehybrid model performance in some of the settings explored. Using the same experimental setting as in the previous section, we investigate in detail how closely the trained hybridmodel matches the generator.In Fig. 4.2 (left) we compare the reconstructed interaction parameters from the hybrid model through Eqs. (2.5) and(2.6) to the corresponding ones in the generator. The interaction parameters that we estimate are all pairwise interactions,the N triplet interactions that are present in the generator and N random triplet interactions not present in the generator.Pairwise interactions are well fitted, as well as the strongest triplet interactions in the generator. Some weaker tripletinteractions in the generator are underestimated instead. The triplet interactions not contained in the generator are closeto in the hybrid model. These results indicate that the hybrid model does not only learn an effective model of thegenerator, but extracts the true interactions in the underlying system.In Fig. 4.2 (center and right) we show that the energies calculated from the pairwise part in the generator are stronglycorrelated with the energies from the pairwise part in the trained model and the energies calculated from the trainedneural network are strongly correlated with the energies coming from the higher-order interactions in the generator.See also Supplementary Fig. S3 for the same experiment on a smaller system. In order to evaluate the impact of the neural architecture used in the hybrid model (2.4), we repeat the experiments withdifferent sizes for the hidden layer of the MLP. As in the previous section, we keep the higher-order interactions in thegenerator restricted to N triplets, where N is the system size. We vary the number of hidden neurons between and in powers of . The results in Fig. 4.3 indicate that size of the neural network has only a small effect on the errorabove a certain threshold (around in this specific case). While using a pure pairwise model for training leads to aquickly increasing reconstruction error (as already visible in Fig. 4.1), the addition of a single layer neural network witheven a small number of hidden neurons (on the order of the system size N ) leads to a significantly better reconstructionof the pairwise couplings in the generator. 7 PREPRINT - D ECEMBER 15, 2020 = . System Size 16 0.00.20.40.60.81.0System Size 640.00.20.40.60.81.0 = . = . Figure 4.3: Reconstruction error of couplings in presence of triplet interactions in the generator, for varying number ofhidden neurons in the trained model and different values of γ . We used M = 10 training samples for N = 16 and M = 5 · training samples for N = 64 . Shown are means and standard deviation over 5 independent samples.Varying the number of hidden neurons allows us also to test the hypothesis that a sufficiently large neural network on itsown is enough for inferring the pairwise couplings. In this setting, the models containing only an MLP approach theperformance of the hybrid model only for N = 16 and for very wide networks, while a large gap remains at N = 64 .We note that where models based only on a neural network perform well in terms of the reconstruction error, the hybridmodel obtains comparable results with two orders of magnitude less parameters. It is also to be said, however, that thiscomparison is not completely fair since the hybrid model contains an inductive prior by design, which the pure neuralnetwork model lacks. Still, we take this observation as evidence that adding a pairwise part in the trained model issensible if the generating distribution is expected to contain a significant pairwise part. In the preceding sections we restricted ourselves to triplet interactions in the generator. In order to probe the limitsof our approach, we repeat the experiments with generators that contain N higher-order interactions up to order ,leaving all other characteristics like training set size and training approach the same. The order of each interaction ischose from a uniform distribution between and , and the variables involved in each interaction are a random subsetof all variables.We note that this is a very ambitious test: our hope is that the neural network picks up the higher-order interactions inthe generator, which are of the type ξ (cid:81) Ii =1 s i , where I is the interaction order and ξ the corresponding parameter. Thismeans that we try to fit a combination of overlapping sparse parity problems of up to inputs. While constructinga solution to a single instance of such a problem is easy using a single hidden layer with continuous weights (seee.g. [38]), parity functions are generally considered among the hardest functions to learn from data [39]. While wemight be able to alleviate this problem by adding more layers to the neural network, we consider this to be out of scopefor the current work and note that in a realistic application the size of the underlying interactions is often not known.Even in this hard case, however, one could expect that the neural network gives a contribution to the quality of trainingby fitting at least some of the variability due to the higher-order interactions.While also in this setting we report generally better performance of the hybrid approach over the pairwise only andneural network only approaches, the gain is not as large as in the case of triplet interactions of the last sections (see8 PREPRINT - D ECEMBER 15, 2020 Pairwise Models Hybrid Models Explicit CouplingsReconstructed Couplings 0 50 100 150 200 250Explicit CouplingsReconstructed Couplings Epochs Figure 4.4: Reconstruction error of couplings in presence of higher-order (3 to 10) interactions in the generator asa function of the training epoch. The higher-order interaction strength γ was set to . , the system size is N = 64 ,and we used M = 5 · training samples. (Left) Reconstruction error for independent systems using only apairwise model for training. The system corresponding to the colored lines are also used in the 3 right panels. (Right 3panels) Reconstruction error given by pairwise only models (solid lines), and by hybrid models using either explicit orreconstructed couplings. The hybrid models contained a single hidden layer with hidden units.Supplementary Fig. S2). Moreover, in this setting the explicit couplings of the hybrid model significantly deviatefrom the couplings reconstructed using Eq. 2.6 at the end of training, as can be seen in Fig. 4.4. While the additionalreconstruction step is computationally cheap, these observations suggest that additional constraints for keeping thepairwise interactions in the neural network small might lead to further improvements. In Supplementary Section S2 wepresent a rough way of doing this and speculate about more sophisticated approaches. In this work we have shown that adding neural networks to pairwise models can improve the quality of reconstructionof pairwise interactions if the distribution underlying the data generating process contains additional higher-orderinteractions, as it typically occurs in natural data. While both the explicitly pairwise part and the neural network part ofthe hybrid model may contribute to the reconstructed couplings in general, we showed that in certain settings the neuralnetwork and the pairwise model specialize in fitting the separate parts of the generating model.There are many directions for future investigations. Systematic exploration of the neural architecture employed, whichwe did not pursue at great length in this work, could yield significant improvements. Different training methods forenergy based models could be applied, possibly speeding up simulations or giving more robust predictions. We also didnot check the quality of the trained models when used as generative distributions, which might be an important factorwhen applying similar methods for example to protein design. In addition, constraining the neural network to accountonly for higher-order interactions in a more sophisticated way might lead to further improvements.To the best of our knowledge, this work is the first one that solves the inverse problem by using the couplings recon-structed from a neural network. This leads to another line of possible research, were the training of possibly very largeand complex generative models without explicit pairwise couplings is followed by a reconstruction step. In principle,common architectures like GANs or autoregressive networks could be adapted at little additional computational cost.The next immediate step, however, would be to screen the current application domains of pairwise models and translatethe improvements observed in the well-controlled settings in this work to real-world data. Many of these fields presentidiosyncrasies, for example in the data characteristics, the expected topology of the underlying interactions or theadditional tricks in training or preprocessing that are important for achieving good results when using purely pairwisemodels. We therefore expect that some additional adaptions to our method are necessary in these cases. Given, however,that in most or all applications pairwise models are used as effective models and one would expect higher-orderinteractions to play a role in almost all complicated real-world scenarios, we believe that our work presents a verypromising perspective. 9 PREPRINT - D ECEMBER 15, 2020 References [1] Faruck Morcos, Andrea Pagnani, Bryan Lunt, Arianna Bertolino, Debora S Marks, Chris Sander, RiccardoZecchina, José N Onuchic, Terence Hwa, and Martin Weigt. Direct-coupling analysis of residue coevolutioncaptures native contacts across many protein families. Proceedings of the National Academy of Sciences of theUnited States of America , 108(49):E1293–301, dec 2011.[2] Debora S Marks, Thomas A Hopf, and Chris Sander. Protein structure prediction from sequence variation. Naturebiotechnology , 30(11):1072–1080, 2012.[3] Simona Cocco, Christoph Feinauer, Matteo Figliuzzi, Rémi Monasson, and Martin Weigt. Inverse statisticalphysics of protein sequences: A key issues review. Reports on Progress in Physics , 81(3):1–18, 2018.[4] Yasser Roudi, Joanna Tyrcha, and John Hertz. Ising model for neural data: model quality and approximatemethods for extracting functional connectivity. Physical Review E , 79(5):051915, 2009.[5] Gašper Tkaˇcik, Olivier Marre, Dario Amodei, Elad Schneidman, William Bialek, and Michael J Berry II. Searchingfor collective behavior in a large network of sensory neurons. PLoS Comput Biol , 10(1):e1003408, 2014.[6] Daniel S Fisher and David A Huse. Ordered phase of short-range ising spin-glasses. Physical review letters ,56(15):1601, 1986.[7] Dietrich Stauffer. Social applications of two-dimensional ising models. American Journal of Physics , 76(4):470–473, 2008.[8] Didier Sornette. Physics and financial economics (1776–2014): puzzles, ising and agent-based models. Reportson progress in physics , 77(6):062001, 2014.[9] Gavin Hall and William Bialek. The statistical mechanics of twitter communities. Journal of Statistical Mechanics:Theory and Experiment , 2019(9):093406, 2019.[10] Michael Waechter, Kathrin Jaeger, Stephanie Weissgraeber, Sven Widmer, Michael Goesele, and Kay Hamacher.Information-theoretic analysis of molecular (co)evolution using graphics processing units. ECMLS 2012 - 3rdInternational Emerging Computational Methods for the Life Sciences Workshop , pages 49–58, 2012.[11] Christoph Feinauer, Marcin J. Skwark, Andrea Pagnani, and Erik Aurell. Improving Contact Prediction alongThree Dimensions. PLoS Computational Biology , 10(10), 2014.[12] Michael Schmidt and Kay Hamacher. hodca: higher order direct-coupling analysis. BMC bioinformatics ,19(1):1–5, 2018.[13] Jian Peng and Jinbo Xu. Raptorx: exploiting structure information for protein alignment by statistical inference. Proteins: Structure, Function, and Bioinformatics , 79(S10):161–171, 2011.[14] Adam J Riesselman, Jung-Eun Shin, Aaron W Kollasch, Conor McMahon, Elana Simon, Chris Sander, AashishManglik, Andrew C Kruse, and Debora S Marks. Accelerating protein design using autoregressive generativemodels. bioRxiv , page 757252, 2019.[15] Alexander Rives, Siddharth Goyal, Joshua Meier, Demi Guo, Myle Ott, C Lawrence Zitnick, Jerry Ma, and RobFergus. Biological structure and function emerge from scaling unsupervised learning to 250 million proteinsequences. bioRxiv , page 622803, 2019.[16] Ryan O’Donnell. Analysis of boolean functions . Cambridge University Press, 2014.[17] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning . MIT press Cambridge,2016.[18] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning. Predicting structured data , 1(0), 2006.[19] Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. arXiv preprintarXiv:1903.08689 , 2019.[20] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems , pages2672–2680, 2014.[21] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 , 2013.[22] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation ,14(8):1771–1800, 2002.[23] Miguel A Carreira-Perpinan and Geoffrey E Hinton. On contrastive divergence learning. In Aistats , volume 10,pages 33–40. Citeseer, 2005. 10 PREPRINT - D ECEMBER 15, 2020[24] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller.Equation of state calculations by fast computing machines. The journal of chemical physics , 21(6):1087–1092,1953.[25] Alan Morningstar and Roger G Melko. Deep learning the ising model near criticality. The Journal of MachineLearning Research , 18(1):5975–5991, 2017.[26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems , pages5998–6008, 2017.[27] Dian Wu, Lei Wang, and Pan Zhang. Solving statistical mechanics using variational autoregressive networks. Physical review letters , 122(8):080602, 2019.[28] Yilun Du and Igor Mordatch. Implicit generation and modeling with energy based models. In H. Wallach,H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural InformationProcessing Systems , volume 32, pages 3608–3618. Curran Associates, Inc., 2019.[29] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormal-ized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence andStatistics , pages 297–304, 2010.[30] Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of MachineLearning Research , 6(Apr):695–709, 2005.[31] Julian Besag. Efficiency of pseudolikelihood estimation for simple gaussian fields. Biometrika , pages 616–618,1977.[32] Erik Aurell and Magnus Ekeberg. Inverse ising inference using all the data. Physical review letters , 108(9):090201,2012.[33] Magnus Ekeberg, Cecilia Lövkvist, Yueheng Lan, Martin Weigt, and Erik Aurell. Improved contact prediction inproteins: using pseudolikelihoods to infer potts models. Physical Review E , 87(1):012707, 2013.[34] Aurélien Decelle and Federico Ricci-Tersenghi. Pseudolikelihood decimation algorithm improving the inferenceof the interaction network in a general class of ising models. Phys. Rev. Lett. , 112:070603, Feb 2014.[35] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.[36] Marc Vuffray, Sidhant Misra, and Andrey Lokhov. Efficient learning of discrete graphical models. Advances inNeural Information Processing Systems , 33, 2020.[37] Charles R. Harris, K. Jarrod Millman, St’efan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau,Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H.van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fern’andez del R’ıo, Mark Wiebe, Pearu Peterson, PierreG’erard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, andTravis E. Oliphant. Array programming with NumPy. Nature , 585(7825):357–362, September 2020.[38] Leonardo Franco and Sergio A Cannas. Generalization properties of modular networks: implementing the parityfunction. IEEE transactions on neural networks , 12(6):1306–1313, 2001.[39] Gerald Tesauro and Bob Janssens. Scaling relationships in back-propagation learning. Complex Systems , 2(1):39–44, 1988.[40] Aapo Hyvärinen. Consistency of pseudolikelihood estimation of fully visible boltzmann machines. NeuralComputation , 18(10):2283–2292, 2006.[41] Alexander Mozeika, Onur Dikmen, and Joonas Piili. Consistent inference of a general model using the pseudo-likelihood method. Physical Review E , 90(1):010101, 2014.11 PREPRINT - D ECEMBER 15, 2020 Supplemental Material S1 Using Pseudolikelihoods for training EBMs Pseudolikelihoods are often used as an alternative to an intractable or at least computationally expensive likelihood[31]. It has been applied successfully to pairwise models [40, 32, 33, 34]. We show here how it can be applied to ageneric Energy-Based Model, and add some considerations specific to pairwise models. We note that while maximumpseudolikelihood is a widely applied method for training simple Energy-Based Models, to the best of our knowledgethis is the first time it has been used for training deep feed-forward neural networks.We assume the data that we want to model to consist of configurations ( s , . . . , s N ) of categorical variables of length N and we will use q to denote the number of categories. A common method for fitting a probability distribution p Θ ( s ) with parameters Θ to a training set of sequences { s m } Mm =1 is to find the Θ ∗ for which Θ ∗ = argmax Θ M (cid:88) m =1 log p Θ ( s m ) , (S1)which corresponds to a maximum-likelihood solution. For an Energy-Based Model (EBM) p Θ ( s ) = e − E Θ( s ) Z Θ , where E Θ ( s ) is the energy function, this would correspond to solving Θ ∗ = argmax Θ M M (cid:88) m =1 [ − E Θ ( s m ) − log Z Θ ] , (S2)for example by gradient descent methods. The general problem in this approach is that the normalization constant Z Θ = (cid:80) s e − E Θ ( s ) , where we sum over all possible configurations s , contains q N terms. This is intractable even formodest N and in the case of binary variables, where q = 2 . The idea of pseudolikelihoods is to replace the likelihoodobjective by Θ ∗ = argmax Θ M N (cid:88) i =1 M (cid:88) m =1 log p Θ (cid:16) s mi | s m/i (cid:17) , (S3)where p Θ (cid:16) s mi | s m/i (cid:17) is the probability of symbol s mi in sequence m , given the other symbols. We therefore train thedistribution by using it for predicting a missing symbol from the other symbols.Other variations are possible, for example to discard the sum over i and find a maximum set of Θ ∗ i for every i independently. We found the approach with the sum to be conceptually easier and in the applications known to us, theperformance seems to be the same [33]. While it can be shown that this new objective has the same maximum as theoriginal likelihood under certain conditions [41], this is for example not generally true if the training samples comefrom a different model class than p Θ , which is true in our case. In this work, we are interested in whether we can maketraining using this objective work in practice and refrain from further theoretical analysis.We note that we have notrestricted the form of E Θ . In the models we analyse in this work, the energy is calculated by a sum of the energy of apairwise model and a neural network.Neglecting the sum over i and m for the time being, we can write the quantity log p Θ ( s i | s /i ) for an EBM as log p Θ ( s i | s /i ) = log p Θ ( s ) p Θ ( s /i ) = log p Θ ( s ) (cid:80) q ˆ s i =1 p Θ (ˆ s i , s /i ) , (S4)where we used the notation (ˆ s i , s /i ) for the configuration s after s i has been replaced with ˆ s i . Since the normalizationconstant Z Θ appears in the both the numerator and denominator, it cancels and we are left with log p Θ ( s i | s /i ) = log e − E Θ ( s ) (cid:80) q ˆ s i =1 e − E Θ (ˆ s i ,s /i ) = − log (cid:88) ˆ s i (cid:54) = s i e E Θ ( s ) − E Θ (ˆ s i ,s /i ) (S5)12 PREPRINT - D ECEMBER 15, 2020The sum in this expression can be computed efficiently, using q evaluations of E . This means that including the sumover i and replacing the sum over m with a sum over a mini-batch of B in a stochastic gradient descent (SGD) setting,we need q · N · B evaluations of E for a single gradient step, corresponding to q · N forward passes.In the case of binary strings with s i ∈ {± } and a pairwise model E Θ ( s ) = − (cid:80) i During training, we did not enforce a division of labour between the two parts of the hybrid models, which means thatthe neural network is not discouraged in any way from fitting also pairwise interactions. While extracting the pairwisecoefficients from the entire hybrid model and constructing an effective pairwise model is a way of solving this aftertraining, it would be more satisfactory to include this also in the training procedure. The cleanest way of ensuring onlyhigher-order interactions in the neural network would be to constrain the optimization of the neural network to the partof parameters space where it does not contain pairwise interactions. In practice, Eq. (2.5) could be used to create aregularization term penalizing all pairwise interactions: N (cid:88) ij ( E s s i s j E nn ( s )) = E s,s (cid:48) E nn ( s ) E nn ( s (cid:48) ) q ( s, s (cid:48) ) , (S8)where the expectation is over uniformly sampled Ising configurations and q ( s, s (cid:48) ) = N (cid:80) i s i s (cid:48) i is the overlap betweentwo configurations. This expression can be approximately evaluated by Monte Carlo sampling. While this approachseems promising, we did not pursue it in this exploratory analysis.A different approach instead is to counter the pairwise interactions in the neural network by using an additional pairwisemodel. To this end, we define a new energy E ( s ) = E pw ( s ) + E nn ( s ) − ˆ E pw ( s ) . (S9)Here, E pw and E nn are the same as in the hybrid models of the preceding sections. The new term ˆ E pw is anotherpairwise model, but it is excluded from the gradient descent step and we set its couplings explicitly every k epochs. Thevalues of these couplings are the pairwise interactions extracted from E nn using Eq (2.5). The idea is to estimate thepairwise terms in the expansion of the neural network energy E nn and absorb these interactions in the additional ˆ E pw ,which we therefore call an absorber model. After setting the couplings of this absorber, the last two terms on the righthand side of Eq. (S9) should contain approximately no pairwise interactions, i.e. (cid:88) s s i s j (cid:16) E nn ( s ) − ˆ E pw ( s ) (cid:17) ≈ (S10)for all i, j . This leaves the term E pw as the only one with significant pairwise interactions. While we could do this inprinciple after every epoch or even after every gradient step, this would make the computations unfeasibly slow since at13 PREPRINT - D ECEMBER 15, 2020 Pairwise Models Hybrid/Absorber Models Without AbsorberWith Absorber 0 50 100 150 200 250Without AbsorberWith Absorber Epochs Figure S1: Reconstruction error with higher-order (3 to 10) in generator for γ = 1 . and trained with M = 5 · samples. The training samples in this figure are the same as in Fig. 4.4. Left panel: Reconstruction error for independent runs using only a pairwise model for training. The training samples corresponding to the colored lineswere further used as a training set for the hybrid and absorber models shown in the 3 right panels. Right 3 panels:Reconstruction error for trained models containing only pairwise terms (solid lines), reconstruction error for hybridmodels (dashed lines) and reconstruction error for hybrid models with absorber terms (dotted lines)every step we estimate the pairwise interactions in E nn using samples. We therefore restrict ourselves to doing theestimate less frequently, every k = 5 epochs in our experiments.In Fig. S1 it can be seen that using these additional absorbers improves the training of the couplings significantly. Weused the same training samples as for Fig 4.4 and also left the other training characteristics the same. The results arevery similar to what would have been obtained by reconstructing the couplings at every step (compare also Fig. 4.4).While this also means that there was no strong improvement over approach of reconstructing the couplings after thetraining has ended, we think it still noteworthy that enforcing that the pairwise model E pw should be the one solelyresponsible for fitting pairwise interactions is possible during training.14 PREPRINT - D ECEMBER 15, 2020 S3 Additional Figures = . System Size 16 0.00.10.20.30.40.50.6System Size 640.00.20.40.60.81.0 = . = . Figure S2: Pairwise reconstruction errors with higher-order (size 3 to 10) interactions in generator for varying numberof hidden neurons in the trained model and 3 different values of γ . The number of these interactions is equal to N .Training was done with M = 10 samples for N = 16 and M = 5 · samples for N = 64 . Shown are meansand standard deviation over 5 runs. The reconstruction error is defined in Eq. (4.1). The blue line corresponds toreconstruction error calculated using the couplings of the pairwise part of the hybrid model, the red line to the pairwiseinteractions reconstructed from the complete hybrid model using Eq. (2.6).15 PREPRINT - D ECEMBER 15, 2020Figure S3: Inferred versus true interactions for system size N = 16 . The generator included N triplet interactionsand γ was set to . and the hybrid model had a single hidden layer with128