[PDF] Generating Tertiary Protein Structures via an Interpretative Variational Autoencoder

Abstract

Much scientific enquiry across disciplines is founded upon a mechanistic treatment of dynamic systems that ties form to function. A highly visible instance of this is in molecular biology, where an important goal is to determine functionally-relevant forms/structures that a protein molecule employs to interact with molecular partners in the living cell. This goal is typically pursued under the umbrella of stochastic optimization with algorithms that optimize a scoring function. Research repeatedly shows that current scoring function, though steadily improving, correlate weakly with molecular activity. Inspired by recent momentum in generative deep learning, this paper proposes and evaluates an alternative approach to generating functionally-relevant three-dimensional structures of a protein. Though typically deep generative models struggle with highly-structured data, the work presented here circumvents this challenge via graph-generative models. A comprehensive evaluation of several deep architectures shows the promise of generative models in directly revealing the latent space for sampling novel tertiary structures, as well as in highlighting axes/factors that carry structural meaning and open the black box often associated with deep models. The work presented here is a first step towards interpretative, deep generative models becoming viable and informative complementary approaches to protein structure prediction.

Full PDF

GGenerating Tertiary Protein Structures via an Interpretative VariationalAutoencoder

XIAOJIE GUO,

Department of Information Sciences and Technology, George Mason University, Fairfax, VA, USA

SIVANI TADEPALLI,

Department of Computer Science,George Mason University, Fairfax, VA, USA

LIANG ZHAO,

Department of Information Sciences and Technology, George Mason University, Fairfax, VA, USA

AMARDA SHEHU,

Department of Computer Science, George Mason University, Fairfax, VA, USA

Much scientific enquiry across disciplines is founded upon a mechanistic treatment of dynamic systems that ties form to function. Ahighly visible instance of this is in molecular biology, where an important goal is to determine functionally-relevant forms/structuresthat a protein molecule employs to interact with molecular partners in the living cell. This goal is typically pursued under the umbrellaof stochastic optimization with algorithms that optimize a scoring function. Research repeatedly shows that current scoring function,though steadily improving, correlate weakly with molecular activity. Inspired by recent momentum in generative deep learning, thispaper proposes and evaluates an alternative approach to generating functionally-relevant three-dimensional structures of a protein.Though typically deep generative models struggle with highly-structured data, the work presented here circumvents this challengevia graph-generative models. A comprehensive evaluation of several deep architectures shows the promise of generative models indirectly revealing the latent space for sampling novel tertiary structures, as well as in highlighting axes/factors that carry structuralmeaning and open the black box often associated with deep models. The work presented here is a first step towards interpretative,deep generative models becoming viable and informative complementary approaches to protein structure prediction.Additional Key Words and Phrases: Deep generative models, variational autoencoder, disentanglemment, protein structure prediction

ACM Reference Format:

Xiaojie Guo, Sivani Tadepalli, Liang Zhao, and Amarda Shehu. 2019. Generating Tertiary Protein Structures via an InterpretativeVariational Autoencoder. 1, 1 (April 2019), 17 pages. https://doi.org/*****

Decades of scientific enquiry across disciplines have demonstrated just how fundamental form is to function [7].A central question that arises in various scientific domains is how to effectively explore the space of all possibleforms of a dynamic system to uncover those that satisfy the constraints needed for the system to be active/functional.In computational structural biology we find a most visible instantiation of this question in the problem of de-novo (or template-free) protein structure prediction (PSP). PSP seeks to determine one or more biologically-active/nativeforms/structures of a protein molecule from knowledge of its chemical composition (the sequence of amino acidsbonded to one another to form a protein chain) [32].

Authors’ addresses: Xiaojie Guo, Department of Information Sciences and Technology, George Mason University, Fairfax, VA, USA; Sivani Tadepalli,Department of Computer Science,George Mason University, Fairfax, VA, USA; Liang Zhao, Department of Information Sciences and Technology, GeorgeMason University, Fairfax, VA, USA, [email protected]; Amarda Shehu, Department of Computer Science, George Mason University, Fairfax, VA, USA,[email protected] to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].© 2019 Association for Computing Machinery.Manuscript submitted to ACM a r X i v : . [ q - b i o . B M ] A p r SP has a natural and elegant formulation as an optimization problem [37] and is routinely tackled with stochasticoptimization algorithms that are tasked with exploring a vast and high-dimensional structure space. These algorithmssample one structure at a time, biased by an expert-designed scoring/energy function that scores sampled structuresbased on the likelihood of these strutures being biologically-active/native [46]. While great progress has been made instochastic optimization for PSP, there are fundamental challenges on how to achieve a resource-aware balance betweenexploration (exploring more of the structure space so as not to miss novel structures) and exploitation (improvingstructures so as to reach local optima of the scoring function) in vast and high-dimensional search spaces and dealingwith inherently-inaccurate scoring functions that often drive towards score-optimal but non-native structures [28].Stochastic optimization frameworks that conduct a biased (by the scoring function) exploration of a protein’sstructure space, do not learn not to generate structures that are unfit according to the scoring function. Efforts to inject alearning component by Shehu and others struggle with how to best connect the structure generation mechanism/enginein these frameworks with the evaluation engine [40]. A machine learning (ML) framework would seemingly provide anatural setting; however, while discriminative (ML) models can in principle be built to evaluate tertiary structures, anda growing body of research is showing their promise [10, 12, 13, 33], effective generative ML models for tertiary proteinstructures have so far been elusive.The majority of work on leveraging deep neural networks (NNs) for problems related to PSP has been on theprediction of distance or contact maps. The latter are alternative representations of tertiary molecular structures thatdispense with storing the Cartesian coordinates of atoms in a molecule and instead record only pairwise distances (or,as in contact maps, 0/1 bits to indicate a distance above or not above a threshold). Work on contact map (similarly,distance matrix) prediction focuses on the following setting: given the amino-acid sequence of a protein molecule ofinterest, predict one contact map that represents the most likely native structure of the target protein. The predictedcontact map (or distance matrix) can be utilized to recover a compatible tertiary structure via optimization algorithmsthat treat the predicted contacts/distances as restraints [1, 50].Early work on prediction of a contact or distance matrix from a given amino-acid sequence [29] employed bidirectionalrecursive NNs (BRNNs). RaptorX-Contact started a thread of productive research on residual convolutional NNs(CNNs) [52]. DNNCON2 leveraged a two-stage CNN [2]. DeepContact deepened the framework via a 9-layer residualCNN with 32 filters [35]. DeepCOV alleviated some of the demands for evolutionary information [23]. PconsC4 furtherlimited input features and significantly shortens prediction time [41]. SPOT-Contact extended RaptorX-Contact byadding a 2D-RNN stage downstream of the CNN [19]. TripletRes put together four CNNs trained end-to-end [34].It is worth emphasizing that the above NNs for contact or pairwise distance prediction predict one instance from agiven amino-acid sequence. These models do not generate multiple contact maps or multiple distance matrices to giveany view of the diversity of physically-realistic (and possibly functionally-relevant) structures in the structure spaceof a given protein. The objective of these models is to leverage the strong bias in the input features (which includeamino-acid sequence, physico-chemical information of amino acids, co-evolutionary information, secondary structures,solvent accessibility, and other more sophisticated features) to get the model to output one prediction that is most likelyto represent the native structure. While the performance of these models varies [49] and is beyond the scope of thispaper, this body of work does not leverage generative deep learning and is not focused on generating a distributionof tertiary structures for a given amino-acid sequence (whether directly in Cartesian space or indirectly through theintermediate representation of contact maps or pairwise distance matrices).Work in our laboratory has integrated some of these contact prediction models, such as RaptorX-Contact, inoptimization-based structure generation engines to evaluate the quality of a sampled tertiary structure in lieu of or inddition to a scoring/energy function[56]. AlphaFold can also be seen as a similar effort, where a deep NN is used topredict a pairwise distance matrix for a given amino-acid sequence and then predicted distances are encapsulated in apenalty-based scoring function to guide a gradient descent-based optimization algorithm assembling fragments intotertiary structures [45].While it remains more challenging to generate a distribution of tertiary structures (whether directly or indirectlythrough the concept of contact maps or distance matrices), some preliminary efforts have been made [4, 5, 22, 43]. Workin [5] investigates the promise of such models for protein de-novo design. Other work that focuses on PSP remainspreliminary and lack of rigorous evaluation on metrics of interest in PSP and, more generally, protein modeling doesnot expose what these models have actually learned. – need to provide more detail here to differentiate these efforts.Inspired by recent momentum in generative deep learning, we propose here a novel approach based on generative,adversarial deep learning. Leveraging recent opportunities in graph generative learning, we put forth a disentanglement-enhanced contact variational autoencoder (VAE) to which we refer as DECO-VAE from now on. As any other deep neuralnetwork, DECO-VAE learns from data, which we obtain from an initial population generated by the state-of-the-artstochastic optimization engine Rosetta [31]. As a VAE, DECO-VAE learns from this data the latent space encapsulatingthe distribution and then directly generates more data (tertiary structures) from the latent space.DECO-VAE addresses several hurdles that stand in the way of generative models for highly-structured data. First,each tertiary structure is represented as a graph via the concept of the contact graph, and DECO-VAE learns the latentspace of such graphs, directly sampling novel contact graphs from this space. In this first version, readily-available 3Dreconstruction webservers, such as CONFOLD [1], are used to obtain tertiary structures corresponding to DECO-VAE-generated graphs. Second, DECO-VAE addresses the issue of interpretability via the disentanglement mechanism. Inone of several insightful analyses presented in this paper, we demonstrate how DECO-VAE is more powerful than aversion with no disentanglement mechanism, CO-VAE, as well as other deep architectures. We show that DECO-VAEgenerates a distribution that does not simply reproduce the input distribution but instead contains novel and better-quality structures. A thorough comparison with several alternative generative models is also presented. In addition,the disentanglement mechanism integrated in DECO-VAE elucidates what exactly about the tertiary structures the"axes"/factors of the learned latent space control, thus opening the black box that is often associated with deep neuralnetworks.The paper proceeds as follows. First, we present a summary of related work. DECO-VAE, the input data, and othercomponents of our approach (reconstruction of tertiary structures and various analyses and metrics) are then describedin detail. The evaluation of the obtained distribution and the interpretability of DECO-VAE is presented next. The paperconcludes with a summary of remaining challenges and future opportunities.

The concept of a contact graph or a contact map has long been leveraged in molecular biology, particularly in proteinmodeling [53]. Briefly, the contact map is an alternatively, compressed representation of a tertiary molecular structure.We can think of it as an adjacenncy matrix, where the entry [ i , j ] encodes whether atoms at positions i , j are spatiallyproximate; the latter uses threshholds, which typically vary in [ , ] Åin computational biology literature. Typically,the main carbon atoms of each amino acid (the CA atoms) are considered. Some works focus instead on the carbon betaatoms (connecting the side chain atoms to the backbone atoms of an amino acid). In this paper, we take the conventionof CA-based contact maps.nterest in predicting contacts from a given amino-acid sequence has a long history in protein structure prediction,as it was thought to be perhaps less challenging that predicting other representations (such as Cartesian coordinates).Many machine learning approaches have been used, including support vector machines [10, 53], random forest [33],neural networks [15, 18, 42, 48, 51], and deep learning [12, 13], with varied success. These approaches struggle withhow best to capture the inherent structure in the input training data and how to discover the pertinent latent space.More recent methods leverage the concept of the latent space via variational autoencoders [6, 36, 39]. Specifically,Ma et al.[39] develops a deep learning based approach that uses convolutions and a variational autoencoder (a CVAE) tocluster tertiary structures obtained from folding simulations and discover intermediate states from the folding pathways.Anand et al. [6] applies generative adversarial networks (GANs) for de-novo protein design. The authors encode proteinstructures in terms of pairwise distances between CA atoms, eliminating the need for the generative model to learntranslational and rotational symmetries. Livi et al. [36] is the first to treat the contact map as a graph and propose agenerative model by learning the probability distribution of the existence of a link/contact given the distance betweentwo nodes.To the best of our knowledge, no deep graph-generative models have been applied to contact map generation. TheDECO-VAE we present in this paper is the first to do so. The method leverages developments in graph-generativeneural networks (GNNs). Most of the existing GNN-based graph generation for general graphs have been proposed inthe last two years and are based on VAEs [44, 47] or GANs [8]. Most of these approaches generate nodes and edgessequentially to form a whole graph, leading to issues of sensitivity to generation order and high time costs for largegraphs. Specifically, GraphVAE [47] represents each graph by its adjacent matrix and feature vector and utilizes the VAEmodel to learn the distribution of the graphs conditioned on a latent representation at the graph level. Graphite [16]and VGAE [27] encode the nodes of each graph into node-level embeddings and predict the links between each pair ofnodes to generate a graph. In contrast, GraphRNN [55] builds an autoregressive generative model on these sequenceswith a long short-term memory (LSTM) model and shows good scalability.These current graph generation models have two drawbacks: (i) the encoder and decoder architecture are not powerfulto handle real-world graphs with complex structure relationships; (ii) they are black boxes, lacking interpretability. Inthe biology domain, we need to understand the features we learn from the observing data and how we generate thecontact maps.In this paper, we additionally address the issue of interpretability via the concept of disentanglement. Recently, avariety of disentangled representation learning algorithms have been proposed for VAEs. Assuming that the data isgenerated from independent factors of variation (e.g., object orientation and lighting conditions in images of objects),disentanglement as a meta-prior encourages these factors to be captured by different independent variables in therepresentation. Learning via disentanglement should result in a concise abstract representation of the data useful fora variety of downstream tasks and promises improved sample efficiency. This has motivated a search for ways oflearning disentangled representations, where perturbations of an individual dimension of the latent code z perturbthe corresponding x in an interpretable manner. A well-known example is the β − V AE [20]. This has prompteda number of approaches that modify the VAE objective by adding, removing, or altering the weight of individualterms [3, 9, 14, 24, 30, 38, 59].Marrying recent developments in disparate communities, we present here a VAE that generates contact graphs (oftertiary protein structures) and learns disentangled representations that allow interpreting how the learned latentfactors control changes at the contact graph level. By utilizing known 3D reconstruction tools, we additionally showow the factors control changes in Cartesian space. The following Section provides further details into the DECO-VAEand other models we design and assess for their ability to generate physically-realistic tertiary protein structures.

We utilize three deep graph generation models to validate the superiority of our proposed model. VGAE [27] is aframework for unsupervised learning on graph-structured data based on the variational auto-encoder (VAE). Thismodel makes use of latent variables and is capable of learning interpretable latent representations for undirectedgraphs. Graphite [16] is an algorithmic framework for unsupervised learning of representations over nodes in largegraphs using deep latent variable generative models. This model parameterizes variational autoencoders (VAE) withgraph neural networks, and uses a novel iterative graph refinement strategy inspired by low-rank approximations fordecoding. GraphRNN [55] is a deep autoregressive model to approximate any distribution of graphs with minimalassumptions about their structure. GraphRNN learns to generate graphs by training on a representative set of graphsand decomposes the graph generation process into a sequence of node and edge formations, conditioned on the graphstructure generated so far.

Training and Testing Dataset

The evaluation is carried out on 13 proteins of varying lengths (53 to 146 amino acids long) and folds ( α , β , α + β ,and coil ) that are used as a benchmark to evaluate structure prediction methods [57, 58]. On each protein, given itsamino-acid sequence in fasta format, we run the Rosetta AbInitio protocol available in the Rosetta suite [31] to obtainan ensemble of 50 , − ,

000 structures. For each structure, we only retain the central carbon (CA) atom for eachamino acid (effectively discarding the other atoms). Each such CA-only structure is converted into a contact graph, asdescribed above. The contact graph dataset of each protein is split into a training and testing data set in a x:y split.The deep models described above are separately trained on each protein. An overview of Disentanglement-enhancedContact-VAE (DECO-VAE) is illustrated in Fig.6. By obeserving many decoys contact maps for a protein sequence, itlearns the distribution of these decoys and generate many more decoys for that protein sequence. DECO-VAE has twomain properties: contact generation and interpretation. The contact generation is an implementation of a deep graphvaritional auto-encoder. The graph encoder learns the distribution of the decoys of a certain sequence and generatenew decoys based on a low dimension latent code. And then the contact decoys can be recovered into 3D structure bythe existing methods. The contact interpretation ensures each element in the latent representation has a control onan interpretable factor of the generated protein structure. By changing the element value continuously in the latentcode, we can control the different properties of the generated protein structure. The proposed method also has a variantwhich does not consider the disentanglement but only target on the generation of contact maps, named as Contac-VAE(CO-VAE). The CO-VAE has the same structure with the DECO-VAE, except the objective function when training agenerative model. The details of the methods are illustrated in the Section of Methods

Comparing the generated contact maps with the native one

The proposed methods have been evaluated on 13 datasets, each of which corresponds to a protein. For each proteinthat is a sequence if amino acids, we use thousands of its contact map decoys to train the CO-VAE and DCO-VAE model.Then based on this well-trained model, the same number of contact maps are generated and compared with the onlyative one. For each generated contact map and native one, we compute the precision, recall, coverage and and F1 scorebetween them, as shown in Table 1. In addition, the performance of our methods are also compared with the othergraph generation methods in machine learning domain (Graphite [16], GVAE [27], GraphRNN [55]). The algorithm andparameter settings of the proposed CO-VAE, DCO-VAE model and the comparison methods are described in details inthe Section of Methods.As shown in Table 1, the graphs generated by CO-VAE and DCO-VAE both have very high F1_score, and the coverageand recall are all around 0.34, which outperforms the other methods by a large margin. Specifically, our precision is 0.67and 0.71 on average, over 53.4% and 56.7% higher than the highest performance of comparison methods; our coverage is0.55 and 0.58 on average, over 12.7% and 24.8% higher than the highest performance of comparison methods; our recallis 0.57 and 0.59 on average, over 12.2% and 15.2% higher than the highest performance of comparison methods; ourF1_score is 0.61 and 0.64 on average, over 52.4% and 54.7% higher than the highest performance of comparison methods;Moreover, the proposed methods (CO-VAE and DCO-VAE) obtain the best performance in four metrics in almost 76% ofthe proteins. This indicates that the P-GraphVAE and PD-GraphVAE truly learn the underlined distribution for thecontact map decoys for the amino acid sequence. The good performance of the proposed methods rely on its specialarchitecture, where each amino acid has its unique parameters in extracting features from its contacts and the structuresimilarity is considered among different amino acids in generation process.

Table 1. Evaluation of Generated Contact Maps by Comparing to Native One for different proteins

Protein Model Precision Coverage Recall F1_score Protein Model Precision Coverage Recall F1_score1DTJA Graphite 0.1320 0.5000 0.5000 0.2031 1DTDB Graphite 0.1526 0.4893 0.5002 0.2338GVAE 0.1390 0.5109 0.5000 0.2101 GVAE 0.1526 0.4892 0.5001 0.2339GraphRNN 0.3153 0.1301 0.1301 0.1733 GraphRNN 0.2974 0.2696 0.2755 0.2858CO-VAE

DCO-VAE 0.6136

CO-VAE 0.7127

DCO-VAE

DCO-VAE 0.4638 0.4363 0.4531 0.45841C8CA Graphite 0.1322 0.4907 0.5002 0.2091 1CC5 Graphite 0.1234 0.4843 0.4997 0.1980GVAE 0.1319 0.4900 0.4995 0.2088 GVAE 0.1234 0.4846 0.4999 0.1979GraphRNN 0.3381 0.3065 0.3124 0.3239 GraphRNN 0.3489 0.2733 0.2820 0.3117CO-VAE 0.6513

DCO-VAE

DCO-VAE 0.8965 0.5864 0.6051 0.72251HZ6A Graphite 0.1500

DCO-VAE

Percentage of native and non-native contacts in a structure

For each of the generated contact map, we calculate the percentage of its contacts that are indeed native contacts,namely those contacts in a native structure. Specifically: Given contacts for a structure (each model) and the knownative structure, we calculate the percentage of contacts that are in the native structure that are also in the reconstructedstructure. Percentage for each reconstructed structure (model) is calculated for each protein per each method and thenan average over all reconstructed structures is reported per protein per method. In addition, we also calculate thepercentage of non-native contacts: the contacts that are in a reconstructed structure but not in the structure. For eachmodel, we calculate the number of non-native contacts and divide it by the number of amino acids in the protein. Thenwe report the average percentage over all structures per protein per method. The comparison methods are the sameones that mentioned above.As shown in Table 2, the proposed CO-VAE and DCO-VAE both have the highest percentage of the native contactsand the lowest non-native contacts among around 86% of all the proteins, especially with a very large superiority (e.g.33% on average for native contacts and 61% on average for non-native contacts comparing to the highest performanceof comparison methods). The results demonstrate that the proposed methods can accurately generate the contacts likethe native one and the generated contact maps have the similar balance between the contacts and non-contacts inthe structure. Moreover, even the DCO-VAE enhanced the disentanglement in the objective function of the model, itsperformance on contact map generation still remains infected, and sometimes even better than the CO-VAE.

Table 2. Evaluation of the contacts and non-contacts percentage in the generated structures

Protein ID dimension metric-type VGAE Graphite GraphRNN CO-VAE DCO-VAE1AIL 73 contacts 49.41 49.47 35.94

78 non-contacts 89.39 89.39 65.10 14.54

64 non-contacts 86.68 86.68 68.11 9.65

99 non-contacts 90.82 90.83 72.88 10.77

62 non-contacts 85.15 85.19 72.65 19.65

66 non-contacts 86.93 86.96 71.81

94 non-contacts 99.00 99.00 98.73

133 non-contacts 93.56 93.61 76.86 19.92

Evaluating the learned graph distribution

Since the successful generation of contact maps rely on successfully learning the distribution of the contact maps, weevaluate whether the generated contact maps follow the learned distributions. By regarding each contact map as a graph,e can calculate four properties of each graphs: density, number of edges, the degree, and transitivity. The density of agraph is the ratio of the number of edges and the number of possible edges. The average degree coefficient measuresthe similarity of connections in the graph with respect to the node degree. Transitivity is the overall probability for thegraph to have adjacent nodes interconnected. All these properties can be calculated by the open source API NetworkX.Then the distance between the distribution of the generated contact maps and the distribution of the training setsin terms of the four properties is measured by three metrics: Pearson correlation coefficient, Bhattacharyya distanceand Earth Mover’s Distance (EMD). In statistics, the Pearson correlation coefficient (PCC) is a measure of the linearcorrelation between two variables X and Y . Here the X and Y refers to the a kind of graph property of the generatedgraphs and training graphs respectively. The Bhattacharyya distance and EMD both measures the similarity of twoprobability distributions and the smaller the better.The results for one example protein are shown as Table 3. In addition,we also compare our proposed methods with the comparison methods. The results for other proteins can be seen in thesupplemental materials.As shown in Table 3, considering EMD distance, the proposed P-GraphVAE has the best performance regarding all thegraph properties. For the Bhattacharyya distance, the proposed P-GraphVAE outperformed other comparison methodsby 33.38% and the disentangle version outperformed others by 33.68%. The Pearson similarity is very small, whichdemonstrate that the generated graphs are various in some degree than the training set, which ensure the diveristy ofthe generated contact maps. Specifically, in terms of EMD distance, the proposed CO-VAE and DCO-VAE achieve 1.4and 2.9 on average, which is 65.8% and 29.2% smaller than the best performance comparison method. Table 3. Evaluation the learned distributions by comparing to the training sets for Protein 1DTJA

Graph Property Methods Pearson Bhattacharyya EMDDensity Graphite 0.008 5.410 0.465GVAE 0.007 5.415 0.466GraphRNN 0.022

DCO-VAE 0.004 3.740 0.004Number of Edge Graphite 0.007 5.410 1327GVAE 0.007 5.415 1329GraphRNN 0.022

DCO-VAE 0.004 3.741 11.29Ave-Degree Cor Graphite 0.030 5.056 0.136GVAE 0.009 5.401 0.030GraphRNN Nan 3.413 NanCO-VAE 0.005 5.310

DCO-VAE 0.004

DCO-VAE 0.005 3.571 0.027

Visualizing the 3D Structure from Generated Contact Maps

Given the generated contact maps by the proposed CO-VAE and DCO-VAE, we use the existing tool CONFOLD [1]to reconstruct the contact map to its responding 3D structure. To visualize and evaluate the real 3D structure of thegenerated proteins, we also visualize the protein structure by the tool PyMol [11] which is widely used and user-sponsored molecular visualization system. We show the results both for the proposed CO-VAE model and the DCO-VAE, D C O - VA E C O - VA E G r a ph R NN Fig. 1. Reconstructing the 3D structure from the generated contact maps: we reconstruct the 3D structure based on the generatedcontact maps of five proteins (i.e.,1DTJA, 1AIL, 1DTDB, 1HHP, and 1SAP) shown as examples; The generated contact maps aregenerated by the proposed CO-VAE and DCO-VAE, as well as the comparison method GraphRNN. which is disentangled. And we also compare our proposed model with the graph generative model graphRNN. Asshown in Fig. 1, the proposed CO-VAE and DCO-VAE have better performance in finding both the secondary andtertiary structure.

Interpreting the disentangled latent representation

It is important to be able to measure the level of disentanglement achieved by different models. In this subsection,we qualitatively demonstrate that our proposed DCO-VAE framework consistently discovers more latent factors anddisentangles them in a cleaner fashion. By learning a latent code representation of a protein structure, we assume eachvariable in the latent code corresponds to a certain factor or property in generating the protein structure. Thus, bychanging the value of one variable continuously and fixing the remaining variables, we could visualize the correspondingchange of the generated contact map and 3D protein structure.First, an example protein IDTJA is analyzed and displayed. the As shown in Fig 3 and Fig 2, we use the conventionthat in each one latent code varies from left to right while the other latent codes and noise are fixed. The different rowscorrespond to different random samples of fixed latent codes and noise. For instance, in Fig 2, one column contains fourgenerated contact map samples regarding each variables, and a row shows the generated contact maps for 5 possiblevalues among 1 and 10000 in a certain variable of latent code with other noise fixed. From the changing contact maps ineach row, it is easily to find out which part of the contact map is related or controlled by the certain row. We highlightthe part that controlled by each variable in red circle in Fig 2. The corresponding 3D structure visualization is shown inFig 2. On this way, we could easily interpret the role of each variable in the latent code in controlling the generation ofthe contact maps and the 3D structure.Second, some other proteins are also shown in Fig 5 and Fig 4, which demonstrate that the interpretable model canbe generalized to any proteins.

In summary, our study aims at utilizing the deep neural network-based generative model for generating tertiary proteinstructure as well as interpreting the generation process. To the best of our knowledge, we showed for the first time thedevelopment and application of an interpretative graph variational auto-encoder for the problem of protein structure Z i =10000 Z i =1 Z Z Z Z Fig. 2. Contact map interpretation for protein 1DTJA: Four semantic factors are discovered in the latent variables (i.e., Z , Z , Z , and Z ) which can control the local structural features of the contact maps; the value of latent variables travels from 1 to 10000 and fivesegments are selected to visualize the local structural feature variations. generation and interpretation. This demonstrated the promise of generative models in directly revealing the latentspace for sampling novel tertiary structures, highlighting axes/factors that carry structural meaning, and opening theblack box of deep models.By treating tertiary protein structure construction problem as the generation of contact maps, we proposed a newdisentangled VAE, named Contact Map VAE (CO-VAE) with new graph encoder and decoder. It first learns the underlyingdistribution of contact maps and then generates additional contact maps by sampling from the learned distribution.The similarity of the structure (e.g., in terms of precision, recall, F1-score, and coverage) as well as positive contactpercentage between the generated contact maps and the native ones demonstrated the quality of the generated contactmaps. The proposed methods(CO-VAE and DCO-VAE) show great advantages than the existing models by obtainingthe best performance in four metrics in almost 76% of the proteins and the highest percentage of the native contactsas well as the lowest non-native contacts among around 86% of all the proteins. Furthermore, we reconstructed thegenerated contact maps into 3D protein structures. The visualization of the constructed 3D protein illustrated that thegenerated contact maps are valid and useful for further generating the 3D tertiary protein structure.To further investigate whether the proposed CO-VAE indeed learns the distribution of the observed contact mapsamples, we generated a set of contact maps and compared their underlying distribution and that of the real contactmaps. Since it is difficult to directly evaluate the graph distribution, we resorted to evaluating the distribution of theproperties of graphs, such as density, edge numbers, average degree correlation, and transitivity. The small EMD and Z j =1 Z j =10000 Z Z Z Z Fig. 3. Tertiary protein structure interpretation for protein 1DTJA: Tertiary protein structures are reconstructed based on the generatedcontact maps in Fig. 2; the amino acids within the same secondary structure are shown in the same color Z i =10000 Z i =1 AOY （ Z ） HH P （ Z ） I S UA （ Z ） H ND ( Z ） Fig. 4. Contact map interpretation for different proteins: the generated contact maps of four proteins (i.e.,1AOY, 1HHP, 1ISUA, and2H5ND) are visualized as examples; for each protein, one of the semantic factors that are discovered in the latent variables is shownto control the local structural features of the contact maps; the value of latent variables travels from 1 to 10000 and five segments areselected to visualize the local structural feature variations. H ND ( Z ） I S UA （ Z ） HH P （ Z ） AOY （ Z ） Z i =1 Z i =10000 Fig. 5. Tertiary protein structure interpretation for different proteins: Tertiary protein structures are reconstructed based on thegenerated contact maps in Fig. 4; the amino acids within the same secondary structure are shown in the same color

Bhattachryya distances between the learned and real distribution in terms of all four properties of graphs generated byCO-VAE and DCO-VAE validated that the underlying graph distribution is effectively discovered, which outperformedthe distances calculated from the graphs that generated by comparison methods by 33.4% and 47.5% on average. Thoughthe Pearson correlation score was very low, it showed that the diversity of the generated graphs was ensured since itmeasures the strength and direction of a linear relationship between two variables other than the similarity of twodistributions.Next, to explore our generative model’s capability of interpreting the process of contact map generation, we enhancedour CO-VAE by enforcing the weights of the second term of the training objective, leading to the disentanglementamong variables and hence a new interpretable model named DCO-VAE. The learned latent representation variablesin DCO-VAE is expected to be related to the factors that influence the formation of the contact maps. As a result, foreach latent variable, by varying the value of this variable from 1 to 10,000 and fixing the values of others at the sametime, a certain local part of the generated contact maps showed obvious trends (e.g., contracting or sketching). Thisdemonstrated that the learned latent variables are effectively disentangled, which indicated the potential semanticfactors in the formation of contact maps and the corresponding protein structures.From one aspect, though there have been some deep generative models applied to the protein contact map generation,they cannot utilize the relationships among amino acids by treating the contact maps as the graph-structured data withgraph generative models. From another aspect, though many interpretable learning models have been explored andapplied into the image generation, there are no interpretable models that can discover the semantic factors that cancontrol the process of protein folding. In summary, to the best of our knowledge, this is the first time an interpretabledeep generative model for graphs have been applied for protein structure prediction and its effectiveness in generatingood-quality protein structures is demonstrated. In addition, the proposed DCO-VAE can also be applied to otherreal-world applications where there is a need for graph-structured data generation and interpretation.There are much more promising and challenging topics that can be originated from the research in this paper forfuture exploration. For example, it would be interesting and potentially beneficial to develop an end-to-end tertiaryprotein structure generation model directly for 3D structure instead of contact maps. This is because there is a gapbetween the contact map generation and 3D structure formulation process, and the learned variables can only explainwell of the formation of contact maps instead of the 3D structure. In addition, the exploration of the node-edge jointgenerative model would also be highly interesting. The proposed DCO-VAE model focuses on generating the graphtopology (i.e., via contact maps) instead of node features (e.g., properties and types of amino acid). Jointly generatingboth graph topology and node features could be important in some cases, such as when directly generating the 3Dstructure where the node features can be the 3D positions.

Recoverable tertiary structure formulation requires to preserve information not only on which atom is bonded withwhich else, but, more importantly, on which atoms are in proximity of each-other in three-dimensional space. As shownin Fig. 6, to address this issue, we employ contact map graph which can be trivially computed from tertiary structuresas the input to our model. And to recover a tertiary structure from a contact map is comparatively trivial [1].

Fig. 6. Overall schematic of proposed generative learning framework.

Hence, the contact map is a graph-based representation of a tertiary structure that effectively embeds the three-dimensional Cartesian space into a two-dimensional space. Specifically, in our approach, let the contact graph G = ( E , F ) now be associated with its edge attribute tensor E ∈ R N × N × L , which denotes the adjacent matrix A when L =

1; andnode attribute matrix F ∈ R N × L (as shown in Fig. 6) denoting the identity of each atom by hot vector embedding,where N is the number of atoms over which contacts are computed; one can do so for all atoms in a molecule, orepresentative atoms to control the size of the input space. The edge and node attributes are rich mechanisms to addadditional information. For instance, node attributes can store not just the identities of the amino acids but also theirPSSM profile (derived from the Position Specific Scoring Matrix), as well as their solvent accessibility and secondarystructure as derived from a given tertiary structure. The edge attributes can encode additional information aboutcontacts, such as their exact distance and/or contact predicted for that pair of amino acids (vertices) from sequenceinformation alone. In our experiment, we are given an undirected, unweighted graph G as mentioned above, and weonly use the adjacency matrix A of G (we assume diagonal elements set to 1, i.e. every node is connected to itself) andnode attributes F store the identities of the amino acids with L =

20 (the number of all the atom types).

Deep Contact Map Variational Auto-encoder

The task in this paper requires to learn the disentangled generative models on contact map that encodes graphinformation. Although disentanglement enhancement and deep graph generative models are respectively attractingfast-increasing attention in recent years, the synergistic integration of them has rarely been explored yet. To learn thedistribution of the contact map decoys of a amino acid sequence, we first propose a new graph varational auto-encoderframework that consists of powerful graph encoder, graph decoder, and latent graph code disentanglement mechanism.First we introduce the model without disentanglement mechanism, named as The Deep Contact Map VariationalAuto-encoder (CO-VAE).Specifically, given the adjacent matrix A of a contact map, we further introduce stochastic latent variables z i ,summarized in an 1 × H vector Z . The overall architecture contains a graph encoder and decoder which are trained byoptimizing the variational lower bound Loss w.r.t. the variational parameters W i : L = E q ( Z | F , A ) [ loдp ( A | Z , F )] − KL [ q ( Z | F , A )|| p ( Z )] (1)where KL [ q (·)|| p (·)] is the Kullback-Leibler divergence between q (·) and p (·) . The first item is the reconstruction loss ofthe generated contact maps and the second term enforces the inferenced latent vector close to the real latent vectordistribution. We take a Gaussian prior p ( Z ) = (cid:206) i p ( z i ) = (cid:206) i N( z i | , I ) For the encoder part, we take a simple inference model parameterized by a two-layer Graph Convolution NueralNetwork (GCN) [26] as the encoder of our CO-VAE: q ( Z | F , A ) = N (cid:214) i = q ( z i | F , A ) , where q ( z i | F , A ) = N( z i | µ i , diaд ( σ )) (2)Here µ = GCN µ ( F , A ) is the mean of the latent vectors, which is inferenced by GCN; similarly loдσ = GCN σ ( F , A ) is thestandard of the latent vector that is inferenced by anotehr GCN. Thus, it is possible to sample Z from the distributionsof the latent vectors q ( Z | F , A ) .For the generative model, we utilize the graph decoder network proposed in our previous works[17]. This work firstproposes the graph deconvolution based graph decoders, which achieve the best performance in the graph generationtask when the node set of the graph is fixed. Thus, we choose this graph decoder as part of our CO-VAE. p ( A | Z , F ) = N (cid:214) j = N (cid:214) i = p ( A ij | Z ) (3)where A ij are the elements of A .e perform full-batch gradient descent and make use of the re-parameterization trick [25] for training. The archi-tecture and mathematical operations of the graph encoder for modelling the q ( Z | A , F ) and decoder for modelling the p ( A | Z , F ) are detailed in the supplemental materials. Disentangled Contact Map VAE

Then next we introduce the Contact Map VAE with the disentanglement mechanism, which is inpired by the β − V AE [20]. β − V AE is proposed for automated discovery of interpretable factorised latent representations from data in a completelyunsupervised manner, and has been currently used in image domain [21] and Natural Language Processing [54]. It isa modification of the variational autoencoder (VAE) framework by introducing an adjustable hyperparameter β thatbalances latent channel capacity and independence constraints with reconstruction accuracy. Thus, we proposed theDisentangled Contact Map VAE (DCO-VAE) to largely interpret the latent representations for the protein generation. Wepropose augmenting the CO-VAE framework with a single hyperparameter β that modulates the learning constraintsapplied to the model. These constraints impose a limit on the capacity of the latent information channel and controlthe emphasis on learning statistically independent latent factors. Eq. 1 can be re-written to arrive at the DCO-VAEformulation, but with the addition of the β coefficient: L = E q ( Z | F , A ) [ loдp ( A | Z , F )] − βKL [ q ( Z | F , A )|| p ( Z )] (4)where with the β = β > DATA AVAILABILITY

All data, models, and source code are freely available to readers upon contacting the authors.

REFERENCES [1] Badri Adhikari, Debswapna Bhattacharya, Renzhi Cao, and Jianlin Cheng. 2015. CONFOLD: residue-residue contact-guided ab initio protein folding.

Proteins: Structure, Function, and Bioinformatics

83, 8 (2015), 1436–1449.[2] B. Adhikari, J. Hou, and J. Cheng. 2018. DNCON2: improved protein contact prediction using two-level deep convolutional neural networks.

Bioinformatics

34 (2018), 1466–1472.[3] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. 2016. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410 (2016).[4] Namrata Anand, Raphael Eguchi, and Po-Ssu Huang. 2019. Fully differentiable full-atom protein backbone generation. (2019).[5] Namrata Anand and Possu Huang. 2018. Generative modeling for protein structures. In

Advances in Neural Information Processing Systems .7494–7505.[6] Namrata Anand and Possu Huang. 2018. Generative modeling for protein structures. In

Advances in Neural Information Processing Systems .7494–7505.[7] D. D. Boehr and P. E. Wright. 2008. How do proteins interact?

Science arXivpreprint arXiv:1803.00816 (2018).[9] Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. 2018. Isolating sources of disentanglement in variational autoencoders. In

Advances in Neural Information Processing Systems . 2610–2620.[10] Jianlin Cheng and Pierre Baldi. 2007. Improved residue contact prediction using support vector machines and a large feature set.

BMC bioinformatics

8, 1 (2007), 113.[11] Warren Lyford DeLano. 2002. PyMOL.[12] Pietro Di Lena, Ken Nagata, and Pierre Baldi. 2012. Deep architectures for protein contact map prediction.

Bioinformatics

28, 19 (2012), 2449–2457.[13] Jesse Eickholt and Jianlin Cheng. 2012. Predicting protein residue–residue contacts using deep networks and boosting.

Bioinformatics

28, 23 (2012),3066–3072.14] Babak Esmaeili, Hao Wu, Sarthak Jain, Alican Bozkurt, Narayanaswamy Siddharth, Brooks Paige, Dana H Brooks, Jennifer Dy, and Jan-Willemvan de Meent. 2018. Structured disentangled representations. arXiv preprint arXiv:1804.02086 (2018).[15] Piero Fariselli, Osvaldo Olmea, Alfonso Valencia, and Rita Casadio. 2001. Prediction of contact maps with neural networks and correlated mutations.

Protein engineering

14, 11 (2001), 835–843.[16] Aditya Grover, Aaron Zweig, and Stefano Ermon. 2018. Graphite: Iterative generative modeling of graphs. arXiv preprint arXiv:1803.10459 (2018).[17] Xiaojie Guo, Lingfei Wu, and Liang Zhao. 2018. Deep Graph Translation. arXiv preprint arXiv:1805.09980 (2018).[18] Nicholas Hamilton, Kevin Burrage, Mark A Ragan, and Thomas Huber. 2004. Protein contact prediction using patterns of correlation.

Proteins:Structure, Function, and Bioinformatics

56, 4 (2004), 679–684.[19] J. Hanson, K. Paliwal, T. Litfin, Y. Yang, and Y. Zhou. 2018. Accurate prediction of protein contact maps by coupling residual two-dimensionalbidirectional long short-term memory with convolutional neural networks.

Bioinformatics

34 (2018), 4039–4045.[20] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017.beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework.

ICLR

2, 5 (2017), 6.[21] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. 2018. Multimodal unsupervised image-to-image translation. In

Proceedings of the EuropeanConference on Computer Vision (ECCV) . 172–189.[22] John Ingraham, Adam Riesselman, Chris Sander, and Debora Marks. 2019. Learning protein structure with a differentiable simulator. In

InternationalConference on Learning Representations .[23] D. T. Jones and S. M. Kandathil. 2018. High precision in protein contact prediction using fully convolutional neural networks and minimal sequencefeatures.

Bioinformatics

34 (2018), 3308–3315.[24] Hyunjik Kim and Andriy Mnih. 2018. Disentangling by factorising. arXiv preprint arXiv:1802.05983 (2018).[25] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).[26] Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).[27] Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016).[28] A. Kryshtafovych, B. Monastyrskyy, K. Fidelis, T. Schwede, and A. Tramontano. 2017. Assessment of model accuracy estimations in CASP12.

Proteins: Struct, Funct, Bioinf

86, Suppl 1 (2017), 345–360.[29] P. Kukic, P. Mirabello, G. Tradigo, I. Walsh, P. Veltri, and G. Pollastri. 2014. Toward an accurate prediction of inter-residue distances in proteinsusing 2d recursive neural networks.

BMC Bioinf

15 (2014), 6.[30] Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. 2017. Variational inference of disentangled latent concepts from unlabeledobservations. arXiv preprint arXiv:1711.00848 (2017).[31] A. Leaver-Fay, M. Tyka, S. M. Lewis, O. F. Lange, J. Thompson, R. Jacak, et al. 2011. ROSETTA3: an object-oriented software suite for the simulationand design of macromolecules.

Methods Enzymol

487 (2011), 545–574.[32] J. Lee, P. Freddolino, and Y. Zhang. 2017. Ab initio protein structure prediction. In

From Protein Structure to Function with Bioinformatics (2 ed.), D. J.Rigden (Ed.). Springer London, Chapter 1, 3–35.[33] Yunqi Li, Yaping Fang, and Jianwen Fang. 2011. Predicting residue–residue contacts using random forest models.

Bioinformatics

27, 24 (2011),3379–3384.[34] Y. Li, C. Zhang, E. W. Bell, D.-J. Yu, and Y. Zhang. 2019. Ensembling multiple raw coevolutionary features with deep residual neural networks forcontact-map prediction in CASP13.

Proteins: Struct, Funct, Bioinf

87, 12 (2019), 1082–1091.[35] Y. Liu, P. Palmedo, Ye Q., B. Berger, and J. Peng. 2018. Enhancing evolutionary couplings with deep convolutional neural networks.

Cell Syst

6, e3(2018), 65–74.[36] Lorenzo Livi, Enrico Maiorino, Alessandro Giuliani, Antonello Rizzi, and Alireza Sadeghian. 2016. A generative model for protein contact networks.

Journal of Biomolecular Structure and Dynamics

34, 7 (2016), 1441–1454.[37] A. Liwo, J. Lee, D. R. Ripoll, J. Pillardy, and H. A. Scheraga. 1999. Protein structure prediction by global optimization of a potential energy function.

Proc Natl Acad Sci USA

96, 10 (1999), 5482–5485.[38] Romain Lopez, Jeffrey Regier, Michael I Jordan, and Nir Yosef. 2018. Information constraints on auto-encoding variational bayes. In

Advances inNeural Information Processing Systems . 6114–6125.[39] Heng Ma, Debsindhu Bhowmik, Hyungro Lee, Matteo Turilli, Michael T Young, Shantenu Jha, and Arvind Ramanathan. 2019. Deep generativemodel driven protein folding simulation. arXiv preprint arXiv:1908.00496 (2019).[40] T. Maximova, R. Moffatt, B. Ma, R. Nussinov, and A. Shehu. 2016. Principles and Overview of Sampling Methods for Modeling MacromolecularStructure and Dynamics.

PLoS Comp. Biol.

12, 4 (2016), e1004619.[41] M. Michel, D. M. Hurtado, and A. Elofsson. 2019. PconsC4: fast, accurate and hassle-free contact predictions.

Bioinformatics

35 (2019), 2677–2679.[42] Gianluca Pollastri and Pierre Baldi. 2002. Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from allfour cardinal corners.

Bioinformatics

18, suppl_1 (2002), S62–S70.[43] Sari Sabban and Mikhail Markovsky. 2019. RamaNet: Computational De Novo Protein Design using a Long Short-Term Memory GenerativeAdversarial Neural Network.

BioRxiv (2019), 671552.[44] Bidisha Samanta, DE Abir, Gourhari Jana, Pratim Kumar Chattaraj, Niloy Ganguly, and Manuel Gomez Rodriguez. 2019. Nevae: A deep generativemodel for molecular graphs. In

Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 33. 1110–1117.45] A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, et al. 2019. Protein structure prediction using multiple deep neural networks in CASP13.

Proteins: Struct, Funct, Bioinf

87, 12 (2019), 1141–1148.[46] A. Shehu. 2013. Probabilistic Search and Optimization for Protein Energy Landscapes. In

Handbook of Computational Molecular Biology , S. Aluruand A. Singh (Eds.). Chapman & Hall/CRC Computer & Information Science Series.[47] Martin Simonovsky and Nikos Komodakis. 2018. Graphvae: Towards generation of small graphs using variational autoencoders. In

InternationalConference on Artificial Neural Networks . Springer, 412–422.[48] Allison N Tegge, Zheng Wang, Jesse Eickholt, and Jianlin Cheng. 2009. NNcon: improved protein contact map prediction using 2D-recursive neuralnetworks.

Nucleic acids research

37, suppl_2 (2009), W515–W518.[49] M. Torrisi, G. Pollastri, and Q. Le. 2020. Deep learning methods in protein structure prediction.

Comput and Struct Biotech J

S2001037019304441(2020), 1–10.[50] M. Vendruscolo, E. Kussell, and E. Domany. 1997. Recovery of protein structure from contact maps.

Folding and Design

1, 5 (1997), 295–30.[51] Ian Walsh, Davide Baù, Alberto JM Martin, Catherine Mooney, Alessandro Vullo, and Gianluca Pollastri. 2009. Ab initio and template-basedprediction of multi-class distance maps by two-dimensional recursive neural networks.

BMC structural biology

9, 1 (2009), 5.[52] S. Wang, S. Sun, Z. Li, R. Zhang, and J. Xu. 2017. Accurate de novo prediction of protein contact map by ultra-deep learning model.

PLoS ComputBiol

13 (2017), e1005324.[53] Sitao Wu and Yang Zhang. 2008. A comprehensive assessment of sequence-based and template-based methods for protein contact prediction.

Bioinformatics

24, 7 (2008), 924–931.[54] Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. 2018. Neural-symbolic vqa: Disentangling reasoningfrom vision and language understanding. In

Advances in Neural Information Processing Systems . 1031–1042.[55] Jiaxuan You, Rex Ying, Xiang Ren, William L Hamilton, and Jure Leskovec. 2018. Graphrnn: Generating realistic graphs with deep auto-regressivemodels. arXiv preprint arXiv:1802.08773 (2018).[56] A. Zaman, P. Parthasarathy, and A. Shehu. 2019. Using Sequence-Predicted Contacts to Guide Template-free Protein Structure Prediction. In

ACMConf on Bioinf and Comp Biol (BCB) . Niagara Falls, NY, 154–160.[57] A. Zaman and A. Shehu. 2019. Balancing multiple objectives in conformation sampling to control decoy diversity in template-free protein structureprediction.

BMC Bioinformatics

20, 1 (2019), 211. https://doi.org/10.1186/s12859-019-2794-5[58] G. Zhang, L. Ma, X. Wang, and X. Zhou. 2018. Secondary Structure and Contact Guided Differential Evolution for Protein Structure Prediction.

IEEE/ACM Trans Comput Biol and Bioinf (2018). https://doi.org/10.1109/TCBB.2018.2873691 preprint.[59] Shengjia Zhao, Jiaming Song, and Stefano Ermon. 2017. Infovae: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262 (2017).

ACKNOWLEDGEMENTS

This work was supported in part by the National Science Foundation Grant No. 1907805 to AS and LZ and a JeffressMemorial Trust Award to LZ and AS. The authors additionally thank members of the Zhao and Shehu laboratories forvaluable feedback during this work.

AUTHOR CONTRIBUTIONS

XG implemented and evaluated the proposed methodologies, as well as drafted the manuscript. ST assisted withpreparation of the input data and the evaluation of reconstructed structures. LZ and AS conceptualized the methodologiesand provided guidance on implementation and evaluation, as well as edited and finalized the manuscript.