[PDF] Mimetic Neural Networks: A unified framework for Protein Design and Folding

Abstract

Recent advancements in machine learning techniques for protein folding motivate better results in its inverse problem -- protein design. In this work we introduce a new graph mimetic neural network, MimNet, and show that it is possible to build a reversible architecture that solves the structure and design problems in tandem, allowing to improve protein design when the structure is better estimated. We use the ProteinNet data set and show that the state of the art results in protein design can be improved, given recent architectures for protein folding.

Full PDF

MMimetic Neural Networks: A uniﬁed framework for Protein Design and Folding

Moshe Eliasof Tue Boesen Eldad Haber Chen Keasar Eran Treister Abstract

Recent advancements in machine learning tech-niques for protein folding motivate better resultsin its inverse problem – protein design. In thiswork we introduce a new graph mimetic neuralnetwork, MimNet, and show that it is possible tobuild a reversible architecture that solves the struc-ture and design problems in tandem, allowing toimprove protein design when the structure is bet-ter estimated. We use the ProteinNet data set andshow that the state of the art results in protein de-sign can be improved, given recent architecturesfor protein folding.

1. Introduction

Protein folding has been an open challenge in science formany years (Finkelstein & Galzitskaya, 2004; Ołdziej et al.,2005; Rose et al., 2006; AlQuraishi, 2019a). The goal of pro-tein folding is to predict the 3D structure of a peptide chaingiven its amino acids (residues) composition. Traditionalmethods for folding were based on physical understandingof the interaction potentials. However, they require con-siderable computational resources and often converge to alocal minima (Nedwidek & Hecht, 1997). In recent years,techniques based on machine learning, and in particular ondeep neural networks, have been proposed for the solutionof the problem, (AlQuraishi, 2019a; Drori et al., 2019; Xu,2019; Senior et al., 2020) showing major improvement inprotein structure estimation. Such methods utilize advancesin machine learning and network architectures and perhapsmore importantly, large protein data sets and advanced, non-trivial pre-processing techniques. These data sets are usedin order to ﬁnd homologous proteins by computing MultipleSequence Alignment (MSA). The MSA is then used to com-pute ﬁrst order statistics in terms of Position-Speciﬁc ScoreMatrices (PSSM) and second order statistics in terms ofthe covariance matrices of the homologous proteins. Sincethe second order statistics can yield an approximation to Department of Computer Science, Ben-Gurion Univer-sity of the Negev Department of EOAS, The University ofBritish Columbia. Correspondence to: Eldad Haber . contact maps (Vassura et al., 2008), such tools can aid inhinting about the structure and guide the training of thenetwork. Nonetheless, using second order statistics such ascovariance matrices and contact maps requires non-trivialcomputations that are time consuming, in particular search-ing through large data-bases computing MSA’s and theirapproximate covariances. It has therefore been proposedin (AlQuraishi, 2019a) to use ﬁrst order statistics in theform of PSSM for the task in hand. While such approachdoes not yield state of the art results for the protein foldingproblem, it is signiﬁcantly faster to obtain and train. For thepurpose of protein design, such ﬁrst order statistics may besufﬁcient, as our experimental results suggest in Sec. 3.2.For the protein folding problem, we demonstrate that by uti-lizing Graph Convolution Networks (GCNs), it is possibleto improve accuracy compared to state of the art modelswhich use only ﬁrst order data, in section 3.3.A closely related problem to the folding that did not get asmuch attention is the protein design (Richardson & Richard-son, 1989; Yennamalli, 2019; Basanta et al., 2020). In thelatter we assume to have a known shape of a protein andthe goal is to ﬁnd a plausible underlying sequence. Theproblem is often referred to as the inverse problem of pro-tein folding. In recent years, deep learning techniques havebeen used for the design problem (Wang et al., 2018; Gaoet al., 2020; Strokach et al., 2020; Xu et al., 2020) withpromising success. Nevertheless, the treatment of the twoproblems has been disjoint. Therefore, techniques in proteindesign do not leverage advances in folding and vice versa.It is well known that the design problem does not have aunique solution. Indeed, there may be more than one se-quence that yields the same or at least very similar structure(Vassura et al., 2008). In fact, many homologous proteinsshare similar if not identical (up to measurement errors)structures. Therefore, judging the success of the predictionby looking at a single sequence can be misleading. To thisend, we propose to approach the problem by predicting afamily of protein sequences with similar structures. Namely,given a protein structure (coordinates), we aim to predict itsPSSM, instead of ﬁnding a unique sequence that describes it.Notably, enriching a protein sequence by family-consensusresidues often leads to better structural stability, a strategyknow as back-to-consensus (Bershtein et al., 2008; Chandleret al., 2020). a r X i v : . [ q - b i o . B M ] F e b rotein design and folding Previous and related work

Recently, the problem of protein folding drew large atten-tion with the recent summary of the CASP 14 competition(Liu et al., 2021). However, using ﬁrst order informationonly (that sufﬁce for protein design) received notably lessattention. Related works to ours can be found in (Li et al.,2017; Gao et al., 2018; AlQuraishi, 2019a; Torrisi et al.,2020). As we show in our numerical results in Sec. 3, ourapproach yields superior results given the ﬁrst order data.Employing deep learning for protein design task is a rela-tively new idea. A similar approach to ours was recently pre-sented in (Strokach et al., 2020) where graph methods wereproposed for protein design, reporting promising results bytreating the problem as a graph node classiﬁcation problem,surpassing other de-novo design codes. While (Strokachet al., 2020) and our method share some similarities, oursis largely different as we use reversible architectures whichoffer numerous advantages, discussed in the following. Aswe show in Sec. 3.2, our approach obtains better results ona large data-set derived from the Protein Data Bank (PDB).

Main Contribution

The main part of this work is the introduction of a frame-work that uniﬁes the treatment of protein folding and design.Our framework mimics the physics of folding a proteinusing a neural network. Hence, we coin the term

MimeticDeep Neural Networks (MimNet), which we apply to graphs,describing protein structures. While our work focuses onprotein folding and design, the proposed network can beapplied with any node or edge data that is available, thus itsuits using both ﬁrst or second order statistics.The main idea is to generate a reversible transformationfrom the structure to the sequence and vice versa by usingreversible architectures. These networks allow us to jointlytrain the folding and the design problems, utilizing both thesequence and the structure of the protein simultaneously .We explore a family of neural networks that are designedto do just that. These are neural network architectures thatare inspired by Hamiltonian dynamics and hyperbolic dif-ferential equations (Chang et al., 2018; Ruthotto & Haber,2019). Such networks can propagate forward and backwardand, they can utilize any type of layer, from structured tograph convolution or attention, harnessing recent advancesin the understanding of protein folding architectures. Inthis paper, we particularly explore the use of MultiscaleGraph Convolutional Networks (Multiscale GCNs), whichare graph-based deep learning methods (Gao & Ji, 2019;Wang et al., 2019; Eliasof & Treister, 2020). Such net-works can efﬁciently mimic the pairwise interaction of thepotential in a three dimensional physical system.Reversible networks are bidirectional, and therefore it isnatural to train them for both folding and design simulta-neously, effectively doubling the amount of the data with respect to the network parameters. Another important ad-vantage of such a network is its memory footprint. Sincethe network is reversible, it is possible to train an arbitrarilylong network without storing the activation, at the cost ofdouble the computation of the backward pass (Chang et al.,2018). This enables the use of very deep networks that areimpossible to use otherwise.The physical folding process can be described by a secondorder differential equation derived from Hamiltonian dy-namics. Reversible architectures that are inspired by Hamil-tonian dynamics can be used to simulate this process. Onecan therefore claim that such a mimetic network is morefaithful to the physics of the protein folding problem com-pared to a standard deep network like a ResNet (He et al.,2016).The rest of the paper is organized as follows. In Sec. 2 wediscuss the problem and introduce the key mathematicalideas which constitute the building blocks of our network.In particular, we discuss multiscale reversible networks anddifferent types of graph convolution techniques that are usedto solve the problem. We then deﬁne our MimNet and itsobjective functions. In Sec. 3 we perform numerical ex-periments with data obtained from ProteinNet (AlQuraishi,2019b). ProteinNet is a publicly available data set that con-tains both sequences and PSSMs and thus allows for thetraining of a folding network with ﬁrst order statistics asdone in (AlQuraishi, 2019a). The size of the data set, itsstructure and the division into training validating and testing,which were carefully selected, allows one to rigorously testthe design problem as well. Finally, in Sec. 4 we discuss theresults and summarize the paper.

2. Methods

Before discussing the particular network and architecture,we deﬁne the data and the functions of folding and designproblems. Speciﬁcally, assume that S ∈ S is a × n matrixthat represents a protein sequence of n amino acids. Let S + ∈ S + be additional data that is related to the sequencesuch as PSSM and possibly covariance information derivedfrom MSAs. Also, let X ∈ X be a × n matrix that rep-resents the protein structure (coordinates). We deﬁne themapping F : S × S + → X as the folding mapping. Thismapping takes the information in S and S + and maps it intothe estimated coordinates ˆ X that reveal the structure of theprotein. Throughout the paper we denote F ( S ) instead of F ( S , S + ) for brevity. Consider now the opposite mappingfrom the space X to the space S × S + . We denote thismapping as F † : X → S × S + and it can be thought of assome psedu-inverse of the mapping F . These mappings canbe learnt separately and independently as has been done sofar. However, since F and F † are closely related, it is tempt-ing to jointly learn them, utilizing both the sequence, its rotein design and folding attributes, as well as the structure of the protein in tandem .We now review the concept of a mimetic deep neural net-work, that is, a neural network that mimics the physics ofthe dynamics of the folding process. To this end, a deepnetwork can be thought of as a time discretization of a dif-ferential equation (Chen et al., 2019; Ruthotto & Haber,2019). According to this interpretation, each layer repre-sents the state of the system at some particular pseudo-time.The mimetic properties are ﬁrst discussed in pseudo-time,namely, how the network propagates from one layer to theother. The second mimetic property considers the spatialdomain, meaning, how a particular residue in the proteininteracts with another residue. These properties are puttogether to generate a mimetic deep neural network thatimitates molecular dynamics simulations using network ar-chitectures that are derived by discretized differential oper-ators in time and space (Ruthotto & Haber, 2019; Eliasof& Treister, 2020). The treatment in both space and timeare put together within a network optimization procedure totrain the system and yield a network that can solve both thefolding and the design problems. In this subsection we show how to build a mimetic networkin time by using reversible dynamics. Reversible systemsplay a major role in physics for applications that range fromHamiltonian dynamics to wave equations. Broadly speak-ing, a reversible system is one that can propagate forward intime without information loss and therefore, can propagatebackwards in time. Simple physical examples are a pendu-lum or a wave. These systems (in their idealized form) donot change their entropy, and therefore allow for forward orbackward integration in time. Typical molecular dynamicsis solved using reversible methods (Saitou & Nei, 1987)(that is, integrating Hamiltonian dynamics) and therefore,it is natural to explore neural network architectures withsimilar properties.To be more speciﬁc, given the input for the folding task [ S , S + ] (e.g., the concatenation of the one hot encodingsequence design and PSSM matrices) we ﬁrst apply Y = q ([ S , S + ] , θ e ) (2.1)where Y contains n f channels of n -length sequence fea-tures, embedded by the transformation q ( · , · ) , parameterizedby the weights θ e . This layer transforms the input to thelatent space of the network. Here we use a 1D convolutionfor q , but other transformations may also be suitable.The initial state Y and its velocity vector V are thenpushed forward by a deep residual neural network. In par-ticular, we consider a network with the following structure Input t = 0 t = t t = t Figure 1.

The architecture of MimNet with graph convolution lay-ers. An embedding layer transforms the input into a latent spacewhich then propagates through a GCN layer fed with the outputs oftwo previous layers. A graph that represents the protein structureis computed after each layer. for The ﬁnal layer is then projectedback to obtain residue coordinates. V j +1 = V j + h · f ( Y j , θ j ) (2.2a) Y j +1 = Y j + h · g ( V j +1 , θ j ) , (2.2b)where j = 0 , . . . , T is the layer index. h is a parameter thatrepresents a time step size and θ j are learnt parameters thatcharacterize the j -th layer. The system in equation 2.2 canbe interpreted as a Verlet type discretization of a dynamicalsystem with learnable forces that are the gradients of somepotential function. A particular case of such dynamics isobtained by setting g = Id (the identity transformation)yielding the second order dynamics Y j +1 = 2 Y j − Y j − + h f ( Y j , θ j ) . (2.3)This scheme is clearly reversible, regardless of the choice f (which we discuss Sec. 2.3 ), since we can express Y j − as a function of Y j and Y j +1 . The propagation forward(and backward) is not complete without deﬁning the bound-ary conditions Y − and Y T +1 . Here we arbitrarily choose Y − = Y and Y T − = Y T , that is, initializing the net-work with zero velocity, i.e., V = 0 . An illustration of thedynamics is plotted in Fig. 1.Given the ﬁnal state of the system Y T , we predict the coor-dinates X by projecting Y T onto a 3 dimensional space ˆ X = q + ( Y T , θ f ) , (2.4)where ˆ X are the predicted coordinates. The transformation q + ( · , · ) can be realized by a neural network, and we choose rotein design and folding it to be a learnable projection matrix of size n f × suchthat the ﬁnal feature maps are projected to 3D coordinates.The layer in equation 2.4 may also contains some additionalconstraints. In particular, we may demand that | ˆ X i − ˆ X i − | = c, constraining the distance between every two residues to c = 3 . Å. We have found that when the data is noisyimplementing this constraint is needed in order to obtainphysical results (see Sec. 2.5.3).In the forward pass, described above, the folding problemwas solved, where we march from the protein design at-tributes (as in Sec. 2.1) to its coordinates. In the backwardpass, we solve the design problem, where our goal is topredict the sequence given its coordinates. We start thebackward pass by embedding the coordinates into the net-work feature space, i.e., Y T = ( q + ) ∗ ( X , θ f ) , (2.5)where ( q + ) ∗ is the adjoint of the transformation q + . Wethen march backwards, replacing the entries of Y j +1 and Y j − in equation 2.3 and ﬁnally, using the adjoint of q topropagate from Y to the sequence space [ˆ S , ˆ S + ] = q ∗ ( Y , θ e ) , (2.6)where [ˆ S , ˆ S + ] are the predicted protein design attributes.These forward and backward passes couple the design andthe folding tasks together into a single network that, simi-larly to the physical dynamics, can be integrated (in time)from sequence to coordinates and backwards from coordi-nates to a sequence. Sec. 2.1 considers the propagation of the network from itsinitial condition (a sequence) to its ﬁnal one (3D structure)and vice versa. The discussion was agnostic to the choice ofthe function f ( · , · ) in equation 2.3 that realizes the networkin hand. In this section we review the concept of a graphnetwork and discuss its computation.The idea behind a graph based method is rooted in thephysics of the problem. Energy based simulations can bethought of as pairwise interactions on a graph based on the L distance between the residues. Indeed, as the distancebetween residues is smaller, the interaction between themis stronger. This motivates us to use machine learning tech-niques that mimic this property. As the dynamical system isevolving, the interaction between pairs of close residues issigniﬁcantly larger compared to far ones.One of the most successful techniques for image andspeech processing is Convolution Neural Networks (CNN) (Krizhevsky et al., 2012; Goodfellow et al., 2016). Themethod relays on the structured grids on which sequencesand images are deﬁned. That is, every element has neigh-bouring elements in a structured manner. In recent years,similar ideas were extended to more complex geometriesand manifolds, which can be naturally represented by agraph (Ranjan et al., 2018; Hanocka et al., 2019; Wanget al., 2019). The main idea is to replace the structuredconvolution with a graph based convolution. That is, ratherthan convolving each location with its near neighbours de-ﬁned by the sequence, deﬁne the distance between eachlocation based on the graph node or edges features, and thenconvolve the residues that are close on the graph.To be more speciﬁc, we let Y j be the state at the j th layer.Then, we deﬁne a graph convolution block as follows: f ( Y j ) = −C ∗ ( θ j , σ ( C ( θ j , Y j ))) (2.7)where C ( θ j , · ) is the graph convolution operator with itslearned associated weights θ j . This operator spatially resem-ble a discrete differential operator, e.g., a mass term, a graphLaplacian, or an edge gradient (Eliasof & Treister, 2020). σ ( · ) is the ReLU activation function. The operator C ∗ is theadjoint operator of C (like a transposed convolution), appliedusing the same weights θ j . This way, assuming that σ is amonotonically non-decreasing function that either zeroes itsinput or preserves its sign, we get a symmetric and positivesemi-deﬁnite operator. We use the negative sign in frontof the layer such that the operator f ( · ) is negative, whichis important if we are to generate a stable dynamics—see(Ruthotto & Haber, 2019) for details and analysis.Many graph based networks employ a graph convolutionwith ﬁxed connectivity (Ranjan et al., 2018; Bouritsas et al.,2019). This is reasonable if the ﬁnal topology is known.However, for protein folding we start with an unknownstructure and it is evolving (learnt) from the data. There-fore, rather than using a ﬁxed graph for the network we letthe graph evolve throughout network. We thus recomputea weighted graph Laplacian at each layer, or, for computa-tional saving, every T U-net layers. To this end, we computethe weighted distance matrix between each two residues W j = exp (cid:0) − α − D ( Y j ) (cid:1) , (2.8)where α is a scaling parameter (we set α = 10 ) and D isthe L distance between each two residues D ( Y ) = (cid:0) Y (cid:62) + (cid:62) Y − Y (cid:62) Y (cid:1) . (2.9)The vector is a vector of ones of appropriate size. Usingthe distance matrix we deﬁne the graph Laplacian as L j = diag( D j ) − D j . (2.10)The approach of dynamically updating the connectivity ofthe graph was also suggested in (Wang et al., 2019). Our rotein design and folding strategy differs in that instead of picking k nearest neighborsto be equally weighted, regardless of their distances, we usea weighted and fully connected graph. That is, our graphLaplacian is a dense matrix. This is reasonable since the typ-ical size of a protein is less than , residues with a meansize of , and similarly to various physical applications,every two residues interact according to their distance (Ned-widek & Hecht, 1997). Further, the weighting of the edgesmakes the graph Laplacian continuously differentiable withrespect to the network, which aids its training. The limitation of graph based networks, similar to other con-volution methods is that they generate strong local interac-tions only. Hence, spatially-distant connections may sufferfrom weak interactions (due to small weights), and infor-mation will be spread slowly within the network - requiringmore layers to compensate for. An elegant way to have longinteractions and pass information between far-away parts ofthe graph is to consider a multiscale framework.To this end, instead of a standard graph convolution C inequation 2.7, we use a multiscale mechanism that is similarto a U-net (Ronneberger et al., 2015; Shah et al., 2018),where coarse scale approximations of the protein are com-posed. In particular, in the multiscale version of equation 2.7we choose C in to be the encoder part of a U-net, and theoperator C ∗ is the transposed operation, that has a decoderstructure (parameterized by the same weights). Together,they form a symmetric graph U-net. The reversibility of thenetworks remains, since equation 2.3 is reversible for every f , and in particular for our symmetric U-net.Our graph U-net is comprised of n Levels graph scales. Ateach level we perform a GCN block where we use both thegraph and sequence neighbors in our convolutions: Y j +1 = ω j Y j + σ ( N ( K j ( Y j + Y j L j ))) . (2.11)where K j is a 1D convolution with kernel of size 9, con-necting nodes on the protein sequence, and L j is the graphLaplacian operator from equation 2.10. N is the instancenormalization layer, and σ is the ReLU activation. ω j equals when graph coarsening is not applied, and otherwise.On the coarsest level of the U-net, we perform 2 convolutionsteps like equation 2.11. At each level, the graph differentialoperator L j is re-computed on the coarse graph allowingfor simple and inexpensive computations between scales. Inaddition, since the protein has a simple linear underlyingchain, we use linear coarsening, implemented by simpleaverage 1D pooling—this is illustrated in Fig. 3. In thedecoder part of the U-net we apply the transposed operators,and to reﬁne our graph (unpooling) we use a linear interpo-lation along the chain. To propagate information betweenmatching levels we add long skip-connections after each convolution, for a stable training scheme (see Fig. 2) Combining our building blocks together, we now deﬁne ourbi-directional mimetic architecture, called MimNet. Thenetwork consists of three main components - the openingembedding layer, a stacked graph U-net modules, and aclosing embedding layer.At the start and end of MimNet we use the embeddinglayers equation 2.1 and equation 2.4, both of which areimplemented using a simple × convolution of appropriatesizes ( n = 40 ). At the core of our network we employ aseries of T graph U-nets modules. Each graph U-net isdeﬁned according to section 2.3, and all of them are ofidentical dimensions. That is, each has n f channels of equaldimensions on the ﬁnest level. Our MimNet allows us to build the physics of the probleminto the neural network. This needs to be followed by a thor-oughly thought training process. In particular, care needs betaken when choosing the appropriate problem to minimizeand the appropriate choice of loss functions and regulariza-tion. We now discuss these choices for our training.2.5.1. T

HE OPTIMIZATION PROBLEM

Since we have a bidirectional network, we use both direc-tions to train the network. We deﬁne the objective function J ( θ ) = 1 N (cid:88) j (cid:96) fold ( F ( S j , θ ) , X j ) (2.12) + 1 N (cid:88) j (cid:96) design ( F † ( X j , θ ) , S j ) + βR ( θ ) . Here θ are all the parameters of the network, F is the for-ward network from sequence to coordinates and F † is thebackward mapping from coordinates to sequence. The lossfunctions (cid:96) fold and (cid:96) design are chosen to measure the dis-crepancy between the estimated and true coordinates andbetween the predicted and true sequence design, respec-tively. The choice of these functions is to be discussed next.Finally, R ( · ) is a regularization term that ensures stabilityof the network and is described below.2.5.2. L OSS F UNCTION FOR THE D ESIGN P ROBLEM

The loss function for the design problem is rather straightforward. At every residue location we attempt to recoverthe individual residue out of possible ones. Noting ˆ S = F † ( X j , θ ) (2.13) rotein design and folding poolGCN GCNpool GCN GCNunpool GCNunpoolSkip connectionSkip connection Figure 2.

A graph U-net. GCN is deﬁned in equation 2.11. Pool and unpool denote graph coarsening and reﬁnement, respectively. Skipconnection denotes a summation of the respective feature maps.

Figure 3.

Coarsening a protein. The features of each two residuesin the chain (left graph) are averaged together and a new coarsegraph is computed (right) for the coarse protein. The graph Lapla-cian is computed directly from the rediscretized coarse protein. we interpret ˆ S as a matrix that its ij entry is the probabilityof the j th residue to be of residue type i . Therefore, it isstraight forward to use the cross entropy as a metric, setting (cid:96) design = (cid:88) S (cid:12) log( F † ( X , θ )) . (2.14)Note that the network output is exactly the deﬁnition ofthe PSSM, therefore, if one assumes that the PSSM of thesequence is available, then, it is possible to use the KLdivergence between the computed and observed PSSMs asa distance metric. This is because the PSSM representsthe true probability of the j th residue in the sequence tobe of type i . While this strategy is always possible duringtraining, using it for inference can be difﬁcult. Indeed, nomapping known to us is given from PSSM to a particularresidue. However, having a PSSM as an answer may allowfor greater ﬂexibility when designing a protein, since thereis not necessarily a unique answer to the design process,and the PSSM represents this ambiguity. One can alwaysuse the maximum probability of the PSSM for the design aparticular protein. Although there is no guarantee, this often leads to better structural stability (Bershtein et al., 2008;Chandler et al., 2020).2.5.3. L OSS F UNCTION FOR THE F OLDING P ROBLEM

We turn our attention for the loss function for the foldingproblem. Clearly, one cannot simply compare the coordi-nates obtained by the network, F ( S ) to the observed coor-dinates of the sequence, as they are invariant with respectto rotation and translation. Similar to the work (AlQuraishi,2019a) one can compare the distance matrices obtained fromthe coordinates. Let D s ( X ) = (cid:112) D ( X ) be the pairwise dis-tance matrix in equation 2.9. The distance matrix is invariantto rotations and translations. Thus, it is possible to comparethe distances obtained from the true coordinates, D s ( X ) tothe distances of the predicted coordinates D s ( F † ( S )) bytheir dRMSD in equation 2.15.In an average protein, residue distances typically range fromblue dozens of Angstroms to a few Angstroms. Minimizingthe L distance is therefore focused on the large scale struc-ture of the protein and can neglect the small scale structuresas they contribute remarkably less to the loss. Also, it iswell known that the distance between 1-hop (immediate)neighboring residues is smaller than . ± . Å. This mo-tivated previous works to use a threshold value and ignoredistances larger than that threshold. For example, in Al-phaFold (Senior et al., 2020) a distance of Å was usedthe cutoff. Here we used a slightly more conservative valueof residues which translates into × . Å = 26 . Å.Another aspect of the particular PDB data is that it can benoisy. As a result, some of the distances between residuesare not physical. In particular, distances that are (cid:96) apart can-not have distance that is larger than (cid:96) × . Å. Unfortunately,the data presents many such pairs. Additionally, some of theresidues are missing information (whether their sequence orcoordinates). We therefore do not consider such entries ofthe data, by masking them during training and inference. To rotein design and folding summarize, the loss can be expressed as (cid:96) fold = (cid:114) n M (cid:107) M (cid:12) ( D ( F ( S , θ )) − D ( X )) (cid:107) F (2.15)where M is a masking matrix that masks the part of thedata that is not-physical or missing and n M is the numberof non-zeros in the matrix.2.5.4. R EGULARIZATION

The last component in our optimization scheme is the reg-ularization, R ( θ ) . We rewrite θ = [ θ , . . . , θ L ] where θ j are the parameters used for the j -th layer. Then, stabilityfor the dynamical system is obtained if the total variation ofits parameters is small (Ruthotto & Haber, 2019). Thus wechoose the following regularization function R ( θ ) = (cid:88) j | θ j +1 − θ j | (2.16)Note that we do not use the standard Tikhonov regularization(so called weight decay) on the weights as they do notguarantee smoothness in time which is crucial for reversiblenetworks and integration in time (Celledoni et al., 2020).

3. Numerical Experiments

We verify our method by performing three sets of experi-ments - protein folding, design, and an ablation study toquantify the contribution of the reversible learning.

For the experiments we used the data set suppliedby the ProteinNet (AlQuraishi, 2019b). The data containsproteins processed from the PDB data set, and is organizedto hold training, validation and testing splits speciﬁcallyfor CASP − . For instance, the ProteinNet on CASP data contains 42,338 proteins that are less than 1000residue long for training, 224 proteins for validation and 81test proteins. The data set was used in (AlQuraishi, 2019a)and more recently in (Drori et al., 2019). We use the thinning version of the data, as reported in (AlQuraishi,2019a). While the ﬁrst order statistics is available, secondorder statistics cannot be downloaded freely and requirescomplex and expensive pre-processing. We therefore useonly ﬁrst order statistics in this work and compare it to otherrecent methods that uses identical information. Note that therecent success of the AlphaFold2, as well as other methods,in CASP 14 were achieved using second order statistics.Our network is generic and can use both ﬁrst order statistics(in terms of node attributes) and second order statistics (interms of edge attributes). Comparing better and recent re-sults that use second order statistics requires additional datathat are proprietary to different organizations and therefore is not done in this work. We believe that the ProteinNet dataset constitutes a great leap forward as it allows scientiststo compare methods on the same footings, similarly to theimpact of ImageNet (Deng et al., 2009) on the computervision community. Network and optimization settings

Throughout our ex-periments, we use our MimNet as described in Sec. 2.4 with n f = 128 with n Levels = 3 and T = 6 . We use the Adamoptimizer with an initial learning rate of 0.0001 and a batchsize of 1. Our network is trained for 100 epochs and wemultiply the learning rate by a factor of 0.9 every 2 epochs.Our experiments are carried on an Nvidia Titan RTX . Ourcode is implemented in PyTorch (Paszke et al., 2019) Comparisons

To compare our results we use two recentworks. First, for the protein folding we compare to the workof (AlQuraishi, 2019a). While the work did not get the stateof the art results, it is the only recent work known to usthat uses ﬁrst order information only. Furthermore, the workuses the ProteinNet data which allows us to directly compareour results to the one published. Second, for protein design,we compare the work of (Strokach et al., 2020) to ours. Thenetwork proposed in the paper is named ProteinSolver andit uses a graph neural network for the solution of the designproblem. The ProteinSolver network obtains the state of theart results by using a sophisticated graph representation ofthe protein and its features. The ProteinSolver work showsa remarkable improvement over previous design work andtherefore we ﬁnd it as a good benchmark.

As discussed in Sec. 2.5.2 we measure the KL divergencebetween the predicted and ground-truth PSSMs. This metricallows for soft-assignments of the sequence design, which ismore ﬂexible and natural than predicting one-hot labels, dueto the multiple design possibilities given a structure. Sincethe data already include such probability in terms of PSSMit is only natural to use such a loss. Our results show majorimprovement over a recent work ProteinSolver, reportedin Tab. 1. The results for ProteinSolver were obtained byevaluating the published pre-trained model on ProteinNetdataset, for CASP − . We stress here that ProteinSolverwas pre-trained on larger data, sourced from the PDB, whichis the same source for ProteinNet. Our folding experiment uses ﬁrst-order data (PSSM and aone hot encoding matrix of the sequence design), obtainedfrom the ProteinNet dataset. We therefore compare it tothe recent work of (AlQuraishi, 2019a) which uses identicaldata. Our experiments suggest that the use of GCNs cansigniﬁcantly improve the accuracy of protein folding as we rotein design and folding

Table 1.

KL Divergence comparison of recent Protein-Design methods. Average of FM (novel folds) and TBM (known folds) is shown.

Model CASP CASP CASP CASP CASP CASP ProteinSolver (Strokach et al., 2020) 1.73 1.61 1.63 1.5 1.67 1.62MimNet (ours) 0.99 0.88 0.87 0.83 0.96 0.95

Table 2. dRMSD [Å] comparison of recent Protein-Folding methods. Average of FM (novel folds) and TBM (known folds) is shown.

Model CASP CASP CASP CASP CASP CASP RGN (AlQuraishi, 2019a) 7.45 6.60 7.60 8.45 7.95 8.80MimNet (ours) 4.97 4.88 5.14 5.31 5.80 5.37

Table 3.

Comparison between Coordinates to Design (C → D) andreversible learning (C ↔ D) on CASP 7-12. Results are reportedin KL-divergence score.

Dataset C → D D ↔ CCASP One of the key contributions of our work is the introductionof reversible networks to jointly learning protein folding anddesign, which have not been done until now. We thereforedelve on the signiﬁcance of the reversibility scheme. Thatis, we compare the behavior of our network when trainedfor both directions, versus the case of optimizing it onlyone direction (from sequence to coordinates and vice versa).Our results are summarized in Tab. 3 −

4, suggesting thatcoupling the learning of folding and design problems canlead to better results in both folding and design, that is, onecan obtain better protein design if the network can be usedto fold and vice-verse. We believe that this is due to the twoproblems being tightly coupled, as well as the effectivelydoubling of the data processed by the network. Similarbehaviour is obtained when using reversible architecturesto solve problems such as normalized ﬂows (Yang et al.,2019).

4. Conclusion

In this work we have introduced a novel approach that uni-ﬁes the treatment of protein folding and protein design. Ourmethodology is based on a combination of two recently

Table 4.

Comparison between Design to Coordinates (D → C) andreversible learning (C ↔ D) on CASP 7-12. Results are reportedin dRMSD [Å].

Dataset D → C D ↔ CCASP rotein design and folding for the folding task. However, more importantly, we haveshown a signiﬁcant improvement on the protein design task,achieving a KL divergence loss that is less than half of arecently published work. We attributed this success for theuse of recent protein folding architectures as well as us-ing extensive data sets that allow for better training of theproposed architecture. Acknowledgements and Funding

The work was funded by Genomica.ai and MITACS. ME issupported by Kreitman high-tech scholarship.

References

AlQuraishi, M. End-to-end differentiable learning of protein structure.

Cell Systems ,8(4):292–301, 2019a.AlQuraishi, M. Proteinnet: a standardized data set for machine learning of proteinstructure.

BMC Bioinformatics , 20, 2019b.Basanta, B., Bick, M. J., Bera, A. K., Norn, C., Chow, C. M., Carter, L. P., Goreshnik,I., Dimaio, F., and Baker, D. An enumerative algorithm for de novo design ofproteins with diverse pocket structures.

Proceedings of the National Academy ofSciences , 117(36):22135–22145, 2020.Bershtein, S., Goldin, K., and Tawﬁk, D. S. Intense neutral drifts yield robustand evolvable consensus proteins.

Journal of Molecular Biology , 379(5):1029– 1044, 2008. ISSN 0022-2836. doi: https://doi.org/10.1016/j.jmb.2008.04.024. URL .Bouritsas, G., Bokhnyak, S., Ploumpis, S., Bronstein, M., and Zafeiriou, S. Neural3d morphable models: Spiral convolutional networks for 3d shape representationlearning and generation. In

Proceedings of the IEEE/CVF International Confer-ence on Computer Vision , pp. 7213–7222, 2019.Celledoni, E., Ehrhardt, M. J., Etmann, C., McLachlan, R. I., Owren, B., Schön-lieb, C.-B., and Sherry, F. Structure preserving deep learning. arXiv preprintarXiv:2006.03364 , 2020.Chandler, P. G., Broendum, S. S., Riley, B. T., Spence, M. A., Jackson, C. J.,McGowan, S., and Buckle, A. M.

Strategies for Increasing Protein Stability ,pp. 163–181. Springer US, New York, NY, 2020. ISBN 978-1-4939-9869-2.doi: 10.1007/978-1-4939-9869-2_10. URL https://doi.org/10.1007/978-1-4939-9869-2_10 .Chang, B., Meng, L., Haber, E., Ruthotto, L., Begert, D., and Holtham, E. Reversiblearchitectures for arbitrarily deep residual neural networks. In

Proceedings of theAAAI Conference on Artiﬁcial Intelligence , volume 32, 2018.Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural ordinarydifferential equations, 2019.Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: ALarge-Scale Hierarchical Image Database. In

CVPR09 , 2009.Drori, I., Thaker, D., Srivatsa, A., Jeong, D., Wang, Y., Nan, L., Wu, F., Leggas, D.,Lei, J., Lu, W., Fu, W., Gao, Y., Karri, S., Kannan, A., Moretti, A., AlQuraishi,M., Keasar, C., and Pe’er, I. Accurate protein structure prediction by embeddingsand deep learning representations, 2019.Eliasof, M. and Treister, E. Diffgcn: Graph convolutional networks via differentialoperators and algebraic multigrid pooling. , 2020.Finkelstein, A. and Galzitskaya, O. Physics of protein folding.

Physics of LifeReviews , 1(1):23 – 56, 2004. Gao, H. and Ji, S. Graph u-nets. In international conference on machine learning ,pp. 2083–2092. PMLR, 2019.Gao, W., Mahajan, S. P., Sulam, J., and Gray, J. J. Deep learning in protein structuralmodeling and design.

Patterns , 1(9):100142, 2020.Gao, Y., Wang, S., Deng, M., and Xu, J. Raptorx-angle: real-value prediction ofprotein backbone dihedral angles through a hybrid method of clustering and deeplearning.

BMC Bioinformatics , 19(4), 2018.Goodfellow, I., Bengio, Y., and Courville, A.

Deep Learning . MIT Press, 2016. .Hanocka, R., Hertz, A., Fish, N., Giryes, R., Fleishman, S., and Cohen-Or, D.Meshcnn: a network with an edge.

ACM Transactions on Graphics (TOG) , 38(4):1–12, 2019.He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recogni-tion. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pp. 770–778, 2016.Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classiﬁcation with deepconvolutional neural networks. In

Advances in neural information processingsystems , pp. 1097–1105, 2012.Li, H., Hou, J., and Adhikari, B. Deep learning methods for protein torsion angleprediction.

BMC Bioinformatics , 18:417–426, 2017.Liu, J., Wu, T., Guo, Z., Hou, J., and Cheng, J. Improving protein tertiary struc-ture prediction by deep learning and distance prediction in casp14. bioRxiv ,2021. doi: 10.1101/2021.01.28.428706. URL .Nedwidek, M. and Hecht, M. Minimized protein structures: a little goes a long way.

Proceedings of the National Academy of Sciences , 94(19), 1997.Ołdziej, S., Czaplewski, C., Liwo, A., Chinchio, M., Nanias, M., Vila, J. A.,Khalili, M., Arnautova, Y. A., Jagielska, A., Makowski, M., Schafroth, H. D.,Ka´zmierkiewicz, R., Ripoll, D. R., Pillardy, J., Saunders, J. A., Kang, Y. K., Gib-son, K. D., and Scheraga, H. A. Physics-based protein-structure prediction usinga hierarchical protocol based on the unres force ﬁeld: Assessment in two blindtests.

Proceedings of the National Academy of Sciences , 102(21):7547–7552,2005. doi: 10.1073/pnas.0502655102.Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G.,Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A.,Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner,B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style,high-performance deep learning library. In Wallach, H., Larochelle, H.,Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.),

Advancesin Neural Information Processing Systems 32 , pp. 8024–8035. CurranAssociates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf .Ranjan, A., Bolkart, T., Sanyal, S., and Black, M. J. Generating 3d faces usingconvolutional mesh autoencoders. In

Proceedings of the European Conferenceon Computer Vision (ECCV) , pp. 704–720, 2018.Richardson, J. S. and Richardson, D. C. The de novo design of protein structures.

Trends in Biochemical Sciences , 14(7):304 – 309, 1989.Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks forbiomedical image segmentation, 2015.Rose, G. D., Fleming, P., Banavar, J., and Maritan, A. A backbone-based theory ofprotein folding.

Proceedings of the National Academy of Sciences of the UnitedStates of Americal , 103(45), 2006.Ruthotto, L. and Haber, E. Deep neural networks motivated by partial differentialequations.

Journal of Mathematical Imaging and Vision , pp. 1–13, 2019.Saitou, N. and Nei, M. The neighbor-joining method: a new method for reconstruct-ing phylogenetic trees.

Molecular biology and evolution , 4:406–25, 1987. rotein design and folding

Senior, A. W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., Qin, C.,Žídek, A., Nelson, A. W., Bridgland, A., et al. Improved protein structure predic-tion using potentials from deep learning.

Nature , 577(7792):706–710, 2020.Shah, S., Ghosh, P., Davis, L. S., and Goldstein, T. Stacked u-nets: A no-frillsapproach to natural image segmentation, 2018.Strokach, A., Becerra, D., Corbi-Verge, C., Perez-Riba, A., and Kim, P. M. Fastand ﬂexible protein design using deep graph neural networks.

Cell Systems ,11(4):402 – 411.e4, 2020. ISSN 2405-4712. doi: https://doi.org/10.1016/j.cels.2020.08.016. URL .Torrisi, M., Pollastri, G., and Le, Q. Deep learning methods in protein structure pre-diction.

Computational and Structural Biotechnology Journal , 18:1301 – 1310,2020.Vassura, M., Margara, L., Di Lena, P., Medri, F., Fariselli, P., and Casadio, R. Recon-struction of 3d structures from protein contact maps.

IEEE/ACM Transactionson Computational Biology and Bioinformatics , 5(3):357–367, 2008.Wang, J., Cao, H., Zhang, J. Z. H., and Qi, Y. Computational protein design withdeep learning neural networks.

Nature , 8(6349), 2018.Wang, Y., Sun, Y., Liu, Z., Sarma, S. E., Bronstein, M. M., and Solomon, J. M.Dynamic graph cnn for learning on point clouds.

Acm Transactions On Graphics(tog) , 38(5):1–12, 2019.Xu, J. Distance-based protein folding powered by deep learning.

Proceedings of theNational Academy of Sciences , 116(34):16856–16865, 2019. doi: 10.1073/pnas.1821309116.Xu, Y., Verma, D., Sheridan, R. P., Liaw, A., Ma, J., Marshall, N. M., McIntosh, J.,Sherer, E. C., Svetnik, V., and Johnston, J. M. Deep dive into machine learningmodels for protein engineering.

Journal of Chemical Information and Modeling ,60(6):2773–2790, 2020.Yang, G., Huang, X., Hao, Z., Liu, M.-Y., Belongie, S., and Hariharan, B. Pointﬂow:3d point cloud generation with continuous normalizing ﬂows, 2019.Yennamalli, R. M. Protein design. In Ranganathan, S., Gribskov, M., Nakai, K.,and Schönbach, C. (eds.),