[PDF] Deep Learning in Protein Structural Modeling and Design

Abstract

Deep learning is catalyzing a scientific revolution fueled by big data, accessible toolkits, and powerful computational resources, impacting many fields including protein structural modeling. Protein structural modeling, such as predicting structure from amino acid sequence and evolutionary information, designing proteins toward desirable functionality, or predicting properties or behavior of a protein, is critical to understand and engineer biological systems at the molecular level. In this review, we summarize the recent advances in applying deep learning techniques to tackle problems in protein structural modeling and design. We dissect the emerging approaches using deep learning techniques for protein structural modeling, and discuss advances and challenges that must be addressed. We argue for the central importance of structure, following the "sequence -> structure -> function" paradigm. This review is directed to help both computational biologists to gain familiarity with the deep learning methods applied in protein modeling, and computer scientists to gain perspective on the biologically meaningful problems that may benefit from deep learning techniques.

Full PDF

DDeep Learning inProtein Structural Modeling and Design

Wenhao Gao, † Sai Pooja Mahajan, † Jeremias Sulam, ‡ and Jeﬀrey J. Gray ∗ , † † Department of Chemical and Biomolecular Engineering, Johns Hopkins University,Baltimore, MD ‡ Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD

E-mail: [email protected]

Abstract

Deep learning is catalyzing a scientiﬁc revolution fueled by big data, accessibletoolkits, and powerful computational resources, impacting many ﬁelds including pro-tein structural modeling. Protein structural modeling, such as predicting structurefrom amino acid sequence and evolutionary information, designing proteins toward de-sirable functionality, or predicting properties or behavior of a protein, is critical tounderstand and engineer biological systems at the molecular level. In this review, wesummarize the recent advances in applying deep learning techniques to tackle problemsin protein structural modeling and design. We dissect the emerging approaches usingdeep learning techniques for protein structural modeling, and discuss advances andchallenges that must be addressed. We argue for the central importance of structure,following the “sequence → structure → function” paradigm. This review is directedto help both computational biologists to gain familiarity with the deep learning meth-ods applied in protein modeling, and computer scientists to gain perspective on thebiologically meaningful problems that may beneﬁt from deep learning techniques. a r X i v : . [ q - b i o . B M ] J u l ontents cknowledgement 34References 34 Proteins are linear polymers that fold into various speciﬁc conformations to function. Theincredible variety of three-dimensional structures determined by the combination and order inwhich twenty amino acids thread the protein polymer chain (sequence of the protein) enablesthe sophisticated functionality of proteins responsible for most biological activities. Hence,obtaining the structures of proteins is of paramount importance in both understanding thefundamental biology of health and disease and developing therapeutic molecules. Whileprotein structure is primarily determined by sophisticated experimental techniques such asX-ray crystallography, NMR spectroscopy and, increasingly, cryo-electron microscopy, computational structure prediction from the genetically encoded amino acid sequence ofa protein has been employed as an alternative when experimental approaches are limited.Computational methods have been used to predict the structure of proteins, illustrate themechanism of biological processes, and determine the properties of proteins. Further, allnaturally occurring proteins are a result of an evolutionary process of random variants aris-ing under various selective pressures. Through this process, nature has explored only a smallsubset of theoretically possible protein sequence space. To explore a broader sequence andstructural space that potentially contains proteins with enhanced or novel properties, tech-niques such as de novo design can be employed to generate new biological molecules that havethe potential to tackle many outstanding challenges in biomedicine and biotechnology.

While the application of machine learning and more general statistical methods in pro-tein modeling can be traced back decades, recent advances in machine learning, especiallyin deep learning (DL) related techniques, have opened up new avenues in many areas ofprotein modeling. DL is a set of machine learning techniques based on stacked neural3etwork layers that parameterize functions in terms of compositions of aﬃne transforma-tions and non-linear activation functions. Their ability to extract domain-speciﬁc featuresthat are adaptively learned from data for a particular task often enables them to surpassthe performance of more traditional methods. DL has made dramatic impacts on digitalapplications like image classiﬁcation, speech recognition and game playing. Success inthese areas has inspired an increasing interest in more complex data types, including pro-tein structures. In the most recent Critical Assessment of Structure Prediction (CASP13held in 2018), a biennial community experiment to determine the state-of-the-art in pro-tein structure prediction, DL-based methods accomplished a striking improvement in modelaccuracy (see Figure 1), especially in the “diﬃcult” target category where comparative mod-eling (starting with a known, related structure) is ineﬀective. The CASP13 results show thatthe complex mapping from amino acid sequence to three-dimensional protein structure canbe successfully learned by a neural network and generalized to unseen cases. Concurrently,for the protein design problem, progress in the ﬁeld of deep generative models has spawneda range of promising approaches. In this review, we summarize the recent progress in applying DL techniques to the prob-lem of protein modeling and discuss the potential pros and cons. We limit our scope toprotein structure and function prediction, protein design with DL (see Figure 2), and awide array of popular frameworks used in these applications. We discuss the importanceof protein representation, and summarize the approaches to protein design based on DL forthe ﬁrst time. We also emphasize the central importance of protein structure, following thesequence → structure → function paradigm and argue that approaches based on structuresmay be most fruitful. We refer the reader to other review papers for more information onapplications of DL in biology and medicine, bioinformatics, structural biology, fold-ing and dynamics, antibody modeling, and structural annotation and prediction ofproteins. α atoms ina model to those in the corresponding experimental structure (higher numbers are moreaccurate). Target diﬃculty is based on sequence and structure similarity to other proteinswith known experimental structures (see Kryshtafovych et al. for details). Figure fromKryshtafovych et al. (2019). (b) Number of FM+FM/TBM (FM: free modeling, TBM:template-based modeling) domains (out of 43) solved to a TM-score threshold for all groupsin CASP13. AlphaFold ranked 1st among them, showing the progress is mainly due to thedevelopment of DL based methods. Figure from Senior et al. (2020) Figure 2: Schematic comparison of three major tasks in protein modeling: function predic-tion, structure prediction and protein design. In function prediction, the sequence and/orthe structure is known and the functionality is needed as output of a neural net. In structureprediction, sequence is known input and structure is unknown output. Protein design startsfrom desired functionality, or a step further, structure that can perform this functionality.The desired output is a sequence that can fold into the structure or has such functionality.5

Protein structure prediction and design

The prediction of protein three-dimensional structure from amino acid sequence has beena grand challenge in computational biophysics for decades.

Folding of peptide chains isa fundamental concept in biophysics, and atomic-level structures of proteins and complexesare often the starting point to understand their function and to modulate or engineer them.Thanks to the recent advances in next-generation sequencing technology, there are nowover 180 million protein sequences recorded in UniProt dataset. In contrast, only 158,000experimentally determined structures are available in the Protein Data Bank (PDB). Thus,computational structure prediction is a critical problem of both practical and theoreticalinterest.More recently, the advances in structure prediction have led to an increasing interest inthe protein design problem. In design, the objective is to obtain a novel protein sequence thatwill fold into a desired structure or perform a speciﬁc function, such as catalysis. Naturallyoccurring proteins represent only an inﬁnitesimal subset of all possible amino acid sequencesselected by the evolutionary process to perform a speciﬁc biological function. Proteins withmore robustness (higher thermal stability, resistance to degradation) or enhanced properties(faster catalysis, tighter binding) might lie in the space that has not been explored by nature,but is potentially accessible by de novo design. The current approach for computational de novo design is based on physical and evolutionary principles and requires signiﬁcantdomain expertise. Some successful examples include novel folds, enzymes, vaccines, novel protein assemblies, ligand-binding protein, and membrane proteins. The current methodology for computational protein structure prediction is largely basedon Anﬁnsen’s thermodynamic hypothesis, which states that the native structure of a pro-6ein must be the one with the lowest free-energy, governed by the energy landscape of allpossible conformations associated with its sequence. Finding the lowest-energy state is chal-lenging because of the immense space of possible conformations available to a protein, alsoknown as the “sampling problem” or Levinthal’s paradox. Furthermore, the approach re-quires accurate free energy functions to describe the protein energy landscape and rankdiﬀerent conformations based on their energy, referred as the “scoring problem”. In light ofthese challenges, current computational techniques rely heavily on multi-scale approaches.Low-resolution, coarse-grained energy functions are employed to capture large scale confor-mational sampling such as the hydrophobic burial and formation of local secondary structuralelements. Higher-resolution energy functions are employed to explicitly model ﬁner detailssuch as amino acid side-chain packing, hydrogen bonding and salt bridges. Protein design problems, sometimes known as the inverse of structure prediction prob-lems, require a similar toolbox. Instead of sampling the conformational space, a proteindesign protocol samples the sequence space that folds into the desired topology. Currenteﬀorts can be broadly divided into two classes: modifying an existing protein with known se-quence and properties, or generating novel proteins with sequences unrelated to those foundin nature. The former protocol evolves an existing protein’s amino acid sequence (and as aresult, structure and properties), and the latter is called de novo protein design.Despite signiﬁcant progress in the last several decades in the ﬁeld of computational proteinstructure prediction and design, accurate structure prediction and reliable design bothremain challenging. Conventional approaches rely heavily on the accuracy of the energyfunctions to describe protein physics and the eﬃciency of sampling algorithms to explore theimmense protein sequence and structure space.7

Deep learning architectures

In conventional computational approaches, predictions from data are made by means ofphysical equations and modeling. Machine learning puts forward a diﬀerent paradigm inwhich algorithms automatically infer – or learn – a relationship between inputs and outputsfrom a set of hypotheses. Consider a collection of N training samples comprising features x in an input space X ( e.g. , amino acid sequences), and corresponding labels y in some outputspace Y ( e.g. , residue pairwise distances), where { x i , y i } Ni =1 are sampled independently andidentically distributed from some joint distribution P . Additionally, consider a function f : X → Y in some function class H , and a loss function (cid:96) : Y × Y → R that measures howmuch f ( x ) deviates from the corresponding label y . The goal of supervised learning is toﬁnd a function f ∈ H which minimizes the expected loss, E [ (cid:96) ( f ( x ) , y )], for ( x , y ) sampledfrom P . Since one does not have access to the true distribution but rather N samples fromit, the popular Empirical Risk Minimization (ERM) approach seeks to minimize the loss overthe training samples instead. In neural network models, in particular, the function class isparameterized by a collection of weights. Denoting these parameters collectively by θ , ERMboils down to an optimization problem of the formmin θ N N (cid:88) i =1 (cid:96) ( f θ ( x i ) , y i ) . (1)The choice of the network determines how the hypothesis class is parameterized. Deepneural networks typically implement a non-linear function as the composition of aﬃne maps, W l : R n l → R n l +1 , where W l x = W l x + b l , and other non-linear activation functions, σ ( · ). REctifying Linear Units (RELU) and max-pooling are some of the most popular non-linear transformations applied in practice. The architecture of the model determines howthese functions are composed, the most popular option being their sequential composition f ( x ) = W L σ ( W L − σ ( W L − σ ( . . . W σ ( W x )))) for a network with L layers. Computing f ( x ) is typically referred to as the forward pass .8e will not dwell on the details of the optimization problem in Eq. (1), which is typi-cally carried out via stochastic gradient descent algorithms or variations thereof, eﬃcientlyimplemented via back-propagation . Rather, in this section we summarize some of the mostpopular models widely used in protein structural modeling. High-level diagrams of the majorarchitectures are shown in Figure 3. a Convolutional Neural NetworkSequence: MGSSHH …… Predicted Distance MatrixConvolutional Kernel SimulatedAnnealing c Variational Auto-Encoder d Generative Adversarial Networkb Recurrent Neural Network

Encoder DecoderLatent Space

RealorFake?

Synthetic DataGaussianNoise Generator DiscriminatorPairwise Features

Sequence: MGSSHH ……

RNNCell S t S t+1 MG RNNCell GS RNNCell

ES……

Figure 3: Schematic representation of several architectures used in protein modeling anddesign. (a) CNNs are widely used in structure prediction. (b) RNNs learn in an auto-regressive way and can be used for sequence generation. (c) The VAE can be jointly trainedby protein and properties to construct a latent space correlated with properties. (d) In theGAN setting, a mapping from a priori distribution to the design space can be obtained viathe adversarial training.

Convolutional networks architectures are most commonly applied to image analysis orother problems where shift-invariance or co-variance is needed. Inspired by the fact that anobject on an image can be shifted in the image and still be the same object, CNNs adoptconvolutional kernels for the layer-wise aﬃne transformation to capture this translational9nvariance. A 2D convolutional kernel w applied to a 2D image data x can be deﬁned as: S ( i, j ) = ( x ∗ w )( i, j ) = (cid:88) m (cid:88) n x ( m, n ) w ( i − m, j − n ) (2)where S ( i, j ) represents the output at position ( i, j ), x ( m, n ) is the value of the input x atposition ( m, n ), w ( i − m, j − n ) is the parameter of kernel w at position ( i − m, j − n ), andthe summation is over all possible positions. An important variant of CNN is the residualnetwork (ResNet), which incorporates skip-connections between layers. These modiﬁcationhave shown great advantages in practice, aiding the optimization of these typically hugemodels. CNNs, especially ResNets, have been widely used in protein structure prediction.An example is AlphaFold, in which the input is given by residue pair-wise features and theoutput is a corresponding residue distance map (Figure 3a). Recurrent architectures are based on applying several iterations of the same functionalong a sequential input. This can be seen as an unfolded architecture, and has been widelyused to process sequential data, such as written text and time series data. An example of anRNN approach in the context of protein prediction is using an N-terminal subsequence of aprotein and predicting the next amino acid in the protein (Figure 3b; e.g.

M¨uller et al. ).With an initial hidden state h (0) and sequential data [ x (1) , x (2) , . . . , x ( n ) ], we can obtainhidden states recursively: h ( t ) = g ( t ) ( x ( t ) , x ( t − , x ( t − , . . . , x (1) ) = f ( h ( t − , x ( t ) ; θ ) , (3)where f represents a function or transformation from one position to the next, and g ( t ) repre-sents the accumulative transformation up to position t . The hidden state vector at position i , h ( i ) , contains all the information that has been seen before. As the same set of parame-ters (usually called a cell) can be applied recurrently along the sequential data, an input of10ariable length can be fed to an RNN. Due to the gradient vanishing and explosion problem(the error signal decreases or increases exponentially during training), more recent variantsof standard RNN, namely Long Short-Term Memory (LSTM) and Gated Recurrent Unit(GRU) are more widely used. Auto-Encoders (AEs), unlike the ones discussed so far, provide a model for unsuper-vised learning. Within this unsupervised framework, an auto-encoder does not learn labeledoutputs but instead attempts to learn some representation of the original input. This istypically accomplished by training two parametric maps: an encoder function g : X → R m that maps an input x to an m -dimensional representation or latent space, and a decoder g : R m → X intended to implement the inverse map so that f ( g ( x )) ≈ x . Typically, thelatent representation is of small dimension ( m is smaller than the ambient dimension of X )or constrained in some other way ( e.g. , through sparsity).Variational Auto-Encoders (VAEs), in particular, provide a stochastic map betweenthe input space and the latent space. This is beneﬁcial because, while the input space mayhave a highly complex distribution, the distribution of the representation z can be muchsimpler; e.g. , Gaussian. These methods are derived from variational inference, a methodfrom machine learning that approximates probability densities through optimization. Thestochastic encoder, given by the inference model q φ ( z | x ) and parametrized by weights φ ,is trained to approximate the true posterior distribution of the representation given thedata, p θ ( z | x ). The decoder, on the other hand, provides an estimate for the data giventhe representation, p θ ( x | z ). Direct optimization of the resulting objective is intractable,however. Thus, training is done by maximizing the “Evidence Lower BOund” (ELBO), L θ , φ ( x ), instead, which provides a lower bound on the log-likehood of the data: L θ , φ ( x ) = E z ∼ q φ ( z | x ) log p θ ( x | z ) − D KL ( q φ ( z | x ) || p θ ( z | x )) . (4)11ere, D KL ( q φ || p θ ) is the Kullback–Leibler (KL) divergence, which quantiﬁes the distancebetween distributions q φ and p θ . Employing Gaussians for the factorized variational andlikelihood distributions, as well as employing a change of variables via diﬀerentiable maps,allows for the eﬃcient optimization of these architectures.An example of applying VAE in the protein modeling ﬁeld is learning a representation ofanti-microbial protein sequences (Figure 3c; e.g. , Das et al. ). The resulting continuous real-valued representation can then be used to generate new sequences likely to have antimicrobialproperties. Generative Adversarial Networks (GANs) are another class of unsupervised (generative)models. Unlike VAEs, GANs are trained by an adversarial game between two models, ornetworks: a generator , G , which given a sample, z , from some simple distribution p z ( z ) ( e.g. ,Gaussian), seeks to map it to the distribution of some data class ( e.g. , naturally lookingimages); and a discriminator , D , whose task is to detect whether the images are real ( i.e. ,belonging to the true distribution of the data, p data ( x )), or fake (produced by the generator).With this game-based setup, the generator model is trained by maximizing the error rate ofthe discriminator, thereby training it to “fool” the discriminator. The discriminator, on theother hand, is trained to foil such fooling. The original objective function as formulated in is: min G max D V ( D, G ) = E x ∼ p data ( x ) [log D ( x )] + E z ∼ p z ( z ) [log(1 − D ( G ( z )))] . (5)Training is performed by stochastic optimization of this diﬀerentiable loss function. Whileintuitive, this original GAN objective can suﬀer from issues such as mode collapse andinstabilities during training. The Wasserstein GAN (WGAN) is a popular extension ofGAN which introduces a Wasserstein-1 distance measure between distributions, leading toeasier and more robust training in practice. An example of GAN, in the context of protein12odeling, is the work of Anand et al., to learn the distribution of protein backbone distancesand generate novel protein-like folds (Figure 3d). One network G generates folds, and asecond network D aims to distinguish between generated folds and fake folds. One of the most fundamental challenges in protein modeling is the prediction of function-ality from sequence or structure. Function prediction is typically formulated as a supervisedlearning problem. The property to predict can either be a protein-level property, such as aclassiﬁcation as an enzyme or non-enzyme, or a residue-level property, such as the sites ormotifs of phosphorylation (DeepPho) and polyadenylation (Terminitor). The challeng-ing part here and in the following models is how to represent the protein. Representationrefers to the encoding of a protein that serves as an input for prediction tasks or the outputfor generation tasks. Although a deep neural network is in principle capable of extractingcomplex features, a well-chosen representation can make learning more eﬀective and eﬃ-cient. In this section, we will introduce the commonly used representations of proteins inDL models: sequence-based, structure-based, and one special form of representation relevantto computational modeling of proteins: coarse-grained models.

As the amino acid sequence contains the information essential to reach the folded struc-ture for most proteins, it is widely used as an input in functional prediction and structureprediction tasks. The amino acid sequence, like other sequential data, is typically convertedinto one-hot encoding based representation (each residue is represented with one high bitto identify the amino acid type and all the others low), that can be directly used in manysequence-based DL techniques. However, this representation is inherently sparse, andthus, sample-ineﬃcient. There are many easily accessible additional features that can be13 equence Feature

Amino Acid SequenceChemical Descriptors Evolutional InformationProtein GraphVoxelizationTorsion Angles

Full Structure

Protein SurfaceSecondary StructureInter-residue Distance

Structural FeatureLearned Feature

Variational Auto-EncodersProtVecUniRep

Figure 4: Diﬀerent types of representation schemes applied to a protein.concatenated with amino acid sequences providing structural, evolutionary, and biophysicalinformation. Some widely used features include predicted secondary structure, high-levelbiological features such as sub-cellular localization and unique functions, and physicaldescriptors such as AAIndex, hydrophobicity, ability to form hydrogen bonds, charge,solvent-accessible surface area, etc. A sequence can be augmented with additional data fromsequence databases such as multiple sequence alignments (MSA) or position-speciﬁc scoringmatrices (PSSM), or pairwise residue co-evolution features. Table 1 lists typical featuresas used in CUProtein. Because the performance of machine learning algorithms highly depends on the featureswe choose, labor-intensive and domain-based feature engineering was vital for traditionalmachine learning projects. Now, the exceptional feature extraction ability of neural net-works makes it possible to “learn” the representation, with or without giving the model14ny labels. As publicly available sequence data is abundant (See Table 2), a well-learnedrepresentation that utilizes these data to capture more information is of particular interest.The class of algorithms that address the label-less learning problem fall under the umbrellaof unsupervised or semi-supervised learning, which extracts information from unlabeled datato reduce the number of labeled samples needed.The most straightforward way to learn from amino acid sequence is to directly applynatural language processing (NLP) algorithms. Word2Vec and Doc2Vec are groups ofalgorithms widely used for learning word or paragraph embeddings. These models are trainedby either predicting a word from its context or predicting its context from one central word.To apply these algorithm, Asgari and Mofrad ﬁrst proposed a Word2Vec based model calledBioVec that interprets the non-overlapping 3-mer sequence of amino acids (eg. alanine-glutamine-lysine or AQL) as “words” and lists of shifted “words” as “sentences”. Theythen represent a protein as the summation of all overlapping sequence fragments of length k , or k -mers (called ProtVec). Predictions based on the ProtVec representation outper-formed state-of-the-art machine learning methods in the Pfam protein family classiﬁcation(93% accuracy for ∼ and 75% for previous methods).Many Doc2Vec-type extensions were developed based on the “3-mer” protocol. Kimothiet al. showed that non-overlapping k -mers perform better than the overlapping ones, andYang et al. compared the performance of all Doc2Vec frameworks for thermostability andenantioselectivity prediction. In these approaches, the three-residue segmentation of a protein sequence is arbitrary anddoes not embody any biophysical meaning. Alternatively, Alley et al. directly used an RNN(unidirectional multiplicative long-short-term-memory or mLSTM ) model, called UniRep,to summarize arbitrary length protein sequences into a ﬁxed-length real representation byaveraging over the representation of each residue. Their representation achieved lower meansquared errors on 15 property prediction tasks (absorbance, activity, stability, etc.) comparedto former models, including Yang et al.’s Doc2Vec. Notably, their sequence representation15ased model, UniRep Fusion, was able to outperform stability ranking predictions made byRosetta, which uses sequence, structure, and a scoring function trained on various biophysicaldata.Auto-encoders can also provide representations for subsequent supervised tasks. Dinget al. showed that a VAE model is able to capture evolutionary relationships between se-quences and stability of proteins, while Sinai et al. and Riesselman et al. showed that thelatent vectors learned from VAEs are able to predict the eﬀects of mutations on ﬁtness andactivity for a range of proteins such as poly(A)-binding protein, DNA methyltransferase and β -lactamase. Recently, a lower-dimensional embedding of the sequence was learned forthe more complex task of structure prediction. Alley et al.’s UniRep surpassed former mod-els, but since UniRep is trained on 24 million sequences and previous models ( e.g. , Prot2Vec)were trained on much smaller datasets (0.5 million), it is not clear if the improvement was dueto better methods or the larger training dataset. Rao et al. introduced multiple biological-relevant semi-supervised learning tasks, TAPE, and benchmarked the performance againstvarious protein representations. Their results show conventional alignment-based inputs stilloutperform current self-supervised models on multiple tasks, and the performance on a singletask cannot evaluate the capacity of models. A comprehensive and persuasive comparisonof representations is required.

Since the most important functions of a protein (binding, signaling, catalysis etc.) canbe traced back to the 3D structure of the protein, direct use of 3D structural information,and analogously, learning a good representation based on 3D structure, are highly desired.The direct use of raw 3D representations (such as coordinates of atoms) is hindered by con-siderable challenges, including the processing of unnecessary information due to translation,rotation, and permutation of atomic indexing. Townshend et al. and Simonovsky and Mey-ers, obtained a translationally-invariant, 3D representation of each residue by voxelizing16ts atomic neighborhood for a grid-based 3D CNN model. Alternatively, the torsion angles ofthe protein backbone, which are invariant to translation and rotation, can fully recapitulateprotein backbone structure under the common assumption that variation in bond lengthsand angles is negligible. AlQuraishi employed backbone torsion angles to represent the 3Dstructure of the protein as a 1D data vector. However, because a change in a backbonetorsion angle at a residue aﬀects the inter-residue distances between all preceding and sub-sequent residues, these 1D variables are highly interdependent, which can frustrate learning.To circumvent these limitations, many approaches use 2D projections of 3D protein struc-ture data such as residue-residue distance and contact maps and pseudo-torsion anglesand bond angles that capture the relative orientations between pairs of residues. Whilethese representations guarantee translational and rotational invariance, they do not guaran-tee invertibility back to the 3D structure. The structure must be reconstructed by applyingconstraints on distance or contact parameters using algorithms such as gradient descentminimization, multidimensional scaling, a program like the Crystallography and NMR sys-tem (CNS), or in conjunction with an energy-function-based protein structure predictionprogram. An alternative to the above approaches for representing protein structures is the use ofa graph, i.e. , a collection of nodes or vertices connected by edges. Such a representationis highly amenable to the Graph Neural Network (GNN) paradigm, which has recentlyemerged as a powerful framework for non-Euclidean data in which the data are representedwith relationships and inter-dependencies, or edges, between objects, or nodes. While therepresentation of proteins as graphs and the application of graph theory to study theirstructure and properties has a long history, the eﬀorts to apply GNNs to protein modelingand design is quite recent. As a benchmark, many GNNs have been applied to classifyenzymes from non-enzymes in the PROTEINS and D&D datasets. Fout et al. utilizeda GNN in developing a model for protein-protein interface prediction. In their model, thenode feature comprised residue composition and conservation, accessible surface area, residue17epth, and protrusion index; and the edge feature comprised a distance and an angle betweenthe normal vectors of the amide plane of each node/residue. A similar framework wasused to predict antibody-antigen binding interfaces. Zamora-Resendiz and Crivelli andGligorijevic et al. further generalized and validated the use of graph-based representationsand the Graph Convolutional Network (GCN) framework in protein function predictiontasks, using a Class Activation Map (CAM) to interpret the structural determinants of thefunctionalities. Torng and Altman applied GCNs to model pocket-like cavities in proteinsto predict the interaction of proteins with small molecules, and Ingraham et al. adopted agraph based transformer model to perform a protein sequence design task. These examplesdemonstrate the generality and potential of the graph-based representation and GNNs toencode structural information for protein modeling.The surface of the protein or a cavity is an information-rich region that encodes how aprotein may interact with other molecules and its environment. Recently, Gainza et al. used a geometric DL framework to learn a surface-based representation of the protein,called MaSIF. They calculated “ﬁngerprints” for patches on the protein surface using geodesicconvolutional layers, which were further used to perform tasks such as binding site predictionor ultra-fast protein-protein interaction (PPI) search. The performance of MaSIF approachedthe baseline of current methods in docking and function prediction, providing a proof-of-concept to inspire more applications of geometry-based representation learning.

A high-quality force ﬁeld (or, more generally, score function) for sampling and/or rankingmodels (decoys) is one of the most vital requirements for protein structural modeling.

Aforce ﬁeld describes the potential energy surface of a protein. A score function may containknowledge-based terms that do not necessarily have a valid physical meaning, and they aredesigned to distinguish near-native conformations from non-native ones (for example, learn-ing the GDT TS ). A molecular dynamics (MD) or Monte Carlo (MC) simulation with a18tate-of-the-art force ﬁeld or score function can reproduce reasonable statistical behaviors ofbiomolecules.

Current DL-based eﬀorts to learn the force ﬁeld can be divided into two classes: “ﬁngerprint”-based and graph-based. Behler and Parrinello developed roto-translationally invariant fea-tures, i.e. , Behler–Parrinello ﬁngerprint, to encode the atomic environment for neural net-works to learn potential surfaces from Density Functional Theory (DFT) calculations.

Smith et al. extended this framework and tested its accuracy by simulating systems up to312 atoms (Trp-cage) for 1 ns.

Another family that includes deep tensor neural network(DTNN) and SchNet, utilizes graph convolution to learn a representation for each atomwithin its chemical environment. Though the prediction quality and the ability to learn arepresentation with novel chemical insight make the graph-based approach increasingly pop-ular, the application has mainly focused on small organic molecules as it scales poorly tolarger systems.A shift towards DL-based score functions is anticipated, especially due to the enormousgains in speed and eﬃciency. For example, Zhang et al. showed that MD simulation on aneural potential was able to reproduce energies, forces, and time-averaged properties com-parable to ab initio MD (AIMD) at a cost that scales linearly with system size, comparedto cubic scaling typical for AIMD with DFT.

Though these force ﬁelds are, in principle,generalizable to larger systems, direct applications of neural potential to model full pro-teins are still rare. PhysNet, trained on a set of small peptide fragments (at most eightheavy atoms), was able to generalize to deca-alanine (Ala10), and ANI-1x and AIMNethave been tested on Chignolin (10 residues) and Trp-cage (20 residues) within the ANI-MD benchmark dataset.

Lahey and Rowley and Wang et al. combined the QuantumMechanics/Molecular Mechanics (QM/MM) strategy and the neural potential to modeldocking with small ligands and larger proteins (up to 82 residues). .5 Coarse-grained models Coarse-grained models are higher-level abstractions of biomolecules, such as using a sin-gle pseudo-atom or a bead to represent multiple atoms, grouped based on local connectiv-ity and/or chemical properties. Coarse-graining smoothens out the energy landscape, andthereby helps avoid trapping in local minima and speeds up conformational sampling.

One can learn the atomic-level properties to construct a fast and accurate neural coarse-grained model once the coarse-grained mapping is given. Early attempts to apply DL basedmethods to coarse-graining focus on water molecules with the roto-translationally invariantfeatures.

Wang et al. developed CGNet and learned the coarse-grained model of themini protein, chignolin, in which the atoms of a residue are mapped to the corresponding C α atom. The free energy surface learned with CGNet is quantitatively correct and MD sim-ulations performed with CGnet potential predict the same set of metastable states (folded,unfolded, and misfolded). Also, the level of coarse-graining, e.g., a single coarse-grainedatom to represent a residue versus two coarse-grained atoms, one to represent the backboneand the other to represent the sidechain, is critical to the performance of coarse-grainedmodels. For this purpose, Wang and G´omez-Bombarelli applied an encoder-decoder basedmodel to explicitly learn the lower-dimensional representation of proteins by minimizing theinformation loss at diﬀerent levels of coarse-graining.

The most successful application of DL in the ﬁeld of protein modeling so far has been theprediction of protein structure. Protein structure prediction is formulated as a well-deﬁnedproblem with clear inputs and outputs: predict the 3D structure (output) given amino acidsequences (input), with the experimental structures as the ground truth (labels). This prob-lem perfectly ﬁts the classical supervised learning approach, and once the problem is deﬁnedin these terms, the remaining challenge is to choose a framework to handle the complex re-20ationship between input and output. The CASP experiment for structure prediction is heldevery two years and served as a platform for DL to compete with state-of-the-art methodsand, impressively, outshine them in certain categories. We will ﬁrst discuss the application ofDL to the protein folding problem, and then comment on some problems related to structuredetermination. Table 3 summarizes major DL eﬀorts in structure prediction.

Before the notable success of DL at CASP12 (2016) and CASP13 (2018), the state-of-the-art methodology employed complex workﬂows based on a combination of fragment insertionand structure optimization methods such as simulated annealing with a score function orenergy potential. Over the last decade, the introduction of co-evolution information in theform of evolutionary coupling analysis (ECA) improved predictions. ECA relies on therationale that residue pairs in contact in 3D space tend to evolve or mutate together; other-wise, they would disrupt the structure to destabilize the fold or render a large conformationalchange. Thus evolutionary couplings from sequencing data suggest distance relationships be-tween residue pairs and aid structure construction from sequence through contact or distanceconstraints. Since co-evolution information relies on statistical averaging of sequence infor-mation from a large number of MSAs, this approach is not eﬀective when the proteintarget has only a few sequence homologs. Neural networks were, at ﬁrst, introduced to de-duce evolutionary couplings between distant homologs, thereby improving ECA-type contactpredictions for contact-assisted protein folding.

While the application of neural networksto learn inter-residue protein contacts dates back to the early 2000s, more recentlythis approach was adopted by MetaPSICOV (2-layer NN),

PConsC2 (2-layer NN), andCoinDCA-NN (5-layer NN), which combined neural networks with ECAs. However, therewas no signiﬁcant advantage to neural nets compared to other machine learning methods atthat time.

In 2017, Wang et al. proposed RaptorX-Contact, a residual neural network (ResNet)21 a) Residue Distance Prediction by RaptorX: the overall network architecture of the deep dilated ResNet usedin CASP13. Inputs of the ﬁrst-stage, one-dimensional convolutional layers are a sequence proﬁle, predictedsecondary structure and solvent accessibility. The output of the ﬁrst stage is then converted into a two-dimensional matrix by concatenation and fed into a deep ResNet along with pairwise features (co-evolutioninformation, pairwise contact and distance potential). A discretized inter-residue distance is the output.Additional network layers can be attached to predict torsion angles and secondary structures. Figure fromXu and Wang (2019). (b) Direct Structure Prediction: Overview of recurrent geometric networks (RGN) approach. The raw aminoacid sequence along with a PSSM are fed as input features, one residue at a time, to a bidirectional LSTMnet. Three torsion angles for each residue are predicted to directly construct the three-dimensional structure.Figure from AlQuraishi (2019). Figure 5: Two representative DL approaches to protein structure prediction22ased model, which, for the ﬁrst-time employed a deep neural network for protein-contactprediction, signiﬁcantly improving the accuracy on blind, challenging targets with novel folds.RaptorX-Contact ranked ﬁrst in free modeling (FM) targets at CASP12. Its architecture(Figure 5 (a)) entails (1) a 1D ResNet that inputs MSAs, predicted secondary structureand solvent accessibility (from DL based prediction tool RaptorX-Property ) and (2) a2D ResNet with dilations that inputs the 1D ResNet output and inter-residue co-evolutioninformation from CCMPred.

In its original formulation, RaptorX-Contact outputs a bi-nary classiﬁcation of contacting versus non-contacting residue pairs. Later versions weretrained to learn multi-class classiﬁcation for distance distributions between C β atoms. The primary contributors to the accuracy of predictions was the co-evolution informationfrom CCMpred and the depth of the 2D ResNet, suggesting that the deep neural networklearned co-evolution information better than previous methods. Later, the method was ex-tended to predict C α − C α , C α − C γ , C γ − C γ , N-O distances and torsion angles (DL basedRaptorX-Angle ) and all ﬁve distances, torsions, and secondary structure predictions wereconverted to constraints for folding by CNS. At CASP12, however, RaptorX-Contact(original contact based formulation) and DL drew limited attention because the diﬀerencebetween top-ranked predictions from DL-based methods and hybrid DCA-based methodswas small.This situation changed at CASP13 when one DL-based model, AlphaFold, developedby team A7D, or DeepMind, ranked ﬁrst and signiﬁcantly improved the accuracy of“free modeling” (no templates available) targets (Figure 1). The A7D team modiﬁed thetraditional simulated annealing protocol with DL-based predictions and tested three pro-tocols based on deep neural networks. Two protocols used memory-augmented simulatedannealing (with domain segmentation, and fragment assembly) with potentials generatedfrom predicted inter-residue distance distributions and predicted GDT TS, respectively,whereas the third protocol directly applies gradient descent optimization on a hybrid poten-tial combining predicted distance and Rosetta score. For the distance prediction network, a23eep ResNet, similar to that of RaptorX, inputs MSA data and predicts the probabilityof distances between β − carbons. A second network was trained to predict GDT TS of thecandidate structure with respect to the true or native structure. The simulated annealingprocess was improved with a Conditional Variational Auto-Encoder (CVAE) model thatconstructs a mapping between the backbone torsions and a latent space conditioned by se-quence. With this network, the team generated a database of nine-residue fragments for thememory-augmented simulated annealing system. Gradient-based optimization performedslightly better than the simulated annealing, suggesting that traditional simulated annealingis no longer necessary and state-of-the-art performance can be reached with simply optimiz-ing a network predicted potential. AlphaFold’s authors, like the RaptorX-Contact group,emphasized that the accuracy of predictions relied heavily on learned distance distributionsand coevolutionary data.Yang et al. further improved the accuracy of predictions on CASP13 targets using a shal-lower network than former models (61 versus 220 ResNet blocks in AlphaFold) by addition-ally training their neural network model (named trRosetta) to learn inter-residue orientationsalong with β − carbon distances . The geometric features – C α -C β torsions, pseudo-bondangles, and azimuthal rotations – directly describe the relevant coordinates for the physicalinteraction of two amino acid side chains. These additional outputs created signiﬁcant im-provement on a relatively ﬁxed DL-framework, suggesting that there is room for additionalimprovement.An alternative and intuitive approach to structure prediction is directly learning themapping from sequence to structure with a neural network. AlQuraishi developed such anend-to-end diﬀerentiable protein structure predictor, called Recurrent Geometric Network(RGN), that allows direct prediction of torsion angles to construct the protein backbone(Figure 5b). RGN is a bi-directional LSTM that inputs a sequence, PSSM, and positionalinformation and outputs predicted backbone torsions. Overall 3D structure predictions arewithin 1-2 ˚A of those made by top-ranked groups at CASP13, and this approach boasts24 considerable advantage in prediction time compared to strategies that learn potentials.Moreover, the method does not use MSA-based information and could potentially be im-proved with the inclusion of evolutionary information. The RGN strategy is generalizableand well-suited for protein-structure prediction. Several generative methods (see below) alsoentail end-to-end structure prediction models, like the CVAE framework used by AlphaFold,albeit with more limited success. Protein-Protein Interface (PPI) prediction identiﬁes residues at the interface of thetwo proteins forming a complex. Once the interface residues are determined, a local searchand scoring protocol can be used to determine the structure of a complex. Similar to proteinfolding, eﬀorts have focused on learning to classify contact or not. For example, Townshendet al. developed a 3D CNN model (SASNet) that voxelizes the three-dimensional environ-ment around the target residue, and Fout et al. developed a GCN-based model with eachinteracting partner represented as a graph. Unlike those starting from the unbound struc-tures,Zeng et al. reuse the model trained on single-chain proteins ( i.e. , RaptorX-Contact) topredict PPI with sequence information alone, which resulted in RaptorX-Complex that out-performs ECA-based methods at contact prediction.

Another interesting approach directlycompares the geometry of two protein patches. Gainza et al. trained their MaSIF model byminimizing the Euclidean distances between the complementary surface patches on the twoproteins while maximizing the distances between non-interacting surface patches.

Thisstep is followed by a quick nearest neighbor scanning to predict binding partners. The ac-curacy of MaSIF was comparable to traditional docking methods. However, MaSIF, similarto existing methods, showed low prediction accuracy for targets that involve conformationalchanges during binding.

Membrane Proteins (MPs) are partially or fully embedded in a hydrophobic environ-ment composed of a lipid bilayer, and consequently, they exhibit hydrophobic motifs on the25urface unlike majority of the proteins that are water soluble. Li et al. used a DL transferlearning framework comprising one-shot learning from non-MPs to MPs.

They showedthat transfer learning works surprisingly well here because the most frequently occurringcontact patterns in soluble proteins and membrane proteins are similar. Other eﬀorts in-clude classiﬁcation of the trans-membrane topology.

Since experimental biophysical datais sparse for membrane proteins, Alford and Gray compiled a collection of twelve diversebenchmark sets for membrane protein prediction and design for testing and learning of im-plicit membrane energy models.

Loop modeling is a special case of structure prediction, where most of the 3D proteinstructure is given, but coordinates of segments of the polypeptide are missing and need to becompleted. Loops are irregular and sometimes ﬂexible segments, and thus their structureshave been diﬃcult to capture experimentally or computationally.

So far, DL frameworksbased on inter-residue distance prediction (similar to protein structure prediction) andthose based on treating the loop residue distances with the remaining residues as an imageinpainting problem have been applied to loop modeling. Recently, Ruﬀolo et al. used aRaptorX-like network setup and a trRosetta geometric representation to predict the structureof antibody hypervariable complementarity-determining region (CDR) H3 loops, which iscritical for antigen binding.

We divide the current DL approaches to protein design into two broad categories. Theﬁrst uses knowledge of other sequences (either “all” sequenced proteins or a certain class ofproteins) to design sequences directly (Table 4). These approaches are well suited to createnew proteins with functionality matching existing proteins based on sequence informationalone. The second class follows the “fold-before-function” scheme and seeks to stabilize spe-ciﬁc 3D structures, perhaps but not necessarily with the intent to perform a desired function26Tables 5 and 6). The ﬁrst approach can be described as function → sequence (structureagnostic), and the second approach ﬁts the traditional step-wise inverse design: function → structure → sequence. Approaches that attempt to design for sequences parallel work in the ﬁeld of NLP, wherean auto-regressive framework is common, most notably, the RNN. In language processing,an RNN model is able to take the beginning of a sentence and predict the next word inthat sentence. Likewise, given a starting amino acid residue or a sequence of residues, aprotein design model can output a categorical distribution for each of the 20 amino acidresidues for the next position in the sequence. The next residue in the sequence is sampledfrom this categorical distribution, which in turn is used as the input to predict the followingone. Following this approach, new sequences, sampled from the distribution of the trainingdata, are generated, with the goal of having properties similar to those in the training set.M¨uller et al. ﬁrst applied an LSTM RNN framework to learn sequence patterns of anti-microbial peptides (AMPs), a highly specialized sequence space of cationic, amphipathichelices. The same group then applied this framework to design membranolytic anticancerpeptides (ACPs). Twelve of the generated peptides were synthesized and six of them killedMCF7 human breast adenocarcinoma cells with at least three-fold selectivity against humanerythrocytes. In another application, instead of traditional RNNs, Riesselman et al. useda residual causal dilated CNN in an auto-regressive way and generated a functional single-domain antibody library conditioned on the naive immune repertoires from llamas; thoughexperimental validation was not presented. Such applications could potentially speed-up andsimplify the task of generating sequence libraries in the lab.Another approach to sequence generation is mapping the latent space to the sequencespace, and common strategies to train such a mapping include AEs and GANs. As men-tioned earlier, AEs are trained to learn a bi-directional mapping between a discrete design27pace (sequence) and a continuous real-valued space (latent space). Thus, many applica-tions of AEs employ the learnt latent representation to capture the sequence distribution ofa speciﬁc class of proteins, and subsequently, to predict the eﬀect of variations in sequence(or mutations) on protein function.

The utility of this learned latent space, however, ismore than that. A well trained real-valued latent space can be used to interpolate betweentwo training samples, or even extrapolate beyond the training data to yield novel sequences.One such example is the PepCVAE model. Following a semi-supervised learning approach,Das et al. trained a VAE model on an unlabeled dataset of 1 . × sequences and thenreﬁned the model for the AMP subspace using a 15,000-sequence labeled dataset. By con-catenating a conditional code indicating if a peptide is antimicrobial, the CVAE frameworkallows eﬃcient sampling of AMPs selectively from the broader peptide space. More than82% of the generated peptides were predicted to exhibit antimicrobial properties accordingto a state-of-the-art AMP classiﬁer. Unlike AEs, GANs focus on learning the uni-directional mapping from a continuous real-valued space to the design space. In an early example, Killoran et al.’s developed a modelthat combines a standard GAN and activation maximization to design DNA sequences thatbind to a speciﬁc protein.

Repecka et al. trained ProteinGAN on the bacterial enzymemalate dehydrogenase (MDH) to generate new enzyme sequences that were active and solu-ble in vitro, some with over 100 mutations, with a 24% success rate.

Another interestingGAN-based framework is Gupta and Zou’s FeedBack GAN (FBGAN) that learns to gener-ate complementary DNA sequences for peptides.

They add a feedback-loop architectureto optimize the synthetic gene sequences for desired properties using an oracle (an externalfunction analyzer). At every epoch, they update the positive training data for the discrimina-tor with high-scoring sequences from the generator so that the score of generated sequencesincreases gradually. They demonstrated the eﬃcacy of their model by successfully biasinggenerated sequences towards anti-microbial activity and a desired secondary structure.28 .2 Design with structure as intermediate

Within the fold-before-function scheme, one ﬁrst picks or design a protein fold or topologyaccording to certain desirable properties, then determines the amino acid sequence thatcould fold into that structure (function → structure → sequence). The main challenge ingenerating feasible protein structures might still be choosing a suitable representation ofprotein structures. Anand and Huang tested various representations (full atom, torsion-only, etc.) with a deep convolutional GAN (DCGAN) framework that generates sequence-agnostic, ﬁxed-length short protein structural fragments. They found that the distancemap of C α atoms gives the most meaningful protein structures, though the asymmetry of ψ and φ torsion angles was only recovered with torsion-based representations. Later theyextended this work to all atoms in the backbone and combined with a recovery network toavoid the time-consuming structure reconstruction process. They showed that some of thedesigned folds are stable in molecular simulation. Further work in this direction, such asconditioned generation or variable length generation, would enable the design of folds withdesired functions.As for the amino acid sequence design given a protein structure, under the supervisedlearning setting, most eﬀorts use the native sequences as the ground truth and recovery rateof native-sequences ( i.e. the percentage of sequence that matches the native one) as a successmetric. To compare, Kuhlman and Baker reported sequence-recovery rates of 51% for coreresidues and 27% for all amino acid residues using traditional de novo design approaches.

Since the mapping from sequence to structure is not unique (within a neighborhood of eachstructure), it is not clear that higher sequence recovery rates would be meaningful. A classof eﬀorts, pioneered by the SPIN model, inputs a ﬁve-residue sliding window to predictthe amino acid probabilities for the center position to generate sequences compatible witha desired structure. The features in such models include φ and ψ dihedrals, a sequenceproﬁle of a 5-residue fragment derived from similar structures, and a rotamer-based energyproﬁle of the target residue using the DFIRE potential. SPIN reached a 30.7% sequence29ecovery rate and Wang et al. and O’Connell et al.’s SPIN2 further improved it to 34%. Another class of eﬀorts inputs the voxelized local environment of an amino acid residue. InZhang et al.’s and Shroﬀ et al.’s models, voxelized local environment was fed into a 3D CNNframework to predict the most stable residue type at the center of a region.

Shroﬀ et al.reported a 70% recovery rate and the mutation sites were validated experimentally.

Anandet al. trained a similar model to design sequences for a given backbone.

Their protocolinvolves iteratively sampling from predicted conditional distributions, and it recovered from33% to 87% of native sequence identities. They tested their model by designing sequences forﬁve proteins including a de novo

TIM-barrel. The designed sequences were 30-40% identicalto native sequences and predicted structures were 2-5 ˚A RMSD from the native conformation.Another approach is to generate the full sequence instead of fragments, conditioned by atarget structure. Greener et al. trained a CVAE model to generate sequences conditionedon protein topology represented in a string.

The resulting sequence was veriﬁed to be sta-ble with molecular simulation. Karimi et al. developed gcWGAN that combined a CGANand a guidance strategy to bias the generated sequences towards a desired structure.

They employed a fast structure prediction algorithm as an “oracle” to assess the outputsequence and provide feedback to reﬁne the model. They examined the model for six foldsusing Rosetta-based structure prediction, and gcWGAN had higher TM-score distributionsand more diverse sequence proﬁles than CVAE.

Another notable experiment is Ingrahamet al.’s graph transformer model that inputs a structure, represented as a graph, and outputsthe sequence proﬁle. They treat the sequence design problem similar to a machine trans-lation problem, i.e. , a translation from structure to sequence. Like the original transformermodel, they adopted an encoder-decoder framework with self-attention mechanisms to dy-namically learn the relationship between information in two neighbor layers. They measuredtheir results by perplexity, a widely used metric in speech recognition, and the per-residueperplexity (lower is better) for single chains was 9.15, lower than the perplexity for SPIN2(12.86). 30

Outlook and conclusion

In this review, we have summarized the current state-of-the-art DL techniques appliedto the problem of protein structure prediction and design. As in many other areas, DLshows the potential to revolutionize the ﬁeld of protein modeling. While DL originated fromcomputer vision, NLP and machine learning, its fast development combined with knowledgefrom operations research, game theory, and variational inference amongst other ﬁelds,has resulted in many new and powerful frameworks to solve increasingly complex problems.The application of DL for biomolecular structure has just begun, and we expect to see moreeﬀorts on methodology development and applications in protein modeling and design.There are several trends we observed: Experimental validation : An important gap in current DL work in protein modeling,especially protein design (with few notable exceptions ), is the lack of experimentalvalidation. Past blind challenges, e.g. , CASP and CAPRI, and design claims have shownthat experimental validation in this ﬁeld is of paramount importance , where computationalmodels are still prone to error. A key next stage for this ﬁeld is to engage collaborationsbetween machine learning experts and experimental protein engineers to test and validatethese emerging approaches.

Importance of benchmarking : In other ﬁelds of machine learning, standardizedbenchmarks have triggered rapid progress.

CASP is a great example that providesa standardized platform for benchmarking diverse algorithms, including emerging DL-basedapproaches. A well-deﬁned question and proper evaluation (especially experimental) wouldlead to more open competition among a broader range of groups and, eventually, the inno-vation of more diverse and powerful algorithms.

Imposing a physics-based prior : One common topic among the machine learningcommunity is how to utilize existing domain knowledge to reduce the eﬀort during training.Unlike certain classical ML problems such as image classiﬁcation, in protein modeling, awide range of biophysical principles restrict the range of plausible solutions. Some examples31n related ﬁelds include imposing a physics-based model prior, adding a regularizationterm with physical meaning, and adopting a speciﬁc formula to conserve physical sym-metry.

Similarly, in protein modeling, well-established empirical observations can helprestrict the solution space, such as the Ramanchandran distribution of backbone torsionangles and the Dunbrack or Richardsons library of side-chain conformations.

Closed-loop design : The performance of DL methodologies relies heavily on the qualityof data, but the publicly available datasets may not cover important sample space because ofexperimental accessibility at the time of experiments. Furthermore, the dataset may containharmful noise from non-uniform experimental protocols and conditions. A possible solutionmay be to combine model training with experimental data generation. For instance, one maydevise a closed-loop strategy to generate experimental data, on-the-ﬂy, for queries (or modelinputs) that are most likely to improve the model, and update the training dataset withthe newly generated data.

For such a strategy to be feasible, automated synthesis andcharacterization is necessary. As high-throughput synthesis and testing of protein (or DNAand RNA) can be carried out in parallel, automation is possible. While such a strategy mayseem far-fetched, automated platforms such as those from Ginkgo Bioworks or Transcripticare already on the market.

Reinforcement learning : Another approach to overcome the limitation of data avail-ability is reinforcement learning (RL). Biologically meaningful data may be generated on-the-ﬂy in simulated environments such as the Foldit game. In the most famous applicationof RL, AlphaGo Zero, an RL agent (network) was able to learn and master the game bylearning from the game-environment alone. There are already some examples of RL in theﬁeld of chemistry and electric engineering to optimize the organic molecules or computationalchips. One suitable protein modeling problem for an RL algorithm would be trainingan AI agent to make a series of “moves” to fold a protein, similar to the Foldit game.

Model interpretability : One should keep in mind that a neural network representsnothing more (and nothing less) than a powerful and ﬂexible regression model. Addition-32lly, due to their highly recursive nature, neural networks tend to be regarded as “black-boxes”, i.e. , too complicated for practitioners to understand the resulting parameters andfunctions. Although model interpretability in ML is a rapidly developing ﬁeld, many popu-lar approaches, such as saliency analysis for image classiﬁcation models, are far fromsatisfactory.

Though other approaches oﬀer more reliable interpretations, their ap-plication to DL model interpretation has been largely missing in protein modeling. As aresult, current DL models oﬀer limited understanding of the complex patterns they learn.

The “sequence → structure → function” paradigm : We know from molecular bio-physics that a sequence translates into function through the physical intermediary of a three-dimensional molecular structure. Allosteric proteins, for instance, may exhibit diﬀerentstructural conformations under diﬀerent physiological conditions ( e.g. , pH) or environmentalstimuli ( e.g. small molecules, inhibitors), reminding us that context is as important as pro-tein sequence. That is, despite Anﬁnsen’s hypothesis, sequence alone does not always fullydetermine the structure. Some proteins require chaperones to fold to their native structure,meaning that a sequence could result in non-native conformations when the kinetics of fold-ing to the native structure may be unfavorable in the absence of a chaperone. Since manypowerful DL algorithms in NLP operate on sequential data, it may seem reasonable to useprotein sequences alone for training DL models. In principle, with a suitable framework andtraining, DL could disentangle the underlying relationships between sequence and structuralelements. However, a careful selection of DL frameworks that are structure or mechanism-aware will accelerate learning and improve predictive power. Indeed, many successful DLframeworks applied so far ( e.g. convolutional neural networks or graph convolutional neuralnetworks) factor in the importance of learning on structural information.Finally, with the hope of gaining insight into the fundamental science of biomolecules,there is a desire to link artiﬁcial intelligence (AI) approaches to the underlying biochemicaland biophysical principles that drive biomolecular function. For more practical purposes, adeeper understanding of underlying principles and hidden patterns that lead to pathology33s important in the development of therapeutics. Thus, while eﬀorts strictly limited tosequences are abundant, we believe that models with structural insights will play a morecritical role in the future. Acknowledgement

This work was supported by the National Institutes of Health through grant R01-GM078221.

References (1) Slabinski, L.; Jaroszewski, L.; Rodrigues, A. P.; Rychlewski, L.; Wilson, I. A.; Les-ley, S. A.; Godzik, A. The challenge of protein structure determination-lessons fromstructural genomics.

Protein Science , , 2472–2482.(2) Markwick, P. R. L.; Malliavin, T.; Nilges, M. Structural Biology by NMR: Structure,Dynamics, and Interactions. PLoS Computational Biology , , e1000168.(3) Jonic, S.; V´enien-Bryan, C. Protein structure determination by electron cryo-microscopy. Current Opinion in Pharmacology , , 636–642.(4) Kryshtafovych, A.; Schwede, T.; Topf, M.; Fidelis, K.; Moult, J. Critical assessmentof methods of protein structure prediction (CASP)—Round XIII. Proteins: Structure,Function, and Bioinformatics , , 1011–1020.(5) Hollingsworth, S. A.; Dror, R. O. Molecular dynamics simulation for all. Neuron , , 1129–1143.(6) Ranjan, A.; Fahad, M. S.; Fernandez-Baca, D.; Deepak, A.; Tripathi, S. Deep RobustFramework for Protein Function Prediction using Variable-Length Protein Sequences. IEEE/ACM Transactions on Computational Biology and Bioinformatics , 1–1.(7) Huang, P. S.; Boyken, S. E.; Baker, D. The coming of age of de novo protein design.

Nature , , 320–327.(8) Yang, K. K.; Wu, Z.; Arnold, F. H. Machine-learning-guided directed evolution forprotein engineering. Nature Methods , , 687–694.(9) Bohr, H.; Bohr, J.; Brunak, S.; J. Cotterill, R.; Fredholm, H.; Lautrup, B.; Petersen, S.A novel approach to prediction of the 3-dimensional structures of protein backbonesby neural networks. FEBS letters , , 43–46.3410) Schneider, G.; Wrede, P. The rational design of amino acid sequences by artiﬁcialneural networks and simulated molecular evolution: de novo design of an idealizedleader peptidase cleavage site. Biophysical Journal , , 335–344.(11) Schneider, G.; Schr¨odl, W.; Wallukat, G.; M¨uller, J.; Nissen, E.; R¨onspeck, W.;Wrede, P.; Kunze, R. Peptide design by artiﬁcial neural networks and computer-basedevolutionary search. Proceedings of the National Academy of Sciences of the UnitedStates of America , , 12179–12184.(12) Ofran, Y.; Rost, B. Predicted protein–protein interaction sites from local sequenceinformation. FEBS letters , , 236–239.(13) Nielsen, M.; Lundegaard, C.; Worning, P.; Lauemøller, S. L.; Lamberth, K.; Buus, S.;Brunak, S.; Lund, O. Reliable prediction of T-cell epitopes using neural networks withnovel sequence representations. Protein Science , , 1007–1017.(14) LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature , , 436–444.(15) Angermueller, C.; P¨arnamaa, T.; Parts, L.; Stegle, O. Deep learning for computationalbiology. Molecular Systems Biology , , 878.(16) Ching, T.; Himmelstein, D. S.; Beaulieu-Jones, B. K.; Kalinin, A. A.; Do, B. T.;Way, G. P.; Ferrero, E.; Agapow, P.-M.; Zietz, M.; Hoﬀman, M. M., et al. Opportu-nities and obstacles for deep learning in biology and medicine. Journal of The RoyalSociety Interface , , 20170387.(17) Mura, C.; Draizen, E. J.; Bourne, P. E. Structural biology meets data science: Doesanything change? Current Opinion in Structural Biology , , 95–102.(18) No´e, F.; De Fabritiis, G.; Clementi, C. Machine learning for protein folding and dy-namics. Current Opinion in Structural Biology , , 77–84.(19) Guo, Y.; Liu, Y.; Oerlemans, A.; Lao, S.; Wu, S.; Lew, M. S. Deep learning for visualunderstanding: A review. Neurocomputing , , 27–48.(20) Young, T.; Hazarika, D.; Poria, S.; Cambria, E. Recent trends in deep learning basednatural language processing. IEEE Computational Intelligence Magazine , ,55–75.(21) Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.;Hubert, T.; Baker, L.; Lai, M.; Bolton, A., et al. Mastering the game of go withouthuman knowledge. Nature , , 354.(22) Senior, A. W. et al. Protein structure prediction using multiple deep neural networksin the 13th Critical Assessment of Protein Structure Prediction (CASP13). Proteins , , 1141–1148.(23) Ingraham, J.; Garg, V.; Barzilay, R.; Jaakkola, T. Generative models for graph-basedprotein design. Advances in Neural Information Processing Systems , 15820–15831. 3524) Anand, N.; Huang, P. Generative modeling for protein structures.

Advances in NeuralInformation Processing Systems , 7494–7505.(25) O’Connell, J.; Li, Z.; Hanson, J.; Heﬀernan, R.; Lyons, J.; Paliwal, K.; Dehzangi, A.;Yang, Y.; Zhou, Y. SPIN2: Predicting sequence proﬁles from protein structures usingdeep neural networks.

Proteins: Structure, Function and Bioinformatics , ,629–633.(26) Senior, A. W.; Evans, R.; Jumper, J.; Kirkpatrick, J.; Sifre, L.; Green, T.; Qin, C.;ˇZ´ıdek, A.; Nelson, A. W.; Bridgland, A., et al. Improved protein structure predictionusing potentials from deep learning. Nature , 1–5.(27) Li, Y.; Huang, C.; Ding, L.; Li, Z.; Pan, Y.; Gao, X. Deep learning in bioinformatics:Introduction, application, and perspective in the big data era.

Methods , ,4–21.(28) No´e, F.; Tkatchenko, A.; M¨uller, K.-R.; Clementi, C. Machine learning for molecularsimulation. Annual Review of Physical Chemistry , , 361–390.(29) Graves, J.; Byerly, J.; Priego, E.; Makkapati, N.; Parish, S. V.; Medellin, B.;Berrondo, M. A Review of Deep Learning Methods for Antibodies. Antibodies , , 12.(30) Kandathil, S. M.; Greener, J. G.; Jones, D. T. Recent developments in deep learningapplied to protein structure prediction. Proteins: Structure, Function, and Bioinfor-matics , , 1179–1189.(31) Le, Q.; Torrisi, M.; Pollastri, G. Deep learning methods in protein structure prediction. Computational and Structural Biotechnology Journal , , 1301–1310.(32) Pauling, L.; Niemann, C. The structure of proteins. Journal of the American ChemicalSociety , , 1860–1867.(33) Kuhlman, B.; Bradley, P. Advances in protein structure prediction and design. NatureReviews Molecular Cell Biology , , 681–697.(34) UniProt-Consortium, UniProt: a worldwide hub of protein knowledge. Nucleic AcidsResearch , , D506–D515.(35) Kuhlman, B.; Dantas, G.; Ireton, G. C.; Varani, G.; Stoddard, B. L.; Baker, D. Designof a novel globular protein fold with atomic-level accuracy. Science , , 1364–1368.(36) Donnelly, A. E.; Murphy, G. S.; Digianantonio, K. M.; Hecht, M. H. A de novo enzymecatalyzes a life-sustaining reaction in Escherichia coli. Nature Chemical Biology , , 253.(37) Correia, B. E.; Bates, J. T.; Loomis, R. J.; Baneyx, G.; Carrico, C.; Jardine, J. G.;Rupert, P.; Correnti, C.; Kalyuzhniy, O.; Vittal, V., et al. Proof of principle forepitope-focused vaccine design. Nature , , 201.3638) King, N. P.; Sheﬄer, W.; Sawaya, M. R.; Vollmar, B. S.; Sumida, J. P.; Andr´e, I.;Gonen, T.; Yeates, T. O.; Baker, D. Computational design of self-assembling proteinnanomaterials with atomic level accuracy. Science , , 1171–1174.(39) Tinberg, C. E.; Khare, S. D.; Dou, J.; Doyle, L.; Nelson, J. W.; Schena, A.;Jankowski, W.; Kalodimos, C. G.; Johnsson, K.; Stoddard, B. L., et al. Computationaldesign of ligand-binding proteins with high aﬃnity and selectivity. Nature , ,212–216.(40) Joh, N. H.; Wang, T.; Bhate, M. P.; Acharya, R.; Wu, Y.; Grabe, M.; Hong, M.;Grigoryan, G.; DeGrado, W. F. De novo design of a transmembrane Zn2+-transportingfour-helix bundle. Science , , 1520–1524.(41) Anﬁnsen, C. B. Principles that govern the folding of protein chains. Science , , 223–230.(42) Levinthal, C. Are there pathways for protein folding? Journal de Chimie Physique , , 44–45.(43) Li, B.; Fooksa, M.; Heinze, S.; Meiler, J. Finding the needle in the haystack: towardssolving the protein-folding problem computationally. Critical Reviews in Biochemistryand Molecular Biology , , 1–28.(44) LeCun, Y.; Boser, B. E.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W. E.;Jackel, L. D. Handwritten digit recognition with a back-propagation network. Ad-vances in Neural Information Processing Systems , 396–404.(45) He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 770–778.(46) Jordan, M. I. Serial order: A parallel distributed processing approach.

Advances inPsychology , , 471–495.(47) M¨uller, A. T.; Hiss, J. A.; Schneider, G. Recurrent Neural Network Model for Con-structive Peptide Design. Journal of Chemical Information and Modeling , ,472–479.(48) Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Computation , , 1735–1780.(49) Cho, K.; Van Merri¨enboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.;Bengio, Y. Learning phrase representations using RNN encoder-decoder for statisticalmachine translation. arXiv preprint , 1406.1078.(50) Hinton, G. E.; Zemel, R. S. Autoencoders, minimum description length and Helmholtzfree energy. Advances in Neural Information Processing Systems , 3–10.3751) Kingma, D. P.; Welling, M. Auto-encoding variational bayes. arXiv preprint ,1312.6114.(52) Kingma, D. P.; Welling, M. An introduction to variational autoencoders. arXivpreprint , 1906.02691.(53) Blei, D. M.; Kucukelbir, A.; McAuliﬀe, J. D. Variational inference: A review forstatisticians.

Journal of the American Statistical Association , , 859–877.(54) Das, P.; Wadhawan, K.; Chang, O.; Sercu, T.; Santos, C. D.; Riemer, M.; Chen-thamarakshan, V.; Padhi, I.; Mojsilovic, A. Pepcvae: Semi-supervised targeted designof antimicrobial peptide sequences. arXiv preprint , 1810.07743.(55) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.;Courville, A.; Bengio, Y. Generative adversarial nets. Advances in Neural InformationProcessing Systems , 2672–2680.(56) Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein gan. arXiv preprint ,1701.07875.(57) Kurach, K.; Lucic, M.; Zhai, X.; Michalski, M.; Gelly, S. A large-scale study onregularization and normalization in GANs. arXiv preprint , 1807.04720.(58) Anand, N.; Eguchi, R.; Huang, P.-S. Fully diﬀerentiable full-atom protein backbonegeneration. ICLR Workshop. 2019.(59) Niepert, M.; Ahmed, M.; Kutzkov, K. Learning convolutional neural networks forgraphs.

International Conference on Machine Learning , 2014–2023.(60) Luo, F.; Wang, M.; Liu, Y.; Zhao, X.-M.; Li, A. DeepPhos: prediction of proteinphosphorylation sites with deep learning.

Bioinformatics ,(61) Yang, C.; Li, C.; Nip, K. M.; Warren, R. L.; Birol, I. Terminitor: Cleavage SitePrediction Using Deep Learning Models. bioRxiv , 710699.(62) Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and newperspectives.

IEEE Transactions on Pattern Analysis and Machine Intelligence , , 1798–1828.(63) Romero, P. A.; Krause, A.; Arnold, F. H. Navigating the protein ﬁtness landscapewith Gaussian processes. Proceedings of the National Academy of Sciences , ,E193–E201.(64) Bedbrook, C. N.; Yang, K. K.; Rice, A. J.; Gradinaru, V.; Arnold, F. H. Machinelearning to design integral membrane channelrhodopsins for eﬃcient eukaryotic ex-pression and plasma membrane localization. PLoS Computational Biology , ,e1005786.(65) Ofer, D.; Linial, M. ProFET: Feature engineering captures high-level protein functions. Bioinformatics , , 3429–3436. 3866) Kawashima, S.; Pokarowski, P.; Pokarowska, M.; Kolinski, A.; Katayama, T.; Kane-hisa, M. AAindex: amino acid index database, progress report 2008. Nucleic AcidsResearch , , D202–D205.(67) Wang, S.; Peng, J.; Ma, J.; Xu, J. Protein secondary structure prediction using deepconvolutional neural ﬁelds. Scientiﬁc Reports , , 18962.(68) Drori, I.; Thaker, D.; Srivatsa, A.; Jeong, D.; Wang, Y.; Nan, L.; Wu, F.; Leggas, D.;Lei, J.; Lu, W., et al. Accurate Protein Structure Prediction by Embeddings and DeepLearning Representations. arXiv preprint , 1911.05531.(69) Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Eﬃcient estimation of word representa-tions in vector space. arXiv preprint , 1301.3781.(70) Le, Q.; Mikolov, T. Distributed representations of sentences and documents. Interna-tional Conference on Machine Learning , 1188–1196.(71) Asgari, E.; Mofrad, M. R. Continuous distributed representation of biological se-quences for deep proteomics and genomics.

PloS ONE , , e0141287.(72) El-Gebali, S.; Mistry, J.; Bateman, A.; Eddy, S. R.; Luciani, A.; Potter, S. C.;Qureshi, M.; Richardson, L. J.; Salazar, G. A.; Smart, A., et al. The Pfam proteinfamilies database in 2019. Nucleic Acids Research , , D427–D432.(73) Cai, C.; Han, L.; Ji, Z. L.; Chen, X.; Chen, Y. Z. SVM-Prot: web-based support vectormachine software for functional classiﬁcation of a protein from its primary sequence. Nucleic Acids Research , , 3692–3697.(74) Aragues, R.; Sali, A.; Bonet, J.; Marti-Renom, M. A.; Oliva, B. Characterization ofprotein hubs by inferring interacting motifs from protein interactions. PLoS Compu-tational Biology , , e178.(75) Kimothi, D.; Soni, A.; Biyani, P.; Hogan, J. M. Distributed representations for bio-logical sequence analysis. arXiv preprint , 1608.05949.(76) Yang, K. K.; Wu, Z.; Bedbrook, C. N.; Arnold, F. H. Learned protein embeddings formachine learning. Bioinformatics , , 2642–2648.(77) Alley, E. C.; Khimulya, G.; Biswas, S.; AlQuraishi, M.; Church, G. M. Uniﬁed rationalprotein engineering with sequence-only deep representation learning. bioRxiv ,589333.(78) Krause, B.; Lu, L.; Murray, I.; Renals, S. Multiplicative LSTM for sequence modelling. arXiv preprint , 1609.07959.(79) Ding, X.; Zou, Z.; Brooks Iii, C. L. Deciphering protein evolution and ﬁtness land-scapes with latent space models. Nature Communications , , 1–13.(80) Sinai, S.; Kelsic, E.; Church, G. M.; Nowak, M. A. Variational auto-encoding of proteinsequences. arXiv preprint , 1712.03346.3981) Riesselman, A. J.; Ingraham, J. B.; Marks, D. S. Deep generative models of geneticvariation capture the eﬀects of mutations. Nature Methods , , 816–822.(82) Rao, R.; Bhattacharya, N.; Thomas, N.; Duan, Y.; Chen, P.; Canny, J.; Abbeel, P.;Song, Y. Evaluating protein transfer learning with TAPE. Advances in Neural Infor-mation Processing Systems , 9689–9701.(83) Townshend, R.; Bedi, R.; Dror, R. O. Generalizable protein interface prediction withend-to-end learning. arXiv preprint , 1807.01297.(84) Simonovsky, M.; Meyers, J. DeeplyTough: Learning Structural Comparison of ProteinBinding Sites.

Journal of Chemical Information and Modeling , , 2356–2366.(85) AlQuraishi, M. End-to-End Diﬀerentiable Learning of Protein Structure. Cell Systems , , 292–301.e3.(86) Wang, S.; Sun, S.; Li, Z.; Zhang, R.; Xu, J. Accurate de novo prediction of proteincontact map by ultra-deep learning model. PLoS Computational Biology , ,e1005324.(87) Yang, J.; Anishchenko, I.; Park, H.; Peng, Z.; Ovchinnikov, S.; Baker, D. Improvedprotein structure prediction using predicted inter-residue orientations. bioRxiv ,846279.(88) Brunger, A. T. Version 1.2 of the Crystallography and NMR system. Nature Protocols , , 2728.(89) Zhou, J.; Cui, G.; Zhang, Z.; Yang, C.; Liu, Z.; Sun, M. Graph neural networks: Areview of methods and applications. arXiv preprint , 1812.08434.(90) Ahmed, E.; Saint, A.; Shabayek, A.; Cherenkova, K.; Das, R.; Gusev, G.; Aouada, D.;Ottersten, B. Deep learning advances on diﬀerent 3D data representations: A survey. arXiv preprint , , 1808.01462.(91) Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P. S. A comprehensive survey ongraph neural networks. arXiv preprint , 1901.00596.(92) Vishveshwara, S.; Brinda, K.; Kannan, N. Protein structure: insights from graphtheory. Journal of Theoretical and Computational Chemistry , , 187–211.(93) Ying, Z.; You, J.; Morris, C.; Ren, X.; Hamilton, W.; Leskovec, J. Hierarchical graphrepresentation learning with diﬀerentiable pooling. Advances in Neural InformationProcessing Systems , 4800–4810.(94) Borgwardt, K. M.; Ong, C. S.; Sch¨onauer, S.; Vishwanathan, S.; Smola, A. J.;Kriegel, H.-P. Protein function prediction via graph kernels.

Bioinformatics , , i47–i56.(95) Dobson, P. D.; Doig, A. J. Distinguishing enzyme structures from non-enzymes with-out alignments. Journal of Molecular Biology , , 771–783.4096) Fout, A.; Byrd, J.; Shariat, B.; Ben-Hur, A. Protein interface prediction using graphconvolutional networks. Advances in Neural Information Processing Systems ,6530–6539.(97) Bailey-Kellogg, C.; Pittala, S. Learning Context-aware Structural Representations toPredict Antigen and Antibody Binding Interfaces. bioRxiv , 658054.(98) Zamora-Resendiz, R.; Crivelli, S. Structural Learning of Proteins Using Graph Con-volutional Neural Networks. bioRxiv , 610444.(99) Gligorijevic, V.; Renfrew, P. D.; Kosciolek, T.; Leman, J. K.; Cho, K.; Vatanen, T.;Berenberg, D.; Taylor, B. C.; Fisk, I. M.; Xavier, R. J., et al. Structure-Based FunctionPrediction using Graph Convolutional Networks. bioRxiv , 786236.(100) Torng, W.; Altman, R. B. Graph Convolutional Neural Networks for Predicting Drug-Target Interactions.

Journal of Chemical Information and Modeling , , 4131–4149.(101) Gainza, P.; Sverrisson, F.; Monti, F.; Rodola, E.; Bronstein, M. M.; Correia, B. E.Deciphering interaction ﬁngerprints from protein molecular surfaces. bioRxiv ,606202.(102) Bronstein, M. M.; Bruna, J.; LeCun, Y.; Szlam, A.; Vandergheynst, P. Geometricdeep learning: going beyond euclidean data. IEEE Signal Processing Magazine , , 18–42.(103) Nerenberg, P. S.; Head-Gordon, T. New developments in force ﬁelds for biomolecularsimulations. Current Opinion in Structural Biology , , 129–138.(104) Derevyanko, G.; Grudinin, S.; Bengio, Y.; Lamoureux, G. Deep convolutional networksfor quality assessment of protein folds. Bioinformatics , , 4046–4053.(105) Best, R. B.; Zhu, X.; Shim, J.; Lopes, P. E.; Mittal, J.; Feig, M.; MacKerell Jr, A. D.Optimization of the additive CHARMM all-atom protein force ﬁeld targeting improvedsampling of the backbone φ , ψ and side-chain χ χ Journal ofChemical Theory and Computation , , 3257–3273.(106) Weiner, S. J.; Kollman, P. A.; Case, D. A.; Singh, U. C.; Ghio, C.; Alagona, G.;Profeta, S.; Weiner, P. A new force ﬁeld for molecular mechanical simulation of nucleicacids and proteins. Journal of the American Chemical Society , , 765–784.(107) Alford, R. F.; Leaver-Fay, A.; Jeliazkov, J. R.; O’Meara, M. J.; DiMaio, F. P.; Park, H.;Shapovalov, M. V.; Renfrew, P. D.; Mulligan, V. K.; Kappel, K., et al. The Rosetta all-atom energy function for macromolecular modeling and design. Journal of ChemicalTheory and Computation , , 3031–3048.(108) Behler, J.; Parrinello, M. Generalized neural-network representation of high-dimensional potential-energy surfaces. Physical Review Letters , , 146401.41109) Smith, J. S.; Isayev, O.; Roitberg, A. E. ANI-1: an extensible neural network potentialwith DFT accuracy at force ﬁeld computational cost. Chemical Science , , 3192–3203.(110) Smith, J. S.; Nebgen, B.; Lubbers, N.; Isayev, O.; Roitberg, A. E. Less is more:Sampling chemical space with active learning. The Journal of Chemical Physics , , 241733.(111) Sch¨utt, K. T.; Arbabzadah, F.; Chmiela, S.; M¨uller, K. R.; Tkatchenko, A. Quantum-chemical insights from deep tensor neural networks. Nature Communications , , 1–8.(112) Sch¨utt, K. T.; Sauceda, H. E.; Kindermans, P.-J.; Tkatchenko, A.; M¨uller, K.-R.SchNet–A deep learning architecture for molecules and materials. The Journal ofChemical Physics , , 241722.(113) Zhang, L.; Han, J.; Wang, H.; Car, R.; Weinan, E. Deep potential molecular dynamics:a scalable model with the accuracy of quantum mechanics. Physical Review Letters , , 143001.(114) Unke, O. T.; Meuwly, M. PhysNet: A neural network for predicting energies, forces,dipole moments, and partial charges. Journal of Chemical Theory and Computation , , 3678–3693.(115) Zubatyuk, R.; Smith, J. S.; Leszczynski, J.; Isayev, O. Accurate and transferablemultitask prediction of chemical properties with an atoms-in-molecules neural network. Science Advances , , eaav6490.(116) Lahey, S.-L. J.; Rowley, C. N. Simulating protein–ligand binding with neural networkpotentials. Chemical Science , , 2362–2368.(117) Wang, Z.; Han, Y.; Li, J.; He, X. Combining the Fragmentation Approach and NeuralNetwork Potential Energy Surfaces of Fragments for Accurate Calculation of ProteinEnergy. The Journal of Physical Chemistry B , , 3027–3035.(118) Senn, H. M.; Thiel, W. QM/MM methods for biomolecular systems. AngewandteChemie International Edition , , 1198–1229.(119) Kmiecik, S.; Gront, D.; Kolinski, M.; Wieteska, L.; Dawid, A. E.; Kolinski, A. Coarse-grained protein models and their applications. Chemical Reviews , , 7898–7936.(120) Zhang, L.; Han, J.; Wang, H.; Car, R.; E, W. DeePCG: Constructing coarse-grainedmodels via deep neural networks. The Journal of Chemical Physics , , 034101.(121) Patra, T. K.; Loeﬄer, T. D.; Chan, H.; Cherukara, M. J.; Narayanan, B.; Sankara-narayanan, S. K. A coarse-grained deep neural network model for liquid water. AppliedPhysics Letters , , 193101. 42122) Wang, J.; Olsson, S.; Wehmeyer, C.; P´erez, A.; Charron, N. E.; De Fabritiis, G.;No´e, F.; Clementi, C. Machine learning of coarse-grained molecular dynamics forceﬁelds. ACS Central Science , , 755–767.(123) Wang, W.; G´omez-Bombarelli, R. Learning Coarse-Grained Particle Latent Space withAuto-Encoders. Advances in Neural Information Processing Systems , .(124) Marks, D. S.; Colwell, L. J.; Sheridan, R.; Hopf, T. A.; Pagnani, A.; Zecchina, R.;Sander, C. Protein 3D Structure Computed from Evolutionary Sequence Variation. PloS ONE , , e28766.(125) Ma, J.; Wang, S.; Wang, Z.; Xu, J. Protein contact prediction by integrating joint evo-lutionary coupling analysis and supervised learning. Bioinformatics , , 3506–3513.(126) Skwark, M. J.; Raimondi, D.; Michel, M.; Elofsson, A. Improved contact predictionsusing the recognition of protein like contact patterns. PLoS Computational Biology , , e1003889.(127) Fariselli, P.; Olmea, O.; Valencia, A.; Casadio, R. Prediction of contact maps withneural networks and correlated mutations. Protein Engineering , , 835–843.(128) Horner, D. S.; Pirovano, W.; Pesole, G. Correlated substitution analysis and the pre-diction of amino acid structural contacts. Brieﬁngs in Bioinformatics , , 46–56.(129) Jones, D. T.; Singh, T.; Kosciolek, T.; Tetchner, S. MetaPSICOV: combining coevo-lution methods for accurate prediction of contacts and long range hydrogen bondingin proteins. Bioinformatics , , 999–1006.(130) Monastyrskyy, B.; d’Andrea, D.; Fidelis, K.; Tramontano, A.; Kryshtafovych, A. Eval-uation of residue–residue contact prediction in CASP10. Proteins: Structure, Func-tion, and Bioinformatics , , 138–153.(131) Xu, J.; Wang, S. Analysis of distance-based protein structure prediction by deep learn-ing in CASP13. Proteins: Structure, Function, and Bioinformatics , , 1069–1081.(132) Moult, J.; Fidelis, K.; Kryshtafovych, A.; Schwede, T.; Tramontano, A. Critical as-sessment of methods of protein structure prediction (CASP)—Round XII. Proteins:Structure, Function, and Bioinformatics , , 7–15.(133) Wang, S.; Li, W.; Liu, S.; Xu, J. RaptorX-Property: a web server for protein structureproperty prediction. Nucleic Acids Research , , W430–W435.(134) Seemayer, S.; Gruber, M.; S¨oding, J. CCMpred - Fast and precise prediction of proteinresidue-residue contacts from correlated mutations. Bioinformatics , , 3128–3130. 43135) Xu, J. Distance-based protein folding powered by deep learning. Proceedings of theNational Academy of Sciences , , 16856–16865.(136) Gao, Y.; Wang, S.; Deng, M.; Xu, J. RaptorX-Angle: real-value prediction of proteinbackbone dihedral angles through a hybrid method of clustering and deep learning. BMC Bioinformatics , , 100.(137) AlQuraishi, M. AlphaFold at CASP13. Bioinformatics , , 4862–4865.(138) Zemla, A.; Venclovas, ˇC.; Moult, J.; Fidelis, K. Processing and analysis of CASP3protein structure predictions. Proteins: Structure, Function and Genetics , ,22–29.(139) Kingma, D. P.; Mohamed, S.; Rezende, D. J.; Welling, M. Semi-supervised learningwith deep generative models. Advances in Neural Information Processing Systems , 3581–3589.(140) Zeng, H.; Wang, S.; Zhou, T.; Zhao, F.; Li, X.; Wu, Q.; Xu, J. ComplexContact: a webserver for inter-protein contact prediction using deep learning.

Nucleic Acids Research , , W432–W437.(141) Li, Z.; Wang, S.; Yu, Y.; Xu, J. Predicting membrane protein contacts from non-membrane proteins by deep transfer learning. arXiv preprint , 1704.07207.(142) Tsirigos, K. D.; Peters, C.; Shu, N.; K¨all, L.; Elofsson, A. The TOPCONS web serverfor consensus prediction of membrane protein topology and signal peptides. NucleicAcids Research , , W401–W407.(143) Alford, R. F.; Gray, J. J. Diverse scientiﬁc benchmarks reveal optimization imperativesfor implicit membrane energy functions. bioRxiv , 168021.(144) Stein, A.; Kortemme, T. Improvements to robotics-inspired conformational samplingin rosetta. PloS ONE , , e63090.(145) Ruﬀolo, J. A.; Guerra, C.; Mahajan, S. P.; Sulam, J.; Gray, J. J. Geometric Potentialsfrom Deep Learning Improve Prediction of CDR H3 Loop Structures. bioRXiv ,940254.(146) Nguyen, S. P.; Li, Z.; Xu, D.; Shang, Y. New deep learning methods for protein loopmodeling. IEEE/ACM Transactions on Computational Biology and Bioinformatics , , 596–606.(147) Li, Z.; Nguyen, S. P.; Xu, D.; Shang, Y. Protein Loop Modeling Using Deep Gener-ative Adversarial Network. , 1085–1091.(148) Waghu, F. H.; Gopi, L.; Barai, R. S.; Ramteke, P.; Nizami, B.; Idicula-Thomas, S.CAMP: Collection of sequences and structures of antimicrobial peptides. Nucleic AcidsResearch , , D1154–D1158. 44149) Grisoni, F.; Neuhaus, C. S.; Gabernet, G.; M¨uller, A. T.; Hiss, J. A.; Schneider, G. De-signing anticancer peptides by constructive machine learning. ChemMedChem , , 1300–1302.(150) Riesselman, A. J.; Shin, J.-E.; Kollasch, A. W.; McMahon, C.; Simon, E.; Sander, C.;Manglik, A.; Kruse, A. C.; Marks, D. S. Accelerating Protein Design Using Autore-gressive Generative Models. bioRxiv , 757252.(151) Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXivpreprint , 1511.07122.(152) Killoran, N.; Lee, L. J.; Delong, A.; Duvenaud, D.; Frey, B. J. Generating and design-ing DNA with deep generative models. arXiv preprint , 1712.06148.(153) Repecka, D.; Jauniskis, V.; Karpus, L.; Rembeza, E.; Zrimec, J.; Poviloniene, S.;Rokaitis, I.; Laurynenas, A.; Abuajwa, W.; Savolainen, O., et al. Expanding functionalprotein sequence space using generative adversarial networks. bioRxiv , 789719.(154) Gupta, A.; Zou, J. Feedback GAN (FBGAN) for DNA: a Novel Feedback-Loop Ar-chitecture for Optimizing Protein Functions. arXiv preprint , 1804.01694.(155) Ramachandran, G. N. Stereochemistry of polypeptide chain conﬁgurations. Journalof Molecular Biology , , 95–99.(156) Kuhlman, B.; Baker, D. Native protein sequences are close to optimal for their struc-tures. Proceedings of the National Academy of Sciences , , 10383–10388.(157) Li, Z.; Yang, Y.; Faraggi, E.; Zhan, J.; Zhou, Y. Direct prediction of proﬁles of se-quences compatible with a protein structure by neural networks with fragment-basedlocal and energy-based nonlocal proﬁles. Proteins: Structure, Function, and Bioinfor-matics , , 2565–2573.(158) Wang, J.; Cao, H.; Zhang, J. Z.; Qi, Y. Computational protein design with deeplearning neural networks. Scientiﬁc Reports , , 6349.(159) O’Connell, J.; Li, Z.; Hanson, J.; Heﬀernan, R.; Lyons, J.; Paliwal, K.; Dehzangi, A.;Yang, Y.; Zhou, Y. SPIN2: Predicting sequence proﬁles from protein structures usingdeep neural networks. Proteins: Structure, Function, and Bioinformatics , ,629–633.(160) Zhang, Y.; Chen, Y.; Wang, C.; Lo, C.-C.; Liu, X.; Wu, W.; Zhang, J. ProDCoNN:Protein Design using a Convolutional Neural Network. Proteins: Structure, Function,and Bioinformatics , , 819–829.(161) Shroﬀ, R.; Cole, A. W.; Morrow, B. R.; Diaz, D. J.; Donnell, I.; Gollihar, J.; Elling-ton, A. D.; Thyer, R. A structure-based deep learning framework for protein engineer-ing. bioRxiv , 833905. 45162) Anand, N.; Eguchi, R. R.; Derry, A.; Altman, R. B.; Huang, P. Protein sequencedesign with a learned potential. bioRxiv , 895466.(163) Greener, J. G.; Moﬀat, L.; Jones, D. T. Design of metalloproteins and novel proteinfolds using variational autoencoders. Scientiﬁc Reports , , 1–12.(164) Taylor, W. R. A “periodic table” for protein structures. Nature , , 657–660.(165) Karimi, M.; Zhu, S.; Cao, Y.; Shen, Y. De Novo Protein Design for Novel Foldsusing Guided Conditional Wasserstein Generative Adversarial Networks (gcWGAN). bioRxiv , 769919.(166) Hou, J.; Adhikari, B.; Cheng, J. DeepSF: deep convolutional neural network for map-ping protein sequences to folds. Bioinformatics , , 1295–1303.(167) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.;Kaiser, (cid:32)L.; Polosukhin, I. Attention is all you need. Advances in Neural InformationProcessing Systems , , 5999–6009.(168) Jelinek, F.; Mercer, R. L.; Bahl, L. R.; Baker, J. K. Perplexity—a measure of thediﬃculty of speech recognition tasks. The Journal of the Acoustical Society of America , , S63–S63.(169) Sutton, R. S.; Barto, A. G. Reinforcement learning: An introduction ; MIT press, 2018.(170) Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scalehierarchical image database. , 248–255.(171) Mayr, A.; Klambauer, G.; Unterthiner, T.; Hochreiter, S. DeepTox: toxicity predictionusing deep learning.

Frontiers in Environmental Science , , 80.(172) Brown, N.; Fiscato, M.; Segler, M. H.; Vaucher, A. C. GuacaMol: benchmarkingmodels for de novo molecular design. Journal of Chemical Information and Modeling , , 1096–1108.(173) Lutter, M.; Ritter, C.; Peters, J. Deep lagrangian networks: Using physics as modelprior for deep learning. arXiv preprint , 1907.04490.(174) Greydanus, S.; Dzamba, M.; Yosinski, J. Hamiltonian Neural Networks. arXiv preprint , 1906.01563.(175) Raissi, M.; Perdikaris, P.; Karniadakis, G. E. Physics-informed neural networks: Adeep learning framework for solving forward and inverse problems involving nonlinearpartial diﬀerential equations. Journal of Computational Physics , , 686–707.(176) Zepeda-N´u˜nez, L.; Chen, Y.; Zhang, J.; Jia, W.; Zhang, L.; Lin, L. Deep Density:circumventing the Kohn-Sham equations via symmetry preserving neural networks. arXiv preprint , 1912.00775. 46177) Han, J.; Li, Y.; Lin, L.; Lu, J.; Zhang, J.; Zhang, L. Universal approximation ofsymmetric and anti-symmetric functions. arXiv preprint , 1912.01765.(178) Shapovalov, M. V.; Dunbrack Jr, R. L. A smoothed backbone-dependent rotamerlibrary for proteins derived from adaptive kernel density estimates and regressions. Structure , , 844–858.(179) Hintze, B. J.; Lewis, S. M.; Richardson, J. S.; Richardson, D. C. Molprobity’s ultimaterotamer-library distributions for model validation. Proteins: Structure, Function, andBioinformatics , , 1177–1189.(180) Jensen, K. F.; Coley, C. W.; Eyke, N. S. Autonomous discovery in the chemical sciencespart I: Progress. Angewandte Chemie International Edition , , 2–38.(181) Coley, C. W.; Eyke, N. S.; Jensen, K. F. Autonomous discovery in the chemical sciencespart II: Outlook. Angewandte Chemie International Edition , , 2–25.(182) Coley, C. W.; Thomas, D. A.; Lummiss, J. A.; Jaworski, J. N.; Breen, C. P.;Schultz, V.; Hart, T.; Fishman, J. S.; Rogers, L.; Gao, H., et al. A robotic plat-form for ﬂow synthesis of organic compounds informed by AI planning. Science , , eaax1566.(183) Barrett, R.; White, A. D. Iterative Peptide Modeling With Active Learning AndMeta-Learning. arXiv preprint , 1911.09103.(184) You, J.; Liu, B.; Ying, R.; Pande, V.; Leskovec, J. Graph convolutional policy net-work for goal-directed molecular graph generation. Advances in Neural InformationProcessing Systems , , 6410–6421.(185) Zhou, Z.; Kearnes, S.; Li, L.; Zare, R. N.; Riley, P. Optimization of molecules via deepreinforcement learning. Scientiﬁc Reports , , 1–10.(186) Mirhoseini, A.; Goldie, A.; Yazgan, M.; Jiang, J.; Songhori, E.; Wang, S.; Lee, Y.-J.;Johnson, E.; Pathak, O.; Bae, S., et al. Chip Placement with Deep ReinforcementLearning. arXiv preprint , 2004.10746.(187) Cooper, S.; Khatib, F.; Treuille, A.; Barbero, J.; Lee, J.; Beenen, M.; Leaver-Fay, A.;Baker, D.; Popovi´c, Z., et al. Predicting protein structures with a multiplayer onlinegame. Nature , , 756–760.(188) Koepnick, B.; Flatten, J.; Husain, T.; Ford, A.; Silva, D.-A.; Bick, M. J.; Bauer, A.;Liu, G.; Ishida, Y.; Boykov, A., et al. De novo protein design by citizen scientists. Nature , , 390–394.(189) Zeiler, M. D.; Fergus, R. Visualizing and understanding convolutional networks. Eu-ropean Conference on Computer Vision , 818–833.(190) Smilkov, D.; Thorat, N.; Kim, B.; Vi´egas, F.; Wattenberg, M. Smoothgrad: removingnoise by adding noise. arXiv preprint , 1706.03825.47191) Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic attribution for deep networks.

Proceed-ings of the 34th International Conference on Machine Learning , , 3319–3328.(192) Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I.; Hardt, M.; Kim, B. Sanity checksfor saliency maps. Advances in Neural Information Processing Systems , 9505–9515.(193) Shrikumar, A.; Greenside, P.; Kundaje, A. Learning Important Features ThroughPropagating Activation Diﬀerences. arXiv preprint , 1704.02685.(194) Lundberg, S. M.; Lee, S.-I. A Uniﬁed Approach to Interpreting Model Predictions.

Proceedings of the 31st International Conference on Neural Information ProcessingSystems , 4768–4777.(195) Langan, R. A.; Boyken, S. E.; Ng, A. H.; Samson, J. A.; Dods, G.; Westbrook, A. M.;Nguyen, T. H.; Lajoie, M. J.; Chen, Z.; Berger, S., et al. De novo design of bioactiveprotein switches.

Nature , , 205–210.(196) Jones, D. T.; Buchan, D. W.; Cozzetto, D.; Pontil, M. PSICOV: precise structural con-tact prediction using sparse inverse covariance estimation on large multiple sequencealignments. Bioinformatics , , 184–190.(197) Lena, P. D.; Nagata, K.; Baldi, P. Structural bioinformatics Deep architectures forprotein contact map prediction. , , 2449–2457.(198) Eickholt, J.; Cheng, J. Structural bioinformatics Predicting protein residue – residuecontacts using deep networks and boosting. , , 3066–3072.(199) Jones, D. T.; Kandathil, S. M. High precision in protein contact prediction using fullyconvolutional neural networks and minimal sequence features. Bioinformatics , , 3308–3315.(200) Hanson, J.; Paliwal, K.; Litﬁn, T.; Yang, Y.; Zhou, Y. Accurate prediction of pro-tein contact maps by coupling residual two-dimensional bidirectional long short-termmemory with convolutional neural networks. Bioinformatics ,(201) Kandathil, S. M.; Greener, J. G.; Jones, D. T. Prediction of interresidue contacts withDeepMetaPSICOV in CASP13. , 1092–1099.(202) Hou, J.; Wu, T.; Cao, R. Protein tertiary structure modeling driven by deep learningand contact distance prediction in CASP13. , 1165–1178.(203) Zheng, W.; Li, Y.; Zhang, C.; Pearce, R.; Zhang, S. M. M. Y. Deep-learning contact-map guided protein structure prediction in CASP13. , 1149–1164.(204) Wu, Q.; Peng, Z.; Anishchenko, I.; Cong, Q.; Baker, D.; Yang, J. Protein contact pre-diction using metagenome sequence data and residual neural networks.

Bioinformatics , , 41–48. 48205) Brookes, D. H.; Listgarten, J. Design by adaptive sampling. ,(206) Yu, C.-H.; Qin, Z.; Martin-Martinez, F. J.; Buehler, M. J. A Self-Consistent Soni-ﬁcation Method to Translate Amino Acid Sequences into Musical Compositions andApplication in Protein Design Using Artiﬁcial Intelligence. ACS nano , , 7471–7482.(207) Costello, Z.; Martin, H. G. How to hallucinate functional proteins. arXiv preprint , 1903.00458.(208) Chhibbar, P.; Joshi, A. Generating protein sequences from antibiotic resistance genesdata using Generative Adversarial Networks. ,(209) Davidsen, K.; Olson, B. J.; DeWitt III, W. S.; Feng, J.; Harkins, E.; Bradley, P.;Matsen IV, F. A. Deep generative models for T cell receptor protein sequences. Elife , .(210) Han, X.; Zhang, L.; Zhou, K.; Wang, X. ProGAN: Protein solubility generative ad-versarial nets for data augmentation in DNN framework. Computers & Chemical En-gineering , , 106533.(211) Brookes, D. H.; Park, H.; Listgarten, J. Conditioning by adaptive sampling for robustdesign. arXiv preprint , 1901.10060.(212) Sabban, S.; Markovsky, M. RamaNet: Computational De Novo Protein Design usinga Long Short-Term Memory Generative Adversarial Neural Network. bioRxiv ,671552.(213) Wang, J.; Cao, H.; Zhang, J. Z.; Qi, Y. Computational Protein Design with DeepLearning Neural Networks. Scientiﬁc Reports , , 1–16.(214) Chen, S.; Sun, Z.; Lin, L.; Liu, Z.; Liu, X.; Chong, Y.; Lu, Y.; Zhao, H.; Yang, Y. ToImprove Protein Sequence Proﬁle Prediction through Image Captioning on PairwiseResidue Distance Map. Journal of Chemical Information and Modeling , ,391–399.(215) Strokach, A.; Becerra, D.; Corbi-Verge, C.; Perez-Riba, A.; Kim, P. M. Designing realnovel proteins using deep graph neural networks. bioRxiv , 868935.(216) Karimi, M.; Zhu, S.; Cao, Y.; Shen, Y. Structural Bioinformatics De Novo ProteinDesign for Novel Folds using Guided Conditional Wasserstein Generative AdversarialNetworks ( gcWGAN ). , 1–8.(217) Qi, Y.; Zhang, J. Z. H. DenseCPD: Improving the Accuracy of Neural-Network-BasedComputational Protein Sequence Design with DenseNet. Journal of Chemical Infor-mation and Modeling , 49 a b l e : F e a t u r e s c o n t a i n e db y C U P r o t e i nd a t a s e t . F e a t u r e N a m e D e s c r i p t i o n D i m e n s i o n s T y p e I O AA S e q u e n ce S e q u e n ce o f a m i n oa c i dn × c h a r s i npu t P SS M P o s i t i o n - s p ec i ﬁ c s c o r i n g m a t r i x , a r e s i du e - w i s e s c o r e f o r m o t i f s a pp e a r a n ce n × R e a l[ , ]i npu t M S A c o v a r i a n ce C o v a r i a n ce m a t r i x a c r o ss h o m o l og e o u s e l y n e a r s e q u e n ce s n × n R e a l[ , ]i npu t SS A c oa r s ec a t e go r i ze d s ec o nd a r y s t r u c t u r e ( Q r Q ) n × r c h a r s i npu t D i s t a n ce M a t r i ce s P a i r w i s e d i s t a n ce b e t w ee n r e s i du e s ( C α o r C β ) n × n P o s i t i v e r e a l ( ˚A ) o u t pu t T o r s i o n A n g l e s V a r i a b l e d i h e d r a l a n g l e s f o r e a c h r e s i du e s ( φ , ψ ) n × R e a l[ - π , + π ] ( r a d i a n s ) o u t pu t n : nu m b e r o f r e s i du e s i n o n e p r o t e i n . D a t a f r o m D r o r i e t a l. a b l e : A s u mm a r y o f pub li c l y a v a il a b l e m o l ec u l a r b i o l og y d a t a b a s e s . D a t a s e t D e s c r i p t i o n N W e b s i t e E u r o p e a n B i o i n f o r m a t i c s I n s t i t u t e ( E M B L - E B I ) A c o ll ec t i o n s o f w i d e r a n g e o f d a t a s e t s / h tt p s : // . e b i. a c . u k N a t i o n a l C e n t e r f o r B i o t ec hn o l og y I n f o r m a t i o n ( N C B I ) A c o ll ec t i o n s o f b i o m e d i c a l a nd g e n o m i c d a t a b a s e s / h tt p s : // . n c b i. n l m . n i h . go v P r o t e i n D a t a B a n k ( P D B ) D s tr u c t u r a l d a t ao f b i o m o l ec u l e s , s u c h a s p r o t e i n s a ndnu c l e i c a c i d s . ∼ , h tt p s : // . r c s b . o r g N u c l e i c A c i d D a t a b a s e ( N D B ) S tr u c t u r e o f nu c l e i c a c i d s a nd c o m p l e x a ss e m b li e s . ∼ , h tt p : // ndb s e r v e r . r u t g e r s . e du U n i v e r s a l P r o t e i n R e s o u r ce ( U n i P r o t) P r o t e i n s e q u e n ce a nd f un c t i o n i n f r o m a t i o n s ∼ , h tt p : // . un i p r o t . o r g/ S e q u e n ce R e a d A r c h i v e ( S R A ) R a w s e q u e n ce d a t a f r o m “ n e x t - g e n e r a t i o n ” s e q u e n c i n g t ec hn o l og i e s ∼ × N C B I d a t a b a s e a b l e : A s u mm a r y o f s tr u c t u r e p r e d i c t i o n m o d e l s . M o d e l A r c h i t ec t u r e D a t a s e t N t r a i n P e r f o r m a n ce T e s t s e t C i t a t i o n / M L P ( - l a y e r ) p r o t e a s e s . ˚A R M S D ( T R M ) , . ˚A R M S D ( P T I ) T R M , P T I B o h r e t a l, P S I C O V g r a ph i c a l L a ss o// P r ec i s i o n : T o p - L . , T o p - L /20 . , T o p - L /50 . , T o p - L /100 . P f a m J o n e s e t a l, C M A P p r o2 D b i - R NN + M L P A S T R A L , P r ec i s i o n : T o p - L /50 . , T o p - L /100 . A S T R A L . C A S P , D i L e n a e t a l, D N C O N R B M P D B S V M c o n , P r ec i s i o n : T o p - L . , T o p - L /20 . , T o p - L /50 . S V M C O N T E S T , D , C A S P E i c k h o l t e t a l, CC M p r e d L M // P r ec i s i o n : T o p - L . , T o p - L /20 . , T o p - L /50 . , T o p - L /100 . P f a m S ee m a y e r e t a l, P c o n s C S t a c k e d R F P S I C O V s e t P o s i t i v e p r e d i c t i v e v a l u e ( PP V ) . s e t o f C A S P ( ) S k w a r k e t a l, M e t a P S I C O V M L PP D B P r ec i s i o n : T o p - L . , T o p - L /20 . , T o p - L /50 . , T o p - L /100 . P f a m J o n e s e t a l, R a p t o r X - C o n t a c t R e s N e t s ub s e t o f P D B , T M - S c o r e : . ( CC M p r e d : . , M e t a P S I C O V : . ) P f a m , C A S P , C A M E O , M P W a n g e t a l, R a p t o r X - D i s t a n ce R e s N e t s ub s e t o f P D B , T M - S c o r e : . ( C A S P ) , . ( C A M E O ) , . ( C A S P ) C A S P + , C A M E O X u , D ee p C o v D C NN P D B , P r ec i s i o n : T o p - L . , T o p - L /20 . , T o p - L /50 . , T o p - L /100 . C A S P J o n e s e t a l, S P O T R e s N e t , R e s - b i - L S T M P D B , AU C : . ( R a p t o r X - c o n t a c t r a n k e d nd : . ) , c h a i n s a f t e r J un . H a n s o n e t a l, D ee p M e t a P S I C O V R e s N e t P D B , P r ec i s i o n : T o p - L /50 . C A S P K a nd a t h il e t a l, M U L T I C O M D C NN C A S P - T M - S c o r e : . , G D TT S : . , S U M Z s c o r e ( - . ) : . C A S P H o u e t a l, C - I - T A SS E R *2 D C NN // T M - S c o r e : . , G D T HA : . , R M S D : . , S U M Z s c o r e ( − . ) : . C A S P Z h e n g e t a l, A l ph a F o l d R e s N e t P D B , T M - S c o r e : . , G D TT S : . , S U M Z s c o r e ( - . ) : . C A S P S e n i o r e t a l, , M a p P r e d R e s N e t P I S C E S , P r ec i s i o n : . % i nS P O T , . % i n C A M E O , . i n C A S P S P O T , C A M E O , C A S P W u e t a l, t r R o s e tt a R e s N e t P D B , T M S c o r e : . ( A l ph a F o l d : . ) C A S P , C A M E O Y a n g e t a l, R G N b i - L S T M P r o t e i n N e t ( b e f o r e ) **104 , . A d R M S D o n F M , . A o n T B M C A S P A l Q u r a i s h i, / b i - G R U , R e s L S T M C U P r o t e i n , P e r s e d e d C A S P w i nn i n g t e a m , c o m p a r a b l e t o A l ph a F o l d i n R M S D C A S P + D r o r i e t a l, R B M : r e s tr i c t e d B o l t z m a nn m a c h i n e , L M : p s e ud o - li k e li h oo d m a x i m i z a t i o n , R F : r a nd o m f o r e s t , M L P : m u l t il a y e r p e r ce p tr o n , M P : m e m b r a n e p r o t e i n , F M : f r ee m o d e li n g , T B M : t e m p l a t e - b a s e d m o d e li n g * C - I - T A SS E R a nd C - Q UA R K w e r e r e p o rt e d , w e o n l y r e p o rt o n e h e r e . ** R G N w a s tr a i n e d o nd i ﬀ e r e n t P r o t e i n N e t f o r e a c h C A S P , w e r e p o rtt h e l a t e s t o n e h e r e . a b l e : G e n e r a t i v e m o d e l s t o i d e n t i f y s e q u e n ce f r o m f un c t i o n ( d e s i g n f o r f un c t i o n ) . M o d e l A r c h i t ec t u r e O u t pu t D a t a s e t N t r a i n P e r f o r m a n ce C i t a t i o n / W G AN + A M D NA c h r o m o s o m e f hu m a nh g384 . M ∼ t i m e ss t r o n g e r t h a n t r a i n i n g d a t a i np r e d i c t e d T F b i nd i n g K ill o r a n e t a l, / VA E AA p r o t e i n f a m ili e s / N a t u r a l m u t a t i o np r o b a b ili t y p r e d i c t i o n r h o = . S i n a i e t a l, / L S T M AAA D A M , A P D , D A D P , P r e d i c t e d a n t i m i c r o b i a l p r o p e r t y . ± . ( R a nd o m : . ± . ) M u ll e r e t a l, P e p C VA E C VA E AA /15 K l a b e l e d , . M un l a b e l e d G e n e r a t e p r e d i c t e d A M P w i t h % ( r a nd o m : % ,l e n g t h ¡ ) D a s e t a l, F B G AN W G AN D NAU n i P r o t( r e s .¡ ) , P r e d i c t e d a n t i m i c r o b i a l p r o p e r t y o v e r . f t e r e p o c h e s G up t a e t a l, D ee pS e q u e n ce VA E AA M u t a t i o n a l s c a nd a t a41 s c a n s A i m e d f o r m u t a t i o n e ﬀ ec t p r e d i c t i o n , o u t p e r f o r m e dp r e v i o u s m o d e l s R i e ss e l m a n e t a l, D b A S - VA E VA E + A S D NA s i m u l a t e dd a t a/ P r e d i c t e dp r o t e i n e x p r e ss i o n s u r p a ss e d F B - G AN / VA E B r oo k e s e t a l, / L S T MM u s i c a l s c o r e s /56 b e t a s + l ph a s G e n e r a t e dp r o t e i n s c a p t u r e t h e s ec o nd a r y s t r u c t u r e f e a t u r e Y u e t a l, B i o S e q VA E VA E AAU n i P r o t , . % r ec o n s t r u c t i o n a cc u r a c y , . % E C a cc u r a c y C o s t e ll o e t a l, / W G ANAA a n t i b i o t i c r e s i s t a n ce d e t e r m i n a n t s , % s i m il a r t o t r a i n i n g s e q u e n ce ( B L A S T p ) C hh i bb a r e t a l, PE VA E VA E AA p r o t e i n f a m ili e s , L a t e n t s p a cec a p t u r e s ph y l og e n e t i c , a n ce s t r a l r e l a t i o n s h i p , a nd s t a b ili t y D i n g e t a l, / R e s N e t AA m u t a t i o nd a t a + I l a m a i mm un e r e p e r t o i r e . M ( n a n o ) P r e d i c t e d m u t a t i o n e ﬀ ec t r e a c h e d s t a t e - o f - t h e - a r t , bu il t a li b r a r y o f C D R s e q . R i e ss e l m a n e t a l, V a m p i r e VA E AA i mm un e A CC E SS / G e n e r a t e d s e q u e n ce s p r e d i c t e d t o b e s i m il a r t o r e a l C D R s e q u e n ce s . D a v i d s o n e t a l, P r o G AN C G ANAA e S o l , S o l ub ili t y p r e d i c t i o n R i m p r o v e d f r o m . t o0 . H a n e t a l, P r o t e i n G AN G ANAA M D H f r o m U n i p r o t , s e q u e n ce w e r e t e s t e d i n v i t r o , s o l ub l e , w i t h c a t a l y t i c a c t i v i t y R e p ec k a e t a l, C b A S - VA E VA E + A S AA p r o t e i nﬂu o r e s ce n ce d a t a s e t , P r e d i c t e dp r o t e i nﬂu o r e s ce n ce s u r p a ss e d F B - VA E / D b A S B r oo k e s e t a l, A M : a c t i v a t i o n m a x i m i z a t i o n , C VA E : c o nd i t i o n a l v a r i a t i o n a l a u t o - e n c o d e r , C G AN : c o nd i t i o n a l g e n e r a t i v e a d v e r s a r i a l n e t w o r k , A S : a d a p t i v e s a m p li n g , E C : e n z y m ec o mm i ss i o n AA : a m i n oa c i d s e q u e n ce , D NA : D NA s e q u e n ce a b l e : G e n e r a t i v e m o d e l s f o r p r o t e i n s tr u c t u r e p r e d i c t i o n . M o d e l A r c h i t ec t u r e R e p r e s e n t a t i o n D a t a s e t N t r a i n P e r f o r m a n ce C i t a t i o n / D C G AN C α - C α d i s t a n ce s P D B ( , , - r e s i du e f r ag m e n t) , M e a n i n g f u l s ec o nd a r y s t r u c t u r e , r e a s o n a b l e R a m a c h a nd r a np l o t A n a nd e t a l, R a m a N e t G AN T o r s i o n a n g l e s I d e a l h e li c a l s t r u c t u r e s f r o m P D B G e n e r a t e d t o r s i o n s a r ec o n ce n t r a t e d a r o undh e li c a l r e g i o nS a bb a n e t a l, / D C G AN B a c k b o n e d i s t a n ce P D B ( - r e s i du e f r ag m e n t) , S m oo t h i n t e r p o l a t i o n s . R ec o v e r f r o m s e q u e n ce d e s i g n a nd f o l d i n g . A n a nd e t a l, G AN : g e n e r a t i v e a d v e r s a r i a l n e t w o r k D C G AN : d ee p c o n v o l u t i o n a l g e n e r a t i v e a d v e r s a r i a l n e t w o r k a b l e : G e n e r a t i v e m o d e l s t o i d e n t i f y s e q u e n ce f r o m s tr u c t u r e ( p r o t e i nd e s i g n ) . M o d e l A r c h i t ec t u r e I npu t D a t a s e t N t r a i n P e r f o r m a n ce C i t a t i o n S P I N M L P S li d i n g w i nd o ww i t h f e a t u r e s P I S C E S , S e q u e n ce r ec o v e r y o f . % o n , p r o t e i n s ( C V ) L i e t a l, S P I N M L P S li d i n g w i nd o ww i t h f e a t u r e s P I S C E S , S e q u e n ce r ec o v e r y o f . % o n , p r o t e i n s ( C V ) O ’ C o nn e ll e t a l, / M L P T a r g e t r e s i du e a nd i t s n e i g hb o r a s p a i r s P D B , S e q u e n ce r ec o v e r y o f % o n , p r o t e i n s W a n g e t a l, / C VA E S t r i n g e n c o d e d s t r u c t u r e o r m e t a l P D B , M e t a l P D B , V e r i ﬁ e d w i t h s t r u c t u r e p r e d i c t i o n a ndd y n a m i c s i m u l a t i o n G r ee n e r e t a l, S P R O F B i - L S T M + D R e s N e t - D f e a t u r e s + C α d i s t a n ce m a p P D B , S e q u e n ce r ec o v e r y o f . % o np r o t e i n s C h e n e t a l, P r o D C o NN D C NN G r i dd e d a t o m i cc oo r d i n a t e s P D B , S e q u e n ce r ec o v e r y o f . % o n , p r o t e i n s Z h a n g e t a l, /3 D C NN G r i dd e d a t o m i cc oo r d i n a t e s P D B - R E D O , S e q u e n ce r ec o v e r y % , e x p e r i m e n t a l v a li d a t i o n o f m u t a t i o nSh r o ﬀ e t a l, P r o t e i nS o l v e r G r a ph NN P a r t i a l s e q u e n ce , a d j a ce n c y m a t r i x U n i P a r c × r e s i. S e q u e n ce r ec o v e r y o f % , f o l d i n ga nd M D t e s t w i t h p r o t e i n s S t r o k a c h e t a l, g c W G AN C G AN R a nd o m n o i s e + s t r u c t u r e S C O P e , D i v e r s i t y a nd T M - S c o r e o f p r e d i c t i o n f r o m d e s i g n e d s e q u e n ce ≥ c VA E K a r i m i e t a l, / G r a ph T r a n s f o r m e r B a k c b o n e s t r u c t u r e i n g r a ph C A T H b a s e d , P e r p l e x i t y : . ( r i g i d ) , . ( ﬂ e x i b l e )( R a nd o m : . ) I n g r a h a m e t a l, D e n s e C P D R e s N e t G r i dd e db a c k b o n e a t o m i c d e n s i t y P I S C E S . × r e s i. S e q u e n ce r ec o v e r y o f . % o n p r o t e i n s Q i e t a l, /3 D C NN G r i dd e d a t o m i cc oo r d i n a t e s P D B , S e q u e n ce r ec o v e r y f r o m % t o87 % , t e s t w i t h f o l d i n go f T I M - b a rr e l A n a nd e t a l, M L P : M u l t i - L a y e r P e r ce p tr o n , B i - L S T M : B i d i r ec t i o n a l L o n g Sh o rt T e r m M e m o r y C V : c r o ss v a li d a t i o n , r e s i.: r e s i du e ss