[PDF] Design of metalloproteins and novel protein folds using variational autoencoders

Abstract

The design of novel proteins has many applications but remains an attritional process with success in isolated cases. Meanwhile, deep learning technologies have exploded in popularity in recent years and are increasingly applicable to biology due to the rise in available data. We attempt to link protein design and deep learning by using variational autoencoders to generate protein sequences conditioned on desired properties. Potential copper and calcium binding sites are added to non-metal binding proteins without human intervention and compared to a hidden Markov model. In another use case, a grammar of protein structures is developed and used to produce sequences for a novel protein topology. One candidate structure is found to be stable by molecular dynamics simulation. The ability of our model to confine the vast search space of protein sequences and to scale easily has the potential to assist in a variety of protein design tasks.

Full PDF

DDesign of metalloproteins and novel protein foldsusing variational autoencoders

Joe G Greener , Lewis Moffat , and David T Jones Department of Computer Science, University College London, Gower Street, London WC1E 6BT; and FrancisCrick Institute, 1 Midland Road, London NW1 1AT + these authors contributed equally to this work * [email protected] ABSTRACT

The design of novel proteins has many applications but remains an attritional process with success in isolated cases. Meanwhile,deep learning technologies have exploded in popularity in recent years and are increasingly applicable to biology due to the risein available data. We attempt to link protein design and deep learning by using variational autoencoders to generate proteinsequences conditioned on desired properties. Potential copper and calcium binding sites are added to non-metal bindingproteins without human intervention and compared to a hidden Markov model. In another use case, a grammar of proteinstructures is developed and used to produce sequences for a novel protein topology. One candidate structure is found to bestable by molecular dynamics simulation. The ability of our model to conﬁne the vast search space of protein sequences and toscale easily has the potential to assist in a variety of protein design tasks.

Introduction

The computational design and redesign of proteins provides a route to create new protein structures and functions .The ‘inverse folding problem’ of ﬁnding a sequence that folds to a given structure or carries out a given function ischallenging due to the vast number of possible sequences, the difﬁculty of assessing the suitability of a sequence,and the marginal stability of protein structures . From early attempts to add functional motifs to existing structures ,the ﬁeld has progressed through design of a novel fold to the use of high-throughput assays to test thousands ofsequences simultaneously .Metalloproteins have been shown to be incredibly abundant and important to cell function . Designing metal-loproteins offers the potential not only to improve our understanding of their function but also to develop newapplications within research and industrial settings. Designing metal binding sites in proteins is a relatively matureresearch area and a variety of methods have been succesfully applied ranging from the deterministic to proba-bilistic . For example, random mutagenesis has recently shown to be succesful in developing metalloproteins . Arange of machine learning techniques have been applied to site prediction in the past decade although they arenot prevalent within the design task outside of computationally validating designed sequences before experimentalcharacterization.Another task within the realm of protein design is designing entirely new protein topologies. It is challengingbut opens up new applications as well as exploring the limits of the available fold space . Similar tometalloproteins, the design process also has not yet seen pervasive use of machine learning algorithms.Typical workﬂows for the design of novel metalloproteins, and novel protein topologies, consist of designinga complex template followed by computational selection of the best designs before experimental validation. Formetalloproteins this also includes choosing a metal and identifying relevant binding residues. The computationalselection typically relies on computationally costly force-ﬁeld based molecular dynamics (MD) simulations . Thismakes improving the efﬁciency of the computional pipeline a valuable pursuit.Outside of protein design, another ﬁeld that has seen recent development is that of deep learning research.An aspect of this development has been the rise of powerful new generative methods that leverage deeparchitectures in order to learn complex distributions . One of the most widely used deep generative methods,and the one used in this work, is the variational autoencoder (VAE). VAEs use a combination of an inferencemechanism and a generative mechanism in order to learn a compressed and compact latent representation of theinput data . The generative mechanism can be used to sample complex synthetic data, and the inferencemechanism allows sampling of synthetic data that is similar to a speciﬁed real input. Furthermore, this two a r X i v : . [ q - b i o . B M ] N ov echanism structure allows for auxiliary tasks, like semi-supervised learning.Deep generative models are only recently being applied to protein sequences and this leaves a variety ofpotential applications not yet explored. For example, some work has been done using deep learning to generateprotein sequences with desired features, however they do not use generative models. Instead they use deterministicdeep models that take in a variety of engineered features to produce a sequence . One example of using deepgenerative models to produce proteins has been published . This uses an autoregressive recurrent neural networktrained on a database of antimicrobial peptides, however it does not take into account structure explicitly or learn acompact latent representation of the data.In this work, a deep generative model is presented using conditional variational autoencoders (CVAE) for smallsingle domain protein sequences. This model aims to reduce the initial search space by learning a distribution thatnarrows down the space of all possible proteins to a smaller space of proteins that are more likely to be nativelyordered. This work explores two applications. Firstly, unsupervised design of metal binding sites in pre-existingproteins, and secondly producing proteins that fold into a novel topology.For the metalloprotein design task the model is conditioned on metal binding ability for 8 different metals.Proteins are then designed by adding a new metal binding site where there had not been one. The second task takesinspiration from similar work which uses VAEs with data described by a grammar . Instead of producing novelsmall molecules like that study , the second task produces protein sequences. In order to condition the modelin this task on a speciﬁed protein fold, a discrete representation of protein tertiary and secondary structure wasrequired. This requirement was met by developing a context-free grammar (CFG) for protein fold structure. Thegrammar uses a ‘periodic table’ of protein folds from which a vectorized structural representation can be drawn.This task also uses both the generative and inference aspects of the VAE in an iterative method to explore the latentspace, allowing sampling of sequences demonstrating a particular desired attribute.Code and documentation required to reproduce the results of the paper and generate further sequences, alongwith a copy of the trained model, is freely available at https://github.com/psipred/protein-vae . Methods

Data collection

Proteins are extracted from the Protein Data Bank (PDB) with the following criteria: monomers only, chainlength no greater than 120 residues and no DNA/RNA present. Only chains with single domains are retained bycomparison with CATH . This gives 3,785 protein chains. Homologues were found for each chain by runningblastp against the UniRef90 dataset of available sequences clustered at the 90% similarity level . An E-valuethreshold of 10 − was used and up to 50 hits were added to the dataset for each PDB structure. The length of blastphits was limited to 80%-120% of the query sequence with an upper limit of 140 residues.Information on metal binding is collected from MetalPDB for 8 metals - Fe (bound to 10% of proteins in dataset),Zn (6%), Ca (4%), Na (2%), Cu (2%), Mg (1%), Cd (0.9%) and Ni (0.5%). For the model conditioned on metal bindingsites, another sequence is added to the dataset to act as a negative training example for each sequence in the datasetthat binds a metal. This sequence is identical except that metal binding residues identiﬁed in MetalPDB are mutatedto a random amino acid drawn from a distribution where each amino acid has the same frequency as in the wholetraining set. For homologous sequences the residues in the corresponding places in the multiple sequence alignmentare mutated if they are the same amino acid as in the PDB structure. The dataset used for the metal binding modelhas 148,000 entries including the homologous sequences and negative training examples.To construct the hidden Markov model (HMM) used for comparison to the VAE, hmmbuild from HMMER was run with default parameters on a multiple sequence alignment (MSA) generated from the results of a blastpquery of the sequence. Context free grammar

A grammar for protein structures allows conditioning of the output of a VAE. We base our grammar on Taylor’s‘periodic table’ of protein structures . This considers protein structures to belong to one of three basic forms ( αβα layer, αββα layer and αβ barrel) with secondary structural elements added or removed to form individual topologies.The formal context-free grammar that describes the topology strings is expressed as G = { N , Σ , P , S } where: • Non-terminal symbols, N = { ‘ < topology > ’,‘ < element > ’,‘ < orientation > ’, < layer > ’,‘ < position > ’ } • Terminal symbols, Σ = { ‘+’, ‘-’,‘A’, ‘B’, ‘C’, ‘D’,‘+0’, ‘+1’, ‘+2’, ‘+3’, ‘+4’, ‘+5’, ‘+6’,‘-1’, ‘-2’, ‘-3’, ‘-4’, ‘-5’, ‘-6’ } • Production rules, P = { ‘ < topology > ’ → ‘ < element >< topology > ’,‘ < topology > ’ → ‘ < element > ’,‘ < element > ’ → ‘ < orientation >< layer >< position > ’,‘ < orientation > ’ → [‘+’, ‘-’] (2 rules),‘ < layer > ’ → [‘A’, ‘B’, ‘C’, ‘D’] (4 rules),‘ < position > ’ → [‘+0’, ‘+1’, ‘+2’, ‘+3’, ‘+4’, ‘+5’, ‘+6’, ‘-1’, ‘-2’, ‘-3’, ‘-4’, ‘-5’, ‘-6’] (13 rules) } • Start symbol, S = ‘ < topology > ’For example, the reductase-related bacterial protein with PDB ID 2CU6 ﬁts form 0-3-2 ( αβα ) with topology string‘-C+0+B+0-B-1+C-1-B-2’, where B and C are the layers preﬁxed by their relative orientation to the ﬁrst strand in thesheet and sufﬁxed by their position relative to the ﬁrst element in each layer . This example is shown in Figure 1.Proteins in our dataset were assigned to topology strings using an assignment of topology strings to SCOPfolds . Each protein was compared to all assigned representative SCOP proteins and the highest TMAlign scoreabove 0.6, indicating a high chance of the same fold, was used to select the topology string to assign. This allowedassignment for 65% of proteins in the dataset described above to 325 unique topology strings. This 65%, equating to105,000 sequences including homologues, was used to train the CVAE for the fold generation model. Model

In this section we describe the how a variational autoencoder can be used to produce novel protein sequences whenconditioned on an attribute like a grammar or a metal code. VAEs simultaneously train two models, an inferencemodel q φ ( z | x ) and a generative model p θ ( x | z ) p θ ( z ) for data x and the latent variable z . For this application bothmodels are also conditioned on a chosen attribute of the sequences, a - see Figure 2. Both models are jointlyoptimized using the tractable variational Bayes approach which maximizes the evidence lower bound (ELBO).log p ( x | a ) ≥ E q φ ( z | x , a ) (cid:104) log p θ ( x | z , a ) q φ ( z | x , a ) (cid:105) = L ( θ , φ ; x ) (1) = E q φ ( z | x , a ) (cid:2) log p θ ( x | z , a ) (cid:3) − KL ( q φ ( z | x , a ) || p θ ( z )) (2)This equates to minimizing the reconstruction loss on x and the Kullback-Leibler (KL) divergence between theinference model and a prior p ( z ) usually characterized by an exponential family distribution, most typically astandard Gaussian. q φ ( z | x , a ) = N ( z | µ q ( x , a ) , σ q ( x , a )) (3) p θ ( z ) = N ( z | I ) (4)Using the reparameterization trick , which incorporates the prior using a linear transform of noise, ensures thatboth models are differentiable and can be optimized using stochastic gradient descent (SGD).In terms of layerwise construction of the model we deﬁne a linear block (LB) as the following LB ( x ) = ReLU ( BN ( Wx + b )) Where ReLU is a rectiﬁed linear unit and BN is batch normalization . It was found that inclusion of batchnormalization improved reconstruction accuracy, which has been documented particularly in the case of VAEs .We further deﬁne a multi-layer perceptron (MLP) as three sequential LBs. Both the decoder and the encodercontain one MLP. In the case of the decoder, the MLP is followed by a linear layer with a sigmoid activation function hat produces the output sequence. The MLP in the encoder produces the Gaussian parameters as follows, where itis assumed a and x are concatenated as y : µ = Lin ( MLP ( y )) (5) σ = Softplus ( Lin ( MLP ( y ))) (6)Where Lin is a linear layer. Different sizes of latent space were tried and the ﬁnal chosen for sequence generation is16 dimensions. The sizes of the layers within the MLP are 512, 256, & 128 units. The order as presented is used forthe encoder but reversed for the decoder.Each sequence is represented as a series of one-hot encoded tensors with 22 possible positions. This representsthe 20 amino acids, a padding symbol to reach max length, and a symbol for any non-standard amino acid suchas selenocysteine. Given L max =

140 the sequences are sized as 140 ×

22 before being ﬂattened to 3080 at whichpoint they are inserted into the network while being concatenated with a tensor representing the conditioningattribute. In the case of metal binding sites this is a 1D tensor containing 8 values. Each is set to either zero or oneto denote the absence or presence respectively of one of the eight metal binding sites. For the structure grammarthis is a 1265 long ﬂattened tensor of the one-hot encoded rules. Each rule is described by a one-hot tensor with 23options, corresponding to the 22 rules P and a blank rule for padding, and up to 55 rules are able to produce anygiven topology in the dataset. When converting produced sequences to their amino acid representations tensors areresized from 3080 to 140 ×

22 and then the arg max is taken across the second dimension to choose the most likelyamino acid for each position, leaving a sequence 140 long.The model was optimized using an Adam optimizer with a learning rate of 5 × − , a batch size of 512, and β values of 0.9 and 0.9. These values were optimized by grid search. The KL divergence was set to zero at thebeginning of training before being multiplied by a linearly increasing “burn-in” factor until the full KL loss wasincorporated. This is to avoid the documented behavior of the KL loss turning off layer units early in training .Models were trained until convergence. Furthermore all models were built using the Pytorch framework forthe Python programming language. Training time for the ﬁnal model took roughly 6 hours to converge using aNVIDIA GTX 1080 TI GPU. Sampling procedure - structure task

With a maximum sequence length of 140 residues there are 20 possible proteins implying that even a reducedsequence space for a particular structure will likely contain a very large number of potential sequences . It hasalso been shown that within the protein space the optimal set of sequences for a particular structure is surprisinglysmall . This being the case a sequential sampling-analysis-sampling method was used to traverse the spacetowards ﬁnding the region containing sequences that belonged to this optimal set.For designing a new topology, the generator was sampled to produce 1,000 different sequences from across thespace. From there sequences were analyzed, as described below, to ﬁnd the best sequence of those generated. Thissequence is then fed into the inference mechanism to ﬁnd the mean and standard deviation tensors that deﬁne itslatent code. From these are then used to sample 1,000 more samples using the generative network. This is analogousto searching the space around the best sequence to ﬁnd potentially better sequences in the local area. This process isrepeated several times to traverse the search space before ﬁnding a ﬁnal best sequence - see Figure 3.In order to select promising sequences at each iteration, generated sequences are threaded onto a relevanttemplate structure for the desired fold using SCWRL4 with default options. Energy minimisation is carried outusing the Rosetta relax protocol with the thorough option and 3 structures generated per sequence. The top5 sequences by score are taken forward for ab initio structure generation using Rosetta. Fragment generation iscarried out and 10,000 structures are generated using the AbinitioRelax protocol with default parameters. Structuresclose to the template are found using TMAlign and the best sequence is input to the VAE in a new cycle. Eachiteration step takes around 500 hours of computation on a single CPU but is easily parallelized, making the iterationprocess computationally tractable. Sampling procedure - metalloprotein task

For design of metalloproteins the sampling process is simpler as removing or adding a binding site is done with apreexisting sequence. The protein chosen is fed into the inference mechanism with its conditioning attribute and1,000 samples are then drawn from the generator using its mean and standard deviation tensors.To select the best candidate sequence, while limiting human oversight, a neural network classiﬁer was trained topredict metal binding potential using the same dataset employed in training the VAE. Instead of having the metal inding code c as the conditioning set it was used as a target set in a supervised fashion i.e. predicting c givensequence x .The classiﬁer was built using the same linear blocks used for the VAE (six sized 1024, 512, 256, 128, 64 & 8 hiddenunits respectively) where the last layer utilized a sigmoid activation function instead of a ReLU function. This isbecause predicting metal binding is a multi-class classiﬁcation task, each class being the protein’s ability to bind oneof eight metals. The loss function used was binary cross entropy and optimization was carried out using the sameparameters as the VAE. The inverse ratio of the number of proteins that bind a given metal was used to weight theloss function to reduce the effect of imbalanced classes.Cross-validation was performed by creating a 90%/10% split between the training set and a validation set. Toprevent proteins from the same families existing in both sets ECOD was used. This was done by making sure noprotein sequence in the validation set was a member of the same ECOD family as any protein in the training set. Toprevent overﬁtting during training early stoppage was performed by saving after every epoch and choosing themodel that best minimized the validation loss. The MDM2 and TSG-6 sequences explored in the results were alsoplaced in the validation set to avoid them being used for training, though only non-metal binding sequences arepresent for these proteins. The trained classiﬁer was given the 1,000 sampled sequences and the sequence with thehighest predicted metal binding potential for the selected metal was chosen as the ﬁnal candidate sequence. Molecular dynamics

All molecular dynamics (MD) runs were carried out using the GROMACS package . Energy minimisation wasconducted using a steepest descent energy minimisation of 5,000 steps in a vacuum and the OPLS-AA force ﬁeld.MD runs were conducted using periodic boundary conditions, SPC water, charge-neutralising counter ions, theOPLS-AA force ﬁeld and a 2 fs timestep. An initial energy minimisation was followed by a constant temperatureand volume equilibration for 100 ps, then a constant pressure and temperature equilibration for 100 ps. ProductionMD was run for 200 ns. Results

In order to carry out protein design tasks we developed a CVAE that is able to generate protein sequences withcertain properties. At its simplest the model is able to act as an encoder and decoder of protein sequences, with amean sequence identity between training set sequences and their encoded-decoded form of 49.5% for the modelconditioned on metal binding sites. Example sequences generated by the model are shown in Figure 4. Theconservation of residues across protein families is broadly reproduced by the model. Residues conserved byevolution are generally not varied by the model, whereas residues not conserved by evolution are varied in thesequence output. For example, a MSA was formed from 1,000 sequences generated from one lysozyme sequence(PDB ID 1IIZ) in the dataset using the model conditioned on the grammar. The conservation per residue as given byJensen-Shannon divergence shows a Pearson correlation coefﬁcient of 0.72 with values from a MSA from a blastpquery of the sequence. Generation of metal binding sites

The model is able to add potential metal binding sites to proteins. The SWIB domain of human MDM2, a protein inthe dataset used to train the model, was investigated as it is not known to bind metal ions and it has high sequenceidentity when encoded and decoded by the model. 1,000 sequences were generated by encoding and decodingthis sequence using the VAE with the copper ﬂag turned on, i.e. copper binding was requested. In order to explorethe beneﬁts of using a VAE to generate sequences, we compared these sequences to those generated by a HMM. AHMM produces sequences from an MSA assuming a Markov process with unobserved states. Sequences generatedwith the VAE using the MDM2 sequence as input show less variability than sequences generated with a HMMproduced from a MSA of the blastp hits of MDM2. This is expected as we are using a single sequence as inputrather than a protein family.In order to assess the stability of generated sequences in the structure of the input sequence, sequences werethreaded through this structure (PDB ID 3LBL) and the Rosetta energy scores after relaxation were compared. Aswe are comparing different sequences adopting the same structure, a lower Rosetta energy score indicates a bettersuitability to the given structure. The median score is -182 energy units for the VAE sequences, -115 for the HMMsequences and -184 for the native sequence. This indicates that sequences generated by the VAE are likely to foldto the same structure as the input sequence, whereas the HMM sequences are likely to adopt a different structureor not fold to a stable structure. Hence HMM sequences show more sequence and structural variation than VAEsequences. hen a copper binding site is requested in the VAE, more copper-binding motifs are observed in the outputsequences than for the HMM. 10% of sequences have histidine residues close in space (8% for the HMM) and 4% ofsequences have His-x -His on a α -helix (2% for the HMM). When the output sequences are limited to those withmore histidines than the native sequence, 42% of sequences from the VAE have close histidines (17% for the HMM)and 16% of sequences have His-x -His on a α -helix (4% for the HMM). This indicates that the VAE is able to generatesequences with a high chance of folding to the same structure as the input sequence, yet can still add metal bindingmotifs to the sequences. Whilst copper-binding motifs are not being explicitly requested from the HMM, fewercopper binding motifs are seen than for the VAE despite the higher sequence variability. The ability to generatesequences with histidines close in space and with copper-binding motifs on helices shows that the VAE is learningstructural information.A discriminator trained to predict metal binding sequences (see the Methods) was used to predict the sequencewith the highest copper binding character from the 1,000 generated. The top-ranked sequence has two potentialnew copper binding sites when compared to the input sequence, as shown in Figure 5. One of the potential sites,site A, involves the modiﬁcation of a site with a histidine residue to a site with two histidines and a cysteine to forma linear motif. The other potential site, site B, involves the addition of two histidine residues close in structure but18 residues apart in sequence to form a non-linear motif. This sequence has 54% identity to the input sequence.Another protein, the link module of human TSG-6 (PDB ID 1O7B), was selected for similar reasons to the SWIBdomain. In this case calcium binding was requested. The generated sequences show more aspartate residues,known to bind calcium , than the native sequence in 29% of cases. The metal binding discriminator was usedto select the sequence from 1,000 generated sequences most likely to bind calcium. The second-highest rankedsequence shows a potential new calcium binding site formed of a glutamate added by the model and a nativeglutamate 3 residues apart, as shown in Figure 5. This sequence has 89% identity to the input sequence. In bothcases above a single sequence was used as input and a sequence was returned that shows potential metal bindingcharacter. These results suggest that the model is able to add metal binding sites without having to manuallyexamine individual protein structures.Another test of the model’s ability to add metal binding sites is whether it can add a known site back to aprotein excluded from the training set. This was tested with the classic copper-binding protein plastocyanin. Theinstances of plastocyanin with native sequences were removed from the dataset; the instances with metal bindingresidues mutated to other amino acids were left in. When copper binding sequences are requested with the mutatedplastocyanin sequence as input from the model trained on this new dataset, 87% of 1,000 sequences have the crucialcopper-binding residue His37 present even though it was mutated in the training examples. 11% of the sequenceshave the full copper-binding site of His37, Cys84 and His87 present. Generation of proteins with a given topology

Next, we explore the ability of a CVAE to generate sequences for a given protein topology. In this case, the outputis conditioned on a grammar of protein structures . Note that when generating sequences for a given topologyonly the topology is used as input; this is different to the metal binding case above when a protein sequence wasused as input along with the metal binding information. First, sequences were generated for a topology commonin the dataset, the two-layer sandwich with three β -strands and one α -helix (topology ‘+B+0-C+0+B+2-B+1’ - seethe Methods). In this case the sequences generated show similarity to the training sequences with the giventopology, with 55% of generated sequences having 50% or greater sequence identity to at least one training sequence.Threading generated sequences through four PDB structures with the given topology (PDB IDs 1AHO, 1JZA, 2FKLand 2M8B), followed by Rosetta energy minimisation, gives energy scores that are similar to the PDB structures inmany cases. 18% of generated sequences have energy scores within 10% of the native PDB scores for at least one ofthe above four structures, which only represent a subset of available structures of the fold. In this case the CVAEseems to be generating viable homologues of the sequences used for training.Next, the model was used to generate sequences for a novel topology. This topology was made by picking astructure with a well-assigned Taylor topology - the reductase-related bacterial protein with PDB ID 2CU6 - andmodifying the loops connecting the central β -strands with ModLoop . See Figure 6. The secondary structuralelements are arranged identically in 3D space but the different connection of the loops creates a new topology andleads to a large rearrangement of the sequence. The resulting topology (‘-C+0+B+0-B-2+C-1-B-1’) is not assigned toany structure in the PDB and a structural search with the modiﬁed protein using DALI does not give any closematches, the only hits being similar in topology to 2CU6.Sequences were generated for this topology using the model, and an iterative procedure of selecting promisingsequences via ab initio structure generation and localized sampling was used to reﬁne the sequence generation (see he Methods). After three iterations of this process a sequence of 86 residues was generated that had structures in thepool of ab initio predicted structures similar to the modiﬁed 2CU6 structure. This is shown in Figure 6. The regionbetween the β -strand elements contains two α -helices rather than one and there is less β -strand character, but theoverall topology of α -helices behind a β -strand with certain connections is correct. In fact the PSIPRED predictionof the region not forming a β -sheet in the generated structure is of a β -sheet, and with minor rearrangement of thestructure a β -sheet could be formed.Although the modiﬁed 2CU6 backbone was used to guide the search, no sequence information from 2CU6 wasutilised and the sequence is generated purely by the CVAE. There are no similar sequences in the PDB, searching withan E-value threshold of 10. The sequence shows some similarity only to amidophosphoribosyltransferase sequences,with the highest scoring blastp hit having 46% sequence identity over 55% of the query protein. However structuresof amidophosphoribosyltransferases in the PDB show no similarity to the requested topology. Interestingly, thesequence shows 28% sequence identity to a sequence generated by RosettaDesign from the generated structurebackbone in Figure 6, indicating some agreement with established design protocols.In order to probe the stability of this structure and gain information on whether the sequence adopts it, MDsimulations were carried out. MD simulations can lead to stabilisation of collapsed structures other than thenative structure due to favouring hydrophobic interactions , but they are still a useful tool for probing potentialstructures . The structure showed little deviation in structure over three runs of 200 ns simulation time whenconsidering root mean square displacement (RMSD) and radius of gyration. This is shown in Figure 7. Althoughlonger simulations would be required to check more thoroughly for unfolding or for a different folded state, this issome indication that the fold is stable. Experimental testing would be required to verify if this prediction is correct.However, these results indicate that a CVAE is able to generate viable sequences for protein topologies both knownand not previously seen. It is hoped that these results generalize to the vast number of topologies described by thegrammar. Discussion

This work has indicated that CVAEs are able to carry out protein design tasks by conditioning output sequences ondesired properties such as metal binding sites and given topologies. The model is able to learn structural properties,even with fewer latent dimensions than the 16 used in the ﬁnal model. For example, with 2 latent dimensions thehomologues of each input sequence cluster together in the latent space. The sequences also show some separationby CATH class, and sequences of the same CATH architecture tend to be close in the latent space. These propertiesare shown in Figure 8. As would be expected, the ﬁrst latent dimension correlates with molecular weight as themodel learns the length of the input sequence.Sequences generated by the model described here show similar conservation patterns to native sequence families.However, the covariation of residues present across sequence families is not reproduced by the model. This isto be expected for two reasons: each PDB sequence in the dataset is limited to 50 sequence homologues, and theinput to the model is a single sequence rather than a multiple sequence alignment. There are developments tothe model that could lead to sequences being generated with realistic covariation signals. These include trainingdirectly on multiple sequence alignments, having a data set with proteins that have more homologues, and usingregularization techniques to promote covarying outputs. It has been shown that taking into account covariation isimportant for effective protein design . Another potential use of this is to generate additional sequences for usein residue contact prediction methods.Our model has shown promise in conditioning output sequences in terms of metal binding or a given topology.There are other properties of proteins that could be explored using the model in a similar way. These includeprotein-protein interactions, which have been organised systematically in an analogous way to Taylor topologies ;allosteric sites, which have been designed into proteins previously ; taking account of multiple conformationsin the design process ; and designing proteins by linking together large fragments from the PDB . Anotherapproach is to use activation maximization on trained models to directly move towards sampling proteins thathave measurable and desired attributes. For example, sampling a sequence containing a particular motif.After years of effort, protein design is becoming a systematic tool showing promise in many areas. However, thedifﬁculties of navigating a vast search space and scaling design efforts from speciﬁc proteins to the whole of proteinspace remain a challenge. The rise of deep learning has the potential to meet this challenge due to its ability tolearn complex patterns without supervision and the small computational cost of using a model once it is trained. Inaddition, the increasing amount of sequence and structural data will aid data-intensive methods such as deeplearning. It is hoped that methods similar to the CVAE presented here will be a valuable addition to the proteindesign pipeline. ata availability Code and documentation required to reproduce the results of the paper and generate further sequences, along witha copy of the trained model, is freely available at https://github.com/psipred/protein-vae . References Huang, P., Boyken, S. E. & Baker, D. The coming of age of de novo protein design.

Nature , 320–327 (2016). Samish, I., MacDermaid, C. M., Perez-Aguilar, J. M. & Saven, J. G. Theoretical and computational protein design.

Annu. Rev Phys Chem , 129–149 (2011). Yue, K. & Dill, K. A. Inverse protein folding problem: designing polymer sequences.

Proc Natl Acad Sci USA ,4163–4167 (1992). Regan, L. Protein design: novel metal-binding sites.

Trends Biochem. Sci , 280–285 (1995). Kuhlman, B. et al.

Design of a novel globular protein fold with atomic-level accuracy.

Science , 1364–1368(2003). Rocklin, G. J. et al.

Global analysis of protein folding using massively parallel design, synthesis, and testing.

Science , 168–175 (2017). Chevalier, A. et al.

Massively parallel de novo protein design for targeted therapeutics.

Nature , 74–79 (2017). Andreini, C., Cavallaro, G., Lorenzini, S. & Rosato, A. MetalPDB: a database of metal sites in biologicalmacromolecular structures.

Nucleic Acids Res , D312–D319 (2013). Andreini, C., Bertini, I. & Rosato, A. Metalloproteomes: A bioinformatic approach.

Acc Chem Res , 1471–1479(2009). Fung, H. K., Welsh, W. J. & Floudas, C. A. Computational De Novo Peptide and Protein Design: Rigid Templatesversus Flexible Templates.

Ind Eng Chem Res , 993–1001 (2008). Yang, H. et al.

Evolving artiﬁcial metalloenzymes via random mutagenesis.

Nat Chem , 318–324 (2018). Akcapinar, G. B. & Sezerman, O. U. Computational approaches for de novo design and redesign of metal-binding sites on proteins.

Biosci. Reports (2017). Brylinski, M. & Skolnick, J. FINDSITE-metal: Integrating evolutionary information and machine learning forstructure-based metal-binding site prediction at the proteome level.

Proteins , 735–751 (2011). Lin, H. H. et al.

Prediction of the functional class of metal-binding proteins from sequence derived physico-chemical properties by support vector machine approach.

BMC Bioinforma. (2006). Sodhi, J. S. et al.

Predicting metal-binding site residues in low-resolution structural models.

J Mol Biol ,307–320 (2004).

Dagliyan, O. et al.

Rational design of a ligand-controlled protein conformational switch.

Proc Natl Acad Sci USA , 6800–6804 (2013).

Taylor, W. R., Chelliah, V., Hollup, S. M., MacDonald, J. T. & Jonassen, I. Probing the “dark matter” of proteinfold space.

Structure , 1244–1252 (2009). Goodfellow, I. J. et al.

Generative Adversarial Networks.

ArXiv e-prints (2014).

Kingma, D. P. & Welling, M. Auto-Encoding Variational Bayes.

ArXiv e-prints (2013).

Rezende, D. J., Mohamed, S. & Wierstra, D. Stochastic Backpropagation and Approximate Inference in DeepGenerative Models.

ArXiv e-prints (2014). van den Oord, A., Kalchbrenner, N. & Kavukcuoglu, K. Pixel Recurrent Neural Networks.

ArXiv e-prints (2016).

Jaques, N., Gu, S., Turner, R. E. & Eck, D. Tuning Recurrent Neural Networks with Reinforcement Learning.

ArXiv e-prints (2016). van den Oord, A. et al.

Conditional Image Generation with PixelCNN Decoders.

ArXiv e-prints (2016).

Gomez-Bombarelli, R. et al.

Automatic Chemical Design Using a Data-Driven Continuous Representation ofMolecules.

ACS Cent Sci , 268–276 (2018). M ¨uller, A. T., Hiss, J. A. & Schneider, G. Recurrent Neural Network Model for Constructive Peptide Design.

JChem Inf Model. , 472–479 (2018). Sinai, S., Kelsic, E., Church, G. M. & Nowak, M. A. Variational auto-encoding of protein sequences.

ArXive-prints (2017).

Li, Z., Yang, Y., Faraggi, E., Zhan, J. & Zhou, Y. Direct prediction of proﬁles of sequences compatible with aprotein structure by neural networks with fragment-based local and energy-based nonlocal proﬁles.

Proteins ,2565–2573 (2014). Wang, J., Cao, H., Zhang, J. Z. H. & Qi, Y. Computational Protein Design with Deep Learning Neural Networks.

ArXiv e-prints (2018).

Kusner, M. J., Paige, B. & Hern´andez-Lobato, J. M. Grammar Variational Autoencoder.

ArXiv e-prints (2017).

Taylor, W. R. A ‘periodic table’ for protein structures.

Nature , 657–660 (2002).

Berman, H. M. et al.

The Protein Data Bank.

Nucleic Acids Res , 235–242 (2000). Sillitoe, I. et al.

CATH: comprehensive structural and functional annotations for genome sequences.

NucleicAcids Res , D376–D381 (2015). Altschul, S. F. et al.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Nucleic Acids Res , 3389–3402 (1997). The UniProt Consortium. UniProt: the universal protein knowledgebase.

Nucleic Acids Res , D158–D169(2017). Eddy, S. R. Accelerated Proﬁle HMM Searches.

PLoS Comput. Biol , e1002195 (2011). Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. SCOP: a structural classiﬁcation of proteins databasefor the investigation of sequences and structures.

J Mol Biol , 536–540 (1995).

Taylor, W. R. Personal communication (2017).

Ioffe, S. & Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing InternalCovariate Shift.

Proc. 32nd Int. Conf. on Mach. Learn. , 448–456 (2015). Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K. & Winther, O. Ladder Variational Autoencoders.

Adv.Neural Inf. Process. Syst. , 3738–3746 (2016). Kingma, D. P. & Ba, J. Adam: A Method for Stochastic Optimization.

ArXiv e-prints (2014).

Paszke, A. et al.

Automatic differentiation in PyTorch.

NIPS-W (2017).

Tian, P. & Best, R. B. How Many Protein Sequences Fold to a Given Structure? A Coevolutionary Analysis.

Biophys J , 1719–1730 (2017).

Kuhlman, B. & Baker, D. Native protein sequences are close to optimal for their structures.

Proc Natl Acad SciUSA , 10383–10388 (2000). Krivov, G. G., Shapovalov, M. V. & Dunbrack, R. L. Improved prediction of protein side-chain conformationswith SCWRL4.

Proteins , 778–795 (2009). Leaver-Fay, A. et al.

ROSETTA3: an object-oriented software suite for the simulation and design of macro-molecules.

Meth Enzym. , 545–574 (2011).

Cheng, H. et al.

ECOD: An Evolutionary Classiﬁcation of Protein Domains.

PLoS Comput. Biol , e1003926(2014). Abraham, M. J. et al.

GROMACS: High performance molecular simulations through multi-level parallelismfrom laptops to supercomputers.

SoftwareX , 19–25 (2015).

Capra, J. A. & Singh, M. Predicting functionally important residues from sequence conservation.

Bioinformatics , 1875–1882 (2007). Fiser, A. & Sali, A. ModLoop: automated modeling of loops in protein structures.

Bioinformatics , 2500–2501(2003). Holm, L. & Rosenstrom, P. Dali server: conservation mapping in 3D.

Nucleic Acids Res , W545–W549 (2010). Buchan, D. W., Minneci, F., Nugent, T. C., Bryson, K. & Jones, D. T. Scalable web services for the PSIPREDProtein Analysis Workbench.

Nucleic Acids Res , W349–W357 (2013). Liu, Y. & Kuhlman, B. RosettaDesign server for protein design.

Nucleic Acids Res , W235–W238 (2006). Robustelli, P., Piana, S. & Shaw, D. E. Developing a molecular dynamics force ﬁeld for both folded anddisordered protein states.

Proc Natl Acad Sci USA , E4758–E4766 (2018).

Lindorff-Larsen, K., Piana, S., Dror, R. O. & Shaw, D. E. How fast-folding proteins fold.

Science , 517–520(2011).

Socolich, M. et al.

Evolutionary information for specifying a protein fold.

Nature , 512–518 (2005).

Tian, P., Louis, J. M., Baber, J. L., Aniana, A. & Best, R. B. Co-Evolutionary Fitness Landscapes for SequenceDesign.

Angew Chem Int Ed Engl , 5674–5678 (2018). Ahnert, S. E., Marsh, J. A., Hernandez, H., Robinson, C. V. & Teichmann, S. A. Principles of assembly reveal aperiodic table of protein complexes.

Science , aaa2245 (2015).

Davey, J. A., Damry, A. M., Goto, N. K. & Chica, R. A. Rational design of proteins that exchange on functionaltimescales.

Nat Chem Biol , 1280–1285 (2017). Ambroggio, X. I. & Kuhlman, B. Computational design of a single amino acid sequence that can switch betweentwo distinct protein folds.

J Am Chem Soc , 1154–1161 (2006).

Jacobs, T. M. et al.

Design of structurally distinct proteins using strategies inspired by evolution.

Science ,687–690 (2016).

Nguyen, A., Yosinski, J., Bengio, Y., Dosovitskiy, A. & Clune, J. Plug & Play Generative Networks: ConditionalIterative Generation of Images in Latent Space.

ArXiv e-prints (2016).

Kunin, V., Copeland, A., Lapidus, A., Mavromatis, K. & Hugenholtz, P. A bioinformatician’s guide to metage-nomics.

Microbiol Mol Biol Rev , 557–578 (2008). Asgari, E. & Mofrad, M. R. Continuous Distributed Representation of Biological Sequences for Deep Proteomicsand Genomics.

PLoS ONE , e0141287 (2015). Callaway, E. The revolution will not be crystallized.

Nature , 172–174 (2015).

Acknowledgements

We thank members of the group for valuable discussions and comments, and William R Taylor for providing dataand assistance for the protein structure grammar. This work was supported by the European Research CouncilAdvanced Grant ‘ProCovar’ (project ID 695558) and the Biotechnology & Biological Sciences Research Council(BBSRC) UK, grant BB/M011712/1. This work was supported by the Francis Crick Institute which receives its corefunding from Cancer Research UK (FC001002), the UK Medical Research Council (FC001002), and the WellcomeTrust (FC001002).

Author contributions statement

JGG, LM and DTJ conceived and designed the study and reviewed the manuscript. JGG and LM carried out thecomputational work, did the analysis and drafted the manuscript.

Additional information

Competing interests

The authors declare no competing interests. igure 1.

The Taylor grammar of protein structures shown for a reductase-related bacterial protein (PDB ID 2CU6).The orientation of the main secondary structural elements is examined in order to assign a topology string. SeeTaylor 2002 for a full description of the grammar. Figure 2.

The inference/generator (encoder/decoder) structure of the VAE. The data x and the conditionedattribute a are concatenated and passed through the inference model to produce the latent code z . The attribute isthen concatenated to the sampled z (denoted by a dashed line) which going through the generator produces thereconstructed sequence ˆ x . igure 3. Graphic showing how the latent space was explored through iterative sampling and analysis of proteinsequences.

Figure 4.

Example protein sequences generated by the model. (A) A sequence can be given as input, in which casesimilar sequences are returned by encoding and decoding the input sequence. (B) A topology string can be given asinput, in which case sequences are generated that the VAE believes match the topology. igure 5.

Automated addition of potential metal binding sites to two proteins. The sequence of the SWIB domainof human MDM2 (PDB ID 3LBL) and a sequence generated by the model that is predicted to have highcopper-binding character are shown. Highlighted in purple are the residues that form the potential copper bindingsites. The sites are shown on the structure of the generated sequence using 3LBL as a template, and compared to3LBL. The same is shown for a potential site on the link module of human TSG-6 (PDB ID 1O7B) when calciumbinding is requested. igure 6.

Structures to explore a novel fold. The structure and sequence of 2CU6 are shown. By remodelling theloops at the locations of the blue circles using ModLoop a modiﬁed structure with a novel topology is generated.This is used as a backbone template to select structures from a pool of ab initio Rosetta structures generated fromsequences output by the CVAE. The closest structure to the template is shown. igure 7. Analysis of MD runs of the generated structure shown in Figure 6. Three runs of 200 ns are shown inblue, orange and green. (A) Backbone RMSD of trajectory structures to the energy-minimized starting structure. (B)Backbone radius of gyration (Rg) of trajectory structures. igure 8.

Separation by structural properties in the latent space when 2 latent dimensions are used in the model.The axes are the 2 latent dimensions and each point is the encoded representation in the 2 dimensions of one inputsequence. Clusters generally correspond to the homologues collected for each sequence. (A) Each sequence iscoloured by CATH class according to the colours shown. (B) Sequences for one CATH architecture, ‘mainly betasingle sheet’ (CATH ID 2.20), are highlighted in red.according to the colours shown. (B) Sequences for one CATH architecture, ‘mainly betasingle sheet’ (CATH ID 2.20), are highlighted in red.