[PDF] Sequence-guided protein structure determination using graph convolutional and recurrent networks

Abstract

Single particle, cryogenic electron microscopy (cryo-EM) experiments now routinely produce high-resolution data for large proteins and their complexes. Building an atomic model into a cryo-EM density map is challenging, particularly when no structure for the target protein is known a priori. Existing protocols for this type of task often rely on significant human intervention and can take hours to many days to produce an output. Here, we present a fully automated, template-free model building approach that is based entirely on neural networks. We use a graph convolutional network (GCN) to generate an embedding from a set of rotamer-based amino acid identities and candidate 3-dimensional C α locations. Starting from this embedding, we use a bidirectional long short-term memory (LSTM) module to order and label the candidate identities and atomic locations consistent with the input protein sequence to obtain a structural model. Our approach paves the way for determining protein structures from cryo-EM densities at a fraction of the time of existing approaches and without the need for human intervention.

Full PDF

SSequence-guided protein structure determinationusing graph convolutional and recurrent networks

Po-Nan Li

Dept. of Electrical Eng.Stanford University

Stanford, CA, [email protected]

Saulo H. P. de Oliveira

Division of BiosciencesSLAC National Accelerator Laboratory

Menlo Park, CA, [email protected]

Soichi Wakatsuki

Dept. of Structural BiologyStanford University

Stanford, CA, [email protected]

Henry van den Bedem

Atomwise, Inc.

San Francisco, CA, [email protected]

Abstract —Single particle, cryogenic electron microscopy (cryo-EM) experiments now routinely produce high-resolution data forlarge proteins and their complexes. Building an atomic modelinto a cryo-EM density map is challenging, particularly whenno structure for the target protein is known a priori . Existingprotocols for this type of task often rely on signiﬁcant humanintervention and can take hours to many days to produce anoutput. Here, we present a fully automated, template-free modelbuilding approach that is based entirely on neural networks.We use a graph convolutional network (GCN) to generate anembedding from a set of rotamer-based amino acid identitiesand candidate 3-dimensional C α locations. Starting from thisembedding, we use a bidirectional long short-term memory(LSTM) module to order and label the candidate identities andatomic locations consistent with the input protein sequence toobtain a structural model. Our approach paves the way fordetermining protein structures from cryo-EM densities at afraction of the time of existing approaches and without the needfor human intervention. Index Terms —Machine learning, Computational biology, Elec-tron microscopy, Recurrent neural networks, Neural networks

I. I

NTRODUCTION

Insight into the three-dimensional (3-D) structure of proteinsis fundamentally important to help us understand their cellularfunctions, their roles in disease mechanisms, and for structurebased development of pharmaceuticals. Recent advancementsin cryogenic electron microscopy (cryo-EM), including betterdetector technologies and data processing techniques, have en-abled high-resolution imaging of proteins and large biologicalcomplexes at the atomic scale [1].To construct a structural, atomically detailed model for aprotein, typically tens of thousands of single-particle imagesare collected, sorted and aligned to reconstruct a 3-D densitymap volume. Next, atomic coordinates are built into thedensity map. The latter routine is known as the map-to-model process, which typically requires a considerable amountof human intervention and inspection, notwithstanding theavailability of automated tools to aid the process [2]–[4].Despite signiﬁcant progress in machine learning techniquesin 2-D or 3-D object detection [6]–[9] and protein folding[10], deep learning approaches to modeling atomic coordinatesinto cryo-EM densities remain relatively unexplored. Multipleresearch groups have proposed convolutional neural networks (CNNs) for detecting amino acid residues in a cryo-EMdensity map, but either did not address the ﬁnal map-to-modelstep [11]–[14], or use a conventional optimization algorithmto construct the ﬁnal model (see Related Work). Conventionalsearch algorithms have high time- and space-complexity, con-stituting a bottleneck for large protein complexes, and areunable to exploit rich structural information encoded in geneticinformation [10].Here, we address these shortcomings by presenting anapproach for protein structure determination from cryo-EMdensities based entirely on neural networks. First, we use a3D CNN with residual blocks [17] we called RotamerNetto locate and predict amino acid and rotameric identities inthe 3-D density map volume. Next, we apply a graph con-volutional network (GCN) [18] to create a graph embeddingusing the nodes with 3D structural information generated byRotamerNet. Inspired by the UniRep approach [19], we thenapply a bidirectional long short-term memory (LSTM) moduleto select and impose an ordering consistent with the proteinsequence on the candidate amino acids, effectively generatinga reﬁned version of the graph with directed edges connectingamino acids (Fig. 1). We trained our LSTM on sequence data(UniRef50) alone, taking advantage of structural informationencoded in the vast amount of genetic information [10]. In thispaper we focus on the protein structure generation part with aGCN and an LSTM, which together we termed the StructureGenerator. Our main contributions are: • The ﬁrst, to our knowledge, entirely neural network basedapproach to generate a protein structure from a set ofcandidate 3D rotameric identities and positions. • Exploitation of genetic information learned fromUniRef50 sequences to help generate a 3-D structurefrom cryo-EM data using a GCN embedding and abidirectional LSTM.II. R

ELATED WORK

A. Map-to-model for cryo-EM maps

Over the last few years, cryo-EM has evolved as a majorexperimental technique for determining novel structures oflarge proteins and their complexes. Computational techniquesto process and analyze the data, and build protein structures a r X i v : . [ q - b i o . B M ] S e p ig. 1. Overview of the map-to-model pipeline. The present work focuseson the bottom panel (shaded box), determining a structural model consistentwith the protein sequence from candidate amino acid positions. are challenged by this avalanche of data. For example, widely-used de novo cryo-EM structure determination tools, e.g., phenix.map_to_model [3], [4] or rosettaCM [5] par-tially automate cryo-EM data interpretation and reconstruction,but typically take many hours to generate a preliminary modeland can require signiﬁcant manual intervention. The underly-ing algorithms are often decades old, and are difﬁcult to adaptto faster (e.g. graph processing unit, GPU) architectures. Itwill be critical to modernize these approaches, and capitalizeon recent advances of deep learning and GPUs to expedite thisprocedure.Several deep-learning based approaches for protein structuredetermination from cryo-EM data have been proposed. Liand coworkers [11] introduced a CNN-based approach toannotate the secondary structure elements in a density map,an approach later also proposed by Subramaniya et al. andMostosi et al. [13], [14]. The feasibility of an end-to-end map-to-model pipeline with deep learning has also been explored.Xu and colleagues trained a number of 3-D CNNs withsimulated data to localize and identify amino acid residuesin a density map and use a Monte-Carlo Tree Search (MCTS)algorithm to build the protein backbone [15]. Using an entirelydifferent architecture, Si and coworkers divided the map-to-model procedure into several tasks addressed by a cascade ofCNNs. However, their procedure also relied on a conventionalTabu-Search algorithm to produce the ﬁnal protein model [16]. B. Graph neural networks

Graph neural networks [18], [20] are natural representationsfor molecular structures with atoms as nodes and covalentbonds as edges. Duvenaud and coworkers pioneered this ap-proach using a GCN to learn molecular ﬁngerprints, which areimportant in drug design [21]. Numerous other applications ofGCNs to predict or generate molecular properties can be foundin the literature. For example, Li and colleagues demonstratedthe utility of a generative GCN to construct 3-D moleculesfrom SMILES strings, among other applications [22].

C. Long short-term memory

Recurrent neural networks (RNNs) are widely used in nat-ural language processing tasks. Their architecture is designedto process, classify, or predict properties of sequences as inputand can output sequences with desired properties [23]. Amongmany RNN architectures, the LSTM model was proposed toaddress the gradient vanishing problem for long sequences [24]and a number of variants have since been studied to furtherincrease its capacity, such as multi-layer and bidirectionalLSTMs. LSTMs are often used in conjunction with otherneural network models. An image captioning system, forexample, can be realized by using a 2-D CNN that extractshigh-dimensional features from an image and an LSTM thatoutputs a sentence describing the input image [25].III. M

ETHOD

A. Model

In this section we present the Structure Generator, a neuralnetwork model for protein model building consisting of aGCN and, subsequently, a bidirectional LSTM module. Theinput for the Structure Generator is a set of nodes labeledwith 3-D coordinates and amino acid identity. To generate aset of candidate amino acids we have previously implementedRotamerNet (unpublished), a 3-D CNN based on the ResNetarchitecture [17] that can identify amino acids and theirrotameric identities in an EM map. This set of candidate aminoacids is not constrained by the sequence, and and their 3-Dlocations are located based entirely on their density proﬁles.The set can contain false positives (an amino acid rotamer isproposed at a location where there is none) or false negative(a correct amino acid rotamer was not identiﬁed). RotamerNetoutputs an amino acid and rotamer identity together withproposed coordinates for its C α atom. In the remainder, wewill only consider the amino acid identity. A node v is aproposed amino acid identity with the C α coordinates.Next, we generate a C α contact map for all predicted C α coordinate locations. We connect any two proposed C α with adistance less than a given threshold ( . ˚ A ) with an undirectededge. We represent the input with two matrices: an m by matrix of node features and an m by m adjacency matrix thatdescribes the connectivity between nodes.We generate a high dimensional embedding for each node v , H node = GCN( A , F ) , where H = [ h (1) (cid:124) node , . . . , h ( m ) (cid:124) node ] (cid:124) , A isthe adjacency matrix with a i,j = 1 for each neighbor pair ( i, j ) or the node itself, i.e. i = j , and F = [ f (1) (cid:124) node , . . . , f ( m ) (cid:124) node ] (cid:124) areinput features. Features are generated with f ( v )node = NN( s ( v ) ,where s ∈ R is the normalized softmax score vector fora node v obtained from the RotamerNet and NN ( · ) denotesa single-layer neural network. We implemented the GCNmodule following [18] (Fig. 2(a)). Note that the GCN can beapplied in T layers to propagate messages, thereby increasingthe capacity of the network [21], [22]. As depicted in Fig.2(a), in each of the GCN layer, messages propagate throughedges, sharing the embedding of a node with its neighbors. Forexample when T = 2 , H node = GCN (2) ( A , GCN (1) ( A , F )) ,nd these two GCN layers can share the same set or havedifferent sets of parameters.The Structure Generator then uses a bidirectional LSTMmodule as a decoder for the reﬁned protein chain generation.We use zero vectors for initial hidden and cell states. Ateach time step t , an embedding of an amino acid in thesequence h ( t )seq = NN (cid:16) c ( t )seq (cid:17) , where c ( t )seq ∈ R is a one-hotencoding for an amino acid at position t in the sequence, isfed into the LSTM cell. The cell output at each time step, h ( t ) P , can be viewed as the current graph representation for P , the protein structure to be built. A score z ( v )node ∈ R of acandidate node to be selected as the next node to be addedto P is determined by z ( v )node = NN (cid:16) h ( v )add + h P (cid:17) , where h ( v )add = NN (cid:0) h ( v ) (cid:1) . At each time step t , the node with thehighest softmax score p ( v )node = exp( z ( v )node ) / (cid:80) v (cid:48) exp( z ( v (cid:48) )node ) is selected and added to P . The selection process continuesuntil the end of the sequence t = N , where N is the lengthof the sequence, is reached, at which point the cross entropyloss is computed for the entire sequence in the training stage,or bitwise accuracy for the inference stage. We implementedthe decoder with a bidirectional LSTM, in which the outputsfrom one LSTM fed with a forward sequence and anotherfed with a backward sequence are concatenated to obtain h ( t ) P for each time step t . We found that a bidirectional LSTMconsistently outperformed a uni-directional LSTM. We alsofound that using the ensemble of inference results with aforward (from N-terminus) and a backward (from C-terminus)sequence further improves the accuracy. Fig. 2(b) illustratethe generation process with the sequence as the input at eachtime step (top) and the best corresponding node prediction asoutput (bottom). Importantly, the sequence information is usedboth in the training and inference stages to guide the proteinmodeling. Fig. 2. Architecture of the Structure Generator. (a) A graph convolutionalnetwork allows the embedding of each node to communicate through edgesfor T rounds of propagation. (b) Protein sequence (SE...Q) is fed to thebidirectional LSTM to guide the modeling. The outputs from the forward ( f )and backward ( b ) LSTM state at each time step are concatenated to predict thebest node to add to the protein P . N and M are the length of the sequenceand the number of nodes in the graph, respectively. B. Training data

To train the Structure Generator, we randomly selected1,000,000 and 100 sequences with length in [50 , fromthe UniRef50 dataset [26] for the training and validation sets, respectively. The remaining sequences (approximately 30million) are kept untouched for future uses. The median andmean sequence length in the validation set are and . .RotamerNet was trained on simulated density proﬁles ofproteins, generated as follows. We selected high-quality pro-tein structures from the Protein Data Bank, with resolutionbetween . and . ˚ A . We used phenix.fmodel to gener-ate electron scattering factors with noise to simulate thecryo-EM density maps for , protein structures, ofwhich were used to train the RotamerNet.The orders of amino acid residues in a given proteinstructure are shufﬂed. Because the UniRef50 dataset has onlyprotein sequences, we assumed perfect C α -C α contact mapsand simulated input features, i.e. a normalized softmax scorevector s ∈ R by s i = |N ∼ (0 , . | for i (cid:54) = j and s j = 1 − (cid:80) i (cid:54) = j s i , where j is the index corresponding theground-truth amino acid identity. The ground truth sequenceis used in both training and inference stages. C. Training the Structure Generator

We trained the Structure Generator with the ADAM op-timizer with batch size and learning rate . for ﬁrst100,000 iterations and decreased the learning rate to . for the rest. The sum of the cross entropy loss at each sequenceposition with the ground true index j ∈ R and the vector ofnormalized scores p ∈ R m , Loss = − n (cid:88) t =1 log p j (1)where n is the length of the ground truth sequence and m isthe number of nodes in the raw graph, is calculated and back-propagated through the entire network, i.e. the LSTM and thenthe GCN. In the inference stage, the average accuracy AA = 1 K (cid:88) prot N prot N prot (cid:88) t =1 (ˆ j t = j t ) , (2)i.e., the fraction of amino acids whose identity was predictedcorrectly, is used to evaluate the performance of the StructureGenerator on a set of K protein structures, where N prot denotes the sequence length of a protein, j t and ˆ j t are theground truth and predicted note index for the t -th step in theLSTM, respectively.We trained the GCN with sequence embedding dimension , node embedding dimension and LSTM hidden statedimension × ( for each direction). During training,we added (500 − n ) dummy (false positive) nodes with randomedges in each iteration to complicate the training data.IV. R ESULTS

We ﬁrst examined the effect of the GCN on structuredetermination. We found that the number of GCN layers candramatically improve the average accuracy on the validationset. For example, using two rather than one GCN layer,i.e. T = 1 to T = 2 , yields an improvement (Fig.3(a)). Encouraged by this improvement, we further trainedhe Structure Generator with T = { , } . Fig. 3(a) shows theerror rate ( − AA) curves on the validation data for differentnumber of GCN layers T . While increasing to T = 3 againgives another . increase in average accuracy, T = 4 addsonly . (Table I). This observation suggests that T = 2 ,which can be interpreted as learning 5-mer spatial motifs inthe graph (see discussion in IV-B) is sufﬁcient for the modelto capture the implicit structural information in the graph andthe sequence. In the remainder, unless stated otherwise weﬁxed T = 2 for inference in all experiments. Fig. 3(b) showsthe error counts on the 100 protein structure in the validationset as a function of sequence length, suggesting that the errorcounts increased very mildly with the length of sequence. Fig. 3. Validation results. N = 100 . (a) Error rate, deﬁned as − average accuracy, vs. training iterations with GCNs with different layers. (b)Error counts, i.e. the number of incorrect amino acid assignments in a proteinstructure as a function of sequence length for the T = 2 model. To study the efﬁcacy of the GCN, we also tested a GCNwith T = 0 , i.e., the node features F are added directly to theLSTM outputs. This model dramatically reduced the averageaccuracy to . , which is approximately the probabilityof randomly selecting the correct node out of candidate nodesthat have the amino acid identity matching the sequence input.This result indicates that a GCN embedding of RotamerNet’soutput is required for the LSTM to predict an ordered graphconsistent with the sequence and the GCN. A. Performance of the Structure Generator on the ProteinNetdataset

Next, we evaluated the Structure Generator on the Pro-teinNet data set, a standardized machine learning sequence-structure dataset with standardized splits for the protein struc-ture prediction and design community [27]. The CASP12ProteinNet validation set used here has 224 structures withsequence lengths ranging from 20 to 689, with median and mean . . We selected the same parameters as those TABLE IA

VERAGE ACCURACY ON VARIOUS VALIDATION DATASETS WITHDIFFERENT

GCN

LAYER AND INFERENCE SETTINGS .Dataset Inference T = 0 T = 1 T = 2 T = 3 T = 4 UniRef50 Forward 0.0347 0.7952 0.9949 0.9981 0.9985Ensemble 0.0347 0.8383 0.9954 0.9986 0.9987ProteinNet Forward - 0.8214 0.9899 0.9957 0.9960Ensemble - 0.8577 0.9908 0.9960 0.9963RotamerNet Forward - 0.6162 0.7443 0.6912 0.6853Ensemble - 0.6409 0.7538 0.7060 0.6965 for the UniRef50 dataset to generate simulated feature vectorsand generated C α contact maps based on the backbone atomcoordinates from ProteinNet. We note that a small numberof C α coordinates are absent in ProteinNet owing to lack ofexperimental data. Compared to the UniRef50 validation set,the ProteinNet validation set is therefore more challenging, asedges in the input graph can be missing. Validation results onProteinNet are given in the second row of Table I.Nonetheless, well over of the structures in the dataset are correctly predicted without any errors (Fig. 4(a)). Re-markably, the Structure Generator can correctly predict aminoacids for which the C α records are missing. For example,atomic coordinates for the ﬁrst three and last two amino acidsof prosurvival protein A1 (PDB ID 2vog) are missing, butthe Structure Generator can still completely reconstruct theprotein model (Fig. 4(b)). Several amino acids, for exampleglutamine (Q), occur multiple times in the sequence. As aresult, the last two rows in the Structure Generator outputhave repeating patterns (Fig. 4(c)), which did not prevent theStructure Generator from predicting the correct nodes for eachof the positions corresponding to glutamine. B. Performance of the Structure Generator on RotamerNetdata

Finally, to demonstrate the utility of our approach withan upstream machine learning approach, we tested the Struc-ture Generator on output from RotamerNet. The RotamerNetvalidation data set consists of the amino acid type classiﬁ-cation scores for simulated cryo-EM density maps from protein structures of various lengths from to withvarious nominal resolution ranging from . to 1.8 ˚ A . Theaverage accuracy of the RotamerNet validation set is . ,meaning that a non-trivial fraction of input features for theStructure Generator is noisy or incorrect. Fig. 5(a) shows theRotamerNet and the Structure Generator accuracy of the proteins with various sequence lengths. Remarkably, while theperformance of the Structure Generator is largely limited bythe RotamerNet accuracy (data points beneath the dashed grayline in Fig. 5(a)), as indicated by the correlation, a number ofproteins have higher Structure Generator accuracy than Ro-tamerNet accuracy, suggesting that the Structure Generator cantolerate and recover errors from upstream machine learningapproaches. To understand the characteristics of the StructureGenerator, we plot the confusion matrix for the C-terminalcalponin homology domain of alpha-parvin (PDB ID 2vzg). ig. 4. Test on the ProteinNet CASP12 validation set. (A) Error counts, i.e.number of incorrect amino acid assignments in a protein structure as a functionof sequence length, and the histogram, with the T = 2 model. (b) Contactmap of prosurvival protein A1 (PDB ID 2vog). Green pixels are contactsand purple pixels indicate where C α coordinates are unknown and thus thecontacts are missing. (c) The output of the Structure Generator on 2vog. Redmeans higher probability whereas blue means less likely. Among those amino acids whose identity and position arecorrectly predicted by RotamerNet and the Structure Generator(Fig. 5(b), blue dots on the diagonal), there are two red dotsindicating that the prediction error from RotamerNet doesnot necessarily prevent the Structure Generator from makingcorrect predictions. Again we point out that the StructureGenerator has been trained only on UniRef50 sequences andsimulated features, and has not been ﬁne-tuned with theRotamerNet data. We anticipate that either doing so or trainingwith the upstream model will further enhance the performanceof the Structure Generator.We observe in Table I that the T = 2 model performs beston the RotamerNet data. The Structure Generator relies onlearning the correlation between the graph embedding andthe motifs in the sequence. Increasing the number of GCNlayers in principle allows the Structure Generator to recognizelonger n-grams and spatial motifs of increased connectivitylength. However, such correspondences will become increas-ingly noisy as lengths increase. Based on this observation, wetherefore suggest that T = 2 is a practical choice.V. C ONCLUSION

Building an atomic model into a map is a time- and labor-intensive step in single particle cryo-EM structure determi-nation, and mostly relies on traditional search algorithmsthat cannot exploit recent advancements in GPU comput-ing and deep learning. To address these shortcomings, wehave presented the Structure Generator, a full-neural network

Fig. 5. Results on the RotamerNet data. (a) Structure Generator accuracyvs. RotamerNet accuracy. Each data point represents a protein structure. Thecolor code indicates the length of the structure. (b) Amino acid position ofa select structure, PDB ID 2vzg, predicted by the Structure Generator. Redpixels are where RotamerNet made incorrect prediction on the amino acidtype. Sequence on the top is derived from RotamerNet output and on theright is the ground truth. pipeline that can build a protein structural model from aset of unordered candidate amino acids generated by othermachine learning models. Our experiments show that a GCNcan effectively encode the output from the upstream modelas a graph while a bidirectional LSTM can precisely decodeand generate a directed amino acid chain, even when the inputcontains false or erroneous entries. Our experiments suggestthat a two-layer GCN is sufﬁcient for processing the raw graphwhile preventing over-ﬁtting to the training data.The Structure Generator exploits genetic information toguide the protein structure generation, and showed promis-ing results on the RotamerNet data set without ﬁne tuning.Training on the ProteinNet dataset and ﬁne-tuning on theRotamerNet dataset will further enhance performance. Whilea practical machine learning model for cryo-EM map-to-model is still a work-in-progress, in part because of the lackof high-resolution experimental data [28] for training, ourproposed framework can complement the existing approachesand ultimately pave ways toward a fully trainable end-to-end machine learning map-to-model pipeline, making humanintervention-free protein modelling in a fraction of a minutepossible. C

URRENT AFFILIATION

This work was initiated when S.H.d.O. and H.v.d.B. were atSLAC National Accelerator Laboratory. S.H.d.O. is currentlyat Frontier Medicines, CA, USA. In addition to his positionat Atomwise, H.v.d.B. is on the faculty of the Departmentof Bioengineering and Therapeutic Sciences, University ofCalifornia, San Francisco, CA, USA.R

EFERENCES[1] Ewen Callaway, “Revolutionary cryo-EM is taking over structural biol-ogy,”

Nature , 201 (2020).[2] F. DiMaio and W. Chiu, “Chapter Ten – Tools for Model Building andOptimization into Near-Atomic Resolution Electron Cryo-MicroscopyDensity Maps” in

Methods in Enzymology , 255–276, edited by R.A.Crowther (2016).3] Thomas C. Terwilliger, Paul D. Adams, Pavel V. Afonine, and Oleg V.Sobolev, “A fully automatic method yielding initial models from high-resolution cryo-electron microscopy maps,”

Nature Methods , 905–908 (2018).[4] Thomas C. Terwilliger, Paul D. Adams, Pavel V. Afonine, and OlegV. Sobolev, “Map segmentation, automated model-building and theirapplication to the Cryo-EM Model Challenge,” J. Structural Biology , 338–343 (2018).[5] Yifan Song, Frank DiMaio, Ray Yu-Ruei Wang, David Kim, Chris Miles,T.J. Brunette, James Thompson, and David Baker, “High-ResolutionComparative Modeling with RosettaCM,”

Structure , 1735–1742(2013).[6] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, AnoopKorattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song,Sergio Guadarrama, and Kevin Murphy, “Speed/accuracy trade-offs formodern convolutional object detectors,” arXiv:1611.10012 (2017).[7] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshic, “MaskR-CNN,” arXiv:1703.06870 (2018).[8] Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao,“YOLOv4: Optimal Speed and Accuracy of Object Detection,”arXiv:2004.10934 (2020).[9] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier,Alexander Kirillov, Sergey Zagoruyko, “End-to-End Object Detectionwith Transformers,” arXiv:2005.12872 (2020).[10] Andrew W. Senior, Richard Evans, John Jumper, James Kirkpatrick,Laurent Sifre, Tim Green, Chongli Qin, Augustin ˇZ´ıdek, AlexanderW. R. Nelson, Alex Bridgland, Hugo Penedones, Stig Petersen, KarenSimonyan, Steve Crossan, Pushmeet Kohli, David T. Jones, David Silver,Koray Kavukcuoglu, and Demis Hassabis, “Improved protein structureprediction using potentials from deep learning,” Nature , 706–710(2020).[11] Rongjian Li, Dong Si, Tao Zeng, Shuiwang Ji, Jing He, “Deep Convo-lutional Neural Networks for Detecting Secondary Structures in ProteinDensity Maps from Cryo-Electron Microscopy,” 2016 IEEE BIBM(2016).[12] Mark Rozanov and Haim J. Wolfson, “AAnchor: CNN guided detectionof anchor amino acids in high resolution cryo-EM density maps,” 2018IEEE BIBM (2018).[13] Sai Raghavendra Maddhuri Venkata Subramaniya, Genki Terashi, andDaisuke Kihara, “Protein secondary structure detection in intermediate-resolution cryo-EM maps using deep learning,”

Nature Methods , 911–917 (2019).[14] Philipp Mostosi, Hermann Schindelin, Philip Kollmannsberger, andAndrea Thorn, “Haruspex: A Neural Network for the Automatic Iden-tiﬁcation of Oligonucleotides and Protein Secondary Structure in Cryo-Electron Microscopy Map,” Angew. Chem. Int. Ed. , 2–10 (2020).[15] Kui Xu, Zhe Wang, Jianping Shi, Hongsheng Li, Qianfng Cliff Zhang,“ A -Net: molecular structuree estimation from Cryo-EM density vol-umes,” arXiv:1901.00785 (2019).[16] Dong Si, Spencer A. Moritz, Jonas Pfab, Jie Hou, Renzhi Cao, LiguoWang, Tianqi Wu, and Jianlin Cheng, “Deep learning to predict pro-tein backbone structure from high-resolution cryo-EM density maps,” Scientiﬁc Reports , 4282 (2020).[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “DeepResidual Learning for Image Recognition,” 2016 IEEE CVPR (2016).[18] Thomas N. Kipf and Max Welling, “Semi-supervised classiﬁcation withgraph convolutional networks,” ICLR 2017 (2017).[19] Ethan C. Alley, Grigory Khimulya, Surojit Biswas, MohammedAlQuraishi, and George M. Church, “Uniﬁed rational protein engineer-ing with sequence-based deep representation learning,” Nature Methods , 1315–1322 (2019).[20] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagen-buchner,and Gabriele Monfardini, “The graph neural network model,” IEEETransactions on Neural Networks , 61–80 (2009).[21] David K. Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, RafaelBombarell, Timothy Hirzel, Alan Aspuru-Guzik, and Ryan P. Adams,“Convolutional Networks on Graphs for Learning Molecular Finger-prints,” NIPS 2015 (2015).[22] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu and Peter Battaglia,“Learning deep generative models of graphs,” arXiv:1803.03324 (2018).[23] Alex Graves, “Generating Sequences With Recurrent Neural Networks,”arXiv:1308.0850 (2013).[24] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-term memory,” Neural Computation , 1735–1780 (1997). [25] Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, SubhashiniVenugopalan, Sergio Guadarrama, Kate Saenko, Trevor Darrell, “Long-term recurrent convolutional networks for visual recognition and de-scription,” CVPR 2015 (2015).[26] Baris E. Suzek, Yuqi Wang, Hongzhan Huang, Peter B. McGarvey,Cathy H. Wu, and the UniProt Consortium, “UniRef clusters: a com-prehensive and scalable alternative for improving sequence similaritysearches,” Bioinformatics , 926–932 (2015).[27] Mohammed AlQuraishi, “ProteinNet: a standardized data set for ma-chine learning of protein structure,” BMC Bioinformatics20