[PDF] Learning from Protein Structure with Geometric Vector Perceptrons

Abstract

Learning on 3D structures of large biomolecules is emerging as a distinct area in machine learning, but there has yet to emerge a unifying network architecture that simultaneously leverages the graph-structured and geometric aspects of the problem domain. To address this gap, we introduce geometric vector perceptrons, which extend standard dense layers to operate on collections of Euclidean vectors. Graph neural networks equipped with such layers are able to perform both geometric and relational reasoning on efficient and natural representations of macromolecular structure. We demonstrate our approach on two important problems in learning from protein structure: model quality assessment and computational protein design. Our approach improves over existing classes of architectures, including state-of-the-art graph-based and voxel-based methods. We release our code at this https URL.

Full PDF

LL EARNING FROM P ROTEIN S TRUCTURE WITH G EOMETRIC V ECTOR P ERCEPTRONS

Bowen Jing ∗ Department of Computer ScienceStanford University [email protected]

Stephan Eismann ∗ Department of Applied PhysicsStanford University [email protected]

Patricia Suriana

Department of Computer ScienceStanford University [email protected]

Raphael J.L. Townshend

Department of Computer ScienceStanford University [email protected]

Ron Dror

Department of Computer ScienceStanford University [email protected] A BSTRACT

Learning on 3D structures of large biomolecules is emerging as a distinct area in machine learning,but there has yet to emerge a unifying network architecture that simultaneously leverages the graph-structured and geometric aspects of the problem domain. To address this gap, we introduce geometricvector perceptrons , which extend standard dense layers to operate on collections of Euclidean vectors.Graph neural networks equipped with such layers are able to perform both geometric and relationalreasoning on efﬁcient and natural representations of macromolecular structure. We demonstrate ourapproach on two important problems in learning from protein structure: model quality assessmentand computational protein design. Our approach improves over existing classes of architectures,including state-of-the-art graph-based and voxel-based methods.

Many efforts in structural biology aim to predict, or derive insights from, the structure of a macromolecule (such as aprotein, RNA, or DNA), represented as a set of positions associated with atoms or groups of atoms in 3D Euclideanspace. These problems can often be framed as functions mapping the input domain of structures to some property ofinterest—for example, predicting the quality of a structural model or determining whether two molecules will bind.Thanks to their importance and difﬁculty, such problems, which we broadly refer to as learning from structure , haverecently developed into an exciting and promising application area for deep learning [12, 13, 15, 23, 27, 31].Successful applications of deep learning are often driven by techniques that leverage the problem structure of thedomain—for example, convolutions in computer vision [8] and attention in natural language processing [29]. What arethe relevant considerations in the domain of learning from structure? Using proteins as the most common example, thespatial arrangement and orientation of the amino acids govern the topology, dynamics and function of the molecule [4].These properties are in turn mediated by key pairwise and higher-order residue-residue interactions [10, 14]. We referto these as the geometric and relational aspects of the problem domain, respectively.Current state-of-the-art methods for learning from structure are successful by leveraging one of these two aspects.Commonly, such methods either employ graph neural networks (GNNs), which are expressive in terms of relationalreasoning [3] or convolutional neural networks (CNNs), which operate directly on spatial features. Here, we present aunifying architecture that bridges spatial and graph-based methods to leverage both aspects of the problem domain.We do so by introducing geometric vector perceptrons (GVPs), a drop-in replacement for standard multi-layer percep-trons (MLPs) in aggregation and feed-forward layers of GNNs. GVPs operate directly on both scalar and geometric features—features that transform as a vector under rotation of spatial coordinates. GVPs therefore allow for theembedding of geometric information at nodes and edges without reducing such information to hand-picked scalars that ∗ Equal contribution a r X i v : . [ q - b i o . B M ] S e p ay not fully capture complex geometric relationships. We postulate that our approach makes it easier for a network tolearn functions whose signiﬁcant features are both geometric and relational.Our method (GVP-GNN) can be applied to any problem where the input domain is a structure of a single macromoleculeor of molecules bound to one another. In this work, we demonstrate our approach on two important problems connectedto protein structure: model quality assessment and computational protein design. Model quality assessment (MQA)aims to select the best structural model of a protein from a large pool of candidate structures and is a crucial step inprotein structure prediction [6]. Computational protein design (CPD) is the conceptual inverse of structure prediction,aiming to infer an amino acid sequence that will fold into a given structure. Our method outperforms existing methodson both tasks. Current methods for learning from protein structure generally use one of three classes of structural representations,which we outline below. We also describe representative and state-of-the-art examples for MQA and CPD to set thestage for our experiments later.

Sequential representation

One way of representing a protein structure is as a sequence of feature vectors, onefor each amino acid, that can be input into a 1D convolutional network. These representations intrinsically encodeneither geometric nor relational information and instead rely on hand-selected features to represent the 3D structuralneighborhood of each amino acid. Such features can include contact-based features [21], orientations or positionscollectively projected to local coordinates [16, 30], and physics-inspired energy terms [20, 28]. These methods wereamong the earliest developed for learning from structure, and a number of them (ProQ3D [28], VoroMQA [21], SBROD[16]) remain competitive in assessments of MQA.

Graph-based representations

Graph-based methods represent amino acids or individual atoms as nodes and drawedges based on chemical bonds or spatial proximity, and are well-suited for complex relational reasoning [3]. Thesemethods reduce the challenging task of representing the collective structural neighborhood of an amino acid to that ofrepresenting individual edges. The CPD method ProteinSolver [26], for example, uses pairwise Euclidean and sequencedistance to represent edges. However, to encode more complex geometric information, a choice of scalar representationscheme is again necessary i.e. , vector-valued features whose components are intrinsically linked need to be decomposedinto individual scalars. For example, the MQA method ProteinGCN [24] projects pairwise relative positions into localcoordinates. The CPD method Structured Transformer [15] additionally uses a quaternion representation of pairwiserelative orientations.

Voxel-based representations

Many structural biology methods voxelize proteins or neighborhoods of amino acidsinto occupancy maps and then perform hierarchical 3D convolutions. These architectures circumvent the need forhand-picked representations of geometric information altogether and instead directly operate on approximate positionsin 3D space. However, CNNs do not directly encode relational information within the structure. In addition, standardCNNs do not intrinsically account for the symmetries of the problem space. To address the latter, early methods, suchas the MQA method 3DCNN [11], learned rotation invariance via data augmentation. More recent MQA methods suchas Ornate [22] and many end-to-end CPD models [1, 32] instead deﬁne voxels based on a local coordinate system.Voxel-based and graph-based architectures, catering to geometric and relational aspects of the problem domain,respectively, have complementary strengths and weaknesses. To address this, some recent work on small molecules aimsto leverage geometric features in GNNs through voxelized convolutions at each graph node [25], message transformationbased on encoded angle information [17] or products with position vectors to account for vector features [7]. Ourcontribution is similar in spirit but more general and expressive in methodology. With the introduction of GVPs, weprovide a conceptually simple augmentation of GNNs that allows for arbitrary scalar and vector features at all steps ofgraph propagation, and has the desired equivariance properties with respect to rotations of atomic structures in space.

In this section, we formally introduce the geometric vector perceptron, discuss datasets and evaluation metrics forquality assessment and protein design, and describe the protein representations and model architectures used for thosetasks. 2igure 1: (A)

Schematic of the geometric vector perceptron illustrating the computation shown in Algorithm 1. Givena tuple of scalar and vector input features ( s , V ) , the perceptron computes an updated tuple ( s (cid:48) , V (cid:48) ) . s (cid:48) is a function ofboth s and V . (B) Illustration of the structure-based prediction tasks. In model quality assessment (top), the goal is topredict a quality score given the 3D structure of a candidate model. Individual atoms are represented as colored spheres.The quality score measures the accuracy of a candidate structure with respect to an experimentally determined structure(shown in gray). In computational protein design (bottom), the goal is to predict an amino acid sequence that wouldfold into a given protein backbone structure.

Given a tuple ( s , V ) of scalar features s ∈ R n and vector features V ∈ R ν × , the geometric vector perceptron computesnew features ( s (cid:48) , V (cid:48) ) ∈ R m × R µ × . As outlined in pseudocode in Algorithm 1 and illustrated in Figure 1A, thecomputation is analogous to the linear combination of inputs in normal dense layers, with the addition of a normoperation and an intermediary hidden layer, which allow information in the vector channels to be extracted as scalars. Algorithm 1

Geometric vector perceptron

Input : Scalar and vector features ( s , V ) ∈ R n × R ν × . Output : Scalar and vector features ( s (cid:48) , V (cid:48) ) ∈ R m × R µ × . h ← max ( ν, µ ) GVP :V h ← W h V ∈ R h × V µ ← W µ V h ∈ R µ × s h ← (cid:107) V h (cid:107) ( row-wise ) ∈ R h v µ ← (cid:107) V µ (cid:107) ( row-wise ) ∈ R µ s h + n ← concat ( s h , s ) ∈ R h + n s m ← W m s h + n + b ∈ R m s (cid:48) ← σ ( s m ) ∈ R m V (cid:48) ← σ + ( v µ ) (cid:12) V µ , ( row-wise multiplication ) ∈ R µ × return ( s (cid:48) , V (cid:48) ) W h ∈ R h × ν , W µ ∈ R µ × h , and W m ∈ R m × ( n + h ) are weight matrices and b ∈ R m is a bias vector. σ and σ + denoteactivation functions. In our experiments, we choose ReLU and sigmoid for σ and σ + , respectively, although othernonlinearities may be used. Additionally, h may also be deﬁned independently, but we use the maximum of ν, µ forconvenience.Despite its conceptual simplicity, the GVP offers some desirable properties. First, the vector and scalar outputs of theGVP are equivariant and invariant, respectively, with respect to an arbitrary composition of rotations and reﬂections in3D Euclidean space described by R i.e. , GVP (( s , R ( V ))) = ( s’ , R ( V (cid:48) )) (1)3his is due to the fact that the only operations on vector-valued inputs are scalar multiplication, linear combination, andthe L norm. We include a formal proof in Supplementary Information.In addition, a GVP G s deﬁned with n, µ = 0 —that is, the part of a GVP that transforms vector features to scalarfeatures—is able to (cid:15) -approximate a function F : R ν × → R that is invariant with respect to rotations and reﬂections in3D under mild assumptions. Theorem.

Let R describe an arbitrary rotation or reﬂection in R . For ν ≥ let Ω ν ⊂ R ν × be the set of all V = [ v , . . . , v ν ] T ∈ R ν × such that v , v , v are linearly independent and < || v i || ≤ b for all i and someﬁnite b > . Then for any continuous F : Ω ν → R such that F ( R ( V )) = F ( V ) and for any (cid:15) > , there exists a form f ( V ) = w T G s ( V ) such that | F ( V ) − f ( V ) | < (cid:15) for all V ∈ Ω ν .Proof. In Supplementary Information.As a corollary, a GVP with nonzero n, µ is also able to approximate similarly-deﬁned functions over the full inputdomain R n × R ν × .Finally, the GVP layer is nearly as fast as normal dense layers, incurring additional overhead only in reshaping andconcatenation operations.In addition to the GVP layer itself, we use a version of dropout that drops entire vector channels at random (as opposedto components within vector channels). We also introduce layer normalization for the vector features as v i ← v i / (cid:115) |{ j }| (cid:88) j || v j || ∀ i , (2)where v i ∈ R are the individual row vectors of the vector feature matrix R ν × . This vector layer norm has no trainableparameters, but we continue to use normal layer normalization on scalar channels with trainable parameters γ, β . We benchmark GVP-augmented graph neural networks on two distinct challenges in structural biology: model qualityassessment and protein design. Figure 1B illustrates these tasks.

Model quality assessment aims to select the best structural model of a protein from a large pool of candidate structures. The performance of different MQA methods is regularly assessed in the biennial blind prediction competition CASP,of which the 13th was the most recent [6]. For a number of recently solved but unreleased structures, called targets ,structure generation programs produce a large number of candidate structures. MQA methods are ranked by how wellthey are able to predict the GDT-TS score of a candidate structure. The GDT-TS is a measure of how similar two proteinbackbones are after global alignment; speciﬁcally, it is the mean ofGDT-N = 1 (cid:88) all residues i [ d ( target i , candidate i ) < N Angstroms ] (3)for N = 1 , , , , where d is the Euclidean distance. In addition to accurately predicting the absolute quality of acandidate structure, a good MQA method should also be able to accurately assess the relative model qualities among apool of candidates for a given target so that the best ones can be selected, perhaps for further reﬁnement. Therefore,MQA methods are commonly evaluated on two metrics: a global correlation between the predicted and ground truthscores, pooled across all targets, and the average per-target correlation among only the candidate structures for a speciﬁctarget [5, 11, 22].To fulﬁll this aim, we train our MQA model on an absolute loss and a pairwise loss. That is, for each training step weintake pairs i, j where i, j are candidate structures for the same target and compute L = H ( y ( i ) − ˆ y ( i ) ) + H ( y ( j ) − ˆ y ( j ) ) + H (cid:16) ( y ( i ) − y ( j ) ) − (ˆ y ( i ) − ˆ y ( j ) ) (cid:17) (4)where H is the Huber loss. When reshufﬂing at the beginning of each epoch, we also randomly pair up the candidatestructures for each target. Interestingly, adding the pairwise term also improves global correlation, likely because themuch larger number of possible pairs makes it more difﬁcult to overﬁt. We refer to models of protein structure as "structural models" or "candidate structures" to avoid confusion with the term "model"as used in the ML community.

4e train and validate on the set of candidate structures generated in the CASP 5-10 assessments, which collectivelycontain 528 targets and 79200 candidate structures. For testing, we predict model quality for a total of 20880 stage 1and stage 2 candidate structures from CASP 11 (84 targets) and 12 (40 targets).

Computational protein design is the conceptual inverse of structure prediction, aiming to infer an amino acid sequencethat will fold into a given structure. In comparison to model quality assessment, computational protein design is moredifﬁcult to unambiguously benchmark, as some structures may correspond to a large space of sequences and others maycorrespond to none at all. Therefore, the proxy metric of native sequence recovery —splitting the set of all known nativestructures in the PDB and attempting to design the sequences corresponding to held-out structures—is typically used tobenchmark CPD models [19, 20, 30]. Drawing an analogy between sequence design and language modelling, Ingrahamet al. [15] also evaluate the model perplexity on held-out native sequences. Both metrics rest on the implicit assumptionthat native sequences are optimized for their structures [18] and should be assigned high probabilities.To best approximate real-world applications that may require design of novel structures, the held-out evaluation setshould bear minimal similarity to the training structures. We use the CATH 4.2 dataset curated by Ingraham et al. [15] inwhich all available structures with 40% nonredudancy are partitioned by their CATH (class, architecture, topology/fold,homologous superfamily) classiﬁcation. The canonical training, validation, and test splits consist of 18204, 608, and1120 structures, respectively.

In a computation graph, a GVP can be placed at any node typically occupied by a dense layer. Tomost directly augment relational representations with geometric information, we modify GNNs by permitting all edgeand node embeddings to have geometric vector and scalar features, and then use GVPs at all graph propogation andpoint-wise feed-forward steps.

Representations of proteins

A protein structure is a sequence of amino acids, where each amino acid consists offour backbone atoms and a set of sidechain atoms located in 3D Euclidean space. Here we represent only the backbonebecause our MQA benchmark corresponds to the assessment of backbone structure. In CPD, the sidechains are bydeﬁnition unknown.Let X i be the position of atom X in the i th amino acid (where X is a C α , C, N, or O atom). We represent backbonestructure as a graph G = ( V , E ) where each node v i ∈ V corresponds to an amino acid and has embedding h ( i ) v with thefollowing features:• Scalar features { sin , cos } × φ, ψ, ω , where φ, ψ, ω are the dihedral angles computed from C i − , N i , C α i , C i ,and N i +1 .• The forward and reverse unit vectors in the directions of C α i +1 − C α i and C α i − − C α i , respectively.• The unit vector in the imputed direction of C β i − C α i . This is computed by assuming tetrahedral geometryand normalizing (cid:114)

13 ( n × c ) / || n × c || − (cid:114)

23 ( n + c ) / || n + c || where n = N i − C α i and c = C i − C α i . The three unit vectors unambiguously deﬁne the orientation of eachamino acid residue.• A one-hot representation of amino acid identity, when available.The set of edges is E = { e i → j } i (cid:54) = j for all i, j where v i is among the k = 30 nearest neighbors of v j as measured bythe distance between their C α atoms. This is motivated by the fact that if two entities have a relational dependency theyare likely to be close together in space. Each edge has an embedding h ( i → j ) e with the following features:• The unit vector in the direction of C α i − C α j .• The radial basis encoding of the distance RBF ( || C α i − C α j || ) .• A sinusoidal encoding of i − j as described in Vaswani et al. [29], representing distance along the backbone. C α , C, N, and O. C β is the second carbon from the carboxyl carbon C.

5n our notation, each feature vector h is a concatenation of scalar and vector features as described above. Collectively,these features are sufﬁcient for a complete description of the protein backbone. In particular, whereas previous graph-based representations depended on scalar edge embeddings to represent relative orientations and positions, we areable to directly encode the absolute orientations of each amino acid and relative positions in the equivalent of fewerreal-valued channels. Network architecture

Our GVP-GNN takes as input the protein graph deﬁned above and performs graph propagationsteps that update the node embeddings according to: h ( i → j ) m := g (cid:16) concat (cid:16) h ( i ) v , h ( i → j ) e (cid:17)(cid:17) (5) h ( j ) v ← LayerNorm  h ( j ) v + 1 ||{ i : e i → j ∈ E }|| Dropout  (cid:88) i : e i → j ∈ E h ( i → j ) m  (6)where g is a composition of GVPs. We do not update edge embeddings and do not use a global graph embedding.Between node update steps, we also use a residual feed-forward point-wise update layer: h ( i ) v ← LayerNorm (cid:16) h ( i ) v + Dropout (cid:16) g (cid:16) h ( i ) v (cid:17)(cid:17)(cid:17) (7)In model quality assessment, the network performs regression against the true quality score of a candidate structure, aglobal scalar property. To obtain a single global representation, we apply a node-wise GVP to reduce all features toscalars after all graph propagation steps, and then average the representations across all nodes. A ﬁnal dense networkwith dropout and layer normalization then outputs the network’s prediction.In computational protein design, the network learns a generative model over the space of protein sequences conditionedon the given backbone structure. Following Ingraham et al. [15], we frame this as an autoregressive task and use amasked encoder-decoder architecture to capture the joint distribution over all positions: for each position i , the networkmodels the distribution over amino acids at i based on the complete structure graph, as well as the sequence informationat positions j < i . The encoder ﬁrst performs graph propagation on the structural information only. Then, sequenceinformation is added to the graph, and the decoder performs further graph propagation where incoming messages h ( i → j ) m for i ≥ j are computed only with the encoder embeddings. Finally, we use one last GVP with 20-way scalaroutput and softmax activation to output the probability of the amino acids.The MQA network, structure encoder, and masked decoder each use 3 graph propagation layers. The feed-forwardmodules consist of GVPs and the message-gather function uses 3.

We compare our method against state-of-the-art methods on the CASP 11-12 test set in Table 1. These includerepresentatives of voxel-based methods (3DCNN and Ornate), a graph-based method (GraphQA), and three methodsthat use sequential representations. All of these methods learn solely from protein structure, with the exception ofProQ3D, which in addition uses sequence information on related proteins that is not always available. We includeProQ3D because it is an improved version of the best single-model method in CASP 11 and CASP 12 [28]. Our methodoutperforms all other structural methods in both global and per-target correlation, and even performs better than ProQ3Don all but one benchmark.The CASP 11-12 datasets have been the most well-benchmarked in the recent MQA literature. However, for complete-ness, we also evaluate our method on CASP 13 (Table 2). Because of its recency, many target structures remain publiclyunavailable. We use the stage 2 candidate structures of a subset of 20 targets previously used for benchmarking [2].Our method achieves improved results over all other methods, including ProteinGCN [24], a more recent graph-basedmethod. Because of the small sample size, we emphasize that these results, although promising, should be consideredpreliminary until further structures for CASP 13 are available. Our method achieves state-of-the-art performance on CATH 4.2, representing a substantial improvement both in termsof perplexity and sequence recovery over Structured Transformer [15], which was trained on the same training set (Table We also tried learning a weighted average of nodes, but this led to increased overﬁtting. There are two versions of GraphQA; we compare against the one using only structure information.

ProteinGCN [24] 0.723 0.603VoroMQA [21] 0.769 0.665ProQ3D [28] 0.849 0.6713). Following Ingraham et al. [15], we report evaluation on short (100 or fewer amino acid residues) and single-chainsubsets of the CATH 4.2 test set, containing 94 and 103 proteins, respectively, in addition to the full test set. AlthoughStructured Transformer leverages an attention mechanism on top of a graph-based representation of proteins, theauthors note in ablation studies that removing attention appeared to increase performance. We therefore retrain andcompare against a version of Structured Transformer with the attention layers replaced with standard graph propagationoperations that we call Structured GNN. Our method also improves upon this model.We emphasize that the underlying architecture of Structured GNN is comparable to ours, with the exception of GVPsand geometric vector channels. In particular, the underlying protein graph is built in a similar manner, except withstructural information encoded in solely scalar channels. For example, relative orientations are encoded in terms ofquaternion coefﬁcients (4 scalar channels per edge ), whereas we represent absolute orientations with 3 vector channelsper node . Therefore, our improvement over Structured GNN is the most direct indication of the beneﬁt of our approachthat combines geometric and relational information. Additionally, due to the efﬁciency of our representation, we areable to achieve this performance gain with a modest but notable decrease in parameter count—1.01M in our modelversus 1.38M in Structured GNN and 1.53M in Structured Transformer.Table 3: Performance on the CATH 4.2 test set and its short and single-chain subsets in terms of per-residue perplexity(lower is better) and recovery (higher is better). Recovery is reported as the median over all structures of the meanrecovery of 100 sequences per structure. Our method performs better than Structured Transformer and a variant of it,Structured GNN, in which we replaced the attention mechanisms with standard graph propagation operations (see 4.2).Perplexity Recovery %Method Short Single-chain All Short Single-chain AllOurs

Structured GNN 8.31 8.88 6.55 28.4 28.1 37.3Structured Transformer [15] 8.54 9.03 6.85 28.3 27.6 36.47

Conclusion

In this work, we developed the ﬁrst architecture designed speciﬁcally for learning on dual relational and geometricrepresentations of 3D macromolecular structure. At its core, our method, GVP-GNN, augments graph neural networkswith computationally fast layers that perform expressive geometric reasoning over Euclidean vector features. Ourmethod possesses desirable theoretical properties and empirically outperforms existing architectures on learning qualityscores and sequence designs, respectively, from protein structure.In further work, we hope to apply our architecture to other important structural biology problem areas, includingprotein complexes, RNA structure, and protein-ligand interactions. Our results more generally highlight the promiseof domain-aware architectures in specialized applications of deep learning, and we hope to continue developing andreﬁning such architectures in the domain of learning from structure.

Acknowledgements

We thank Tri Dao and all members of the Dror group for feedback and discussions.

Funding

We acknowledge support from the U.S. Department of Energy, Ofﬁce of Science, Ofﬁce of Advanced ScientiﬁcComputing Research, Scientiﬁc Discovery through Advanced Computing (SciDAC) program, and Intel Corporation.SE is supported by a Stanford Bio-X Bowes fellowship. RJLT is supported by the U.S. Department of Energy, Ofﬁce ofScience Graduate Student Research (SCGSR) program.

References [1] N. Anand, R. R. Eguchi, A. Derry, R. B. Altman, and P. Huang. Protein sequence design with a learned potential. bioRxiv ,2020.[2] F. Baldassarre, D. M. Hurtado, A. Elofsson, and H. Azizpour. GraphQA: Protein model quality assessment using graphconvolutional network, 2020. URL https://openreview.net/forum?id=HyxgBerKwB .[3] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo,A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 ,2018.[4] J. M. Berg, J. L. Tymoczko, and L. Stryer. Biochemistry, W.H. Freeman and Company, 2002.[5] R. Cao and J. Cheng. Protein single-model quality assessment by feature-based probability density functions.

Scientiﬁc reports ,6:23990, 2016.[6] J. Cheng, M.-H. Choe, A. Elofsson, K.-S. Han, J. Hou, A. H. Maghrabi, L. J. McGufﬁn, D. Menéndez-Hurtado, K. Olechnoviˇc,T. Schwede, et al. Estimation of model accuracy in casp13.

Proteins: Structure, Function, and Bioinformatics , 87(12):1361–1377, 2019.[7] H. Cho and I. S. Choi. Enhanced deep-learning prediction of molecular properties via augmentation of bond topology.

ChemMedChem , 14(17):1604–1609, 2019.[8] N. Cohen and A. Shashua. Inductive bias of deep convolutional networks through pooling geometry. arXiv preprintarXiv:1605.06743 , 2016.[9] G. Cybenko. Approximation by superpositions of a sigmoidal function.

Mathematics of control, signals and systems , 2(4):303–314, 1989.[10] A. del Sol, H. Fujihashi, D. Amoros, and R. Nussinov. Residue centrality, functionally important residues, and active site shape:analysis of enzyme and non-enzyme families.

Protein Science , 15(9):2120–2128, 2006.[11] G. Derevyanko, S. Grudinin, Y. Bengio, and G. Lamoureux. Deep convolutional networks for quality assessment of proteinfolds.

Bioinformatics , 34(23):4046–4053, 2018.[12] P. Gainza, F. Sverrisson, F. Monti, E. Rodola, D. Boscaini, M. Bronstein, and B. Correia. Deciphering interaction ﬁngerprintsfrom protein molecular surfaces using geometric deep learning.

Nature Methods , 17(2):184–192, 2020.[13] J. Graves, J. Byerly, E. Priego, N. Makkapati, S. V. Parish, B. Medellin, and M. Berrondo. A review of deep learning methodsfor antibodies.

Antibodies , 9(2):12, 2020.[14] S. Hammes-Schiffer and S. J. Benkovic. Relating protein motion to catalysis.

Annu. Rev. Biochem. , 75:519–541, 2006.[15] J. Ingraham, V. Garg, R. Barzilay, and T. Jaakkola. Generative models for graph-based protein design. In

Advances in NeuralInformation Processing Systems , pages 15794–15805, 2019.

16] M. Karasikov, G. Pagès, and S. Grudinin. Smooth orientation-dependent scoring function for coarse-grained protein qualityassessment.

Bioinformatics , 35(16):2801–2808, 2019.[17] J. Klicpera, J. Groß, and S. Günnemann. Directional message passing for molecular graphs, 2020.[18] B. Kuhlman and D. Baker. Native protein sequences are close to optimal for their structures.

Proceedings of the NationalAcademy of Sciences , 97(19):10383–10388, 2000.[19] Z. Li, Y. Yang, E. Faraggi, J. Zhan, and Y. Zhou. Direct prediction of proﬁles of sequences compatible with a proteinstructure by neural networks with fragment-based local and energy-based nonlocal proﬁles.

Proteins: Structure, Function, andBioinformatics , 82(10):2565–2573, 2014.[20] J. O’Connell, Z. Li, J. Hanson, R. Heffernan, J. Lyons, K. Paliwal, A. Dehzangi, Y. Yang, and Y. Zhou. Spin2: Predictingsequence proﬁles from protein structures using deep neural networks.

Proteins: Structure, Function, and Bioinformatics , 86(6):629–633, 2018.[21] K. Olechnoviˇc and ˇC. Venclovas. Voromqa: Assessment of protein structure quality using interatomic contact areas.

Proteins:Structure, Function, and Bioinformatics , 85(6):1131–1145, 2017.[22] G. Pagès, B. Charmettant, and S. Grudinin. Protein model quality assessment using 3d oriented convolutional neural networks.

Bioinformatics , 35(18):3313–3319, 2019.[23] J. C. Pereira, E. R. Caffarena, and C. N. dos Santos. Boosting docking-based virtual screening with deep learning.

Journal ofchemical information and modeling , 56(12):2495–2506, 2016.[24] S. Sanyal, I. Anishchenko, A. Dagar, D. Baker, and P. Talukdar. Proteingcn: Protein model quality assessment using graphconvolutional networks.

BioRxiv , 2020.[25] P. Spurek, T. Danel, J. Tabor, M. ´Smieja, Ł. Struski, A. Słowik, and Ł. Maziarka. Geometric graph convolutional neuralnetworks. arXiv preprint arXiv:1909.05310 , 2019.[26] A. Strokach, D. Becerra, C. Corbi-Verge, A. Perez-Riba, and P. M. Kim. Fast and ﬂexible design of novel proteins using graphneural networks. bioRxiv , page 868935, 2020.[27] R. Townshend, R. Bedi, P. Suriana, and R. Dror. End-to-end learning on 3d protein structure for interface prediction. In

Advances in Neural Information Processing Systems , pages 15616–15625, 2019.[28] K. Uziela, D. Menéndez Hurtado, N. Shu, B. Wallner, and A. Elofsson. Proq3d: improved model quality assessments usingdeep learning.

Bioinformatics , 33(10):1578–1580, 2017.[29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all youneed. In

Advances in neural information processing systems , pages 5998–6008, 2017.[30] J. Wang, H. Cao, J. Z. Zhang, and Y. Qi. Computational protein design with deep learning neural networks.

Scientiﬁc reports , 8(1):1–9, 2018.[31] J. Won, M. Baek, B. Monastyrskyy, A. Kryshtafovych, and C. Seok. Assessment of protein model structure accuracy estimationin casp13: Challenges in the era of deep learning.

Proteins: Structure, Function, and Bioinformatics , 87(12):1351–1360, 2019.[32] Y. Zhang, Y. Chen, C. Wang, C.-C. Lo, X. Liu, W. Wu, and J. Zhang. Prodconn: Protein design using a convolutional neuralnetwork.

Proteins: Structure, Function, and Bioinformatics , 2019. upplementary Information Equivariance and invariance

The vector and scalar outputs of the GVP are equivariant and invariant, respectively, with respect to an arbitrarycomposition of rotations and reﬂections in 3D Euclidean space described by R i.e. ,GVP (( s , R ( V ))) = ( s (cid:48) , R ( V (cid:48) )) Proof.

We can write the transformation described by R as multiplying V with a unitary matrix U ∈ R × from theright. The L -norm, scalar multiplications, and nonlinearities are deﬁned row-wise as in Algorithm 1. We considerscalar and vector outputs separately. The scalar output, as a function of the inputs, is s (cid:48) = σ (cid:18) W m (cid:20) (cid:107) W h V (cid:107) s (cid:21) + b (cid:19) Since (cid:107) W h VU (cid:107) = (cid:107) W h V (cid:107) , we conclude s (cid:48) is invariant. Similarly the vector output is V (cid:48) = σ + (cid:0) (cid:107) W µ W h V (cid:107) (cid:1) (cid:12) W µ W h V The row-wise scaling can also be viewed as left-multiplication by a diagonal matrix D . Since (cid:107) W µ W h V (cid:107) = (cid:107) W µ W h VU (cid:107) , D is invariant. Since DW µ W h ( VU ) = ( DW µ W h V ) U we conclude that V (cid:48) is equivariant. The ability to approximate rotation-invariant functions

The GVP inherits an analogue of the Universal Approximation property [9] of standard dense layers. If R describes anarbitrary rotation or reﬂection in 3D Euclidean space, we show that a GVP can approximate arbitrary scalar-valuedfunctions invariant under R and deﬁned over Ω ν ⊂ R ν × , the bounded subset of R ν × whose elements can becanonically oriented based on three linearly independent vector entries. Without loss of generality, we assume the ﬁrstthree vector entries can be used.The machinery corresponding to such approximations corresponds to a GVP G s with only vector inputs, only scalaroutputs, and a sigmoidal nonlinearity σ , followed by a dense layer. This can also be viewed as the sequence of matrixmultiplication with W h , taking the L -norm, and a dense network with one hidden layer. Such machinery can beextracted from any two consecutive GVPs (assuming a sigmoidal σ ). The theorem is restated from the main text: Theorem.