[PDF] Hierarchical, rotation-equivariant neural networks to select structural models of protein complexes

Abstract

Predicting the structure of multi-protein complexes is a grand challenge in biochemistry, with major implications for basic science and drug discovery. Computational structure prediction methods generally leverage pre-defined structural features to distinguish accurate structural models from less accurate ones. This raises the question of whether it is possible to learn characteristics of accurate models directly from atomic coordinates of protein complexes, with no prior assumptions. Here we introduce a machine learning method that learns directly from the 3D positions of all atoms to identify accurate models of protein complexes, without using any pre-computed physics-inspired or statistical terms. Our neural network architecture combines multiple ingredients that together enable end-to-end learning from molecular structures containing tens of thousands of atoms: a point-based representation of atoms, equivariance with respect to rotation and translation, local convolutions, and hierarchical subsampling operations. When used in combination with previously developed scoring functions, our network substantially improves the identification of accurate structural models among a large set of possible models. Our network can also be used to predict the accuracy of a given structural model in absolute terms. The architecture we present is readily applicable to other tasks involving learning on 3D structures of large atomic systems.

Full PDF

HH IERARCHICAL , ROTATION - EQUIVARIANT NEURAL NETWORKSTO PREDICT THE STRUCTURE OF PROTEIN COMPLEXES

Stephan Eismann ∗ Department of Applied PhysicsStanford University [email protected]

Raphael J.L. Townshend ∗ Department of Computer ScienceStanford University [email protected]

Nathaniel Thomas ∗ Department of PhysicsStanford University [email protected]

Milind Jagota

Department of Electrical EngineeringStanford University [email protected]

Bowen Jing

Department of Computer ScienceStanford University [email protected]

Ron Dror

Department of Computer ScienceStanford University [email protected] A BSTRACT

Predicting the structure of multi-protein complexes is a grand challenge in biochemistry, with majorimplications for basic science and drug discovery. Computational structure prediction methodsgenerally leverage pre-deﬁned structural features to distinguish accurate structural models from lessaccurate ones. This raises the question of whether it is possible to learn characteristics of accuratemodels directly from atomic coordinates of protein complexes, with no prior assumptions. Herewe introduce a machine learning method that learns directly from the 3D positions of all atoms toidentify accurate models of protein complexes, without using any pre-computed physics-inspired orstatistical terms. Our neural network architecture combines multiple ingredients that together enableend-to-end learning from molecular structures containing tens of thousands of atoms: a point-basedrepresentation of atoms, equivariance with respect to rotation and translation, local convolutions, andhierarchical subsampling operations. When used in combination with previously developed scoringfunctions, our network substantially improves the identiﬁcation of accurate structural models amonga large set of possible models. Our network can also be used to predict the accuracy of a givenstructural model in absolute terms. The architecture we present is readily applicable to other tasksinvolving learning on 3D structures of large atomic systems.

Proteins bind to one another in speciﬁc ways to form protein complexes. The resulting complexes are essential forvirtually all cellular processes, and the targeted blocking of protein-protein interactions is a key strategy in modern drugdesign [11]. Determining structures of protein complexes experimentally is often difﬁcult and time-consuming, placinga premium on computational structure prediction [35]. Unfortunately, computational structure prediction for proteincomplexes has also proven difﬁcult—much more so than for individual proteins.The process of predicting the structure of a protein complex given structures of the individual proteins involved, knownas protein docking, generally involves two steps. First, the conﬁgurational space of the interacting proteins is sampledto produce a list of candidate models (hypothetical 3D structures of the complex). Second, each candidate model isassigned a score using a scoring function intended to assign the best scores to the most accurate models (i.e., those thatwould most closely match the experimentally determined structure, were it available). Despite the availability of manysoftware packages designed for this purpose [6, 37, 8, 32, 30, 26, 17, 24, 20], identifying accurate models out of a largeset of candidate models has proven difﬁcult.Existing scoring functions—methods for assigning scores to candidate models—generally leverage hand-designed,local features to assess the accuracy of a model. Widely used scoring functions can be classiﬁed as physics-inspired orstatistical. Physics-inspired scoring functions include terms accounting for interatomic interactions and desolvation ∗ Equal contribution a r X i v : . [ q - b i o . B M ] J un nergy [36, 19, 23]. Statistical scoring functions are based on a probability distribution of deﬁned spatial features (e.g.interatomic distances and bond orientations) computed for a set of known protein complex structures [16, 21, 10, 2].Recently, several scoring functions have been developed using machine learning methods [12, 40, 5]. These alsogenerally leverage pre-computed features as inputs. The use of pre-computed features can allow algorithms to directlyfocus on relevant information for predicting model accuracy. At the same time, all of the above methods might notconsider certain features or characteristics of the atomic structure that are in fact relevant to identifying accurate modelsbut have never been recognized as such. In this paper, we explore whether we can improve solutions to the scoringchallenge by considering all atoms of a model, without the use of hand-designed features.We present a neural network that learns directly from the 3D positions of all atoms to distinguish accurate models fromless accurate ones, without making any assumptions about what characterizes a favorable protein–protein interface orwhat features are relevant to identifying an accurate model of a protein complex. The key challenge of designing such amachine learning method is to prevent an explosion of parameters to be ﬁt, because the amount of data available fortraining (i.e., to ﬁt these parameters) is limited by the number of experimentally determined structures. We also wish torepresent each structural model precisely, without approximating atom positions.We address these challenges by designing a neural network architecture speciﬁcally for machine learning tasks involvinglarge molecular systems. Our network architecture is based on convolutional ﬁlters that are precisely equivariant torotation, translation, and permutation of the atomic coordinates that serve as inputs to the network. These physicalsymmetries are conserved over multiple hierarchical layers that we use to learn features at different levels of structuralcoarseness and to aggregate information globally. The equivariance properties of our network signiﬁcantly reducesthe number of model parameters with no loss of model expressiveness. It also eliminates the need for rotational dataaugmentation and ensures that the orientation with which structural models are presented to the network does not matter.This hierarchical, equivariant network architecture enables us to learn directly from a complete atomic representationof the protein complex, without making any assumptions about which features are or are not relevant to identifyingaccurate structural models. We demonstrate that a trained network, which we term PAUL, can pick out accurate modelsamong a large set of candidate models for a given protein complex. When combined with existing scoring functions,PAUL substantially improves ranking performance and enriches the number of accurate models among those selected.We further demonstrate that a network can be trained to perform quality assessment—that is, to predict the accuracy ofa given protein complex model in absolute terms. We present a method that allows effective end-to-end learning from large atomic structures. Atomic structure heremeans the most direct representation of a molecular system’s 3D geometry: a list of 3D coordinates and the elementtype of each atom. End-to-end refers to the concept that we do not manually deﬁne, select, or pre-compute any featuresthat we believe are relevant, beyond the 3D coordinates and element type of each atom.Figure 1 illustrates the architecture of our deep learning network. The goal is to predict the accuracy of a givenhypothesized protein complex model. We measure this accuracy in terms of the ligand root mean square deviation(LRMSD), a single scalar. The LRMSD of a model is calculated based on the atomic coordinates of the ligand (onespeciﬁed protein among the two proteins) after the receptor (the second protein) has been aligned with the experimentalstructure. Figure 1B shows a protein complex in which receptor and ligand are highlighted. We use LRMSD becauseit is widely used in the structural biology community to measure the accuracy of a protein complex model [18]. Wetrain networks to perform two different tasks. One, which we term the regression task, is direct prediction of a model’sLRMSD. The other, which we term the classiﬁcation task, is prediction of whether or not a model is “acceptable.” Wedeﬁne an acceptable model as one with

LRMSD < Å, in line with the assessment criteria of the CAPRI proteindocking competition [18]. We note that our neural network architecture can be trivially adapted to output a vector ofany desired length—e.g., to predict multiple properties at the same time.The protein structure is deﬁned by a list of atoms. Each atom’s entry contains the atom’s 3D coordinates, its chemicalelement, and a Boolean ﬂag that indicates whether the atom is part of the receptor protein or part of the ligand protein.We represent carbon, oxygen, nitrogen, sulfur, and hydrogen atoms. The chemical element is speciﬁed by a vector oflength 5 in which one entry is 1 and all others are 0 (one-hot encoding). We include hydrogen atoms as they play a vitalrole in the physics governing protein-protein interactions. The Boolean ﬂag allows the network to recognize the bindinginterface between the two proteins. 2igure 1:

Learning from entire protein complex structures without prior assumptions. ( A ) Proteins consist ofchains of amino acid residues. Each residue has a central alpha carbon C α . ( B ) Two proteins forming a protein complex.( C ) Given 3D coordinates and element types of every atom as initial input, the network learns to predict either theligand root mean squared deviation (LRMSD) of a structural model or whether or not the model is acceptable (i.e.,sufﬁciently accurate). Internally, each of the ﬁrst three network layers learns a speciﬁed number of features for each ofa speciﬁed set of points. Starting with all atoms, we use alpha carbons as the intermediate set of points before we outputone large feature vector for the entire complex. This feature vector is a learned ﬁngerprint of the atomic structure andserves as the input to four standard, fully-connected neural network layers, the last of which has a single scalar output.To enable end-to-end learning on these large molecules, we combine (i) neural networks equivariant to rotation,translation, and permutation, (ii) a novel, hierarchical learning approach, and (iii) nearest-neighbor convolutions. Thesethree components, which we describe in the following sections, speciﬁcally cater to learning on atomic systems. Standard three-dimensional convolutional neural networks (3D CNNs) have been applied to several problems instructural biology, including the prediction of protein structure, protein-protein interfaces, and protein-ligand binding[39, 25, 33, 14, 9, 22, 34]. Although powerful, conventional 3D CNNs are suboptimal in two ways when learning fromto evaluate the accuracy of large atomic structural models. First, conventional CNNs operate on a discretized grid, suchthat representing atomic coordinates with very high precision requires a very ﬁne grid and thus a very large set of inputsto the network. Second, the convolutional ﬁlters of conventional 3D CNNs are not rotationally equivariant, posinga challenge to learning properties that are known to be invariant under rotations in 3D space, such as the accuracyof a structural model. In certain cases, one can solve this latter problem through a canonical alignment of the inputdata, but complexes of different proteins do not lend themselves naturally to such an alignment. One can also relyon rotational data augmentation—that is, rotating each training example at many different random angles—but thisprovides an imperfect approximation to rotation invariance, complicates the learning task, and poses challenges tocapturing structural features involving ﬁner-scale features that might appear at different orientations relative to oneanother in different structures (sometimes described as ‘patchwise’ symmetry).To account for the fact that the laws of physics that govern inter- and intramolecular interactions are invariant totranslations and rotations, Behler and Parrinello [3] developed basic radial and angular symmetry functions in orderto model the potential energy of small molecules through neural networks. These were further extended in SchNet[27] and ANI-1 [29]. A number of recent publications have proposed neural network architectures that account fortranslational and rotational symmetries based on tools from group representation theory [7, 31, 15, 1]. Speciﬁcally,tensor ﬁeld networks [31] have two key properties that distinguish them from conventional CNNs: they operate on aset of points with coordinates in 3D space (not a discretized grid), and they are automatically equivariant to rotations,translations, and permutations of those points.Informally, a function (such as a neural network layer) is equivariant to some transformation (such as a rotation ortranslation) if applying the transformation to the input is the same as applying this transformation to the output. Theclassic example of this is a standard convolutional neural network: a translation of the input to a convolutional layer3y an integer number of pixels is equivalent to the same translation of the output. Equivariant networks enjoy theproperty that when each layer is equivariant, the whole deep network is automatically equivariant. Equivariant networkseliminate the need for data augmentation with respect to the corresponding symmetry, and furthermore ensure local or‘patchwise’ symmetry at every part of the space (not only a single global symmetry).Formally, a function L : X → Y is equivariant with respect to a group G and its representations D X and D Y if L ◦ D X ( g ) = D Y ( g ) ◦ L (1)holds for all g ∈ G . Equivariant neural networks bake in the prior that a symmetry with respect to G always holds. Weexpect the guarantee of equivariance to be a more powerful property the larger G is (where the sense of ‘larger’ in thecases we care about means the dimension of a continuous group G when it is considered as a manifold —for example,the group of 2D rotations is dimension 1, and the group of 3D rotations is dimension 3).Concretely, we can represent inputs as V am (the coordinates and other features associated with each point), where a indexes the points and m is an index over the group representation (and in general will be broken up into multipleindices based on the internal structure of the representation, as described in Thomas et al. [31]). This means that L isequivariant if (cid:88) a (cid:48) m (cid:48) L a (cid:48) m (cid:48) am (cid:32)(cid:88) m (cid:48)(cid:48) D X m (cid:48) m (cid:48)(cid:48) ( g ) V a (cid:48) m (cid:48)(cid:48) (cid:33) = (cid:88) a (cid:48) m (cid:48) D Y mm (cid:48) ( g ) (cid:88) m (cid:48)(cid:48) L a (cid:48) m (cid:48)(cid:48) am (cid:48) ( V a (cid:48) m (cid:48)(cid:48) ) , (2)for all g ∈ G , where D denotes the corresponding matrix representations of G . Mathematical deﬁnitions of the tensor ﬁeld network layers that we use here and proofs of their equivariance can befound in Thomas et al. [31].To implement an equivariant convolution with respect to the group of proper 3D translations and rotations (the specialEuclidean group SE (3) ), we decompose convolution ﬁlters into a truncated series of spherical harmonics Y withlearnable radial functions R , which is combined with the input using a tensor product operation: L a,c, ( l (cid:48) ,m (cid:48) ) a,c, ( l,m ) (cid:0) V a (cid:48)(cid:48) ,c, ( l (cid:48)(cid:48) ,m (cid:48)(cid:48) ) (cid:1) = C ( l (cid:48) ,m (cid:48) )( l (cid:48)(cid:48) ,m (cid:48)(cid:48) )( l,m ) R ( l (cid:48) ) c ( r aa (cid:48)(cid:48) ) Y ( l (cid:48) ,m (cid:48) ) (ˆ r aa (cid:48)(cid:48) ) V a (cid:48)(cid:48) ,c, ( l (cid:48)(cid:48) ,m (cid:48)(cid:48) ) , (3)where (cid:126)r aa (cid:48) is the vector between points a and a (cid:48) , C are the Clebsch-Gordan coefﬁcients for the group of 3D rotations SO (3) (implementing the tensor product), ( l, m ) are the indices corresponding to the representation of SO (3) , and c isthe feature index. Truncation of this series in l is necessary because normally the series is inﬁnite.Additionally, we use self-interaction layers (analogous to 1x1 convolutions in CNNs) and nonlinear scaling of mag-nitudes (where we apply a nonlinear function to the norm over the representation index). We also include a simplenormalization layer, which includes a sum over the representation index in addition to a sum over the ﬁlter index (as ina typical normalization layer): V a,c, ( l,m ) (cid:112)(cid:80) cm | V a,c, ( l,m ) | (4)This normalization was found to aid in training convergence and is equivariant, as the representations of SO (3) that weuse are unitary. Rotation-equivariant convolutions do not require the set of input points to be the same as the set of output points[31, 15]; that is, a and a (cid:48)(cid:48) do not have to range over the same sets in Eq. 3. We take advantage of this fact to output atonly a subset of points, a subsampling operation analogous to the stride in standard CNNs. We expect this to be usefulfor efﬁciently recognizing patterns at different scales of the protein structure.Note that if we used such subsampling with rotation- invariant convolutions (such as those in Schütt et al. [27]), thiswould eliminate lower-level orientation information, making it inaccessible to later network layers. We call the largest l -value in the truncated spherical harmonic series the maximum rotation order , and it determines the angular resolutionof the geometric information that is passed to later layers.In the case of proteins, we start by considering all atoms and aggregate the information at the level of alpha carbons,where the alpha carbon is a canonical central atom in each amino acid (Fig. 1A). This reduces the number of points by afactor of ~20 compared to the full set of atoms. We ﬁnally output a single vector assigned to the centroid of all atoms.This feature vector is the input to four standard, fully-connected layers with a single scalar output (Fig. 1C).4 .1.4 Nearest-neighbor convolutions Naive use of the convolution layer calculates all pairwise interactions between points and therefore has O ( N ) complexity in the number of points N . However, the laws of physics are local; that is, the effects of physically relevantforces diminish with distance. As such, we reduce the complexity to O ( N ) by only calculating interactions for a ﬁxednumber K of nearest neighbors of each point. K here is analogous to the area or volume of the convolution window forstandard discrete 2D or 3D CNNs. The overall computational complexity of our model thus scales linearly with thenumber of atoms. For training and testing, we use protein complexes for which the Protein Data Bank [4] includes experimentally resolvedstructures of the two proteins in complex as well as each of the two proteins on its own.The input data to our model are models generated by the docking software ATTRACT [26]. Given two individualprotein structures as input, ATTRACT generates a user-speciﬁed number of protein complex models. Our neuralnetworks are trained with 300 models per protein complex.In all our experiments we use a subset of complexes from docking benchmark 4 (DB4) [13] for training (143 complexes)and validation (21 complexes) and we test on 50 complexes that are part of docking benchmark 5 (DB5) [38] but not ofDB4. DB4 and DB5 are widely used datasets speciﬁcally curated to benchmark protein docking methods as well asscoring functions used for protein docking. DB5 is a superset of DB4.

We train separate neural networks to perform classiﬁcation and regression. For regression, we minimize the meansquared error with respect to the LRMSD label (i.e., the true LRMSD). The LRMSD labels are approximately Gaussiandistributed. For classiﬁcation of a model as either ’acceptable’ or not, we minimize the binary cross entropy. Asmentioned earlier, we deﬁne an acceptable model as a model with

LRMSD < Å. We weight misclassiﬁcations ofacceptable models with a factor of 100 to account for the fact that most models do not fall into this category.For both regression and classiﬁcation, we minimize the loss functions using the Adam optimizer in TensorFlow. We useHorovod [28] to distribute training across 8 Nvidia Titan Xp GPUs.For both regression and classiﬁcation, we repeat the stochastic optimization procedure to produce multiple machinelearning (ML) models (i.e., multiple sets of optimized neural network parameters). For classiﬁcation, we then selectthe best model based on its validation set performance with respect to the metric of success rate (see SupplementaryInformation). For regression, we choose the best model based on the training epoch with the smallest validation loss.

The goal of a scoring function is to rank a set of structural models for a protein complex, with the most accurate models(i.e., those most similar to the unknown experimental structure) at the top. We use PAUL, the network that we trained toclassify models as either acceptable or not, to remove suboptimal models from a list pre-ranked by a scoring function.Although PAUL is trained to produce a ﬁnal binary output, it ﬁrst computes a scalar value between 0 and 1, with largervalues predictive of more accurate models. Given a list of 1000 models for a complex, pre-ranked by a scoring function,we assign a value to each model with PAUL. We then remove all models with a value below the median.Here, we consider the effects of this approach on the performance of the scoring functions SOAP-PP and ZRANK.Representing two different approaches, SOAP-PP and ZRANK are representative of the most widely used methods inthe ﬁeld. SOAP-PP is a statistical potential based on distributions of spatial features, which was optimized with respectto models generated for DB4 [10]. The features considered are interatomic distances, orientations between pairs ofcovalent bonds, and relative atomic surface availability. ZRANK is a physics-based scoring function that utilizes aweighted sum of seven terms accounting for electrostatic potential, van der Waals interactions, and desolvation energyat the protein-protein interface [23].Figure 2 shows the effect PAUL has on the ranking performance of SOAP-PP and ZRANK for each individual complexin the test set. Augmentation with PAUL improves performance on 8 and 7 complexes for SOAP-PP and ZRANK,respectively, while only reducing performance on a single complex.5igure 2:

Ranking results per protein complex.

Given 1000 structural models per complex, we rank the models usingthe scoring functions SOAP-PP and ZRANK, with or without application of PAUL to remove suboptimal models. Theﬁgure indicates whether or not a method ranks at least one acceptable model among the ﬁrst 1, 5, 10, 50, and 100models (T1, T5, T10, T50, and T100, respectively). Blue indicates that it does, while gray indicates that it does not.Green and orange highlight differences due to the augmentation with PAUL. Green and orange correspond to blue andgray, respectively. The ﬁgure shows results for 23 complexes for which the sampling software provides at least oneacceptable model as part of the 1000 models and at least one of these methods ranks an acceptable model among thetop 100 models.Figure 3 shows the best ranking method for each complex based on the metric of rank-weighted success . We deﬁne rank-weighted success r as a cumulative metric: r = (cid:88) N ∈{ , , , , } A(N) , (5)where A(N) is the number of acceptable models among the top-ranked N models. Gains in r reﬂect overall enrichmentof acceptable models, with an emphasis on acceptable models that are ranked highly. For example, an acceptable modelthat is ranked 4th contributes four times as much to r as an acceptable model that is ranked 51st.We note that the combination of PAUL with either SOAP-PP or ZRANK consistently enhances ranking performance forcomplexes in the ’medium’ and ’difﬁcult’ categories. The difﬁculty categories follow the classiﬁcation in Vreven et al.[38] and are based on the deformation of the individual protein structures upon complex formation.In Supplementary Information, we show that augmentation with PAUL also increases the performance of a machinelearning scoring function. In addition, we demonstrate that PAUL also increases ranking performance with respectto two others metrics, success and hit rate. Finally, we show that removing suboptimal models using the ZRANK orSOAP-PP scoring functions instead of PAUL does not result in increased performance. We train a second rotation-equivariant neural network to predict LRMSD for a given model. Predicting LRMSDprovides an estimate of how close a given model is to the correct structure (i.e., a high-resolution, experimentallydetermined structure). This is useful for quality assessment—that is, for assessing the likelihood that a chosen model isaccurate.The density scatter plot in Figure 4 shows the predictions of our network and the true LRMSD for 300 structural modelsfor each of 50 protein complexes. The Pearson correlation coefﬁcient of network predictions and labels is 0.62. The6igure 3:

PAUL enriches the number of acceptable models among highly ranked models.

For each pair (that is,for each scoring function with and without the application of PAUL to remove suboptimal models), the best of the tworanking method for each protein complex, as measured by the rank-weighted success metric, is shown in blue. Bothare blue when the two methods perform equally well, unless neither ranks an acceptable model among the top 100,in which case both are white. PAUL particularly enhances ranking performance for complexes in the ‘medium‘ and‘difﬁcult‘ difﬁculty categories. The categories follow the classiﬁcation in Vreven et al. [38]. ’Easy’ here refers to the’rigid-body’ category.correlation coefﬁcients for ZRANK and SOAP-PP are 0.05 and 0.36, respectively; these lower correlation coefﬁcientsare not surprising, as these scoring functions are optimized to rank models of individual complexes rather than forquality assessment.Figure 4:

Correlation between predicted and actual accuracy of structural models.

Prediction of our trained neuralnetwork VS. true LRMSD value for diverse structural models of 50 protein-protein complexes. The color in the densityscatter plot indicates the number of models per bin. Bins with no models are shown in white. Correlation indicates thePearson correlation coefﬁcient between predicted and true LRMSD values.While relative ranking and quality assessment are separate tasks, the ability of our classiﬁer model PAUL to removepoor structural models without removing acceptable ones may be linked to the performance of the regression model atpredicting LRMSD. Figure 4 shows good correlation over a wide range of LRMSD values, including values above 50 Å.LRMSD values in this regime generally indicate that the ligand is placed at the wrong site of the receptor. The ability torecognize such poor models likely stems from our global learning approach—learning directly from all atoms of thecomplex—as opposed to an approach that only locally assesses the protein-protein interface of a given structural model.

As part of our ML model selection for the experiment shown in Figure 4, we trained 30 different networks and exploredthe inﬂuence of the maximum rotation order (see Section 2.1.3) on network performance. We trained 10 networks permaximum rotation order. The test losses of the best networks (in units of Å , selected based on the validation loss) forthe maximum rotation orders 0, 1, and 2 are 290, 274, and 260, respectively. The data is in line with the theoretical7rediction that the maximum rotation order governs the amount of preserved orientation information after we apply theequivariant subsampling operation.For networks with equal numbers of features per point, we measured the runtimes per training epoch (mean ± std) forthe maximum rotation orders of 0, 1, and 2 as (290 ± s, (610 ± s, and (1288 ± s, respectively. We trainedeach network using 42,000 structural models. This is 1–2 orders of magnitude faster than voxelization-based networkarchitectures (that only take the atoms at the protein-protein interface into account) and highlights the efﬁciency of ourpoint-based approach [34]. Correctly ranking candidate structures of protein complexes has remained an unsolved challenge. Difﬁculties includethe large number of atoms involved, the inherent ﬂexibility of the individual proteins, and the need to capture subtleinteratomic interactions that are sensitive to small changes in structure. In this paper, we demonstrate how a novelmachine learning approach can help to address this challenge.We present a deep learning architecture that can efﬁciently learn geometrically precise features on large systems using arotation-equivariant neural network. This architecture is different from other machine learning approaches that havebeen applied to the problem of protein-protein docking [12, 40, 5] or to other problems in structural biology. First, we donot provide any pre-computed physics-based energies or statistical features to the neural network. Instead, our networklearns solely from the raw structural data, given by the spatial coordinates and chemical element type of each atom.Second, our architecture operates directly on point coordinates. We avoid an expensive representation of the atomicstructure in terms of 3D voxels and eliminate the need for rotational data augmentation. Our architecture can also learnphysics-based relationships involving high-order polynomials of atomic distances and multi-body interactions. Finally,the hierarchical design of our approach, in which we aggregate information at different levels of spatial granularity,allows us to learn end-to-end from entire protein complexes as opposed to local assessment of the protein-proteininterface [12, 40] or learning from a more coarsely represented structure at the residue level [5].We use our neural network architecture to learn a classiﬁer, PAUL, that distinguishes acceptable from incorrect models.By removing incorrect models from a list of models ranked by a scoring function, we are able to increase rankingperformance for individual complexes. PAUL particularly improves ranking performance for challenging dockingtargets in which the individual proteins deform substantially upon binding to one another.Although we use PAUL to augment scoring functions in this paper, we expect a similar classiﬁer to be useful during themodel generation stage of protein docking. Speciﬁcally because PAUL rarely classiﬁes acceptable models as incorrect,such a classiﬁer could be used to select among candidate models before ﬁnal structural reﬁnement.We train a second neural network to predict LRMSD for a given structural model. The reported correlation coefﬁcienttestiﬁes to the ability of this network to assess model quality over a wide range of LRMSD values. This ability is likelydue to our global learning approach—i.e., the fact that we learn from all atoms in the complex at once, as opposed toindependently assessing local regions of the protein-protein interface. We also do not require the binding afﬁnity of anative complex to predict the accuracy of a structural model.Our empirical results also support the theoretical prediction that higher-order spherical harmonics are necessary topropagate orientation information over hierarchical layers. Despite the use of higher-order harmonics, our physics-inspired architecture remains highly efﬁcient with respect to both data and computation, as demonstrated through theruntime measurements and the fact that we obtain high performance while using fewer than 200 protein complexes fortraining.To the best of our knowledge, PAUL is the ﬁrst application of a point-based network to an important problem in structuralbiology. Its hierarchical, rotation-equivariant architecture is directly applicable to other learning tasks involving atomicsystems, in structural biology and beyond.

Acknowledgements

We thank Joseph Paggi and João Rodrigues for useful discussions.

Funding

We acknowledge support from the U.S. Department of Energy, Ofﬁce of Science, Ofﬁce of Advanced ScientiﬁcComputing Research, Scientiﬁc Discovery through Advanced Computing (SciDAC) program, and Intel Corporation.8E is supported by a Stanford Bio-X Bowes fellowship. RJLT is supported by the U.S. Department of Energy, Ofﬁce ofScience Graduate Student Research (SCGSR) program. NT is supported by the Air Force Ofﬁce of Scientiﬁc Research(FA9550-16-1- 0082).

References [1] B. Anderson, T.-S. Hy, and R. Kondor. CORMORANT: Covariant Molecular Neural Networks.

Advances inNeural Information Processing Systems , 32:14510–14519, 2019.[2] J. Andreani, G. Faure, and R. Guerois. InterEvScore: a novel coarse-grained interface scoring function using amulti-body statistical potential coupled to evolution.

Bioinformatics , 29(14):1742–1749, 2013. ISSN 1460-2059.doi: 10.1093/bioinformatics/btt260.[3] J. Behler and M. Parrinello. Generalized neural-network representation of high-dimensional potential-energysurfaces.

Physical Review Letters , 98(14):1–4, 2007. ISSN 00319007. doi: 10.1103/PhysRevLett.98.146401.[4] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. TheProtein Data Bank.

Nucleic Acids Research , 28(1):235–242, 2000. ISSN 13624962. doi: 10.1093/nar/28.1.235.[5] Y. Cao and Y. Shen. Energy-based graph convolutional networks for scoring protein docking models.

PROTEINS ,page prot.25888, 2020. ISSN 0887-3585. doi: 10.1002/prot.25888.[6] R. Chen, L. Li, and Z. Weng. ZDOCK: An initial-stage protein-docking algorithm.

PROTEINS , 52(1):80–87,2003. ISSN 0887-3585. doi: 10.1002/prot.10389.[7] T. Cohen and M. Welling. Group Equivariant Convolutional Networks.

Proceedings of The InternationalConference on Machine Learning , 33:2990–2999, 2016.[8] S. J. de Vries, M. van Dijk, and A. M. J. J. Bonvin. The HADDOCK web server for data-driven biomoleculardocking.

Nature Protocols , 5(5):883–897, 2010. ISSN 1754-2189. doi: 10.1038/nprot.2010.32.[9] G. Derevyanko, S. Grudinin, Y. Bengio, and G. Lamoureux. Deep convolutional networks for quality assessmentof protein folds.

Bioinformatics , 34(23):4046–4053, 2018. ISSN 1367-4803. doi: 10.1093/bioinformatics/bty494.[10] G. Q. Dong, H. Fan, D. Schneidman-Duhovny, B. Webb, and A. Sali. Optimized atomic statistical potentials:assessment of protein interfaces and loops.

Bioinformatics , 29(24):3158–66, 2013. ISSN 1367-4811. doi:10.1093/bioinformatics/btt560.[11] R. Esmaielbeiki, K. Krawczyk, B. Knapp, J.-C. Nebel, and C. M. Deane. Progress and challenges in predictingprotein interfaces.

Brieﬁngs in Bioinformatics , 17(1):117–31, 2016. ISSN 1477-4054. doi: 10.1093/bib/bbv027.[12] C. Geng, Y. Jung, N. Renaud, V. Honavar, A. M. J. J. Bonvin, and L. C. Xue. iScore: a novel graph kernel-basedfunction for scoring protein–protein docking models.

Bioinformatics , 36(1):112–121, 2019. ISSN 1367-4803.doi: 10.1093/bioinformatics/btz496.[13] H. Hwang, T. Vreven, J. Janin, and Z. Weng. Protein-protein docking benchmark version 4.0.

PROTEINS , 78(15):3111–3114, 2010. ISSN 08873585. doi: 10.1002/prot.22830.[14] J. Jiménez, S. Doerr, G. Martínez-Rosell, A. S. Rose, and G. De Fabritiis. DeepSite: protein-binding site predictorusing 3D-convolutional neural networks.

Bioinformatics , 33(19):3036–3042, 2017. ISSN 1367-4803. doi:10.1093/bioinformatics/btx350.[15] R. Kondor. N-body Networks: a Covariant Hierarchical Neural Network Architecture for Learning AtomicPotentials. arXiv:1803.01588 , 2018.[16] D. Kozakov, R. Brenke, S. R. Comeau, and S. Vajda. PIPER: An FFT-based protein docking program withpairwise potentials.

PROTEINS , 65(2):392–406, 2006. ISSN 08873585. doi: 10.1002/prot.21117.[17] D. Kozakov, D. R. Hall, B. Xia, K. A. Porter, D. Padhorny, C. Yueh, D. Beglov, and S. Vajda. The ClusPro webserver for protein-protein docking.

Nature Protocols , 12(2):255–278, 2017. ISSN 17502799. doi: 10.1038/nprot.2016.169.[18] M. F. Lensink, R. Méndez, and S. J. Wodak. Docking and scoring protein complexes: CAPRI 3rd Edition.

PROTEINS , 69(4):704–718, 2007. ISSN 08873585. doi: 10.1002/prot.21804.[19] J. G. Mandell, V. A. Roberts, M. E. Pique, V. Kotlovyi, J. C. Mitchell, E. Nelson, I. Tsigelny, and L. F. Ten Eyck.Protein docking using continuum electrostatics and geometric ﬁt.

Protein Engineering, Design and Selection , 14(2):105–113, 2001. ISSN 1741-0134. doi: 10.1093/protein/14.2.105.[20] N. A. Marze, S. S. Roy Burman, W. Shefﬂer, and J. J. Gray. Efﬁcient ﬂexible backbone protein–protein dockingfor challenging targets.

Bioinformatics , 34(20):3461–3469, 2018. ISSN 1367-4803. doi: 10.1093/bioinformatics/bty355. 921] J. Mintseris, B. Pierce, K. Wiehe, R. Anderson, R. Chen, and Z. Weng. Integrating statistical pair potentials intoprotein complex prediction.

PROTEINS , 69(3):511–520, 2007. ISSN 08873585. doi: 10.1002/prot.21502.[22] G. Pagès, B. Charmettant, and S. Grudinin. Protein model quality assessment using 3D oriented convolutionalneural networks.

Bioinformatics , 35(18):3313–3319, 2019. ISSN 1367-4803. doi: 10.1093/bioinformatics/btz122.[23] B. Pierce and Z. Weng. ZRANK: Reranking protein docking predictions with an optimized energy function.

PROTEINS , 67(4):1078–1086, 2007. ISSN 08873585. doi: 10.1002/prot.21373.[24] C. Quignot, J. Rey, J. Yu, P. Tufféry, R. Guerois, and J. Andreani. InterEvDock2: an expanded server for proteindocking using evolutionary and biological information from homology models and multimeric inputs.

NucleicAcids Research , 46(1):408–416, 2018. ISSN 0305-1048. doi: 10.1093/nar/gky377.[25] M. Ragoza, J. Hochuli, E. Idrobo, J. Sunseri, and D. R. Koes. Protein-Ligand Scoring with ConvolutionalNeural Networks.

Journal of Chemical Information and Modeling , 57(4):942–957, 2017. ISSN 15205142. doi:10.1021/acs.jcim.6b00740.[26] C. E. M. Schindler, I. Chauvot de Beauchêne, S. J. de Vries, and M. Zacharias. Protein-protein and peptide-proteindocking and reﬁnement using ATTRACT in CAPRI.

PROTEINS , 85(3):391–398, 2017. ISSN 08873585. doi:10.1002/prot.25196.[27] K. T. Schütt, P.-J. Kindermans, H. E. Sauceda, S. Chmiela, A. Tkatchenko, and K.-R. Müller. SchNet: Acontinuous-ﬁlter convolutional neural network for modeling quantum interactions. arXiv:1706.08566 , 2017. ISSN15499626. doi: 10.1021/acs.jctc.7b00577.[28] A. Sergeev and M. Del Balso. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv:1802.05799 ,2018.[29] J. S. Smith, O. Isayev, and A. E. Roitberg. ANI-1: an extensible neural network potential with DFT accuracyat force ﬁeld computational cost.

Chemical Science , 8(4):3192–3203, 2017. ISSN 20416539. doi: 10.1039/C6SC05720A.[30] A. Szilagyi and Y. Zhang. Template-based structure modeling of protein-protein interactions, 2014. ISSN0959440X.[31] N. Thomas, T. Smidt, S. Kearnes, L. Yang, L. Li, K. Kohlhoff, and P. Riley. Tensor ﬁeld networks: Rotation- andtranslation-equivariant neural networks for 3D point clouds. arXiv:1802.08219 , 2018.[32] M. Torchala, I. H. Moal, R. A. G. Chaleil, J. Fernandez-Recio, and P. A. Bates. SwarmDock: a server for ﬂexibleprotein–protein docking.

Bioinformatics , 29(6):807–809, 2013. ISSN 1460-2059. doi: 10.1093/bioinformatics/btt038.[33] W. Torng and R. B. Altman. 3D deep convolutional neural networks for amino acid environment similarityanalysis.

BMC Bioinformatics , 18(1):302, 2017. ISSN 1471-2105. doi: 10.1186/s12859-017-1702-0.[34] R. Townshend, R. Bedi, P. Suriana, and R. Dror. End-to-End Learning on 3D Protein Structure for InterfacePrediction.

Advances in Neural Information Processing Systems , 32:15616–15625, 2019.[35] I. A. Vakser. Protein-protein docking: From interaction to interactome.

Biophysical Journal , 107(8):1785–1793,2014. ISSN 15420086. doi: 10.1016/j.bpj.2014.08.033.[36] I. A. Vakser and C. Aﬂalo. Hydrophobic docking: A proposed enhancement to molecular recognition techniques.

PROTEINS , 20(4):320–329, 1994. ISSN 10970134. doi: 10.1002/prot.340200405.[37] V. Venkatraman, Y. D. Yang, L. Sael, and D. Kihara. Protein-protein docking using region-based 3D Zernikedescriptors.

BMC Bioinformatics , 10(1):407, 2009. ISSN 1471-2105. doi: 10.1186/1471-2105-10-407.[38] T. Vreven, I. H. Moal, A. Vangone, B. G. Pierce, P. L. Kastritis, M. Torchala, R. Chaleil, B. Jiménez-García, P. A.Bates, J. Fernandez-Recio, A. M. Bonvin, and Z. Weng. Updates to the Integrated Protein–Protein InteractionBenchmarks: Docking Benchmark Version 5 and Afﬁnity Benchmark Version 2.

Journal of Molecular Biology ,427(19):3031–3041, 2015. ISSN 0022-2836. doi: 10.1016/J.JMB.2015.07.016.[39] I. Wallach, M. Dzamba, and A. Heifets. AtomNet: A Deep Convolutional Neural Network for BioactivityPrediction in Structure-based Drug Discovery. arXiv:1510.02855 , 2015.[40] X. Wang, G. Terashi, C. W. Christoffer, M. Zhu, and D. Kihara. Protein docking model evaluation by 3D deepconvolutional neural networks.

Bioinformatics , 2019. ISSN 1367-4803. doi: 10.1093/bioinformatics/btz870.[41] H. Zhou and J. Skolnick. GOAP: A generalized orientation-dependent, all-atom statistical potential for proteinstructure prediction.

Biophysical Journal , 101(8):2043–2052, 2011. ISSN 00063495. doi: 10.1016/j.bpj.2011.09.012. 10 upplementary Information

Augmenting the scoring functions DOVE and GOAP

In analogy to Figure 2 of the main manuscript, we use PAUL to augment the scoring functions DOVE [20] and GOAP[41]. DOVE is a machine learning scoring function that uses a voxel-based convolutional neural network to locallyassess the protein-protein interface of a given model. Inputs to the neural network include atomic interaction types andtheir energetic contributions. GOAP is a more traditional statistical scoring function similar to SOAP-PP. It is used as areference scoring function in the DOVE publication [20].To augment a scoring function, we use PAUL to remove suboptimal models from a list which has been pre-ranked withthe scoring function. Given a list of 1000 models for a complex, we assign a value to each model with PAUL. We thenremove all models with a value below the median.Figure S5 shows the effect PAUL has on the ranking performance of DOVE and GOAP for each individual complex inthe test set. Augmentation with PAUL improves performance on 5 and 9 complexes for DOVE and GOAP, respectively,without reducing performance for a single complex.Figure S5:

Ranking results per protein complex.

Given 1000 structural models per complex, we rank the modelsusing the scoring functions DOVE and GOAP, with or without application of PAUL to remove suboptimal models.The ﬁgure indicates whether or not a method ranks at least one acceptable model among the ﬁrst 1, 5, 10, 50, and 100models (T1, T5, T10, T50, and T100, respectively). Blue indicates that it does, while gray indicates that it does not.Green highlights differences due to the augmentation with PAUL. Green corresponds to blue. The ﬁgure shows resultsfor the same 23 complexes as in Figure 2.

Success and hit rate

We evaluate the effect PAUL has on ranking performance with respect to two others metrics, success and hit rate.

Deﬁnition of metrics

Let

A(N) be the number of acceptable models, deﬁned as models with

LRMSD < Å, in a set of N models selectedfrom the top of a ranked list of N max models. 11 uccess rate s (N) indicates whether there is at least one acceptable model among the N models: s (N) = (cid:26) , A(N) > , A(N) = 0 (6)

Hit rate h (N) describes the relative number of acceptable models A(N) among the top-ranked N models for eachscoring function: h (N) = A(N)min (N , A(N max )) (7)An ideal scoring function has the property of h (N) = 1 for all N . Results

We use our trained network PAUL to remove suboptimal models from a list of models pre-ranked by SOAP-PP. Wedescribe this score combination with the notation PAUL

SOAP-PP . For a given list of 1000 models per complex, weremove all models for which PAUL assigns a score below the median. Figure S6 shows the effect of this method forour test set. We are able to increase success and hit rate consistently (Fig. S6). The values shown are computed asaverages over 27 complexes in the test set with at least one acceptable model among the ﬁrst 1000 models generated byATTRACT. We also demonstrate that other combinations of ﬁltering, namely SOAP-PP

ZRANK and ZRANK