[PDF] Graph-convolution neural network-based flexible docking utilizing coarse-grained distance matrix

Abstract

Prediction of protein-ligand complexes for flexible proteins remains still a challenging problem in computational structural biology and drug design. Here we present two novel deep neural network approaches with significant improvement in efficiency and accuracy of binding mode prediction on a large and diverse set of protein systems compared to standard docking. Whereas the first graph convolutional network is used for re-ranking poses the second approach aims to generate and rank poses independent of standard docking approaches. This novel approach relies on the prediction of distance matrices between ligand atoms and protein C_alpha atoms thus incorporating side-chain flexibility implicitly.

Full PDF

GGraph-convolution neural network-basedﬂexible docking utilizing coarse-graineddistance matrix

Amr H. Mahmoud, † , ‡ Jonas F. Lill, ‡ and Markus A. Lill ∗ , ‡ † Department of Medicinal Chemistry and Molecular Pharmacology, College of Pharmacy,Purdue University, 575 Stadium Mall Drive, West Lafayette, Indiana 47906, United States ‡ Department of Pharmaceutical Sciences, University of Basel, Klingelbergstrasse 50, 4056Basel, Switzerland

E-mail: [email protected]

Phone: ++41 61 2076135

Abstract

Prediction of protein-ligand complexes for ﬂexible proteins remains still a challengingproblem in computational structural biology and drug design. Here we present two noveldeep neural network approaches with signiﬁcant improvement in eﬃciency and accuracyof binding mode prediction on a large and diverse set of protein systems compared tostandard docking. Whereas the ﬁrst graph convolutional network is used for re-rankingposes the second approach aims to generate and rank poses independent of standarddocking approaches. This novel approach relies on the prediction of distance matricesbetween ligand atoms and protein C α atoms thus incorporating side-chain ﬂexibilityimplicitly. a r X i v : . [ q - b i o . B M ] A ug ntroduction Structure-based drug design is an essential tool and an important pillar in Computer-aidedDrug Design (CADD) for eﬃcient lead discovery and optimization. CADD methods such asdocking aim to identify novel binders to a target protein and to predict the structure of protein-ligand complexes. Docking is still widely applied using a rigid protein as template in CADDprojects, ignoring the representation of the diﬀerent conformations that the binding-site canassume.The importance of modelling protein ﬂexibility in docking, however, has been recognizedalready in early docking studies. In 1999 Murray et al. demonstrated the shortcomings ofdocking to rigid proteins by carrying out rigid cross docking for three enzymes (thrombin,thermolysin and the inﬂuenza virus neuroaminidase). Each ligand was docked to the proteinstructures of all complexes available for each enzyme. The authors found that in 51%of the cases the program failed to dock the small molecules directly and highlighted theimportance of modeling protein ﬂexibility in computational docking. Later, Englebienne &Moitessier showed that the accuracy of many scoring functions can be deteriorated by proteinﬂexibility and solvation. A large number of reviews discuss the importance of incorporatingprotein ﬂexibility in docking algorithms while focusing on side-chain, backbone and domainmovements necessary for the protein to accommodate diﬀerent ligands.

Incorporating protein ﬂexibility into molecular docking is a diﬃcult optimization probleminvolving a large number of degrees of freedom that represent the receptor ﬂexibility. Ap-proaches to incorporate receptor ﬂexibility range from the use of soft-core potentials, multipleprotein structures (ensemble docking) or the active sampling of protein conformations duringenergy optimization of the ligand (induced-ﬁt docking).

Due to the computational com-plexity of the problem, for many practical uses, ﬂexible docking is still a challenging task andfull incorporation of protein ﬂexibility is computationally not feasible. In summary, there is asigniﬁcant demand for eﬃcient algorithms to handle ﬂexible proteins in docking.Several limitations for improving ﬂexible docking methodology exist. Most of the published2ethods are optimized or benchmarked on limited data sets with a very limited set of targets. While this practice was acceptable in the past due to the limitation of computational resources,this is no longer the case nowadays. Validation has to be carried out using large test setswith a wide range of diﬀerent targets. Only such a validation procedure allows to identifythe shortcomings of current and newly developed algorithms. This is essential for systematicimprovement of ﬂexible protein docking methodologies.Additionally, ﬂexible docking is more resource and time intensive than rigid docking. Theapplication of ﬂexible docking to virtual screening of large libraries is still unrealistic. Thenumber of degrees of freedom in ﬂexible proteinn docking is signiﬁcantly higher comparedto rigid docking, leading to an increase in rate of false-positives and intensive usage ofcomputational resources. Thus, there is a crucial need to develop new methods that rely oneﬃcient algorithms and heuristics to lessen the computational requirements and allow anaccurate widespread implementation of ﬂexible docking engines.In rigid docking applications deep-learning methods proved to be useful in achievingunprecedented accuracy in pose prediction.

In this work, we aim to utilize deep learningto improve the quality of ﬂexible docking approaches. We will demonstrate that our deeplearning-based concepts increase the accuracy of ﬂexible docking with concurrent improvementin sampling eﬃciency. Two neural-network-based methods are presented here making initialsteps towards this overarching aim (Figure 1). The ﬁrst method, named re-ranking by gatedattention neural network (RerankGAT) method, re-evaluates docking poses generated bystandard docking approaches, while the second method, named pose generation by neural-network predicted distance matrix (PoseNetDiMa), is based on predicting and utilizing thedistance matrix, which represents the relation between ligand and protein atoms, for posegeneration and ranking.Concepts using predictions of distance matrices have proven to be useful in many branchesof bioinformatics and cheminformatics. AlphaFold, for example, enabled the an initioprediction of 3D-protein structures with a higher accuracy than any other state-of-the-art3ethods.

Another example used Wasserstein generative adversarial networks (GAN) togenerate valid conformations of organic molecules. Our concept of using the predictionof distance matrices between proteins and ligands is the ﬁrst in the domain of docking orprotein-ligand interactions in general.The RerankGAT method uses a graph representation for each possible ligand-proteinpose and a distance-aware gated graph attention mechanism in order to learn to classify theligand poses.

Details are discussed in the following Materials and Methods section. It isimportant to emphasize that the RerankGAT methods in this work can boost the ranking ofdocking poses in case of docking success but cannot address sampling failure, i.e. failure togenerate native-like poses independent of their subsequent ranking.The PoseNetDiMa method predicts the distance matrix between the C α atoms of the targetprotein and all heavy atoms of the ligand to be docked. C α atoms were chosen to implicitlyinclude side-chain ﬂexibility without explicit sampling. The method uses coordinates of theC α atoms and the ligand topology as input. A graph neural network with a global attentionmechanism is trained to predict the pairwise distances between protein C α and ligand heavyatoms. The model relies on the heterogeneous graph attention concept where twodiﬀerent types of graphs are encoded (Figure 2): One graph represents the protein using theC α atoms as nodes colored by the type of amino acid. Edges between nodes are deﬁned basedon the Euclidean distance between C α atoms. The other graph encodes the ligand, wherenodes and edges are represented by heavy atoms and covalent bonds. No spatial informationis included in the ligand graph, as information about ligand conformation has to be predictedin the docking stage based on the predicted distance matrix. In our implementation bothgraphs are fused into one graph with extended feature vectors as described in subsequentsections.The PoseNetDiMa method is far more versatile since the predicted distance matrix can beused in mulitple ways: First, the predicted distance matrix can provide restrains necessary toconﬁne the solutions within a limited space, thus enabling exploration of conﬁgurational space4n a reasonable time. Second, generated poses can be ﬁltered according to their correlationwith the predicted distance matrix. Third, direct reconstruction of the ligand within thebinding pocket based on the distance matrix is possible. In summary, the predicted distancematrix can be used in machine-learning assisted docking or re-scoring of poses. Thus, incontrast to RerankGAT, PoseNetDiMa was designed to increase the number of systems forwhich native docking solutions are identiﬁed compared to standard ﬂexible docking. Materials and Methods

Training Data

The general set from PDBbind was used for training and initial validation of the models.To generate poses for model training and validation, ﬂexible docking was performed usingSmina. Unbiased selection of ﬂexible residues was chosen, where any residue was consideredﬂexible if it is located within 4 Å of the ligand in the X-ray complex structure. The searchvolume was deﬁned by the centroid of the co-crystallized ligand, adding a padding of 8 Å tothe box encompassing the ligand. Exhaustiveness is set to 8 with 50 modes being requestedand using 8 threads per docking job. For validation purposes, the data set was split into fourgroups and 4-fold cross-validation was performed.

Validation and Quality Assessment using Cross Docking

The quality of the models was further assessed using cross docking. The data sets used forthis assessment comprises a large number of targets which vary among each other accordingto their diﬃculty in ﬂexible docking. The total number of ligands which were docked werearound 4500 ligands from 95 targets with an average of 45 ligands per target (Table 1). Allligands were docked with ﬂexible side chains with the same settings used for docking theGeneral Set from PDBBind (cf. Section Training Data). The dataset is consistent in coveragewith the Disco dataset. In contrast to the study of Wierbowski et al. ﬂexible docking on5ne template protein structure was performed instead of rigid cross-docking.Table 1: Targets and number of ligands used in cross-docking experiments.Target PDB Description N o Model Features

For both network models, basic chemical properties of atoms were used as initial nodefeatures. Those features include an atom’s elemental type, connectivity index, aromaticity,implicit valence, partial charge estimates, number of attached hydrogen atoms, surface areacontributions (Labute ASA in rdkit and TPSA), Crippen LogP, Crippen MR and electro-topological State descriptors known as EState in rdkit. Bond featurization depends on thebond type, bond conjugation and whether the bond is a ring bond. In the PoseNetDiMamodel, however, which relies on a coarse-grained representation of the protein, the nodes inthe protein graph represent the C α atoms of the amino acids of the binding site with onehot-encoding for the twenty diﬀerent amino acids. Features describing the physicochemicalproperties of each amino acid (i.e. polar, charged, hydrophobic, aromatic side chain) areadded to the feature vector describing the nodes of the protein graph. To generate a single,heterogeneous graph containing both ligand and protein nodes, the ligand and protein nodefeatures are concatenated into one feature vector F i = ( f ( L ) i , f ( P ) i ) . Since C α atoms arenot covalently bonded, the edges were represented using virtual bonds that reﬂect pairwisedistances between two nodes being within 7 Å. In detail, ﬁve distance bins between 2 Åand 7 Å are generated, and an edge between two C α atoms within a maximum distance of 7Å is colored by its association with the matching distance bin using one-hot encoding. For10xample, an edge between two C α atoms with distance of 5.7 Å will have the feature vector(0, 0, 0, 1, 0). To combine the heterogeneous bond features of ligand and protein, the featurevectors are concatenated: F ij = ( f ij i ( L ) , f ( P ) ij ) . Graph Neural Network Models

Pose re-ranking model: RerankGAT

The model for pose re-ranking utilizes the Graph Neural Network algorithm described byLim et al. The network was gated with an attention mechanism that takes distances intoconsideration.

Model for prediction of distance matrix: PoseNetDiMa

The PoseNetDiMa model is inspired by the work of Jin et al. on synthesis prediction.In our work, a similar network as in Jin et al. is used with the main aim to predictprotein-ligand distance matrices that could be used as distance restraints during pose samplingor for pose ﬁltering and re-scoring. The network tries to identify the correspondence betweeneach atom of the ligand and C α atoms comprising the protein-binding site .In a ﬁrst step, the nodes of both protein and ligand graph are encoded using a graph neuralnetwork (Figure 4 A). To encode the hidden features h ( l ) v of a node v in layer l , messages m uv from neighboring nodes u ∈ N ( v ) are collected. To compute the message between u and v , the current node feature h ( l − u and edge feature F uv are concatenated and used as inputof a neural network (Figure 4 A) with ReLU activation function τ : m vu = τ ( V [ h ( l − u , F uv ]) (1)After collecting all message from neighboring nodes, the previous hidden feature h ( l − v ofnode v is added in a skip connection and a subsequent neural network ﬁnally encodes newhidden features 11 ( l ) v = τ ( U h ( l − v + U (cid:88) u ∈ N ( v ) τ ( V [ h ( l − u , F uv ])) (2)where h (0) v = F v and U , U , V are shared weights.After L steps of graph convolutions, the current hidden feature vector h ( l ) v is transformedinto a ﬁnal local feature vector c v (Figure 4 B). First, h ( l ) v , neighboring feature vectors h ( l ) u and corresponding edges undergo additional tensor multiplications by W (0) , W (1) and W (2) .Subsequent Hadamard products between the three resulting entities generates the ﬁnal localfeature vector c v : c v = W (2) h ( L ) v (cid:12) (cid:88) u ∈ N ( v ) W (0) h ( L ) u (cid:12) W (1) F uv (3) c v is a feature vector that locally encodes the chemical environment of the atom.To predict the likely distance between ligand atoms and protein C α atoms, the currentlydistinct ligand and protein graphs need to be connected. In other words, information needsto be shared between both graphs. The main idea is that the interaction strengths betweendiﬀerent residue types and ligand atom types varies (e.g. hydrogen bonds diﬀer in distancedependency and strength compared to hydrophobic contacts). Whereas the local environmentof a node within one of the graph is captured according to each atom’s connectivity with itsneighbors, the global environment, i.e. protein-ligand interactions, is incorporated through aglobal attention mechanism which allows for weighted information exchange between nodesof the two diﬀerent graphs (Figure 4 C). The attention score α uv between nodes u and v isderived by α uv = σ (cid:0) u T τ ( P a c u + P a c v + P b b uv ) (cid:1) (4)where σ ( · ) is the sigmoid function, and b uv is a feature vector that represents informationabout the relationship between u and v , i.e. whether the two nodes represent protein-protein,12igand-protein or ligand-ligand pairwise interactions or covalent bonds.The global feature representation g u of node u is then calculated as the weighted sum ofall other surrounding nodes where the weights correspond to the attention factors (Figure 4D): g u = (cid:88) v α uv c v (5)Finally, the distance between two nodes u and v , e.g. ligand atom u and protein C α atom v is computed (Figure 4 E) by d uv = u T τ ( M a g u + M a g v + M b f uv + P a c u + P a c v ) (6) Training.

The network is ﬁnally trained to reproduce the experimentally measureddistances y uv between ligand atoms u and protein C α atoms v . The in-dependant predictionof each label is performed due to the quadratic complexity of the problem. The interactionlabels can be determined by the product of N ligand atoms and M C α atoms and thisquadratic complexity prevents higher-order predictions. Docking using PoseNetDiMa

Similar to RerankGAT, PoseNetDiMa can be used for re-ranking poses obtained from standardﬂexible docking such as Smina. As Smina is unable to generate native-like poses for a largenumber of targets, we tested if the predicted distance matrices could be directly used forboth posing and ranking phases in docking. A scheme of the overall docking scheme basedon PostNetDiMa is shown in Figure 5.First, the distance matrix for the protein-ligand system of interest is predicted usingPostNetDiMa. For every ligand atom, all possible locations are computed based on thepredicted distance matrix using all possible triplets of C α atoms in the binding site. Thosepoints are clustered using Quality Threshold (QT) clustering algorithm with a radius of 113. Clustering is stopped either when half of all possible points are assigned to cluster or amaximum number of three clusters is identiﬁed for an atom. Pharmacophore models aregenerated from the clusters with one element per atom. For each atom the used clustercenter is selected randomly, generating a maximum number of 25 pharmacophores. Dockingis performed to the pharmacophore models using LSAlign. Those poses are rescored usingiDock on atomic density maps. Those density maps are 3D grids where the density of aligand atom i at grid point k is obtained from the product of normal distribution functionscentered around the predicted distance d ji between ligand atom i and C α atom jp ik = (cid:89) j exp ( − . · ( r jk − d ji ) ) (7)where r jk is the distance between protein atoms j and grid point k . Results

General Set Evaluation

Four-fold cross validation was carried out using the General-set from PDBbind. The cumu-lative results of only the test sets in the four cross-validation runs are reported. First, wetested the re-ranking performance of RerankGAT based on the poses obtained from Sminadocking. Smina was only able to generate native-like poses (RMSD < r > . could be achieved, for half of the systemsa correlation even larger than 0.8. Interestingly, there is a correlation between number ofsystems with native-like poses and quality in distance matrix prediction (Figure 7, bottom).For example, 80 % of systems with high distance-matrix quality ( r > . ) have near-nativeposes, while only 50 % with poor distance matrix quality ( r < . . Initial analysis indicatesthat the distance matrix for systems with high ﬂexibility and particular solvent-exposure,that may have alternative binding poses, are diﬃcult to predict. Those systems also show norobust prediction in binding poses in docking. Cross-Docking Assessment

For additional validation, the same analysis was performed on cross docking on 95 targetswith diﬀerent levels of diﬃculty. Some targets are known to have high failure rate in crossdocking such as Cytochrome P450 3A4 and Caspase-3. Smina was used for ﬂexible dockingof the cross-docking dataset and the poses were re-scored using the graph-attention neuralnetwork model trained on the general set of PDBbind (Figure 8). The ﬁrst observation isthat there are a higher number of systems with native-like poses compared to the generalset of PDBbind (81 % versus 66 %). The reason for this diﬀerence is that for each targetsystem, the protein structure with the highest success rate was selected for cross-dockingstudies following a previous protocol. The higher number of systems with native-like poses resulted in higher number of native-like poses identiﬁed in the top-5 ranked list with rescoreGAT (77 %) again outperformingSmina scoring (64 %). In contrast, the performance for identifying a native-like pose in thetop-1 position remained unchanged in the cross-docking study compared to ﬂexible docking15o PDBbind.Figure 9 shows similar accuracy in predicting the distance matrix for the cross-dockingdataset compared to the PDBbind dataset, and the same trend of overall better prediction ofthe distance matrix for systems with more likely success in generating native-like poses.

Re-ranking of poses using PoseNetDiMa

Next, we explored the potential of PoseNetDiMa to re-rank poses obtained from Smina.Whereas the similarity between native and docked pose is typically measured by their RMSDvalue, alternatively the similarity of their corresponding protein-ligand distance matricescould be used (Figure 10). Thus, assuming the distance matrix predicted by PoseNetDiMa issimilar to the experimentally known distance matrix, the docked poses could be translatedinto distance matrices and ranked by their similarity to the predicted distance matrix, Basedon this idea, the hypothesis has been that pose ranking could be improved using the predicteddistance matrix from PoseNetDiMa.The analysis was performed on those systems for which the docking engine was able togenerate near-native poses. As shown in Figure 11 (left), PoseNetDiMa signiﬁcantly improvespose ranking, even outperforming RerankGAT by a signiﬁcant margin. Whereas, Smina isonly able to rank 47 % of systems with native-like poses as top-1, PoseNetDiMa increasesthis percentage to 82 %. Adding native poses to the pool of docked poses further increasesthis percentage to 89 %.

Docking using PoseNetDiMa

Docking using PoseNetDiMa was performed on the cross-docking set. Despite only usingthe C α atoms from the protein, PostNetDiMa obtained the same success rate to identify anative-like pose at top-1 position and even slightly outperformed ﬂexible docking using Sminawhen considering the top-5 poses (Figure 12). Interestingly for almost all systems a top-5ranked pose was identifed within an RMSD of less than 4 Å. This means that the general16rientation of the scaﬀold of a ligand could be identiﬁed for almost all systems based on acoarse grained representation of the protein.Furthermore Figure 13 highlights a strong correlation between prediction quality of theprotein-ligand distance matrix and the docking quality. In particular, a native-like posecould most likely be generated among the top-5 ranked poses if the correlation coeﬃcient r between experimental and predicted distance matrix is larger than 0.8, and such a native-likepose is top ranked if r is even larger than 0.9. Thus, in the future we will focus on improvingthe model for predicting the protein-ligand distance matrix, as this will directly improvedocking performance beyond the quality of full-atomistic ﬂexible docking programs. Conclusion

We demonstrated in this study how ﬂexible docking performance can be signiﬁcantly improvedusing deep learning approaches. Two diﬀerent models have been designed for this task:RerankGAT, a model based on graph convolutional neural networks, which was used to re-rank existing poses. Besides standard docking algorithms, those poses could also be obtainedfrom molecular dynamics simulations or similarity-based alignment algorithms. The secondmodel, PoseNetDiMa, that generates distance matrices between ligands and proteins basedon ligand topology and C α atoms of binding site residues, can also been used for rerankingposes. Furthermore, PoseNetDiMa also provides the necessary information to directly guideligand placement.Analysis of targets used in ﬂexible docking reveals that standard docking strategies showweak accuracy in binding pose generation for ﬂexible proteins and proteins with large bindingsites compared to ligand size, Using distance matrices that are based on C α atoms only,explicit side-chain sampling becomes obsolete, reducing the degrees-of-freedom signiﬁcantly.This can result in more eﬃcient and accurate sampling of native-like poses.17 eferences (1) Murray, C. W.; Baxter, C. A.; Frenkel, A. D. The sensitivity of the results of moleculardocking to induced ﬁt eﬀects: application to thrombin, thermolysin and neuraminidase. Journal of computer-aided molecular design , , 547–562.(2) Englebienne, P.; Moitessier, N. Docking ligands into ﬂexible and solvated macromolecules.4. Are popular scoring functions accurate for this class of proteins? Journal of chemicalinformation and modeling , , 1568–1580.(3) Lill, M. A. Eﬃcient Incorporation of Protein Flexibility and Dynamics into MolecularDocking Simulations. Biochemistry , , 6157–6169.(4) Totrov, M.; Abagyan, R. Flexible ligand docking to multiple receptor conformations: apractical alternative. Current opinion in structural biology , , 178–184.(5) Durrant, J. D.; McCammon, J. A. Computer-aided drug-discovery techniques thataccount for receptor ﬂexibility. Current opinion in pharmacology , , 770–774.(6) Henzler, A. M.; Rarey, M. In Pursuit of Fully Flexible Protein-Ligand Docking: Modelingthe Bilateral Mechanism of Binding. Molecular Informatics , , 164–173.(7) A Sotriﬀer, C. Accounting for induced-ﬁt eﬀects in docking: what is possible and whatis not? Current topics in medicinal chemistry , , 179–191.(8) Ferrari, A. M.; Wei, B. Q.; Costantino, L.; Shoichet, B. K. Soft docking and multiplereceptor conformations in virtual screening. Journal of medicinal chemistry , ,5076–5084.(9) Sherman, W.; Day, T.; Jacobson, M. P.; Friesner, R. A.; Farid, R. Novel procedure formodeling ligand/receptor induced ﬁt eﬀects. Journal of medicinal chemistry , ,534–553. 1810) Mahmoud, A. H.; Masters, M. R.; Yang, Y.; Lill, M. A. Elucidating the multipleroles of hydration for accurate protein-ligand binding prediction via deep learning. Communications Chemistry , , 1–13.(11) Ghanbarpour, A.; Mahmoud, A. H.; Lill, M. A. On-the-ﬂy Prediction of Protein Hydra-tion Densities and Free Energies using Deep Learning. arXiv preprint arXiv:2001.02201 ,(12) Masters, M. R.; Mahmoud, A. H.; Yang, Y.; Lill, M. A. Eﬃcient and Accurate HydrationSite Proﬁling for Enclosed Binding Sites. Journal of chemical information and modeling , , 2183–2188, PMID: 30289252.(13) Senior, A. W.; Evans, R.; Jumper, J.; Kirkpatrick, J.; Sifre, L.; Green, T.; Qin, C.;Žídek, A.; Nelson, A. W.; Bridgland, A., et al. Improved protein structure predictionusing potentials from deep learning. Nature , , 706–710.(14) Senior, A. W.; Evans, R.; Jumper, J.; Kirkpatrick, J.; Sifre, L.; Green, T.; Qin, C.;Žídek, A.; Nelson, A. W.; Bridgland, A., et al. Protein structure prediction using multipledeep neural networks in the 13th Critical Assessment of Protein Structure Prediction(CASP13). Proteins: Structure, Function, and Bioinformatics , , 1141–1148.(15) Hoﬀmann, M.; Noé, F. Generating valid Euclidean distance matrices. arXiv preprintarXiv:1910.03131 ,(16) Lim, J.; Ryu, S.; Park, K.; Choe, Y. J.; Ham, J.; Kim, W. Y. Predicting Drug–TargetInteraction Using a Novel Graph Neural Network with 3D Structure-Embedded GraphRepresentation. Journal of chemical information and modeling , , 3981–3988.(17) Ryu, S.; Lim, J.; Hong, S. H.; Kim, W. Y. Deeply learning molecular structure-propertyrelationships using attention-and gate-augmented graph convolutional network. arXivpreprint arXiv:1805.10988 , 1918) Sun, M.; Zhao, S.; Gilvary, C.; Elemento, O.; Zhou, J.; Wang, F. Graph convolutionalnetworks for computational drug development and discovery. Brieﬁngs in bioinformatics ,(19) Morris, C.; Ritzert, M.; Fey, M.; Hamilton, W. L.; Lenssen, J. E.; Rattan, G.; Grohe, M.Weisfeiler and leman go neural: Higher-order graph neural networks. Proceedings of theAAAI Conference on Artiﬁcial Intelligence. 2019; pp 4602–4609.(20) Lei, T.; Jin, W.; Barzilay, R.; Jaakkola, T. Deriving neural architectures from sequenceand graph kernels. Proceedings of the 34th International Conference on Machine Learning-Volume 70. 2017; pp 2024–2033.(21) Jin, W.; Coley, C.; Barzilay, R.; Jaakkola, T. Predicting organic reaction outcomes withweisfeiler-lehman network. Advances in Neural Information Processing Systems. 2017;pp 2607–2616.(22) Coley, C. W.; Jin, W.; Rogers, L.; Jamison, T. F.; Jaakkola, T. S.; Green, W. H.;Barzilay, R.; Jensen, K. F. A graph-convolutional neural network model for the predictionof chemical reactivity.

Chemical science , , 370–377.(23) Wang, X.; Ji, H.; Shi, C.; Wang, B.; Ye, Y.; Cui, P.; Yu, P. S. Heterogeneous graphattention network. The World Wide Web Conference. 2019; pp 2022–2032.(24) Koes, D. R.; Baumgartner, M. P.; Camacho, C. J. Lessons learned in empirical scoringwith smina from the CSAR 2011 benchmarking exercise. Journal of chemical informationand modeling , , 1893–1904.(25) Wierbowski, S. D.; Wingert, B. M.; Zheng, J.; Camacho, C. J. Cross-docking benchmarkfor automated pose and ranking prediction of ligand binding. Protein Science , ,298–305. 2026) Hu, J.; Liu, Z.; Yu, D.-J.; Zhang, Y. LS-align: an atom-level, ﬂexible ligand structuralalignment algorithm for high-throughput virtual screening. Bioinformatics , ,2209–2218.(27) Majewski, M.; Carmona, S. R.; Barril, X. Are protein-ligand complexes robust structures? bioRxiv , 454165. 21 ist of Figures α atoms (PoseNetDiMa) using protein C α atom coordinates andligand topology as input. The predicted distance matrix can be directly usedto generate poses with implicit inclusion of side chain ﬂexibility. . . . . . . . 242 Two diﬀerent types of graphs are encoded using graph neural network (GNN).One graph used the C α atoms of the binding site residues as nodes colored bytype of amino acid. Edges are colored based on distance between connectingC α atoms. The second graph encodes the ligand topology using all heavyatoms. The atom nodes are colored by atom properties, the edges by bondcharacter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 A. Scheme of a graph convolution step ( left ) and its attention-augmentedversion ( right ). Central nodes update is carried out using neighboring nodeswhere diﬀerent width of arrows reﬂect the importance of information transfer,hence attention. B. Diﬀerent versions of skip connections to conserve initialnode features over mulitple update steps. Skip rate z v is determined usingneural network layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 Scheme of PoseNetDiMa to predict distance matrix based on coarse grainedrepresentation of protein and 2D representation of ligand. After initial localencoding using message pass (A), a local feature vector is determined basedon combining atom encoding and bond featurization for protein and ligandseparately (B). Using global attention (C) protein and ligand encodings arecombined and a ﬁnal global feature vector is computed (D). Local and globalfeature vector are ﬁnally combined to predict the protein-ligand distance matrix(E). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Scheme for using PoseNetDiMa for docking. A. Distance matrix is predictedusing PostNetDiMa. B. For every ligand atom, all possible locations arecomputed based on the predicted distance matrix using all possible proteintriplets. C. Those points are clustered using QT clustering algorithm. D. Amaximum of 25 pharmacophore models are generated by random selectionof combinations of cluster centers. Docking is performed to pharmacophoremodels using LSAlign. E. Those poses are rescored using iDock and atomicdensity maps obtained from predicted distance matrix. . . . . . . . . . . . . 286 Ranking performance using Smina and rerankGAT (top) on all systems and(bottom) on systems with at least one native-like pose. . . . . . . . . . . . . 297 (Top) Fraction of systems with certain distance matrix prediction accuracymeasured by correlation coeﬃcient between experimental and predicted dis-tance matrix. (Bottom) Fraction of systems with native-like pose using Sminacorrelated with the distance matrix prediction accuracy. . . . . . . . . . . . . 308 Ranking performance using Smina and rerankGAT (top) on all systems and(bottom) on systems with at least one native-like pose for cross-docking data set. 3122 (Top) Fraction of systems with certain distance matrix prediction accuracymeasured by correlation coeﬃcient between experimental and predicted dis-tance matrix. (Bottom) Fraction of systems with native-like pose using Sminacorrelated with the distance matrix prediction accuracy. Data for cross-dockingdata set is shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3210 Scheme for pose re-ranking using poseNetDiMa. Protein-ligand distance matrixis predicted using postNetDiMa and compared with corresponding distancematrices measured for each docking pose. Re-ranking is performed based onsimilarity between predicted distance matrix and distance matrix of a givendocking pose. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3311 Re-ranking accuracy of docking poses obtained from Smina using PoseNetDiMafor systems with native-like poses using only docked poses (left) or when addingnative poses from X-ray structure (right). . . . . . . . . . . . . . . . . . . . . 3412 Cumulative probability of predicting docking pose within certain RMSD to na-tive binding mode at top-1 or among top-5 ranked solutions using PostNetDiMain docking modus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3513 Probability of predicting native-like docking poses within an RMSD of lessthan 2 Å(blue) and 4 Å(orange) to the native binding mode at top-1 or amongtop-5 ranked solutions using PostNetDiMa in docking modus. Dependency onprediction quality of distance matrix is shown. . . . . . . . . . . . . . . . . . 3623igure 1: Two neural network approaches to improve ﬂexible docking performance. A.Gated attention neural network (RerankGAT) re-ranks poses obtained from standard dockingprogram, here Smina, aiming to improve pose scoring. B. Graph neural network that predictsdistance matrix between ligand atoms and protein C α atoms (PoseNetDiMa) using proteinC α atom coordinates and ligand topology as input. The predicted distance matrix can bedirectly used to generate poses with implicit inclusion of side chain ﬂexibility.24igure 2: Two diﬀerent types of graphs are encoded using graph neural network (GNN).One graph used the C α atoms of the binding site residues as nodes colored by type of aminoacid. Edges are colored based on distance between connecting C α atoms. The second graphencodes the ligand topology using all heavy atoms. The atom nodes are colored by atomproperties, the edges by bond character. 25igure 3: A. Scheme of a graph convolution step ( left ) and its attention-augmented version( right ). Central nodes update is carried out using neighboring nodes where diﬀerent width ofarrows reﬂect the importance of information transfer, hence attention. B. Diﬀerent versionsof skip connections to conserve initial node features over mulitple update steps. Skip rate z vv