[PDF] Improved Protein-ligand Binding Affinity Prediction with Structure-Based Deep Fusion Inference - Researchain

Abstract

Predicting accurate protein-ligand binding affinity is important in drug discovery but remains a challenge even with computationally expensive biophysics-based energy scoring methods and state-of-the-art deep learning approaches. Despite the recent advances in the deep convolutional and graph neural network based approaches, the model performance depends on the input data representation and suffers from distinct limitations. It is natural to combine complementary features and their inference from the individual models for better predictions. We present fusion models to benefit from different feature representations of two neural network models to improve the binding affinity prediction. We demonstrate effectiveness of the proposed approach by performing experiments with the PDBBind 2016 dataset and its docking pose complexes. The results show that the proposed approach improves the overall prediction compared to the individual neural network models with greater computational efficiency than related biophysics based energy scoring functions. We also discuss the benefit of the proposed fusion inference with several example complexes. The software is made available as open source at this https URL.

Full PDF

II MPROVED P ROTEIN - LIGAND B INDING A FFINITY P REDICTIONWITH S TRUCTURE -B ASED D EEP F USION I NFERENCE

A P

REPRINT

Derek Jones †∗ Global Security Computing Applications DivisionLawrence Livermore National LaboratoryLivermore, CA

Hyojin Kim † Center for Applied Scientiﬁc ComputingLawrence Livermore National LaboratoryLivermore, CA

Xiaohua Zhang

Biosciences and Biotechnology DivisionLawrence Livermore National LaboratoryLivermore, CA

Adam Zemla

Biosciences and Biotechnology DivisionLawrence Livermore National LaboratoryLivermore, CA

Garrett Stevenson

Computational Engineering DivisionLawrence Livermore National LaboratoryLivermore, CA

William D. Bennett

Biosciences and Biotechnology DivisionLawrence Livermore National LaboratoryLivermore, CA

Dan Kirshner

Biosciences and Biotechnology DivisionLawrence Livermore National LaboratoryLivermore, CA

Sergio Wong

Biosciences and Biotechnology DivisionLawrence Livermore National LaboratoryLivermore, CA

Felice Lightstone

Biosciences and Biotechnology DivisionLawrence Livermore National LaboratoryLivermore, CA

Jonathan E. Allen

Global Security Computing Applications DivisionLawrence Livermore National LaboratoryLivermore, CAMay 19, 2020Predicting accurate protein-ligand binding afﬁnity is important in drug discovery but remains a challenge even withcomputationally expensive biophysics-based energy scoring methods and state-of-the-art deep learning approaches.Despite the recent advances in the deep convolutional and graph neural network based approaches, the model perfor-mance depends on the input data representation and suffers from distinct limitations. It is natural to combine com-plementary features and their inference from the individual models for better predictions. We present fusion modelsto beneﬁt from different feature representations of two neural network models to improve the binding afﬁnity pre-diction. We demonstrate effectiveness of the proposed approach by performing experiments with the PDBBind 2016dataset and its docking pose complexes. The results show that the proposed approach improves the overall predic-tion compared to the individual neural network models with greater computational efﬁciency than related biophysicsbased energy scoring functions. We also discuss the beneﬁt of the proposed fusion inference with several examplecomplexes. The software is made available as open source at https://github.com/llnl/fast . ∗ † denotes authors contributed equally a r X i v : . [ q - b i o . B M ] M a y PREPRINT - M AY

19, 2020

Predicting accurate binding afﬁnity between a small molecule and target protein is one of the fundamental challengesin drug development. Recently deep learning models have been proposed as an alternative to traditional physics-based free energy scoring functions. The beneﬁt of the deep learning approach is in learning binding interaction rulesdirectly from an atomic representation without relying on hand curated features that may not capture the mechanismof binding (Ballester and Mitchell 2010; Ain et al. et al. et al. et al. et al. et al. et al. et al.

PREPRINT - M AY

19, 2020

The PDBBind database, a curated subset of the Protein Data Bank (PDB) (wwPDB consortium 2019) and initiallydeveloped for use in molecular dynamics pipelines, is a popular choice for the development of machine learning basedscoring functions (Feinberg et al. et al. general and reﬁned ) based upon criteria that consider the nature of the complex (e.g. complexes that have ligands with molecularweight above 1000 Da are excluded from reﬁned), the quality of the binding data (e.g. complexes with IC but no k i or k d measurement are not included in reﬁned) and the quality of the complex structure (e.g. resolution of crystalstructure must be better than 2.5 ˚A). From the reﬁned set, a core set is compiled to provide a representative set forvalidation using a clustering protocol.The 2016 edition of PDBBind used in this study, consists of 13,308 protein-ligand binding complexes in general, 4,057complexes in reﬁned, and 290 complexes in core. An example input is shown in Figure 1.All docking complex data is generated using the in-house developed ConveyorLC toolchain (Zhang et al. (cid:15) = 4 ) is used in the MM/GBSA rescoring since previous studies demonstrateremarkable improvement in the pose ranking (Sun et al. et al. Before extracting the respective three-dimensional representations for each deep learning model, a common pre-processing protocol was applied to the binding complex structures provided in Protein Data Bank (.pdb) format bythe PDBBind database. The process closely mirrors Stepniewska-Dziubinska et al. et al. et al. et al. core set to simulate realisticconditions of evaluating new docking poses. The results of this protocol produces a Tripos Mol2 (.mol2) ﬁle for eachprotein pocket.

A common atomic representation based on that of Stepniewska-Dziubinska et al. • Element type: one-hot encoding of B , C , N , O , P , S , Se , halogen or metal • Atom hybridization (1, 2, or 3) • Number of heavy atom bonds (i.e. heavy valence) • Number of bonds with other heteroatoms • Structural properties: bit vector (1 where present) encoding of hydrophobic , aromatic , acceptor , donor , ring • Partial charge • Molecule type to indicate protein atom versus ligand atom (-1 for protein, 1 for ligand) • Van der Waals radiusThe OpenBabel cheminformatics tool (version 2.4.1) (O’Boyle et al.

PREPRINT - M AY

19, 2020Figure 2: The proposed mid-level fusion model together with 3D-CNN and SG-CNN.

The reﬁned set was subtracted from the general set, and the core set was subtracted from the reﬁned set such that thereare no overlaps between the three subsets. We hold out the core set to be used as testing data, keeping the remaininggeneral and reﬁned complexes as training data. Due to the relatively small size of our datasets as compared to those inother domains such as computer vision (Deng et al. et al. et al. et al. et al. N × N × N × C where N is the voxel grid size in each axis (48 in our experiment) and C is the number of atomic features describedin Subsection 2.1.2 (19 in our experiment). The volume size in each dimension is approximately ˚A where eachvoxel size is ˚A, which is sufﬁcient to cover the entire pocket region while minimizing the collisions between atoms.Each atom is assigned to at least one voxel or more, depending on its Van der Waals radius or the user-deﬁned size.In the case of the collisions between atoms, we apply element-wise addition to the atom features. Once all atoms arevoxelized, Gaussian blur with σ = 1 is applied in order to populate the atom features into neighboring voxels, similarto Kuzminykh et al. Top ). The residual short connection proposed in ResNet He et al. et al.

Deep learning approaches for modeling chemical graphs have demonstrated viability for learning continuous vectorrepresentations of molecular data as well as for property prediction tasks (Duvenaud et al. et al. et al. et al.

PREPRINT - M AY

19, 2020protein-ligand binding complex. The PotentialNet architecture presented by Feinberg et al. et al. et al. nodes within this graph representation. Both covalent and non-covalent bondsare represented through the use of a square N × N adjacency matrix A and an N × M node feature matrix where A ∈ R N × N and A ij is equal to the euclidean distance (in angstroms ˚A) of atom i and atom j . To further expandthis representation as a 3D tensor, we deﬁne two thresholds for covalent and non-covalent “neighborness”, α c and α nc respectively s.t. A ij,c = 0 if A ij ≥ α c and A ij,nc = 0 if A ij ≥ α nc . In the context of this paper the thresholds usedwere α c = 1 . ˚A and α nc = 4 . ˚A. We found these settings lead to the best performance on the validation set. Weimplement the SG-CNN using the PyTorch Geometric (PyG) python library (Fey and Lenssen 2019). Fusion models to combine multiple input sources or different feature representations have been applied to a numberof computer vision applications, especially in the presence of multi-modal images or different image sensors. Thesefusion models beneﬁt from multiple feature representations which are considered complementary to each other. In ad-dition, fusion-based approaches increase robustness by reducing uncertainty of each feature representation or modal-ity. Inspired by that, we propose to use a separate fusion neural network to combine feature representations fromtwo independently trained models (3D-CNN and SG-CNN), each of which has its own strength and weakness. Suchheterogeneous feature representations that the two models capture can enrich the proposed fusion model’s features,which has strong potential to improve the performance of binding afﬁnity prediction.Among several ways to fuse models addressed in Roitberg et al.

Protein-ligand complex binding pockets from the PDBBind database were compared to identify local regions surround-ing ligands and perform structure-based clustering for evaluation of machine learning model performance. Clusteringof the structures was performed using LGA (Zemla 2003) on a whole-protein level as well as speciﬁc local substruc-tures selected to represent ligand binding site regions. The implemented approach can be brieﬂy described as follows:for each protein-ligand complex, the binding site local environment was delineated using an initial 12.0 ˚A radius spherecentered at ligand atoms. The sphere size of 12.0 ˚A radius was selected in order to capture as much conformation infor-mation around the local environment as possible to allow detection of similarities between pockets even with differentsizes of observed ligands. Previous research indicated that distances of 7.5 ˚A are an upper limit in capturing informativefunctional properties for clustering purposes (Yoon et al.

PREPRINT - M AY

19, 2020Table 1: Performance of Binding afﬁnity prediction on the crystal struc-ture of PDBBind 2016 core set.

Top : comparison of the proposed fusionapproaches with individual and existing models. R : reﬁned set, G : generalset.Model r Pearson r Spearman r MAE RMSE

SG-CNN (R) .424 .666 .647 1.321 1.650SG-CNN (G) .519 .747 .746 1.194 1.508SG-CNN (R + G) .600 .782 .766 1.084 1.3753D-CNN (R) .523 .723 .716 1.164 1.5013D-CNN (G) .420 .649 .658 1.294 1.6553D-CNN (R + G) .397 .677 .657 1.334 1.688Late Fusion .628 .808 .803 1.044 1.326Mid-level Fusion .638 .810 .807 1.019 1.308

Pafnucy a - 0.78 - 1.13 1.42 a Stepniewska-Dziubinska et al.

Table 2: Comparison of the proposed mid-level fusion model withphysics-based scoring functions on the crystal structures of PDB-Bind 2016 core set. We give the results for the 243 complexes forwhich it was possible to compute a score across all methods. Thecorrelation coefﬁcients of Vina and MM-GBSA scoring functionsare given as absolute values.Method

Pearson r Spearman r MAE RMSE

Vina .599 .605 - -MM/GBSA .647 .649 - -Mid-level Fusion .803 .797 1.035 1.327 et al.

The PDBBind 2016 core dataset was used to evaluate the following hypotheses: • The two CNN models provide complimentary information. • The fusion model learns to integrate the two CNN models and improve prediction over the individual ones. • The machine learning models retain prediction accuracy when presented with docked poses rather than crystalstructures. • The machine learning models are as accurate as the more computationally costly MM/GBSA re-scoringfunction.

Prediction Performance on PDBBind-2016 Crystal Structure

Table 2 summarizes the model performance on the crystal structure of PDBBind 2016 core set. Training on bothPDBBind’s general and reﬁned data were considered. While training on the larger general dataset could improve per-formance, it has the drawback of noisier binding afﬁnity measurements and lower resolution 3D structures (typicallylarger than 2.5 ˚A) (Su et al.

PREPRINT - M AY

19, 2020Table 3: Performance on PDBBind 2016 Core Set - Docking Poses.Model

Pearson r Spearman r MAE RMSE

SG-CNN .656 .625 1.343 .685 .668 1.34

Prediction Performance on PDBBind-2016 Docking Poses

Scoring the binding afﬁnity of a crystal complex is useful for separating the scoring task from the ligand pose selectionproblem. In practice, however, the correct ligand pose will not be known and the scoring function will evaluate noisierand error prone docking poses. To address this problem, the machine learning models scored the top 10 Vina poses andreport the highest binding afﬁnity for each complex for 257 test complexes where MM/GBSA re-scoring calculationscompleted. Waters captured in the crystal structure were removed since this information can artiﬁcially constrain thedocking poses and inﬂate performance. A modest increase in performance was observed when water is retained (resultsnot shown). Prediction performance on docking poses is summarized in Table 3 and compared with the original Vinadocking score and the more expensive MM/GBSA re-scoring function. As expected, overall performance decreasesrelative to scoring crystal structures, which can in part be explained by fewer correct poses to re-score. Using amaximum RMSD of 2 ˚A threshold between a docked pose and the crystal pose, a correct pose is found among thetop 10 Vina poses in only 77% of the cases when evaluating the reﬁned dataset. Nonetheless, the fusion model’sPearson correlation coefﬁcient, remains higher than the computationally costly MM/GBSA scores and Vina scores(0.685 versus 0.629 and 0.616), motivating use of the Fusion model over the scoring functions.Classiﬁcation performance was evaluated for predicting non-binders (threshold set to pKi/pKd <

5) and binders(threshold set to pKi/pKd > The complexes found in the evaluation set came from 41 of the 830 structure based clusters. These clusters wereused to assess prediction performance across the different clusters. (The complete listing of clusters is provided as asupplemental ﬁle.) The MAE is shown in the Figure 3 and shows a trend of varying error exceeding 2 logs in somecases suggesting more accurate predictions for speciﬁc protein clusters.SMILES strings were constructed for 269 of the 290 compounds referenced in the 2016 core set. 51 compounds werefound to occur in both the holdout set and the reﬁned training set. Tanimoto distance between each test ligand and itsmost similar ligand in the training set from the same cluster was compared with MAE but no correlation was found.These results suggest that models are learning the structurally important features, but other chemical and physicalinformation may be needed. There are six clusters (111,144,176, 20, 206 and 401) with at least two complexes withinthe respective cluster where a difference in error between the two models is at least 1 log and the SG-CNN doesconsistently better in each case. Similarly, there are four clusters (244,58,59 and 64) where the reverse occurs and the3D-CNN shows consistently lower error. Figure 4 shows the top 8 compounds with maximum prediction discrepancy7

PREPRINT - M AY

19, 2020Figure 3: MAE (x-axis) with standard deviation for groups (y-axis) based on the pocket and the ligand positioning.MAE is shown for the machine learning models. The number of complexes in the reﬁned training set is shown foreach cluster ( gray bars). 8

PREPRINT - M AY

19, 2020Figure 4: Structure for 8 compounds with maximum difference in prediction between two models. Top 4 casesshown where error is lower for 3D-CNN or SG-CNN. Hydrogens are not shown, images are generated with Py-mol (Schr¨odinger, LLC 2015)in the two models. It is still unclear whether there are important structural differences in these clusters that explainthe advantage of one model over the other. The ﬁrst two examples in Figure 4 highlight compounds that interact withthe same pocket type, 58 and 111 for the 3D-CNN and SG-CNN respectively. There are many other clusters whereneither model has a clear advantage. Nevertheless, the models clearly exhibit distinct performance proﬁles. While theFusion model exhibits better overall performance in more clusters than its constituent models, it is not able to give thelowest MAE in every cluster.

The results show that the two CNN models provide complimentary predictions for many test complexes. The currentSG-CNN implementation does not explicitly capture bond angles and we speculate that in some cases where the 3Dshape of the molecule is important, the 3D-CNN may have an advantage. On the other hand, the SG-CNN likelybeneﬁts from a more explicit representation of pairwise interactions, which leads to fewer parameters to learn. Thebeneﬁt of using both models is supported by the performance of the Fusion models, which yield improved overallperformance compared to the individual models. An area for future improvement could be in exploring activity mapssuch as those introduced in Hochuli et al.

PREPRINT - M AY

19, 2020The machine learning model prediction error appears to be surprisingly robust when predicting on new ligands inrecognized pockets. Moreover, accuracy should continue to improve as the amount of experimental data grows. Weconclude that the Fusion model will become a more computationally efﬁcient alternative to the MM/GBSA re-scoringfunction.

Acknowledgements

None.

Funding

This work was supported by American Heart Association Cooperative Research and Development AgreementTC02274. This work was performed under the auspices of the U.S. Department of Energy by Lawrence LivermoreNational Laboratory under Contract DE-AC52-07NA27344. LLNL-JRNL-804162.

References

Ain, Q. U. et al. (2015). Machine-learning scoring functions to improve structure-based binding afﬁnity predictionand virtual screening.

Wiley Interdiscip. Rev. Comput. Mol. Sci. , (6), 405–424.Ballester, P. J. and Mitchell, J. B. O. (2010). A machine learning approach to predicting protein-ligand binding afﬁnitywith applications to molecular docking. Bioinformatics , (9), 1169–1175.Ch´eron, G. et al. (2015). P-CNN: Pose-Based CNN features for action recognition. In , pages 3218–3226.Deng, J. et al. (2009). ImageNet: A large-scale hierarchical image database. In , pages 248–255.Duvenaud, D. K. et al. (2015). Convolutional networks on graphs for learning molecular ﬁngerprints. In C. Cortes,N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information ProcessingSystems 28 , pages 2224–2232. Curran Associates, Inc.Feinberg, E. N. et al. (2018). PotentialNet for molecular property prediction.

ACS Cent Sci , (11), 1520–1530.Fey, M. and Lenssen, J. E. (2019). Fast graph representation learning with pytorch geometric.Gilmer, J. et al. (2017). Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212v2,1-14 .Gomes, J. et al. (2017). Atomic convolutional networks for predicting Protein-Ligand binding afﬁnity. arXiv preprintarXiv:1703.10603v1,1-17 .He, K. et al. (2016). Deep residual learning for image recognition. In , pages 770–778.Hochuli, J. et al. (2018). Visualizing convolutional neural network protein-ligand scoring. J. Mol. Graph. Model. , ,96–108.Huang, Y. et al. (2019). Human action recognition based on temporal pose CNN and multi-dimensional fusion. In Computer Vision – ECCV 2018 Workshops , pages 426–440. Springer International Publishing.Jakalian, A. et al. (2002). Fast, efﬁcient generation of high-quality atomic charges. AM1-BCC model: II. parameteri-zation and validation.

J. Comput. Chem. , (16), 1623–1641.Jim´enez, J. et al. (2017). DeepSite: protein-binding site predictor using 3d-convolutional neural networks. Bioinfor-matics , (19), 3036–3042.Jim´enez, J. et al. (2018). KDEEP: Protein–Ligand absolute binding afﬁnity prediction via 3D-Convolutional neuralnetworks. J. Chem. Inf. Model. , (2), 287–296.Kearnes, S. et al. (2016). Molecular graph convolutions: moving beyond ﬁngerprints. J. Comput. Aided Mol. Des. , (8), 595–608.Keedy, D. A. et al. (2009). The other 90% of the protein: assessment beyond the calphas for CASP8 template-basedand high-accuracy models. Proteins ,

77 Suppl 9 , 29–49.Kuzminykh, D. et al. (2018). 3D molecular representations based on the wave transform for convolutional neuralnetworks.

Mol. Pharm. , (10), 4378–4385. 10 PREPRINT - M AY

19, 2020Li, Y. et al. (2016). Gated graph sequence neural networks. In Y. Bengio and Y. LeCun, editors, .Maier, J. A. et al. (2015). ff14SB: Improving the accuracy of protein side chain and backbone parameters from ff99SB.

J. Chem. Theory Comput. , (8), 3696–3713.O’Boyle, N. M. et al. (2011). Open babel: An open chemical toolbox. J. Cheminform. , , 33.Pettersen, E. F. et al. (2004). UCSF chimera–a visualization system for exploratory research and analysis. J. Comput.Chem. , (13), 1605–1612.Ragoza, M. et al. (2017). Protein–Ligand scoring with convolutional neural networks. J. Chem. Inf. Model. , (4),942–957.Roitberg, A. et al. (2019). Analysis of deep fusion strategies for multi-modal gesture recognition. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition Workshops , pages 0–0.Schr¨odinger, LLC (2015). The PyMOL molecular graphics system, version 1.8.Stepniewska-Dziubinska, M. M. et al. (2018). Development and evaluation of a deep learning model for protein-ligandbinding afﬁnity prediction.

Bioinformatics , (21), 3666–3674.Su, M. et al. (2019). Comparative assessment of scoring functions: The CASF-2016 update. J. Chem. Inf. Model. , (2), 895–913.Sun, H. et al. (2014). Assessing the performance of MM/PBSA and MM/GBSA methods. 5. improved dockingperformance using high solute dielectric constant MM/GBSA and MM/PBSA rescoring. Phys. Chem. Chem. Phys. , (40), 22035–22045.Trott, O. and Olson, A. J. (2010). AutoDock vina: improving the speed and accuracy of docking with a new scoringfunction, efﬁcient optimization, and multithreading. J. Comput. Chem. , (2), 455–461.Wallach, I. et al. (2015). AtomNet: A deep convolutional neural network for bioactivity prediction in structure-baseddrug discovery. arXiv preprint arXiv:1510.02855v1,1-11 .wwPDB consortium (2019). Protein data bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. , (D1), D520–D528.Yang, F. et al. (2018). A fusion model for road detection based on deep learning and fully connected CRF. In , pages 29–36.Yoon, S. et al. (2007). Clustering protein environments for function prediction: ﬁnding PROSITE motifs in 3D. BMCBioinformatics , , S10.Zemla, A. (2003). LGA: A method for ﬁnding 3D similarities in protein structures. Nucleic Acids Res. , (13),3370–3374.Zhang, H. et al. (2019). DeepBindRG: a deep learning based method for estimating effective protein-ligand afﬁnity. PeerJ , , e7362.Zhang, X. et al. (2013). Message passing interface and multithreading hybrid for parallel molecular docking of largedatabases on petascale high performance computing machines. J. Comput. Chem. , (11), 915–927.Zhang, X. et al. (2014). Toward fully automated high performance computing drug discovery: a massively parallelvirtual screening pipeline for docking and molecular mechanics/generalized born surface area rescoring to improveenrichment. J. Chem. Inf. Model. , (1), 324–337. 11 mproved Protein-ligand Binding Aﬃnity Prediction withStructure-Based Deep Fusion Inference: Supplemental Information Derek Jones [email protected]

Hyojin Kim [email protected]

Xiaohua Zhang [email protected]

Adam Zemla [email protected]

Garrett Stevenson [email protected]

William D. Bennett [email protected]

Dan Kirshner [email protected]

Sergio Wong [email protected]

Felice Lightstone [email protected]

Jonathan Allen [email protected]

May 19, 2020

Spatial Graph Convolutional Network Architecture

The Spatial Graph Convolutional Neural Network (SG-CNN) presented here is composed of a number of smallerneural networks and can be described in terms of a propagation layer, a gather or aggregation step, and ﬁnally,an output fully-connected network. The propagation layer can be understood as the portion of the network wheremessages are passed between the atoms and used to compute new feature values for each node, with a total of k i rounds of message passing, where i (wlog) corresponds to the i th propagation layer.Using this architecture, we deﬁne two distinct propagation layers for covalent and non-covalent interactions.In both cases, we use a single scalar value for the edge feature, the euclidean distance (measured in Angstroms, ˚A)between nodes v i and v j . Propagation is performed for k = 2 rounds for both cases, then the attention operationis performed on the ﬁnal propagation output. For covalent propagation, the input node feature size is 20 andthe output size is 16 after the attention operation is applied. For non-covalent propagation, the node featuresthat are the result of the covalent attention operation are used as the initial feature set, the output node featuresize of this layer is 12. The resulting features of the non-covalent propagation are then “gathered” across theligand nodes in the graph to produce a “ﬂattened” vector representation by taking a node-wise summation ofthe features. This graph summary feature is then fed through the output fully-connected network to produce thebinding aﬃnity prediction ˆ y . ∀ i h k =0 i = x i for k = 0 (intialization of node features) h ki = GRU ( h k − i , X j ∈N ( e ) ( v i ) f e ( h j )) for k ∈ { ...K } (message passing)where x i is the feature vector of node v i . f e ( h i ) = Θ h i + X j ∈N ( i ) h j · f θ ( e i,j ) (outer edge network)where Θ is a learned set of parameters, x i is the set of features for n i , and f θ is a neural network that computesa new edge feature for the edge e i,j between v i and v j . 1 a r X i v : . [ q - b i o . B M ] M a y θ ( e i,j ) = φ ( f θ ( φ ( f θ ( e i,j )))) (inner edge network)where f θ and f θ are neural networks and φ is the Softsign (Turian et al. / | x | ). h K = σ ( p ( h K , h k =0 )) (cid:12) (cid:0) q ( b K )) (attention operation)where σ is the softmax activation function, q and p are neural networks. h gather = X v ∈ G ligand h v (gather step)where G ligand is the ligand subgraph of the binding complex graph.ˆ y = g ( h gather ) = g ( ReLU ( g ( ReLU ( g ( h gather ))))) (output network)2 xperiment Setup The 3D-CNN, SG-CNN, and the fusion network models, used an Adam optimizer with learning rate of 1 × − ,1 × − , and 1 × − , respectively. The mini-batch sizes are 50, 8, 100, and the approximate number ofepochs are 200, 200, 1000, respectively. We observed that smaller mini-batch sizes drastically improved the modelperformance in the training of the SG-CNN model while the eﬀect of diﬀerent mini-batch sizes was negligible in3D-CNN and the fusion model.The 3D-CNN and fusion models were developed using the Tensorﬂow python library (Abadi et al. et al. Visualizing Model Bias and MAE distribution

In ﬁgure 2 the PDBBind 2016 Core set is color codedinto 3 groups according to the MAE of the experimental log ( k i /k d ) (left panel) and the predicted log ( k i /k d )(right panel). Predictions that fall below one standard deviation of the mean MAE value for a given model (red),between 1-2 standard deviations (green), and exceed 2 standard deviations (blue). Figure 2 shows that modelspredict over the full range of binding aﬃnity values, however, predictions are biased toward the mean. As theerror increases (higher values on the y-axis), the left panel plot shows how high error predictions have troublepredicting the tails of the distribution. The right panel plot shows that the models predict closer to the mean incases of high error. The black vertical line gives the location of the ground truth mean and the purple verticalline gives the predicted mean. 4igure 2: Scatter plots for each scoring method and the experiment binding aﬃnity measurement. Relationship between scoring function output with experimental measurement.

Figure 2 gives scat-ter plots of the scores from Mid-level Fusion, Late Fusion, MM/GBSA, and Vina scoring methods versus theexperimental binding aﬃnity for the 242 complexes of the 2016 core set for which a score across all methods waspossible to obtain. Both Fusion methods show a signiﬁcant improvement over physics-based scoring methods (interms of Pearson correlation coeﬃcient) with the experimental log ( k i /k d ).5igure 3: MAE (y-axis) with standard deviation for 57 functional groups (x-axis) deﬁned by CASF-2016. MAEis shown for the machine learning models. Performance on protein targets (CASF-2016)

To consider prediction performance of the diﬀerent models,mean absolute error (MAE) is shown for the 57 functional categories deﬁned by the Comparative Assessment ofScoring Functions (CASF-2016) complexes, which consists of 285 PDBBind core complexes with 5 complexes percategory. The results are shown in Figure 3 with the categories sorted on the x-axis by increasing Mid-fusionMAE. The ﬁgure shows signiﬁcantly diﬀerent Mid-fusion MAE between groups, ranging from approximately 0.5to over 2 log units. The protein categories are limited by manual curation and were not reported for the completecollection of PDBBind complexes, making it diﬃcult to assess overlap between binding pockets found in thetraining data and the test set. To address this, a binding pocket oriented structure based clustering scheme wasapplied to the complete collection of PDBBind complexes.6Table 1: Comparison of classiﬁer performance on PDBBind 2016 Core Set - Docking Poses. SD=standarddeviation. Model

Bind ROC AUC SD No-bind ROC AUC SD

SG-CNN .784 .066 .829 .0633D-CNN .747 .081 .774 .063Vina .788 .071 .848 .052MM/GBSA .828 .064 .833 .057Late Fusion .82 .065 .859 .054Mid-level Fusion .806 .07 .853 .055

Classiﬁcation of binders, performance comparison between fusion models and physics-based scoring

For the bind detection screening task, the results are summarized as ROC AUC and reﬂect randomly partitioningthe PDBBind 2016 core set into 5 non-overlapping folds, computing the ROC AUC on each fold, repeating thisprocedure 100 times and taking the average. The datasets for both tasks maintain a similar class imbalance of25% positives and 75% negatives. Table 1 show that MM/GBSA has slightly better performance than the Fusionmodel and both methods show a small improvement (0.04) over Vina. For the no-bind task, while the Fusionmodel had the highest ROC AUC the margin of diﬀerence was negligible compared to Vina (0.011).7igure 4: Histogram counting minimum Tanimoto distance between each compound in test and nearest match inPDBBind 20016 reﬁned.

Structure similarity between reﬁned abd core sets of PDBBind 2016

In order to gain an understandingof how “similar” our training set (in terms of the reﬁned set) was to our testing set (the core set), we considerstructural similarity between the ligands in the reﬁned and core sets as measure by the tanimoto distance metric.Figure 4 illustrates the distribution of the tanimoto distance between each ligand in the core set with its nearestneighbor in the reﬁned set.

Disclaimer

This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United Statesgovernment nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumesany legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed,or represents that its use would not infringe privately owned rights. Reference herein to any speciﬁc commercial product, process, or serviceby trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoringby the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do notnecessarily state or reﬂect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used foradvertising or product endorsement purposes. eferences Abadi, M. et al. (2016). Tensorﬂow: A system for large-scale machine learning. In

Proceedings of the 12thUSENIX Conference on Operating Systems Design and Implementation , OSDI16, page 265283, USA. USENIXAssociation.Fey, M. and Lenssen, J. E. (2019). Fast graph representation learning with pytorch geometric.Paszke, A. et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In

Advances inNeural Information Processing Systems 32 , pages 8024–8035. Curran Associates, Inc.Turian, J. et al. (2009). Quadratic features and deep architectures for chunking. In