DeepAtom: A Framework for Protein-Ligand Binding Affinity Prediction
Yanjun Li, Mohammad A. Rezaei, Chenglong Li, Xiaolin Li, Dapeng Wu
DDeepAtom: A Framework for Protein-LigandBinding Affinity Prediction
Yanjun Li , Mohammad A. Rezaei , Chenglong Li , Xiaolin Li , and Dapeng Wu NSF Center for Big Learning, University of Florida Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development (CNPD3), University of Florida Department of Chemistry, University of Florida.Gainesville, FL, USAyanjun.li@ufl.edu, [email protected]fl.edu, [email protected]fl.edu, [email protected]fl.edu, dpwu@ufl.edu
Abstract —The cornerstone of computational drug design isthe calculation of binding affinity between two biological coun-terparts, especially a chemical compound, i.e., a ligand, anda protein. Predicting the strength of protein-ligand bindingwith reasonable accuracy is critical for drug discovery. Inthis paper, we propose a data-driven framework named Deep-Atom to accurately predict the protein-ligand binding affinity.With 3D Convolutional Neural Network (3D-CNN) architecture,DeepAtom could automatically extract binding related atomicinteraction patterns from the voxelized complex structure. Com-pared with the other CNN based approaches, our light-weightmodel design effectively improves the model representationalcapacity, even with the limited available training data. Withvalidation experiments on the PDBbind v. benchmark andthe independent Astex Diverse Set, we demonstrate that the lessfeature engineering dependent DeepAtom approach consistentlyoutperforms the other state-of-the-art scoring methods. We alsocompile and propose a new benchmark dataset to further im-prove the model performances. With the new dataset as traininginput, DeepAtom achieves Pearson's R= . and RMSE= . pK units on the PDBbind v. core set. The promising resultsdemonstrate that DeepAtom models can be potentially adoptedin computational drug development protocols such as moleculardocking and virtual screening. Index Terms —binding affinity prediction, deep learning, effi-cient 3D-CNN, benchmarking
I. I
NTRODUCTION
Binding of a chemical molecule to a protein may start abiological process. This includes the activation or inhibitionof an enzyme’s activity, and a drug molecule affecting itstarget protein. The binding is quantified by how strong thechemical compound, a.k.a. a ligand, binds to its counterpartprotein; this quantity is called binding affinity . Simulation ofbiological processes and computational drug design heavilyrelies on calculating this binding strength. For example, VirtualScreening (VS) tries to find the best chemical compoundswhich can regulate the function of a protein; in drug discoverythis is often a disease-related target protein. VS achieves thisgoal by assigning a score to each binding ligand, indicatinghow strong it binds to the target protein. To get the overallpicture of how binding affinity prediction enables VS, thereader is referred to [3], [5], [35]. †These authors contributed equally.* To whom correspondence should be addressed.
Current approaches for quantifying the binding affinity canbe categorized as physics-based, empirical, knowledge-basedand descriptor-based scoring functions [19]. In spite of theirmerits, the conventional techniques assume a predeterminedfunctional form which is additive. Furthermore, they needdomain knowledge to extract features and formulate the scor-ing functions. For example, the semi-empirical force field inAutoDock [23] and empirical scoring function in X-Score [34]belong to this category. As instance, X-Score takes averageof three scoring functions HPScore, HMScore, and HSScore,differing by the terms which describe the hydrophobic inter-actions. Each of these scoring functions comes in the form ofa linear combination of the terms [33].Only in the past decade the machine learning algorithmshave been used to score the protein-ligand binding strength ina data-driven manner. Ballester and coworkers gathered dataabout the frequency of occurrence of each possible protein-ligand atom pair up to a distance threshold. By binning thesecounts, they trained their Random Forest model implicitly onshort-range, middle-range, and long-range interactions. TheRF-Score model is still among the top performer models in thisfield [2]. However, such models also heavily rely on biologicalfeature engineering to extract descriptors or fingerprints; it isstill based on expert knowledge and therefore biased.Deep learning models, which belong to descriptor-basedcategory, aim to minimize the bias due to expert knowledge.To describe the interactions between a protein and its ligand,atom-centered and grid-based methods are the most widelyused techniques. Schietgat et al. developed a 3D neighborhoodkernel, which took the atom spatial distances into consid-eration, to describe the structure of proteins and ligands[26]. More recently Gomes and coworkers [6] represented thestructures of proteins and ligands as a combination of neighborlists and atom types, to later be used in a deep network. In theircase, they described the whole protein and ligand structuresusing their atom-centered scheme.By contrast, grid-based approach usually limits the rep-resentation of protein and ligand interactions to a grid boxdefined around a protein’s binding site, and different atominformation is encoded in different channels of the 3D grid.Wallach et al. [31] and Ragoza et al. [24] developed CNNscoring functions to classify compound poses as binders or a r X i v : . [ q - b i o . B M ] D ec on-binders. Jim´enez et al. [14] and Marta et al. [29] designedsimilar deep learning approachs to predict the binding affinity,based on the rasterized protein-ligand structure input.In addition to modeling approach, data reliability is a majorissue for binding affinity prediction. Although a few thousandsof labeled protein-ligand complexes are available, their bind-ing affinity are reported as different experimental measures,including K d , K a , K i , and IC , in decreasing order ofreliability for the purpose of binding affinity prediction. Ifwe indiscriminately feed all the data to the machine learningmodel, it will potentially introduce label noise or even incor-rect labels different from ground truth. The machine learningmodel may then suffer from the inaccurate supervision.Our goal in this paper is twofold. First, we aim to developan end-to-end solution for binding affinity accurate predictionwhich 1) gets as input the 3D structural data for the proteinand ligand, 2) requires minimum feature engineering, and 3)achieves state-of-the-art prediction performance. Second, weaim to systematically analyze the publicly available protein-ligand binding affinity data and propose a new benchmark formore reliable model learning.More recently, deep learning has exhibited its powerfulsuperiority in a broad range of bioinformatics tasks, such asgene mutations impact prediction [30], protein folding [17]and drug discovery [27]. By stacking many well-designedneural network layers, it is capable of extracting useful fea-tures from raw data form and approximating highly complexfunctions [16]. Many advanced deep learning algorithms aredeveloped based on convolutional neural networks (CNNs).The impressive performance of CNNs is mainly because theycan effectively take advantage of spatially-local correlation inthe data. Similarly, protein-ligand 3D structure naturally hassuch characteristics; biochemical interactions between atomsoccur locally. CNNs hopefully can hierarchically composesuch local spatial interactions into abstract high-dimensionalglobal features contributing to the binding score.In this paper, we propose the framework DeepAtom to ac-curately predict the protein-ligand binding affinity. A complexof protein-ligand is first rasterized into a 3D grid box, whichis centralized on the ligand center. Each voxel has severalinput channels, embedding the different raw information ofatoms located around the voxel. A light-weight 3D-CNNmodel is then developed to hierarchically extract useful atominteraction features supervised by the binding affinity score. Asa data-driven approach, it effectively avoids a priori functionalform assumptions. More importantly, our efficient architecturedesign significantly improves the model representational andgeneralization capacity even trained with the limited availablecomplex data.We present comprehensive experiments on the standardbenchmark test set, called PDBbind v.2016 core set [32] andan additional test set, called Astex Diverse Set [7]. Randomlyinitialized and evaluated for 5 times, DeepAtom consistentlyoutperforms all the state-of-the-art published models. In orderto further improve the model performance, we also criticallystudy the publicly available complex data and propose a Fig. 1: Local box featurization (3D data representation).
The gridbox encompasses the area around the binding site, centered on theligand. Each channel includes only a specific feature, e.g. from leftto right, the three channels shown are the excluded volume channelfor the ligand as well as the hydrophobic and aromatic channels forprotein. Each sample is described in terms of 24 channels in total. new benchmark dataset. It further improves the DeepAtomperformance, with potential benefits to the future research inthe binding affinity scoring field.II. M
ATERIALS AND M ETHODS
A. Input Featurization and Processing1) Protein-ligand Complex:
The standard datasets, suchas PDBbind and Binding MOAD, include the structures ofprotein and ligand in their bound form, a.k.a. their complex,deposited in a single PDB file. Based on experimental tech-niques, the strength of ligand binding to protein has beendetermined for each structure. This binding affinity data is usedas the ground truth labels for these protein-ligand complexes.
2) Grid Size & Resolution:
We calculate the distributionof end-to-end distances for all ligands in the PDBbind v.2016 refined and core datasets. This gives us clues to define a boxsize of ˚A, which is the same as the end-to-end distance forthe longest ligand in these two datasets, so there is no need tofilter out any. The distribution of ligand lengths in these twosubsets is illustrated in Fig. 3a.The van der Waals radius of the 9 major heavy atoms(C, N, O, P, S, F, Cl, Br, I) used in our study is greaterthan 1.4 ˚A. As a simplified view, an atom’s r vdw can beassumed as a measure of its size; it is defined as half of theinternuclear separation of two non-bonded atoms of the sameelement on their closest possible approach. A grid resolutionlarger than × r vdw cannot differentiate two atoms from eachother. On the other hand, a finer resolution brings about muchhigher computational cost. As a trade-off between accuracyand efficiency, we set the grid resolution as 1.0 ˚A.
3) Features / Atom Types:
The 11 Arpeggio atom types arebased on the potential interactions each atom may get involvedin [15]; they include features such as Hydrogen bond acceptorand donor, positive or negative, hydrophobic, and aromaticatom types. The protein and ligand atoms are described by 11rpeggio atom types and an excluded volume feature, wherediscrimination is made between protein and ligand atoms. Thisresulted in (11 + 1) × features.
4) Occupancy Type:
This hyper-parameter defines howeach atom impacts its surrounding environment. In our work,each atom can affect its neighbor voxels up to double of its vander Waals radius r vdw through a pair correlation potential. Weuse the Atom-to-voxel PCMax algorithm, described in [13],where each atom makes a continuous contribution n ( r ) to itsneighbor voxels as defined by Eq. 1. At the center of a voxel,only the maximum effect from contributing atoms is kept. n ( r ) = 1 − exp ( − ( r vdw r ) ) (1) B. Network Architecture
To extract the atomic interaction information from the vox-elized protein-ligand complex data, a straightforward approachis to extend 2D-CNNs to 3D by using 3D convolution kernels.One channel of the output feature map at the location ( i, j, k ) is computed by the standard convolution as follows: Conv(
W, h ) ( i,j,k ) = S,T,R,M (cid:88) s,t,r,m W ( s,t,r,m ) · h ( i + s,j + t,k + r,m ) (2)where h represents the input with M channels, W ( s,t,r,m ) ∈ R S × T × R × M represents one filter weight, and S, T, R areside length of the filter. However, 3D convolution itself willmassively inflate the amount of trainable network parameters,due to the increase in the input and kernel dimensions.Specifically, if N is the number of the output channels, onestandard convolution layer will introduce S · T · R · M · N parameters. More importantly, in order to improve the learningability and achieve higher prediction accuracy, a general trendhas been to make the model deeper and more complicated[28], [8], [12]. By contrast, for the affinity prediction problem,only a few thousands of protein-ligand complexes with exper-imentally determined binding affinity are available. This issuediscourages the use of network architectures with too manytrainable parameters, because overfitting may occur when thenetwork has high complexity whereas a relatively small dataset is available for training. Indeed, another D CNN-basedaffinity prediction work [24] also encounters the overfittingissue. After empirically optimizing the model depth and width,they ultimately reduce the network to only three convolutionallayers. Similarly, Pafnucy [29] is developed as a D CNNmodel with three convolutional and three dense layers.Our model is inspired by the light-weight network archi-tecture, and aims to achieve the best trade-off between theprediction accuracy and the model complexity in terms oflearnable parameters. A series of related network structurehave been recently proposed, such as Xception [4], Mo-bileNet v1 [9] and v2 [25], ShuffleNet v1 [36] and v2 [20],and CondenseNet [11]. Based on the practical guidelinesfor efficient CNN architecture design and the correspondingShuffleNet units described in [20], we propose a novel light-weight D CNN mdoel, which can be effectively trained withdeeper layers by the limited training samples. It improves the prediction performance by a large margin, but does notsignificantly increase model complexity.Specifically, as shown in Fig. 2 our model consists of threebuilding blocks, namely atom information integration block,stacked feature extraction block, and global affinity regressionblock.In the atom information integration block, a pointwise (PW, × × ) convolution layer defined in Eq. 3 with non-linearactivation function is first utilized to fuse the atom informationacross different channels. PWConv(
W, h ) ( i,j,k ) = M (cid:88) m W m · h ( i,j,k,m ) (3)This cascaded cross-channel parametric pooling structurebrings about an improvement compared with the empiricalscoring functions. For instance, AutoDock software’s scoringfunction is composed of a linear combination of interactiontypes, such as Hydrogen bonding and electrostatic interactions[23]. The pointwise convolution layer in our model is followedby a 3D max pooling layer to increase the translationalinvariance of the network and reduce the input dimension.The output of this block has the grid size of × × .The feature extraction block consists of multiple consecutive3D shuffle units, and according to the number of channels intheir outputs, they are categorized into three groups. At thebeginning of the unit, a channel split operator equally splitsthe input of feature channels into two branches. One branchdata is sequentially processed by a pointwise convolution,a × × depthwise (DW) convolution and an additionalpointwise convolution. All three layers have the same numberof input and output channels N . Depthwise convolutional layerperforms the spatial convolution independently over everychannel of an input: DWConv(
W, h ) ( i,j,k ) = S,T,R (cid:88) s,t,r W ( s,t,r ) (cid:12) h ( i + s,j + t,k + r ) (4)where (cid:12) denotes the element-wise product. Although depth-wise convolution does not combine different input channels,the two neighbor regular pointwise convolution effectively fusethe information across the channels. The other branch is keptas identity until it is concatenated with the output from the firstbranch. This identity branch can be regarded as an effectivefeature reuse design, which strengthens feature propagationand reduces the number of parameters. Within a basic unit, thedepthwise and pointwise convolutions respectively introduce S · T · R · N and N · N parameters. Therefore, using a basic unitto replace the standard convolution, we obtain the parameterreduction as follows: S · T · R · N + 2 · N · N S · T · R · N · N = 12 ( 1 N + 1 S · T · R ) (5)DeepAtom uses × × depthwise convolutions and thenumber of channels are set as , , . Therefore, withthe efficient model design, we can easily obtain more than 20times parameters reduction, which enable us to stack deeperlayer to improve the model learning capacity. ig. 2: Network architecture. Each Conv layer is specified by its number of channels, kernel size and stride. The 3D MaxPool layer haskernel size 3 and stride 2. For the 3D Shuffle Groups, the numbers in parentheses denote the number of output channels and repeat timeof the unit. Only the first unit has down sampling layer, where the DWConv layer has kernel size 3 and stride 2. In the remaining units,DWConv with kernel size 3 and stride 1, as well as PWConv with kernel size 1 and stride 1 are utilized. Eight losses are calculated basedon the shared weights FC layer output. Two dropout layers are appended before the last 3D Pointwise Conv and FC layers respectively.
At the end of the units, the channel shuffle operation isapplied to enable the information flow across the two branches.Particularly, the channel shuffle operation first divides thefeature map in each branch into several subgroups, then mixesthe branches with different subgroups. When the spatial downsampling is applied, the channel split operator in the shuffleunit is removed, and the number of output channels is doubled.In each group, only the first shuffle unit has the down samplinglayer, and the remaining units keep the input dimension.After stacking three shuffle groups, the original 3D inputdata is down sampled to a × × ×
4D tensor (3grids along x, y, z axes and channels). The global affinityregression block first slices the tensor into × × vectorswith dimension . Based on the prior shuffle groups, thereceptive field of each vector covers the entire raw volume, sowe set up the affinity prediction task for each vector. A sharedweights fully connected (FC) layer consumes each vector toconstruct regression loss, and it enables us to train the toplayers more thoroughly and further avoid overfitting. In testingphase, outputs from the multiple hidden vectors are averagedright before the FC layer to stabilize the prediction.In the architecture, we adopt the leaky rectified linearunit as the activation function. A batch normalization layeris appended after each convolution operation to speed upthe training. The mean squared error is set as our affinityregression loss for model learning.
1) Training:
The model is updated by Adam algorithm withdefault parameters for momentum scheduling ( β = . , β = . ). Training the model from scratch, we set the initiallearning rate as 0.001 and the weight decay to × − . Ourmodel is implemented using PyTorch (version . ). With batchsize of , the model is trained for around epochs on Nvidia P GPU cards.
2) Data Augmentation:
The publicly available biologicaldatasets contain only thousands of complexes with reliableexperimental binding affinity value. Directly training on theseinsufficient samples easily makes the deep learning modelsuffer from the overfitting problem. Data augmentation isproved as an effective approach to enhance the deep learningmodel performance. In our experiments, each of the originalsamples gets randomly translated and rotated. Enabling suchtransformations significantly improves the training and modelcapacity. In order to reduce the variance, the augmentedsamples of each protein-ligand complex are averaged duringthe prediction phase.
C. Dataset Preparation1) PDBbind Dataset:
PDBbind is the standard dataset fordeveloping models to predict the binding affinity of protein-ligand complexes [32]; it has three subsets, namely core , refined , and general . The general subset includes complexeswith relatively lower quality; it contains lower resolution struc-tures, and the experimental affinities for some structures arereported as IC values. The refined dataset is a higher qualitysubset of the general dataset; it includes complex structureswith better resolution and excludes any complex with IC data only. In total, PDBbind v. refined dataset includes complexes. The core dataset is a subset of refined data,clustered with a threshold of sequence similarity; fiverepresentative complexes are picked for each of the proteinfamily clusters in order to cover the affinity range better. Thisresults in complexes in the core subset, which servesas the standard test set to evaluate the scoring approaches.We further split the rest of non-overlapping complexesbetween the refined and core into two disjoint subsets: ( i )10% of complexes ( ) are randomly selected and used ashe validation set, ( ii ) the rest ( complexes) are used fortraining, which is named as “training set-1”.
2) Proposed Benchmark Training Set:
In order to compilean improved benchmark dataset, we use PDBbind data as wellas a complementary source of protein-ligand binding affinitydata, namely Binding MOAD [1], [10]. In order to incorporatethe recently updated complexes, we start from PDBbind v. dataset, and extract all complexes with either K d , K a , or K i data from general and refined subsets. It is worth notingthat we exclude the complexes shared with the core subsetto prevent the data leakage. We follow the same steps withBinding MOAD data. A few filtration steps are also necessary:first if a complex has reported K d /K a data in one databaseand K i in the other, we keep the K d /K a data only. Second,complexes with a peptide as their ligand are discarded. We donot filter the complexes based on their structure resolution norperform any clustering on them in terms of protein sequenceor structure; clustering is typically done to later reduce thedataset into representative samples. The limited availabilityof the experimental affinity data discourages further removalof samples, although the dataset is biased towards somestructures, e.g. the congeneric series.In total, the final benchmark dataset contains
10 383 com-plexes. Please note that in contrast to NMR structures whichcontain multiple 3D models, a PDB file from X-ray crystal-lography contains a single 3D structure only. Our proposedbenchmark dataset includes almost exclusively X-ray struc-tures, with only one structure existing in each PDB file. Merely63 complexes come from NMR experiments. We get only thefirst model from these PDB files.Compared with the refined subset of PDBbind, this datasetalmost doubles the number of samples with K d /K a /K i dataand is expected to improve the performance of binding affinityscoring techniques. The full list of the proposed benchmarkdataset for model training is provided in the SupplementaryTable . The pK d /pK a /pK i value for each complex is reportedto make it easier for other researchers to use the proposeddataset. The binding score of the complexes in this datasetranges from − . to . in pk units, and the scoredistribution is shown in Supplementary Fig. S1.To train the scoring approaches, we split the proposedbenchmark dataset into two subsets: ( i ) we randomly se-lect samples from non-overlapping complexes betweenPDBbind v. refined and core sets. ( ii ) the rest ( complexes) are for training, named as “training set-2”.
3) Astex Diverse Set:
This dataset was developed in 2007. Itincludes 85 protein-ligand complexes filtered to be of interestspecifically to pharmaceutical and agrochemical industries [7].Among these 85 complexes, 64 of them include bindingaffinity data.
D. Other Methods for Comparison
In Section III, we compare DeepAtom with three state-of-the-art and open-source scoring approaches: Pafnucy model https://github.com/YanjunLi-CS/DeepAtom SupplementaryMaterials TABLE I:
Results on PDBbind v. core set with “training set-1”.In each table cell, mean value over five runs is reported as well asthe standard deviation in parentheses.
RMSE MAE SD RDeepAtom
RF-Score 1.403 (0.002) 1.134 (0.003) 1.293 (0.002) 0.803 (0.001)Pafnucy 1.553 (0.031) 1.261 (0.027) 1.521 (0.037) 0.722 (0.017) [29], RF-Score [2], and X-Score [34]. For Pafnucy and RF-Score, we use the open-source codes provided by the authors,and use their suggested hyper-parameters to re-train modelson the same datasets as DeepAtom. For X-Score, we take theresults from paper [14], where the authors used the publiclyavailable binaries to make predictions on the same PDBbindv. core set.III. R
ESULTS & D
ISCUSSION
In this section, we describe the training and benchmarkingprotein-ligand complex data for DeepAtom. The evaluationdetails are presented along with discussion of the results.
A. Evaluation Metrics
To comprehensively evaluate the model performance, weuse Root Mean Square Error (RMSE) and Mean AbsoluteError (MAE) to measure the prediction error, and use Pearsoncorrelation coefficient (R) and standard deviation (SD) inregression to measure the linear correlation between predictionand the experimental value. The SD is defined in Eq. 6. SD = (cid:115) (cid:80) ni [ y i − ( a + bx i )] n − (6)where x i and y i respectively represent the predicted andexperimental value of the i th complex, and a and b are theintercept and the slope of the regression line, respectively. B. Model Comparison with “Training Set-1”
We first train DeepAtom, RF-Score and Pafnucy on the“training set-1” with complexes described in SectionII-C1, and evaluate them on the PDBbind v. core set,which is unseen to the model during its training and vali-dation. Each approach is randomly initialized and evaluatedfor times. The mean and the standard deviation (in theparentheses) of the four evaluation metrics are presented inTable I for testing, and Supplemental Table S1 for validation.Learning with the very limited samples, DeepAtom outper-forms the similar 3D CNN-based Pafnucy by a large margin,which demonstrates that our light-weight architecture designenables effective training with the deep layers and significantlyimproves its learning and generalization capacity. On the otherhand, DeepAtom achieves the comparable performance withthe conventional machine learning method RF-Score, althoughas a practical guideline, training a supervised deep learningmodel generally requires larger datasets. It suggested thatour model has greater potential to provide more accurateprediction, given enough training data. ig. 3: a . Ligand length distribution in the PDBbind v.2016 refined and core sets. b . Binding data distribution of the training set in ourproposed benchmark. c . Binding data distribution of the core set. d . DeepAtom prediction results for the core set. e . The distribution of MAEbetween DeepAtom prediction and target complexes with different binding ranges. f . DeepAtom prediction results for the Astex Diverse Set. C. Model Comparison with “Training Set-2”
Next, we use our proposed “training set-2” to re-trainDeepAtom, RF-Score and Pafnucy models, and also evaluatethem on the PDBbind v.2016 core set. Similarly, differentruns are conducted for each scoring method to stabilize theresults. Fig. 4 shows the comparison results in terms ofmean value of the R and RMSE, also including the X-Scoreprediction results. Table II presents the detailed results of thetop three methods. As expected, DeepAtom outperforms allthe other approaches across all four measurements by a largemargin. It obtains the best Pearson correlation coefficient of . and RMSE of . in pK units, compared with RF-Scoreresults: R = 0 . and RM SE = 1 . , and Pafnucy results: R = 0 . and RM SE = 1 . . To the best of our knowledge,DeepAtom achieves the state-of-the-art performance on thiswell-known benchmark. The corresponding validation resultsare shown in the Supplementary Table S2.Fig. 3d shows the correlation between the prediction resultsof one DeepAtom model and the experimental binding affinitydata. DeepAtom gives the highly correlated prediction on thePDBbind v.2016 core set. To further investigate the modelperformance on different ranges of the binding data, wevisualize the binding affinity distribution of the training set(Fig. 3b) in the proposed benchmark, PDBbind v.2016 core set (Fig. 3c) and the corresponding DeepAtom RMSE valuewithin each pK unit range (Fig. 3e). From Fig. 3b and 3c,we observe that the binding scores of our training samples areintensively located in the middle range (from to pK units)which is highly similar to the core set. Fig. 3e shows thatDeepAtom obtains better prediction results with lower MAEvalues in this middle range, compared to the less frequent TABLE II:
Results on PDBbind v. core with “training set-2”.
RMSE MAE SD RDeepAtom
RF-Score 1.419 (0.002) 1.124 (0.001) 1.304 (0.002) 0.801 (0.000)Pafnucy 1.443 (0.021) 1.164 (0.019) 1.424 (0.022) 0.761 (0.008)
TABLE III:
Results on Astex Diverse Set with “training set-2”.
RMSE MAE SD RDeepAtom
RF-Score 1.144 (0.006) 0.891 (0.010) 1.103 (0.004) 0.710 (0.003)Pafnucy 1.374 (0.057) 1.110 (0.065) 1.288 (0.039) 0.569 (0.04) binding scores. For a data-driven approach, the distribution oftraining data plays a crucial role in its performance. Becausethe number of training samples falling in the middle range ismuch larger than the samples with marginal affinity values,DeepAtom naturally performs better for the complexes in themiddle range during the testing stage. It also suggests thatdiversifying the training samples is promising for DeepAtomto provide more accurate and reliable predictions.We also compare DeepAtom with RF-Score and Pafnucyon the independent Astex Diverse Set. Table III shows thatDeepAtom again significantly outperforms the others over allthe measurements. The prediction results from one DeepAtommodel are illustrated in Fig. 3f.
D. Evaluation of the Proposed Benchmark Dataset
Comparison of Tables I and II reveals that training onour proposed benchmark dataset results in a significant im-provement of the model performances, especially for the deepleaning based approaches. For example, DeepAtom increases a) Comparison of R (b)
Comparison of RMSE
Fig. 4:
Comparison of scoring methods on PDBbind v. core set.
TABLE IV:
Validation performance with various feature/atom types.
Num of Features Resolution Occupancy RMSE R11 1.0 Binary 1.485 0.70624 1.360 0.73760 1.359 0.737 the R from . to . and decreases the RMSE from . to . . The difference between the two tables confirms theeffectiveness of our proposed new benchmark dataset, wherethe model trained on our new dataset provides more accuratepredictions. Although the dataset contains some complexeswith low resolution structure, such low-quality data does notintroduce obvious label noise. On the contrary, this extendeddataset provides more reliable complex data which can ef-fectively improve the generalization power of binding affinityprediction algorithms.It is worth noting that our proposed benchmark dataset ex-tends the standard refined set by including the complexes fromPDBbind general subset and Binding MOAD database onlywhen the experimental affinity data is either K d , K a , or K i .While K d and K i as equilibrium constants may be comparedif derived from multiple binding assays, dependence of IC values on experimental settings discourages its comparisonacross different assays [18]. E. Hyper-parameter Optimization
Although DeepAtom is an end-to-end data-driven approachfor binding affinity prediction, some hyper-parameters areinevitably introduced especially in the input featurizationprocess. To finalize the optimal data representation of protein-ligand complex for DeepAtom prediction, we implementsystematic optimization experiments over the related hyper-parameters. In all of following comparison experiments, ourdeep learning models are trained on the “training set-1”with complexes, and evaluated on the correspondingvalidation set with complexes; the prediction performanceis measured by Pearson’s R and RMSE.
1) Feature/Atom Types:
We consider three different de-scriptors with , , and features. The first descriptorcharacterizes both the protein and ligand atoms with the same11 Arpeoggio atom types. The second descriptor is describedin Section II-A3. The third descriptor includes 40 ANOLEAatom types to describe protein atoms and 9 element types todescribe ligand atoms, as well as 11 Arpeggio atom types.The ANOLEA atom types describe each protein atom basedon its bond connectivity, chemical nature, and whether itbelongs to side-chain or backbone of the amino acid [21] TABLE V:
Validation performance with different resolutions.
Num of Features Resolution Occupancy RMSE R24 0.5 Binary 1.357 0.7391.0 1.360 0.737
TABLE VI:
Validation performance with different occupancy types.
Num of Features Resolution Occupancy RMSE Pearson's R24 1.0 Binary 1.360 0.737PCMax 1.348 0.741 [22]. For the rest of controlled variables, we use the simplebinary scheme to represent the occupancy types, and set theresolution of 3D grid box as 1.0 ˚A. From Table IV, we canobserve that when both protein and ligand atoms are treated thesame (the first descriptor), a lower performance is obtained;training the models on PDBbind dataset needs extracting thestructures of free protein and free ligand from the complex,assuming that the conformational change upon ligand bindingis negligible. Therefore, binding affinity prediction relies onthe inter-molecular interactions between protein and ligandatoms, while the intra-molecular energies are cancelled out.
2) Resolution:
High-resolution rasterized data can ade-quately capture the fine-grained features and changes in thelocal spatial regions. However, it will cause excessive memoryusage and heavy computational cost. Thus, there exists atrade-off between prediction performance and computationalefficiency. Based on our analysis, we pick the resolution as 1.0˚A and 0.5 ˚A, both of which are less than the smallest × r vdw value of 1.4 ˚A for the 9 major heavy atoms. Table V showsthat with the increase of resolution, DeepAtom prediction per-formance improves. However, this slight improvement comeswith a large increase in the computational cost, especiallywhen the more demanding occupancy strategy such as PCMaxis utilized later; therefore we select 1.0 ˚A as the optimalresolution value.
3) Occupancy Type:
Occupancy type describes how eachatom impacts its surrounding environment. Several differentstrategies have been proposed, such as binary, Gaussian [24]and PCMax [13]. The binary occupancy discretizes the atomimpact over the voxel. For example, if the distance betweenan atom and a voxel center is shorter than the atom’s vander Waals radius, the corresponding voxel channel will beactivated as 1, otherwise deactivated as 0. In contrast, theGaussian and PCMax approaches can represent an atom’simpact by a continuous numerical value, which can containricher information. The impact can also decay smoothlywhen the distance increases. We compare the binary andPCMax occupancy types, on the basis of the optimal 24feature/atom types and 1.0 ˚A grid resolution. Table VI showsthat DeepAtom with PCMax occupancy types achieves betterperformance. Considering the similarity between Gaussian andPCMax algorithms, we expect them yield comparable results.
4) Averaging at the Testing:
As an effective strategy, dataaugmentation is also used to improve the DeepAtom perfor-mance. In addition to augmenting data for training, we alsorun the trained model on multiple augmented versions of testdata and average the results to reduce the prediction variance.e evaluate multiple test data versions, including 1, 12 and24, where the value 1 means only the original test set is usedwithout the averaging operation. We observe that increasingthe test set versions can favorably reduce the variance ofpredictions and further improve the performance.IV. C
ONCLUSION
In this paper, we proposed the framework DeepAtom toaccurately predict the protein-ligand binding affinity. An effi-cient 3D-CNN architecture is proposed to effectively improvethe model learning capacity with limited available complexesdata for training. In a purely data-driven manner without apriori functional form assumptions, DeepAtom significantlyoutperforms state-of-the-art deep learning, machine learningand conventional scoring techniques. We also proposed a newbenchmark dataset to further improve the model performance.The promising results on independent challenging datasetsdemonstrated DeepAtom can be potentially adopted in compu-tational drug development protocols such as molecular dockingand virtual screening.V. A
CKNOWLEDGMENTS
This work is partially supported by National ScienceFoundation (CNS-1842407) and National Institutes of Health(R01GM110240). We thank Yaxia Yuan for helpful discus-sions and comments that improved the manuscript.R
EFERENCES[1] A. Ahmed, R. D. Smith, J. J. Clark, J. B. J. Dunbar, and H. A. Carlson.Recent improvements to binding moad: a resource for protein-ligandbinding affinities and structures.
Nucleic Acids Research , 43(D1):D465–D469, 2015.[2] P. J. Ballester, A. Schreyer, and T. L. Blundell. Does a more precisechemical description of protein-ligand complexes lead to more accurateprediction of binding affinity?
Journal of chemical information andmodeling , 54(3):944–955, 2014.[3] R. C Braga, V. M Alves, A. C Silva, M. N Nascimento, F. C Silva,L. M Liao, and C. H Andrade. Virtual screening strategies in medicinalchemistry: the state of the art and current challenges.
Current topics inmedicinal chemistry , 14(16):1899–1912, 2014.[4] F. Chollet. Xception: Deep learning with depthwise separable convolu-tions. In
Computer Vision and Pattern Recognition , 2017.[5] L. Evanthia, S. George, V. D. K, and C. Zoe. Structure-based vir-tual screening for drug discovery: principles, applications and recentadvances.
Current Topics in Medicinal Chemistry , 14(16):1923–1938,2014.[6] J. Gomes, B. Ramsundar, E. N. Feinberg, and V. S. Pande. Atomicconvolutional networks for predicting protein-ligand binding affinity. arXiv:1703.10603 , 2017.[7] M. J. Hartshorn, M. L. Verdonk, G. Chessari, S. C. Brewerton, W. T.Mooij, P. N. Mortenson, and C. W. Murray. Diverse, high-quality testset for the validation of protein- ligand docking performance.
Journalof medicinal chemistry , 50(4):726–741, 2007.[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In
Computer Vision and Pattern Recognition , 2016.[9] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neuralnetworks for mobile vision applications. arXiv:1704.04861 , 2017.[10] L. Hu, M. L. Benson, R. D. Smith, M. G. Lerner, and H. A. Carlson.Binding moad (mother of all databases).
Proteins , 60(3):333–340, 2005.[11] G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger. Con-densenet: An efficient densenet using learned group convolutions. In
Computer Vision and Pattern Recognition , 2018.[12] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Denselyconnected convolutional networks. In
Computer Vision and PatternRecognition , 2017. [13] J. Jim´enez, S. Doerr, G. Mart´ınez-Rosell, A. Rose, and G. De Fabritiis.Deepsite: protein-binding site predictor using 3d-convolutional neuralnetworks.
Bioinformatics , 33(19):3036–3042, 2017.[14] J. Jim´enez, M. Skalic, G. Mart´ınez-Rosell, and G. De Fabritiis. K deep:Protein-ligand absolute binding affinity prediction via 3d-convolutionalneural networks.
Journal of chemical information and modeling ,58(2):287–296, 2018.[15] H. C. Jubb, A. P. Higueruelo, B. Ochoa-Monta˜no, W. R. Pitt, D. B.Ascher, and T. L. Blundell. Arpeggio: A web server for calculatingand visualising interatomic interactions in protein structures.
Journal ofmolecular biology , 429(3):365–371, 2017.[16] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature ,521(7553):436, 2015.[17] Y. Li, H. Kang, K. Ye, S. Yin, and X. Li. Foldingzero: Proteinfolding from scratch in hydrophobic-polar model. In
Workshop on DeepReinforcement Learning at NeurIPS , 2018.[18] Y. Li, Z. Liu, J. Li, L. Han, J. Liu, Z. Zhao, and R. Wang. Comparativeassessment of scoring functions on an updated benchmark: 1. compi-lation of the test set.
Journal of chemical information and modeling ,54(6):1700–1716, 2014.[19] J. Liu and R. Wang. Classification of current scoring functions.
Journalof chemical information and modeling , 55(3):475–482, 2015.[20] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. Shufflenet v2: Practicalguidelines for efficient cnn architecture design. In
European Conferenceon Computer Vision , 2018.[21] F. Melo and E. Feytmans. Novel knowledge-based mean force potentialat atomic level.
Journal of molecular biology , 267(1):207–222, 1997.[22] F. Melo and E. Feytmans. Assessing protein structures with a non-localatomic interaction energy.
Journal of molecular biology , 277(5):1141–1152, 1998.[23] G. M. Morris, R. Huey, W. Lindstrom, M. F. Sanner, R. K. Belew, D. S.Goodsell, and A. J. Olson. Autodock4 and autodocktools4: Automateddocking with selective receptor flexibility.
Journal of computationalchemistry , 30(16):2785–2791, 2009.[24] M. Ragoza, J. Hochuli, E. Idrobo, J. Sunseri, and D. R. Koes. Protein-ligand scoring with convolutional neural networks.
Journal of chemicalinformation and modeling , 57(4):942–957, 2017.[25] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen.Mobilenetv2: Inverted residuals and linear bottlenecks. In
ComputerVision and Pattern Recognition , 2018.[26] L. Schietgat, T. Fannes, and J. Ramon. Predicting protein functionand protein-ligand interaction with the 3d neighborhood kernel. In
International Conference on Discovery Science . Springer, 2015.[27] M. H. Segler, M. Preuss, and M. P. Waller. Planning chemical syntheseswith deep neural networks and symbolic ai.
Nature , 555(7698):604,2018.[28] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. In
International Conference on LearningRepresentations , 2015.[29] M. M. Stepniewska-Dziubinska, P. Zielenkiewicz, and P. Siedlecki.Development and evaluation of a deep learning model for protein-ligandbinding affinity prediction.
Bioinformatics , 34(21):3666–3674, 2018.[30] L. Sundaram, H. Gao, S. R. Padigepati, J. F. McRae, Y. Li, J. A.Kosmicki, N. Fritzilas, J. Hakenberg, A. Dutta, J. Shon, et al. Predictingthe clinical impact of human mutation with deep neural networks.
Naturegenetics , 50(8):1161, 2018.[31] I. Wallach, M. Dzamba, and A. Heifets. Atomnet: a deep convolutionalneural network for bioactivity prediction in structure-based drug discov-ery. arXiv:1510.02855 , 2015.[32] R. Wang, X. Fang, Y. Lu, C.-Y. Yang, and S. Wang. The pdbbinddatabase: Methodologies and updates.
Journal of Medicinal Chemistry ,48:4111–4119, 2005.[33] R. Wang, L. Lai, and S. Wang. Further development and validation ofempirical scoring functions for structure-based binding affinity predic-tion.
Journal of Computer-Aided Molecular Design , 16(1):11–26, 2002.[34] R. Wang, Y. Lu, and S. Wang. Comparative evaluation of 11 scoringfunctions for molecular docking.
Journal of Medicinal Chemistry ,46(12):2287–2303, 2003.[35] B. M. Wingert and C. J. Camacho. Improving small molecule virtualscreening strategies for the next generation of therapeutics.
Currentopinion in chemical biology , 44:87–92, 2018.[36] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremelyefficient convolutional neural network for mobile devices. In