[PDF] DeepAtom: A Framework for Protein-Ligand Binding Affinity Prediction

Abstract

The cornerstone of computational drug design is the calculation of binding affinity between two biological counterparts, especially a chemical compound, i.e., a ligand, and a protein. Predicting the strength of protein-ligand binding with reasonable accuracy is critical for drug discovery. In this paper, we propose a data-driven framework named DeepAtom to accurately predict the protein-ligand binding affinity. With 3D Convolutional Neural Network (3D-CNN) architecture, DeepAtom could automatically extract binding related atomic interaction patterns from the voxelized complex structure. Compared with the other CNN based approaches, our light-weight model design effectively improves the model representational capacity, even with the limited available training data. With validation experiments on the PDBbind v.2016 benchmark and the independent Astex Diverse Set, we demonstrate that the less feature engineering dependent DeepAtom approach consistently outperforms the other state-of-the-art scoring methods. We also compile and propose a new benchmark dataset to further improve the model performances. With the new dataset as training input, DeepAtom achieves Pearson's R=0.83 and RMSE=1.23 pK units on the PDBbind v.2016 core set. The promising results demonstrate that DeepAtom models can be potentially adopted in computational drug development protocols such as molecular docking and virtual screening.

Full PDF

DDeepAtom: A Framework for Protein-LigandBinding Afﬁnity Prediction

Yanjun Li , Mohammad A. Rezaei , Chenglong Li , Xiaolin Li , and Dapeng Wu NSF Center for Big Learning, University of Florida Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development (CNPD3), University of Florida Department of Chemistry, University of Florida.Gainesville, FL, USAyanjun.li@uﬂ.edu, [email protected]ﬂ.edu, [email protected]ﬂ.edu, [email protected]ﬂ.edu, dpwu@uﬂ.edu

Abstract —The cornerstone of computational drug design isthe calculation of binding afﬁnity between two biological coun-terparts, especially a chemical compound, i.e., a ligand, anda protein. Predicting the strength of protein-ligand bindingwith reasonable accuracy is critical for drug discovery. Inthis paper, we propose a data-driven framework named Deep-Atom to accurately predict the protein-ligand binding afﬁnity.With 3D Convolutional Neural Network (3D-CNN) architecture,DeepAtom could automatically extract binding related atomicinteraction patterns from the voxelized complex structure. Com-pared with the other CNN based approaches, our light-weightmodel design effectively improves the model representationalcapacity, even with the limited available training data. Withvalidation experiments on the PDBbind v. benchmark andthe independent Astex Diverse Set, we demonstrate that the lessfeature engineering dependent DeepAtom approach consistentlyoutperforms the other state-of-the-art scoring methods. We alsocompile and propose a new benchmark dataset to further im-prove the model performances. With the new dataset as traininginput, DeepAtom achieves Pearson's R= . and RMSE= . pK units on the PDBbind v. core set. The promising resultsdemonstrate that DeepAtom models can be potentially adoptedin computational drug development protocols such as moleculardocking and virtual screening. Index Terms —binding afﬁnity prediction, deep learning, efﬁ-cient 3D-CNN, benchmarking

I. I

NTRODUCTION

Binding of a chemical molecule to a protein may start abiological process. This includes the activation or inhibitionof an enzyme’s activity, and a drug molecule affecting itstarget protein. The binding is quantiﬁed by how strong thechemical compound, a.k.a. a ligand, binds to its counterpartprotein; this quantity is called binding afﬁnity . Simulation ofbiological processes and computational drug design heavilyrelies on calculating this binding strength. For example, VirtualScreening (VS) tries to ﬁnd the best chemical compoundswhich can regulate the function of a protein; in drug discoverythis is often a disease-related target protein. VS achieves thisgoal by assigning a score to each binding ligand, indicatinghow strong it binds to the target protein. To get the overallpicture of how binding afﬁnity prediction enables VS, thereader is referred to [3], [5], [35]. †These authors contributed equally.* To whom correspondence should be addressed.

Current approaches for quantifying the binding afﬁnity canbe categorized as physics-based, empirical, knowledge-basedand descriptor-based scoring functions [19]. In spite of theirmerits, the conventional techniques assume a predeterminedfunctional form which is additive. Furthermore, they needdomain knowledge to extract features and formulate the scor-ing functions. For example, the semi-empirical force ﬁeld inAutoDock [23] and empirical scoring function in X-Score [34]belong to this category. As instance, X-Score takes averageof three scoring functions HPScore, HMScore, and HSScore,differing by the terms which describe the hydrophobic inter-actions. Each of these scoring functions comes in the form ofa linear combination of the terms [33].Only in the past decade the machine learning algorithmshave been used to score the protein-ligand binding strength ina data-driven manner. Ballester and coworkers gathered dataabout the frequency of occurrence of each possible protein-ligand atom pair up to a distance threshold. By binning thesecounts, they trained their Random Forest model implicitly onshort-range, middle-range, and long-range interactions. TheRF-Score model is still among the top performer models in thisﬁeld [2]. However, such models also heavily rely on biologicalfeature engineering to extract descriptors or ﬁngerprints; it isstill based on expert knowledge and therefore biased.Deep learning models, which belong to descriptor-basedcategory, aim to minimize the bias due to expert knowledge.To describe the interactions between a protein and its ligand,atom-centered and grid-based methods are the most widelyused techniques. Schietgat et al. developed a 3D neighborhoodkernel, which took the atom spatial distances into consid-eration, to describe the structure of proteins and ligands[26]. More recently Gomes and coworkers [6] represented thestructures of proteins and ligands as a combination of neighborlists and atom types, to later be used in a deep network. In theircase, they described the whole protein and ligand structuresusing their atom-centered scheme.By contrast, grid-based approach usually limits the rep-resentation of protein and ligand interactions to a grid boxdeﬁned around a protein’s binding site, and different atominformation is encoded in different channels of the 3D grid.Wallach et al. [31] and Ragoza et al. [24] developed CNNscoring functions to classify compound poses as binders or a r X i v : . [ q - b i o . B M ] D ec on-binders. Jim´enez et al. [14] and Marta et al. [29] designedsimilar deep learning approachs to predict the binding afﬁnity,based on the rasterized protein-ligand structure input.In addition to modeling approach, data reliability is a majorissue for binding afﬁnity prediction. Although a few thousandsof labeled protein-ligand complexes are available, their bind-ing afﬁnity are reported as different experimental measures,including K d , K a , K i , and IC , in decreasing order ofreliability for the purpose of binding afﬁnity prediction. Ifwe indiscriminately feed all the data to the machine learningmodel, it will potentially introduce label noise or even incor-rect labels different from ground truth. The machine learningmodel may then suffer from the inaccurate supervision.Our goal in this paper is twofold. First, we aim to developan end-to-end solution for binding afﬁnity accurate predictionwhich 1) gets as input the 3D structural data for the proteinand ligand, 2) requires minimum feature engineering, and 3)achieves state-of-the-art prediction performance. Second, weaim to systematically analyze the publicly available protein-ligand binding afﬁnity data and propose a new benchmark formore reliable model learning.More recently, deep learning has exhibited its powerfulsuperiority in a broad range of bioinformatics tasks, such asgene mutations impact prediction [30], protein folding [17]and drug discovery [27]. By stacking many well-designedneural network layers, it is capable of extracting useful fea-tures from raw data form and approximating highly complexfunctions [16]. Many advanced deep learning algorithms aredeveloped based on convolutional neural networks (CNNs).The impressive performance of CNNs is mainly because theycan effectively take advantage of spatially-local correlation inthe data. Similarly, protein-ligand 3D structure naturally hassuch characteristics; biochemical interactions between atomsoccur locally. CNNs hopefully can hierarchically composesuch local spatial interactions into abstract high-dimensionalglobal features contributing to the binding score.In this paper, we propose the framework DeepAtom to ac-curately predict the protein-ligand binding afﬁnity. A complexof protein-ligand is ﬁrst rasterized into a 3D grid box, whichis centralized on the ligand center. Each voxel has severalinput channels, embedding the different raw information ofatoms located around the voxel. A light-weight 3D-CNNmodel is then developed to hierarchically extract useful atominteraction features supervised by the binding afﬁnity score. Asa data-driven approach, it effectively avoids a priori functionalform assumptions. More importantly, our efﬁcient architecturedesign signiﬁcantly improves the model representational andgeneralization capacity even trained with the limited availablecomplex data.We present comprehensive experiments on the standardbenchmark test set, called PDBbind v.2016 core set [32] andan additional test set, called Astex Diverse Set [7]. Randomlyinitialized and evaluated for 5 times, DeepAtom consistentlyoutperforms all the state-of-the-art published models. In orderto further improve the model performance, we also criticallystudy the publicly available complex data and propose a Fig. 1: Local box featurization (3D data representation).

The gridbox encompasses the area around the binding site, centered on theligand. Each channel includes only a speciﬁc feature, e.g. from leftto right, the three channels shown are the excluded volume channelfor the ligand as well as the hydrophobic and aromatic channels forprotein. Each sample is described in terms of 24 channels in total. new benchmark dataset. It further improves the DeepAtomperformance, with potential beneﬁts to the future research inthe binding afﬁnity scoring ﬁeld.II. M

ATERIALS AND M ETHODS

A. Input Featurization and Processing1) Protein-ligand Complex:

The standard datasets, suchas PDBbind and Binding MOAD, include the structures ofprotein and ligand in their bound form, a.k.a. their complex,deposited in a single PDB ﬁle. Based on experimental tech-niques, the strength of ligand binding to protein has beendetermined for each structure. This binding afﬁnity data is usedas the ground truth labels for these protein-ligand complexes.

2) Grid Size & Resolution:

We calculate the distributionof end-to-end distances for all ligands in the PDBbind v.2016 reﬁned and core datasets. This gives us clues to deﬁne a boxsize of ˚A, which is the same as the end-to-end distance forthe longest ligand in these two datasets, so there is no need toﬁlter out any. The distribution of ligand lengths in these twosubsets is illustrated in Fig. 3a.The van der Waals radius of the 9 major heavy atoms(C, N, O, P, S, F, Cl, Br, I) used in our study is greaterthan 1.4 ˚A. As a simpliﬁed view, an atom’s r vdw can beassumed as a measure of its size; it is deﬁned as half of theinternuclear separation of two non-bonded atoms of the sameelement on their closest possible approach. A grid resolutionlarger than × r vdw cannot differentiate two atoms from eachother. On the other hand, a ﬁner resolution brings about muchhigher computational cost. As a trade-off between accuracyand efﬁciency, we set the grid resolution as 1.0 ˚A.

3) Features / Atom Types:

The 11 Arpeggio atom types arebased on the potential interactions each atom may get involvedin [15]; they include features such as Hydrogen bond acceptorand donor, positive or negative, hydrophobic, and aromaticatom types. The protein and ligand atoms are described by 11rpeggio atom types and an excluded volume feature, wherediscrimination is made between protein and ligand atoms. Thisresulted in (11 + 1) × features.

4) Occupancy Type:

This hyper-parameter deﬁnes howeach atom impacts its surrounding environment. In our work,each atom can affect its neighbor voxels up to double of its vander Waals radius r vdw through a pair correlation potential. Weuse the Atom-to-voxel PCMax algorithm, described in [13],where each atom makes a continuous contribution n ( r ) to itsneighbor voxels as deﬁned by Eq. 1. At the center of a voxel,only the maximum effect from contributing atoms is kept. n ( r ) = 1 − exp ( − ( r vdw r ) ) (1) B. Network Architecture

To extract the atomic interaction information from the vox-elized protein-ligand complex data, a straightforward approachis to extend 2D-CNNs to 3D by using 3D convolution kernels.One channel of the output feature map at the location ( i, j, k ) is computed by the standard convolution as follows: Conv(

W, h ) ( i,j,k ) = S,T,R,M (cid:88) s,t,r,m W ( s,t,r,m ) · h ( i + s,j + t,k + r,m ) (2)where h represents the input with M channels, W ( s,t,r,m ) ∈ R S × T × R × M represents one ﬁlter weight, and S, T, R areside length of the ﬁlter. However, 3D convolution itself willmassively inﬂate the amount of trainable network parameters,due to the increase in the input and kernel dimensions.Speciﬁcally, if N is the number of the output channels, onestandard convolution layer will introduce S · T · R · M · N parameters. More importantly, in order to improve the learningability and achieve higher prediction accuracy, a general trendhas been to make the model deeper and more complicated[28], [8], [12]. By contrast, for the afﬁnity prediction problem,only a few thousands of protein-ligand complexes with exper-imentally determined binding afﬁnity are available. This issuediscourages the use of network architectures with too manytrainable parameters, because overﬁtting may occur when thenetwork has high complexity whereas a relatively small dataset is available for training. Indeed, another D CNN-basedafﬁnity prediction work [24] also encounters the overﬁttingissue. After empirically optimizing the model depth and width,they ultimately reduce the network to only three convolutionallayers. Similarly, Pafnucy [29] is developed as a D CNNmodel with three convolutional and three dense layers.Our model is inspired by the light-weight network archi-tecture, and aims to achieve the best trade-off between theprediction accuracy and the model complexity in terms oflearnable parameters. A series of related network structurehave been recently proposed, such as Xception [4], Mo-bileNet v1 [9] and v2 [25], ShufﬂeNet v1 [36] and v2 [20],and CondenseNet [11]. Based on the practical guidelinesfor efﬁcient CNN architecture design and the correspondingShufﬂeNet units described in [20], we propose a novel light-weight D CNN mdoel, which can be effectively trained withdeeper layers by the limited training samples. It improves the prediction performance by a large margin, but does notsigniﬁcantly increase model complexity.Speciﬁcally, as shown in Fig. 2 our model consists of threebuilding blocks, namely atom information integration block,stacked feature extraction block, and global afﬁnity regressionblock.In the atom information integration block, a pointwise (PW, × × ) convolution layer deﬁned in Eq. 3 with non-linearactivation function is ﬁrst utilized to fuse the atom informationacross different channels. PWConv(

W, h ) ( i,j,k ) = M (cid:88) m W m · h ( i,j,k,m ) (3)This cascaded cross-channel parametric pooling structurebrings about an improvement compared with the empiricalscoring functions. For instance, AutoDock software’s scoringfunction is composed of a linear combination of interactiontypes, such as Hydrogen bonding and electrostatic interactions[23]. The pointwise convolution layer in our model is followedby a 3D max pooling layer to increase the translationalinvariance of the network and reduce the input dimension.The output of this block has the grid size of × × .The feature extraction block consists of multiple consecutive3D shufﬂe units, and according to the number of channels intheir outputs, they are categorized into three groups. At thebeginning of the unit, a channel split operator equally splitsthe input of feature channels into two branches. One branchdata is sequentially processed by a pointwise convolution,a × × depthwise (DW) convolution and an additionalpointwise convolution. All three layers have the same numberof input and output channels N . Depthwise convolutional layerperforms the spatial convolution independently over everychannel of an input: DWConv(

W, h ) ( i,j,k ) = S,T,R (cid:88) s,t,r W ( s,t,r ) (cid:12) h ( i + s,j + t,k + r ) (4)where (cid:12) denotes the element-wise product. Although depth-wise convolution does not combine different input channels,the two neighbor regular pointwise convolution effectively fusethe information across the channels. The other branch is keptas identity until it is concatenated with the output from the ﬁrstbranch. This identity branch can be regarded as an effectivefeature reuse design, which strengthens feature propagationand reduces the number of parameters. Within a basic unit, thedepthwise and pointwise convolutions respectively introduce S · T · R · N and N · N parameters. Therefore, using a basic unitto replace the standard convolution, we obtain the parameterreduction as follows: S · T · R · N + 2 · N · N S · T · R · N · N = 12 ( 1 N + 1 S · T · R ) (5)DeepAtom uses × × depthwise convolutions and thenumber of channels are set as , , . Therefore, withthe efﬁcient model design, we can easily obtain more than 20times parameters reduction, which enable us to stack deeperlayer to improve the model learning capacity. ig. 2: Network architecture. Each Conv layer is speciﬁed by its number of channels, kernel size and stride. The 3D MaxPool layer haskernel size 3 and stride 2. For the 3D Shufﬂe Groups, the numbers in parentheses denote the number of output channels and repeat timeof the unit. Only the ﬁrst unit has down sampling layer, where the DWConv layer has kernel size 3 and stride 2. In the remaining units,DWConv with kernel size 3 and stride 1, as well as PWConv with kernel size 1 and stride 1 are utilized. Eight losses are calculated basedon the shared weights FC layer output. Two dropout layers are appended before the last 3D Pointwise Conv and FC layers respectively.

At the end of the units, the channel shufﬂe operation isapplied to enable the information ﬂow across the two branches.Particularly, the channel shufﬂe operation ﬁrst divides thefeature map in each branch into several subgroups, then mixesthe branches with different subgroups. When the spatial downsampling is applied, the channel split operator in the shufﬂeunit is removed, and the number of output channels is doubled.In each group, only the ﬁrst shufﬂe unit has the down samplinglayer, and the remaining units keep the input dimension.After stacking three shufﬂe groups, the original 3D inputdata is down sampled to a × × ×

4D tensor (3grids along x, y, z axes and channels). The global afﬁnityregression block ﬁrst slices the tensor into × × vectorswith dimension . Based on the prior shufﬂe groups, thereceptive ﬁeld of each vector covers the entire raw volume, sowe set up the afﬁnity prediction task for each vector. A sharedweights fully connected (FC) layer consumes each vector toconstruct regression loss, and it enables us to train the toplayers more thoroughly and further avoid overﬁtting. In testingphase, outputs from the multiple hidden vectors are averagedright before the FC layer to stabilize the prediction.In the architecture, we adopt the leaky rectiﬁed linearunit as the activation function. A batch normalization layeris appended after each convolution operation to speed upthe training. The mean squared error is set as our afﬁnityregression loss for model learning.

1) Training:

The model is updated by Adam algorithm withdefault parameters for momentum scheduling ( β = . , β = . ). Training the model from scratch, we set the initiallearning rate as 0.001 and the weight decay to × − . Ourmodel is implemented using PyTorch (version . ). With batchsize of , the model is trained for around epochs on Nvidia P GPU cards.

2) Data Augmentation:

The publicly available biologicaldatasets contain only thousands of complexes with reliableexperimental binding afﬁnity value. Directly training on theseinsufﬁcient samples easily makes the deep learning modelsuffer from the overﬁtting problem. Data augmentation isproved as an effective approach to enhance the deep learningmodel performance. In our experiments, each of the originalsamples gets randomly translated and rotated. Enabling suchtransformations signiﬁcantly improves the training and modelcapacity. In order to reduce the variance, the augmentedsamples of each protein-ligand complex are averaged duringthe prediction phase.

C. Dataset Preparation1) PDBbind Dataset:

PDBbind is the standard dataset fordeveloping models to predict the binding afﬁnity of protein-ligand complexes [32]; it has three subsets, namely core , reﬁned , and general . The general subset includes complexeswith relatively lower quality; it contains lower resolution struc-tures, and the experimental afﬁnities for some structures arereported as IC values. The reﬁned dataset is a higher qualitysubset of the general dataset; it includes complex structureswith better resolution and excludes any complex with IC data only. In total, PDBbind v. reﬁned dataset includes complexes. The core dataset is a subset of reﬁned data,clustered with a threshold of sequence similarity; ﬁverepresentative complexes are picked for each of the proteinfamily clusters in order to cover the afﬁnity range better. Thisresults in complexes in the core subset, which servesas the standard test set to evaluate the scoring approaches.We further split the rest of non-overlapping complexesbetween the reﬁned and core into two disjoint subsets: ( i )10% of complexes ( ) are randomly selected and used ashe validation set, ( ii ) the rest ( complexes) are used fortraining, which is named as “training set-1”.

2) Proposed Benchmark Training Set:

In order to compilean improved benchmark dataset, we use PDBbind data as wellas a complementary source of protein-ligand binding afﬁnitydata, namely Binding MOAD [1], [10]. In order to incorporatethe recently updated complexes, we start from PDBbind v. dataset, and extract all complexes with either K d , K a , or K i data from general and reﬁned subsets. It is worth notingthat we exclude the complexes shared with the core subsetto prevent the data leakage. We follow the same steps withBinding MOAD data. A few ﬁltration steps are also necessary:ﬁrst if a complex has reported K d /K a data in one databaseand K i in the other, we keep the K d /K a data only. Second,complexes with a peptide as their ligand are discarded. We donot ﬁlter the complexes based on their structure resolution norperform any clustering on them in terms of protein sequenceor structure; clustering is typically done to later reduce thedataset into representative samples. The limited availabilityof the experimental afﬁnity data discourages further removalof samples, although the dataset is biased towards somestructures, e.g. the congeneric series.In total, the ﬁnal benchmark dataset contains

10 383 com-plexes. Please note that in contrast to NMR structures whichcontain multiple 3D models, a PDB ﬁle from X-ray crystal-lography contains a single 3D structure only. Our proposedbenchmark dataset includes almost exclusively X-ray struc-tures, with only one structure existing in each PDB ﬁle. Merely63 complexes come from NMR experiments. We get only theﬁrst model from these PDB ﬁles.Compared with the reﬁned subset of PDBbind, this datasetalmost doubles the number of samples with K d /K a /K i dataand is expected to improve the performance of binding afﬁnityscoring techniques. The full list of the proposed benchmarkdataset for model training is provided in the SupplementaryTable . The pK d /pK a /pK i value for each complex is reportedto make it easier for other researchers to use the proposeddataset. The binding score of the complexes in this datasetranges from − . to . in pk units, and the scoredistribution is shown in Supplementary Fig. S1.To train the scoring approaches, we split the proposedbenchmark dataset into two subsets: ( i ) we randomly se-lect samples from non-overlapping complexes betweenPDBbind v. reﬁned and core sets. ( ii ) the rest ( complexes) are for training, named as “training set-2”.

3) Astex Diverse Set:

This dataset was developed in 2007. Itincludes 85 protein-ligand complexes ﬁltered to be of interestspeciﬁcally to pharmaceutical and agrochemical industries [7].Among these 85 complexes, 64 of them include bindingafﬁnity data.

D. Other Methods for Comparison

In Section III, we compare DeepAtom with three state-of-the-art and open-source scoring approaches: Pafnucy model https://github.com/YanjunLi-CS/DeepAtom SupplementaryMaterials TABLE I:

Results on PDBbind v. core set with “training set-1”.In each table cell, mean value over ﬁve runs is reported as well asthe standard deviation in parentheses.

RMSE MAE SD RDeepAtom

RF-Score 1.403 (0.002) 1.134 (0.003) 1.293 (0.002) 0.803 (0.001)Pafnucy 1.553 (0.031) 1.261 (0.027) 1.521 (0.037) 0.722 (0.017) [29], RF-Score [2], and X-Score [34]. For Pafnucy and RF-Score, we use the open-source codes provided by the authors,and use their suggested hyper-parameters to re-train modelson the same datasets as DeepAtom. For X-Score, we take theresults from paper [14], where the authors used the publiclyavailable binaries to make predictions on the same PDBbindv. core set.III. R

ESULTS & D

ISCUSSION

In this section, we describe the training and benchmarkingprotein-ligand complex data for DeepAtom. The evaluationdetails are presented along with discussion of the results.

A. Evaluation Metrics

To comprehensively evaluate the model performance, weuse Root Mean Square Error (RMSE) and Mean AbsoluteError (MAE) to measure the prediction error, and use Pearsoncorrelation coefﬁcient (R) and standard deviation (SD) inregression to measure the linear correlation between predictionand the experimental value. The SD is deﬁned in Eq. 6. SD = (cid:115) (cid:80) ni [ y i − ( a + bx i )] n − (6)where x i and y i respectively represent the predicted andexperimental value of the i th complex, and a and b are theintercept and the slope of the regression line, respectively. B. Model Comparison with “Training Set-1”

We ﬁrst train DeepAtom, RF-Score and Pafnucy on the“training set-1” with complexes described in SectionII-C1, and evaluate them on the PDBbind v. core set,which is unseen to the model during its training and vali-dation. Each approach is randomly initialized and evaluatedfor times. The mean and the standard deviation (in theparentheses) of the four evaluation metrics are presented inTable I for testing, and Supplemental Table S1 for validation.Learning with the very limited samples, DeepAtom outper-forms the similar 3D CNN-based Pafnucy by a large margin,which demonstrates that our light-weight architecture designenables effective training with the deep layers and signiﬁcantlyimproves its learning and generalization capacity. On the otherhand, DeepAtom achieves the comparable performance withthe conventional machine learning method RF-Score, althoughas a practical guideline, training a supervised deep learningmodel generally requires larger datasets. It suggested thatour model has greater potential to provide more accurateprediction, given enough training data. ig. 3: a . Ligand length distribution in the PDBbind v.2016 reﬁned and core sets. b . Binding data distribution of the training set in ourproposed benchmark. c . Binding data distribution of the core set. d . DeepAtom prediction results for the core set. e . The distribution of MAEbetween DeepAtom prediction and target complexes with different binding ranges. f . DeepAtom prediction results for the Astex Diverse Set. C. Model Comparison with “Training Set-2”

Next, we use our proposed “training set-2” to re-trainDeepAtom, RF-Score and Pafnucy models, and also evaluatethem on the PDBbind v.2016 core set. Similarly, differentruns are conducted for each scoring method to stabilize theresults. Fig. 4 shows the comparison results in terms ofmean value of the R and RMSE, also including the X-Scoreprediction results. Table II presents the detailed results of thetop three methods. As expected, DeepAtom outperforms allthe other approaches across all four measurements by a largemargin. It obtains the best Pearson correlation coefﬁcient of . and RMSE of . in pK units, compared with RF-Scoreresults: R = 0 . and RM SE = 1 . , and Pafnucy results: R = 0 . and RM SE = 1 . . To the best of our knowledge,DeepAtom achieves the state-of-the-art performance on thiswell-known benchmark. The corresponding validation resultsare shown in the Supplementary Table S2.Fig. 3d shows the correlation between the prediction resultsof one DeepAtom model and the experimental binding afﬁnitydata. DeepAtom gives the highly correlated prediction on thePDBbind v.2016 core set. To further investigate the modelperformance on different ranges of the binding data, wevisualize the binding afﬁnity distribution of the training set(Fig. 3b) in the proposed benchmark, PDBbind v.2016 core set (Fig. 3c) and the corresponding DeepAtom RMSE valuewithin each pK unit range (Fig. 3e). From Fig. 3b and 3c,we observe that the binding scores of our training samples areintensively located in the middle range (from to pK units)which is highly similar to the core set. Fig. 3e shows thatDeepAtom obtains better prediction results with lower MAEvalues in this middle range, compared to the less frequent TABLE II:

Results on PDBbind v. core with “training set-2”.

RMSE MAE SD RDeepAtom

RF-Score 1.419 (0.002) 1.124 (0.001) 1.304 (0.002) 0.801 (0.000)Pafnucy 1.443 (0.021) 1.164 (0.019) 1.424 (0.022) 0.761 (0.008)

TABLE III:

Results on Astex Diverse Set with “training set-2”.

RMSE MAE SD RDeepAtom

RF-Score 1.144 (0.006) 0.891 (0.010) 1.103 (0.004) 0.710 (0.003)Pafnucy 1.374 (0.057) 1.110 (0.065) 1.288 (0.039) 0.569 (0.04) binding scores. For a data-driven approach, the distribution oftraining data plays a crucial role in its performance. Becausethe number of training samples falling in the middle range ismuch larger than the samples with marginal afﬁnity values,DeepAtom naturally performs better for the complexes in themiddle range during the testing stage. It also suggests thatdiversifying the training samples is promising for DeepAtomto provide more accurate and reliable predictions.We also compare DeepAtom with RF-Score and Pafnucyon the independent Astex Diverse Set. Table III shows thatDeepAtom again signiﬁcantly outperforms the others over allthe measurements. The prediction results from one DeepAtommodel are illustrated in Fig. 3f.

D. Evaluation of the Proposed Benchmark Dataset

Comparison of Tables I and II reveals that training onour proposed benchmark dataset results in a signiﬁcant im-provement of the model performances, especially for the deepleaning based approaches. For example, DeepAtom increases a) Comparison of R (b)

Comparison of RMSE

Fig. 4:

Comparison of scoring methods on PDBbind v. core set.

TABLE IV:

Validation performance with various feature/atom types.

Num of Features Resolution Occupancy RMSE R11 1.0 Binary 1.485 0.70624 1.360 0.73760 1.359 0.737 the R from . to . and decreases the RMSE from . to . . The difference between the two tables conﬁrms theeffectiveness of our proposed new benchmark dataset, wherethe model trained on our new dataset provides more accuratepredictions. Although the dataset contains some complexeswith low resolution structure, such low-quality data does notintroduce obvious label noise. On the contrary, this extendeddataset provides more reliable complex data which can ef-fectively improve the generalization power of binding afﬁnityprediction algorithms.It is worth noting that our proposed benchmark dataset ex-tends the standard reﬁned set by including the complexes fromPDBbind general subset and Binding MOAD database onlywhen the experimental afﬁnity data is either K d , K a , or K i .While K d and K i as equilibrium constants may be comparedif derived from multiple binding assays, dependence of IC values on experimental settings discourages its comparisonacross different assays [18]. E. Hyper-parameter Optimization

Although DeepAtom is an end-to-end data-driven approachfor binding afﬁnity prediction, some hyper-parameters areinevitably introduced especially in the input featurizationprocess. To ﬁnalize the optimal data representation of protein-ligand complex for DeepAtom prediction, we implementsystematic optimization experiments over the related hyper-parameters. In all of following comparison experiments, ourdeep learning models are trained on the “training set-1”with complexes, and evaluated on the correspondingvalidation set with complexes; the prediction performanceis measured by Pearson’s R and RMSE.

1) Feature/Atom Types:

We consider three different de-scriptors with , , and features. The ﬁrst descriptorcharacterizes both the protein and ligand atoms with the same11 Arpeoggio atom types. The second descriptor is describedin Section II-A3. The third descriptor includes 40 ANOLEAatom types to describe protein atoms and 9 element types todescribe ligand atoms, as well as 11 Arpeggio atom types.The ANOLEA atom types describe each protein atom basedon its bond connectivity, chemical nature, and whether itbelongs to side-chain or backbone of the amino acid [21] TABLE V:

Validation performance with different resolutions.

Num of Features Resolution Occupancy RMSE R24 0.5 Binary 1.357 0.7391.0 1.360 0.737

TABLE VI:

Validation performance with different occupancy types.

Num of Features Resolution Occupancy RMSE Pearson's R24 1.0 Binary 1.360 0.737PCMax 1.348 0.741 [22]. For the rest of controlled variables, we use the simplebinary scheme to represent the occupancy types, and set theresolution of 3D grid box as 1.0 ˚A. From Table IV, we canobserve that when both protein and ligand atoms are treated thesame (the ﬁrst descriptor), a lower performance is obtained;training the models on PDBbind dataset needs extracting thestructures of free protein and free ligand from the complex,assuming that the conformational change upon ligand bindingis negligible. Therefore, binding afﬁnity prediction relies onthe inter-molecular interactions between protein and ligandatoms, while the intra-molecular energies are cancelled out.

2) Resolution:

High-resolution rasterized data can ade-quately capture the ﬁne-grained features and changes in thelocal spatial regions. However, it will cause excessive memoryusage and heavy computational cost. Thus, there exists atrade-off between prediction performance and computationalefﬁciency. Based on our analysis, we pick the resolution as 1.0˚A and 0.5 ˚A, both of which are less than the smallest × r vdw value of 1.4 ˚A for the 9 major heavy atoms. Table V showsthat with the increase of resolution, DeepAtom prediction per-formance improves. However, this slight improvement comeswith a large increase in the computational cost, especiallywhen the more demanding occupancy strategy such as PCMaxis utilized later; therefore we select 1.0 ˚A as the optimalresolution value.

3) Occupancy Type:

Occupancy type describes how eachatom impacts its surrounding environment. Several differentstrategies have been proposed, such as binary, Gaussian [24]and PCMax [13]. The binary occupancy discretizes the atomimpact over the voxel. For example, if the distance betweenan atom and a voxel center is shorter than the atom’s vander Waals radius, the corresponding voxel channel will beactivated as 1, otherwise deactivated as 0. In contrast, theGaussian and PCMax approaches can represent an atom’simpact by a continuous numerical value, which can containricher information. The impact can also decay smoothlywhen the distance increases. We compare the binary andPCMax occupancy types, on the basis of the optimal 24feature/atom types and 1.0 ˚A grid resolution. Table VI showsthat DeepAtom with PCMax occupancy types achieves betterperformance. Considering the similarity between Gaussian andPCMax algorithms, we expect them yield comparable results.

4) Averaging at the Testing:

As an effective strategy, dataaugmentation is also used to improve the DeepAtom perfor-mance. In addition to augmenting data for training, we alsorun the trained model on multiple augmented versions of testdata and average the results to reduce the prediction variance.e evaluate multiple test data versions, including 1, 12 and24, where the value 1 means only the original test set is usedwithout the averaging operation. We observe that increasingthe test set versions can favorably reduce the variance ofpredictions and further improve the performance.IV. C

ONCLUSION

In this paper, we proposed the framework DeepAtom toaccurately predict the protein-ligand binding afﬁnity. An efﬁ-cient 3D-CNN architecture is proposed to effectively improvethe model learning capacity with limited available complexesdata for training. In a purely data-driven manner without apriori functional form assumptions, DeepAtom signiﬁcantlyoutperforms state-of-the-art deep learning, machine learningand conventional scoring techniques. We also proposed a newbenchmark dataset to further improve the model performance.The promising results on independent challenging datasetsdemonstrated DeepAtom can be potentially adopted in compu-tational drug development protocols such as molecular dockingand virtual screening.V. A

CKNOWLEDGMENTS

This work is partially supported by National ScienceFoundation (CNS-1842407) and National Institutes of Health(R01GM110240). We thank Yaxia Yuan for helpful discus-sions and comments that improved the manuscript.R

EFERENCES[1] A. Ahmed, R. D. Smith, J. J. Clark, J. B. J. Dunbar, and H. A. Carlson.Recent improvements to binding moad: a resource for protein-ligandbinding afﬁnities and structures.

Nucleic Acids Research , 43(D1):D465–D469, 2015.[2] P. J. Ballester, A. Schreyer, and T. L. Blundell. Does a more precisechemical description of protein-ligand complexes lead to more accurateprediction of binding afﬁnity?

Journal of chemical information andmodeling , 54(3):944–955, 2014.[3] R. C Braga, V. M Alves, A. C Silva, M. N Nascimento, F. C Silva,L. M Liao, and C. H Andrade. Virtual screening strategies in medicinalchemistry: the state of the art and current challenges.

Current topics inmedicinal chemistry , 14(16):1899–1912, 2014.[4] F. Chollet. Xception: Deep learning with depthwise separable convolu-tions. In

Computer Vision and Pattern Recognition , 2017.[5] L. Evanthia, S. George, V. D. K, and C. Zoe. Structure-based vir-tual screening for drug discovery: principles, applications and recentadvances.

Current Topics in Medicinal Chemistry , 14(16):1923–1938,2014.[6] J. Gomes, B. Ramsundar, E. N. Feinberg, and V. S. Pande. Atomicconvolutional networks for predicting protein-ligand binding afﬁnity. arXiv:1703.10603 , 2017.[7] M. J. Hartshorn, M. L. Verdonk, G. Chessari, S. C. Brewerton, W. T.Mooij, P. N. Mortenson, and C. W. Murray. Diverse, high-quality testset for the validation of protein- ligand docking performance.

Journalof medicinal chemistry , 50(4):726–741, 2007.[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In

Computer Vision and Pattern Recognition , 2016.[9] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,M. Andreetto, and H. Adam. Mobilenets: Efﬁcient convolutional neuralnetworks for mobile vision applications. arXiv:1704.04861 , 2017.[10] L. Hu, M. L. Benson, R. D. Smith, M. G. Lerner, and H. A. Carlson.Binding moad (mother of all databases).

Proteins , 60(3):333–340, 2005.[11] G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger. Con-densenet: An efﬁcient densenet using learned group convolutions. In

Computer Vision and Pattern Recognition , 2018.[12] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Denselyconnected convolutional networks. In

Computer Vision and PatternRecognition , 2017. [13] J. Jim´enez, S. Doerr, G. Mart´ınez-Rosell, A. Rose, and G. De Fabritiis.Deepsite: protein-binding site predictor using 3d-convolutional neuralnetworks.

Bioinformatics , 33(19):3036–3042, 2017.[14] J. Jim´enez, M. Skalic, G. Mart´ınez-Rosell, and G. De Fabritiis. K deep:Protein-ligand absolute binding afﬁnity prediction via 3d-convolutionalneural networks.

Journal of chemical information and modeling ,58(2):287–296, 2018.[15] H. C. Jubb, A. P. Higueruelo, B. Ochoa-Monta˜no, W. R. Pitt, D. B.Ascher, and T. L. Blundell. Arpeggio: A web server for calculatingand visualising interatomic interactions in protein structures.

Journal ofmolecular biology , 429(3):365–371, 2017.[16] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature ,521(7553):436, 2015.[17] Y. Li, H. Kang, K. Ye, S. Yin, and X. Li. Foldingzero: Proteinfolding from scratch in hydrophobic-polar model. In

Workshop on DeepReinforcement Learning at NeurIPS , 2018.[18] Y. Li, Z. Liu, J. Li, L. Han, J. Liu, Z. Zhao, and R. Wang. Comparativeassessment of scoring functions on an updated benchmark: 1. compi-lation of the test set.

Journal of chemical information and modeling ,54(6):1700–1716, 2014.[19] J. Liu and R. Wang. Classiﬁcation of current scoring functions.

Journalof chemical information and modeling , 55(3):475–482, 2015.[20] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. Shufﬂenet v2: Practicalguidelines for efﬁcient cnn architecture design. In

European Conferenceon Computer Vision , 2018.[21] F. Melo and E. Feytmans. Novel knowledge-based mean force potentialat atomic level.

Journal of molecular biology , 267(1):207–222, 1997.[22] F. Melo and E. Feytmans. Assessing protein structures with a non-localatomic interaction energy.

Journal of molecular biology , 277(5):1141–1152, 1998.[23] G. M. Morris, R. Huey, W. Lindstrom, M. F. Sanner, R. K. Belew, D. S.Goodsell, and A. J. Olson. Autodock4 and autodocktools4: Automateddocking with selective receptor ﬂexibility.

Journal of computationalchemistry , 30(16):2785–2791, 2009.[24] M. Ragoza, J. Hochuli, E. Idrobo, J. Sunseri, and D. R. Koes. Protein-ligand scoring with convolutional neural networks.

Journal of chemicalinformation and modeling , 57(4):942–957, 2017.[25] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen.Mobilenetv2: Inverted residuals and linear bottlenecks. In

ComputerVision and Pattern Recognition , 2018.[26] L. Schietgat, T. Fannes, and J. Ramon. Predicting protein functionand protein-ligand interaction with the 3d neighborhood kernel. In

International Conference on Discovery Science . Springer, 2015.[27] M. H. Segler, M. Preuss, and M. P. Waller. Planning chemical syntheseswith deep neural networks and symbolic ai.

Nature , 555(7698):604,2018.[28] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. In

International Conference on LearningRepresentations , 2015.[29] M. M. Stepniewska-Dziubinska, P. Zielenkiewicz, and P. Siedlecki.Development and evaluation of a deep learning model for protein-ligandbinding afﬁnity prediction.

Bioinformatics , 34(21):3666–3674, 2018.[30] L. Sundaram, H. Gao, S. R. Padigepati, J. F. McRae, Y. Li, J. A.Kosmicki, N. Fritzilas, J. Hakenberg, A. Dutta, J. Shon, et al. Predictingthe clinical impact of human mutation with deep neural networks.

Naturegenetics , 50(8):1161, 2018.[31] I. Wallach, M. Dzamba, and A. Heifets. Atomnet: a deep convolutionalneural network for bioactivity prediction in structure-based drug discov-ery. arXiv:1510.02855 , 2015.[32] R. Wang, X. Fang, Y. Lu, C.-Y. Yang, and S. Wang. The pdbbinddatabase: Methodologies and updates.

Journal of Medicinal Chemistry ,48:4111–4119, 2005.[33] R. Wang, L. Lai, and S. Wang. Further development and validation ofempirical scoring functions for structure-based binding afﬁnity predic-tion.

Journal of Computer-Aided Molecular Design , 16(1):11–26, 2002.[34] R. Wang, Y. Lu, and S. Wang. Comparative evaluation of 11 scoringfunctions for molecular docking.

Journal of Medicinal Chemistry ,46(12):2287–2303, 2003.[35] B. M. Wingert and C. J. Camacho. Improving small molecule virtualscreening strategies for the next generation of therapeutics.

Currentopinion in chemical biology , 44:87–92, 2018.[36] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufﬂenet: An extremelyefﬁcient convolutional neural network for mobile devices. In