[PDF] PIGNet: A physics-informed deep learning model toward generalized drug-target interaction predictions

Abstract

Recently, deep neural network (DNN)-based drug-target interaction (DTI) models are highlighted for their high accuracy with affordable computational costs. Yet, the models' insufficient generalization remains a challenging problem in the practice of in-silico drug discovery. We propose two key strategies to enhance generalization in the DTI model. The first one is to integrate physical models into DNN models. Our model, PIGNet, predicts the atom-atom pairwise interactions via physics-informed equations parameterized with neural networks and provides the total binding affinity of a protein-ligand complex as their sum. We further improved the model generalization by augmenting a wider range of binding poses and ligands to training data. PIGNet achieved a significant improvement in docking success rate, screening enhancement factor, and screening success rate by up to 2.01, 10.78, 14.0 times, respectively, compared to the previous DNN models. The physics-informed model also enables the interpretation of predicted binding affinities by visualizing the energy contribution of ligand substructures, providing insights for ligand optimization. Finally, we devised the uncertainty estimator of our model's prediction to qualify the outcomes and reduce the false positive rates.

Full PDF

PPIGNet: A physics-informed deep learning model towardgeneralized drug-target interaction predictions † Seokhyun Moon, ‡ Wonho Zhung, ‡ Soojung Yang, ‡ Jaechang Lim andWoo Youn Kim ∗ Department of Chemistry, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Ko-rea KI for Artiﬁcial Intelligence, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic ofKorea (E-mail: [email protected]) ‡ These authors contributed equally to this work.

Recently, deep neural network (DNN)-based drug-target interaction (DTI) models are high-lighted for their high accuracy with affordable computational costs. Yet, the models’ insufﬁ-cient generalization remains a challenging problem in the practice of in-silico drug discovery.We propose two key strategies to enhance generalization in the DTI model. The ﬁrst one isto integrate physical models into DNN models. Our model, PIGNet, predicts the atom-atompairwise interactions via physics-informed equations parameterized with neural networksand provides the total binding afﬁnity of a protein-ligand complex as their sum. We furtherimproved the model generalization by augmenting a wider range of binding poses and lig-ands to training data. PIGNet achieved a signiﬁcant improvement in docking success rate,screening enhancement factor, and screening success rate by up to 2.01, 10.78, 14.0 times,respectively, compared to the previous DNN models. The physics-informed model also en- a r X i v : . [ q - b i o . B M ] A ug bles the interpretation of predicted binding afﬁnities by visualizing the energy contributionof ligand substructures, providing insights for ligand optimization. Finally, we devised theuncertainty estimator of our model’s prediction to qualify the outcomes and reduce the falsepositive rates. The accurate prediction of drug-target interaction (DTI) is one of the key steps in the early-stage in-silico drug discovery . Experimental validation of all possible compounds would be idealto ﬁnd novel drug candidates. However, the brute-force approach is practically inefﬁcient to searchacross a huge chemical space whose size is estimated from to . On this account, vari-ous computational methods with affordable computational costs and reliable accuracy have beenproposed for DTI predictions . Docking programs are one of the most popular computationaltools for the purpose and has been successfully adopted in several early-stage drug discoveryprojects . The docking programs provide the reasonable binding poses and binding afﬁnities ofprotein-ligand complexes based on physical modeling combined with empirical parameters. De-spite its practicability, the accuracy is still far insufﬁcient to ﬁnd a small fraction of hit compoundsamong millions of candidates .Recently, deep neural networks (DNNs) have been widely studied as a potential surrogatefor improving the accuracy of DTI predictions . The main advantage of DNN-based DTI mod-els is the models’ ability to extract the relevant features directly from raw data as inputs withouthandcrafted feature descriptions . Among various DNN-based DTI models, the structure-basedapproach stands out for its accuracy; the spatial coordination of the protein and ligand is crucial in2etermining their interactions . Some of the promising approaches utilize a 3-dimensional con-volutional neural network (3D CNN) or a graph neural network (GNN) . Both approachessigniﬁcantly improved the accuracy of DTI prediction compared to docking calculations.Despite the advance of previous structure-based DNN models, their generalization abilityremains a challenging problem limiting better performance. In particular, the deﬁciency in 3Dstructural data of the protein-ligand complexes might drive the models to excessively memorizethe features in training data. Such models, being over-ﬁtted to the training data, might fail togeneralize in a broader context . Several research reported that DNN-based models often learn thedata-intrinsic bias instead of the desirable underlying physics of the protein-ligand interaction .For instance, Chen et al. reported an extremely high similarity in the performance of the receptor-ligand model and the ligand-only model - both trained with the DUD-E dataset - in terms of areaunder the ROC curve (AUC). Such a similarity implies that the models might have learned todeduce the protein-ligand binding afﬁnity by only looking at the ligand structures, whether or notthe protein structures are included as inputs. Moreover, they showed that the memorization ofwrong features can cause a severe degradation in the performance for the proteins that vary fromthose in the training data. In addition, it was also reported that 3D CNN and GNN models trainedon the DUD-E dataset considerably underperformed for different datasets such as the ChEMBLand MUV datasets . Such an insufﬁcient generalization of the DTI models can cause an increasein false positive rates in virtual screening scenarios, as the models would often fail to make correctpredictions for unseen protein-ligand pairs.In the ﬁeld of physical applications of deep learning, the incorporation of right physics as an3nductive bias is a promising mean to improve the model generalization ability. Physical laws areuniversal, meaning that every nature should follow. Thus, if a model is trained to keep underlyingphysics, its generalization to unseen data dictated by the same physics can be expected. Severalstudies have indeed shown that the physics-informed models maintain their generality for unseendata .In this regard, we propose two key strategies to enhance the generalization ability of DTImodels. First, we introduce a novel physics-informed graph neural network, named PIGNet. It pro-vides the binding afﬁnity of a protein-ligand complex as a sum of atom-atom pairwise interactions,which are the combinations of the four energy components - van der Waals (VDW) interaction,hydrogen bond, metal-ligand interaction, and hydrophobic interaction. Each energy component iscomputed as an output of a physics model parameterized by deep neural networks, which learns thespeciﬁc pattern of the interaction. This strategy can increase the generalization ability by allowingthe model to dissect an unseen protein-ligand pair as combinations of commonly observed interac-tions between the protein and the ligand. The detailed pattern of local interactions can render themodel to learn the universal physics underlying the protein-ligand binding. Moreover, as the modelprovides predictions for each atom-atom pair and each energy component, it is possible to analyzethe contribution of individual molecular substructures to the binding afﬁnity. This information canbe used to modify drug candidates to further strengthen the binding afﬁnity.Second, we introduce a data augmentation strategy that aims to further improve the modelgeneralization. The experimental data on protein-ligand binding structures is very insufﬁcient tocover the structural diversity of all possible binding complexes. The limitation in the dataset would4ake the model fall short in practice. In virtual screening scenarios, typical screening librariesinclude a wide range of chemical moieties, most of them never seen by the model. Also, themodels only trained with optimal binding structures provided by the experimental data may fail todistinguish stable and non-stable binding poses in pose predictions . Therefore, we provide a largequantity of various poses and ligands generated from computations to the dataset. Taking one stepcloser to the practice of virtual screening, we devise an uncertainty estimator for PIGNet to improvethe prediction reliability in this data-deﬁcient situation. We show that one can qualify the outcomesto ﬁlter out the most uncertain predictions. The ﬁrst ever introduction of uncertainty quantiﬁcationin structure-based DTI models would beneﬁt practitioners by lightening their massive experimentalburden.To evaluate the generalization ability of the proposed model, we use the CASF2016 bench-mark. Previously, the evaluation of the DTI models had mostly focused on the correlation betweenthe predicted and the experimental binding afﬁnities . However, the high correlation does notautomatically guarantee the good model generalization . A well-generalized model should suc-cessfully identify the true binding pose in minimum-energy or correctly rank the best bindingmolecule, where the former criterion is related to docking power, and the latter is related to screen-ing power. Examining both tasks are essential to make sure that the model is well-generalized andthus can be utilized in practice such as virtual high-thoughput screening (vHTS). We comparedthe performance of PIGNet with other docking programs and previous deep learning models, interms of various metrics including docking power and screening power of the CASF2016 bench-mark . PIGNet signiﬁcantly improved the docking power and concomitantly the screening power,5ompared to the previous deep learning models. In addition to the model performance evaluation,we show the possibility of interpretation. Since explaining the underlying chemistry of the DTIprediction is pharmacologically important, a lack of interpretability in deep learning models was acrucial drawback . Physics-based deep learning models can offer interpretability by providingintermediate values that carry certain physical meanings . As our model can predict the interac-tion energy for each atom-atom pair, we can assess the contribution of each ligand substructurein total binding free energy. Taking account of the information, practitioners would be able tosubstitute the less contributing moieties, opening up the possibility of further ligand optimization. MethodsRelated worksSummary of previous works: 3D CNN and GNN

The 3D CNN takes a 3D rectangular gridthat represents the position of atoms at the binding site as an input.

The proposed 3D CNNmodels outperformed docking programs for the DUD-E and PDBbind dataset in terms of Pearson’scorrelation coefﬁcient and AUC. Nevertheless, the high dimensionality of 3D rectangular represen-tations and lack of explicit chemical interactions may put a limitation on 3D CNN models. Oneof the promising alternatives is a GNN, which takes structural information in the form of molecu-lar graphs. In the molecular graph, each atom and chemical interaction corresponds to the nodeand the edge of the graph, respectively. Such an efﬁcient representation of chemical interactionscontribute to better performance of GNN in DTI predictions compared to docking programs and3D CNN models. hysics-informed neural networks Greydanus et al. proposed a Hamiltonian neural networkto solve problems that follow Hamiltonian mechanics. They used deep neural networks to predictparameters in the Hamiltonian equation and showed better generalization than regular neural net-works. Pun et al. proposed a physics-informed neural network potential for atomistic modeling.The model predicts the parameters of the interatomic potentials separately instead of directly pre-dicting the total energy of the system, improving the model generalization for simulating out ofbonding region. With neural networks, we parameterize the equations that are derived from thephysics regarding chemical interactions. Overview of our model

PIGNet predicts a binding free energy of any given protein-ligand complex. Taking structuralinformation as an input, our model generates a neural representation for each atom-atom pairinteraction through the series of GNN layers. The pairwise representation is fed to neural networksto produce parameters for each of the four physics-informed parameterized equations. The outputsof the equations each corresponds to VDW interaction, hydrogen bond interaction, metal-ligandinteraction, and hydrophobic interaction. The binding afﬁnity of each atom-atom pair is computedas a sum of four energy components, and the total binding afﬁnity for a protein-ligand complex isobtained by adding up the binding afﬁnities of all atom-atom pairs. The overall model architectureis described in Figure 1.

Model Architecture

Our model takes a molecular graph, G , and the distances between the atompairs, d ij , of a protein-ligand complex as an input. Generally, a graph, G , can be deﬁned as7 H, A ) , where H is a set of node features and A is an adjacency matrix. In an attributed graph,the i th node feature, h i , is represented by a vector. An initial node feature includes atom type,degree of the atom, number of the attached hydrogen atoms, number of implicit valence electrons,and aromaticity in an one-hot vector. Notably, our graph representation includes two adjacencymatrices to discriminate the covalent bonds in each molecule and the intermolecular interactionsbetween protein and ligand atoms. The details of the explicit node features and the construction ofthe two adjacency matrices are explained in the Supplementary Information.Figure 1: Our model architecture.

A protein-ligand complex is represented in a graph and ad-jacency matrices are assigned from the binding structure of the complex. Each node feature isupdated through neural networks to carry the information of covalent bonds and intermolecularinteractions. Given the distance and ﬁnal node features of each atom pair, four energy componentsare calculated from the physics-informed parameterized equations. The total binding afﬁnity isobtained as a sum of pairwise binding afﬁnities, which is a sum of the four energy componentsdivided by an entropy term. 8ur model consists of several units of gated graph attention network (gated GAT) and in-teraction network. The gated GAT and interaction network update each node feature via twoadjacency matrices that correspond to covalent bonds and intermolecular interactions. During thenode feature update, gated GAT and interaction network learn to convey the information of co-valent bonds and intermolecular interactions, respectively. After several node feature updates,we calculate VDW interactions ( E vdw ), hydrogen bond interactions ( E hbond ), metal-ligand inter-actions ( E metal ), and hydrophobic interactions ( E hydrophobic ), by feeding the ﬁnal node featuresinto physics-informed parameterized equations. Speciﬁcally, for each energy component, the fullyconnected layers take a set of ﬁnal node features as input and produce the parametric values of thephysics-informed equation. We also consider the entropy loss from the protein-ligand binding bydividing total energy with rotor penalty ( T rotor ). The total energy can be written as follows: E total = E vdw + E hbond + E metal + E hydrophobic T rotor . (1) Gated graph attention network (Gated GAT)

The gated GAT updates a set of node features withrespect to the adjacency matrix for covalent bonds. The attention mechanism aims to put differentweights on the neighboring nodes regarding their importance. The attention coefﬁcient, which im-plies the importance of the node, is calculated from the two nodes that are connected in a covalentbond and then normalized across the neighboring nodes. The purpose of the gate mechanism isto effectively deliver the information from the previous node features to the next node features.The extent of the contribution from the previous nodes is determined by an coefﬁcient, which isobtained from the previous and new node features. We describe the details of gated GAT in the9upplementary Information.

Interaction network

The interaction network takes an updated set of node features from the gatedGAT along with the adjacency matrix to generate the next set of node features. Unlike the gatedGAT, the interaction network adopts an adjacency matrix featuring intermolecular interactions.The interaction network produces two different sets of embedded node features by multiplying theprevious set of node features with two different learnable weights. Next, we apply max poolingto each set of embedded node features, obtaining two set of interaction embedded node features.The interaction embedded node features are then added to the embedded node features to generatethe new node features. The ﬁnal node features are obtained as a linear combination of the newand previous node features, where the linear combination is performed with a gated recurrent unit(GRU) . We describe the details of the interaction network in the Supplementary Information. Physics-informed parameterized function

PIGNet consists of four energy components - VDWinteraction, hydrophobic interaction, hydrogen bonding, and metal-ligand interaction - and a rotorpenalty. Energy component of an interaction between the i th node and the j th node is computedfrom two node features, h i and h j . Since the node features contain the information of the twoatoms and their interaction, the model can reasonably predict DTI.The energy components and the rotor penalty are motivated from the empirical functions of AutoDockVina . The total binding afﬁnity is obtained as a weighted sum of energy components, where theweights are introduced to account for the difference between the calculated energies and the truefree binding energies. PIGNet employs learnable parameters to ﬁnd an optimal weight for each10omponent, learning to account for the different types of protein-ligand interactions.Each energy component is calculated from d ij and d ij , which are the inter-atomic distance andthe corrected sum of the VDW radii of the i th node and the j th node, respectively. d ij can berepresented as follows: d ij = r i + r j + c · b ij , (2)where r i and r j are the VDW radii of the i th node and the j th node, respectively, and b ij is acorrection term between the two nodes. b ij is resulted from a fully connected layer that acceptstwo node features h i and h j as inputs. We used 0.2 for the constant c to scale the correction term. van der Waals (VDW) interaction We used 12-6 Lennard-Jones potential to calculate the VDWinteraction term, E vdw . We considered all protein and ligand atom pairs except for metal atomswhose VDW radii highly vary depending on the atom type. The total VDW energy is obtainedas a sum of all possible atom-atom pairwise VDW energy contribution coefﬁcients. E vdw can bedescribed as follows: E vdw = X i,j c ij "(cid:18) d ij d ij (cid:19) − (cid:18) d ij d ij (cid:19) , (3)where c ij , predicted from a fully connected layer, indicates the minimum VDW interaction energyand renders each estimated energy component similar to the true energy component, in order toreﬂect the physical reality. Hydrogen bond, Metal-ligand interaction, Hydrophobic interaction

The pairwise energy contribu-tion coefﬁcients, e ij , of hydrogen bond ( E hbond ), metal-ligand interaction ( E metal ), and hydropho-11ic interaction ( E hydrophobic ) share the same expression as shown in equation (4) with differentcoefﬁcients, c , c , and a learnable scalar variable, w . e ij =  w if d ij − d ij < c w (cid:16) d ij − d ij − c c − c (cid:17) if c < d ij − d ij < c if d ij − d ij > c (4)Here, c and c are set as -0.7 and 0.0 for hydrogen bonds and metal-ligand interactions, respec-tively, while the constants are set as 0.5 and 1.5 for hydrophobic interaction. We chose the samevalues of c and c for hydrogen bonds and metal-ligand interactions, since both originate from theelectron donor-acceptor interactions. The total energy component is computed as a summation ofall atom-atom pairwise energy contribution coefﬁcients, as described in equation (5). E = X i,j e ij (5)We classiﬁed atoms into hydrogen bond acceptors, hydrogen bond donors, metal atoms, and hy-drophobic atoms. Since hydrogen bonds appear between hydrogen bond donors and hydrogenbond acceptors, each atom that forms hydrogen bonds are selected by substructure matching of thegeneral SMARTS descriptors, which are summarized in Supplementary Table S2. Metal atomsinclude Zn , M n , Co , M g , N i , F e , Ca , and Cu . Lastly, halogen atoms or carbon centers that aresurrounded only by carbon or hydrogen atoms are classiﬁed as hydrophobic atoms . Rotor penalty

The rotor penalty term, T rotor , is intended to consider a loss of entropy as the binding12ocket interrupts the free rotation of chemical bonds during protein-ligand binding. We assumedthat the entropy loss is proportional to the number of the rotatable bonds of a ligand molecule. T rotor can be described as follows: T rotor = 1 + C rotor × N rotor , (6)where N rotor is the number of rotatable bonds and C rotor is a positive learnable scalar variable. Weused RDKit software to calculate N rotor . Loss functions

The loss function of PIGNet consists of three components, L energy , L derivative , and L augmentation as in equation (7). Figure 2 explains the overall training scheme of PIGNet based onthe three loss functions. L total = L energy + L derivative + L augmentation (7) L energy is the mean squared error (MSE) loss between the predicted value from the model, y pred ,and the corresponding experimental binding free energy, y true , L energy = 1 N train X i ( y pred,i − y true,i ) , (8)where N train is a number of training data. Minimizing L energy avails the model to correctly predict13igure 2: The training scheme of PIGNet.

We use three types of data in model training - truebinding complex, true binder ligand-protein pair in computer-generated binding pose, and non-binding decoy complex. PIGNet predicts binding free energy for each input. For a true bindingcomplex, the model learns to predict its true binding energy. The model also learns to predict theenergy of a computer-generated binding pose complex or a non-binding decoy complex in highervalue than the true binding energy and threshold energy, respectively. Finally, PIGNet learns theproper correlation of ligand atom position and binding afﬁnity by minimizing the derivative loss.the binding afﬁnity of experimental 3D structures. L derivative is composed of the ﬁrst and thesecond derivative of the energy with respect to the atomic position. Minimizing L derivative intendsthe model to sensitively ﬁnd relatively stable poses. L augmentation is the loss related to the dataaugmentation. Derivative loss

The shape of the potential energy curve between the protein and ligand atoms hasa huge impact on distinguishing the stable binding poses. The ligand atoms are located at thelocal minimum of the potential curve when the ligand binding is stable. Also, a potential energy14urve in proper sharpness makes it easier to distinguish stable conformers from the others, as asmall change in atomic positions would induce a large amount of energy deviation. Since a modeltrained with respect to L energy alone does not control the shape of the potential energy curve, itwould be hard to distinguish whether or not a ligand is at a stable position. Accordingly, we guidethe model with the derivative loss, L derivative , to learn the proper shape of the pairwise potentialenergy curve, the width and the minimum energy position in particular.We can assume that the ligand atoms are located at the local minimum of the potential for theexperimentally validated binding structures. Thus, we make the experimental structures as a localminimum by forcing the ﬁrst derivative of the potential energy with respect to position to becomezero. The sharpness of the potential energy curve was induced by increasing the second derivative.The derivative loss, L derivative , is given as follows: L derivative = X i (cid:18) ∂E total ∂q i (cid:19) − min (cid:18)(cid:18) ∂ E total ∂q i (cid:19) , C der (cid:19) , (9)where q i is the position of the i th ligand atom. An excessively sharp potential energy curve maycause a problem in energy prediction by the immense deviation of energy from a small change inligand atom positions. Therefore, we set its maximum value as C der , which is 20.0 in our model. Data augmentation loss

Here, we constructed three different data augmentations and the corre-sponding loss functions; docking augmentation, random screening augmentation, and cross screen-ing augmentation. Smina (scoring and minimization with AutoDock Vina) was used to preparecomputer-generated decoy structures. 15 Docking augmentationThe purpose of docking augmentation is to improve the model to distinguish the most stablebinding poses from the others. We assume experimental binding structures from the PDB-bind dataset as the most stable binding poses. Thus, the energy of experimental structuresshould be lower than the predicted energy of decoy structures that have different poses fromtrue binding poses. The loss for docking augmentation, L docking , can be written as follows: L docking = X i max ( y exp,i − y decoy,i , , (10)where y exp is the energy of an experimental structure and y decoy is the predicted energy of adecoy structure. By minimizing L docking , the model can predict y decoy as larger than y exp . • Random screening augmentationIn general, only a small fraction of molecules in a huge chemical space can bind to a speciﬁctarget protein. Most molecules would have low binding afﬁnity and high dissociation con-stant, k d , with the target. From this nature, we assume that the dissociation constant of anarbitrary protein-ligand pair from the virtual screening library would be higher than − M ,as a criterion for hit identiﬁcation is conventionally in micromolar ( − M ) scale . Refer-ring to the relationship between the binding free energy ∆ G and the binding constant, k a ,which is reciprocal to K d , we can set a threshold for ∆ G of a protein-ligand pair as follows: ∆ G ≥ − .

36 log K a = − . kcal/mol . (11)16 model trained with random screening loss, L random screening , and a non-binding randommolecule-protein pair can sufﬁciently learn the chemical diversity. The model would predictthe binding free energy of a random molecule with the target to a value higher than the thresh-old energy, − . . Thus, the loss for the random screening augmentation, L random screening ,can be written as follows: L random screening = X i max ( − y random,i − . , , (12)where y random is the prediction energy of synthetic compounds from the IBS molecule li-brary. The inaccuracy of a docking program is not problematic for the augmentation, asthe binding energies of wrong binding poses are typically higher than the true binding en-ergy. • Cross screening augmentationAnother nature of protein-ligand binding is that if a ligand strongly binds to a speciﬁc target,the ligand is less likely to bind to other targets, because the different types of proteins havedifferent binding pockets. We assumed that the true binders of the PDBbind dataset do notbind to the other proteins in the PDBbind dataset.As in the random screening augmentation, training with non-binding ligands and proteinpairs affect a model to learn chemical diversity. The loss for the cross screening augmenta-17ion, L cross screening , can be written as follows: L cross screening = X i max ( − y cross,i − . , , (13)where y cross is the prediction energy of the cross binder. The same threshold for the bindingfree energy as in random screening augmentation is also used here. Total loss function

The total loss, L total , is the weighted sum of all the loss terms: L energy , L derivative , L docking , L random screening , and L cross screening . The total loss can be written as follows: L total = L energy + c derivative L derivative + c docking L docking + c random screening L random screening + c cross screening L cross screening , (14)where c derivative , c docking , c random screening , and c cross screening are hyper parameters. We set c derivative , c docking , c random screening , and c cross screening as 10.0, 10.0, 5.0, and 5.0, respectively. Dataset

Our primary training set is the PDBbind v.2019 reﬁned set which provides experimentalprotein-ligand x-ray crystal binding structures and the corresponding binding afﬁnities. We used4,212 samples for the training set and the 259 samples for the test set. For the docking augmen-tation, we generated 202,035 decoy structures using the PDBbind v.2016 dataset. For the random18creening augmentation and the cross screening augmentation, we generated 773,623 complexesusing the IBS molecules and 386,876 complexes based on the random cross binding, respectively.Any complexes in the augmentation do not include proteins and ligands from the test set. See ourgithub for more information about data preprocessing codes.

Benchmark method

Benchmark datasets

Our primary test set is the CASF2016 benchmark dataset ,which provides four metrics to evaluate a DTI prediction model: scoring power, ranking power,docking power, and screening power. The scoring power is deﬁned as a linear correlation of predicted binding afﬁnity and experimentalbinding data, measured in terms of the Pearson’s correlation coefﬁcient, R.

The ranking power is the ability of a model to correctly rank the binding afﬁnities of true bindersfor a certain target protein, given the binders’ precise binding poses. A metric for the rankingpower is the Spearman’s rank correlation coefﬁcient, ρ . The docking power is the ability of a model to ﬁnd the native ligand binding pose among decoyswith computer-generated poses. We report the overall success rate of the docking benchmark.

The screening power is the ability of a model to identify the true binding ligands for a giventarget protein among a set of random molecules. We measure the screening power in terms ofenhancement factor (EF) and success rate.The details of CASF2016 benchmark metrics are further explained in the Supplementary Infor-mation. From the four metrics, the CASF2016 benchmark dataset enables a detailed evaluation19f the DTI prediction model. Along with the CASF2016, we computed the Pearson’s correlationcoefﬁcient R for CSAR NRC-HiQ (2010) 1 and 2 to evaluate the scoring power of the modelwith respect to external datasets. Benchmark criteria

In vHTS schemes, a model should be able to identify the true binding posesamong others and correctly ranks the complexes by their binding afﬁnities. The former relates tothe docking power and the latter relates to the ranking and screening power. Indeed, the dock-ing, ranking, and screening powers depend on the model’s ability to predict the correct bindingafﬁnity of a given complex, which is typically estimated in the scoring power. However, an exper-imental analysis on CASF2016 benchmark shows that the screening and docking powers are notnecessarily correlated with the scoring power . We attribute this inconsistency to a limitation inthe CASF2016 scoring power benchmark - the test dataset only contains experimentally obtainedprotein-ligand complex structures and binding afﬁnities, where the docking and screening powerbenchmarks include a much wider range of complexes. In this regard, the docking and screeningpowers can be an indicator of model over-ﬁtting, and the scoring power itself cannot be a singlecriterion of a DTI model performance evaluation. Accordingly, in contrast to previous studies thatmostly focus on the scoring power alone, we highlight the models’ screening and docking powersin the discussion. Compared models

We compare our model with some of the most popular docking programs forDTI predictions: X-Score , AutoDock Vina , ChemPLP@GOLD , and GlideScore . We con-structed two DNN models based on 3D CNN and GNN architectures to compare with PIGNet.The 3D CNN and GNN based models are the two major approaches for the structure-based DTI20rediction. The 3D CNN models are characterized by the voxelized 3D grid representation ofa protein-ligand complex structure and the 3D convolutional networks, while the GNN modelstake graph representation - often including atom-atom pairwise distances - and utilize graph con-volutional networks. For the 3D CNN-based model, we reimplemented the K DEEP model fromJim´enez et al. Our rebuilt 3D CNN-based model is identical to K DEEP , except for the atomfeature construction. We replaced atom features with those of PIGNet. We also constructed aGNN-based model identical to PIGNet except for the physics modeling part; the model producesﬁnal outputs via FC layers instead of the parametric equations used in PIGNet.

Results and DiscussionsAssessment of the model performance and generalization ability

Table 1 summarizes the per-formance of our model and the benchmark models for the CASF2016 and the CSAR benchmark.In the CASF2016 benchmark, our model outperformed all the others including deep learning mod-els and conventional scoring functions such as X-Score, AutoDock Vina, ChemPLP@GOLD, andGlideScore. In terms of scoring R, ranking ρ , docking success rate, screening EF and successrate, our model’s performance was respectively, 1.67, 1.09, 2.01, 10.78, and 14.03 times of the 3DCNN-based and GNN-based models’, at most. Our model also shows competitive performance onthe CSAR NRC-HiQ 1 and 2 benchmarks.As expected, the 3D CNN-based and GNN-based models show high accuracy in predictingbinding afﬁnities for the CASF2016 scoring power, ranking power, CSAR 1, and 2 - the benchmarksets consisted of experimentally obtained samples. However, those models fail to achieve high21ocking power and screening power. We attribute such low docking power to model over-ﬁtting onthe true-binding complex structures and binding afﬁnities. The models have produced inaccuratebinding afﬁnities for the computer-generated decoy structures, which are primarily queried for thedocking and screening power test. The low docking power then leads to the low screening power,as the most stable binding conformer needs to be identiﬁed in order to ﬁnd the true binder. Fromthese observations, we suspect that the performance reports of the previously introduced deepDTI models have been overoptimistic. In contrast, PIGNet consistently shows high performanceacross the four CASF2016 metrics and the external CSAR benchmarks. Such results imply thatour model is properly ﬁtted to the training data, and also has learned the proper features - theunderlying physics of protein-ligand binding patterns. Moreover, the results remind us that thescoring power cannot be a single criterion measuring the model performance. In the followingsection, we analyze how much each of our strategy had contributed to the result through ablationstudies. Ablation study on the generalization ability of the model

We attribute the improvement of ourmodel in generalization to both two major strategies we have utilized; the physics-informed in-ductive bias on the model, and the data augmentation strategy designed for a DTI model. In thissection, we carried out an ablation study to decouple the effects of the two strategies. Table 2 sum-marizes the results. We did not include a physics-informed version of the 3D CNN-based model,as it is not straightforward to impose the parameterized equations on the 3D CNN architecture.Speciﬁcally, 3D CNN models do not produce explicit atom-atom pairwise representations, whichare required for the parameterized equations. 22 ffect of the physics-informed inductive bias on the models

We can observe the effect of thephysics-informed model by comparing the performances of GNN-based models and PIGNet, asa GNN-based model is identical to PIGNet except the parametric equations. As expected, the ef-fect was not critical for the CASF2016 scoring and ranking powers. However, the employment ofthe physics-informed model has resulted in a signiﬁcant increase in docking and screening powers.We conclude that the incorporation of the parametric equations has clearly contributed to enhancedmodel generalization. The incorporation of the equations may impose an excessive inductive biason the model, increasing the possibility of under-ﬁtting. However, the comparable scoring powersof PIGNet and GNN-based models render the possibility unlikely. In addition, we believe ad-vances in physical modeling and ﬁner categorization of energy components will further improvePIGNet without under-ﬁtting. Interestingly, PIGNet without data augmentation shows better dock-ing power than GNN with data augmentation. While adding several hundred thousand training dataclearly improves the models, the augmentation alone cannot entirely replace the generalization ef-fect given by the physics-informed model. Instead, the data augmentation and physical modelingimprove the model in a complementary manner, as we can see from the improvement of PIGNetthrough the data augmentation.

Effect of the DTI-adapted data augmentation strategy on the models

The PDBbind dataset is oneof the most representative training datasets for deep DTI models, providing both 3D binding struc-tures and the binding afﬁnities of the protein-ligand complexes . However, the PDBbind datasetis suspected to hold an intrinsic bias ; its ligands have insufﬁcient chemical diversity and only thebinding structures in minimum energy poses are given. To overcome the current limitation of the23DBbind dataset, we additionally included total 1,362,534 computer-generated random protein-ligand complexes in training. Taking ligands from the IBS molecular library and generatingdecoys enrich the chemical diversity of ligands and 3D structural diversity of binding poses. Table2 reveals the effect of the DTI-adapted data augmentation strategy on the model generalizationability. The augmentation improved the docking and screening power regardless of the modelarchitectures. It shows the applicability of our data augmentation strategy for a variety of DNN-based DTI models. For benchmarks only containing the true binding complexes - scoring power,ranking power, the CSAR 1 and 2 - it was an expected result that the data augmentation did notimprove the scores, because the model learns to accurately distinguish the decoy and true bindingcomplexes from the augmented data and the corresponding losses. Interpretation of the physically modeled outputs

One important advantage of our approach isthe possibility of the atom-atom pairwise interpretation of DTI. To rationally design a drug fora speciﬁc target, knowing the dominant interaction of ligand binding is helpful. Since PIGNetcomputes atom-atom pairwise energy components, we can calculate the energy contribution of thesubstructures within a ligand. Here, we conduct a case study for two target proteins retrieved fromthe PDBbind dataset; protein-tyrosine phosphatase non-receptor type 1 (PTPN1) and platelet acti-vating factor acetylhydrolase (PAF-AH). The result is illustrated in Figure 3a, where two ligandsfor each protein are compared regarding the predicted substructural energy contributions and theinhibitory constant, K i . Note that the lower the K i , the stronger the protein-ligand binding afﬁnity.For PTPN1, the two ligands differ only by the red-circled substructure. The sulfonyl piperidinemoiety is inserted in the second ligand instead of the benzyl moiety in the ﬁrst ligand. The model24redicts greater energy contribution for the sulfonyl piperidine moiety, and such a result is coherentto the experimental K i values. For PAF-AH, the ligand with the phenyl group has a lower K i valuethan that of the ligand with the methyl group. The model predicts greater energy contribution ofthe phenyl group compared to the methyl group. For both proteins, the common substructure of thetwo ligands, highlighted in the ﬁgure with blue circles, are predicted to have similar energy con-tributions. This implies that the predicted outcomes provide physically meaningful interpretationsof the ligand binding, which can be further used for ligand optimization to strengthen the bindingafﬁnity with the target protein.Most docking programs manually assign different scoring functions to atom-atom pairs accordingto the predeﬁned categories. This manual assignment would fall short when the binding pattern ofthe pair does not ﬁt in the existing category. Instead of the handcrafted categorization, our modelexploits neural networks to automatically differentiate the atom-atom pairs; the information ofeach pair’s interaction is updated through the graph attention networks. We illustrate the deviationand its physical interpretation in Figure 3b and 3c.Figure 3b shows a distance-energy distribution plot of VDW component for carbon-carbon pairswithin the test set. When trained with learnable parameters, predicted VDW interactions naturallydeviate within the carbon-carbon pair, while without the learnable parameters the distance-energyplot follows a single solid line. With the aid of learnable parameters, our model might have learneda wider range of pairwise interactions in a data-driven manner. We also show the deviations inhydrophobic, hydrogen bond, and metal energy components in the Supplementary Figure S1.Figure 3c shows that the naturally occurring deviations within the atom-atom pairs in our model25igure 3: Interpretation of the predicted outcomes. a.

Substructural analysis of ligands fortwo target proteins. Protein-tyrosine phosphatase non-receptor type 1 (PTPN1) and platelet acti-vating factor acetylhydrolase (PAF-AH). The blue and red circles indicate common and differentsubstructures, respectively, and the predicted energy contribution (unit: kcal/mol) of each substruc-ture is annotated. The inhibitory constant, K i , indicates how potent the ligand binds to the targetprotein. b. A distance-energy plot of carbon-carbon pairwise van der Waals (VDW) energy com-ponents in the test set. The red solid line illustrates the original distance-energy relation withoutany deviation induced by learnable parameters. The closer the color of a data point to yellow, thelarger the number of corresponding carbon-carbon pairs. c. The average value of the corrected sumof VDW radii, d ij , corresponding to different carbon-carbon pair types. C sp − C sp , C sp − C sp ,and C sp − C sp pairs are compared.are the consequences of learning sufﬁcient physics information. The corrected sum of VDW radii, d ij , which contains a learnable parameter assigned to each atom-atom pair, deviated according tothe carbon-carbon pair types. Since the interaction between the two carbon atoms would not besigniﬁcantly affected by their hybridization, we speculate that the corrected sum of VDW radii of26he pair would be dependent on the atom radii. The result shows an increasing tendency from the C sp − C sp pair to the C sp − C sp pair. Resonating with the speculation, larger the s-character ofthe carbon atoms, shorter was the corrected sum of VDW radii. Uncertainty quantiﬁcation of PIGNet

For the reliable virtual screening, it is important to screenout the false positive binders and secure the true positives . Unfortunately, even though a largenumber of false positives cost an immense experimental burden, most positive returns of dockingprograms are actually false positives . The DNN models can also experience the same problem.The data-deﬁcient nature of training DTI models might render the DNN models less ﬁt to out-of-domain complexes , producing the false positive results. One promising way of reducing falsepositives is quantifying the uncertainty of the predictions and sorting out the unreliable positivepredictions. However, the scope of the related works is limited to the ligand-based models .In this regard, we hereby demonstrate the ﬁrst uncertainty quantiﬁcation on a structure-based DTImodel.We employed the MC-dropout, a practical Bayesian inference method utilizing dropout regular-ization, to estimate epistemic uncertainties . For aleatoric uncertainty quantiﬁcation, since thePIGNet architecture is not compliant to a direct implementation of the previous methods, we de-vised a new strategy. Kendall et al. shows that a model can be trained to predict both the outputand its aleatoric variance, given the convergence of the negative log likelihood loss. Since ourmodel adds up the predictions from each atom-atom pair to obtain an output for a protein-ligandcomplex, we aggregate the pairwise aleatoric uncertainties by multiplication over all pairs, as atom-atom pairwise interactions are not mutually independent. When the aggregation was proceeded by27ummation or taking an average, the model failed to converge. The further details of the imple-mentation are explained in the Supplementary Information.Figure 4: Plot of the average Pearson’s correlation coefﬁcients, R, of the 5-fold PIGNet model,with or without the uncertainty estimator, on the datasets classiﬁed according to the totaluncertainty.

PIGNet with the uncertainty estimator - low : the lowest third, random : the randomlyselected one third, high : the highest third of the uncertainty distribution. PIGNet without theuncertainty estimator - baseline : the randomly selected one third. An error bar represents onestandard deviation. PIGNet with and without the uncertainty estimator were tested at the th and , th training epoch, respectively. For each model, the random selection was performed ﬁvetimes and the R value was obtained by a 5-fold average.We quantiﬁed uncertainties for the samples in four datasets - CSAR NRC-HiQ 1 and 2, theCASF2016 scoring power, the PDBbind test set. In Figure 4, the ’low’, ’random’, and ’high’batches are in descending order in terms of Pearson’s correlation, R, in all four datasets. Such a28esult resonates with the expectation; the lower the uncertainty, the more probable the model wouldhave correctly predicted the result. The consistent results across the four test datasets show thatthe uncertainties can be properly measured for our model. Furthermore, a previous study showsan evidence of the correlation between good generalization ability and robust uncertainty quantiﬁ-cation . Thus, it might be possible to relate the high generalization ability of our model to thesuccess in the uncertainty quantiﬁcation.For PIGNet without the uncertainty estimator, the one third of the predictions was randomly se-lected, namely the ’baseline’ subset. By comparing the R values of the ’random’ and ’baseline’batches, we can evaluate the general performance of PIGNet with and without the uncertainty es-timator. The result shown in Figure 4 conﬁrms that the addition of uncertainty estimator does notharm the model performance.The results show that PIGNet is eligible to ﬁlter out the false positives by uncertainty quantiﬁcation.By sorting out the predictions in high uncertainty, we could obtain a high correlation between theprediction and the true afﬁnity value - which means we have excluded the complexes that are highlylikely to be the false positives. Conclusion

In this work, we integrated physical models into neural networks to build a generalized drug-targetinteraction (DTI) prediction model, named PIGNet. By achieving a high model generalization,our model has substantial pragmatic merits in virtual high-throughput screening schemes. Such anachievement is substantiated with the CASF2016 and CSAR benchmark results. A DTI model for29irtual screening should make accurate energy predictions on true binding complexes and correctlyscreen out invalid, non-binding complexes. The former requirement is associated with scoring andranking powers, and the latter condition is related to docking and screening powers. Yet the DNN-based benchmark models scored similarly competent scoring and ranking powers, no model wasmarked comparable to PIGNet regarding the docking and screening powers.We attribute the success in our model to the two key strategies proposed in this work. The ﬁrstone is to employ the physics-informed parameterized equations. The physics modeling acts as aproper inductive bias for the neural model, guiding the model to learn the underlying physics of thechemical interactions. We further improved the model performance by augmenting training datawith protein-ligand complexes from the wider chemical and structural diversity. We analyzed theeffects of the physics-informed model and the data augmentation through the ablation study andfound that both contribute to the model generalization.Our model can enjoy further practical advantages such as the physical interpretation of predictedDTI values and the reduction in false positives via uncertainty quantiﬁcation. Obtaining bindingfree energy for every atom-atom pair opens up a possibility of further interpretation. This usefulinformation can later be used to optimize drug candidates to attain better binding afﬁnity. Also,we introduced an uncertainty estimator for DTI prediction models and evaluated the quality ofestimation for PIGNet. As predictions in high uncertainty can possibly be false positives, theuncertainty quantiﬁcation has practical beneﬁts in virtual screening scenarios.Still, our model has a room for improvement regarding the representation of solvation energy. Inreality, proteins and ligands interact while surrounded by numerous water molecules, and the water30olecules engender thermodynamic and structural effects on ligand binding . As our model doesnot explicitly include a solvent energy component, we think introducing a solvent energy compo-nent to PIGNet will further improve the model. Nevertheless, as our results suggest, we believethat incorporating the physical models in a deep DTI prediction model can be a new practical wayof improving the quality of DTI predictions. Data availability

Datasets that we have used for training are shown at github: https://github.com/jaechanglim/DTI_PDBbind.git . Code availability

Our source codes for data preprocessing and training are available at github: https://github.com/jaechanglim/DTI_PDBbind.git . References

1. Hopkins, A. L. Drug discovery: Predicting promiscuity.

Nature

Journal of Computer-Aided Molecular Design

Organic and Biomolecular Chemistry Journalof Chemical Information and Modeling

Journal of Chemical Informationand Modeling

Drug Discovery Today

Computational and Structural Biotech-nology Journal

Journal of computational chemistry et al. rDock: A Fast, Versatile and Open Source Program for DockingLigands to Proteins and Nucleic Acids.

PLoS Computational Biology

Journal of Computer-Aided MolecularDesign

Journal of Medicinal Chemistry

J. Mol.Biol. et al.

Glide: a new approach for rapid, accurate docking and scoring. 1.Method and assessment of docking accuracy.

Journal of medicinal chemistry

Journal of MolecularGraphics and Modelling

Journal of Chemical Information and Modeling et al.

DOCK 6: Impact of new features and current docking performance.

Journalof Computational Chemistry

Bioorganic and Medic-inal Chemistry Letters et al.

In vitro and in silico evaluation of stereoselective effect of ginsenoside isomerson platelet P2Y12 receptor.

Phytomedicine

WIREs Computational Molecular Science Journal of Medicinal Chemistry

Trends in Pharmacological Sciences

Multi-label classiﬁca-tion using deep belief networks for virtual screening of multi-target drug in (2016),102–107.23. ”Ozt¨urk, H., ¨Ozg¨ur, A. & Ozkirimli, E. DeepDTA: Deep drug-target binding afﬁnity predic-tion. Bioinformatics

Frontiers in Chemistry

782 (2019).25. Lipinski, C. F., Maltarollo, V. G., Oliveira, P. R., da Silva, A. B. F. & Honorio, K. M. Ad-vances and Perspectives in Applying Deep Learning for Drug Design and Discovery.

Fron-tiers in Robotics and AI

108 (2019).26. Tsubaki, M., Tomii, K. & Sese, J. Compound-protein interaction prediction with end-to-endlearning of neural networks for graphs and sequences.

Bioinformatics

PLoS Computational Biology

Nature Machine Intelligence Molecular Infor-matics

Structural Bioinformatics: Applications in Preclinical DrugDiscovery Process

Journal of Chemical Information and Modeling

Bioinformatics

Journal of Chem-ical Information and Modeling

Preprint at arXiv:1510.02855 (2015). 355. Ragoza, M., Hochuli, J., Idrobo, E., Sunseri, J. & Koes, D. R. Protein-Ligand Scoring withConvolutional Neural Networks.

Journal of Chemical Information and Modeling et al.

PotentialNet for Molecular Property Prediction.

ACS Central Science Journal of Chemical Information and Modeling et al.

Predicting Drug-Target Interaction Using a Novel Graph Neural Network with3D Structure-Embedded Graph Representation.

Journal of Chemical Information and Mod-eling (2019).39. Hawkins, D. M. The Problem of Overﬁtting. Journal of Chemical Information and ComputerSciences et al.

Hidden bias in the DUD-E dataset leads to misleading performance of deeplearning in structure-based virtual screening.

PLoS one e0220113 (2019).41. Morrone, J. A., Weber, J. K., Huynh, T., Luo, H. & Cornell, W. D. Combining Docking PoseRank and Structure with Deep Learning Improves Protein–Ligand Binding Mode Predictionover a Baseline Docking Approach.

Journal of Chemical Information and Modeling (2020).42. Greydanus, S., Dzamba, M. & Yosinski, J. Hamiltonian Neural Networks.

Preprint at arXiv:1906.05163 (2019). 363. Pun, G. P., Batra, R., Ramprasad, R. & Mishin, Y. Physically informed artiﬁcial neural net-works for atomistic modeling of materials.

Nature Communications et al.

Comparative Assessment of Scoring Functions: The CASF-2016 Update.

Jour-nal of Chemical Information and Modeling et al.

Toward more realistic drug-target interaction predictions.

Brieﬁngs inbioinformatics et al. Interpretable Drug Target Prediction Using Deep Neural Representation. in IJCAI (2018), 3371–3377.47. Zubatyuk, T. et al.

Machine Learned H¨uckel Theory: Interfacing Physics and Deep NeuralNetworks.

Preprint at arXiv:1909.12963 (2019).48. Karlov, D. S., Sosnin, S., Fedorov, M. V. & Popov, P. GraphDelta: MPNN Scoring Functionfor the Afﬁnity Prediction of Protein-Ligand Complexes.

ACS Omega et al. Graph Neural Networks: A Review of Methods and Applications.

Preprint atarXiv:1812.08434 (2018).50. Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical Evaluation of Gated RecurrentNeural Networks on Sequence Modeling.

Preprint at arXiv:1412.3555 (2014).51. Wunderlich, R. E., Wenisch, T. F., Falsaﬁ, B. & Hoe, J. C. SMARTS: Accelerating microar-chitecture simulation via rigorous statistical sampling.

Conference Proceedings - Annual In-ternational Symposium on Computer Architecture, ISCA,

RDKit: Open-source cheminformatics < > .373. Koes, D. R., Baumgartner, M. P. & Camacho, C. J. Lessons learned in empirical scoring withsmina from the CSAR 2011 benchmarking exercise. Journal of Chemical Information andModeling et al.

Forging the Basis for Developing Protein-Ligand Interaction Scoring Functions.

Accounts of Chemical Research

British Journal of Pharmacology

InterBioScreen Ltd < > .57. Jr., J. B. D. et al. CSAR benchmark exercise of 2010: Selection of the protein-ligand com-plexes.

Journal of Chemical Information and Modeling

Frontiers in Chemistry (2020).59. Sink, R., Gobec, S., Pecar, S. & Zega, A. False Positives in the Early Stages of Drug Discov-ery. Current Medicinal Chemistry

Chemical Science

Journalof Chemical Information and Modeling (2020).382. Hirschfeld, L., Swanson, K., Yang, K., Barzilay, R. & Coley, C. W. Uncertainty Quantiﬁca-tion Using Neural Networks for Molecular Property Prediction.

Preprint at arXiv:2005.10036 (2020).63. Yang, S., Lee, K. H. & Ryu, S. A comprehensive study on the prediction reliability of graphneural networks for virtual screening.

Preprint at arXiv:2003.07611 (2020).64. Gal, Y. & Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Un-certainty in Deep Learning.

Proceedings of the 33 rd International Conference on MachineLearning (2016).65. Kendall, A. & Gal, Y. What Uncertainties Do We Need in Bayesian Deep Learning for Com-puter Vision? Preprint at arXiv:1703.04977 (2017).66. Geschwindner, S. & Ulander, J. The current impact of water thermodynamics for small-molecule drug discovery.

Expert Opinion on Drug Discovery

Journal of Computational Chemistry

Acknowledgements

This work was supported by Basic Science Research Programs through the NationalResearch Foundation of Korea (NRF) funded by the Ministry of Science and ICT (NRF-2017R1E1A1A01078109).

Competing Interests

There are no competing interests to declare.

Author contributions

Conceptualization: J.L. and W.Y.K.; Methodology: S.M., W.Z., S.Y., and J.L.;Software, Investigation and Formal Analysis: S.M., W.Z., S.Y., and J.L.; Writing – Original Draft: S.M.,W.Z., S.Y., and J.L.; Writing – Review & Editing: S.M., W.Z., S.Y., J.L., and W.Y.K.; Supervision: W.Y.K. orrespondence Correspondence and requests for materials should be addressed to Woo Youn Kim.

Supplementary Information

Supplementary Information is available at doi ASF2016 Benchmark CSARScoring Ranking Docking Screening NRC-HiQ set1 NRC-HiQ set2

R ρ

Success Rate Average EF Success Rate

R R

X-Score - - - - - 0.72 0.653D CNN based model 0.652 0.611 42.5% 1.4% 3.5% 0.692 GNN based model 0.723 0.583 67.7% 7.0% 26.3% 0.635 0.786PIGNet

Table 1: Benchmark test results on CASF2016 and CSAR NRC-HiQ. R , ρ indicate PearsonCorrelation Coefﬁcient, Spearman’s rank Correlation Coefﬁcient, respectively. Top 1% rate usedfor Average EF and Success Rate. ∆ VinaRF20 was excluded from the comparison for scoringpower, as it was ﬁne-tuned on the PDBbind v.2017 data, which in fact includes of data inthe CASF2016 test set. We collected the reference scores of CASF2016 from Su et al. We haveremoved the overlapping PDBbind data from the CSAR NRC-HiQ datasets. The reference scoresof CSAR NRC-HiQ were curated from Jim´enez et al. , which also has removed the overlappingdata. CASF2016 Benchmark CSARScoring Ranking Docking Screening NRC-HiQ set1 NRC-HiQ set2

R ρ

Success Rate Average EF Success Rate

R R

3D CNN based modelW/O data augmentation

3D CNN based modelwith data augmentation 0.652

GNN based modelwith data augmentation 0.723 0.583

PIGNetwith data augmentation

Table 2: Benchmark test results on the CASF2016 and CSAR NiQ dataset for different modelswith or without data augmentation. 41 lectronic Supplementary Information for:

PIGNet: A physics-informed deep learning model towardgeneralized drug-target interaction predictions

Seokhyun Moon, † Wonho Zhung, † Soojung Yang, † Jaechang Lim, and Woo Youn Kim ∗ Department of Chemistry, KAIST, Daejeon 34141, Republic of Korea KI for Artiﬁcial Intelligence, KAIST, Daejeon 34141, Republic of Korea ∗ Email: [email protected] † These authors contributed equally to this work.

Contents

1. Model construction(a) Neural model architecture(b) Physics-informed parameterized functions2. Benchmark methods3. Interpretation of the physically modeled outputs4. Uncertainty quantiﬁcation method

List of Figures

1. Supplementary Figure 1. The plot of pairwise interaction energy distribution for each component inthe test set. The closer to the color yellow, the larger the number of pairs corresponding to the datapoint.

List of Tables

1. Supplementary Table 1. The list of atom features.2. Supplementary Table 2. SMARTS descriptors for hydrogen bond acceptor and donor.1 a r X i v : . [ q - b i o . B M ] A ug . Model construction (a) Neural model architecture Atom features

Initial atom features in our model are summarized in Supplementary Table 1. The X in the atom featuremeans all atoms except C, N, O, S, F, P, Cl, and Br. The ﬁnal dimension of our node in the graphrepresentation of molecule is 27.

Feature listAtom type C, N, O, S, F, P, Cl, Br, X (onehot)Degree of atom 0, 1, 2, 3, 4, 5 (onehot)Number of hydrogen atoms attached 0, 1, 2, 3, 4 (onehot)Number of implicit valence electrons 0, 1, 2, 3, 4, 5 (onehot)Aromatic 0 or 1

Supplementary Table 1: The list of atom features

Adjacency matrices

Our graph representation, G ( H, A ) , contains two adjacency matrices expressed as equations (1) and (2). A and A are constructed to account for the covalent bonds and intermolecular interactions in a protein-ligandcomplex, respectively. A ij = ( if i and j are connected by covalent bonds or i = j otherwise (1) A ij = ( if 0.5 Å < d ij < 5.0 Å otherwise (2)For A , we neglect the atom-atom pair interactions whose pairwise distance is smaller than 0.5 or largerthan 5.0. By placing the upper threshold, we limit the eﬀect of distant atoms and reduce the complexity ofthe graph representation. By setting the lower threshold, we avoid exceptional atom pairs within extremelyshort distances. Gated graph attention network (Gated GAT)

The n th unit of the gated GAT generates a set of the next node features from the set of the current nodefeatures, H n = { h n , h n , · · · , h nN } , and the adjacency matrix A , where h ni ∈ R F . The scalar values N and F are the number of the atoms in a protein-ligand complex and the dimension of the node feature,respectively. The initial step of the gated GAT is the multiplication of a learnable weight, W n ∈ R F × F and the node feature, h ni to produce an embedded node feature, m ni , which has more information aboutprotein-ligand complex. From the embedded node feature, m ni , the attention coeﬃcient, e nij between the i th and the j th nodes is computed as follows: e nij = ( m ni ) T W n m nj + ( m nj ) T ( W n ) T m ni , (3)where W n ∈ R F × F is also a learnable matrix. e nij implies the inﬂuence of the i th node to update the featuresof the j th node. The summation of ( m ni ) T W n m nj and ( m nj ) T ( W n ) T m nj forces e nij and e nji to be equal. Weadopted the softmax activation function to normalize the attention coeﬃcient, e nij , across neighboring nodes.The normalized attention coeﬃcient, a nij , is given by a nij = exp ( e nij ) P j ∈ N i exp ( e nij ) , (4)2here N i is the set of the neighboring nodes of the i th node. Then, the current node feature, ˜ h ni is calculatedvia the linear combination of the neighboring node features weighted by the attention coeﬃcient, a ij , witha ReLU activation function: ˜ h ni = ReLU (Σ j a nij h nj ) . (5)We also used the gate mechanism to eﬀectively deliver the previous node features and the current nodefeatures to the next node features. The importance of the previous node features, z i , is computed from h ni and ˜ h ni as follows: z i = σ ( W n (( h ni k ˜ h ni ))) , (6)where σ is a sigmoid activation function which constrains z i between 0 and 1, ( ·||· ) is a concatenationoperation, and W n ∈ R F × is a learnable weight vector. Lastly, the next node feature, h n +1 i , is a linearlyinterpolated value between h ni and ˜ h ni : h n +1 i = z i h ni + (1 − z i ) ˜ h ni . (7)We used three units of the gated GAT to incorporate intramolecular interactions into the node features. Interaction network

The interaction network takes the previous set of node features, H n = { h n , h n , · · · , h nN } , and the adjacencymatrix A to generate the next set of node features, H n +1 = { h n +11 , h n +12 , · · · , h n +1 N } . The interactionnetwork ﬁrst multiplies h ni with a learnable weight, W n ∈ R F × F to get the set of embedded node features, M = { m , m , · · · , m N } as follows: m i = W n h ni , (8)where i is the index of node features. The interaction network also makes a set of interaction embedded nodefeatures, M = { m , m , · · · , m N } with each previous node feature, h ni , a learnable weight, W n ∈ R F × F , and the adjacency matrix A . The interaction embedded feature of the i th node is represented by: m i = max j ∈ N i { W n h nj } , (9)where N i is the set of nodes which have interactions with the i th node. By maximum aggregation for eachnode, the set of interaction embedded node features, M , becomes the most important node feature elementwithin nodes with intermolecular interactions. From M and M , we can get a set of total node features, H n = { h n , h n , · · · , h nN } through the summation and a ReLU activation function: h ni = ReLU ( m i + m i ) . (10)The next set of node features, H n +1 i , can be obtained from a gated recurrent unit (GRU) by using a set oftotal node feature, H n , as a hidden state input and the previous set of node features, H n . h n +1 i = GRU ( h ni , h ni ) (11)Total node features, H n , updates the previous set of node features, H n , recursively in the GRU cell. Asa result, the next set of node features, H n +1 , is more likely to reﬂect only important features of the givenprotein-ligand complex when updating node features.The interaction network makes a signiﬁcant role in our model by transforming a set of node features ofthe protein-ligand complex to contain information about intermolecular interactions. To make a set of nodefeatures that suﬃciently contains intermolecular interaction information, the interaction network consists ofthree units. 3 b) Physics-informed parameterized function PIGNet consists of several physics-informed parameterized functions: four energy components and a ro-tor penalty. Each energy component is computed with a set of pair-wise node features, H concat , whichrepresented as equation (12). Each pair-wise node feature consists of two node features, h i and h j . H concat = { h concat , h concat , · · · , h concatN } = { ( h || h ) , ( h || h ) , · · · , ( h N || h N − ) , ( h N || h N ) } (12)Each energy component depends on d ij , and d ij , which are distance, and corrected minimum distancebetween the i th node and the j th node. d ij can be represented as follows: d ij = r i + r j + c · b ij , (13)where r i and r j are van der Waals radii of the i th and the j th nodes respectively, and b ij is correspondingcorrection between two nodes. For the constant c that scales b ij , we used 0.2. The correction constant, b ij , originates from a set of pair-wise node features, H concat , by using learnable weights, W ∈ R F × F and W ∈ R F × , as the following: b ij = tanh ( W ( ReLU ( W ( h concatij )))) . (14) van der Waals interaction We used the 12-6 Lennard-Jones potential to calculate a van der Waals (VDW) interaction term, e vdwij ,between the i th and j th atoms. Equation (15) summarizes e vdwij : e vdwij = c ij [( d ij d ij ) − d ij d ij ) ] , (15)where c ij indicates the minimum interaction energy which is also predicted from neural networks. Weconstrain the minimum and maximum values as 0.0178 and 0.0356, respectively, to render the predictedenergy component similar to the true energy component. The maximum value of c ij is referred from theparameter of AutoDock Vina for steric interactions. Equation (16) summarizes the calculation of c ij : c ij = σ ( W vdw ( ReLU ( W vdw h concatij ))) × (0 . − . . , (16)where W vdw ∈ R F × F , and W vdw ∈ R F × , are weight matrices. We consider all protein and ligand atompairs except metal atoms whose van der Waals radii have high variance depending on atoms types. Weobtain the total van der Waals energy, E vdw , by summing e vdwij of all possible pairs, as follows. E vdw = X i,j e vdwij (17)The hydrogen bond, metal-ligand interaction, hydrophobic interaction components, and rotor penalty canbe computed as described in the main article. Hydrogen bond, Metal-ligand interaction, Hydrophobic interaction

Supplementary Table 2 shows SMARTS descriptors which are used to select the hydrogen bond donors andhydrogen bond acceptors. hydrogen bond acceptor [$([!

Supplementary Table 2: SMARTS descriptors for hydrogen bond acceptor and donor4 . Benchmark methods

CASF2016 benchmark metrics In this supplementary section, we explain the benchmark metrics we report as our results. • Scoring power : This is deﬁned as a linear correlation of predicted binding aﬃnity and experimentalbinding data. The linear correlation is measured in the Pearson’s correlation coeﬃcient, R. In fact,the scoring functions in comparison with PIGNet produce log aﬃnity ( logK a ) values, while PIGNetcomputes binding free energy values [ kJ/mol ]. Nevertheless, since a constant multiplication converts logK a to the binding free energy, such a diﬀerence does not make a change in the R values. • Ranking power : This refers to the ability of a model to correctly rank the binding aﬃnities of thetrue binders for a certain target protein, given the binders’ precise binding poses. The ranking powercan be measured in terms of the average Spearman’s rank correlation coeﬃcient, ρ , the Kendall’s rankcorrelation coeﬃcient, τ , and the Predictive Index ( P I ). We only report Spearman’s rank correlationcoeﬃcients for our experiments, as all three metrics are well-correlated. • Docking power : This is the ability of a model to ﬁnd the native ligand binding pose among decoyswith computer-generated poses. The metric for the docking power measurement is the overall successrate , which counts a complex identiﬁed by a model as a successful case if it has a high conformationalsimilarity (RMSD < Å) with the native binding pose. • Screening power : This is the ability of a model to identify the true binding ligands for a giventarget protein among a set of random molecules. We measure the screening power in terms of theenhancement factor (EF) and success rate, averaged across all 57 target proteins. The success rate iscomputed as a ratio of highest-aﬃnity binder among the top α (%) ligands. The enhancement factorfor a target protein is deﬁned as follows: EF α = N T B α N T B total × α , (18)where N T B α is the number of the true binders among the top α (%) candidates ranked by a model,and N T B total is the total number of the true binders for the given target protein. We cite our resultsin the average enhancement factor and success rate measured at α = 1% .5 . Interpretation of the physically modeled outputs Distribution plot of atom-atom pairwise interaction in each energy component

Supplementary Figure 1: The distance-energy plot for each energy component in the test set. The closer tothe yellow, the larger the number of pairs corresponding to the data point.By dissecting the predicted energies into individual energy components, we could observe that the modelhas learned the deviations within each energy component. Supplementary Figure 1 shows a distance-energyplot of each energy component, where the data points are the atom-atom pairs in the test set. Note that thepairwise energy plots are not a single solid line, but the multifariously deviated distributions. For the vander Waals component, while the plot generally complies with the form of the 12-6 Lennard-Jones potential,the deviations arise from the two learnable parameters, b ij and c ij . b ij also contributes to the deviations inthe hydrophobic, hydrogen bond, and metal energy components, as the parameter is used to calculate thecorrected sum of van der Waals radii in equation (13).6 . Uncertainty quantiﬁcation method The uncertainty of predicted value, can be quantiﬁed in two types - aleatoric uncertainty, a noise inherentin data, and epistemic uncertainty, a uncertainty in model parameters. While the epistemic uncertainty canbe reduced by increasing the quantity of training data, the aleatoric uncertainty is related to data qualityrather than the quantity. The two types of the uncertainty can be added up to plausibly approximate thetotal uncertainty . To quantify the uncertainties, we employed the following well-known methods: • Aleatoric uncertainty : We adopt the Mean Variance Estimation (MVE) approach, which trains amodel to produce both mean and aleatoric variance for a prediction. In the MVE scheme, the modelis trained by minimizing the loss function derived as follows: L aleatoric ( θ ) = 1 N N X i =1 σ ( x i ) k y i − f ( x i ) k + 12 log σ ( x i ) . (19)Mean ˆ y i = f ( x i ) and variance σ ( x i ) of a prediction are the functions of input x i . In practice, theembedded feature vector of an input can be fed into two diﬀerent output layers to produce the meanand variance. • Epistemic uncertainty : A popular solution to approximate the uncertainty from the model isBayesian inference which gives probabilistic model parameters. Among various Bayesian inferencemethods, the MC-Dropout stands out as a practical method since it only requires the implementationof the widely used dropout regularization.In fact, the PIGNet architecture is not compliant to a direct implementation of the aleatoric uncertaintyinference method introduced in Kendall et al. . Thus, we hereby introduce the following strategy to quantifythe aleatoric uncertainty for PIGNet. • Aggregation of the atom-atom pairwise uncertainties : Since our model adds up the predictionsfrom each atom-atom pair to obtain an output for a protein-ligand complex, the corresponding pairwisealeatoric uncertainties should be aggregated as well. We chose to aggregate the pairwise uncertaintiesby multiplication over all pairs, as atom-atom pairwise interactions are not mutually independent. • Distance-dependency of uncertainty : Our model takes atom-atom pair distances as inputs toaccount for the physical reality where the strength of interactions decay with distance. In an analogousmanner, we conjecture that aleatoric uncertainties also decay with distance. We multiply a distance-dependent exponential decay function to pairwise aleatoric uncertainties, as the exponential decay hadgiven better convergence compared to the reciprocal decay.The whole process is formulated as follows: σ = Y i,j σ ij = Y i,j | W var ( ReLU ( W var ( h concatij ))) × a exp( − bd ij ) | , (20)where h concatij is a concatenated embedding of the ligand i th atom and the protein j th atom, and d ij is adistance between the atoms. FC layers and learnable parameters, a and b , were trained to minimize thealeatoric uncertainty loss function (equation (19)). Note that h concatij and d ij are the inputs to produce bothpredictive output and uncertainty. 7 eferences [1] Abdul-rahman Allouche. “Gabedit–A Graphical User Interface for Computational Chemistry Softwares”.In: Journal of computational chemistry

Preprint at arXiv:1412.3555 (2014).[3] Yarin Gal and Zoubin Ghahramani. “Dropout as a Bayesian Approximation: Representing Model Un-certainty in Deep Learning”. In:

Proceedings of the 33 rd International Conference on Machine Learning

48 (2016).[4] Alex Kendall and Yarin Gal. “What Uncertainties Do We Need in Bayesian Deep Learning for ComputerVision?” In:

Preprint at arXiv:1703.04977 (2017).[5] Gabriele Scalia et al. “Evaluating Scalable Uncertainty Estimation Methods for Deep Learning-BasedMolecular Property Prediction”. In:

Journal of Chemical Information and Modeling (2020).[6] Minyi Su et al. “Comparative Assessment of Scoring Functions: The CASF-2016 Update”. In: