[PDF] Explainable Deep Relational Networks for Predicting Compound-Protein Affinities and Contacts

Abstract

Predicting compound-protein affinity is critical for accelerating drug discovery. Recent progress made by machine learning focuses on accuracy but leaves much to be desired for interpretability. Through molecular contacts underlying affinities, our large-scale interpretability assessment finds commonly-used attention mechanisms inadequate. We thus formulate a hierarchical multi-objective learning problem whose predicted contacts form the basis for predicted affinities. We further design a physics-inspired deep relational network, DeepRelations, with intrinsically explainable architecture. Specifically, various atomic-level contacts or "relations" lead to molecular-level affinity prediction. And the embedded attentions are regularized with predicted structural contexts and supervised with partially available training contacts. DeepRelations shows superior interpretability to the state-of-the-art: without compromising affinity prediction, it boosts the AUPRC of contact prediction 9.5, 16.9, 19.3 and 5.7-fold for the test, compound-unique, protein-unique, and both-unique sets, respectively. Our study represents the first dedicated model development and systematic model assessment for interpretable machine learning of compound-protein affinity.

Full PDF

EExplainable Deep Relational Networks forPredicting Compound-Protein Aﬃnities andContacts

Mostafa Karimi, † , ‡ , § Di Wu, † , § Zhangyang Wang, ¶ and Yang Shen ∗ , † , ‡ † Department of Electrical and Computer Engineering, Texas A&M University, CollegeStation, TX 77843, United States ‡ TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering, TexasA&M University, College Station, TX 77840, United States ¶ Department of Computer Science and Engineering, Texas A&M University, CollegeStation, TX 77843, United States § Co-ﬁrst authors

E-mail: [email protected]

Abstract

Predicting compound-protein aﬃnity is critical for accelerating drug discovery.Recent progress made by machine learning focuses on accuracy but leaves much tobe desired for interpretability. Through molecular contacts underlying aﬃnities, ourlarge-scale interpretability assessment ﬁnds commonly-used attention mechanisms in-adequate. We thus formulate a hierarchical multi-objective learning problem whosepredicted contacts form the basis for predicted aﬃnities. We further design a physics-inspired deep relational network, DeepRelations, with intrinsically explainable archi-tecture. Speciﬁcally, various atomic-level contacts or “relations” lead to molecular-level a r X i v : . [ q - b i o . B M ] D ec ﬃnity prediction. And the embedded attentions are regularized with predicted struc-tural contexts and supervised with partially available training contacts. DeepRelationsshows superior interpretability to the state-of-the-art: without compromising aﬃnityprediction, it boosts the AUPRC of contact prediction 9.5, 16.9, 19.3 and 5.7-fold forthe test, compound-unique, protein-unique, and both-unique sets, respectively. Ourstudy represents the ﬁrst dedicated model development and systematic model assess-ment for interpretable machine learning of compound-protein aﬃnity. Introduction

Current drug-target interactions are predominantly represented by the interactions betweensmall-molecule compounds as drugs and proteins as targets. The enormous chemical space toscreen compounds is estimated to contain 10 drug-like compounds. And these compoundsact in biological systems of millions or more protein species or “proteoforms” (consideringgenetic mutations, alternative splicing, and post-translation modiﬁcations of proteins).

Facing such a combinatorial explosion of compound-protein pairs, drug discovery calls foreﬃcient characterization of compound eﬃcacy and toxicity, and computational prediction ofcompound-protein interactions (CPI) addresses the need.Recently computational CPI prediction has made major progress beyond predictingwhether compounds and proteins interact. Indeed, thanks to increasingly abundant molec-ular data and advanced deep-learning techniques, compound-protein aﬃnity prediction isreaching unprecedented accuracy, with inputs of compound-protein structures, compoundidentities (such as SMILES and graphs) and protein structures (see a relevant problem ofbinding classiﬁcation ), or even just compound-protein identities.

As previously sum-marized, structure-based aﬃnity-prediction methods are limited in applicability due tothe often-unavailable structures of compound-protein pairs or even proteins alone, whereasstructure-free methods, being broadly applicable, could be limited in interpretability.Interpretability remains a major gap between the capability of current compound-aﬃnity2redictors and the demand of rational drug discovery. The central question about inter-pretability is whether and how methods (including machine learning models) could explain why they make certain predictions (aﬃnity level for any compound-protein pair in our con-text). This important topic is rarely addressed for aﬃnity prediction. DeepAﬃnity hasembedded joint attentions over compound-protein component pairs and uses such joint at-tentions to assess origins of aﬃnities (binding sites) or speciﬁcities. Additionally, attentionmechanisms have been used for predictions of CPI, chemical stability and protein sec-ondary structures. Assessment of interpretability for all these studies was either lacking orlimited to few case studies. We note a recent work proposing post-hoc attribution-based testto determine whether a model learns binding mechanisms. We raise reasonable concerns on how much attention mechanisms can reproduce naturalcontacts in compound-protein interactions. Attention mechanisms were originally developedto boost the performance of seq2seq models for neural machine translations. And theyhave gained popularity for interpreting deep learning models in visual question answering, natural language processing and healthcare. However, they were also found to workdiﬀerently from human attentions in visual question answering. Representing the ﬁrst eﬀort dedicated to interpretability of compound-protein aﬃnitypredictors, our study is focused on how to deﬁne, assess, and enhance such interpretabilityas follows.

How to deﬁne interpretability for aﬃnity prediction.

Interpretable machine learning isincreasingly becoming a necessity for ﬁelds beyond drug discovery. Unlike interpretabilityin a generic case, what interpretability actually means and how it should be evaluated ismuch less ambiguous for compound-protein aﬃnity prediction. So that explanations con-form with scientiﬁc knowledge, human understanding, and drug-discovery needs, we deﬁneinterpretability of aﬃnity prediction as the ability to explain predicted aﬃnity through un-derlying atomic interactions (or contacts). Speciﬁcally, atomic contacts of various types areknown to constitute the physical basis of intermolecular interactions, modeled in force ﬁelds3o estimate interaction energies, needed to explain mechanisms of actions for drugs, and relied upon to guide structure-activity research in drug discovery. We emphasizethat simultaneous prediction of aﬃnity and contacts does not necessarily make the aﬃnitypredictors intrinsically interpretable unless predicted contacts form the basis for predictedaﬃnities.

How to assess interpretability for aﬃnity prediction.

Once interpretability of aﬃnity pre-dictors is deﬁned through atomic contacts, it can be readily assessed against ground truthknown in compound-protein structures, which overcomes the barrier for interpretable ma-chine learning without ground truth. In our study, we have curated a dataset of compound-protein pairs, all of which are labeled with K d values and some of which with contact de-tails; and we have split them into training, test, compound-unique, protein-unique, andboth-unique (or double-unique) sets. We measure the accuracy of contact prediction overvarious sets using area under the precision-recall curve (AUPRC) which is suitable for bi-nary classiﬁcation (contacts/non-contacts) with imbalanced classes (far less contacts thannon-contacts). We have performed large-scale assessments of attention mechanisms in var-ious molecular data representations (protein amino-acid sequences and structure-propertyannotated sequences as well as compound SMILES and graphs) and corresponding neuralnetwork architectures (convolutional and recurrent neural networks [CNN and RNN] as wellas graph convolutional and isomorphism networks [GCN and GIN]). And we have foundthat current attention mechanisms inadequate for interpretable aﬃnity prediction, as theirAUPRCs were merely 50% more than chance. How to enhance interpretability for aﬃnity prediction.

We have made three main contri-butions to enhance interpretability.The ﬁrst contribution, found to be the most impactful, is to design intrinsic explainabilityinto the architecture of a deep “relational” network. Inspired by physics, we explicitly modeland learn various types of atomic interactions (or “relations”) through deep neural networksand embed attentions at the levels of residue-atom pairs and relation types. This was4otivated by relational neural networks ﬁrst introduced to learn to reason in computervision and subsequent interaction networks to learn the relations and interactions ofcomplex objects and their dynamics.

Moreover, we combine such deep relational modulesin hierarchy to progressively focus attention from putative protein surfaces, binding-site k -mers and residues, to putative residue-atom binding pairs.The second contribution is to incorporate physical constraints into data representations,model architectures, and model training. (1) To respect the sequence nature of protein in-puts and to overcome the computational bottlenecks of RNNs, inspired by protein foldingprinciples, we represent protein sequences as hierarchical k -mers and model them with hi-erarchical attention networks (HANs). (2) To respect the structural contexts of proteins,we predict from protein sequences solvent exposure over residues and contact maps overresidue pairs; and we introduce novel structure-aware regularizations for structured sparsityof model attentions.The third contribution is to supervise attentions with partially available contact data andtrain models accordingly. For interpretable and accurate aﬃnity prediction, we have formu-lated a hierarchical multi-objective optimization problem where contact predictions form thebasis for aﬃnity prediction. We utilize contact data available to a minority (around 7.5%)of training compound-protein pairs and design hierarchical training strategies accordingly.The rest of the paper is organized as follows. The aforementioned contributions in deﬁn-ing, measuring, and enhancing intepretable aﬃnity prediction will be detailed in Methods. InResults, compared to state-of-the-art ﬁrst, the resulting framework of DeepRelations is foundto drastically boost interpretability robustly over default test, protein-unique, compound-unique, and double-unique sets, without sacriﬁcing accuracy. Ablation studies then revealthe most contributing methodological contribution — the intrinsically explainable modelarchitecture of our deep “relational” networks. Case studies provide further insights into thepattern of interpreted contacts. 5 ethods Toward genome-wide prediction of compound-protein interactions (CPI), we assume thatproteins are only available in 1D amino-acid sequences, whereas compounds are available in1D SMILES or 2D chemical graphs. We start the section with the curation of a dataset ofcompound-protein pairs with known pK d values, a subset of which is of known intermolecularcontacts. We will introduce the state-of-the-art and our newly-adopted neural networksto predict from such molecular data. These neural networks will be ﬁrst adopted in ourprevious framework of DeepAﬃnity (supervised learning with joint attention) so that theinterpretability of attention mechanisms can be systematically assessed in CPI prediction.We will then describe our physics-inspired, intrinsically explainable architecture of deeprelational networks where aforementioned neural networks are used as basis models. Withcarefully designed regularization terms, we will explain multi-stage deep relational networksthat increasingly focus attention on putative binding-site k -mers, binding-site residues, andresidue-atom interactions, for the prediction and interpretation of compound-protein aﬃnity.We will also explain how the resulting model can be trained using compound-protein pairswith aﬃnity values but not necessarily with atomic interaction details. Curation of a CPI Relational Benchmark Set

We have previously curated aﬃnity-labeled compound-protein pairs based on BindingDB. The compound data were in the format of canonical SMILES as provided in PubChem andthe protein data are in the format of FASTA sequences. In this study, we used those datawith amino-acid sequences no more than 300 from the p K d -labeled set, which correspondsto 1,926 compound-protein pairs. We also converted SMILES to graphs with RDKit. The p K d -labeled data only shows the aﬃnity strength between proteins and compounds,but it lacks the details on where and how the pairs interact. We have thus curated a subset ofthe p K d -labeled data with atomic-level intermolecular contacts (or “relations”) derived from6ompound-protein co-crystal structures in PDB, as ground truth for the interpretablityof aﬃnity prediction. Speciﬁcally, we cross-referenced aforementioned compound-proteinpairs in PDBsum and used its LigPlot service to collect high-resolution atomic contactsor relations. These relations are given in the form of contact types (hydrogen bond orhydrophobic contact), atomic pairs, and atomic distances.The resulting dataset of 1,926 p K d -labeled compound-protein pairs (including 144 pairswith atomic-contact data) corresponds to 137 proteins and 1,376 compounds. We randomlysplit them into four folds where fold 1 do not overlap with fold 2 in compounds, do not doso with fold 3 in proteins, and do not do so with fold 4 in either compounds or proteins.Folds 2, 3, and 4 are referred to as compound-unique, protein-unique, and double uniquesets for generalization tests; and they contain 201(19), 191(14), and 192(10) compoundpairs (including those with contact details in the parentheses). Fold 1 was randomly splitinto training (70%) and test (30%) sets where 10% of the training set was set aside as thevalidation or development set. The training (including validation) and test sets contain974(74) and 368(27) compound-protein pairs (with contact details). The split of the wholedataset is illustrated in Figure 1 below.Although monomer structures of proteins are often unavailable, their structural featurescan be predicted from protein sequences alone with reasonable accuracy. We have pre-dicted the secondary structure and solvent accessibility of each residue using the latestSCRATCH and contact maps for residue pairs using RaptorX-contact. These dataprovide additional structural information to regularize our machine learning models. If pro-tein structures are available, actual rather than predicted such data can be used instead.

Data Representation and Corresponding Basis Neural Networks

Baseline: CNN and RNN for 1D protein and compound sequences.

When molecular data are given in 1D sequences, these inputs are often processed by convo-lutional neural networks (CNN) and by recurrent neural networks (RNN) that are more7igure 1: The complete data set consists of training, test, compound-unique,protein-unique, and double unique sets with compound-protein counts provided (includingthose with contact details in parentheses).suitable for sequence data with long-term interactions. Challenges remain in RNN for compound strings or protein sequences. For compoundsin SMILES strings, the descriptive power of such strings can be limited. In this study, weovercome the challenge by representing compounds in chemical formulae (2D graphs) andusing two types of graph neural networks (GNN). For proteins in amino-acid sequences,the often-large lengths demand deep RNNs that are hard to be trained eﬀectively (gradientvanishing or exploding and non-parallel training). We previously overcame the secondchallenge by predicting structure properties from amino-acid sequences and representingproteins as a much shorter structure property sequences where each 4-letter tuple correspondsto a secondary structure. This treatment however limits the resolution of interpretabilityto be at the level of protein secondary structures (multiple neighboring residues) ratherthan individual residues. In this study, we overcome the second challenge while achievingresidue-level interpretability by using biologically-motivated hierarchical RNN (HRNN).8 roposed: GCN and GIN for 2D compound graphs.

Compared to 1D SMILES strings, chemical formulae (2D graphs) of compounds have moredescriptive power and are increasingly used as inputs to predictive models.

In thisstudy, compounds are represented as 2D graphs in which vertices are atoms and edges arecovalent bonds between atoms. Suppose that n is the maximum number of atoms in ourcompound set (compounds with smaller number of atoms are padded to reach size n ). Let’sconsider a graph G = ( V , X , E , A ), where V = { v j } nj =1 is the set of n vertices (each with d g features), X ∈ R n × d g that of vertex features, E that of edges, and A ∈ { , } n × n isunweighted symmetric adjacency matrix. Let ˆ A = A + I and ˆ D be the degree matrix (thediagonals of ˆ A ).We used Graph Convolutional Network (GCN) and Graph Isomorphism Network (GIN) which are the state of art for graph embedding and inference. GCN consists of multiple layersand at layer l the model can be written as: H ( l ) = ReLU( ˆ D − ˆ A ˆ D − H ( l − Θ ( l ) ) , (1)where H ( l ) ∈ R n × d ( l ) g is the output, Θ ( l ) ∈ R d ( l − g × d ( l ) g the trainable parameters, and d ( l ) g thenumber of features, all at layer l . Initial conditions (when l = 0) are H (0) = X and d (0) g = d g .GIN is the most powerful graph neural network in theory: its discriminative or represen-tational power is equal to that of the Weisfeiler-Lehman graph isomorphism test. Similar toGCN, GIN consists of multiple layers and at layer l the model can be written as a multi-layerperceptron (MLP): H ( l ) = MLP ( l ) ( ¯ A ( l ) H ( l − ) , (2)where ¯ A ( l ) = A + (cid:15) ( l ) I , (cid:15) ( l ) can be either a trainable parameter or a ﬁxed hyper-parameter.Each GIN layer has several nonlinear layers compared to GCN layer with just a ReLU perlayer, which might improve predictions but suﬀer in interpretability.The ﬁnal representation for a compound is Y = H ( L ) if GCN or GIN has L layers. In9his study, vertex features are as in, with few additional features detailed later for physics-inspired relational modules. Proposed: HRNN for 1D protein sequences.

We aim to keep the use of RNN that respects the sequence nature of protein data and mitigatethe diﬃculty of training RNN for long sequences. To that end, inspired by the hierarchy ofprotein structures, we model protein sequences using hierarchical attention networks (HANs).Speciﬁcally, during protein folding, sequence segments may fold separately into secondarystructures and the secondary structures can then collectively pack into a tertiary structureneeded for protein functions. We exploit such hierarchical nature by representing a proteinsequence of length easily in thousands as tens or hundreds of k -mers (consecutive sequencesegments) of length k ( k = 15 in this study). Accordingly we process the hierarchicaldata with hierarchical attention networks (HANs) which have been proposed for naturallanguage processing. We also refer to it as hierarchical RNN (HRNN).Given a protein sequence x with maximum length m (shorter sequences are padded toreach length m ) partitioned into T groups of k -mers, we use two types of RNNs (speciﬁcally,LSTMs here) in hierarchy for modeling within and across k -mers. We ﬁrst use an embeddinglayer to represent the i th residue in t th k -mer as a vector x it . And we use a shared LSTMfor all k -mers for the latent representation of the residue: h it = LSTM( x it ) ( t = 1 , . . . , T ).We then summarize each k -mer as k t with an intra- k -mer attention mechanism: u it = v tanh(Θ h it + b ) ∀ i , t u (cid:48) it = exp( u it ) (cid:80) i (cid:48) exp( u i (cid:48) t ) ∀ i , t k t = (cid:88) i u (cid:48) it h it ∀ t (3)Then we use another LSTM for for k t and reach h t = LSTM( k t ) ( t = 1 , . . . , T ).The ﬁnal representation for a protein sequence is the collection of h t .10 oint attention over protein-compound atomic pairs for interpretability. Once the representation of protein sequences ( h t where t = 1 , . . . , T is the index of protein k -mer) and that of compound sequences or graphs ( y j where j = 1 , . . . , n is the index of com-pound atom) are deﬁned, they are processed with a joint k -mer–atom attention mechanismto interpret any downstream prediction: N tj = tanh( h t Θ y j ) ∀ t , j W (cid:48) tj = exp( N tj ) (cid:80) t (cid:48) ,j (cid:48) exp( N t (cid:48) j (cid:48) ) ∀ t , j (4)With W (cid:48) tj , the joint attention between the t th k-mer and the j th atom, we can combine itwith the intra- k -mer attention over each residue i in the t th k-mer and reach W ij , the jointattention between the i th protein residue and the j th compound atom: W ij = u (cid:48) it W (cid:48) tj ∀ i , j (5)This joint attention mechanism is an extension of our previous work where a proteinsequence was represented as a single, “ﬂat” RNN rather than multiple, hierarchical RNNs. DeepRelations

Overall architecture.

We have developed an end-to-end “by-design” interpretable architecture named DeepRela-tions for joint prediction and interpretation of compound-protein aﬃnity. The overall archi-tecture is shown in Figure 2.There are three relational modules (Rel-CPI) corresponding to three stages. Their at-tentions are trained to progressively focus on putative binding k -mers, residues, and pairs;and earlier-stage attentions guide those in the next stage through regularization. In eachRel-CPI module, there are six types of atomic “relations” or interactions (including elec-11 inding group-residues predictionRel-CPI(HRNN-GCN) PredictedSolventExposure

Reg= (.,.)+ (.)+ (.) Joint interaction andbinding residuespredictionRel-CPI(CNN-GCN)

Reg = (.,.) Afﬁnity predictionRel-CPI(CNN-GCN) Jointinteractionafﬁnity

Rel-CPI(Model) , X comp X prot Input/outputRegularizationInputOutputRegularizerRel-CPImodule MLP , ≥ 0 ∑  Model ModelModel ModelModel comp HbondProtHbondcomp atom

Protamino comp nonpolarProtnonpolar  vdw  Hydrophobic  Hbond comp HalogenProtHalogen  Halogen comp aromatic Protaromatic  Aromatic

ModelModel , ≥ 0 ∑  compcharge Protcharge  elec comppolar Protchargecompcharge Protpolar comppolar Protpolar  pol,pol  pol,charge  charge,pol  charge,charge Model Model  ﬁnal Reg = (.,.) If (.) and (.) exist? Regularizer (.)+ (.) (.,.) PredictedProteincontactmaps

Figure 2: Schematic illustration of DeepRelations, an intrinsically explainable neuralnetwork architecture for predicting compound-protein interactions.trostaics as the non-negative linear combination of four sub-types). And each (sub)typeof relation is modelled by aforementioned neural network pairs with joint attentions. Forinstance, the ﬁrst Rel-CPI uses HRNN-GCN (HRNN for protein sequences and GCN forcompound graphs) and the next two use CNN-GCN (dilated causal CNN for proteins andGCN for compounds). And the non-negative linear combination of six individual relations’attention matrices W i (where i corresponds to the six relation types) gives the overall jointattention matrix W ﬁnal (or W in short) in each module. Physics-inspired relational modules

The relational modules are inspired by physics. Speciﬁcally, atomic “relations” or interac-tions constitute the physical bases and explanations of compound-protein interaction aﬃni-ties and are often explicitly modelled in force ﬁelds. We have considered the following sixtypes relations with attentions paid on and additional input data deﬁned for. • Electrostatic interactions : A non-negative linear combination of four subtypes of compound-protein interactions through attentions: 1) charge-charge, 2) charge-dipole, 3) dipole-12harge, and 4) dipole-dipole interactions. The input feature for the charge of a proteinresidue or a compound atom is the CHARMM27 parameter and the atomic formalcharge, respectively. That for the dipole of a protein residue or a compound atom isthe residue being polar/nonpolar or the Gasteiger atomic partial charge. • Hydrogen bond : Non-covalent interaction between an electronegative atom as a hydro-gen “acceptor” and a hydrogen atom that is covalently bonded to an electronegativeatom called a hydrogen “donor”. Therefore, if a protein residue or compound atomcould provide a hydrogen acceptor/donor, its hydrogen-bond feature is -1/+1; other-wise the feature value is 0. A protein residue is allowed to be both hydrogen-bonddonor and acceptor. • Halogen bond : A halogen bond is very similar to hydrogen bond except that a halogen(rather than hydrogen) atom (often found in drug compounds) is involved in suchinteractions. If a protein residue or a compound atom has/is a halogen atom such asiodine, bromine, chlorine and ﬂuorine, its halogen-bond feature is assigned +4, +3, +2and +1, respectively, for decreasing halogen-bonding strength. If it can be a halogenacceptor, the feature is -1. If it can be neither, the feature is simply set at 0. • Hydrophobic interactions : The interactions between hydrophobic protein residues andcompound atoms contribute signiﬁcantly to the binding energy between them. If aprotein residue is hydrophobic, it is represented as 1 and otherwise as 0. Moreover,the non-polar atoms are represented as 1 and the polar one as -1. • Aromatic interactions : Aromatic rings in tryptophans, phenylalanines, and tyrosinesparticipate in “stacking” interactions with aromatic moieties of a compound ( π - π stack-ing). Therefore, if a protein residue has an aromatic ring, its aromatic feature is setat 1 and otherwise at 0. Similarly, if a compound atom is part of a ring, the feature isset at 1 and otherwise at 0. 13 VdW interactions : Van der Waals are weaker interactions compared to others. Butthe large amount of these interactions contribute signiﬁcantly to the overall bindingenergy between a protein and a compound. We consider the amino-acid type and theatom element as their features and use an embedding layer to derive their continuousrepresentations.For each (sub)type of atomic relations, corresponding protein and compound featuresare fed into basis neural network models such as HRNN for protein sequences and GNNfor compound graphs. All features are made available to baseline methods (DeepAﬃnity+variants) as well for fair comparison.

Physical constraints as regularization.

The joint attention matrices W in each Rel-CPI module, for individual relations or overall,are regularized with the following two types of physical constraints. Focusing regularization

In the ﬁrst regularization, a constraint input is given as a matrix

T ∈ [0 , m × n to penalize the attention matrices W i for all the 10 (sub)types of relations ifthey focus on the undesired regions of proteins. In addition, an L1 sparsity regularization ison the attention matrices W i for all relations to promote interpretability as a small portionof protein residues interact with compounds. Therefore, this “focusing” penalty can beformalized as: R ( W ) = λ relation 10 (cid:88) i =1 || ( − T ) (cid:12) W i || + λ L1 10 (cid:88) i =1 ||W i || , (6)where the T term, a parameter, can be considered as soft thresholding and its penalty onlyincurs when T ij = 0.The ﬁrst regularization is used for all three Rel-CPI modules or stages with increasinglyfocusing T . Let T [ k ] be the constraint matrix and W [ k ] the learned attention matrix for a14iven relation in the k th stage. In the ﬁrst stage, T [1] ij is one only for any residue i predictedto be solvent-exposed in order to focus on surfaces. In the second stage, T [2] ij = max j (cid:48) W [1] ij (cid:48) to focus on putative binding residues hierarchically learned for k -mers and residues in stage1. In the last stage, T [3] ij = W [2] ij focuses on putative contacts between protein residues andcompound atoms. The focusing regularization is enforced on attentions for every relation(sub)type in the current implementation and can be done only for given (sub)types in future. Structure-aware sparsity regularization over protein contact maps

We furtherdevelop a structure aware sparsity constraints based on known or RaptorX-predicted contactmaps of the unbound protein. As sequentially distant residues might be close in 3D and formbinding sites for compounds, we deﬁne overlapping groups of residues where each groupconsists of a residue and its spatially close neighboring residues. Just in the second stage,we introduce Group Lasso for spatial groups and the Fused Sparse Group Lasso (FSGL) forsequential groups on the overall, joint attention matrix W : R ( W ) = λ group ||W|| group + λ fused ||W|| fused + λ L1 − overall ||W|| . (7)The group Lasso penalty will encourage a structured group-level sparsity so that few clus-ters of spatially close residues share similar attentions within individual clusters. The fusedsparsity will encourage local smoothness of the attention matrix so that sequentially closeresidues share similar attentions with compound atoms. The L1 term maintains the spar-sity of the overall attention matrix W , since the L1 sparsity of attention matrices W i forindividual relations do not guarantee that their linear combination remains sparse. Supervised attention.

It has been shown in visual question answering that attention mechanisms in deep learningcan diﬀer from human attentions. As will be revealed in our results, they do not necessar-ily focus on actual atomic interactions (relations) in compound-protein interactions either.15e have thus curated a relational subset of our compound-protein pairs with aﬃnities, forwhich known ground-truth atomic contacts or relations are available. We summarize actualcontacts of a pair in a matrix W true of length m × n ( m and n are actual numbers of proteinresidues and compound atoms, respectively, for a given pair), which is a binary pairwiseinteraction matrix normalized by the total number of nonzero entries. We have accord-ingly introduced a third regularization term to supervise W , the non-padded submatrix ofattention matrix W , in the second stage: R ( W ) = λ bind c ||W − W true || F , (8)where c is a normalization constant across batches. Suppose that, in any batch, a given pair’sactual interaction matrix is of size m × n and the smallest such size across all batches is m min × n min . Then c = m min × n min m × n for this pair. Training strategy for hierarchical multi-objectives

Accuracy and interpretability are the two objectives we pursue at the same time. In our case,the two objectives are hierarchical: compound-protein aﬃnity originates from atomic-levelinteractions (or “relations”) and better interpretation in the latter potentially contributes tobetter prediction of the former.Challenges remain in solving the hierarchical multi-objective optimization problem. First,optimizing for both objectives simultaneously (for instance, through weighted sum of them)does not respect that the two objectives do no perfectly align with each other and are ofdiﬀerent sensitivities to model parameters. Second, ground-truth data for interpretabilityof aﬃnity prediction, i.e., compound-protein contacts, is rare. In fact, merely 7.5% of ourcompound-protein pairs labeled with K d aﬃnities are with contact data.To overcome the aforementioned challenges, we consider the problem as multi-label ma-chine learning facing missing labels. And we design hierarchical training strategies to solve16he corresponding hierarchical multi-objective optimization problem. The whole DeepRela-tions model, including the three Rel-CPI modules, are trained end-to-end. In the ﬁrststage, we “pre-trained” DeepRelations to minimize mean squared error (MSE) of p K d regres-sion alone, with physical constraints turned on; in other words, attentions were regularized(through R ( · ) and R ( · )) but not supervised in this stage. We tuned combinations of allhyperparameters except λ bind in the discrete set of { − , − , . . . , , − } , with 400 epochsat the learning rate of 0.001. Over the validation set, we recorded the lowest RMSE foraﬃnity prediction and chose the hyperparameter combination with the highest AUPRC forcontact prediction subjective to that the corresponding aﬃnity RMSE (root mean squareerror) does not deteriorate from the lowest by more than 10%.In the second stage, with the optimal values of all hyperparameters but λ bind ﬁxed, weloaded the corresponding optimized model in the ﬁrst stage and “ﬁne-tuned” the model tominimize MSE additionally regularized by supervised attentions (through R ( · ), R ( · ), and R ( · )). As only 7.5% training examples are with known contacts, we used the their averageand ignored the other examples for R ( · ) in each batch. We used a slower learning rate(0.0001) and less training epochs (200) in the ﬁne-tuning stage; and we tuned λ bind in theset of { − , − , . . . , , − } following the same strategy as in the pre-training stage.In the end, we chose λ relation = 10 − , λ L1 = 10 − , λ group = 10 − , λ fused = 10 − , λ L1 − overall =10 − and λ bind = 10 for DeepRelations.We did similarly for hyper-parameter tuning while constraining (and supervising) atten-tions to make DeepAﬃnity+ variants. For HRNN-GCN cstr (modeling protein sequenceswith HRNN and compound graph with GCN, regularized by physical constraints in R ( · )),we chose λ group = 10 − , λ fused = 10 − , and λ L1 − overall = 10 − ; and for its supervised versionHRNN-GCN cstr sup, the additional λ bind = 10 . For HRNN-GIN cstr (modeling proteinsequences with HRNN and compound graph with GIN, regularized by physical constraintsin R ( · )), we chose λ group = 10 − , λ fused = 10 − , and λ L1 − overall = 10 − ; and for its supervisedversion HRNN-GIN cstr sup, the additional λ bind = 10 . R ( · ) was for attentions on indi-17idual relations in DeepRelations and not applicable for DeepAﬃnity+ variants, although asurface-focusing regularization on overall attentions could be introduced. Results

Attentions alone are inadequate for interpreting compound-proteinaﬃnity prediction

Our ﬁrst task is to systematically assess the adequacy of attention mechanisms for inter-preting model-predicted compound-protein aﬃnities. To that end, we adopt various datarepresentations and corresponding state-of-the-art neural network architectures in our frame-work of DeepAﬃnity. To model proteins, we have adopted RNN using protein SPS as inputdata as well as CNN and newly developed HRNN using protein amino-acid sequences. Tomodel compounds, we have adopted RNN using SMILES as input data as well as GCN andGIN using compound graphs with node features and edge adjacency. In the end, we havetested six DeepAﬃnity variants for protein-compound pairs, including RNN-RNN, RNN-GCN, CNN-GCN, HRNN-RNN, HRNN-GCN, and HRNN-GIN. The ﬁrst two (RNN-RNNand RNN-GCN), where protein SPS sequences are modeled by RNN and compound SMILESor graphs are modeled by RNN or GCN, are essentially our previous models except thatno unsupervised pretraining is used in this study. Whereas these two models’ attentions onproteins are at the secondary structure levels (thus not assessed for interpretability here),the rest have joint attentions at the level of pairs of protein residues and compound atoms.The accuracy of aﬃnity prediction, measured by RMSE (root mean squared error) inp K d , is summarized for the DeepAﬃnity variants in the top panel of Figure 3. Overall, allvariants have shown p K d error between 1.1 and 1.3, a level competitively comparable even tothe state-of-the-art aﬃnity predictor using compound-protein co-crystal structures. Thesemodels have robust accuracy proﬁles across the default, compound-unique, protein-unique,and double-unique test sets, suggesting their generalizability beyond training compounds or18roteins. Modeling compound SMILES with RNN seems to have slightly worse performancecompared to modeling compound graphs with GCN or GIN, although less features are usedfor SMILES strings compared to node features for compound graphs.Figure 3: Comparing accuracy and interpretability among various versions of DeepAﬃnitywith (unsupervised) joint attention mechanisms. Separated by underscores in legends areneural network models for proteins and compounds respectively.The interpretability of aﬃnity prediction is assessed against ground truth of contacts,as in the bottom panel of Figure 3. Speciﬁcally, we use joint attention scores to classifyall possible residue-atom pairs into contacts or non-contacts. As contacts only representa tiny portion (0.0061 ± Regularizing attentions with physical constraints modestly improvesinterpretability.

Our next task is to enhance the interpretability of compound-protein aﬃnity predictionbeyond the level achieved by attention mechanisms alone. The ﬁrst idea is to incorporatedomain-speciﬁc physical constraints into model training. The rationale is that, by bringing inthe (predicted) structural contexts of proteins and protein-compound interactions, attentionscan be guided in their sparsity patterns accordingly for better interpretability.We start with the two best-performing DeepAﬃnity variants so far (HRNN-GCN andHRNN-GIN) where protein amino-acid sequences are modeled by hierarchical RNN andcompound graphs by various GNNs (including GCN and GIN). And we introduce structure-aware sparsity regularization R ( · ) to the two models to make “DeepAﬃnity+” variants.The resulting HRNN-GCN cstr and HRNN-GIN cstr models with physical constraints areassessed in Figure 4. Compared to the the non-regularized counterparts in Figure 3, bothmodels achieved similar accuracy levels across various test sets for aﬃnity prediction, buttheir interpretability improved. Speciﬁcally, HRNN-GCN after constraints, compared to thatbefore constraints, had AUPRC improvement of 5.7%, 2.9%, 19.2%, and 20.0% for defaulttest, protein-unique, compound-unique, and double-unique sets, respectively. However, theinterpretability improvements from physical constraints were modest especially when theabsolute level of AUPRC remained around 0.01. These results suggest that incorporating20hysical constraints to structurally regularize the sparsity of attentions is useful for improvinginterpretability but may not be enough.Figure 4: Comparing accuracy and interpretability among various versions ofDeepAﬃnity+ (DeepAﬃnity with regularized and supervised attentions) andDeepRelations. “cstr” in legends indicates physical constraints imposed on attentionsthrough regularization term R ( · ), whereas “sup” indicates supervised attentions throughregularization term R ( · ). Supervising attentions signiﬁcantly improves interpretability.

As regularizing attentions with physical constraints was not enough to enhance interpretabil-ity, our next idea is to additionally supervise attentions with ground-truth contact dataavailable to some but not all training examples. Again we introduce “DeepAﬃnity+” modelsstarting with HRNN-GCN and HRNN-GIN, by both regularizing and supervising attentions(using R ( · ) and R ( · )).The performances of resulting HRNN-GCN cstr sup and HRNN-GIN cstr sup modelsare shown in Figure 4. Importantly, HRNN-GCN cstr sup (light blue) signiﬁcantly im-proves interpretability of aﬃnity prediction without the sacriﬁce of accuracy. The averageAUPRC improved to 0.0455, 0.0106, 0.0883, and 0.0175 for the default test, protein-unique,compound-unique, and double-unique test sets, representing a relative improvement of 309%21645%), 92% (73%), 600% (1347%), and 46% (186%), respectively, compared to the con-strained counterparts (chance). Interestingly, supervising attentions in HRNN-GIN did notsee as signiﬁcant improvement in interpretability. Building explainability into DeepRelations architecture further dras-tically improves interpretability.

Toward better interpretability, besides regularizing and supervising attentions, we have fur-ther developed an explainable, deep relational neural network named DeepRelations. Hereatomic “relations” constituting physical bases and explanations of compound-protein aﬃni-ties are explicitly modeled in the architecture with multi-stage gradual “zoom-in” to focusattentions. In other words, the model architecture itself is intrinsically explainable by design.The superior performances of the resulting DeepRelations (with both regularized andsupervised attentions) are shown in Figure 4 (yellow-green “DeepRelations cstr sup”). Withequally competitive accuracy in aﬃnity prediction as all previous models, DeepRelationsachieved drastic improvements in interpretability. Strikingly, the average AUPRC further im-proved to 0.0996, 0.1350, 0.1754, and 0.0571 for the default test, protein-unique, compound-unique, and double-unique test sets, representing a relative improvement of 121% (1532%),1173% (2113%), 98% (2775%), and 226% (836%), respectively, compared to the previousbest DeepAﬃnity+ variant (chance).We further assessed contact prediction (or interpretability) of DeepAﬃnity+ variantsand DeepRelations using the precision, sensitivity, and odds ratio (or enrichment factor) oftheir top K predictions (where K ranged from 5 to 50). Figure 5 shows that DeepRelationsdrastically outperforms other methods in all assessment measures considered. The precisionand sensitivity levels may not appear impressive, largely due to the very strict deﬁnition of“true contacts” in our study, as will be revealed in a case study. Note that all atomic-levelcontact predictions were made with the inputs of protein sequences and compound graphsalone. 22igure 5: Comparing precision, sensitivity, and odds ratio (enrichment) ofaﬃnity-interpreting contacts predicted by various versions of DeepAﬃnity+ andDeepRelations.For fair comparison, all DeepAﬃnity+ variants and DeepRelations were using the sameset of features. A negative control experiment in the subsequent ablation study further val-idated this. Therefore, the architecture of DeepRelations, being intrinsically explainable, isthe major contributor to its superior interpretability. From the machine learning perspective,DeepAﬃnity+ variants have various molecular features lumped into general-purpose neuralnetworks, which makes it very hard to learn governing physics laws from the molecular aﬃn-ity data. Instead, DeepRelations directly builds the physics laws into its model architectureand carefully structure various features into corresponding atomic relations and eventuallythe overall binding aﬃnity. Ablation study for DeepRelations

To disentangle various components of DeepRelations and understand their relative contribu-tions to DeepRelations’ superior interpretability, we removed components from DeepRela-tions and made “DeepRelations-” variants. Besides regularized and supervised attentions, webelieve that the main contributions in the architecture itself are (1) the multi-stage “zoom-23n” mechanisms that progressively focus attentions from surface, binding k -mers, bindingresidues to binding residue-atom pairs; and (2) the explicit modeling of atomic relationsthat can explain the structure feature-aﬃnity mappings consistently with physics principles.We thus made three DeepRelations- variants: DeepRelations without multi-stage focusing,without explicit atomic relations, or without both.We compared the three intermediate “DeepRelations-” versions with the best DeepAﬃn-ity+ (regularized and supervised HRNN-GCN) and DeepRelations in Figure 6. Consistentwith our conjecture, we found that, the explicit modeling of atomic relations was the mainreason for DeepRelations’ superior interpretability, as the removal of this component alonereduced the average AUPRC down to a similar level of the best DeepAﬃnity+ (except forthe protein-unique case). Removing both components essentially reproduced the best Deep-Aﬃnity+ (again, except that it still outperforms the latter in the protein-unique case), whichserved well as a “negative control” case here.Figure 6: Comparing interpretability between DeepRelations and DeepRelations-(DeepRelations without multi-stage focusing, explicitly-modeled relations, or both).24 ase Study Now that we have established how drastically DeepRelations improves the interpretabilityof compound-protein aﬃnity prediction and explained why it achieves so by design, we wenton to examine the pattern in which DeepRelations contact prediction outperforms the bestDeepAﬃnity variant HRNN-GCN (and leaves room for further improvement). We thusrandomly chose a compound-protein test pair with known contacts for case study: carbonicanhydrase II inhibitor with its compound AL1 (PDB ID:1BNN).As shown in Figure 7, DeepRelations (middle) not only made more correct contact predic-tions than HRNN-GCN (left) but also showed much improved contact pattern . In particular,HRNN-GCN could focus attention on residue-atom pairs that are actually as far as above20˚A away, and the attended residues could be dispersed at two sides of a protein. In contrast,DeepRelations predictions were correctly focused in the binding site of the protein and manyof its “incorrect” predictions may correspond to residue-atom pairs within 10˚A or less, whichcould be partially attributed to the physical constraints introduced as regularization.To further examine the possible beneﬁt of explicitly modeled atomic relations, we ex-amined the overall attention matrix and found that the most contributions originate fromelectrostatic relations. We therefore examined the top-10 predicted electrostatic contactsaccording to the electrostatic attention matrix alone and found four true electrostatic inter-actions associated with the same protein residue (Hisidine 94).We extended the analysis of the patterns of predicted contacts over all test cases. Con-sidering that the true contacts are deﬁned rather strictly, we assess distance-distributionsof residue-atom pairs predicted by HRNN-GCN, HRNN-GCN with regularized attention,HRNN-GCN with regularized and supervised attention, and DeepRelations (also with reg-ularized and supervised attention). As seen in Figure 8, DeepRelations outperforms com-petitors in all distance ranges over all test sets (except the 4˚A ∼ Conclusions

Toward accurate and interpretable machine learning of compound-protein aﬃnity, we havecurated an aﬃnity-labeled dataset with partially annotated contact details, assessed the ade-quacy of current attention-based deep learning models for both accuracy and interpretability,and developed novel machine-learning models and training strategies to drastically enhanceinterpretability without sacriﬁcing accuracy. This is the ﬁrst study with dedicated modeldevelopment and systematic model assessment for interpretability in aﬃnity prediction.Our study has found that commonly-used attention mechanisms alone, although betterthan chance in most cases, are not satisfying in interpretability: the most attended contactsin aﬃnity prediction do not reveal true contacts underlying aﬃnities at a useful level. Wehave tackled the challenge with three innovative, methodological advances. First, we in-26igure 8: Distributions of top-50 contacts, predicted by DeepAﬃnity, DeepAﬃnity+, andDeepRelations, in various distance ranges (unit: ˚A).troduce domain-speciﬁc physical constraints to regularize attentions (or guide their sparsitypatterns), in which structural contexts such as sequence-predicted protein surfaces and pro-tein contact maps are utilized. Second, we exploit partially available ground-truth contactsto supervise attentions. Lastly, we build intrinsically explainable model architecture wherevarious atomic relations, reﬂecting physics laws, are explicitly modeled and aggregated foraﬃnity prediction. Joint attentions are embedded over residue-atom pairs for individualand overall relations. And a multi-stage hierarchy, trained end-to-end, progressively focusesattentions on protein surfaces, binding k -mers and residues, and residue-atom contact pairs.Empirical results demonstrate the superiority of DeepRelations in interpretability with-out sacriﬁcing accuracy. Compared to the best DeepAﬃnity variant with joint attention(HRNN-GCN), the AUPRC for contact prediction was boosted to 9.48, 16.86, 19.28, and27.71-fold for the default test, compound-unique, protein-unique, and double-unique cases.Importantly, the interpretability of DeepRelations proves robust and generalizable, as themargins of improvement were even higher when compounds or/and proteins are not presentin the training set. Ablation studies demonstrate that the explainable relational networkarchitecture was the major contributor to such performances. Case studies suggest thatDeepRelations predict not only more correct but also more well-patterned contacts. Andmany “incorrect” predictions due to the strict deﬁnition of contacts were within reasonableranges — in fact, around one third of the top-50 predicted contacts correspond to residue-atom pairs within 10˚A.An additional beneﬁt of DeepRelations is its broad applicability toward the vast chemicaland proteomic spaces. It does not rely on 3D structures of compound-protein complexes oreven protein monomers when such structures are often unavailable. The only inputs neededare protein sequences and compound graphs. Meanwhile, it adopts the latest technology topredict structural contexts for protein sequences (such as surfaces, secondary structures, andresidue-contact maps) and incorporatea such structural contexts into aﬃnity and contactpredictions. When structure data are available, DeepRelations can readily integrate suchdata by using actual rather than predicted structural contexts.Our study demonstrates that, it is much more eﬀective to directly build explainabil-ity into machine learning model architectures (as DeepRelations models underlying atomicrelations explicitly) than to infer explainability from general-purpose architectures (as Deep-Relations variants learn attentions from data alone). In other words, designing intrinsicallyinterpretable machine learning models, although more diﬃcult, can be much more desiredthan pursuing interpretability in a post hoc manner. Acknowledgement

This work was supported by the National Institutes of Health (R35GM124952 to Y.S.).28art of the computing time was provided by the Texas A&M High Performance ResearchComputing.

References (1) Santos, R.; Ursu, O.; Gaulton, A.; Bento, A. P.; Donadi, R. S.; Bologa, C. G.; Karls-son, A.; Al-Lazikani, B.; Hersey, A.; Oprea, T. I., et al. A comprehensive map ofmolecular drug targets.

Nature reviews Drug discovery , , 19.(2) Bohacek, R. S.; McMartin, C.; Guida, W. C. The art and practice of structure-baseddrug design: A molecular modeling perspective. Medicinal Research Reviews , ,3–50.(3) Ponomarenko, E. A.; Poverennaya, E. V.; Ilgisonis, E. V.; Pyatnitskiy, M. A.; Kopy-lov, A. T.; Zgoda, V. G.; Lisitsa, A. V.; Archakov, A. I. The Size of the HumanProteome: The Width and Depth. Int J Anal Chem , , 7436849.(4) Aebersold, R. et al. How many human proteoforms are there? Nat. Chem. Biol. , , 206–214.(5) Wallach, I.; Dzamba, M.; Heifets, A. AtomNet: a deep convolutional neural net-work for bioactivity prediction in structure-based drug discovery. arXiv preprintarXiv:1510.02855 ,(6) Gomes, J.; Ramsundar, B.; Feinberg, E. N.; Pande, V. S. Atomic convolutional networksfor predicting protein-ligand binding aﬃnity. arXiv preprint arXiv:1703.10603 ,(7) Jim´enez, J.; ˇSkaliˇc, M.; Mart´ınez-Rosell, G.; De Fabritiis, G. KDEEP: ProteinLigandAbsolute Binding Aﬃnity Prediction via 3D-Convolutional Neural Networks. Journalof Chemical Information and Modeling , , 287–296, PMID: 29309725.298) Torng, W.; Altman, R. B. Graph Convolutional Neural Networks for Predicting Drug-Target Interactions. Journal of Chemical Information and Modeling , , 4131–4149.(9) Lim, J.; Ryu, S.; Park, K.; Choe, Y. J.; Ham, J.; Kim, W. Y. Predicting Drug–TargetInteraction Using a Novel Graph Neural Network with 3D Structure-Embedded GraphRepresentation. Journal of chemical information and modeling , , 3981–3988.(10) Karimi, M.; Wu, D.; Wang, Z.; Shen, Y. DeepAﬃnity: Interpretable Deep Learningof Compound-Protein Aﬃnity through Uniﬁed Recurrent and Convolutional NeuralNetworks. arXiv preprint arXiv:1806.07537 ,(11) ¨Ozt¨urk, H.; ¨Ozg¨ur, A.; Ozkirimli, E. DeepDTA: deep drug–target binding aﬃnity pre-diction. Bioinformatics , , i821–i829.(12) Feng, Q.; Dueva, E. V.; Cherkasov, A.; Ester, M. PADME: A Deep Learning-basedFramework for Drug-Target Interaction Prediction. CoRR , abs/1807.09741 .(13) Gao, K. Y.; Fokoue, A.; Luo, H.; Iyengar, A.; Dey, S.; Zhang, P. Interpretable DrugTarget Prediction Using Deep Neural Representation. IJCAI. 2018; pp 3371–3377.(14) Li, X.; Yan, X.; Gu, Q.; Zhou, H.; Wu, D.; Xu, J. DeepChemStable: Chemical StabilityPrediction with an Attention-Based Graph Convolution Network. Journal of chemicalinformation and modeling , , 1044–1049.(15) Uddin, M. R.; Mahbub, S.; Rahman, M. S.; Bayzid, M. S. SAINT: Self-AttentionAugmented Inception-Inside-Inception Network Improves Protein Secondary StructurePrediction. bioRxiv , 786921.(16) McCloskey, K.; Taly, A.; Monti, F.; Brenner, M. P.; Colwell, L. J. Using attribution todecode binding mechanism in neural network models for chemistry. Proceedings of theNational Academy of Sciences , , 11624–11629.3017) Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning toalign and translate. arXiv preprint arXiv:1409.0473 ,(18) Lu, J.; Yang, J.; Batra, D.; Parikh, D. Hierarchical question-image co-attention forvisual question answering. Advances In Neural Information Processing Systems. 2016;pp 289–297.(19) Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Ben-gio, Y. Show, attend and tell: Neural image caption generation with visual attention.International conference on machine learning. 2015; pp 2048–2057.(20) Choi, E.; Bahadori, M. T.; Sun, J.; Kulas, J.; Schuetz, A.; Stewart, W. Retain: Aninterpretable predictive model for healthcare using reverse time attention mechanism.Advances in Neural Information Processing Systems. 2016; pp 3504–3512.(21) Das, A.; Agrawal, H.; Zitnick, L.; Parikh, D.; Batra, D. Human attention in visualquestion answering: Do humans and deep networks look at the same regions? ComputerVision and Image Understanding , , 90–100.(22) Doshi-Velez, F.; Kim, B. Towards A Rigorous Science of Interpretable Machine Learn-ing. 2017.(23) Dill, K.; Bromberg, S. Molecular driving forces: statistical thermodynamics in biology,chemistry, physics, and nanoscience ; Garland Science, 2012.(24) Gilson, M. K.; Zhou, H.-X. Calculation of protein-ligand binding aﬃnities.

Annu. Rev.Biophys. Biomol. Struct. , , 21–42.(25) Brzozowski, A. M.; Pike, A. C.; Dauter, Z.; Hubbard, R. E.; Bonn, T.; Engstr¨om, O.;¨Ohman, L.; Greene, G. L.; Gustafsson, J.-˚A.; Carlquist, M. Molecular basis of agonismand antagonism in the oestrogen receptor. Nature , , 753.3126) Congreve, M.; Murray, C. W.; Blundell, T. L. Keynote review: Structural biology anddrug discovery. Drug discovery today , , 895–907.(27) Wlodawer, A.; Vondrasek, J. Inhibitors of HIV-1 protease: a major success of structure-assisted drug design. Annual review of biophysics and biomolecular structure , ,249–284.(28) Brik, A.; Wong, C.-H. HIV-1 protease: mechanism and drug discovery. Organic &biomolecular chemistry , , 5–14.(29) Yang, F.; Du, M.; Hu, X. Evaluating explanation without ground truth in interpretablemachine learning. arXiv preprint arXiv:1907.06831 ,(30) Santoro, A.; Raposo, D.; Barrett, D. G.; Malinowski, M.; Pascanu, R.; Battaglia, P.;Lillicrap, T. A simple neural network module for relational reasoning. Advances inneural information processing systems. 2017; pp 4967–4976.(31) Lu, C.; Krishna, R.; Bernstein, M.; Fei-Fei, L. Visual relationship detection with lan-guage priors. European Conference on Computer Vision. 2016; pp 852–869.(32) Battaglia, P.; Pascanu, R.; Lai, M.; Rezende, D. J., et al. Interaction networks forlearning about objects, relations and physics. Advances in neural information processingsystems. 2016; pp 4502–4510.(33) Hoshen, Y. Vain: Attentional multi-agent predictive modeling. Advances in NeuralInformation Processing Systems. 2017; pp 2701–2711.(34) Liu, T.; Lin, Y.; Wen, X.; Jorissen, R. N.; Gilson, M. K. BindingDB: a web-accessibledatabase of experimentally determined protein–ligand binding aﬃnities. Nucleic acidsresearch , , D198–D201.(35) Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B. A.;32hiessen, P. A.; Yu, B., et al. PubChem 2019 update: improved access to chemicaldata. Nucleic acids research , , D1102–D1109.(36) RDKit: Open-source cheminformatics. , [Online; accessed:april 2019].(37) Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.;Shindyalov, I. N.; Bourne, P. E. The protein data bank. Nucleic acids research , , 235–242.(38) Laskowski, R. A.; Jab(cid:32)lo´nska, J.; Pravda, L.; Vaˇrekov´a, R. S.; Thornton, J. M. PDBsum:Structural summaries of PDB entries. Protein Science , , 129–134.(39) Cheng, J.; Randall, A. Z.; Sweredoski, M. J.; Baldi, P. SCRATCH: a protein structureand structural feature prediction server. Nucleic acids research , , W72–W76.(40) Magnan, C. N.; Baldi, P. SSpro/ACCpro 5: almost perfect prediction of protein sec-ondary structure and relative solvent accessibility using proﬁles, machine learning andstructural similarity. Bioinformatics , , 2592–2597.(41) Wang, S.; Sun, S.; Li, Z.; Zhang, R.; Xu, J. Accurate de novo prediction of pro-tein contact map by ultra-deep learning model. PLoS computational biology , ,e1005324.(42) Lee, I.; Keum, J.; Nam, H. DeepConv-DTI: Prediction of drug-target interactions viadeep learning with convolution on protein sequences. PLoS computational biology , , e1007129.(43) Trinh, T. H.; Dai, A. M.; Luong, M.-T.; Le, Q. V. Learning longer-term dependenciesin rnns with auxiliary losses. arXiv preprint arXiv:1803.00144 ,(44) Cortes-Ciriano, I.; Bender, A. KekuleScope: prediction of cancer cell line sensitivity and33ompound potency using convolutional neural networks trained on compound images. Journal of Cheminformatics , , 41.(45) Kipf, T. N.; Welling, M. Semi-supervised classiﬁcation with graph convolutional net-works. arXiv preprint arXiv:1609.02907 ,(46) Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 ,(47) Weisfeiler, B.; Lehman, A. A. A reduction of a graph to a canonical form and an algebraarising during this reduction. Nauchno-Technicheskaya Informatsia ,2