[PDF] DeepAffinity: Interpretable Deep Learning of Compound-Protein Affinity through Unified Recurrent and Convolutional Neural Networks

Abstract

Motivation: Drug discovery demands rapid quantification of compound-protein interaction (CPI). However, there is a lack of methods that can predict compound-protein affinity from sequences alone with high applicability, accuracy, and interpretability. Results: We present a seamless integration of domain knowledges and learning-based approaches. Under novel representations of structurally-annotated protein sequences, a semi-supervised deep learning model that unifies recurrent and convolutional neural networks has been proposed to exploit both unlabeled and labeled data, for jointly encoding molecular representations and predicting affinities. Our representations and models outperform conventional options in achieving relative error in IC 50 within 5-fold for test cases and 20-fold for protein classes not included for training. Performances for new protein classes with few labeled data are further improved by transfer learning. Furthermore, separate and joint attention mechanisms are developed and embedded to our model to add to its interpretability, as illustrated in case studies for predicting and explaining selective drug-target interactions. Lastly, alternative representations using protein sequences or compound graphs and a unified RNN/GCNN-CNN model using graph CNN (GCNN) are also explored to reveal algorithmic challenges ahead. Availability: Data and source codes are available at this https URL Supplementary Information: Supplementary data are available at this http URL

Full PDF

aa r X i v : . [ q - b i o . B M ] D ec Bioinformatics

Advance Access Publication Date: Day Month YearManuscript Category

Structural Bioinformatics

DeepAfﬁnity: Interpretable Deep Learning ofCompound-Protein Afﬁnity through UniﬁedRecurrent and Convolutional Neural Networks

Mostafa Karimi , Di Wu , Zhangyang Wang and Yang shen Department of Electrical and Computer Engineering, TEES–AgriLife Center for Bioinformatics and Genomic Systems Engineering,and Department of Computer Science and Engineering, Texas A&M University, College Station, 77843, USA. ∗ To whom correspondence should be addressed.

Associate Editor: XXXXXXX

Received on XXXXX; revised on XXXXX; accepted on XXXXX

Abstract

Motivation:

Drug discovery demands rapid quantiﬁcation of compound-protein interaction (CPI). However,there is a lack of methods that can predict compound-protein afﬁnity from sequences alone with highapplicability, accuracy, and interpretability.

Results:

We present a seamless integration of domain knowledges and learning-based approaches.Under novel representations of structurally-annotated protein sequences, a semi-supervised deep learningmodel that uniﬁes recurrent and convolutional neural networks has been proposed to exploit bothunlabeled and labeled data, for jointly encoding molecular representations and predicting afﬁnities. Ourrepresentations and models outperform conventional options in achieving relative error in IC within 5-foldfor test cases and 20-fold for protein classes not included for training. Performances for new protein classeswith few labeled data are further improved by transfer learning. Furthermore, separate and joint attentionmechanisms are developed and embedded to our model to add to its interpretability, as illustrated in casestudies for predicting and explaining selective drug-target interactions. Lastly, alternative representationsusing protein sequences or compound graphs and a uniﬁed RNN/GCNN-CNN model using graph CNN(GCNN) are also explored to reveal algorithmic challenges ahead. Availability:

Data and source codes are available at https://github.com/Shen-Lab/DeepAfﬁnity

Contact: [email protected]

Supplementary information:

Supplementary data are available athttp://shen-lab.github.io/deep-afﬁnity-bioinf18-supp-rev.pdf.

Drugs are often developed to target proteins that participate in manycellular processes. Among almost 900 FDA-approved drugs as of year2016, over 80% are small-molecule compounds that act on proteins fordrug effects (Santos et al. , 2017). Clearly, it is of critical importanceto characterize compound-protein interaction for drug discovery anddevelopment, whether screening compound libraries for given proteintargets to achieve desired effects or testing given compounds againstpossible off-target proteins to avoid undesired effects. However,experimental characterization of every possible compound-protein paircan be daunting, if not impossible, considering the enormous chemicaland proteomic spaces. Computational prediction of compound-proteininteraction (CPI) has therefore made much progress recently, especiallyfor repurposing and repositioning known drugs for previously unknownbut desired new targets (Keiser et al. , 2009; Power et al. , 2014) and for anticipating compound side-effects or even toxicity due to interactionswith off-targets or other drugs (Chang et al. , 2010; Mayr et al. , 2016).Structure-based methods can predict compound-protein afﬁnity, i.e.,how active or tight-binding a compound is to a protein; and theirresults are highly interpretable. This is enabled by evaluating energymodels (Gilson and Zhou, 2007) on 3D structures of protein-compoundcomplexes. As these structures are often unavailable, they often needto be ﬁrst predicted by “docking" individual structures of proteins andcompounds together before their energies can be evaluated, which tendsto be a bottleneck for computational speed and accuracy (Leach et al. ,2006). Machine learning has been used to improve scoring accuracy basedon energy features (Ain et al. , 2015).More recently, deep learning has been introduced to predict compoundactivity or binding-afﬁnity from 3D structures directly. Wallach et al.developed AtomNet, a deep convolutional neural network (CNN), formodeling bioactivity and chemical interactions (Wallach et al. , 2015).Gomes et al. (Gomes et al. , 2017) developed atomic convolutional neural Karimi, Wu, Wang and Shen network (ACNN) for binding afﬁnity by generating new pooling andconvolutional layers speciﬁc to atoms. Jimenez et al. (Jimenez et al. , 2018)also used 3D CNN with molecular representation of 3D voxels assignedto various physicochemical property channels. Besides these 3D CNNmethods, Cang and Wei represented 3D structures in novel 1D topologyinvariants in multiple channels for CNN (Cang and Wei, 2017). These deeplearning methods often improve scoring thanks to modeling long-range andmulti-body atomic interactions. Nevertheless, they still rely on actual 3Dstructures of CPI and remain largely untested on lower-quality structurespredicted from docking, which prevents large-scale applications.Sequence-based methods overcome the limited availability ofstructural data and the costly need of molecular docking. Rather,they exploit rich omics-scale data of protein sequences, compoundsequences (e.g. 1D binary substructure ﬁngerprints (Wang et al. , 2009))and beyond (e.g. biological networks). However, they have beenrestricted to classifying CPIs (Chen et al. , 2016) mainly into two types(binding or not) and occasionally more (e.g., binding, activating,or inhibiting (Wang and Zeng, 2013)). And more importantly, theirinterpretablity is rather limited due to high-level features. Earlier sequence-based machine learning methods are based on shallow models forsupervised learning, such as support vector machines, logistic regression,random forest, and shallow neural networks (Cheng et al. , 2012; Yu et al. ,2012; Tabei and Yamanishi, 2013; Shi et al. , 2013; Cheng et al. , 2016).These shallow models are not lack of interpretability per se , but thesequence-based high-level features do not provide enough interpretabilityfor mechanistic insights on why a compound–protein pair interacts or not.Deep learning has been introduced to improve CPI identiﬁcationfrom sequence data and shown to outperform shallow models. Wangand Zeng developed a method to predict three types of CPI based onrestricted Boltzmann machines, a two-layer probabilistic graphical modeland a type of building block for deep neural networks (Wang and Zeng,2013). Tian et al. boosted the performance of traditional shallow-learning methods by a deep learning-based algorithm for CPI (Tian et al. ,2016). Wan et al. exploited feature embedding algorithm such aslatent semantic algorithm (Deerwester et al. , 1990) and word2vec(Mikolov et al. , 2013) to automatically learn low-dimensional featurevectors of compounds and proteins from the corresponding large-scaleunlabeled data (Wan and Zeng, 2016). Later, they trained deep learningto predict the likelihood of their interaction by exploiting the learnedlow-dimensional feature space. However, these deep-learning methodsinherit from sequence-based methods two limitations: simpliﬁed task ofpredicting whether rather than how active CPIs occur as well as lowinterpretability due to the lack of ﬁne-resolution structures. In addition,interpretability for deep learning models remains a challenge albeit withfast progress especially in a model-agnostic setting (Ribeiro et al. , 2016;Koh and Liang, 2017) .As has been reviewed, structure-based methods predict quantitativelevels of CPI in a realistic setting and are highly interpretable withstructural details. But their applicability is restricted by the availabilityof structure data, and the molecular docking step makes the bottleneck oftheir efﬁciency. Meanwhile, sequence-based methods often only predictbinary outcomes of CPI in a simpliﬁed setting and are less interpretablein lack of mechanism-revealing features or representations; but they arebroadly applicable with access to large-scale omics data and generally fastwith no need of molecular docking.Our goal is to, realistically, predict quantitative levels of CPIs(compound-protein afﬁnity measured in IC , K i , or K d ) from sequencedata alone and to balance the trade-offs of previous structure- orsequence-based methods for broad applicability, high throughput andmore interpretability. From the perspective of machine learning, this is amuch more challenging regression problem compared to the classiﬁcationproblem seen in previous sequence-based methods. To tackle the problem, we have designed interpretable yet compactdata representations and introduced a novel and interpretable deeplearning framework that takes advantage of both unlabeled and labeleddata. Speciﬁcally, we ﬁrst have represented compound sequencesin the Simpliﬁed Molecular-Input Line-Entry System (SMILES)format (Weininger, 1988) and protein sequences in novel alphabets ofstructural and physicochemical properties. These representations aremuch lower-dimensional and more informative compared to previously-adopted small-molecule substructure ﬁngerprints or protein Pfamdomains (Tian et al. , 2016). We then leverage the wealth of abundantunlabeled data to distill representations capturing long-term, nonlineardependencies among residues/atoms in proteins/compounds, by pre-training bidirectional recurrent neural networks (RNNs) as part of theseq2seq auto-encoder that ﬁnds much success in modeling sequence data innatural language processing (Kalchbrenner and Blunsom, 2013). And wedevelop a novel deep learning model unifying RNNs and convolutionalneural networks (CNNs), to be trained from end to end (Wang et al. ,2016b) using labeled data for task-speciﬁc representations and predictions.Furthermore, we introduce several attention mechanisms to interpretpredictions by isolating main contributors of molecular fragments or theirpairs, which is further exploited for predicting binding sites and origins ofbinding speciﬁcity. Lastly, we explore alternative representations usingprotein sequences or compound graphs (structural formulae), developgraph CNN (GCNN) in our uniﬁed RNN/GCNN-CNN model, and discussremaining challenges.The overall pipeline of our uniﬁed RNN-CNN method for semi-supervised learning (data representation, unsupervised learning, and jointsupervised learning) is illustrated in Fig. 1 with details given next. We used molecular data from three public datasets: labeled compound-protein binding data from BindingDB (Liu et al. , 2006), compound data inthe SMILES format from STITCH (Kuhn et al. , 2007) and protein amino-acid sequences from UniRef (Suzek et al. , 2014).From 489,280 IC -labeled samples collected from BindingDB, wecompletely excluded four classes of proteins from the training set:nuclear estrogen receptors (ER; 3,374 samples), ion channels (14,599samples), receptor tyrosine kinases (34,318 samples), and G-protein-coupled receptors (GPCR; 60,238 samples), to test the generalizability ofour framework. And we randomly split the rest into the training (263,583samples including 10% held out for validation) and the default test set(113,168 samples) without the aforementioned four classes of proteintargets. Similarly, we split a K i ( K d ) labeled dataset into 101,134 (8,778)samples for training, 43,391 (3,811) for testing, 516 (4) for ERs, 8,101(366) for ion channels, 3,355 (2,306) for tyrosine kinases, and 77,994(2,554) for GPCRs. All labels are in logarithm forms: pIC , p K i , andp K d . More details can be found in Sec. 1.1 of Supplementary Data.For unlabeled compound data from STITCH, we randomly chose 500Ksamples for training and 500K samples for validation (sizes were restricteddue to computing resources) and then removed those whose SMILESstring lengths are above 100, resulting in 499,429 samples for training and484,481 for validation. For unlabeled protein data from UniRef, we usedall UniRef50 samples (50% sequence-identity level) less those of lengthsabove 1,500, resulting in 120,000 for training and 50,525 for validation. Only 1D sequence data are assumed available. 3D structures of proteins,compounds, or their complexes are not used.

A popular compound representation is based eepAfﬁnity: Interpretable Deep Learning of Compound–Protein Afﬁnity CC(=N)N1CCC(C1)OC2=CC=C(C=C2)C(CC3=CC4=C(C=C3)C=CC(=C4)C(=N)N)C(=O)O.O.O.O.O.O.Cl

MGR PLHLVLLSASLAG LLLLGES LFIRREQANNILARVTRANSFLEEM KKGHL ERECMEET ..... CEKS, ANGM, CNGS, AEKL, CEKS, AEDS .....

RNN encoder 1D CNN A tt e n t i o n l a y e r RNN encoder A tt e n t i o n l a y e r C o n c a t Fully connected

InitializationInitialization

Prediction

Seq2seq Auto-encoders I n t e r p r e t a t i o n Unlabeled datasets: Stitch, Uniref Labeled datasets: BindingDBCompoundsProteins A tt e n t i o n l a y e r RNN encoderRNN decoder A tt e n t i o n l a y e r RNN encoderRNN decoder E m b e dd i n g l a y e r E m b e dd i n g l a y e r E m b e dd i n g l a y e r E m b e dd i n g l a y e r C CNN C CC C O CC C CCC CCO OC CC C CCC C CCC C NN

Data

Representation

Structural property SMILES

Coil Exposed SmallBasic

1D CNN

Predicting binding site/mode A tt e n t i o n Structurally annotated protein components

Fig. 1.

The pipeline of our uniﬁed RNN-CNN method to predict and interpret compound-protein afﬁnity. on 1D binary substructure ﬁngerprints from PubChem (Wang et al. , 2009).Mainly, basic substructures of compounds are used as ﬁngerprints bycreating binary vectors of 881 dimensions.

SMILES representation.

We used SMILES (Weininger, 1988) thatare short ASCII strings to represent compound chemical structures basedon bonds and rings between atoms. 64 symbols are used for SMILESstrings in our data. 4 more special symbols are introduced for the beginningor the end of a sequence, padding (to align sequences in the samebatch), or not-used ones. Therefore, we deﬁned a compound “alphabet"of 68 “letters". Compared to the baseline representation which uses k -hot encoding, canonical SMILES strings fully and uniquely determinechemical structures and are yet much more compact. Previously the most common proteinrepresentation for CPI classiﬁcation was a 1D binary vector whosedimensions correspond to thousands of (5,523 in (Tian et al. , 2016)) Pfamdomains (Finn et al. , 2014) (structural units) and 1’s are assigned basedon k -hot encoding (Tabei and Yamanishi, 2013; Cheng et al. , 2016). Weconsidered all types of Pfam entries (family, domain, motif, repeat,disorder, and coiled coil) for better coverage of structural descriptions,which leads to 16,712 entries (Pfam 31.0) as features. Protein sequencesare queried in batches against Pfam using the web server HMMER(hmmscan) (Finn et al. , 2015) with the default gathering threshold. Structural property sequence (SPS) representation.

Although3D structure data of proteins is often a luxury and their predictionremains a challenge without templates, it has been of muchprogress to predict protein structural properties from sequences(Cheng et al. , 2005; Magnan and Baldi, 2014; Wang et al. , 2016a). Weused SSPro/ACCPro (Magnan and Baldi, 2014) to predict secondarystructure class ( α -helix, β -strand, and coil) and solvent accessibility(exposed or not) for each residue and group neighboring residues ofthe same secondary structure class into secondary structure elements (SSEs). The details and the pseudo-code for SSE are in Algorithm 1(Supplementary Data).Each SSE is further classiﬁed: solvent exposed if at least 30% ofresidues are and buried otherwise; polar, non-polar, basic or acidic basedon the highest odds (for each type, occurrence frequency in the SSE isnormalized by background frequency seen in all protein sequences toremove the effect from group-size difference); short if length L ,medium if < L , and long if L > . In this way, we deﬁned4 separate alphabets of 3, 2, 4 and 3 letters, respectively to characterizeSSE category, solvent accessibility, physicochemical characteristics, andlength (Table S1) and combined letters from the 4 alphabets in the orderabove to create 72 “words” (4-tuples) to describe SSEs. Pseudo-code forthe protein representation is shown as Algorithm 2 in SupplementaryData. Considering the 4 more special symbols introduced similarly forcompound SMILES strings, we ﬂattened the 4-tuples and thus deﬁned aprotein SPS “alphabet" of 76 “letters".The SPS representation overcomes drawbacks of Pfam-based baselinerepresentation: it provides higher resolution of sequence and structuraldetails for more challenging regression tasks, more distinguishabilityamong proteins in the same family, and more interpretability on whichprotein segments (SSEs here) are responsible for predicted afﬁnity. Allthese are achieved with a much smaller alphabet of size 76 , which leads toaround 100-times more compact representation of a protein sequence thanthe baseline. In addition, the SPS sequences are much shorter than amino-acid sequences and prevents convergence issues when training RNN andLSTM for sequences longer than 1,000 (Li et al. , 2018). We encode compound SMILES or protein SPS into representations,ﬁrst by unsupervised deep learning from abundant unlabeled data. Weused a recurrent neural network (RNN) model, seq2seq (Sutskever et al. ,2014), that has seen much success in natural language processing andwas recently applied to embedding compound SMILES strings intoﬁngerprints (Xu et al. , 2017). A Seq2seq model is an auto-encoder that

Karimi, Wu, Wang and Shen consists of two recurrent units known as the encoder and the decoder,respectively (see the corresponding box in Fig. 1). The encoder maps aninput sequence (SMILES/SPS in our case) to a ﬁxed-dimension vectorknown as the thought vector. Then the decoder maps the thought vector tothe target sequence (again, SMILES/SPS here). We choose gated recurrentunit (GRU) (Cho et al. , 2014) as our default seq2seq model and treatthe thought vectors as the representations learned from the SMILES/SPSinputs. The detailed GRU conﬁguration and advanced variants (bucketing,bidirectional GRU, and attention mechanism which provides a way to“focus” for encoders) can be found in Sec. 1.4 of Supplementary Data.Through unsupervised pre-training, the learned representations capturenonlinear joint dependencies among protein residues or compound atomsthat are far from each other in sequence. Such “long-term” dependenciesare very important to CPIs since corresponding residues or atoms can beclose in 3D structures and jointly contribute to intermolecular interactions.

With compound and protein representations learned from the aboveunsupervised learning, we solve the regression problem of compound-protein afﬁnity prediction using supervised learning. For either proteinsor compounds, we append a CNN after the RNN (encoders and attentionmodels only) that we just trained. The CNN model consists of a one-dimensional (1D) convolution layer followed by a max-pooling layer. Theoutputs of the two CNNs (one for proteins and the other for compounds)are concatenated and fed into two more fully connected layers.The entire RNN-CNN pipeline is trained from end to end (Wang et al. ,2016b), with the pre-trained RNNs serving as warm initializations, forimproved performance over two-step training. The pre-trained RNNinitializations prove to be very important for the non-convex trainingprocess (Sutskever et al. , 2013). In comparison to such a “uniﬁed” model,we also include the “separate" RNN-CNN baseline for comparison, inwhich we ﬁxed the learned RNN part and train CNN on top of its outputs.

We have also introduced three attention mechanisms to uniﬁed RNN-CNNmodels. The goal is to both improve predictive performances and enablemodel interpretability at the level of “letters" (SSEs in proteins and atomsin compounds) and their pairs.1) Separate attention. This default attention mechanism is applied tothe compound and the protein separately so the attention learned on eachside is non-speciﬁc to a compound-protein pair. However, it has the leastparameters among the three mechanisms.2) Marginalized attention. To introduce pair-speciﬁc attentions, weﬁrst use a pairwise “interaction” matrix for a pair and then marginalize itbased on maximization over rows or columns for separate compound orprotein attention models, which is motivated by Lu et al. (2016).3) Joint attention. We have developed this novel attention model tofully explain the pairwise interactions between components (compoundatoms and protein SSEs). Speciﬁcally, we use the same pairwise interactionmatrix but learn to represent the pairwise space and consider attentions onpairwise interactions rather than “interfaces" on each side. Among the threeattention mechanisms, joint attention provides the best interpretabilityalbeit with the most parameters.These attention models (for proteins, compounds, or their pairs)are jointly trained with the RNN encoder and the CNN part. Learnedparameters of theirs include attention weights on all “letters” for a givenstring (or those on all letter-pairs for a given string-pair). Compared tothat in unsupervised learning, each attention model here outputs a singlevector as the input to its corresponding subsequent 1D-CNN model.More details on uniﬁed RNN-CNN and attention mechanisms can befound in Sec. 1.5 of Supplementary Data.

We compared the auto-encoding performances of our vanilla seq2seqmodel and 4 variants: bucketing, bi-directional GRU (“fw+bw”), attentionmechanism, and attention mechanism with fw+bw, respectively, in TablesS3 and S4 (Supplementary Data). We used the common assessment metricin language models, perplexity, which is related to the entropy H ofmodeled probability distribution P (Perp ( P ) = 2 H ( P ) > ). First,the vanilla seq2seq model had lower test-set perplexity for compoundSMILES than protein SPS (7.07 versus 41.03), which echoes the fact that,compared to protein SPS strings, compound SMILES strings are deﬁnedin an alphabet of less letters ( versus 76) and are of shorter lengths (100versus 152), thus their RNN models are easier to learn. Second, bucketing,the most ad-hoc option among all, did not improve the results much. Third,whereas bi-directional GRUs lowered perplexity by about 2 ∼ To assess how useful the learned/encoded protein and compoundrepresentations are for predicting compound-protein afﬁnity, we comparedthe novel and baseline representations in afﬁnity regression using thelabeled datasets. The representations were compared under the sameshallow machine learning models — ridge regression, lasso regressionand random forest (RF).

Baseline representations Novel representationsRidge Lasso RF Ridge Lasso RFTraining 1.16 (0.60) 1.16 (0.60) 0.76 (0.86) 1.23 (0.54) 1.22 (0.55) (0.91)Testing 1.16 (0.60) 1.16 (0.60) (0.78) 1.23 (0.54) 1.22 (0.55) (0.78)ER 1.43 (0.30) 1.43 (0.30) 1.44 (0.37) 1.46 (0.18) 1.48 (0.18) (0.26)Ion Channel 1.32 (0.22) 1.34 (0.20) 1.30 (0.22) 1.26 (0.23) 1.32 (0.17) (0.30)GPCR (0.22) 1.30 (0.22) 1.32 (0.28) 1.34 (0.20) 1.37 (0.17) 1.40 (0.25)Tyrosine Kinase (0.38) 1.16 (0.38) 1.18 (0.42) 1.50 (0.11) 1.51 (0.10) 1.58 (0.11)Time (core hours) 3.5 7.4 1239.8 0.47 2.78 668.7Memory (GB) 7.6 7.6 8.3 7.3 7.3 6.3

Table 1. Comparing the novel representations to the baseline based on RMSE(and Pearson correlation coefﬁcient r) of pIC shallow regression. From Table 1 we found that our novel representations learnedfrom SMILES/SPS strings by seq2seq models outperform baselinerepresentations of k -hot encoding of molecular/Pfam features. For the bestperforming random forest models, using 46% less training time and 24%less memory, the novel representations achieved the same performanceover the default test set as the baseline ones and lowered root meansquared errors (RMSE) for two of the four generalization sets whosetarget protein classes (nuclear estrogen receptors / ER and ion channels)are not included in the training set. Similar improvements were observedon p K i , p K d , and pEC predictions in Tables S5–7 (SupplementaryData), respectively. These results show that learning protein and compoundrepresentations from even unlabeled datasets alone could improve theircontext-relevance for various labels. We also note that, unlike Pfam-basedprotein representations that exploit curated information only availableto some proteins and their homologs, our SPS representations do notassume such information and can apply to uncharacterized proteins lackingannotated homologs. eepAfﬁnity: Interpretable Deep Learning of Compound–Protein Afﬁnity Using the novel representations we next compared the performances ofafﬁnity regression between the best shallow model (random forest) andvarious deep models. For both separate and uniﬁed RNN-CNN models,we tested results from a single model with (hyper)parameters optimizedover the training/validation set, averaging a “parameter ensemble” of 10models derived in the last 10 epochs, and averaging a “parameter+NN”ensemble of models with varying number of neurons in the fully connectedlayers ((300,100), (400,200) and (600,300)) trained in the last 10 epochs.The attention mechanism used here is the default, separate attention.From Table 2 we noticed that uniﬁed RNN-CNN models outperformboth random forest and separate RNN-CNN models (the similarperformances between RF and separate RNN-CNN indicated a potentialto further improve RNN-CNN models with deeper models). By using arelatively small amount of labeled data (which are usually expensive andlimited), protein and compound representations learned from abundantunlabeled data can be tuned to be more task-speciﬁc. We also noticed thataveraging an ensemble of uniﬁed RNN-CNN models further improvesthe performances especially for some generalization sets of ion channelsand GPCRs. As anticipated, averaging ensembles of models reduces thevariance originating from network architecture and parameter optimizationthus reduces expected generalization errors. Similar observations weremade for p K i predictions as well (Table S8 in Supplementary Data) evenwhen their hyper-parameters were not particularly optimized and simplyborrowed from pIC models. Impressively, uniﬁed RNN-CNN modelswithout very deep architecture could predict IC values with relativeerrors below . =5 fold (or 1.0 kcal/mol) for the test set and even around . = 20 fold (or 1.8 kcal/mol) on average for protein classes not seenin the training set. Interestingly, GPCRs and ion channels had similarRMSE but more different Pearson’s r , which is further described by thedistributions of predicted versus measured pIC values for various sets(Fig. S5 in Supplementary Data). To assess the predictive powers of the three attention mechanismsintroduced, we compared their pIC predictions in Table 3 using the samedataset and the same uniﬁed RNN-CNN models as before. All attentionmechanisms had similar performances on the training and test sets.However, as we anticipated, separate attention with the least parametersedged joint attention in generalization (especially for receptor tyrosinekinases). Meanwhile, joint attention had similar predictive performancesand much better interpretability, thus will be further examined in allinterpretability studies in case studies for selective drugs. Using the generalization sets, we proceed to explain and address our uniﬁedRNN-CNN models’ relatively worse performances for new classes ofprotein targets without any training data. We chose to analyze separateattention models with the best generalization results and ﬁrst noticedthat proteins in various sets have different distributions in the SPSalphabet (4-tuples). In particular, the test set, ion channels/GPCRs/tyrosinekinases, and estrogen receptors are increasingly different from thetraining set (measured by Jensen-Shannon distances in SPS letter or SPSlength distribution) (Fig. S3 in Supplementary Data), which correlatedwith increasingly deteriorating performance relative to the training set(measured by the relative difference in RMSE) with a Pearson correlationcoefﬁcient of 0.68 (SPS letter distribution) or 0.96 (SPS length distribution)(Fig. S4 in Supplementary Data).To improve the performances for new classes of proteins, we comparetwo strategies: re-training shallow models (random forest) from scratchbased on new training data alone and “transferring” original deep models(uniﬁed parameter+NN ensemble with the default separate attention) toﬁt new data (see details in Supplementary Data). The reason is that new classes of targets often have few labeled data that might be adequate forre-training class-speciﬁc shallow models from scratch but not for deepmodels with much more parameters.As shown in Fig. 2, deep transfer learning models increasinglyimproved the predictive performance compared to the original deeplearning models, when increasing amount of labeled data for new proteinclasses are made available. The improvement was signiﬁcant even with training coverage for each new protein class. Notably, deep transferlearning models outperformed random forest models that were re-trainedspeciﬁcally for each new protein class. We went on to test how well our uniﬁed RNN-CNN models could predictcertain drugs’ target selectivity, using 3 sets of drug-target interactionsof increasing prediction difﬁculty. Our novel representations and modelssuccessfully predicted target selectivity for 6 of 7 drugs whereas baselinerepresentations and shallow models (random forest) failed for most drugs.

Thrombin and factor X (Xa) are important proteins in the bloodcoagulation cascade. Antithrombotics, inhibitors for such proteins, havebeen developed to treat cardiovascular diseases. Due to thrombin’s othersigniﬁcant roles in cellular functions and neurological processes, it isdesirable to develop inhibitors speciﬁcally for factor Xa. DX-9065a issuch a selective inhibitor (p K i value being 7.39 for Xa and < et al. , 1996). Baseline rep. + RF Novel rep. + RF Novel rep. + DL (sep. attn.) Novel rep. + DL (joint attn.)Thrombin 6.36

Table 4. Predicted p K i values and target speciﬁcity for compound DX-9065ainteracting with human factor Xa and thrombin. We used the learned p K i models in this study. Both proteins (thrombinand factor Xa) were included in the K i training set with 2,294 and2,331 samples, respectively, but their interactions with the compound DX-9065a were not.. Table 4 suggested that random forest correctly predictedthe target selectivity (albeit with smaller than 0.5-unit difference) usingbaseline representations but failed to do so using novel representations.In contrast, our models with separate and joint attention mechanismsboth correctly predicted the compound’s favoring Xa. Moreover, ourmodels predicted selectivity levels being 2.4 (separate attention) and 3.9(joint attention) in p K i difference ( ∆ p K i ), where the joint attentionmodel produced predictions very close to the known selectivity margin( ∆ p K i > COX protein family represents an important class of drug targets forinﬂammatory diseases. These enzymes responsible for prostaglandinbiosynthesis include COX-1 and COX-2 in human, both of which can beinhibited by nonsteroidal anti-inﬂammatory drugs (NSAIDs). We chosethree common NSAIDs known for human COX-1/2 selectivity: celecoxib(pIC for COX-1: 4.09; COX-2: 5.17), ibuprofen (COX-1: 4.92, COX-2:4.10) and rofecoxib (COX-1: <

4; COX-2: 4.6) (Luo et al. , 2017). This isa very challenging case for selectivity prediction because selectivity levelsof all NSAIDs are close to or within 1 unit of pIC .We used the learned pIC ensemble models in this study. COX-1 and COX-2 both exist in our IC training set with 959 and 2,006binding examples, respectively, including 2 of the 6 compound-proteinpairs (ibuprofen and celecoxib with COX-1 individually). Karimi, Wu, Wang and Shen

Separate RNN-CNN Models Uniﬁed RNN-CNN ModelsRF single parameter parameter+NN single parameter parameter+NNensemble ensemble ensemble ensembleTraining 0.63 (0.91) 0.68 (0.88) 0.67 (0.90) 0.68 (0.89) 0.47 (0.94) 0.45 (0.95) (0.95)Testing 0.91 (0.78) 0.94 (0.76) 0.92 (0.77) 0.90 (0.79) 0.78 (0.84) 0.77 (0.84) (0.86)Generalization – ER (0.26) 1.45 (0.24) 1.44 (0.26) 1.43 (0.28) 1.53 (0.16) 1.52 (0.19) 1.46 (0.30)Generalization – Ion Channel (0.30) 1.36 (0.18) 1.33 (0.18) 1.29 (0.25) 1.34 (0.17) 1.33 (0.18) 1.30 (0.18)Generalization – GPCR 1.40 (0.25) 1.44 (0.19) 1.41 (0.20) 1.37 (0.23) 1.40 (0.24) 1.40 (0.24) (0.30)Generalization – Tyrosine Kinase 1.58 (0.11) 1.66 (0.09) 1.62 (0.10) 1.54 (0.12) 1.24 (0.39) 1.25 (0.38) (0.42)

Table 2. Under novel representations learned from seq2seq, comparing random forest and variants of separate RNN-CNN and uniﬁed RNN-CNN models based onRMSE (and Pearson correlation coefﬁcient r ) for pIC prediction. Separate attention Marginalized attention Joint attentionsingle parameter parameter+NN single parameter parameter+NN single parameter parameter+NNensemble ensemble ensemble ensemble ensemble ensembleTraining 0.47 (0.94) 0.45 (0.95) 0.44 (0.95) 0.50 (0.94) 0.47 (0.95) 0.42 (0.96) 0.48 (0.94) 0.44 (0.94) (0.95)Testing 0.78 (0.84) 0.77 (0.84) (0.86) 0.81 (0.83) 0.79 (0.84) (0.86) 0.84 (0.82) 0.80 (0.83) (0.86)Generalization – ER 1.53 (0.16) 1.52 (0.19) 1.46 (0.30) 1.69 (0.20) 1.67 (0.20) 1.53 (0.30) 1.78 (0.03) 1.68 (0.04) (0.23)Generalization – Ion Channel 1.34 (0.17) 1.33 (0.18) (0.18) 1.63 (0.01) 1.64 (0.06) 1.41 (0.13) 1.54 (0.25) 1.53 (0.26) 1.42 (0.26)Generalization – GPCR 1.40 (0.24) 1.40 (0.24) (0.30) 1.59 (0.17) 1.57 (0.18) 1.42 (0.24) 1.53 (0.19) 1.53 (0.19) 1.38 (0.25)Generalization – Tyrosine Kinase 1.24 (0.39) 1.25 (0.38) (0.42) 1.69 (0.22) 1.62 (0.25) 1.50 (0.32) 2.22 (0.18) 2.17 (0.21) 2.04 (0.17)

Table 3. Under novel representations learned from seq2seq, comparing different attention mechanisms of uniﬁed RNN-CNN models based on RMSE (and Pearsoncorrelation coefﬁcient r ) for pIC prediction. Nuclear Estrogen Receptor Ion Channel GPCR Receptor Tyrosine Kinase

Fig. 2.

Comparing strategies to generalize predictions for four sets of new protein classes: original random forest (RF), original param.+NN ensemble of uniﬁed RNN-CNN models (DLfor deep learning with the default attention), and re-trained RF or transfer DL using incremental amounts of labeled data in each set.

Baseline rep. + RF Novel rep. + RF Novel rep. + DL (sep. attn.) Novel rep. + DL (joint attn.)CEL IBU ROF CEL IBU ROF CEL IBU ROF CEL IBU ROFCOX-1 6.06 5.32 5.71 6.41 6.12 6.13 5.11

Table 5. Predicted pIC values and target speciﬁcity for three NSAIDs (CEL:celecoxib, IBU: ibuprofen and ROF: rofecoxib) interacting with human COX-1and COX-2. From Table 5, we noticed that, using the baseline representations,random forest incorrectly predicted COX-1 and COX-2 to be equallyfavorable targets for each drug. This is because the two proteins arefrom the same family and their representations in Pfam domains areindistinguishable. Using the novel representations, random forest correctlypredicted target selectivity for two of the three drugs (celecoxib androfecoxib), whereas our uniﬁed RNN-CNN models (both attentionmechanisms) did so for all three. Even though the selectivity levels of theNSAIDs are very challenging to predict, our models were able to predictall selectivities correctly with the caveat that few predicted differencesmight not be statistically signiﬁcant (for instance, the 0.03-unit differencefor rofecoxib using joint attention).

Protein-tyrosine kinases and protein-tyrosine phosphatases (PTPs) arecontrolling reversible tyrosine phosphorylation reactions which are criticalfor regulating metabolic and mitogenic signal transduction processes.Selective PTP inhibitors are sought for the treatment of various diseasesincluding cancer, autoimmunity, and diabetes. Compound 1 [2- (oxalyl-amino)-benzoic acid or OBA] and its derivatives, compounds 2 and 3(PubChem CID: 44359299 and 90765696) , are highly selective towardPTP1B rather than other proteins in the family such as PTPRA, PTPRE,PTPRC and SHP1 (Iversen et al. , 2000). Speciﬁcally, the p K i values ofOBA, compound 2, and compound 3 against PTP1B are 4.63, 4.25, and6.69, respectively; and their p K i differences to the closest PTP familyprotein are 0.75, 0.7, and 2.47, respectively (Iversen et al. , 2000).We used the learned p K i ensemble models in this study. PTP1B,PTPRA, PTPRC, PTPRE and SHP1 were included in the K i trainingset with 343, 33, 16, 6 and 5 samples respectively. These examples justincluded OBA binding to all but SHP1 and compound 2 binding to PTPRC.Results in Table 6 showed that random forest using baselinerepresentations cannot tell binding afﬁnity differences within the PTPfamily as the proteins’ Pfam descriptions are almost indistinguishable.Using novel representations, random forest incorrectly predicted target eepAfﬁnity: Interpretable Deep Learning of Compound–Protein Afﬁnity Baseline rep. + RF Novel rep. + RF Novel rep. + DL (sep. attn.) Novel rep. + DL (joint attn.)Protein Comp1 Comp2 Comp3 Comp1 Comp2 Comp3 Comp1 Comp2 Comp3 Comp1 Comp2 Comp3

PTP1B

PTPRA 4.15 3.87 5.17 6.29 6.59 6.27 2.73 2.90 3.44 2.39 2.62 2.12PTPRC 4.15 3.87 5.17

Table 6. Predicted p K i values and target speciﬁcity for three PTP1B-selectivecompounds interacting with ﬁve proteins in the human PTP family. selectivity for all 3 compounds, whereas uniﬁed RNN-CNN models withboth attention mechanisms correctly did so for all but one (compound1 – OBA). We also noticed that, although the separate attention modelpredicted likely insigniﬁcant selectivity levels for compounds 2 ( ∆ p K i =0 . ) and 3 ( ∆ p K i = 0 . ), the joint attention model much improved theprediction of selectivity margins ( ∆ p K i = 0 . and 0.82 for compounds2 and 3, respectively) and their statistical signiﬁcances. After successfully predicting target selectivity for some drugs, we proceedto explain using attention scores how our deep learning models did so andwhat they reveal about those compound-protein interactions.

Given that SPS and SMILES strings are interpretable and attention modelsbetween RNN encoders and 1D convolution layers can report their focus,we pinpoint SSEs in proteins and atoms in compounds with high attentionscores, which are potentially responsible for CPIs. To assess the idea, wechose 3 compound-protein pairs that have 3D crystal complex structuresfrom the Protein Data Bank; and extracted residues in direct contacts withligands (their SSEs are regarded ground truth for binding site) for eachprotein from ligplot diagrams provided through PDBsum (De Beer et al. ,2013). Based on joint attention scores α ij ’s on pairs of protein SSE i andcompound atom j from the single uniﬁed RNN-CNN model, we picked thetop 10% (4) SSEs as predicted binding sites. Speciﬁcally, we ﬁrst correctedjoint attention scores to be β ij = α ij − (cid:16)P Ik =1 α kj (cid:17) /I ( ∀ i =1 , . . . , I, j = 1 , . . . , J ) to offset the contribution of any compoundatom j with promiscuous attentions over all protein SSEs. We thencalculated the attention score β i for protein SSE i by max-marginalization( β i = max j β ij ). No negative β i was found in this case thus no furthertreatment was adopted. Number of SSEs Top 10% (4) SSEs predicted as binding site by joint attn.Target–Drug Pair PDB ID total binding site

Table 7. Interpreting deep learning models: predicting binding sites based onjoint attentions.

Table 7 shows that, compared to randomly ranking the SSEs, ourapproach can enrich binding site prediction by 1.7 ∼ t -tests (see details in Sec. 1.7 of Supplementary Data)suggested that binding sites enjoyed higher attention scores than non-binding sites in a statistically signiﬁcant way. When the strict deﬁnitionof binding sites is relaxed to residues within 5Å of any heavy atom of theligand, results were further improved with all top 10% SSEs of factor Xabeing at the binding site (Table S10).We delved into the predictions for factor Xa–DX-9065a interactionin Fig. 3 (the other 2 are in Fig. S6 of Supplementary Data). Warmercolors (higher attentions) are clearly focused near the ligand. The red loopsconnected through a β strand (resi. 171–196) were correctly predicted tobe at the binding site with a high rank 2, thus a true positive (TP). TheSSE ranked ﬁrst, a false positive, is its immediate neighbor in sequence (resi. 162-170; red helix at the bottom) and is near the ligand. In fact,as mentioned before, when the binding site deﬁnition is relaxed, all top10% SSEs were at the binding site. Therefore, in the current uniﬁedRNN-CNN model with attention mechanism, wrong attention could bepaid to sequence neighbors of ground truth; and additional information(for instance, 2D contact maps or 3D structures of proteins, if available)could be used as additional inputs to reduce false negatives.We also max-marginalized β ij over protein SSE i for β j – attentionscore on atom j of the compound. Many high attention scores wereobserved for compound atoms (Fig. S7), which is somewhat intuitive assmall-molecule compounds usually ﬁt in protein pockets or grooves almostentirely. The top-ranked atom happened to be a nitrogen atom forming ahydrogen bond with an aspartate (Asp189) of factor Xa, although morecases need to be studied more thoroughly for a conclusion. Fig. 3.

Interpreting deep learning models for factor Xa binding-site prediction based onjoint attention: 3D structure of factor Xa (colored cartoons including helices, sheets, andcoils) in complex with DX-9065a (black sticks) (PDB ID:1FAX) where protein SSEs arecolor-coded by attention scores ( β i ) where warmer colors indicate higher attentions. To predictively explain the selectivity origin of compounds, we designedan approach to compare attention scores between pairs of CPIs and testedit using factor Xa-selective DX-9065a with known speciﬁcity origin.For selective compounds that interact with factor Xa over thrombin,position 192 has been identiﬁed: it is a charge-neutral polar glutamine(Gln192) in Xa but a negatively-charged glutamate (Glu192) inthrombin (Huggins et al. , 2012). DX-9065a exploited this differencewith a carboxylate group forming unfavorable electrostatic repulsionwith Glu192 in thrombin but favorable hydrogen bond with Gln192in Xa. To compare DX-9065a interacting with the two proteins, weperformed amino-acid sequence alignment between the proteins and splittwo sequences of mis-matched SSEs (count: 31 and 38) into those ofperfectly matched segments (count: 50 and 50). In the end, segment 42,where SSE 26 of Xa and SSE 31 of thrombin align, is the ground truthcontaining position 192 for target selectivity.For DX-9065a interacting with either factor Xa or thrombin, we rankedthe SSEs based on the attention scores from the uniﬁed RNN-CNN singlemodel and assigned each segment the same rank as its parent SSE. Due tothe different SSE counts between thrombin and factor Xa, we normalizedeach rank for segment i by the corresponding SSE count for a rank ratio r i . For each segment we then substracted from 1 the average of rankratios between factor Xa and thrombin interactions so that highly attendedsegments in both proteins can be scored higher. Fig. 4 shows that theground-truth segment in red was ranked the 2nd among 50 segments albeitwith narrow margins over the next 3 segments. Karimi, Wu, Wang and Shen

Pairwise SSE alignment0.00.20.40.60.8 - a v g ( r a , r b ) Fig. 4.

Interpreting deep learning models for factor Xa speciﬁcity based on joint attentions.Pairwise alignment of amino-acid sequences of factor Xa and thrombin decomposed bothsequences into 50 segments (labeled by indices). These segments are scored by one less theaverage of the corrected attention rank ratios for the two compound-protein interactions.The ground truth of speciﬁcity origin is in red.

We lastly explore alternative representations of proteins and compoundsand discuss remaining challenges.

As shown earlier, our SPS representations integrate both sequence andstructure information of proteins and are much more compact comparedto the original amino acid sequences. That being said, there is a valueto consider a protein sequence representation with the resolution ofresidues rather than SSEs: potentially higher-resolution precision andinterpretability. We started with unsupervised learning to encode theprotein sequence representation with seq2seq. More details are given inSec. 1.8 of Supplementary Data.

SPS rep. +attention+fw/bw seq. rep. +attention+fw/bwTraining error (Perplexity) Table 8. Comparing the auto-encoding performances between amino acid andSPS sequences using the best seq2seq model (bidirectional GRUs with attentionmechanism).

Compared to SPS representations, protein sequences are 10-timeslonger and demanded 10-times more GRUs in seq2seq, which suggestsmuch more expensive training. Under the limited computational budget,we trained the protein sequence seq2seq models using twice the time limiton the SPS ones. The perplexity for the test set turned out to be over 12,which is much worse than 1.001 in the SPS case (see Sec. 3.1) and deemedinadequate for subsequent (semi-)supervised learning. Learning very longsequences is challenging in general and calls for advanced architecturesof sequence models.

We have chosen SMILES representations for compounds partly due torecent advancements of sequence models especially in the ﬁeld of naturallanguage processing. Meanwhile, the descriptive power of SMILES stringscan have limitations. For instance, some syntactically invalid SMILES strings can still correspond to valid chemical structures. Therefore, wealso explore chemical formulae (2D graphs) for compound representation.We replaced RNN layers for compound sequences with graph CNN(GCNN) in our uniﬁed model (separate attention) and kept the rest of thearchitecture. This new architecture is named uniﬁed RNN/GCNN-CNN.The GCNN part is adopting a very recently-developed method (Gao et al. ,2018) for compound-protein interactions. More details can be found inSec. 1.9 of Supplementary Data.

SMILES rep. Graph rep.single parameter parameter+NN single parameter parameter+NNensemble ensemble ensemble ensembleTraining 0.47 (0.94) 0.45 (0.95) (0.95) 0.55 (0.92) 0.54 (0.92) 0.55 (0.92)Testing 0.78 (0.84) 0.77 (0.84) (0.86) 1.50 (0.35) 1.50 (0.35) 1.34 (0.45)Generalization – ER 1.53 (0.16) 1.52 (0.19) (0.30) 1.68 (0.05) 1.67 (0.03) 1.67 (0.07)Generalization – Ion Channel 1.34 (0.17) 1.33 (0.18) (0.18) 1.43 (0.10) 1.41 (0.13) 1.35 (0.12)Generalization – GPCR 1.40 (0.24) 1.40 (0.24) (0.30) 1.63 (0.04) 1.61 (0.04) 1.49 (0.07)Generalization – Tyrosine Kinase 1.24 (0.39) 1.25 (0.38) (0.42) 1.74 (0.01) 1.71 (0.03) 1.70 (0.03)

Table 9. Comparing uniﬁed RNN-CNN (SMILES strings for compoundrepresentation) and uniﬁed RNN/GCNN-CNN (graphs for compoundrepresentation) based on RMSE (and Pearson’s correlation coefﬁcient) forpIC prediction. Results in Table 9 indicate that the uniﬁed RNN/GCNN-CNN modelusing compound graphs did not outperform the uniﬁed RNN-CNN modelusing compound SMILES in RMSE and did a lot worse in Pearson’scorrelation coefﬁcient. These results did not show the superiority ofSMILES versus graphs for compound representations per se . Rather, theyshow that graph models need new architectures and further developmentsto address the challenge. We note recent advancements in deep graphmodels (Gilmer et al. , 2017; Coley et al. , 2017; Jin et al. , 2018).

We have developed accurate and interpretable deep learning models forpredicting compound-protein afﬁnity using only compound identities andprotein sequences. By taking advantage of massive unlabeled compoundand protein data besides labeled data in semi-supervised learning, wehave jointly trained uniﬁed RNN-CNN models for learning context- andtask-speciﬁc protein/compound representations and predicting compound-protein afﬁnity. These models outperform baseline machine-learningmodels. And impressively, they achieve the relative error of IC within5-fold for a comprehensive test set and even that within 10-fold forgeneralization sets of protein classes unknown to the training set. Deepermodels would further improve the results. Moreover, for the generalizationsets, we have devised transfer-learning strategies to signiﬁcantly improvemodel performance using as few as 40 labeled samples.Compared to conventional compound or protein representations usingmolecular descriptors or Pfam domains, the encoded representationslearned from novel structurally-annotated SPS sequences and SMILESstrings improve both predictive power and training efﬁciency for variousmachine learning models. Given the novel representations with betterinterpretability, we have included attention mechanism in the uniﬁed RNN-CNN models to quantify how much each part of proteins or compoundsare focused while the models are making the speciﬁc prediction for eachcompound-protein pair.When applied to case studies on drugs of known target-selectivity,our models have successfully predicted target selectivity in all caseswhereas conventional compound/protein representations and machinelearning models have failed some. Furthermore, our analyses on attentionweights have shown promising results for predicting protein binding sitesas well as the origins of binding selectivity, thus calling for further methoddevelopment for better interpretability.For protein representation, we have chosen SSE as the resolutionfor interpretability due to the known sequence-size limitation of RNN eepAfﬁnity: Interpretable Deep Learning of Compound–Protein Afﬁnity models (Li et al. , 2018). One can easily increase the resolution to residue-level by simply feeding to our models amino-acid sequences (preferentiallyof length below 1,000) instead of SPS sequences, but needs to be aware ofthe much increased computational burden and much worse convergencewhen training RNNs. For compound representation, we have started with1D SMILES strings and have also explored 2D graph representationsusing graph CNN (GCNN). Although the resulting uniﬁed RNN/GCNN-CNN model did not improve against uniﬁed RNN-CNN, graphs are moredescriptive for compounds and more developments in graph models areneeded to address remaining challenges. Acknowledgments

This project is in part supported by the National Institute of GeneralMedical Sciences of the National Institutes of Health (R35GM124952to YS) and the Defense Advanced Research Projects Agency (FA8750-18-2-0027 to ZW). Part of the computing time is provided by the TexasA&M High Performance Research Computing.

References

Ain, Q. U., Aleksandrova, A., Roessler, F. D., and Ballester, P. J. (2015).Machine-learning scoring functions to improve structure-based bindingafﬁnity prediction and virtual screening.

Wiley Interdiscip Rev ComputMol Sci , (6), 405–424.Brandstetter, H., Kühne, A., Bode, W., Huber, R., von der Saal, W.,Wirthensohn, K., and Engh, R. A. (1996). X-ray structure of active site-inhibited clotting factor xa implications for drug design and substraterecognition. Journal of Biological Chemistry , (47), 29988–29992.Cang, Z. and Wei, G. W. (2017). TopologyNet: Topology based deepconvolutional and multi-task neural networks for biomolecular propertypredictions. PLoS Comput. Biol. , (7), e1005690.Chang, R. L., Xie, L., Xie, L., Bourne, P. E., and Palsson, B. P. (2010).Drug off-target effects predicted using structural analysis in the contextof a metabolic network model. PLoS Comput. Biol. , (9), e1000938.Chen, X., Yan, C. C., Zhang, X., Zhang, X., Dai, F., Yin, J., and Zhang, Y.(2016). Drug-target interaction prediction: databases, web servers andcomputational models. Brief. Bioinformatics , (4), 696–712.Cheng, F., Zhou, Y., Li, J., Li, W., Liu, G., and Tang, Y. (2012). Predictionof chemical–protein interactions: multitarget-qsar versus computationalchemogenomic methods. Molecular BioSystems , (9), 2373–2384.Cheng, J., Randall, A. Z., Sweredoski, M. J., and Baldi, P. (2005). Scratch:a protein structure and structural feature prediction server. Nucleic acidsresearch , (suppl_2), W72–W76.Cheng, Z., Zhou, S., Wang, Y., Liu, H., Guan, J., and Chen, Y.-P. P. (2016). Effectively identifying compound-protein interactions bylearning from positive and unlabeled examples. IEEE/ACM transactionson computational biology and bioinformatics .Cho, K., Van Merriënboer, B., Bahdanau, D., and Bengio, Y. (2014).On the properties of neural machine translation: Encoder-decoderapproaches. arXiv preprint arXiv:1409.1259 .Coley, C. W., Barzilay, R., Green, W. H., Jaakkola, T. S., and Jensen, K. F.(2017). Convolutional Embedding of Attributed Molecular Graphs forPhysical Property Prediction.

J Chem Inf Model , (8), 1757–1772.De Beer, T. A., Berka, K., Thornton, J. M., and Laskowski, R. A. (2013).Pdbsum additions. Nucleic acids research , (D1), D292–D296.Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., andHarshman, R. (1990). Indexing by latent semantic analysis. Journalof the American society for information science , (6), 391.Finn, R. D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R. Y., Eddy,S. R., Heger, A., Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E. L. L., Tate, J., and Punta, M. (2014). Pfam: the protein familiesdatabase. Nucleic Acids Research , (D1), D222–D230.Finn, R. D., Clements, J., Arndt, W., Miller, B. L., Wheeler, T. J.,Schreiber, F., Bateman, A., and Eddy, S. R. (2015). Hmmer web server:2015 update. Nucleic acids research , (W1), W30–W38.Gao, K. Y., Fokoue, A., Luo, H., Iyengar, A., Dey, S., and Zhang, P. (2018).Interpretable drug target prediction using deep neural representation. In IJCAI , pages 3371–3377.Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl,G. E. (2017). Neural message passing for quantum chemistry.

CoRR , abs/1704.01212 .Gilson, M. K. and Zhou, H.-X. (2007). Calculation of protein-ligandbinding afﬁnities. Annual review of biophysics and biomolecularstructure , .Gomes, J., Ramsundar, B., Feinberg, E. N., and Pande, V. S. (2017).Atomic convolutional networks for predicting protein-ligand bindingafﬁnity. arXiv preprint arXiv:1703.10603 .Huggins, D. J., Sherman, W., and Tidor, B. (2012). Rational approachesto improving selectivity in drug design. Journal of medicinal chemistry , (4), 1424–1444.Iversen, L. F., Andersen, H. S., Branner, S., Mortensen, S. B., Peters,G. H., Norris, K., Olsen, O. H., Jeppesen, C. B., Lundt, B. F., Ripka,W., et al. (2000). Structure-based design of a low molecular weight,nonphosphorus, nonpeptide, and highly selective inhibitor of protein-tyrosine phosphatase 1b. Journal of Biological Chemistry , (14),10300–10307.Jimenez, J., Skalic, M., Martinez-Rosell, G., and De Fabritiis, G. (2018).KDEEP: Protein-Ligand Absolute Binding Afﬁnity Prediction via 3D-Convolutional Neural Networks. J Chem Inf Model , (2), 287–296.Jin, W., Barzilay, R., and Jaakkola, T. S. (2018). Junction tree variationalautoencoder for molecular graph generation. CoRR , abs/1802.04364 .Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuoustranslation models. In EMNLP , volume 3, page 413.Keiser, M. J., Setola, V., Irwin, J. J., Laggner, C., Abbas, A., Hufeisen,S. J., Jensen, N. H., Kuijer, M. B., Matos, R. C., Tran, T. B., et al. (2009).Predicting new molecular targets for known drugs.

Nature , (7270),175.Koh, P. W. and Liang, P. (2017). Understanding black-box predictions viainﬂuence functions. In D. Precup and Y. W. Teh, editors, Proceedings ofthe 34th International Conference on Machine Learning , volume 70of

Proceedings of Machine Learning Research , pages 1885–1894,International Convention Centre, Sydney, Australia. PMLR.Kuhn, M., von Mering, C., Campillos, M., Jensen, L. J., and Bork, P.(2007). Stitch: interaction networks of chemicals and proteins.

Nucleicacids research , (suppl_1), D684–D688.Leach, A. R., Shoichet, B. K., and Peishoff, C. E. (2006). Prediction ofprotein-ligand interactions. Docking and scoring: successes and gaps. J.Med. Chem. , (20), 5851–5855.Li, S., Li, W., Cook, C., Zhu, C., and Gao, Y. (2018). Independentlyrecurrent neural network (indrnn): Building A longer and deeper RNN. CoRR , abs/1803.04831 .Liu, T., Lin, Y., Wen, X., Jorissen, R. N., and Gilson, M. K. (2006).Bindingdb: a web-accessible database of experimentally determinedprotein–ligand binding afﬁnities. Nucleic acids research , (suppl_1),D198–D201.Lu, J., Yang, J., Batra, D., and Parikh, D. (2016). Hierarchical question-image co-attention for visual question answering. In Advances In NeuralInformation Processing Systems , pages 289–297.Luo, Y., Zhao, X., Zhou, J., Yang, J., Zhang, Y., Kuang, W., Peng, J.,Chen, L., and Zeng, J. (2017). A network integration approach for drug-target interaction prediction and computational drug repositioning fromheterogeneous information.

Nature communications , (1), 573. Karimi, Wu, Wang and Shen

Magnan, C. N. and Baldi, P. (2014). Sspro/accpro 5: almostperfect prediction of protein secondary structure and relative solventaccessibility using proﬁles, machine learning and structural similarity.

Bioinformatics , (18), 2592–2597.Mayr, A., Klambauer, G., Unterthiner, T., and Hochreiter, S. (2016).Deeptox: Toxicity prediction using deep learning. Frontiers inEnvironmental Science , , 80.Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efﬁcientestimation of word representations in vector space. arXiv preprintarXiv:1301.3781 .Power, A., Berger, A. C., and Ginsburg, G. S. (2014). Genomics-enableddrug repositioning and repurposing: insights from an IOM Roundtableactivity. JAMA , (20), 2063–2064.Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). "why should i trustyou?": Explaining the predictions of any classiﬁer. In Proceedings of the22Nd ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining , KDD ’16, pages 1135–1144, New York, NY, USA.ACM.Santos, R., Ursu, O., Gaulton, A., Bento, A. P., Donadi, R. S., Bologa,C. G., Karlsson, A., Al-Lazikani, B., Hersey, A., Oprea, T. I., andOverington, J. P. (2017). A comprehensive map of molecular drugtargets.

Nat Rev Drug Discov , (1), 19–34.Shi, Y., Zhang, X., Liao, X., Lin, G., and Schuurmans, D. (2013). Protein-chemical interaction prediction via kernelized sparse learning svm. In Paciﬁc Symposium on Biocomputing , pages 41–52.Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). Onthe importance of initialization and momentum in deep learning. In

International conference on machine learning , pages 1139–1147.Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequencelearning with neural networks. In

Advances in neural informationprocessing systems , pages 3104–3112.Suzek, B. E., Wang, Y., Huang, H., McGarvey, P. B., Wu, C. H., andConsortium, U. (2014). Uniref clusters: a comprehensive and scalablealternative for improving sequence similarity searches.

Bioinformatics , (6), 926–932.Tabei, Y. and Yamanishi, Y. (2013). Scalable prediction of compound-protein interactions using minwise hashing. BMC systems biology , (6), S3.Tian, K., Shao, M., Wang, Y., Guan, J., and Zhou, S. (2016). Boostingcompound-protein interaction prediction by deep learning. Methods , , 64–72.Wallach, I., Dzamba, M., and Heifets, A. (2015). Atomnet: a deepconvolutional neural network for bioactivity prediction in structure-based drug discovery. arXiv preprint arXiv:1510.02855 .Wan, F. and Zeng, J. (2016). Deep learning with feature embedding forcompound-protein interaction prediction. bioRxiv , page 086033.Wang, S., Li, W., Liu, S., and Xu, J. (2016a). Raptorx-property: a webserver for protein structure property prediction. Nucleic Acids Research , (W1), W430–W435.Wang, Y. and Zeng, J. (2013). Predicting drug-target interactions usingrestricted Boltzmann machines. Bioinformatics , (13), i126–134.Wang, Y., Xiao, J., Suzek, T. O., Zhang, J., Wang, J., and Bryant,S. H. (2009). Pubchem: a public information system for analyzingbioactivities of small molecules. Nucleic acids research , (suppl_2),W623–W633.Wang, Z., Chang, S., Yang, Y., Liu, D., and Huang, T. S. (2016b). Studyingvery low resolution recognition using deep networks. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,pages 4792–4800.Weininger, D. (1988). Smiles, a chemical language and informationsystem. 1. introduction to methodology and encoding rules.

Journalof chemical information and computer sciences , (1), 31–36.Xu, Z., Wang, S., Zhu, F., and Huang, J. (2017). Seq2seqﬁngerprint: An unsupervised deep molecular embedding for drugdiscovery. In Proceedings of the 8th ACM International Conference onBioinformatics, Computational Biology, and Health Informatics , pages285–294. ACM.Yu, H., Chen, J., Xu, X., Li, Y., Zhao, H., Fang, Y., Li, X., Zhou, W.,Wang, W., and Wang, Y. (2012). A systematic prediction of multipledrug-target interactions from chemical, genomic, and pharmacologicaldata.

PloS one ,7