[PDF] Cross-Modality Protein Embedding for Compound-Protein Affinity and Contact Prediction

Abstract

Compound-protein pairs dominate FDA-approved drug-target pairs and the prediction of compound-protein affinity and contact (CPAC) could help accelerate drug discovery. In this study we consider proteins as multi-modal data including 1D amino-acid sequences and (sequence-predicted) 2D residue-pair contact maps. We empirically evaluate the embeddings of the two single modalities in their accuracy and generalizability of CPAC prediction (i.e. structure-free interpretable compound-protein affinity prediction). And we rationalize their performances in both challenges of embedding individual modalities and learning generalizable embedding-label relationship. We further propose two models involving cross-modality protein embedding and establish that the one with cross interaction (thus capturing correlations among modalities) outperforms SOTAs and our single modality models in affinity, contact, and binding-site predictions for proteins never seen in the training set.

Full PDF

CCross-Modality Protein Embedding forCompound-Protein Afﬁnity and Contact Prediction

Yuning You, Yang Shen

Texas A&M University {yuning.you,yshen}@tamu.edu

Abstract

Compound-protein pairs dominate FDA-approved drug-target pairs and the pre-diction of compound-protein afﬁnity and contact (CPAC) could help acceleratedrug discovery. In this study we consider proteins as multi-modal data including1D amino-acid sequences and (sequence-predicted) 2D residue-pair contact maps.We empirically evaluate the embeddings of the two single modalities in their ac-curacy and generalizability of CPAC prediction (i.e. structure-free interpretablecompound-protein afﬁnity prediction). And we rationalize their performances inboth challenges of embedding individual modalities and learning generalizableembedding-label relationship. We further propose two models involving cross-modality protein embedding and establish that the one with cross interaction (thuscapturing correlations among modalities) outperforms SOTAs and our single modal-ity models in afﬁnity, contact, and binding-site predictions for proteins never seenin the training set.

Computational prediction of compound–protein interactions (CPI) has been of great interest partlydue to its potential impact on accelerating drug discovery [1, 2]. Recent progress in this topic includes(1) the improved accuracy of structure-based binary classiﬁcation [3, 4] and afﬁnity regression [5, 6]for CPI; (2) the structure-free inputs that remove the demand of compound-protein co-crystal ordocked structures that are experimentally or computationally expensive [7, 8, 9, 10, 11]; and (3) therecent development of interpretable structure-free predictions of both protein-ligand binding afﬁnitiesand their atomic contacts [9, 12, 13].We focus on interpretable CPI prediction without the need of compound-protein co-crystal or dockedstructures. Even unbound structures of proteins are not assumed here. Speciﬁcally, we aim atsimultaneous prediction of compound-protein afﬁnity and contacts in the aforementioned structure-free setting. We note that earlier works for this task represent proteins as 1D amino-acid sequences[12, 13] or 1D structurally-annotated sequences [9]. However, 1D sequences of proteins adopt 3Dstructures to function, including interactions with compounds; so structure-aware representations ofproteins (such as sequence-predicted residue-residue 2D contact maps) can also be useful, as exploredin a recent afﬁnity predictor [11]. (Although compound data can be available in both modalities of 1DSMILES and chemical graphs, we did not pursue both modalities and only represented compoundsas graphs because SMILES strings have limited descriptive power and known worse performance inthe CPAC task [9, 12].)In this paper, we treat protein data as available in both modalities of 1D sequences and (sequence-predicted) 2D contact maps. And we ask the following questions: How do the two modalities comparewith each other for the task of structure-free interpretable CPI prediction, i.e., compound-proteinafﬁnity and contact (CPAC) prediction? Is there an advantage to exploit both modalities? And whatwould be a beneﬁcial cross-modality approach? Our contributions and ﬁndings include the following:

Machine Learning for Structural Biology Workshop, NeurIPS 2020, Vancouver, Canada. a r X i v : . [ q - b i o . B M ] N ov By embedding either modality with recurrent or graph neural networks and predicting afﬁni-ties through intermolecular contact-predicting joint attentions, we empirically compared thetwo resulting single-modality models and found that: the 1D or 2D modality of proteinsdid not dominate each other for proteins seen in the training set; however, the 1D and 2Dmodality-based models tend to generalize better for unseen proteins in afﬁnity predictionand contact prediction, respectively. We further provided conjectures involving the difﬁcultyof embedding various modality and the mappings between various embeddings and afﬁnityor contact labels.• For the ﬁrst time, we propose cross-modality learning models for the task of structure-freeinterpretable CPI prediction, to capture and fuse the different information from both 1D & 2Dmodalities of proteins. And we empirically demonstrate that the two cross-modality learningmodels (through concatenation or cross-interaction of sequence and graph embeddings)achieve better accuracy and generalizability compared to the state of the art (SOTA) and oursingle-modality models, in compound-protein afﬁnity, contact, and binding-site prediction.

We assume that compounds are available in (1D SMILES or) 2D chemical graphs and proteinsavailable in 1D amino-acid sequences. Given a compound-protein pair ( X comp , X prot ) composed of N comp atoms and N prot residues where N comp , N prot are predeﬁned and ﬁxed numbers (padding isapplied to ensure the ﬁxed sizes), a CPAC model f CPAC : X comp × X prot → R ≥ × [0 , N comp × N prot istargeted at making prediction for both the intermolecular afﬁnity z aff and (atom-residue) contacts Z inter , where X comp , X prot are respectively the spaces for X comp , X prot . The SOTA pipelines for CPAC[12, 9, 13] comprise of the following three major components as shown in Figure 1.(1) Neural-network encoders f comp : X comp → R N comp × D , f prot : X prot → R N prot × D that separately extract embeddings H comp , H prot for the compound X comp and pro-tein X prot where D is hidden dimension. Graph neural network (GNN, [14, 15,16, 17, 18, 19]) is adopted for compound 2D chemical graphs and hierarchical re-current neural network (HRNN, [20]) is chosen for protein 1D amino-acid sequences. Neural networks Embeddings Outputs

Figure 1:

Pipeline overview for compound-protein afﬁn-ity and contact prediction model f CPAC . (2) Interaction module f inter : R N comp × D × R N prot × D → [0 , N comp × N prot × R L × D takingthe encoded embeddings H comp , H prot as in-puts, employing joint attention to output the in-teraction matrix Z inter and joint embedding toextract embeddings H cp for compound-proteinpairs, where L is hidden length determined by N comp , N prot .(3) Afﬁnity module f aff : R L × D → R that pre-dicts the afﬁnity z aff given the joint embedding H cp , consisting of 1D convolutional, poolinglayers, and multi-layer perceptron (MLP). Notethat the contact-predicting interaction modulefeeds the afﬁnity module, making afﬁnity pre-diction intrinsically interpretable by the underlying contacts.After the CPAC model f CPAC forwardly generates the outputs ( z aff , Z inter ) , true labels ( y aff , Y inter ) arecompare to calculate the loss, l CPAC , which consists of afﬁnity loss l aff , intermolecular atom–residuecontact/interaction loss l inter and three structure-aware sparsity regularization loss l group , l fused , l L1 described in [12], expressed as: l CPAC = l aff + λ inter l inter + λ group l group + λ fused l fused + λ L1 l L1 . (1)The model is trained end to end while the training loss is minimized. More details can be found in[12]. 2 Single-Modality Models and Performances

Protein 1D sequences.

We follow DeepAfﬁnity+ [12] as described above and use HRNN to encodeprotein sequences. One change we made was replacing the hierarchical joint attention with naïvejoint attention in the interaction module expressed as: Z inter = Z (cid:48) inter / sum( Z (cid:48) inter ) , z (cid:48) inter ,i,j = ( h comp ,i W comp , attn ) T ( h prot ,j W prot , attn ) , (2)where z i,j = Z [ i, j ] , h i = H [ i, :] , i = 1 , ..., N comp , j = 1 , ..., N prot and W comp , attn , W prot , attn aretwo learnable attention matrices. Protein 2D contact maps.

In previous SOTAs for CPAC, proteins are often represented as 1Damino-acid sequences [12, 9, 13]. We propose to adopt the 2D modality of proteins as inputs andmodel them as graphs with the following reasons. Firstly, graphs are structure-aware compared with1D sequences, potentially resulting in better generalizability. Secondly, graphs are more conciseyet informative (focusing on pairwise residue interactions) compared to the data structure of 3Dcoordinates (which are also harder to predict than contact maps) [21]. Lastly, the recent surge ofdevelopment in graph learning [14, 15, 16] provides advanced tools to facilitate graph representationlearning.Thus, given 2D residue-residue contact maps, we represent a protein input X prot as a graph G prot = {V prot , E prot } where vertices stand for residues and edges exist between residues predicted to be incontact (Z-scores of predicted probability are above 3). When actual protein contact graphs are usedfor comparison, the edge criteria (for residue pairs in contact) is if their C β atoms are within 8Å.As the graphs are deﬁned by the 2D contact maps, we may refer to them as 2D maps or 2D graphsinterchangeably.The graphs are associated with feature matrix F prot ∈ R N prot × D (embedded amino-acid types ofresidues) and the adjacency matrix A prot ∈ { , } N prot × N prot (binary contact map). We employ anexpressive GNN model, graph attention network (GAT, [14]) with K layers as the protein encoder f prot to extract graph embeddings, with the formulation of each layer’s forward propagation as: H ( k ) prot = MLP( ˜ S ( k − H ( k − prot ) , ˜ S ( k − = D ( k − − ( S ( k − (cid:12) A prot ) , S ( k − = exp( H ( k − prot W ( k − H ( k − prot T ) , (3)where H prot = H ( K ) prot , H (0) prot = F prot , the normalization matrix D ( k − = diag(( S ( k − (cid:12) A prot ) J N prot , ) , (cid:12) is the element-wise multiplication, J N prot , is an all-ones matrix with size N prot × ,and W ( k − is a learnable weight matrix.As (unbound or ligand-bound) structure data is not readily available for many proteins, we usesequence-predicted 2D contact maps to overcome the limitation and broaden our models’ applica-bility. 2D contact map prediction is done by RaptorX-contact [22] that exploits both sequence andevolutionary information. Data set.

We use the dataset and splitting scheme as in DeepAfﬁnity+ [12], which is curated basedon PDBbind [23] and BindingDB [24]. It contains protein sequences, predicted (and actual bound)protein contact maps, compound SMILEs and graphs, afﬁnity labels (p K d /p K i ) and intermolecularatomic interactions/contacts (curated from the LigPlot service of PDBsum [25]). The updated datasetis diverse: it consists of 4,446 pairs between 3,672 compounds (of wide range of properties suchas logP, molecular weight, and afﬁnity labels) and 1,287 proteins (including enzymes across allsix classes, GPCRs, nuclear receptors, ion channels, and so on). The dataset is split into subsetsof various challenging levels in generalizability: 795 pairs involving unseen proteins (proteins notpresent in the training set), 521 pairs involving unseen compounds, and 205 for unseen both; whereasthe rest is randomly split into training including validation (2,334) and the default test (591) sets.Note that the default test set contains compounds or protein seen in the training set but never trainingcompound-protein pairs. Model training and hyperparameter tuning.

We train our models end to end with the followingoptimization settings as in [12]: the optimizer Adam with a learning rate of 0.001, the batch size of64 and the maximum amount of training epochs being 200. The best checkpoint model is selectedvia validation. The following hyperparameters in the loss function are optimized following a two-stage process over pre-deﬁned grids [12]. Speciﬁcally, λ group , λ fused , and λ L1 are ﬁrst tuned over3 . , . , . } with λ inter = 0 (afﬁnity regression alone), where the best afﬁnity loss l aff isrecorded and λ group , λ fused , and λ L1 are optimized with the best AUPRC such that the correspondingafﬁnity RMSE does not deteriorate more than 10% of the best afﬁnity RMSE. In the second stage, weﬁx the optimal λ group , λ fused , and λ L1 and tune λ inter over { , , , , , } based on thebest AUPRC performance while jointly optimizing the regularized afﬁnity and contact losses. Numerical comparison of different modalities.

We compare the empirical results in Table 1between taking 1D amino-acid sequences and 2D contact maps as protein inputs, using HRNN andGAT as encoders for proteins, respectively. We make the following observations.

Table 1:

Afﬁnity and contact prediction with different modalities of proteins as inputs.

1D Sequences 2D GraphsTest (Seen-Protein) Unseen-Protein Test (Seen-Protein) Unseen-ProteinAfﬁnity RMSE ↓ r ↑ ↑ ↑ (i) For afﬁnity prediction (see RMSE & Pearson), 1D sequences and 2D graphs did not yield majordifferences especially in Pearson’s r . 1D sequences led to less deterioration in RMSE from thevalidation set (containing seen proteins) to unseen proteins.One conjecture is that the information in graphs might be more difﬁcult to learn compared tosequences (the training RMSE losses are 0.71 & 0.99 for 1D & 2D modalities, respectively, whenlong enough training processes were performed). Moreover, afﬁnity prediction for unseen-proteincases are not as challenging as intermolecular contact prediction to show the beneﬁt of the 2Dmodality (see (ii) below), as contact prediction often involves tens of thousands of values (rather thana single value) to ﬁt for each compound-protein pair.(ii) For contact prediction (see AUPRC & AUROC), encoding proteins as 1D sequences performedbetter (+3.22% at AUPRC and +1.67% at AUROC) in seen proteins, (i.e. the proteins in compound-protein pairs at the inference phase are involved in the training compound-protein pairs). Meanwhile,encoding 2D protein contact maps (graphs) outperformed doing that to 1D protein sequences (+4.91%at AUPRC and +2.24% at AUROC) for unseen proteins.We conjecture that sequential dependency information encoded in 1D amino-acid sequences iswell captured especially for seen proteins whose embeddings are well constructed after training (asthey are already represented in the training set), leading to the better contact predictions for seenproteins. However, the sequential information learned from the encoder could be more accuratetoward intermolecular contact prediction for close or even distant homologs of seen proteins but it isless general to unseen proteins.In contrast, we conjecture that the better generalizability of the 2D modality model might resultfrom the quality of the encoded embedding of proteins, which is co-determined by both the inputs(2D maps) and encoders (GAT models). The structural topology information encoded in protein 2Dcontact maps is more difﬁcult for graph neural networks to capture even for seen proteins, leadingto the worse contact predictions for seen proteins. However, such information can generalize tounseen proteins well toward contact prediction. In particular, even when sequence similarity fornon-homologous proteins (to training ones) is too low to be detectable using RNNs, binding-pocket(subgraph) similarity could still preserve and be detected in 2D contact maps using GNNs thuseventually leads to better intermolecular contact prediction. We have shown that both sequential dependency in 1D amino-acide sequences and structural topologyin 2D contact maps are important information for proteins to extract accurate and generalizableembeddings. Therefore it is natural to propose a cross-modality learning framework that capturesand fuses the information from 1D & 2D modalities for better performances. Speciﬁcally we havedesigned the following two models. 4 AT HRNN Cat MLP GATHRNN CatCI-SeqCI-Graph MLPCat Concatenation Residue-wise multiplicationCross interaction connections (a) (b)

CI Cross interaction module

2D Graphs 1DSequences2D Graphs

Figure 2:

Cross-modality encoder for proteins to capture and fuse different modality information, with (a) naïveconcatenation and (b) cross interaction introduced.

Concatenation.

A simple fusion model is to concatenate the extracted embeddings of the 1D and2D modalities that are encoded by HRNN and GAT, respectively, as shown in Figure 2(a). Indeed,concatenation is commonly used in previous work [26, 27] to preserve information from differentsources. The concatenated output is fed to a multi-layer perception (MLP) for the ﬁnal proteinembedding H prot . Cross interaction.

Although the aforementioned concatenation strategy preserves the information ofindividual modalities, the encoding processes for the two modalities are separate. In other words,the two types of embeddings from different modalities were independently encoded and then mixedthrough concatenation. However, the different modalities of proteins are intrinsically correlated witheach other and could be coupled in a properly-designed representation-learning process. Therefore,we have introduced a cross interaction module to facilitate the encoder to learn protein embeddingsfrom correlated data (1D and 2D modalities), as shown in Figure 2(b). Speciﬁcally, given the outputsof encoders H (cid:48) prot,seq and H (cid:48) prot,graph , we calculate sequence & graph cross-modality outputs H prot,seq and H prot,graph , respectively: h prot,seq ,n = (sigmoid( h (cid:48)(cid:48) prot,graph ,n T h (cid:48) prot,seq ,n ) + 1) h (cid:48) prot,seq ,n , (4) h prot,graph ,n = (sigmoid( h (cid:48)(cid:48) prot,seq ,n T h (cid:48) prot,graph ,n ) + 1) h (cid:48) prot,seq ,n , (5)where h · n = H · [ n, :] ( · can be empty, (cid:48) or (cid:48)(cid:48) ), H (cid:48)(cid:48) prot,graph = H (cid:48) prot,graph W cross,graph , H (cid:48)(cid:48) prot,seq = H (cid:48) prot,seq W cross,seq , and W cross,seq and W cross,graph are learnable weight matrices. Instead of indepen-dently extracting information from protein modalities (1D sequences and 2D contact maps), thecross interaction module enforces a learned relationship between the encoded embeddings of the twoprotein modalities, which is expected to better capture the information from the correlated proteinmodalities and to beneﬁt the afﬁnity and contact prediction. Again, H prot,seq and H prot,graph (now withinformation from each other) are concatenated and fed to an MLP for the ﬁnal protein embedding H prot .The idea of cross interaction was previously introduced in [28] and modiﬁed in our study as follows.First, we do not normalize cross interaction along residues (sequence length is 1000 here) sinceit would signiﬁcantly change the scale of the residue embeddings. Second, we restrict the crossinteraction for each residue in the range of [0, 1] with sigmoid function to represent the cross-modality“interaction strength”. We compare our single-modality and cross-modality models with two latest SOTAs for the CPACproblem, namely Gao et al. [8] and DeepAfﬁnity+ [12]. Tasks involved include afﬁnity, contact, andbinding-site predictions. 5 fﬁnity and Contact Prediction.

As shown in Table 2 and 3, compared to SOTAs, our modelshave achieved similar performances in afﬁnity prediction (RMSE and Pearson’s r ) and improvedperformances in contact prediction (AUPRC and AUROC) especially for proteins never seen intraining (unseen-protein and unseen-both). We have made the following observations. Table 2:

Comparison among SOTAs and our models in compound-protein afﬁnity prediction (measured byRMSE and Pearson’s correlation coefﬁcient). ∗ denotes the cited performances. Boldfaced were the bestperformances for given test sets. Test (Seen-Both) Unseen-Compound Unseen-Protein Unseen-BothSOTAsGao et al. ∗ RMSE 1.87 1.75 1.72 1.79Pearson’s r ∗ RMSE 1.49 r Single Modality(1D Sequences)

RMSE 1.57 1.38 1.63 1.79Pearson’s r Single Modality(Pred. 2D Graphs)

RMSE 1.49 1.37 1.75 1.93Pearson’s r Single Modality(True 2D Graphs)

RMSE 1.69 1.62 1.88 1.99Pearson’s r Cross Modality(Concatenation)

RMSE r Cross Modality(Cross Interaction)

RMSE 1.55 1.43

Pearson’s r Table 3:

Comparison among SOTAs and our models in contact prediction (measured by AUPRC and AUROC). ∗ denotes the cited performances. Boldfaced were the best performances for given test sets. Test (Seen-Both) Unseen-Compound Unseen-Protein Unseen-BothSOTAsGao et al. ∗ AUPRC (%) 0.60 0.57 0.48 0.48AUROC (%) 51.57 51.50 51.65 51.55DeepAfﬁnity+ ∗ AUPRC (%) 19.74 19.98 4.77 4.11AUROC (%) 73.78 73.80 60.01 59.09Ours

Single Modality(1D Sequences)

AUPRC (%) 20.51 20.80 6.54 6.36AUROC (%) 79.01 80.00 73.03 73.41

Single Modality(Pred. 2D Graphs)

AUPRC (%) 17.29 17.46 8.78 7.05AUROC (%) 77.34 78.70 77.94 76.59

Single Modality(True 2D Graphs)

AUPRC (%) 21.41 21.33 10.52 9.40AUROC (%)

Cross Modality(Concatenation)

AUPRC (%)

Cross Modality(Cross Interaction)

AUPRC (%) 23.49 23.29

AUROC (%) 81.30 82.07 80.64 79.78

First, our models used similar backbone as DeepAfﬁnity+ and revised the joint attention mechanism;thus DeepAfﬁnity+ and our 1D sequence-based single-modality model, both using protein sequences,had similar performances in afﬁnity prediction but ours improved contact prediction.Second, as observed in Section 3, compared to the 1D modality of protein sequences, the 2D modalityof (sequence-predicted) protein contact maps improved the generalizability of compound-proteincontact prediction for unseen proteins or unseen both, even though it resulted in slightly worseaccuracy for seen proteins. Higher-quality actual protein contact maps, compared to sequence-predicted ones, further beneﬁted contact prediction for both seen and unseen proteins; but they couldlead to worse afﬁnity prediction. These results echo our earlier conjecture that structural topology inthe 2D graphs is more informative for the more complex task of contact prediction even though itmay not be as effective as the 1D sequences for the less complex task of afﬁnity prediction.We have also made the following observations for our cross-modality fusion models where onlysequence-predicted protein contact maps are used.Third, fusing two modalities’ information together, even by a simple concatenation strategy, could getthe best of both modalities: the cross modality model by concatenation had better contact prediction6han single-modality models (even the true 2D map-based one) and a trade-off in afﬁnity predictions(better than the 2D single modality models and worse than the 1D single modality model). Theseresults conﬁrm our rationale of proposing cross-modality protein encoders for the CPAC task.Last, enforcing a learned correlation between the 1D and 2D embeddings rather than independentlylearning two individual embeddings, the cross-modality model with cross interaction further improvedafﬁnity prediction and actually had the best afﬁnity accuracy among all methods for unseen proteins orunseen both. Moreover, it impressively achieved the best AUPRC for unseen proteins and unseen both.We note that, as intermolecular contacts only represent a minority (around 0.4%) of all compound-protein atom-residue pairs, AUPRC is a much more relevant measure than AUROC for contactprediction. These results reinforced our rationale that the learned correlation between embeddingsfrom different modalities can better capture the correlated data and better perform CPAC predictions.

Protein binding-site prediction.

We also compare Gao et al., DeepAfﬁnity+, and our models forprotein binding site prediction that is ligand-speciﬁc and structure-free. Our models again signiﬁcantlyimprove the accuracy here compared to SOTAs. As actual protein structures (unbound or bound) arenot assumed available, the single-modality model using true 2D contact maps (from compound-boundprotein structures) here is essentially providing an estimate of the performance upper bound forunseen proteins. Impressively, using only protein sequences and sequence-predicted contact maps,both cross-modality models improved against the single modality model (true 2D graphs) for seenproteins and performed closely to the latter for unseen proteins. The cross-modality model withcross interaction achieved the best AUPRC for unseen proteins among all models compared. Again,as protein binding-site residues represent a minority among all residues, AUPRC is a much morerelevant measure than AUROC for assessing binding-site prediction.

Table 4:

Comparison among SOTAs and our models in ligand-speciﬁc and structure-free protein binding-siteprediction. ∗ denotes the cited numbers. Boldfaced are the best performances for individual test sets. Test (Seen-Both) Unseen-Compound Unseen-Protein Unseen-BothSOTAsGao et al. ∗ AUPRC (%) 5.43 5.38 4.95 4.96AUROC (%) 49.79 50.51 48.21 48.74DeepAfﬁnity+ ∗ AUPRC (%) 42.16 43.14 16.98 15.65AUROC (%) 76.33 78.22 64.93 65.18Ours

Single Modality(1D Sequences)

AUPRC (%) 40.35 40.81 20.37 20.17AUROC (%) 76.69 77.79 70.28 70.96

Single Modality(Pred. 2D Graphs)

AUPRC (%) 33.17 33.83 25.57 22.49AUROC (%) 75.11 76.53 76.15 74.87

Single Modality(True 2D Graphs)

AUPRC (%) 41.73 42.58 29.44

AUROC (%)

Cross Modality(Concatenation)

AUPRC (%)

Cross Modality(Cross Interaction)

AUPRC (%) 43.45 43.00

We explore in this study various protein modalities (1D sequences and 2D residue-residue contactmaps) in the context of compound-protein afﬁnity and contact prediction. To this end, we haveexploited RNNs and GNNs to encode the 1D and 2D modalities respectively and proposed cross-modality models (concatenation and cross interaction) on top of the single-modality models.Our experiments show that the two different protein modalities result in different accuracy andgeneralizability in afﬁnity and contact predictions. Speciﬁcally, sequential dependency learned in the1D protein modality can be adequate for the relatively simple task of afﬁnity prediction. However,it does not generalize well for the relatively difﬁcult task of contact prediction especially when theproteins are new. In other words, the accuracy of learned sequence-contact mapping can be restrictedto seen proteins or their homologs but does not transfer to a non-homolog. In contrast, structuraltopology in the 2D protein modality is more difﬁcult to capture by GNNs and its mapping to afﬁnitycan be predicted less well (not to mention that the quality of the predicted 2D modality is worse thanthe actual). However, once the mapping between the 2D embeddings and intermolecular contacts is7earned, it generalizes well to unseen proteins, possibly due to better capturing subgraph (bindingpocket) similarity.Our experiments also show that cross-modality models can exploit the correlation between bothmodalities and enjoy the beneﬁts of both modalities even when a simple concatenation strategy isadopted for the two embeddings. The newly proposed cross interaction model has led to better afﬁnityprediction (RMSE and Pearson’s r ) and better contact prediction (AUPRC) for unseen proteins thanSOTAs, any our single-modality model, and the simple cross-modality model with concatenation. Ithas also outperformed those other models in the generalizability of binding-site prediction for unseenproteins. Acknowledgment

This project is in part supported by the National Science Foundation (CCF-1943008 to YS) and theNational Institute of General Medical Sciences of the National Institutes of Health (R35GM124952to YS). We thank Texas A&M High Performance Research Computing (HPRC) for computingallocations. We also thank anonymous reviewers for useful comments that have helped improve themanuscript.

References [1] Ismail Kola and John Landis. Can the pharmaceutical industry reduce attrition rates? Naturereviews Drug discovery, 3(8):711–716, 2004.[2] Steven M Paul, Daniel S Mytelka, Christopher T Dunwiddie, Charles C Persinger, Bernard HMunos, Stacy R Lindborg, and Aaron L Schacht. How to improve r&d productivity: thepharmaceutical industry’s grand challenge. Nature reviews Drug discovery, 9(3):203–214,2010.[3] Jaechang Lim, Seongok Ryu, Kyubyong Park, Yo Joong Choe, Jiyeon Ham, and Woo YounKim. Predicting drug–target interaction using a novel graph neural network with 3d structure-embedded graph representation. Journal of chemical information and modeling, 59(9):3981–3988, 2019.[4] W. Torng and R. B. Altman. Graph Convolutional Neural Networks for Predicting Drug-TargetInteractions. J Chem Inf Model, 59(10):4131–4149, 10 2019.[5] Joseph Gomes, Bharath Ramsundar, Evan N Feinberg, and Vijay S Pande. Atomic convolutionalnetworks for predicting protein-ligand binding afﬁnity. arXiv preprint arXiv:1703.10603, 2017.[6] J. Jimenez, M. Skalic, G. Martinez-Rosell, and G. De Fabritiis. KDEEP: Protein-LigandAbsolute Binding Afﬁnity Prediction via 3D-Convolutional Neural Networks. J Chem InfModel, 58(2):287–296, Feb 2018.[7] Hakime Öztürk, Arzucan Özgür, and Elif Ozkirimli. Deepdta: deep drug–target binding afﬁnityprediction. Bioinformatics, 34(17):i821–i829, 2018.[8] Kyle Yingkai Gao, Achille Fokoue, Heng Luo, Arun Iyengar, Sanjoy Dey, and Ping Zhang.Interpretable drug target prediction using deep neural representation. In IJCAI, volume 2018,pages 3371–3377, 2018.[9] Mostafa Karimi, Di Wu, Zhangyang Wang, and Yang Shen. Deepafﬁnity: interpretable deeplearning of compound–protein afﬁnity through uniﬁed recurrent and convolutional neuralnetworks. Bioinformatics, 35(18):3329–3338, 2019.[10] Masashi Tsubaki, Kentaro Tomii, and Jun Sese. Compound–protein interaction prediction withend-to-end learning of neural networks for graphs and sequences. Bioinformatics, 35(2):309–318, 2019.[11] Mingjian Jiang, Zhen Li, Shugang Zhang, Shuang Wang, Xiaofeng Wang, Qing Yuan, andZhiqiang Wei. Drug–target afﬁnity prediction using graph neural network and contact maps.RSC Advances, 10(35):20701–20712, 2020.[12] Mostafa Karimi, Di Wu, Zhangyang Wang, and Yang Shen. Explainable deep relational networksfor predicting compound-protein afﬁnities and contacts. arXiv preprint arXiv:1912.12553, 2019.813] Shuya Li, Fangping Wan, Hantao Shu, Tao Jiang, Dan Zhao, and Jianyang Zeng. Monn: Amulti-objective neural network for predicting compound-protein interactions and afﬁnities. CellSystems, 10(4):308–322, 2020.[14] Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and YoshuaBengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.[15] Yuning You, Tianlong Chen, Zhangyang Wang, and Yang Shen. L2