[PDF] Predicting Biomedical Interactions with Higher-Order Graph Convolutional Networks

Abstract

Biomedical interaction networks have incredible potential to be useful in the prediction of biologically meaningful interactions, identification of network biomarkers of disease, and the discovery of putative drug targets. Recently, graph neural networks have been proposed to effectively learn representations for biomedical entities and achieved state-of-the-art results in biomedical interaction prediction. These methods only consider information from immediate neighbors but cannot learn a general mixing of features from neighbors at various distances. In this paper, we present a higher-order graph convolutional network (HOGCN) to aggregate information from the higher-order neighborhood for biomedical interaction prediction. Specifically, HOGCN collects feature representations of neighbors at various distances and learns their linear mixing to obtain informative representations of biomedical entities. Experiments on four interaction networks, including protein-protein, drug-drug, drug-target, and gene-disease interactions, show that HOGCN achieves more accurate and calibrated predictions. HOGCN performs well on noisy, sparse interaction networks when feature representations of neighbors at various distances are considered. Moreover, a set of novel interaction predictions are validated by literature-based case studies.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Predicting Biomedical Interactions withHigher-Order Graph Convolutional Networks

Kishan KC, Rui Li, Feng Cui, and Anne R. Haake

Abstract —Biomedical interaction networks have incredible potential to be useful in the prediction of biologically meaningfulinteractions, identiﬁcation of network biomarkers of disease, and the discovery of putative drug targets. Recently, graph neuralnetworks have been proposed to effectively learn representations for biomedical entities and achieved state-of-the-art results inbiomedical interaction prediction. These methods only consider information from immediate neighbors but cannot learn a generalmixing of features from neighbors at various distances. In this paper, we present a higher-order graph convolutional network (HOGCN)to aggregate information from the higher-order neighborhood for biomedical interaction prediction. Speciﬁcally, HOGCN collects featurerepresentations of neighbors at various distances and learns their linear mixing to obtain informative representations of biomedicalentities. Experiments on four interaction networks, including protein-protein, drug-drug, drug-target, and gene-disease interactions,show that HOGCN achieves more accurate and calibrated predictions. HOGCN performs well on noisy, sparse interaction networkswhen feature representations of neighbors at various distances are considered. Moreover, a set of novel interaction predictions arevalidated by literature-based case studies.

Index Terms —Higher-Order Graph Convolution, Interaction Prediction, Biomedical interaction networks. (cid:70)

NTRODUCTION A B iological system is a complex network of variousmolecular entities such as genes, proteins, and otherbiological molecules linked together by the interactionsbetween these entities. The complex interplay between var-ious molecular entities can be represented as interactionnetworks with molecular entities as nodes and their in-teractions as edges. Such a representation of a biologicalsystem as a network provides a conceptual and intuitiveframework to investigate and understand direct or indi-rect interactions between different molecular entities in abiological system. Study of such networks lead to system-level understanding of biology [1] and discovery of novelinteractions including protein-protein interactions (PPIs) [2],drug-drug interactions (DDIs) [3], drug-target interactions(DTIs) [4] and gene-disease associations (GDIs) [5].Recently, the generalization of deep learning to thenetwork-structured data [6] has shown great promise acrossvarious domains such as social networks [7], recommen-dation systems [8], chemistry [9], citation networks [10].These approaches are under the umbrella of graph con-volutional networks (GCNs). GCNs repeatedly aggregatefeature representations of immediate neighbors to learnthe informative representation of the nodes for link pre-diction. Although GCN based methods show great suc-cess in biomedical interaction prediction [3], [11], the issuewith such methods is that they only consider informationfrom immediate neighbors. SkipGNN [12] leverages skipgraph to aggregate feature representations from direct and • K. KC, R. Li, and A. Haake are with the Department of Computing andInformation Sciences, Rochester Institute of Technology, Rochester, NY,14623.E-mail: { kk3671, arhics, rxlics } @rit.edu • F. Cui is with Thomas H. Gosnell School of Life Sciences, RochesterInstitute of Technology, Rochester, NY, 14623. E-mail: [email protected] second-order neighbors and demonstrated improvementsover GCN methods in biomedical interaction prediction.However, SkipGNN cannot be applied to aggregate infor-mation from higher-order neighbors and thus fail to captureinformation that resides farther away from a particularinteraction [13].To address the challenge, we propose an end-to-enddeep graph representation learning framework namedhigher-order graph convolutional networks (HOGCN) forpredicting interactions between pairs of biomedical entities.HOGCN learns a representation for every biomedical entityusing an interaction network structure G and/or features X .In particular, we deﬁne a higher-order graph convolution(HOGC) layer where the feature representations from k -order neighbors are considered to obtain the representationof biomedical entities. The layer can thus learn to mixfeature representations of neighbors at various distancesfor interaction prediction. Furthermore, we deﬁne a bilineardecoder to reconstruct the edges in the input interactionnetwork G by relying on feature representations producedby HOGC layers. The encoder-decoder approach makesHOGCN an end-to-end trainable model for interaction pre-diction.We compare HOGCN’s performance with that of state-of-the-art network similarity-based methods [14], networkembedding methods [15], [16], and graph convolution-based methods [10], [12], [17] for biomedical link prediction.We experiment with various interaction datasets and showthat our method makes accurate and calibrated predic-tions. HOGCN outperforms alternative methods based onnetwork embedding by up to 30%. Furthermore, HOGCNoutperforms graph convolution-based methods by up to6%, alluding to the beneﬁts of aggregating information fromhigher-order neighbors.We perform a case study on the DDI network and a r X i v : . [ c s . L G ] O c t OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 observe that aggregating information from higher-orderneighborhood allows HOGCN to learn meaningful repre-sentation for drugs. Moreover, literature-based case studiesillustrate that the novel predictions are supported by ev-idence, suggesting that HOGCN can identify interactionsthat are highly likely to be a true positive.In summary, our study demonstrates the ability ofHOGCN to identify potential interactions between biomedi-cal entities and opens up the opportunities to use the biolog-ical and physicochemical properties of biomedical entitiesfor a follow-up analysis of these interactions.

ELATED WORKS

With the increasing coverage of the interactome, variousnetwork-based approaches have been proposed to exploitalready available interactions to predict missing interac-tions [14], [18]–[20]. These methods can be broadly classiﬁedinto (1) network similarity-based methods (2) network em-bedding methods (3) graph convolution-based methods. Wenext summarize these categories of methods for biomedicalinteraction prediction.Given a network of known interactions, various similar-ity metrics are used to measure the similarity between thebiomedical entities [18] with an assumption that higher sim-ilarity indicates interaction. Triadic closure principle (TCP)has been explored in biomedical interaction prediction withthe hypothesis that biomedical entities with common inter-action partners are likely to interact with each other [19].TCP relies on a common neighbor algorithm to count thenumber of shared neighbors between the nodes and isquantiﬁed by A where A is the adjacency matrix. Recently,L3 heuristic [14] shows the common neighbor hypothesisfails for most protein pairs in PPI prediction and proposes toconsider nodes that are similar to the neighbors of the nodesand can be quantiﬁed by A . This indicates that higher-orderneighbors are important for interaction prediction.Next, network embedding methods embed the exist-ing networks to low-dimensional space that preserves thestructural proximity such that the nodes in the original net-work can be represented as low-dimensional vectors. Deep-walk [15] is a popular approach that generates the truncatedrandom walks in the network and deﬁnes a neighborhoodfor each node as a set of nodes within a window of size k in each random walk. Similarly, node2vec performs a biasedrandom walk by balancing the breadth-ﬁrst and depth-ﬁrstsearch in the network. The random walks generated bythese methods can be considered as a combination of nodesfrom various order of neighborhoods such as 1-hop to k -hop neighborhood. In other words, DeepWalk and node2veclearn the embeddings for the nodes in the network from thecombination of A , A , A , . . . A k where A i is the i th powerof the adjacency matrix. These embeddings can then be fedinto a classiﬁer to predict the interaction probability. Thesemethods are only limited to the structure of the biomedicalnetworks and cannot incorporate additional informationabout the biomedical entities. Also, they cannot learn thefeature difference between nodes at various distances.Furthermore, graph convolution-based methods use amessage-passing mechanism to receive and aggregate in-formation from neighbors to generate representations for the nodes in the network. Graph convolutional networks(GCNs) [17] and variational graph convolutional autoen-coder (VGAE) [10] aggregate feature representation fromimmediate neighbors to learn the representation of biomed-ical entities in an end-to-end manner using link predictionobjective. These methods are only limited to the averagepooling of the neighborhood features [13]. SkipGNN [12]therefore proposes to use skip similarity between thebiomedical entities to aggregate information from second-order neighbors. However, these methods cannot aggregatefeature representations from higher-order neighbors andalso cannot learn feature differences between neighbors atvarious distances. RELIMINARIES

A biomedical network is deﬁned as G = ( V , E , X ) where V denotes the set of nodes representing biomedical entities(e.g. proteins, genes, drugs, diseases) and |V| denotes thenumber of nodes. E ⊆ ( V × V ) denotes a set of interactionsbetween biomedical entities. X ∈ R |V|× F is the features ofbiomedical entities where F is the dimension of features.Let A denote the adjacency matrix of G , where A ij indicates an edge between nodes v i and v j . We assume thecase of binary adjacency matrix A ij ∈ { , } n × n where A ij represents the existence of edge between the nodes v i and v j , indicating the presence of the experimental evidencefor their interaction (i.e. A ij = 1 ) or the absence of theexperimental evidence for their interaction (i.e. A ij = 0 ).Note that the same notation of adjacency matrix can be usedto represent weighted graphs such that A ij = [0 , . Table 1shows the notations and their deﬁnitions used in the paper. TABLE 1Terms and notations

Notation Deﬁnition G : {V , E , X } Graph with nodes V , edges E and features X E (cid:48) Test edges A ∈ R |V|×|V| Adjacency matrix of graph G D ∈ R |V|×|V| Degree matrix with D ii = (cid:80) i A ij I ∈ R |V|×|V| Identity matrix ˆ A ∈ R |V|×|V| Symmetrically normalized adjacency matrix X ∈ R |V|× F F -dimensional feature matrix A ij ∈ { , } Ground-truth interaction between nodes i and jp ij ∈ [0 , Probability of interaction between nodes i and jZ ∈ R |V|× d ∗ Final node embeddings W ( i )( l ) Weight matrix for i th adjacency power for layer lL Number of HOGC layers T Number of training epochs k The order of neighborhood P A set of integer adjacency powers P = { , , . . . , k } O ( l ) j Representation of neighbors at distance j in layer l Problem Statement. (Biomedical interaction prediction)Given a biomedical interaction network G = ( V , E , X ) and the set of potential biomedical interactions E (cid:48) , we aimto learn a interaction prediction model f to predict theinteraction probabilities of E (cid:48) , f : E (cid:48) → [0 , . Given a biomedical network G , message passing algorithmslearn the representation of biomedical entities in the net-work by aggregating information from immediate neigh-bors [9]. Additional information about biomedical entities OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

HOGC HOGC B ili n e a r L i n e a r L i n e a r ? 𝑣 ! 𝑣 " 𝑣 ! 𝑣 " 𝑝 !" Encoder Decoder

𝒢 = (𝒱, ℰ, 𝑋) 𝑍 𝐳 ! 𝐳 " 𝐞 !" Fig. 1. Block diagram of proposed HOGCN model with 2 HOGC layers. Given a biomedical interaction network G with initial features X for biomedicalentities, the encoder mixes the feature representation of neighbors at various distances and learn ﬁnal representation Z . The decoder takes therepresentation z i and z j of nodes v i and v j to learn the representation e ij for the edge (denoted by ? ) and predict probability p ij of its existence. can be used to initialize the feature matrix X . These al-gorithms involve the message passing step in which eachbiomedical entity sends its current representation to, and ag-gregates incoming messages from its immediate neighbors.Representation for each biomedical entity can be obtainedafter L steps of message passing and feature aggregation.However, such message passing operation is limited toaverage pooling of features from immediate neighbors andthus is unable to learn feature differences among neighborsat various distances [13].Neighborhood nodes at various distances provide net-work structure information at different granularities [21]–[25]. Taking k -hop neighborhoods into consideration, weaim at aggregating information from various distances atevery message passing step. Different powers of adjacencymatrices such as A , A , A , . . . , A k provide informationabout the network structure at different scales. Higher-order message passing operations can therefore learn to mixtheir representations using various powers of the adjacencymatrix at each message passing step. Graph convolutional networks (GCNs) are the generaliza-tion of convolution operation from regular grids such asimages or texts to graph structured data [6], [26]. The keyidea of GCNs is to learn the function to generate the node’srepresentation by repeatedly aggregating information fromimmediate neighbors. The graph convolutional layer is de-ﬁned as: H ( l ) = σ ( ˆ AH ( l − W ( l ) ) (1)where H ( l − and H ( l ) are the input and output acti-vations, W ( l ) is a trainable weight matrix of the layer l , σ is the element-wise activation, and ˆ A is a symmet-rically normalized adjacency matrix with self-connections ˆ A = D − ( A + I |V| ) D − . A GCN model with L layers isthen deﬁned as: H ( l ) = (cid:40) X if l = 0 σ ( ˆ AH ( l − W ( l ) ) if l ∈ [1 , . . . , L ] and H ( L ) can be used to predict the probability of interac-tions between biomedical entities. IGHER - ORDER G RAPH C ONVOLUTION N ET - WORK (HOGCN)

In this work, we develop a higher-order graph convolu-tional network (HOGCN) that takes an interaction network G as input and reconstruct the edges in the interactionnetwork (Fig. 1). HOGCN has two main components: • Encoder : a higher-order graph convolution encoderthat operates on an interaction graph G and producesrepresentations for biomedical entities by aggregatingfeatures from the neighborhood at various distancesand • Decoder : a bilinear decoder that relies on these repre-sentations to reconstruct the interactions in G . We ﬁrst describe the higher-order graph encoder, that oper-ates on an undirected interaction graph G = ( V , E , X ) andlearns the representations for biomedical entities.We develop an encoder with higher-order Graph Con-volution (HOGC) layer to mix feature representations fromneighbors at k -distances. Speciﬁcally, HOGC layer considersthe neighborhood information at different granularities andis deﬁned as: H ( l ) = (cid:107) j ∈ P σ (cid:16) (cid:98) A j H ( l − W ( l ) j (cid:17) (2)where P is a set of integer adjacency powers, (cid:98) A j denotesthe adjacency matrix (cid:98) A multiplied j times, and (cid:107) denotescolumn-wise concatenation [13]. Graph convolutional net-work [17] only considers the st power of adjacency matrixand can be exactly recovered by setting P = { } in Equa-tion (2). Similarly, SkipGNN [12] considers direct and skipsimilarity and can be recovered by setting P = { , } inEquation (2).Fig. 2 shows a HOGC layer with P = { , , . . . , k } where k is maximum order of neighborhood considered in OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4 !𝐴 ! 𝐻 ( 𝑊 !( !𝐴 % 𝐻 ( 𝑊 %( 𝐻 ( !𝐴 ’ 𝐻 ( 𝑊 ’( 𝐻 ( 𝑂 !( 𝑂 %( 𝑂 ’( Fig. 2. High-order graph convolution (HOGC) Layer with P = { , . . . , k } .The feature representation H ( l ) is a linear combination of the neighbors (cid:98) A j H ( l − at multiple distances j . O ( l ) j represents feature representa-tion of neighbors at j distances for layer l . each HOGC layer. If k = 0 , HOGC layer only considersthe features of the biomedical entities and can capture thefeature similarity between various biological entities. Thisis equivalent to a fully connected network with featuresof biomedical entities as input. For the HOGC layer, (cid:98) A isthe identity matrix I |V| where |V| is the number of nodesin the network. This allows the HOGC layer to learn thetransformation of node features separately and mix it withfeature representations from neighbors.The maximum order of neighborhood k and the numberof trainable weight matrices | P | , one per each adjacencypower, can vary across layers. However, we set the same k for neighborhood aggregation and the same dimension d for all the weight matrices across all layers.Neighborhood features from different adjacency powers j ∈ { , , . . . , k } at layer ( l − are column-wise concate-nated to obtain feature representation H ( l − . As shown inFig 2, weight W ( l )( · ) at layer l can learn the arbitrary linearcombination of the concatenated features to obtain H ( l ) .Speciﬁcally, the layer can assign different coefﬁcients todifferent columns in the concatenated matrix. For instance,the layer can assign positive coefﬁcients to the columnsproduced by certain power of (cid:98) A and assign negative co-efﬁcients to other powers. This allows the model to learnfeature differences among neighbors at various distances.We apply L HOGC layers to learn the latent representation Z ∈ R |V|× d ∗ for biomedical entities in the network, where d ∗ = d ×| P | and d is the dimension of node’s representationfor each adjacency power. We introduced the encoder based on HOGC layers thatlearns feature representation Z for biomedical entities bymixing neighborhood information at multiple distances.Next, we discuss the decoder that reconstructs the interac-tions in G based on the representation Z . We adopt a bilinear layer to fuse the representation ofbiomedical entities v i and v j and learn the edge represen-tation e ij . More precisely, we deﬁne a simple bilinear layerthat takes the representation z i and z j as input: e ij = ELU( z Ti W b z j + b ) (3)where W b ∈ R d × d ∗ × d ∗ represents the learnable fusion ma-trix, e ij is the representation of edge e ij between nodes v i and v j , b denotes the bias of the bilinear layer. ELU is non-linearity.The edge representation e ij is then fed into 2-layeredfully connected neural network to predict probability p ij foredge e ij : p ij = sigmoid (FC (ELU(FC ( e ij )))) (4)where FC ( e ij ) = W · e ij + b denotes fully connected layerwith weight W and bias b , p ij represents the probabilitythat biomedical entities v i and v j interact.So far, we have discussed the encoder and decoderof our proposed approach. Next, we describe the trainingprocedure of our proposed HOGCN model. In particular,we explain how to optimize the trainable neural networkweights in an end-to-end manner. During HOGCN training, we employ binary cross entropyloss to optimize the model parameters L ( v i , v j ) = − A ij log( p ij ) − (1 − A ij ) log (1 − p ij ) (5)and encourage the model to assign higher probability toobserved interactions ( v i , v j ) than to randomly selectednon-interactions. p ij is the predicted interaction probabilitybetween v i and v j and A ij denotes the ground-truth in-teraction label between these nodes. The ﬁnal loss functionconsidering all interactions is L = (cid:88) ( i,j ) ∈E L ( v i , v j ) (6)We follow an end-to-end approach to jointly optimize overall trainable parameters and backpropagate the gradientsthrough encoder and decoder of HOGCN. HOGCN leverages biomedical network structure A alongwith additional information about biomedical entities as theinitial feature representation X . In this paper, we initial-ize the initial features X to be one-hot encoding i.e. I |V| .The feature matrix X can be initialized with properties ofbiomedical entities or pre-trained embeddings from othernetwork-based approaches such as DeepWalk, node2vec.Given an adjacency matrix A and the initial node repre-sentations X , higher-order neighborhood indicated by thehigher power of the adjacency matrix is iteratively com-puted that makes the model more efﬁcient. By adoptingright-to-left multiplication, for instance, we can calculate ˆ A H ( i ) as ˆ A ( ˆ A ( ˆ AH ( i ) )) (Line in Algorithm 1). Repre-sentation O ( l ) j learned for the neighborhood at j distancesare concatenated to obtain the representation H ( l ) as shownin Fig. 2 (Line 11 in Algorithm 1). After passing through OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

Algorithm 1

Training of HOGCN for biomedical interactionprediction Inputs: ˆ A , X , k H (0) = X for t = 1 to T do Sample mini-batch of training edges and their inter-action labels for l = 1 to L do B := H ( l − for j = 1 to k do B := ˆ AB O ( l ) j := BW ( l ) j end for H ( l ) := (cid:107) j ∈ P O ( l ) j end for Z := H ( L ) p ij := Interaction Decoder ( Z ) Compute loss in (6)

Update model parameters via gradient descent end for L HOGC layers, we obtain the ﬁnal representation Z forbiomedical entities. With the ﬁnal representations Z and themini-batch of training edges, we retrieve the embeddingsfor the nodes in training edges and feed them into the inter-action decoder to compute their interaction probabilities.The parameters of HOGCN are optimized with a binarycross-entropy loss (Equation (6)) in an end-to-end manner.Given two biomedical entities v i and v j , the trained modelcan predict the probability of their interactions. XPERIMENTAL DESIGN

We view the problem of biomedical interaction predictionas solving a link prediction task on an interaction network.We consider various interaction datasets and compare ourproposed method with the state-of-the-art methods.

We conduct interaction prediction experiments on fourpublicly-available biomedical network datasets: • BioSNAP-DTI [27]: DTI network contains 15,139 drug-target interactions between 5,018 drugs and 2,325 pro-teins. • BioSNAP-DDI [27]: DDI network contains 48,514 drug-drug interactions between 1,514 drugs extracted fromdrug labels and biomedical literature. • HuRI-PPI [2]: HI-III human PPI network contains 5,604proteins and 23,322 interactions generated by multpleorthogonal high-throughput yeast two-hybrid screens. • DisGeNET-GDI [28]: GDI network consists of 81,746interactions between 9,413 genes and 10,370 diseasescurated from GWAS studies, animal models and scien-tiﬁc literature.Table 2 provides summary of datasets used in our experi-ments. We provide the number of interactions used for train-ing, validation, and testing for each interaction datasets.Also, the table includes the average number of interactionsfor each biomedical entity which can be computed as |E||V| . We compare our proposed model with the followingnetwork-based baselines for interaction prediction: • network similarity-based methods – L3 [14] counts the number of paths with length-3normalized by the degree for all the node pairs. • Network embedding methods – DeepWalk [15] performs truncated random walkexploring the network neighborhood of nodes andapplies skip-gram model to learn the d -dimensionalembedding for each node in the network. Node fea-tures are concatenated to form edge representationand train a logistic regression classiﬁer. – node2vec [16] extends DeepWalk by running biasedrandom walk based on breadth/depth-ﬁrst search tocapture both local and global network structure. • Graph convolution-based methods – VGAE [10] uses graph convolutional encoder withtwo GCN layers to learn representation for each nodein the network and adopts inner product decoder toreconstruct adjacency matrix. – GCN [17] uses normalized adjacency matrix tolearn node representations. The representation fornodes are concatenated to form feature representa-tion for the edges and fully connected layer use theseedge representation to reconstruct edges, similar toHOGCN. Setting P = { } in our proposed HOGCNis equivalent to GCN. – SkipGNN [12] learns the node embeddings by com-bining direct and skip similarity between nodes.Setting P = { , } in our proposed HOGCN isequivalent to SkipGNN. We split the interaction dataset into training, validation, andtesting interactions in a ratio of 7:1:2 as shown in Table 2.Since the available interactions are positive samples, thenegative samples are generated by randomly sampling fromthe complement set of positive examples. Five independentruns of the experiments with different random splits of thedataset are conducted to report the prediction performance.We use (1) area under the precision-recall curve (AUPRC)and (2) area under the receiver operating characteristics(AUROC) as the evaluation metrics. With these evaluationmetrics, we expect positive interactions to have higher in-teraction probability compared to negative interactions. So,the higher value of AUPRC and AUROC indicates betterperformance.We implement HOGCN using PyTorch [29] and performall experiments on a single NVIDIA GeForce RTX 2080TiGPU. We construct a -layered HOGC network with k = 3 for each layer. At each HOGC layer, the node mixes thefeature representations from neighbors at distances P = { , , and } . The dimension of all weight matrices inHOGC layers is set to d = 32 . All the weight matricesare initialized using Xavier initialization [30]. We train ourmodel using mini-batch gradient descent with Adam opti-mizer [31] for a maximum of 50 epochs, with a ﬁxed learningrate of × − . We set the mini-batch size to 256 and the OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

TABLE 2Summary of the datasets used in our experiments.

Dataset dropout probability [32] to 0.1 for all layers. Early stoppingis adopted to stop training if validation performance doesnot improve for 10 epochs. The dimension of the edgefeature e ij from the bilinear layer is followed by linearlayers to project the edge features to edge probabilities. Forbaseline methods, we follow the same experimental settingsdiscussed in [12]. ESULTS

In this section, we investigate the performance and ﬂexibil-ity of HOGCN on interaction prediction using four differentdatasets. We further explore the robustness of HOGCNto sparse networks. Finally, we demonstrate the ability ofHOGCN to make novel predictions with literature-basedcase studies.

We compare HOGCN against various baselines on biomed-ical interaction prediction tasks using four different typesof interaction datasets including protein-protein interactions(PPIs), drug-target interactions (DTIs), drug-drug interac-tions (DDIs) and gene-disease associations (GDIs).We randomly mask of interactions from the networkas a test set and as a validation set. We train all modelswith of interactions and evaluate their performances ontest sets. The best set of hyperparameters is selected basedon their performances on the validation dataset. Finally,the experiment is repeated for ﬁve independent randomsplits of the interaction dataset and the results with ± one standard deviation are reported in Table 3. All of ourmodels used for the reported results are of same capacity(i.e. P = { , , , } and d = 32 ).Table 3 shows that HOGCN achieves huge improvementover network embedding methods such as DeepWalk andnode2vec across all datasets. Speciﬁcally, HOGCN outper-forms Deepwalk on AUPRC by 24.44% in DTI, 28.51% inDDI, 30.07% in PPI, and 13.79% in GDI. Although node2vecachieves better performance compared to DeepWalk byadopting a biased random walk, HOGCN still outperformsnode2vec by a signiﬁcant margin. DeepWalk and node2vecconsider different orders of neighborhood deﬁned by thewindow size and learns similar representations for thenodes in that window. In contrast, HOGCN learns featuredifferences between neighbors at various distances to ob-tain feature representation for the node and thus achievessuperior performance. The improved performance suggeststhat feature differences between different order neighborsprovide important information for interaction prediction.A network similarity-based method, L3 [14] outperformsnetwork embedding methods across four datasets but is TABLE 3Average AUPRC and AUROC with ± one standard deviation onbiomedical interaction prediction Dataset Method AUPRC AUROCDTI DeepWalk 0.753 ± ± ± ± ± ± ± ± ± ± ± ± ± ± DDI DeepWalk 0.698 ± ± ± ± ± ± ± ± ± ± ± ± ± ± PPI DeepWalk 0.715 ± ± ± ± ± ± ± ± ± ± ± ± ± ± GDI DeepWalk 0.827 ± ± ± ± ± ± ± ± ± ± ± ± ± ± limited to a single aspect of network similarity i.e. thenumber of paths of length connecting two nodes. So, L3cannot be applied when other similarities between nodessuch as similarity in features and common neighbors at var-ious distances need to be considered. HOGCN overcomesthese limitations and outperforms L3 across all interactiondatasets with huge gain. In particular, HOGCN gains 3.5%AUPRC and 7.09% AUROC on PPI over L3 [14], whichrecently outperformed 20 network science methods in thePPI prediction problem.Graph convolution-based methods such as GCN andVGAE achieves signiﬁcant improvements over network em-bedding approaches but achieves comparable performancewith L3. SkipGNN shows improvement over all other meth-ods by incorporating skip similarity to aggregate informa-tion from second-order neighbors. Moreover, HOGCN with k = 3 achieves an improvement over all graph convolution- OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7 (a) (b) (c)Fig. 3. Visualization of learned representation for drugs with (a) GCN (b) SkipGNN (c) HOGCN. Drugs are mapped to the 2D space using t-SNEpackage [33] with learned drug representations. Drugs categories such as DBCAT002175, DBCAT000702 and DBCAT000592 are highlighted. Thenumber of drugs in each categroy is reported in legend. Best viewed on screen. based methods. Speciﬁcally, HOGCN achieves improve-ment in AUPRC over VGAE [10] by 6.3%, GCN [17] by 4.8%and SkipGNN [12] by 3.6% on DDI dataset.As HOGCN can learn the linear combination of nodefeatures at multiple distances, it can extract meaningfulrepresentations from the interaction networks. The resultsin Table 3 demonstrate that our approach with higher-order neighborhood mixing outperforms the state-of-the-artmethods on real interaction datasets.

Next, we evaluate if HOGCN learns meaningful representa-tion when feature representations of higher-order neighborsare aggregated. To this aim, we train GCN, SkipGNN, andHOGCN models on the DDI network to obtain the drugrepresentations Z . The learned drug representations aremapped to 2D space using t-SNE [33] and visualize themin Fig. 3.Drugbank [34] provides information about drugs andtheir categories based on different characteristics such asinvolved metabolic enzymes, class of drugs, side effects ofdrugs, and the like. For this experiment, we collect drugcategories from Drugbank and limit the selection of drugcategories such that the training set doesn’t contain anyinteractions between the drugs in the same category. Theselected drug categories are ACE Inhibitors and Diuretics(DBCAT002175), Penicillins (DBCAT000702), and Antineo-plastic Agents (DBCAT000592) with 10, 24, and 16 drugsrespectively. Although these drugs don’t have direct inter-actions in the training set, we assume that these drugs shareneighborhoods at various distances and can be exploredaccordingly with HOGCN.Fig. 3 shows the clustering structure in drugs’ repre-sentations as neighborhood information at multiple dis-tances are considered. Examining the ﬁgure, we observethat drugs in the same category are embedded close to eachother in the 2D space when the model aggregates infor-mation from farther neighbors. For example, 24 drugs inthe Penicillins (DBCAT000702) category (marked with blue triangles in Fig. 3) are scattered in the representation spacelearned by GCN that only considers feature aggregationfrom immediate neighbors (Fig. 3a). Note that these drugsdon’t have any direct interaction between themselves in thetraining set. Since GCN-based models can only average therepresentation from immediate neighbors, these drugs aremapped relatively farther to each other and closer to otherinteracting drugs. SkipGNN considers skip similarity toaggregate features from second-order neighbors and showrelatively compact clusters compared to GCNs (Fig. 3b).On the other hand, HOGCN considers the higher-orderneighborhood and learns similar representations for drugsthat belong to the same category demonstrated by compactclustering structure in Fig. 3c even though no informationabout categorical similarity is provided to the model. Thisanalysis demonstrates that HOGCN learns meaningful rep-resentation for drugs by aggregating feature representationsfrom the neighborhood at various distances.Next, we test if the clustering pattern in Fig. 3 holdsacross many drug categories. With this aim, we considerall drug categories in DrugBank and compute the averageEuclidean distance between each drug’s representation andrepresentations of other drugs within the same drug cate-gory. We then perform 2-sample Kolmogorov–Smirnov testsand found that HOGCN learns signiﬁcantly more similarrepresentations of drugs than expected by chance (p-value= . e − ), GCNs (p-value = . e − ) and SkipGNN( . e − ). Thus, this analysis indicates that HOGCNlearns meaningful representations for drugs by aggregatingneighborhood information at various distances. We next explore the robustness of network-based interactionprediction models to network sparsity. To this aim, weevaluate the performance with respect to the percentageof training edges varying from 10% to 70%. We makepredictions on the rest of the interactions. We further use10% of test edges for validation to select the best hyperpa-rameter settings. For a fair comparison, we compare with

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8 graph convolution-based methods that aggregate informa-tion from direct and/or second-order neighbors. (a) (b)(c) (d)Fig. 4. AUPRC comparison of HOGCN’s performance with that of al-ternative approaches with respect to network sparsity. HOGCN consis-tently achieves better performance in various fraction of training edges.

Fig. 4 shows the robustness of HOGCN to network spar-sity. HOGCN achieves strong performance in all tasks withdifferent network sparsity. The performance of HOGCNsteadily improves with the increase in training edges. Themixing of features from a higher-order neighborhood inHOGCN and SkipGNN shows improvement over GCN andVGAE that only consider direct neighbors. Since HOGCNcan learn the linear combination of features from a 3-hopneighborhood for this experiment, it shows improvementover SkipGNN in almost all cases. This demonstrates thatfeatures from farther distances are informative for interac-tion prediction in sparse networks.

All graph convolution-based model proposed for biomed-ical link prediction predicts the conﬁdence estimate p ij for interaction between two biomedical entities v i and v j .We thus test if a predicted conﬁdence p ij represents thelikelihood of being true interaction. In other words, weexpect the conﬁdence estimate p ij to be calibrated, i.e. p ij represents true interaction probability [35]. For example,given 100 predictions, each with the conﬁdence of 0.9, weexpect that 90 interactions should be correctly classiﬁed astrue interactions.To evaluate the calibration performance, we use reli-ability diagrams [35] and Brier score [36]. In particular,the reliability diagram provides a visual representation ofmodel calibration. These diagrams plot the expected fractionof positives as a function of predicted probability [35]. Amodel with perfectly calibrated predictions is representedby a diagonal in Fig. 5. In addition to reliability diagrams,it is more convenient to have a scalar summary statistics (a) (b)(c) (d)Fig. 5. Reliability diagrams for different graph convolution-based meth-ods. The calibration performance is evaluated with Brier score, reportedin the legend (lower is better). of calibration. Brier score [36] is a proper scoring rule formeasuring the accuracy of predicted probabilities. LowerBrier score indicates better calibration of a set of predictions.It is computed as the mean squared error of a predictedprobability p ij and the ground-truth interaction label A ij .Mathematically, the Brier score can be computed as:Brier score = 1 |E (cid:48) | |E (cid:48) | (cid:88) ( i,j )=1 ( p ij − A ij ) (7)where |E (cid:48) | denotes the number of test edges.Fig. 5 shows the calibration plots for GCN, SkipGNNand HOGCN ( k = 3 ). For DTI dataset, SkipGNN showbetter calibration compared to GCN and HOGCN (Fig. 5a),indicating that second-order neighborhood information isappropriate and aggregating features from farther awaymakes model overconﬁdent. For other datasets, GCNs arerelatively overconﬁdent for all predicted conﬁdence. Forexample, approximately − of interactions are truepositives among the interactions with high predicted con-ﬁdence . in PPI (Fig. 5c) and GDI dataset (Fig. 5d). Incontrast, HOGCN achieves a lower Brier score in compar-ison to the GCN and SkipGNN across DDI, PPI, and GDIdatasets, alluding to the beneﬁts of aggregating higher-orderneighborhood features for calibrated prediction. This analy-sis demonstrates that HOGCN with higher-order neighbor-hood mixing makes accurate and calibrated predictions forbiomedical interaction. In Section 6.3, we contrast HOGCN’s performance with thatof alternative graph convolution-based methods in varyingfraction of edges. In this experiment, we aim to observethe performance of HOGCN when the order k is increased OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9 to allow the model to aggregate neighborhood informationfrom farther away. We follow a similar setup as discussedin 6.3. (a) (b)(c) (d)Fig. 6. AUPRC comparison for higher-order message passing with dif-ferent fractions of training edges. The values of k for different HOGCNmodels are reported in the legend. Fig. 6 shows the comparison of HOGCN with higher-order neighborhood mixing k = { , , } . The predictionperformance of HOGCN improves with the increase in thenumber of training interactions for all cases. The resultsshow that HOGCN’s performances are not sensitive tothe hyperparameter settings of k for all datasets since forsettings k = { , , } , we achieve comparable performancesacross the datasets. This analysis indicates that the 3-hopneighborhood provides sufﬁcient information for interac-tion prediction across all datasets and the performanceremains stable even with a large value for k . Next, we perform the literature-based validation of novelpredictions. Our goal is to evaluate the quality of HOGCN’snovel predictions compared to that of GCN and SkipGNNand show that HOGCN predicts novel interactions withhigher conﬁdence. We consider GDIs and DDIs for thisevaluation.We ﬁrst evaluate the potential of the HOGCN to makenovel GDI predictions. We collect 1,134,942 GDIs and theirscores from DisGeNET [28]. The score corresponds to thenumber and types of sources and the number of publica-tions supporting the associations. With the score thresholdof . , we obtain 17,893 new GDIs that are not in thetraining set. We make predictions on these 17,893 GDIs withGCN, SkipGNN, and HOGCN ( k = 3 ). Out of 17,893 GDIs,HOGCN predicts a higher probability than GCN for 17,356(96.99%) GDIs and than SkipGNN for 11,418 (63.8%) GDIs.Table 4 shows the top 5 GDIs with a signiﬁcant increase ininteraction probabilities when higher-order neighborhood mixing is considered. We also provide the number of ev-idence from DisGenNet [28] to support these predictions.Improvement in predicted probabilities by HOGCN modelsshows that aggregating feature representations from higher-order neighbors make HOGCN more conﬁdent about thepotential interactions as discussed in Section 6.4. TABLE 4Novel prediction of GDIs with the number of evidence fromDisGenNet [28] supporting the interaction. GCN, SkipGNN andHOGCN are denoted by , and respectively. Probability No. ofGene Disease

EvidencePTGER1 Gastric ulcer 0.087 0.519 0.721 1ANGPT1 Gastric ulcer 0.173 0.583 0.657 2ABO Pancreatic carcinoma 0.265 0.615 0.770 26VCAM1 Endotoxemia 0.294 0.529 0.639 5GPC3 Hepatoblastoma 0.307 0.540 0.598 17

We select two predicted GDIs with a large numberof supporting evidence and investigate the reason for theimprovement in predicted conﬁdence with HOGCN. Specif-ically, we choose gene-disease pairs (a) ABO and Pancreaticcarcinoma (26 pieces of evidence) and (b) GPC3 and Hepato-blastoma ((17 pieces of evidence). To explain the prediction,the subnetworks containing all shortest paths between thesepairs are selected. In particular, there are 49 shortest paths oflength 3 between ABO and Pancreatic carcinoma including6 diseases and 15 genes (Fig. 7a). Similarly, there are 20shortest paths of length 3 between GPC3 and Hepatoblas-toma including 6 diseases and 9 genes (Fig. 7b). Since thesenodes are 3-hop away from each other and GCNs can onlyconsider immediate neighbors, GCNs assign low conﬁdenceto these interactions.Examining the subnetwork in Fig. 7a, we found thatmost of the diseases are related to a cancerous tumor inthe pancreas and the prostate. Furthermore, pancreatic car-cinoma is associated with other diseases such as Pancreaticneoplasm, malignant neoplasm of pancreas, and malignantneoplasm of prostate [28]. Since ABO is linked with diseasesthat are related to pancreatic carcinoma and other genesare related to these diseases as well, HOGCN captures suchassociation (Fig 7a) even though they are farther away in thenetwork. Similarly, HOGCN predicts association for GPC3and Hepatoblastoma (Fig. 7b).Next, we perform a similar case study for DDIs andevaluate the predictions against DrugBank [34]. For thisexperiment, we make a prediction for every drug pair withGCN, SkipGNN and HOGCN and exclude the interactionsthat are already in the training set. Table 5 shows the top5 interactions with an increase in interaction probabilitieswhen highers orders of the neighborhood are considered. Asdiscussed in Section 6.4, HOGCN makes predictions withhigher conﬁdence compared to GCN and SkipGNN for theinteractions that are likely to be a true positive.Moreover, we validate the false positive DDI predictionsof GCNs and investigate the subnetwork for these drugs inDDI networks to reason the predictions. Table 6 shows thetop 5 interactions with a signiﬁcant decrease in predictedconﬁdence compared to GCN-based models. Since theseDDIs are false positives [34], GCN-based models make over-conﬁdent predictions for such DDIs. In contrast, HOGCN

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10 (a) (b)Fig. 7. Subnetwork with predicted interactions (marked by bold dashed lines) between (a) ABO and Pancreatic carcinoma (b) GPC3 andHepatoblastoma and all shortest paths between these pairs. The known interactions are presented as gray lines. Diseases are presented asdark circles and genes are presented as white circles.TABLE 5Novel prediction of DDIs with the literature evidence supporting theinteraction. GCN, SkipGNN and HOGCN are denoted by , and respectively. Drug 1 Drug 2 Probability

EvidenceNelﬁnavir Acenocoumarol 0.192 0.318 0.417 [37]Praziquantel Itraconazole 0.609 0.721 0.811 [38]Cisapride Droperidol 0.618 0.725 0.823 [39]Dapsone Warfarin 0.632 0.720 0.885 [40]Levoﬂoxacin Tobramycin 0.663 0.760 0.823 [41] signiﬁcantly reduces the predicted conﬁdence for theseDDIs to be true positive, indicating that the higher-orderneighborhood allows HOGCN to identify false positive pre-dictions. In particular, HOGCN can identity false positiveDDI between Belimumab and Estazolam even though theyare 3-hop away from each other.

TABLE 6Predicted probability for negative DDIs. GCN, SkipGNN and HOGCNare denoted by , and respectively. Drug 1 Drug 2 Probability with k Tranylcypromine Melphalan 0.925 0.478 0.065Belimumab Estazolam 0.912 0.477 0.178Methotrimeprazine Cloxacillin 0.907 0.406 0.065Hydrocodone Melphalan 0.905 0.193 0.012Ibrutinib Mecamylamine 0.899 0.398 0.353

We select subnetwork involving the drugs to investigatethe reason for such predictions. Fig. 8 shows the subnetworkwith all shortest paths between the drugs in Table 6. Exam-ining the ﬁgure, we observe that the drugs in these falsepositive DDIs have common immediate neighbors for allcases. GCN makes wrong predictions for these DDIs with high conﬁdence. However, SkipGNN becomes less conﬁ-dent about the interaction being true positive by consideringthe skip similarity. HOGCN further reduces the predictedconﬁdence for Tranylcypromine and Melphalan to 0.065,indicating that there is no association between these drugs. (a) (b)(c) (d)Fig. 8. Subnetwork containing false positive predictions (marked by darkdashed lines) and all shortest paths between (a) Tranylcypromine andMelphalan (b) Methotrimeprazine and Cloxacillin (c) Hydrocodone andMelphalan and (d) Ibrutinib and Mecamylamine. Other known interac-tions are presented as gray gray lines. Dark circles denotes drugs.

These case studies show that HOGCN with higher-order

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11 neighborhood mixing not only provide information for theidentiﬁcation of novel interactions but also help HOGCN toreduce false positive predictions.

ONCLUSION

We present a novel deep graph convolutional network(HOGCN) for biomedical interaction prediction. Our pro-posed model adopts a higher-order graph convolutionallayer to learn to mix the feature representation of neighborsat various scales. Experimental results on four interactiondatasets demonstrate the superior and robust performanceof the proposed model. Furthermore, we show that HOGCNmakes accurate and calibrated predictions by consideringhigher-order neighborhood information.There are several directions for future study. Our ap-proach only considers the known interactions to ﬂag po-tential interactions. There are other sources of biomedicalinformation such as various physicochemical and biologicalproperties of biomedical entities that can provide additionalinformation about the interaction and we plan to investigatethe integration of such features into the model. As HOGCNaggregates the neighborhood information at various dis-tances and can ﬂag novel interactions, it would be interest-ing to provide interpretable explanations for the predictionsin the form of a small subgraph of the input interactionnetwork G that are most inﬂuential for the predictions [42]. A CKNOWLEDGMENTS

This work was supported by the NSF [NSF-1062422 to A.H.],[NSF-1850492 to R.L.] and the NIH [GM116102 to F.C.] R EFERENCES [1] L. Cowen, T. Ideker, B. J. Raphael, and R. Sharan, “Networkpropagation: a universal ampliﬁer of genetic associations,”

NatureReviews Genetics , vol. 18, no. 9, p. 551, 2017.[2] K. Luck, D.-K. Kim, L. Lambourne, K. Spirohn, B. E. Begg, W. Bian,R. Brignall, T. Cafarelli, F. J. Campos-Laborie, B. Charloteaux et al. ,“A reference map of the human binary protein interactome,”

Nature , vol. 580, no. 7803, pp. 402–408, 2020.[3] M. Zitnik, M. Agrawal, and J. Leskovec, “Modeling polypharmacyside effects with graph convolutional networks,”

Bioinformatics ,vol. 34, no. 13, pp. i457–i466, 2018.[4] Y. Luo, X. Zhao, J. Zhou, J. Yang, Y. Zhang, W. Kuang, J. Peng,L. Chen, and J. Zeng, “A network integration approach fordrug-target interaction prediction and computational drug reposi-tioning from heterogeneous information,”

Nature communications ,vol. 8, no. 1, pp. 1–13, 2017.[5] M. Agrawal, M. Zitnik, J. Leskovec et al. , “Large-scale analysisof disease pathways in the human interactome.” in

PSB . WorldScientiﬁc, 2018, pp. 111–122.[6] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Van-dergheynst, “Geometric deep learning: going beyond euclideandata,”

IEEE Signal Processing Magazine , vol. 34, no. 4, pp. 18–42,2017.[7] J. Qiu, J. Tang, H. Ma, Y. Dong, K. Wang, and J. Tang, “Deepinf:Social inﬂuence prediction with deep learning,” in

Proceedingsof the 24th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining , 2018, p. 2110–2119.[8] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, andJ. Leskovec, “Graph convolutional neural networks for web-scalerecommender systems,” in

Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining ,2018, p. 974–983.[9] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl,“Neural message passing for quantum chemistry,” in

Proceedingsof the 34th International Conference on Machine Learning-Volume 70 ,2017, pp. 1263–1272. [10] T. N. Kipf and M. Welling, “Variational graph auto-encoders,” arXiv preprint arXiv:1611.07308 , 2016.[11] X. Yue, Z. Wang, J. Huang, S. Parthasarathy, S. Moosavinasab,Y. Huang, S. M. Lin, W. Zhang, P. Zhang, and H. Sun, “Graphembedding on biomedical networks: Methods, applications andevaluations,”

Bioinformatics , vol. 36, no. 4, pp. 1241–1251, 2020.[12] K. Huang, C. Xiao, L. Glass, M. Zitnik, and J. Sun, “Skipgnn: Pre-dicting molecular interactions with skip-graph networks,” arXivpreprint arXiv:2004.14949 , 2020.[13] S. Abu-El-Haija, B. Perozzi, A. Kapoor, N. Alipourfard, K. Lerman,H. Harutyunyan, G. Ver Steeg, and A. Galstyan, “Mixhop: Higher-order graph convolutional architectures via sparsiﬁed neighbor-hood mixing,” in

International Conference on Machine Learning , 2019,pp. 21–29.[14] I. A. Kov´acs, K. Luck, K. Spirohn, Y. Wang, C. Pollis, S. Schlabach,W. Bian, D.-K. Kim, N. Kishore, T. Hao et al. , “Network-basedprediction of protein interactions,”

Nature communications , vol. 10,no. 1, pp. 1–8, 2019.[15] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learningof social representations,” in

Proceedings of the 20th ACM SIGKDDinternational conference on Knowledge discovery and data mining , 2014,pp. 701–710.[16] A. Grover and J. Leskovec, “node2vec: Scalable feature learningfor networks,” in

Proceedings of the 22nd ACM SIGKDD internationalconference on Knowledge discovery and data mining , 2016, pp. 855–864.[17] T. N. Kipf and M. Welling, “Semi-supervised classiﬁcation withgraph convolutional networks,” arXiv preprint arXiv:1609.02907 ,2016.[18] I. Ahmad, M. U. Akhtar, S. Noor, and A. Shahnaz, “Missing linkprediction using common neighbor and centrality based parame-terized algorithm,”

Scientiﬁc Reports , vol. 10, no. 1, pp. 1–9, 2020.[19] O. Keskin, N. Tuncbag, and A. Gursoy, “Predicting protein–proteininteractions from the molecular to the proteome level,”

Chemicalreviews , vol. 116, no. 8, pp. 4884–4909, 2016.[20] K. Kishan, R. Li, F. Cui, Q. Yu, and A. R. Haake, “Gne: a deeplearning framework for gene network inference by aggregatingbiological information,”

BMC systems biology , vol. 13, no. 2, p. 38,2019.[21] A. R. Benson, R. Abebe, M. T. Schaub, A. Jadbabaie, and J. Klein-berg, “Simplicial closure and higher-order link prediction,”

Pro-ceedings of the National Academy of Sciences , vol. 115, no. 48, pp.E11 221–E11 230, 2018.[22] S. Cao, W. Lu, and Q. Xu, “Grarep: Learning graph representationswith global structural information,” in

Proceedings of the 24th ACMinternational on conference on information and knowledge management ,2015, pp. 891–900.[23] D. Zhu, P. Cui, Z. Zhang, J. Pei, and W. Zhu, “High-orderproximity preserved embedding for dynamic networks,”

IEEETransactions on Knowledge and Data Engineering , vol. 30, no. 11, pp.2134–2144, 2018.[24] R. A. Rossi, N. K. Ahmed, and E. Koh, “Higher-ordernetwork representation learning,” in

Companion Proceedingsof the The Web Conference 2018 , ser. WWW ’18. Republicand Canton of Geneva, CHE: International World Wide WebConferences Steering Committee, 2018, p. 3–4. [Online]. Available:https://doi.org/10.1145/3184558.3186900[25] R. A. Rossi, N. K. Ahmed, E. Koh, S. Kim, A. Rao, and Y. A. Yad-kori, “Hone: higher-order network embeddings,” arXiv preprintarXiv:1801.09303 , 2018.[26] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “Acomprehensive survey on graph neural networks,”

IEEE Transac-tions on Neural Networks and Learning Systems , 2020.[27] M. Zitnik, S. Rok Sosic, and J. Leskovec, “Biosnapdatasets: Stanford biomedical network dataset collection,” http://snap.stanford.edu/biodata , 2018.[28] J. Pi˜nero, J. M. Ram´ırez-Anguita, J. Sa ¨uch-Pitarch, F. Ronzano,E. Centeno, F. Sanz, and L. I. Furlong, “The disgenet knowledgeplatform for disease genomics: 2019 update,”

Nucleic acids research ,vol. 48, no. D1, pp. D845–D855, 2020.[29] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al. , “Pytorch:An imperative style, high-performance deep learning library,” in

Advances in neural information processing systems , 2019, pp. 8026–8037.[30] X. Glorot and Y. Bengio, “Understanding the difﬁculty of trainingdeep feedforward neural networks,” in

Proceedings of the thirteenth

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12 international conference on artiﬁcial intelligence and statistics , 2010, pp.249–256.[31] D. P. Kingma and J. Ba, “Adam (2014), a method for stochasticoptimization,” in

Proceedings of the 3rd International Conference onLearning Representations (ICLR), arXiv preprint arXiv , vol. 1412,2015.[32] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: a simple way to prevent neuralnetworks from overﬁtting,”

The journal of machine learning research ,vol. 15, no. 1, pp. 1929–1958, 2014.[33] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”

Journal of machine learning research , vol. 9, no. Nov, pp. 2579–2605,2008.[34] D. S. Wishart, C. Knox, A. C. Guo, S. Shrivastava, M. Hassanali,P. Stothard, Z. Chang, and J. Woolsey, “Drugbank: a comprehen-sive resource for in silico drug discovery and exploration,”

Nucleicacids research , vol. 34, no. suppl 1, pp. D668–D672, 2006.[35] A. Niculescu-Mizil and R. Caruana, “Predicting good probabilitieswith supervised learning,” in

Proceedings of the 22nd internationalconference on Machine learning , 2005, pp. 625–632.[36] G. W. Brier, “Veriﬁcation of forecasts expressed in terms of proba-bility,”

Monthly weather review , vol. 78, no. 1, pp. 1–3, 1950.[37] B. Garcia, P. De Juana, T. Bermejo, J. G´omez, P. Rondon, I. Gar-cia Plaza et al. , “Sequential interaction of ritonavir and nelﬁnavirwith acenocoumarol (abstract 1069),” in

Seventh European Con-ference on Clinical Aspects and Treatment of HIV Infection. Lisbon,Portugal , 1999.[38] E. Perucca, “Clinically relevant drug interactions with antiepilep-tic drugs,”

British journal of clinical pharmacology , vol. 61, no. 3, pp.246–255, 2006.[39] E. L. Michalets and C. R. Williams, “Drug interactions with cis-apride,”

Clinical pharmacokinetics , vol. 39, no. 1, pp. 49–75, 2000.[40] T. Truong and J. Haley, “Probable warfarin and dapsone interac-tion,”

Clinical and Applied Thrombosis/Hemostasis , vol. 18, no. 1, pp.107–109, 2012.[41] B. Ozbek and G. Otuk, “Post-antibiotic effect of levoﬂoxacinand tobramycin alone or in combination with cefepime againstpseudomonas aeruginosa,”

Chemotherapy , vol. 55, no. 6, pp. 446–450, 2009.[42] Z. Ying, D. Bourgeois, J. You, M. Zitnik, and J. Leskovec, “Gnnex-plainer: Generating explanations for graph neural networks,” in