Beyond Observed Connections : Link Injection
BBeyond Observed Connections : Link Injection ∗ Jie Bu
Department of Computer ScienceVirginia TechBlacksburg, VA 24061 [email protected]
M. Maruf
Department of Computer ScienceVirginia TechBlacksburg, VA 24061 [email protected]
Arka Daw
Department of Computer ScienceVirginia TechBlacksburg, VA 24061 [email protected]
Abstract
In this paper, we proposed the link injection , a novel method that helpsany differentiable graph machine learning models to go beyond observedconnections from the input data in an end-to-end learning fashion. It findsout (weak) connections in favor of the current task that is not presentin the input data via a parametric link injection layer. We evaluate ourmethod on both node classification and link prediction tasks using aseries of state-of-the-art graph convolution networks. Results show thatthe link injection helps a variety of models to achieve better performanceson both applications. Further empirical analysis shows a great potentialof this method in efficiently exploiting unseen connections from theinjected links.
Recently, we have seen a growing popularity of graphs in machine learning[1, 2]. Even thoughdeep learning can capture arbitrary patterns in the data, it often ignores the plethora of additionalinformation which it can use if the data is represented as graphs. For example, in citation networks[3,4], the papers are linked to each other if one of the paper cites the other. This can be used to grouppapers in different categories. In protein-protein networks, a link between two proteins denote thetype of interaction between them. These inherent structure in the data can be leveraged to learn richernode embedding, which in-turn allows us to improve predictive performance. Another drawbackof traditional machine learning methods is that it assumes that the instances are independent ofeach other. Whereas for graphs, this assumption not valid, as the interaction between two nodes areincorporated in the edges, which might or might not be represented by a feature set.Using graph to represent the knowledge we have, we can categorize graph data into three categories:1. The true graph: the actual graph that contains all the information for the target problem.We use G ∗ or G ∗ ( X, A ) to denote it, where X is the node features and A ∗ is the adjacencymatrix.2. The observed graph: the partial graph that given by the data, can be viewed as the true graphwith some edges being dropped-out. We use G to denote it. In the link prediction problems, ∗ Ongoing project. Updates may be made in the future. Special thanks to Dr. Hoda Eldardiry and Dr. AnujKarpatne. a r X i v : . [ c s . S I] S e p e reasonably assume the only missing part is part of the edges from G ∗ , therefore the nodefeatures stay the same, so we use G ( X, A ) .3. The predicted graph: the observed graph with predicted links. We use ˆ G or ˆ G ( X, ˆ A ) todenote it.Most existing methods try to learn structural patterns from the observed graphs in a supervisedway then use these patterns to predict links, common state-of-arts include walk-based methods andnetwork flow based methods. Recently graph neural networks (GNNs) and its convolutional variants -graph convolution networks (GCNs) attract lots of attentions in link prediction researches. Thesemethods are more or less depend on some sort of message flows, e.g., for walk-based methods, thewalks have starting points and transit to neighboring nodes at every step. The flow-based methodscan be viewed as converged walks where randomness is addressed by statistical estimates. GNNs andGCNs are no exceptions as well. Both families rely on message passing to capture both input featuresand graph structures. To void confusions on the concepts and for convenience of describing thepropagation of message flows in the networks, we refer such walks, flows and message propagationin the graph as message passing in this paper. missing key linktargetlink targetlink G(X, A) G(X, A)G (X, A ) ** a a a b b b Figure 1: An Example of Missing Key Links (Red Dashed) And Injected Links (Blue Solid)We believe such message passing is crucial to estimate how strong the two nodes are connected.Previous researches on graphs empirically show that most real-world networks tend to have someclusters of nodes, i.e., connected nodes are more likely to belong to the same densely connectedcommunity or cluster of nodes. Based on the observed nature of real-world graphs, a reasonablemodel will tend to predict the presence of a link between two nodes when there exist a large amountof short paths and connections between them, in which case message can be passed more efficientlybetween the two nodes. This leads to a strong hypothesis that remains open for discussions andstill needs future researches to prove: for most of the state-of-the-arts, when message can notbe effectively passed from a pair of nodes, the model will tend to predict the absence of a linkbetween the two nodes . The message passing mechanism integrated in most state-of-arts caneffectively capture such connectivity information, allowing them to achieve good performances inreal-world link prediction problems, e.g., social networks and protein and protein interaction (PPI).However, one of the core idea of the paper that motivates of work is that the missing links destroy theneighboring connectivity that impair the message passing, hence lead to erroneously estimation aboutthe connectivity strength between two nodes, especially when some of the key links are missed (see
Figure
Our node classification and link prediction tasks is related to previous node embedding approaches,general supervised approaches to learning over graphs, and convolutional neural networks overgraph-structured data.
Embedding based approaches : There are several node embedding based approaches that learnlow-dimensional embeddings using random walk statistics and matrix factorization based objectives.2hese embedding algorithms directly train node embeddings to individual nodes and also they requireexpensive additional training to make predictions over new nodes. Random walk based methodslike DeepWalk [5], node2vec [6] uses the skipgram model used in word2vec [7], and representsthe walks as sentences. These methods are related to classical approaches like spectral clustering[8] as well as page-rank algorithms [9]. Another type of embedding based approach comprisesthe Graph Auto-encoders (GAE) [10]. GAEs are unsupervised learning frameworks- aim to learnlow dimensional embedding via an encoder and then reconstruct the graph via a decoder. Graphautoencoder models tend to employ GCN as a building block for the encoder and reconstruct thestructure information via a link prediction decoder.
Supervised learning over graphs : GNN (Graph Neural Network) is a type of neural network whichdirectly operates on the graph structure. Suppose for node classification task, each node is associatedwith a label, and we want to predict the label of the nodes without ground-truth. GNN leveragesthe labeled nodes to predict the labels of the unlabeled, and it does so using message passing orneighborhood aggregation. At each iteration it learns a low dimensional vector representation foreach node that contains its neighborhood information. With the right choice of the loss function inthe supervised setting, these GNNs can back propagate the loss and learn the task specific model.These methods [11, 12, 13, 14] have used GNN for node classification task.
Graph convolutional networks : There are several convolutional neural network architectures forlearning over graphs have been proposed in recent years [15, 16, 17, 1, 18]. Some of the methods usespectral convolution. Defferrad et al [16] uses fast localized convolutions and kipf et al [1] uses fastapproximate convolutions in a semi-supervised setting. Both of the algorithms require to know thefull graph Laplacian during training. GraphSAGE (SAmple and aggreGatE) [2] an inductive nodeembedding approach can be viewed as an extension of these convolutional approaches. It aggregatesthe feature information from a node’s local neighborhood and by doing it simultaneously, it capturesthe topological structure of each node’s neighborhood.
Graph Connectivity Augmentation : The idea of augmenting the graph connectivity for bettermessage propagation in GCNs is not new, which shows it effectiveness in Graph U-Nets[19]. Similarto [20] which built links between nodes in a graph whose distances are at most k hops, we also build"weak links" initially but for every pair of nodes. An rough analogy is that the methodology of linkinjection can be viewed as a learnable global graph connectivity augmentation. However, the ideaof learning injected links is first in the community and substantially different to all existing graphconnectivity augmentation. Link Injection
To address the problems mentioned about, we propose link injection as a way that retrieves part ofthe missing structural information in the given/observed graph. It provides a way of using artificialconnections, i.e., injected links, to augment the message passing on the observed graph in the trainingphase.Figure 2 shows the pipeline of the basic link injection method. A highlight of this method is thatthe connected strength of the injected links can be trained in an end-to-end fashion when combinedwith differential models, e.g., graph neural networks, etc. In practice, the injected links J are oftenregarded DifferentiablePredictors(GNN, GCN, …)
Injected Links
Clipping 𝑎′ 𝑖𝑗 ∈ [0, 1] Node FeaturesAdjacency
Matrix 𝑎 𝑖𝑗 = 0 𝑜𝑟 1 Predictions
𝑆 ∈ ℝ
𝑁×𝐶
TrainingLabels
Losses
𝐿 ∈ ℝ
Gradients to Injections ∇ 𝐽 𝐿 Gradients to Predictor Parameters + 𝐴′𝐽𝐴 ℝ +𝑁×𝑁 ℝ +𝑁×𝑁 𝑋 ℝ
𝑁×𝐷 ℝ 𝑁×𝐶 𝑌ℤ +𝑁×𝑁 Figure 2: Flow Chart Showing The Pipeline of End-to-End Link Injection Architecture3 key for success using link injection is a proper constructed loss function since the connectionstrengths are updated in order to reduce the loss. This infers that there is no general way that cangenerate a certain set of injected links for any specific tasks, e.g., node classifications and linkpredictions, even on the same graph. When the loss functions can vary for different graph machinelearning tasks, it means for different problems on the same graph, there may have different optimalsets of injected links that help our model to achieve best performances. Therefore, it is importantto discuss the link injection methods under different contexts of graph machine learning problems.In the following sections, we will cover the discussions of link injections in two particular contexts,node classification and link prediction.
First, we introduce the notations that will be used in the paper. Denote the loss function used forevaluating classification performances, e.g., cross-entropy loss functions, as L . Define a differentialmodel M : ( R N i × D , R N i × N i ) → R N i × C , i = 1 , , ..., B , where B is the batch size or in manycases it will be the number of input graphs; D is the dimension of the node features, C is the numberof output classes and N i is the number of nodes in the i th graph. The forward propagation of themodel can be expressed as M ( X i , A i ) given the input node features and adjacency matrix for the i th graph, and in the backpropagation phase, for the differential model M the gradient calculated formodel parameter p is denoted as ∇ p L . A example of a simple link injection for node classificationtasks is shown in Algorithm 1. Algorithm 1
Link Injection For Node Classifications
Data: G → { X, A, Y } ; X ∈ R N × D ; A ∈ R N × N ; Y ∈ Z N × , ∀ y ∈ Y, y ∈ [1 , C ] Result: S ∈ R N × C Parameter: J ∈ R N × N Initialize J (random or constant) while Training do A (cid:48) ← A + ReLU ( J ) Clip ( A (cid:48) ) so that A (cid:48) ∈ [0 , S ← M ( X, A (cid:48) ) L ← CrossEntropyLoss ( S, Y ) L ← L + (cid:80) p ∈ M (cid:107) p (cid:107) + (cid:107) J (cid:107)∇ p L = Backpropagate ( M, L ) ∇ J L = Backpropagate ( J, L ) Update ( M, ∇ p L ) Update ( J, ∇ J L ) if Termination Conditions Satisfied then
Stop training and return M . endend By introducing additional connections, we hope link injection can help
Analogous to node classifications, the success for link predictions also depends highly on effectivemessage passing, where link injection comes into help. It tries to maximize the likelihood of theobserved graph for certain predicted links: max ˆ A P ( G | ˆ G ) For the predicted links, we use a positive bias term to represent the difference between predictedgraph and observed graph, ˆ A = A + J . Then we can formulate the maximization problem as max J P ( A | ˆ A, X ) where J is a non-negative matrix. 4 ATASET
Table 1: Dataset StatisticsTo effectively obtain graph embedding for both node features and structural information, we chooseGCNs as the predictor. Nevertheless, link injection works for a variety of state-of-art link predictionalgorithms, so we use F to denote the predictors, which map the input space to predicted links. Thenwe can use a simple formula to express the supervised model predicting as: ˆ A = F ( A, X ) while our link injection method works in a semi-supervised fashion, the prediction gives a scoringmatrix represents the estimated connection strength at each layer: S = F ( ˆ A, X ) The loss function are designed as: L = (cid:107) ReLU ( A − S ) (cid:107) F + λ (cid:107) S (cid:107) F We used the benchmark citation datasets - Cora and Citeseer [3, 4] for our experiments. In thesenetwork, nodes correspond to documents and edges correspond to citations. The feature of thesenetworks are actually the bag-of-words of the documents. The subjects of the documents arerepresented as labels. The Cora dataset contains a number of Machine Learning papers divided intoone of 7 classes while the CiteSeer dataset has 6 class labels. The stop words and the words thathave frequency less than 10 have been removed. The final corpus of Cora has 2708 documents, 1433distinct words in the vocabulary and 5429 links. And for CiteSeer, there are 3312 documents, 3703distinct words in the vocabulary and 4732 links. In our preliminary experiment the link injectormodel outperforms some baselines in the node classification tasks.Table 1 shows the details of these datasets.
We evaluate our models on
Cora and
CiteSeer dataset for node classifications. For splitting thedataset, we used the common split given by [3]. We use three different graph neural networks andgraph convolutional networks as the differentiable predictors in Figure 2 to test the link injectionmethod on different models. The three models are
Graph Convolutional Networks (GCNConv) [1],
Graph SAmple and aggreGatE (GraphSAGE) [2], and a simple graph neural network (GNN) underthe message passing and neighboring aggregation framework as the other two models.All models are trained for 10,000 epochs using Adamax optimizer [21]. A sliding window earlystopping trick was used to prevent overfitting. Basically it calculates the mean accuracy of the latesttwo non-overlapping windows on the validation set. If the mean accuaracy of the latter window islower than the previous one by a margin that is larger than a set tolerance value, the training will stop.For our experiments, we set the sliding window size = 100, tolerance = 0.005, and earliest stop =5000, i.e., no early stopping before the th epoch. To evaluate our models for link prediction we have used the
Cora dataset. The baselines are same asthose used for node classification. All codes can be founded on Github: https://github.com/jayroxis/Link-Injection . Results/Evaluation
Table 2 compares the accuracy(macro) and area under the ROC curve(macro) for the differentbaselines on the Cora and CiteSeer datasets. Preliminary results of using link injection in GraphNeural Networks for node classification show significant improvement in the evaluation metrics forboth datasets. For Cora, an improvement of 2.6% in accuracy was observed while for CiteSeer datasetwe observed an significant 6.7% in accuracy improvement over the baseline GNN. Link injection inthe GCN Conv improved accuracy for Cora and CiteSeer by 2.2% and 2.% respectively. Although theaccuracy of GraphSAGE improved by 1.7% on CiteSeer, we observed a slight decrease of 0.8% inaccuracy for Cora dataset. These results are coherent with our original hypothesis that link injectionallows efficient message passing among graphs.Our preliminary experiments also show an all around improvement in Area Under the ROC curve. Itcan be inferred that link injection improves the degree of separability of the model, thus it is betterat distinguishing between the classes than their respective baselines. This is also reflected in theimproved accuracy for node classification as previously observed.
Dataset Cora CiteSeerMetric Accuracy (macro) AUC-ROC (macro) Accuracy (macro) AUC-ROC (macro)GNN . ± .
013 0 . ± .
004 0 . ± .
007 0 . ± . GNN* . ± .
003 0 . ± .
002 0 . ± .
008 0 . ± . GCNConv . ± .
003 0 . ± .
002 0 . ± .
010 0 . ± . GCNConv* . ± .
008 0 . ± .
005 0 . ± .
011 0 . ± . GraphSAGE . ± .
004 0 . ± . . ± .
004 0 . ± . GraphSAGE* . ± .
001 0 . ± . . ± .
001 0 . ± . Table 2: Results for node classification tasks on
Cora and
CiteSeer datasets.In order to evaluate the quality of injected links and test our hypothesis, we designed an experimentwhere we train the models (both with and without link injection) on graphs where no informationregarding the edges are available during training. As expected, due to unavailability of edges duringthe training on Cora, all the baseline models without link predicion performed very poorly. With 7classes for Cora dataset, we anticipated an accuracy of / ≈ . which was reflected through ourexperiments as well. GraphSAGE GraphSAGE* GCNConv GCNConv* GNN GNN*Accuracy 0.144
Hits ≈ ≈ ≈ Hit Rate ≈ ≈ ≈ MR N/A
N/A
N/A
MR Ratio ≈ ≈ ≈ Table 3: Evaluating injected links in node classification task on
Cora with no edges available intraining . Accuracy is the macro-accuracy for node classification while Hits Hit Rate and MR aremetrics for evaluating the injected links against observed connections in the graph. The best resultsamong five random experiments will are reported in the table.
Table 4 compares the accuracy, precision and recall for link prediction for the different baselines onthe Cora dataset. The results show significant improvement in accuracy and precision for baselineswith link injection. The recall also improves for GNN, while we observe a slight decrease in recallfor GCNConv and GraphSAGE when equipped with link injection. Overall, the results show byintroducing weak artificial links during the training process, and learning the weights of the weaklinks, the overall performance of the baselines for link prediction tasks can be improved significantly.Table 5 demonstrates the injected links in the neighborhood from link prediction tasks on Cora dataset.To demonstrate that injected links learned by our model are not arbitrary, we perform some analysison the learned injections. When the top 50/10556 links were used for evaluating the injected links, it6 odel Accuracy (%) Precision (%) Recall (%)GNN . ± .
59 95 . ± .
92 80 . ± . GNN* . ± .
69 97 . ± .
87 87 . ± . GCNConv . ± .
29 95 . ± . . ± . GCNConv* . ± .
26 98 . ± . . ± . GraphSAGE . ± .
37 91 . ± . . ± . GraphSAGE* . ± .
45 96 . ± . . ± . Table 4: Results for link prediction tasks on
Cora datasets.was found that for all the baselines, the injected links were either neighbors(i.e. an edge was presentbetween them in the original graph) or they were completely disconnected(i.e. there were no pathbetween the two nodes).
Models Top 50 Top 10556Neighbors Disconnected Neighbors DisconnectedGCNConv* 100.00% 0.00% 60.72% 39.28%GNN* 100.00% 0.00% 86.92% 13.08%GraphSAGE* 10.00% 90.00% 72.95% 27.05%
Table 5: Injected links in its neighborhood from link prediction task on the
Cora dataset.Since, Cora dataset has 10556 edges, we used the top 10556 values of the injected links, and evaluatedthem with the edges already present in the Cora dataset to assess the quality of the injected links.Table 6 show the statistics of 10556 top-scored injections.
Models GCNConv* GraphSAGE* GNN*Training Fraction 80% 80% 80%Hits (Total) 6410 7701 9175Hits ( (cid:54)∈
Train) 182 101 403Hit Rate (Total) 60.724% 72.954% 86.917%Hit Rate ( (cid:54)∈
Train) 8.621% 4.784% 19.089%MR 2567 3829 2341MR Ratio 0.7568 0.6372 0.7782
Table 6: Statistics of 10556 top-scored injected links in reflecting the original (observed) graph.
Apart from the results showing in section 6, we found more interesting observations. Here we includeseveral things to discuss about.1. The link injection displayed incredible consistency with the original graphs, especially inthe link prediction task.2. Different predictive models can result in different preferences in injections.Figure 3 shows the neighboring subgraph of the highest-scored injected link of different models in thelink prediction task on
Cora , where the red dots are the vertices of the injected link and blue dots arethe nodes in the neighboring subgraph. The black solid lines are the edges that connect these nodes.A noticeable difference between the GraphSAGE and GCNConv or GNN is that the GraphSAGEtended to welcome the assistance of the injected links that lay between disconnected communities,while the It probably acts as a factor causing GraphSAGE we constructed with link injection hard tobe trained.Figure 4a shows that the injections learned during the training process is different for differentbaselines, but the sum of the injected links for the different baselines seems to converge after thetraining is complete. Also, figure 4b shows the distribution of the injections learned by the model. Itcan be seen that the value of the injections and the number of injections are inversely proportional.Furthermore, it must be noted that a large number of injections become zero after the training processis finished, which is in accordance to our goals, i.e. we want the injections to be sparse.7 a) GCNConv (b) GraphSAGE (c) GNN
Figure 3: Examples of the top-scored injected links of different models in the link prediction task onthe
Cora dataset. Epochs (log) S u m o f I n j e c t e d L i n k s (a) Sum of Injections Per Epoch Injected Link Value (Non-zero) F r e q u e n c y ( l o g ) GNNGCNConvGraphSAGEInitial Value (b) Distribution Plot of Non-zero Injections
Figure 4: Examples of the top-scored injected links of different models in the link prediction task onthe
Cora dataset.
Link injection when paired with differentiable predictors such GNN, GCNConv, and GraphSAGEshow significant improvement in performance for node classification and link prediction tasks. Also,the learned injections are not arbitrary and demonstrate interesting properties as described earlier.As an extension to this work, we plan on implementing negative links injections to deal with noisylinks which might be present in the data. Since, a major drawback of our proposed approach is thecomputational overhead of using the adjacency matrix, a possible extension of this work would beusing diff-pool [22] to reduce the size of the graph and construct link injections on the coarsenedgraph. Other possible extensions include storing a different injection matrix for each graph when theproblem involves multiple graphs.
References [1] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 , 2016.[2] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on largegraphs, 2017.[3] Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learningwith graph embeddings.
CoRR , abs/1603.08861, 2016.[4] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data.
AI magazine , 29(3):93–93, 2008.85] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social repre-sentations. In
Proceedings of the 20th ACM SIGKDD international conference on Knowledgediscovery and data mining , pages 701–710. ACM, 2014.[6] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In
Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery anddata mining , pages 855–864. ACM, 2016.[7] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-sentations of words and phrases and their compositionality. In
Advances in neural informationprocessing systems , pages 3111–3119, 2013.[8] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and analgorithm. In
Advances in neural information processing systems , pages 849–856, 2002.[9] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citationranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.[10] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprintarXiv:1611.07308 , 2016.[11] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models forstructured data. In
International conference on machine learning , pages 2702–2711, 2016.[12] Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graphdomains. In
Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005. ,volume 2, pages 729–734. IEEE, 2005.[13] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neuralnetworks. arXiv preprint arXiv:1511.05493 , 2015.[14] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.The graph neural network model.
IEEE Transactions on Neural Networks , 20(1):61–80, 2008.[15] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locallyconnected networks on graphs. arXiv preprint arXiv:1312.6203 , 2013.[16] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networkson graphs with fast localized spectral filtering. In
Advances in neural information processingsystems , pages 3844–3852, 2016.[17] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learningmolecular fingerprints. In
Advances in neural information processing systems , pages 2224–2232, 2015.[18] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neuralnetworks for graphs. In
International conference on machine learning , pages 2014–2023, 2016.[19] Hongyang Gao and Shuiwang Ji. Graph u-nets, 2019.[20] Sundeep Prabhakar Chepuri and Geert Leus. Subsampling for graph power spectrum estimation,2016.[21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[22] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec.Hierarchical graph representation learning with differentiable pooling. In