[PDF] Learning the Implicit Semantic Representation on Graph-Structured Data

Abstract

Existing representation learning methods in graph convolutional networks are mainly designed by describing the neighborhood of each node as a perceptual whole, while the implicit semantic associations behind highly complex interactions of graphs are largely unexploited. In this paper, we propose a Semantic Graph Convolutional Networks (SGCN) that explores the implicit semantics by learning latent semantic-paths in graphs. In previous work, there are explorations of graph semantics via meta-paths. However, these methods mainly rely on explicit heterogeneous information that is hard to be obtained in a large amount of graph-structured data. SGCN first breaks through this restriction via leveraging the semantic-paths dynamically and automatically during the node aggregating process. To evaluate our idea, we conduct sufficient experiments on several standard datasets, and the empirical results show the superior performance of our model.

Full PDF

LLearning the Implicit Semantic Representationon Graph-Structured Data

Likang Wu , Zhi Li , Hongke Zhao , Qi Liu , Jun Wang ,Mengdi Zhang , and Enhong Chen (cid:66) ) Anhui Province Key Laboratory of Big Data Analysis and Application, Universityof Science and Technology of China, Hefei, China { wulk,zhili03 } @mail.ustc.edu.cn , { qiliuql,cheneh } @ustc.edu.cn Tianjin University, Tianjin, China [email protected] Meituan-Dianping Group, Beijing, China [email protected]

Abstract.

Existing representation learning methods in graph convo-lutional networks are mainly designed by describing the neighborhoodof each node as a perceptual whole, while the implicit semantic asso-ciations behind highly complex interactions of graphs are largely un-exploited. In this paper, we propose a Semantic Graph ConvolutionalNetworks (SGCN) that explores the implicit semantics by learning la-tent semantic-paths in graphs. In previous work, there are explorationsof graph semantics via meta-paths. However, these methods mainly relyon explicit heterogeneous information that is hard to be obtained ina large amount of graph-structured data. SGCN ﬁrst breaks throughthis restriction via leveraging the semantic-paths dynamically and au-tomatically during the node aggregating process. To evaluate our idea,we conduct suﬃcient experiments on several standard datasets, and theempirical results show the superior performance of our model. Keywords:

Graph Neural Networks · Semantic Representation · Net-work Analysis.

The representations of objects (nodes) in large graph-structured data, such associal or biological networks, have been proved extremely eﬀective as featureinputs for graph analysis tasks. Recently, there have been many attempts inthe literature to extend neural networks to deal with representation learning ofgraphs, such as Graph Convolutional Networks (GCN) [15], GraphSAGE [12]and Graph Attention Networks (GAT) [34].In spite of enormous success, previous graph neural networks mainly proposedrepresentation learning methods by describing the neighborhoods as a perceptualwhole, and they have not gone deep into the exploration of semantic information Our code is available online at https://github.com/WLiK/SGCN_SemanticGCN a r X i v : . [ c s . A I] J a n L. Wu et al. 认知诊断模型调研 A B C D

Fig. 1.

Example of implicit semantic-paths in a scholar cooperation network. There arenot explicit node (relation) types. Behind the same kind of relation (black solid edge),there are implicit factors (dotted line, A is the student of B, B is the advisor of C).So, the path A-B-C expresses “Student-Advisor-Student”, A and C are “classmates”.B-C-D expresses “Advisor-Student-Advisor”, B and D are “colleagues”. in graphs. Taking the movie network as an example, the paths based on com-posite relations of “Movie-Actor-Movie” and “Movie-Director-Movie” may revealtwo diﬀerent semantic patterns, i.e., the two movies have the same actor (direc-tor). Here the semantic pattern is deﬁned as a speciﬁc knowledge expressed bythe corresponding path. Although several researchers [35,30] attempt to capturethese graph semantics of composite relations between two objects by meta-paths,existing work relies on the given heterogeneous information such as diﬀerenttypes of objects and distinct object connections. However, in the real world,quite a lot of graph-structured data do not have the explicit characteristics. Asshown in Figure 1, in a scholar cooperation network, there are usually no explicitnode (relation) types and all nodes are connected through the same relation, i.e.,“Co-author”. Fortunately, behind the same relation, there are various implicitfactors which may express diﬀerent connecting reasons, such as “Classmate”and “Colleague” for the same relation “Co-author”. These factors can furthercompose diverse semantic-paths (e.g. “Student-Advisor-Student” and “Advisor-Student-Advisor”), which reveal sophisticated semantic associations and help togenerate more informative representations. Then, how to automatically exploitcomprehensive semantic patterns based on the implicit factors behind a generalgraph is a non-trivial problem.In general, there are several challenges to solve this problem. Firstly, it is anessential part to adaptively infer latent factors behind graphs. We notice thatseveral researches begin to explore desired latent factors behind a graph by dis-entangled representations [20,18]. However, they mainly focus on inferring thelatent factors by the disentangled representation learning while failing to discrim-inatively model the independent implicit factors behind the same connections.Secondly, after discovering the latent factors, how to select the most meaningfulsemantics and aggregate the diverse semantic information remain largely unex-plored. Last but not the least, to further exploit the implicit semantic patternsand to be capable of conducting inductive learning are quite diﬃcult.To address above challenges, in this paper, we propose a novel SemanticGraph Convolutional Networks (SGCN), which sheds light on the explorationof implicit semantics in the node aggregating process. Speciﬁcally, we ﬁrst pro- earning the Implicit Semantic Representation on Graph-Structured Data 3 pose a latent factor routing method with the DisenConv layer [20] to adaptivelyinfer the probability of each latent factor that may have caused the link froma given node to one of its neighborings. Then, for further exploring the diversesemantic information, we transfer the probability between every two connectednodes to the corresponding semantic adjacent matrix, which can present thesemantic-paths in a graph. Afterwards, most semantic strengthen methods likethe semantic level attention module can be easily integrated into our model andaggregate the diverse semantic information from these semantic-paths. Finally,to encourage the independence of the implicit semantic factors and conduct theinductive learning, we design an eﬀective joint loss function to maintain the in-dependent mapping channels of diﬀerent factors. This loss function is able tofocus on diﬀerent semantic characteristics during the training process.Speciﬁcally, the contributions of this paper can be summarized as follows: – We ﬁrst break the heterogeneous restriction of semantic representations withan end-to-end framework. It automatically infers the independent factor be-hind the formation of each edge and explores the semantic associations oflatent factors behind a graph. – We propose a novel Semantic Graph Convolutional Networks (SGCN), tolearn node representations by aggregating the implicit semantics from thegraph-structured data. – We conduct extensive experiments on various real-world graphs datasets toevaluate the performance of the proposed model. The results show the supe-riority of our proposed model by comparing it with many powerful models.

Graph neural networks (GNNs) [10,26], especially graph convolutional networks[13], have been proven successful in modeling the structured graph data due toits theoretical elegance [5]. They have made new breakthroughs in various tasks,such as node classiﬁcation [15] and graph classiﬁcation [6]. In the early days,the graph spectral theory [13] was used to derive a graph convolutional layer.Then, the polynomial spectral ﬁlters [6] greatly reduced the computational costthan before. And, Kipf and Welling [15] proposed the usage of a linear ﬁlterto get further simpliﬁcation. Along with spectral graph convolution, directlyperforming graph convolution in the spatial domain was also investigated bymany researchers [8,12]. Among them, graph attention networks [34] has arousedconsiderable research interest, since it adaptively specify weights to the neighborsof a node by attention mechanism [1,37].For semantic learning research, there have been studies explored a kind ofsemantic-path called meta-path in heterogeneous graph embedding to preservestructural information. ESim [28] learned node representations by searching theuser-deﬁned embedding space. Based on random walk, meta-path2vec [7] utilizedskip-gram to perform a semantic-path. HERec [29] proposed a type constraintstrategy to ﬁlter the node sequence and captured the complex semantics reﬂected

L. Wu et al. in heterogeneous graph. Then, Fan et al. [9] suggested a meta-graph2vec modelfor malware detection, where both the structures and semantics are preserved.Sun et al. [30] proposed meta-graph-based network embedding models, whichsimultaneously considers the hidden relations of all meta information of a meta-graph. Meanwhile, there were other inﬂuential semantic learning approaches insome studies. For instance, many models [4,17,25] were utilized to various ﬁeldsbecause of their latent semantic analysis ability.In heterogeneous graphs, two objects can be connected via diﬀerent semantic-paths, which are called meta-paths. It depends on the characteristic that thisgraph structure has diﬀerent types of nodes and relations. One meta-path Φ is deﬁned as a path in the form of A R −→ A R −→ · · · R l −→ A l +1 (abbreviated as A A · · · A l +1 ), it describes a composite relation R = R ◦ R ◦ · · · ◦ R l , where ◦ denotes the composition operator on relations. Actually, in homogeneous graph,the relationships between nodes are also generated for diﬀerent reasons (latentfactors), so we can implicitly construct various types of relationships to extractvarious semantic-paths correspond to diﬀerent semantic patterns, so as to im-prove the performance of GCN model from the perspective of semantic discovery. In this section, we introduce the Semantic Graph Convolutional Networks (SGCN).We ﬁrst present the notations, then describe the overall network progressively.

We focus primarily on undirected graphs, and it is straightforward to extend ourapproach to directed graphs. We deﬁne G = ( V, E ) as a graph, comprised of thenodes set V and edges set E , and | V | = N denotes the number of nodes. Eachnode u ∈ V has a feature vector x u ∈ R d in . We use ( u, v ) ∈ E to indicate thatthere is an edge between node u and node v . Most graph convolutional networkscan be regarded as an aggregation function f ( · ) that outputs the representationsof nodes when given features of each node and its neighbors: y = f ( x u , x v : ( u, v ) ∈ E | u ∈ V ) , where the output y ∈ R N × d out denotes the representations of nodes. It meansthat neighborhoods of a node contains rich information, which can be aggre-gated to describe the node more comprehensively. Diﬀerent from previous stud-ies [15,12,34], in our work, proposed f ( · ) would automatically learn the semantic-path from graph data to explore corresponding semantic pattern. Here we aim to introduce the disentangled algorithm that calculates the latentfactors between every two objects. We assume that each node is composed of K earning the Implicit Semantic Representation on Graph-Structured Data 5 independent components, hence there are K latent factors to be disentangled.For the node u ∈ V , the hidden representation of u is h u = [ e u , , e u , , ..., e u,K ] ∈ R K × doutK , where e u,k ∈ R doutK ( k = 1 , , ..., K ) denotes corresponding aspect ofnode u that is pertinent to the k -th disentangled factor.In the initial stage, we project its feature vector x u into K diﬀerent subspaces: z u,k = σ ( W k x u + b k ) (cid:107) σ ( W k x u + b k ) (cid:107) , (1)where W k ∈ R d in × doutK and b k ∈ R doutK are the mapping parameters and biasof k -th subspace, the nonlinear activation function σ is ReLU [23]. To captureaspect k of node u comprehensively, we construct e u,k from both z u,k and { z v,k :( u, v ) ∈ E } , which can be utilized to identify the latent factors. Here we learn theprobability of each factor by leveraging neighborhood routing mechanism [20,18],it is a DisenConv layer: e tu,k = z u,k + (cid:80) v :( u,v ) ∈ E p k,t − u,v z v,k (cid:107) z u,k + (cid:80) v :( u,v ) ∈ E p k,t − u,v z v,k (cid:107) , (2) p k,tu,v = exp( z (cid:62) v,k e tu,k ) (cid:80) Kk =1 exp( z (cid:62) v,k e tu,k ) , (3)where iteration t = 1 , , ..., T , p ku,v indicates the probability that factor k indi-cates the reason why node u reaches neighbor v , and satisﬁes p ku,v ≥ , (cid:80) Kk =1 p ku,v =1. The neighborhood routing mechanism will iteratively infer p ku,v and construct e k . Note that, there are total L DisenConv layers, z u,k is assigned the value of e Tu,k ﬁnally in each layer l ≤ L −

1, more detail can refer to

Algorithm For the data that various relation types between nodes and their correspondingneighbors are explicit and ﬁxed, it is easily to construct multiple sub-semanticgraphs as the input data for multiple GCN model. As shown in Figure 2(a) , aheterogeneous graph G contains two diﬀerent types of meta-paths (meta-path1, meta-path 2). Then G can be decomposed to multiple graphs ˜ G consisting ofsingle semantic graph G and G , where u and its neighbors are connected bypath-relation 1(2) for each node u in G ( G ).However, we cannot simply transfer the pre-construct multiple graph methodto all network architectures. In detail, for a graph with no diﬀerent types of edges,we have to judge implicit connecting factors of these edges to ﬁnd semantic-paths.And the probability of each latent factor is calculated in the iteratively runningprocess as mentioned in last section. To solve this dilemma, we propose a novelalgorithm to automatically represent semantic-paths during the model running.After the latent factor routing process, we get the soft probability matrixof node latents p ∈ R N × N × K , where 0 ≤ p ki,j ≤ L. Wu et al.(a) Multi-graph method 𝑢 𝑙 (cid:1874)𝑜 𝑘 (cid:3404) 3

1 2 3 1 2 3123 𝐁 (cid:3048),(cid:3039) 𝐁 (cid:3048),(cid:3049) (b) Discriminative semantic aggregation method Fig. 2.

A previous meta-paths representation on heterogeneous graph and our discrim-inative semantic aggregation method. node i connects to j because of the factor k . In our model, the latent factorshould identify the certain connecting cause of each connected node pair. Herewe transfer the probability matrix p to an semantic adjacent matrix A , so theelement in A only has binary value (0 or 1). In detail, for every node pair i and j , A ki,j = 1 if p ki,j denotes the biggest value in p i,j . As shown in Figure 2(b), eachnode is represented by K components. In this graph, every node may connectwith others by one relationship from K types, e.g., the relationship betweennode u and o is R (denotes A u,o = 1). For node u , we can ﬁnd that it has twosemantic-path-based neighbors l and v . And, the semantic-paths of ( u, l ) and( u, v ) are two diﬀerent types which composed by Φ u,o,l = ( A u,o , A o,l ) = R ◦ R and Φ u,o,v = ( A u,o , A o,v ) = R ◦ R respectively. We deﬁne the adjacent matrix B for virtual semantic-path-based edges, B u,v = (cid:88) [( u,o ) , ( o,v )] ∈ E A (cid:62) u,o A o,v , { u, v } ⊂ V, (4)where A u,o ∈ R K , A o,v ∈ R K , and B u,v ∈ R K × K . For instance, in Figure 2(b), A u,o = [0 , , A o,v = [1 , , A o,l = [0 , , u can be expressed as B , u,l = 1 and B , u,v = 1.In the semantic information aggregation process, we aggregate the latentvectors connected by corresponding semantic-path as: h u = [ e u, , e u, , ..., e u,K ] ∈ R K × doutK , ˜h v = [ z v, , z v, , ..., z v,K ] ∈ R K × doutK , y u = h u + MeanPooling v ∈ V ,v (cid:54) = u ( B u,v ˜h v ) , u ∈ V, (5)where we just use MeanPooling to avoid large values instead of (cid:80) v ∈ V oper-ator, and h u , ˜h v ∈ R K × doutK are both returned from the last layer of Disen-Conv operation, in this time that factor probabilities would be stable since therepresentation of each node considers the inﬂuence from neighbors. According earning the Implicit Semantic Representation on Graph-Structured Data 7 to Eq. (5), the aggregation of two latent representations (end points) of onecertain semantic-path denotes the mining result of this semantic relation, e.g.,Pooling( e u, , z v, ) and Pooling( e u, , z l, ) express two diﬀerent kinds of semanticpattern representations in Figure 2(b), R ◦ R and R ◦ R respectively. And, forall types of semantic-paths start from node u , the weight of each type dependson its frequency. Note that, although the semantic adjacent matrix A neglectssome low probability factors, our semantic paths are integrated with the nodestates of DisenGCN, which would not lose the crucial information captured bybasic GCN model. The advantage of this aggregation method is that our modelcan distinguish diﬀerent semantic relations without adding extra parameters,instead of designing various graph convolution networks for diﬀerent semantic-paths. That is to say, the model does not increase the risk of over ﬁtting afterthe graph semantic-paths learning. Here we only consider 2-order-paths in ourmodel, however, it can be straightly extended to longer path mining. In fact, one type of edge in a meta-path tries to denote one unique meaning, sothe K latent factors in our work should not overlap. So, the assumption of usinglatent factors to construct semantic-paths is that these diﬀerent factors extractedby latent factor routing module can focus on diﬀerent connecting causes. Inother words, we should encourage the representations of diﬀerent factors to beof suﬃcient independence. Before the probability calculating, on our features,the focused point views of K subspaces in Eq. (1) should keep diﬀerent. Oursolution considers that the distance between independence factor representations z i,k , k ≤ K should be suﬃcient long if they were projected to one subspace.First, we project the input values z in Eq. (1) into an uniﬁed space to getvectors Q and K as follow: Q = zw , K = zw , (6)where w ∈ R doutK × doutK is the projection parameter matrix. Then, the indepen-dence loss based on distances between unequal factor representations could becalculated as follow: L i = 1 M (cid:88) softmax( QK (cid:62) (cid:113) d out K ) (cid:12) (1 − I ) , (7)where I ∈ R K × K denotes an identity matrix, (cid:12) is element-wise product, M = K − K . Speciﬁcally, we learn a lesson from [33] that scaling the dot products by1 / (cid:112) d out /K , to counteract the gradients disappear eﬀect for large values. As longas L i is minimized in the training process, the distances between diﬀerent factorstend to be larger, that is, the K subspaces would capture suﬃcient diﬀerentinformation to encourage independence among learned latent factors.Next, we would analyze the validity of this optimization. Latent Factor Rout-ing aims to utilize the disentangled algorithm to calculate the latent factors be-tween every two objects. However, this approach is a variant of von Mises-Fisher L. Wu et al. (vMF) [2] mixture model, such an EM algorithm cannot optimize the indepen-dences of latent factors within the iterative process. And random initializationof the mapping parameters is also not able to promise that subspaces obtaindiﬀerent concerns. For this shortcoming, we give an assumption:

Assumption 31

The features in diﬀerent subspaces keep suﬃcient independentwhen the margins of their projections in the uniﬁed space are suﬃciently distinct.

This assumption is inspired by the Latent Semantic Analysis algorithm (LSA)[16] that projects multi-dimensional features of a vector space model into asemantic space with less dimensions, which keeps the semantic features of theoriginal space in a statistical sense. So, our optimization approach is listed below: w = arg min (cid:88) softmax( QK T ) (cid:12) (1 − I ) , = arg min (cid:88) Vu softmax(( z u w )( z u w ) T ) (cid:12) (1 − I ) , = arg min V (cid:88) u (cid:80) k (cid:54) = k exp( z u,k w · z u,k w ) (cid:80) k ,k exp( z u,k w · z u,k w ) , (8)= arg max V (cid:88) u (cid:88) k (cid:54) = k distance( z u,k w , z u,k w ) .S.t. : 1 ≤ k ≤ K, ≤ k ≤ K. In the above equation, w denotes the training parameter to be optimized.We ignore the 1 /M and 1 / (cid:112) d out /K in Eq. (7), because they do not aﬀect theoptimization procedure. With the increase of Inter-distances of K subspaces, theIntraVar of factors in each subspace would not larger than the original level (asthe random initialization). The InterVar/IntraVar ratio becomes larger, in otherword, we get more suﬃcient independence of mapping subspaces. In this section, we describe the overall algorithm of SGCN for performing node-related tasks. For graph G , the ground-truth label of node u is † u ∈ { , } C ,where C is the number of classes. The details of our algorithm are shown in Algorithm

1. First, we calculate the independence loss L i after factor channelscapture features. Then, L layers of DisenConv operations would return the stableprobability matrix p . After that, the automatic graph semantic-path represen-tation y is learned based on p . To apply y to diﬀerent tasks, we design the ﬁnallayer by a fully-connected layer y (cid:48) = W y y + b y , where W y ∈ R d out ×C , b y ∈ R C .For instance, for the semi-supervised node classiﬁcation task, we implement L s = − (cid:88) u ∈ V L C C (cid:88) c =1 † u ( c )ln(ˆ y u ( c )) + λ L i (9) earning the Implicit Semantic Representation on Graph-Structured Data 9 Algorithm 1

Semantic Graph Convolutional Networks

Input: the feature vector matrix x ∈ R N × d in , the graph G = ( V, E ), the number ofiterations T , and the number of disentangle layers L . Output: the representation of node u by y u ∈ R d out , ∀ u ∈ V for i ∈ V do for k = 1 , , ..., K do z i,k ← σ ( W k x i + b k ) / (cid:107) σ ( W k x i + b k ) (cid:107) Q ← zw q , K ← zw k L i = M (cid:80) softmax( QK (cid:62) / (cid:113) d out K ) (cid:12) (1 − I )6: for disentangle layer l = 1 , , ..., L do e t =1 u,k ← z u,k , ∀ k = 1 , , ..., K, ∀ u ∈ V for routing iteration t = 1 , , ..., T do

9: Get the soft probability matrix p , where calculating p k,tu,v by Eq. (3)10: Update the latent representation e tu,k , ∀ u ∈ V by Eq. (2)11: e u ← dropout(ReLU( e u )) , z u,k ← e t = Tu,k , ∀ k = 1 , , ..., K, ∀ u ∈ V (cid:67) when l ≤ L − p to hard probability matrix A B u,v ← (cid:80) [( u,o ) , ( o,v )] ∈ E A (cid:62) u,o A o,v , { u, v } ⊂ V

14: Get each aggregation y ku of the latent vectors on semantic-paths by Eq. (5)15: return { y u , ∀ u ∈ V } , L i as the loss function, where ˆ y u = softmax( y (cid:48) u ), V L is the set of labeled nodes, and L i would be joint training by sum up with the task loss function. For the multi-label classiﬁcation task, since the label † u consists of more than one positivebits, we deﬁne the multi-label loss function for node u as: L m = − C C (cid:88) c =1 [ † u ( c ) · sigmoid( y (cid:48) u ( c )) + (1 − † u ( c )) · sigmoid( − y (cid:48) u ( c ))] + λ L i . (10) Moreover, for the node clustering task, y (cid:48) denotes the input feature of K-Means. We should notice a problem in Section 3.3 that the time complexity of Eq. (4-5) by matrix calculation is O ( N ( N − N − K + N (( N − K × d out K +2 K d out K )) ≈ O ( N K + N K ). Such a complex time complexity will bring a lotof computing load, so we optimize this algorithm in the actual implementation.For real-world datasets, one node connects to neighbors that are far less thanthe total number of nodes in the graph. Therefore, when we create the semantic-paths based adjacent matrix, the matrix ˜A ∈ R N × C × K is deﬁned to denote1-order neighbor relationships, C is the maximum number of neighbors that wedeﬁne, and ˜A ku is the id of a neighbor if they are connected by R k , else ˜A ku = 0.Then the semantic-path relations of type ( R k , R k ) of u ∈ V are denoted by ˜B k ,k u = ˜A [ ˜A [ u, : , k ] , : , k ] ∈ R C × C , and the pooling of this semantic pattern isthe mean pooling of z [ ˜B k ,k u , k , :]. According to the analysis above, the timecomplexity can be reduced to O ( K ( N C + N C d out K )) ≈ O (2 N K C ). Table 1.

The statistics of datasets.

Dataset Type Nodes Edges Classes Features Multi-label

Pubmed Citation Network 19,717 44,338 3 500 FalseCiteseer Citation Network 3,327 4,732 6 3,703 FalseCora Citation Network 2,708 5,429 7 1,433 FalseBlogcatalog Social Network 10,312 333,983 39 - TruePOS Word Co-occurrence 4,777 184,812 40 - True

In this section, we empirically assess the eﬃcacy of SGCN on several node-related tasks, includes semi-supervised node classiﬁcation, node clustering andmulti-label node classiﬁcation. We then provide node visualization analysis andsemantic-paths sampling experiments to verify the validity of our idea.

We conduct our experiments on 5 real-world datasets, Citeseer, Cora,Pubmed, POS and BlogCatalog [27,11,32], whose statistics are listed in Table 1.The ﬁrst three citation networks are benchmark datasets for semi-supervisednode classiﬁcation and node clustering. For graph content, the nodes, edges, andlabels in these three represent articles, citations, and research areas, respectively.Their node features correspond a bag-of-words representation of a document.POS and BlogCatalog are suitable for multi-label node classiﬁcation task.Their labels are part-of-speech tags and user interests, respectively. In detail,BlogCatalog is a social relationships network of bloggers who post blogs in theBlogCatalog website. These labels represent the blogger’s interests inferred fromthe text information provided by the blogger. POS (Part-of-Speech) is a co-occurrence network of words appearing in the ﬁrst million bytes of the Wikipediadump. The labels in POS denote the Part-of-Speech tags inferred via the Stan-ford POS-Tagger. Due to the two graphs do not provide node features, we usethe rows of their adjacency matrices in place of node features for them.

Baselines.

To demonstrate the advantages of our model, we compare SGCNwith some representative graph neural networks, including the graph convolu-tion network (GCN) [15] and the graph attention network (GAT) [34]. In detail,GCN [15] is a simpliﬁed spectral method of node aggregating, while GAT weightsa node’s neighbors by the attention mechanism. GAT achieves state of the artin many tasks, but it contains far more parameters than GCN and our model.Besides, ChebNet [6] is a spectral graph convolutional network by means of aChebyshev expansion of the graph Laplacian, MoNet [22] extends CNN archi-tectures by learning local, stationary, and compositional task-speciﬁc features.And IPGDN [18] is the advanced version of DisenGCN. We also implement othernon-graph convolution network method, including random walk based networkembedding DeepWalk [24], link-based classiﬁcation method ICA [19], inductive earning the Implicit Semantic Representation on Graph-Structured Data 11 embedding based approach Planetoid [38], label propagation approach LP [39],semi-supervised embedding learning model SemiEmb [36] and so on.In addition, we conduct the ablation experiments into nodes classiﬁcationand clustering to verify the eﬀectiveness of the main components of SGCN:SGCN-path is our complete model without independence loss, and SGCN-indepdenotes SGCN without the semantic-path representations.In the multi-label classiﬁcation experiment, the original implementations ofGCN and GAT do not support multi-label tasks. We therefore modify themto use the same multi-label loss function as ours for fair comparison in multi-label tasks. We additionally include three node embedding algorithms, includingDeepWalk [24], LINE [31], and node2vec [11], because they are demonstrated toperform strongly on the multi-label classiﬁcation. Besides, we remove IPGDNsince it is not designed for multi-label task.

Implementation Details.

We train our models on one machine with 8 NVIDIATesla V100 GPUs. Some experimental results and the settings of common base-lines that we follow [20,18], and we optimize the parameters of models withAdam [14]. Besides, we tune the hyper-parameters of both our model and base-lines using hyperopt [3]. In detail, for semi-supervised classiﬁcation and nodeclustering, we set the number of iterations T = 6, the layers L ∈ { , , ..., } ,the number of components K ∈ { , , .., } (denotes the number of mappingchannels. Therefore, for our model, the dimension of a component in the SGCNmodel is [ d out /K ] ∈ { , , ..., } ), dropout rate ∈ { . , . , ..., . } , trade-oﬀ λ ∈ { . , . , ..., . } , the learning rate ∼ loguniform [ e − , l regular-ization term ∼ loguniform [ e − , d out is set to 128 to achievebetter performance, while setting the dimension of the node embeddings to be128 as well for other node embedding algorithms. And, when tuning the hyper-parameters, we set the number of components K ∈ { , , ... } in the latentfactor routing process. Here K = 8 makes the best result in our experiments. For semi-supervised node classiﬁcation, there are only 20 labeled instances foreach class. It means that the information of neighbors should be leveraged whenpredicting the labels of target nodes. Here we follow the experimental settingsof previous works [38,15,34].We report the classiﬁcation accuracy (ACC) results in Table 2. The majorityof nodes only connect with those neighbors of the same class. According to Table2, it is obvious that SGCN achieves the best performance amongst all baselines.Here SGCN outperforms the most powerful baseline IPGDN with 1.55%, 0.47%and 1.1% relative accuracy improvements on three datasets, compared with theincreasing degrees of previous models, our model express obvious improvementsin the node classiﬁcation task. And our proposed model achieves the best ACCof 85.4% on Cora dataset, it is a great improvement on this dataset. On the otherhand, in the ablation experiment (the last three rows of Table 2), the complete

Table 2.

Semi-supervised classiﬁcation.

Models Cora Citeseer Pubmed

MLP 55.1 46.5 71.4SemiEmb 59.0 59.6 71.1LP 68.0 45.3 63.0DeepWalk 67.2 43.2 65.3ICA 75.1 69.1 73.9Planetoid 75.7 64.7 77.2ChebNet 81.2 69.8 74.4GCN 81.5 70.3 79.0MoNet 81.7 - 78.8GAT 83.0 72.5 79.0DisenGCN 83.7 73.4 80.5IPGDN 84.1 74.0 81.2SGCN-indep 84.2 73.7 82.0SGCN-path 84.6

Table 3.

Node clustering with double metrics.

Models Cora Citeseer Pubmed

NMI ARI NMI ARI NMI ARISemiEmb 48.7 41.5 31.2 21.5 27.8 35.2DeepWalk 50.3 40.8 30.5 20.6 29.6 36.6Planetoid 52.0 40.5 41.2 22.1 32.5 33.9ChebNet 49.8 42.4 42.6 41.5 35.6 38.6GCN 51.7 48.9 42.8 42.8 35.0 40.9GAT 57.0 54.1 43.1 43.6 35.0 41.4DIsenGCN 58.4 60.4 43.7 42.5 36.1 41.6IPGDN 59.2 61.0 44.3 43.0 37.0 42.0SGCN-indep 60.2 59.2 44.7 42.8 37.2 42.3SGCN-path 60.5 60.7

SGCN

SGCN model is superior to either algorithm in at least two datasets. Moreover,we can ﬁnd that SGCN-indep and SGCN-path are both perform better thanprevious algorithms to some degree. It reveals the eﬀectiveness of our semantic-paths mining module and the independence learning for subspaces.

In the multi-label classiﬁcation experiment, every node is assigned one or morelabels from a ﬁnite set L . We follow node2vec [11] and report the performance ofeach method while varying the number of nodes labeled for training from 10% | V | to 90% | V | , where | V | is the total number of nodes. The rest of nodes aresplit equally to form a validation set and a test set. Then with the best hyper-parameters on the validation sets, we report the averaged performance of 30runs on each multi-label test set. Here we summarize the results of multi-labelnode classiﬁcation by Macro-F1 and Micro-F1 scores in Figure 3. Firstly, thereis an obvious point that proposed SGCN model achieves the best performancesin both two datasets. Compared with DisenGCN model, SGCN combines withsemantic semantic-paths can achieve the biggest improvement of 20.0% when weset 10% of labeled nodes in POS dataset. The reason may be that the relationtype of POS dataset is Word Co-occurrence, there are lots of regular explicitor implicit semantics amongst these relationships between diﬀerent words. Inthe other dataset, although SGCN does not show a full lead but achieves thehighest accuracy on both indicators. We ﬁnd that the GCN-based algorithms areusually superior to the traditional node embedding algorithms in overall eﬀect.Although for the Micro-F1 score on Blogcatalog, GCN produces the poor results.In addition, the SGCN algorithm can make both Macro-F1 and Micro-F2 achievegood results at the same time, and there will be no bad phenomenon in one ofthem. Because this approach would not ignore the information provided by theclasses with few samples but important semantic relationships. earning the Implicit Semantic Representation on Graph-Structured Data 13

10 20 30 40 50 60 70 80 90 %Labeled Nodes M a cr o - F ( % ) SGCNDisenGCNDeepWalkLINENode2VecGCNGAT (a) Macro-F1 POS

10 20 30 40 50 60 70 80 90 %Labeled Nodes M a cr o - F ( % ) SGCNDisenGCNDeepWalkLINENode2VecGCNGAT (b) Macro-F1 Blogcatalog

10 20 30 40 50 60 70 80 90 %Labeled Nodes M i cr o - F ( % ) SGCNDisenGCNDeepWalk

LINENode2VecGCNGAT (c) Micro-F1 POS

10 20 30 40 50 60 70 80 90 %Labeled Nodes M i cr o - F ( % ) SGCNDisenGCNDeepWalkLINENode2VecGCNGAT (d) Micro-F1 Blogcatalog

Fig. 3.

Results of multi-label node classiﬁcation.

To further evaluate the embeddings learned from the above algorithms, we alsoconduct the clustering task. Following [18], for our model and each baseline, weobtain its node embedding via feed forward when the model is trained. Thenwe input the node embedding to the K-Means algorithm to cluster nodes. Theground-truth is the same as that of node classiﬁcation task, and the numberof clusters K is set to the number of classes. In detail, we employ two metricsof Normalized Mutual Information (NMI) and Average Rand Index (ARI) tovalidate the clustering results. Since the performance of K-Means is aﬀectedby initial centroids, we repeat the process for 20 times and report the averageresults in Table 3. As can be seen in Table 3, SGCN consistently outperformsall baselines, and GNN-based algorithms usually achieve better performance.Besides, with the semantic-path representation, SGCN and SGCN-path performssigniﬁcantly better than DisenGCN and IPGDN, our proposed algorithm getsthe best results on both NMI and ARI. It shows that SGCN captures a moremeaningful node embedding via learning semantic patterns from graph. We try to demonstrate the intuitive changes of node representations after incor-porating semantic patterns. Therefore, we utilize t-SNE [21] to transform featurerepresentations (node embedding) of SGCN and DisenGCN into a 2-dimensionalspace to make a more intuitive visualization. Here we visualize the node embed-ding of Cora (actually, the change of representation visualization is similar in

Fig. 4.

Node representation visualization of Cora.

Number of cut A cc u r c y ( % ) SGCN

Fig. 5.

Semantic-paths sampling. other datasets), where diﬀerent colors denote diﬀerent research areas. Accord-ing to Figure 4, there is a phenomenon that the visualization of SGCN is moredistinguishable than DisenGCN. It demonstrates that the embedding learned bySGCN presents a high intra-class similarity and separates papers into diﬀerentresearch areas with distinct boundaries. On the contrary, DisenGCN dose notperform well since the inter-margin of clusters are not distinguishable enough.In several clusters, many nodes belong to diﬀerent areas are mixed with others.Then, to explore the inﬂuence of diﬀerent scales of semantic-paths on ourmodel performance, we implement a semantic-paths sampling experiment onCora. As mentioned in the section 3.6, for capturing diﬀerent numbers of seman-tic paths, we change the hyper-parameter of cut size C to restrict the samplingsize on each node’s neighbors. As shown in Figure 5, the SGCN model with thepath representation achieves higher performances than the ﬁrst point ( C = 0).From the perspective of global trend, with the increase of C , the classiﬁcationaccuracy of SGCN model is also improved steady, although it get the highestscore when C = 5. It means that GCN model combines with more suﬃcientscale semantic-paths can really learn better node representations. In this paper, we proposed a novel framework named Semantic Graph Convo-lutional Networks which incorporates the semantic-paths automatically duringthe node aggregating process. Therefore, SGCN provided the semantic learningability to general graph algorithms. We conducted extensive experiments on var-ious real-world datasets to evaluate the superior performance of our proposedmodel. Moreover, our method has good expansibility, all kinds of path-basedalgorithms in the graph embedding ﬁeld can be directly applied in SGCN toadapt to diﬀerent tasks, we will take more explorations in future work.

This research was partially supported by grants from the National Key Researchand Development Program of China (No. 2018YFC0832101), and the National earning the Implicit Semantic Representation on Graph-Structured Data 15

Natural Science Foundation of China (No.s U20A20229 and 61922073). Thisresearch was also supported by Meituan-Dianping Group.

References

1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learningto align and translate. arXiv preprint arXiv:1409.0473 (2014)2. Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S.: Clustering on the unit hypersphereusing von mises-ﬁsher distributions. J. Mach. Learn. Res. (Sep), 1345–1382 (2005)3. Bergstra, J., Yamins, D., Cox, D.D.: Hyperopt: A python library for optimizingthe hyperparameters of machine learning algorithms. In: Proceedings of the 12thPython in science conference. pp. 13–20. Citeseer (2013)4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn.Res. , 993–1022 (2003), http://jmlr.org/papers/v3/blei03a.html

5. Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P.: Geometricdeep learning: going beyond euclidean data. IEEE Signal Processing Magazine (4), 18–42 (2017)6. Deﬀerrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks ongraphs with fast localized spectral ﬁltering. In: Advances in neural informationprocessing systems. pp. 3844–3852 (2016)7. Dong, Y., Chawla, N.V., Swami, A.: metapath2vec: Scalable representation learn-ing for heterogeneous networks. In: Proceedings of the 23rd ACM SIGKDD inter-national conference on knowledge discovery and data mining. pp. 135–144 (2017)8. Duvenaud, D.K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A.: Convolutional networks on graphs for learning molecular ﬁngerprints.In: Advances in neural information processing systems. pp. 2224–2232 (2015)9. Fan, Y., Hou, S., Zhang, Y., Ye, Y., Abdulhayoglu, M.: Gotcha-sly malware! scor-pion a metagraph2vec based malware detection system. In: Proceedings of the 24thACM SIGKDD. pp. 253–262 (2018)10. Gori, M., Monfardini, G., Scarselli, F.: A new model for learning in graph domains.In: Proceedings. 2005 IEEE International Joint Conference on Neural Networks,2005. vol. 2, pp. 729–734. IEEE (2005)11. Grover, A., Leskovec, J.: node2vec: Scalable feature learning for networks. In: Pro-ceedings of the 22nd ACM SIGKDD. pp. 855–864 (2016)12. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on largegraphs. In: NIPS. pp. 1024–1034 (2017)13. Henaﬀ, M., Bruna, J., LeCun, Y.: Deep convolutional networks on graph-structureddata. arXiv preprint arXiv:1506.05163 (2015)14. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd In-ternational Conference on Learning Representations, ICLR 2015, San Diego, CA,USA, May 7-9, 2015, Conference Track Proceedings (2015)15. Kipf, T.N., Welling, M.: Semi-supervised classiﬁcation with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 (2016)16. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic anal-ysis. Discourse processes (2-3), 259–284 (1998)17. Li, Z., Wu, B., Liu, Q., Wu, L., Zhao, H., Mei, T.: Learning the compositional visualcoherence for complementary recommendations. In: IJCAI-20. pp. 3536–354318. Liu, Y., Wang, X., Wu, S., Xiao, Z.: Independence promoted graph disentanglednetworks. Proceedings of the AAAI Conference on Artiﬁcial Intelligence (2020)6 L. Wu et al.19. Lu, Q., Getoor, L.: Link-based classiﬁcation. In: Proceedings of the 20th Interna-tional Conference on Machine Learning (ICML-03). pp. 496–503 (2003)20. Ma, J., Cui, P., Kuang, K., Wang, X., Zhu, W.: Disentangled graph convolutionalnetworks. In: International Conference on Machine Learning. pp. 4212–4221 (2019)21. Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. Journal of machine learn-ing research (Nov), 2579–2605 (2008)22. Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., Bronstein, M.M.: Geo-metric deep learning on graphs and manifolds using mixture model cnns. In: IEEEConference on Computer Vision and Pattern Recognition. pp. 5115–5124 (2017)23. Nair, V., Hinton, G.E.: Rectiﬁed linear units improve restricted boltzmann ma-chines. In: Proceedings of the 27th international conference on machine learning(ICML-10). pp. 807–814 (2010)24. Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: Online learning of social represen-tations. In: Proceedings of the 20th ACM SIGKDD. pp. 701–710 (2014)25. Qiao, L., Zhao, H., Huang, X., Li, K., Chen, E.: A structure-enriched neural net-work for network embedding. Expert Systems with Applications pp. 300–311 (2019)26. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graphneural network model. IEEE Transactions on Neural Networks (1), 61–80 (2008)27. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collec-tive classiﬁcation in network data. AI magazine (3), 93–93 (2008)28. Shang, J., Qu, M., Liu, J., Kaplan, L.M., Han, J., Peng, J.: Meta-path guidedembedding for similarity search in large-scale heterogeneous information networks.arXiv preprint arXiv:1610.09769 (2016)29. Shi, C., Hu, B., Zhao, W.X., Philip, S.Y.: Heterogeneous information networkembedding for recommendation. IEEE Transactions on Knowledge and Data En-gineering (2), 357–370 (2018)30. Sun, L., He, L., Huang, Z., Cao, B., Xia, C., Wei, X., Philip, S.Y.: Joint embeddingof meta-path and meta-graph for heterogeneous information networks. In: 2018IEEE International Conference on Big Knowledge. pp. 131–138. IEEE (2018)31. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: Large-scale infor-mation network embedding. In: Proceedings of the 24th international conferenceon world wide web. pp. 1067–1077 (2015)32. Tang, L., Liu, H.: Leveraging social media networks for classiﬁcation. Data Miningand Knowledge Discovery23