Self-Supervised Deep Graph Embedding with High-Order Information Fusion for Community Discovery
JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
Self-Supervised Deep Graph Embedding withHigh-Order Information Fusion for CommunityDiscovery
Shuliang Xu , Shenglan Liu ,
Member, IEEE, and Lin Feng
Abstract —Deep graph embedding is an important approach for community discovery. Deep graph neural network with self-supervisedmechanism can obtain the low-dimensional embedding vectors of nodes from unlabeled and unstructured graph data. The high-orderinformation of graph can provide more abundant structure information for the representation learning of nodes. However, mostself-supervised graph neural networks only use adjacency matrix as the input topology information of graph and cannot obtain toohigh-order information since the number of layers of graph neural network is fairly limited. If there are too many layers, thephenomenon of over smoothing will appear. Therefore how to obtain and fuse high-order information of graph by a shallow graphneural network is an important problem. In this paper, a deep graph embedding algorithm with self-supervised mechanism forcommunity discovery is proposed. The proposed algorithm uses self-supervised mechanism and different high-order information ofgraph to train multiple deep graph convolution neural networks. The outputs of multiple graph convolution neural networks are fused toextract the representations of nodes which include the attribute and structure information of a graph. In addition, data augmentationand negative sampling are introduced into the training process to facilitate the improvement of embedding result. The proposedalgorithm and the comparison algorithms are conducted on the five experimental data sets. The experimental results show that theproposed algorithm outperforms the comparison algorithms on the most experimental data sets. The experimental results demonstratethat the proposed algorithm is an effective algorithm for community discovery.
Index Terms —Graph embedding, graph convolution neural network, self-supervised learning, complex network, community discovery. (cid:70)
NTRODUCTION N OWADAYS , graph is more and more common in a widerange of applications, such as social network analysis[1], [2], citation network analysis [3], product recommen-dation system [4], [5], knowledge graph [6] and protein-protein interaction [7], etc. It is different from the traditionalstructured data that graph data is unstructured. The highdimension, unstructure and sparsity of graph data bringgreat challenges for community discovery. Therefore it issignificant to transform graph data from the high dimen-sional and sparse space into the low dimensional and densesubspace.Graph embedding is to learn a linear or nonlinear mapfunction that can project graph data into low dimensionaland dense vector space to facilitate downstream tasks suchas product recommendation, community discovery, etc [8].However, the conventional embedding algorithms, such asLocally Linear Embedding (LLE) [9], Laplacian Eigenmap(LE) [10] and Non-Negative Matrix Factorization (NMF)[11], etc. only focus on the local structure of graph andignore the higher-order structure information and attributeinformation. Recently, with the advances in deep learning, • S. Xu is with the Faculty of Electronic Information and Electrical Engi-neering, Dalian University of Technology, Dalian, 116024 China.E-mail: slx [email protected] • S. Liu and L. Feng are with the School of Innovation and Entrepreneur-ship, Dalian University of Technology, Dalian, 116024 China.E-mail: [email protected]; [email protected] received XX XX, XXXX; revised XX XX, XXXX. deep graph convolution neural network is an effective ap-proach for community discovery [12] and it can obtain astrong representation ability by stacking multiple hiddenlayers. Deep graph convolution neural network can auto-matically learn the low dimensional and dense vectors ofnodes from unstructured graph data and it can also colligatethe structure information and attribute information into theembedding result. Therefore there are many advantages fordeep graph convolution neural network dealing with graphdata and researchers have paid much attention to this field.Graph embedding is a hot topic in graph mining field.Up to now, researchers propose many graph embeddingalgorithms. For those algorithms, they are mainly dividedinto three categories: matrix factorization, random walk andgraph neural network.Matrix factorization is to decompose the adjacency ma-trix of a graph or the interactor matrix related to adja-cency matrix. The representative algorithms are NetSMF[13], FONPE [14], AROPE [15], MNMF [16], NSP [17], etc.However, there is a high time complexity for the mostmatrix factorization algorithms although the decomposedresult has a good explanation. It means the scalability of thematrix factorization algorithms are not good. Therefore theyare not suitable for community discovery tasks when thescale of the network is relatively large.Random walk is to generate a random walk path se-quence or a local subgraph as the context for each nodeand then it uses neural language model, such as Word2Vec[18], etc., to generate the low dimensional embedding vec-tors. The representative algorithms include DeepWalk [19], a r X i v : . [ c s . S I] F e b OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
LINE [20], struc2vec [21], metapath2vec [22], SEED [23],SPINE [24], MRF [25], etc. However, the path of randomwalk depends on the topology information of a graph andcannot integrate the attribute information of nodes which iscommon in attribute graph. In addition, it is not possible tointervene in walk path according to the characteristics of agraph and random walk is biased to the nodes with largedegree. Therefore the shortcomings of random walk placerestrictions on its applications.Graph neural network is an effective approach for graphembedding [26], [27] and it is proposed for dealing with un-structured graph data. Graph neural network can update thefeatures of nodes from their neighbors. With the powerfulrepresentation ability of deep learning, deep graph neuralnetwork achieves great success in graph mining although itmay suffer from over smoothing problem [28], [29], [30].The representative graph neural network algorithms areGraphSAGE [31], DropEdge [32], GCKN [33], GMI [34]and SDCN [35], etc. However, how to efficiently train agraph neural network is still an open problem. Beyond that,the over smoothing problem restricts the depth of graphneural network although DropEdge and Dropcluster [36]propose some measures to slow down over smoothing. Nev-ertheless, DropEdge and Dropcluster cannot eliminate oversmoothing and the performance of graph neural networkstill degenerates after the number layer of layer is too large.Therefore most graph neural networks are shallow layer anda shallow hidden layer restrains graph neural network fromobtaining higher level features of nodes.The high-order structure information of graph and theattribute information of nodes play important roles ongraph embedding which can provide more information forcommunity discovery. At present, due to the limitation ofthe number of hidden layers of graph neural network,most graph neural networks cannot obtain too high levelembedding features of nodes. Therefore a self-superviseddeep graph embedding algorithm with high-order informa-tion fusion for community discovery (SDGE) is proposed inthis paper. SDGE uses multiple graph convolution neuralnetworks which can integrate the attribute and structureinformation by convolution and aggregation operator tolearn the dimensional embedding vectors. Deep graph con-volution neural networks are trained from different high-order information and attribute information of nodes byintroducing self-supervised learning mechanism. The finalresult is the fusion of the outputs of multiple graph convo-lution neural networks. The main contributions of this paperare as follows: • A deep graph embedding algorithm is proposed inthis paper. The different high-order information isemployed to train multiple graph convolution neuralnetworks. The final result is the output fusion of themultiple graph convolution neural networks. • A data augmentation approach and negative sam-pling mechanism are introduced to improve theperformance of the proposed algorithm. The graphconvolution neural networks can be trained by self-supervised learning mechanism. • The proposed algorithm introduces spectral propa-gation to enhance the embedding result. • The proposed algorithm can keep the structure andattribute similarity in the low dimensional embed-ding space. The experimental results demonstrate theeffectiveness of the proposed algorithm.The rest of this paper is organized as follows: Section2 reviews some related works; Sections 3 describes thetheory and the detailed steps of the proposed algorithm; theexperimental results and analysis are presented in Section4; Section 5 concludes the paper and gives some researchdirections in the future.
ELATED WORKS
Graph embedding is an important topic in graph miningand it is closely related to community discovery. In the earlydays, most graph embedding algorithms are from matrixfactorization and random walk. In recent years, graph neu-ral network facilitates the developments of this field andmany deep graph embedding algorithms are proposed.Wang et al. propose a deep attentional embedding ap-proach [37] called as DAEGC. DAEGC is a development ofgraph convolution neural network [26]. It utilizes autocoderwith double hidden layers to learn the embedding vectorsof nodes, and then uses the clustering result to adjust theparameters of hidden layers by self-supervised learning. Ingraph neural network, DAEGC considers the weights ofnodes and introduces attention mechanism to restructurethe adjacency matrix and learn the target distribution.Fan et al. propose a multi-view graph autoencoderfor graph clustering [38] named as One2Multi. One2Multiemploys graph convolution neural network as the hiddenlayers of autocoder and uses the adjacency matrices ofmultiple views to extract the features of nodes. Then theclustering result of k-means algorithm is used to learn theparameters of graph convolution neural network by self-supervised learning.Wang et al. propose an adaptive multi-channel graphattentional convolutional network [39] called as AM-GCN.AM-GCN firstly computes the similarity matrix of nodesfrom the attribute features of nodes by cosine or ker-nel method and then constructs a feature graph by KNNmethod. The feature graph and the original graph are inputinto graph convolution neural network. AM-GCN consid-ers that the feature graph and the original graph describethe same graph. Therefore there is common informationbetween the structure features and the attribute features.AM-GCN designs three graph convolution neural networksto learn the low dimensional embedding vectors of thestructure features, the attribute features and the commoninformation, respectively. The final embedding vectors ofnodes are the weighting fusion of the three groups of theembedding vectors.Chen et al. propose a robust node representation learn-ing algorithm named as CGNN [40]. CGNN introducescontrastive learning [41] to train graph neural network inan unsupervised way. For each node, CGNN discards someneighborhood edges with a certain probability and it canobtain two local subgraphs of each node. Then the twolow dimensional embedding vectors can be output after thetwo local subgraphs are input into graph neural network.The two low dimensional embedding vectors of the same
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3 node are seen as a positive sample pair since they describethe local structure of the same node. CGNN selects k dif-ferent low dimensional embedding vectors of other nodesas negative samples. CGNN introduces noise contrastiveestimation to train graph neural network, therefore it hasa good robustness for the representation learning of nodes.Qiu et al. propose a graph contrastive coding graphneural network called as GCC [42]. GCC can be pre-trainedon a large data set and then fine-tuned on a specific task. Foreach node, GCC samples one or multiple r -ego networks.The r -ego networks sampled from the same node are asthe positive samples. The r -ego networks sampled from thedifferent nodes are as the negative samples. GCC considersthat the positive sample pair should be as similar as possibleand the negative samples should be far away from thepositive samples. The parameters of the autocoder which isa graph neural network are trained by contrastive learning.The low dimensional embedding vectors can be obtained byinputting the r -ego networks of nodes into the autocoder. HE PROPOSED ALGORITHM
For a graph G = (cid:104) V, A , E, X (cid:105) , V is the set of nodes inthe graph, A ∈ R n × n is the adjacency matrix of G , E isthe edge set of G and X is the attribute features of nodes.For ∀ v i , v j ∈ V , A ij = 1 if e ij ∈ E , otherwise, A ij = 0 .It is known that A is high dimensional and sparse. Graphembedding is to learn a map function f : V (cid:55)→ R n × d whichprojects each node into a d -dimension space ( d (cid:28) n ) andthe similarity between nodes is preserved. It is common for the nodes of social networks withoutlabels and the high-order information of graph can pro-vide much semantic information for community discoverytask. Therefore a self-supervised deep graph embeddingalgorithm with high-order information fusion is proposedfor community discovery. Figure 1 is an illustration ofthe proposed algorithm (SDGE). SDGE firstly computes r different high-order matrices of adjacency matrix. It isknown that different high-order matrices present differentsemantic information. Then there are r graph convolutionneural networks operating on r different high-order ma-trices, respectively. The outputs of the graph convolutionneural networks are the r features of nodes. After obtainingthe initial embedding vectors from the r graph convolutionneural networks, the final embedding vectors are the fusionof the initial embedding vectors. With the final embeddingvectors of nodes being obtained, the community partition ofnodes can be assigned by multilayer perceptron or cluster-ing algorithm. Considering the unlabeled nodes of graph,self-supervised learning approach is introduced to train thegraph convolution neural networks. In the training process,data augmentation approach and negative sampling [43] areemployed to train the graph convolution neural networkseffectively. In addition, SDGE can preserve the similari-ties of structure information and attribute information. Theoverview of SDGE is seen as Fig. 1. Let A be the adjacency matrix of the graph G and r ∈ Z + isthe order of the adjacency matrix. A r is defined as: A r = A · A · · · A (cid:124) (cid:123)(cid:122) (cid:125) r (1)The value of r determines the global and local informationof the graph G . A large r means A r contains more globalinformation. A small r means A r contains more local infor-mation. In other words, A r is equal to the information of r -step random walk in a graph. In order to make full use oflocal and global information, SDGE computes r high-ordermultiplicative matrices of A which are as A , A , · · · , A r . Itcan use A , A , · · · , A r as the inputs of r graph convolutionneural networks (GCNs), respectively. Therefore r GCNslearn r groups of the low dimensional embedding vectors ofnodes from r groups of the different high-order informationof the graph G .Let A r be the input of the r th GCN ( r = 1 , , · · · ) and X be the attribute features of nodes. The output of the r thGCN in the ( (cid:96) + 1) th layer is as follow: H (cid:96) +1 r = δ (cid:16)(cid:99) D − r (cid:98) A r (cid:99) D − r H (cid:96)r W (cid:96)r (cid:17) (2)where H (cid:96)r is the output of the r th GCN in the (cid:96) th layer, itis also the input of the ( (cid:96) + 1) th layer and H r = X ; (cid:98) A r = A r + I and I is an identity matrix; W (cid:96)r is the parameter ofthe (cid:96) th layer and (cid:99) D r is the degree matrix of (cid:98) A r ; δ ( · ) is theactivation function. In general, different GCNs use differentactivation functions. For SDGE, Dynamic ReLU function isintroduced and Chen et al. have proved that Dynamic ReLUcan obtain better performance than ReLU [44]. In order toavoid over-fitting and improve the performance of SDGE,Batch Normalization [45] is introduced and the output ofeach hidden layer is normalized before it is input into thenext hidden layer.After inputting A , A , · · · , A r into r GCNs, it can obtain r groups of the outputs denoted as H , H , · · · , H r and H r is the output of the r th GCN in the last hidden layer. Thefinal output H of GCNs is fused from the outputs of the r GCNs by aggregation operator. The aggregation operator isdefined as follow: H = Aggregate ( H , H , · · · , H r ) (3)where Aggregate ( · ) is the aggregation function which canfuse the outputs of multiple GCNs. The most commonaggregation functions include summation, mean, max andCONCAT, etc. In this paper, CONCAT or summation is se-lected as the aggregate function of SDGE because CONCATand summation are injective [28]. It is known that H con-tains more the local information of graph than H r and H r contains more the global information of graph than H . Thelocal and global information of graph play different roleson graph embedding. Therefore SDGE uses the weightingapproach to fuse the outputs of multiple GCNs. For eachoutput of GCN, it can use clustering algorithm to discoverthe community structure. After the community structure of H i ( i = 1 , , · · · , r ) is obtained, the modularity Q i of H i OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
A B F C G E
D H I Input graph
A A 𝐴 𝑟 𝐴 ⋮ ⋮ + + + X X X ⋮ ⋮ ⋮ 𝜎(∙) 𝜎(∙) 𝜎(∙) ⋯ ⋯ ⋯ Output
Output
Output 𝐻 𝐻 𝐻 𝑟 GCN GCN GCN + ⋮ g (∙) Data augmentation Positive samples ⋮ ⋮ ⋯ Negative sampling Negative samples Loss function ⋮ (e) Self-supervised learning (d) The output fusion of GCNs (c) Multiple GCNs (b) The high-order information of graph (a) The original graph A B F C G E D H I (f) Community structure
Fig. 1. The overview of SDGE. It consists of six parts. (a): The original graph and the adjacency matrix are as the initial inputs. (b): Compute A r with different r values. The different multiplicative matrices represent random walks with different path lengths. (c): The different matrices of A r areas the inputs of the r GCNs, respectively. (d): The outputs of the r GCNs are fused into the embedding vectors of nodes in original graph. (e): Foreach sample, new sample is generated by data augmentation which is seen as positive sample. Then SDGE samples several samples from thenon-neighborhood nodes as the negative samples. The GCNs are trained by self-supervised learning approach. (f): The community structure isoutput by the end-to-end approach or clustering algorithm. which is a measure to reflect the strength of communitystructure [46] is computed as follow: Q i = 12 | E | | V | (cid:88) i (cid:48) ,j (cid:18) A i (cid:48) j − k i (cid:48) · k j | E | (cid:19) · σ ( v i (cid:48) , v j ) (4)where |·| is the cardinality of a set, k i (cid:48) and k j are the degreesof the nodes v i (cid:48) and v j , respectively. σ ( v i (cid:48) , v j ) = 1 if v i (cid:48) and v j are in the same community; otherwise, σ ( v i (cid:48) , v j ) = 0 .The range of modularity is in [0 , . A larger modularitymeans that the community structure is better which alsoreflects the output of GCN is better. Therefore the weight α i of H i is determined as follow: α i = exp ( Q i ) r (cid:80) j =1 exp ( Q j ) (5)If the aggregate function is summation or CONCAT, H isdetermined as follow: H = r (cid:88) i =1 α i H i or H = || ri =1 α i H i (6)where || is the concatenate operator which concatenates theoutput matrices of GCNs. It is known that the attributeinformation of nodes is an important information for thepartition of nodes. Therefore X is also concatenated to theoutput matrices of GCNs if the graph is an attribute graph.The final fusion is defined as follow: H ← H || X (7)After the fusion of Eq.(7) is obtained, the fusion ismapped nonlinearly by multilayer perceptron (MLP). Chenet al. [30] prove that MLP is beneficial to self-supervisedlearning [47]. The final embedding result Z is determinedas Z = g ( H ) ∈ R n × d where g ( · ) is the nonlinear mappingfunction of MLP. Self-supervised learning is an important unsupervised ap-proach for graph neural network. It can train GCN fromunlabeled graph data. The parameters of SDGE can beoptimized by minimizing loss function. The diagram of self-supervised learning of SDGE is as Fig. 2.For each node, SDGE employs data augmentation togenerate positive samples. Let Z = (cid:2) z T , z T , · · · , z Tn (cid:3) T and z i ∈ R × d be the embedding vector of the node v i . For ∀ v i ∈ V , the noise, such as Gaussian noise, etc., is addedinto z i and the new generated data is denoted as z + i ∈ R × d . z + i and z i can be seen as a positive sample pair since bothof them are the descriptions of the node v i . For the node v i ,it can also sample m samples from the non-neighborhoodnodes of v i which are seen as negative samples. The simi-larity between z i and z + i should be as large as possible. Thesimilarity between z i and the negative samples of z i shouldbe as small as possible. Therefore the loss function of theself-supervised learning is defined as follow: L s = 1 n n (cid:88) i =1 log δ (cid:0) z i · z + i /τ (cid:1) − E z j ∼ P n ( v ) ,v j / ∈ N ( v i ) [log δ ( z i · z j /τ )] (8)In Eq.(8), the first term is the similarity of the positivesample pair; the second term is the similarity betweenthe node v i and the negative samples where P n ( v ) is thedistribution of negative samples, N ( v i ) is the neighborsof v i and τ is the temperature. It is known from Eq.(2)that the output of MLP includes the structure informationand the attribute information of nodes. Therefore the lowdimensional embedding Z should preserve the similaritiesof the structure information and the attribute informationafter dimension reduction. The difference between Z andthe structure information should be as small as possibleand the difference between Z and the attribute informationshould be also as small as possible. The loss of the struc-ture information and the attribute information of nodes are OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5 noise Positive sample Negative samples
Loss function
Fig. 2. The self-supervised learning of SDGE. For each node, the pos-itive sample is generated by adding noise into the original sample andthe negative samples are sampled from the non-neighborhood nodes. defined as follow: L sa = 1 n (cid:13)(cid:13)(cid:13) ZZ T − A (cid:13)(cid:13)(cid:13) F + 1 n (cid:13)(cid:13)(cid:13) ZZ T − XX T (cid:13)(cid:13)(cid:13) F (9)In Eq.(9), the first term is the loss of the structure infor-mation and the second term is the loss of the attributeinformation. SDGE is to preserve the minimal loss of thestructure information and the attribute information.SDGE is to project the nodes of graph into the lowdimensional space. Therefore the embedding result of SDGEshould also satisfy the constraint of graph regularization.It means the similar nodes in high dimensional space arealso similar in the dimensional embedding space. The lossfunction L r is defined as: L r = 1 n tr (cid:16) Z T LZ (cid:17) (10)where L is the Laplace matrix of graph G . L = W − D , W is the weight matrix of the edges in the graph G , D is thedegree matrix and D ii = (cid:80) nj =1 W ij .From Eqs.(8)-(10), the optimization problem of the lossfunction L is defined as: min Z L = β · L sa + γ · L r − L s (11)where β ≥ and γ ≥ are the predefined parameters. Theparameters of GCNs and MLP are adjusted iteratively bybackward propagating the loss of Eq.(11). Z can be solved from Eq.(11). Then spectral propagation[48] is introduced to enhance the embedding result. Z isupdated as: Z ← D − A (cid:16) I − (cid:101) L (cid:17) Z (12)where (cid:101) L = U g ( Λ ) U − is the modulated Laplacian and g is as the spectral modulator. U and Λ are the eigen-value decomposition results of the random walk normalizedgraph Laplacian ¯ L = I − D − A where ¯ L = U Λ U − . Λ = diag ([ λ , λ , · · · , λ n ]) . λ i ( i = 1 , , · · · , n ) is the eigen-value of ¯ L and λ ≤ λ ≤ · · · ≤ λ n . The reference[48] presents truncated Chebyshev expansion to computethe modulated Laplacian efficiently.If SDGE is an end-to-end approach, the dimension d of Z which is the output of MLP is set to the numberof communities and the community partition of the inputgraph G is determined by sof tmax ( · ) . If SDGE is not anend-to-end approach, the community structure of the graph G can be obtained by the clustering algorithm after MLPoutputs the low dimensional embedding vectors Z .Therefore the detail steps of SDGE algorithm are sum-marized as Algorithm 1 . Algorithm 1
SDGE
Input:
A graph G = (cid:104) V, A , E, X (cid:105) , the number of clusters k and the dimension after dimension reduction d ; theparameters β and γ and r . Output:
The community structure C ∈ R n × k . Compute A , A , · · · , A r ; Initialize r GCNs and MLP; Input A , A , · · · , A r into r GCNs, respectively; while The loss L is not convergent do Obtain the embedding vectors Z ; Compute the loss L from MLP; Back propagate the loss L and adjust the parametersof GCNs and MLP; Enhance the embedding vectors Z by spectral prop-agation; end while Obtain the community partition C of graph G from MLPor clustering algorithm.In Algorithm 1 , SDGE can effectively fuse the struc-ture information and the attribute information. In addition,multiple GCNs are utilized to extract the features of nodeswhich is equivalent to multi-view learning. Step 1 is tocompute the high-order information of graph and the high-order information matrices include the local and globalinformation. Steps 2-3 input the high-order information ma-trices into different GCNs. Therefore the embedding resultintegrates the local and global information of graph and theattribute information of nodes. Steps 4-8 are to train GCNsand MLP. Step 6 is to enhance the embedding result by spec-tral propagation. Step 10 obtains the community partition bythe end-to-end approach or clustering algorithm.
For SDGE, the time complexity of Eq.(1) is O (cid:0) rn (cid:1) . Let t be the number of iterations of GCN and MLP and l bethe number of layers. The time complexity of GCNs is O (cid:0) rtln (cid:1) . The time complexity of MLP is O (cid:0) tnd (cid:1) . It costs O (cid:0) rn (cid:1) to compute α . The time complexity of the loss L is O (cid:0) tnd + tn d + tnmd (cid:1) . It is known that rtln ≥ tnd and tnd ≤ tn d ≤ rtln . Therefore the time complexity ofSDGE is O (cid:0) rtln + tnmd (cid:1) . In Eq.(6), the outputs of the GNCs are fused by weightingmechanism. It can use the high-order information of graphto project the representation of nodes into the low dimen-sional space effectively.
Theorem 1.
In Eq.(6), the fusion with weighting mechanism isof benefit to the output of MLP if the aggregation function is sum ( · ) .Proof. Let f , f , · · · , f r be the outputs of the GCNs and α , α , · · · , α r be the weights of the outputs f , f , · · · , f r . OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6 ∴ The fusion f = α f + α f + · · · + α r f r = (cid:80) ri =1 α i f i . ∵ α + α + · · · + α r = 1 and ≤ α i ≤ i = 1 , , · · · , r ) . ∴ E [ f ] = (cid:80) ri =1 α i f i is the expectation of f , f , · · · , f r .Let f ∗ be the real target of MLP and F ( · ) be the mapfunction of MLP.From Jensen’s inequality [49], it concludes as follow: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F (cid:32) E (cid:34) r (cid:88) i =1 α i f i (cid:35)(cid:33) − F ( f ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:107) F ( E [ f ]) − F ( f ∗ ) (cid:107) ≤ (cid:107) E [ F ( f )] − F ( f ∗ ) (cid:107) (13)Therefore it is known that the output error can be decreasedby the weighting mechanism. Theorem 2.
In Eq.(6), the fitting capability of MLP with theaggregation function
CON CAT ( · ) is not weaker than the ag-gregation function sum ( · ) .Proof. Let f , f , · · · , f r be the outputs of the GCNs. ∴ (cid:80) ri =1 f i ∈ R n × l , [ f , f , · · · , f r ] ∈ R n × rl and l < lr .Let l (cid:48) be the hidden nodes’ number of MLP. ∴ The parameters of MLP with the aggregation function sum ( · ) is l · l (cid:48) and the parameters of MLP with the ag-gregation function CON CAT ( · ) is rl · l (cid:48) ( r ≥ . ∴ The parameters of MLP with the aggregation function sum ( · ) is not more than the MLP with the aggregationfunction CON CAT ( · ) . ∴ The fitting capability of MLP with the aggregation func-tion
CON CAT ( · ) is not weaker than the aggregation func-tion sum ( · ) .In Theorem 2, the fitting capability of MLP will be toostrong and over-fitting will also appear if r is a large value.In addition, the time cost will also increase with the increaseof r value. HE EXPERIMENTAL RESULTS AND ANALYSIS
The proposed algorithm and the comparison algorithms areconducted on a server with Ubuntu 18.04.2 operating sys-tem, Intel Core i9-7900X CPU and 64G RAM. The proposedalgorithm is implemented by Python 3.7 and the executablecodes of the comparison algorithms are from the releasedcodes in the papers.
In order to verify the performance of the proposed algo-rithm, the follow algorithms are selected as the comparisonalgorithms of this paper: • Line [20]: it is a graph embedding algorithm basedon the assumption of neighborhood similarity andintroduces first-order proximity and second-orderproximity to define the similarity between verticesin graph. • struc2vec [21]: it is a graph embedding algorithmbased on random walk and introduces structure sim-ilarity to define the similarity of any two nodes. • DANMF [50]: it stacks multiple non-negative ma-trix factorization approaches as the autocoder and decoder layers to learn the final community assign-ment. • Graph2gauss [51]: it can efficiently learn versatilenode embedding on large scale graph and embednodes as Gaussian distribution to capture the uncer-tainty of the representation. • Modsoft [52]: it can discover the community struc-ture by maximizing the modularity of the node par-tition and uses sparse matrix to record the partitionresult to improve the efficiency. • ProNE [48]: it is a fast and scalable network represen-tation learning algorithm and the node embeddingvectors can be efficiently obtained by the randomizedtSVD approach. • DGI [53]: it is a self-supervised graph neural net-work to learn structure information of a graph andaims at maximizing mutual information between theinput data and output data. • GIC [54]: it is a self-supervised graph neural networkand uses cluster-level node information to learn thedimensional embedding vectors of nodes. • GMI [34]: it is a self-supervised graph neural net-work to learn the embedding vectors of nodes bymutual Information maximization. • SDGE-cat : it is the SDGE algorithm with
CON CAT ( · ) to aggregate the outputs of GCNs. • SDGE-sum : it is the SDGE algorithm with sum ( · ) toaggregate the outputs of GCNs. For SDGE, k is set to the real community number of dataset. β, γ ∈ [0 , , τ ∈ [0 , , d = { , } and r = 4 .For the GCNs in SDGE structure, the GCNs are with fourlayers structure and the neuron number of the differentlayers is [200 , , , . The activation function of MLPis sigmoid . The aggregation function is CONCAT or sum . In order to test the performance of SDGE and the compar-ison algorithms, the following data sets are selected as theexperimental data sets: • ACM is an author relationship network from ACMDigital Library. The nodes represent papers and thereis an edge between two nodes if there is the sameauthor in two papers. • USA is an air-traffic network of USA. A node repre-sents an airport and an edge means the existence ofcommercial flight between two airports. • Image is an image segmentation data set. The imagesare hand-segmented and each instance is a × region. The attribute graph is constructed by KNN ( K = 10) . • Hyperplane is an artificial data set from MOA plat-form . For a data point x = [ x , x , · · · , x d ] ∈ R d and a d -dimension hyperplane (cid:80) di =1 a i x i = a , if (cid:80) di =1 a i x i ≥ a , x is marked as a positive sample,otherwise, x is marked as a negative sample. Theattribute graph is constructed by KNN ( K = 10) .
1. https://moa.cms.waikato.ac.nz/downloads/
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
TABLE 1The details of the experimental data sets.
Data sets (cid:37) (cid:37) (cid:34) (19) (cid:34) (40) (cid:34) (21) • Waveform is a wave data set. Each sample is gener-ated from a combination of 2 of 3 base waves. Theattribute graph is constructed by KNN ( K = 10) .The detailed information of the experimental data sets arepresented in Table 1. In order to evaluate the performance of the algorithms, thefollowing evaluation criteria are used:(1) Jaccard index (J): J = T PT P + F N + F P (14)(2) Folkes and Mallows index (FM):
F M = T P (cid:112) ( T P + F N ) · ( T P + F P ) (15)(3) F1-measure ( F ): F = 2 · prcision · recallprcision + recall (16)(4) Kulczynski index (K): K = 12 (cid:18) T PT P + F P + T PT P + F N (cid:19) (17)where prcision = T P/ ( T P + F P ) and recall = T P/ ( T P + F N ) . TP , FP , TN and FN are from confusionmatrix. TP is the number of data point pairs that they are inthe same cluster and the real labels of the two data pointsare also the same. TN is the number of data point pairsthat they are in different clusters and the real labels of thetwo data points are also different. FP is the number of datapoint pairs that they are in the same cluster and the reallabels of the two data points are different. FN is the numberof data point pairs that they are in different clusters and thereal labels of the two data points are the same. In order to test the effectiveness of the proposed algorithm,SDGE and the comparison algorithms are conducted on thefive experimental data sets and the test results are showedas Tables 2-6.Tables 2-6 present the results of the proposed and com-parison algorithms on the experimental data sets. Fromthe results, it can be seen that SDGE-cat obtains the best
TABLE 2The experimental results of the algorithms on ACM data set.
Algorithms J FM F KLine 0.2514 0.4094 0.4018 0.4171struc2vec 0.2568 0.4216 0.4079 0.4361DANMF 0.1160 0.2296 0.2079 0.2535Graph2gauss 0.2980 0.4596 0.4592 0.4600Modsoft 0.0131 0.1000 0.0259 0.3867ProNE 0.2752 0.4510 0.4316 0.4714DGI 0.2268 0.3701 0.3697 0.3705GIC 0.2998 0.4880 0.4611 0.5165GMI 0.2485 0.3982 0.3979 0.3986SDGE-cat
SDGE-sum 0.3158 0.5335 0.4801 0.5929
TABLE 3The experimental results of the algorithms on USA data set.
Algorithms J FM F KLine 0.2672 0.4239 0.4213 0.4266struc2vec 0.1751 0.3060 0.2977 0.3148DANMF 0.0218 0.0762 0.0427 0.1359Graph2gauss 0.1718 0.2979 0.2931 0.3027Modsoft 0.1057 0.1967 0.1911 0.2024ProNE 0.1813 0.3173 0.3067 0.3284DGI
GIC 0.3334 0.5001 0.5000 0.5002GMI 0.2021 0.3363 0.3363 0.3363SDGE-cat 0.1799 0.3128 0.3049 0.3208SDGE-sum 0.2022 0.3590 0.3364 0.3832 results on ACM, Hyperplane and Waveform data sets.SDGE-sum obtains the best results on Image data set exceptfor Kulczynski index. SDGE-cat obtains the best result ofKulczynski index on Image data set. Therefore SDGE out-performances the comparison algorithms on the most datasets except for USA data set. On USA data set, SDGE-catis the fourth best of the eleven algorithms and DGI is thebest of all. For USA data set, there are only 1,190 nodesand the scale of the data set is not too large. However,there are r GCNs and a MLP in SDGE. Each GCN is withfour layers and MLP is with double layers. By contrast, thestructure of neural network is too complex for USA dataset. It means that the number of the parameters of neuralnetwork is not consistent with the complexity of data set.The too complex neural network reduces the performanceof SDGE. Therefore the neural network algorithms with theshallower layers such as DGI, GIC, etc., outperform SDGEon USA data set.
In order to test the effects of the parameters on the perfor-mance of the proposed algorithm, SDGE is conducted onACM data set and the parameters τ , β and γ are set to thedifferent values. The test results are showed as Figs. 3-5.Figs. 3-5 show the experimental results of SDGE with dif-ferent parameters τ , β and γ values on ACM data set. Fromthe results, it is known that the parameters τ , β and γ haveinfluences on the performance of SDGE. For the parameter τ which is a temperature for the loss of self-supervisedlearning, it determines the effect of the similarity betweentwo samples on the loss of Eq.(11). For the parameter β , it OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8 (cid:3) (cid:2) (cid:3) (cid:3) (cid:4) (cid:3) (cid:2) (cid:3) (cid:4) (cid:3) (cid:2) (cid:4) (cid:4) (cid:4) (cid:3) (cid:4) (cid:3) (cid:3)(cid:3) (cid:2) (cid:5) (cid:3)(cid:3) (cid:2) (cid:5) (cid:7)(cid:3) (cid:2) (cid:6) (cid:3)(cid:3) (cid:2) (cid:6) (cid:7)(cid:3) (cid:2) (cid:7) (cid:3)(cid:3) (cid:2) (cid:7) (cid:7)(cid:3) (cid:2) (cid:8) (cid:3)(cid:3) (cid:2) (cid:8) (cid:7) t (cid:1) (cid:10) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1) (cid:9) (cid:12) (cid:1) (cid:1) (cid:1) (cid:1) (cid:9) (cid:4) (cid:1) (cid:1) (cid:1) (cid:1) (cid:11) (a) SDGE-cat (cid:2) (cid:1) (cid:2) (cid:2) (cid:3) (cid:2) (cid:1) (cid:2) (cid:3) (cid:2) (cid:1) (cid:3) (cid:3) (cid:3) (cid:2) (cid:3) (cid:2) (cid:2)(cid:2) (cid:1) (cid:4) (cid:7)(cid:2) (cid:1) (cid:5) (cid:2)(cid:2) (cid:1) (cid:5) (cid:7)(cid:2) (cid:1) (cid:6) (cid:2)(cid:2) (cid:1) (cid:6) (cid:7)(cid:2) (cid:1) (cid:7) (cid:2)(cid:2) (cid:1) (cid:7) (cid:7)(cid:2) (cid:1) (cid:8) (cid:2)(cid:2) (cid:1) (cid:8) (cid:7)(cid:2) (cid:1) (cid:9) (cid:2) t (b) SDGE-sum Fig. 3. The experimental results of SDGE with different τ values. (cid:3) (cid:2) (cid:3) (cid:4) (cid:3) (cid:2) (cid:4) (cid:3) (cid:2) (cid:5) (cid:3) (cid:2) (cid:7) (cid:3) (cid:2) (cid:9) (cid:3) (cid:2) (cid:10)(cid:3) (cid:2) (cid:5) (cid:3)(cid:3) (cid:2) (cid:5) (cid:7)(cid:3) (cid:2) (cid:6) (cid:3)(cid:3) (cid:2) (cid:6) (cid:7)(cid:3) (cid:2) (cid:7) (cid:3)(cid:3) (cid:2) (cid:7) (cid:7)(cid:3) (cid:2) (cid:8) (cid:3)(cid:3) (cid:2) (cid:8) (cid:7) b (cid:1) (cid:12) (cid:1) (cid:1) (cid:1) (cid:1) (cid:11) (cid:14) (cid:1) (cid:1) (cid:1) (cid:1) (cid:11) (cid:4) (cid:1) (cid:1) (cid:1) (cid:1) (cid:13) (a) SDGE-cat (cid:2) (cid:1) (cid:2) (cid:3) (cid:2) (cid:1) (cid:3) (cid:2) (cid:1) (cid:4) (cid:2) (cid:1) (cid:6) (cid:2) (cid:1) (cid:8) (cid:2) (cid:1) (cid:9)(cid:2) (cid:1) (cid:4) (cid:2)(cid:2) (cid:1) (cid:4) (cid:6)(cid:2) (cid:1) (cid:5) (cid:2)(cid:2) (cid:1) (cid:5) (cid:6)(cid:2) (cid:1) (cid:6) (cid:2)(cid:2) (cid:1) (cid:6) (cid:6)(cid:2) (cid:1) (cid:7) (cid:2)(cid:2) (cid:1) (cid:7) (cid:6) b (b) SDGE-sum Fig. 4. The experimental results of SDGE with different β values. (cid:3) (cid:2) (cid:3) (cid:4) (cid:3) (cid:2) (cid:4) (cid:3) (cid:2) (cid:5) (cid:3) (cid:2) (cid:7) (cid:3) (cid:2) (cid:9) (cid:3) (cid:2) (cid:10)(cid:3) (cid:2) (cid:5) (cid:3)(cid:3) (cid:2) (cid:5) (cid:7)(cid:3) (cid:2) (cid:6) (cid:3)(cid:3) (cid:2) (cid:6) (cid:7)(cid:3) (cid:2) (cid:7) (cid:3)(cid:3) (cid:2) (cid:7) (cid:7)(cid:3) (cid:2) (cid:8) (cid:3)(cid:3) (cid:2) (cid:8) (cid:7) g (cid:1) (cid:12) (cid:1) (cid:1) (cid:1) (cid:11) (cid:14) (cid:1) (cid:1) (cid:1) (cid:11) (cid:4) (cid:1) (cid:1) (cid:1) (cid:13) (a) SDGE-cat (cid:2) (cid:1) (cid:2) (cid:3) (cid:2) (cid:1) (cid:3) (cid:2) (cid:1) (cid:4) (cid:2) (cid:1) (cid:6) (cid:2) (cid:1) (cid:8) (cid:2) (cid:1) (cid:9)(cid:2) (cid:1) (cid:4) (cid:2)(cid:2) (cid:1) (cid:4) (cid:6)(cid:2) (cid:1) (cid:5) (cid:2)(cid:2) (cid:1) (cid:5) (cid:6)(cid:2) (cid:1) (cid:6) (cid:2)(cid:2) (cid:1) (cid:6) (cid:6)(cid:2) (cid:1) (cid:7) (cid:2)(cid:2) (cid:1) (cid:7) (cid:6)(cid:2) (cid:1) (cid:8) (cid:2) g (b) SDGE-sum Fig. 5. The experimental results of SDGE with different γ values. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
TABLE 4The experimental results of the algorithms on Image data set.
Algorithms J FM F KLine 0.1396 0.2545 0.2448 0.2646struc2vec 0.0863 0.1594 0.1594 0.1600DANMF 0.0884 0.0884 0.1625 0.4106Graph2gauss 0.2204 0.3614 0.3613 0.3614Modsoft 0.0649 0.2273 0.1219 0.4238ProNE 0.1574 0.2733 0.2715 0.2751DGI 0.2582 0.4119 0.4104 0.4134GIC 0.2446 0.3939 0.3931 0.3947GMI 0.2375 0.3847 0.3838 0.3856SDGE-cat 0.1641 0.3749 0.2820
SDGE-sum
TABLE 5The experimental results of the algorithms on Hyperplane data set.
Algorithms J FM F KLine 0.4650 0.6592 0.6348 0.6846struc2vec 0.3701 0.5427 0.5394 0.5460DANMF 0.0184 0.1012 0.1012 0.2839Graph2gauss 0.3335 0.5002 0.5002 0.5002Modsoft 0.0314 0.1306 0.0609 0.2797ProNE 0.3369 0.5040 0.5040 0.5040DGI 0.4108 0.5904 0.5823 0.5986GIC 0.3352 0.5021 0.5021 0.5021GMI 0.4190 0.6004 0.5905 0.6105SDGE-cat
SDGE-sum 0.4963 0.7019 0.6634 0.7426 determines the effect of the reconstruction error on the lossof Eq.(11). For the parameter γ , it determines the effect ofgraph regularization on the loss of Eq.(11). From Figs. 3-5,the changed trends of SDGE-sum with different τ , β and γ values are obvious. It means the influence of the parameters τ , β and γ are also obvious on the performance of SDGE-sum. From Fig. 4(a), it can be seen that the changed trend ofSDGE-cat with different β values are obvious. In Fig. 3(a),the changed trend of SDGE-cat with different τ values ismuch more gradual than SDGE-sum although there is stillsome changes. It means SDGE-cat is not too sensitive to theparameter τ .From Fig. 5(a), the changed trend of SDGE-cat withdifferent γ values is smooth and the changes of evaluationcriteria are small. It means that SDGE-cat is not sensitiveto the parameter γ . From the theory of graph neural net-work, it is known that the embedding vector of each nodecomes from the feature vectors of the neighbors which isin tune with manifold learning approaches such as LocalLinear Embedding, etc. It means graph neural network canlearn the intrinsic manifold structure of graph data and theparameter γ can only further increase the effect of graphregularization which has been inherently included in thelearning process of graph neural network. Therefore thechanged trend of SDGE-cat with different γ values is smoothin Figure 5(a) if the GCNs are effectively trained and havegood approximation ability. It is still not clear if the good performance of SDGE are dueto the fusion of high-order information and Dynamic ReLU
TABLE 6The experimental results of the algorithms on Waveform data set.
Algorithms J FM F KLine 0.3232 0.4903 0.4883 0.4924struc2vec 0.2554 0.4183 0.4066 0.4303DANMF 0.0372 0.1692 0.0718 0.3990Graph2gauss 0.3521 0.5209 0.5208 0.5209Modsoft 0.2603 0.4532 0.4130 0.4973ProNE 0.3466 0.5148 0.5148 0.5149DGI 0.3373 0.5045 0.5045 0.5045GIC 0.3413 0.5089 0.5089 0.5089GMI 0.3391 0.5065 0.5065 0.5065SDGE-cat
SDGE-sum 0.3892 0.5723 0.5603 0.5846 [44]. In order to answer this question, the ablation of SDGEis studied in this section. A GCN with four layers whichis trained by unsupervised autocoder (denoted as GCN-AE) is as the ablative analysis for the fusion of high-orderinformation. The SDGE algorithm that the Dynamic ReLUactivation function is replaced by ReLU activation function(denoted as SDGE-ReLU) is as the ablative analysis forDynamic ReLU. The experimental algorithms are conductedon ACM data set. The experimental results are showed asTable 7 and Figure 6.
TABLE 7The experimental results of the ablative analysis ( τ = 100 , β = 1 , γ = 1 and epoch =100). Algorithms J FM F KLLE 0.2634 0.4366 0.4156 0.4597GCN-AE 0.2025 0.3368 0.3368 0.3368SDGE-ReLU-cat 0.3186 0.5399 0.4832 0.6033SDGE-cat
SDGE-ReLU-sum 0.3176 0.5382 0.4820 0.6010SDGE-sum 0.3178 0.5382 0.4823 0.6005
Table 7 shows the experimental results of the ablativeanalysis. In order to further confirm the effectiveness ofthe proposed algorithm, the experimental results of LLEwhich is a representative manifold learning algorithm is alsopresented in Table 7. From Table 7, it is known that SDGE-cat is the best of all. It demonstrates that SDGE is an ef-fective approach for community discover. By comparing theresults of SDGE-ReLU-cat, SDGE-cat, SDGE-ReLU-sum andSDGE-sum with the results of GCN-AE, it can be seen thatthe performance of SDGE outperforms GCN-AE and theadvantages of the performance between SDGE and GCN-AE are obvious. The obvious advantages indicate the high-order information fusion of SDGE is effective and the fusionof the multiple outputs of GCNs can better learn the localand global structure information of a graph. In SDGE, eachGCN can learn a kind of structure information, thereforeSDGE can learn the graph information from multiple viewsand obtain the best results on the experimental data set.In Table 7, SDGE-cat outperforms SDGE-ReLU-cat. Theresult demonstrates that the performance of SDGE can beimproved by introducing Dynamic ReLU activation func-tion. However, the difference of the performance betweenSDGE-cat and SDGE-ReLU-cat is small and the results ofSDGE-ReLU-sum are almost equal to the results of SDGE-
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10 (cid:3) (cid:4) (cid:3) (cid:5) (cid:3) (cid:6) (cid:3) (cid:7) (cid:3) (cid:8) (cid:3) (cid:9) (cid:3) (cid:10) (cid:3) (cid:11) (cid:3) (cid:12) (cid:3) (cid:4) (cid:3) (cid:3)(cid:3)(cid:4) (cid:3) (cid:3) (cid:3)(cid:5) (cid:3) (cid:3) (cid:3)(cid:6) (cid:3) (cid:3) (cid:3)(cid:7) (cid:3) (cid:3) (cid:3)(cid:8) (cid:3) (cid:3) (cid:3)(cid:9) (cid:3) (cid:3) (cid:3)(cid:10) (cid:3) (cid:3) (cid:3)(cid:11) (cid:3) (cid:3) (cid:3)(cid:12) (cid:3) (cid:3) (cid:3) (cid:1)(cid:2) (cid:3)(cid:3) (cid:15) (cid:21) (cid:20) (cid:18) (cid:19) (cid:1)(cid:16) (cid:14) (cid:17) (cid:2) (cid:13) (cid:15) (a) GCN-AE (cid:4) (cid:5) (cid:4) (cid:6) (cid:4) (cid:7) (cid:4) (cid:8) (cid:4) (cid:9) (cid:4) (cid:10) (cid:4) (cid:11) (cid:4) (cid:12) (cid:4) (cid:13) (cid:4) (cid:5) (cid:4) (cid:4)(cid:4) (cid:3)(cid:4)(cid:4) (cid:3)(cid:5)(cid:4) (cid:3)(cid:6)(cid:4) (cid:3)(cid:7)(cid:4) (cid:3)(cid:8)(cid:4) (cid:3)(cid:9)(cid:4) (cid:3)(cid:10) (cid:1)(cid:2) (cid:3)(cid:3) (cid:15) (cid:23) (cid:22) (cid:19) (cid:20) (cid:1)(cid:17) (cid:14) (cid:16) (cid:15) (cid:2) (cid:19) (cid:18) (cid:25)(cid:1)(cid:17) (cid:14) (cid:16) (cid:15) (cid:2) (cid:24) (cid:26) (cid:21) (b) SDGE (cid:4) (cid:5) (cid:4) (cid:6) (cid:4) (cid:7) (cid:4) (cid:8) (cid:4) (cid:9) (cid:4) (cid:10) (cid:4) (cid:11) (cid:4) (cid:12) (cid:4) (cid:13) (cid:4) (cid:5) (cid:4) (cid:4)(cid:4) (cid:3)(cid:4) (cid:4)(cid:4) (cid:3)(cid:4) (cid:5)(cid:4) (cid:3)(cid:4) (cid:6)(cid:4) (cid:3)(cid:4) (cid:7)(cid:4) (cid:3)(cid:4) (cid:8)(cid:4) (cid:3)(cid:4) (cid:9)(cid:4) (cid:3)(cid:4) (cid:10)(cid:4) (cid:3)(cid:4) (cid:11)(cid:4) (cid:3)(cid:4) (cid:12)(cid:4) (cid:3)(cid:4) (cid:13) (cid:1)(cid:2) (cid:3)(cid:3) (cid:15) (cid:27) (cid:26) (cid:22) (cid:24) (cid:1)(cid:19) (cid:14) (cid:16) (cid:15) (cid:2) (cid:18) (cid:23) (cid:17) (cid:20) (cid:2) (cid:22) (cid:21) (cid:29)(cid:1)(cid:19) (cid:14) (cid:16) (cid:15) (cid:2) (cid:18) (cid:23) (cid:17) (cid:20) (cid:2) (cid:28) (cid:30) (cid:25) (c) SDGE-ReLU
Fig. 6. The convergence of GCN-AE, SGDE and SGDE-ReLU. sum. From Fig. 6(b)-(c), it can be seen that SDGE-ReLU-cat and SDGE-ReLU-sum are fully convergent when epoch is 100 and the losses of SDGE-cat and SDGE-sum can befuture decreased although epoch has reached 100. The trendof the loss in Fig. 6(b) means the performance of SDGE-cat and SDGE-sum can be future improved. In the currentsituation ( epoch =100), SDGE has already outperformed orbe equal to SDGE-ReLU. The performance of SDGE willoutperform SDGE-ReLU if SDGE is fully convergent. There-fore the results prove that Dynamic ReLU can improve theperformance of SDGE.Fig. 6 shows the convergence of GCN-AE, SDGE andSDGE-ReLU. From the results, it can be seen that GCN-AEis fully convergent when epoch is 80, SDGE-ReLU is fullyconvergent when epoch is 60 and the loss of SDGE can befuture decreased after epoch is larger than 100. The resultsalso show that SDGE requires more iterations than GCN-AE and SDGE-ReLU. In SDGE, it needs to train multipleGCNs which increases the complexity of the algorithm. Inaddition, SDGE introduces Dynamic ReLU as activationfunction. From the reference [44], it is known that theappropriate parameters of Dynamic ReLU are determinedaccording to the global context which needs more computa-tion. Therefore SDGE requires more iterations to reach theconvergence.
TABLE 8The experimental results of the ablative analysis of spectralpropagation (SP) on Hyperplane data set ( τ = 50 , β = 0 . , γ = 0 . and epoch =50). Algorithms J FM F KSDGE-cat
SDGE-cat-w/o-SP 0.4514 0.6412 0.6220 0.6610SDGE-sum
SDGE-sum-w/o-SP 0.4974 0.7034 0.6643 0.7448
Table 8 shows the ablative result of spectral propagationon Hyperplane data set. From the results, it is known thatSDGE-cat and SDGE-sum outperform SDGE-cat-w/o-SPand SDGE-sum-w/o-SP, respectively. It demonstrates thatspectral propagation improves the performance of SDGE-cat. When aggregate function is
CON CAT ( · ) , the perfor-mance degradation is obvious after spectral propagationis removed from SDGE-cat. The result provides evidence (cid:1)(cid:2) (cid:1) (cid:1) (cid:1) (cid:1)(cid:3) (cid:1) (cid:1) (cid:1) (cid:1)(cid:4) (cid:1) (cid:1) (cid:1) (cid:1)(cid:5) (cid:1) (cid:1) (cid:1) (cid:1)(cid:6) (cid:1) (cid:1) (cid:1) (cid:1)(cid:7) (cid:1) (cid:1) (cid:1) (cid:1)(cid:8) (cid:1) (cid:1) (cid:1) (cid:1) (cid:2) (cid:5) (cid:6) (cid:4) (cid:1) (cid:3) (cid:7) (cid:8) (cid:9)(cid:1) (cid:1) (cid:4) (cid:9) (cid:8) (cid:12)(cid:3) (cid:11) (cid:7) (cid:5) (cid:10)(cid:9)(cid:6) (cid:11) (cid:2) (cid:1)(cid:8) (cid:4) (cid:6) (cid:5) (cid:2) (cid:10) (cid:9) (cid:14)(cid:1)(cid:8) (cid:4) (cid:6) (cid:5) (cid:2) (cid:10) (cid:9) (cid:14)(cid:2) (cid:16) (cid:3)(cid:12) (cid:2) (cid:8) (cid:7)(cid:1)(cid:8) (cid:4) (cid:6) (cid:5) (cid:2) (cid:13) (cid:15) (cid:11) (cid:1)(cid:1)(cid:8) (cid:4) (cid:6) (cid:5) (cid:2) (cid:13) (cid:15) (cid:11) (cid:2) (cid:16) (cid:3)(cid:12) (cid:2) (cid:8) (cid:7) Fig. 7. The time cost of the ablative analysis of spectral propagation (SP)on Hyperplane data set that spectral propagation plays an important role in SDGE-cat. However, the performance difference between SDGE-sum and SDGE-sum-w/o-SP is small although the perfor-mance degradation of SDGE-sum also appears after spectralpropagation is removed from SDGE-sum. Therefore it canconclude that spectral propagation has more significantinfluence on SDGE-cat than SDGE-sum.Fig. 7 shows the time cost of the ablative analysis aboutspectral propagation on Hyperplane data set. From theresults, it is known that both SDGE-cat and SDGE-sumspend more time on the calculation of embedding vectorsthan SDGE-cat-w/o-SP and SDGE-sum-w/o-SP. It meansspectral propagation is a time consuming process. However,the results in Table 8 show the effectiveness of spectralpropagation. Therefore spectral propagation is an effectiveapproach to improve the performance of SDGE if the task issensitive to time cost.
In this section, the efficiency of SGDE is studied. Figure 8shows the time cost of SGDE algorithm on the experimentaldata sets.From the results, it can be seen that time cost of SDGE-sum is more than SGDE-cat on the most experimental datasets. It known that the outputs of GCNs are added ifthe aggregate function is sum ( · ) and the time complexity OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11 (cid:7) (cid:3) (cid:3) (cid:3)(cid:8) (cid:3) (cid:3) (cid:3)(cid:9) (cid:3) (cid:3) (cid:3)(cid:10) (cid:3) (cid:3) (cid:3)(cid:11) (cid:3) (cid:3) (cid:3)(cid:4) (cid:3) (cid:3) (cid:3) (cid:3)(cid:4) (cid:4) (cid:3) (cid:3) (cid:3)(cid:4) (cid:5) (cid:3) (cid:3) (cid:3)(cid:4) (cid:6) (cid:3) (cid:3) (cid:3) (cid:2) (cid:5) (cid:6) (cid:4) (cid:1) (cid:3) (cid:7) (cid:8) (cid:9) (cid:2) (cid:5) (cid:10) (cid:9) (cid:13)(cid:4)(cid:1) (cid:12) (cid:8) (cid:6) (cid:11)(cid:10)(cid:7) (cid:12) (cid:3) (cid:1)(cid:15) (cid:14) (cid:12) (cid:13) (cid:2)(cid:17) (cid:16) (cid:20)(cid:1)(cid:15) (cid:14) (cid:12) (cid:13) (cid:2)(cid:19) (cid:21) (cid:18) (a) ACM data set. (cid:5) (cid:1) (cid:1) (cid:1)(cid:5) (cid:1) (cid:6) (cid:1)(cid:5) (cid:2) (cid:1) (cid:1)(cid:5) (cid:2) (cid:6) (cid:1)(cid:5) (cid:3) (cid:1) (cid:1)(cid:5) (cid:3) (cid:6) (cid:1)(cid:5) (cid:4) (cid:1) (cid:1)(cid:5) (cid:4) (cid:6) (cid:1)(cid:5) (cid:5) (cid:1) (cid:1)(cid:5) (cid:5) (cid:6) (cid:1) (cid:2) (cid:5) (cid:6) (cid:4) (cid:1) (cid:3) (cid:7) (cid:8) (cid:9) (cid:2) (cid:5) (cid:10) (cid:9) (cid:13)(cid:4)(cid:1) (cid:12) (cid:8) (cid:6) (cid:11)(cid:10)(cid:7) (cid:12) (cid:3) (b) USA data set. (cid:2) (cid:5) (cid:2) (cid:1) (cid:1)(cid:2) (cid:5) (cid:3) (cid:1) (cid:1)(cid:2) (cid:5) (cid:4) (cid:1) (cid:1)(cid:2) (cid:5) (cid:6) (cid:1) (cid:1)(cid:2) (cid:6) (cid:1) (cid:1) (cid:1)(cid:2) (cid:6) (cid:2) (cid:1) (cid:1)(cid:2) (cid:6) (cid:3) (cid:1) (cid:1)(cid:2) (cid:6) (cid:4) (cid:1) (cid:1) (cid:2) (cid:5) (cid:6) (cid:4) (cid:1) (cid:3) (cid:7) (cid:8) (cid:9) (cid:2) (cid:5) (cid:10) (cid:9) (cid:13)(cid:4)(cid:1) (cid:12) (cid:8) (cid:6) (cid:11)(cid:10)(cid:7) (cid:12) (cid:3) (c) Image data set. (cid:5) (cid:3) (cid:1) (cid:1) (cid:1)(cid:5) (cid:6) (cid:1) (cid:1) (cid:1)(cid:5) (cid:9) (cid:1) (cid:1) (cid:1)(cid:6) (cid:2) (cid:1) (cid:1) (cid:1)(cid:6) (cid:5) (cid:1) (cid:1) (cid:1)(cid:6) (cid:8) (cid:1) (cid:1) (cid:1)(cid:7) (cid:1) (cid:1) (cid:1) (cid:1)(cid:7) (cid:4) (cid:1) (cid:1) (cid:1)(cid:7) (cid:7) (cid:1) (cid:1) (cid:1) (cid:2) (cid:5) (cid:6) (cid:4) (cid:1) (cid:3) (cid:7) (cid:8) (cid:9) (cid:2) (cid:5) (cid:10) (cid:9) (cid:13)(cid:4)(cid:1) (cid:12) (cid:8) (cid:6) (cid:11)(cid:10)(cid:7) (cid:12) (cid:3) (d) Hyperplane data set. (cid:3) (cid:5) (cid:1) (cid:1) (cid:1)(cid:3) (cid:6) (cid:1) (cid:1) (cid:1)(cid:4) (cid:1) (cid:1) (cid:1) (cid:1)(cid:4) (cid:2) (cid:1) (cid:1) (cid:1)(cid:4) (cid:3) (cid:1) (cid:1) (cid:1)(cid:4) (cid:5) (cid:1) (cid:1) (cid:1)(cid:4) (cid:6) (cid:1) (cid:1) (cid:1)(cid:5) (cid:1) (cid:1) (cid:1) (cid:1)(cid:5) (cid:2) (cid:1) (cid:1) (cid:1) (cid:2) (cid:5) (cid:6) (cid:4) (cid:1) (cid:3) (cid:7) (cid:8) (cid:9) (cid:2) (cid:5) (cid:10) (cid:9) (cid:13)(cid:4)(cid:1) (cid:12) (cid:8) (cid:6) (cid:11)(cid:10)(cid:7) (cid:12) (cid:3) (e) Waveform data set.
Fig. 8. The time cost of SGDE on the experimental data sets. is O ( nr ) . If the aggregate function is CON CAT ( · ) , theoutputs of GCNs are concatenated together and the timecomplexity is O ( rd r ) where d r is the output dimensionof GCN. Due to the fact ( n (cid:29) d r ) , the time complexityof sum ( · ) is higher than CON CAT ( · ) which is consistentwith the results in Figs. 8 except for Figs. 8(b)-(c). In Figs.8(b)-(c), the time cost of sum ( · ) is less than CON CAT ( · ) .For USA and Image data sets, the number of nodes n is nottoo large and it means the difference between n and d r issmaller than the other data sets. Therefore the time cost of sum ( · ) is closer to the time cost of CON CAT ( · ) on USAand Image data sets than the other data sets and the changedrange in Figs. 8(b)-(c) also agrees with the above analysis.In addition, the initialization of graph neural network alsoplays an important role on the final time cost. An goodinitialization can speed up the convergence. Therefore thetime cost of CON CAT ( · ) is more than sum ( · ) on USA andImage data sets. ONCLUSIONS
In this paper, we focus on building a self-supervised deepgraph neural network named SDGE for node embeddingand community discovery. Through the fusion of the out-puts of multiple GCNs, SDGE can effectively utilize thehigh-order information of graph. The embedding result ofSDGE can preserve the structure and node similarity in thelow dimension embedding space. The spectral propagationis also introduced to enhance the embedding result. Theextensive experiments on the experimental data sets demon-strate the effectiveness of SDGE algorithm.In the current, pre-training has been proved that it isan effective approach to improve the performance of graphneural network. Therefore the future work would suggestto explore to introduce pre-training approach for SDGE.Moreover, the depth of neural network is important for therepresentation learning ability of GCN and the increasing depth of GCN is also a potential way of improving theproposed algorithm. A CKNOWLEDGMENTS
This work was supported by National Natural ScienceFund of China (Nos. 61972064, 61672130), Liaoning Re-vitalization Talents Program (No. XLY-C1806006), Funda-mental Research Funds for the Central Universities (No.DUT19RC(3)012). R EFERENCES [1] J. Tang, “Computational models for social network analysis: Abrief survey,” in
Proceedings of the 26th International Conference onWorld Wide Web Companion , 2017, pp. 921–925.[2] F. Liu, S. Xue, J. Wu, C. Zhou, W. Hu, C. Paris, S. Nepal, J. Yang,and P. S. Yu, “Deep learning for community detection: Progress,challenges and opportunities,” in
The 29th International Joint Con-ference on Artificial Intelligence , 2020, pp. 4981–4987.[3] J. Yang and J. Leskovec, “Defining and evaluating network com-munities based on ground-truth,”
Knowledge and Information Sys-tems , vol. 42, no. 1, pp. 181–213, 2015.[4] Z. Du, X. Wang, H. Yang, J. Zhou, and J. Tang, “Sequentialscenario-specific meta learner for online recommendation,” in
Proceedings of the 25th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , 2019, pp. 2895–2904.[5] Y. Cen, J. Zhang, X. Zou, C. Zhou, H. Yang, and J. Tang, “Control-lable multi-interest framework for recommendation,” in
Proceed-ings of the 26th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , 2020, pp. 2942–2951.[6] W. Zhang, B. Paudel, L. Wang, J. Chen, H. Zhu, W. Zhang,A. Bernstein, and H. Chen, “Iteratively learning embeddings andrules for knowledge graph reasoning,” in
The World Wide WebConference , 2019, pp. 2366–2377.[7] L. Liu, Y. Ma, X. Zhu, Y. Yang, X. Hao, L. Wang, and J. Peng, “Inte-grating sequence and network information to enhance protein-protein interaction prediction using graph convolutional net-works,” in . IEEE, 2019, pp. 1762–1768.[8] P. Cui, X. Wang, J. Pei, and W. Zhu, “A survey on networkembedding,”
IEEE Transactions on Knowledge and Data Engineering ,vol. 31, no. 5, pp. 833–852, 2018.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12 [9] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reductionby locally linear embedding,”
Science , vol. 290, no. 5500, pp. 2323–2326, 2000.[10] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectraltechniques for embedding and clustering,” in
Advances in NeuralInformation Processing Systems , 2002, pp. 585–591.[11] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,”
Nature , vol. 401, no. 6755, pp. 788–791, 1999.[12] Y. Rong, T. Xu, J. Huang, W. Huang, H. Cheng, Y. Ma, Y. Wang,T. Derr, L. Wu, and T. Ma, “Deep graph learning: Foundations, ad-vances and applications,” in
Proceedings of the 26th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining ,2020, pp. 3555–3556.[13] J. Qiu, Y. Dong, H. Ma, J. Li, C. Wang, K. Wang, and J. Tang,“Netsmf: Large-scale network embedding as sparse matrix factor-ization,” in
The World Wide Web Conference , 2019, pp. 1509–1520.[14] T. Pang, F. Nie, and J. Han, “Flexible orthogonal neighborhoodpreserving embedding.” in
The Twenty-Sixth International JointConference on Artificial Intelligence , 2017, pp. 2592–2598.[15] Z. Zhang, P. Cui, X. Wang, J. Pei, X. Yao, and W. Zhu, “Arbitrary-order proximity preserved network embedding,” in
Proceedingsof the 24th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , 2018, pp. 2778–2786.[16] X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang, “Communitypreserving network embedding.” in
The Thirty-First AAAI Confer-ence on Artificial Intelligence , 2017, pp. 203–209.[17] Z. Qiu, W. Hu, J. Wu, Z. Tang, and X. Jia, “Noise-resilient simi-larity preserving network embedding for social networks.” in
TheTwenty-Eighth International Joint Conference on Artificial Intelligence ,2019, pp. 3282–3288.[18] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient esti-mation of word representations in vector space,” arXiv preprintarXiv:1301.3781 , 2013.[19] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learningof social representations,” in
Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining ,2014, pp. 701–710.[20] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line:Large-scale information network embedding,” in
Proceedings of the24th International Conference on World Wide Web , 2015, pp. 1067–1077.[21] L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo, “struc2vec:Learning node representations from structural identity,” in
Pro-ceedings of the 23rd ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining , 2017, pp. 385–394.[22] Y. Dong, N. V. Chawla, and A. Swami, “metapath2vec: Scalablerepresentation learning for heterogeneous networks,” in
Proceed-ings of the 23rd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , 2017, pp. 135–144.[23] L. Wang, B. Zong, Q. Ma, W. Cheng, J. Ni, W. Yu, Y. Liu, D. Song,H. Chen, and Y. Fu, “Inductive and unsupervised representationlearning on graph structured objects,” in
International Conferenceon Learning Representations , 2020.[24] J. Guo, L. Xu, and J. Liu, “Spine: Structural identity preservedinductive network embedding,” in
The Twenty-Eighth InternationalJoint Conference on Artificial Intelligence , 2019, pp. 2399–2405.[25] D. Jin, X. You, W. Li, D. He, P. Cui, F. Fogelman-Souli´e, andT. Chakraborty, “Incorporating network embedding into markovrandom field for better community detection,” in
The Thirty-ThirdAAAI Conference on Artificial Intelligence , 2019, pp. 160–167.[26] T. N. Kipf and M. Welling, “Semi-supervised classification withgraph convolutional networks,” in
International Conference onLearning Representations , 2017.[27] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “Acomprehensive survey on graph neural networks,”
IEEE Transac-tions on Neural Networks and Learning Systems , 2020.[28] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful aregraph neural networks?” in
International Conference on LearningRepresentations , 2020.[29] A. Loukas, “What graph neural networks cannot learn: depth vswidth,” in
International Conference on Learning Representations , 2020.[30] M. Chen, Z. Wei, Z. Huang, B. Ding, and Y. Li, “Simple anddeep graph convolutional networks,” in
Proceedings of the 37thInternational Conference on Machine Learning , 2020. [31] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representationlearning on large graphs,” in
Advances in Neural Information Pro-cessing Systems , 2017, pp. 1024–1034.[32] Y. Rong, W. Huang, T. Xu, and J. Huang, “Dropedge: Towardsdeep graph convolutional networks on node classification,” in
International Conference on Learning Representations , 2020.[33] D. Chen, L. Jacob, and J. Mairal, “Convolutional kernel networksfor graph-structured data,” in
Proceedings of the 37th InternationalConference on Machine Learning , 2020.[34] Z. Peng, W. Huang, M. Luo, Q. Zheng, Y. Rong, T. Xu, andJ. Huang, “Graph representation learning via graphical mutualinformation maximization,” in
Proceedings of The Web Conference2020 , 2020, pp. 259–270.[35] D. Bo, X. Wang, C. Shi, M. Zhu, E. Lu, and P. Cui, “Structural deepclustering network,” in
Proceedings of The Web Conference 2020 ,2020, pp. 1400–1410.[36] X. Zhang, C. Xu, and D. Tao, “On dropping clusters to regular-ize graph convolutional neural networks,” in
The 16th EuropeanConference on Computer Vision , 2020.[37] C. Wang, S. Pan, R. Hu, G. Long, J. Jiang, and C. Zhang, “At-tributed graph clustering: A deep attentional embedding ap-proach,” in
The Twenty-Eighth International Joint Conference on Arti-ficial Intelligence , 2019, pp. 3670–3676.[38] S. Fan, X. Wang, C. Shi, E. Lu, K. Lin, and B. Wang, “One2multigraph autoencoder for multi-view graph clustering,” in
Proceed-ings of The Web Conference 2020 , 2020, pp. 3070–3076.[39] X. Wang, M. Zhu, D. Bo, P. Cui, C. Shi, and J. Pei, “Am-gcn: Adap-tive multi-channel graph convolutional networks,” in
Proceedingsof the 26th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , 2020, pp. 1243–1253.[40] X. Chen, Y. Zhang, I. Tsang, and Y. Pan, “Learning robust noderepresentation on graphs,” arXiv preprint arXiv:2008.11416 , 2020.[41] X. Liu, F. Zhang, Z. Hou, Z. Wang, L. Mian, J. Zhang, andJ. Tang, “Self-supervised learning: Generative or contrastive,” arXiv preprint arXiv:2006.08218 , 2020.[42] J. Qiu, Q. Chen, Y. Dong, J. Zhang, H. Yang, M. Ding, K. Wang, andJ. Tang, “Gcc: Graph contrastive coding for graph neural networkpre-training,” in
Proceedings of the 26th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , 2020, pp. 1150–1160.[43] Z. Yang, M. Ding, C. Zhou, H. Yang, J. Zhou, and J. Tang, “Un-derstanding negative sampling in graph representation learning,”in
Proceedings of the 26th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , 2020, pp. 1666–1676.[44] Y. Chen, X. Dai, M. Liu, D. Chen, L. Yuan, and Z. Liu, “Dynamicrelu,” in
The 16th European Conference on Computer Vision , 2020.[45] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXivpreprint arXiv:1502.03167 , 2015.[46] M. E. Newman and M. Girvan, “Finding and evaluating commu-nity structure in networks,”
Physical review E , vol. 69, no. 2, p.026113, 2004.[47] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simpleframework for contrastive learning of visual representations,” in
Proceedings of the 37th International Conference on Machine Learning ,2020.[48] J. Zhang, Y. Dong, Y. Wang, J. Tang, and M. Ding, “Prone: Fast andscalable network representation learning,” in
The 28th InternationalJoint Conference on Artificial Intelligence , 2019, pp. 4278–4284.[49] Z. Zhou, W. Wang, W. Gao, and L. Zhang,
Introduction to The Theoryof Machine Learning . China Machine Press, 2020.[50] F. Ye, C. Chen, and Z. Zheng, “Deep autoencoder-like nonnegativematrix factorization for community detection,” in
Proceedings ofthe 27th ACM International Conference on Information and KnowledgeManagement , 2018, pp. 1393–1402.[51] A. Bojchevski and S. Gnnemann, “Deep gaussian embedding ofgraphs: Unsupervised inductive learning via ranking,” in
Interna-tional Conference on Learning Representations , 2018.[52] A. Hollocou, T. Bonald, and M. Lelarge, “Modularity-based sparsesoft graph clustering,” in
The 22nd International Conference onArtificial Intelligence and Statistics , 2019, pp. 323–332.[53] P. Velickovic, W. Fedus, W. L. Hamilton, P. Li`o, Y. Bengio, andR. D. Hjelm, “Deep graph infomax,” in
International Conference onLearning Representations , 2019.[54] C. Mavromatis and G. Karypis, “Graph infoclust: Leveragingcluster-level node information for unsupervised graph represen-tation learning,” arXiv preprint arXiv:2009.06946arXiv preprint arXiv:2009.06946