[PDF] CaEGCN: Cross-Attention Fusion based Enhanced Graph Convolutional Network for Clustering

Abstract

With the powerful learning ability of deep convolutional networks, deep clustering methods can extract the most discriminative information from individual data and produce more satisfactory clustering results. However, existing deep clustering methods usually ignore the relationship between the data. Fortunately, the graph convolutional network can handle such relationship, opening up a new research direction for deep clustering. In this paper, we propose a cross-attention based deep clustering framework, named Cross-Attention Fusion based Enhanced Graph Convolutional Network (CaEGCN), which contains four main modules: the cross-attention fusion module which innovatively concatenates the Content Auto-encoder module (CAE) relating to the individual data and Graph Convolutional Auto-encoder module (GAE) relating to the relationship between the data in a layer-by-layer manner, and the self-supervised model that highlights the discriminative information for clustering tasks. While the cross-attention fusion module fuses two kinds of heterogeneous representation, the CAE module supplements the content information for the GAE module, which avoids the over-smoothing problem of GCN. In the GAE module, two novel loss functions are proposed that reconstruct the content and relationship between the data, respectively. Finally, the self-supervised module constrains the distributions of the middle layer representations of CAE and GAE to be consistent. Experimental results on different types of datasets prove the superiority and robustness of the proposed CaEGCN.

Full PDF

IIEEE TRANSACTIONS ON XX, VOL. XX, NO. X, JANUARY 2020 1

CaEGCN: Cross-Attention Fusion based EnhancedGraph Convolutional Network for Clustering

Guangyu Huo, Yong Zhang, Junbin Gao, Boyue Wang, Yongli Hu, and Baocai Yin.

Abstract —With the powerful learning ability of deep con-volutional networks, deep clustering methods can extract themost discriminative information from individual data and pro-duce more satisfactory clustering results. However, existing deepclustering methods usually ignore the relationship between thedata. Fortunately, the graph convolutional network can handlesuch relationship, opening up a new research direction for deepclustering. In this paper, we propose a cross-attention baseddeep clustering framework, named Cross-Attention Fusion basedEnhanced Graph Convolutional Network (CaEGCN), which con-tains four main modules: the cross-attention fusion module whichinnovatively concatenates the Content Auto-encoder module(CAE) relating to the individual data and Graph ConvolutionalAuto-encoder module (GAE) relating to the relationship betweenthe data in a layer-by-layer manner, and the self-supervised modelthat highlights the discriminative information for clustering tasks.While the cross-attention fusion module fuses two kinds ofheterogeneous representation, the CAE module supplements thecontent information for the GAE module, which avoids the over-smoothing problem of GCN. In the GAE module, two novelloss functions are proposed that reconstruct the content andrelationship between the data, respectively. Finally, the self-supervised module constrains the distributions of the middle layerrepresentations of CAE and GAE to be consistent. Experimentalresults on different types of datasets prove the superiority androbustness of the proposed CaEGCN.

Index Terms —Cross-attention fusion mechanism, Graph con-volutional network, Deep clustering. S UPPLEMENTARY MATERIALS

The supplementary code is available at https://github.com/huogy/CaEGCN. I. I

NTRODUCTION C Lustering is an essential topic in data mining area, whichdivides a collection of objects into multiple clusters ofsimilar objects. Inspired by the powerful feature extractioncapability of deep convolutional network, many deep learningbased clustering methods have been proposed in recent years,demonstrating much signiﬁcant progress in clustering research[1]–[4]. The two-step spectral clustering method is usuallyemployed here: A ‘good’ data representation or similaritymatrix which is learned from these deep learning methods can

Corresponding author: Boyue Wang.Guangyu Huo, Yong Zhang, Boyue Wang, Yongli Hu and Baocai Yin arewith Beijing Key Laboratory of Multimedia and Intelligent Software Tech-nology, Beijing Artiﬁcial Intelligence Institute, Faculty of Information Tech-nology, Beijing University of Technology, Beijing 100124, China. E-mail: [email protected], { zhangyong2010,wby,huyongli,ybc } @bjut.edu.cn.Junbin Gao is with the Discipline of Business Analytics, The University ofSydney Business School, The University of Sydney, NSW 2006, Australia.E-mail: [email protected] be pipelined to the downstream models/algorithms such as K-means [5] or Normalized Cut [6] to obtain the ﬁnal clusteringresult.However, existing deep clustering methods only focus onthe data content, and usually ignore the relationship betweenthe data, i.e., the structural information. With the developmentof the data collection and analysis technologies, people notonly collect the data but also obtain or build the relationshipbetween the data in the form of graphs, such as social networks[7], biochemical structure networks [8] and railway networks[9]. These graphs can help people make better data-drivendecisions. Therefore, how to embed the relationship betweenthe data into deep clustering becomes a thorny problem.Furthermore, based on these raw graphs, one wishes to minethe latent relationship between the data effectively. As weknow, the edges in a graph represent the explicit relationship,which is also regarded as the ﬁrst-order structural relationship.Many graph embedding methods exploit such relationship,including DeepWalk [10], node2vec [11], and LINE [12].But, the data relationship in the real-world is complicated.There still exist many implicit and complicated relationships.For example, two nodes may not be directly connected in agraph, while they have many identical neighbors. However it isnatural belief that these two nodes have a high-order structuralrelationship.In order to improve the clustering effect of deep clusteringmethods, utilizing the high-order relationships is necessary.As an important approach in deep learning methods, GraphConvolutional Network (GCN) [8], [13], [14] can mine suchpotential high-order relationship between the data. GCN trans-fers the graph structured data to a low-dimensional, compact,and continuous feature space. While GCN simultaneously hasvery successful application in encoding and exploring graphstructure and node content, it seems little attention has beengiven to applying GCNs to deep clustering tasks.For the purpose of clustering, we can naturally constructan auto-encoder module based on a GCN, the so-called GAEmodule. GCN can lead the signals to be smoother, which is theinherent advantage of GCN. However, such signal smoothingoperation makes the signals more similar, losing the diversityof signals. It has been proven that GCN is prone to over-smoothing when the number of layers becomes large [15],which results in a poor performance in related tasks. So, GCNcannot be stacked as deeply as the CNN model in visual tasks.To overcome this drawback of GCN, we introduce a com-mon auto-encoder network to supplement the data contentinformation to GAE, like the effect of the residual network.Multiple layers are usually stacked in the deep network, and a r X i v : . [ c s . A I] J a n EEE TRANSACTIONS ON XX, VOL. XX, NO. X, JANUARY 2020 2 each layer captures different latent features of the data. Tocombine the high-order relationship of the data (in GAE) withthe potential details of the corresponding content information(in auto-encoder) layer-by-layer, we propose a cross-attentionfusion mechanism, which highlights the discriminative in-formation for clustering tasks. Different from the traditionalattention mechanism, our cross-attention fusion mechanismfuses two kinds of heterogeneous representations, i.e., theregular data and the irregular graph.In this paper, we propose a novel clustering framework,named Cross-Attention Fusion based Enhanced Graph Convo-lutional Network (CaEGCN). In CaEGCN, we can extract thehigh-order relationship between the data through the GraphConvolutional Auto-encoder module (GAE). To alleviate theover-smoothing problem of GAE and supplement the contentinformation to GAE, we build a Content Auto-encoder module(CAE) composed of a common auto-encoder, which extractsthe content information of the data. Besides, we proposea cross-attention fusion mechanism to encode above twomodules to output a complete representation. In order to guidethe optimal clustering direction of the entire model in an end-to-end manner, we introduce a self-supervised module.The contributions of this paper are listed as follows as asummary, • We propose an end-to-end cross-attention fusion baseddeep clustering framework, in which the cross-attentionfusion module creatively concatenates the graph convo-lutional auto-encoder module and content auto-encodermodule in multiple layers; • We propose a cross-attention fusion module to assign theattention weights to the fused heterogeneous representa-tion; • In the graph convolutional auto-encoder module, wepropose simultaneously reconstructing the content and re-lationship between the data, which effectively strengthensthe clustering performance of CaEGCN; • We test CaEGCN on the natural language, human be-havior and image datasets to prove the robustness ofCaEGCN.The rest of the paper is organized as follows. In SectionII, we brieﬂy review the graph convolutional network, deepclustering and attention mechanism, respectively. In SectionIII, we detail the cross-attention fusion based enhanced graphconvolutional network for clustering by presenting the fourmain modules. In Section IV, the proposed method is evaluatedon clustering problems with several public datasets. Finally,the conclusion and the future work are discussed in SectionV. II. R

ELATED W ORK

In this section, we review the necessary knowledge relatedto the research of this paper, which are graph convolutionalnetwork, deep clustering and attention mechanism.A. Graph Convolutional Network (GCN)

Many research ﬁelds consider certain natural graph struc-tures, such as the trafﬁc road network [16], the human skeleton points [17] and molecules structures in biology [18]. Graph isa kind of irregular structural data, which is dispersive anddisorderly. To cope with such irregular data, a lot of GCNbased methods have been proposed. These methods can bedivided into two main categories: spectral-based GCN methods[13]–[15] and spatial-based GCN methods [8], [19]. In a GCN,nodes can be assigned to the features of data, and the edgeweight information describes the similarity between nodes,which shows that graph has a strong information organizationability.The spectral-based GCN methods exploit the spectrumrepresentation of a graph. Kipf et al. [15] initially proposedthe graph convolutional networks for prediction tasks, whichsimulates the graph convolutional operation through the localﬁrst-order approximation of spectral convolutions. Wang et al. [20] introduced the generative adversarial mechanism into thelearning of graph representation, and developed a new graphsoftmax function utilizing the latent structure information ofthe data.The spatial-based GCN methods directly deﬁne the op-erations on the graph and extract the information from thespatial neighbor groups. Velickovic et al. [8] proposed a graphattention network, which computes the corresponding hiddeninformation for each node and uses the attention mechanismto weight the importance of each node compared with itsneighbors. More comprehensive reviews about GCN can befound in [21].

B. Deep Clustering

The current deep learning researches mainly concentrateon the supervised learning tasks. How to extend it onto aframework for unsupervised clustering is a meaningful prob-lem. Fortunately, some researchers have conducted the relatedworks. Xie et al. [2] proposed a deep embedded clusteringmethod, which exploits the deep learning to learn the featurerepresentations and the cluster assignments of the data. Ji etal. [3] constructed a self-expression layer between the encoderand decoder of the auto-encoder.With the development of multi-view clustering, more andmore researchers introduce the relationship between the datato enhance the clustering performance. Kipf et al. [22] usedthe graph convolutional encoder and an inner product decoderto build a Variational Graph Auto-encoder (VGAE), whichlearns the latent features of undirected graphs for clustering.Pan et al. [23] improved the VGAE framework and introducedan adversarial regularization rule to optimize the learned rep-resentation for clustering. Wang et al. [24] employed the graphattention network to weight the importance of neighboringnodes, and obtained a more accurate representation of eachnode. Li et al. [25] jointed the advantages of K-means andspectral clustering, and embedded it into the graph auto-encoder to generate the better data representations. Bo etal. [26] transferred the representation learned by the auto-encoder to the corresponding GCN, and proposed a dualself-supervision mechanism to unify these two different deepneural architectures, which is an important baseline in thispaper.

EEE TRANSACTIONS ON XX, VOL. XX, NO. X, JANUARY 2020 3

Fig. 1. The conceptual framework of CaEGCN, which includes four modules: content auto-encoder module, graph convolutional auto-encoder module, cross-attention fusion module and self-supervised module. X is the original data, ˆ X is the reconstructed data, and A is the original graph. H l and Z l representthe l -th layer output of CAE and GAE, respectively. R l is the cross-attention fused representation of H l and Z l . L CAE content is the content reconstructionloss of CAE. L GAE content and L GAE graph are the content reconstruction loss and graph reconstruction loss of GAE, respectively. L cae and L gae formthe self-supervised module losses. In above, GCN-based clustering methods update networkparameters by reconstructing the adjacency matrix and sufﬁ-ciently exploit the structure information, but they ignore thenode information and the over-smoothing problem.

C. Attention Mechanism

Recently, in the ﬁelds of machine translation [27], [28],semantic segmentation [29] and image generation [30], theattention mechanism has become a trick module that improvesthe effectiveness of most models. Self-attention is a variantof the traditional attention mechanism. Vaswani et al. [27]proposed the self-attention mechanism in machine translationapplications, which obtains the satisfactory experimental re-sults. Besides, self-attention mechanism is robust and easilyembedded into recurrent neural networks [31], generativeadversarial networks [32] and other neural networks, whichalso achieves the excellent experimental results.Many scholars continuously optimize the self-attentionmechanism. Wang et al. [33] introduced a dependency treeinto the self-attention mechanism to represent the relationshipbetween words. Yu et al. [34] extracted the local informationthrough the convolution model to complement the globalinteraction of the self-attention mechanism. Xue et al. [35]used the self-attention mechanism in image segmentation,which better achieves the accurate segmentation through thelong-range context relations To simultaneously handle the heterogeneous data in theproposed model, i.e., the regular data and the irregular graph,we propose the cross-attention fusion module in this paper.III. C

ROSS -A TTENTION F USION BASED E NHANCED G RAPH C ONVOLUTIONAL N ETWORK

In this section, we present a novel cross-attention fusionbased enhanced graph convolutional network model, whichsufﬁciently integrates the content information and the rela-tionship between the data in a multi-level adaptive manner toimprove the clustering performance.The overall network architecture is shown in Figure 1,consisting of four main modules: an auto-encoder modulefor extracting the content information; a GCN based auto-encoder module for exploiting the relationship between thedata; a cross-attention module for concatenating above twomodules,where the multi-level adaptive fusing strategy supple-ments the effective content information as much as possibleduring the transmission process; and a self-supervised moduleused to constrain the consistency of the distributions of middlelayer representations.

A. Constructing the Graph

Before presenting the proposed CaEGCN model, we ﬁrstconstruct the necessary graph of raw data. Given a set ofdata X ∈ R D × N containing N samples and the i -th sample x i ∈ R D , we employ the commonly-used K -nearest neighbor EEE TRANSACTIONS ON XX, VOL. XX, NO. X, JANUARY 2020 4 (KNN) to construct the corresponding graph to exhibit itsstructure information.For the image data, we calculate the similarity betweensamples using the heat kernel method as [36], S ij = e − (cid:107) xi − xj (cid:107) t , (1)where t represents the variance scale parameter.As for the natural language data, the inner-product methodis chosen to measure the similarity between samples as follow, S ij = x Tj x i . (2)Then, with the above calculated similarities among allsamples, we pick up the K highest correlation neighbors ofeach sample and connect them; so, a graph A is obtained. Inmany applications, the graph information actually comes withthe given dataset X . B. Content Auto-encoder Module (CAE)

As we know, deep convolutional network can effectivelyextract the critical features from the complex data. Auto-encoder can reconstruct the samples and reduce the missinginformation during the learning procedure, which is naturallyproper for unsupervised learning. To extract the content infor-mation in the data, we ﬁrst train a deep convolutional networkbased auto-encoder module, which is named as Content Auto-encoder Module (CAE).We represent the input of the l -th layer as H l − , then itsoutput H l can be obtained by, H l = a l ( U l H l − + b l ) ,l = 1 , , · · · , L, (3)where the activation function of the l -th layer, a l , can bechosen according to the practical applications, such as ReLUor Sigmoid. U l and b l denote the weight and bias of the l -thlayer of CAE, respectively. In addition, the input in the ﬁrstlayer of CAE is the raw data X , i.e., H = X . The output inthe ﬁnal layer reconstructs the raw data, i.e., ˆ X = H L , andthe ﬁnal loss function of CAE can be deﬁned as, L CAE content = 12 || X − ˆ X || F , (4)where (cid:107) · (cid:107) F denotes the Frobenius norm. C. Cross-Attention Fusion Module

As shown in Figure 1, CAE extracts the content informationin the data, and GAE exploits the corresponding relationshipbetween the data. How to fuse these two kinds of informationfor clustering tasks is a key problem.Cross-attention fusion mechanism has the global learningability and good parallelism, which can further highlightthe critical information in the fusion representations whilesuppressing the useless noise. Therefore, we use the cross-attention fusion mechanism to integrate the content informa-tion learned by CAE and the data relationship learned by GAEin a multi-level adaptive manner, which is the so-called Cross-Attention Fusion Module. We deﬁne the cross-attention fusion mechanism as, R = F att ( Q, K, V ) , (5)where the query Q = W q Y , the key K = W k Y and the value V = W v Y . The raw fusion representation Y is the input ofthe cross-attention fusion module, which is deﬁned as, Y = γZ l + (1 − γ ) H l (6)where H l is the output of l -th layer in CAE and Z l is the outputof corresponding layer in GAE. γ is a trade-off parameter,which is set to . in our experiments.To discover the latent relationship between data and gen-eralize the cross-attention fusion mechanism (5), we ﬁrstlycalculate the similarity s ab between the fusion query q a andthe fusion key k b , s ab = q a ∗ k b , (7)where q a and k b denote the a -th and b -th vectors in Q and K ,respectively.Then, we execute the softmax normalization on above s ab to obtain the relevance weight α ab as follows, α ab = softmax ( s ab ) = exp( s ab ) (cid:80) D att a =0 exp( s ab ) . (8)Finally, the output of the cross-attention fusion mechanism R = ( r , r , r , · · · , r N ) , i.e., the fusion representation of thedata content information and the relationship between data,can be written as, r a = N (cid:88) b =0 α ab v b . (9)To further perceive different aspects of the data, multi-headmechanism is also introduced, which contains multiple parallelcross-attention fusion modules. Speciﬁcally, we repeatedlyproject the query Q , key K and value V to obtain M parallelcross-attention modules. Each cross-attention fusion module isregarded as one head, and each head has the different weightmatrices { W qm ∈ R N × D l , W km ∈ R N × D l , W vu ∈ R N × D l } to linearly transform the fusion features Q m = W qm Q , K m = W km K , V m = W vm V where D l is the dimensionalityof the l -th layer. The m -th head is, R m = F att ( Q m , K m , V m ) , m = 1 , , , · · · , M. (10)We concatenate the outputs of all M heads, and multiplythe weight matrix W ∈ R N × ( M × D l ) to get the ﬁnal cross-attention fusion representation, R = W · Concat ( R , · · · , R M ) , (11)where Concat ( · ) denotes the matrix concatenate operation.This is the so-called multi-head mechanism and cross-attentionfusion module. EEE TRANSACTIONS ON XX, VOL. XX, NO. X, JANUARY 2020 5

D. Graph Convolutional Auto-Encoder Module (GAE)

As we mentioned before, the relationship between thedata can effectively improve the clustering performance. Mostdeep clustering methods only consider the content informationof data, while ignoring the important data relationship [2].Fortunately, Graph Convolutional Network (GCN) [19] is ableto handle such relationships and the content information ofdata collaboratively. To exploit GCN in unsupervised cluster-ing tasks, we propose a GCN based Auto-Encoder module(GAE), which creatively reconstructs both graph and contentinformation.The previous cross-attention mechanism module combinesthe content representation H l from CAE with the relationshiprepresentation Z l from GAE to output a fusion representation R l in different layers. Then, the GAE executes the spectralgraph convolution on such R l to learn the high-order discrim-inative information based on the adjacency matrix A . Finally,the middle layer Z L is used for clustering.The convolution operation in each GAE layer can be ex-pressed as follow, Z l = GAE ( R l − , A ) = a l ( ˆ D − ˆ A ˆ D − R l − U l ) , (12)where ˆ D − ˆ A ˆ D − is the approximated graph convolutionalﬁlter and ˆ D is the degree matrix of ˆ A , where ˆ D ii = (cid:80) j ˆ A ij .With the identity matrix I and the adjacency matrix A , we use ˆ A = A + I to ensure the self-loop in each node. Additionally, U l denotes the weight of the l -th layer, and Z l is the outputof the l -th GAE layer.It should be noted that the input of the ﬁrst layer in GAEis slightly different. The ﬁrst layer just uses the raw data X instead of R as input, Z = GAE ( X, A ) = a l ( ˆ D − ˆ A ˆ D − XU ) . (13)After this multi-layer learning, the GAE encoder encodesboth the raw relationship A and the content X into a usefulrepresentation Z L . In order to preserve more information, weset the graph reconstruction and content reconstruction errorsas the loss functions of GAE. i) Graph Reconstruction Loss. We choose a simple in-ner product operation to reconstruct the relationship betweensamples as [22], ˜ A = Sigmoid ( Z TL Z L ) , (14)where Z L is the output of the last GAE layer, and ˜ A is the re-constructed adjacency matrix. The loss of graph reconstructioncan be deﬁned as, L GAE graph = || A − ˜ A || F . (15)By minimizing the error between A and ˜ A , the GAEmodule may preserve more data relationship in the latentrepresentation Z L to improve the clustering performance. ii) Content Reconstruction Loss. Except for the relation-ship between the data, we also constrain the GAE moduleto preserve enough content information, which has muchdifference with formula (4); so, we creatively deﬁne its lossfunction as, L GAE content = || X − Z L || F , (16) where Z L is the output of the last layer in GAE, which hasthe same size with the raw data X . In this way, the GAEencodes both the relationship and the content of samples intoa discriminative representation for clustering. E. Self-Supervised Module

It is difﬁcult to judge whether the learned representation Z L is optimally for clustering during the optimization procedure.We need to give an optimization target about clustering.To solve this problem, we ﬁrstly get a set of initial clustercenters { β c } Cc =1 by performing K-means on H L , where C isthe number of clusters. These cluster centers guide approx-imately the optimization direction for Z L , which is the so-called Self-Supervised Module.We use the Student’s t -distribution [37] to calculate thesimilarity between the middle layer representation H L andthe cluster centers β c as follow, t ic = (1 + || h i − β c || ) − (cid:80) Cc =1 (1 + || h i − β c || ) − (17)where h i is the i -th sample representation of H L . And t ic measures the probability that the i -th sample is assigned tothe c -th cluster, so T = [ t ic ] is the overall soft assignmentdistribution.Furthermore, the choice of target distribution directly de-termines the clustering quality. We believe that the high-conﬁdence assignments in T is reliable and can be used asthe target distribution. We raise p ic to highlight the role ofhigh-conﬁdence distribution, p ic = t ic /f c (cid:80) Cc =1 ( t ic /f c ) (18)where f c = (cid:80) i t ic is the soft cluster frequency. The distribu-tion of T and P should be close to each other as follow, L cae = KL ( P || T ) = (cid:88) i (cid:88) c p ic log p ic t ic . (19)Similarly, it is easy to construct a soft assignment distribu-tion Z for the representation Z L , then we can use the targetdistribution P to supervise the distribution Z as, L gae = KL ( P || Z ) = (cid:88) i (cid:88) c p ic log p ic z ic . (20)Now, the optimization goals of GAE and CAE are uniﬁed intoa distribution P , which makes the learned representation moresuitable for clustering tasks. F. Overall Loss Function

The overall objective loss function of Cross-AttentionFusion based Enhanced Graph Convolutional Network(CaEGCN) can be summarized as, L overall = L GAE graph + L GAE content + L CAE content + L cae + L gae . (21)There are ﬁve items in the above objective function, includ-ing three reconstruction losses and two self-supervision losses, EEE TRANSACTIONS ON XX, VOL. XX, NO. X, JANUARY 2020 6 which optimizes the data representations for clustering tasksfrom different perspectives.After optimizing the above objective function, the localoptimal representation G L is obtained. Then, we perform thesoftmax operation on G L to get the ﬁnal clustering results,i.e., max ( softmax ( G L )) .IV. E XPERIMENTAL

In this section, CaEGAN is evaluated on various type ofpublic datasets, including natural language, human behaviorand image datasets. We present the experimental settings andanalysis below.

A. Datesets • ACM [26] contains papers of major categories(i.e., database, wireless communication and data mining).The keywords of each paper are chosen as its feature. Dif-ferent papers of the same author should have the relativestrong correlation, so we can construct the structure graphfor GCN. • DBLP [38] is an author network dataset collected fromthe DBLP website, which includes authors of categories. The research ﬁelds of each author are treatedas the feature. • Citeseer is a citation network dataset which is composedof paper features and citation connections between pa-pers. This dataset has papers of categories. • HHAR [39] consists of sensor records collectedfrom smart phones and smart watches, which is dividedinto categories, including biking, sitting, standing,walking, stair up and stair down. • USPS [40] contains gray images of hand-writtendigits, and the size of each images is × . B. Compared Methods

To verify the effectiveness of CaEGCN, we compare it withseveral state-of-the-art clustering methods, including, • K-means [40] is a basic clustering algorithm based onthe content of data only. • Auto-Encoder (AE) [1] performs K-means on the low-dimensional representations learned from the deep auto-encoder network. • Improved deep embedded clustering (IEDC) [4] addsthe clustering oriented loss and the reconstruction loss tothe deep auto-encoder network, which realizes the one-step clustering of low-dimensional representations. • Variational Graph Auto-Encoders (VGAE) [22] is avariational graph auto-encoder with both topology andcontent information, which introduces the GCN architec-ture and the graph reconstruction loss to build a graphconvolutional auto-encoder network. • Adversarially Regularized Graph Auto-encoder(ARGA) [23] is a GAN architecture deep clusteringmodel. They ﬁrstly construct a graph convolutional https://csxstatic.ist.psu.edu/downloads/data auto-encoder network, then the adversarial trainingprinciple is applied to enforce the latent codes to matcha prior Gaussian or uniform distribution. • Deep Attentional Embedded Graph Clustering(DAEGC) [24] uses the graph attention network to buildthe encoder and trains the internal product decoder toreconstruct the graph structure. In addition, soft labels aregenerated according to the graph embedding to monitorthe self-training graph clustering process. • Structural Deep Clustering Network (SDCN) [26] usesthe structure information learned by the GCN module tostrengthen the data representation learned by the auto-encoder. Furthermore, it constructs a dual self-supervisedloss to combine two networks and supervise clustering,which is an important baseline.The parameter settings of compared methods are listedbelow. For K-means, we repeatly run K-means times andreport the best result. For AE and IEDC, following the workin [26], the network dimension of each dataset is set as − − − − − − . VGAE is a two-layer network, so we set network dimension of its encoder as − , [22]. For ARGA, following the work in [23], weset the dimension of the encoder as − , in addition, thediscriminator’s dimension is set as − . For DAEGC, thedimension of the graph attention encoder is set as − ,[24]. For SDCN, we set the dimensions of the encoder andGCN module for − − − , [26].To evaluate all methods from multiple aspects, we choosefour popular clustering evaluation metrics, including Accuracy(ACC), Normalized Mutual Information (NMI), Average RandIndex (ARI) and macro F1-score (F1). For all metrics, a higherscore indicates a better clustering quality. C. Parameter Settings

When a dataset does not come with graph information,to construct the initial graph of data, we select the popularK-Nearest-Neighbor algorithm (KNN), and the value of K is positively correlated with the number of samples andcategories. Following the strategy in SDCN [26], we tunedifferent K in the range { , , } to get the best performance.Generally, USPS and HHAR employ K = 10 and K = 5 toconstruct the corresponding graphs, respectively. As for ACM,DBLP, and Citeseer, we directly exploit the existing graphs inthe datasets.In the proposed CaEGCN, we set the dimensions of bothCAE and GAE modules as input − − − cluster − − − output , where input and output denote the dimensionof the raw data, and cluster represents the number of clustercategories. The purpose of the last layer in the decoder is toreconstruct the raw data, so the dimension of the last layerequals to the ﬁrst layer, i.e., output = input .In the cross-attention fusion module, we set the number ofheads as . In the self-supervised module, the iteration numberof K-means is set as to initialize the cluster centers. Atlast, we employ the Xavier method to initialize our modelparameters [41], and the initial learning rate is set as . . EEE TRANSACTIONS ON XX, VOL. XX, NO. X, JANUARY 2020 7

Dataset Metric K-means AE IEDC VGAE ARGA DAEGC SDCN CaEGCNACM ACCNMIARIF1 0.68200.32630.31190.6846 0.82780.50200.55530.8295 0.86450.58240.64210.8632 0.82940.52850.56180.8286 0.83270.50390.56460.8335 0.86940.56180.59350.8707 0.88600.63260.69310.8857

DBLP ACCNMIARIF1 0.36460.08860.06570.2637 0.54350.22200.16510.5325 0.65710.30800.32100.6439 0.57630.21890.23480.5456 0.54500.20190.19490.5343 0.62050.32490.21030.6175 0.66130.32490.33380.6556

Citeseer ACCNMIARIF1 0.33840.15020.08930.2246 0.59090.30660.31340.5483 0.60230.30740.29240.5230 0.51610.25720.24050.4184 0.59120.30690.31380.5485 0.64540.36410.3778

USPS ACCNMIARIF1 0.66820.62720.54640.6494 0.44020.48500.30820.3665 0.76840.77950.70110.7565 0.63810.70040.56360.5861 0.71960.68590.60810.7093 0.73550.71120.63330.7245 0.77220.7907

TABLE IC

LUSTERING RESULTS ON ALL FIVE DATASETS . W

E MARK THE BEST - PERFORMING AND THE SECOND - BEST - PERFORMING RESULTS BY BOLDED ANDUNDERLINED . D. Experiment Results Analysis

Table I exhibits the whole experimental results com-pared with other clustering methods. Obviously, the proposedCaEGCN model achieves the best performance in most cases.We can see that the content information based deep cluster-ing methods (such as AE and IEDC) work better than the graphconvolutional auto-encoder method VGAE. The reason is thatVGAE exists the over-smoothing problem. In other words,the information received by the nodes has a low signal-to-noise ratio. SDCN and the proposed CaEGCN supplement thecontent information into the GCN module in each layer, whicheffectively relieves the over-smoothing problem, so SDCN andCaEGCN receive the satisfactory performance.The experimental results show that SDCN and the CaEGCNare superior to other methods on the whole datasets. Com-pared with directly using GCN, supplementing the contentinformation into the structure representation layer by layercan help the clustering work better, and it also illustratesthe signiﬁcance of the interaction between the heterogeneousinformation. In addition, the proposed CaEGCN performsbetter than SDCN in most cases, which proves that the cross-attention fusion module in CaEGCN can promote the learneddata representation containing more prominent information forclustering tasks.For the academic papers datasets, various factors interferethe clustering or recognition tasks, such as the cross-domainapplications of popular algorithms, different research topics in the same ﬁeld and different research ﬁelds of the same author,and so on. As for the ACM dataset, the CaEGCN achievesthe signiﬁcant improvements in all four evaluation metrics.The accuracy of CaEGCN increases by from . comparedwith SDCN to . compared with K-means; The NMI ofCaEGCN increases from . compared with SDCN to . compared with K-means; The ARI of CaEGCN increases byfrom compared with SDCN to . compared with K-means; The F1 score of CaEGCN increases by from . compared with SDCN to compared with K-means.We also note that VGAE has achieved good clusteringresults, which simultaneously considers the graph topologyinformation and node content of the graph. ARGA uses theadversarial training to optimize the method based on graphconvolution, and achieves some improvements. DAEDC usesthe graph attention module, and also achieves the betterexperimental results. The huge gap between the CaEGCNand SDCN (and others) further proves the superiority of theCaEGCN. Similar to the ACM dataset, DBLP and Citeseerare datasets related to academic papers, and their experimentalresults show the same pattern and trend.Large scale dataset is an important challenge for clus-tering methods. When the scale of samples increases, theperformance of many state-of-the-art clustering methods dropsdramatically. The scale of HHAR and USPS datasets are times larger than the previous datasets.For the HHAR dataset, it is difﬁcult to distinguish some EEE TRANSACTIONS ON XX, VOL. XX, NO. X, JANUARY 2020 8 (a) ACM-raw data. (b) ACM-CaEGCN. (c) DBLP-raw data. (d) DBLP-CaEGCN.(e) Citeseer-raw data. (f) Citeseer-CaEGCN. (g) HHAR-raw data. (h) HHAR-CaEGCN.(i) USPS-raw data. (j) USPS-CaEGCN.Fig. 2. 2D visualization. The comparison of the raw data and the clustering results of CaEGCN on ACM, DBLP, Citeseer, HAR and USPS datasets. human daily behaviors, e.g., walking and biking. This imposesa challenge for clustering tasks. Compared with VGAE whichintegrates the structural features into the content information,the accuracy of the CaEGCN increases by . ; comparedwith ARGA, the accuracy increases by . ; and comparedwith DAEGC, the accuracy increases by . . It is ourbelief that the poor performance of these methods is dueto the over-smoothing problem of GCN. The experimentalresults prove the effectiveness of our cross-attention module.Compared with the best baseline SDCN, the accuracy ofCaEGCN still increases by . .For the USPS dataset, its background is simple, so weregard it as a baseline data to test the robustness of theCaEGCN. The experimental results of the CaEGCN, SDCN,and IDEC are similar. This may be caused by the fact thatthe content information of the some handwritten digit imagesis difﬁcult to distinct. We emphasize here that each node inthe initial graph we constructed can only connect closestnodes when considering efﬁciency, so the initial graph fails tocontain enough valuable relationship between the data. Such a limited graph restricts the learning ability of the convolutionalnetwork.In summary, these performance improvements of theCaEGCN can be attributed to two aspects: ﬁrst, the cross-attention fusion mechanism integrates the content informa-tion and the data relationship, and highlights the criticalinformation in the fusion representations; second, the self-supervised module further optimizes the distributions of themiddle layer representations to strengthen the performance ofdeep clustering. E. Clustering Result Visualization

We visualize the clustering results of ﬁve datasets in a two-dimensional space with the t-SNE algorithm [37]. The locationdistribution of the raw data is overlapping, while the CaeGCNcan obviously drive the raw data into different groups.

F. Ablation Experiment Analysis

To prove the effectiveness of each critical module in ourmodel, we designed a set of ablation experiments. Speciﬁ-

EEE TRANSACTIONS ON XX, VOL. XX, NO. X, JANUARY 2020 9

Dataset Metric CaEGCN CaEGCNw/o attention CaEGCNw/o graph CaEGCNw/o contentACM ACCNMIARIF1

TABLE IIT

HE RESULTS OF ABLATION EXPERIMENTS ON ALL FIVE DATASETS . W

E MARK THE BEST - PERFORMING RESULT BY BOLDED . cally, we repeatedly remove one module from the CaEGCNmodel and test the clustering performance of these incompletemodels on ﬁve datasets. The designed incomplete models aredisplayed as follow, • CaEGCN w/o attention:

The proposed CaEGCN with-out the cross-attention fusion module. • CaEGCN w/o graph:

The proposed CaEGCN withoutthe graph reconstruction loss L GAE graph in the GAEmodule, L overall = L GAE content + L CAE content + L cae + L gae . (22) • CaEGCN w/o content:

The proposed CaEGCN withoutthe content reconstruction loss L GAE content in the GAEmodule, L overall = L GAE graph + L CAE content + L cae + L gae . (23)From Table II, we observe that the CaEGCN still achievesthe best results on all datasets. The clustering results of theabove three incomplete models decline, which veriﬁes theimportance of each critical module.Among them, the experimental results of CaEGCN w/oattention drop sharply, which proves that the lack of thecontent and data relationship fusion can decrease the learningability of the GAE module.Without the graph reconstruction loss L GAE graph , the clus-tering performance of

CaEGCN w/o graph still decreaseobviously. This reconstruction loss ensures the learned middle layer representations have abundant structure information,which improves the clustering performance. Meanwhile, theexperimental results of the

CaEGCN w/o graph on thedatasets using the original graph (e.g., ACM, DBLP) signif-icantly decrese, which reﬂects that the graph reconstructionloss can effectively improve the quality of data representationswith the accurate graph structure.As for

CaEGCN w/o content , its experimental results arenot bad. Compared with other two modules, the impact ofcontent reconstruction loss L GAE content is relatively small.However, we believe that content reconstruction loss is alsoindispensable. The CAE module likes a residual network tosupplement the high-quality content information to the GAEmodule layer-by-layer. Then, the content reconstruction lossensures the middle layer representation learned by the GAEmodule contains more content information of the raw data.Throughout these three ablation experiments, it turns outthat each module improves the clustering performance fromdifferent aspects and is meaningful.V. C

ONCLUSION

We propose a cross-attention fusion based enhanced graphconvolutional network for subspace clustering (CaEGCN),which connects the CAE and GAE modules layer-by-layerthrough the cross-attention fusion module, and strengthens theessential information. The fusion representation is used as theinput of the GAE module. The novel graph reconstruction lossand content reconstruction loss in the GAE module further

EEE TRANSACTIONS ON XX, VOL. XX, NO. X, JANUARY 2020 10 ensure the middle layer representation more appropriate forclustering. Finally, we build the self-supervised module to trainthe entire end-to-end model. The excellent experimental resultson various datasets prove the superiority of the proposedmethods. A

CKNOWLEDGEMENTS

The research project is partially supported by Na-tional Natural Science Foundation of China under GrantNo. U19B2039, 61906011, 61632006, 61772048, 61672071,U1811463, 61806014, Beijing Natural Science FoundationNo. 4204086, Beijing Municipal Science and TechnologyProject No. KM202010005014, KM201910005028, BeijingTalents Project (2017A24), Beijing Outstanding Young Sci-entists Projects (BJJWZYJH01201910005018).R

EFERENCES[1] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of datawith neural networks,”

Science , vol. 313, no. 5786, pp. 504–507, 2006.[2] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding forclustering analysis,” in

International Conference on Machine Learning ,2016.[3] P. Ji, T. Zhang, H. Li, M. Salzmann, and I. D. Reid, “Deep subspaceclustering networks,” in

Neural Information Processing Systems , 2017.[4] X. Guo, L. Gao, X. Liu, and J. Yin, “Improved deep embedded clusteringwith local structure preservation,” in

International Joint Conference onArtiﬁcial Intelligence , 2017.[5] J. B. Macqueen, “Some methods for classiﬁcation and analysis ofmultivariate observations,”

Fifth Berkeley Symposium on MathematicalStatistics and Probability , pp. 281–297, 1967.[6] S. S. Tabatabaei, M. Coates, and M. Rabbat, “GANC: Greedy agglomer-ative normalized cut for graph clustering,”

Pattern Recognition , vol. 45,no. 2, pp. 831–843, 2012.[7] M. Girvan and M. E. J. Newman, “Community structure in social andbiological networks,”

Proceedings of the National Academy of Sciences ,vol. 99, no. 12, pp. 7821–7826, 2002.[8] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Li`o, andY. Bengio, “Graph attention networks,” in

International Conference onLearning Representations , 2018.[9] J. Zhang, F. Chen, Y. Guo, and X. Li, “Multi-graph convolutionalnetwork for short-term passenger ﬂow forecasting in urban rail transit,”

IET Intelligent Transport Systems , vol. 14, no. 10, pp. 1210 – 1217,2020.[10] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: online learning ofsocial representations,” in

ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , 2014, pp. 701–710.[11] A. Grover and J. Leskovec, “Node2vec: Scalable feature learning fornetworks,” in

ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , vol. 2016, 2016, pp. 855–864.[12] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “LINE: Large-scale information network embedding,” in

International Conference onWorld Wide Web , 2015, pp. 1067–1077.[13] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neuralnetworks on graphs with fast localized spectral ﬁltering,” in

NeuralInformation Processing Systems , 2016.[14] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks andlocally connected networks on graphs,” in

International Conference onLearning Representations , 2014.[15] T. Kipf and M. Welling, “Semi-supervised classiﬁcation with graphconvolutional networks,” in

International Conference on Learning Rep-resentations , 2017.[16] L. Zhao, Y. Song, C. Zhang, Y. Liu, P. Wang, T. Lin, M. Deng, and H. Li,“T-GCN: A temporal graph convolutional network for trafﬁc prediction,”

IEEE Transactions on Intelligent Transportation Systems , vol. 21, no. 9,pp. 3848–3858, 2020.[17] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutionalnetworks for skeleton-based action recognition.” in

AAAI Conferenceon Artiﬁcial Intelligence , 2018, pp. 7444–7452.[18] S. Sanyal, I. Anishchenko, A. Dagar, D. Baker, and P. Talukdar, “Pro-teinGCN: Protein model quality assessment using graph convolutionalnetworks,” bioRxiv , 2020. [19] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representationlearning on large graphs,” in

Neural Information Processing Systems ,2017.[20] H. Wang, J. Wang, J. Wang, M. Zhao, W. Zhang, F. Zhang, X. Xing,and M. Guo, “GraphGAN: Graph representation learning with generativeadversarial nets,” in

AAAI Conference on Artiﬁcial Intelligence , 2018.[21] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “Acomprehensive survey on graph neural networks,”

IEEE Transactionson Neural Networks and Learning System , 2020.[22] T. Kipf and M. Welling, “Variational graph auto-encoders,”

NIPS Work-shop on Bayesian Deep Learning , 2016.[23] S. Pan, R. Hu, S.-F. Fung, G. Long, J. Jiang, and C. Zhang, “Learninggraph embedding with adversarial training methods,”

IEEE Transactionson Systems, Man, and Cybernetics , vol. 50, no. 6, pp. 2475–2487, 2020.[24] C. Wang, S. Pan, R. Hu, G. Long, J. Jiang, and C. Zhang, “Attributedgraph clustering: a deep attentional embedding approach,” in

Interna-tional Joint Conference on Artiﬁcial Intelligence , 2019, pp. 3670–3676.[25] X. Li, H. Zhang, and R. Zhang, “Embedding graph auto-encoder withjoint clustering via adjacency sharing.” arXiv:2002.08643 , 2020.[26] D. Bo, X. Wang, C. Shi, M. Zhu, E. Lu, and P. Cui, “Structural deepclustering network,” in

The Web Conference , 2020.[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,L. Kaiser, and I. Polosukhin, “Attention is all you need,” in

NeuralInformation Processing Systems , 2017.[28] G. Tang, M. M¨uller, A. R. Gonzales, and R. Sennrich, “Why self-attention? a targeted evaluation of neural machine translation architec-tures,” in

Empirical Methods in Natural Language Processing , 2018.[29] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dualattention network for scene segmentation,” in

International Conferenceon Computer Vision and Pattern Recognition , 2019.[30] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel,and Y. Bengio, “Show, attend and tell: Neural image caption generationwith visual attention,” in

International Conference on Machine Learning ,2015.[31] J. Cheng, L. Dong, and M. Lapata, “Long short-term memory-networksfor machine reading,” in

Conference on Empirical Methods in NaturalLanguage Processing , 2016.[32] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attentiongenerative adversarial networks,” in

International Conference on Ma-chine Learning , 2019.[33] X. Wang, Z. Tu, L. Wang, and S. Shi, “Self-attention with structural po-sition representations,” in

Conference on Empirical Methods in NaturalLanguage Processing , 2019.[34] A. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi, andQ. Le, “QANet: Combining local convolution with global self-attentionfor reading comprehension,” in

International Conference on LearningRepresentations , 2018.[35] H. Xue, C. Liu, F. Wan, J. Jiao, X. Ji, and Q. Ye, “DANet: Divergentactivation for weakly supervised object localization,” in

InternationalConference on Computer Vision , 2019.[36] A. Grigor’yan,

Heat Kernel and Analysis on Manifolds . AmericanMathematical Society / International Press, 2012.[37] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,”

Journal of Machine Learning Research , vol. 9, no. 86, pp. 2579–2605,2008.[38] M. Ley, “DBLP: some lessons learned,”

Proceedings of the VLDBEndowment , vol. 2, no. 2, pp. 1493–1500, 2009.[39] A. Stisen, H. Blunck, S. Bhattacharya, T. Prentow, M. Kjærgaard,A. Dey, T. Sonne, and M. Jensen, “Smart devices are different: Assessingand mitigatingmobile sensing heterogeneities for activity recognition,”in

ACM Conference on Embedded Networked Sensor Systems , 2015.[40] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”

Nature , vol. 521,no. 7553, pp. 436–444, 2015.[41] X. Glorot and Y. Bengio, “Understanding the difﬁculty of training deepfeedforward neural networks,” in

International Conference on ArtiﬁcialIntelligence and Statistics , 2010.

EEE TRANSACTIONS ON XX, VOL. XX, NO. X, JANUARY 2020 11

Guangyu Huo received the B.Sc. degree in IoTEngineering and the M.S. degree in Computer Sci-ence from Beijing University of Technology, Chinain 2016 and 2019, where he is currently workingtoward the PhD. degree in Control Science andEngineering. His current research interests includeintelligent transportation, computer vision, patternrecognition and deep learning.

Yong Zhang (M’12) received the Ph.D. degree incomputer science from the BJUT, in 2010. He iscurrently an Associate Professor in computer sciencein BJUT. His research interests include intelligenttransportation system, big data analysis and visual-ization, computer graphics.

Junbin Gao graduated from Huazhong Universityof Science and Technology (HUST), China in 1982with a BSc in Computational Mathematics and ob-tained his PhD from Dalian University of Technol-ogy, China in 1991. He is Professor of Big Data An-alytics in the University of Sydney Business Schoolat the University of Sydney and was a Professor inComputer Science in the School of Computing andMathematics at Charles Sturt University, Australia.He was a senior lecturer, a lecturer in ComputerScience from 2001 to 2005 at the University of NewEngland, Australia. From 1982 to 2001 he was an associate lecturer, lecturer,associate professor, and professor in Department of Mathematics at HUST.His main research interests include machine learning, data analytics, Bayesianlearning and inference, and image analysis.

Boyue Wang received the B.Sc. degree in ComputerScience from Hebei University of Technology, Chinain 2012 and obtained PhD from Beijing Universityof Technology, China in 2018. He is a postdoctorin the Beijing Municipal Key Laboratory of Mul-timedia and Intelligent Software Technology, Bei-jing University of Technology, Beijing. His currentresearch interests include computer vision, patternrecognition, manifold learning and kernel methods.

Yongli Hu received his Ph.D. degree from BeijingUniversity of Technology in 2005. He is a professorin the Faculty of Information Technology at BeijingUniversity of Technology. He is a researcher at theBeijing Municipal Key Laboratory of Multimediaand Intelligent Software Technology. His researchinterests include computer graphics, pattern recog-nition and multimedia technology.