[PDF] Deep Spectral Clustering using Dual Autoencoder Network

Abstract

The clustering methods have recently absorbed even-increasing attention in learning and vision. Deep clustering combines embedding and clustering together to obtain optimal embedding subspace for clustering, which can be more effective compared with conventional clustering methods. In this paper, we propose a joint learning framework for discriminative embedding and spectral clustering. We first devise a dual autoencoder network, which enforces the reconstruction constraint for the latent representations and their noisy versions, to embed the inputs into a latent space for clustering. As such the learned latent representations can be more robust to noise. Then the mutual information estimation is utilized to provide more discriminative information from the inputs. Furthermore, a deep spectral clustering method is applied to embed the latent representations into the eigenspace and subsequently clusters them, which can fully exploit the relationship between inputs to achieve optimal clustering results. Experimental results on benchmark datasets show that our method can significantly outperform state-of-the-art clustering approaches.

Full PDF

DDeep Spectral Clustering using Dual Autoencoder Network

Xu Yang , Cheng Deng ∗ , Feng Zheng , Junchi Yan , Wei Liu ∗ School of Electronic Engineering, Xidian University, Xian 710071, China Department of Computer Science and Engineering, Southern University of Science and Technology Department of CSE, and MoE Key Lab of Artiﬁcial Intelligence, Shanghai Jiao Tong University Tencent AI Lab, Shenzhen, China { xuyang.xd, chdeng.xd } @gmail.com, [email protected],[email protected], [email protected] Abstract

The clustering methods have recently absorbed even-increasing attention in learning and vision. Deep cluster-ing combines embedding and clustering together to obtainoptimal embedding subspace for clustering, which can bemore effective compared with conventional clustering meth-ods. In this paper, we propose a joint learning frameworkfor discriminative embedding and spectral clustering. Weﬁrst devise a dual autoencoder network, which enforces thereconstruction constraint for the latent representations andtheir noisy versions, to embed the inputs into a latent spacefor clustering. As such the learned latent representationscan be more robust to noise. Then the mutual informationestimation is utilized to provide more discriminative infor-mation from the inputs. Furthermore, a deep spectral clus-tering method is applied to embed the latent representationsinto the eigenspace and subsequently clusters them, whichcan fully exploit the relationship between inputs to achieveoptimal clustering results. Experimental results on bench-mark datasets show that our method can signiﬁcantly out-perform state-of-the-art clustering approaches.

1. Introduction

As an important task in unsupervised learning [43, 8, 22,24] and vision communities [48], clustering [14] has beenwidely used in image segmentation [37], image categoriza-tion [45, 47], and digital media analysis [1]. The goal ofclustering is to ﬁnd a partition in order to keep similar datapoints in the same cluster while dissimilar ones in differ-ent clusters. In recent years, many clustering methods havebeen proposed, such as K -means clustering [28], spectralclustering [31, 46, 17], and non-negative matrix factoriza-tion clustering [41], among which K -means and spectral ∗ Corresponding author. (a) Raw data (b) ConvAE (c) Our method

Figure 1. Visualizing the discriminative embedding capability onMNIST- test with t -SNE algorithm. (a): the space of raw data, (b):data points in the latent subspace of convolution autoencoder; (c):data points in the latent subspace of the proposed autoencoder net-work. Our method can provide a more discriminative embeddingsubspace. clustering are two well-known conventional algorithms thatare applicable to a wide range of various tasks. However,these shallow clustering methods depend on low-level fea-tures such as raw pixels, SIFT [32] or HOG [7] of the inputs.Their distance metrics are only exploited to describe localrelationships in data space, and have limitation to representthe latent dependencies among the inputs [3].This paper presents a novel deep learning based unsu-pervised clustering approach. Deep clustering, which inte-grates embedding and clustering processes to obtain opti-mal embedding subspace for clustering, can be more effec-tive than shallow clustering methods. The main reason isthat the deep clustering methods can effectively model thedistribution of the inputs and capture the non-linear prop-erty, being more suitable to real-world clustering scenarios.Recently, many clustering methods are promoted bydeep generative approaches, such as autoencoder net-work [29]. The popularity of the autoencoder network liesin its powerful ability to capture high dimensional proba-bility distributions of the inputs without supervised infor-mation. The encoder model projects the inputs into the la-1 a r X i v : . [ c s . L G ] A p r ent space, and adopts an explicit approximation of max-imum likelihood to estimate the distribution diversity be-tween the latent representations and the inputs. Simulta-neously, the decoder model reconstructs the latent repre-sentations to ensure the output maintaining all of the de-tails in the inputs [38]. Almost all existing deep clusteringmethods endeavor to minimize the reconstruction loss. Thehope is making the latent representations more discrimina-tive which directly determines the clustering quality. How-ever, in fact, the discriminative ability of the latent represen-tations has no substantial connection with the reconstruc-tion loss, causing the performance gap that is to be bridgedin this paper.We propose a novel dual autoencoder network for deepspectral clustering. First, a dual autoencoder, which en-forces the reconstruction constraint for the latent represen-tations and their noisy versions, is utilized to establish therelationships between the inputs and their latent represen-tations. Such a mechanism is performed to make the la-tent representations more robust. In addition, we adoptthe mutual information estimation to reserve discrimina-tive information from the inputs to an extreme. In thisway, the decoder can be viewed as a discriminator to de-termine whether the latent representations are discrimina-tive. Fig. 1 demonstrates the performance of our proposedautoencoder network by comparing different data represen-tations on MNIST- test data points. Obviously, our methodcan provide more discriminative embedding subspace thanthe convolution autoencoder network. Furthermore, deepspectral clustering is harnessed to embed the latent repre-sentations into the eigenspace, which followed by cluster-ing. This procedure can exploit the relationships betweenthe data points effectively and obtain the optimal results.The proposed dual autoencoder network and deep spectralclustering network are jointly optimized.The main contributions of this paper are in three-folds: • We propose a novel dual autoencoder network forgenerating discriminative and robust latent represen-tations, which is trained with the mutual informationestimation and different reconstruction results. • We present a joint learning framework to embed theinputs into a discriminative latent space with a dualautoencoder and assign them to the ideal distributionby a deep spectral clustering model simultaneously. • Empirical experiments demonstrate that our methodoutperforms state-of-the-art methods over the ﬁvebenchmark datasets, including both traditional anddeep network-based models.

2. Related Work

Recently, a number of deep learning-based clusteringmethods are proposed. Deep Embedding Clustering [40] (DEC) adopts a fully connected stacked autoencoder net-work in order to learn the latent representations by min-imizing the reconstruction loss in the pre-training phase.The objective function applied to the clustering phase is theKullback Leibler ( KL ) divergence between the soft assign-ments of clustering modelled by a t -distribution. And then,a K -means loss is adopted at the clustering phase to train afully connected autoencoder network [42], which is a jointapproach of dimensionality reduction and K -means cluster-ing. In addition, Gaussian Mixture Variational Autoencoder(GMVAE) [9] shows that minimum information constraintcan be utilized to mitigate the effect of over-regularizationin VAEs and provides an unsupervised clustering within theVAE framework considering a Gaussian mixture as a priordistribution. Discriminatively Boosted Clustering [23], afully convolutional network with layer-wised batch normal-ization, adopts the same objective function as DEC and usesa boosting factor to the relatively train a stacked autoen-coder.Shah and Koltun [34] jointly solve the tasks of clus-tering and dimensionality reduction by efﬁciently optimiz-ing a continuous global objective based on robust statistics,which allows heavily mixed clusters to be untangled. Fol-lowing this method, a deep continuous clustering approachis suggested in [35], where the autoencoder parameters anda set of representatives deﬁned against each data-point aresimultaneously optimized. The convex clustering approachproposed by [6] optimizes the representatives by minimiz-ing the distances between each representative and its asso-ciated data-point. Non-convex objectives are involved topenalize for the pairwise distances between the representa-tives.Furthermore, to improve the performance of clustering,some methods combine convolutional layers with fully con-nected layers. Joint Unsupervised Learning (JULE) [44]jointly optimizes a convolutional neural network with theclustering parameters in a recurrent manner using an ag-glomerative clustering approach, where image clusteringis conducted in the forward pass and representation learn-ing is performed in the backward pass. Dizaji [10] pro-poses DEPICT, a method that trains a convolutional auto-encoder with a softmax layer stacked on-top of the en-coder. The softmax entries represent the assignment ofeach data-point to one cluster. VaDE [18] is a variationalautoencoder method for deep embedding, and combines aGaussian Mixture Model for clusering. In [16], a deep au-toencoder is trained to minimize a reconstruction loss to-gether with a self-expressive layer. This objective encour-ages a sparse representation of the original data. Zhou et al . [50] presents a deep adversarial subspace clustering(DASC) method to learn more favorable representationsand supervise sample representation learning by adversar-ial deep learning [21]. However, the results of reconstruc- pectral Clustering Encoder

Graph

Decoder Decoder 𝒚𝒚 Mutual Information KL Divergence R e l a t i v e R e c o n s t r u c t i o n L o ss Negative Sample 𝝍𝝍 R e c o n s t r u c t i o n L o ss Figure 2. Illustration of the overall architecture. We ﬁrst pre-train a dual autoencoder to embed the inputs into a latent space, and reconstruc-tion results are obtained by the latent representations and their noise versions based on the noisy-transformer ψ . The mutual informationcalculated with negative sampling estimation is utilized to learn the discriminative information from inputs. Then, we assign the latentrepresentations to the ideal clusters by a deep spectral clustering model, and jointly optimize the dual autoencoder and spectral clusteringnetwork simultaneously. tion through low-dimensional representations are often veryblurry. One possible way is to train a discriminator with ad-versarial learning but it can further increase the difﬁculty oftraining. Comparatively, our method introduces a relativereconstruction loss and mutual information estimation toobtain more discriminative representations, and jointly op-timize the autoencoder network and the deep spectral clus-tering network for optimal clustering.

3. Methodology

As aforementioned, our framework consists of two maincomponents: a dual autoencoder and a deep spectral clus-tering network. The dual autoencoder, which reconstructsthe inputs using the latent representations and their noiseversions, is introduced to make the latent representationsmore robust. In addition, the mutual information estimationbetween the inputs and the latent representations is appliedto preserve the input information as much as possible. Thenwe utilize the deep spectral clustering network to embed thelatent representations into the eigenspace and subsequentlyclustering is performed. The two networks are merged intoa uniﬁed framework and jointly optimized with KL diver-gence. The framework is shown in Fig. 2.Let X = { x , ..., x n } denote the input samples, Z = { z , ..., z n } denote their corresponding latent representa-tions where z i = f ( x i ; θ e ) ∈ R d is learned by the encoder E . The parameters of the encoder are deﬁned by θ e , and d is the feature dimension. ˜ x z i = g ( z i ; θ d ) represents thereconstructed data point, which is the output of the decoder D , and the parameters of the decoder are denoted by θ d .We adopt a deep spectral clustering network C to map z i to y i = c ( z i ; θ y ) ∈ R K , where K is the number of clusters. We ﬁrst train the dual autoencoder network to embed theinputs into a latent space. Based on the original reconstruc-tion loss, we add a noise-disturbed reconstruction loss tolearn the decoder network. In addition, we introduce themaximization of mutual information [13] to the learningprocedure of the encoder network, so that the network canobtain more robust representations.

Encoder:

Feature extraction is the major step in cluster-ing and a good feature can effectively improve clusteringperformance. However, a single reconstruction loss can-not well guarantee the quality of the latent representations.We hope that the representations will help us to identify thesample from the inputs, which means it is the most uniqueinformation extracted from the inputs. Mutual informationmeasures the essential correlation between two samples andcan effectively estimate the similarity between features Z and inputs X . The deﬁnition of mutual information is de-ﬁned as: I ( X, Z ) = (cid:90) (cid:90) p ( z | x ) p ( x ) log p ( z | x ) p ( z ) dxdz = KL ( p ( z | x ) p ( x ) || p ( z ) p ( x )) , (1)where p ( x ) is the distribution of the inputs, p ( z | x ) is thedistribution of the latent representations, and the distribu-tion of latent space p ( z ) can be calculated by p ( z ) = (cid:82) p ( z | x ) p ( x ) dx . The mutual information is expected tobe as large as possible when training the encoder network,hence we have: p ( z | x ) = max θ e I ( X, Z ) . (2)n addition, the learned latent representations are requiredto obey the prior distribution of the standard normal distri-bution with KL divergence. This is beneﬁcial to make thelatent space more regular. The distribution difference be-tween p ( z ) and its prior q ( z ) is deﬁned as. KL ( p ( z ) || q ( z )) = (cid:90) p ( z ) log p ( z ) q ( z ) dz. (3)According to Eqs. (2) and (3), we have: p ( z | x ) = min θ e (cid:26) − (cid:90) (cid:90) p ( z | x ) p ( x ) log p ( z | x ) p ( z ) dxdz + α (cid:90) p ( z ) log p ( z ) q ( z ) dz (cid:27) . (4)It can be further rewritten as: p ( z | x ) = min θ e (cid:26)(cid:90) (cid:90) p ( z | x ) p ( x )[ − ( α + 1) log p ( z | x ) p ( z )+ α log p ( z | x ) q ( z ) ] dxdz (cid:27) . (5)According to Eq. (1), the Eq. (5) can be viewed as: p ( z | x ) = min θ e {− βI ( X, Z )+ γ E x ∼ p ( x ) [ KL ( p ( z | x ) || q ( z ))] (cid:9) . (6)Unfortunately, KL divergence is unbounded. Instead of us-ing KL divergence, JS divergence is adopted for mutualinformation maximization: p ( z | x ) = min θ e {− βJS ( p ( z | x ) p ( x ) , p ( z ) p ( x ))+ γ E x ∼ p ( x ) [ KL ( p ( z | x ) || q ( z ))] (cid:9) . (7)We have known that the variational estimation of JS di-vergence [33] is deﬁned as: JS ( p ( x ) || q ( x )) = max T ( E x ∼ p ( x ) [log σ ( T ( x ))]+ E x ∼ q ( x ) [log(1 − σ ( T ( x )))]) . (8)where T ( x ) = log p ( x ) p ( x )+ q ( x ) [33]. Here p ( z | x ) p ( x ) and p ( z ) p ( x ) are utilized to replace p ( x ) and q ( x ) . As a result,Eq. (7) can be deﬁned as: p ( z | x ) = min θ e (cid:8) − β ( E ( x,z ) ∼ p ( z | x ) p ( x ) [log σ ( T ( x, z ))]+ E ( x,z ) ∼ p ( z ) p ( x ) [log(1 − σ ( T ( x, z )))])+ γ E x ∼ p ( x ) [ KL ( p ( z | x ) || q ( z ))] (cid:9) . (9)Negative sampling estimation [13], which is the processof using a discriminator to distinguish the real and noisy × Conv Discriminator “Real” “Fake”

Scores 𝐌𝐌 × 𝐌𝐌 features drawn from another image 𝐌𝐌 × 𝐌𝐌 features + replicated feature vector Latent Representation Negative Sample Figure 3. Local mutual information estimation. samples to estimate the distribution of real samples, is gen-erally utilized to solve the problem in Eq. (9). σ ( T ( x, z )) is a discriminator, where x and its latent representation z together form a positive sample pair. We randomly select z t from the disturbed batch to construct a negative samplepair according to x . Note that Eq. (9) represents the globalmutual information between X and Z .Furthermore, we extract the feature map from the middlelayer of the convolutional network, and construct the rela-tionship between the feature map and the latent representa-tion, which is the local mutual information. The estimationmethod plays the same role as global mutual information.The middle layer feature are combined with the latent repre-sentation to obtain a new feature map. Then a × convolu-tion is considered as the estimation network of local mutualinformation, as shown in Fig. 3. The selection method ofnegative samples is the same as global mutual informationestimation. Therefore, the objective function that needs tobe optimized can be deﬁned as: L e = − β ( E ( x,z ) ∼ p ( z | x ) p ( x ) [log σ ( T ( x, z ))]+ E ( x,z ) ∼ p ( z ) p ( x ) [log(1 − σ ( T ( x, z )))]) − βhw Σ i,j ( E ( x,z ) ∼ p ( z | x ) p ( x ) [log σ ( T ( C ij , z ))]+ E ( x,z ) ∼ p ( z ) p ( x ) [log(1 − σ ( T ( C ij , z )))])+ γ E x ∼ p ( x ) [ KL ( p ( z | x ) || q ( z ))] , (10)where h and w represent the height and width of the featuremap. C ij represents the feature vector of the middle featuremap at coordinates ( i, j ) and q ( z ) is the standard normaldistribution. Decoder:

In the existing decoder networks, the reconstruc-tion loss is generally a suboptimal scheme for clustering,due to the natural trade-off between the reconstruction andthe clustering tasks. The reconstruction loss mainly de-pends on the two parts: the distribution of the latent repre-sentations and the generative capacity of decoder network.owever, the generative capacity of the decoder network isnot required in the clustering task. Our real goal is not to ob-tain the best reconstruction results, but to get more discrim-inative features for clustering. We directly use noise dis-turbance in the latent space to discard known nuisance fac-tors from the latent representations. Models trained in thisfashion become robust by exclusion rather than inclusion,and are expected to perform well on clustering tasks, whereeven the inputs contain unseen nuisance [15]. A noisy-transformer ψ is utilized to convert the latent representa-tions Z into their noisy versions ˆ Z , and then the decoder re-constructs the inputs from ˆ Z and Z . The reconstruction re-sults can be deﬁned as ˜ x ˆ z i = g (ˆ z i ; θ d ) and ˜ x z i = g ( z i ; θ d ) ,and the relative reconstruction loss can be written as: L r (˜ x ˆ z i , ˜ x z i ) = (cid:107) ˜ x ˆ z i − ˜ x z i (cid:107) F , (11)where (cid:107) · (cid:107) F stands for the Frobenius norm. We also use theoriginal reconstruction loss to ensure the performance of thedecoder network and consider ψ as multiplicative Gaussiannoise. The complete reconstruction loss can be deﬁned as: L r = (cid:107) ˜ x ˆ z i − ˜ x z i (cid:107) F + δ (cid:107) x − ˜ x z i (cid:107) F . (12)where δ stands for the strength of different reconstructionloss.Hence, by considering all the items, the total loss of theautoencoder network can be deﬁned as: min θ d ,θ e L r + L e . (13) The learned autoencoder parameters θ e and θ d are con-sidered as an initial condition in the clustering phase. Spec-tral clustering can effectively use the relationship betweensamples to reduce intra-class differences, and produce bet-ter clustering results than K -means. In this step, we ﬁrstadopt the autoencoder network to learn the latent represen-tations. Next, a spectral clustering method is used to embedthe latent representations into the eigenspace of their asso-ciated graph Laplacian matrix [25]. All the samples willbe subsequently clustered in this space. Finally, both theautoencoder parameters and clustering objective are jointlyoptimized.Speciﬁcally, we ﬁrst utilize the latent representations Z to construct the non-negative afﬁnity matrix W : W i,j = e − (cid:107) zi − zj (cid:107) σ . (14)The loss function of spectral clustering is deﬁned as: L c = E [ W i,j (cid:107) y i − y j (cid:107) ] , (15)where y i is the output of the network. When we adopt thegeneral neural network to output y , we randomly select a minibatch of m samples at each iteration and thus the lossfunction can be deﬁned as: L c = 1 m m (cid:88) i,j =1 W i,j (cid:107) y i − y j (cid:107) . (16)In order to prevent that all points are grouped into thesame cluster in network maps, the output y is required to beorthonormal in expectation. That is to say: m Y T Y = I k × k , (17)where Y is a m × k matrix of the outputs whose i th row is y Ti . The last layer of the network is utilized to enforce theorthogonality [36] constraint. This layer gets input from K units, and acts as a linear layer with K outputs, in whichthe weights are required to be orthogonal, producing the or-thogonalized output Y for a minibatch. Let ˜ Y denote the m × k matrix containing the inputs to this layer for Z , a lin-ear map that orthogonalizes the columns of ˜ Y is computedthrough its QR decomposition. Since integrated A (cid:62) A isfull rank for any matrix A , the QR decomposition can beobtained by the Cholesky decomposition: A (cid:62) A = BB (cid:62) , (18)where B is a lower triangular matrix, and Q = A ( B − ) (cid:62) .Therefore, in order to orthogonalize ˜ Y , the last layer mul-tiplies ˜ Y from the right by √ m ( L − ) T . Actually, ˜ L canbe obtained from the Cholesky decomposition of ˜ Y and the √ m factor is needed to satisfy Eq. (17).We unify the latent representation learning and the spec-tral clustering using KL divergence. In the clusteringphase, the last term of Eq. (10) can be rewritten as: E x ∼ p ( x ) [ KL ( p (( y, z ) | x ) || q ( y, z ))] , (19)where p (( y, z ) | x ) = p ( y | z ) p ( z | x ) and q ( y, z ) = q ( z | y ) q ( y ) . Note q ( z | y ) is a normal distribution with mean µ y and variance . Therefore, the overall loss of the autoen-coder and the spectral clustering network is deﬁned as: min θ d ,θ e ,θ c L r + L e + L c . (20)Finally, we jointly optimize the two networks until con-vergence to obtain the desired clustering results.

4. Experiments

In this section, we evaluate the effectiveness of the pro-posed clustering method in ﬁve benchmark datasets, andthen compare the performance with several state-of-the-arts. able 1. Description of Datasets

Dataset Samples Classes DimensionsMNIST- full × × test × × × × × × × × (a) MNIST(b) Fashion-Mnist Figure 4. The image samples from the benchmark datasets used inour experiments

In order to show that our method works well with variouskinds of datasets, we choose the following image datasets.Considering that clustering tasks are fully unsupervised, weconcatenate the training and testing samples when applica-ble. MNIST- full [20]: A dataset containing a total of 70,000handwritten digits with 60,000 training and 10,000 testingsamples, each being a 32 ×

32 monochrome image. MNIST- test : A dataset only consists of the testing part of MNIST- full data. USPS: A handwritten digits dataset from theUSPS postal service, containing 9,298 samples of 16 × × . Some image samples areshown in Fig. 4. The brief descriptions of the datasets aregiven in Tab. 1. To evaluate the clustering results, we adopt two standardevaluation metrics: Accuracy (ACC) and Normalized Mu-tual Information (NMI) [41].The best mapping between cluster assignments and truelabels is computed using the Hungarian algorithm to mea- sure accuracy [19]. For completeness, we deﬁne ACC by:

ACC = max m (cid:80) ni =1 { l i = m ( c i ) } n , (21)where l i and c i are the true label and predicted cluster ofdata point x i .NMI calculates the normalized measure of similarity be-tween two labels of the same data, which is deﬁned as: N M I = I ( l ; c ) max { H ( l ) , H ( c ) } , (22)where I ( l, c ) denotes the mutual information between truelabel l and predicted cluster c , and H represents their en-tropy. Results of NMI do not change by permutations ofclusters (classes), and they are normalized to the range of[0, 1], with 0 meaning no correlation and 1 exhibiting per-fect correlation. In our experiments, we set β = 0 . , γ = 1 , and δ = 0 . . The channel numbers and kernel sizes of theautoencoder network are shown in Tab. 2, and the dimen-sion of latent space is set to 120. The deep spectral cluster-ing network consists of four fully connected layers, and weadopt ReLU [26] as the non-linear activations. We constructthe original weight matrix W with probabilistic K -nearestneighbors for each dataset. The weight W ij is calculated asnearest-neighbor graph [11], and the number of neighborsis set to 3. We compare our clustering model with several base-lines, including K -means [28], spectral clustering with nor-malized cuts (SC-Ncut) [37], large-scale spectral clustering(SC-LS) [4], NMF [2], graph degree linkage-based agglom-erative clustering (AC-GDL) [49]. In addition, we also eval-uate the performance of our method with several state-of-the-art clustering algorithms based on deep learning, includ-ing deep adversarial subspace clustering (DASC) [50], deepembedded clustering (DEC) [40], variational deep embed-ding (VaDE) [18], joint unsupervised learning (JULE) [44],deep embedded regularized clustering (DEPICT) [10], im-proved deep embedded clustering with locality preservation(IDEC) [12], deep spectral clustering with a set of near-est neighbor pairs (SpectralNet) [36], clustering with GAN(ClusterGAN) [30] and GAN with the mutual information(InfoGAN) [5]. We run our method with 10 random trials and report theaverage performance, the error range is no more than 2%. Interms of the compared methods, if the results of their meth-ods on some datasets are not reported, we run the released able 2. Description the structure of the autoencoder network

Method encoder-1/decoder-4 encoder-2/decoder-3 encoder-3/decoder-2 encoder-4/decoder-1MNIST 3 × ×

16 3 × ×

32 3 × × × ×

16 3 × ×

32 - -Fashion-Mnist 3 × ×

16 3 × ×

32 3 × × × ×

16 5 × ×

32 5 × × Table 3. Clustering performance of different algorithms on ﬁve datasets based on ACC and NMI

Method MNIST- full

MNIST- test

USPS Fashion-10 YTFNMI ACC NMI ACC NMI ACC NMI ACC NMI ACCK-means [28] 0.500 0.532 0.501 0.546 0.601 0.668 0.512 0.474 0.776 0.601SC-Ncut [37] 0.731 0.656 0.704 0.660 0.794 0.649 0.575 0.508 0.701 0.510SC-LS [4] 0.706 0.714 0.756 0.740 0.755 0.746 0.497 0.496 0.759 0.544NMF [2] 0.452 0.471 0.467 0.479 0.693 0.652 0.425 0.434 - -AC-GDL [49] 0.017 0.113 0.864 0.933 0.825 0.725 0.010 0.112 0.622 0.430DASC [50] 0.784 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ VaDE [18] 0.876 0.945 - - 0.512 0.566 0.630 0.578 - -JULE [44] 0.913 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ code with hyper-parameters mentioned in their papers, andthe results are marked by (*) on top. When the code is notpublicly available, or running the released code is not prac-tical, we put dash marks (-) instead of the correspondingresults.The clustering results are shown in Tab. 3, where theﬁrst ﬁve are conventional clustering methods. In the table,we can notice that our proposed method outperforms thecompeting methods on these benchmark datasets. We ob-serve that the proposed method can improve the clusteringperformance whether in digital datasets or in other productdataset. Especially when performing on the object datasetMNIST- test , the clustering accuracy is over 98%. Speciﬁ-cally, it exceeds the second best DEPICT which is trainedon the noisy versions of the inputs by 1.6% and 3.1% onACC and NMI respectively. Moreover, our method achievesmuch better clustering results than several classical shallowbaselines. This is because compared with shallow methods,our method uses a multi-layer convolutional autoencoder asthe feature extractor and adopts deep clustering network toobtain the most optimal clustering results. The Fashion-MNIST dataset is very difﬁcult to deal with due to the com-plexity of samples, but our method still harvests good re-sults.We also investigate the parameter sensitivity on MNIST- test , and the results are shown in Fig. 5, where Fig. 5(a) (a) (b) Figure 5. ACC and NMI of Our method with different β and γ onMNIST dataset represents the results of ACC from different parameters andFig. 5(b) is the results of NMI. It intuitively demonstratesthat our method maintains acceptable results with most pa-rameter combinations and has relative stability. We compare different strategies for training our model.For training a multi-layer convolutional autoencoder, weanalyze the following four approaches: (1) convolutionalautoencoder with original reconstruction loss (ConvAE),(2) convolutional autoencoder with original reconstruc-tion loss and mutual information (ConvAE+MI), (3) con-volutional autoencoder with improved reconstruction loss(ConvAE+RS) and (4) convolutional autoencoder with im- able 4. Clustering performance with different strategies on ﬁve datasets based on ACC and NMI

Method MNIST- full

MNIST- test

USPS Fashion-10 YTFNMI ACC NMI ACC NMI ACC NMI ACC NMI ACCConvAE 0.745 0.776 0.751 0.781 0.652 0.698 0.556 0.546 0.642 0.476ConvAE+MI 0.800 0.835 0.796 0.844 0.744 0.785 0.609 0.592 0.738 0.571ConvAE+RS 0.803 0.841 0.801 0.850 0.752 0.798 0.597 0.614 0.721 0.558ConvAE+MI+RS 0.910 0.957 0.914 0.961 0.827 0.831 0.640 0.656 0.801 0.606ConvAE+MI+RS+SN 0.941 0.978 0.946 0.980 0.857 0.869 0.645 0.662 0.857 0.691 (a) Raw data (b) ConvAE (c) DEC (d) SpectralNet(e) ConvAE+RS (f) ConvAE+MI (g) ConvAE+RS+MI (h) ConvAE+MI+RS+SN

Figure 6. Visualization to show the discriminative capability of embedding subspaces using MNIST- test data. proved reconstruction loss and mutual information (Con-vAE+MI+RS). The last one is the joint training of convo-lutional autoencoder and deep spectral clustering. Tab. 4represents the performance of different strategies for train-ing our model. It clearly demonstrates that each kind ofstrategy of our method can improve the accuracy of cluster-ing effectively, especially after adding mutual informationand the improved reconstruction loss in the convolutionalautoencoder network. Fig. 6 demonstrates the importanceof our proposed strategy by comparing different data rep-resentations of MNIST- test data points using t -SNE visu-alization [27], Fig. 6(a) represents the space of raw data,Fig. 6(b) is the data points in the latent subspace of convo-lution autoencoder, Fig. 6(c) and 6(d) are the results of DECand SpectralNet respectively, and the rest are our proposedmodel with different strategies. The results demonstrate thelatent representations obtained by our method have moreclear distribution structure.

5. Conclusion

In this paper, we propose an unsupervised deep cluster-ing method with a dual autoencoder network and a deepspectral network. First, the dual autoencoder, which recon-structs the inputs using the latent representations and their noise-contaminated versions, is utilized to establish the re-lationships between the inputs and the latent representationsin order to obtain more robust latent representations. Fur-thermore, we maximize the mutual information between theinputs and the latent representations, which can preservethe information of the inputs as much as possible. Hence,the features of the latent space obtained by our autoencoderare robust to noise and more discriminative. Finally, thespectral network is fused to a uniﬁed framework to clusterthe features of the latent space, so that the relationship be-tween the samples can be effectively utilized. We evaluateour method on several benchmarks and experimental resultsshow that our method outperforms those state-of-the-art ap-proaches.

6. Acknowledgement

Our work was also supported by the National Natu-ral Science Foundation of China under Grant 61572388,61703327 and 61602176, the Key R&D Program-TheKey Industry Innovation Chain of Shaanxi under Grant2017ZDCXL-GY-05-04-02, 2017ZDCXL-GY-05-02 and2018ZDXM-GY-176, and the National Key R&D Programof China under Grant 2017YFE0104100. eferences [1] Lingling An, Xinbo Gao, Xuelong Li, Dacheng Tao, ChengDeng, Jie Li, et al. Robust reversible watermarking via clus-tering and enhanced pixel-wise masking.

IEEE Trans. ImageProcessing , 21(8):3598–3611, 2012.[2] Deng Cai, Xiaofei He, Xuanhui Wang, Hujun Bao, and Ji-awei Han. Locality preserving nonnegative matrix factoriza-tion. In

IJCAI , volume 9, pages 1010–1015, 2009.[3] Pu Chen, Xinyi Xu, and Cheng Deng. Deep view-aware met-ric learning for person re-identiﬁcation. In

IJCAI , pages 620–626, 2018.[4] Xinlei Chen and Deng Cai. Large scale spectral clusteringwith landmark-based representation. In

AAAI , volume 5,page 14, 2011.[5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, IlyaSutskever, and Pieter Abbeel. Infogan: Interpretable repre-sentation learning by information maximizing generative ad-versarial nets. In

Advances in neural information processingsystems , pages 2172–2180, 2016.[6] Eric C Chi and Kenneth Lange. Splitting methods for convexclustering.

Journal of Computational and Graphical Statis-tics , 24(4):994–1013, 2015.[7] Navneet Dalal and Bill Triggs. Histograms of oriented gra-dients for human detection. In

Computer Vision and Pat-tern Recognition, 2005. CVPR 2005. IEEE Computer SocietyConference on , volume 1, pages 886–893. IEEE, 2005.[8] C Deng, E Yang, T Liu, W Liu, J Li, and D Tao. Unsu-pervised semantic-preserving adversarial hashing for imagesearch.

IEEE transactions on image processing: a publica-tion of the IEEE Signal Processing Society , 2019.[9] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo,Matthew CH Lee, Hugh Salimbeni, Kai Arulkumaran, andMurray Shanahan. Deep unsupervised clustering withgaussian mixture variational autoencoders. arXiv preprintarXiv:1611.02648 , 2016.[10] Kamran Ghasedi Dizaji, Amirhossein Herandi, Cheng Deng,Weidong Cai, and Heng Huang. Deep clustering via jointconvolutional autoencoder embedding and relative entropyminimization. In

Computer Vision (ICCV), 2017 IEEE In-ternational Conference on , pages 5747–5756. IEEE, 2017.[11] Quanquan Gu and Jie Zhou. Co-clustering on manifolds. In

Proceedings of the 15th ACM SIGKDD international confer-ence on Knowledge discovery and data mining , pages 359–368. ACM, 2009.[12] Xifeng Guo, Long Gao, Xinwang Liu, and Jianping Yin. Im-proved deep embedded clustering with local structure preser-vation. In

International Joint Conference on Artiﬁcial Intel-ligence (IJCAI-17) , pages 1753–1759, 2017.[13] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon,Karan Grewal, Adam Trischler, and Yoshua Bengio. Learn-ing deep representations by mutual information estimationand maximization. arXiv preprint arXiv:1808.06670 , 2018.[14] Steven CH Hoi, Wei Liu, and Shih-Fu Chang. Semi-supervised distance metric learning for collaborative imageretrieval and clustering.

ACM Transactions on MultimediaComputing, Communications, and Applications (TOMM) ,6(3):18, 2010. [15] Ayush Jaiswal, Rex Yue Wu, Wael Abd-Almageed, andPrem Natarajan. Unsupervised adversarial invariance. In

Advances in Neural Information Processing Systems , pages5097–5107, 2018.[16] Pan Ji, Tong Zhang, Hongdong Li, Mathieu Salzmann, andIan Reid. Deep subspace clustering networks. In

Advances inNeural Information Processing Systems , pages 24–33, 2017.[17] Wenhao Jiang and Fu-lai Chung. Transfer spectral clus-tering. In

Joint European Conference on Machine Learn-ing and Knowledge Discovery in Databases , pages 789–803.Springer, 2012.[18] Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, andHanning Zhou. Variational deep embedding: An unsuper-vised and generative approach to clustering. arXiv preprintarXiv:1611.05148 , 2016.[19] Harold W Kuhn. The hungarian method for the assignmentproblem.

Naval research logistics quarterly , 2(1-2):83–97,1955.[20] Yann LeCun, L´eon Bottou, Yoshua Bengio, and PatrickHaffner. Gradient-based learning applied to document recog-nition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.[21] Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, andDacheng Tao. Self-supervised adversarial hashing networksfor cross-modal retrieval. In

CVPR , pages 4242–4251, 2018.[22] Chao Li, Cheng Deng, Lei Wang, De Xie, and XianglongLiu. Coupled cyclegan: Unsupervised hashing networkfor cross-modal retrieval. arXiv preprint arXiv:1903.02149 ,2019.[23] Fengfu Li, Hong Qiao, and Bo Zhang. Discriminativelyboosted image clustering with fully convolutional auto-encoders.

Pattern Recognition , 83:161–173, 2018.[24] Yeqing Li, Junzhou Huang, and Wei Liu. Scalable sequentialspectral clustering. In

Thirtieth AAAI conference on artiﬁcialintelligence , 2016.[25] Wei Liu, Junfeng He, and Shih-Fu Chang. Large graph con-struction for scalable semi-supervised learning. In

Proceed-ings of the 27th international conference on machine learn-ing (ICML-10) , pages 679–686, 2010.[26] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Recti-ﬁer nonlinearities improve neural network acoustic models.In

Proc. icml , volume 30, page 3, 2013.[27] Laurens van der Maaten and Geoffrey Hinton. Visualiz-ing data using t-sne.

Journal of machine learning research ,9(Nov):2579–2605, 2008.[28] James MacQueen et al. Some methods for classiﬁcationand analysis of multivariate observations. In

Proceedings ofthe ﬁfth Berkeley symposium on mathematical statistics andprobability , volume 1, pages 281–297. Oakland, CA, USA,1967.[29] Jonathan Masci, Ueli Meier, Dan Cires¸an, and J¨urgenSchmidhuber. Stacked convolutional auto-encoders for hi-erarchical feature extraction. In

International Conference onArtiﬁcial Neural Networks , pages 52–59. Springer, 2011.[30] Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, andSreeram Kannan. Clustergan: Latent space cluster-ing in generative adversarial networks. arXiv preprintarXiv:1809.03627 , 2018.31] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectralclustering: Analysis and an algorithm. In

Advances in neuralinformation processing systems , pages 849–856, 2002.[32] Pauline C Ng and Steven Henikoff. Sift: Predicting aminoacid changes that affect protein function.

Nucleic acids re-search , 31(13):3812–3814, 2003.[33] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variationaldivergence minimization. In

Advances in Neural InformationProcessing Systems , pages 271–279, 2016.[34] Sohil Atul Shah and Vladlen Koltun. Robust continuousclustering.

Proceedings of the National Academy of Sci-ences , 114(37):9814–9819, 2017.[35] Sohil Atul Shah and Vladlen Koltun. Deep continuous clus-tering. arXiv preprint arXiv:1803.01449 , 2018.[36] Uri Shaham, Kelly Stanton, Henry Li, Boaz Nadler, RonenBasri, and Yuval Kluger. Spectralnet: Spectral clustering us-ing deep neural networks. arXiv preprint arXiv:1801.01587 ,2018.[37] Jianbo Shi and Jitendra Malik. Normalized cuts and imagesegmentation.

IEEE Transactions on pattern analysis andmachine intelligence , 22(8):888–905, 2000.[38] Elad Tzoreff, Olga Kogan, and Yoni Choukroun. Deepdiscriminative latent space for clustering. arXiv preprintarXiv:1805.10795 , 2018.[39] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machinelearning algorithms. arXiv preprint arXiv:1708.07747 , 2017.[40] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsuperviseddeep embedding for clustering analysis. In

Internationalconference on machine learning , pages 478–487, 2016.[41] Wei Xu, Xin Liu, and Yihong Gong. Document cluster-ing based on non-negative matrix factorization. In

Proceed-ings of the 26th annual international ACM SIGIR conferenceon Research and development in informaion retrieval , pages267–273. ACM, 2003.[42] Bo Yang, Xiao Fu, Nicholas D Sidiropoulos, and MingyiHong. Towards k-means-friendly spaces: Simultaneous deeplearning and clustering. arXiv preprint arXiv:1610.04794 ,2016.[43] Erkun Yang, Cheng Deng, Tongliang Liu, Wei Liu, andDacheng Tao. Semantic structure-based unsupervised deephashing. In

IJCAI , pages 1064–1070, 2018.[44] Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsuper-vised learning of deep representations and image clusters.In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 5147–5156, 2016.[45] Muli Yang, Cheng Deng, and Feiping Nie. Adaptive-weighting discriminative regression for multi-view classiﬁ-cation.

Pattern Recogn. , 88(4):236–245, 2019.[46] Xu Yang, Cheng Deng, Xianglong Liu, and Feiping Nie.New l2, 1-norm relaxation of multi-way graph cut for clus-tering. In

AAAI , 2018.[47] Jinfeng Yi, Lijun Zhang, Tianbao Yang, Wei Liu, and JunWang. An efﬁcient semi-supervised clustering algorithmwith sequential constraints. In

Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discoveryand Data Mining , pages 1405–1414. ACM, 2015. [48] Tianshu Yu, Junchi Yan, Wei Liu, and Baoxin Li. Incremen-tal multi-graph matching via diversity and randomness basedgraph clustering. In

Proceedings of the European Conferenceon Computer Vision (ECCV) , pages 139–154, 2018.[49] Wei Zhang, Deli Zhao, and Xiaogang Wang. Agglomerativeclustering via maximum incremental path integral.

PatternRecognition , 46(11):3056–3065, 2013.[50] Pan Zhou, Yunqing Hou, and Jiashi Feng. Deep adversarialsubspace clustering. In