Deep Clustering and Representation Learning with Geometric Structure Preservation
Lirong Wu, Zicheng Liu, Zelin Zang, Jun Xia, Siyuan Li, Stan. Z Li
PPublished as a conference paper at ICLR 2021 D EEP C LUSTERING AND R EPRESENTATION L EARNINGTHAT P RESERVES G EOMETRIC S TRUCTURES
Lirong Wu, Zicheng Liu, Zelin Zang, Jun Xia, Siyuan Li, Stan. Z Li
AI Lab, School of Engineering, Westlake University &Institute of Advanced Technology, Westlake Institute for Advanced StudyHangzhou, Zhejiang, China { wulirong, liuzicheng, zelin zang, xiajun, lisiyuan, stan.zq.li } @westlake.edu.cn A BSTRACT
In this paper, we propose a novel framework for Deep Clustering and Repre-sentation Learning (DCRL) that preserves the geometric structure of data. Inthe proposed DCRL framework, manifold clustering is done in the latent spaceguided by a clustering loss. To overcome the problem that clustering-orientedlosses may deteriorate the geometric structure of embeddings in the latent space,an isometric loss is proposed for preserving intra-manifold structure locally anda ranking loss for inter-manifold structure globally. Experimental results on var-ious datasets show that the DCRL framework leads to performances comparableto current state-of-the-art deep clustering algorithms, yet exhibits superior perfor-mance for downstream tasks. Our results also demonstrate the importance andeffectiveness of the proposed losses in preserving geometric structure in terms ofvisualization and performance metrics.
NTRODUCTION
Clustering, a fundamental tool for data analysis and visualization, has been an essential researchtopic in data science and machine learning. Conventional clustering algorithms such as K -Means(MacQueen, 1965), Gaussian Mixture Models (GMM) (Bishop, 2006), and spectral clustering (Shi& Malik, 2000) perform clustering based on distance or similarity. However, handcrafted distance orsimilarity measures are rarely reliable for large-scale high-dimensional data, making it increasinglychallenging to achieve effective clustering. An intuitive solution is to transform the data from thehigh-dimensional input space to the low-dimensional latent space and then to cluster the data in thelatent space. This can be achieved by applying dimensionality reduction techniques such as PCA(Wold et al., 1987), t-SNE (Maaten & Hinton, 2008), and UMAP (McInnes et al., 2018). However,since these methods are not specifically designed for clustering tasks, some of their properties maybe contrary to our expectations, e.g., two data points from different manifolds that are close in theinput space will be closer in the latent space derived by UMAP. Therefore, the first question here is how to learn the representation that favors clustering? The two main points of the multi-manifold representation learning are (1) preserving the local geo-metric structure within each manifold and (2) ensuring the discriminability between different man-ifolds. However, it is challenging to decouple complex cross-over relations and ensure the discrim-inability between different manifolds, especially in unsupervised situations. One natural strategyis to perform clustering in the input space to get pseudo-labels and then perform representationlearning for each manifold. However, in that case, representation learning’s performance dependsheavily on the clustering effect, but commonly used clustering algorithms such as K -Means do notwork well on high-dimensional data. Thus, the second question here is how to cluster data thatfavors representation learning? To answer these two questions, some pioneering work has proposed integrating deep clustering andrepresentation learning into a unified framework by defining a clustering-oriented loss. Thoughpromising performance has been demonstrated on various datasets, we observe that a vital factorhas been ignored by these work that the defined clustering-oriented loss may deteriorate the geo-metric structure of the latent space, which in turn hurts the performance of visualization, clustering1 a r X i v : . [ c s . L G ] S e p ublished as a conference paper at ICLR 2021generalization, and downstream tasks. In this paper, we propose to jointly perform deep clusteringand representation learning with geometric structure preservation. Inspired by Xie et al. (2016),the clustering centers are defined as a set of learnable parameters, and we use a clustering loss tosimultaneously guide the separation of data points from different manifolds and the learning of theclustering centers. To prevent clustering loss from deteriorating the latent space, an isometric lossand a ranking loss are proposed to preserve the intra-manifold structure locally and inter-manifoldstructure globally. Our experimental results show that our method exhibits far superior performanceto counterparts in terms of clustering and representation learining, which demonstrates the impor-tance and effectiveness of preserving geometric structure.The contributions of this work are summarized as below: • Proposing to integrate deep clustering and representation learning into a unified frameworkwith local and global structure preservation. • Unlike conventional multi-manifold learning algorithms that deal with all point pair rela-tionships between different manifolds simultaneously, we set the clustering centers as a setof learnable parameters and achieve global structure preservation in a faster, more efficient,and easier to optimize manner by applying ranking loss to the clustering centers. • Analyzing the contradiction between two optimization goals of clustering and local struc-ture preservation, and proposing an elegant training strategy to alleviate it. • The proposed DCRL algorithm outperforms competing algorithms in terms of clusteringeffect, generalizability to out-of-sample, and performance in downstream tasks.
ELATED W ORK
Clustering analysis . As a fundamental tool in machine learning, it has been widely applied invarious domains. One branch of classical clustering is K -Means (MacQueen, 1965) and GaussianMixture Models (GMM) (Bishop, 2006), which are fast, easy to understand, and can be appliedto a large number of problems. However, limited by Euclidean measure, their performance onhigh-dimensional data is often unsatisfactory. Spectral clustering and its variants (such as SC-Ncut(Bishop, 2006)) extend clustering to high-dimensional data by allowing more flexible distance mea-sures. However, limited by computational efficiency of the full Laplace matrix, spectral clusteringis challenging to extend to large-scale datasets. Deep clustering . The success of deep learning has contributed to the growth of deep clustering .One branch of deep clustering performs clustering after learning a representation through existingunsupervised techniques. For example, Tian et al. (2014) use autoencoder to learn low dimensionalfeatures and then run K -Means to get clustering results (AE+ K -Means). Considering the geomet-ric structure of the data, N2D applies UMAP to find the best clusterable manifold of the obtainedembedding, and then run K -Means to discover higher-quality clusters (McConville et al., 2019).The other category of algorithms tries to optimize clustering and representation learning jointly.The closest work to us is Deep Embedding for Clustering (DEC) (Xie et al., 2016), which learnsa mapping from the input space to a lower-dimensional latent space through iteratively optimizinga clustering objective. As a modified version of DEC, while IDEC claims to preserve the localstructure of the data (Guo et al., 2017), in reality, their contribution is nothing more than adding areconstruction loss. JULE proposes a recurrent framework for integrating clustering and represen-tation learning into a single model with a weighted triplet loss and optimizing it end-to-end (Yanget al., 2016b). DSC devises a dual autoencoder to embed data into latent space, and then deepspectral clustering (Shaham et al., 2018) is applied to obtain label assignments (Yang et al., 2019). Manifold Representation Learning . Isomap, as a representative algorithm of single-manifold learning, aims to capture global nonlinear features and seek an optimal subspace that best pre-serves the geodesic distance between data points (Tenenbaum et al., 2000). In contrast, some al-gorithms, such as the LLE (Roweis & Saul, 2000), are more concerned with the preservation oflocal neighborhood information. Combining DNN with manifold learning, the recently proposedMLDL algorithm achieves the preservation of local and global geometries by imposing LIS priorconstraints (Li et al., 2020). Furthermore, multi-manifold learning is proposed to obtain intrinsicproperties of different manifolds. Yang et al. (2016a) proposed a supervised MMD-Isomap wheredata points are partitioned into different manifolds according to label information. Similarly, Zhang2ublished as a conference paper at ICLR 2021et al. (2018) proposed a semi-supervised local multi-manifold learning framework, termed SSMM-Isomap, that applies the labeled and unlabeled training samples to perform the joint learning of localneighborhood-preserving features. In most previous work on multi-manifold learning, the problemis considered from the perspective that the label is known or partially known, which significantlysimplifies the problem. For unsupervised multi-manifold learning, it is still very challenging todecouple multiple overlapping manifolds, and that is exactly what this paper aims to explore.
ROPOSED M ETHOD
Consider a dataset X with N samples, and each sample x i ∈ R d is sampled from C differentmanifolds { M c } Cc =1 . Assume that each category in the data set lies in a compact low-dimensionalmanifold, and the number of manifolds C is prior knowledge. Define two nonlinear mapping z i = f ( x i , θ f ) and y i = g ( z i , θ g ) , where z i ∈ R m is the embedding of x i in the latent space, y i is thereconstruction of x i . The j -th cluster center is denoted as µ j ∈ R m , where { µ j } Cj =1 is defined as aset of learnable parameters. We aim to find optimal parameters θ f so that the embedding features { z i } Ni =1 can achieve clustering with local and global structure preservation. To this end, a denoisingautoencoder (Vincent et al., 2010) shown in Fig 1 is first pre-trained in an unsupervised mannerto learn an initial latent space. Denoising autoencoder aims to optimize the self-reconstructionloss L AE = M SE (ˆ x, y ) , where the ˆ x is a copy of x with Gaussian noise added, that is, ˆ x = x + N (0 , σ ) . Then the autoencoder is finetuned by optimizing the following clustering-orientedloss { L cluster ( z, µ ) } and structure-oriented losses { L rank ( x, µ ) , L LIS ( x, z ) , L align ( z, µ ) } . Sincethe clustering should be performed on features of clean data, instead of noised data ˆ x that is used indenoising autoencoder, the clean data x is used for fine-tuning. , Figure 1: The framework of the proposed DCRL method. The encoder, decoder, latent space, andcluster centers are marked as blue, red, green, and purple, respectively.3.1 C
LUSTERING - ORIENTED L OSS
First, the cluster centers { µ j } Cj =1 in the latent space Z are initialized (the initialization method willbe introduced in Sec 4.1). Then the similarity between the embedded point z i and cluster centers { µ j } Cj =1 is measured by Students t -distribution: q ij = (cid:16) (cid:107) z i − µ j (cid:107) (cid:17) − (cid:80) j (cid:48) (cid:16) (cid:107) z i − µ j (cid:48) (cid:107) (cid:17) − (1)The auxiliary target distribution is designed to help manipulate the latent space, defined as: p ij = q ij /f j (cid:80) j (cid:48) q ij (cid:48) /f j (cid:48) , where f j = (cid:88) i q ij (2)3ublished as a conference paper at ICLR 2021where f j is the normalized cluster frequency, used to balance the size of different clusters. Then theencoder is optimized by the following objective: L cluster = KL( P (cid:107) Q ) = (cid:88) i (cid:88) j p ij log p ij q ij (3)The gradient of L cluster with respect to each learnable cluster center µ j can be computed as: ∂L cluster ∂µ j = − (cid:88) i (cid:16) (cid:107) z i − µ j (cid:107) (cid:17) − · ( p ij − q ij ) ( z i − µ j ) (4) L cluster facilitates the aggregation of data points within the same manifold, while data points fromdifferent manifolds are kept away from each other. However, we find that the clustering-orientedloss may deteriorate the geometric structure of the latent space, which hurts the clustering accuracyand leads to meaningless representation. To prevent the deterioration of clustering loss, we introduceisometry loss L LIS and ranking loss L rank to preserve the local and global structure, respectively.3.2 S TRUCTURE - ORIENTED L OSS
Intra-manifold Isometry Loss.
The intra-manifold local structure is preserved by optimizing thefollowing objective: L LIS = N (cid:88) i =1 (cid:88) j ∈N Zi | d X ( x i , x j ) − d Z ( z i , z j ) | · π ( l ( x i ) = l ( x j )) (5)where N Zi represents the neighborhood of data point z i in the feature space Z , and the k NN is ap-plied to determine the neighborhood. π ( · ) ∈ { , } is an indicator function, and l ( x i ) is a manifolddetermination function that returns the manifold s i where sample x i is located, that is s i = l ( x i ) =arg max j p ij . Then we can derive C manifolds { M c } Cc =1 : M c = { x i ; s i = c, i = 1 , , ..., N } . In anutshell, the loss L LIS constrains the isometry within each manifold.
Inter-manifold Ranking Loss.
The inter-manifold global structure is preserved by optimizing thefollowing objective: L rank = C (cid:88) i =1 C (cid:88) j =1 (cid:12)(cid:12) d Z ( µ i , µ j ) − scale ∗ d X (cid:0) v Xi , v Xj (cid:1)(cid:12)(cid:12) (6)where { v Xj } Cj =1 is defined as the centers of different manifolds in the original input space X with v Xj = | M j | (cid:80) i ∈ M j x i ( j = 1 , , ..., C ). The parameter scale determines the extent to which dif-ferent manifolds move away from each other. The larger scale is, the further away the differentmanifolds are from each other. The derivation for the gradient of L rank with respect to each learn-able cluster center µ j is placed in Appendix A.1 . Additionally, contrary to us, the conventionalmethods for dealing with inter-manifold separation typically impose push-away constraints on alldata points from different manifolds (Zhang et al., 2018; Yang et al., 2016a), defined as: L sep = − N (cid:88) i =1 N (cid:88) j =1 d Z ( z i , z j ) · π ( l ( x i ) (cid:54) = l ( x j )) (7)The main differences between L rank and L sep are as follows: (1) L sep imposes constraints onembedding points { z i } Ni =1 , which in turn indirectly affects the network parameters θ f . In contrast, L rank imposes rank-preservation constrains directly on learnable parameters { µ j } Cj =1 in the form of regularization item to control the separation of the clustering centers. (2) L sep involves N × N point-to-point relationships, while L rank involves only C × C cluster-to-cluster relationships, so L rank iseasier to optimize, faster to process, and more accurate. (3) The parameter scale introduced in L rank allows us to control the extent of separation between manifolds for specific downstream tasks. Alignment Loss.
Note that the global ranking loss L rank is imposed directly on the learnableparameter { µ j } Cj =1 , so optimizing L rank will only update { µ j } Cj =1 rather the encoder’s parameter4ublished as a conference paper at ICLR 2021 θ f . Thus here we need to introduce an auxiliary item L align to align learnable cluster centers { µ j } Cj =1 with real cluster centers { v Zj } Cj =1 : L align = C (cid:88) j =1 || µ j − v Zj || (8)where { v Zj } Cj =1 are defined as v Zj = | M j | (cid:80) i ∈ M j z i ( j = 1 , , ..., C ). We place the derivation forthe gradient of L align with respect to each learnable cluster center µ j in Appendix A.1 .3.3 T
RAINING S TRATEGY
ONTRADICTION
Figure 2: The force analysis ofcontradiction between clusteringand local structure preservation.The contradiction between clustering and local structure preser-vation is analyzed from the forces analysis perspective. Asshown in Fig 2, we assume that there exists a data point (redpoint) and its three nearest neighbors (blue points) around a clus-ter center (gray point). When clustering and local structure pre-serving are optimized simultaneously, it is very easy to fall intoa local optimum, where the data point is in steady-state, andthe resultant force from its three nearest neighbors is equal inmagnitude and opposite to the gravitational forces of the cluster.Therefore, the following training strategy is applied to preventsuch local optimal solutions.3.3.2 A
LTERNATING T RAINING AND W EIGHT G RADUALITY
Alternating Training.
To solve the above problem and integrate the goals of clustering and structurepreservation into a unified framework, we take an alternating training strategy. Within each epoch,we first jointly optimize L cluster and L rank in a mini-batch , with joint loss defined as L = L AE + αL cluster + L rank (9)where α is the weighting factor that balances the effects of clustering and global rank-preservation.Then at each epoch, we optimize isometry loss L LIS and L align on the whole dataset , defined as L = βL LIS + L align (10) Weight Graduality.
At different stages of training, we have different expectations for the clusteringand structure-preserving. At the beginning of training, to successfully decouple the overlappingmanifolds, we hope that the L cluster will dominate and L LIS will be auxiliary. When the marginbetween different manifolds is sufficiently pronounced, the weight α for L cluster can be graduallyreduced, while the weight β for L LIS can be gradually increased, focusing on the preservation ofthe local isometry. The whole algorithm is summarized in Algorithm 1 in
Appendix A.2 . Three-stage explanation.
The entire training process can be roughly divided into three stages, asshown in Fig 3, to explain the training strategy more vividly. At first, four different manifolds overlapeach other. At Stage 1, L cluster dominates, thus data points within each manifold are convergingtowards the clustering center to form a sphere, and the local structure of manifolds is destroyed. AtStage 2, L rank dominates, thus different manifolds in the latent space move away from each other toincrease the manifold margin and enhance the discriminability. At stage 3, the manifolds graduallyrecover their original local structure from the spherical shape with L LIS dominating. It is worthnoting that all of the above losses coexist rather than independently at different stages , but thatthe role played by different losses varies due to the alternating training and weight graduality.5ublished as a conference paper at ICLR 2021Figure 3: Schematic of training strategy. Four different colors and shapes represent four intersectingmanifolds, and three stages involve the clustering, separation, and structure recovery of manifolds.
XPERIMENTS
XPERIMENTAL SETUPS
In this section, the effectiveness of the proposed framework is evaluated in 5 benchmark datasets:USPS , MNIST-full, MNIST-test (LeCun et al., 1998), Fashion-MNIST (Xiao et al., 2017) andREUTERS-10K (Lewis et al., 2004), on which our method is compared with 8 other methods men-tioned in Sec 2 in 8 evaluation metrics including metrics designed specifically for clustering andrepresentation learning. The brief descriptions of the datasets are given in Appendix A.3 . Parameters settings.
The encoder is a multilayer perceptron (MLP) with dimensions d -500-500-2000-10 where d is the dimension of the input data, and the decoder is its mirror. After pretraining, inorder to initialize the learnable clustering centers, the t-SNE is applied to transform the latent space Z to 2 dimensions further, and then the K -Means algorithm is run to obtain the label assignments foreach data point. The centers of each category in the latent space Z are set as initial clustering centers { µ j } Cj =1 . The batch size is set to 256, the epoch is set to 300, the parameter k for nearest neighboris set to 3, and the parameter scale is set to 3 for all datasets. Besides, Adam optimizer (Kingma& Ba, 2014) with learning rate λ =0.001 is used. As described in Sec 3.3.2, the weight gradualityis applied to train the model. The weight parameter α for L cluster decreases linearly from 0.1 to0 within epoch 0-150. In contrast, the weight parameter β for loss L LIS increases linearly from 0to 1.0 within epoch 0-150. The implementation is based on the PyTorch running on NVIDIA v100GPU.
Evaluation Metrics.
Two standard evaluation metrics: Accuracy (ACC) and Normalized MutualInformation (NMI) (Xu et al., 2003) are used to evaluate clustering performance. Besides, six eval-uation metrics are adopted in this paper to evaluate the performance of representation learning, in-cluding Relative Rank Error (RRE), Trustworthiness (Trust), Continuity (Cont), Root Mean Recon-struction Error (RMRE), Locally Geometric Distortion (LGD) and Cluster Rank Accuracy (CRA).Limited by space, their precise definitions are available in
Appendix A.4 .4.2 E
VALUATION OF C LUSTERING
UANTITATIVE C OMPARISON
The metrics ACC/NMI of different methods on various datasets are reported in Tab 1. For thosecomparison methods whose results are not reported on some datasets, we run the released codeusing the hyperparameters provided in their paper and label them with (*). We find that our methodoutperforms K -Means and SC-Ncut with a significant margin and surpasses the other six competingDNN-based algorithms on all datasets except MNIST-test. With even the MNIST-test dataset, westill rank second, outperforming the third by 1.1%. In particular, we obtained the best performanceon the Fashion-MNIST dataset and, more notably, our clustering accuracy exceeds the current bestmethod (N2D) by 3.8%. Despite L cluster is inspired by and highly consistent with the design ofDEC, our method achieves much better clustering results than them. With MNIST-full, for example,our clustering accuracy is 11.7% and 9.9% higher than DEC and IDEC, respectively. https://cs.nyu.edu/roweis/data.html Algorithms MNIST-full MNIST-test USPS Fashion-MNIST REUTERS-10K k-means 0.532/0.500 0.546/0.501 0.668/0.601 0.474/0.512 0.599/0.375*SC-Ncut 0.656/0.731 0.660/0.704 0.649/0.794 0.508/0.575 0.658/0.401*AE+k-means 0.818/0.747 0.815/0.784* 0.662/0.693 0.566/0.585* 0.721/0.432*DEC 0.863/0.834 0.856/0.830 0.762/0.767 0.518/0.546 0.755/0.503*IDEC 0.881/0.867 0.846/0.802 0.761/0.785 0.529/0.557 0.778/0.527*JULE 0.964/0.913 0.961/0.915 0.950/0.913 0.563/0.608 0.797/0.551*DSC 0.978/0.941
ENERALIZABILITY EVALUATION
Table 2: Generalizability evaluated by ACC/NMI
Algorithms training samples testing samples
AE+k-means 0.815/0.736 0.751/0.711DEC 0.841/0.773 0.748/0.704IDEC 0.845/0.860 0.826/0.842JULE 0.958/0.907 0.921/0.895DSC 0.975/0.939 0.969/0.921N2D 0.974/0.930 0.965/0.911DCRL (ours)
Tab 2 demonstrates that a learned DCRLcan generalize well to unseen datawith high clustering accuracy. TakingMNIST-full as an example, DCRL wastrained using 50,000 training samplesand then tested on the remaining 20,000testing samples using the learned model.In terms of the metrics ACC and MNI,our method is optimal for both trainingand test samples. More importantly, there is hardly any degradation in the performance of ourmethod on the test samples compared to the training samples, while all other methods showed asignificant drop in performance, e.g., DEC from 84.1% to 74.8%. This demonstrates the importanceof geometric structure maintenance for good generalizability. The testing visualization availablein
Appendix A.5 shows that DCRL still maintains clear inter-cluster boundaries even on the testsamples, which demonstrates the great generalizability of our method.4.2.3 C
LUSTERING V ISUALIZATION
The visualization of DCRL with several comparison methods is shown in Fig 4 (visualized usingUMAP). From the perspective of clustering, our method is much better than the other methods.Among all methods, only DEC, IDEC and DCRL can hold clear boundaries between different clus-ters, while the cluster boundaries of the other methods are indistinguishable. Although DEC andIDEC can successfully separate different clusters, they group many data points from different classesinto the same cluster. Most importantly, due to the use of the clustering-oriented loss, the embeddinglearned by algorithms such as DEC, IDEC, JULE, and DSC (especially DSC) tend to form spheresand disrupt the original topological structure. Instead, our method overcomes these problems andachieves almost perfect separation between different clusters while preserving the local and globalstructure. Additionally, the embedding of latent space during training process is visualized in
Ap-pendix A.6 , which is highly consistent with the three-stage explanation mentioned in Sec 3.3.2.4.3 E
VALUATION OF R EPRESENTATION L EARNING
UANTITATIVE C OMPARISON
Although numerous previous work has claimed that they brought clustering and representationlearning into a unified framework, they all, unfortunately, lack an analysis of the effectiveness of thelearned representations. In this paper, we compare DCRL with the other five methods in six evalua-tion metrics on five datasets. (Limited by space, only MNIST-full results are provided in the Tab 3and the complete results are in
Appendix A.7 ). The results show that DCRL outperforms all othermethods, especially in the CRA metric, which is not only the best on all datasets but also reaches1.0. This means that the “rank” between different manifolds in the latent space is completely pre-served and undamaged, which proves the effectiveness of our global ranking loss L rank . Moreover,statistical analysis is performed in this paper to show the extent to which local and global structure ispreserved in the latent space for each algorithm. Limited by space, they are placed in Appendix A.8 .7ublished as a conference paper at ICLR 2021 (a) AE+ K -mean (b) DEC (c) IDEC (d) JULE(e) DSC (f) N2D (g) DCRL (Ours) Figure 4: Visualization of the embedding learned by different algorithms on MNIST-full dataset.Table 3: Performance for representation learning
Methods RRE Trust Cont RMSE LGD CRA
DEC 0.099 0.844 0.948 44.85 4.379 0.28IDEC 0.009 0.998 0.979 24.58 1.714 0.33JULE 0.026 0.936 0.983 28.34 2.129 0.27DSC 0.097 0.873 0.925 6.98 1.198 0.23N2D 0.010 0.992 0.984 5.71 0.699 0.21DCRL
Table 4: Performance for downstream tasks
Methods MLP RFC SVM LR
AE 0.974 0.965 0.985 0.956DEC 0.864 0.870 0.870 0.856IDEC 0.979 0.973 0.985 0.965JULE 0.980 0.982 0.978 0.974DSC 0.962 0.950 0.983 0.975N2D 0.979 0.980 0.979 0.979DCRL
OWNSTREAM TASKS
Recently, numerous deep clustering algorithms have claimed to obtain meaningful representations,however, they do not analyze and experiment with the so-called “meaningful”. Therefore, we areinterested to see whether these proposed methods can indeed learn representations that are useful fordownstream tasks. Tab 4 compares DCRL with the other six methods on five datasets (Limited byspace, only MNIST-full results are provided in the paper, and the complete results are in
AppendixA.9 ). Four different classifiers, including a linear classifier (Logistic Regression; LR), two nonlin-ear classifiers (MLP, SVM), and a tree-based classifier (Random Forest Classifier; RFC) are usedas downstream tasks, all of which use default parameters and default implementations in sklearn(Pedregosa et al., 2011) for a fair comparison. The learned representations are frozen and used asinput for training. The classification accuracy evaluated on the test set serves as a metric to evaluatethe effectiveness of learned representations. On the MNIST-full dataset, our method outperforms allthe other methods. Moreover, we surprisingly find that with MLP and RFC as downstream tasks, allmethods except DCRL could not even match the accuracy of AE. Significantly, the performance ofDEC on downstream tasks deteriorates sharply and even shows a large gap with the simplest AEs,which once again shows that the clustering-oriented loss may damage the data geometric structure.4.4 A
BLATION S TUDY
This evaluates the effects of the loss terms and training strategies in the DCRL with five sets ofexperiments: the model without (A) Structure-oriented Loss (SL); (B) Clustering-oriented Loss(CL); (C) Weight Graduality (WG); (D) Alternating Training (AT), and (E) the full model. Limitedby space, only MNIST-full results are provided in the paper, and results for the other four datasets arein
Appendix A.10 . After analyzing the results, we can conclude: (1) CL is the most important factorfor obtaining good clustering, the lack of which leads to unsuccessful clustering, hence the numbersin the table are not very meaningful and are shown in gray color. (2) SL not only brings subtleimprovements in clustering but also greatly improves the performance of representation learning.(3) Our elegant training strategies (WG and AT) both improve the performance of clustering andrepresentation learning to some extent, especially on metrics such as RRE, Trust, Cont, and CRA.8ublished as a conference paper at ICLR 2021Table 5: Ablation study of loss items and training strategies on MNIST-full dataset
Datasets Methods ACC/NMI RRE Trust Cont RMSE LGD CRA w/o SL 0.976/0.939 0.0093 0.9967 0.9816 24.589 1.6747 0.32w/o CL 0.814/0.736 0.0004 0.9998 0.9990 7.458 0.0487 1.00w/o WG 0.977/0.943 0.0065 0.9987 0.9860 5.576 0.6968 0.98w/o AT 0.978/0.944 0.0069 0.9986 0.9851 5.617 0.7037 0.96MNIST-full full model
ONCLUSION
The proposed DCRL framework imposes clustering-oriented and structure-oriented constraints tooptimize the latent space for simultaneously performing clustering and representation learning withlocal and global structure preservation. Extensive experiments on image and text datasets demon-strate that DCRL is not only comparable to the state-of-the-art deep clustering algorithms but alsoable to learn effective and robust representation, which is beyond the capability of those clusteringmethods that only care about clustering accuracy . Future work will focus on the adaptive determi-nation of manifolds (clusters) number and extend our work to datasets with larger scale.9ublished as a conference paper at ICLR 2021 R EFERENCES
Christopher M Bishop.
Pattern recognition and machine learning . springer, 2006.Xifeng Guo, Long Gao, Xinwang Liu, and Jianping Yin. Improved deep embedded clustering withlocal structure preservation. In
IJCAI , pp. 1753–1759, 2017.Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied todocument recognition.
Proceedings of the IEEE , 86(11):2278–2324, 1998.David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. Rcv1: A new benchmark collection fortext categorization research.
Journal of machine learning research , 5(Apr):361–397, 2004.Stan Z Li, Zelin Zhang, and Lirong Wu. Markov-lipschitz deep learning. arXiv preprintarXiv:2006.08256 , 2020.Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.
Journal of machinelearning research , 9(Nov):2579–2605, 2008.J MacQueen. Some methods for classification and analysis of multivariate observations. In
Proc.5th Berkeley Symposium on Math., Stat., and Prob , pp. 281, 1965.Ryan McConville, Raul Santos-Rodriguez, Robert J Piechocki, and Ian Craddock. N2d:(not too)deep clustering via clustering the local manifold of an autoencoded embedding. arXiv preprintarXiv:1908.05968 , 2019.Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation andprojection for dimension reduction. arXiv preprint arXiv:1802.03426 , 2018.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten-hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research ,12:2825–2830, 2011.Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embed-ding. science , 290(5500):2323–2326, 2000.Uri Shaham, Kelly Stanton, Henry Li, Boaz Nadler, Ronen Basri, and Yuval Kluger. Spectralnet:Spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587 , 2018.Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation.
IEEE Transactions onpattern analysis and machine intelligence , 22(8):888–905, 2000.Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework fornonlinear dimensionality reduction. science , 290(5500):2319–2323, 2000.Fei Tian, Bin Gao, Qing Cui, Enhong Chen, and Tie-Yan Liu. Learning deep representations forgraph clustering. In
Aaai , volume 14, pp. 1293–1299. Citeseer, 2014.Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, andL´eon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep networkwith a local denoising criterion.
Journal of machine learning research , 11(12), 2010.Svante Wold, Kim Esbensen, and Paul Geladi. Principal component analysis.
Chemometrics andintelligent laboratory systems , 2(1-3):37–52, 1987.Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmark-ing machine learning algorithms. arXiv preprint arXiv:1708.07747 , 2017.Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis.In
International conference on machine learning , pp. 478–487, 2016.10ublished as a conference paper at ICLR 2021Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrix factoriza-tion. In
Proceedings of the 26th annual international ACM SIGIR conference on Research anddevelopment in informaion retrieval , pp. 267–273, 2003.Bo Yang, Ming Xiang, and Yupei Zhang. Multi-manifold discriminant isomap for visualization andclassification.
Pattern Recognition , 55:215–230, 2016a.Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsupervised learning of deep representationsand image clusters. In
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pp. 5147–5156, 2016b.Xu Yang, Cheng Deng, Feng Zheng, Junchi Yan, and Wei Liu. Deep spectral clustering using dualautoencoder network. In
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pp. 4066–4075, 2019.Yan Zhang, Zhao Zhang, Jie Qin, Li Zhang, Bing Li, and Fanzhang Li. Semi-supervised local multi-manifold isomap by linear embedding for feature extraction.
Pattern Recognition , 76:662–678,2018. 11ublished as a conference paper at ICLR 2021 A PPENDIX
A.1 G
RADIENT D ERIVATION
In the paper, we have emphasized time and again that { µ j } Cj =1 is a set of learnalbe parameters,which means that we can optimize it while optimizing the network parameter θ f . In Eq. (4) of thepaper, we have presented the gradient of L cluster with respect to µ j . In addition to L cluster , both L rank and L align are involving µ j . Hence, the detailed derivations for the gradient of L rank and L align with respect to µ j are also provided. The gradient of L rank with respect to each learnalbecluster center µ j can be computed as: ∂L rank ∂µ j = ∂ (cid:80) Ci (cid:48) =1 (cid:80) Cj (cid:48) =1 (cid:12)(cid:12) d Z ( µ i (cid:48) , µ j (cid:48) ) − scale ∗ d X (cid:0) v Xi (cid:48) , v Xj (cid:48) (cid:1)(cid:12)(cid:12) ∂µ j = C (cid:88) i (cid:48) =1 C (cid:88) j (cid:48) =1 ∂ (cid:12)(cid:12) d Z ( µ i (cid:48) , µ j (cid:48) ) − scale ∗ d X (cid:0) v Xi (cid:48) , v Xj (cid:1)(cid:12)(cid:12) ∂µ j (11)The Euclidean metric is used for both the input space and the hidden layer space, i.e., d Z ( µ i (cid:48) , µ j (cid:48) ) = (cid:107) µ i (cid:48) − µ j (cid:48) (cid:107) . In addition, the symbols are somewhat abused for clear derivation, representing scale ∗ d X (cid:0) v Xi (cid:48) , v Xj (cid:48) (cid:1) with K . Accordingly, Eq. (11) can be further derived as follows: ∂L rank ∂µ j = C (cid:88) i (cid:48) =1 C (cid:88) j (cid:48) =1 ∂ (cid:12)(cid:12) d Z ( µ i (cid:48) , µ j (cid:48) ) − scale ∗ d X (cid:0) v Xi (cid:48) , v Xj (cid:48) (cid:1)(cid:12)(cid:12) ∂µ j = C (cid:88) i (cid:48) =1 C (cid:88) j (cid:48) =1 ∂ (cid:12)(cid:12) (cid:107) µ i (cid:48) − µ j (cid:48) (cid:107) − K (cid:12)(cid:12) ∂µ j = C (cid:88) i (cid:48) =1 ∂ (cid:12)(cid:12) (cid:107) µ i (cid:48) − µ j (cid:107) − K (cid:12)(cid:12) ∂µ j + C (cid:88) j (cid:48) =1 ∂ (cid:12)(cid:12) (cid:107) µ j − µ j (cid:48) (cid:107) − K (cid:12)(cid:12) ∂µ j = C (cid:88) i (cid:48) =1 ∂ ( (cid:107) µ i (cid:48) − µ j (cid:107) − K ) ∂µ j · (cid:107) µ i (cid:48) − µ j (cid:107) − K (cid:12)(cid:12) (cid:107) µ i (cid:48) − µ j (cid:107) − K (cid:12)(cid:12) + C (cid:88) j (cid:48) =1 ∂ ( (cid:107) µ j − µ j (cid:48) (cid:107) − K ) ∂µ j · (cid:107) µ j − µ j (cid:48) (cid:107) − K (cid:12)(cid:12) (cid:107) µ j − µ j (cid:48) (cid:107) − K (cid:12)(cid:12) = C (cid:88) i (cid:48) =1 ∂ (cid:107) µ i (cid:48) − µ j (cid:107) ∂µ j · (cid:107) µ i (cid:48) − µ j (cid:107) − K (cid:12)(cid:12) (cid:107) µ i (cid:48) − µ j (cid:107) − K (cid:12)(cid:12) + C (cid:88) j (cid:48) =1 ∂ (cid:107) µ j − µ j (cid:48) (cid:107) ∂µ j · (cid:107) µ j − µ j (cid:48) (cid:107) − K (cid:12)(cid:12) (cid:107) µ j − µ j (cid:48) (cid:107) − K (cid:12)(cid:12) = C (cid:88) i (cid:48) =1 µ j − µ i (cid:48) (cid:107) µ j − µ i (cid:48) (cid:107) · (cid:107) µ j − µ i (cid:48) (cid:107) − K (cid:12)(cid:12) (cid:107) µ j − µ i (cid:48) (cid:107) − K (cid:12)(cid:12) + C (cid:88) j (cid:48) =1 µ j − µ j (cid:48) (cid:107) µ j − µ j (cid:48) (cid:107) · (cid:107) µ j − µ j (cid:48) (cid:107) − K (cid:12)(cid:12) (cid:107) µ j − µ j (cid:48) (cid:107) − K (cid:12)(cid:12) = 2 C (cid:88) i (cid:48) =1 µ j − µ i (cid:48) (cid:107) µ j − µ i (cid:48) (cid:107) · (cid:107) µ j − µ i (cid:48) (cid:107) − K (cid:12)(cid:12) (cid:107) µ j − µ i (cid:48) (cid:107) − K (cid:12)(cid:12) = 2 C (cid:88) i (cid:48) =1 µ j − µ i (cid:48) (cid:107) µ j − µ i (cid:48) (cid:107) · (cid:107) µ j − µ i (cid:48) (cid:107) − scale ∗ d X (cid:0) v Xi (cid:48) , v Xj (cid:1)(cid:12)(cid:12) (cid:107) µ j − µ i (cid:48) (cid:107) − scale ∗ d X (cid:0) v Xi (cid:48) , v Xj (cid:1) (cid:12)(cid:12) = 2 C (cid:88) i (cid:48) =1 µ j − µ i (cid:48) d Z ( µ j , µ i (cid:48) ) · d Z ( µ j , µ i (cid:48) ) − scale ∗ d X (cid:0) v Xi (cid:48) , v Xj (cid:1)(cid:12)(cid:12) d Z ( µ j , µ i (cid:48) ) − scale ∗ d X (cid:0) v Xi (cid:48) , v Xj (cid:1)(cid:12)(cid:12) (12)12ublished as a conference paper at ICLR 2021The gradient of L align with respect to each learnalbe cluster center µ j can be computed as: ∂L align ∂µ j = ∂ (cid:80) Cj (cid:48) =1 || µ j (cid:48) − v Zj (cid:48) || ∂µ j = C (cid:88) j (cid:48) =1 ∂ || µ j (cid:48) − v Zj (cid:48) || ∂µ j = ∂ || µ j − v Zj || ∂µ j = ∂ ( µ j − v Zj ) ∂µ j · µ j − v Zj (cid:13)(cid:13) µ j − v Zj (cid:13)(cid:13) = µ j − v Zj (cid:13)(cid:13) µ j − v Zj (cid:13)(cid:13) (13)A.2 A LGORITHM
Algorithm 1
Algorithm for Deep Clustering and Representation Learning
Input:
Input samples: X ; Number of clusters: C ; Number of batches: B ; Number of iterations: E . Output:
Autoencoder’s weights: θ f and θ g ; Cluster labels { s i } Ni =1 ; Trainable cluster centers { µ j } Cj =1 .1: Initialize the weight { µ j } Cj =1 , θ f and θ g , and obtain initialized soft label assignment { s i } Ni =1 .2: for epoch ∈ { · · · , E } do
3: Compute embedded points { z i } Ni =1 and distribution Q ;4: Update target distribution P ;5: Compute soft cluster centers (cid:8) v Xi (cid:9) Ci =1 and (cid:8) v Zi (cid:9) Ci =1 .6: for batch ∈ { · · · , B } do
7: Pick up one batch of samples X batch from X ;8: Compute corresponding distribution Q batch and it’s reconstruction Y batch ;9: Pick up target distribution batch P batch from P ;10: Compute loss L ae , L cluster and L rank ;11: Update the weight θ f , θ g and { µ j } Cj =1 .12: end for
13: Compute L iso and L align on all samples;14: Update the weight θ f and { µ j } Cj =1 ;15: Assign new soft labels { s i } Ni =1 .16: end for
17: return θ f , θ g , { s i } Ni =1 , { µ j } Cj =1 . A.3 D
ATASETS
To show that our method works well with various kinds of datasets, we choose the following fiveimage and text datasets. Some example images are shown in Fig A1, and the brief descriptions ofthe datasets are given in Tab A1.Table A1: Description of Datasets
Dataset Samples Categories Data Size
MNIST-full 70000 10 28 × × × × × × × × • MNIST-full (LeCun et al., 1998): The MNIST-full dataset consists of 70,000 handwrittendigits of 28 ×
28 pixels. Each gray image is reshaped to a 784-dimensional vector.13ublished as a conference paper at ICLR 2021 • MNIST-test (LeCun et al., 1998): The MNIST-test is the testing part of the MNIST dataset,which contains a total of 10000 samples. • USPS : The USPS dataset is composed of 9298 gray-scale handwritten digit images witha size of 16x16 pixels. • Fashion-MNIST (Xiao et al., 2017): This Fashion-MNIST dataset has the same number ofimages and the same image size as MNIST-full, but it is fairly more complicated. Insteadof digits, it consists of various types of fashion products. • REUTERS-10K: REUTERS (Lewis et al., 2004) is composed of around 810000 Englishnews stories labeled with a category tree. Four root categories (corporate/industrial, gov-ernment/social, markets, and economics) are used as labels and excluded all documentswith multiple labels. Following DEC (Xie et al., 2016), a subset of 10000 examples arerandomly sampled, and the tf-idf features on the 2000 most frequent words are computed.The sampled dataset is denoted REUTERS-10K. (a) MNIST(b) USPS(c) Fashion-MNIST
Figure A1: The image samples from three datasets (MNIST, USPS, and Fashion-MNIST)A.4 D
EFINITIONS OF PERFORMANCE METRICS
The following notations are used for the definitions: d X ( i, j ) : the pairwise distance between x i and x j in input space X ; d Z ( i, j ) : the pairwise distance between z i and z j in latent space Z ; N k,Xi : the set of indices to the k -nearest neighbor ( k NN) of x i in input space X ; N k,Zi : the set of indices to the k -nearest neighbor ( k NN) of z i in latent space Z ; r X ( i, j ) : the rank of the closeness of x j to x i in input space X ; r Z ( i, j ) : the rank of the closeness of z j to z i in latent space Z .The eight evaluation metrics are defined below:(1) ACC (Accuracy) measures the accuracy of clustering:
ACC = max m (cid:80) Ni =1 { l i = m ( s i ) } N where l i and s i are the true and predicted labels for data point x i , respectively, and m ( · ) isall possible one-to-one mappings between clusters and label categories.(2) NMI (Normalized Mutual Information) NMI calculates the normalized measure of simi-larity between two labels of the same data
N M I = I ( l ; s )max { H ( l ) , H ( s ) } where I ( l, s ) is the mutual information between the real label l and predicted label s , and H ( · ) represents their entropy. https://cs.nyu.edu/ roweis/data.html RRE (Relative Rank Change) measures the average of changes in neighbor ranking be-tween two spaces X and Z : RRE = 1( k − k + 1) k (cid:88) k = k (cid:8) M R kX → Z + M R kZ → X (cid:9) where k and k are the lower and upper bounds of the k -NN. M R kX → Z = 1 H k N (cid:88) i =1 (cid:88) j ∈N k,Zi (cid:18) | r X ( i, j ) − r Z ( i, j ) | r Z ( i, j ) (cid:19) M R kZ → X = 1 H k N (cid:88) i =1 (cid:88) j ∈N k,Xi (cid:18) | r X ( i, j ) − r Z ( i, j ) | r X ( i, j ) (cid:19) where H k is the normalizing term, defined as H k = N k (cid:88) l =1 | N − l | l . (4) Trust (Trustworthiness) measures to what extent the k nearest neighbors of a point arepreserved when going from the input space to the latent space: T rust = 1 k − k + 1 k (cid:88) k = k − N k (2 N − k − N (cid:88) i =1 (cid:88) j ∈N k,Zi ,j / ∈N k,Xi ( r X ( i, j ) − k ) where k and k are the bounds of the number of nearest neighbors.(5) Cont (Continuity) is defined analogously to
T rust , but checks to what extent neighborsare preserved when going from the latent space to the input space:
Cont = 1 k − k + 1 k (cid:88) k = k − N k (2 N − k − N (cid:88) i =1 (cid:88) j / ∈N k,Zi ,j ∈N k,Xi ( r Z ( i, j ) − k ) where k and k are the bounds of the number of nearest neighbors.(6) RMSE (Root Mean Square Error) measures to what extent the two distributions of dis-tances coincide:
RM SE = (cid:118)(cid:117)(cid:117)(cid:116) N N (cid:88) i =1 N (cid:88) j =1 ( d X ( i, j ) − d Z ( i, j )) (7) LGD (Locally Geometric Distortion) measures how much corresponding distances be-tween neighboring points differ in two metric spaces and is the primary metric for isometry,defined as:
LGD = k (cid:88) k = k (cid:118)(cid:117)(cid:117)(cid:116) M (cid:88) i (cid:80) j ∈N k, ( l ) i ( d l ( i, j ) − d l (cid:48) ( i, j )) ( k − k + 1) M ( N i ) . where k and k are the lower and upper bounds of the k -NN.15ublished as a conference paper at ICLR 2021(8) CRA (Cluster Rank Accuracy) measures the changes in ranks of cluster centers from theinput space X and to the latent space Z : CRA = (cid:80) Ci =1 (cid:80) Cj =1 ( r X ( v Xi , v Xj ) = r Z ( v Zi , v Zj )) C (35) where C is the number of clusters, v Xj is the cluster center of the j th cluster in the inputspace X , v Zj is the cluster center of the j th cluster in the latent space Z , r X ( v Xi , v Xj ) denotes the rank of the closeness of v Xi to v Xj in space X in the input space X , and r Z ( v Zi , v Zj ) denotes the rank of the closeness of v Zi to v Zj in space Z .A.5 V ISUALIZATION IN GENERALIZABILITY
The visualization results on the testing samples are shown in Fig A2; even for testing samples, ourmethod still shows distinguishable inter-cluster discriminability, while all the other methods withoutexception coupled different clusters together. (a) AE+k-mean (b) DEC (c) IDEC (d) JULE(e) DSC (f) N2D (g) DCRL (ours)
Figure A2: The visualization of the obtained embedding on the testing samples to show the gener-alization performance of different algorithms on MNIST-full dateset.A.6 V
ISUALIZATION IN DIFFERENT STAGES
The embedding visualization of the latent space during the training process is visualized in Fig A3for depicting how both clustering and structure-preserving is achieved. We can see that the differentclusters initialized by pretrained autoencoder are closely adjacent. In the early stage of training,with clustering loss L cluster and global ranking loss L rank , different manifolds are separated fromeach other, each manifold loses its local structure, and all of them degenerate into spheres. Asthe training progresses, the weight α for L cluster gradually decreases, while the weight β for L iso increases and the optimization is gradually focused from global to local , with each manifoldgradually recovering its original geometric structure from the sphere. Moreover, since our localisometry loss L iso is constrained within each manifold, the preservation of local structure will notdisrupt the global ranking. Finally, we obtain representations in which cluster boundaries are clearlydistinguished, and local and global structures are perfectly preserved.A.7 S TATISTICAL A NALYSIS
The statistical analysis is presented to show the extent to which local and global structure is pre-served from the input space to the latent space. Taking MNIST-full as an example, the statisticalanalysis of the global rank-preservation is shown in Fig A4 (a)-(f). For the i -th cluster, if the rankbetween it and the j -th cluster is preserved from input space to latent space, then the grid in the i -th16ublished as a conference paper at ICLR 2021 (a) Epoch 0 (b) Epoch 9 (c) Epoch 19 (d) Epoch 29 (e) Epoch 69(f) Epoch 119 (g) Epoch 159 (h) Epoch 209 (i) Epoch 249 (j) Epoch 299 Figure A3: Clustering visualization at different stages of training on MNIST-full dateset. (a) DEC (b) IDEC (c) JULE (d) DSC(e) N2D (f) DCRL (g) Local Isometry
Figure A4: Statistical analysis of different algorithms to compare the capability of global and localstructure preservation from the input space to the latent space.row and j -th column is set to blue, otherwise yellow. As shown in the figure, only our method canfully preserve the global rank between different clusters, while all other methods fail.Finally, we perform a statistical analysis for the local isometry property of each algorithm. Foreach sample x i in the dataset, it forms a number of point pairs with its neighborhood samples { ( x i , x j ) | i = 1 , , ..., N ; x j ∈ N Xi } . We compute the difference in the distance of these point pairsfrom the input space to the latent space { d Z ( x i , x j ) − d X ( x i , x j ) | i = 1 , , ..., N ; x j ∈ N i } , andplot it as a histogram. As shown in Fig A4 (g), the curves of DCRL are distributed on both sidesof the 0 value, with maximum peak height and minimum peak-bottom width, respectively, whichindicates that DCRL achieves the best local isometry. Although IDEC claims that they can preservethe local structure well, there is still a big gap between their results and ours.A.8 Q UANTITATIVE E VALUATION OF R EPRESENTATION L EARINING
Our method is compared with the other five methods in six evaluation metrics on five datasets. Thecomplete results in Tab A2 demonstrate the superiority of our method, especially on metrics RRE,Trust, Cont, and CRA. 17ublished as a conference paper at ICLR 2021Table A2: Representation learning performance of different algorithms on five datasets
Datasets Algorithms RRE Trust Cont RMSE LGD CRA
MNIST-full DEC 0.09988 0.84499 0.94805 44.8535 4.37986 0.28IDEC 0.00984 0.99821 0.97936 24.5803 1.71484 0.33JULE 0.02657 0.93675 0.98321 28.3412 2.12955 0.27DSC 0.09785 0.87315 0.92508 6.98098 1.19886 0.23N2D 0.01002 0.99243 0.98466 5.7162 0.69946 0.21DCRL
MNIST-test DEC 0.12800 0.81841 0.91767 14.6113 2.29499 0.19IDEC 0.01505 0.99403 0.97082 7.4599 1.08350 0.38JULE 0.04122 0.92971 0.97208 9.4768 1.17176 0.42DSC 0.10728 0.85498 0.92254 7.1689 1.19239 0.26N2D 0.01565 0.98764 0.97572
USPS DEC 0.07911 0.88871 0.94628 16.4355 1.77848 0.31IDEC 0.01043 0.99726 0.97960 13.0573 1.11689 0.30JULE 0.02972 0.98763 0.98810 14.6324 1.43426 0.33DSC 0.06319 0.9151 0.93988 8.4412 1.02131 0.27N2D 0.01337 0.98769 0.98135 8.1961 0.54967 0.37DCRL
Fasion-MNIST DEC 0.04787 0.93896 0.95450 39.3274 3.87731 0.37IDEC 0.01089 0.99683 0.97797 25.4024 1.91385 0.27JULE 0.03013 0.97732 0.97923 15.2213 1.43642 0.43DSC 0.05168 0.95013 0.96121 17.2201 1.42091 0.36N2D 0.00894 0.99062 0.98054 14.49079
REUTERS-10K DEC 0.26192 0.65518 0.80477 40.4671 4.00423 0.63IDEC 0.05981 0.95840 0.90550 43.9556 2.01365 0.75JULE 0.11230 0.87628 0.93232 46.4287 2.78210 0.56DSC 0.20820 0.74312 0.83672 38.8720 1.89721 0.50N2D 0.03827 0.97385 0.93412 36.1042
UANTITATIVE E VALUATION OF D OWNSTREAM T ASKS
Tab A3 compares DCRL with the other six methods on five datasets to see whether these methodscan indeed learn representations that are useful for downstream tasks. As shown in the table, ourmethod outperforms the other methods on all five datasets with MLP, RFC, LR as downstream tasks.Table A3: Performance of different algorithms in downstream tasks
Datasets Algorithms MLP RFC SVM LR
MNIST-full AE 0.9746 0.9652 0.9859 0.9565DEC 0.8647 0.8706 0.8707 0.8566IDEC 0.9797 0.9737 0.9852 0.9650JULE 0.9802 0.9825 0.9787 0.9743DSC 0.9622 0.9501 0.9837 0.9752N2D 0.9796 0.9803 0.9799 0.9792DCRL
MNIST-test AE 0.9415 0.9420 0.9745 0.9495DEC 0.8525 0.8605 0.8725 0.8685IDEC 0.9740 0.9725 0.9845 0.9655JULE 0.9775 0.9845 0.9800 0.9825DSC 0.9535 0.9740 0.9825 0.9795N2D 0.9715 0.9760 0.9725 0.9725DCRL
USPS AE 0.9421 0.9469 0.9677 0.9073DEC 0.8289 0.8668 0.8289 0.8294IDEC 0.9482 0.9556 0.9656 0.9125JULE 0.9576 0.9617
Fasion-MNIST AE 0.8613 0.9932 0.8314 0.7588DEC 0.6268 0.9853 0.6377 0.6245IDEC 0.8367 0.9918
REUTERS-10K AE 0.9325 0.9170 0.9375 0.8205DEC 0.7985 0.7880 0.8105 0.7450IDEC 0.9225 0.8930 0.9280 0.7705JULE 0.9315 0.9035 0.9185 0.8165DSC 0.9045 0.8835 0.9175 0.8115N2D 0.9205 0.9080 0.9240 0.8335DCRL (ours)
ORE ABLATION EXPERIMENTS
The results of the ablation experiments on the MNIST-full dataset have been presented in Tab 5in Sec 4.3. Here, we provide four more sets of ablation experiments on the other four datasets.The conclusion is similar (note that for the clustering performance of the model without clustering-oriented losses is very poorly, so the “best” metric numbers are not meaningful and are shown ingray color): (1) CL is very important for obtaining good clustering. (2) SL is beneficial for bothclustering and representation learning. (3) Our training strategies (WG and AT) are very superior inimproving metrics such as ACC, RRE, Trust, Cont, and CRA.Table A4: Ablation study of loss items and training strategies used in DCRL
Datasets Methods ACC/NMI RRE Trust Cont RMSE LGD CRA w/o SL 0.976/0.939 0.0093 0.9967 0.9816 24.589 1.6747 0.32w/o CL 0.814/0.736 0.0004 0.9998 0.9990 7.458 0.0487 1.00w/o WG 0.977/0.943 0.0065 0.9987 0.9860 5.576 0.6968 0.98w/o AT 0.978/0.944 0.0069 0.9986 0.9851 5.617 0.7037 0.96MNIST-full full model w/o SL w/o AT 0.970/0.929 0.0118 0.9974 0.9747 5.567 0.9404
MNIST-test full model 0.972/0.930 w/o SL 0.958/0.902 0.0095 0.9967 0.9812 14.609 0.9847 0.29w/o CL 0.664/0.658 0.0020 0.9996 0.9952 2.934 0.0687 1.0w/o WG 0.956/0.896 0.0060 0.9991 0.9868 6.572 0.5335 w/o AT 0.947/0.885 0.0080 0.9979 0.9833
USPS full model w/o SL 0.706/0.682 0.0108 0.9964 0.9781 25.954 1.8936 0.30w/o CL 0.576/0.569 0.0004 0.9994 0.9995 7.654 0.0523 1.00w/o WG 0.702/0.695 0.0084 0.9972 0.9814 w/o AT 0.708/0.694 0.0097 0.9975 0.9798 13.354 1.3611
Fasion-MNIST full model w/o SL 0.819/0.564 0.0529 0.9610 0.9185 44.481 1.9090 0.38w/o CL 0.542/0.279 0.0277 0.9868 0.9456 37.018 2.2294 1.00w/o WG 0.830/0.583 0.0420 0.9667 0.9361 35.302 2.8286 w/o AT 0.825/0.563 0.0440 0.9650 0.9330 39.275 2.9146
REUTERS-10K full model0.836/0.590 0.0320 0.9838 0.9380 34.547 2.7209 1.00