[PDF] Deep Clustering and Representation Learning with Geometric Structure Preservation

Abstract

In this paper, we propose a novel framework for Deep Clustering and multi-manifold Representation Learning (DCRL) that preserves the geometric structure of data. In the proposed framework, manifold clustering is done in the latent space guided by a clustering loss. To overcome the problem that clustering-oriented losses may deteriorate the geometric structure of embeddings in the latent space, an isometric loss is proposed for preserving intra-manifold structure locally and a ranking loss for inter-manifold structure globally. Experimental results on various datasets show that DCRL leads to performances comparable to current state-of-the-art deep clustering algorithms, yet exhibits superior performance for manifold representation. Our results also demonstrate the importance and effectiveness of the proposed losses in preserving geometric structure in terms of visualization and performance metrics.

Full PDF

PPublished as a conference paper at ICLR 2021 D EEP C LUSTERING AND R EPRESENTATION L EARNINGTHAT P RESERVES G EOMETRIC S TRUCTURES

Lirong Wu, Zicheng Liu, Zelin Zang, Jun Xia, Siyuan Li, Stan. Z Li

AI Lab, School of Engineering, Westlake University &Institute of Advanced Technology, Westlake Institute for Advanced StudyHangzhou, Zhejiang, China { wulirong, liuzicheng, zelin zang, xiajun, lisiyuan, stan.zq.li } @westlake.edu.cn A BSTRACT

In this paper, we propose a novel framework for Deep Clustering and Repre-sentation Learning (DCRL) that preserves the geometric structure of data. Inthe proposed DCRL framework, manifold clustering is done in the latent spaceguided by a clustering loss. To overcome the problem that clustering-orientedlosses may deteriorate the geometric structure of embeddings in the latent space,an isometric loss is proposed for preserving intra-manifold structure locally anda ranking loss for inter-manifold structure globally. Experimental results on var-ious datasets show that the DCRL framework leads to performances comparableto current state-of-the-art deep clustering algorithms, yet exhibits superior perfor-mance for downstream tasks. Our results also demonstrate the importance andeffectiveness of the proposed losses in preserving geometric structure in terms ofvisualization and performance metrics.

NTRODUCTION

Clustering, a fundamental tool for data analysis and visualization, has been an essential researchtopic in data science and machine learning. Conventional clustering algorithms such as K -Means(MacQueen, 1965), Gaussian Mixture Models (GMM) (Bishop, 2006), and spectral clustering (Shi& Malik, 2000) perform clustering based on distance or similarity. However, handcrafted distance orsimilarity measures are rarely reliable for large-scale high-dimensional data, making it increasinglychallenging to achieve effective clustering. An intuitive solution is to transform the data from thehigh-dimensional input space to the low-dimensional latent space and then to cluster the data in thelatent space. This can be achieved by applying dimensionality reduction techniques such as PCA(Wold et al., 1987), t-SNE (Maaten & Hinton, 2008), and UMAP (McInnes et al., 2018). However,since these methods are not speciﬁcally designed for clustering tasks, some of their properties maybe contrary to our expectations, e.g., two data points from different manifolds that are close in theinput space will be closer in the latent space derived by UMAP. Therefore, the ﬁrst question here is how to learn the representation that favors clustering? The two main points of the multi-manifold representation learning are (1) preserving the local geo-metric structure within each manifold and (2) ensuring the discriminability between different man-ifolds. However, it is challenging to decouple complex cross-over relations and ensure the discrim-inability between different manifolds, especially in unsupervised situations. One natural strategyis to perform clustering in the input space to get pseudo-labels and then perform representationlearning for each manifold. However, in that case, representation learning’s performance dependsheavily on the clustering effect, but commonly used clustering algorithms such as K -Means do notwork well on high-dimensional data. Thus, the second question here is how to cluster data thatfavors representation learning? To answer these two questions, some pioneering work has proposed integrating deep clustering andrepresentation learning into a uniﬁed framework by deﬁning a clustering-oriented loss. Thoughpromising performance has been demonstrated on various datasets, we observe that a vital factorhas been ignored by these work that the deﬁned clustering-oriented loss may deteriorate the geo-metric structure of the latent space, which in turn hurts the performance of visualization, clustering1 a r X i v : . [ c s . L G ] S e p ublished as a conference paper at ICLR 2021generalization, and downstream tasks. In this paper, we propose to jointly perform deep clusteringand representation learning with geometric structure preservation. Inspired by Xie et al. (2016),the clustering centers are deﬁned as a set of learnable parameters, and we use a clustering loss tosimultaneously guide the separation of data points from different manifolds and the learning of theclustering centers. To prevent clustering loss from deteriorating the latent space, an isometric lossand a ranking loss are proposed to preserve the intra-manifold structure locally and inter-manifoldstructure globally. Our experimental results show that our method exhibits far superior performanceto counterparts in terms of clustering and representation learining, which demonstrates the impor-tance and effectiveness of preserving geometric structure.The contributions of this work are summarized as below: • Proposing to integrate deep clustering and representation learning into a uniﬁed frameworkwith local and global structure preservation. • Unlike conventional multi-manifold learning algorithms that deal with all point pair rela-tionships between different manifolds simultaneously, we set the clustering centers as a setof learnable parameters and achieve global structure preservation in a faster, more efﬁcient,and easier to optimize manner by applying ranking loss to the clustering centers. • Analyzing the contradiction between two optimization goals of clustering and local struc-ture preservation, and proposing an elegant training strategy to alleviate it. • The proposed DCRL algorithm outperforms competing algorithms in terms of clusteringeffect, generalizability to out-of-sample, and performance in downstream tasks.

ELATED W ORK

Clustering analysis . As a fundamental tool in machine learning, it has been widely applied invarious domains. One branch of classical clustering is K -Means (MacQueen, 1965) and GaussianMixture Models (GMM) (Bishop, 2006), which are fast, easy to understand, and can be appliedto a large number of problems. However, limited by Euclidean measure, their performance onhigh-dimensional data is often unsatisfactory. Spectral clustering and its variants (such as SC-Ncut(Bishop, 2006)) extend clustering to high-dimensional data by allowing more ﬂexible distance mea-sures. However, limited by computational efﬁciency of the full Laplace matrix, spectral clusteringis challenging to extend to large-scale datasets. Deep clustering . The success of deep learning has contributed to the growth of deep clustering .One branch of deep clustering performs clustering after learning a representation through existingunsupervised techniques. For example, Tian et al. (2014) use autoencoder to learn low dimensionalfeatures and then run K -Means to get clustering results (AE+ K -Means). Considering the geomet-ric structure of the data, N2D applies UMAP to ﬁnd the best clusterable manifold of the obtainedembedding, and then run K -Means to discover higher-quality clusters (McConville et al., 2019).The other category of algorithms tries to optimize clustering and representation learning jointly.The closest work to us is Deep Embedding for Clustering (DEC) (Xie et al., 2016), which learnsa mapping from the input space to a lower-dimensional latent space through iteratively optimizinga clustering objective. As a modiﬁed version of DEC, while IDEC claims to preserve the localstructure of the data (Guo et al., 2017), in reality, their contribution is nothing more than adding areconstruction loss. JULE proposes a recurrent framework for integrating clustering and represen-tation learning into a single model with a weighted triplet loss and optimizing it end-to-end (Yanget al., 2016b). DSC devises a dual autoencoder to embed data into latent space, and then deepspectral clustering (Shaham et al., 2018) is applied to obtain label assignments (Yang et al., 2019). Manifold Representation Learning . Isomap, as a representative algorithm of single-manifold learning, aims to capture global nonlinear features and seek an optimal subspace that best pre-serves the geodesic distance between data points (Tenenbaum et al., 2000). In contrast, some al-gorithms, such as the LLE (Roweis & Saul, 2000), are more concerned with the preservation oflocal neighborhood information. Combining DNN with manifold learning, the recently proposedMLDL algorithm achieves the preservation of local and global geometries by imposing LIS priorconstraints (Li et al., 2020). Furthermore, multi-manifold learning is proposed to obtain intrinsicproperties of different manifolds. Yang et al. (2016a) proposed a supervised MMD-Isomap wheredata points are partitioned into different manifolds according to label information. Similarly, Zhang2ublished as a conference paper at ICLR 2021et al. (2018) proposed a semi-supervised local multi-manifold learning framework, termed SSMM-Isomap, that applies the labeled and unlabeled training samples to perform the joint learning of localneighborhood-preserving features. In most previous work on multi-manifold learning, the problemis considered from the perspective that the label is known or partially known, which signiﬁcantlysimpliﬁes the problem. For unsupervised multi-manifold learning, it is still very challenging todecouple multiple overlapping manifolds, and that is exactly what this paper aims to explore.

ROPOSED M ETHOD

Consider a dataset X with N samples, and each sample x i ∈ R d is sampled from C differentmanifolds { M c } Cc =1 . Assume that each category in the data set lies in a compact low-dimensionalmanifold, and the number of manifolds C is prior knowledge. Deﬁne two nonlinear mapping z i = f ( x i , θ f ) and y i = g ( z i , θ g ) , where z i ∈ R m is the embedding of x i in the latent space, y i is thereconstruction of x i . The j -th cluster center is denoted as µ j ∈ R m , where { µ j } Cj =1 is deﬁned as aset of learnable parameters. We aim to ﬁnd optimal parameters θ f so that the embedding features { z i } Ni =1 can achieve clustering with local and global structure preservation. To this end, a denoisingautoencoder (Vincent et al., 2010) shown in Fig 1 is ﬁrst pre-trained in an unsupervised mannerto learn an initial latent space. Denoising autoencoder aims to optimize the self-reconstructionloss L AE = M SE (ˆ x, y ) , where the ˆ x is a copy of x with Gaussian noise added, that is, ˆ x = x + N (0 , σ ) . Then the autoencoder is ﬁnetuned by optimizing the following clustering-orientedloss { L cluster ( z, µ ) } and structure-oriented losses { L rank ( x, µ ) , L LIS ( x, z ) , L align ( z, µ ) } . Sincethe clustering should be performed on features of clean data, instead of noised data ˆ x that is used indenoising autoencoder, the clean data x is used for ﬁne-tuning. ， Figure 1: The framework of the proposed DCRL method. The encoder, decoder, latent space, andcluster centers are marked as blue, red, green, and purple, respectively.3.1 C

LUSTERING - ORIENTED L OSS

First, the cluster centers { µ j } Cj =1 in the latent space Z are initialized (the initialization method willbe introduced in Sec 4.1). Then the similarity between the embedded point z i and cluster centers { µ j } Cj =1 is measured by Students t -distribution: q ij = (cid:16) (cid:107) z i − µ j (cid:107) (cid:17) − (cid:80) j (cid:48) (cid:16) (cid:107) z i − µ j (cid:48) (cid:107) (cid:17) − (1)The auxiliary target distribution is designed to help manipulate the latent space, deﬁned as: p ij = q ij /f j (cid:80) j (cid:48) q ij (cid:48) /f j (cid:48) , where f j = (cid:88) i q ij (2)3ublished as a conference paper at ICLR 2021where f j is the normalized cluster frequency, used to balance the size of different clusters. Then theencoder is optimized by the following objective: L cluster = KL( P (cid:107) Q ) = (cid:88) i (cid:88) j p ij log p ij q ij (3)The gradient of L cluster with respect to each learnable cluster center µ j can be computed as: ∂L cluster ∂µ j = − (cid:88) i (cid:16) (cid:107) z i − µ j (cid:107) (cid:17) − · ( p ij − q ij ) ( z i − µ j ) (4) L cluster facilitates the aggregation of data points within the same manifold, while data points fromdifferent manifolds are kept away from each other. However, we ﬁnd that the clustering-orientedloss may deteriorate the geometric structure of the latent space, which hurts the clustering accuracyand leads to meaningless representation. To prevent the deterioration of clustering loss, we introduceisometry loss L LIS and ranking loss L rank to preserve the local and global structure, respectively.3.2 S TRUCTURE - ORIENTED L OSS

Intra-manifold Isometry Loss.

The intra-manifold local structure is preserved by optimizing thefollowing objective: L LIS = N (cid:88) i =1 (cid:88) j ∈N Zi | d X ( x i , x j ) − d Z ( z i , z j ) | · π ( l ( x i ) = l ( x j )) (5)where N Zi represents the neighborhood of data point z i in the feature space Z , and the k NN is ap-plied to determine the neighborhood. π ( · ) ∈ { , } is an indicator function, and l ( x i ) is a manifolddetermination function that returns the manifold s i where sample x i is located, that is s i = l ( x i ) =arg max j p ij . Then we can derive C manifolds { M c } Cc =1 : M c = { x i ; s i = c, i = 1 , , ..., N } . In anutshell, the loss L LIS constrains the isometry within each manifold.

Inter-manifold Ranking Loss.

The inter-manifold global structure is preserved by optimizing thefollowing objective: L rank = C (cid:88) i =1 C (cid:88) j =1 (cid:12)(cid:12) d Z ( µ i , µ j ) − scale ∗ d X (cid:0) v Xi , v Xj (cid:1)(cid:12)(cid:12) (6)where { v Xj } Cj =1 is deﬁned as the centers of different manifolds in the original input space X with v Xj = | M j | (cid:80) i ∈ M j x i ( j = 1 , , ..., C ). The parameter scale determines the extent to which dif-ferent manifolds move away from each other. The larger scale is, the further away the differentmanifolds are from each other. The derivation for the gradient of L rank with respect to each learn-able cluster center µ j is placed in Appendix A.1 . Additionally, contrary to us, the conventionalmethods for dealing with inter-manifold separation typically impose push-away constraints on alldata points from different manifolds (Zhang et al., 2018; Yang et al., 2016a), deﬁned as: L sep = − N (cid:88) i =1 N (cid:88) j =1 d Z ( z i , z j ) · π ( l ( x i ) (cid:54) = l ( x j )) (7)The main differences between L rank and L sep are as follows: (1) L sep imposes constraints onembedding points { z i } Ni =1 , which in turn indirectly affects the network parameters θ f . In contrast, L rank imposes rank-preservation constrains directly on learnable parameters { µ j } Cj =1 in the form of regularization item to control the separation of the clustering centers. (2) L sep involves N × N point-to-point relationships, while L rank involves only C × C cluster-to-cluster relationships, so L rank iseasier to optimize, faster to process, and more accurate. (3) The parameter scale introduced in L rank allows us to control the extent of separation between manifolds for speciﬁc downstream tasks. Alignment Loss.

Note that the global ranking loss L rank is imposed directly on the learnableparameter { µ j } Cj =1 , so optimizing L rank will only update { µ j } Cj =1 rather the encoder’s parameter4ublished as a conference paper at ICLR 2021 θ f . Thus here we need to introduce an auxiliary item L align to align learnable cluster centers { µ j } Cj =1 with real cluster centers { v Zj } Cj =1 : L align = C (cid:88) j =1 || µ j − v Zj || (8)where { v Zj } Cj =1 are deﬁned as v Zj = | M j | (cid:80) i ∈ M j z i ( j = 1 , , ..., C ). We place the derivation forthe gradient of L align with respect to each learnable cluster center µ j in Appendix A.1 .3.3 T

RAINING S TRATEGY

ONTRADICTION

Figure 2: The force analysis ofcontradiction between clusteringand local structure preservation.The contradiction between clustering and local structure preser-vation is analyzed from the forces analysis perspective. Asshown in Fig 2, we assume that there exists a data point (redpoint) and its three nearest neighbors (blue points) around a clus-ter center (gray point). When clustering and local structure pre-serving are optimized simultaneously, it is very easy to fall intoa local optimum, where the data point is in steady-state, andthe resultant force from its three nearest neighbors is equal inmagnitude and opposite to the gravitational forces of the cluster.Therefore, the following training strategy is applied to preventsuch local optimal solutions.3.3.2 A

LTERNATING T RAINING AND W EIGHT G RADUALITY

Alternating Training.

To solve the above problem and integrate the goals of clustering and structurepreservation into a uniﬁed framework, we take an alternating training strategy. Within each epoch,we ﬁrst jointly optimize L cluster and L rank in a mini-batch , with joint loss deﬁned as L = L AE + αL cluster + L rank (9)where α is the weighting factor that balances the effects of clustering and global rank-preservation.Then at each epoch, we optimize isometry loss L LIS and L align on the whole dataset , deﬁned as L = βL LIS + L align (10) Weight Graduality.

At different stages of training, we have different expectations for the clusteringand structure-preserving. At the beginning of training, to successfully decouple the overlappingmanifolds, we hope that the L cluster will dominate and L LIS will be auxiliary. When the marginbetween different manifolds is sufﬁciently pronounced, the weight α for L cluster can be graduallyreduced, while the weight β for L LIS can be gradually increased, focusing on the preservation ofthe local isometry. The whole algorithm is summarized in Algorithm 1 in

Appendix A.2 . Three-stage explanation.

The entire training process can be roughly divided into three stages, asshown in Fig 3, to explain the training strategy more vividly. At ﬁrst, four different manifolds overlapeach other. At Stage 1, L cluster dominates, thus data points within each manifold are convergingtowards the clustering center to form a sphere, and the local structure of manifolds is destroyed. AtStage 2, L rank dominates, thus different manifolds in the latent space move away from each other toincrease the manifold margin and enhance the discriminability. At stage 3, the manifolds graduallyrecover their original local structure from the spherical shape with L LIS dominating. It is worthnoting that all of the above losses coexist rather than independently at different stages , but thatthe role played by different losses varies due to the alternating training and weight graduality.5ublished as a conference paper at ICLR 2021Figure 3: Schematic of training strategy. Four different colors and shapes represent four intersectingmanifolds, and three stages involve the clustering, separation, and structure recovery of manifolds.

XPERIMENTS

XPERIMENTAL SETUPS

In this section, the effectiveness of the proposed framework is evaluated in 5 benchmark datasets:USPS , MNIST-full, MNIST-test (LeCun et al., 1998), Fashion-MNIST (Xiao et al., 2017) andREUTERS-10K (Lewis et al., 2004), on which our method is compared with 8 other methods men-tioned in Sec 2 in 8 evaluation metrics including metrics designed speciﬁcally for clustering andrepresentation learning. The brief descriptions of the datasets are given in Appendix A.3 . Parameters settings.

The encoder is a multilayer perceptron (MLP) with dimensions d -500-500-2000-10 where d is the dimension of the input data, and the decoder is its mirror. After pretraining, inorder to initialize the learnable clustering centers, the t-SNE is applied to transform the latent space Z to 2 dimensions further, and then the K -Means algorithm is run to obtain the label assignments foreach data point. The centers of each category in the latent space Z are set as initial clustering centers { µ j } Cj =1 . The batch size is set to 256, the epoch is set to 300, the parameter k for nearest neighboris set to 3, and the parameter scale is set to 3 for all datasets. Besides, Adam optimizer (Kingma& Ba, 2014) with learning rate λ =0.001 is used. As described in Sec 3.3.2, the weight gradualityis applied to train the model. The weight parameter α for L cluster decreases linearly from 0.1 to0 within epoch 0-150. In contrast, the weight parameter β for loss L LIS increases linearly from 0to 1.0 within epoch 0-150. The implementation is based on the PyTorch running on NVIDIA v100GPU.

Evaluation Metrics.

Two standard evaluation metrics: Accuracy (ACC) and Normalized MutualInformation (NMI) (Xu et al., 2003) are used to evaluate clustering performance. Besides, six eval-uation metrics are adopted in this paper to evaluate the performance of representation learning, in-cluding Relative Rank Error (RRE), Trustworthiness (Trust), Continuity (Cont), Root Mean Recon-struction Error (RMRE), Locally Geometric Distortion (LGD) and Cluster Rank Accuracy (CRA).Limited by space, their precise deﬁnitions are available in

Appendix A.4 .4.2 E

VALUATION OF C LUSTERING

UANTITATIVE C OMPARISON

The metrics ACC/NMI of different methods on various datasets are reported in Tab 1. For thosecomparison methods whose results are not reported on some datasets, we run the released codeusing the hyperparameters provided in their paper and label them with (*). We ﬁnd that our methodoutperforms K -Means and SC-Ncut with a signiﬁcant margin and surpasses the other six competingDNN-based algorithms on all datasets except MNIST-test. With even the MNIST-test dataset, westill rank second, outperforming the third by 1.1%. In particular, we obtained the best performanceon the Fashion-MNIST dataset and, more notably, our clustering accuracy exceeds the current bestmethod (N2D) by 3.8%. Despite L cluster is inspired by and highly consistent with the design ofDEC, our method achieves much better clustering results than them. With MNIST-full, for example,our clustering accuracy is 11.7% and 9.9% higher than DEC and IDEC, respectively. https://cs.nyu.edu/roweis/data.html Algorithms MNIST-full MNIST-test USPS Fashion-MNIST REUTERS-10K k-means 0.532/0.500 0.546/0.501 0.668/0.601 0.474/0.512 0.599/0.375*SC-Ncut 0.656/0.731 0.660/0.704 0.649/0.794 0.508/0.575 0.658/0.401*AE+k-means 0.818/0.747 0.815/0.784* 0.662/0.693 0.566/0.585* 0.721/0.432*DEC 0.863/0.834 0.856/0.830 0.762/0.767 0.518/0.546 0.755/0.503*IDEC 0.881/0.867 0.846/0.802 0.761/0.785 0.529/0.557 0.778/0.527*JULE 0.964/0.913 0.961/0.915 0.950/0.913 0.563/0.608 0.797/0.551*DSC 0.978/0.941

ENERALIZABILITY EVALUATION

Table 2: Generalizability evaluated by ACC/NMI

Algorithms training samples testing samples

AE+k-means 0.815/0.736 0.751/0.711DEC 0.841/0.773 0.748/0.704IDEC 0.845/0.860 0.826/0.842JULE 0.958/0.907 0.921/0.895DSC 0.975/0.939 0.969/0.921N2D 0.974/0.930 0.965/0.911DCRL (ours)

Tab 2 demonstrates that a learned DCRLcan generalize well to unseen datawith high clustering accuracy. TakingMNIST-full as an example, DCRL wastrained using 50,000 training samplesand then tested on the remaining 20,000testing samples using the learned model.In terms of the metrics ACC and MNI,our method is optimal for both trainingand test samples. More importantly, there is hardly any degradation in the performance of ourmethod on the test samples compared to the training samples, while all other methods showed asigniﬁcant drop in performance, e.g., DEC from 84.1% to 74.8%. This demonstrates the importanceof geometric structure maintenance for good generalizability. The testing visualization availablein

Appendix A.5 shows that DCRL still maintains clear inter-cluster boundaries even on the testsamples, which demonstrates the great generalizability of our method.4.2.3 C

LUSTERING V ISUALIZATION

The visualization of DCRL with several comparison methods is shown in Fig 4 (visualized usingUMAP). From the perspective of clustering, our method is much better than the other methods.Among all methods, only DEC, IDEC and DCRL can hold clear boundaries between different clus-ters, while the cluster boundaries of the other methods are indistinguishable. Although DEC andIDEC can successfully separate different clusters, they group many data points from different classesinto the same cluster. Most importantly, due to the use of the clustering-oriented loss, the embeddinglearned by algorithms such as DEC, IDEC, JULE, and DSC (especially DSC) tend to form spheresand disrupt the original topological structure. Instead, our method overcomes these problems andachieves almost perfect separation between different clusters while preserving the local and globalstructure. Additionally, the embedding of latent space during training process is visualized in

Ap-pendix A.6 , which is highly consistent with the three-stage explanation mentioned in Sec 3.3.2.4.3 E

VALUATION OF R EPRESENTATION L EARNING

UANTITATIVE C OMPARISON

Although numerous previous work has claimed that they brought clustering and representationlearning into a uniﬁed framework, they all, unfortunately, lack an analysis of the effectiveness of thelearned representations. In this paper, we compare DCRL with the other ﬁve methods in six evalua-tion metrics on ﬁve datasets. (Limited by space, only MNIST-full results are provided in the Tab 3and the complete results are in

Appendix A.7 ). The results show that DCRL outperforms all othermethods, especially in the CRA metric, which is not only the best on all datasets but also reaches1.0. This means that the “rank” between different manifolds in the latent space is completely pre-served and undamaged, which proves the effectiveness of our global ranking loss L rank . Moreover,statistical analysis is performed in this paper to show the extent to which local and global structure ispreserved in the latent space for each algorithm. Limited by space, they are placed in Appendix A.8 .7ublished as a conference paper at ICLR 2021 (a) AE+ K -mean (b) DEC (c) IDEC (d) JULE(e) DSC (f) N2D (g) DCRL (Ours) Figure 4: Visualization of the embedding learned by different algorithms on MNIST-full dataset.Table 3: Performance for representation learning

Methods RRE Trust Cont RMSE LGD CRA

DEC 0.099 0.844 0.948 44.85 4.379 0.28IDEC 0.009 0.998 0.979 24.58 1.714 0.33JULE 0.026 0.936 0.983 28.34 2.129 0.27DSC 0.097 0.873 0.925 6.98 1.198 0.23N2D 0.010 0.992 0.984 5.71 0.699 0.21DCRL

Table 4: Performance for downstream tasks

Methods MLP RFC SVM LR

AE 0.974 0.965 0.985 0.956DEC 0.864 0.870 0.870 0.856IDEC 0.979 0.973 0.985 0.965JULE 0.980 0.982 0.978 0.974DSC 0.962 0.950 0.983 0.975N2D 0.979 0.980 0.979 0.979DCRL

OWNSTREAM TASKS

Recently, numerous deep clustering algorithms have claimed to obtain meaningful representations,however, they do not analyze and experiment with the so-called “meaningful”. Therefore, we areinterested to see whether these proposed methods can indeed learn representations that are useful fordownstream tasks. Tab 4 compares DCRL with the other six methods on ﬁve datasets (Limited byspace, only MNIST-full results are provided in the paper, and the complete results are in

AppendixA.9 ). Four different classiﬁers, including a linear classiﬁer (Logistic Regression; LR), two nonlin-ear classiﬁers (MLP, SVM), and a tree-based classiﬁer (Random Forest Classiﬁer; RFC) are usedas downstream tasks, all of which use default parameters and default implementations in sklearn(Pedregosa et al., 2011) for a fair comparison. The learned representations are frozen and used asinput for training. The classiﬁcation accuracy evaluated on the test set serves as a metric to evaluatethe effectiveness of learned representations. On the MNIST-full dataset, our method outperforms allthe other methods. Moreover, we surprisingly ﬁnd that with MLP and RFC as downstream tasks, allmethods except DCRL could not even match the accuracy of AE. Signiﬁcantly, the performance ofDEC on downstream tasks deteriorates sharply and even shows a large gap with the simplest AEs,which once again shows that the clustering-oriented loss may damage the data geometric structure.4.4 A

BLATION S TUDY

This evaluates the effects of the loss terms and training strategies in the DCRL with ﬁve sets ofexperiments: the model without (A) Structure-oriented Loss (SL); (B) Clustering-oriented Loss(CL); (C) Weight Graduality (WG); (D) Alternating Training (AT), and (E) the full model. Limitedby space, only MNIST-full results are provided in the paper, and results for the other four datasets arein

Appendix A.10 . After analyzing the results, we can conclude: (1) CL is the most important factorfor obtaining good clustering, the lack of which leads to unsuccessful clustering, hence the numbersin the table are not very meaningful and are shown in gray color. (2) SL not only brings subtleimprovements in clustering but also greatly improves the performance of representation learning.(3) Our elegant training strategies (WG and AT) both improve the performance of clustering andrepresentation learning to some extent, especially on metrics such as RRE, Trust, Cont, and CRA.8ublished as a conference paper at ICLR 2021Table 5: Ablation study of loss items and training strategies on MNIST-full dataset

Datasets Methods ACC/NMI RRE Trust Cont RMSE LGD CRA w/o SL 0.976/0.939 0.0093 0.9967 0.9816 24.589 1.6747 0.32w/o CL 0.814/0.736 0.0004 0.9998 0.9990 7.458 0.0487 1.00w/o WG 0.977/0.943 0.0065 0.9987 0.9860 5.576 0.6968 0.98w/o AT 0.978/0.944 0.0069 0.9986 0.9851 5.617 0.7037 0.96MNIST-full full model

ONCLUSION

The proposed DCRL framework imposes clustering-oriented and structure-oriented constraints tooptimize the latent space for simultaneously performing clustering and representation learning withlocal and global structure preservation. Extensive experiments on image and text datasets demon-strate that DCRL is not only comparable to the state-of-the-art deep clustering algorithms but alsoable to learn effective and robust representation, which is beyond the capability of those clusteringmethods that only care about clustering accuracy . Future work will focus on the adaptive determi-nation of manifolds (clusters) number and extend our work to datasets with larger scale.9ublished as a conference paper at ICLR 2021 R EFERENCES

Christopher M Bishop.

Pattern recognition and machine learning . springer, 2006.Xifeng Guo, Long Gao, Xinwang Liu, and Jianping Yin. Improved deep embedded clustering withlocal structure preservation. In

IJCAI , pp. 1753–1759, 2017.Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied todocument recognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. Rcv1: A new benchmark collection fortext categorization research.

Journal of machine learning research , 5(Apr):361–397, 2004.Stan Z Li, Zelin Zhang, and Lirong Wu. Markov-lipschitz deep learning. arXiv preprintarXiv:2006.08256 , 2020.Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.

Journal of machinelearning research , 9(Nov):2579–2605, 2008.J MacQueen. Some methods for classiﬁcation and analysis of multivariate observations. In

Proc.5th Berkeley Symposium on Math., Stat., and Prob , pp. 281, 1965.Ryan McConville, Raul Santos-Rodriguez, Robert J Piechocki, and Ian Craddock. N2d:(not too)deep clustering via clustering the local manifold of an autoencoded embedding. arXiv preprintarXiv:1908.05968 , 2019.Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation andprojection for dimension reduction. arXiv preprint arXiv:1802.03426 , 2018.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten-hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay. Scikit-learn: Machine learning in Python.

Journal of Machine Learning Research ,12:2825–2830, 2011.Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embed-ding. science , 290(5500):2323–2326, 2000.Uri Shaham, Kelly Stanton, Henry Li, Boaz Nadler, Ronen Basri, and Yuval Kluger. Spectralnet:Spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587 , 2018.Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation.

IEEE Transactions onpattern analysis and machine intelligence , 22(8):888–905, 2000.Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework fornonlinear dimensionality reduction. science , 290(5500):2319–2323, 2000.Fei Tian, Bin Gao, Qing Cui, Enhong Chen, and Tie-Yan Liu. Learning deep representations forgraph clustering. In

Aaai , volume 14, pp. 1293–1299. Citeseer, 2014.Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, andL´eon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep networkwith a local denoising criterion.

Journal of machine learning research , 11(12), 2010.Svante Wold, Kim Esbensen, and Paul Geladi. Principal component analysis.

Chemometrics andintelligent laboratory systems , 2(1-3):37–52, 1987.Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmark-ing machine learning algorithms. arXiv preprint arXiv:1708.07747 , 2017.Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis.In

International conference on machine learning , pp. 478–487, 2016.10ublished as a conference paper at ICLR 2021Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrix factoriza-tion. In

Proceedings of the 26th annual international ACM SIGIR conference on Research anddevelopment in informaion retrieval , pp. 267–273, 2003.Bo Yang, Ming Xiang, and Yupei Zhang. Multi-manifold discriminant isomap for visualization andclassiﬁcation.

Pattern Recognition , 55:215–230, 2016a.Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsupervised learning of deep representationsand image clusters. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pp. 5147–5156, 2016b.Xu Yang, Cheng Deng, Feng Zheng, Junchi Yan, and Wei Liu. Deep spectral clustering using dualautoencoder network. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pp. 4066–4075, 2019.Yan Zhang, Zhao Zhang, Jie Qin, Li Zhang, Bing Li, and Fanzhang Li. Semi-supervised local multi-manifold isomap by linear embedding for feature extraction.

Pattern Recognition , 76:662–678,2018. 11ublished as a conference paper at ICLR 2021 A PPENDIX

A.1 G

RADIENT D ERIVATION

In the paper, we have emphasized time and again that { µ j } Cj =1 is a set of learnalbe parameters,which means that we can optimize it while optimizing the network parameter θ f . In Eq. (4) of thepaper, we have presented the gradient of L cluster with respect to µ j . In addition to L cluster , both L rank and L align are involving µ j . Hence, the detailed derivations for the gradient of L rank and L align with respect to µ j are also provided. The gradient of L rank with respect to each learnalbecluster center µ j can be computed as: ∂L rank ∂µ j = ∂ (cid:80) Ci (cid:48) =1 (cid:80) Cj (cid:48) =1 (cid:12)(cid:12) d Z ( µ i (cid:48) , µ j (cid:48) ) − scale ∗ d X (cid:0) v Xi (cid:48) , v Xj (cid:48) (cid:1)(cid:12)(cid:12) ∂µ j = C (cid:88) i (cid:48) =1 C (cid:88) j (cid:48) =1 ∂ (cid:12)(cid:12) d Z ( µ i (cid:48) , µ j (cid:48) ) − scale ∗ d X (cid:0) v Xi (cid:48) , v Xj (cid:1)(cid:12)(cid:12) ∂µ j (11)The Euclidean metric is used for both the input space and the hidden layer space, i.e., d Z ( µ i (cid:48) , µ j (cid:48) ) = (cid:107) µ i (cid:48) − µ j (cid:48) (cid:107) . In addition, the symbols are somewhat abused for clear derivation, representing scale ∗ d X (cid:0) v Xi (cid:48) , v Xj (cid:48) (cid:1) with K . Accordingly, Eq. (11) can be further derived as follows: ∂L rank ∂µ j = C (cid:88) i (cid:48) =1 C (cid:88) j (cid:48) =1 ∂ (cid:12)(cid:12) d Z ( µ i (cid:48) , µ j (cid:48) ) − scale ∗ d X (cid:0) v Xi (cid:48) , v Xj (cid:48) (cid:1)(cid:12)(cid:12) ∂µ j = C (cid:88) i (cid:48) =1 C (cid:88) j (cid:48) =1 ∂ (cid:12)(cid:12) (cid:107) µ i (cid:48) − µ j (cid:48) (cid:107) − K (cid:12)(cid:12) ∂µ j = C (cid:88) i (cid:48) =1 ∂ (cid:12)(cid:12) (cid:107) µ i (cid:48) − µ j (cid:107) − K (cid:12)(cid:12) ∂µ j + C (cid:88) j (cid:48) =1 ∂ (cid:12)(cid:12) (cid:107) µ j − µ j (cid:48) (cid:107) − K (cid:12)(cid:12) ∂µ j = C (cid:88) i (cid:48) =1 ∂ ( (cid:107) µ i (cid:48) − µ j (cid:107) − K ) ∂µ j · (cid:107) µ i (cid:48) − µ j (cid:107) − K (cid:12)(cid:12) (cid:107) µ i (cid:48) − µ j (cid:107) − K (cid:12)(cid:12) + C (cid:88) j (cid:48) =1 ∂ ( (cid:107) µ j − µ j (cid:48) (cid:107) − K ) ∂µ j · (cid:107) µ j − µ j (cid:48) (cid:107) − K (cid:12)(cid:12) (cid:107) µ j − µ j (cid:48) (cid:107) − K (cid:12)(cid:12) = C (cid:88) i (cid:48) =1 ∂ (cid:107) µ i (cid:48) − µ j (cid:107) ∂µ j · (cid:107) µ i (cid:48) − µ j (cid:107) − K (cid:12)(cid:12) (cid:107) µ i (cid:48) − µ j (cid:107) − K (cid:12)(cid:12) + C (cid:88) j (cid:48) =1 ∂ (cid:107) µ j − µ j (cid:48) (cid:107) ∂µ j · (cid:107) µ j − µ j (cid:48) (cid:107) − K (cid:12)(cid:12) (cid:107) µ j − µ j (cid:48) (cid:107) − K (cid:12)(cid:12) = C (cid:88) i (cid:48) =1 µ j − µ i (cid:48) (cid:107) µ j − µ i (cid:48) (cid:107) · (cid:107) µ j − µ i (cid:48) (cid:107) − K (cid:12)(cid:12) (cid:107) µ j − µ i (cid:48) (cid:107) − K (cid:12)(cid:12) + C (cid:88) j (cid:48) =1 µ j − µ j (cid:48) (cid:107) µ j − µ j (cid:48) (cid:107) · (cid:107) µ j − µ j (cid:48) (cid:107) − K (cid:12)(cid:12) (cid:107) µ j − µ j (cid:48) (cid:107) − K (cid:12)(cid:12) = 2 C (cid:88) i (cid:48) =1 µ j − µ i (cid:48) (cid:107) µ j − µ i (cid:48) (cid:107) · (cid:107) µ j − µ i (cid:48) (cid:107) − K (cid:12)(cid:12) (cid:107) µ j − µ i (cid:48) (cid:107) − K (cid:12)(cid:12) = 2 C (cid:88) i (cid:48) =1 µ j − µ i (cid:48) (cid:107) µ j − µ i (cid:48) (cid:107) · (cid:107) µ j − µ i (cid:48) (cid:107) − scale ∗ d X (cid:0) v Xi (cid:48) , v Xj (cid:1)(cid:12)(cid:12) (cid:107) µ j − µ i (cid:48) (cid:107) − scale ∗ d X (cid:0) v Xi (cid:48) , v Xj (cid:1) (cid:12)(cid:12) = 2 C (cid:88) i (cid:48) =1 µ j − µ i (cid:48) d Z ( µ j , µ i (cid:48) ) · d Z ( µ j , µ i (cid:48) ) − scale ∗ d X (cid:0) v Xi (cid:48) , v Xj (cid:1)(cid:12)(cid:12) d Z ( µ j , µ i (cid:48) ) − scale ∗ d X (cid:0) v Xi (cid:48) , v Xj (cid:1)(cid:12)(cid:12) (12)12ublished as a conference paper at ICLR 2021The gradient of L align with respect to each learnalbe cluster center µ j can be computed as: ∂L align ∂µ j = ∂ (cid:80) Cj (cid:48) =1 || µ j (cid:48) − v Zj (cid:48) || ∂µ j = C (cid:88) j (cid:48) =1 ∂ || µ j (cid:48) − v Zj (cid:48) || ∂µ j = ∂ || µ j − v Zj || ∂µ j = ∂ ( µ j − v Zj ) ∂µ j · µ j − v Zj (cid:13)(cid:13) µ j − v Zj (cid:13)(cid:13) = µ j − v Zj (cid:13)(cid:13) µ j − v Zj (cid:13)(cid:13) (13)A.2 A LGORITHM

Algorithm 1

Algorithm for Deep Clustering and Representation Learning

Input:

Input samples: X ; Number of clusters: C ; Number of batches: B ; Number of iterations: E . Output:

Autoencoder’s weights: θ f and θ g ; Cluster labels { s i } Ni =1 ; Trainable cluster centers { µ j } Cj =1 .1: Initialize the weight { µ j } Cj =1 , θ f and θ g , and obtain initialized soft label assignment { s i } Ni =1 .2: for epoch ∈ { · · · , E } do

3: Compute embedded points { z i } Ni =1 and distribution Q ;4: Update target distribution P ;5: Compute soft cluster centers (cid:8) v Xi (cid:9) Ci =1 and (cid:8) v Zi (cid:9) Ci =1 .6: for batch ∈ { · · · , B } do

7: Pick up one batch of samples X batch from X ;8: Compute corresponding distribution Q batch and it’s reconstruction Y batch ;9: Pick up target distribution batch P batch from P ;10: Compute loss L ae , L cluster and L rank ;11: Update the weight θ f , θ g and { µ j } Cj =1 .12: end for

13: Compute L iso and L align on all samples;14: Update the weight θ f and { µ j } Cj =1 ;15: Assign new soft labels { s i } Ni =1 .16: end for

17: return θ f , θ g , { s i } Ni =1 , { µ j } Cj =1 . A.3 D

ATASETS

To show that our method works well with various kinds of datasets, we choose the following ﬁveimage and text datasets. Some example images are shown in Fig A1, and the brief descriptions ofthe datasets are given in Tab A1.Table A1: Description of Datasets

Dataset Samples Categories Data Size

MNIST-full 70000 10 28 × × × × × × × × • MNIST-full (LeCun et al., 1998): The MNIST-full dataset consists of 70,000 handwrittendigits of 28 ×

28 pixels. Each gray image is reshaped to a 784-dimensional vector.13ublished as a conference paper at ICLR 2021 • MNIST-test (LeCun et al., 1998): The MNIST-test is the testing part of the MNIST dataset,which contains a total of 10000 samples. • USPS : The USPS dataset is composed of 9298 gray-scale handwritten digit images witha size of 16x16 pixels. • Fashion-MNIST (Xiao et al., 2017): This Fashion-MNIST dataset has the same number ofimages and the same image size as MNIST-full, but it is fairly more complicated. Insteadof digits, it consists of various types of fashion products. • REUTERS-10K: REUTERS (Lewis et al., 2004) is composed of around 810000 Englishnews stories labeled with a category tree. Four root categories (corporate/industrial, gov-ernment/social, markets, and economics) are used as labels and excluded all documentswith multiple labels. Following DEC (Xie et al., 2016), a subset of 10000 examples arerandomly sampled, and the tf-idf features on the 2000 most frequent words are computed.The sampled dataset is denoted REUTERS-10K. (a) MNIST(b) USPS(c) Fashion-MNIST

Figure A1: The image samples from three datasets (MNIST, USPS, and Fashion-MNIST)A.4 D

EFINITIONS OF PERFORMANCE METRICS

The following notations are used for the deﬁnitions: d X ( i, j ) : the pairwise distance between x i and x j in input space X ; d Z ( i, j ) : the pairwise distance between z i and z j in latent space Z ; N k,Xi : the set of indices to the k -nearest neighbor ( k NN) of x i in input space X ; N k,Zi : the set of indices to the k -nearest neighbor ( k NN) of z i in latent space Z ; r X ( i, j ) : the rank of the closeness of x j to x i in input space X ; r Z ( i, j ) : the rank of the closeness of z j to z i in latent space Z .The eight evaluation metrics are deﬁned below:(1) ACC (Accuracy) measures the accuracy of clustering:

ACC = max m (cid:80) Ni =1 { l i = m ( s i ) } N where l i and s i are the true and predicted labels for data point x i , respectively, and m ( · ) isall possible one-to-one mappings between clusters and label categories.(2) NMI (Normalized Mutual Information) NMI calculates the normalized measure of simi-larity between two labels of the same data

N M I = I ( l ; s )max { H ( l ) , H ( s ) } where I ( l, s ) is the mutual information between the real label l and predicted label s , and H ( · ) represents their entropy. https://cs.nyu.edu/ roweis/data.html RRE (Relative Rank Change) measures the average of changes in neighbor ranking be-tween two spaces X and Z : RRE = 1( k − k + 1) k (cid:88) k = k (cid:8) M R kX → Z + M R kZ → X (cid:9) where k and k are the lower and upper bounds of the k -NN. M R kX → Z = 1 H k N (cid:88) i =1 (cid:88) j ∈N k,Zi (cid:18) | r X ( i, j ) − r Z ( i, j ) | r Z ( i, j ) (cid:19) M R kZ → X = 1 H k N (cid:88) i =1 (cid:88) j ∈N k,Xi (cid:18) | r X ( i, j ) − r Z ( i, j ) | r X ( i, j ) (cid:19) where H k is the normalizing term, deﬁned as H k = N k (cid:88) l =1 | N − l | l . (4) Trust (Trustworthiness) measures to what extent the k nearest neighbors of a point arepreserved when going from the input space to the latent space: T rust = 1 k − k + 1 k (cid:88) k = k  − N k (2 N − k − N (cid:88) i =1 (cid:88) j ∈N k,Zi ,j / ∈N k,Xi ( r X ( i, j ) − k )  where k and k are the bounds of the number of nearest neighbors.(5) Cont (Continuity) is deﬁned analogously to

T rust , but checks to what extent neighborsare preserved when going from the latent space to the input space:

Cont = 1 k − k + 1 k (cid:88) k = k  − N k (2 N − k − N (cid:88) i =1 (cid:88) j / ∈N k,Zi ,j ∈N k,Xi ( r Z ( i, j ) − k )  where k and k are the bounds of the number of nearest neighbors.(6) RMSE (Root Mean Square Error) measures to what extent the two distributions of dis-tances coincide:

RM SE = (cid:118)(cid:117)(cid:117)(cid:116) N N (cid:88) i =1 N (cid:88) j =1 ( d X ( i, j ) − d Z ( i, j )) (7) LGD (Locally Geometric Distortion) measures how much corresponding distances be-tween neighboring points differ in two metric spaces and is the primary metric for isometry,deﬁned as:

LGD = k (cid:88) k = k (cid:118)(cid:117)(cid:117)(cid:116) M (cid:88) i (cid:80) j ∈N k, ( l ) i ( d l ( i, j ) − d l (cid:48) ( i, j )) ( k − k + 1) M ( N i ) . where k and k are the lower and upper bounds of the k -NN.15ublished as a conference paper at ICLR 2021(8) CRA (Cluster Rank Accuracy) measures the changes in ranks of cluster centers from theinput space X and to the latent space Z : CRA = (cid:80) Ci =1 (cid:80) Cj =1 ( r X ( v Xi , v Xj ) = r Z ( v Zi , v Zj )) C (35) where C is the number of clusters, v Xj is the cluster center of the j th cluster in the inputspace X , v Zj is the cluster center of the j th cluster in the latent space Z , r X ( v Xi , v Xj ) denotes the rank of the closeness of v Xi to v Xj in space X in the input space X , and r Z ( v Zi , v Zj ) denotes the rank of the closeness of v Zi to v Zj in space Z .A.5 V ISUALIZATION IN GENERALIZABILITY

The visualization results on the testing samples are shown in Fig A2; even for testing samples, ourmethod still shows distinguishable inter-cluster discriminability, while all the other methods withoutexception coupled different clusters together. (a) AE+k-mean (b) DEC (c) IDEC (d) JULE(e) DSC (f) N2D (g) DCRL (ours)

Figure A2: The visualization of the obtained embedding on the testing samples to show the gener-alization performance of different algorithms on MNIST-full dateset.A.6 V

ISUALIZATION IN DIFFERENT STAGES

The embedding visualization of the latent space during the training process is visualized in Fig A3for depicting how both clustering and structure-preserving is achieved. We can see that the differentclusters initialized by pretrained autoencoder are closely adjacent. In the early stage of training,with clustering loss L cluster and global ranking loss L rank , different manifolds are separated fromeach other, each manifold loses its local structure, and all of them degenerate into spheres. Asthe training progresses, the weight α for L cluster gradually decreases, while the weight β for L iso increases and the optimization is gradually focused from global to local , with each manifoldgradually recovering its original geometric structure from the sphere. Moreover, since our localisometry loss L iso is constrained within each manifold, the preservation of local structure will notdisrupt the global ranking. Finally, we obtain representations in which cluster boundaries are clearlydistinguished, and local and global structures are perfectly preserved.A.7 S TATISTICAL A NALYSIS

The statistical analysis is presented to show the extent to which local and global structure is pre-served from the input space to the latent space. Taking MNIST-full as an example, the statisticalanalysis of the global rank-preservation is shown in Fig A4 (a)-(f). For the i -th cluster, if the rankbetween it and the j -th cluster is preserved from input space to latent space, then the grid in the i -th16ublished as a conference paper at ICLR 2021 (a) Epoch 0 (b) Epoch 9 (c) Epoch 19 (d) Epoch 29 (e) Epoch 69(f) Epoch 119 (g) Epoch 159 (h) Epoch 209 (i) Epoch 249 (j) Epoch 299 Figure A3: Clustering visualization at different stages of training on MNIST-full dateset. (a) DEC (b) IDEC (c) JULE (d) DSC(e) N2D (f) DCRL (g) Local Isometry

Figure A4: Statistical analysis of different algorithms to compare the capability of global and localstructure preservation from the input space to the latent space.row and j -th column is set to blue, otherwise yellow. As shown in the ﬁgure, only our method canfully preserve the global rank between different clusters, while all other methods fail.Finally, we perform a statistical analysis for the local isometry property of each algorithm. Foreach sample x i in the dataset, it forms a number of point pairs with its neighborhood samples { ( x i , x j ) | i = 1 , , ..., N ; x j ∈ N Xi } . We compute the difference in the distance of these point pairsfrom the input space to the latent space { d Z ( x i , x j ) − d X ( x i , x j ) | i = 1 , , ..., N ; x j ∈ N i } , andplot it as a histogram. As shown in Fig A4 (g), the curves of DCRL are distributed on both sidesof the 0 value, with maximum peak height and minimum peak-bottom width, respectively, whichindicates that DCRL achieves the best local isometry. Although IDEC claims that they can preservethe local structure well, there is still a big gap between their results and ours.A.8 Q UANTITATIVE E VALUATION OF R EPRESENTATION L EARINING

Our method is compared with the other ﬁve methods in six evaluation metrics on ﬁve datasets. Thecomplete results in Tab A2 demonstrate the superiority of our method, especially on metrics RRE,Trust, Cont, and CRA. 17ublished as a conference paper at ICLR 2021Table A2: Representation learning performance of different algorithms on ﬁve datasets

Datasets Algorithms RRE Trust Cont RMSE LGD CRA

MNIST-full DEC 0.09988 0.84499 0.94805 44.8535 4.37986 0.28IDEC 0.00984 0.99821 0.97936 24.5803 1.71484 0.33JULE 0.02657 0.93675 0.98321 28.3412 2.12955 0.27DSC 0.09785 0.87315 0.92508 6.98098 1.19886 0.23N2D 0.01002 0.99243 0.98466 5.7162 0.69946 0.21DCRL

MNIST-test DEC 0.12800 0.81841 0.91767 14.6113 2.29499 0.19IDEC 0.01505 0.99403 0.97082 7.4599 1.08350 0.38JULE 0.04122 0.92971 0.97208 9.4768 1.17176 0.42DSC 0.10728 0.85498 0.92254 7.1689 1.19239 0.26N2D 0.01565 0.98764 0.97572

USPS DEC 0.07911 0.88871 0.94628 16.4355 1.77848 0.31IDEC 0.01043 0.99726 0.97960 13.0573 1.11689 0.30JULE 0.02972 0.98763 0.98810 14.6324 1.43426 0.33DSC 0.06319 0.9151 0.93988 8.4412 1.02131 0.27N2D 0.01337 0.98769 0.98135 8.1961 0.54967 0.37DCRL

Fasion-MNIST DEC 0.04787 0.93896 0.95450 39.3274 3.87731 0.37IDEC 0.01089 0.99683 0.97797 25.4024 1.91385 0.27JULE 0.03013 0.97732 0.97923 15.2213 1.43642 0.43DSC 0.05168 0.95013 0.96121 17.2201 1.42091 0.36N2D 0.00894 0.99062 0.98054 14.49079

REUTERS-10K DEC 0.26192 0.65518 0.80477 40.4671 4.00423 0.63IDEC 0.05981 0.95840 0.90550 43.9556 2.01365 0.75JULE 0.11230 0.87628 0.93232 46.4287 2.78210 0.56DSC 0.20820 0.74312 0.83672 38.8720 1.89721 0.50N2D 0.03827 0.97385 0.93412 36.1042

UANTITATIVE E VALUATION OF D OWNSTREAM T ASKS

Tab A3 compares DCRL with the other six methods on ﬁve datasets to see whether these methodscan indeed learn representations that are useful for downstream tasks. As shown in the table, ourmethod outperforms the other methods on all ﬁve datasets with MLP, RFC, LR as downstream tasks.Table A3: Performance of different algorithms in downstream tasks

Datasets Algorithms MLP RFC SVM LR

MNIST-full AE 0.9746 0.9652 0.9859 0.9565DEC 0.8647 0.8706 0.8707 0.8566IDEC 0.9797 0.9737 0.9852 0.9650JULE 0.9802 0.9825 0.9787 0.9743DSC 0.9622 0.9501 0.9837 0.9752N2D 0.9796 0.9803 0.9799 0.9792DCRL

MNIST-test AE 0.9415 0.9420 0.9745 0.9495DEC 0.8525 0.8605 0.8725 0.8685IDEC 0.9740 0.9725 0.9845 0.9655JULE 0.9775 0.9845 0.9800 0.9825DSC 0.9535 0.9740 0.9825 0.9795N2D 0.9715 0.9760 0.9725 0.9725DCRL

USPS AE 0.9421 0.9469 0.9677 0.9073DEC 0.8289 0.8668 0.8289 0.8294IDEC 0.9482 0.9556 0.9656 0.9125JULE 0.9576 0.9617

Fasion-MNIST AE 0.8613 0.9932 0.8314 0.7588DEC 0.6268 0.9853 0.6377 0.6245IDEC 0.8367 0.9918

REUTERS-10K AE 0.9325 0.9170 0.9375 0.8205DEC 0.7985 0.7880 0.8105 0.7450IDEC 0.9225 0.8930 0.9280 0.7705JULE 0.9315 0.9035 0.9185 0.8165DSC 0.9045 0.8835 0.9175 0.8115N2D 0.9205 0.9080 0.9240 0.8335DCRL (ours)

ORE ABLATION EXPERIMENTS

The results of the ablation experiments on the MNIST-full dataset have been presented in Tab 5in Sec 4.3. Here, we provide four more sets of ablation experiments on the other four datasets.The conclusion is similar (note that for the clustering performance of the model without clustering-oriented losses is very poorly, so the “best” metric numbers are not meaningful and are shown ingray color): (1) CL is very important for obtaining good clustering. (2) SL is beneﬁcial for bothclustering and representation learning. (3) Our training strategies (WG and AT) are very superior inimproving metrics such as ACC, RRE, Trust, Cont, and CRA.Table A4: Ablation study of loss items and training strategies used in DCRL

MNIST-test full model 0.972/0.930 w/o SL 0.958/0.902 0.0095 0.9967 0.9812 14.609 0.9847 0.29w/o CL 0.664/0.658 0.0020 0.9996 0.9952 2.934 0.0687 1.0w/o WG 0.956/0.896 0.0060 0.9991 0.9868 6.572 0.5335 w/o AT 0.947/0.885 0.0080 0.9979 0.9833

USPS full model w/o SL 0.706/0.682 0.0108 0.9964 0.9781 25.954 1.8936 0.30w/o CL 0.576/0.569 0.0004 0.9994 0.9995 7.654 0.0523 1.00w/o WG 0.702/0.695 0.0084 0.9972 0.9814 w/o AT 0.708/0.694 0.0097 0.9975 0.9798 13.354 1.3611

Fasion-MNIST full model w/o SL 0.819/0.564 0.0529 0.9610 0.9185 44.481 1.9090 0.38w/o CL 0.542/0.279 0.0277 0.9868 0.9456 37.018 2.2294 1.00w/o WG 0.830/0.583 0.0420 0.9667 0.9361 35.302 2.8286 w/o AT 0.825/0.563 0.0440 0.9650 0.9330 39.275 2.9146

REUTERS-10K full model0.836/0.590 0.0320 0.9838 0.9380 34.547 2.7209 1.00