[PDF] Deep Clustering Based on a Mixture of Autoencoders

Abstract

In this paper we propose a Deep Autoencoder MIxture Clustering (DAMIC) algorithm based on a mixture of deep autoencoders where each cluster is represented by an autoencoder. A clustering network transforms the data into another space and then selects one of the clusters. Next, the autoencoder associated with this cluster is used to reconstruct the data-point. The clustering algorithm jointly learns the nonlinear data representation and the set of autoencoders. The optimal clustering is found by minimizing the reconstruction loss of the mixture of autoencoder network. Unlike other deep clustering algorithms, no regularization term is needed to avoid data collapsing to a single point. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods.

Full PDF

DDeep Clustering based on a Mixture of Autoencoders

Shlomo E. Chazan, Sharon Gannot and Jacob GoldbergerBar-Ilan UniversityRamal-Gan, 5290002, Israel { Shlomi.Chazan, Sharon.Gannot, Jacob.Goldberger } @biu.ac.il Abstract

In this paper we propose a Deep Autoencoder MixtureClustering (DAMIC) algorithm based on a mixture of deepautoencoders where each cluster is represented by an au-toencoder. A clustering network transforms the data intoanother space and then selects one of the clusters. Next,the autoencoder associated with this cluster is used to re-construct the data-point. The clustering algorithm jointlylearns the nonlinear data representation and the set of au-toencoders. The optimal clustering is found by minimizingthe reconstruction loss of the mixture of autoencoder net-work. Unlike other deep clustering algorithms, no regu-larization term is needed to avoid data collapsing to a sin-gle point. Our experimental evaluations on image and textcorpora show signiﬁcant improvement over state-of-the-artmethods.

1. Introduction

Effective automatic grouping of objects into clustersis one of the fundamental problems in machine learningand data analysis. In many approaches, the ﬁrst step to-ward clustering a dataset is extracting a feature vector fromeach object. This reduces the problem to the aggrega-tion of groups of vectors in a feature space. A commonlyused clustering algorithm in this case is k -means. Clus-tering high-dimensional datasets is, however, hard sincethe inter-point distances become less informative in high-dimensional spaces. As a result, representation learning hasbeen used to map the input data into a low-dimensionalfeature space. In recent years, motivated by the successof deep neural networks in supervised learning, there havebeen many attempts to apply unsupervised deep learningapproaches to clustering. Most methods are focused onclustering over the low-dimensional feature space of an au-toencoder [29][8][13][28], a variational autoencoder [16][7] or a Generative adversarial Network (GAN) [11][25][6].Recent good overviews of deep clustering methods can befound in [2] and [22]. Using deep neural networks, nonlinear mappings thatcan transform the data into more clustering-friendly rep-resentations, can be learned. A deep version of k -meansis based on learning a nonlinear data representation andapplying k -means in the embedded space. A straightfor-ward implementation of the deep k -means algorithm wouldlead, however, to a trivial solution where the features arecollapsed to a single point in the embedded space and thecentroids are collapsed into a single entity. For this rea-son, the objective function of most deep clustering algo-rithms is composed of a clustering term computed in theembedded space and a regularization term in the form of areconstruction error to avoid data collapse. Deep EmbeddedClustering (DEC) [27] is ﬁrst pre-trained using an autoen-coder reconstruction loss and then optimizes cluster cen-troids in the embedded space through a Kullback-Leibelerdivergence loss. The Deep Clustering Network (DCN) [28]is another autoencoder-based method that uses k -means forclustering. Similar to DEC, in the ﬁrst phase, the network ispre-trained using the autoencoder reconstruction loss. How-ever, in the second phase, in contrast to DEC, the networkis jointly trained using a mathematical combination of theautoencoder reconstruction loss and the k -means clusteringloss function. Thus, because strict cluster assignments wereused during the training (instead of probabilities such as inDEC) the method requires an alternation process betweennetwork training and cluster updates.In this paper we propose an algorithm to perform deepclustering within the mixture-of-experts framework [15].Each cluster is represented by an autoencoder neural-network and the clustering itself is performed in a low-dimensional embedded space by a softmax classiﬁcationlayer that directs the input data to the most suitable autoen-coder. Unlike most deep clustering algorithms the proposedalgorithm is deep in nature and not a deep variant of a classi-cal clustering algorithm. The proposed deep clustering ap-proach is different from previous algorithms in three mainaspects: • It does not suffer from the clustering collapsing prob-lem, since the trivial solution is not the global optimum1 a r X i v : . [ c s . L G ] M a r f the clustering learning objective function. • This implies that in the proposed method, unlike othermethods, there is no need for regularization terms thathave to be tuned separately for each dataset. Note thatparameter tuning in clustering is problematic since itis based, either explicitly or implicitly, on the data la-bels which are supposedly unavailable in the clusteringprocess. • Another major difference between the proposedmethod and previously proposed approaches is thelearning method of the embedded latent space, wherethe actual clustering takes place. In most previousmethods, the embedded space is controlled by an au-toencoder. Thus, in order to gain a good reconstruc-tion, it requires to encode into the embedded space in-formation that can be entirely irrelevant to the cluster-ing process. In contrast, in our algorithms no decodingis applied to the clustering in the embedded space andthe only goal of the embedded space is to ﬁnd a goodorganization of the data into separated clusters.We validate the method using standard real datasets in-cluding document and image corpora. The results showa visible improvement from previous methods for all thedatasets. The contribution of this paper is thus twofold: (i)it presents a novel deep learning clustering method that un-like deep variants of k -means does not require a tuned regu-larization term to avoid clustering collapse to a single point;and (ii) it demonstrates improved performance on standarddatasets.

2. Mixture of Autoencoders

Consider the problem of clustering a set of n points x , . . . , x n ∈ R d into k clusters. The k -means algorithmrepresents each cluster by a centroid. In our approach,rather than representing a cluster by a centroid, we repre-sent each cluster by an autoencoder that is specialized inreconstructing objects belonging to that cluster. The clus-tering itself is carried out by directing the input object tothe most suitable autoencoder.We next formally describe the proposed clustering algo-rithm. The algorithm is based on a (soft) clustering networkthat produces a distribution over the k clusters: p ( c = i | x ; θ c ) = exp( w i h ( x ) + b i ) (cid:80) kj =1 exp( w j h ( x ) + b j ) , i = 1 , . . . , k (1)such that θ c is the parameter set of the clustering network, h ( x ) is a nonlinear representation of a point x computedby the clustering network and w , . . . , w k , b , . . . , b k ∈ θ c are the parameters of the softmax output layer. The (hard) cluster assignment of a point x is thus: ˆ c = arg k max i =1 p ( c = i | x ; θ c ) = arg k max i =1 ( w i h ( x ) + b i ) . (2)The clustering task is, by deﬁnition, unsupervised and there-fore we cannot directly train the clustering network. In-stead, we use the clustering results to obtain a more accu-rate reconstruction of the network input. We represent eachcluster by an autoencoder that is specialized in reconstruct-ing instances of that cluster. If the dataset is properly clus-tered, we expect all the points assigned to be same cluster tobe similar. Hence, the task of a cluster-specialized autoen-coder should be relatively easy compared to using a singleautoencoder for the entire data. We thus expect that goodclustering should result in a small reconstruction error. De-note the autoencoder associated with cluster i by f i ( x ; θ i ) where θ i is the parameter-set of the network autoencoder.We can view the reconstructed object f i ( x ; θ i ) ∈ R d as adata-driven centroid of cluster i that is tuned to the input x .The goal of the training procedure is to ﬁnd a clustering ofthe data such that the error of the cluster-based reconstruc-tion is minimized.To ﬁnd the network parameters we jointly train the clus-tering network and the deep autoencoders. The clustering isthus computed by minimizing the following loss function: L ( θ , . . . , θ k , θ c ) (3) = − n (cid:88) t =1 log (cid:32) k (cid:88) i =1 p ( c t = i | x t ; θ c ) exp( − d ( x t , f i ( x t ; θ i ))) (cid:33) such that d ( x t , f i ( x t ; θ i )) is the reconstruction errorof the i -th autoencoder. In our implementation we set d ( x t , f i ( x t ; θ i )) = (cid:107) x t − f i ( x t ; θ i ) (cid:107) .In the minimization of (3) we simultaneously performdata clustering in the embedded space h ( x ) and learn a‘centroid’ representation for each cluster in the form of anautoencoder. Unlike most previously proposed deep clus-tering methods, there is no risk of collapsing to a trivialsolution where all the data points are transformed to thesame vector, even though the clustering is carried out inthe embedded space. Collapsing all the data points into asingle vector in the embedded space will result in directingall the points to the same autoencoder for reconstruction.As our clustering goal is to minimize the reconstruction er-ror, this situation is, of course, worse than using k differentautoencoders for reconstruction. Hence, there is no needto add regularization terms to the loss function (that mightinﬂuence the clustering accuracy) to prevent data collapse.Speciﬁcally, there is no need to add a decoder to the em-bedded space where the clustering is actually performed toprevent data collapse.he back-propagation equation for the parameter set ofthe clustering network is: ∂L∂θ c = − n (cid:88) t =1 k (cid:88) i =1 w ti · ∂∂θ c log p ( c t = i | x t ; θ c ) (4)such that w ti = p ( c t = i | x t ; θ c ) exp( − d ( x t , f i ( x t ; θ i ))) (cid:80) kj =1 p ( c t = j | x t ; θ c ) exp( − d ( x t , f j ( x t ; θ j ))) (5)is a soft assignment of x t into the i -th cluster based on thecurrent parameter-set. In other words, the reconstructionerror of the autoencoders is used to obtain soft labels thatare employed for training the clustering network.In recent years, network pre-training has been largelyrendered obsolete for supervised tasks, given availabilityof large labeled training datasets. However, for hard opti-mization problems that unsupervised clustering tasks can-not handle (like the one presented in (1)), initialization isstill crucial. To initialize the parameters of the network,we ﬁrst train a single autoencoder and use the layer-wisepre-training method, as described in [3], for training au-toencoders. After training the autoencoder, we carry out a k -means clustering on the output of the bottleneck layer toobtain the initial clustering values. The k -means assigns alabel to each data point. Note, that in the pre-training proce-dure a single autoencoder is trained on the entire database.We use these labels as supervision to pre-train the clusteringnetwork (1). The points that were assigned by the k -meansalgorithm to cluster i , are next used to pre-train the i -th au-toencoder f i ( x ; θ i ) . Once all the network parameters havebeen initialized by this pre-training procedure, the networkparameters are jointly trained to minimize the autoencod-ing reconstruction error deﬁned by the loss function (3).We dub the proposed algorithm Deep Autoencoder MIx-ture Clustering (DAMIC). The architecture of the networktrained by the DAMIC algorithm and the ﬁnal clusteringprocedure are depicted in the left and right panels of Fig. 1,respectively. The clustering algorithm is summarized in Ta-ble 1.The DAMIC algorithm can be viewed as an extension ofthe k -means algorithm. Assume we replace each autoen-coder in our network by a constant function f i ( x t , θ i ) ≡ µ i ∈ R d and we replace the clustering network by a harddecision based on the reconstruction error. In so doing,we obtain exactly the classical k -means algorithm. TheDAMIC algorithm replaces the constant centroid with a datadriven representation of the input computed by an autoen-coder.The probabilistic modeling used by the DAMIC clus-tering algorithm can also be viewed as an instance of themixture-of-experts (MoE) model introduced in [15] and [17]. The MoE model is comprised of several expert mod-els and a gate model. Each of the experts provides a de-cision and the gate is a latent variable that selects the rel-evant expert based on the input data. In spite of the hugesuccess of deep learning, there are only a few studies thathave explicitly utilized and analyzed MoEs as an architec-tural component of a neural network [9, 24]. MoE has beenprimarily applied to supervised tasks such as classiﬁcationand regression. In our clustering algorithm the clusteringnetwork is the equivalent of the MoE gating function. Theexperts here are autoencoders were each autoencoder’s ex-pertise is t reconstruct a sample from the associated cluster.Our clustering cost function (3) follows the training strat-egy proposed in [15], which prefers an error function thatencourages expert specialization instead of cooperation.Table 1: The Deep Autoencoder MIxture Clustering(DAMIC) algorithm. Goal: clustering x , . . . , x n ∈ R d into k clusters.Network components: • A network that computes a soft clustering of the datapoint: p ( c = i | x ; θ c ) = exp( w i h ( x ) + b i ) (cid:80) kj =1 exp( w j h ( x ) + b j ) • A set of autoencoders (one for each cluster): x → ˆ x i = f i ( x ; θ i ) , i = 1 , . . . , k Pre-training: • Train a single autoencoder for the entire dataset. • Apply a k -means algorithm in the embedded space. • Use the k -means clustering to initialize the network pa-rameters.Training: clustering is obtained by minimizing the reconstruc-tion error: L ( θ , . . . , θ k , θ c ) = − n (cid:88) t =1 log (cid:0) k (cid:88) i =1 p ( c t = i | x t ; θ c ) exp( − d ( x t , f i ( x t ; θ i ))) (cid:1) The ﬁnal (hard) clustering is: ˆ c t = arg k max i =1 p ( c t = i | x t ; θ c ) , t = 1 , . . . , n. We note that after the training process is ﬁnished, thereis another way to extract the clustering from the trained net-work. Given a data point x t , we can ignore the clusteringDNN and assign each point to the cluster whose reconstruc-igure 1: A block diagram of the proposed mixture of deep auto-encoders for clustering. The training procedure is describedon the left side, and the ﬁnal clustering on the right side.tion error is minimal: c t = arg k min i =1 d ( x t , f i ( x t ; θ i ))) . (6)We found that the performance of this clustering decision isvery close to the clustering strategy we proposed (2). More-over, in almost all cases the hard classiﬁcation decision ofthe clustering network (1) coincides with the cluster whosereconstruction error is minimal, i.e., arg k max i =1 p ( c t = i | x t ; θ c ) = arg k min i =1 d ( x t , f i ( x t ; θ i ))) . We can thus consider a variant of our clustering algorithmthat completely avoids the clustering network. Instead, thetraining goal is to directly minimize the reconstruction errorof the most suitable autoencoder using the following costfunction: L ( θ , . . . , θ k ) = n (cid:88) t =1 k min i =1 d ( x t , f i ( x t ; θ i ))) . (7)This cost function is very similar to the cost function of the k -means algorithm. The only difference is that the con-stant centroid is replaced here by the autoencoder bottle-neck where the given point is the input. However, thereare two drawbacks of using this alternative and simplercost function. First, in our algorithm, in addition to thedata clustering, we also obtain a nonlinear data embedding x → h ( x ) that can be used to visualize the clustering ina clustering friendly space. The second issue is that wefound empirically that without the clustering network evenif we use the pre-processing procedure we described above,we are more vulnerable to clustering collapsing issues, inthe sense that at the end of the training procedure some ofthe autoencoders are not used by any data point. This pro-vides another motivation for the proposed architecture thatis based on an explicit clustering network.

3. Experiments

In this section we evaluate the clustering results of ourapproach. We carried out experiments on different datasetsand compared the proposed method to the state-of-the-artstandard and k -means related deep clustering algorithms. We used both synthetic dataset as well as real datasets.The synthetic dataset will be described in Sec. 4.1. Thereal datasets used in the experiments are standard cluster-ing benchmark collections. We considered both image andtext datasets to demonstrate the general applicability of ourapproach.The image datasets consisted of MNIST (70,000 images,28 ×

28 pixels, 10 classes) which contain hand-written digitimages. We reshaped the images to one dimensional vec-tors and normalized the pixel intensity levels (between 0and 1). The Fashion dataset [26], which is consisting of70,000 examples with similar dimensions as of the MNISTdataset, was also used. This dataset is divided into 10 fash-ion classes.The text collections we considered are the 20 News-groups dataset (hereafter 20NEWS) and the RCV1-v2dataset (hereafter RCV1) [20]. For 20NEWS, the entiredataset comprising 18,846 documents labeled into 20 dif-ferent news-groups was used. For the RCV1, similar to[28] we used a subset of the database containing 365,968documents, each of which pertains to only one of 20 topics.Because of text dataset sparsity, and as proposed in [27] and[28], we selected the 2000 words with the highest tf-idf val-ues to represent each document.

The clustering performance of the methods was evalu-ated with respect to the following three standard measures: a) Latent domain, v . (b) NMF (c) SVD // (d) DAE+KM (e) DAMIC Figure 2: Synthetic dataset with 4 clusters. Each true cluster label has a different color. The observable data is generatedfrom the Gaussian distributed clusters in the ﬁrst ﬁgure, through (8). The 2D representations of the observed data are shownby the NMF, SVD, DAE+KM and the proposed DAMIC methods.normalized mutual information (NMI) [4], adjusted Randindex (ARI) [30], and clustering accuracy (ACC) [4]. NMIis an information-theoretic measure based on the mutual in-formation of the ground-truth classes and the obtained clus-ters, normalized using the entropy of each. ACC measuresthe proportion of data points for which the obtained clusterscan be correctly mapped to ground-truth classes, where thematching is based on the Hungarian algorithm [19]. Finally,ARI is a variant of the Rand index that is adjusted for thechance grouping of elements. Note, that NMI and ACC liein the range of to where one is the perfect clustering re-sult and zero the worst. ARI is a value between minus oneto (plus) one, where one is the best clustering performanceand minus one the worst. The proposed DAMIC algorithm was compared to thefollowing methods:

K-means (KM):

The classic k -means [21]. Spectral Clustering (SC):

The classic SC algorithm [23].

Deep Autoencoder followed by k -means (DAE+KM): This algorithm is carried out in two steps. First, aDAE is applied. Next, KM is applied to the embedded layer of the DAE. This algorithm is also used as aninitialization step for the proposed algorithm.

Deep Clustering Network (DCN):

The algorithm per-forms joint reconstruction and k -means clustering atthe same time. The loss comprises penalties on boththe reconstruction and the clustering losses [28]. Deep Embedding Clustering (DEC):

The algorithm per-forms joint embedding and clustering in the embeddedspace. The loss function only contains a clustering lossterm [27].

The proposed method was implemented with the deeplearning toolbox Tensorﬂow [1]. All datasets were nor-malized between 0 and 1. All neurons in the proposed ar-chitecture except the output layer used exponential linearunit (elu) as the transfer function. The output layer in allDAEs was the sigmoid function, and the clustering networkoutput layer was a softmax layer. Batch normalization [14]was utilized on all layers, and the ADAM optimizer [18]was used for both the pre-training as well as the trainingphase. In the pre-training phase, the DAE networks weretrained with the binary cross-entropy loss function. We sethe number of epochs for the training phases to be 50. How-ever, early stopping was used to prevent mis-convergence ofthe loss. The mini-batch size was 256.Note that for simplicity and to show the robustness of theproposed method, the architectures of the proposed DAMICin all the following experiments had a similar shape; i.e.,for each of the DAEs we used a 5-layer DNN with the fol-lowing input size: 1024, 256, k , 256, 1024, elu neurons,respectively, and for the clustering network we used 512,512, k , elu neurons, respectively, where k is the number ofclusters. There was no need for hyperparameter tuning forthe experiments on the different datasets.

4. Results

To illustrate the capabilities of the DAMIC algorithmwe generated synthetic data as in [28]. The 2D latentdomain contained 4000 samples from four Gaussian dis-tributed clusters as shown in Fig. 2a. The observed signalis x t = ( σ ( W · v t )) t = 1 , · · · , n (8)where σ is the sigmoid function, W ∈ R × and v t is the t -th point in the latent domain.We ﬁrst applied the DAE+KM algorithm for initializa-tion. The architecture of the DAE consisted of a 4-layer en-coder with 100, 50, 10, 2 neurons respectively. The decoderwas a mirrored version of the forward network. Fig. 2b, 2cand 2d depict the 2D representations of (8) by NMF, theSVD and the DAE+KM methods, respectively. It is clearthat it is not sufﬁciently separated.The proposed DAMIC algorithm was then applied. Thearchitecture of each autoencoder consisted of 5-layers of1024, 256, 4, 256, 1024 neurons as described in the previ-ous section. The clustering network was also similar, with512, 512, 2 neurons, respectively. Fig. 2e depicts the 2Dembedded space of the clustering network h ( x t ) . It is easyto see that the embedded space is much more separable.Table 2 summarizes the results of the k -means, theDAE+KM, the SC and the DAMIC algorithms on the syn-thetic generated data. It is easy to verify that the DAMIC al-gorithm outperforms the two competing algorithms in bothNMI and ARI measures.Table 2: Objective measures for the synthetic database.Method DAMIC DAE+KM SC KMNMI The MNIST database has 70000 hand written gray-scaleimages of digits. Each image size is × pixels. Notethat we worked on the raw data of the dataset (without pre-processing). For simplicity, the architecture of each one ofthe DAE was identical. Speciﬁcally, for the MNIST datasetwe used a 5-layer network with 1024, 256, 10, 256, 1024neurons, respectively. The output layer of each DAE wasset to be the sigmoid function. For the clustering networkwe used simpler network with a 3-layer with 512, 512, 10neurons, respectively. The output transfer layer of the clus-tering network was the softmax function. Table 3 presentsthe results of the NMI, the ARI and the ACC of the pro-posed DAMIC method and several standard baselines. It isclear that the DAMIC outperforms the other methods on inall measures. DAE expertise

To test the expertise of each one of theDAE we conducted the following experiment. After theclustering algorithm converged on the MNIST dataset, wesynthetically created a new image in which all the pixelswere set to be ‘1’ (Fig. 3a). The image reconstruction ofall the 10 DAEs is shown in Fig. 3. It is evident that eachDAE assumes a different pattern of input. Speciﬁcally, eachDAE is responsible for a different digit. The clustering task a) Input(b) DAE

Figure 3: The outputs of the different DAEs with a vector of all-ones input. (a) Input(b) DAE

Figure 4: An example of the outputs of the different DAEs with the digit ‘4’ as the input.was unsupervised and we sorted the autoencoders in Fig.3a) by their corresponding digits from ‘0’ to ‘9’ merely forpurpose of visualization.

Best reconstruction wins

To further understand the be-havior of the gate we carried out a different test. An imageof the digit ‘4’ was fed to the network (Fig. 4a). The out-puts of the different DAEs are depicted in Fig. 4. Since eachDAE specializes in a different digit, it was expected that therespective DAE would have the lowest reconstruction error.This was also reﬂected in decision of the clustering network p ( c = 4 | x ; θ c ) = 0 . . Note, that the other DAEs reshapedthe reconstruction to be close to their digit expertise. The Fashion dataset shares the same dimensions andstructure as the MNIST dataset. The only difference is thecontent of the images, which are now one of ten fashionitems. Therefore, the same processing described in the for-mer section was carried out on this dataset.Table 4 describes the results on the Fashion dataset. It isclear that the proposed DAMIC algorithm outperforms thecompared algorithms.We note that the best reported clustering results on theMNIST data achieved by the VaDE algorithm [16] thatable 7: Ablation study on the MNIST database.Method DAMIC Pre-training only Joint-training only KMNMI k -means-based algorithms using the same network architec-ture and parameter initialization and showed improvementin the performance. VaDE algorithm belongs to a differ-ent family of algorithms with different network architectureand parameter initialization strategies. Hence, a direct per-formance comparison is difﬁcult since it is heavily depen-dent on the implementations. It is worth noting that in theFashion dataset, our results even outperforms the VaDE andthe DEC with data augmentation (DEC-DA) algorithms re-sults [10] in this dataset. The 20Newsgroup corpus consists of 18,846 documentsfrom 20 news groups. As in [28] we also used the tf-idfrepresentation of the documents and picked the 2,000 mostfrequently used words as the features. The architecture usedin each one of the DAEs for this experiment also consistedof a 5-layer DNN with 1024, 256, 20, 256, 1024 neurons,respectively. The clustering network here consisted of 512,512, 20 neurons.Table 5 shows the results of the NMI, ARI and ACCmeasures. It is clear that the proposed clustering methodoutperformed the competing baseline algorithms.

The dataset used in this experiment is a subset of theRCV-1-v2 with 365, 968 documents, each containing one of20 topics. As in [28] the 2,000 most frequently used words(in the tf-idf form) are used as the features for each docu-ments. In contrast to the previous databases, in the RCV1dataset, the size of each class is not equal. Therefore, KM-based approaches might not be sufﬁcient in this case. In ourarchitecture we used 1024, 256, 20, 256, 1024 elu neuronsin all DAEs, respectively, and in the clustering network weused 512, 512, 20 elu neurons.Table 6 presents the 3 objective measurements for theRCV1 experiment. The proposed method outperformed thecompeting methods in NMI and ARI measures and had thesame ACC score as of the DCN.

The DAMIC algorithm comprises of two steps, the ini-tialization step, which is based on a deep autoencoder fol- lowed by a k -means clustering (DAE+KM), and a jointtraining of the gate and the clusters’ autoencoders. We nextexplore the necessity of each part in the algorithm. For that,we compared the DAMIC with two variants of the proposedalgorithm. The ﬁrst one is based only on the initializa-tion phase (without the joint the training), and the secondis based only on the joint training phase with random ini-tialization.Table 7 describes the results of our ablation study. Theﬁrst conclusion is that the joint training with random initial-ization improves the k -means results. This result conﬁrmthe ﬁrst contribution of this paper that even with random ini-tialization, the algorithm does not suffer from the clusteringcollapsing problem. It is also evident that the pre-trainingphase is important for parameter initialization but it is notenough. The proposed algorithm which employs both, ini-tialization step and joint training outperforms each of thevariants separately.

5. Conclusion

In this study we presented a clustering technique whichleverages the strength of deep neural network. Our tech-nique has two major properties: ﬁrst, unlike most previousmethods, the clusters are represented by an autoencoder net-work instead of a single centroid vector in the embeddedspace. This enables a much richer representation of eachcluster. Second, the algorithm does not cause a data col-lapsing problem. Hence, there is no need for regularizationterms that have to be tuned for each dataset separately. Ex-periments on a variety of real datasets showed the strongperformance of the proposed algorithm over the other meth-ods.

References [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensor-ﬂow: a system for large-scale machine learning. In

OSDI ,volume 16, pages 265–283, 2016.[2] E. Aljalbout, V. Golkov, Y. Siddiqui, and D. Cremers. Clus-tering with deep learning: Taxonomy and new methods. arXiv preprint arXiv:1801.07648 , 2018.[3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle.Greedy layer-wise training of deep networks. In

Advancesin Neural Information Processing Systems (NIPS) , 2007.4] D. Cai, X. He, and J. Han. Locally consistent concept fac-torization for document clustering.

IEEE Transactions onKnowledge and Data Engineering , 23(6):902–913, 2011.[5] M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deepclustering for unsupervised learning of visual features. In

European Conference on Computer Vision (ECCV) , 2018.[6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever,and P. Abbeel. Infogan: Interpretable representation learningby information maximizing generative adversarial nets. In

Adv. Neural Inf. Process. Syst. , 2016.[7] N. Dilokthanakul1, P. Mediano, M. Garnelo, M. Lee, H. Sal-imbeni, K. Arulkumaran, and M. Shanahan. Deep unsuper-vised clustering with gaussian mixture variational autoen-coder. arXiv preprint arXiv:1611.02648 , 2016.[8] K. G. Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang.Deep clustering via joint convolutional autoencoder embed-ding and relative entropy minimization. In , pages5747–5756. IEEE, 2017.[9] D. Eigen, M. Ranzato, and I. Sutskever. Learning factoredrepresentations in a deep mixture of experts. In

InternationalConference on Learning Representations (ICLR), Workshop ,2014.[10] X. Guo, E. Zhu, X. Liu, and J. Yin. Deep embedded clus-tering with data augmentation. In

Asian Conference on Ma-chine Learning , pages 550–565, 2018.[11] W. Harchaoui, P. A. Mattei, and C. Bouveyron. Deep adver-sarial gaussian mixture autoencoder for clustering. In

ICLR ,2017.[12] C. C. Hsu and C. W. Lin. CNN-based joint clustering andrepresentation learning with feature drift compensation forlarge-scale image data.

IEEE Transactions on Multimedia ,20(2):421–429, 2018.[13] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, andM. Sugiyama. Learning discrete representations via infor-mation maximizing self augmented training. arXiv preprintarXiv:1702.08720 , 2017.[14] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 , 2015.[15] R. Jacobs, S. N. M. Jordan, and G. Hinton. Adaptive mix-tures of local experts.

Neural computation , 3(1):79–87,1991.[16] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou. Variationaldeep embedding: An unsupervised and generative approachto clustering. arXiv preprint arXiv:1611.05148 , 2016.[17] M. Jordan and R. Jacobs. Hierarchical mixtures of expertsand the EM algorithm.

Neural computation , 6(2):181–214,1994.[18] D. Kingma and J. Ba. ADAM: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980 , 2014.[19] H. W. Kuhn. The hungarian method for the assignment prob-lem.

Naval Research Logistics Quarterly , 2:83–97, 1955.[20] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A newbenchmark collection for text categorization research.

Jour-nal of machine learning research , 5(Apr):361–397, 2004. [21] S. Lloyd. Least squares quantization in pcm.

IEEE transac-tions on information theory , 28(2):129–137, 1982.[22] E. Min, X. Guo, Q. Liu, G. Zhang, J. Cui, and J. Long. A sur-vey of clustering with deep learning: From the perspective ofnetwork architecture.

IEEE Access , pages 1–1, 07 2018.[23] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering:Analysis and an algorithm. In

Advances in neural informa-tion processing systems , pages 849–856, 2002.[24] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le,G. Hinton, and J. Dean. Outrageously large neural networks:The sparsely-gated mixture-of-experts layer. In

InternationalConference on Learning Representations (ICLR) , 2017.[25] J. T. Springenberg. Unsupervised and semi-supervised learn-ing with categorical generative adversarial networks. arXivpreprint arXiv: 1511.06390 , 2015.[26] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: anovel image dataset for benchmarking machine learning al-gorithms. arXiv preprint arXiv:1708.07747 , 2017.[27] J. Xie, R. Girshick, and A. Farhad. Unsupervised deep em-bedding for clustering analysis. In

International Conferenceon Machine Learning (ICML) , 2016.[28] B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong. TowardsK-means-friendly spaces: Simultaneous deep learning andclustering. In

International Conference on Machine Learn-ing (ICML) , 2017.[29] J. Yang, D. Parikh, and D. Batra. Joint unsupervised learn-ing of deep representations and image clusters. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 5147–5156, 2016.[30] K. Y. Yeung and W. L. W. L. Ruzzo. Details of the adjustedrand index and clustering algorithms, supplement to the pa-per an empirical study on principal component analysis forclustering gene expression data.