Deep Clustering With Intra-class Distance Constraint for Hyperspectral Images
Jinguang Sun, Wanli Wang, Xian Wei, Li Fang, Xiaoliang Tang, Yusheng Xu, Hui Yu, Wei Yao
11 Deep Clustering With Intra-class DistanceConstraint for Hyperspectral Images
Jinguang Sun, Wanli Wang, Xian Wei,
Member, IEEE , Li Fang,
Member, IEEE ,Xiaoliang Tang, Yusheng Xu,
Member, IEEE , Hui Yu and Wei Yao
Abstract —The high dimensionality of hyperspectral imagesoften results in the degradation of clustering performance. Dueto the powerful ability of deep feature extraction and non-linearfeature representation, the clustering algorithm based on deeplearning has become a hot research topic in the field of hyperspec-tral remote sensing. However, most deep clustering algorithmsfor hyperspectral images utilize deep neural networks as featureextractor without considering prior knowledge constraints thatare suitable for clustering. To solve this problem, we proposean intra-class distance constrained deep clustering algorithm forhigh-dimensional hyperspectral images. The proposed algorithmconstrains the feature mapping procedure of the auto-encodernetwork by intra-class distance so that raw images are trans-formed from the original high-dimensional space to the low-dimensional feature space that is more conducive to clustering.Furthermore, the related learning process is treated as a jointoptimization problem of deep feature extraction and clustering.Experimental results demonstrate the intense competitivenessof the proposed algorithm in comparison with state-of-the-artclustering methods of hyperspectral images.
Index Terms —Deep learning, hyperspectral images clustering,intra-class distance constraint, low-dimensional representation,remote sensing.
I. I
NTRODUCTION W ITH the development of the remote sensing technology,a wide diversity of sensor characteristics is nowadaysavailable. The sensing data is ranging from medium and veryhigh resolution (VHR) multispectral images to hyperspectralimages that sample the electromagnetic spectrum with highdetail [1]–[8]. Utilizing these myriad sensors, the Earth Ob-servation System (EOS) generates massive practical images ofvarious land covering objects. Owing to abundant spatial andspectral information, these numerous images make it possibleto extend the applications of hyperspectral remote sensing tomany potential fields [9]–[15]. However, it is very arduous to
This work was partially supported by the Young Scientists Fund of theNational Natural Science Foundation of China under Grant No. 61602226and No. 61806186, and the CAS Pioneer Hundred Talents Program (Type C)under Grant No. 2017-122. (Jinguang Sun and Wanli Wang contribute equallyto this work.) (Corresponding author: Xian Wei.)
J. Sun and W. Wang are with the School of Electronic and Information En-gineering, Liaoning Technical University, Huludao 125105, Liaoning, China(email: [email protected]; email: [email protected]).X. Wei, L. Fang, X. Tang and H. Yu are with the Quanzhou Instituteof Equipment Manufacturing, Haixi Institute, Chinese Academy of Sciences,Quanzhou 362216, Fujian, China (email: [email protected]).Y. Xu is with the Department of Photogrammetry and Remote Sensing,Technical University of Munich, 80333 Munich, Germany.W. Yao is with the Department of Land Surveying and Geo-Informatics,Hong Kong Polytechnic University, 181 Chatham Road South, Hung Hom,Kowloon, Hong Kong. annotate these massive practical earth observation images ef-fectively. Due to the lack of labeled high-dimensional samplesof hyperspectral images, learning appropriate low-dimensional(LD) representations of data for clustering plays a critical rolein hyperspectral image annotation and understanding.Clustering is the task of grouping a set of objects in such away that objects in the same group are more similar to eachother than to those in other groups [16]. For remote sensingimages, the task is to classify the pixels into homogeneousregions which segment every image into different partitions[1], [7], [17]. Most traditional clustering algorithms are basedon shallow linear models [18]–[21], such as algorithms basedon K-means [22], ISODATA [23] and Fuzzy C-means [4], [6],which often fail when the data exhibits an irregular non-lineardistribution. During the past decades, spectral-based cluster-ing methods [5], [24]–[28] and density-based [7] clusteringmethods have been state-of-the-arts. Let X = [ x , · · · , x n ] ∈ R m × n be the matrix containing the n independent trainingsamples arranged as its columns, the spectral-based clusteringapproaches perform clustering in the following two steps. Atfirst, an affinity matrix (i.e., similarity graph) C is built todepict the relationship of the data, where C ij denotes thesimilarity between data points x i and x j . Secondly, the datais clustered through clustering the eigenvectors of the graphLaplacian L = D − AD − , where D is a diagonal matrixwith D ii = (cid:80) j A ij and A = | C | + (cid:12)(cid:12)(cid:12) C T (cid:12)(cid:12)(cid:12) . The main ideaof the density-based clustering [29] approaches are to findhigh-density regions that are separated by low-density regions.The density peaks clustering algorithm (DPCA) proposed byAlex Rodriguez [30] has brought the density-based clusteringapproaches to a new stage. The core idea of DPCA is that thecentre of the cluster is surrounded by some points with lowlocal density, and these points are far away from other pointswith high local density. The DPCA separates the clusteringprocess of non-clustered centre points into a single process.Due to the selection of the cluster centre and the classificationof the non-cluster points are separated, the clustering precisionis increased.To solve the clustering problem of more complex distributeddata, the sparse subspace clustering (SSC) algorithm [13], [31]is developed. The core idea of SSC is that among the infinitenumber of representations of a data point in terms of otherpoints, a sparse representation corresponds to selecting a fewpoints from the same subspace. In fact, the SSC algorithm isa solution of sparse optimization in the framework of spectralclustering, and it evaluates the labels for every sample in the a r X i v : . [ c s . L G ] A p r low-dimensional space. Sharing the homogeneous idea withSSC, many sparse representation and low-rank approximationbased methods for subspace clustering [18], [32]–[35] havereceived a lot of attention in recent years. The key componentsof these methods are finding a sparse and low-rank represen-tation of the data and then building a similarity graph on thesparse coefficient matrix for the separation of the data.Although the spectral-based clustering algorithms and thedensity-based clustering algorithms can effectively cluster ar-bitrarily distributed data, only the shallow features of the datacan be used [19]. Moreover, it is difficult to further improvethe clustering effect and precision. On the other hand, deepneural networks can non-linearly map data from the originalfeature space to a new feature space, with the promisingadvantages compared with traditional clustering algorithms fordeep feature extraction and dimension reduction. Therefore, inrecent years, the subspace clustering algorithm based on deeplearning has attracted more attention.The core idea of deep neural network clustering [36]–[40]is to non-linearly map data from the original feature spaceto a new feature space and then complete clustering in thenew feature space. Because its mapping is non-linear, deepneural network clustering has powerful capabilities on intrin-sic feature extraction and data representation. The clusteringalgorithm based on auto-encoder networks [38], [41]–[43] isa popular framework for deep clustering algorithms, whichutilizes a symmetric network structure to encode and representthe data. Such an algorithm consists of two important steps.Firstly, it obtains the code space of the data by reducing thedata dimension and clusters the data in the obtained codespace. Secondly, it performs the representation transformationfrom the obtained code space to a new generative featurespace. Depending upon the auto-encoder network, the ideaof generative adversarial [44]–[48] is introduced into theclustering field, which further improves the performance ofdeep clustering algorithms.Generally speaking, current deep clustering algorithms havethe form [32], [42], [49], [50] as J p = J p + λ (cid:13)(cid:13)(cid:13) Z ( M ) − Z ( M ) C (cid:13)(cid:13)(cid:13) (cid:124) (cid:123)(cid:122) (cid:125) J p + J p , (1)where J p is the loss function, J p represents the reconstruc-tion error, J p is a sparse or low-rank constraint determined bythe pre-trained matrix C , and J p is the regularization term.The λ in J p represents the constraint coefficient, and Z ( M ) denotes the obtained codes or extracted features by the auto-encoder. The data representation shown in (1) achieves bothdata reconstruction and dimensionality reduction. Therefore,the features extracted by such algorithms are appropriate onesfor data dimensionality reduction. However, due to the lack ofprior knowledge constraints, the feature extraction procedureof deep clustering algorithms tends to lose some useful guideswhich need to be further explored. To solve this problem, wepropose an embedded deep clustering algorithm with intra-class distance constraint. The proposed algorithm embedsthe global K-means model into the auto-encoder networkthat can constrain the procedure of data mapping and obtain data representation that is more conducive to clustering. Theproposed algorithm has the following contributions:1) The intra-class distance is utilized to constrain theencoding process of the auto-encoder network so thatthe auto-encoder network can map the data from theoriginal feature space to the feature space which is moreconducive to clustering.2) The pre-training process is not necessary, and the indi-cator matrix is dynamically adjusted to image data.3) This work treats the solution to the proposed algorithmas a joint learning optimization problem, and the entireclustering procedure is completed in one stage.II. R ELATED W ORKS
In this section, we briefly discuss some existing works inunsupervised deep learning and subspace clustering.
A. Auto-encoder Network
With impressive learning and characterization capabilities,auto-encoder neural networks have achieved great success invarious areas, especially in the scenario of unsupervised learn-ing [2], [43], [51]–[53], such as natural language processing[54], image processing [55], object detection [56], biometricrecognition [57], and data analysis [58]. As state-of-the-artunsupervised techniques, auto-encoder and auto-encoder basedneural networks also make outstanding contributions in thefield of remote sensing. In this subsection, we briefly introducethe auto-encoder network.In general, an auto-encoder [41] is a kind of networkthat consists of encoder and decoder, and the structure ofthe encoder and decoder is symmetrical. If the auto-encodercontains multiple hidden layers, then the number of hiddenlayers of the encoder is equal to the number of hidden layersof the decoder. The structural model of the basic auto-encoderis shown in Fig. 1. The purpose of the basic auto-encoder is toreconstruct the input data at the output layer, and the perfectcase is that the output signal X out (i.e., Z ( M ) ) is exactly thesame as the input signal X in (i.e., Z (0) ). According to thestructure shown in Fig. 1, the encoding process and decodingprocess of the basic auto-encoder can be described as (cid:40) z ( i +1) = F e (cid:0) W i z ( i ) + b i (cid:1) Encoding z ( j +1) = F d (cid:0) W j z ( j ) + b j (cid:1) Decoding , (2)where W i , b i denote the i - th encoding weight and the i - th encoding bias, respectively, W j , b j represent the j - th decoding weight and the j - th decoding bias, respectively, z ( i ) denotes the data vector in i - th layer, and F e is thenon-linear transformation. Sigmoid , Tanh , Relu are commonlyused activation functions for F e . F d can be the same non-linear transformation as in the encoding process. Therefore,the loss function of the basic auto-encoder is to minimizethe error between X in and X out . The encoder converts theinput signal into codes through some non-linear mapping, andthe decoder tries to remap the codes to the input signal. Theparameters of the auto-encoder, i.e., weights and biases, are learned by minimizing the total reconstruction error, whichcould be computed by the mean square error J m ( W , b ) = n (cid:88) i =1 L (cid:16) z (0) i , z ( M ) i (cid:17) = n (cid:88) i =1 (cid:13)(cid:13)(cid:13) z (0) i − z ( M ) i (cid:13)(cid:13)(cid:13) , (3)or the cross entropy J c ( W , b ) = n (cid:88) i =1 L (cid:16) z (0) i , z ( M ) i (cid:17) = − n (cid:88) i =1 (cid:16) z (0) i log z ( M ) i + (cid:16) − z (0) i (cid:17) log (cid:16) − z ( M ) i (cid:17)(cid:17) . (4)For the structure shown in Fig. 1, the hidden layers of thebasic automatic coding network have three different structures:a compressed structure, a sparse structure, and an equivalent-dimensional structure. When the number of input layer neuronsis greater than the number of hidden layer neurons, it is calleda compressed structure [59]. Conversely, when the number ofinput layer neurons is smaller than the number of hidden layerneurons, it is called a sparse structure. If the input layer andthe hidden layer have the same number of neurons, it is namedthe equivalent-dimensional structure. Encoder Decoder (0) ( ) in X Z ( )2 M Z ( ) ( ) Mout
X Z
Fig. 1. The structure of the auto-encoder network.
B. Deep Subspace Clustering
Many practical subspace clustering missions transform datafrom the high dimensional feature space to the low dimen-sional feature space. Whereas many subspace clustering algo-rithms [18], [19], [31], [35], [60], [61] are shallow models,which cannot model the non-linearity of data in practicalsituations. Benefiting from the powerful capabilities of non-linear modeling and data representation of the deep neuralnetwork, some clustering approaches based on deep neuralnetworks have been proposed in recent years. Song et al.[41] integrated an auto-encoder network with K-means tocluster the latent features. However, the feature mapping andclustering are two relatively independent processes in theirwork, and the K-means algorithm is not jointed into the featuremapping process. Therefore, the feature mapping process may not be constrained by the K-means algorithm. With the emer-gence of generative adversarial networks, some deep clusteringalgorithms [37], [47] embedded discriminant and adversarialideas have been proposed, which further enhance the abilityof deep feature extraction and data representation. On theother hand, some deep subspace clustering algorithms withprior constraints have been developed. According to differentconstraining conditions, these prior constraints may be sparse[59], low rank [62], and least square [39]. Based on theauto-encoder network, the loss functions of these algorithmsshare the same modality, shown in (5). These algorithms learnrepresentation for the input data with minimal reconstructionerror and incorporate prior information into the potentialfeature learning to preserve the key reconstruction relationover the data. Although the features extracted by such algo-rithms are appropriate for the dimensionality reduction andreconstruction of the data, the procedure of feature extractingmay lose the purposefulness, due to the lack of the constraintsof prior knowledge. J p ( W i , b i ) = 12 (cid:13)(cid:13)(cid:13) Z (0) − Z ( M ) (cid:13)(cid:13)(cid:13) F (cid:124) (cid:123)(cid:122) (cid:125) J p + λ (cid:13)(cid:13)(cid:13) Z ( M ) − Z ( M ) C (cid:13)(cid:13)(cid:13) F (cid:124) (cid:123)(cid:122) (cid:125) J p + λ M (cid:88) i =1 (cid:16) (cid:107) W i (cid:107) F + (cid:107) b i (cid:107) (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) J p , (5)where J p denotes the reconstruction error, J p denotes theprior constraint, J p denotes the regularization term and C is apre-trained matrix. Aforementioned deep clustering algorithmshave following three weaknesses:1) Lacking prior constraints related to clustering task.2) The matrix C needs to be pre-trained, which may notbe optimal for various data to be clustered.3) Once given, the matrix C is fixed, which cannot beoptimized jointly with the network.Different from these existing works, we propose an ap-proach that embeds intra-class distance into an auto-encodernetwork. The Indicative matrix and the parameters of thenetwork are adaptively optimized simultaneously. The pro-posed approach utilizes intra-class distance to constrain thefeature mapping process of the auto-encoder so that the deepfeatures extracted from the source space are more conducivefor clustering. More details of the proposed approach aredescribed in the following sections.III. T HE P ROPOSED D EEP C LUSTERING A PPROACH
In this section, we elaborate on the details of the pro-posed
Deep Clustering with Intra-class Distance Constraint (DCIDC) algorithm. The framework of DCIDC is an auto-encoder network, and the specific structure may be variedaccording to different scenarios. With the constraint of intra-class distance, DCIDC extracts deep features of the data bymapping them from the source space to a latent feature space.
In the new latent feature space, objects in the same groupare more similar to each other than to those in other groups.We first explain how DCIDC is specifically designed and thenpresent the algorithm for optimizing the DCIDC model.
A. Deep Clustering with Intra-class Distance Constraint
The neural network within DCIDC consists of M +1 layersfor performing M non-linear transformations, where M is aneven number, the first M hidden layers are encoders to learna set of compact representations (i.e., low-dimensional repre-sentations) and the last M layers are decoders to progressivelyreconstruct the input. The framework of DCIDC is shown asFig. 2. Let Z (0) = X in ∈ R N × D be one input matrix to thefirst layer, which denotes a hyperspectral image consisting of N image pixels (samples) and z (0) be one row of the matrix,which denotes a sample in D -dimensional feature space. Forthe encoder, the output of the i - th layer is computed by z ( i ) = F e (cid:16) W i z ( i − + b i (cid:17) ∈ R d i , (6)where i = 1 , , · · · , M indexes the layers of the encoder, W i denotes the weight matrix from the ( i − - th layer to the i - th layer and b i denotes the bias of the i - th layer. R d i indicatesthat the z ( i ) belongs to a d i -dimensional feature space. The F e ( · ) is a non-linear activation function. The M - th layer z ( M ) ∈ R d M is shared by the encoder and the decoder. Forthe purpose of reducing the dimensionality of the input data,the dimensions of the layers in the encoder are designed tobe D ≥ d i − ≥ d i ≥ d M . For the decoder, the output of the j - th layer can be computed by z ( j ) = F d (cid:16) W j z ( j − + b j (cid:17) ∈ R d j , (7)where j = M + 1 , M + 2 , · · · , M indexes the layers ofthe decoder and the non-linear activation function F d ( · ) canbe the same as F e ( · ) or another absolutely different non-linear model. For the purpose of data reconstruction, thedimensions of the layers in the decoder are designed to be d M ≤ d j − ≤ d j ≤ d M = D . Thus, given a sample z (0) (i.e., x in ) as one input of the first layer of DCIDC, z ( M ) (i.e., x out )is the reconstruction of z (0) , and the corresponding z ( M ) is the representation of x in . Furthermore, for a data matrix Z (0) = (cid:104) z (0)1 , z (0)2 , · · · , z (0) N (cid:105) T ∈ R N × D which denotesa collection fo N given samples, the output matrix of thedecoder Z ( M ) = (cid:104) z ( M )1 , z ( M )2 , · · · , z ( M ) N (cid:105) T ∈ R N × D isthe corresponding reconstruction for Z (0) , and the Z ( M ) = (cid:20) z ( M ) , z ( M ) , · · · , z ( M ) N (cid:21) T ∈ R N × d M is the desired low-dimensional representation of Z (0) .The objective of DCIDC is to minimize the data reconstruc-tion error and jointly constrain the non-linear transformation Encoder
DecoderClustering
Constraining (0) ( ) in X Z ( )2 M Z ( ) ( ) Mout
X Z
Fig. 2. Network framework of the proposed algorithm from X in to the corresponding representation Z ( M ) by intra-class distance. Thus, these targets can be formally stated as min W i , b i J ( W i , b i ) = (cid:13)(cid:13)(cid:13) Z (0) − Z ( M ) (cid:13)(cid:13)(cid:13) F (cid:124) (cid:123)(cid:122) (cid:125) J + λ (cid:13)(cid:13)(cid:13) Z ( M ) − HS T (cid:13)(cid:13)(cid:13) F (cid:124) (cid:123)(cid:122) (cid:125) J + λ M (cid:88) i =1 (cid:16) (cid:107) W i (cid:107) F + (cid:107) b i (cid:107) (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) J , (8)where λ and λ are positive trade-off parameters. The terms J , J , and J are respectively designed for different goals.Intuitively, the first term J is designed for preserving localityby minimizing the reconstruction errors w.r.t. the input itself.In other words, the input acts as a supervisor for the pro-cedure of learning a low-dimensional representation Z ( M ) .For the purpose that objects in the same cluster have similarfeatures, the term J is designed to constrain the non-lineartransformation from Z (0) to its corresponding representation Z ( M ) , by minimizing the clustering error in each iteration.The matrix S ∈ R d M × K in (9) denotes the clustering centres,of which each column represents one cluster centre. Thematrix H ∈ R N × K in (10) is the indicative matrix and eachrow demonstrates the binary label. In other words, for eachrow of H , there is only one element is and the rest are .The H is not fixed and it is updated in each iteration. The K in (9) and (10) denote the number of clusters. At last, J is a regularization term to avoid over-fitting. S = s · · · s K ... . . . ... s d M · · · s d M K . (9) H = · · ·
01 0 0 · · · ... ... ... . . . ... · · ·
00 0 0 · · · . (10)Our neural network model uses the input as self-supervisorto learn low-dimensional representations and jointly constrainthe non-linear transformation by minimizing the clusteringerror, which is expected to enhance the deep intrinsic featuresextracted from the source data. The learned representationsare fully adaptive and favorable for the clustering process.Furthermore, our model completes clustering in one stepwithout additional pre-training process, which improves theefficiency. B. Optimization Procedure
In this subsection, we mainly demonstrate how the proposedDCIDC model can be optimized efficiently via gradient de-scent and the solution procedure of H . As the optimization ofparameters W and b does not share the same mechanism with H and S , we present the gradient descent and the calculationof H and S respectively. For the convenience of developingthe algorithm, we rewrite (8) in the following sample-wiseform. J = 12 N (cid:88) i =1 (cid:32)(cid:13)(cid:13)(cid:13) z (0) i − z ( M ) i (cid:13)(cid:13)(cid:13) + λ (cid:13)(cid:13)(cid:13)(cid:13) z ( M ) i − h i S T (cid:13)(cid:13)(cid:13)(cid:13) (cid:33) + λ M (cid:88) m =1 (cid:16) (cid:107) W m (cid:107) F + (cid:107) b m (cid:107) (cid:17) . (11)According to the definition of z ( i ) in (6), z ( i ) in (7) and thechain rule, we can express the gradients of (11) w.r.t. W m and b m as (12) and (13), respectively. ∂ J ∂ W m = ( ∆ m + λ Λ m ) (cid:16) z ( m − i (cid:17) T + λ W m , (12) ∂ J ∂ b m = ∆ m + λ Λ m + λ b m , (13)where ∆ m is defined as ∆ m = − (cid:16) z (0) i − z ( M ) i (cid:17) (cid:74) G (cid:48) (cid:16) y ( M ) i (cid:17) m = M ( W m +1 ) T ∆ m +1 (cid:74) G (cid:48) (cid:16) y ( m ) i (cid:17) Otherwise , (14)and Λ m is given as Λ m = ( W m +1 ) T Λ m +1 (cid:74) G (cid:48) (cid:16) y ( m ) i (cid:17) m = 1 , · · · , M − (cid:18) z ( M ) i − h i S T (cid:19) (cid:74) G (cid:48) (cid:18) y ( M ) i (cid:19) m = M m = M + 1 , · · · , M , (15) where the (cid:74) denotes element-wise multiplication, y ( m ) i = W m z ( m − i + b m , and G (cid:48) ( · ) is the derivative of the activationfunction G ( · ) defined as G ( · ) = (cid:40) F e ( · ) m = 1 , · · · , M F d ( · ) m = M + 1 , · · · , M . (16)Using the gradient descent algorithm, we update { W m , b m } Mm =1 as (17) and (18) until convergence. W m ← W m − µ ∂ J ∂ W m , (17) b m ← b m − µ ∂ J ∂ b m , (18)where µ > is the learning rate which is typically set to asmall value according to specific scenarios.As the output of DCIDC model, the indicative matrix H iscalculated via the least distance rule in each clustering step. Inother words, H is updated by solving the term (19) in everyiteration. min H , S (cid:13)(cid:13)(cid:13) Z ( M ) − HS T (cid:13)(cid:13)(cid:13) F . (19)Thus, with the accomplishment of the optimization of DCIDCmodel, the final H will be simultaneously obtained. Now, wedemonstrate the solution of (19).For the task of clustering, the matrix Z ( M ) can be redefinedas Z ( M ) = { Z ( M ) i } Ki =1 , (20)where Z ( M ) i is a component of Z ( M ) , which denotesone cluster of the data, (cid:83) Ki =1 Z ( M ) i = Z ( M ) and Z ( M ) i (cid:84) Z ( M ) j = Φ w.r.t. i (cid:54) = j ∈ { , , · · · , K } . Then,for the convenience of solving (19), it can be transformed into min H , S E S = K (cid:88) i =1 (cid:88) z ( M ) ∈ Z ( M ) i (cid:13)(cid:13)(cid:13) s Ti − z ( M ) (cid:13)(cid:13)(cid:13) , (21)where s i is a column vector of S , which denotes the centreof Z ( M ) i . Let ∂ E S ∂ S = ∂∂ S K (cid:88) i =1 (cid:88) z ( M ) ∈ Z ( M ) i (cid:13)(cid:13)(cid:13) s Ti − z ( M ) (cid:13)(cid:13)(cid:13) = K (cid:88) i =1 (cid:88) z ( M ) ∈ Z ( M ) i ∂∂ S (cid:13)(cid:13)(cid:13) s Ti − z ( M ) (cid:13)(cid:13)(cid:13) = 0 , (22)we get s Ti = 1 n i (cid:88) z ( M ) ∈ Z ( M ) i z ( M ) , (23)where n i is the number of pixel of the cluster Z ( M ) i .Similarly, for the purpose of calculating H , the (19) can berewritten as min H , S E H = K (cid:88) i =1 (cid:88) z ( M ) ∈ Z ( M ) i (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) z ( M ) (cid:17) T − Sh T (cid:13)(cid:13)(cid:13)(cid:13) , (24) where h is the indicator corresponding to z ( M ) , which isserialized as a row vector of H . Let ∂∂ h T (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) z ( M ) (cid:17) T − Sh T (cid:13)(cid:13)(cid:13)(cid:13) = ∂∂ h T (cid:13)(cid:13)(cid:13)(cid:13) Sh T − (cid:16) z ( M ) (cid:17) T (cid:13)(cid:13)(cid:13)(cid:13) = ∂∂ h T [ Sh T − (cid:16) z ( M ) (cid:17) T ] T [ Sh T − (cid:16) z ( M ) (cid:17) T ]= 2 S T Sh T − S T (cid:16) z ( M ) (cid:17) T = 0 , (25)we get h T = (cid:16) S T S (cid:17) − S T (cid:16) z ( M ) (cid:17) T . (26)Thus, we get H = [ T ( h ) , T ( h ) , · · · , T ( h N )] T , (27)where T ( h ) = T { h i } Ki =1 = (cid:40) h i = Max (cid:0) { h i } Ki =1 (cid:1) Otherwise . (28)So far, the detailed procedure for optimizing the proposedmodel DCIDC can be summarized as Algorithm 1. Algorithm 1
Algorithm of Deep Clustering with Intra-classDistance Constraint
Input:
The data matrix X in (i.e., Z (0) ) and the number ofcluster K . Output:
The indicative matrix H . Initialize a matrix H according to (10) and the given K . for i = 1 to M do Initialize W i . Initialize b i . end for while not convergence do for i = 1 to M do z ( i ) ← F e (cid:0) W i z ( i − + b i (cid:1) . end for for j = M + 1 to M do z ( j ) ← F d (cid:0) W j z ( j − + b j (cid:1) . end for Calculate J in (8). Calculate s i by (23) and S ← [ s , s , · · · , s K ] . Calculate J in (8). Calculate J in (8). Update H by (27). for i = 1 to M do Update W i by (17). Update b i by (18). end for end while return H . IV. E XPERIMENTAL R ESULTS
In this section, we compare the proposed DCIDC approachwith popular clustering methods on four image datasets interms of two evaluation metrics and discuss the influence onDCIDC with different coefficient values of λ and activationfunctions. A. Experimental Settings1) Datasets:
We carry out our experiments using fourhyperspectral image datasets: Indian Pines, Pavia, Salinas, andSalinas-A. The Indian Pines dataset was gathered by 224-bandAVIRIS sensor over the Indian Pines test site in North-westernIndiana. It consists of × pixels and 224 spectralreflectance bands in the wavelength range of . ∼ . × − meters. The number of bands of the Indian Pines datasetused in our experiment is reduced to 200 by removing bandscovering the region of water absorption. The ROSIS sensoracquired the Pavia dataset during a flight campaign over Pavia,northern Italy. The used Pavia dataset in our experiment isa × pixels image with 100 spectral bands. TheAVIRIS sensor collected the Salinas dataset over SalinasValley, California, characterized by the high spatial resolution.The area covered comprises 512 lines by 217 samples. As withIndian Pines scene, 24 water absorption bands are discarded.A small sub-scene of Salinas image, denoted as Salinas-A, isadopted too.
2) Evaluation Criteria:
We adopt two metrics to evaluatethe clustering quality: accuracy and normalized mutual infor-mation(NMI). Higher values of these metrics indicate betterperformance. For each dataset, we repeat each algorithm fivetimes and report the means and the standard deviations of thesemetrics.
3) Baseline Algorithms:
For the sake of fairness, we com-pare DCIDC with clustering algorithms that carried out experi-ments with the same four datasets: Indian Pines, Pavia, Salinas,and Salinas-A. These algorithms are the deep subspace clus-tering with sparsity Prior (PARTY), the auto-encoder basedsubspace clustering (AESSC), the sparse subspace clustering(SSC), the latent subspace sparse subspace clustering (LS3C),the low-rank representation based clustering (LRR), the lowrank based subspace clustering (LRSC), and the smooth rep-resentation clustering (SMR). Among these methods, PARTYand AESSC are deep clustering methods.In the experiments with datasets Indian Pines, Salinas, andSalinas-A, the proposed DCIDC is designed as a seven layerneural network structure which consists of × × × × × × neurons. For the experiment with dataset Pavia,the network structure contains × × × × × × neurons. For fair comparisons, we report the best results of allthe evaluated methods achieved with their optimal parameters.For our DCIDC with the trade-off parameters λ and λ , wefix λ = 0 . for all the data sets and experimentally choose λ . B. Comparison With The Evaluated Methods
In this subsection, we evaluate the performance of DCIDCon the four datasets respectively, by comparing with the base-line algorithms. In Tab. I and Tab. II, the bold names denote that the corresponding approaches are deep models which areat state of the art, and the bold scores mean the best resultsin the tables. Tab. I and Tab. II quantitatively describe theclustering accuracies and NMIs of DCIDC on datasets IndianPines, Salinas, Salinas-A and Pavia. The experimental resultshows that the proposed DCIDC has a better performance. Theaccuracies of DCIDC are at least . , . and . higher than that of the other methods regarding Indian Pines,Salinas, and Salinas-A datasets. For dataset Pavia, the accuracyof DCIDC is at least . higher than the other methods. Fig.3 – 6 qualitatively demonstrate the experimental results of theproposed DCIDC algorithm compared with other algorithmsin a visual way. (a) (b) (c) (d) (e)(f) (g) (h) (i) (j)Fig. 3. The performance comparison of difierent algorithms on the datasetIndian Pines. (a) The Indian Pines image, (b) ground truth, (c) DCIDC, (d)PARTY, (e) AESSC, (f) SSC, (g) LS3C, (h) LRR, (i) LRSC, (j) SMR.(a) (b) (c) (d) (e)(f) (g) (h) (i) (j)Fig. 4. The performance comparison of difierent algorithms on the datasetSalinas. (a) The Salinas image, (b) ground truth, (c) DCIDC, (d) PARTY, (e)AESSC, (f) SSC, (g) LS3C, (h) LRR, (i) LRSC, (j) SMR. In most cases, these results demonstrate that deep clus-tering methods perform much better than the shallow ones,benefiting from non-linear transformation and deep featurerepresentation learning. However, the experimental results of
TABLE IP
ERFORMANCE COMPARISON ON I NDIAN P INES AND S ALINAS
Dataset Indian Pines SalinasMethod Accuracy NMI Accuracy NMI
DCIDC 89.22 93.78 90.56 93.42PARTY
AESSC
ERFORMANCE COMPARISON ON S ALINAS -A AND P AVIA
Dataset Salinas-A PaviaMethod Accuracy NMI Accuracy NMI
DCIDC 94.02 98.43 89.79 92.79PARTY
AESSC the PARTY and AESSC methods do not outperform othershallow model-based algorithms. This may be due to that theauto-encoder based clustering methods mainly consider thereconstruction of input data, while the representation of thedata lacks constraints of the prior knowledge. Our proposedDCIDC approach solves this issue by embedding intra-classdistance into feature mapping process as a constraint, whichcan achieve higher accuracy. Fig. 7 shows the variety ofaccuracy and NMI in DCIDC as the iteration number increaseson the four databases. It can be found that the performance isenhanced very fast in the first ten iterations, which implies thatour method is effective and efficient. After dozens of iteration,both the accuracy and the NMI become stable, which showsthat the proposed DCIDC approach is convergent. (a) (b) (c) (d) (e)(f) (g) (h) (i) (j)Fig. 5. The performance comparison of difierent algorithms on the datasetSalinas-A. (a) The Salinas-A image, (b) ground truth, (c) DCIDC, (d) PARTY,(e) AESSC, (f) SSC, (g) LS3C, (h) LRR, (i) LRSC, (j) SMR. (a) (b) (c) (d) (e)(f) (g) (h) (i) (j)Fig. 6. The performance comparison of difierent algorithms on the datasetPavia. (a) The Pavia image, (b) ground truth, (c) DCIDC, (d) PARTY, (e)AESSC, (f) SSC, (g) LS3C, (h) LRR, (i) LRSC, (j) SMR.
Iteration S c o re Indian Pines ACCNMI (a)
Iteration S c o re Salinas ACCNMI (b)
Iteration S c o re Salinas-A ACCNMI (c)
Iteration S c o re Pavia ACCNMI (d)Fig. 7. The variation of Accuracy And NMI with iteration numbers in DCIDC.(a) Indian Pines, (b) Salinas, (c) Salinas-A, (d) Pavia.
C. Influence of Tradeoff Coefficient
The coefficient λ is designed to prevent over-fitting, andits impact on DCIDC is balanced in different scenarios, so weset it as a fixed value of 0.0003. For the trade-off coefficient λ , we investigate the variation of accuracy and NMI withdifferent values of λ to obtain the optimal value of λ . Thevariations of accuracy and NMI with different values of λ are shown in Fig. 8. It can be found that given different λ values, the model achieves different clustering accuracies andNMIs. In most cases, the optimal value of λ is near 0.3. D. Influence of Activation Functions
In this subsection, we report the performance of DCIDCwith four different activation functions on the four datasets.The used activation functions includes
Tanh , Sigmoid , Nssig-
Lambda-1 A cc u r a c y Indian PinesSalinasSalinas-APavia (a)
Lambda-1 N M I Indian PinesSalinasSalinas-APavia (b)Fig. 8. The variation of Accuracy And NMI With λ in DCIDC. (a) Accuracy,(b) NMI. moid , and Softplus . From Fig. 9, we can see that the
Tanh function outperforms the other three activation functions in theexperiments and the
Nssigmoid function achieves the secondbest result which is very close to the best one.
ACC NMI (a)
Salinas
ACC NMI (b)
Salinas-A
ACC NMI (c)
ACC NMI (d)Fig. 9. The performance of DCIDC With four different activation functions.(a) Indian Pines, (b) Salinas, (c) Salinas-A, (d) Pavia.
V. C
ONCLUSIONS
In this paper, we propose an intra-class distance constraineddeep clustering approach, which constrains the feature map-ping procedure by intra-class distance. Compared with thecurrent deep clustering approaches, the procedure of featureextraction of DCIDC is more purposeful, making the fea-tures learned more conducive to clustering. The proposedapproach jointly optimize the parameters of the network andthe procedure of the clustering without additional pre-trainingprocess, which is more efficient. We conducted comparativeexperiments on four different hyperspectral datasets, and theaccuracies of DCIDC are at least . , . , . , and . higher than that of the compared methods, regardingthe datasets of Indian Pines, Salinas, Salinas-A ,and Pavia. Theexperimental results demonstrate that our approach remarkablyoutperforms the state-of-the-art clustering methods in terms ofaccuracy and NMI. In the future, the adaptive deep clusteringmethods will be further explored based on this work. R EFERENCES[1] H. Li, S. Zhang, X. Ding, C. Zhang, and P. Dale, “PerformanceEvaluation of Cluster Validity Indices (CVIs) on Multi/HyperspectralRemote Sensing Datasets,”
Remote Sensing , vol. 8, no. 4, pp. 295–317,2016.[2] A. Romero, C. Gatta, and G. Camps-Valls, “Unsupervised deep featureextraction for remote sensing image classification,”
IEEE Transactionson Geoscience and Remote Sensing , vol. 54, no. 3, pp. 1349–1362, 2016.[3] N. Yokoya, C. Grohnfeldt, and J. Chanussot, “Hyperspectral and mul-tispectral data fusion: A comparative review of the recent literature,”
IEEE Geoscience and Remote Sensing Magazine , vol. 5, no. 2, pp. 29–56, 2017.[4] J. Guo and H. Huo, “An Enhanced IT2fcm* Algorithm IntegratingSpectral Indices and Spatial Information for Multi-Spectral RemoteSensing Image Clustering,”
Remote Sensing , vol. 9, no. 9, pp. 960–981,2017.[5] H. Zhai, H. Zhang, X. Xu, L. Zhang, and P. Li, “Kernel sparse subspaceclustering with a spatial max pooling operation for hyperspectral remotesensing data interpretation,”
Remote Sensing , vol. 9, no. 4, pp. 335–350,2017.[6] T. Jiang, D. Hu, and X. Yu, “Enhanced IT2fcm algorithm using object-based triangular fuzzy set modeling for remote-sensing clustering,”
Computers & Geosciences , vol. 118, pp. 14–26, 2018.[7] H. Xie, A. Zhao, S. Huang, J. Han, S. Liu, X. Xu, X. Luo, H. Pan,Q. Du, and X. Tong, “Unsupervised Hyperspectral Remote SensingImage Clustering Based on Adaptive Density,”
IEEE Geoscience andRemote Sensing Letters , vol. 15, no. 4, pp. 632–636, 2018.[8] P. Ghamisi, B. Rasti, N. Yokoya, Q. Wang, B. Hofle, L. Bruzzone,F. Bovolo, M. Chi, K. Anders, R. Gloaguen et al. , “Multisource andmultitemporal data fusion in remote sensing: A comprehensive reviewof the state of the art,”
IEEE Geoscience and Remote Sensing Magazine ,vol. 7, no. 1, pp. 6–39, 2019.[9] S. Liu, L. Bruzzone, F. Bovolo, and P. Du, “Unsupervised multitemporalspectral unmixing for detecting multiple changes in hyperspectral im-ages,”
IEEE Transactions on Geoscience and Remote Sensing , vol. 54,no. 5, pp. 2733–2748, 2016.[10] L. Yu, J. Xie, S. Chen, and L. Zhu, “Generating labeled samples forhyperspectral image classification using correlation of spectral bands,”
Frontiers of Computer Science , vol. 10, no. 2, pp. 292–301, 2016.[11] L. Zhang, L. Zhang, and B. Du, “Deep learning for remote sensingdata: A technical tutorial on the state of the art,”
IEEE Geoscience andRemote Sensing Magazine , vol. 4, no. 2, pp. 22–40, 2016.[12] B. Zhao, Y. Zhong, A. Ma, and L. Zhang, “A spatial Gaussian mixturemodel for optical remote sensing image clustering,”
IEEE Journal ofSelected Topics in Applied Earth Observations and Remote Sensing ,vol. 9, no. 12, pp. 5748–5759, 2016.[13] H. Zhang, H. Zhai, L. Zhang, and P. Li, “Spectralspatial sparse subspaceclustering for hyperspectral remote sensing images,”
IEEE Transactionson Geoscience and Remote Sensing , vol. 54, no. 6, pp. 3672–3684, 2016.[14] K. V. Kale, M. M. Solankar, D. B. Nalawade, R. K. Dhumal, and H. R.Gite, “A research review on hyperspectral data processing and analysisalgorithms,”
Proceedings of the National Academy of Sciences, IndiaSection A: Physical Sciences , vol. 87, no. 4, pp. 541–555, 2017.[15] D. Hong, N. Yokoya, J. Chanussot, and X. X. Zhu, “An augmented linearmixing model to address spectral variability for hyperspectral unmixing,”
IEEE Transactions on Image Processing , vol. 28, no. 4, pp. 1923–1938,2019.[16] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya,S. Foufou, and A. Bouras, “A survey of clustering algorithms forbig data: Taxonomy and empirical analysis,”
IEEE Transactions onEmerging Topics in Computing , vol. 2, no. 3, pp. 267–279, 2014.[17] A. K. Alok, S. Saha, and A. Ekbal, “Multi-objective semi-supervisedclustering for automatic pixel classification from remote sensing im-agery,”
Soft Computing , vol. 20, no. 12, pp. 4733–4751, 2016.[18] V. M. Patel and R. Vidal, “Kernel sparse subspace clustering,” in
ImageProcessing of 2014 IEEE International Conference , 2014, pp. 2849–2853.[19] M. Yin, Y. Guo, J. Gao, Z. He, and S. Xie, “Kernel sparse subspaceclustering on symmetric positive definite manifolds,” in
Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2016, pp. 5157–5164.[20] R. Zhang, F. Nie, and X. Li, “Self-weighted supervised discriminativefeature selection,”
IEEE Transactions on Neural Networks and LearningSystems , vol. 29, no. 8, pp. 3913–3918, 2018.[21] X. Wei,
Learning Image and Video Representations Based on SparsityPriors . Aachen, Germany: Shaker Verlag GmbH, 2017. [22] W. Yang, K. Hou, B. Liu, F. Yu, and L. Lin, “Two-stage clusteringtechnique based on the neighboring union histogram for Hyperspectralremote sensing images,”
IEEE Access , vol. 5, pp. 5640–5647, 2017.[23] S. Hemalatha and S. M. Anouncia, “Unsupervised segmentation ofremote sensing images using FD based texture analysis model and ISO-DATA,”
International Journal of Ambient Computing and Intelligence ,vol. 8, no. 3, pp. 58–75, 2017.[24] J. Shi and J. Malik, “Normalized cuts and image segmentation,”
IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 22,no. 8, pp. 888–905, 2000.[25] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysisand an algorithm,” in
Advances in Neural Information ProcessingSystems , 2002, pp. 849–856.[26] U. Von Luxburg, “A tutorial on spectral clustering,”
Statistics andComputing , vol. 17, no. 4, pp. 395–416, 2007.[27] H. Zhai, H. Zhang, L. Zhang, P. Li, and A. Plaza, “A New Sparse Sub-space Clustering Algorithm for Hyperspectral Remote Sensing Imagery,”
IEEE Geoscience and Remote Sensing Letters , vol. 14, no. 1, pp. 43–47,2017.[28] X. Wei, H. Shen, and M. Kleinsteuber, “Trace quotient meets sparsity:A method for learning low dimensional image representations,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2016, pp. 5268–5277.[29] P. Ji, M. Salzmann, and H. Li, “Efficient dense subspace clustering,”in
Applications of Computer Vision of 2014 IEEE Winter Conference ,2014, pp. 461–468.[30] A. Rodriguez and A. Laio, “Clustering by fast search and find of densitypeaks,”
Science , vol. 344, no. 6191, pp. 1492–1496, 2014.[31] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm,theory, and applications,”
IEEE Transactions on Pattern Analysis andMachine Intelligence , vol. 35, no. 11, pp. 2765–2781, 2013.[32] K.-C. Lee, J. Ho, and D. J. Kriegman, “Acquiring linear subspaces forface recognition under variable lighting,”
IEEE Transactions on PatternAnalysis & Machine Intelligence , no. 5, pp. 684–698, 2005.[33] V. M. Patel, H. Van Nguyen, and R. Vidal, “Latent space sparse subspaceclustering,” in
Proceedings of the IEEE International Conference onComputer Vision , 2013, pp. 225–232.[34] H. Zhang, Z. Lin, C. Zhang, and J. Gao, “Robust latent low rankrepresentation for subspace clustering,”
Neurocomputing , vol. 145, pp.369–373, 2014.[35] X. Wei, H. Shen, and M. Kleinsteuber, “Trace quotient with sparsity pri-ors for learning low dimensional image representations,” arXiv preprintarXiv:1810.03523 , 2018.[36] A. Dundar, J. Jin, and E. Culurciello, “Convolutional clustering forunsupervised learning,” arXiv preprint arXiv:1511.06241 , 2015.[37] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding forclustering analysis,” in
International Conference on Machine Learning ,2016, pp. 478–487.[38] J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan, “Deep adaptiveimage clustering,” in
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2017, pp. 5879–5887.[39] P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid, “Deep subspaceclustering networks,” in
Advances in Neural Information ProcessingSystems , 2017, pp. 24–33.[40] M. T. Law, R. Urtasun, and R. S. Zemel, “Deep spectral clusteringlearning,” in
International Conference on Machine Learning , 2017, pp.1985–1994.[41] C. Song, F. Liu, Y. Huang, L. Wang, and T. Tan, “Auto-encoder baseddata clustering,” in
Iberoamerican Congress on Pattern Recognition .Springer, 2013, pp. 117–124.[42] N. Dilokthanakul, P. A. Mediano, M. Garnelo, M. C. Lee, H. Salim-beni, K. Arulkumaran, and M. Shanahan, “Deep unsupervised clus-tering with gaussian mixture variational autoencoders,” arXiv preprintarXiv:1611.02648 , 2016.[43] X. Wei, H. Shen, Y. Li, X. Tang, F. Wang, M. Kleinsteuber, andY. L. Murphey, “Reconstructible nonlinear dimensionality reduction viajoint dictionary learning,”
IEEE Transactions on Neural Networks andLearning Systems , vol. 30, no. 1, pp. 175–189, 2019.[44] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Advances in Neural Information Processing Systems , 2014, pp. 2672–2680.[45] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clus-tering: Discriminative embeddings for segmentation and separation,” in
Acoustics, Speech and Signal Processing of 2016 IEEE InternationalConference , 2016, pp. 31–35. [46] J. Liang, J. Yang, H.-Y. Lee, K. Wang, and M.-H. Yang, “Sub-GAN:An Unsupervised Generative Model via Subspaces,” in Proceedings ofthe European Conference on Computer Vision , 2018, pp. 698–714.[47] P. Zhou, Y. Hou, and J. Feng, “Deep Adversarial Subspace Clustering,”in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2018, pp. 1596–1604.[48] W. Wei, B. Xi, and M. Kantarcioglu, “Adversarial Clustering: A GridBased Clustering Algorithm Against Active Adversaries,” arXiv preprintarXiv:1804.04780 , 2018.[49] E. Min, X. Guo, Q. Liu, G. Zhang, J. Cui, and J. Long, “A surveyof clustering with deep learning: From the perspective of networkarchitecture,”
IEEE Access , vol. 6, pp. 39 501–39 514, 2018.[50] W. Lin, J. Chen, C. D. Castillo, and R. Chellappa, “Deep Density Clus-tering of Unconstrained Faces,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2018, pp. 8128–8137.[51] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Areview and new perspectives,”
IEEE Transactions on Pattern Analysisand Machine Intelligence , vol. 35, no. 8, pp. 1798–1828, 2013.[52] F. Zhang, B. Du, and L. Zhang, “Saliency-guided unsupervised featurelearning for scene classification,”
IEEE Transactions on Geoscience andRemote Sensing , vol. 53, no. 4, pp. 2175–2184, 2015.[53] Y. Li, X. Huang, and H. Liu, “Unsupervised deep feature learning forurban village detection from high-resolution remote sensing images,”
Photogrammetric Engineering & Remote Sensing , vol. 83, no. 8, pp.567–579, 2017.[54] D. T. Grozdic and S. T. Jovicic, “Whispered speech recognition usingdeep denoising autoencoder and inverse filtering,”
IEEE/ACM Transac-tions on Audio, Speech and Language Processing , vol. 25, no. 12, pp.2313–2322, 2017.[55] Y. Dai and G. Wang, “Analyzing tongue images using a conceptualalignment deep autoencoder,”
IEEE Access , vol. 6, pp. 5962–5972, 2018.[56] D. Park, Y. Hoshi, and C. C. Kemp, “A multimodal anomaly detectorfor robot-assisted feeding using an lstm-based variational autoencoder,”
IEEE Robotics and Automation Letters , vol. 3, no. 3, pp. 1544–1551,2018.[57] J. Yu, C. Hong, Y. Rui, and D. Tao, “Multitask autoencoder model forrecovering human poses,”
IEEE Transactions on Industrial Electronics ,vol. 65, no. 6, pp. 5060–5068, 2018.[58] M. Ma, C. Sun, and X. Chen, “Deep coupling autoencoder for fault di-agnosis with multimodal sensory data,”
IEEE Transactions on IndustrialInformatics , vol. 14, no. 3, pp. 1137–1145, 2018.[59] X. Peng, S. Xiao, J. Feng, W. Yau, and Z. Yi, “Deep Subspace Clusteringwith Sparsity Prior,” in
International Joint Conference on ArtificialIntelligence , 2016, pp. 1925–1931.[60] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery ofsubspace structures by low-rank representation,”
IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 35, no. 1, pp. 171–184,2013.[61] X. Peng, Z. Yi, and H. Tang, “Robust Subspace Clustering via Thresh-olding Ridge Regression,” in
AAAI Conference on Artificial Intelligence ,2015, pp. 3827–3833.[62] Y. Chen, L. Zhang, and Z. Yi, “Subspace clustering using a low-rankconstrained autoencoder,”