[PDF] Self-Taught Convolutional Neural Networks for Short Text Clustering

Abstract

Short text clustering is a challenging problem due to its sparseness of text representation. Here we propose a flexible Self-Taught Convolutional neural network framework for Short Text Clustering (dubbed STC^2), which can flexibly and successfully incorporate more useful semantic features and learn non-biased deep text representation in an unsupervised manner. In our framework, the original raw text features are firstly embedded into compact binary codes by using one existing unsupervised dimensionality reduction methods. Then, word embeddings are explored and fed into convolutional neural networks to learn deep feature representations, meanwhile the output units are used to fit the pre-trained binary codes in the training process. Finally, we get the optimal clusters by employing K-means to cluster the learned representations. Extensive experimental results demonstrate that the proposed framework is effective, flexible and outperform several popular clustering methods when tested on three public short text datasets.

Full PDF

SSelf-Taught Convolutional Neural Networks forShort Text Clustering

Jiaming Xu a , Bo Xu a, ∗ , Peng Wang a , Suncong Zheng a ,Guanhua Tian a , Jun Zhao a,b , Bo Xu a,c a Institute of Automation, Chinese Academy of Sciences (CAS), Beijing, P.R. China b National Laboratory of Pattern Recognition (NLPR), Beijing, P.R. China c Center for Excellence in Brain Science and Intelligence Technology, CAS. P.R. China

Abstract

Short text clustering is a challenging problem due to its sparseness of text rep-resentation. Here we propose a ﬂexible Self-Taught Convolutional neural net-work framework for Short Text Clustering (dubbed STC ), which can ﬂexiblyand successfully incorporate more useful semantic features and learn non-biaseddeep text representation in an unsupervised manner. In our framework, theoriginal raw text features are ﬁrstly embedded into compact binary codes byusing one existing unsupervised dimensionality reduction methods. Then, wordembeddings are explored and fed into convolutional neural networks to learndeep feature representations, meanwhile the output units are used to ﬁt thepre-trained binary codes in the training process. Finally, we get the optimalclusters by employing K-means to cluster the learned representations. Exten-sive experimental results demonstrate that the proposed framework is eﬀective,ﬂexible and outperform several popular clustering methods when tested on threepublic short text datasets. Keywords:

Semantic Clustering, Neural Networks, Short Text, UnsupervisedLearning ∗ Corresponding author. Email address: [email protected]

Preprint submitted to Neural Networks January 3, 2017 a r X i v : . [ c s . I R ] J a n . Introduction Short text clustering is of great importance due to its various applications,such as user proﬁling [1] and recommendation [2], for nowaday’s social mediadataset emerged day by day. However, short text clustering has the data sparsityproblem and most words only occur once in each short text [3]. As a result, theTerm Frequency-Inverse Document Frequency (TF-IDF) measure cannot workwell in short text setting. In order to address this problem, some researcherswork on expanding and enriching the context of data from Wikipedia [4] oran ontology [5]. However, these methods involve solid Natural Language Pro-cessing (NLP) knowledge and still use high-dimensional representation whichmay result in a waste of both memory and computation time. Another way toovercome these issues is to explore some sophisticated models to cluster shorttexts. For example, Yin and Wang [6] proposed a Dirichlet multinomial mixturemodel-based approach for short text clustering. Yet how to design an eﬀectivemodel is an open question, and most of these methods directly trained based onBag-of-Words (BoW) are shallow structures which cannot preserve the accuratesemantic similarities.Recently, with the help of word embedding, neural networks demonstratetheir great performance in terms of constructing text representation, such as Re-cursive Neural Network (RecNN) [7, 8] and Recurrent Neural Network (RNN) [9].However, RecNN exhibits high time complexity to construct the textual tree,and RNN, using the hidden layer computed at the last word to represent the text,is a biased model where later words are more dominant than earlier words [10].Whereas for the non-biased models, the learned representation of one text canbe extracted from all the words in the text with non-dominant learned weights.More recently, Convolution Neural Network (CNN), as the most popular non-biased model and applying convolutional ﬁlters to capture local features, hasachieved a better performance in many NLP applications, such as sentence mod-eling [11], relation classiﬁcation [12], and other traditional NLP tasks [13]. Mostof the previous works focus CNN on solving supervised NLP tasks, while in this2aper we aim to explore the power of CNN on one unsupervised NLP task, shorttext clustering.We systematically introduce a simple yet surprisingly powerful Self-TaughtConvolutional neural network framework for Short Text Clustering, called STC .An overall architecture of our proposed approach is illustrated in Figure 1. We,inspired by [14, 15], utilize a self-taught learning framework into our task. Inparticular, the original raw text features are ﬁrst embedded into compact binarycodes B with the help of one traditional unsupervised dimensionality reductionfunction. Then text matrix S projected from word embeddings are fed intoCNN model to learn the deep feature representation h and the output unitsare used to ﬁt the pre-trained binary codes B . After obtaining the learnedfeatures, K-means algorithm is employed on them to cluster texts into clusters C . Obviously, we call our approach “self-taught” because the CNN model islearnt from the pseudo labels generated from the previous stage, which is quitediﬀerent from the term “self-taught” in [16]. Our main contributions can besummarized as follows: • We propose a ﬂexible short text clustering framework which explores thefeasibility and eﬀectiveness of combining CNN and traditional unsuper-vised dimensionality reduction methods. • Non-biased deep feature representations can be learned through our self-taught CNN framework which does not use any external tags/labels orcomplicated NLP pre-processing. • We conduct extensive experiments on three short text datasets. The ex-perimental results demonstrate that the proposed method achieves ex-cellent performance in terms of both accuracy and normalized mutualinformation.This work is an extension of our conference paper [17], and they diﬀer in thefollowing aspects. First, we put forward a general a self-taught CNN frameworkin this paper which can ﬂexibly couple various semantic features, whereas the3 igure 1: The architecture of our proposed STC framework for short text clustering. Solidand hollow arrows represent forward and backward propagation directions of features andgradients respectively. The STC framework consist of deep convolutional neural network(CNN), unsupervised dimensionality reduction function and K-means module on the deepfeature representation from the top hidden layers of CNN. conference version can be seen as a speciﬁc example of this work. Second, inthis paper we use a new short text dataset, Biomedical, in the experiment toverify the eﬀectiveness of our approach. Third, we put much eﬀort on studyingthe inﬂuence of various diﬀerent semantic features integrated in our self-taughtCNN framework, which is not involved in the conference paper.For the purpose of reproducibility, we make the datasets and software usedin our experiments publicly available at the website .The remainder of this paper is organized as follows: In Section 2, we ﬁrstbrieﬂy survey several related works. In Section 3, we describe the proposedapproach STC and implementation details. Experimental results and analysesare presented in Section 4. Finally, conclusions are given in the last Section. Our code and dataset are available: https://github.com/jacoxu/STC2 . Related Work In this section, we review the related work from the following two perspec-tives: short text clustering and deep neural networks.

There have been several studies that attempted to overcome the sparseness ofshort text representation. One way is to expand and enrich the context of data.For example, Banerjee et al. [4] proposed a method of improving the accuracy ofshort text clustering by enriching their representation with additional featuresfrom Wikipedia, and Fodeh et al. [5] incorporate semantic knowledge from anontology into text clustering. However, these works need solid NLP knowledgeand still use high-dimensional representation which may result in a waste ofboth memory and computation time. Another direction is to map the originalfeatures into reduced space, such as Latent Semantic Analysis (LSA) [18], Lapla-cian Eigenmaps (LE) [19], and Locality Preserving Indexing (LPI) [20]. Evensome researchers explored some sophisticated models to cluster short texts. Forexample, Yin and Wang [6] proposed a Dirichlet multinomial mixture model-based approach for short text clustering. Moreover, some studies even focusthe above both two streams. For example, Tang et al. [21] proposed a novelframework which enrich the text features by employing machine translationand reduce the original features simultaneously through matrix factorizationtechniques.Despite the above clustering methods can alleviate sparseness of short textrepresentation to some extent, most of them ignore word order in the textand belong to shallow structures which can not fully capture accurate semanticsimilarities.

Recently, there is a revival of interest in DNN and many researchers haveconcentrated on using Deep Learning to learn features. Hinton and Salakhut-dinov [22] use DAE to learn text representation. During the ﬁne-tuning proce-5ure, they use backpropagation to ﬁnd codes that are good at reconstructingthe word-count vector.More recently, researchers propose to use external corpus to learn a dis-tributed representation for each word, called word embedding [23], to improveDNN performance on NLP tasks. The Skip-gram and continuous bag-of-wordsmodels of Word2vec [24] propose a simple single-layer architecture based on theinner product between two word vectors, and Pennington et al. [25] introducea new model for word representation, called GloVe, which captures the globalcorpus statistics.In order to learn the compact representation vectors of sentences, Le andMikolov [26] directly extend the previous Word2vec [24] by predicting words inthe sentence, which is named Paragraph Vector (Para2vec). Para2vec is still ashallow window-based method and need a larger corpus to yield better perfor-mance. More neural networks utilize word embedding to capture true mean-ingful syntactic and semantic regularities, such as RecNN [7, 8] and RNN [9].However, RecNN exhibits high time complexity to construct the textual tree,and RNN, using the layer computed at the last word to represent the text, isa biased model. Recently, Long Short-Term Memory (LSTM) [27] and GatedRecurrent Unit (GRU) [28], as sophisticated recurrent hidden units of RNN,has presented its advantages in many sequence generation problem, such as ma-chine translation [29], speech recognition [30], and text conversation [31]. While,CNN is better to learn non-biased implicit features which has been successfullyexploited for many supervised NLP learning tasks as described in Section 1, andvarious CNN based variants are proposed in the recent works, such as DynamicConvolutional Neural Network (DCNN) [11], Gated Recursive ConvolutionalNeural Network (grConv) [32] and Self-Adaptive Hierarchical Sentence model(AdaSent) [33].In the past few days, Visin et al. [34] have attempted to replace convolu-tional layer in CNN to learn non-biased features for object recognition withfour RNNs, called ReNet, that sweep over lower-layer features in diﬀerent di-rections: (1) bottom to top, (2) top to bottom, (3) left to right and (4) right to6eft. However, ReNet does not outperform state-of-the-art convolutional neuralnetworks on any of the three benchmark datasets, and it is also a supervisedlearning model for classiﬁcation. Inspired by Skip-gram of word2vec [35, 24],Skip-thought model [36] describe an approach for unsupervised learning of ageneric, distributed sentence encoder. Similar as Skip-gram model, Skip-thoughtmodel trains an encoder-decoder model that tries to reconstruct the surround-ing sentences of an encoded sentence and released an oﬀ-the-shelf encoder toextract sentence representation. Even some researchers introduce continuousSkip-gram and negative sampling to CNN for learning visual representation inan unsupervised manner [37]. This paper, from a new perspective, puts forwarda general self-taught CNN framework which can ﬂexibly couple various seman-tic features and achieve a good performance on one unsupervised learning task,short text clustering.

3. Methodology

Assume that we are given a dataset of n training texts denoted as: X = { x i : x i ∈ R d × } i =1 , ,...,n , where d is the dimensionality of the original BoWrepresentation. Denote its tag set as T = { , , ...C } and the pre-trained wordembedding set as E = { e ( w i ) : e ( w i ) ∈ R d w × } i =1 , ,..., | V | , where d w is the di-mensionality of word vectors and | V | is the vocabulary size. In order to learnthe r -dimensional deep feature representation h from CNN in an unsupervisedmanner, some unsupervised dimensionality reduction methods f dr ( X ) are em-ployed to guide the learning of CNN model. Our goal is to cluster these texts X into clusters C based on the learned deep feature representation while preservingthe semantic consistency.As depicted in Figure 1, the proposed framework consist of three compo-nents, deep convolutional neural network (CNN), unsupervised dimensionalityreduction function and K-means module. In the rest sections, we ﬁrst presentthe ﬁrst two components respectively, and then give the trainable parametersand the objective function to learn the deep feature representation. Finally, the7 igure 2: The architecture of dynamic convolutional neural network [11]. An input text is ﬁrstprojected to a matrix feature by looking up word embeddings, and then goes through wideconvolutional layers, folding layers and k-max pooling layers, which provides a deep featurerepresentation before the output layer. last section describe how to perform clustering on the learned features. In this section, we brieﬂy review one popular deep convolutional neural net-work, Dynamic Convolutional Neural Network (DCNN) [11] as an instance ofCNN in the following sections, which as the foundation of our proposed methodhas been successfully proposed for the completely supervised learning task, textclassiﬁcation.Taking a neural network with two convolutional layers in Figure 2 as anexample, the network transforms raw input text to a powerful representation.Particularly, each raw text vector x i is projected into a matrix representation S ∈ R d w × s by looking up a word embedding E , where s is the length of onetext. We also let ˜W = { W i } i =1 , and W O denote the weights of the neuralnetworks. The network deﬁnes a transformation f ( · ) : R d × → R r × ( d (cid:29) r )which transforms an input raw text x to a r -dimensional deep representation h .There are three basic operations described as follows:8 ide one-dimensional convolution This operation m ∈ R m is appliedto an individual row of the sentence matrix S ∈ R d w × s , and yields a resultingmatrix C ∈ R d w × ( s + m − , where m is the width of convolutional ﬁlter. Folding

In this operation, every two rows in a feature map are simplysummed component-wisely. For a map of d w rows, folding returns a map of d w / ˆC ∈ R ( d w / × ( s + m − . Note that folding operation does not introduceany additional parameters. Dynamic k -max pooling Assuming the pooling parameter as k , k -maxpooling selects the sub-matrix ¯C ∈ R ( d w / × k of the k highest values in eachrow of the matrix ˆC . For dynamic k -max pooling, the pooling parameter k is dynamically selected in order to allow for a smooth extraction of higher-order and longer-range features [11]. Given a ﬁxed pooling parameter k top forthe topmost convolutional layer, the parameter k of k -max pooling in the l -thconvolutional layer can be computed as follows: k l = max( k top , (cid:24) L − lL s (cid:25) ) , (1)where L is the total number of convolutional layers in the network. As described in Figure 1, the dimensionality reduction function is deﬁned asfollows: Y = f dr ( X ) , (2)where, Y ∈ R q × n are the q -dimensional reduced latent space representations.Here, we take four popular dimensionality reduction methods as examples inour framework. Average Embedding (AE) : This method directly averages the word em-beddings which are respectively weighted with TF and TF-IDF. Huang et al. [38]used this strategy as the global context in their task, and Socher et al. [8] and9ai et al. [10] used this method for text classiﬁcation. The weighted average ofall word vectors in one text can be computed as follows: Y ( x i ) = (cid:80) ki =1 w ( w i ) · e ( w i ) (cid:80) ki =1 w ( w i ) , (3)where w ( w i ) can be any weighting function that captures the importance ofword w i in the text x . Latent Semantic Analysis (LSA) : LSA [18] is the most popular globalmatrix factorization method, which applies a dimension reducing linear projec-tion, Singular Value Decomposition (SVD), of the corresponding term/documentmatrix. Suppose the rank of X is ˆ r , LSA decompose X into the product of threeother matrices: X = UΣV T , (4)where Σ = diag ( σ , ..., σ ˆ r ) and σ ≥ σ ≥ ... ≥ σ ˆ r are the singular values of X , U ∈ R d × ˆ r is a set of left singular vectors and V ∈ R n × ˆ r is a set of rightsingular vectors. LSA uses the top q vectors in U as the transformation matrixto embed the original text features into a q -dimensional subspace Y [18]. Laplacian Eigenmaps (LE) : The top eigenvectors of graph Laplacian,deﬁned on the similarity matrix of texts, are used in the method, which candiscover the manifold structure of the text space [19]. In order to avoid storingthe dense similarity matrix, many approximation techniques are proposed toreduce the memory usage and computational complexity for LE. There are tworepresentative approximation methods, sparse similarity matrix and Nystr¨omapproximation. Following previous studies [39, 14], we select the former tech-nique to construct the n × n local similarity matrix A by using heat kernel asfollows: A ij =  exp( − (cid:107) x i − x j (cid:107) σ ) , if x i ∈ N k ( x j ) or vice versa0 , otherwise (5)10here, σ is a tuning parameter (default is 1) and N k ( x ) represents the set of k -nearest-neighbors of x . By introducing a diagonal n × n matrix D whoseentries are given by D ii = (cid:80) nj =1 A ij , the graph Laplacian L can be computedby ( D − A ). The optimal q × n real-valued matrix Y can be obtained by solvingthe following objective function:arg min Y tr ( YLY T )subject to YDY T = IYD1 = (6)where tr ( · ) is the trace function, YDY T = I requires the diﬀerent dimensionsto be uncorrelated, and YD1 = requires each dimension to achieve equalprobability as positive or negative). Locality Preserving Indexing (LPI) : This method extends LE to dealwith unseen texts by approximating the linear function Y = W TLP I X [14], andthe subspace vectors are obtained by ﬁnding the optimal linear approximationsto the eigenfunctions of the Laplace Beltrami operator on the Riemannian man-ifold [20]. Similar as LE, we ﬁrst construct the local similarity matrix A , thenthe graph Laplacian L can be computed by ( D − A ), where D ii measures thelocal density around x i and is equal to (cid:80) nj =1 A ij . Compute the eigenvectors a and eigenvalues λ of the following generalized eigen-problem: XLX T a = λ XDX T a . (7)The mapping function W LP I = [ a , ..., a q ] can be obtained and applied tothe unseen data [39].All of the above methods claim a better performance in capturing semanticsimilarity between texts in the reduced latent space representation Y than inthe original representation X , while the performance of short text clusteringcan be further enhanced with the help of our framework, self-taught CNN.11 .3. Learning The last layer of CNN is an output layer as follows: O = W O h , (8)where, h is the deep feature representation, O ∈ R q is the output vector and W O ∈ R q × r is weight matrix.In order to incorporate the latent semantic features Y , we ﬁrst binary thereal-valued vectors Y to the binary codes B by setting the threshold to be themedia vector median ( Y ). Then, the output vector O is used to ﬁt the binarycodes B via q logistic operations as follows: p i = exp( O i )1 + exp( O i ) . (9)All parameters to be trained are deﬁned as θ . θ = { E , ˜W , W O } . (10)Given the training text collection X , and the pre-trained binary codes B ,the log likelihood of the parameters can be written down as follows: J ( θ ) = n (cid:88) i =1 log p ( b i | x i , θ ) . (11)Following the previous work [11], we train the network with mini-batchesby back-propagation and perform the gradient-based optimization using theAdagrad update rule [40]. For regularization, we employ dropout with 50% rateto the penultimate layer [11, 41]. With the given short texts, we ﬁrst utilize the trained deep neural networkto obtain the semantic representations h , and then employ traditional K-meansalgorithm to perform clustering. 12ataset C Num. Len. | V | SearchSnippets 8 12,340 17.88/38 30,642StackOverﬂow 20 20,000 8.31/34 22,956Biomedical 20 20,000 12.88/53 18,888

Table 1: Statistics for the text datasets. C: the number of classes; Num: the dataset size;Len.: the mean/max length of texts and | V | : the vocabulary size.

4. Experiments

We test our proposed approach on three public short text datasets. Thesummary statistics and semantic topics of these datasets are described in Table 1and Table 2.

SearchSnippets . This dataset was selected from the results of web searchtransaction using predeﬁned phrases of 8 diﬀerent domains by Phan et al. [42]. StackOverﬂow . We use the challenge data published in Kaggle.com . Theraw dataset consists 3,370,528 samples through July 31st, 2012 to August 14,2012. In our experiments, we randomly select 20,000 question titles from 20diﬀerent tags as in Table 2. Biomedical . We use the challenge data published in BioASQ’s oﬃcial web-site . In our experiments, we randomly select 20, 000 paper titles from 20diﬀerent MeSH major topics as in Table 2. As described in Table 1, the maxlength of selected paper titles is 53 .For these datasets, we randomly select 10% of data as the development set.Since SearchSnippets has been pre-processed by Phan et al. [42], we do not http://jwebpro.sourceforge.net/data-web-snippets.tar.gz . . http://participants-area.bioasq.org/ . http://en.wikipedia.org/wiki/Medical_Subject_Headings . . Table 2: Description of semantic topics (that is, tags/labels) from the three text datasetsused in our experiments. further process this dataset. In StackOverﬂow, texts contain lots of computerterminology, and symbols and capital letters are meaningful, thus we do notdo any pre-processed procedures. For Biomedical, we remove the symbols andconvert letters into lower case.

We use the publicly available word2vec tool to train word embeddings,and the most parameters are set as same as Mikolov et al. [24] to train wordvectors on Google News setting , except of vector dimensionality using 48 and https://code.google.com/p/word2vec/ . https://groups.google.com/d/msg/word2vec-toolkit/lxbl_MB29Ic/NDLGId3KPNEJ . | V | | T | SearchSnippets 23,826 (77%) 211,575 (95%)StackOverﬂow 19,639 (85%) 162,998 (97%)Biomedical 18,381 (97%) 257,184 (99%)

Table 3: Coverage of word embeddings on three datasets. | V | is the vocabulary size and | T | is the number of tokens. minimize count using 5. For SearchSnippets, we train word vectors on Wikipediadumps . For StackOverﬂow, we train word vectors on the whole corpus of theStackOverﬂow dataset described above which includes the question titles andpost contents. For Biomedical, we train word vectors on all titles and abstractsof 2014 training articles. The coverage of these learned vectors on three datasetsare listed in Table 3, and the words not present in the set of pre-trained wordsare initialized randomly. In our experiment, some widely used text clustering methods are comparedwith our approach. Besides K-means, Skip-thought Vectors, Recursive NeuralNetwork and Paragraph Vector based clustering methods, four baseline clus-tering methods are directly based on the popular unsupervised dimensionalityreduction methods as described in Section 3.2. We further compare our ap-proach with some other non-biased neural networks, such as bidirectional RNN.More details are listed as follows:

K-means

K-means [43] on original keyword features which are respectivelyweighted with term frequency (TF) and term frequency-inverse document fre-quency (TF-IDF). http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 . kip-thought Vectors (SkipVec) This baseline [36] gives an oﬀ-the-shelfencoder to produce highly generic sentence representations. The encoder istrained using a large collection of novels and provides three encoder modes, thatare unidirectional encoder (SkipVec (Uni)) with 2,400 dimensions, bidirectionalencoder (SkipVec (Bi)) with 2,400 dimensions and combined encoder (SkipVec(Combine)) with SkipVec (Uni) and SkipVec (Bi) of 2,400 dimensions each.K-means is employed on the these vector representations respectively. Recursive Neural Network (RecNN)

In [7], the tree structure is ﬁrstlygreedy approximated via unsupervised recursive autoencoder. Then, semi-supervised recursive autoencoders are used to capture the semantics of textsbased on the predicted structure. In order to make this recursive-based methodcompletely unsupervised, we remove the cross-entropy error in the second phraseto learn vector representation and subsequently employ K-means on the learnedvectors of the top tree node and the average of all vectors in the tree.

Paragraph Vector (Para2vec)

K-means on the ﬁxed size feature vectorsgenerated by Paragraph Vector (Para2vec) [26] which is an unsupervised methodto learn distributed representation of words and paragraphs. In our experiments,we use the open source software released by Mesnil et al. [44]. Average Embedding (AE)

K-means on the weighted average vectors ofthe word embeddings which are respectively weighted with TF and TF-IDF.The dimension of average vectors is equal to and decided by the dimension ofword vectors used in our experiments.

Latent Semantic Analysis (LSA)

K-means on the reduced subspace vec-tors generated by Singular Value Decomposition (SVD) method. The dimen-sion of subspace is default set to the number of clusters, we also iterate thedimensions ranging from 10:10:200 to get the best performance, that is 10 onSearchSnippets, 20 on StackOverﬂow and 20 on Biomedical in our experiments. https://github.com/ryankiros/skip-thoughts . https://github.com/mesnilgr/iclr15 . aplacian Eigenmaps (LE) This baseline, using Laplacian Eigenmapsand subsequently employing K-means algorithm, is well known as spectral clus-tering [45]. The dimension of subspace is default set to the number of clus-ters [19, 39], we also iterate the dimensions ranging from 10:10:200 to get thebest performance, that is 20 on SearchSnippets, 70 on StackOverﬂow and 30 onBiomedical in our experiments.

Locality Preserving Indexing (LPI)

This baseline, projecting the textsinto a lower dimensional semantic space, can discover both the geometric anddiscriminating structures of the original feature space [39]. The dimension ofsubspace is default set to the number of clusters [39], we also iterate the di-mensions ranging from 10:10:200 to get the best performance, that is 20 onSearchSnippets, 80 on StackOverﬂow and 30 on Biomedical in our experiments. bidirectional RNN (bi-RNN)

We replace the CNN model in our frame-work as in Figure 1 with some bi-RNN models. Particularly, LSTM and GRUunits are used in the experiments. In order to generate the ﬁxed-length doc-ument representation from the variable-length vector sequences, for both bi-LSTM and bi-GRU based clustering methods, we further utilize three poolingmethods: last pooling (using the last hidden state), mean pooling and element-wise max pooling. These pooling methods are respectively used in the previousworks [46, 28], [47] and [10]. For regularization, the training gradients of allparameters with an l The clustering performance is evaluated by comparing the clustering resultsof texts with the tags/labels provided by the text corpus. Two metrics, theaccuracy (ACC) and the normalized mutual information metric (NMI), are usedto measure the clustering performance [39, 49]. Given a text x i , let c i and t i be the obtained cluster label and the label provided by the corpus, respectively.Accuracy is deﬁned as: ACC = (cid:80) ni =1 δ ( t i , map ( c i )) n , (12)17here, n is the total number of texts, δ ( x, y ) is the indicator function that equalsone if x = y and equals zero otherwise, and map ( c i ) is the permutation mappingfunction that maps each cluster label c i to the equivalent label from the textdata by Hungarian algorithm [50].Normalized mutual information [51] between tag/label set T and clusterset C is a popular metric used for evaluating clustering tasks. It is deﬁned asfollows: N M I ( T , C ) = M I ( T , C ) (cid:112) H ( T ) H ( C ) , (13)where, M I ( T , C ) is the mutual information between T and C , H ( · ) is entropyand the denominator (cid:112) H ( T ) H ( C ) is used for normalizing the mutual informa-tion to be in the range of [0, 1]. The most of parameters are set uniformly for these datasets. Followingprevious study [39], the number of nearest neighbors in Eqn. (5) is ﬁxed to 15when constructing the graph structures for LE and LPI. For CNN model, thenetworks has two convolutional layers. The widths of the convolutional ﬁltersare both 3. The value of k for the top k -max pooling in Eqn. (1) is 5. Thenumber of feature maps at the ﬁrst convolutional layer is 12, and 8 featuremaps at the second convolutional layer. Both those two convolutional layers arefollowed by a folding layer. We further set the dimension of word embeddings d w as 48. Finally, the dimension of the deep feature representation r is ﬁxed to480. Moreover, we set the learning rate λ as 0.01 and the mini-batch trainingsize as 200. The output size q in Eqn. (8) is set same as the best dimensions ofsubspace in the baseline method, as described in Section 4.3.For initial centroids have signiﬁcant impact on clustering results when utiliz-ing the K-means algorithms, we repeat K-means for multiple times with randominitial centroids (speciﬁcally, 100 times for statistical signiﬁcance) as Huang [49].The all subspace vectors are normalized to 1 before applying K-means and theﬁnal results reported are the average of 5 trials with all clustering methods onthree text datasets. 18earchSnippets StackOverﬂow BiomedicalMethod ACC (%) ACC (%) ACC (%)K-means (TF) 24.75 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± -AE 68.34 ± ± ± -LSA 73.09 ± ± ± -LE ± ± ± STC -LPI ± ± ± Table 4: Comparison of ACC of our proposed methods and three clustering methods onthree datasets. For RecNN (Top), K-means is conducted on the learned vectors of the toptree node. For RecNN (Ave.), K-means is conducted on the average of all vectors in the tree.More details about the baseline setting are described in Section 4.3

In Table 4 and Table 5, we report the ACC and NMI performance of our pro-posed approaches and four baseline methods, K-means, SkipVec, RecNN andPara2vec based clustering methods. Intuitively, we get a general observationthat (1) BoW based approaches, including K-means (TF) and K-means (TF-IDF), and SkipVec based approaches perform not well; (2) RecNN based ap-proaches, both RecNN (Ave.) and RecNN (Top+Ave.), do better; (3) Para2vecmakes a comparable performance with the most baselines; and (4) the evalua-tion clearly demonstrate the superiority of our proposed methods STC . It is anexpected results. For SkipVec based approaches, the oﬀ-the-shelf encoders are19earchSnippets StackOverﬂow BiomedicalMethod NMI (%) NMI (%) NMI (%)K-means (TF) 09.03 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± -AE 54.01 ± ± ± -LSA 54.53 ± ± ± -LE ± ± ± -LPI 62.94 ± ± ± Table 5: Comparison of NMI of our proposed methods and three clustering methods onthree datasets. For RecNN (Top), K-means is conducted on the learned vectors of the toptree node. For RecNN (Ave.), K-means is conducted on the average of all vectors in the tree.More details about the baseline setting are described in Section 4.3 trained on the BookCorpus datasets [52], and then applied to our datasets toextract the sentence representations. The SkipVec encoders can produce genericsentence representations but may not perform well for speciﬁc datasets, in ourexperiments, StackOverﬂow and Biomedical datasets consist of many computerterms and medical terms, such as “ASP.NET”, “XML”, “C ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± -LPI ± ± ± Table 6: Comparison of ACC of our proposed methods and some other non-biased modelson three datasets. For LPI, we project the text under the best dimension as described inSection 4.3. For both bi-LSTM and bi-GRU based clustering methods, the binary codesgenerated from LPI are used to guide the learning of bi-LSTM/bi-GRU models. to represent sentence level semantic. And we also get another observation that,although our proposed STC -LE and STC -LPI outperform both BoW basedand RecNN based approaches across all three datasets, STC -AE and STC -LSA do just exhibit some similar performances as RecNN (Ave.) and RecNN(Top+Ave.) do in the datasets of StackOverﬂow and Biomedical.We further replace the CNN model in our framework as in Figure 1 withsome other non-biased models, such as bi-LSTM and bi-GRU, and report theresults in Table 6 and Table 7. As an instance, the binary codes generatedfrom LPI are used to guide the learning of bi-LSTM/bi-GRU models. Fromthe results, we can see that bi-GRU and bi-LSTM based clustering methods doequally well, no clear winner, and both achieve great enhancements comparedwith LPI (best). Compared with these bi-LSTM/bi-GRU based models, theevaluation results still demonstrate the superiority of our approach methods,CNN based clustering model, in the most cases. As the results reported byVisin et al. [34], despite bi-directional or multi-directional RNN models perform21earchSnippets StackOverﬂow BiomedicalMethod NMI (%) NMI (%) NMI (%)bi-LSTM (last) 50.32 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± -LPI ± ± ± Table 7: Comparison of NMI of our proposed methods and some other non-biased modelson three datasets. For LPI, we project the text under the best dimension as described inSection 4.3. For both bi-LSTM and bi-GRU based clustering methods, the binary codesgenerated from LPI are used to guide the learning of bi-LSTM/bi-GRU models. a good non-biased feature extraction, they yet do not outperform state-of-the-art CNN on some tasks.In order to make clear what factors make our proposed method work, wereport the bar chart results of ACC and MNI of our proposed methods andthe corresponding baseline methods in Figure 3 and Figure 4. It is clear that,although AE and LSA does well or even better than LE and LPI, especiallyin dataset of both StackOverﬂow and Biomedical, STC -LE and STC -LPIachieve a much larger performance enhancements than STC -AE and STC -LSA do. The possible reason is that the information the pseudo supervisionused to guide the learning of CNN model that make diﬀerence. Especially, forAE case, the input features fed into CNN model and the pseudo supervision em-ployed to guide the learning of CNN model are all come from word embeddings.There are no diﬀerent semantic features to be used into our proposed method,thus the performance enhancements are limited in STC -AE. For LSA case, aswe known, LSA is to make matrix factorization to ﬁnd the best subspace ap-22 igure 3: ACC results on three short text datasets using our proposed STC based on AE,LSA, LE and LPI. proximation of the original feature space to minimize the global reconstructionerror. And as [25, 53] recently point out that word embeddings trained withword2vec or some variances, is essentially to do an operation of matrix factor-ization. Therefore, the information between input and the pseudo supervisionin CNN is not departed very largely from each other, and the performance en-hancements of STC -AE is also not quite satisfactory. For LE and LPI case, aswe known that LE extracts the manifold structure of the original feature space,and LPI extracts both geometric and discriminating structure of the original fea-ture space [39]. We guess that our approach STC -LE and STC -LPI achieveenhancements compared with both LE and LPI by a large margin, because both23 igure 4: NMI results on three short text datasets using our proposed STC based on AE,LSA, LE and LPI. of LE and LPI get useful semantic features, and these features are also diﬀerentfrom word embeddings used as input of CNN. From this view, we say that ourproposed STC has potential to behave more eﬀective when the pseudo supervi-sion is able to get semantic meaningful features, which is diﬀerent enough fromthe input of CNN.Furthermore, from the results of K-means and AE in Table 4-5 and Fig-ure 3-4, we note that TF-IDF weighting gives a more remarkable improvementfor K-means, while TF weighting works better than TF-IDF weighting for Aver-age Embedding. Maybe the reason is that pre-trained word embeddings encodesome useful information from external corpus and are able to get even better re-24ults without TF-IDF weighting. Meanwhile, we ﬁnd that LE get quite unusualgood performance than LPI, LSA and AE in SearchSnippets dataset, which isnot found in the other two datasets. To get clear about this, and also to makea much better demonstration about our proposed approaches and other base-lines, we further report 2-dimensional text embeddings on SearchSnippets inFigure 5, using t-SNE [54] to get distributed stochastic neighbor embeddingof the feature representations used in the clustering methods. We can see thatthe results of from AE and LSA seem to be fairly good or even better thanthe ones from LE and LPI, which is not the same as the results from ACCand NMI in Figure 3-4. Meanwhile, RecNN (Ave.) performs better than BoW(both TF and TF-IDF) while RecNN (Top) does not, which is the same as theresults from ACC and NMI in Table 4 and Table 5. Then we guess that both”the same as” and ”not the same as” above, is just a good example to illustratethat visualization tool, such as t-SNE, get some useful information for measur-ing results, which is diﬀerent from the ones of ACC and NMI. Moreover, fromthis complementary view of t-SNE, we can see that our STC -AE, STC -LSA,STC -LE, and STC -LPI show more clear-cut margins among diﬀerent semantictopics (that is, tags/labels), compared with AE, LSA, LE and LPI, respectively,as well as compared with both baselines, BoW and RecNN based ones.From all these results, with three measures of ACC, NMI and t-SNE underthree datasets, we can get a solid conclusion that our proposed approaches isan eﬀective approaches to get useful semantic features for short text clustering.

5. Conclusions

With the emergence of social media, short text clustering has become an in-creasing important task. This paper explores a new perspective to cluster shorttexts based on deep feature representation learned from the proposed self-taughtconvolutional neural networks. Our framework can be successfully accomplished http://lvdmaaten.github.io/tsne/ . igure 5: A 2-dimensional embedding of original keyword features weighted with (a) TF and(b) TF-IDF, (c) vectors of the top tree node in RecNN, (d) average vectors of all tree node inRecNN, (e) average embeddings weighted with TF, subspace features based on (f) LSA, (g)LE and (h) LPI, deep learned features from (i) STC -AE, (j) STC -LSA, (k) STC -LE and(l) STC -LPI. All above features are respectively used in K-means (TF), K-means (TF-IDF),RecNN (Top), RecNN (Ave.), AE (TF), LSA(best), LE (best), LPI (best), and our proposedSTC -AE, STC -LSA, STC -LE and STC -LPI on SearchSnippets. (Best viewed in color) Acknowledgments

We would like to thank reviewers for their comments, and acknowledge Kag-gle and BioASQ for making the datasets available. This work is supported by theNational Natural Science Foundation of China (No. 61602479, No. 61303172,No. 61403385) and the Strategic Priority Research Program of the ChineseAcademy of Sciences (Grant No. XDB02070005).