[PDF] Deep Autoencoder-based Fuzzy C-Means for Topic Detection

Abstract

Topic detection is a process for determining topics from a collection of textual data. One of the topic detection methods is a clustering-based method, which assumes that the centroids are topics. The clustering method has the advantage that it can process data with negative representations. Therefore, the clustering method allows a combination with a broader representation learning method. In this paper, we adopt deep learning for topic detection by using a deep autoencoder and fuzzy c-means called deep autoencoder-based fuzzy c-means (DFCM). The encoder of the autoencoder performs a lower-dimensional representation learning. Fuzzy c-means groups the lower-dimensional representation to identify the centroids. The autoencoder's decoder transforms back the centroids into the original representation to be interpreted as the topics. Our simulation shows that DFCM improves the coherence score of eigenspace-based fuzzy c-means (EFCM) and is comparable to the leading standard methods, i.e., nonnegative matrix factorization (NMF) or latent Dirichlet allocation (LDA).

Full PDF

Deep Autoencoder-based Fuzzy C-Means for Topic Detection

Hendri Murfi Department of Mathematics Universitas Indonesia, Depok 16424, Indonesia Email: [email protected] Natasha Rosaline Department of Mathematics Universitas Indonesia, Depok 16424, Indonesia Email: [email protected]

Nora Hariadi Department of Mathematics Universitas Indonesia, Depok 16424, Indonesia Email: [email protected]

Abstract . Topic detection is a process for determining topics from a collection of textual data. One of the topic detection methods is a clustering-based method, which assumes that the centroids are topics. The clustering method has the advantage that it can process data with negative representations. Therefore, the clustering method allows a combination with a broader representation learning method. In this paper, we adopt deep learning for topic detection by using a deep autoencoder and fuzzy c-means called deep autoencoder-based fuzzy c-means (DFCM). The encoder of the autoencoder performs a lower-dimensional representation learning. Fuzzy c-means groups the lower-dimensional representation to identify the centroids. The autoencoder's decoder transforms back the centroids into the original representation to be interpreted as the topics. Our simulation shows that DFCM improves the coherence score of eigenspace-based fuzzy c-means (EFCM) and is comparable to the leading standard methods, i.e., nonnegative matrix factorization (NMF) or latent Dirichlet allocation (LDA).

Keywords:

Topic detection, clustering, deep learning, autoencoder, fuzzy c-means

1. Introduction

Topic detection is a process used to analyze words in a collection of textual data to determine the topics in the collection, how they relate to one another, and how they change over time. The topics usually are represented by a set of words. The coherence of the words usually measures the topic's interpretability. The standard topic detection methods are nonnegative matrix factorization (NMF) (Lee & Seung, 1999), clustering (Allan, 2002), and latent Dirichlet allocation (LDA) (Blei et al., 2003). In the clustering method, the cluster centers or centroids are interpreted as a topic. In other words, the clustering method will group the textual data based on their topic similarity. Unlike the other two methods, the clustering method can process data with negative representations. Therefore, the clustering method allows a combination with broader representation learning or dimension reduction methods. Nur' aini et al. combines k-means and latent semantic analysis (LSA) for topic detection (Nur’aini, Najahaty, Hidayati, Murfi, & Nurrohmah, 2015). Firstly, the textual data are transformed into a lower-dimensional Eigenspace using the singular value decomposition (SVD). Next, k-means is performed on the Eigenspace to extract the topics that are then transformed back to the nonnegative subspace of the original space. The k-means method splits the textual data into k clusters in which each textual data belongs to the nearest centroid. It means that the k-means method assumes that each textual data contains only one topic. This assumption is relatively weak and also different from the standard NMF and LDA, considering that the textual data may have many topics. Therefore, soft clustering is examined to be an alternative clustering method for topic detection. Fuzzy c-means (FCM) is one of the famous soft clustering methods (Bezdek, Ehrlich, & Full, 1984). Using FCM, the textual data may belong to more than one cluster and may have more than one topic. The combination of FCM and LSA called

Eigenspace-based fuzzy c-means (EFCM) is proposed for topic detection (Muliawati & Murfi, 2017). In general, some simulations show that EFCM gives the coherence scores between the ones of LDA and NMF (Murfi, 2018, 2019; Praditya Nugraha, Rifky Yusdiansyah, & Murfi, 2019). Currently, deep learning is the primary machine learning method for unstructured data such as images and text (Goodfellow, Bengio, & Courville, 2016; Zhang, Lipton, Li, & Smola, 2020). Deep learning has been extensively studied to extract a good representation of data by neural networks (Bengio, Courville, & Vincent, 2013). In this paper, we adopt deep learning to improve the performance of EFCM for topic prediction problems by using deep autoencoder (DAE) for the representation learning process. We call this topic detection method as deep autoencoder-based fuzzy c-means (DFCM). First, the encoder of DAE performs a lower-dimensional representation learning. Next, FCM groups the lower-dimensional representation to identify the centroids. Finally, the decoder of DAE transforms back the centroids into the original representation to provide the topics. Our simulation shows that DFCM improves the coherence score of EFCM and is comparable to the leading standard methods, i.e., NMF or LDA. This paper's outline is as follows: In Section 2 and Section 3, we describe the related works and the methods, i.e., FCM, DAE, and DFCM. Section 4 describes the results and the discussion of our simulations. Finally, a general conclusion about the results is presented in Section 5.

2. Related Works

Topic detection methods are algorithms for discovering the topics or the themes from an unstructured collection of documents. Some recent publications show the gwowing use of topic detection for researchers in Library and Information Science to find the theme that they are interested in and then examine the documents related to that theme (Battsengel, Geetha, & Jeon, 2020; Lamba & Madhusudhan, 2019; Parlina, Ramli, & Murfi, 2020). The standard topic detection methods are nonnegative matrix factorization (Cichocki & Phan, 2009; Févotte & Idier, 2011; Lee & Seung, 1999), clustering (Allan, 2002; Petkos, Papadopoulos, & Kompatsiaris, 2014) and latent Dirichlet allocation (Blei, 2012; Blei et al., 2003; Hoffman, Blei, & Bach, 2010; Hoffman, Blei, Wang, & Paisley, 2013). Unlike the other two methods, clustering is a general method of grouping data. Furthermore, this method can also process positive and negative data representation. Thus, the clustering method is more flexible to be combined with representation learning or dimension reduction. Fuzzy clustering is one of the most widely used clustering methods because it has soft and flexible in grouping data to the cluster (Ruspini, Bezdek, & Keller, 2019). Bezdek developed FCM by extending the fuzzifier value m to m > 1 (Bezdek et al., 1984). This extension makes FCM a generalization from k-means, which is hard clustering. FCM is more suitable for the topic detection method because it allows adaptation to a document's condition with one or more topics, namely by finding the optimal fuzzifier value m. In the era of big data, the existence of high-dimensional data is a big challenge for FCM (Winkler, Klawonn, & Kruse, 2011). By finding a new representation of the original data, two approaches are already used to reduce the difficulty of FCM in high-dimensional data. The first approach uses kernel methods to implicitly get more expressive features by formulating the data into the feature space constructed by some kernel functions (Huang, Chuang, & Chen, 2012; Shang, Zhang, Li, Jiao, & Stolkin, 2019). The second approach is an explicit transformation of the original data. In addition to the specified nonlinear data transformations (Zhu, Pedrycz, & Li, 2017), random projection (Rathore, Bezdek, Erfani, Rajasegarar, & Palaniswami, 2018) is commonly used to obtain low-dimensional data. The second approach is more suitable because the excellent design of kernel space for clustering is complicated. The ability of kernel methods to handle large-scale data is also a concern. In several empirical studies, the combination of FCM with the data transformation approach (Murfi, 2018; P. Nugraha, Rifky Yusdiansyah, &

Murfi, 2019) provides better performance than the random projection approach (Yusdiansyah, Murfi, & Wibowo, 2019) for topic detection problems. Currently, deep learning is the primary machine learning method for unstructured data such as images and text (Goodfellow et al., 2016; Zhang et al., 2020). Deep learning has been extensively studied to extract a good representation of data by neural networks (Bengio et al., 2013). The combination of the deep neural network and an unsupervised clustering method also becomes an active research field (Song, Huang, Liu, Wang, & Wang, 2014). In general, there are some approaches to incorporate deep learning. In general, there are several approaches to combining deep learning and clustering. The first approach is to combine representation learning and clustering in two steps, namely using a deep autoencoder for representation learning and then a clustering method for the next stage (Song et al., 2014; Song, Liu, Huang, Wang, & Tan, 2013). The second approach is to combine a deep autoencoder and a clustering method simultaneously (Guo, Gao, Liu, & Yin, 2017; Xie, Girshick, & Farhadi, 2016). The next approach is to combine clustering with a pretrained encoder, such as Bidirectional Encoder Representations from Transformers (BERT) proposed by Google (Guan et al., 5555). However, most of the clustering methods used in these approaches are hard clustering. Few studies work on the improvement of feature quality by deep learning for the fuzzy clustering. This study is to find a good deep representation for the fuzzy clustering, i.e. fuzzy c-means. In this research, we use the first approach and the second approach in combining representation learning and clustering. In this approach, representation learning and clustering are carried out separately, not simultaneously as in the second approach. This approach still requires the decoder part to transform the data back into original representation. In addition, our approach does not use a pretrained model because it is still difficult to determine the most important words to represent the resulting topics.

3. Methods

Let A be a word by document matrix and c be the number of topics. Given A and c , the topic detection problem is how to recover c topics from A . In the clustering-based topic detection method, the clustering centers or centroids are interpreted as topics. In this section, we describe deep autoencoder-based fuzzy c-means (DFCM) for topic detection. First, we review the core methods, i.e., fuzzy c-means (FCM) and deep autoencoder (DAE). 3.1. Fuzzy C-Means Given a dataset in the form of a word by document matrix 𝐴𝐴 = [ 𝒂𝒂 𝒂𝒂 … 𝒂𝒂 𝑛𝑛 ] and the number of centroids c , the goal of fuzzy c-means (FCM) can be formulated as the following constrained optimization: min 𝑚𝑚 𝑖𝑖𝑖𝑖 , 𝒒𝒒 𝑖𝑖 𝐽𝐽 = ∑ ∑ 𝑚𝑚 𝑖𝑖𝑖𝑖𝑓𝑓𝑛𝑛𝑖𝑖=1𝑐𝑐𝑖𝑖=1 ‖𝐚𝐚 𝑖𝑖 − 𝐪𝐪 𝑖𝑖 ‖ (1) 𝑠𝑠 . 𝑡𝑡 . ∑ 𝑚𝑚 𝑖𝑖𝑖𝑖 = 1, ∀𝑘𝑘 𝑐𝑐𝑖𝑖=1 ∑ 𝑚𝑚 𝑖𝑖𝑖𝑖 < 𝑛𝑛 , ∀𝑖𝑖 𝑛𝑛𝑖𝑖=1 𝑚𝑚 𝑖𝑖𝑖𝑖 ∈ [0,1], ∀𝑖𝑖 , 𝑘𝑘 where 𝒒𝒒 𝑖𝑖 are centroids, 𝑚𝑚 𝑖𝑖𝑖𝑖 is the membership of data point a k in cluster i , 𝑓𝑓 > 1 is the fuzzification constant, and ‖ . ‖ is any norm. The first constraint ensures that every data point has total membership in all clusters where each membership is in [0,1]. The second constraint guarantees that all clusters are nonempty(Bezdek et al., 1984). The problem of the constrained optimization in Equation 1 is to find 𝑚𝑚 𝑖𝑖𝑖𝑖 and 𝒒𝒒 𝑖𝑖 that minimize the objective function J . The standard method to solve the constrained optimization is alternating optimization. First, we choose some initial values for the 𝒒𝒒 𝑖𝑖 . Then we minimize J concerning the 𝑚𝑚 𝑖𝑖𝑖𝑖 , keeping the 𝒒𝒒 𝑖𝑖 fixed giving: 𝑚𝑚 𝑖𝑖𝑖𝑖 = �∑ � ‖𝐚𝐚 𝑖𝑖 −𝐪𝐪 𝑖𝑖 ‖�𝐚𝐚 𝑖𝑖 −𝐪𝐪 𝑗𝑗 � � � −1 , ∀𝑖𝑖 , 𝑘𝑘 (2) Next, we minimize J on the 𝒒𝒒 𝑖𝑖 , keeping the 𝑚𝑚 𝑖𝑖𝑖𝑖 fixed giving: 𝐪𝐪 𝑖𝑖 = ∑ � ( 𝑚𝑚 𝑖𝑖𝑖𝑖 ) 𝑓𝑓 𝐚𝐚 𝑖𝑖 � 𝑛𝑛𝑖𝑖=1 ∑ ( 𝑚𝑚 𝑖𝑖𝑖𝑖 ) 𝑓𝑓𝑛𝑛𝑖𝑖=1 , ∀𝑖𝑖 (3) This two-step optimization are iterated until a stopping criterion is fulfilled, e.g., the maximum number of iteration, insignificant changes in the objective function J , the membership 𝑚𝑚 𝑖𝑖𝑖𝑖 , or the centroids 𝒒𝒒 𝑖𝑖 (Bezdek & Hathaway, 2003). The FCM algorithm is described in more detail in Algorithm 1. According to Equation 2, the memberships 𝑚𝑚 𝑖𝑖𝑖𝑖 tend to or when the fuzzification constant f approaches to 1. The bigger the fuzzification constant makes, the fuzzier the memberships 𝑚𝑚 𝑖𝑖𝑖𝑖 . Therefore, the setting of the fuzzification constant is quite intuitive. The small fuzzification constant means that each textual data may contain a small number of topics. On the other hand, the bigger the fuzzification constant implies that each textual data may have more topics. Algorithm 1.

FCM Input : 𝐴𝐴 , 𝑐𝑐 , f, max iteration ( 𝑇𝑇 ) , threshold ( 𝜀𝜀 ) Output : 𝒒𝒒 𝑖𝑖 set t = 0 initialize 𝒒𝒒 𝑖𝑖 update t = t + 1 calculate 𝑚𝑚 𝑖𝑖𝑖𝑖 = �∑ � ‖𝐚𝐚 𝑖𝑖 −𝐪𝐪 𝑖𝑖 ‖ �𝐚𝐚 𝑖𝑖 −𝐪𝐪 𝑗𝑗 � � � −1 , ∀𝑖𝑖 , 𝑘𝑘 calculate 𝒒𝒒 𝑖𝑖 = ∑ � ( 𝑚𝑚 𝑖𝑖𝑖𝑖 ) 𝑓𝑓 𝒂𝒂 𝑖𝑖 � 𝑛𝑛𝑖𝑖=1 ∑ ( 𝑚𝑚 𝑖𝑖𝑖𝑖 ) 𝑓𝑓𝑛𝑛𝑖𝑖=1 , ∀𝑖𝑖 if a stopping, i.e., 𝑡𝑡 > 𝑇𝑇 or ‖𝑀𝑀 𝑡𝑡 − 𝑀𝑀 𝑡𝑡−1 ‖ 𝐹𝐹 < 𝜀𝜀 , is fulfilled then stop, else go back to step 3 3.2. Deep Autoencoder Deep autoencoder (DAE) is a deep neural network for unsupervised learning problems. This unsupervised problem is solved using a supervised learning approach, where target labels are constructed from input features. This deep autoencoder architecture has the same output layer as the input layer, and the standard supervised learning can be applied. The architecture of DAE for representation learning can be explained into three parts, i.e., encoder, code, and decode (Figure 1). The encoder part consists of fully connected layers used to transform data input to the code part, a new data representation. The decoder part is used to transform the new data representation back to the original representation. The decoder part consists of fully connected layers having a symmetric structure with the encoder part. The reason is if an encoder requires a certain complexity (number of layers (depth), the number of neurons in each layer (units)) to represent data to new representation, then a decoder with the same complexity is needed to transform the new data representation back to the original data representation. For dimension reduction problem, the number of neurons in the code part is set less than the number of neurons in the input layer (Hinton & Salakhutdinov, 2006).

Figure 1. Deep Autoencoders

DAE can be built layer by layer using greedy layer-wise pretraining, where each layer is built by a denoising autoencoder (Vincent et al., 2010). The denoising autoencoder is an autoencoder that reconstructs the input from a corrupted version to force the hidden layer to discover a more stable and robust representation. Given textual dataset 𝑋𝑋 = { 𝐱𝐱 , 𝐱𝐱 , … , 𝐱𝐱 𝑁𝑁 } with 𝐱𝐱 𝑖𝑖 ∈ ℛ 𝐷𝐷 𝑥𝑥 , ∀𝑖𝑖 = 1,2, … , 𝑁𝑁 , the denoising autoencoder consists of two layers as follows: 𝒙𝒙� 𝑖𝑖 ~ 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑡𝑡 ( 𝒙𝒙 𝒊𝒊 ) (4) 𝒉𝒉 𝑖𝑖 = 𝑔𝑔 ( 𝒙𝒙� 𝑖𝑖 , 𝒘𝒘 ) (5) 𝒉𝒉� 𝑖𝑖 ~ 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑡𝑡 ( 𝒉𝒉 𝑖𝑖 ) (6) 𝒚𝒚 𝑖𝑖 = 𝑔𝑔 �𝒉𝒉� 𝑖𝑖 , 𝒘𝒘 � (7) where 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑡𝑡 () and 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑡𝑡 () are methods to ignore some number of neuron outputs during training randomly, 𝑔𝑔 () and 𝑔𝑔 () are activation function, 𝒘𝒘 and 𝒘𝒘 are weights . The fitting is performed to minimize the loss ℒ ( 𝒙𝒙 , 𝒘𝒘 ) , i.e., errors between x i and y i . Next, h i becomes a new representation for the input data of the next layer. After training on each denoising autoencoder, the denoising autoencoder's weights become the corresponding weights of the autoencoder. Furthermore, the autoencoder is retrained to minimize a reconstruction loss for all layers. This DAE algorithm is described in more detail in Algorithm 2. 3.3. Deep Autoencoder-based Fuzzy C-Means Deep autoencoder-based Fuzzy C-Means (DFCM) is a proposed topic detection method that combines DAE for representation learning and FCM for fuzzy clustering. FCM works well for low dimensional textual data and generates only one topic for high dimensional textual data. We can set the fuzzification constant of FCM with a small value to push FCM to produce more than one topic. However, this small fuzzification constant assumes that each textual data contains few topics and only one topic when the fuzzification constant approaches to one. Therefore, we use DAE to transform the data to lower-dimensional representation and keep the fuzzification constant adaptable for textual data with multi-topics. Figure 2 provides a general process of DFCM.

Algorithm 2.

DAE Input : 𝑋𝑋 , the size of code p Output: encoder ( w ), decoder ( w ) 1. Initialize autoencoder ( p ) 2. 𝒉𝒉 𝑛𝑛 ( ) = 𝒙𝒙 𝑛𝑛 Let m be the number of layers of the autoencoder 4. FOR i = 1 TO m Fitting the denoising autoencoder for the i -th layer: deAutoencoder ( 𝒘𝒘 ( 𝑖𝑖 ) ), 𝒘𝒘 ( 𝑖𝑖 ) = min 𝒘𝒘 ℒ�𝒉𝒉 𝑛𝑛 ( 𝑖𝑖−1 ) , 𝒘𝒘� , ∀𝑛𝑛 𝒉𝒉 𝑛𝑛 ( 𝑖𝑖 ) = 𝑑𝑑𝑑𝑑𝑑𝑑𝑛𝑛𝑐𝑐𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑�𝒉𝒉 𝑛𝑛 ( 𝑖𝑖−1 ) � , ∀𝑛𝑛 Initialize weight of autoencoder with the corresponding weight of the denoising autoencoder: autoencoder ( 𝒘𝒘 ( 𝑖𝑖 ) ), ∀𝑖𝑖 Fitting the autoencoder: autoencoder ( 𝒘𝒘 ), 𝒘𝒘 = min 𝒘𝒘 ℒ ( 𝒙𝒙 𝑛𝑛 , 𝒘𝒘 ) , ∀𝑛𝑛 Figure 2. Deep Autoencoders-based Fuzzy C-Means

Given a textual dataset 𝑋𝑋 = { 𝐱𝐱 , 𝐱𝐱 , … , 𝐱𝐱 𝑁𝑁 } with 𝐱𝐱 𝑖𝑖 ∈ ℛ 𝐷𝐷 𝑥𝑥 , ∀𝑖𝑖 = 1,2, … , 𝑁𝑁 , the dimension of new data representation p , and the number of topic c . First, the textual data are transformed into a lower-dimensional representation using an encoder. We denote this transformation as follows: 𝑋𝑋� = 𝑑𝑑𝑛𝑛𝑐𝑐𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 ( 𝑋𝑋 , 𝑑𝑑 ) (8) where 𝐱𝐱� 𝑖𝑖 ∈ ℛ 𝐷𝐷 𝑥𝑥� , ∀𝑖𝑖 = 1,2, … , 𝑁𝑁 . Next, we perform FCM on the dataset 𝑋𝑋� with the lower-dimensional representation. In this step, centroids 𝝁𝝁� 𝑖𝑖 ∈ ℛ 𝐷𝐷 𝑥𝑥� , ∀𝑖𝑖 = 1,2, … , 𝑐𝑐 are extracted from all c -given clusters as follows: 𝝁𝝁� 𝑖𝑖 = 𝐹𝐹𝐹𝐹𝑀𝑀 ( 𝑋𝑋� , 𝑐𝑐 , 𝑓𝑓 , 𝑇𝑇 , 𝜀𝜀 ) (9) These centroids 𝝁𝝁� 𝑖𝑖 are interpreted as the topics in the lower-dimensional representation. However, the topics have no meaning and will be meaningful if they are transformed back to the original representation. Therefore, it is necessary to transform the extracted topics back to the original representation as follows: 𝝁𝝁 𝑖𝑖 = max (0, 𝑑𝑑𝑑𝑑𝑐𝑐𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 ( 𝝁𝝁� 𝑖𝑖 )) (10) where 𝛍𝛍 𝑖𝑖 ∈ ℛ 𝐷𝐷 𝑥𝑥 , ∀𝑖𝑖 = 1,2, … , 𝑐𝑐 , and max() is a function that gives a maximum between 0 and each element of 𝑑𝑑𝑑𝑑𝑐𝑐𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 ( 𝝁𝝁� 𝑖𝑖 ) . This DFCM algorithm is described in more detail in Algorithm 3. Algorithm 3.

DFCM Input : 𝑋𝑋 , the size of code p , the number of topics 𝑐𝑐 , max number of iterations T, threshold 𝜀𝜀 Output: 𝝁𝝁 𝑖𝑖 Build autoencoder: encoder , decoder = DAE ( X , p ) 2. Transform X :

𝑋𝑋� = 𝑑𝑑𝑛𝑛𝑐𝑐𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 ( 𝑋𝑋 ) Perform FCM : 𝝁𝝁� 𝑖𝑖 = 𝐹𝐹𝐹𝐹𝑀𝑀�𝑋𝑋� , 𝑐𝑐 , 𝑇𝑇 , 𝜀𝜀� , 𝑖𝑖 = 1,2, … , 𝑐𝑐 Calculate the topics : 𝝁𝝁 𝑖𝑖 = max � 𝑑𝑑𝑑𝑑𝑐𝑐𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 ( 𝝁𝝁� 𝑖𝑖 ) � , 𝑖𝑖 =1,2, … , 𝑐𝑐

4. Results and Discussion t of the total m documents were also excluded, where the t threshold was set to max (10; m /1000). Finally, we use the term frequency-inverse document frequency for weighting. Given a tweet collection, the topic detection methods produce topics represented by its top 10 most frequent words. The standard quantitative method to measure the interpretability of the topics is topic coherence. In our simulations, we use one of the topic coherence measures called TC-W2V (O’Callaghan, Greene, Carthy, & Cunningham, 2015). Suppose a topic t consists of n words that are {t ,t ,…,t n }, the TC-W2V of the topic t is 𝑇𝑇𝐹𝐹 − 𝑊𝑊 𝑉𝑉 ( 𝒕𝒕 ) = ∑ ∑ similarity ( 𝑤𝑤𝑤𝑤 𝑗𝑗 , 𝑤𝑤𝑤𝑤 𝑖𝑖 ) 𝑗𝑗−1𝑖𝑖=1𝑛𝑛𝑗𝑗=2 (11) where wv j and wv i are vectors of word t j and t i constructed by a word2vec model. The simulations are conducted in a Windows-based Python environment. For EFCM and DFCM parameters, we set the fuzzification constant f = 1.1, the maximum number of iteration T = 1000, and the threshold ε = 0.005. As mentioned before, the setting of the fuzzification constant is quite intuitive. The small fuzzification constant means that each tweet may contain a small number of topics. On the other hand, the bigger the fuzzification constant implies that each textual data may contain more topics. We initialize the centroids of FCM for both EFCM and DFCM using the best of the 10-run k-means clustering. In DFCM, DAE architecture consists of three symmetrical layers with each layer consisting of 500, 500, 2000 neurons. We implement this representation learning using Python-based Keras . On the other hand, we use truncatedSVD implementation of scikit-learn for dimension reduction in EFCM (Pedregosa et al., 2011). Finally, we need to tune the rest parameters, i.e., the lower dimension p and the number of topics c , for EFCM and DFCM. This simulation also compares DFCM with two standard topic detection methods: latent Dirichlet allocation (LDA) and nonnegative matrix factorization (NMF). We use the LDA and NMF implementations provided by scikit-learn (Pedregosa et al., 2011). The LDA algorithm uses the batch variational Bayes method for training LDA. Two parameters are usually optimized for this training method, namely α and η . α control the mixture of topics for a specific document. A smaller α means the document will likely have less of a mixture of topics. η control the distribution of words per topic. The larger η means the topic will likely have more words. To optimize both parameters, we use a hyperparameter grid and run an algorithm for each combination [0.01, 0.1, 0.25, 0.5, 0.75, 1]. For NMF, the data vectors are normalized to unit length. The implementation of NMF uses a coordinate descent algorithm. There is no parameter we optimize for this algorithm. To reduce the instability of random initialization, the NNDSVD initialization is performed. 3.1 Enron – An English Email Dataset The first dataset is Enron consisting of approximately 500,000 emails generated by employees of the Enron Corporation . The Federal Energy Regulatory Commission obtained it during its investigation of Enron's collapse. To calculate the TC-W2V of the extracted topics, we use a pre-trained word2vec model trained on the Google News dataset for the English email dataset. The model contains 300-dimensional vectors for 3 million words and phrases . https://keras.io https://code.google.com/archive/p/word2vec/ First, we analyze the effect of the number of DAE training epochs on the coherence scores of DFCM. Using a batch size of 256, the coherence scores for several epoch sizes are given in Figure 3. First, the number of epochs is set to 100. Then, this number of epochs is increased to 400 and 1000. The average coherence score of DFCM fluctuates and increases when the number of epochs is increased from 100 to 400. However, the average coherence score tends to decrease when the number of epochs is increased to 700. If we choose 400 as the number of epochs, then DFCM gives the mean coherence scores of 0.1771, 22% better than EFCM. Figure 4 provides simulation results similar to Figure 3, but for 10-dimensional representation. Compared to the five-dimensional representation, the mean coherence scores of DFCM fluctuate only slightly as the number of epochs is increased from 100, 400, and 700. DFCM gives the mean coherence scores of 0.1896 when the number of epochs is 400. Like the 5-dimensional data representation, DFCM still provides a better mean coherence score than EFCM, which is about 5% better. Figures 3 and Figure 4 show that the 10-dimensional representation of data is more suitable for both the DFCM and EFCM methods. DFCM with the ten-dimensional representation gives the mean coherence scores 7% better than DFCM with the five-dimensional representation. Meanwhile, EFCM with the ten-dimensional representation provides an average coherence score of 26% better than EFCM with the five-dimensional representation.

Figure 3. Coherence scores in terms of TC-W2V for the Enron dataset on the number of topics 10, 20, …, 100 when the lower-dimensional representation is set to five. DFCM(100), DFCM(400), DFCM(700) mean the number of epoch of deep autoencoders are set to 100, 400, 700, respectively

Figure 4. Coherence scores in terms of TC-W2V for the Enron dataset on the number of topics 10, 20, …, 100 when the lower-dimensional representation is set to ten. DFCM(100), DFCM(400), DFCM(700) mean the number of epoch of deep autoencoders are set to 100, 400, 700, respectively

Furthermore, we also provide a comparison between the DFCM method with two other standard methods, namely NMF and LDA. Figure 5 includes coherence scores of 4 topic detection methods for the number of topics 10, 20, ..., 100. First, Figure 5 confirms the previous simulation results that EFCM provides coherence scores between NMF and LDA. From Figure 5, we can also see that DFCM can reach a coherence score that is slightly better than NMF in almost all number of topics. Only on the topic number of 10, NMF gives a significantly better coherence score. For all number of topics, DFCM provides better mean coherence scores of 3%, 7%, 34% than NMF, EFCM, LDA, respectively.

Figure 5. The comparison of coherence scores in terms of TC-W2V for LDA, NMF, EFCM, and DFCM on the number of topics 10, 20, …, 100 for the Enron dataset. word2vec model using a corpus consisting of 750000 Indonesian documents from wiki, news, and tweets to measure the TC-W2V of the extracted topics. Unlike the word2vec model for the first English dataset, we train the Berita dataset to this word2vec model. Therefore, all vocabularies of the Berita dataset exist in the word2vec model. The simulations for the Berita dataset are given in Figure 6, Figure 7, and Figure 8. Figure 6 is a simulation to see the effect of the number of epochs in the DAE learning to coherence scores of DFCM for the five-dimensional representation. Meanwhile, Figure 7 is a simulation to see the effect of the number of epochs in the DAE learning to coherence scores of DFCM for the 10-dimensional representation. The initial number of epochs is 50 and is increased to 100 and then 400. In general, increasing the number of epochs makes the coherence score lower for most topics. For the epoch number of 400, DFCM even provides a coherence score below the EFCM in almost all number of topics. The same conditions are seen for the 10-dimensional representation in Figure 7.

Figure 6. Coherence scores in terms of TC-W2V for the Berita dataset on the number of topics 10, 20, …, 100 when the lower-dimensional representation is set to five. DFCM(50), DFCM(100), DFCM(400) mean the number of epoch of deep autoencoders are set to 50, 100, 400, respectively

Figure 7. Coherence scores in terms of TC-W2V for the Berita dataset on the number of topics 10, 20, …, 100 when the lower-dimensional representation is set to ten. DFCM(50), DFCM(100), DFCM(400) mean the number of epoch of DAE are set to 50, 100, 400, respectively

If we use 50 epochs for DAE learning, then DFCM gives an average coherence score of 0.3730 for the five-dimensional representation. Meanwhile, EFCM provides an average coherence score of 0.3589. This means that DFCM achieves a slightly higher average coherence score than EFCM, which is about 4% better. A similar result is shown for the 10-dimensional representation where the DFCM gives an average coherence score of about 3%, slightly higher than the EFCM. From Figure 6 and Figure 7, we can also conclude that the five-dimensional representation provides a slightly better coherence score for both DFCM and EFCM. Figure 8 provides a comparison of DFCM on the Berita dataset with two other standard topic detection methods, namely NMF and LDA. In this comparison, both DFCM and EFCM use the five-dimensional representation. Figure 8 shows that DFCM, EFCM, NMF, and LDA provide mean coherence scores of 0.3730, 0.3588, 0.3560, and 0.2815, respectively. These results indicate that DFCM can better achieve an average coherence score than EFCM, NMF, and LDA. However, NMF still gives a better coherence score for the smallest number of topics. 10. In this Berita dataset, NMF and EFCM provide almost the same means coherence score.

Figure 8. The comparison of coherence scores in terms of TC-W2V for LDA, NMF, EFCM, and DFCM on the number of topics 10, 20, …, 100 for the Berita dataset.

DFCM also provides a higher average coherence score compared to NMF. DFCM achieved a 3% better average coherence score for the Enron dataset and a 5% better average coherence score for the News dataset. In contrast to EFCM, DFCM and NMF provide lower-dimensional representation with non-orthogonal dimensions or features. NMF carried out a topic extraction process in the original space, which consisted of words. Thus, the resulting topics can be directly interpreted, and their coherence scores are calculated. Meanwhile, DFCM extracts the topics in the lower-dimensional space and must be transformed back to the original space so that the extracted topics can be interpreted, and coherence scores are calculated. However, DFCM makes it possible to process a better representation for textual data that generally has a lot of noise and variation. Thus, the success of DFCM in achieving better coherence scores is mainly because DFCM processes textual data with better representations. The same condition for LDA, where LDA performs the topic extraction process in the original space, consists of words. Deep learning is currently a popular supervised learning approach, especially for unstructured data such as images and text. Deep learning integrates the feature extraction process with classification or regression processes. In the context of unsupervised learning, deep autoencoder is a popular deep learning method for representation learning. This method makes it possible to do the denoising process while reducing dimensions to produce better lower-dimensional representation. However, the integration of representation learning methods with an unsupervised learning problem such as topic detection is still an opportunity to continue to be developed.

5. Conclusions

DFCM is a topic detection method that combines deep autoencoder for representation learning and fuzzy c-means for topic extraction. Therefore, deep autoencoder-fuzzy c-means makes it possible to process a better representation for textual data that generally has a lot of noise and variation. Unlike EFCM, DFCM extracts topics from lower-dimensional representations with dimensions or features that are not orthogonal. This representation is more realistic to represent the topics. Our simulation shows that DFCM gives a higher accuracy in terms of the coherence score than EFCM and the two standard methods, i.e., NMF and LDA.

Acknowledgement

This paper was supported by Universitas Indonesia under PDUPT 2020 grant. Any opinions, findings, conclusion, and recommendations are the authors’ and do not necessarily reflect those of the sponsor.

References

Allan, J. (2002).

Topic detection and tracking: event-based information organization . Kluwer. Battsengel, G., Geetha, S., & Jeon, J. (2020). Analysis of Technological Trends and Technological Portfolio of Unmanned Aerial Vehicle.

Journal of Open Innovation: Technology, Market, and Complexity , (3), 48. https://doi.org/10.3390/joitmc6030048 Bengio, Y., Courville, A., & Vincent, P. (2013). Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence , (8), 1798–1828. https://doi.org/10.1109/TPAMI.2013.50 Bezdek, J. C., Ehrlich, R., & Full, W. (1984). FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences , (2), 191–203. https://doi.org/https://doi.org/10.1016/0098-3004(84)90020-7 Bezdek, J. C., & Hathaway, R. J. (2003). Convergence of alternating optimization.

Neural Parallel and Scientific Computing , (4), 351–368. Blei, D. M. (2012). Probabilistic topic models. Communication of the ACM , (4), 77–84. Blei, D. M., Ng, A. Y., Jordan, M. I., Edu, B. B., Ng, A. Y., Edu, A. S., … Edu, J. B. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research , , 993–1022. https://doi.org/10.1162/jmlr.2003.3.4-5.993 Cichocki, A., & Phan, A. H. (2009). Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences , E92 - A (3), 708–721. https://doi.org/10.1587/transfun.E92.A.708 Feng, Q., Chen, L., Chen, C. L. P., & Guo, L. (2020). Deep Fuzzy Clustering—A Representation Learning Approach. IEEE Transactions on Fuzzy Systems , (7), 1420–1433. https://doi.org/10.1109/TFUZZ.2020.2966173 Févotte, C., & Idier, J. (2011). Algorithms for Nonnegative Matrix Factorization with the {$β$} -divergence. Neural Comput. , (9), 2421–2456. https://doi.org/10.1162/NECO_a_00168 Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning . MIT Press. Guan, R., Zhang, H., Liang, Y., Giunchiglia, F., Huang, L., & Feng, X. (5555). Deep Feature-Based Text Clustering and Its Explanation.

IEEE Transactions on Knowledge & Data Engineering , (01), 1. https://doi.org/10.1109/TKDE.2020.3028943 Guo, X., Gao, L., Liu, X., & Yin, J. (2017). Improved Deep Embedded Clustering with Local Structure Preservation. In

Proceedings of the 26th International Joint Conference on Artificial Intelligence (pp. 1753–1759). AAAI Press. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks.

Science , (5786), 504–507. https://doi.org/10.1126/science.1127647 Hoffman, M. D., Blei, D. M., & Bach, F. (2010). Online Learning for Latent Dirichlet Allocation. In Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 1 (pp. 856–864). USA: Curran Associates Inc. Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic Variational Inference.

J. Mach. Learn. Res. , (1), 1303–1347. Huang, H., Chuang, Y., & Chen, C. (2012). Multiple Kernel Fuzzy Clustering. IEEE Transactions on Fuzzy Systems , (1), 120–134. https://doi.org/10.1109/TFUZZ.2011.2170175 Lamba, M., & Madhusudhan, M. (2019). Mapping of topics in DESIDOC Journal of Library and Information Technology, India: a study. Scientometrics , (2), 477–505. https://doi.org/10.1007/s11192-019-03137-5 Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by nonnegative matrix factorization. Nature , , 788–791. Masci, J., Meier, U., Cire\csan, D., & Schmidhuber, J. (2011). Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction. In Proceedings of the 21th International Conference on Artificial Neural Networks - Volume Part I (pp. 52–59). Berlin, Heidelberg: Springer-Verlag. Muliawati, T., & Murfi, H. Eigenspace-based fuzzy c-means for sensing trending topics in Twitter, 1862 AIP Conference Proceedings § (2017). Depok. https://doi.org/10.1063/1.4991244 Murfi, H. (2018).

The accuracy of fuzzy C-means in lower-dimensional space for topic detection . Lecture Notes in Computer Science (including subseries Lecture Notes in

Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11344 LNCS). https://doi.org/10.1007/978-3-030-05755-8_32 Murfi, H. (2019). Monitoring trending topics of real-world events on Indonesian tweets using fuzzy C-means in lower dimensional space. In

ACM International Conference Proceeding Series . https://doi.org/10.1145/3369114.3369127 Nugraha, P., Rifky Yusdiansyah, M., & Murfi, H. (2019).

Fuzzy C-means in lower dimensional space for topics detection on indonesian online news . Communications in Computer and Information Science (Vol. 1071). https://doi.org/10.1007/978-981-32-9563-6_28 Nugraha, Praditya, Rifky Yusdiansyah, M., & Murfi, H. (2019). Fuzzy C-Means in Lower Dimensional Space for Topics Detection on Indonesian Online News. In Y. Tan & Y. Shi (Eds.),

Data Mining and Big Data (pp. 269–276). Singapore: Springer Singapore. Nur’aini, K., Najahaty, I., Hidayati, L., Murfi, H., & Nurrohmah, S. (2015). Combination of singular value decomposition and K-means clustering methods for topic detection on Twitter. In (pp. 123–128). IEEE. https://doi.org/10.1109/ICACSIS.2015.7415168 O’Callaghan, D., Greene, D., Carthy, J., & Cunningham, P. (2015). An analysis of the coherence of descriptors in topic modeling.

Expert Systems with Applications , (13), 5645–5657. https://doi.org/10.1016/j.eswa.2015.02.055 Parlina, A., Ramli, K., & Murfi, H. (2020). Theme mapping and bibliometrics analysis of one decade of big data research in the scopus database. Information (Switzerland) , (2). https://doi.org/10.3390/info11020069 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … Duchesnay, E. (2011). Scikit-learn: Machine Learning in {P}ython. Journal of Machine Learning Research , , 2825–2830. Petkos, G., Papadopoulos, S., & Kompatsiaris, Y. (2014). Two-level message clustering for topic detection in Twitter. CEUR Workshop Proceedings , , 49–56. Rathore, P., Bezdek, J. C., Erfani, S. M., Rajasegarar, S., & Palaniswami, M. (2018). Ensemble Fuzzy Clustering Using Cumulative Aggregation on Random Projections. IEEE Transactions on Fuzzy Systems , (3), 1510–1524. https://doi.org/10.1109/TFUZZ.2017.2729501 Ruspini, E. H., Bezdek, J. C., & Keller, J. M. (2019). Fuzzy Clustering: A Historical Perspective. Comp. Intell. Mag. , (1), 45–55. https://doi.org/10.1109/MCI.2018.2881643 Shang, R., Zhang, W., Li, F., Jiao, L., & Stolkin, R. (2019). Multi-objective artificial immune algorithm for fuzzy clustering based on multiple kernels. Swarm and Evolutionary Computation , , 100485. https://doi.org/https://doi.org/10.1016/j.swevo.2019.01.001 Song, C., Huang, Y., Liu, F., Wang, Z., & Wang, L. (2014). Deep Auto-Encoder Based Clustering. Intell. Data Anal. , (6S), S65–S76. Song, C., Liu, F., Huang, Y., Wang, L., & Tan, T. (2013). Auto-encoder Based Data Clustering. In J. Ruiz-Shulcloper & G. di Baja (Eds.), Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications (pp. 117–124). Berlin, Heidelberg: Springer Berlin Heidelberg. Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and Composing Robust Features with Denoising Autoencoders. In

Proceedings of the 25th International Conference on Machine Learning (pp. 1096–1103). New York, NY, USA:

Association for Computing Machinery. https://doi.org/10.1145/1390156.1390294 Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P.-A. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion.

J. Mach. Learn. Res. , , 3371–3408. Winkler, R., Klawonn, F., & Kruse, R. (2011). Fuzzy C-Means in high dimensional spaces. International Journal of Fuzzy System Applications , (March), 2–4. https://doi.org/10.4018/ijfsa.2011010101 Xie, J., Girshick, R., & Farhadi, A. (2016). Unsupervised Deep Embedding for Clustering Analysis. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 (pp. 478–487). JMLR.org. Yang, B., Fu, X., Sidiropoulos, N. D., & Hong, M. (2017). Towards K-Means-Friendly Spaces: Simultaneous Deep Learning and Clustering. In

Proceedings of the 34th International Conference on Machine Learning - Volume 70 (pp. 3861–3870). JMLR.org. Yusdiansyah, M. R., Murfi, H., & Wibowo, A. (2019).

Randomspace-Based Fuzzy C-Means for Topic Detection on Indonesia Online News . Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11909 LNAI). https://doi.org/10.1007/978-3-030-33709-4_12 Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2020).

Dive into Deep Learning . Zhu, X., Pedrycz, W., & Li, Z. (2017). Fuzzy clustering with nonlinearly transformed data.

Applied Soft Computing , , 364–376. https://doi.org/https://doi.org/10.1016/j.asoc.2017.07.026, 364–376. https://doi.org/https://doi.org/10.1016/j.asoc.2017.07.026