JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
Deep NMF Topic Modeling
Jian-Yu Wang and Xiao-Lei Zhang
Abstract —Nonnegative matrix factorization (NMF) based topic modeling methods do not rely on model- or data-assumptions much.However, they are usually formulated as difficult optimization problems, which may suffer from bad local minima and highcomputational complexity. In this paper, we propose a deep NMF (DNMF) topic modeling framework to alleviate the aforementionedproblems. It first applies an unsupervised deep learning method to learn latent hierarchical structures of documents, under theassumption that if we could learn a good representation of documents by, e.g. a deep model, then the topic word discovery problemcan be boosted. Then, it takes the output of the deep model to constrain a topic-document distribution for the discovery of thediscriminant topic words, which not only improves the efficacy but also reduces the computational complexity over conventionalunsupervised NMF methods. We constrain the topic-document distribution in three ways, which takes the advantages of the threemajor sub-categories of NMF—basic NMF, structured NMF, and constrained NMF respectively. To overcome the weaknesses of deepneural networks in unsupervised topic modeling, we adopt a non-neural-network deep model—multilayer bootstrap network. To ourknowledge, this is the first time that a deep NMF model is used for unsupervised topic modeling. We have compared the proposedmethod with a number of representative references covering major branches of topic modeling on a variety of real-world text corpora.Experimental results illustrate the effectiveness of the proposed method under various evaluation metrics.
Index Terms —nonnegative matrix factorization, topic modeling, unsupervised deep learning. (cid:70)
NTRODUCTION T OPIC modeling extracts salient features and discoversstructural information from a large collection of doc-uments [1]. This paper focuses on discussing the nonnativematrix factorization (NMF) based topic modeling [2], [3], [4],[5], [6], [7], [8]. NMF topic modeling usually decomposesthe document-word representation of documents into atopic-document matrix and a word-topic matrix. Existingdecomposition methods usually have the following twomajor problems. First, it is challenging to discover commonpatterns or topics in the documents and organize theminto hierarchy [9], [10]. Second, the topic-word distributiondo not meet human interpretation of documents [11], [12].For example, traditional topic modeling may lose smallersubject codes, i.e. sub-topics, in the tails of large topics,which leads to the inability of describing topic dimensionsin terms of the human interpretable objects of topics, andsimultaneously loses all latent sub-structure within eachtopic [11]. Deep learning, which learns hierarchical datarepresentations, provide one solution to the aforementionedproblems. However, existing deep learning methods fortopic modeling are mostly supervised, and fall into thecategory of probabilistic topic models [13], [14]. To ourknowledge, unsupervised deep NMF topic modeling seemsunexplored yet, due to maybe the high computational com-plexity of deep unsupervised NMF [15], [16] as well as thelack of supervised information of data. • Both authors are with the Center for Intelligent Acoustics and Im-mersive Communications, School of Marine Science and Technology,Northwestern Polytechnical University, Xi’an, China and the Research& Development Institute of Northwestern Polytechnical University inShenzhen, Shenzhen, China. E-mail: [email protected], [email protected] received April 19, 2005; revised August 26, 2015.
In this paper, we aim to explore an unsupervised deepNMF (DNMF) framework to address the above challenges.Because modeling topic hierarchies of documents and dis-covering topic words simultaneously is a complicated opti-mization problem, we propose to solve the two problems insequence, under the assumption that, if the representationof documents is good enough, then the overall performancecan be boosted [17]. The proposed method contains thefollowing novelties: • An unsupervised deep NMF framework is proposed.It first learns the topic hierarchies of documents byan unsupervised deep model, whose output is usedto constrain the topic-document matrix. Then, it pro-duces a good solution to the topic-document matrix andword-topic matrix by NMF under the constraint. It canhave many implementations by incorporating differentNMF methods and deep models. Unlike conventionalNMF topic modeling methods that make predefinedassumptions, DNMF alleviates the weaknesses of NMF,e.g. non-unique factorization, by deep learning. To ourknowledge, this is the first work of unsupervised deepNMF for topic modeling. • Three implementations of DNMF that reach the state-of-the-art performance are proposed. The three algo-rithms fall into the three major subclasses of NMFtechnologies [18], denoted as basic DNMF (bDNMF),strutured DNMF (sDNMF), and constrained DNMF(cDNMF) respectively. Specifically, bDNMF takes theoutput of the deep model as the topic-document matrixdirectly to generate the word-topic distribution. sD-NMF takes the output of the deep model as the intrinsicgeometry of the topic-document distribution, which isused to mask the topic-document matrix. cDNMF takesthe output of the deep model as a regularization of the a r X i v : . [ c s . I R ] F e b OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 topic-document distribution. The convergence of theproposed algorithms is theoretically proved. • Because the representation of documents in topic mod-eling is usually sparse and high-dimensional, existingdeep neural networks can easily overfit to the doc-uments. Although some methods reduce the dimen-sion of the documents by discarding low-frequencywords, their performance suffers from the compro-mise [19]. To address the problem, this paper ap-plies multilayer bootstrap networks (MBN) to learn thetopic hierarchies of documents. MBN contains threesimple operators—random resampling, stacking, andone-nearest-neighbor optimization. To our knowledge,this is the first time that a non-neural-network un-supervised deep model is applied to topic modeling,which outperforms conventional shallow topic model-ing methods significantly.We have compared the proposed DNMF variants with 9representative topic modeling methods [1], [3], [4], [5], [20],[21], [22], [23], [24] covering probabilistic topic models [1],[20], [21], [22], NMF methods [3], [4], [5], [23], and deeptopic models [24]. Empirical results on the 20-newsgroups,topic detection and tracking database version 2 (TDT2),and Reuters-21578 corpora illustrate the effectiveness ofDNMF in terms of three evaluation metrics. Moreover, thehyperparameters of the DNMF variants have stable workingranges across all situations, which facilitates their practicaluse.In this paper, we first introduce some related work andpreliminaries in the following two subsections, then presentthe proposed DNMF framework and its three implemen-tations in Section 2. Section 3 presents the experimentalresults. Finally, Section 4 concludes our findings.
Probabilistic topic modeling : Topic models were originallyformulated as unsupervised probabilistic models [1], [21],[25], [26]. A seminal work of probabilistic topic models islatent Dirichlet allocation (LDA) [1]. It models a documentas a multinomial distribution over latent semantic topics,and models a topic itself as a multinomial distribution overwords. The document-dependent topic embedding, gov-erned by a Dirichlet prior, is estimated in an unsupervisedway and then adopted as the low-dimensional feature fordocument classification and indexing. Later on, hierarchicaltree-structured priors such as nested Dirichlet processing[25], [27] or nested Chinese restaurant process [27], [28]were applied to discover the hierarchy of topics and capturethe nonlinearity of documents. However, the hierarchicalprobabilistic models suffer from conceptual and practicalproblems. For example, their optimization problem is NP-hard in the worst case due to the intractability of the pos-terior inference [29]. Existing methods have to resort to ap-proximate inference methods, such as variational Bayes andGibbs sampling which is also difficult to carry out [30]. Be-sides, because the exact inference is intractable, the modelscan never make predictions for words that are sharper thanthe distributions predicted by any of the individual topics.As a result, the hypothesis of probability distributions areunable to be applied to all text corpora [31]. Moreover, there is a lack of justification of the Bayesian priors as well [32].Recently, a geometric Dirichlet means algorithm [33], whichbuilds upon a weighted k -means clustering procedure andis augmented with a geometric correction, overcomes thecomputational and statistical inefficiencies encountered byprobabilistic topic models based on Gibbs sampling andvariational inference. However, the learned topic polytopeis largely influenced by the performance of the clusteringalgorithm. Deep probabilistic topic modeling : Another solution tothe optimization difficulty of the hierarchical probabilisticmodels is to integrate the perspectives of the probabilisticmodels and deep neural networks. The integrated methods,named deep neural topic models, introduce neural networkbased priors as alternatives to Dirichlet process based priors[34], [35], [36], [37]. This integrates the powerfulness ofneural network architecture into the inference of the prob-abilistic graph models, which makes the models not onlyinterpretable but also powerful and easily extendable. How-ever, they still fail to consider the veracity of the Bayesianhypothesis. The problem of component collapsing may alsolead to bad local optima of the inference network in whichall topics are identical.
NMF topic modeling : To deal with the optimizationdifficulty of the hierarchical probabilistic models, a largeeffort has been paid on polynomial time solvable topicmodeling algorithms, many of which are formulated asseparable nonnegative matrix factorization (NMF) methods[2], [3], [4], [5], [6], [7], [8]. They find the underlying param-eters of topic models by decomposing the document-worddata matrix into a weighted combination of a set of topicdistributions [38]. A key problem in the context of NMFresearch is the separability issue, i.e., whether the matrixfactors are unique [39]. When one applies NMF to topicmodeling, the separability assumption is equivalent to ananchor-word assumption which assumes that every topichas a characteristic anchor word that does not appear in theother topics [3], [4], [5], [6]. However, because words andterms have multiple uses, the anchor word assumption maynot always hold. How to avoid the unrealistic assumption isa key research topic. One solution explores tensor factoriza-tion models with three- or higher-order word co-occurencestatistics. However, such statistics need many more samplesthan lower-order statistics to obtain reliable estimates, andseparability still relies on additional assumptions [23], suchas consecutive words being persistently drawn from thesame topic. Another recent solution is anchor-free corre-lated topic modeling (AnchorFree) with second-order co-occurrence statistics. However, an assumption called suf-ficiently scattered condition is still needed to be made,though the assumption is much milder than the anchor-word assumption. Besides the problem of making additionalassumptions to the data, NMF is also formulated as ashallow learning method with no more than one nonlinearlayer, which may not capture the nonlinearity of documentsand the hierarchy of topics well.
Deep NMF methods : The aforementioned NMF topicmodels are all shallow models, which is not powerfulenough to grasp the nonlinearity of documents. In the NMFresearch community, a lot of efforts have been paid on themultilayered NMF algorithms with applications to image
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3 processing [40], [41], speech separation [42], [43], commu-nity detection [44], etc. The basic idea is to factorize a matrixinto multiple factors, where the factorization can be eitherlinear or nonlinear. If the factorization is nonlinear, then themethod is called a deep NMF. For example, deep semi-NMF[15] factorizes the basis matrix into multiple factors withthe optimization criterion of minimum reconstruction error,where it does not require the factorized weight matrix tobe nonnegative anymore. Deep nonnegative basis matrixfactorization [16] conducts deep factorization to the coef-ficient matrix with different regularization constraints onthe basis matrix. However, because the bag-of-words repre-sentation of documents is high-dimensional and sparse, theapplication of the aforementioned idea to topic modeling iscomputationally high and may also suffer from overfitting.To our knowledge, no deep NMF topic modeling methodshave been proposed yet.
Unsupervised deep learning for document clustering :Document clustering and topic modeling are two closelyrelated tasks [17]. Unsupervised topic modeling projectsdocuments into a topic embedding space, which promotesthe development of document clustering. Recently, manyworks focused on learning the representations and topicassignments of documents simultaneously by deep neuralnetworks [19], [45], [46], [47]. However, current deep learn-ing methods for document clustering do not show advancesover the shallow learning methods, such as NMF-basedtopic modeling. We conjecture that existing methods maynot be good at dealing with sparse and high-dimensionalrepresentations of documents. As a compromise, they re-duce the dimension of the sparse data by discarding thelow-frequency words, which may significantly lose usefulinformation. To deal with the aforementioned problems,here we develop deep models that are able to outperformconventional shallow models without discarding the low-frequency words. Note that, although some deep learningbased topic models apply word embeddings to deep topicmodels [48], [49], it may not be unsuitable to compare themwith the conventional topic modeling methods that workwith the term frequency-inverse document frequency (TF-IDF) statistics.
We first introduce some notations here. Regular letters, e.g. δ , M , t , and , indicate scalars. Bold lower-case letters, e.g. d , indicate vectors. Bold capital letters, e.g. D , C , and W ,indicate matrices. The bold digit indicates an all-zerovector or matrix. The operator T denotes the transpose.The notation [ C ] ij indicates the elements of the matrix C at the i th column and j th row. The operator (cid:12) is theHadamard multiplication. The operator Tr ( · ) denotes thetrace of matrices. In topic modeling, given a corpus of N documents with K topics and a vocabulary of V words, denoted as { d n } Nn =1 where d n = [ d n, , . . . , d n,V ] T with d n,v as the frequencyof the v th word in the vocabulary that appears in the n th document. we aim to learn a topic-document ma-trix W = [ w k,n ] ∈ R K × N + and a word-topic matrix C = [ c v,k ] ∈ R V × K + from the document-word matrix D = [ d , . . . , d N ] ∈ R V × N + , where the notation k ≤ K is the topic index, w k,n is the topic label which describesthe probability of the n th document belonging to the k thtopic, and c v,k is the probability of the v th vocabulary thatappears in the k th topic. The task of topic modeling is tofind an approximate factorization: D ≈ CW (1)NMF measures the distance between D and CW by thesquared Frobenius norm, and formulates the topic modelingproblem as the following optimization problem: ( C , W ) = arg min C ≥ ; W ≥ (cid:107) D − CW (cid:107) F (2)where the nonnegative constraints make the solution in-terpretable. Under the anchor-word assumption, the worddistribution C is enforced to be a block diagonal matrix,which guarantees a consistent solution [29], [50]. However,the anchor-word assumption is fragile in practice. Recently,many methods have been proposed to overcome this as-sumption [23], [51]. EEP
NMF
TOPIC MODEL
In this section, we first present the DNMF topic modelingframework in Section 2.1, then implements three DNMFtopic modeling methods named bDNMF, cDNMF, and sD-NMF respectively in Section 2.2, and finally introduce theunsupervised deep model in Section 2.3.
Traditional NMF topic modeling aims to learn a documentrepresentation by linear NMF essentially. In order to capturethe manifold structure or topic hierarchies of documents, anatural way is to extend NMF into a deep NMF framework.Here we propose a DNMF framework which constrainsthe topic-document matrix by an unsupervised deep modelwith multiple layers of nonlinear transforms: D ≈ CW subject to g ( W | f ( D )) ≥ , C ≥ , W ≥ (3)where f ( · ) is the unsupervised deep model and g ( · ) is adiscriminator used to constrain W by f ( · ) . f ( · ) performslike a prior that constrains the solution of W and C to beinterpretable and discriminant, which is the fundamentaldifference between DNMF and conventional NMF topicmodels. The framework is illustrated in Fig. 1. It minimizesthe reconstruction error between D and CW in terms of thesquared Frobenius norm.A direct thought to solve problem (3) is to optimize f ( D ) , W , and C alternatively until convergence. However,it is too costly to train a deep model in a single iteration.In practice, we take the following optimization algorithm tosolve problem (3): • Pretrain f ( D ) first by an unsupervised deep model. • Optimize W and C alternatively with f ( D ) fixed untilconvergence. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
Layer 1Layer 2
Layer 3
Document topic distribution Topic word distribution
W C min C ≥ (cid:107) D − CW (cid:107) F g ( W | f ( D )) f ( D ) D Fig. 1. The proposed DNMF framework.
The effectiveness of the above algorithm relies on the as-sumption that, if a high-quality f ( D ) is obtained as a prior,then the solution of C and W is also boosted.The difference between the proposed topic modelingmethod and existing deep NMF methods [15], [16] is that theproposed method takes the deep model f ( · ) as an additionalconstraint of W , while the methods in [15], [16] decomposes W directly into a hierarchical network. It is easy to see thatthe proposed framework can employ various unsuperviseddeep models to bring additional information into the matrixdecomposition problem for specific applications. It is easyto constrain W flexibly as we will do in Section 2.2, whichbrings advanced NMF methods into the proposed frame-work. It also can either employ a pretrained deep model orconduct joint optimization of the deep model and the matrixdecomposition. On the contrary, although [15], [16] can beapplied to topic modeling, the computational complexity oftheir multilayered matrix decomposition is too high to beapplied to topic modeling in practice. To our knowledge,they were not applied to topic modeling yet. In this subsection, we first introduce three DNMF imple-mentations that extend the three sub-categories of NMF [18]to their deep versions respectively, and then discuss theconnection between the three implementations. Note that,besides the novelty of the DNMF framework, cDNMF andsDNMF are also fundamentally new even without the deepmodel f ( D ) . Many NMF topic modeling methods introduce polytopeto interpret the geometry of documents [52], [53], i.e. [ D ] ij = (cid:80) Kk =1 [ C ] ik [ W ] kj . A standard NMF topic mod-eling can always find an infinite solutions of C and W that satisfy D ≈ CW . To prevent such infinite solutions,various constraints have to be added. One of the simplest Algorithm 1: bDNMF.
Input :
Text corpus D , the number of topics T ,hyperparameters δ ≥ and M ≥ . Output: C ( t ) , W . Initialize: topic-word distribution C , t ← ; Construct a document-topic distribution f ( D ) by deepunsupervised learning methods; W ← f ( D ) ; repeat Calculate C ( t ) by (11); t ← t + 1 ; until convergence ; constraint is to provide one of the two factors beforehand,e.g. W . However, it seems not easy to find a satisfied W beforehand in history. Fortunately, deep learning providessuch an opportunity. We conjecture bravely that, if a goodtopic-document matrix W could be learned beforehand bydeep learning, then the problem of finding the other factor C can be greatly simplified, which motivates bDNMF.Given a latent document topic proportions f ( D ) from adeep model, bDNMF interprets the documents by [ D ] ij = K (cid:88) k =1 [ C ] ik [ f ( D )] kj for i = 1 , . . . , V ; m = 1 , . . . , N. (4)It is a special case of the framework in Fig. 1 where g ( · ) issimply defined as W − f ( D ) = . Solving the factorization(4) in the NMF framework results in the following optimiza-tion problem: min C ≥ ,f ( · ) D F [ D || C f ( D )] (5)where D F [ D || C f ( D )] denotes the Frobenius norm of NMFwith C f ( D ) being an approximation of D : D F [ D || C f ( D )] = (cid:107) D − C f ( D ) (cid:107) F (6)We solve bDNMF in two steps. First, we generate thesparse representation of documents f ( D ) by a deep model.Then, problem (5) is formulated as a nonnegative leastsquares optimization problem, which can be solved bygradient descent algorithms or multiplicative update rules[54]. Here we prefer multiplicative update rules, since theydo not have tunable hyperparameters. As we can see, when f ( D ) is given, problem (5) satisfies the following first-orderKarush-Kuhn-Tucker (KKT) optimality conditions: C ≥ ∂D F ( D || C f ( D )) ∂ C ≥ (cid:12) ∂D F ( D || C f ( D )) ∂ C = (7)which guarantees that the solution of (5) converges to astationary point.The multiplicative update rules are described as follows.Let Ψ be the Lagrange multiplier of the constraint C ≥ ,the Lagrangian J for (5) is J = Tr ( DD T ) − Tr ( D f ( D ) T C T )+ Tr ( C f ( D ) f ( D ) T C T ) + Tr ( CΨ ) (8)The partial derivative of J with respect to C is ∂ J ∂ C = − D f ( D ) T + 2 C f ( D ) f ( D ) T + Ψ (9) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
Algorithm 2: sDNMF.
Input :
Text corpus D , the number of topics T ,hyperparameters δ ≥ and M ≥ . Output: C ( t ) , f ( D ) . Initialize: topic-word distribution C , document-topicdistribution T , weight matrix T , t ← ; Construct a document-topic distribution f ( D ) by deepunsupervised learning methods; repeat Calculate T ( t ) by (17); Calculate C ( t ) by (18); t ← t + 1 ; until convergence ; By the KKT condition C (cid:12) Ψ = , we derive C (cid:12) ( C f ( D ) f ( D ) T ) − C (cid:12) ( D f ( D ) T ) + C (cid:12) Ψ = (10)Therefore, the multiplicative update rules for C can beinferred as follows: [ C ] ( t +1) ij ← [ C ] ( t ) ij [ D f ( D ) T ] ij [ C f ( D ) f ( D ) T ] ij (11)where the superscript ( t ) denotes the t th iteration of themultiplicative update rules.bDNMF is summarized in Algorithm 1. It implements g ( W , f ( D )) by simply setting W = f ( D ) . The main meritof bDNMF is that it can easily get the global optimumsolution of C given W fixed, which avoids the non-uniquesolution of the NMF topic modeling in a simple way. Itseffectiveness is largely affected by f ( D ) . In practice, weimplement f ( D ) as semantic topic labels, which is obtainedby the deep-learning-based document clustering. Although bDNMF is simple, it reduces NMF with onlyone variable when f ( D ) is given, which limits the flex-ibility of C . To solve the problem, sDNMF modifies theregular factorization formulation (2) by a new discriminator W = f ( D ) (cid:12) T instead of taking W = f ( D ) where T is anew variable. Its objective function is formulated as follows: min C ≥ , T ≥ ,f ( · ) D F ( D (cid:107) C ( f ( D ) (cid:12) T ))= min C ≥ , T ≥ ,f ( · ) (cid:107) D − C ( f ( D ) (cid:12) T ) (cid:107) F (12)Like bDNMF, we solve sDNMF by first generating f ( D ) by a deep model, which formulates problem (12) as analternative least squares optimization problem. As we cansee, when f ( D ) is given, problem (12) satisfies the followingfirst-order KKT optimality conditions: C ≥ , T ≥ ∂D F ( D (cid:107) C ( f ( D ) (cid:12) T )) ∂ C ≥ (cid:12) ∂D F ( D (cid:107) C ( f ( D ) (cid:12) T )) ∂ C = ∂D F ( D (cid:107) C ( f ( D ) (cid:12) T )) ∂ T ≥ (cid:12) ∂D F ( D (cid:107) C ( f ( D ) (cid:12) T )) ∂ T = (13)which guarantees that the solution of (12) converges to astationary point. Algorithm 3: cDNMF.
Input :
Text corpus D , number of topics T ,hyperparameters δ ≥ , M ≥ , λ ≥ , and λ ≥ . Output: C ( t ) , f ( D ) . Initialize: topic-word distribution C , document-topicdistribution W , weight matrix T , t ← ; Construct a document-topic distribution f ( D ) by deepunsupervised learning methods; repeat Calculate W ( t ) by (28); Calculate C ( t ) by (29); Calculate T ( t ) by (30); t ← t + 1 ; until convergence ; Let U and V denote the Lagrange multipliers of C and T respectively. Then, minimizing (12) is equivalent tominimizing the Lagrangian J : J = D F ( D (cid:107) CT , f ( D )) + Tr ( UC T ) + Tr ( VT T ) (14)Taking partial derivatives in (14) derives ∂ J ∂ C =2 C ( f ( D ) (cid:12) T ) T ( f ( D ) (cid:12) T ) − D T ( f ( D ) (cid:12) T ) + U (15) ∂ J ∂ T =2(( f ( D ) (cid:12) T ) CC T ) (cid:12) f ( D ) − f ( D ) (cid:12) DC T ) + V (16)Combining with the KKT conditions, we obtain the updaterules: [ T ] ( t +1) ij ← [ T ] ( t ) ij [( DC T ) (cid:12) f ( D )] ij [(( f ( D ) (cid:12) T ) CC T ) (cid:12) f ( D )] ij (17) [ C ] ( t +1) ij ← [ C ] ( t ) ij [( f ( D ) (cid:12) T ) T D ] ij [( f ( D ) (cid:12) T ) T ( f ( D ) (cid:12) T ) C ] ij (18)sDNMF is summarized in Algorithm 2. It promotes theeffectiveness of C by introducing the internal variable T tobridge the gap between f ( D ) and C . bDNMF and sDNMF intrinsically assumes that each doc-ument contains only one topic, which may not be true. Toovercome the weakness of bDNMF and sDNMF, we proposecDNMF which introduces f ( D ) as a regularization on W instead of masking W by f ( D ) directly.Specifically, we implement the discriminator g ( W , f ( D )) as a real-valued regression response ofthe semantic topic labels f ( D ) : min T ∈ R K × K , W ≥ (cid:107) f ( D ) − TW (cid:107) F (19)where T denotes a linear transform of W . To further con-strains the word-topic matrix C for highly meaningful topicwords, we propose a word-word affinity regularization Ω( C ) : Ω( C ) = (cid:107) CC T − DD T (cid:107) F (20) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6 which encodes the word-word semantics from the sharedknowledge between the documents into C . To our knowl-edge, this is the first time that such a regularization isintroduced to the NMF topic modeling.Substituting (19) and (20) into the DNMF frameworkderives the objective of cDNMF: min T ≥ , C ≥ , W ≥ ,f ( · ) D F ( D || CW , T , f ( D )) (21)where D F ( D || CW , T , f ( D ))= (cid:107) D − CW (cid:107) F + λ (cid:107) f ( D ) − TW (cid:107) F + λ Ω( C ) (22)with λ and λ as two hyperparameters.Like bDNMF, we solve cDNMF by first obtaining f ( D ) from a deep model, and taking f ( D ) as a constant of (21).Then, we optimize (21) for C , W , and T by the alterativeleast squares optimization algorithm. When f ( D ) is given,problem (21) satisfies the following first-order KKT optimal-ity conditions: C ≥ , W ≥ , T ≥ ∂D F ( D || CW , T ,f ( D )) ∂ C ≥ (cid:12) ∂D F ( D || CW , T ,f ( D )) ∂ C = ∂D F ( D || CW , T ,f ( D )) ∂ W ≥ (cid:12) ∂D F ( D || CW , T ,f ( D )) ∂ W = ∂D F ( D || CW , T ,f ( D )) ∂ T ≥ (cid:12) ∂D F ( D || CW , T ,f ( D )) ∂ T = (23)which guarantees that the optimization of (21) convergesto a stationary point. Let Ψ , Q , and P be the Lagrangemultipliers of the constraints C ≥ , W ≥ and T ≥ ,respectively. The Lagrangian J of (21) is J = Tr ( DD T ) − Tr ( DW T C T ) + Tr ( CWW T C T )+ λ Tr ( f ( D ) f T ( D )) − λ Tr ( f ( D ) W T T T )+ λ Tr ( TWW T T T ) + λ Tr ( DD T DD T ) − λ Tr ( DD T CC T ) + λ Tr ( CC T CC T )+ Tr ( CΨ ) + Tr ( WQ ) + Tr ( TP ) (24)The partial derivatives of J with respect to C , W and T are ∂ J ∂ C = − DW T + 2 CWW T − λ DD T C +4 λ CC T C + Ψ (25) ∂ J ∂ W = − C T D + 2 C T CW − λ T T f ( D )+2 λ T T TW + Q (26) ∂ J ∂ T = − λ f ( D ) W T + 2 λ TWW T + P (27)Using the KKT conditions C (cid:12) Ψ = , W (cid:12) Q = and T (cid:12) P = we get the following update rule for C : [ W ] ( t +1) ij ← [ W ] ( t ) ij [ C T D ] ij + λ [ T T f ( D )] ij [ C T CW ] ij + λ [ T T TW ] ij (28) [ C ] ( t +1) ij ← [ C ] ( t ) ij [ DW T ] ij + 2 λ [ DD T C ] ij [ CWW T ] ij + 2 λ [ CC T C ] ij (29) [ T ] ( t +1) ij ← [ T ] ( t ) ij [ f ( D ) W T ] ij [ TWW T ] ij (30)cDNMF is summarized in Algorithm 3. Its merit overbDNMF and sDNMF is that cDNMF avoids the assumptionthat each document contains only one topic. However, ithas two tunable hyperparameters. As we know, there isno way to tune the hyperparameters in unsupervised topicmodeling. To remedy this weakness, we take the documentclustering result f ( D ) as the pseudo labels for tuning thehyperparameters. In Section 1.2, we have summarized the recent progress ofunsupervised deep learning methods for document cluster-ing. To our knowledge, the advantage of the deep learningbased document clustering over conventional documentclustering methods is not apparent in general. In this sec-tion, we propose a novel unsupervised deep learning baseddocument clustering method, named MBN, to address thisissue.
MBN consists of L gradually narrowed hidden layers frombottom-up. Each hidden layer consists of M k -centroidsclusterings ( M (cid:29) ), where parameter k at the l -th layeris denoted by k l , l = 1 , . . . , L . Each k l -centroids clusteringhas k l output units, each of which indicates one cluster. Theoutput layer is linear-kernel-based spectral clustering [55].We take the output of the spectral clustering as f ( D ) .MBN is trained simply by stacking. To train the l -th layer,we simply train each k l -centroids clustering as follows: • Random sampling of input . The first step randomlyselects k l documents from X ( l − = [ x ( l − , . . . , x ( l − N ] as the k l centroids of the clustering. If l = 1 , then X ( l − = D . • One-nearest-neighbor learning . The second step as-signs an input document x ( l − to one of the k l clustersby one-nearest-neighbor learning, and outputs a k l -dimensional indicator vector h = [ h , . . . , h k l ] T , whichis a one-hot sparse vector indicating the nearest cen-troid to x ( l − .The output units of all k l -centroids clusterings are con-catenated as the input of their upper layer, i.e. x ( l ) =[ h T , . . . , h TM ] T . We use cosine similarity to evaluate thesimilarity between the input and the centroids in all layers.As described in [56], each layer of MBN is a histogram-based nonparametric density estimator, which does notmake model assumptions on data; the hierarchical structureof MBN captures the nonlinearity of documents by buildinga vast number of hierarchical trees on the TF-IDF featurespace implicitly. The network structure of MBN is important to its effec-tiveness. First of all, we should set the hyperparameter M to a large number, which guarantees the high estimationaccuracy of MBN at each layer. Then, to maintain the OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7 tree structure and discriminability of MBN, we should set { k l } Ll =1 carefully by the following criteria: k = (cid:98) N/ (cid:99) , k l = (cid:98) δk l − (cid:99) (31) k L is set to guarantee that at least a document perclass is chosen by a random sample in probability (32)where δ ∈ [0 , is a user defined hyperparameter with . as the default.As analyzed in [56], the hyperparameter δ controls howaggressively the nonlinearity of data is reduced. If the datais highly nonlinear, then we set δ to a large number, whichresults in a very deep architecture; otherwise, we set δ to asmall number. MBN is relatively sensitive to the selection of δ . As will be shown in the experiment, setting δ = 0 . issafe, though tuning δ may lead to better performance.The criterion (32) guarantees that each k L -centroids clus-tering is a valid one in probability. Specifically, for any k L -centroids clustering, if its centroids do not contain anydocument of a topic, then its output representation hasno discriminability to the topic. To understand this point,we consider an extreme case: if k L = 1 , then the tophidden layer of MBN outputs the same representation forall documents. In practice, we implement (32) by: k L ≈ (cid:26) (cid:100) NN z (cid:101) , if D is strongly class imbalanced . K, otherwise (33)where N z is the size of the smallest topic. If N z is unknown,we simply set k L to a number that is significantly largerthan the number of topics, e.g. 300 or so. The DNMF variants are new in the NMF study even withoutthe deep model. First, the structured NMF component ofsDNMF is different from existing structured NMF models.For example, nonsmooth NMF [57] incorporates a smoothfactor to make the basis matrix and coefficient matrix (i.e.the topic-document matrix and word-topic matrix respec-tively in topic modeling) sparse, and reconciles the contra-diction between approximation and sparseness. Some otherstructured NMF methods [58], [59] adopt a global centroidfor each basis vector to capture the manifold structure.However, sDNMF takes the sparse representation of doc-uments as a mask of the basis matrix. Second, although itis common to add regularization terms into the objectivefunction of NMF, we did not observe the term (20) in thestudy of NMF. Although some similar form to (19) has beenproposed in [60] for hyperspectral unmixing, they learn therepresentation of data by a shallow model. Therefore, theobjective function of cDNMF is fundamentally new to ourknowledge.Because sDNMF and cDNMF are non-convex optimiza-tion problems, we take the alternative iterative optimizationalgorithm to solve them. The convergence of the algorithmis guaranteed by the following theorem:
Theorem 1.
The objective values of sDNMF and cDNMF de-creases monotonically and converges to a stationary point.Proof.
See Appendix A for the proof of Theorem 1 wherewe take cDNMF as an example. The proof can be applied tosDNMF too whose objective value is non-increasing underthe update rules (18) and (17).
XPERIMENTS
In this section, we compare the proposed DNMF with ninetopic modeling methods on three benchmark text datasets.
The hyperparameters of DNMF in all experiments were setas follows: M = 400 , δ = 0 . , λ = 1 , and λ = 1 ,unless otherwise stated. We compared DNMF with fourprobabilistic models [1], [20], [21], [22], four NMF methods[3], [4], [5], [23], and one deep learning based topic model[24] with their optimal hyperparameter settings. They arelisted as follows: • Probabilistic latent semantic indexing (PLSI) [20]. • Latent Dirichlet allocation (LDA) [1]. • Laplacian probabilistic latent semantic indexing (Lap-PLSI) [21]. • Locally-consistent topic modeling (LTM) [22]. • Successive projection algorithm (SPA) [3]. • Successive nonnegative projection (SNPA) [5]. • A fast conical hull algorithm (XRAY) [4]. • Anchor-free correlated topic modeling (AchorFree) [23]. • Deep Poisson factor modeling (DPFA) [24]: it is a deeplearning based topic model built on the Dirichlet pro-cess. We set its DNN to a depth of two hidden layers,and set the number of the hidden units of the twohidden layers to K and (cid:100) K/ (cid:101) respectively. We usedthe output from the first hidden layer for clustering.The above setting results in the best performance. We evaluated the comparison results in terms of clusteringaccuracy (ACC), coherence , and similarity count (SimCount).Clustering accuracy applies the hungarian algorithm to OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8 atheismgraphicsmspcmacwindowsforsaleautosmotorcyclesbaseballhockeycryptelectronicsmedspacechristiangunsmideastpoliticsreligion
Fig. 2. Visualizations of 20Newsgroups produced by DNMF.TABLE 1Topic words discovered by bDNMF and AnchorFree on a 5-topic subsetof TDT2 corpus. The topic words in bold denotes overlapped wordsbetween topics.
AnchorFree bDNMFTopic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 1 Topic 2 Topic 3 Topic 4 Topic 5netanyahu asian bowl tornadoes economic netanyahu asian bowl florida nigeriaisraeli asia super florida indonesia israeli percent super tornadoes abachaisrael economic broncos central asian israel indonesia broncos tornado militarypalestinian financial denver storms financial palestinian asia denver storms policepeace percent packers ripped imf peace economy packers killed nigerianarafat economy bay victims economy albright financial green victims oppositionpalestinians market green tornado crisis arafat market game damage nigeriasalbright stock football homes asia palestinians stock bay homes antibenjamin crisis game killed monetary talks economic football ripped electionswest markets san people currency west billion elway nino arrestedtalks stocks elway damage billion benjamin crisis san el lagos bank currency diego twisters fund madeleine imf team weather democracyprime prices xxxii nino percent london japan sports twisters sanilondon dollar nfl el international ross spkr diego storm sysciviliantemminister investors quarterback deadly government withdrawal currency coach rain protestyasser index sports storm bank process markets play stories protestsross billion play counties korea prime dollar win deadly presidentialwithdrawal bank yards weather south yasser south teams struck abachasmadeleine growth favre funerals indonesian secretary government season residents violent13 indonesia pittsburgh toll suharto 13 prices fans california nigerians solve the permutation problem of predicted labels. Coher-ence evaluates the quality of a single topic by finding howmany topic words belonging to the topic appear across thedocuments of the topic [50]:
Coh( ν ) = (cid:88) v ,v ∈ ν log freq( v , v ) + ε freq( v ) (34)where v and v denote two words in the vocabulary, freq( v , v ) denotes the number of the documents where v and v co-appear, freq( v ) denotes the number of thedocuments containing v , and ε = 0 . is used to preventthe input of the logarithm operator from zero. The higherthe clustering accuracy or coherence score is, the better thetopic model is. Because the coherence measurement doesnot evaluate the redundancy of a topic, we used similaritycount to measure the similarity between the topics. For eachtopic, similarity count is obtained simply by counting thenumber of the overlapped words in the leading K words.The lower the similarity count score is, the better the topicmodel is. We listed the top 20 topic words of a 5-topic modelingproblem as an example in Table 1. From the table, wesee that the topic words of the second and fifth topicsproduced by AnchorFree have an overlap of over 50%.
TABLE 2Performance of the comparison algorithms on 20-newsgroups.
Some informative topic words discovered by bDNMF, suchas the words on anti-government activities or violence inthe fifth topic, were not detected by AnchorFree. The abovephenomena are observed in the other experiments too. Weconjecture that the advanced experimental phenomena arecaused by the fact that bDNMF not only avoids makingadditional assumptions but also benefits from the highclustering accuracy of the deep model. We show the latentrepresentation of the documents in 20-newsgroups learnedby MBN in Fig. 2. From the figure, we see that the latentrepresentation has strong disriminability which may lead tohigh performance of DNMF.Table 2 shows the comparison results on the 20-newsgroups corpus. From the table, we see that the DNMFvariants achieve the highest clustering accuracy among thecomparison methods. For example, the clustering accuracyof DNMF is more than absolutely higher than that of therunner-up method, i.e. LTM, when the number of the topicsis 20 and between 4 and 15, and is at least higher than thelatter in the other cases. Particularly, DNMF is significantlybetter than the NMF methods. The relative improvement ofDNMF over NMF tends to be enlarged when the number ofthe topics increases, which demonstrates the effectiveness ofthe deep architecture of DNMF. In addition, the single-topicquality produced by sDNMF ranks the third in terms ofcoherence, which is inferior to AnchorFree and DFPA. Thesimilarity count scores produced by sDNMF and cDNMFrank behind LTM and are higher than the other comparisonmethods, which indicates that DNMF is able to generateless overlapped topic words than the comparison methodsexcept the probabilistic model—LTM.Table 3 shows the results on the TDT2 corpus. From thetable, we see that the DNMF variants obtain the best perfor-mance in terms of clustering accuracy and similarity count,particularly when the number of topics is below 10. bD-NMF and sDNMF outperform the comparison algorithmsin terms of all three evaluation metrics, which demonstratesthe advantage of the DNMF framework further. Although OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
TABLE 3Performance of the comparison algorithms on TDT2.
TABLE 4Performance of the comparison algorithms on Reuters-21578.
LapPLSI yields competitive clustering accuracy with DNMF,its performance in coherence and similarity count is signif-icantly lower than DNMF. Although AnchorFree reachesa higher coherence rank than cDNMF, its similarity countscores are much higher than cDNMF.Table 4 shows the performance of the comparison meth-ods on the Reuters-21578 corpus. From the table, we see thatthe DNMF variants reach the highest clustering accuracy.Although it seems that they do not reach the top perfor-mance in terms of coherence and similarity count soly, theybalance the coherence and similarity count which evaluate
TABLE 5Average ranks of the comparison methods on all three data sets. The“Coh.+SimCount” ranking list is the average of the lists in coherenceand similarity count. The “overall” ranking list is the average of the listsin the three evaluation metrics.
PLSI LDA LapPLSI LTM SPA SNPA XRAY AnchorFree DFPA bDNMF sDNMF cDNMFACC 10.58 8.75 6.42 6.43 8.62 8.71 10.30 5.05 6.95 1.06 1.06 1.06Coherence 8.33 6.27 8.76 9.65 5.60 5.60 5.85 3.19 8.02 5.27 3.12 7.34SimCount 7.36 7.76 10.70 3.76 6.06 6.06 8.37 6.75 10.64 4.42 2.52 2.61Coh.+SimCount 7.85 7.02 9.73 6.71 5.83 5.83 7.11 4.97 9.33 4.85 2.82 4.98
Overall -750-700-650-600-550-500-450 C oh (a) S i m C oun t (b)Fig. 3. Performance of cDNMF with respect to hyperparameter λ interms of coherence and similarity count. -750-700-650-600-550-500-450 C oh (a) S i m C oun t (b)Fig. 4. Performance of cDNMF with respect to hyperparameter λ . A cc u r a cy (a)
50 100 150 200 250 300 350 400 M (b)Fig. 5. Clustering accuracy of DNMF with respect to hyperparameters δ and M . two contradict aspects of a topic model. For example, al-though the coherence of sDNMF ranks behind XRAY andSPA, its similarity count is much higher than the latter.Although the similarity count of sDNMF ranks behind PLSI,its coherence is higher than PLSI as well. If we average thecoherence and similarity count ranking lists, it is clear thatsDNMF performs the best.We summarize the ranking lists of the comparison meth-ods on the three corpora in Table 5. From the overallranking list in the table, we see that (i) the DNMF variantsperform the best generally, followed by AnchorFree andLTM, and (ii) sDNMF performs the best among the threeDNMF variants. If we take a look at the average ranking listover coherence and similarity count, we find that sDNMFreach the top performance, while bDNMF and sDNMFbehave similarly with AnchorFree—a recent advanced NMFmethod that avoids the anchor-word assumption. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10 C oh cDNMFsDNMFbDNMF (a) S i m C oun t cDNMFsDNMFbDNMF (b)Fig. 6. Performance of the DNMF variants with respect to hyperparame-ter δ on 20-newsgroups in terms of coherence and similarity count. C oh cDNMFsDNMFbDNMF (a) S i m C oun t cDNMFsDNMFbDNMF (b)Fig. 7. Performance of the DNMF variants with respect to hyperparame-ter δ on TDT2 in terms of coherence and similarity count. C oh cDNMFsDNMFbDNMF (a) S i m C oun t cDNMFsDNMFbDNMF (b)Fig. 8. Performance of the DNMF variants with respect to hyperparame-ter δ on Reuters-21578 in terms of coherence and similarity count. To study how the hyperparameters of DNMF affect theperformance, we searched the hyperparameters in grid. Toprevent exhaust search, when we studied a hyperparameter,we fixed the other hyperparameters to their default values.We first studied the two regularization hyperparametersof cDNMF λ and λ in terms of coherence and similaritycount by searching the two hyperparametres in grid from0.1 to 0.9. The results are shown in Figs. 3 and 4. Fromthe figures, we see that cDNMF is in sensitive to the twohyperparameters.Then, we studied the hyperparameters δ and M of thedeep model in DNMF in terms of all three evaluation met-rics, in which δ is searched from 0.1 to 0.9 and M searchedfrom 10 to 400. Figure 5 shows the clustering accuracy ofDNMF with respect to the two hyperparameters. Figures 6to 8 shows the coherence and similarity count of the DNMFvariants with respect to δ on the three corpora respectively.Figures 9 to 11 shows the coherence and similarity count ofthe DNMF variants with respect to M on the three corporarespectively. From the figures, we see that although theDNMF variants are sensitive to δ and M , we can clearlyfind the regulations. For the hyperparameter δ , we observefrom Fig. 5a and Figs. 6 to 8 that, when δ is set around thedefault value . , all DNMF variants approach to the topperformance in all cases.For the hyperparameter M , we see from Fig. 5b thatenlarging M clearly improves the clustering accuracy of all
50 100 150 200 250 300 350 400 M -1100-1050-1000-950-900-850-800-750-700-650 cDNMFsDNMFbDNMF (a)
50 100 150 200 250 300 350 400 M cDNMFsDNMFbDNMF (b)Fig. 9. Performance of the DNMF variants with respect to hyperparame-ter M on 20-newsgroups in terms of coherence and similarity count.
50 100 150 200 250 300 350 400 M -1050-1000-950-900-850-800-750-700 cDNMFsDNMFbDNMF (a)
50 100 150 200 250 300 350 400 M cDNMFsDNMFbDNMF (b)Fig. 10. Performance of the DNMF variants with respect to hyperparam-eter M on TDT2 in terms of coherence and similarity count.
50 100 150 200 250 300 350 400 M -1600-1400-1200-1000-800-600-400 cDNMFsDNMFbDNMF (a)
50 100 150 200 250 300 350 400 M cDNMFsDNMFbDNMF (b)Fig. 11. Performance of the DNMF variants with respect to hyperparam-eter V on Reuters-21578 in terms of coherence and similarity count. DNMF variants. From Figs. 9a, 10a, and 11a, we see that theperformance of all DNNF variants is improved generallyalong with the increase of M in terms of coherence in allcases except that the performance of bDNMF and sDNMFis getting worse on 20-newsgroups. From Figs. 9b, 10b, and11b, we see that the similarity count scores of all DNMF vari-ants are getting smaller generally along with the increase of M on 20-newsgroups and TDT2. It is interesting to observethat the similarity count scores of the DNMF variants firstget larger and then smaller along with the increase of M on Reuters-21578, with a peak at M = 100 . Nonetheless,the DNMF variants approach to the lowest similarity countscores at M = 400 in all cases. We can imagine that, whenwe set M larger than 400, the performance may be furtherimproved with the expense of higher computational com-plexity. To balance the performance and the computationalcomplexity, it is reasonable to set M = 400 . ONCLUSIONS
In this paper, we have proposed a deep NMF topic mod-eling framework and evaluated its effectiveness with threeimplementations. To our knowledge, this is the first deepNMF topic modeling framework. The novelty of DNMFlies in the following aspects. First, we proposed a novelunsupervised deep NMF framework that is fundamentallydifferent from existing deep learning based topic model-ing methods. It takes the unsupervised deep learning as
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11 a constraint of NMF. It is a general framework that canincorporate many types of deep models and NMF methods.To evaluate its effectiveness, we implemented three DNMFalgorithms, denoted as bDNMF, sDNMF, and cDNMF. bD-NMF takes the sparse output of the deep model as thetopic-document matrix directly, which formulates bDNMFas a supervised regression problem with a nonnegativeconstraint on the word-topic matrix. sDNMF takes the out-put of the deep model as a mask of the topic-documentmatrix, and solves the NMF problem by the alternativeiterative optimization, which relaxes the strong constrainton the topic-document matrix in bDNMF. cDNMF takes theoutput of the deep model as a regularization, which furtherrelaxes the constraint on the topic-document matrix. To ourknowledge, the regularization terms in cDNMF is novel.Finally, we applied multilayer bootstrap networks for docu-ment clustering. It reaches the state-of-the-art performancegiven the high-dimensional sparse TF-IDF statistics of thedocuments, which further boosts the overall performanceof the DNMF implementations. We have conducted anextensive experimental comparison with 9 representativecomparison methods covering probabilistic topic models,NMF topic models, and deep topic modeling on threebenchmark datasets—20-newsgroups, TDT2, and Reuters-21578. Experimental results show that the proposed DNMFvariants outperform the comparison methods significantlyin terms of clustering accuracy, coherence, and similaritycount. Moreover, although the DNMF variants are relativesensitive to the hyperparameter δ , we always find a robustworking range across the corpora, which demonstrates therobustness of the DNMF variants in real-world applications. R EFERENCES [1] David M Blei, Andrew Y Ng, and Michael I Jordan, “Latentdirichlet allocation,”
Journal of machine Learning research , vol. 3,no. Jan, pp. 993–1022, 2003.[2] Jaegul Choo, Changhyun Lee, Chandan K Reddy, and HaesunPark, “Utopian: User-driven topic modeling based on interactivenonnegative matrix factorization,”
IEEE transactions on visualiza-tion and computer graphics , vol. 19, no. 12, pp. 1992–2001, 2013.[3] Nicolas Gillis and Stephen A Vavasis, “Fast and robust recursivealgorithmsfor separable nonnegative matrix factorization,”
IEEEtransactions on pattern analysis and machine intelligence , vol. 36, no.4, pp. 698–714, 2013.[4] Abhishek Kumar, Vikas Sindhwani, and Prabhanjan Kambadur,“Fast conical hull algorithms for near-separable non-negative ma-trix factorization,” in
International Conference on Machine Learning ,2013, pp. 231–239.[5] Nicolas Gillis, “Successive nonnegative projection algorithm forrobust nonnegative blind source separation,”
SIAM Journal onImaging Sciences , vol. 7, no. 2, pp. 1420–1450, 2014.[6] Xiao Fu, Wing-Kin Ma, Tsung-Han Chan, and Jos´e M Bioucas-Dias, “Self-dictionary sparse regression for hyperspectral unmix-ing: Greedy pursuit and pure pixel search are related,”
IEEEJournal of Selected Topics in Signal Processing , vol. 9, no. 6, pp. 1128–1141, 2015.[7] Y. Chen, J. Wu, J. Lin, R. Liu, H. Zhang, and Z. Ye, “Affinityregularized non-negative matrix factorization for lifelong topicmodeling,”
IEEE Transactions on Knowledge and Data Engineering ,pp. 1–1, 2019.[8] Xuelong Li, Guosheng Cui, and Yongsheng Dong, “Graph reg-ularized non-negative low-rank matrix factorization for imageclustering,”
IEEE transactions on cybernetics , vol. 47, no. 11, pp.3840–3853, 2016.[9] Thomas L. Griffiths, Michael I. Jordan, Joshua B. Tenenbaum, andDavid M. Blei, “Hierarchical topic models and the nested chineserestaurant process,” in
Advances in Neural Information Processing Systems 16 , S. Thrun, L. K. Saul, and B. Sch¨olkopf, Eds., pp. 17–24.MIT Press, 2004.[10] Jen-Tzung Chien, “Hierarchical theme and topic modeling,”
IEEEtransactions on neural networks and learning systems , vol. 27, no. 3,pp. 565–578, 2015.[11] Daniel Ramage, Christopher D Manning, and Susan Dumais,“Partially labeled topic models for interpretable text mining,” in
Proceedings of the 17th ACM SIGKDD international conference onKnowledge discovery and data mining , 2011, pp. 457–465.[12] Finale Doshi-Velez and Been Kim, “Towards a rigorous science ofinterpretable machine learning,” arXiv preprint arXiv:1702.08608 ,2017.[13] Y. Zheng, Y. Zhang, and H. Larochelle, “A deep and autoregressiveapproach for topic modeling of multimodal data,”
IEEE Transac-tions on Pattern Analysis and Machine Intelligence , vol. 38, no. 6, pp.1056–1069, 2016.[14] J. Chien and C. Lee, “Deep unfolding for topic models,”
IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 40, no.2, pp. 318–331, 2018.[15] George Trigeorgis, Konstantinos Bousmalis, Stefanos Zafeiriou,and Bj¨orn W Schuller, “A deep matrix factorization method forlearning attribute representations,”
IEEE transactions on patternanalysis and machine intelligence , vol. 39, no. 3, pp. 417–429, 2016.[16] Linlin Zong, Xianchao Zhang, Long Zhao, Hong Yu, and QianliZhao, “Multi-view clustering via multi-manifold regularized non-negative matrix factorization,”
Neural Networks , vol. 88, pp. 74 –89, 2017.[17] Pengtao Xie and Eric P. Xing, “Integrating document clusteringand topic modeling,” in
Proceedings of the Twenty-Ninth Conferenceon Uncertainty in Artificial Intelligence , Arlington, Virginia, USA,2013, UAI13, p. 694C703, AUAI Press.[18] Yu-Xiong Wang and Yu-Jin Zhang, “Nonnegative matrix factor-ization: A comprehensive review,”
IEEE Transactions on Knowledgeand Data Engineering , vol. 25, no. 6, pp. 1336–1353, 2012.[19] Junyuan Xie, Ross Girshick, and Ali Farhadi, “Unsupervised deepembedding for clustering analysis,” in
International conference onmachine learning , 2016, pp. 478–487.[20] Christos H Papadimitriou, Prabhakar Raghavan, Hisao Tamaki,and Santosh Vempala, “Latent semantic indexing: A probabilisticanalysis,”
Journal of Computer and System Sciences , vol. 61, no. 2,pp. 217–235, 2000.[21] Deng Cai, Qiaozhu Mei, Jiawei Han, and Chengxiang Zhai, “Mod-eling hidden topics on document manifold,” in
Proceedings ofthe 17th ACM conference on Information and knowledge management .ACM, 2008, pp. 911–920.[22] Deng Cai, Xuanhui Wang, and Xiaofei He, “Probabilistic dyadicdata analysis with local and global consistency,” in
Proceedings ofthe 26th annual international conference on machine learning . ACM,2009, pp. 105–112.[23] Xiao Fu, Kejun Huang, Nicholas D Sidiropoulos, Qingjiang Shi,and Mingyi Hong, “Anchor-free correlated topic modeling,”
IEEEtransactions on pattern analysis and machine intelligence , vol. 41, no.5, pp. 1056–1071, 2018.[24] Ricardo Henao, Zhe Gan, James Lu, and Lawrence Carin, “Deeppoisson factor modeling,” in
Advances in Neural Information Pro-cessing Systems , 2015, pp. 2800–2808.[25] Yee W Teh, Michael I Jordan, Matthew J Beal, and David M Blei,“Sharing clusters among related groups: Hierarchical dirichletprocesses,” in
Advances in neural information processing systems ,2005, pp. 1385–1392.[26] Weifeng Li, Junming Yin, and Hsinchsun Chen, “Supervisedtopic modeling using hierarchical dirichlet process-based inverseregression: Experiments on e-commerce applications,”
IEEE Trans-actions on Knowledge and Data Engineering , vol. 30, no. 6, pp. 1192–1205, 2017.[27] John Paisley, Chong Wang, David M Blei, and Michael I Jordan,“Nested hierarchical dirichlet processes,”
IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 37, no. 2, pp. 256–270, 2014.[28] David M Blei, Thomas L Griffiths, and Michael I Jordan, “Thenested chinese restaurant process and bayesian nonparametricinference of topic hierarchies,”
Journal of the ACM (JACM) , vol.57, no. 2, pp. 7, 2010.[29] Sanjeev Arora, Rong Ge, and Ankur Moitra, “Learning topicmodels–going beyond svd,” in . IEEE, 2012, pp. 1–10.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12 [30] Jen-Tzung Chien and Chao-Hsi Lee, “Deep unfolding for topicmodels,”
IEEE transactions on pattern analysis and machine intelli-gence , vol. 40, no. 2, pp. 318–331, 2017.[31] Geoffrey E Hinton and Ruslan R Salakhutdinov, “Replicated soft-max: an undirected topic model,” in
Advances in neural informationprocessing systems , 2009, pp. 1607–1614.[32] Martin Gerlach, Tiago P Peixoto, and Eduardo G Altmann, “Anetwork approach to topic models,”
Science advances , vol. 4, no. 7,pp. eaaq1360, 2018.[33] Mikhail Yurochkin and XuanLong Nguyen, “Geometric dirichletmeans algorithm for topic inference,” in
Advances in NeuralInformation Processing Systems , 2016, pp. 2505–2513.[34] Hugo Larochelle and Stanislas Lauly, “A neural autoregressivetopic model,” in
Advances in Neural Information Processing Systems ,2012, pp. 2708–2716.[35] Yin Zheng, Yu-Jin Zhang, and Hugo Larochelle, “A deep andautoregressive approach for topic modeling of multimodal data,”
IEEE transactions on pattern analysis and machine intelligence , vol. 38,no. 6, pp. 1056–1069, 2015.[36] Zhe Gan, Changyou Chen, Ricardo Henao, David Carlson, andLawrence Carin, “Scalable deep poisson factor analysis for topicmodeling,” in
International Conference on Machine Learning , 2015,pp. 1823–1832.[37] Rajesh Ranganath, Linpeng Tang, Laurent Charlin, and David Blei,“Deep exponential families,” in
Artificial Intelligence and Statistics ,2015, pp. 762–771.[38] Yong Ren, Yining Wang, and Jun Zhu, “Spectral learning forsupervised topic models,”
IEEE transactions on pattern analysis andmachine intelligence , vol. 40, no. 3, pp. 726–739, 2017.[39] David Donoho and Victoria Stodden, “When does non-negativematrix factorization give a correct decomposition into parts?,” in
Advances in neural information processing systems , 2004, pp. 1141–1148.[40] Y. Zhao, H. Wang, and J. Pei, “Deep non-negative matrix factor-ization architecture based on underlying basis images learning,”
IEEE Transactions on Pattern Analysis and Machine Intelligence , pp.1–1, 2019.[41] Zechao Li, Jinhui Tang, and Tao Mei, “Deep collaborative embed-ding for social image understanding,”
IEEE transactions on patternanalysis and machine intelligence , vol. 41, no. 9, pp. 2070–2083, 2018.[42] J. Le Roux, J. R. Hershey, and F. Weninger, “Deep nmf for speechseparation,” in , 2015, pp. 66–70.[43] S. Wisdom, T. Powers, J. Pitton, and L. Atlas, “Deep recurrentnmf for speech separation by unfolding iterative thresholding,” in , 2017, pp. 254–258.[44] Fanghua Ye, Chuan Chen, and Zibin Zheng, “Deep autoencoder-like nonnegative matrix factorization for community detection,” in
Proceedings of the 27th ACM International Conference on Informationand Knowledge Management , 2018, pp. 1393–1402.[45] Bo Yang, Xiao Fu, Nicholas D Sidiropoulos, and Mingyi Hong,“Towards k-means-friendly spaces: Simultaneous deep learningand clustering,” in international conference on machine learning , 2017,pp. 3861–3870.[46] Maziar Moradi Fard, Thibaut Thonet, and Eric Gaussier, “Deepk-means: Jointly clustering with k-means and learning representa-tions,”
Pattern Recognition Letters , vol. 138, pp. 185 – 192, 2020.[47] Mohammed Jabi, Marco Pedersoli, Amar Mitiche, and Ismail BenAyed, “Deep clustering: On the link between discriminativemodels and k-means,”
IEEE Transactions on Pattern Analysis andMachine Intelligence , 2019.[48] Ziqiang Cao, Sujian Li, Yang Liu, Wenjie Li, and Heng Ji, “A novelneural topic model and its supervised extension,” in
Twenty-NinthAAAI Conference on Artificial Intelligence , 2015.[49] He Zhao, Lan Du, Wray Buntine, and Mingyuan Zhou, “Interand intra topic structure learning with word embeddings,” in
Proceedings of the 35th International Conference on Machine Learning ,Jennifer Dy and Andreas Krause, Eds., Stockholmsmssan, Stock-holm Sweden, 10–15 Jul 2018, vol. 80 of
Proceedings of MachineLearning Research , pp. 5892–5901, PMLR.[50] Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, AnkurMoitra, David Sontag, Yichen Wu, and Michael Zhu, “A practicalalgorithm for topic modeling with provable guarantees,” in
International Conference on Machine Learning , 2013, pp. 280–288.[51] Byoungwook Jang and Alfred Hero, “Minimum volume topicmodeling,” in
The 22nd International Conference on Artificial Intelli- gence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa,Japan , 2019, pp. 3013–3021.[52] Mikhail Yurochkin, Zhiwei Fan, Aritra Guha, Paraschos Koutris,and XuanLong Nguyen, “Scalable inference of topic evolutionvia models for latent geometric structures,” in
Advances in NeuralInformation Processing Systems , 2019, pp. 5951–5961.[53] Xiao Fu, Kejun Huang, Nicholas D Sidiropoulos, and Wing-KinMa, “Nonnegative matrix factorization for signal and data ana-lytics: Identifiability, algorithms, and applications.,”
IEEE SignalProcess. Mag. , vol. 36, no. 2, pp. 59–80, 2019.[54] Daniel D Lee and H Sebastian Seung, “Algorithms for non-negative matrix factorization,” in
Advances in neural informationprocessing systems , 2001, pp. 556–562.[55] Andrew Y Ng, Michael I Jordan, and Yair Weiss, “On spectralclustering: Analysis and an algorithm,” in
Advances in neuralinformation processing systems , 2002, pp. 849–856.[56] Xiao-Lei Zhang, “Multilayer bootstrap networks,”
Neural Net-works , vol. 103, pp. 29–43, 2018.[57] Alberto Pascual-Montano, Jose Maria Carazo, Kieko Kochi, Di-etrich Lehmann, and Roberto D Pascual-Marqui, “Nonsmoothnonnegative matrix factorization (nsnmf),”
IEEE transactions onpattern analysis and machine intelligence , vol. 28, no. 3, pp. 403–415,2006.[58] Hongchang Gao, Feiping Nie, and Heng Huang, “Local centroidsstructured non-negative matrix factorization,” in
Thirty-First AAAIConference on Artificial Intelligence (AAAI 2017) , 2017.[59] Zechao Li, Jinhui Tang, and Xiaofei He, “Robust structurednonnegative matrix factorization for image representation,”
IEEEtransactions on neural networks and learning systems , vol. 29, no. 5,pp. 1947–1960, 2017.[60] Jun Li, Jos´e M Bioucas-Dias, Antonio Plaza, and Lin Liu, “Robustcollaborative nonnegative matrix factorization for hyperspectralunmixing,”
IEEE Transactions on Geoscience and Remote Sensing ,vol. 54, no. 10, pp. 6076–6090, 2016.[61] Chris HQ Ding, Tao Li, and Michael I Jordan, “Convex and semi-nonnegative matrix factorizations,”
IEEE transactions on patternanalysis and machine intelligence , vol. 32, no. 1, pp. 45–55, 2008. A PPENDIX A Before we prove Theorem 1, we first give the definition ofan upper bound auxiliary function.
Definition 1. G ( u, u (cid:48) ) is an upper bound auxiliary function for g ( u ) if the following conditions are satisfied: G ( u, u (cid:48) ) ≥ g ( u ) , G ( u, u ) = g ( u ) (35) Corollary 1. If G ( · , · ) is an upper bound auxiliary function for g ( u ) , then g ( u ) is non-increasing under the update rule u t +1 = arg min u G ( u, u t ) (36) Proposition 1.
For any matrices A ∈ R n × n + , B ∈ R k × k + , E ∈ R n × k + , E (cid:48) ∈ R n × k + , with A and B being symmetric matrices, thefollowing inequality holds [61]: n (cid:88) i =1 k (cid:88) j =1 [ AE (cid:48) B ] ij [ E ] ij [ E ] (cid:48) ij ≥ Tr ( E T AEB ) (37) Definition 2.
A function can be represented as an infinite sumof terms that are calculated from the values of the function’sderivatives at a single point, which can be formulated as follows: f ( x ) = ∞ (cid:88) n =0 f ( n ) ( a ) n ! ( y − a ) n (38)Given the above definitions, the objective function ofcDNMF (21) with respect to the three univariate functionsare obtained as follows: O C = (cid:107) D − CW (cid:107) F + λ (cid:107) CC T − DD T (cid:107) F (39) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13 O W = (cid:107) D − CW (cid:107) F + λ (cid:107) f ( D ) − TW (cid:107) F (40) O T = λ (cid:107) f ( D ) − TW (cid:107) F (41)Then, we have the following three lemmas. Lemma 1.
The auxiliary function for O ( C ) is as follows: G ([ C ] ij , [ C (cid:48) ] ij ) = O C + [ − DW T + 2 CWW T − λ DD T C + 4 λ CC T C ] ij ([ C ] ij − [ C (cid:48) ] ij )+ 13! 4 λ [ C ] ij ([ C ] ij − [ C (cid:48) ] ij ) + 14! 4 λ ([ C ] ij − [ C (cid:48) ] ij ) + 12 2[ CWW T ] ij + 4 λ [ CC T C ] ij [ C ] ij ([ C ] ij − [ C (cid:48) ] ij ) (42) Proof.
It is obvious that G ( C , C ) = O C ( C ) , we only needto prove that G ( C , C (cid:48) ) ≥ O C ( C ) . The first-order partialderivative of (39) in element-wise is ∂ O C ∂ [ C ] ij =[ − DW T + 2 CWW T − λ DD T C + 4 λ CC T C ] ij (43)The second-order derivative of (39) with respect to C is ∂ O C ∂ [ C ] ij ∂ [ C ] ij = [2 WW T ] jj − λ [ DD T ] ii + 4 λ [ CC T ] ii (44)The third-order partial derivative of (39) is ∂ O C ∂ [ C ] ij ∂ [ C ] ij ∂ [ C ] ij = 4 λ [ C ] ij (45)The fourth-order partial derivative of (39) is ∂ O C ∂ [ C ] ij ∂ [ C ] ij ∂ [ C ] ij ∂ [ C ] ij = 4 λ (46)According to the Taylor expansion in Definition (38), we canrewrite (39) to its Taylor expansion form: O C ( c ij ) = O C + ∂ O C ∂c ij ( c ij − [ C ] ij )+ 12 ∂ O C ∂c ij ∂c ij ( c ij − [ C ] ij ) + 13! ∂ O C ∂c ij ∂c ij ∂c ij ( c ij − [ C ] ij ) + 14! ∂ O C ∂c ij ∂c ij ∂c ij ∂c ij ( c ij − [ C ] ij ) (47)The upper bound auxiliary function for (39) is defined as G ([ C ] ij , [ C (cid:48) ] ij ) = O C + ∂ O C ∂ [ C ] ij ([ C ] ij − [ C (cid:48) ] ij )+ 13! ∂ O C ∂ [ C ] ij ∂ [ C ] ij ∂ [ C ] ij ([ C ] ij − [ C (cid:48) ] ij ) + 14! ∂ O C ∂ [ C ] ij ∂ [ C ] ij ∂ [ C ] ij ∂ [ C ] ij ([ C ] ij − [ C (cid:48) ] ij ) + 12 2[ CWW T ] ij + 4 λ [ CC T C ] ij [ C ] ij ([ C ] ij − [ C (cid:48) ] ij ) (48) Substituting (47) into (48), we find that G ( C , C (cid:48) ) ≥ O C ( C ) is equivalent to
12 2[
CWW T ] ij + 4 λ [ CC T C ] ij [ C ] ij ([ C ] ij − [ C (cid:48) ] ij ) ≥ (cid:16) [2 WW T ] jj − λ [ DD T ] ii + 4 λ [ CC T ] ii (cid:17) ([ C ] ij − [ C (cid:48) ] ij ) (49)Because we have [ CWW T ] ij [ C ] ij = (cid:80) j (cid:0) [ C ] ij × [ WW T ] jj (cid:1) [ C ] ij ≥ [ C ] ij × [ WW T ] jj [ C ] ij = [ WW T ] jj (50) [ CC T C ] ij [ C ] ij = (cid:80) j (cid:0) [ CC T ] jj × [ C ] ij (cid:1) [ C ] ij ≥ [ CC T ] ii × [ C ] ij [ C ] ij = [ CC T ] ii (51)we can conclude that (49) holds, and (48) is an upper boundauxiliary function for (39). Because the elements of matrix C is nonnegative, the third and fourth order partial derivativesare larger than zero and (48) is a convex function, itsminimum value can be achieved at [ C (cid:48) ] ij = [ C ] ij − [ − DW T + 2 CWW T − λ DD T C + 4 λ CC T C ] ij ×
12 [2
CWW T ] ij +4 λ [ CC T C ] ij [ C ] ij ([ C ] ij − [ C (cid:48) ] ij ) = [ C ] ij [ DW T ] ij + 2 λ [ DD T C ] ij [ CWW T ] ij + 2 λ [ CC T C ] ij (52)Lemma 1 is proved. Lemma 2.
Given Proposition 1, the auxiliary function for O ( W ) is as follows: G ( W , W (cid:48) ) = − λ Tr ( f ( D ) W T T T ) − Tr ( DW T C T )+ (cid:88) ij [ C T CW ] ij [ W (cid:48) ] ij [ W ] ij + λ (cid:88) ij [ T T TW ] ij [ W (cid:48) ] ij [ W ] ij (53) Lemma 3.
The auxiliary function for O ( T ) is as follows: G ( T , T (cid:48) ) = − λ Tr ( f ( D ) W T T T ) + λ (cid:88) ij [ TWW T ] ij [ T (cid:48) ] ij [ T ] ij (54)With the above lemmas, we derive the update rules foreach variable by minimizing their corresponding auxiliaryfunctions: ∂ G ( C , C (cid:48) ) ∂ [ C (cid:48) ] ij = − DW T ] ij + 2 [ CWW T ] ij [ C (cid:48) ] ij [ C ] ij − λ [ DD T C ] ij + 4 λ [ CC T C ] ij [ C (cid:48) ] ij [ C ] ij = 0 (55) ∂ G ( W , W (cid:48) ) ∂ [ W (cid:48) ] ij = − λ [ T T f ( D )] ij − C T D ] ij + 2 [ C T CW ] ij [ W (cid:48) ] ij [ W ] ij + 2 λ [ T T TW ] ij [ W (cid:48) ] ij [ W ] ij = 0 (56) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14 ∂ G ( T , T (cid:48) ) ∂ [ T (cid:48) ] ij = − f ( D ) W T ] ij + 2 [ TWW T ] ij [ T (cid:48) ] ij [ T ] ij = 0 (57)which derives [ C (cid:48) ] ij = [ C ] ij [ DW T ] ij + 2 λ [ DD T C ] ij [ CWW T ] ij + 2 λ [ CC T C ] ij (58) [ W (cid:48) ] ij = [ W ] ij λ [ T T f ( D )] ij + [ C T D ] ij [ C T CW ] ij + λ [ T T TW ] ij (59) [ T (cid:48) ] ij = [ T ] ij [ f ( D ) W T ] ij [ TWW T ] ijij