[PDF] How Many Topics? Stability Analysis for Topic Models

Abstract

Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the "over-clustering" of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process.

Full PDF

HHow Many Topics?Stability Analysis for Topic Models

Derek Greene, Derek O’Callaghan, P´adraig Cunningham

School of Computer Science & Informatics, University College Dublin { derek.greene,derek.ocallaghan,padraig.cunningham } @ucd.ie Abstract.

Topic modeling refers to the task of discovering the under-lying thematic structure in a text corpus, where the output is commonlypresented as a report of the top terms appearing in each topic. Despitethe diversity of topic modeling algorithms that have been proposed, acommon challenge in successfully applying these techniques is the selec-tion of an appropriate number of topics for a given corpus. Choosingtoo few topics will produce results that are overly broad, while choosingtoo many will result in the“over-clustering” of a corpus into many small,highly-similar topics. In this paper, we propose a term-centric stabilityanalysis strategy to address this issue, the idea being that a model withan appropriate number of topics will be more robust to perturbations inthe data. Using a topic modeling approach based on matrix factorization,evaluations performed on a range of corpora show that this strategy cansuccessfully guide the model selection process.

From a general text mining perspective, a topic in a text corpus can be viewedas either a probability distribution over the terms present in the corpus or acluster that deﬁnes weights for those terms [26]. Considerable research on topicmodeling has focused on the use of probabilistic methods such as variants of La-tent Dirichlet Allocation (LDA) [5] and Probabilistic Latent Semantic Analysis(PLSA) [11]. Non-probabilistic algorithms, such as Non-negative Matrix Factor-ization (NMF) [20], have also been applied to this task [26,1]. Regardless of thechoice of algorithm, a key consideration in successfully applying topic modelingis the selection of an appropriate number of topics k for the corpus under consid-eration. Choosing a value of k that is too low will generate topics that are overlybroad, while choosing a value that is too high will result in “over-clustering” ofthe data. For some corpora, coherent topics will exist at several diﬀerent resolu-tions, from coarse to ﬁne-grained, reﬂected by multiple appropriate k values.When a clustering result is generated using an algorithm that contains astochastic element or requires the selection of one or more key parameter values,it is important to consider whether the solution represents a “deﬁnitive” solutionthat may easily be replicated. Cluster validation techniques based on this con-cept have been shown to be eﬀective in helping to choose a suitable number ofclusters in data [17,21]. The stability of a clustering model refers to its ability to a r X i v : . [ c s . L G ] J un onsistently replicate similar solutions on data originating from the same source.In practice, this involves repeatedly clustering using diﬀerent initial conditionsand/or applying the algorithm to diﬀerent samples of the complete data set. Ahigh level of agreement between the resulting clusterings indicates high stabil-ity, in turn suggesting that the current model is appropriate for the data. Incontrast, a low level of agreement indicates that the model is a poor ﬁt for thedata. Stability analysis has most frequently been applied in bioinformatics [7,4],where the focus has been on model selection for classical clustering approaches,such as k -means [17,3] and agglomerative hierarchical clustering [21,4].In the literature, the output of topic modeling procedures is often presented inthe form of lists of top-ranked terms suitable for human interpretation. Motivatedby this, we propose a term-centric stability approach for selecting the numberof topics in a corpus, based on the agreement between term rankings generatedover multiple runs of the same algorithm. We employ a “top-weighted” rankingmeasure, where higher-ranked terms have a greater degree of inﬂuence whencalculating agreement scores. To ensure that a given model is robust againstperturbations, we use both sampling of documents from a corpora and randommatrix initialization to produce diverse collections of topics on which stabilityis calculated. Unlike previous applications of the concept of stability in NMF[7] or LDA [25,8], our approach is generic in the sense that it does not rely ondirectly comparing probability distributions or topic-term matrices. So althoughwe highlight the use of this method in conjunction with NMF, it could be appliedin conjunction with other topic modeling and document clustering techniques.This paper is organized as follows. Section 2 provides a brief overview ofexisting work in the areas of matrix factorization, stability analysis, and rankagreement. In Section 3 we discuss the problem of measuring the similarity be-tween sets of term rankings, and describe a solution that can be used to quantifytopic stability. Using a topic modeling approach based on matrix factorization,in Section 4 we present an empirical evaluation of the proposed solution on arange of text corpora. The paper ﬁnishes with some conclusions and suggestionsfor future work in Section 5. While work on topic models has largely focused on the use of LDA [5,25], Non-negative Matrix Factorization (NMF) can also be applied to textual data toreveal topical structures [26]. NMF seeks to decompose a data matrix into fac-tors that are constrained so that they will not contain negative values. Givena document-term matrix A ∈ IR m × n representing m unique terms present in acorpus of n documents, NMF generates a reduced rank- k approximation in theform of the product of two non-negative factors A ≈ WH , where the objectiveis to minimize the reconstruction error between A and the low-dimensional ap-proximation. The columns or basis vectors of W ∈ IR m × k can be interpreted astopics, deﬁned with non-negative weights relative to the m terms. The entriesn the matrix H ∈ IR k × n provide document memberships with respect to the k topics. Note that, unlike LDA which operates on raw frequency counts, NMFcan be applied to a non-negative matrix A that has been previously normalizedusing common pre-processing procedures such as TF-IDF term weighting anddocument length normalization. As with LDA, document-topic assignments arenot discrete, allowing a single document to be associated with multiple topics.For NMF, the key model selection challenge is the selection of the user-deﬁned parameter k . Although no deﬁnitive approach for choosing k has beenidentiﬁed, a number of heuristics exist in the literature. A simple technique isto calculate the Residual Sum of Squares (RSS) between the approximationgiven by a pair of NMF factors and the original matrix [12], which indicates thedegree of variation in the dependent variables the NMF model did not explain.The authors suggest that, by examining the RSS curve for a range of candidatevalues of k , an inﬂection point might be identiﬁed to provide a robust estimateof the optimal reduced rank. A range of methods based on the concept of stability analysis have been proposedfor the task of model selection. The stability of a clustering algorithm refers toits ability to consistently produce similar solutions on data originating from thesame source [17,3]. Since only a single set of data items will be generally availablein unsupervised learning tasks, clusterings are generated on perturbations ofthe original data. The primary application of stability analysis has been as arobust approach for selecting key algorithm parameters [18], speciﬁcally whenestimating the optimal number of clusters for a given data set. These methodsare motivated by the observation that, if the number of clusters in a model is toolarge, repeated clusterings will lead to arbitrary partitions of the data, resultingin unstable solutions. On the other hand, if the number of clusters is too small,the clustering algorithm will be constrained to merge subsets of objects whichshould remain separated, also leading to unstable solutions. In contrast, repeatedclusterings generated using some optimal number of clusters will generally beconsistent, even when the data is perturbed or distorted.The most common approach to stability analysis involves perturbing the databy randomly sampling the original objects to produce a collection of subsamplesfor clustering using values of k from a pre-deﬁned range [21]. The stability of theclustering model for each candidate value of k is evaluated using an agreementmeasure evaluated on all pairs of clusterings generated on diﬀerent subsamples.One or more values of k are then recommended, selected based on the highestmean agreement scores.Brunet et al. proposed an initial stability-based approach for NMF modelselection based on discretized cluster assignments of items (rather than features)across multiple runs of the same algorithm using diﬀerent random initializations[7]. Speciﬁcally, for each NMF run applied to the same data set of n items, a n × n connectivity matrix is constructed, where an entry ( i, j ) = 1 if items i and j areassigned to the same discrete cluster, and ( i, j ) = 0 otherwise. By repeating thisrocess over τ runs, a consensus matrix can be calculated as the average of all τ connectivity matrices. Each entry in this matrix indicates the fraction of timestwo items were clustered together. To measure the stability of a particular valueof k , a cophenetic correlation coeﬃcient is calculated on a hierarchical clusteringof the connectivity matrix. The authors suggest a heuristic for selecting one ormore values of k , based on a sudden drop in the correlation score as k increases.In their work on LDA, Steyvers and Griﬃths noted the importance of identi-fying those topics that will appear repeatedly across multiple samples of relateddata [25], which closely resembles the more general concept of stability analysis[21]. The authors suggested comparing two runs of LDA by examining a topic-topic matrix constructed from the symmetric Kullback Liebler (KL) distancebetween topic distributions from the two runs. Alternative work on measuringthe stability of LDA topic models was described in [8]. The authors proposeda document-centric approach, where topics from two diﬀerent LDA runs arematched together based on correlations between rows of the two correspondingdocument-topic matrices. The output was represented as a document-documentcorrelation matrix, where block diagonal structured induced by the correlationvalues are indicative of higher stability. In this respect, the approach is similarto the Brunet et al. approach for NMF.Other evaluation measures used for LDA have included those based on the se-mantic coherence of the top terms derived from a single set of topics, with respectto term co-occurrence within the same corpus or an external background corpus.For example, Newman et al. calculated correlations between human judgementsand a set of proposed measures, and found that a Pointwise Mutual Information(PMI) measure achieved best or near-best out of all those considered [23]. How-ever, such measures have not focused on model selection and do not consider therobustness of topics over multiple runs of an algorithm. A variety of well-known simple metrics exist for measuring the distance or simi-larity between pairs of ranked lists of the same set of items, notably Spearman’sfootrule distance and Kendall’s tau function [14]. However, Webber et al. [27]note that many problems will involve comparing indeﬁnite rankings , where itemsappear in one list but not in another list, but standard metrics do not considersuch cases. For other applications, it will be desirable to employ a top-weighted ranking agreement measure, such that changing the rank of a highly-relevantitem at the top of a list results in a higher penalty than changing the rank ofan irrelevant item appearing at the tail of a list. This consideration is importantin the case of comparing query results from diﬀerent search engines, though, aswe demonstrate later, it is also a key consideration when comparing rankings ofterms arising in topic modeling.Motivated by basic set overlap, Fagin et al. [9] proposed a top-weighted dis-tance metric between indeﬁnite rankings, also referred to as Average Overlap(AO) [27], which calculates the mean intersection size between every pair ofsubsets of d top-ranked items in two lists, for d = [1 , t ]. This naturally accords higher positional weight to items at the top of the lists. More recently, Ku-mar and Vassilvitskii proposed a generic framework for measuring the distancebetween a pair of rankings [16], supporting both positional weights and itemrelevance weights. Based on this framework, generalized versions of Kendall’stau and Spearman’s footrule metric were derived. However, the authors did notfocus on the case of indeﬁnite rankings. In this section we describe a general stability-based method for selecting the num-ber of topics for topic modeling. Unlike previous unsupervised stability analysismethods, we focus on the use of features or terms to evaluate the suitability of amodel. This is motivated by the term-centric approach generally taken in topicmodeling, where precedence is generally given to the term-topic output and top-ics are summarized using a truncated set of top terms. Also, unlike the approachproposed in [7] for genetic data, our method does not assume that topic clustersare entirely disjoint and does not require the calculation of a dense connectivitymatrix or the application of a subsequent clustering algorithm.Firstly, in Section 3.1 we describe a similarity metric for comparing tworanked lists of terms. Using this measure, in Section 3.2 we propose a measure ofthe agreement between two topic models when represented as ranked term lists.Subsequently, in Section 3.3 we propose a stability analysis method for selectingthe number of topics in a text corpus.

A general way to represent the output of a topic modeling algorithm is in theform of a ranking set containing k ranked lists, denoted S = { R , . . . , R k } . The i -th topic produced by the algorithm is represented by the list R i , containingthe top t terms which are most characteristic of that topic according to somecriterion. In the case of NMF, this will correspond to the highest ranked valuesin each column of the k basis vectors, while for LDA this will consist of theterms with the highest probabilities in the term distribution for each topic. Forpartitional or hierarchical document clustering algorithms, this might consist ofthe highest ranked terms in each cluster centroid.A variety of symmetric measures could be used to assess the similarity be-tween a pair of ranked lists ( R i , R j ). A na¨ıve approach would be to employ asimple set overlap method, such as the Jaccard index [13]. However, such mea-sures do not take into account positional information. Terms occurring at thetop of a ranked list generated by an algorithm such as NMF will naturally bemore relevant to a topic than those occurring at the tail of the list, which corre-spond to zero or near-zero values in the original basis vectors. Also, in practice,rather than considering all m terms in a corpus, the results of topic modelingare presented using the top t << m terms. Similarly, when measuring the simi-larity between ranked lists, it may be preferable to consider truncated lists with able 1. Example of Average Jaccard (AJ) term ranking similarity, for two rankedlists of terms up to depth d = 5. The value Jac d indicates the Jaccard score at depth d only, while AJ indicates the current AJ similarity at that depth. d R ,d R ,d Jac d AJ only t terms, for economy of representation and to reduce the computationalcost of applying multiple similarity operations. However, this will often lead toindeﬁnite rankings, where diﬀerent subsets of terms are being compared.Therefore, following the ranking distance measure proposed by Fagin et al. [9],we propose the use of a top-weighted version of the Jaccard index, suitable forcalculating the similarity between pairs of indeﬁnite rankings. Speciﬁcally, wedeﬁne the Average Jaccard (AJ) measure as follows. We calculate the average ofthe Jaccard scores between every pair of subsets of d top-ranked terms in twolists, for depth d ∈ [1 , t ]. That is: AJ ( R i , R j ) = 1 t t (cid:88) d =1 γ d ( R i , R j ) (1)where γ d ( R i , R j ) = | R i,d ∩ R j,d || R i,d ∪ R j,d | (2)such that R i,d is the head of list R i up to depth d . This is a symmetric measureproducing values in the range [0 , d = 5 is comparatively high (0.429), the mean score is much lower(0.154), as the similarity between terms occurs towards the tails of the lists –these terms carry less weight than those at the head of the lists, such as “album”and “sport”. We now consider the problem of measuring the agreement between two diﬀerent k -way topic models, represented as two ranking sets S x = { R x , . . . , R xk } and S y = { R y , . . . , R yk } , both containing k ranked lists. We construct a k × k similarity matrix M , such that the entry M ij indicates the agreement between R xi and R yj ( i.e. the i -th topic in the ﬁrst model and the j -th topic in the secondmodel), as calculated using the Average Jaccard score (Eqn. 1). We then ﬁndthe best match between the rows and columns of M ( i.e. the ranked lists in S x .00 0.07 0.500.50 0.00 0.070.00 0.61 0.00 Ranking R : R = { sport, win, award } R = { bank, ﬁnance, money } R = { music, album, band } Ranking R : R = { ﬁnance, bank, economy } R = { music, band, award } R = { win, sport, money } the best match between the rows and columns of S ( i.e. the ranked lists in R x and R y ). The optimal permutation ⇡ may be found in O ( k ) time by solving theminimal weight bipartite matching problem using the Hungarian method [22].From this, we can produce an agreement score: agree ( R x , R y ) = 1 k k X i =1 AJ ( R xi , ⇡ ( R xi )) (3)where ⇡ ( R xi ) denotes the ranked list in R y matched to R xi by the permutation ⇡ . Values for the above take the range [0 , k -topic models will result in a score of 1. R R R To generate a diverse collection of solutions, we combine two general strategiescommon in the ensemble clustering and stability analysis literature. Firstly, wemake use of the natural instability of topic modeling algorithms – i.e. the sensi-tivity of NMF to the choice of initial factors, or the stochastic element in LDAoptimization. Secondly, to further increase diversity, at each run we sample aspeciﬁc fraction ⇢ of all documents for analysis.We measure pairwise agreement between all term ranking collections for agiven value of k using the AJ term ranking agreement method described inSection 3.1.The approach above requires the evaluation of the agreement between 1 / ⇥ ⌧ ⇥ ( ⌧

1) unique pairs of ranking collections. Following the stability analysismethod described in X, an alternative approach is to identify an initial single

Ranking R : R = { sport, win, award } R = { bank, ﬁnance, money } R = { music, album, band } Ranking R : R = { ﬁnance, bank, economy } R = { music, band, award } R = { win, sport, money } the best match between the rows and columns of S ( i.e. the ranked lists in R x and R y ). The optimal permutation ⇡ may be found in O ( k ) time by solving theminimal weight bipartite matching problem using the Hungarian method [22].From this, we can produce an agreement score: agree ( R x , R y ) = 1 k k X i =1 AJ ( R xi , ⇡ ( R xi )) (3)where ⇡ ( R xi ) denotes the ranked list in R y matched to R xi by the permutation ⇡ . Values for the above take the range [0 , k -topic models will result in a score of 1. R R R R R R To generate a diverse collection of solutions, we combine two general strategiescommon in the ensemble clustering and stability analysis literature. Firstly, wemake use of the natural instability of topic modeling algorithms – i.e. the sensi-tivity of NMF to the choice of initial factors, or the stochastic element in LDAoptimization. Secondly, to further increase diversity, at each run we sample aspeciﬁc fraction ⇢ of all documents for analysis. Ranking set S : R = { sport, win, award } R = { bank, ﬁnance, money } R = { music, album, band } Ranking set S : R = { ﬁnance, bank, economy } R = { music, band, award } R = { win, sport, money } subsets of r top-ranked items in two lists, for r = [1 , t ]. This naturally accords ahigher positional weight to items at the top of the lists. More recently, Kumar andVassilvitskii proposed a generic framework for measuring the distance betweena pair of rankings [21], supporting both positions weights ( i.e. top-weighted)and item relevance weights. Based on this framework, generalized versions ofKendall’s tau and Spearman’s footrule metric were derived. However, the authorsdid not focus on the case of indeﬁnite rankings. In this section we describe a general stability-based method for selecting the num-ber of topics for topic modeling. Unlike previous unsupervised stability analysisand ensemble clustering methods, we focus on the use of features or terms toevaluate the suitability of a model. This is motivated by the term-centric ap-proach generally taken in topic modeling, where precedence is generally given tothe term-topic output and topics are summarized using a truncated set of topterms per topic. Also, unlike the approach proposed in [11] for genetic data, ourmethod does not assume that topic clusters are entirely disjoint and does notrequire the calculation of a dense connectivity matrix.Firstly, in Section 3.1 we describe a similarity metric for comparing tworanked lists of terms. Using this measure, in Section 3.2 we propose a measureof the agreement between two topic models, represented as ranked term lists.Subsequently, in Section 3.3 we propose a stability analysis method for selectingthe number of topics in a text corpus. R R R R R R agree ( S , S ) = 0 .

50 + 0 .

613 = 0 . ⇡ = ( R , R ) , ( R , R ) , ( R , R ) Fig. 1.

A simple example of measuring the agreement between two diﬀerent topicmodels, each containing k = 3 topics, represented by a pair of ranking sets. Termranking similarity values are calculated using Average Jaccard, up to depth d = 3. and S y ). The optimal permutation π may be found in O ( k ) time by solving theminimal weight bipartite matching problem using the Hungarian method [15].From this, we can produce an agreement score: agree ( S x , S y ) = 1 k k (cid:88) i =1 AJ ( R xi , π ( R xi )) (3)where π ( R xi ) denotes the ranked list in S y matched to R xi by the permuta-tion π . Values for the above take the range [0 , k -way topic models will result in a score of 1. A simple exampleillustrating the agreement process is shown in Fig. 1. Building on the agreement measure deﬁned in Section 3.2, we now propose amodel selection approach for topic modeling. For each value of k in a broad pre-deﬁned range [ k min , k max ], we proceed as follows. We ﬁrstly generate an initialtopic model on the complete data set using an appropriate algorithm (ideally thisshould be deterministic in nature), which provides a reference point for analyzingthe stability aﬀorded by using k topics. We represent this as a reference rankingset S , where each topic is represented by the ranked list of its top t terms.Subsequently, τ samples of the data set are constructed by randomly selectinga subset of β × n documents without replacement, where 0 ≤ β ≤ τ k -way topic models by applying the topic modeling algorithm to eachof the samples, resulting in alternative ranking sets {S , . . . , S τ } , where all topicsare also represented using t top terms. To measure the overall stability at k , wecalculate the mean agreement between the reference ranking set and all otherranking sets using Eqn. 3: stability ( k ) = 1 τ τ (cid:88) i =1 agree ( S , S i ) (4) . Randomly generate τ samples of the data set, each containing β × n documents.2. For each value of k ∈ [ k min , k max ] :1. Apply the topic modeling algorithm to the complete data set of n documentsto generate k topics, and represent the output as the reference ranking set S .2. For each sample X i :(a) Apply the topic modeling algorithm to X i to generate k topics, andrepresent the output as the ranking set S i .(b) Calculate the agreement score agree ( S , S i ).3. Compute the mean agreement score for k over all τ samples (Eqn. 4).3. Select one or more values for k based upon the highest mean agreement scores. Fig. 2.

Summary of the proposed stability analysis method for topic models.

This process is repeated for each candidate k ∈ [ k min , k max ]. A summary of theentire process is given in Fig. 2. Note that the proposed approach is similar tothe strategy for item stability analysis proposed in [21], in that a single referencepoint is used for each value of k , involving τ comparisons between solutions. Thiscontrasts with the approach used by other authors in the literature ( e.g. [18])which involves comparing all unique pairs of results, requiring τ × ( τ − agreementcomparisons.By examining a plot of the stability scores produced with Eqn. 4, a ﬁnalvalue k may be identiﬁed based on peaks in the plot. The presence of more thanone peak indicates that multiple appropriate topic schemes exist for the corpusunder consideration, which is analogous to the existence of multiple alternativesolutions in many general cluster analysis problems [2]. An example of this case isshown in Fig. 3(a) for the guardian-2013 corpus. This data set has six annotatedcategory labels, but we also see a peak at k = 3 in the stability plots, suggestingthat thematic structure exists at a more coarse level too. On the other hand, aﬂat curve with no peaks, combined with low stability values, strongly suggeststhat no coherent topics exist in the data set. This is analogous to the generalproblem of identifying “clustering tendency” [21]. The example in Fig. 3(b) showsplots generated for a synthetic data set of 500 randomly generated documents.As one might expect, no strong peak appears in the stability plots. We now evaluate the stability analysis method proposed in Section 3 to assessits usefulness in guiding the selection of the number of topics for NMF. Theevaluation is performed on a number of text corpora, each of which has anno-tated “ground truth” document labels, such that each document is assigned asingle label. When pre-processing the data, terms occurring in <

20 documents .00.10.20.30.40.50.60.70.80.91.0 2 3 4 5 6 7 8 9 10 11 12 S t ab ili t y S c o r e KStability (t=10)Stability (t=20)Stability (t=50)Stability (t=100) (a) S t ab ili t y S c o r e KStability (t=10)Stability (t=20)Stability (t=50)Stability (t=100) (b)

Fig. 3.

Stability analysis plots generated using t = 10 / / /

100 top terms for (a)the guardian-2013 corpus of news articles, (b) a synthetic dataset of 500 documentsgenerated randomly from 1,500 terms. were removed, along with English language stop words, but no stemming wasperformed. Standard log TF-IDF and L2 document length normalization pro-cedures were then applied to the term-document matrix. Descriptions of thecorpora are provided in Table 2, and pre-processed versions are made availableonline for further research . In our experiments we compare the proposed stability analysis method witha popular existing approach for selecting the reduced rank for NMF based onthe cophenetic correlation of a consensus matrix [7]. The experimental processinvolved applying both schemes to each corpus across a reasonable range ofvalues for k , and comparing plots of their output. Here we use k ∈ [2 , β = 0 . i.e.

80% of documents are randomly chosen for each run), with a total of τ = 100runs to minimize any variance introduced by sampling. For our stability analysismethod, we also generate reference ranking sets for each candidate value of k byapplying NMF to the complete data set with Nonnegative Double Singular ValueDecomposition (NNDSVD) initialization to ensure a deterministic solution [6]. http://mlg.ucd.ie/howmanytopics/ able 2. Details of the corpora used in our experiments, including the total numberof documents n , terms m , and number of labels ˆ k in the associated “ground truth”. Corpus n m ˆ k Description bbc 2,225 3,121 5 General news articles from the BBC [10].bbc-sport 737 969 5 Sports news articles from the BBC [10].guardian-2013 6,520 10,801 6 New corpus of news articles published by TheGuardian during 2013.irishtimes-2013 3,246 4,832 7 New corpus of news articles published by TheIrish Times during 2013.nytimes-1999 9,551 12,987 4 A subset of the New York Times Annotated Cor-pus from 1999 [24].nytimes-2003 11,527 15,001 7 As above, with articles from 2003.wikipedia-high 5,738 17,311 6 Subset of a Wikipedia dump from January 2014,where articles are assigned labels based on theirhigh level WikiProject.wikipedia-low 4,986 15,441 10 Another Wikipedia subset. Articles are labeledwith ﬁne-grained WikiProject sub-groups.

Initially, for stability analysis we examined a range of values t = 10 / / / t were highly correlated across all corpora considered in our evalu-ation (see Table 3 for average correlations). A typical example of this behavior isshown in Fig. 3(a) for the guardian-2013 corpus, where the plots almost perfectlyoverlap. This behavior is perhaps unsurprising as, given the deﬁnition of the Av-erage Jaccard measure in Eqn. 1, terms occurring further down ranked lists willnaturally carry less weight. Therefore, the diﬀerence between scores generatedwith, say t = 50 and t = 100 will be minimal. For the remainder of this sectionwe report stability scores for t = 20, which provided the highest pairwise meancorrelation (0.977) with the results from other values of t examined, while alsoproviding economy of representation for topics.Figures 4 and 5 show plots generated on the eight corpora for k ∈ [2 , , Table 3.

Pearson correlation coeﬃcient scores between stability scores for diﬀerentnumbers of top terms t , as averaged across all corpora in our evaluations. t = 10 t = 20 t = 50 t = 100 Mean t = 10 - 0.964 0.929 0.926 0.940 t = 20 0.964 - 0.985 0.982 t = 50 0.929 0.985 - 0.997 0.970 t = 100 0.926 0.982 0.997 - 0.968 .40.50.60.70.80.91.0 2 3 4 5 6 7 8 9 10 11 12 S c o r e KStability (t=20)Norm. Consensus (a) bbc S c o r e KStability (t=20)Norm. Consensus (b) bbc-sport S c o r e KStability (t=20)Norm. Consensus (c) guardian-2013 S c o r e KStability (t=20)Norm. Consensus (d) irishtimes-2013

Fig. 4.

Comparison of plots generated for stability analysis ( t = 20) and consensusmatrix analysis for values of k ∈ [2 , k based on peaks in the plots. the observed consensus scores were > . bbc corpus contains ﬁve well-separated annotated categories for news articles, suchas “business” and “entertainment”. Therefore it is unsurprising that in Fig. 4(a)we ﬁnd a strong peak for both methods at k = 5, with a sharp fall-oﬀ for thestability method after this point. This reﬂects the fact that the ﬁve categoriesare accurately recovered by NMF. For the bbcsport corpus, which also has ﬁveground truth news categories, we see a peak at k = 4, followed by a lower peakat k = 5 – see Fig. 4(b). The consensus method also exhibits a peak at thispoint. Examining the top terms for the reference ranking set indicates that the .40.50.60.70.80.91.0 2 3 4 5 6 7 8 9 10 11 12 S c o r e KStability (t=20)Norm. Consensus (a) nytimes-1999 S c o r e KStability (t=20)Norm. Consensus (b) nytimes-2003 S c o r e KStability (t=20)Norm. Consensus (c) wikipedia-high S c o r e KStability (t=20)Norm. Consensus (d) wikipedia-low

Fig. 5.

Comparison of plots generated for stability analysis ( t = 20) and consensusmatrix analysis for values of k ∈ [2 , two smallest categories, “athletics” and “tennis” have been assigned to a singlelarger topic, while the other three categories are clearly represented as topics.In the ground truth for the guardian-2013 corpus, each article is labeledbased upon the section in which it appeared on the guardian.co.uk website.From Fig. 4(c) we see that the stability method correctly identiﬁes a peak at k = 6 corresponding to the six sections in the corpus, which is not found by theconsensus method. However, both methods also suggest a more coarse clusteringat k = 3. Inspecting the reference ranking set (see Table 4(a)) suggests anintuitive explanation – “books”, “fashion” and “music” sections were merged ina single culture-related topic, documents labeled as “politics” and “business”were clustered together, while “football” remains as a distinct topic.Articles in the irishtimes-2013 corpus also have annotated labels based ontheir publication section on irishtimes.com . In Fig. 4(d) we see high scoresat k = 2 for both methods, and a subsequent peak identiﬁed by the stabilityethod at k = 7, corresponding to the seven publication sections. In the formercase, the top ranked reference set terms indicate a topic related to sports and acatch-all news topic – see Table 4(b).Next we consider the two corpora of news articles coming from the NewYork Times Annotated Corpus. Interestingly, for nytimes-1999 , in Fig. 5(a) bothmethods exhibit a trough for k = 4 topics, even though the ground truth forthis corpus contains four news article categories. Inspecting the term rankingsshown inTable 4(c) provide a potential explanation of this instability: across the100 factorization results, the ground truth “sports” category is often but notalways split into two topics relating to baseball and basketball. For the nytimes-2003 corpus, which contains seven article categories, both methods produce highscores at k = 2, with subsequent peaks at k = 4 and k = 7 – see Fig. 5(b). Aswith the irishtimes-2013 corpus, the highest-level structures indicate a simpleseparation between sports articles and other news. The reference topics at k = 4indicates that smaller categories among the New York Times articles, such as“automobiles” and “dining & wine” do not appear as strong themes in the data.Finally, we consider the two collections of Wikipedia pages, where pages aregiven labels based on their assignment to WikiProjects at varying levels ofgranularity. For wikipedia-high , from Fig. 5(c) we see that both methods achievehigh scores for k = 2 and k = 4 topics. In the case of the former, the top termsin the reference ranking set indicate a split between Wikipedia pages relatedto music and all other pages (Table 4(d)). While at k = 4 (Table 4(e)), we seecoherent topics covering “music”, “sports”, “space”, and a combination of the“military” & “transportation” WikiProject labels. The “medicine” WikiProjectis not clearly represented as a topic at this level. In the case of wikipedia-low ,which contains ten low-level page categories, both methods show spikes at k = 5,and k = 10. At k = 5, NMF recovers topics related to “ice hockey”, “cricket”,“World War I”, a topic covering a mixture of musical genres, and a seeminglyincoherent group that includes all other pages. The relatively high stability scoreachieved at this level (0.87) suggests that this conﬁguration regularly appearedacross the 100 NMF runs. Overall, it is interesting to observe that, for a number of data sets, both meth-ods evaluated here exhibited peaks at k = 2, where one might expect far moreﬁne-grained topics in these types of data sets. This results from high agreementbetween the term ranking sets generated at this level of granularity. A closerinspection of document membership weights for these cases shows that this phe-nomenon generally arises from the repeated appearance of one small “outlier”topic and one large “merged” topic encompassing the rest of the documents inthe corpus ( e.g. the examples shown in Table 4(b,d)). In a few cases we also seethat the ground truth does not always correspond well to the actual data ( e.g. forthe sports-related articles in nytimes-1999 ). This problem arises from time to See http://en.wikipedia.org/wiki/Wikipedia:WikiProject able 4.

Examples of top 10 terms for reference ranking sets generated by NMF on anumber of text corpora for diﬀerent values of k .(a) guardian-2013 ( k = 3) Rank Topic 1 Topic 2 Topic 3 (b) irishtimes-2013 ( k = 2) Rank Topic 1 Topic 2 (c) nytimes-1999 ( k = 4) Rank Topic 1 Topic 2 Topic 3 Topic 4 (d) wikipedia-high ( k = 2) Rank Topic 1 Topic 2 (e) wikipedia-high ( k = 4) Rank Topic 1 Topic 2 Topic 3 Topic 4 (f) wikipedia-low ( k = 5) Rank Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 time when meta-data is used to provide a ground truth in machine learningbenchmarking experiments [19].n relation to computational time, the requirement to run a complete hierar-chical clustering on the document-document consensus matrix before calculatingcophenetic correlations leads to substantially longer running times on all corpora,when compared to the stability analysis method using a reference ranking set.In addition, the latter can be readily parallelized, as agreement scores can becalculated independently for each of the factorization results.

A key challenge when applying topic modeling is the selection of an appropriatenumber of topics k . We have proposed a new method for choosing this parameterusing a term-centric stability analysis strategy, where a higher level of agreementbetween the top-ranked terms for topics generated across diﬀerent samples ofthe same corpus indicates a more suitable choice. Evaluations on a range of textcorpora have suggested that this method can provide a useful guide for selectingone or more values for k .While our experiments have focused on the application of the proposedmethod in conjunction with NMF, the use of term rankings rather than rawfactor values or probabilities means that it can potentially generalize to anytopic modeling approach that can represent topics as ranked lists of terms. Thisincludes probabilistic techniques such as LDA, together with more conventionalpartitional algorithms for document clustering such as k -means and its vari-ants. In further work, we plan to examine the usefulness of stability analysis inconjunction with alternative algorithms. Acknowledgements.

This publication has emanated from research conductedwith the ﬁnancial support of Science Foundation Ireland (SFI) under GrantNumber SFI/12/RC/2289.

References

1. Arora, S., Ge, R., Moitra, A.: Learning topic models – Going beyond SVD. In:Proc. 53rd Symp. Foundations of Computer Science. pp. 1–10. IEEE (2012)2. Bae, E., Bailey, J.: Coala: A novel approach for the extraction of an alternate clus-tering of high quality and high dissimilarity. In: Proc. 6th International Conferenceon Data Mining. pp. 53–62. IEEE (2006)3. Ben-David, S., P´al, D., Simon, H.U.: Stability of k-means clustering. In: LearningTheory, pp. 20–34. Springer (2007)4. Bertoni, A., Valentini, G.: Random projections for assessing gene expression clus-ter stability. In: Proc. IEEE International Joint Conference on Neural Networks(IJCNN’05). vol. 1, pp. 149–154 (2005)5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of MachineLearning Research 3, 993–1022 (2003)6. Boutsidis, C., Gallopoulos, E.: SVD based initialization: A head start for non-negative matrix factorization. Pattern Recognition (2008). Brunet, J.P., Tamayo, P., Golub, T.R., Mesirov, J.P.: Metagenes and molecularpattern discovery using matrix factorization. Proc. National Academy of Sciences101(12), 4164–4169 (2004)8. De Waal, A., Barnard, E.: Evaluating topic models with stability. In: 19th AnnualSymposium of the Pattern Recognition Association of South Africa (2008)9. Fagin, R., Kumar, R., Sivakumar, D.: Comparing top kk