Latent Tree Models for Hierarchical Topic Detection
Peixian Chen, Nevin L. Zhang, Tengfei Liu, Leonard K.M. Poon, Zhourong Chen, Farhan Khawar
LLatent Tree Models for Hierarchical Topic Detection
Peixian Chen a , Nevin L. Zhang a, ∗ , Tengfei Liu b , Leonard K. M. Poon c , ZhourongChen a , Farhan Khawar a a Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology, Hong Kong b Ant Financial Services Group, Shanghai c Department of Mathematics and Information TechnologyThe Education University of Hong Kong, Hong Kong
Abstract
We present a novel method for hierarchical topic detection where topics are obtained byclustering documents in multiple ways. Specifically, we model document collectionsusing a class of graphical models called hierarchical latent tree models (HLTMs) . Thevariables at the bottom level of an HLTM are observed binary variables that representthe presence/absence of words in a document. The variables at other levels are binarylatent variables, with those at the lowest latent level representing word co-occurrencepatterns and those at higher levels representing co-occurrence of patterns at the levelbelow. Each latent variable gives a soft partition of the documents, and documentclusters in the partitions are interpreted as topics. Latent variables at high levels of thehierarchy capture long-range word co-occurrence patterns and hence give thematicallymore general topics, while those at low levels of the hierarchy capture short-range wordco-occurrence patterns and give thematically more specific topics. Unlike LDA-basedtopic models, HLTMs do not refer to a document generation process and use wordvariables instead of token variables. They use a tree structure to model the relationshipsbetween topics and words, which is conducive to the discovery of meaningful topicsand topic hierarchies. ∗ Corresponding author
Email address: [email protected] (Nevin L. Zhang)
Preprint submitted to Elsevier December 22, 2016 a r X i v : . [ c s . C L ] D ec . Introduction The objective of hierarchical topic detection (HTD) is to, given a corpus ofdocuments, obtain a tree of topics with more general topics at high levels of the treeand more specific topics at low levels of the tree. It has a wide range of potentialapplications. For example, a topic hierarchy for posts at an online forum can providean overview of the variety of the posts and guide readers quickly to the posts ofinterest. A topic hierarchy for the reviews and feedbacks on a business/product canhelp a company gauge customer sentiments and identify areas for improvements. Atopic hierarchy for recent papers published at a conference or journal can give readersa global picture of recent trends in the field. A topic hierarchy for all the articlesretrieved from PubMed on an area of medical research can help researchers get anoverview of past studies in the area. In applications such as those mentioned here, theproblem is not about search because the user does not know what to search for. Ratherthe problem is about summarization of thematic contents and topic-guided browsing.Several HTD methods have been proposed previously, including nested Chineserestaurant process (nCRP) [1, 2], Pachinko allocation model (PAM) [3, 4], and nestedhierarchical Dirichlet process (nHDP) [5]. Those methods are extensions of latentDirichlet allocation (LDA) [6]. Hence we refer to them collectively as
LDA-basedmethods .In this paper, we present a novel HTD method called hierarchical latent treeanalysis (HLTA) . Like the LDA-based methods, HLTA is a probabilistic method and itinvolves latent variables. However, there are fundamental differences. The firstdifference lies in what is being modeled and the semantics of the latent variables. TheLDA-based methods model the process by which documents are generated. The latentvariables in the models are constructs in the hypothetical generation process,including a list of topics (usually denoted as β ), a topic distribution vector for eachdocument (usually denoted as θ d ), and a topic assignment for each token in eachdocument (usually denoted as Z d,n ). In contrast, HLTA models a collection ofdocuments without referring to a document generation process. The latent variables inthe model are considered unobserved attributes of the documents. If we compare2hether words occur in particular documents to whether students do well in varioussubjects, then the latent variables correspond to latent traits such as analytical skill,literacy skill and general intelligence.The second difference lies in the types of observed variables used in the models.Observed variables in the LDA-based methods are token variables (usually denoted as W d,n ). Each token variable stands for a location in a document, and its possible valuesare the words in a vocabulary. Here one cannot talk about conditional independencebetween words because the probabilities of all words must sum to 1. In contrast, eachobserved variable in HLTA stands for a word. It is a binary variable and represents thepresence/absence of the word in a document. The output of HLTA is a tree-structuredgraphical model, where the word variables are at the leaves and the latent variables areat the internal nodes. Two word variables are conditionally independent given anylatent variable on the path between them. Words that frequently co-occur indocuments tend to be located in the same “region” of the tree. This fact is conduciveto the discovery of meaningful topics and topic hierarchies. A drawback of usingbinary word variables is that word counts cannot be taken into consideration.The third difference lies in the definition and characterization of topics. Topics inthe LDA-based methods are probabilistic distributions over a vocabulary. Whenpresented to users, a topic is characterized using a few words with the highestprobabilities. In contrast, topics in HLTA are clusters of documents. Morespecifically, all latent variables in HTLA are assumed to be binary. Just as the concept“analytical skill” partitions a student population into soft two clusters, those with highanalytic skill in one cluster and those with low analytic skill in another, a latentvariable in HLTA partitions a document collection into two soft clusters of documents.The document clusters are interpreted as topics. For presentation to users, a topic ischaracterized using the words that not only occur with high probabilities in topic butalso occur with low probabilities outside the topic. The consideration of occurrenceprobabilities outside the topic is important because a word that occurs with highprobability in the topic might also occur with high probability outside the topic. Whenthat happens, it is not a good choice for the characterization of the topic.There are other differences that are more technical in nature and the explanations3re hence postponed to Section 4.The rest of the paper is organized as follows. We discuss related work in Section 2and review the basics of latent tree models in Section 3. In Section 4, we introducehierarchical latent tree models (HLTMs) and explain how they can be used forhierarchical topic detection. The HLTA algorithm for learning HLTMs is described inSections 5 - 7. In Section 8, we present the results HTLA obtains on a real-worlddataset and discuss some practical issues. In Section 9, we empirically compare HLTAwith the LDA-based methods. Finally, we end the paper in Section 10 with someconcluding remarks and discussions of future work.
2. Related Work
Topic detection has been one of the most active research areas in MachineLearning in the past decade. The most commonly used method is latent Dirichletallocation (LDA) [6]. LDA assumes that documents are generated as follows: First, alist { β , . . . , β K } of topics is drawn from a Dirichlet distribution. Then, for eachdocument d , a topic distribution θ d is drawn from another Dirichlet distribution. Eachword W d,n in the document is generated by first picking a topic Z d,n according to thetopic distribution θ d , and then selecting a word according to the word distribution β Z d,n of the topic. Given a document collection, the generation process is reverted viastatistical inference (sampling or variational inference) to determine the topics andtopic compositions of the documents.LDA has been extended in various ways for additional modeling capabilities. Topiccorrelations are considered in [7, 3]; topic evolution is modeled in [8, 9, 10]; topicstructures are built in [11, 3, 1, 4]; side information is exploited in [12, 13]; supervisedtopic models are proposed in [14, 15]; and so on. In the following, we discuss in moredetails three of the extensions that are more closely related to this paper than others.Pachinko allocation model (PAM) [3, 4] is proposed as a method for modelingcorrelations among topics. It introduces multiple levels of supertopics on top of thebasic topics. Each supertopic is a distribution over the topics at the next level below.Hence PAM can also be viewed as an HTD method, and the hierarchical structure needs4o be predetermined. To pick a topic for a token, it first draws a top-level topic froma multinomial distribution (which in turn is drawn from a Dirichlet distribution), andthen draws a topic for the next level below from the multinomial distribution associatedwith the top-level topic, and so on. The rest of the generation process is the same as inLDA.Nested Chinese Restaurant Process (nCRP) [2] and nested Hierarchical DirichletProcess (nHDP) [5] are proposed as HTD methods. They assume that there is a truetopic tree behind data. A prior distribution is placed over all possible trees using nCRPand nHDP respectively. An assumption is made as to how documents are generatedfrom the true topic tree, which, together with data, gives a likelihood function overall possible trees. In nCRP, the topics in a document are assumed to be from onepath down the tree, while in nHDP, the topics in a document can be from multiplepaths, i.e., a subtree within the entire topic tree. The true topic tree is estimated bycombining the prior and the likelihood in posterior inference. During inference, onein theory deals with a tree with infinitely many levels and each node having infinitelymany children. In practice, the tree is truncated so that it has a predetermined numberof levels. In nHDP, each node also has a predetermined number of children, and nCRPuses hyperparameters to control the number. As such, the two methods in effect requirethe user to provide the structure of an hierarchy as input.As mentioned in the introduction, HLTA models document collections withoutreferring to a document generation process. Instead, it uses hierarchical latent treemodels (HLTMs) and the latent variables in the models are regarded as unobservedattributes of the documents.The concept of latent tree models was introduced in [16, 17], where they werereferred to as hierarchical latent class models. The term “latent tree models” firstappeared in [18, 19]. Latent tree models generalize two classes of models from theprevious literature. The first class is latent class models [20, 21], which are used forcategorical data clustering in social sciences and medicine. The second class isprobabilistic phylogenetic trees [22], which are a tool for determining the evolutionhistory of a set of species.The reader is referred to [23] for a survey of research activities on latent tree5odels. The activities take place in three settings. In the first setting, data are assumedto be generated from an unknown LTM, and the task is to recover the generativemodel [24]. Here one tries to discover relationships between the latent structure andobserved marginals that hold in LTMs, and then use those relationships to reconstructthe true latent structure from data. And one can prove theoretical results onconsistency and sample complexity.In the second setting, no assumption is made about how data are generated and thetask is to fit an LTM to data [25]. Here it does not make sense to talk about theoreticalguarantees on consistency and sample complexity. Instead, algorithms are evaluatedempirically using held-out likelihood. It has been shown that, on real-world datasets,better models can be obtained using methods developed in this setting than using thosedeveloped in the first setting [26]. The reason is that, although the assumption of thefirst setting is reasonable for data from domains such as phylogeny, it is not reasonablefor other types of data such as text data and survey data.The third setting is similar to the second setting, except that model fit is no longerthe only concern. In addition, one needs to consider how useful the results are to users,and might want to, for example, obtain a hierarchy of latent variables. Liu et al. [27]are the first to use latent tree models for hierarchical topic detection. They propose analgorithm, namely HLTA, for learning HLTMs from text data and give a method forextracting topic hierarchies from the models. A method for scaling up the algorithm isproposed by Chen et al. [28]. This paper is based on [27, 28] . There are substantialextensions: The novelty of HLTA w.r.t the LDA-based methods is now systematicallydiscussed; The theory and algorithm are described in more details and two practicalissues are discussed; A new parameter estimation method is used for large datasets;And the empirical evaluations are more extensive.Another method to learn a hierarchy of latent variables from data is proposed by Here data generated from a model are vectors of values for observed variables, not documents. NOTE TO REVIEWER (to be removed in the final version): Those are conference papers by theauthors themselves. It is stated in the AIJ review form that “a paper is novel if the results it describes werenot previously published by other authors , and were not previously published by the same authors in anyarchival journal ”. The undirected latent tree model in (c) represents an equivalent class of directed latenttree models, which includes (a) and (b) as members.
Ver Steeg and Galstyan [29]. The method is named correlation explanation (CorEx) .Unlike HLTA, CorEx is proposed as a model-free method and it hence does not intendto provide a representation for the joint distribution of the observed variables.HLTA produces a hierarchy with word variables at the bottom and multiple levelsof latent variables at top. It is hence related to hierarchical variable clustering. Onedifference is that HLTA also partitions documents while variable clustering does not.There is a vast literature on document clustering [30]. In particular, co-clustering [31]can identify document clusters where each cluster is associated with a potentiallydifferent set of words. However, document clustering and topic detection are generallyconsidered two different fields with little overlap. This paper bridges the two fields bydeveloping a full-fledged HTD method that partitions documents in multiple ways.
3. Latent Tree Models A latent tree model (LTM) is a tree-structured Bayesian network [32], where the leafnodes represent observed variables and the internal nodes represent latent variables. Anexample is shown in Figure 1 (a). In this paper, all variables are assumed to be binary.The model parameters include a marginal distribution for the root Z , and a conditionaldistribution for each of the other nodes given its parent. The product of the distributionsdefines a joint distribution over all the variables.In general, an LTM has n observed variables X = { X , . . . , X n } and m latentvariables Z = { Z , . . . , Z m } . Denote the parent of a variable Y as pa ( Y ) and let pa ( Y ) be a empty set when Y is the root. Then the LTM defines a joint distribution7ver all observed and latent variables as follows: P ( X , . . . , X n , Z , . . . , Z m ) = (cid:89) Y ∈ X ∪ Z P ( Y | pa ( Y )) (1)By changing the root from Z to Z in Figure 1 (a), we get another model shownin (b). The two models are equivalent in the sense that they represent the same setof distributions over the observed variables X , . . . , X [17]. It is not possible todistinguish between equivalent models based on data. This implies that the root ofan LTM, and hence orientations of edges, are unidentifiable. It therefore makes moresense to talk about undirected LTMs, which is what we do in this paper. One example isshown in Figure 1 (c). It represents an equivalent class of directed models. A memberof the class can be obtained by picking a latent node as the root and directing the edgesaway from the root. For example, (a) and (b) are obtained from (c) by choosing Z and Z to be the root respectively. In implementation, an undirected model is representedusing an arbitrary directed model in the equivalence class it represents.In the literature, there are variations of LTMs where some internal nodes areobserved [24] and/or the variables are continuous [33, 34, 35]. In this paper, we focuson basic LTMs as defined in the previous two paragraphs.We use | Z | to denote the number of possible states of a variable Z . An LTM is regular if, for any latent node Z , we have that | Z | ≤ (cid:81) ki =1 | Z i | max ki =1 | Z i | , (2)where Z , . . . , Z k are the neighbors of Z , and that the inequality holds strictly when k = 2 . When all variables are binary, the condition reduces to that each latent nodemust have at least three neighbors.For any irregular LTM, there is a regular model that has fewer parameters andrepresents that same set of distributions over the observed variables [17]. Consequently,we focus only on regular models.
4. Hierarchical Latent Tree Models and Topic Detection
We will later present an algorithm, called HLTA, for learning from text data modelssuch as the one shown in Figure 2. There is a layer of observed variables at the bottom8nd multiple layers of latent variables on top. The model is hence called a hierarchicallatent tree model (HLTM) . In this section, we discuss how to interpret HLTMs and howto extract topics and topic hierarchies from them.
We use the toy model in Figure 2 as an running example. It is learned from asubset of the 20 Newsgroup data . The variables at the bottom level, level 0, areobserved binary variables that represent the presence/absence of words in a document.The latent variables at level 1 are introduced during data analysis to model wordco-occurrence patterns. For example, Z captures the probabilistic co-occurrence ofthe words nasa , space , shuttle and mission ; Z captures the probabilisticco-occurrence of the words orbit , earth , solar and satellite ; Z capturesthe probabilistic co-occurrence of the words lunar and moon . Latent variables atlevel 2 are introduced during data analysis to model the co-occurrence of the patternsat level 1. For example, Z represents the probabilistic co-occurrence of the patterns Z , Z and Z .Because the latent variables are introduced layer by layer, and each latent variableis introduced to explain the correlations among a group of variables at the level below,we regard, for the purpose of model interpretation, the edges between two layers asdirected and they are directed downwards. (The edges between top-level latentvariables are not directed.) This allows us to talk a about the subtree rooted at a latentnode. For example, the subtree rooted at Z consists of the observed variables orbit , earth , . . . , mission . There are totally 14 latent variables in the toy example. Each latent variable has twostates and hence partitions the document collection into two soft clusters. To figure outwhat the partition and the two clusters are about, we need to consider the relationshipbetween the latent variable and the observed variables in its subtree. Take Z as an http://qwone.com/ jason/20Newsgroups/ Z . Z s0 (0.95) s1 (0.05) space nasa orbit earth shuttle moon mission example. Denote the two document clusters it gives as Z
21 = s and Z
21 = s . Theoccurrence probabilities in the two clusters of the words in the Z subtree is given inTable 1, along with the sizes of the two clusters. We see that the cluster Z =s1 consistsof of the documents. In this cluster, the words such as space , nasa and orbit occur with relatively high probabilities. It is clearly a meaningful and is interpreted asa topic . One might label the topic “NASA”. The other cluster Z =s0 consists of of the documents. In this cluster, the words occur with low probabilities. We interpretit as a background topic .There are three subtle issues concerning Table 1. The first issue is how the wordvariables are ordered. To answer the question, we need the mutual information (MI)10 ( X ; Y ) [36] between the two discrete variables X and Y , which is defined as follows: I ( X ; Y ) = (cid:88) X,Y P ( X, Y ) log P ( X, Y ) P ( X ) P ( Y ) , (3)In Table 1, the word variables are ordered according to their mutual information with Z . The words placed at the top of the table have the highest MI with Z . Theyare the best ones to characterize the difference between the two clusters because theiroccurrence probabilities in the two clusters differ the most. They occur with highprobabilities in the clusters Z
21 = s and with low probabilities in Z
21 = s . If oneis to choose only the top, say 5, words to characterize the topic Z =s1, then the bestwords to pick are space , nasa , orbit , earth and shuttle .The second issue is how the background topic is determined. The answer is that,among the two document clusters given by Z , the one where the words occur withlower probabilities is regarded as the background topic. In general, we consider thesum of the probabilities of the top 3 words. The cluster where the sum is lower isdesignated to be the background topic and labeled s , and the other one is considereda genuine topic and labeled s .Finally, the creation of Table 1 requires the joint distribution of Z with eachof the words variable in its subtrees (e.g., P ( space , Z ). The distributions can becomputed using Belief Propagation [32]. The computation takes linear time becausethe model is tree-structured. If the background topics are ignored, each latent variable gives us exactly onetopic. As such, the model in Figure 2 gives us 14 topics, which are shown in Table 2.Latent variables at high levels of the hierarchy capture long-range word co-occurrencepatterns and hence give thematically more general topics, while those at low levels ofthe hierarchy capture short-range word co-occurrence patterns and give thematicallymore specific topics. For example, the topic given by Z ( windows , card , graphics , video , dos ) consists of a mixture of words about several aspects ofcomputers. We can say that the topic is about computers. The subtopics are each11able 2: Topic hierarchy given by the model in Figure 2. [0.05] space nasa orbit earth shuttle [0.06] orbit earth solar satellite[0.05] space nasa shuttle mission[0.03] moon lunar [0.14] team games players season hockey [0.14] team season[0.11] players baseball league[0.09] games win won[0.08] hockey nhl [0.24] windows card graphics video dos [0.12] card video driver[0.15] windows dos[0.10] graphics display image[0.09] computer science concerned with only one aspect of computers: Z ( card , video , driver ), Z ( dos , windows ), and Z ( graphics , display , image ). In the introduction, we have discussed three differences between HLTA and theLDA-based methods. There are three other important differences. The fourthdifference lies in the relationship between topics and documents. In the LDA-basedmethods, a document is a mixture of topics, and the probabilities of the topics within adocument sum to 1. Because of this, the LDA models are sometimes called mixed-membership models . In HLTA, a topic is a soft cluster of documents, and adocument might belong to multiple topics with probability 1. In this sense, HLTMscan be said to be multi-membership models .The fifth difference between HLTA and the LDA-based methods is about thesemantics of the hierarchies they produce. In the context of document analysis, acommon concept of hierarchy is a rooted tree where each node represents a cluster ofdocuments, and the cluster of documents at a node is the union of the documentclusters at its children. Neither HLTA nor the LDA-based methods yield suchhierarchies. nCRP and nHDP produce a tree of topics. The topics at higher levels12ppear more often than those at lower levels, but they are not necessarily relatedthematically. PAM yields a collection of topics that are organized into a directedacyclic graph. The topics at the lowest level are distributions over words, and topics athigher levels are distributions over topics at the level below and hence are calledsuper-topics. In contrast, the output of HLTA is a tree of latent variables. Latentvariables at high levels of the tree capture long-range word co-occurrence patterns andhence give thematically more general topics, while latent variables at low levels of thetree capture short-range word co-occurrence patterns and hence give thematicallymore specific topics.Finally, LDA-based methods require the user to provide the structure of a hierarchy,including the number of latent levels and the number of nodes at each level. Thenumber of latent levels is usually set at 3 out of efficiency considerations. The contentsof the nodes (distributions over vocabulary) are learned from data. In contrast, HLTAlearns both model structures and model parameters from data. The number of latentlevels is not limited to 3.
5. Model Structure Construction
We present the HLTA algorithm in this and the next two sections. The inputs toHLTA include a collection of documents and several algorithmic parameters. Theoutputs include an HLTM and a topic hierarchy extracted from the HLTM. Topichierarchy extraction has already been explained in Section 4, and we will hence focuson how to learn the HLTM. In this section we will describe the procedures forconstructing the model structure. In Section 6 we will discuss parameter estimationissues, and Section 7 we discuss techniques employed to accelerate the algorithm.
The top-level control of HLTA is given in Algorithm 1 and the subroutines are givenin Algorithm 2-6. In this subsection, we illustrate the top-level control using the toydataset mentioned in Section 4, which involves 30 word variables.There are 5 steps. The first step (line 3) yields the model shown in Figure 3 (a). It issaid to be a flat LTM because each latent variable is connected to at least one observed13 lgorithm 1
HLTA( D , τ , µ , δ , κ ) Inputs: D — a collection of documents, τ — upper bound on the number of top-leveltopics, µ — upper bound on island size, δ — threshold used in UD-test, κ —number of EM steps on final model. Outputs:
An HLTM and a topic hierarchy. D ← D , m ← null . repeat m ← L EARN F LAT M ODEL ( D , δ , µ ); if m = null then m ← m ; else m ← S TACK M ODELS ( m , m ); end if D ← H ARD A SSIGNMENT ( m , D ); until number of top-level nodes in m ≤ τ . Run EM on m for κ steps. return m and topic hierarchy extracted from m . variable. In hierarchical models such as the one shown in Figure 2, on the other hand,only the latent variables at the lowest latent layer are connected to observed variables,and other latent variables are not. The learning of a flat model is the key step of HLTA.We will discuss it in details later.We refer to the latent variables in the flat model from the first step as level-1 latentvariables. The second step (line 9) is to turn the level-1 latent variables into observedvariables through data completion. To do so, the subroutine H ARD A SSIGNMENT carries out inference to compute the posterior distribution of each latent variable foreach document. The document is assigned to the state with the highest posteriorprobability, resulting in a dataset D over the level-1 latent variables.In the third step, line 3 is executed again to learn a flat LTM for the level-1 latentvariables, resulting the model shown in Figure 3 (b).14a)(b)Figure 3: Illustration of the top-level control of HLTA: (a) A flat model over the wordvariables is first learned; (b) The latent variables in (a) are converted into observedvariables through data completion and another flat model is learned for them; Finally,the flat model in (b) is stacked on top of the flat model in (c) to obtain the hierarchicalmodel in Figure 2.In the fourth step (line 7), the flat model for the level-1 latent variables is stackedon top of the flat model for the observed variables, resulting in the hierarchical modelin Figure 2. While doing so, the subroutine S TACK M ODELS cuts off the links amongthe level-1 latent variables. The parameter values for the new model are copied fromtwo source models.In general, the first four steps are repeated until the number of top-level latentvariables falls below a user-specified upper bound τ (lines 2 to 10). In our runningexample, we set τ = 5 . The number of nodes at the top level in our current model is 3,which is below the threshold τ . Hence the loop is exited.In the fifth step (line 11), the EM algorithm [37] is run on the final hierarchicalmodel for κ steps to improve its parameters, where κ is another user specified inputparameter.The five steps can be grouped into two phases conceptually. The modelconstruction phase consists of the first four steps. The objective is to build a15igure 4: The subroutine B UILD I SLANDS partitions word variables into uni-dimensional clusters and introduce a latent variable to each cluster to form an island(an LCM).
Algorithm 2 L EARN F LAT M ODEL ( D , δ , µ ) L ← B UILD I SLANDS ( D , δ , µ ); m ← B RIDGE I SLANDS ( L , D ); return m . hierarchical model structure. The parameter estimation phase consists of the fifthstep. The objective is to optimize the parameters of the hierarchical structure from thefirst phase. The objective of the flat model learning step is to find, among all flat models, theone that have the highest BIC score. The BIC score [38] of a model m given a dataset D is defined as follows: BIC ( m | D ) = log P ( D | m, θ ∗ ) − d |D| , (4)where θ ∗ is the maximum likelihood estimate of the model parameters, d is thenumber of free model parameters, and |D| is the sample size. Maximizing the BICscore intuitively means to find a model that fits the data well and that is not overlycomplex.One way to solve the problem is through search. The state-of-the-art in thisdirection is an algorithm named EAST [25]. It has been shown [26] to find bettermodels that alternative algorithms such as BIN [39] and CLRG [24]. However, it doesnot scale up. It is capable of handling data with only dozens of observed variables andis hence not suitable for text analysis. 16n the following, we present an algorithm that, when combined with the parameterestimation technique to be described in the next section, is efficient enough to dealwith large text data. The pseudo code is given in Algorithm 2. It calls two subroutines.The first subroutine is B UILD I SLANDS . It partitions all word variables into clusters,such that the words in each cluster tend to co-occur and the co-occurrences can beproperly modeled using a single latent variable. It then introduces a latent variablefor each cluster to model the co-occurrence of the words inside it. In this way for eachcluster we obtain an LTM with a single latent variable, and is called a latent class model(LCM) . In our running example, the results are shown in Figure 4. We metaphoricallyrefer to the LCMs as islands .The second subroutine is B
RIDGE I SLANDS . It links up the islands by firstestimating the mutual information between every pair of latent variables, and thenfinding the maximum spanning tree [40]. The result is the model in Figure 3 (a).We now set out to describe the two subroutines in details.
Conceptually, a set of variables is said to be uni-dimensional if the correlationsamong them can be properly modeled using a single latent variable. Operationally, werely on the uni-dimensionality test (UD-test) to determine whether a set of variables isuni-dimensional.To perform UD-test on a set S of observed variables, we first learn two latent treemodels m and m for S and then compare their BIC scores. The model m is themodel with the highest BIC score among all LTMs with a single latent variable, andthe model m is the model with the highest BIC score among all LTMs with two latentvariables. Figure 5 (b) shows what the two models might look like when S consists offour word variables nasa , space , shuttle and mission . We conclude that S isuni-dimensional if the following inequality holds: BIC ( m | D ) − BIC ( m | D ) < δ, (5)where δ is a user-specified threshold. In other words, S is considered uni-dimensionalif the best two-latent variable model is not significantly better than the best one-latent17 lgorithm 3 B UILD I SLANDS ( D , δ , µ ) V ← variables in D , L ← ∅ . while |V| > do m ← O NE I SLAND ( D , V , δ , µ ); L ← L ∪ { m } ; V ← variables in D but not in any m ∈ L ; end while return L . variable model.Note that the UD-test is related to the Bayes factor for comparing the twomodels [41]: K = P ( D| m ) P ( D| m ) . (6)The strength of evidence in favor of m depends on the value of K . The followingguidelines are suggested in [41]: If the quantity K is from 0 to 2, the evidenceis negligible; If it is between 2 and 6, there is positive evidence in favor of m ; If it isbetween 6 to 10, there is strong evidence in favor of m ; And if it is larger than 10, thenthere is very strong evidence in favor of m . Here, “ log ” stands for natural logarithm.It is well known that the BIC score BIC ( m |D ) is a large sample approximation ofthe marginal loglikelihood log P ( D| m ) [38]. Consequently, the difference BIC ( m |D ) − BIC ( m |D ) is a large approximation of the logarithm of the Bayesfactor log K . According to the cut-off values for the Bayes factor, we conclude thatthere is positive, strong, and very strong evidence favoring m when the difference islarger than 1, 3 and 5 respectively. In our experiments, we always set δ = 3 . The subroutine B
UILDISLANDS (Algorithm 3) builds islands one by one. It buildsthe first island by calling another subroutine O NE I SLAND (Algorithm 4). Then itremoves the variables in the island from the dataset, and repeats the process to buildother islands. It continues until all variables are grouped into islands.The subroutine O NE I SLAND (Algorithm 4) requires a measurement of how closely18 lgorithm 4 O NE I SLAND ( D , V , δ , µ ) if |V| ≤ , m ← L EARN
LCM( D , V ), return m . S ← two variables in V with highest MI; X ← arg max A ∈V\S MI ( A, S ) ; S ← S ∪ X ; V ← V \ S ; D ← P ROJECT D ATA ( D , S ); m ← L EARN
LCM( D , S ). loop X ← arg max A ∈V MI ( A, S ) ; W ← arg max A ∈ S MI ( A, X ) ; D ← P ROJECT D ATA ( D , S ∪ { X } ) , V ← V \ { X } ; m ← P EM -L CM ( m, S , X, D ) ; if |V | = 0 return m . m ← P EM -L TM -2 L ( m , S \ { W } , { W, X } , D ); if BIC ( m |D ) − BIC ( m |D ) > δ then return m with W , X and their parent removed. end if if |S| ≥ µ , return m . m ← m , S ← S ∪ { X } . end loop correlated each pair of variables are. In this paper, mutual information is used for thepurpose. The mutual information I ( X ; Y ) between the two variables X and Y is givenby (3). We will also need the mutual information (MI) between a variable X and a setof variables S . We estimate it as follows:MI ( X, S ) = max A ∈S MI ( X, A ) . (7)The subroutine O NE I SLAND maintains a working set S of observed variables.Initially, S consists of the pair of variables with the highest MI (line 2), which will bereferred to as the seed variables for the island. Then the variable that has the highestMI with those two variables is added to S as the third variable (line 3 and 4). Thenother variables are added to S one by one. At each step, we pick the variable X that19as the highest MI with the current set S (line 9), and perform UD-test on the set S ∪ { X } (lines 12, 14, 15). If the UD-test passes, X is added to S (line 19) and theprocess continues. If the UD-test fails, one island is created and the subroutine returns(line 16). The subroutine also returns when the size of the island reaches auser-specified upper-bound µ (line 18). In our experiments, we always set µ = 15 . (a) Initial island (b) UD-test passes after adding mission (c) UD-test passes after adding moon (d) UD-test fails after adding lunar (e) Final island Figure 5:
Illustration of the O NE I SLAND subroutine.
The UD-test requires two models m and m . In principle, they should be the bestmodels with one and two latent variables respectively. For the sake of computationalefficiency, we construct them heuristically in this paper. For m , we choose the LCMwhere the latent variable is binary and the parameters are optimized by a fast subroutineP EM -L CM that will be described in the next section.Let W be the variable in S that has the highest MI with the variable X to be addedto the island. For m , we choose the model where one latent variable is connected to20he variables in S \ { W } and the second latent variable connected to W and X . Bothlatent variables are binary and the model parameters are optimized by a fast subroutineP EM -L TM -2 L that will be described in the next section.Let us illustrate the O NE I SLAND subroutine using an example in Figure 5. Thepair of variables nasa and space have the highest MI among all variables, and theyare hence the seed variables . The variable shuttle has the highest MI with the pairamong all other variables, and hence it is chosen as the third variable to start the island(Figure 5 (a)). Among all the other variables, mission has highest MI with the threevariables in the model. To decide whether mission should be added to the group,the two models m and m in Figure 5 (b) are created. In m , shuttle is groupedwith the new variable because it has the highest MI with the new variable among allthe three variables in Figure 5 (a). It turns out that m has higher BIC score than m .Hence the UD-test passes and the variable mission is added to the group. The nextvariable to be considered for addition is moon and it is added to the group because theUD-test passes again (Figure 5 (c)). After that, the variable lunar is considered. Inthis case, the BIC score of m is significantly higher than that of m and hence theUD-test fails (Figure 5 (d)). The subroutine O NE I SLAND hence terminates. It returnsan island, which is the part of the model m that does not contain the last variable lunar (Figure 5 (e)). The island consists of the four words nasa , space , shuttle and mission . Intuitively, they are grouped together because they tend to co-occur inthe dataset. After the islands are created, the next step is to link them up so as to obtain a modelover all the word variables. This is carried out by the B
RIDGE I SLANDS subroutine andthe idea is borrowed from [42]. The subroutine first estimates the MI between each pairof latent variables in the islands, then constructs a complete undirected graph with theMI values as edge weights, and finally finds the maximum spanning tree of the graph.The parameters of the newly added edges are estimated using a fast method that willbe described at the end of Section 6.3.Let m and m (cid:48) be two islands with latent variables Y and Y (cid:48) respectively. The MI21 ( Y ; Y (cid:48) ) between Y and Y (cid:48) is calculated using Equation (3) from the following jointdistribution: P ( Y, Y (cid:48) | D , m, m (cid:48) ) = C (cid:88) d ∈D P ( Y | m, d ) P ( Y (cid:48) | m (cid:48) , d ) (8)where P ( Y | m, d ) is the posterior distribution of Y in m given data case d , P ( Y (cid:48) | m (cid:48) , d ) is that of Y (cid:48) in m (cid:48) , and C is the normalization constant.In our running example, applying B RIDGE I SLANDS to the islands in Figure 4results in the flat model shown in Figure 3 (a).
6. Parameter Estimation during Model Construction
In the model construction phase, a large number of intermediate models aregenerated. Whether HLTA can scale up depends on whether the parameters of thoseintermediate models and the final model can be estimated efficiently. In this section,we present a fast method called progressive EM for estimating the parameters of theintermediate models. In the next section, we will discuss how to estimate theparameters of the final model efficiently when the sample size is very large.
We start by briefly reviewing the EM algorithm. Let X and H be respectively thesets of observed and latent variables in an LTM m , and let V = X ∪ H . Assume onelatent variable is picked as the root and all edges are directed away from the root. Forany V in V that is not the root, the parent pa ( V ) of V is a latent variable and can takevalues ‘0’ or ‘1’. For technical convenience, let pa ( V ) be a dummy variable with onlyone possible value when V is the root. Enumerate all the variables as V , V , · · · , V n .We denote the parameters of m as θ ijk = P ( V i = k | pa ( V i ) = j ) , (9)where i ∈ { , · · · , n } , k is value of V i and j is a value of pa ( V i ) . Let θ be the vectorof all the parameters. 22iven a dataset D , the loglikelihood function of θ is given by l ( θ | D ) = (cid:88) d ∈D (cid:88) H log P ( d, H | θ ) . (10)The maximum likelihood estimate (MLE) of θ is the value that maximizes theloglikelihood function.Due to the presence of latent variables, it is intractable to directly maximize theloglikelihood function. An iterative method called the Expectation-Maximization(EM) [37] algorithm is usually used in practice. EM starts with an initial guess θ (0) ofthe parameter values, and then produces a sequence of estimates θ (1) , θ (2) , · · · . Giventhe current estimate θ ( t ) , the next estimate θ ( t +1) is obtained through an E-step and anM-step. In the context of latent tree models, the two steps are as follows: • The E-step: n ( t ) ijk = (cid:88) d ∈D P ( V i = k, pa ( V i ) = j | d, m, θ ( t ) ) . (11) • The M-step: θ ( t +1) ijk = n ( t ) ijk (cid:80) k n ( t ) ijk . (12)Note that the E-step requires the calculation of P ( V i , pa ( V i ) | d, m, θ ( t ) ) for each datacase d ∈ D and each variable V i . For a given data case d , we can calculate P ( V i , pa ( V i ) | d, m, θ ( t ) ) for all variables V i in linear time using messagepropagation [43].EM terminates when the improvements in loglikelihood l ( θ ( t +1) |D ) − l ( θ ( t ) |D ) falls below a predetermined threshold or when the number of iterations reaches apredetermined limit. To avoid local maxima, multiple restarts are usually used. Being an iterative algorithm, EM can be trapped in local maxima. It is also time-consuming and does not scale up well. Progressive EM is proposed as a fast alternativeto EM for the model construction phase. It estimates all the parameters in multiple steps23
B CD EY Z (a)
A B CD EY Z (b)
Figure 6:
Progressive EM: EM is first run in the submodel shaded in (a) to estimate thedistributions P ( Y ) , P ( A | Y ) , P ( B | Y ) and P ( D | Y ) , and then, EM is run in the submodelshaded in (b), with P ( Y ) , P ( B | Y ) and P ( D | Y ) fixed, to estimate the distributions P ( Z | Y ) , P ( C | Z ) and P ( E | Z ) . and, in each step, it considers a small part of the model and runs EM in the submodelto maximize the local likelihood function. The idea is illustrated in Figure 6. Assume Y is selected to be the root. To estimate all the parameters of the model, we first runEM in the part of the model shaded in Figure 6a to estimate P ( Y ) , P ( A | Y ) , P ( B | Y ) and P ( D | Y ) , and then run EM in the part of the model shaded in Figure 6b, with P ( Y ) , P ( B | Y ) and P ( D | Y ) fixed, to estimate P ( Z | Y ) , P ( C | Z ) and P ( E | Z ) . We use progressive EM to estimate the parameters for the intermediate modelsgenerated by HLTA, specifically those generated by subroutine O NE I SLAND (Algorithm 4). It is carried out by the two subroutines P EM -L CM and P EM -L TM -2 L .At lines 1 and 7, O NE I SLAND needs to estimate the parameters of an LCM withthree observed variables. It is done using EM. Next, it enters a loop. At the beginning,we have an LCM m for a set S of variables. The parameters of the LCM have beenestimated earlier (line 7 at beginning or line 12 of previous pass through the loop). Atlines 9 and 10, O NE I SLAND finds the variable X outside S that has maximum MI with S , and the variable W inside S that has maximum MI with X .At line 12, O NE I SLAND adds X to the m to create a new LCM m . Theparameters of m are estimated using the subroutine P EM -L CM (Algorithm 5), whichis an application of progressive EM. Let us explain P EM -L CM using the intermediatemodels shown in Figure 5. Let m be the model shown on the left of Figure 5c and24 lgorithm 5 P EM -L CM ( m , S , X , D ) Y ← the latent variable of m ; S ← { X }∪ two seed variables in S ; While keeping the other parameters fixed, run EM in the part of m that involves S ∪ Y toestimate P ( X | Y ) . return m Algorithm 6 P EM -L TM -2 L ( m, S \ { W } , { W, X } , D ) Y ← the latent variable of m ; m ← model obtained from m by adding X and a new latent variable Z , connecting Z to Y , connecting X to Z , and re-connecting W (connected to Y before) to Z ; S ← { W, X }∪ two seed variables in S ; While keeping the other parameters fixed, run EM in the part of m that involves S ∪ Y ∪ Z to only estimate P ( W | Z ) , P ( X | Z ) and P ( Z | Y ) . return m S = { nasa , space , shuttle , mission , moon } . The variable X to be added to m is lunar , and the model m after adding lunar to m is shown on the left ofFigure 5d. The only distribution to be estimated is P ( lunar | Y ) , as other distributionshave already been estimated. P EM -L CM estimates the distribution by running EM ona part of the model m in Figure 7 (left), where the variables involved are inrectangles. The variables nasa and space are included in the submodel, instead ofother observed variables, because they were the seed variables picked at line 2 ofAlgorithm 4.At line 14, O NE I SLAND adds X to the m to create a new LTM m with two latentvariables. The parameters of m are estimated using the subroutine P EM -L TM -2 L (Algorithm 6), which is also an application of progressive EM. In our running example,let moon be the variable W that has the highest MI with lunar among all variables in S . Then the model m is as shown on the right hand side of Figure 5d. The distributionsto be estimated are: P ( Z | Y ) , P ( moon | Z ) and P ( lunar | Z ) . P EM -L TM -2 L estimates When one of the seed variables is W , use the other seed variable and the variable picked at line 3 ofAlgorithm 4. lunar should be added to the island m in Figure 5c, two models arecreated. We need to estimate only P ( lunar | Y ) for the model on the left, and P ( Z | Y ) , P ( moon | Z ) and P ( lunar | Z ) for the model on the right. The estimation is done byrunning EM in the parts of the models where the variable names are in rectangles.the distributions by running EM on a part of the model m in Figure 7 (right), wherethe variables involved are in rectangles. The variables nasa and space are includedin the submodel, instead of shuttle and mission , because they were the seedvariables picked at line 2 of Algorithm 4.There is also a parameter estimation problem inside the subroutineB RIDGED I SLANDS . After linking up the islands, the parameters for edges betweenlatent variables must be estimated. We use progressive EM for this task also. Considerthe model in Figure 3 (a). To estimate P ( Z | Z , we form a sub-model by pickingtwo children of Z , for instance nasa and space , and two children of Z , forinstance orbit and earth . Then we estimate the distribution P ( Z | Z byrunning EM in the submodel with all other parameters fixed. Let n be the number of observed variables and N be the sample size. HLTA requiresthe computation of empirical MI between each pair of observed variables. This takes O ( n N ) time.When building islands for the observed variables, HLTA generates roughly n intermediate models. Progressive EM is used to estimate the parameters of the26ntermediate models. It is run on submodels with 3 or 4 observed variables. Theprojection of a dataset onto 3 or 4 binary variable consists of only 8 or 16 distinctcases no matter how large the original sample size is. Hence progressive EM takesconstant time, which we denote by c , on each submodel. This is the key reason whyHLTA can scale up. The data projection takes O ( N ) time for each submodel. Hencethe total time for island building is O (2 n ( N + c )) .To bridge the islands, HLTA needs to estimate the MI between every pair of latentvariables and runs progressive EM to estimate the parameters for the edges betweenthe islands. A loose upper bound on the running time of this step is n N + n ( N + c ) .The total number of variables (observed and latent) in the resulting flat model is upperbounded by n . Inference on the model takes no more n propagation steps for eachdata case. Let c be the time for each propagation step. Then the hard assignment steptakes O (4 nc N ) time. So, the total time for the first pass through the loop in HLTAis O (2 n N + 3 n ( N + c ) + 4 nc N ) = O (2 n N + 3 nc + 4 nc N ) , where the term nN is ignored because it is dominated by the term nc N .As we move up one level, the number of “observed” variables is decreased by atleast half. Hence, the total time for the model construction phase is upper bounded by O (4 n N + 6 nc + 8 nc N ) .The total number of variables (observed and latent) in the final model is upperbounded by n . Hence, one EM iteration takes O (4 nc N ) time and the final parameteroptimization steps takes O (4 nc N κ ) times.The total running time of HLTA is O (4 n N + 6 nc + 8 nc N ) + O (4 nc N κ ) .The two terms are the times for model construction phase and the parameter estimationphase respectively.
7. Dealing with Large Datasets
We employ two techniques to further accelerate HLTA so that it can handle largedatasets with millions of documents. The first technique is downsampling and we useit is to reduce the complexity of the model construction phase. Specifically, we usea subset of N (cid:48) randomly sampled data cases instead of the entire dataset and thereby27educe the complexity to O (4 n N (cid:48) + 6 nc + 8 nc N (cid:48) ) . When N is very large, we canset N (cid:48) to be a small fraction of N and hence achieve substantial computational savings.In the meantime, we can still expect to obtain a good structure if N (cid:48) is not too small.The reason is that model construction relies on salient regularities of data and thoseregularities should be preserved in the subset when N (cid:48) is not too small.The second technique is stepwise EM [44, 45]. We use it to accelerate theconvergence of the parameter estimation process in the second phase, where the taskis to improve the values of the parameters θ = { θ ijk } (Equation 9) obtained in themodel construction phase. While standard EM, a.k.a. batch EM , updates theparameter once in each iteration, stepwise EM updates the parameters multiple timesin each iteration.Suppose the data set D is randomly divided into equal-sized minibatches D , . . . , D B . Stepwise EM updates the parameters after processing each minibatch. It maintainsa collection of auxiliary variables n ijk , where are initialized to in our experiments.Suppose the parameters have been updated u − times before and the current valuesare θ = { θ ijk } . Let D b be the next minibatch to process. Stepwise EM carries out theupdating as follows: n (cid:48) ijk = (cid:88) d ∈D b P ( V i = k, pa ( V i ) = j | d, m, θ ) , (13) n ijk = (1 − η u ) n ijk + η u n (cid:48) ijk , (14) θ ijk = n ijk (cid:80) k n ijk . (15)Note that equation (13) is similar to (11) except that the statistics are calculated onthe minibatch D b rather than the entire dataset D . The parameter η u is known as the stepsize and is given by η u = ( u + 2) − α and the parameter α is to be chosen the range . ≤ α ≤ [46]. In all our experiments, we set α = 0 . .Stepwise EM is similar to stochastic gradient descent [47] in that it updates theparameters after processing each minibatch. It has been shown to yield estimates of thesame or even better quality as batch EM and it converges much faster than the latter[46]. As such, we can run it for much fewer iterations than batch EM and therebysubstantially reduce the running time. 28 . Illustration of Results and Practical Issues HLTA is a novel method for hierarchical topic detection and, as discussed in theintroduction, it is fundamentally different from the LDA-based methods. We willempirically compare HLTA with the LDA-based methods in the next section. In thissection, we present the results HLTA obtains on a real-world dataset so that the readercan gain a clear understanding of what it has to offer. We also discuss two practicalissues.
HLTA is implemented in Java. The source code is available online , along with thedatasets used in this paper and the full details of the results obtained on them. HLTAhas been tested on several datasets. One of them is the NYT dataset, which consists of300,000 articles published on New York Times between 1987 and 2007 . A vocabularyof 10,000 words was selected using average TF-IDF [48] for the analysis. The averageTF-IDF of a term t in a collection of documents D is defined as follows: tf − idf ( t, D ) = (cid:80) d ∈D tf ( t, d ) · idf ( t, D ) |D| , (16)where | · | stands for the cardinality of a set, tf ( t, d ) is the term frequency of t indocument d , and idf ( t, D ) = log( |D| / |{ d ∈ D : t ∈ d }| ) is the inverse documentfrequency of t in the document collection D .The subset of 10,000 randomly sampled data cases was used in the modelconstruction phase. Stepwise EM was used in the parameter estimation phase and thesize of minibatches was set at 1,000. Other parameter settings are given in the nextsection. The analysis took around 420 minutes on a desktop machine.The result is an HLTM with 5 levels of latent variables and 21 latent variables at thetop level. Figure 8 shows a part of the model structure. Four top-level latent variablesare included in the figure. The level-4 and level-2 latent variables in the subtrees rootedat the five top-level latent variables are also included. ∼ lzhang/topic/index.htm http://archive.ics.uci.edu/ml/datasets/Bag+of+Words Z Z Z j azz h i p b l u e s m u s i c i a ngu it a r i s t g a s d r illi ng e xp l o r a ti on o il n a t u r a l _ f e d e r a l _ r e s e r v e _ f e d_n a s d a q c o m po s it e _g e o r g e t o w n m u s i ca l r ec o r d i ng c o m po s e r c l a ss i ca l r ec o r d e d_ e n r on _ e n r on_ c o r p i n t e r n a l acc oun ti ng c o ll a p s e d r a p c on ce r t _ e m i n e m _g r a mm y s yn c d r ugph a r m ace u ti ca l p r e s c r i p ti on m e d i c i n e s _ m e r c k ac t o r p r odu ce r a w a r d _ho ll y w ood m ov i e s _ c nn ca b l e n e t w o r kv i e w e r c h a nn e l s ce n e s s t ud i o s c r i p t ca s t p r oo f r e m e m b e r c o m e s f unny t a k e s c r az y f il mm ov i e s ong m u s i cc h a r ac t e r p r opo s a l p r opo s e d h ea lt h w e l f a r e p r opo s i ng c u rr e n c y m i n i m u m e xp a nd e xp a n s i on e xp a nd i ng s t o r e s t o r e s c on s u m e r p r odu c t r e t a il e r i n s u r a n cec ov e r a g e i n s u r e r p a yp r e m i u m a n i m a t e d a n i m a ti on ca r t oon c o m i c w r iti ngv e r s i ondo c u m e n t a r yo r i g i n a l f il mm a k e r _pb r ec r u it e r ov a ti on c o ac h e d r ec r u iti ng r ec r u it ec ono m y ec ono m i c ec ono m i s t r i s i ng r ece ss i on_ s t a r _ w a r ph a n t o m _h a rr y_po tt e r t a l ee p i s od e s m ea t f ood ea t ea ti ngpo r kg r ee n t r ee s t r ee b r o w n f o r e s t _ s ou t h ea s t _no r t h ea s tt hund e r s t o r m r o c k i e s c l oud r ec i p e s r ec i p e f l ou r ca k ec h ee s ec ook i ng c ook e d s a u ce f l a vo r ov e n d e g r ee s t e m p e r a t u r e t e m p e r a t u r e s w a r m b a k i ng _ s ou t h_ t e nn e ss ee _g e o r g i a tit a n _g a t a b l e s poonp a n t ea s poon s p r i nk l e e v e n l y c o m p a n i e s c o m p a ny i ndu s t r y i n ce n ti v ec h i e f f i r m c on s u lti ng d i s t r i bu ti onp a r t n e r d e li v e r p r o f it r e v e nu e g r o w t h t r oub l e ss h a r e p e r ce n t a v e r a g e s ho r t a g e s o a r i ng r e du ce do ll a r b illi on l e v e l m a i n t a i n m illi on s o l d bough t p r op e r t ybuy i ng buy e r p r i ce p r i ce s m e r g e r s h a r e ho l d e r ac qu i s iti on s t o c k m a r k e t i nv e s t o r a n a l y s t i nv e s t m e n t Figure 8:
Part of the hierarchical latent tree model obtained by HLTA on the NYT dataset:The nodes Z to Z are top-level latent variables. The subtrees rooted at different top-levellatent variables are coded using different colors. Edge width represents mutual information.
1. [0.20] economy stock economic market dollar
2. [0.22] companies company firm industry incentive
Each level-2 latent variable is connected to four word variables in its subtrees.Those are the word variables that have the highest MI with the latent variable amongall word variables in the subtree.The structure is interesting. We see that most words in the subtree rooted at Z are about economy and stock market ; most words in the subtree Z are about companies and various industries ; most words in the subtree rooted at Z are about movies and music ; and most words in the subtree rooted at Z are about cooking .Table 3 shows a part of the topic hierarchy extracted from the part of the modelshown in Figure 8. The topics and the relationships among them are meaningful. Forexample, the topic 1 is about economy and stock market. It splits into two groupsof subtopics, one on economy and another on stock market. Each subtopic furthersplits into sub-subtopics. For example, the subtopic 1.2 under economy splits into31hree subtopics: currency expansion, labor union and minimum wages. The topic 2 isabout company-firm-industry. Its subtopics include several types of companies such asinsurance, retail stores/consumer products, natural gas/oil, drug, and so on. Next we discuss two practical issues.
In HLTA, each latent variable is introduced to model a pattern of probabilistic wordco-occurrence. It also gives us a topic, which is a soft cluster of documents. The sizeof the topic is determined by considering not only the words in the pattern, but all thewords in the vocabulary. As such, it conceptually includes two types of documents:(1) documents that contain, in a probabilistic sense, the pattern, and (2) documents thatdo not contain the pattern but are otherwise similar to those that do. Because of theinclusion of the second type of documents, the topic is said to be broadly defined . Allthe topics reported above are broadly defined.The size of a widely defined topic might appear unrealistically large at the firstglance. For example, one topic detected from the NYT dataset consists of the words affair , widely , scandal , viewed , intern , monica lewinsky , and itssize is . . Although this seems too large, it is actually reasonable. Obviously, thefraction of documents that contain the seven words in the topic should be muchsmaller than . However, those documents also contain many other words, such as bill and clinton , about American politics. Other documents that contain many ofthose other words are also included in the topic, and hence it is not surprising for thetopic to cover of the corpus. As a matter of fact, there are several other topicsabout American politics that are of similar sizes. One of them is: corruptioncampaign political democratic presidency .In some applications, it might be desirable to identify narrowly defined topics —topics made up of only the documents containing particular patterns. Such topics can beobtained as follows: First, pick a list of words to characterize a topic using the methoddescribed in Section 4; then, form a latent class model using those words as observed32able 4: Topics detected from AAAI/IJCAI papers (2000-15) that contain the word”network”. [0.05] neural-network neural hidden-layer layer activation[0.08] bayesian-network probabilistic-inference variable-elimination[0.04] dynamic-bayesian-network dynamic-bayesian slice time-slice[0.03] domingos markov-logic-network richardson-domingos[0.06] dechter constraint-network freuder consistency-algorithm[0.09] social-network twitter social-media social tweet[0.03] complex-network community-structure community-detection[0.01] semantic-network conceptnet partof[0.09] wireless sensor-network remote radar radio beacon[0.08] traffic transportation road driver drive road-network variables; and finally, use the model to partition all documents into two clusters. Thecluster where the words occur with relatively higher probabilities are designated as thenarrow topic. The size of a narrowly defined topic is typically much smaller than thatof the widely defined version. For example, the sizes of the narrowly defined versionof the two topics from the previous paragraph are 0.008 and 0.169 respectively.Learning a latent class model for each latent variable from scratch istime-consuming. To accelerate the process, one can calculate the parameters for thelatent class model from the global HLTM model, fix the conditional distributions, andupdate only the marginal distribution of the latent variable a number of times, say 10times. In HLTMs, each observed variable is connected to only one latent variable. Ifindividual words are used as observed variables, then each word would appear in onlyone branch of the resulting topic hierarchy. This is not reasonable. Take the word“network” as an example. It can appear in different terms such as “neural network”,“Bayesian network”, “constraint network”, “social network”, “sensor network”, and soon. Clearly, those terms should appear in different branches in a good topic hierarchy.A method to mitigate the problem is proposed in [49]. It first treats individualwords as tokens and finds top n tokens with the highest average TF-IDF. Let P ICK - TOP - TOKENS ( D , n ) be the subroutine which does that. The method then calculates33he TF-IDF values of all 2-grams, and and includes the top n ICK - TOP - TOKENS ( D , n ) is run again to pick a new set of n tokens.The process can be repeated if one wants to consider n-grams with n > as tokens.The method has been applied in an analysis of papers published at AAAI and IJCAIbetween 2000 and 2015. Table 4 shows some of the topics from the resulting topichierarchy. They all contain the word “network” and are from different branches of thehierarchy.
9. Empirical Comparisons
We now present empirical results to compare HLTA with LDA-based methods forhierarchical topic detection, including nested Chinese restaurant process (nCRP) [2],nested hierarchical Dirichlet process (nHDP) [5] and Hierarchical Pachinko allocationmodel(hPAM) [4]. Also included in the comparisons is CorEx [29]. CorEx producesa hierarchy of latent variables, but not a probability model over all the variables. Forcomparability, we convert the results into a hierarchical latent tree model. Three datasets were used in our experiments. The first one is the NYT datasetmentioned before. The second one is the 20 Newsgroup (referred to as Newsgroup)dataset . It consists of 19,940 newsgroup posts. The third one is the NIPS dataset,which consists of 1,955 articles published at the NIPS conference between 1988 and1999 . Symbols, stop words and the words barely occurring were removed for all the Let Y be a latent variable and Z , . . . , Z k be its children. Let Z ( d ) i be the value of Z i in a data case d . Itis obtained via hard assignment if Z i is a latent variable. CorEx gives the distribution P ( Y | Z ( d )1 , . . . , Z ( d ) k ) for data case d . Let ( d ) ( Z , . . . , Z k ) be a function that takes value when Z i = Z ( d ) i for all i and0 otherwise. Then, the expression (cid:80) d ∈D P ( Y | Z ( d )1 , . . . , Z ( d ) k ) ( d ) ( Z , . . . , Z k ) / |D| defines a jointdistribution over Y and Z - Z k . From the joint distribution, we obtain P ( Z i | Y ) for each i , and also P ( Y ) if Y is the root. http://qwone.com/ ∼ jason/20Newsgroups/ ∼ roweis/data.html nCRP-bin, nHDP-bin, hPAM-bin,nCRP-bow, nHDP-bow and hPAM-bow respectively. HLTA was run in two modes. In the first mode, denoted as
HLTA-batch , the entiredataset was used in the model construction phased and batch EM was used in theparameter estimation phase. In the second mode, denoted as
HLTA-step , a subset of N (cid:48) randomly sampled data cases was used in the model construction phase andstepwise EM was used in the parameter estimation phase (see Section 7). In allexperiments, N (cid:48) was set at 10,000, the size of minibatch at 1,000, and the parameter α in stepwise EM at 0.75. HLTA-batch was run on all the datasets, while HLTA-stepwas run only on NYT and the Newsgroup datasets. HLTA-step was not run on theNIPS datasets because the sample size is too small. For HLTA-batch, the number κ ofiterations for batch EM was set at 50. For HLTA-step, stepwise EM was terminatedafter 100 updates.The other parameters of HLTA (see Algorithm 1) were set as follows in both modes:35he threshold δ used in UD-tests was set at ; the upper bound µ on island size was setat 15; and the upper bound τ on the number of top-level topics was set at for NYTand 20 for all other datasets. When extracting topics from an HLTM (see Section 4),we ignored the level-1 latent variables because the topics they give are too fine-grainedand often consist of different forms of the same word (e.g., “image, images”).The LDA-based methods nCRP, nHDP and hPAM do not learn model structures.A hierarchy needs to be supplied as input. In our experiment, the height of hierarchywas set at 3, as is usually done in the literature. The number of nodes at each levelwas set in such way that nCRP, nHDP and hPAM would yield roughly the same totalnumber of topics as HLTA. CorEx is configured similarly. We used program packagesand default parameter values provided by the authors for these algorithms.All experiments were conducted on the same desktop computer. Each experimentwas repeated 3 times so that variances can be estimated. For topic models, the standard way to assess model quality is to measurelog-likelihood on a held-out test set [2, 5]. In our experiments, each dataset wasrandomly partitioned into a training set with of the data, and a test set with of the data. Models were learned from the training set, and per-documentloglikelihood was calculated on the held-out test set. The statistics are shown inTable 6. For comparability, only the results on binary data are included.We see that the held-out likelihood values for HLTA are drastically higher thanthose for all the alternative methods. This implies that the models obtained by HLTAcan predict unseen data much better than the other methods. In addition, the variancesare significantly smaller for HLTA than the other methods in most cases.Table 7 shows the running times. We see that HLTA-step is significantly moreefficient than HLTA-batch on large datasets, while there is virtually no decrease inmodel quality.Note that all algorithms have parameters that control computational complexity.Thus, running time comparison is only meaningful when it is done with reference to36able 6: Per-document held-out loglikelihood on binary data: Best scores are markedin bold. The sign “-” indicates non-termination after 72 hours, and “NR” stands for“not run’.
NIPS-1k NIPS-5k NIPS-10k News-1k News-5k NYT
HLTA-batch -393 ± ± ± ± ± ± HLTA-step NR NR NR -114 ± -243 ± ± ±
16 -3,034 ±
135 — — — —nHDP-bin -1,188 ± ± ±
11 -183 ± ± ± ± ± ± ± ± ± ± Table 7: Running times: The sign “-” indicates non-termination after 72 hours, and“NR” stands for “not run”.
Time(min)
NIPS-1k NIPS-5k NIPS-10k News-1k News-5k NYT
HLTA-batch ± ± ±
12 66.1 ± ±
39 787 ± ± ± ± nCRP-bin 782 ±
39 3,608 ±
163 — — — —nHDP-bin 152 ± ± ± ±
13 263 ± ± ±
17 — — 328 ± ±
150 3,939 ±
301 — — — —nHDP-bow 379 ±
14 416 ±
49 413 ±
16 250 ±
81 332 ±
16 564 ± ±
27 — — 604 ±
19 — —CorEx 53 ± ±
23 1,190 ± ±
34 4,287 ±
52 — model quality. It is clear from Tables 6 and 7 that HLTA achieved much better modelquality than the alternative algorithms using comparable or less time.
It has been argued that, in general, better model fit does not necessarily implybetter topic quality [50]. It might therefore be more meaningful to compare alternative37ethods in terms of topic quality directly. We measure topic quality using twometrics. The first one is the topic coherence score proposed by [51]. Suppose a topic t is characterized using a list W ( t ) = { w ( t )1 , w ( t )2 , . . . , w ( t ) M } of M words. Thecoherence score of t is given by: Coherence ( W ( t ) ) = M (cid:88) i =2 i − (cid:88) j =1 log D ( w ( t ) i , w ( t ) j ) + 1 D ( w ( t ) j ) , (17)where D ( w i ) is the number of documents containing word w i , and D ( w i , w j ) is thenumber of documents containing both w i and w j . It is clear that the score dependson the choices of M and it generally decreases with M . In our experiments, we set M = 4 because some of the topics produced by HLTA have only 4 words and hencethe choice of a larger value for M would put other methods at a disadvantage. With M fixed, a higher coherence score indicates a better quality topic.The second metric we use is the topic compactness score proposed by [52]. Thecompactness score of a topic t is given by: compactness ( W ( t ) ) = 2 M ( M − M (cid:88) i =2 i − (cid:88) j =1 S ( w ( t ) i , w ( t ) j ) , (18)where S ( w i , w j ) is the similarity between the words w i and w j as determined by a word2vec model [53, 54]. The word2vec model was trained on a part of the GoogleNews dataset . It contains about 100 billion words and each word is mapped to ahigh dimensional vector. The similarity between two words is defined as the cosinesimilarity of the two corresponding vectors. When calculating compactness ( W ( t ) ) ,words that do not occur in the word2vec model were simply skipped.Note that the coherence score is calculated on the corpus being analyzed. In thissense, it is an intrinsic metric. The intuition is that words in a good topic should tendto co-occur in the documents. On the other hand, the compactness score is calculatedon a general and very large corpus not related to the corpus being analyzed. Hence it isan extrinsic metric. The intuition here is that words in a good topic should be closelyrelated semantically. https://code.google.com/archive/p/word2vec/ NIPS-1k NIPS-5k NIPS-10k News-1k News-5k NYT
HLTA-batch -5.95 ± ± ± -12.00 ± ± ± -11.66 ± ± ± nCRP-bow -7.46 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 9: Average compactness scores.
NIPS-1k NIPS-5k NIPS-10k News-1k News-5k NYT
HLTA-batch ± ± ± ± ± ± ± ± ± nCRP-bow 0.163 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Tables 8 and 9 show the average topic coherence and topic compactness scores ofthe topics produced by various methods. For LDA-based methods, we reported theirscores on both datasets of binary and bags-of-words versions. There is no distinctadvantage of choosing either version. We see that the scores for the topics produced byHLTA are significantly higher than the those obtained by other methods in all cases.39 .5. Quality of Topic Hierarchies
Table 10: A part of topic hierarchy by nHDP.
1. company business million companies money
There is no metric for measuring the quality of topic hierarchies to the best ofour knowledge, and it is difficult to come up with one. Hence, we resort to manualcomparisons.The entire topic hierarchies produced by HLTA and nHDP on the NIPS and NYTdatasets can be found at the URL mentioned at the beginning of the previous section.Table 10 shows the part of the hierarchy by nHDP that corresponds to the part of thehierarchy by HLTA shown in Table 3. In the HLTA hierarchy, the topics are nicelydivided into three groups, economy , stock market , and companies . In Table 10, there isno such clear division. The topics are all mixed up. The hierarchy does not match thesemantic relationships among the topics.Overall, the topics and topic hierarchy obtained by HLTA are more meaningful thanthose by nHDP.
10. Conclusions and Future Directions
We propose a novel method called HLTA for hierarchical topic detection. The ideais to model patterns of word co-occurrence and co-occurrence of those patterns usinga hierarchical latent tree model. Each latent variable in an HLTM represents a softpartition of documents and the document clusters in the partitions are interpreted as40opics. Each topic is characterized using the words that occur with high probabilityin documents belonging to the topic and occur with low probability in documents notbelonging to the topic. Progressive EM is used to accelerate parameter learning forintermediate models created during model construction, and stepwise EM is used toaccelerate parameter learning for the final model. Empirical results show that HLTAoutperforms nHDP, the state-of-the-art method for hierarchical topic detection basedon LDA, in terms of overall model fit and quality of topics/topic hierarchies, whiletakes no more time than the latter.HLTA treats words as binary variables and each word is allowed to appear in onlyone branch of a hierarchy. For future work, it would be interesting to extend HLTA sothat it can handle count data and that a word is allowed to appear in multiple branches ofthe hierarchy. Another direction is to further scale up HLTA via distributed computingand by other means.
Acknowledgments
We thank John Paisley for sharing the nHDP implements with us, and we thank JunZhu for valuable discussions. Research on this article was supported by Hong KongResearch Grants Council under grants 16202515 and 16212516, and the Hong KongInstitute of Education under project RG90/2014-2015R.