[PDF] A Multilayer Correlated Topic Model

Abstract

We proposed a novel multilayer correlated topic model (MCTM) to analyze how the main ideas inherit and vary between a document and its different segments, which helps understand an article's structure. The variational expectation-maximization (EM) algorithm was derived to estimate the posterior and parameters in MCTM. We introduced two potential applications of MCTM, including the paragraph-level document analysis and market basket data analysis. The effectiveness of MCTM in understanding the document structure has been verified by the great predictive performance on held-out documents and intuitive visualization. We also showed that MCTM could successfully capture customers' popular shopping patterns in the market basket analysis.

Full PDF

RReceived: Added at production Revised: Added at production Accepted: Added at productionDOI: xxx/xxxx

A Multilayer Correlated Topic Model

Ye Tian Department of Statistics, ColumbiaUniversity, New York, USACorrespondenceYe Tian, Department of Statistics,Columbia University, New York, USA.Email: [email protected]

Summary

We proposed a novel multilayer correlated topic model (MCTM) to analyze how the mainideas inherit and vary between a document and its diﬀerent segments, which helps under-stand an article’s structure. The variational expectation-maximization (EM) algorithm wasderived to estimate the posterior and parameters in MCTM. We introduced two potentialapplications of MCTM, including the paragraph-level document analysis and market bas-ket data analysis. The eﬀectiveness of MCTM in understanding the document structurehas been veriﬁed by the great predictive performance on held-out documents and intuitivevisualization. We also showed that MCTM could successfully capture customers’ popularshopping patterns in the market basket analysis.

KEYWORDS: latent Dirichlet allocation, correlated topic model, document structure, variational EMalgorithm, market basket analysis

Probabilistic topic modeling has been a popular research topic due to its eﬀectiveness in learning the documents’ underlying semantic structure.Since the latent Dirichlet allocation (LDA) came out (Blei, Ng, & Jordan 2003), researchers have developed a number of topic modeling methods,including the author-topic model (Rosen-Zvi, Griﬃths, Steyvers, & Smyth 2012), hierarchical LDA model (Griﬃths, Jordan, Tenenbaum, & Blei 2004),the correlated topic model (CTM) (Blei, Laﬀerty, et al. 2007) and the dynamic topic model (Blei & Laﬀerty 2006). One thing that is ignored by manytopic models but very important is that documents come with structures: a document can have paragraphs that contain sentences. Modeling sucha local structure of an article can be very helpful to understand the document. An article can carry out some main ideas, while each paragraphis organized to introduce only a part of these ideas. Ideas of paragraphs vary around the main idea of the whole document (Du, Buntine, & Jin2010). Using the language of topic modeling, segments of an article can have diﬀerent topic proportions while sharing some similarities with thedocument-level topic proportion. A relative assumption of LDA and some other document-level topic models is the out-of-bag assumption, whichimposes the exchangeability of words within each document. It is almost impossible for the long text to have this assumption because diﬀerentsegments can express slightly diﬀerent ideas, although they belong to the same article. In summary, capturing heterogeneity within diﬀerent articlesegments is quite useful for understanding the document structure.There has been some previous work tackling this problem. We can divide these methodologies into two categories. Methods in the ﬁrst categoryintroduce two types of topics, which they call document-level topics (or super-topics) and word-level topics, respectively, and segments withdiﬀerent super-topics can enjoy diﬀerent word-level distributions. The latent Dirichlet co-clustering (LDCC) proposed by Shaﬁei and Milios (2006)is a direct application of this idea. Li and McCallum (2006) developed pachinko allocation model (PAM), using a directed acyclic graph to capturethe hierarchical structure of topics in diﬀerent layers. Hou et al. (2017) applied a similar idea in a multilayer multi-view topic model (mlmv_LDA)for video classiﬁcation. They all belong to the ﬁrst category. One issue of this family of methods is the diﬃculty in interpreting the super-topics,especially when there are multiple levels. Instead of assigning a super-topic to segments, approaches belonging to the second category connectdiﬀerent layers of an article by parameter passing. Within each layer of an article, the passed parameters enjoy some variety while are still relative tothe same document-level parameter. Compared with the ﬁrst category, this way turns out to be more natural, and the results are more comfortable a r X i v : . [ c s . I R ] J a n Ye Tian to illustrate because there is only one type of topic. Example methods in this category include the segmented topic model based on Poisson-Dirichlet process (Du et al. 2010), the sequential LDA model (Du et al. 2010) and the LDA model based on partition (Guo, Lu, & Wei 2019). Thesemethods are all designed based on the LDA model (Blei et al. 2003).In this work, we proposed a new multilayer correlated topic model (MCTM), which belongs to the aforementioned second category. It is basedon CTM (Blei et al. 2007) and the topic of each word is generated from a logistic normal distribution (Atchison & Shen 1980). Motivated by thedynamic topic model (Blei & Laﬀerty 2006), MCTM varies the topic distribution in each layer of a document by passing the mean parameter of thenormal distribution. The dynamic topic model passes the mean parameter along articles published at diﬀerent times to capture the changing trend,which can be seen as a series of models. Segments belonging to the same node in the higher layer in MCTM share the same mean parameter, whichis similar to a tree structure. This design also shares a similar idea as the famous random eﬀect model (Laird & Ware 1982), where observations ofdiﬀerent people have the same ﬁxed eﬀect and diﬀerent random eﬀect.We highlight our contribution in two aspects. First, a simple multilayer correlated topic model (MCTM) was proposed to be applied in analyzingthe document structure and how the main ideas of an article are organized in diﬀerent segments. It relaxes the out-of-bag assumption by onlyimposing the exchangeability within each segment. A variational expectation-maximization (EM) algorithm was derived to estimate the posteriorand parameters in MCTM. Second, we visualized the results by connecting the document and its paragraphs with related topics in high proportions,which intuitively show the similarity and heterogeneity with diﬀerent paragraphs of the same article. Such visualization can be beneﬁcial in theparagraph-level document analysis.The remaining of this paper is organized as follows. Section 2 introduces the MCTM coupled with the document generation procedure andits graphical representation in detail. In Section 3, we present the variation EM algorithm for the parameter and posterior estimation in MCTM.Section 4 brieﬂy discusses two potential applications of MCTM, including the paragraph-level document analysis and the market basket analysis.We conduct the experiments and present the results of these two applications in Section 5. Finally, we close this paper with the summary of ourwork and the future avenues in Section 6. Details of the derivation for variational EM algorithm is summarized in Appendix A.

We ﬁrst introduce some notations and terms used in this section.• A corpus is a collection of D documents, which are denoted by the words they contain as { w d } Dd = . The corpus is the top level of ourhierarchical model.• The d -th document consists of S d segments { w ds } S d s = with no intersection. The document is the second level of the hierarchical model.• The s -th segment of d -th document is a sequence of N ds words { w dsn } N ds n = . The segment is the third level in the whole model.• A word w ∈ { , . . . , W } is the bottom level of the topic model.The generative procedure of our model can be described as follows (for document d ).1. Choose the number of segments S d ∼ Poisson ( υ S ) ;2. Draw γ d ∼ N ( µ , Σ ) ;3. For each segment s = : S d :(a) Draw η ds ∼ N ( γ d , Σ ) ;(b) Draw the number of words N ds ∼ Poisson ( υ N ) ;(c) For each word n = : N ds :i. Choose a topic z dsn ∼ Categorical ( f ( η ds )) , where f is the logistic normal function;ii. Choose a word w dsn ∼ Categorical ( β z dsn ) .The corresponding graphical representation of MCTM is shown in Figure 1. The plates are used to represent the replications of the samestructure. The shaded nodes, unshaded nodes and black dots denote observed variables, hidden variables and hyperparameters, respectively. Here e Tian 3 FIGURE 1

Graphical representations of MCTM (left) and CTM (right). The plates represent the replications of the same structure. The shadednodes, unshaded nodes and black dots denote observed variables, hidden variables and hyperparameters, respectively.we denote β W × K = ( β wz ) W × K , where β wz = Pr( w dsn = w | z dsn = z ) for any d , s and n . γ and z are hidden variables. v S , v N , µ , Σ and η arehyperparameters. In practice, we see paragraphs as diﬀerent segments. As Shaﬁei and Milios (2006) pointed out, variables { N ds } d , s are independentwith all the other variables and we can observe { N ds } d , s as well. Therefore without the loss of generality, the randomness { N ds } d , s can be ignored.The topic z dsn corresponding to word w dsn given η ds is generated from a categorical distribution with parameter f ( η ds ) , where its k -th component f ( k ) ( η ds ) = exp( η dsk ) (cid:80) Kk = exp( η dsk ) , k = , . . . , K . Given the hyperparameters µ , Σ and β , the joint distribution of the document-level topic mixture parameters γ = { γ d } Dd = , the segment-leveltopic mixture parameters { η ds } d , s , the word topics z = { z dsn } d , s , n and the observed words w = { w dsn } d , s , n is p ( γ , η , z , w | µ , Σ , β ) = D (cid:89) d = p ( γ d | µ , Σ ) S d (cid:89) s = p ( η ds | γ d , Σ ) N sd (cid:89) n = p ( z dsn | η ds ) p ( w dsn | β , z dsn ) . The novel point of MCTM is that we consider one more layer than the CTM, which adds additional ﬂexibility for the topic model, making itadapted to the nuanced analysis of a document in almost no more eﬀort. More precisely, each segment has its parameter η ds to describe the topicdistribution within that segment, which is expected to capture better the local topic distribution than the global parameter γ d . Simultaneously,MCTM does not require more hyperparameters than CTM because each segment uses that document’s information to generate its parameter η ds .As mentioned in the last section, MCTM enjoys analogous spirits with the famous random eﬀect model (Laird & Ware 1982). For each individual,the ﬁxed population eﬀect is the same, while the individual eﬀect is random. Here we can see each document as a population and segments asindividuals.It is essential to point out that MCTM can be of any depth if we pass the mean parameter of logistic normal distribution through each layer.It allows us to extend MCTM to more sophisticated structures such as the sentence-level topic model. In the following, we only consider MCTMexhibited in Figure 1 with one more layer than CTM and leave the arbitrarily deep models to future research. With the Bayesian model in hand, the next step is to develop the inference algorithm. The inference goal is to estimate the parameters in the modeland compute the posterior, which is very useful in future prediction and model evaluation.In many cases, especially for complicated models, it is impossible to carry out the exact inference. We have to rely on the Bayesian approximationapproaches. There are three main families of approximation methods, including Markov Chain Monte Carlo (MCMC) (Robert & Casella 2013),variational inference (VI) (Blei, Kucukelbir, & McAuliﬀe 2017) and expectation propagation (Minka 2013). With MCMC and joint distribution, wecan make the inference from complicated models. The calculation can be further simpliﬁed if the conjugacy holds. Among the MCMC family, onepopular approach is Gibbs sampler (Gelfand & Smith 1990). Gibbs sampler is available when the complete conditional distribution is known, which

Ye Tian does not require the information about the joint distribution (Shaﬁei & Milios 2006). Besides, the conditional conjugacy is easier to be satisﬁedthan the joint conjugacy (Blei 2014).VI is another family of Bayesian approximation methods (Jordan, Ghahramani, Jaakkola, & Saul 1999; Wainwright & Jordan 2008), which is aneﬃcient inference method. It optimizes Kullback-Leibler (KL) divergence between exact posterior and the approximate one (Blei et al. 2017). Thesimplest approximation family is the mean-ﬁeld family, where it is assumed that all hidden variables are independent conditioned on observeddata. The parameters used to describe the independent conditional distribution of hidden variables are called variational parameters. When theconditional conjugate priors are available, the coordinate ascent VI can be straightforward to conduct. Blei et al. (2017), and Hoﬀman, Blei, Wang,and Paisley (2013) discussed the relationship between the stepwise updates and the conditional posterior under the exponential family. Therehas been a myriad of variants of VI appearing in recent years, including the stochastic VI (Hoﬀman et al. 2013), the automatic diﬀerentiation VI(Kucukelbir, Ranganath, Gelman, & Blei 2015; Kucukelbir, Tran, Ranganath, Gelman, & Blei 2017), and black box VI (Ranganath, Gerrish, & Blei 2014).Variational Expectation-Maximization (EM) algorithm also belongs to this family. It estimates the variational parameters and hyperparameters intwo stages.In this work, we follow similar steps of Blei et al. (2007) to develop the variational EM algorithm and update the estimates through the coordinateascent algorithm. Denote the number of topics K and documents { w d } Dd = . Suppose the d -th document has S d segments. We are going to estimatethe posterior p ( γ , η , z | µ , Σ , β , w ) . Consider the mean-ﬁeld variational family, where we approximate the posterior with q ( γ , η , z | λ , v , ξ , m , φ ) = D (cid:89) d = (cid:89) k = q ( γ dk | λ dk , v ) D (cid:89) d = d (cid:89) s = (cid:89) k = q ( η dsk | ξ dsk , m ) · D (cid:89) d = d (cid:89) s = ds (cid:89) n = q ( z dsn | φ dsn ) , (1)where γ dk ∼ N ( λ dk , v ) , η dsk ∼ N ( ξ dsk , m ) and z dsn ∼ Multinomial ( φ dsn ) . And γ , η , z are independent. The joint distribution of hidden variables p ( γ , η , z , w | µ , Σ , β ) = D (cid:89) d = p ( γ d | µ , Σ ) S d (cid:89) s = p ( η ds | γ d , Σ ) N sd (cid:89) n = p ( z dsn | η ds ) p ( w dsn | β , z dsn )= D (cid:89) d = ( π ) K2 | Σ | exp (cid:26) − ( γ d − µ ) T Σ − ( γ d − µ ) (cid:27) S d (cid:89) s = ( π ) K2 | Σ | exp (cid:26) − ( η ds − γ d ) T Σ − ( η ds − γ d ) (cid:27) × N sd (cid:89) n = (cid:89) k = (cid:104) f ( k ) ( η ds ) (cid:105) z ( k ) dsn W (cid:89) w = β z ( k ) dsn wk . (2)In the E-step, we are going to ﬁnd the estimate of ( λ , v , ξ , m , φ ) to maximize the ELBOE q [log p ( γ , η , z , w | µ , Σ , β )] − E q [ q ( γ , η , z | λ , v , ξ , m , φ )] , which is shown to be a lower bound of the evidence (cid:80) Dd = log p ( w d | µ , Σ , β ) (Blei et al. 2017). The exact ELBO is still challenging to calculate,therefore we are going to ﬁnd a lower bound of it and optimize the lower bound instead (Blei et al. 2007). To do this, we need to deﬁne additionalvariational parameters ζ = { ζ ds } d , s = . With some straightforward calculations, it can be shown that B ( µ , Σ , β , λ , v , ξ , m , φ ) = − D2 log | Σ | − KD2 log π − Tr ( diag ( v ) Σ − ) − D (cid:88) d = ( λ d − µ ) T Σ − ( λ d − µ ) − ( K log π + log | Σ | ) · D (cid:88) d = S d − D (cid:88) d = d (cid:88) s = Tr ( diag ( v + m ) Σ − ) − D (cid:88) d = d (cid:88) s = ( ξ ds − λ d ) T Σ − ( ξ ds − λ d )+ D (cid:88) d = d (cid:88) s = ds (cid:88) n = (cid:34) K (cid:88) k = ξ dsk φ dsnk − ζ − (cid:88) k = exp( ξ dsk +

12 m ) + − log ζ ds (cid:35) + D (cid:88) d = d (cid:88) s = ds (cid:88) n = (cid:88) k = φ dsnk log β w dsn , k + D (cid:88) d = (cid:88) k = (log v + log π + ) + D (cid:88) d = d (cid:88) s = ds (cid:88) n = (cid:88) k = (log m + log π + ) − D (cid:88) d = d (cid:88) s = ds (cid:88) n = (cid:88) k = φ dsnk − D (cid:88) d = d (cid:88) s = ds (cid:88) n = (cid:88) k = φ dsnk log φ dsnk ≤ E q [log p ( γ , η , z , w | µ , Σ , β )] − E q [ q ( γ , η , z | λ , v , ξ , m , φ )] ≤ D (cid:88) d = log p ( w d | µ , Σ , β ) . The coordinate ascent algorithm can be easily conducted through stepwise optimization of B ( µ , Σ , β , λ , v , ξ , m , φ ) . The updating formulasof estimate ( ˆ ζ ds , ˆ φ dsnk , ˆ λ d , ˆv dk ) are ˆ ζ ds = K (cid:88) k = exp (cid:26) ˆ ξ dsk +

12 ˆm (cid:27) , e Tian 5 ˆ φ dsnk ∝ β w dsn , k exp { ˆ ξ dsk } , k = , . . . , K , ˆ λ d = ( S d + ) −  S d (cid:88) s = ˆ ξ ds + µ  , ˆv dk = (( S d + )( Σ − ) kk ) − . For ξ , m , we ﬁnd their estimates through Newton-Raphson algorithm. The corresponding gradient and hessian matrix are ∇ ξ ds B = − Σ − ( ξ ds − λ d ) + N ds (cid:88) n = φ dsn − N ds ζ ds exp (cid:26) ξ ds +

12 m (cid:27) , ∇ ξ ds B = − Σ − − N ds ζ ds diag (cid:18) exp (cid:26) ξ ds +

12 m (cid:27)(cid:19) ,∂ B ∂ m = − ( Σ − ) kk − N ds ζ ds exp (cid:26) ξ ds +

12 m (cid:27) + ,∂ B ∂ ( m ) = − N ds ζ ds exp (cid:26) ξ ds +

12 m (cid:27) − . In the M-step, we update the hyperparameters ( µ , Σ , β ) to maximize B ( µ , Σ , β , λ , v , ξ , m , φ ) each step by ˆ β wk ∝ D (cid:88) d = d (cid:88) s = { n : w sdn = w } ˆ φ dsnk , ˆ µ = D (cid:88) d = ˆ λ d , ˆΣ = + (cid:80) Dd = S d ·  D (cid:88) d = diag ( ˆv ) + D (cid:88) d = d (cid:88) s = diag ( ˆv + ˆm ) + D (cid:88) d = ( ˆ λ d − ˆ µ )( ˆ λ d − ˆ µ ) T + D (cid:88) d = d (cid:88) s = ( ˆ ξ ds − ˆ λ d )( ˆ ξ ds − ˆ λ d ) T  . The detailed derivation of the results above can be found in Appendix A.

In this section, we point out two potential applications, including paragraph-level document analysis and market basket analysis, where MCTM canbe directly applied. The experiment results on these two problems will be presented in the next section.

As mentioned in Section 1, the original LDA model (Blei et al. 2003) and CTM (Blei et al. 2007) conducted only the document-level topic analysis,which assume the interchangeability of words within each document. MCTM relaxes this assumption by only requiring the interchangeability withineach segment. In practice, each paragraph is natural to be a segment. An article has some main ideas, and each paragraph of it is organized andcan carry out slightly diﬀerent information around these ideas (Du et al. 2010). Understanding what each paragraph of an article is talking aboutcan be very useful in text mining.

The decision process that consumers select items from various products in the same shopping trip is called a market basket choice problem (Russell& Petersen 2000; Russell et al. 1999). Numerous models have been developed to analyze market basket data. Hruschka (2014) ﬁtted LDA andCTM on market basket data and obtained some interesting conclusions. Although the topic models might not be favorable in dealing with themarket basket data (Hruschka 2019), it is helpful for us to explore the common shopping pattern and the preference of each customer via topicmodels. The topic models’ results could help the market design better promotion strategies and recommend appropriate products to customers.We can adapt MCTM into the market basket analysis with no extra eﬀorts. Items can be seen as diﬀerent words in document analysis. Eachtopic indicates a speciﬁc shopping pattern of these items. Consider diﬀerent customers d = , . . . D and the d -th customer has s = , . . . , S d shopping trips in total. Within the s -th trip of customer d , a list of items w ds1 , . . . , w dsN ds are added into the cart. Each customer can be seen as a“document,” and each trip can be seen as a “segment”. The data generation procedure is the same as presented in Section 2. The correspondingvariational EM algorithm presented in Section 3 can be immediately applied. Ye Tian

We ﬁrst demonstrate the eﬀectiveness of MCTM to capture the local structure of articles in paragraph-level document analysis. We take NIPSconference papers published during 2015-2017 as the dataset. We removed all the words appearing for less than three times and all the paragraphswith less than 20 words. Also, we used a list of scientiﬁc stop words and common stop words from ISO dataset provided by the R package stopwords . Finally, the NIPS dataset contains 1646 documents with 1181156 words. There are 20667 unique words in total.We evaluate diﬀerent models by investigating the perplexity. Deﬁne the training set as w train and the parameters ﬁtted using the training setare Υ . Denote the held-out documents as w held-out = { w d } Dd = and suppose w d has N d words in total. Then we deﬁne the perplexity asPerplexity = (cid:32) D (cid:89) d = P ( w d | Υ , w train ) (cid:33) − / ( (cid:80) Dd = N d ) . The lower perplexity denotes a more powerful model. In experiments, it is estimated by following the harmonic mean method and treating theapproximate posteriors as the real proposed ones (Wallach, Murray, Salakhutdinov, & Mimno 2009).We used the 2016-2017 NIPS papers, which contains 1244 documents and 930588 words in total, as the training set. The 2015 NIPS papers,which contains 402 documents and 250568 words, are used as held-out documents. We ﬁtted the LDA model (Blei et al. 2003), CTM (Blei et al.2007) and MCTM by considering 10, 20, 30, 40, 50, 60, 70, 80 topics. The LDA model and CTM are implemented by R package topicmodels (Hornik & Grün 2011). The inference algorithm is run until the relative change of ELBO or the lower bound of ELBO is less than 0.0001%. topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 topic 8 topic 9“optimization" “classiﬁcation" “memory" “image processing" “graph" “bound" “kernel" “online learning" “acknowledgment"gradient loss size images graph bound kernel regret supportedvector classiﬁer memory image nodes bounds kernels algorithms grantoptimization classiﬁers compression layers node upper space online nsfloss examples bit dataset edges bounded gaussian bandit acknowledgmentsdescent binary quantization neural graphs inequality rkhs action rewardstep risk values networks vertices holds hilbert reward agentiteration hinge cost convolutional clustering tight functions learner universityfunctions losses datasets architecture tree terms reproducing actions tasksrespect decision dataset features algorithms implies matrix settings policiesapproximation neural hash size sets constants features unknown neural

TABLE 1

Top 10 words with highest frequency from 9 selected topics. The results is obtained from a 40-topic MCTM.First, we presented 9 topics with the corresponding top 10 words with the highest frequency derived from the training data by ﬁtting a 40-topicMCTM in Table 1. Notice that MCTM indeed captures the hidden structure of the documents, and these topics are interpretable. Notice that thereis a special topic, which we call “acknowledgment", including some words often used in the acknowledgment paragraph. Our model can capturethis topic because of its hierarchical structure and ﬂexible variation between documents and paragraphs. To intuitively show how MCTM capturesthe diﬀerence between topics of diﬀerent paragraphs, we take the following example. Figure 2 shows the last paragraph of 2017 NIPS paper“On clustering network-valued data" (Mukherjee, Sarkar, & Lin 2017). This paper is one of the training documents, and we highlight the trainingwords in this paragraph. This paragraph’s approximate posterior topic distribution shows that in 96.6% probability, a word is generated from the“acknowledgment" topic. While in the whole paper, this probability is only 0.0027%. Capturing the diﬀerence between the document-level topicproportion and paragraph-level proportion leads to the success of MCTM when we are interested in the local structure of an article.Next, we presented the perplexity with respect to diﬀerent numbers of topics K in Figure 3. From the results it can be observed that LDA andCTM achieve comparable performance while MCTM outperforms both of them. This is because MCTM can capture more sophisticated documentstructure which LDA and CTM cannot describe.Finally, we visualize the results of MCTM by constructing an intuitive ﬁgure of the document “On clustering network-valued data" with itsseveral paragraphs and topics. It is exhibited as Figure 4. Each circle represents a paragraph of the document with 5 highest-frequency words inthat paragraph, while each rectangle denotes a topic with 5 top words. We connect the document and its paragraphs with topics achieving over The list can be found at https://github.com/seinecle/Stopwords/blob/master/src/scientific/en/scientificstopwords.txt . e Tian 7 FIGURE 2

The last paragraph of 2017 NIPS paper “On clustering network-valued data". Training words are highlighted. In this paragraph the“acknowledgement" topic proportion is 96.6% while in the whole paper it is only 0.0027%. K pe r p l e x i t y model CTMLDAMCTM

FIGURE 3

Perplexity of LDA, CTM and MCTM with respect to diﬀerent number of topics and percentage of observed words in held-out documents. % proportion level. It can be observed that the document has connections with topics 2, 11, 15, 27, and 34, while some paragraphs only focus onsome of them. For instance, paragraph 4 discusses mainly the “graph” and “matrix”. All six paragraphs are connected with topic 11 “graph”, whichis the whole document’s focus. Besides, paragraph 21 only connects with topic 11 “graph” and topic 32 “acknowledgment”. We have shown thisparagraph in Figure 2, and it is an acknowledgment paragraph. Such a ﬁgure can be immediately drawn for each document after ﬁtting MCTM,which clearly shows the local structure. In this section, we apply MCTM to study the market basket data by following the idea described in Section 4.2. The dataset we used is the shoppingdata provided by Instacart, a big grocery market in the United States and Canada. It consists of the records of over 200000 customers. We takethe ﬁrst 1000 customers, coupled with 15145 trips and 150150 bought items as our training data. On each trip, the customer bought at least fouritems. There are 134 diﬀerent categories to which the products belong. We ﬁt MCTM based on these categories instead of single items.We ﬁtted a 20-topic MCTM and presented 5 topics and corresponding top 10 categories in Table 2. Observe that the MCTM successfullycaptures some shopping needs, including the household, baby care, body care, frozen products, vegetables, and drinking. Besides, by looking ateach customer’s topic proportion, we can better know the categories each customer often buys, which helps analyze each consumer’s preferenceand promote the right products to them. The analysis of individual customers’ behavior is out of our discussion scope, and we do not cover it here.

Ye Tian

Document: "On clustering network-valued data" P2 networksnodesadvancement agentsalgorithms Topic 11: graphnodesnodeedgesgraphs

Topic 15: kernelkernelsspacegaussianrkhs

Topic 27: matrixvectorvectorsmatricesentries

Topic 34: estimatorestimatorsestimatingasymptoticiid

Topic 32: supportedgrantnsfacknowledgmentreward P3 networksalgorithmsclusteringkernelsmatrix P1 networksclustercorrespondencenodeclustering P4 networksmatrixsingularusvtclustering P5 networksobservationsaimsapproximationsclustering P21 dmsgrantsupportednsfacknowledgments

Topic 2: gradient vector optimization loss descent

FIGURE 4

Document “On clustering network-valued data" with its several paragraphs and some topics. The document and its paragraphs areconnected with topics achieving over % proportion level. topic 1 topic 2 topic 3 topic 4 topic 5“laundry and other household" “baby and body care" “frozen products" “vegetable, fruits and milk" "drinking and body care"laundry baby food formula frozen meals fresh vegetables body lotions soapdeodorants diapers wipes frozen breakfast fresh fruits beers coolersmint gum baby bath body care frozen pizza packaged vegetables fruits muscles joints pain reliefcleaning products energy granola bars instant foods yogurt dog food caredish detergents fruit vegetable snacks cereal packaged cheese white winestrash bags liners laundry frozen appetizers sides chips pretzels air fresheners candlesoral hygiene canned fruit applesauce plates bowls cups ﬂatware water seltzer sparkling water ice cream toppingsbaby accessories food storage frozen dessert frozen produce red winesmore household cereal frozen vegan vegetarian ice cream ice ﬁrst aidpaper goods hair care hair care milk spirits TABLE 2

In this work, we proposed an innovative multilayer correlated topic model (MCTM) to analyze the documents’ local structure. It is motivated bythe connection between articles published at diﬀerent times in the dynamic topic model (Blei & Laﬀerty 2006) and the group structure used in therandom eﬀect model (Laird & Ware 1982). The variational EM algorithm, coupled with MCTM, was derived for posterior and parameter estimation.We introduced two potential applications of MCTM, including the paragraph-level document analysis and market basket data analysis. In numericalstudies, we demonstrated the eﬀectiveness of MCTM in capturing the document structure, and MCTM was conﬁrmed to improve LDA and CTM.We visualized the relationship between the main ideas of a document and its paragraphs. Besides, we captured some hidden shopping patternsamong customers by applying MCTM on market basket data. e Tian 9

Our work’s main contribution is that we developed a very natural and straightforward way to extend CTM (Blei et al. 2007) in almost no extraeﬀort, making it possible to capture the heterogeneity between paragraphs of a document. Understanding the local structure of an article can helpinvestigate how central ideas are organized inside a document.There are a few future avenues that are interesting to explore. The ﬁrst one is the usage of the covariance matrix in the logistic normal distribution.As Blei et al. (2007) suggested, the covariance can help analyze the pairwise correlation between diﬀerent topics. Similar results can be obtained forMCTM as well. Besides, in our framework, topics do not have to share the same correlation structure at diﬀerent levels. Using separate covariancematrices at diﬀerent levels might help interpret the relationship between a document and its paragraphs. Second, as we mentioned before, MCTMcan be of any depth, making it ready to be extended for more sophisticated modeling such as the sentence-level topic modeling. Another appealingdirection is to look for better ways to visualize the results of MCTM. Figure 4 provides a good example, and we expect to see more intuitivevisualization methods to display the results.

DATA ACCESSIBILITY

The NIPS dataset is available at , and the Instacart market basket dataset is available at . ACKNOWLEDGMENTS

The author’s research is sponsored by Dean’s Fellowship of Graduate School of Arts and Sciences, Columbia University.

APPENDIXA DERIVATION OF VARIATIONAL EM ALGORITHM IN SECTION 3

In this section, we treat z dsn as a K -vector as z dsn = ( z ( ) dsn , . . . , z ( K ) dsn ) T , where each coordinate is an indicator. For instance, if the n -th word in the s -th segment of the d -th document has topic k , then z ( j ) dsn = ( j = k ) for j = , . . . , K .By Blei et al. (2017), the ELBO equals toE q [log p ( γ , η , z , w | µ , Σ , β )] − E q [ q ( γ , η , z | λ , v , ξ , m , φ )] . Due to (2) and the normal density, we haveE q [log p ( γ d | µ , Σ )] = log | Σ | − − K2 log π − E q [( γ d − µ ) T Σ − ( γ d − µ )]= − log | Σ | − K2 log π − Tr ( diag ( v ) Σ − ) − ( λ d − µ ) T Σ − ( λ d − µ ) . (A1)Similarly, it holds thatE q [log p ( η ds | γ d , Σ )] = − log | Σ | − K2 log π − E q [( η ds − γ d ) T Σ − ( η ds − γ d )]= − log | Σ | − K2 log π − Tr ( diag ( m + v ) Σ − ) − ( ξ ds − λ d ) T Σ − ( ξ ds − λ d ) . (A2)According to the categorical distribution, we haveE q [log p ( z dsn | η ds )] = E q ( η Tds z dsn ) − E q (cid:34) log (cid:32) K (cid:88) k = exp( η dsk ) (cid:33)(cid:35) . Since the second term does not have a close form, we follow the strategy of Blei et al. (2007) by introducing additional positive parameters ξ = { ξ ds } d , s . It can be easily checked thatE q (cid:34) log (cid:32) K (cid:88) k = exp( η dsk ) (cid:33)(cid:35) ≤ ξ − (cid:34) K (cid:88) k = E q (exp( η dsk )) (cid:35) − + log ξ ds = ξ − (cid:34) K (cid:88) k = E q (cid:18) exp (cid:18) ξ dsk +

12 m (cid:19)(cid:19)(cid:35) − + log ξ ds , (A3)where the second equality comes from the moment generating function of normal distributed ξ dsk under q . In addition, we haveE q [log p ( w dsn | z dsn , β )] = K (cid:88) k = φ dsnk log β w dsn , k . (A4)On the other hand, it holds that E q [log q ( γ dk | λ dk , v )] = − (log v + log π + ) , (A5)E q [log q ( η dsk | ξ dsk , m )] = − (log m + log π + ) , (A6)E q [log q ( z dsn | φ dsn )] = φ dsnk log φ dsnk . (A7)Combining (2), (1), (A1), (A2), (A3), (A4), (A5), (A6), (A7), it follows that ELBO can be bounded by B ( µ , Σ , β , λ , v , ξ , m , φ ) . Then the derivationsexhibited in Section 3 are straightforward to obtain. References

Atchison, J., & Shen, S. M. (1980). Logistic-normal distributions: Some properties and uses.

Biometrika , (2), 261–272.Blei, D. M. (2014). Build, compute, critique, repeat: Data analysis with latent variable models.Blei, D. M., Kucukelbir, A., & McAuliﬀe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American statistical Association , (518), 859–877.Blei, D. M., & Laﬀerty, J. D. (2006). Dynamic topic models. In Proceedings of the 23rd international conference on machine learning (pp. 113–120).Blei, D. M., Laﬀerty, J. D., et al. (2007). A correlated topic model of science.

The Annals of Applied Statistics , (1), 17–35.Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research , (Jan), 993–1022.Boztuğ, Y., & Hildebrandt, L. (2008). Modeling joint purchases with a multivariate mnl approach. Schmalenbach Business Review , (4), 400–422.Du, L., Buntine, W., & Jin, H. (2010). A segmented topic model based on the two-parameter poisson-dirichlet process. Machine learning , (1),5–19.Gelfand, A. E., & Smith, A. F. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American statistical association , (410), 398–409.Griﬃths, T. L., Jordan, M. I., Tenenbaum, J. B., & Blei, D. M. (2004). Hierarchical topic models and the nested chinese restaurant process. In Advances in neural information processing systems (pp. 17–24).Guo, C., Lu, M., & Wei, W. (2019). An improved lda topic modeling method based on partition for medium and long texts.

Annals of Data Science ,1–14.Hoﬀman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference.

The Journal of Machine Learning Research , (1),1303–1347.Hornik, K., & Grün, B. (2011). topicmodels: An r package for ﬁtting topic models. Journal of statistical software , (13), 1–30.Hou, S., Chen, L., Tao, D., Zhou, S., Liu, W., & Zheng, Y. (2017). Multi-layer multi-view topic model for classifying advertising video. PatternRecognition , , 66–81.Hruschka, H. (2014). Linking multi-category purchases to latent activities of shoppers: analysing market baskets by topic models. Marketing:ZFP–Journal of Research and Management , (4), 267–273.Hruschka, H. (2019). Comparing unsupervised probabilistic machine learning methods for market basket analysis. Review of Managerial Science ,1–31.Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models.

Machine learning , (2), 183–233.Kraus, M., & Feuerriegel, S. (2019). Personalized purchase prediction of market baskets with wasserstein-based sequence matching. In Proceedingsof the 25th acm sigkdd international conference on knowledge discovery & data mining (pp. 2643–2652).Kucukelbir, A., Ranganath, R., Gelman, A., & Blei, D. (2015). Automatic variational inference in stan.

Advances in neural information processingsystems , , 568–576.Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic diﬀerentiation variational inference. The Journal of MachineLearning Research , (1), 430–474.Laird, N. M., & Ware, J. H. (1982). Random-eﬀects models for longitudinal data. Biometrics , 963–974. e Tian 11

Li, W., Matsukawa, T., Saigo, H., & Suzuki, E. (2020). Context-aware latent dirichlet allocation for topic segmentation. In

Paciﬁc-asia conference onknowledge discovery and data mining (pp. 475–486).Li, W., & McCallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic correlations. In

Proceedings of the 23rd internationalconference on machine learning (pp. 577–584).McCallum, A. K. (1999). Multi-label text classiﬁcation with a mixture model trained by em. In

Aaai 99 workshop on text learning.

Minka, T. P. (2013). Expectation propagation for approximate bayesian inference. arXiv preprint arXiv:1301.2294 .Mukherjee, S. S., Sarkar, P., & Lin, L. (2017). On clustering network-valued data. In

Advances in neural information processing systems (pp. 7071–7081).Ranganath, R., Gerrish, S., & Blei, D. (2014). Black box variational inference. In

Artiﬁcial intelligence and statistics (pp. 814–822).Reisenbichler, M., & Reutterer, T. (2019). Topic modeling in marketing: recent advances and research opportunities.

Journal of Business Economics , (3), 327–356.Robert, C., & Casella, G. (2013). Monte carlo statistical methods . Springer Science & Business Media.Rosen-Zvi, M., Griﬃths, T., Steyvers, M., & Smyth, P. (2012). The author-topic model for authors and documents. arXiv preprint arXiv:1207.4169 .Russell, G. J., & Petersen, A. (2000). Analysis of cross category dependence in market basket selection.

Journal of Retailing , (3), 367–392.Russell, G. J., Ratneshwar, S., Shocker, A. D., Bell, D., Bodapati, A., Degeratu, A., . . . Shankar, V. H. (1999). Multiple-category decision-making:Review and synthesis. Marketing Letters , (3), 319–332.Shaﬁei, M. M., & Milios, E. E. (2006). Latent dirichlet co-clustering. In Sixth international conference on data mining (icdm’06) (pp. 542–551).Wainwright, M. J., & Jordan, M. I. (2008).

Graphical models, exponential families, and variational inference . Now Publishers Inc.Wallach, H. M. (2006). Topic modeling: beyond bag-of-words. In

Proceedings of the 23rd international conference on machine learning (pp. 977–984).Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. (2009). Evaluation methods for topic models. In

Proceedings of the 26th annual internationalconference on machine learning (pp. 1105–1112).Wang, C., & Blei, D. M. (2013). Variational inference in nonconjugate models.

Journal of Machine Learning Research ,14