Restricted Collapsed Draw: Accurate Sampling for Hierarchical Chinese Restaurant Process Hidden Markov Models
Takaki Makino, Shunsuke Takei, Issei Sato, Daichi Mochihashi
aa r X i v : . [ s t a t . M L ] J un Restricted Collapsed Draw: Accurate Sampling forHierarchical Chinese Restaurant Process Hidden Markov Models
Abstract
We propose a restricted collapsed draw (RCD)sampler, a general Markov chain Monte Carlosampler of simultaneous draws from a hierar-chical Chinese restaurant process (HCRP) withrestriction. Models that require simultaneousdraws from a hierarchical Dirichlet process withrestriction, such as infinite Hidden markov mod-els (iHMM), were difficult to enjoy benefits ofthe HCRP due to combinatorial explosion incalculating distributions of coupled draws. Byconstructing a proposal of seating arrangements(partitioning) and stochastically accepts the pro-posal by the Metropolis-Hastings algorithm, theRCD sampler makes accurate sampling for com-plex combination of draws while retaining effi-ciency of HCRP representation. Based on theRCD sampler, we developed a series of sophis-ticated sampling algorithms for iHMMs, includ-ing blocked Gibbs sampling, beam sampling, andsplit-merge sampling, that outperformed conven-tional iHMM samplers in experiments.
Existing sampling algorithms for infinite hidden Markovmodels (iHMMs, also known as the hierarchical Dirichletprocess HMMs) [ ?? ] do not use a hierarchical Chineserestaurant process (HCRP) [ ? ], which is a way of repre-senting the predictive distribution of a hierarchical Dirich-let process (HDP) by collapsing, i.e. integrating out, the un-derlying distributions of the Dirichlet process (DP). Whilean HCRP representation provides efficient sampling formany other models based on an HDP [ ?? ] through reduc-ing the dimension of sampling space, it has been consid-ered rather “awkward” [ ? ] to use an HCRP for iHMMs,due to the difficulty in handling coupling between randomvariables. In the simplest case, consider step-wise Gibbssampling from an iHMM defined as π k ∼ DP( β , α ) and x i -1 x i x i +1 y i -1 y i y i +1 x i +2 y i +2 x i -1 x' i x i +1 x' i +1 y i -1 y i y i +1 x i +2 y i +2 Figure 1: Step-wise Gibbs sampling in iHMM. Since theDirichlet process prior is posed on transitions in iHMM,resampling x i involves taking two transitions, x i − → x i and x i → x i +1 , simultaneously. In this case, we considerdistribution of two draws ( x ′ i , x ′ i +1 ) with restriction that thedraws are consistent with remaining sequence, i.e., x ′ i +1 = x i +1 . β ∼ GEM( γ ) . Given x , . . . , x i − , x i +1 , . . . , x T , resam-pling hidden state x i at time step i actually consists of twodraws (Figure 1), x ′ i ∼ π x i − and x ′ i +1 ∼ π x ′ i , under therestriction ( x ′ i , x ′ i +1 ) ∈ C that these draws are consistentwith the following sequence, i.e., C = { ( x ′ i , x ′ i +1 ) | x ′ i +1 = x i +1 } . Under the HCRP, the two draws are coupled evenif x i − = x ′ i , because distributions π x i − , π x ′ i as well asthe base measure β are integrated out in an HCRP, and cou-pling complicates sampling from the restricted distribution.To generalize, the main part of the difficulty is to obtaina sample from a restricted joint distribution of simulta-neous draws from collapsed distributions, which we call restricted collapsed draw (RCD). Consider resampling L draws simultaneously, x = ( x j i , . . . , x j L i L ) , from therespective restaurants j = ( j , . . . , j L ) , when we have arestriction C such that x ∈ C . Step-wise Gibbs samplingfrom iHMM can be fitted into RCD with L = 2 by allowingestaurant index j to be dependent on the preceding draw x j i .In this paper, we point out that it is not enough to considerthe distribution of draws. Since the HCRP introduces anadditional set of latent variables s that accounts for the seat-ing arrangements of the restaurants, we have to compute anexact distribution of s as well, under the restriction. Wewant to perform sampling from the following conditionaldistribution, p ( x , s | C ) = 1 Z C I [ x ∈ C ] p ( x , s ) , (1)where Z C is a normalization constant and I is the indicatorfunction, whose value is 1 if the condition is true and 0 oth-erwise. Although non-restricted probability p ( x , s ) can beeasily calculated for a given x and s , calculating the nor-malization constant Z C leads to a combinatorial explosionin terms of L .To solve this issue, we propose the restricted collapseddraw (RCD) sampler, which provides accurate distribu-tions of simultaneous draws and seating arrangements fromHCRP. The RCD sampler constructs a proposal of seatingarrangements using a given proposal of draws, and the pairof proposals are stochastically accepted by the Metropolis-Hastings algorithm [ ? ]. Since the RCD sampler can han-dle any combination of restricted collapsed draws simul-taneously, we were able to develop a series of samplingmethod for HCRP-HMM, including a blocked collapsedGibbs sampler, a collapsed beam sampler, and a split-merge sampler for HCRP-HMM. Through experiments wefound that our collapsed samplers outperformed their non-collapsed counterparts. An infinite hidden Markov model (iHMM) [ ?? ] is definedover the following HDP: G ∼ DP( γ, H ) G k ∼ DP( α , G ) , (2)To see the relation of this HDP to the transition matrix π ,consider the explicit representation of parameters: G = ∞ X k ′ =1 β k ′ φ k ′ G k = ∞ X k ′ =1 π k ′ k φ k ′ , (3)where transition probability π k is given as π k ∼ DP( α , β ) , β ∼ GEM( γ ) is the stick-breaking construc-tion of DPs [ ? ], and φ k ∼ H .A formal definition for the HDP based on this representa-tion is: β | γ ∼ GEM( γ ) π j | α , β ∼ DP( α , β ) (4) x ji | ( π k ) ∞ k =1 ∼ π j φ k ∼ H y ji | x ji ∼ F ( φ x ji ) , (5) Given an HDP and initial state x , we can construct aninfinite HMM by extracting a sequence of draws x i as x i = x x i − i , and corresponding observations y i = y x i i .Figure 2 shows a graphical representation of the iHMM. As another way of representing HDP in iHMM (Eq. 2), weintroduce a hierarchical Chinese restaurant process (HCRP,also known as the Chinese restaurant franchise), whichdoes not need to sample the transition distribution π andits base measure β in Eq. (4): k jt | γ ∼ CRP( γ ) t ji | α ∼ CRP( α ) (6) x j i = k jt j i (7) φ k ∼ H y j i | x j i , φ ∼ F ( φ x j i ) . (8)Using the Chinese restaurant metaphor, we say that cus-tomer i of restaurant j sits at table t ji , which has a dish ofan index k jt ji .To understand connection between HDP and HCRP, con-sider a finite model of grouped observations x ji , in whicheach group j choose a subset of M mixture componentsfrom a model-wide set of K mixture components: β | γ ∼ Dir( γ/K, . . . , γ/K ) k jt | β ∼ β (9) τ j | α ∼ Dir( α /M, . . . , α /M ) t ji | τ j ∼ τ j (10)As K → ∞ and M → ∞ , the limit of this model is HCRP;hence the infinite limit of this model is also HDP. Equa-tion (6) is derived by taking the infinite limit of K and Mafter integrating out β and τ in Eqs. (9) and (10). Thedistribution π j in Eq. 4 can be derived from τ j and k j asfollows: π j = X t τ jt δ k jt . (11)To consider sampling of x ji using HCRP (Eqs. 7 and 8),we use count notation n jtk as the number of customers inrestaurant j at table t serving the dish of the k -th entry, and m jk as the number of tables in restaurants the j serving thedish of the k -th entry. We also use dots for marginal counts(e.g., m · k = P j m jk ). Then, we sample table index t ji from the following distribution: p ( t ji = t | t j , . . . , t j,i − ) = n jt · n j ·· + α (12) p ( t ji = t new | t j , . . . , t j,i − ) = α n j ·· + α . (13)When t ji = t new (i.e., the customer sits at a new table), weneed to sample k jt new , whose distribution is: p ( k jt = k | k , . . . ) = m · k m ·· + γ (14) p ( k jt = k new | k , . . . ) = γm ·· + γ . (15)These variables determine the new sample x ji = k jt ji .Since x ji does not uniquely determine the state of theHCRP model, we need to keep latent variables t ji and γ α β ̟ x t −1 x x x x x T y y y y T F x t ∞∞ Figure 2: Graphical Representation of iHMM. k jt for subsequent sampling. We will denote s ( j ) =( t j , t j , . . . ) as the seating arrangement in restaurant j , s (0) = ( k , k , . . . , k , . . . ) as the seating arrangementin the root restaurant, and s as the collection of all seatingarrangements, corresponding to the sampled model state.In Bayesian inference based on sampling, we need a proce-dure to sample the latent variables, given the value of newdraw x ji and the seating arrangements for other draws s ,which is called as addCustomer .Construction of HCRP-HMM is the same as iHMM, i.e.,extracting a sequence of draws x i given x as x i = x x i − i ,and corresponding observations y i = y x i i . What we want is a sampling algorithm for HCRP-HMM.As described in the Introduction, the problem can be re-duced to an algorithm for sampling from p ( x , s | C ) , i.e.,the distribution of restricted collapsed draw with seatingarrangements (Eq. 1).Our idea is to apply the Metropolis-Hastings algorithm[ ? ] to the seating arrangements, which stochastically ac-cepts the proposal distribution of seating arrangements. Al-though it is hard to directly give proposal distribution q ( s ) of seating arrangements, our method constructs q ( s ) bycombining q x ( x ) with q s ( s | x ) , another proposal of seat-ing arrangements given the proposed draws, which is basedon the addCustomer procedure that is standardly used inGibbs sampling of HCRP. The Metropolis-Hastings algorithm provides a way of con-structing an MCMC sampler using unnormalized probabil-ity value ˜ p ( z ) . After sampling z ∗ from proposal distribu-tion q ( z ∗ | z ) , the algorithm computes acceptance probabil-ity R : R = min (cid:18) , ˜ p ( z ∗ )˜ p ( z old ) q ( z old | z ∗ ) q ( z ∗ | z old ) (cid:19) . (16) Then the result z new = z ∗ with probability R , and z new = z old otherwise. Repeating this process consti-tutes an MCMC sampler from required distrubution p ( z ) ∝ ˜ p ( z ) ,Within the context of HCRP, sample space z consists ofdraws x and seating arrangement s . From Eq. (1), we canuse the non-restricted probability of draws p ( x , s ) as un-normalized probability value ˜ p ( z ) , but it is not easy to pro-vide a proposal for joint distribution q ( x ∗ , s ∗ ) .Our idea is to factorize the proposal distribution as: q ( x ∗ , s ∗ | s ) = q x ( x ∗ | s ) · q s ( s ∗ | x ∗ , s ) . (17)First factor q x is the proposal distribution of the draws.Second factor q s is the proposal distribution of the seat-ing arrangements given the proposal draws. We use theresult of the addCustomer procedure, which stochasticallyupdates the seating arrangements, as the proposal distribu-tion of the seating arrangements. The following describes each factor in R and its computa-tion. True Probability p ( x , s ) in Eq. (1) is the joint probabil-ity of all draws x ji : p ( x , s ) = Y j p ( s ( j ) ) · p ( s (0) ) (18)where p ( s ( j ) ) = Γ( α )Γ( α + n j ·· ) · α m j · · Y t Γ( n jt · ) (19) p ( s (0) ) = Γ( γ )Γ( γ + m ·· ) · γ K · Y t Γ( m · k ) , (20)and Γ is the Gamma function. This is the product of theprobabilities of seating arrangements (Eqs. 12 to 15) foreach customer.In practice, we only need to calculate probabilities that ac-count for the change in seating from s , because the prob-bility for unchanged customers is cancelled out throughreducing the fraction in R . Let s be the seating arrange-ment for the unchanged customers, then p ( s ∗ ) p ( s old ) = p ( s ∗ | s ) p ( s old | s ) . (21)In fact, p ( x ∗ , s ∗ | s ) is easily calculated along with add-Customer operations: p ( x ∗ , s ∗ | s ) = p ( x ∗ , s ∗ | s ) p ( x ∗ , s ∗ | s ∗ ) · · · p ( x ∗ L , s ∗ | s ∗ L − ) . (22)Here, p ( x ℓ , s ℓ | s ℓ − ) is probability p ( x j ℓ i ℓ , t j ℓ i ℓ | j ℓ , s ℓ − ) of obtaining seating arrangement s ℓ as a result of drawinga sample from restaurant j : p ( x ji = k, t ji = t | j, s ) = 1 n j ·· + α × n jtk n jt · ≥ α · m · k m ·· + γ n jt · = 0 , m · k ≥ α · γm ·· + γ n jt · = 0 , m · k = 0 (23)The same applies to the calculation of p ( s old | s ) , whichcan be done along with removeCustomer operations. Proposal Distribution of Draws q ( x ) can be anythingas long as it is ergodic within restriction C . To increasethe acceptance probability, however, it is preferable for theproposal distribution to be close to the true distribution.We suggest that a good starting point would be to use ajoint distribution composed of the predictive distributionsof each draw, as has been done in the approximated Gibbssampler [ ? ]: q x ( x ) = I [ x ∈ C ] L Y i =1 p ( x i | s ) . (24)We will again discuss the proposal distribution of draws forthe HCRP-HMM case in Section 4. Proposal Distribution of Seating Arrangements q s ( s ∗ | x , s ) , is the product of the probabilities for eachoperation of adding a customer: q s ( s ∗ | x ∗ , s ) = q s ( s ∗ | x ∗ , s ) q s ( s ∗ | x ∗ , s ∗ ) · · · q s ( s ∗ | x ∗ ℓ , s ∗ ℓ − ) . (25)Here, q s ( s ℓ | x ℓ , s ℓ − ) = p ( t j ℓ i ℓ | x j ℓ i ℓ , j ℓ , s ℓ − ) , i.e., theprobability of obtaining seating arrangement s ℓ as a resultof the addCustomer ( x j ℓ i ℓ , j ℓ , s ℓ − ) operation. p ( t ji = t | x ji = k, j, s ) = (26) n jtk n j · k + α m · k m ·· + γ n jt · ≥ ∧ k jt = kα m · k m ·· + γ n j · k + α m · k m ·· + γ n jt · = 0 ∧ m · k > n jt · = 0 ∧ m · k = 0 . (27) Paying attention to the fact that both Eqs. (23) and (27) arecalculated along a series of addCustomer calls, we intro-duce factors r ∗ ℓ = p ( x ∗ ℓ , s ℓ | s ℓ − ) q s ( s ℓ | x ∗ ℓ , s ℓ − ) r oldℓ = p ( x oldℓ , s ℓ | s ℓ − ) q s ( s ℓ | x oldℓ , s ℓ − ) (28)to simplify the calculation of R as: R = min (cid:18) , p ( s ∗ ) p ( s old ) q s ( s old | x old , s ) q s ( s ∗ | x ∗ , s ) q x ( x old ) q x ( x ∗ ) (cid:19) = min (cid:18) , r ( x ∗ , s ∗ | s ) r ( x old , s old | s ) q ( x old ) q ( x ∗ ) (cid:19) , (29)where r ( x ∗ , s ∗ | s ) = p ( x ∗ , s ∗ | s ) q s ( s ∗ | x ∗ , s )= p ( x ∗ , s | s ) q s ( s | x ∗ , s ) · · · p ( x ∗ L , s L | s L − ) q s ( s L | x ∗ L , s L − )= r ∗ · r ∗ · · · r ∗ L . (30)Surprisingly, assigning Eqs. (23) and (27) into Eq. (28) re-veals that r ∗ ℓ is equal to p ( x j ℓ i ℓ = x ∗ ℓ | s ∗ ℓ − ) , i.e., the prob-ability of new customer x j ℓ i ℓ at restaurant j ℓ eating dish x ∗ ℓ : p ( x ji = k | s ) = n j · k + α m · k m · k + γ n j ·· + α (31) p ( x ji = k new | s ) = α γm · k + γ n j ·· + α . (32)In other words, calculation of the accept ratio does not use t ji (the table index of each customer), despite the fact thatthe values of t ji are being proposed; t ji will indirectly af-fect the accept ratio by changing subsequent draw prob-abilities p ( x ∗ ℓ +1 | s ∗ ℓ ) , p ( x ∗ ℓ +2 | s ∗ ℓ +1 ) , . . . through modifying n jtk and m jk , i.e., the number of customers and tables.It is now clear that, as done in some previous work [ ? ], wecan save storage space by using an alternative representa-tion for seating arrangements s , in which the table indicesof each customer t ji are forgotten but only the numbers ofcustomers n jt · , k jt and m jk are retained. The only remain-ing reference to t ji in the removeCustomer procedure canbe safely replaced by sampling.However, it should be noted that we have to revert to orig-inal seating assignment s old whenever the proposal is re-jected. Putting the old draws x old back into s by usingthe addCustomer procedure again will lead sampling toan incorrect distribution of seating assignments, and con-sequently, an incorrect distribution of draws.Algorithm 1 is the one we propose othat obtains new sam-ples x new drawn simultaneously from restaurants indexedby j and associated seating arrangement s new , given pre-vious samples x old and s old . lgorithm 1 MH-RCDSampler ( j , x old , s old ): Metropolis-Hastings sampler for restricted collapsed draw s oldL = s old for ℓ = L downto do s oldℓ − = removeCustomer ( x oldℓ , j ℓ , s oldℓ ) { Remove customers for x old , . . . , x oldm sequentially from s old } r oldℓ = p ( x oldℓ , s oldℓ − ) { Calculate factors for accept ratio } end for s ∗ = s = s old x ∗ ∼ q x ( x ; s ) { Draw x ∗ from proposal distribution q ( x ) of draws. } for ℓ = 1 to L do r ∗ ℓ = p ( x ∗ ℓ , s ∗ ℓ − ) { Calculate factors for accept ratio } s ∗ ℓ = addCustomer ( x ∗ ℓ , j ℓ , s ∗ ℓ − ) { Add customers for x ∗ , . . . , x ∗ m sequentially to s ∗ } end for s ∗ = s ∗ L { Obtain proposal seating s ∗ } R = min (cid:18) , q x ( s old ) q x ( s ∗ ) L Y ℓ =1 r ∗ ℓ r oldℓ (cid:19) { Calculate acceptance probability } return h x new , s new i = ( h x ∗ , s ∗ i with probability R h x old , s old i otherwise. { Accept/reject proposed sample } The first half of this sampler is similar to a samplerfor a single draw; it consists of removing old cus-tomers (line 3), choosing a new sample (line 7), andadding the customers again (line 10). The main differ-ence is that there are L times of iteration for each call removeCustomer / addCustomer , and the calculation of r ,which is later used for acceptance probability R . This section describes a series of samplers for HCRP-HMM. First, we present the step-wise Gibbs sampler as thesimplest example. After that, we describe a blocked Gibbssampler using a forward-backward algorithm. We also ex-plain the HCRP version of the beam sampler [ ? ] as well asthe split-merge sampler [ ? ] for iHMM. A step-wise Gibbs sampler for HCRP-HMM is easily con-structed using an RCD sampler (Algorithm 5 in the Ap-pendix describes one Gibbs sweep). We slightly modifiedthe proposal distribution q ( x t ) from that suggested in Sec-tion 3.2, in order to ensure that x t +1 is proposed with non-zero probability even when no table in s serves dish x t +1 : q x ( x t ) ∝ (cid:18) p ( x t | s ( x t − )0 ) + (cid:18) α γ ( α + n x t − ·· )( γ + m ·· ) (cid:19) δ x t +1 (cid:19) · p ( x t +1 | s ( x t )0 ) · p ( y t | F ( x t )0 ) . (33) We can construct an alternate sampling scheme underthe framework of RCD sampler that resamples a block of hidden states simultaneously, based on the forward-backward sampler [ ? ]. The idea is that we run the forward-backward sampler with a predictive transition distributionfrom HCRP-HMM, and use the result as a proposal of re-stricted collapsed draw.For iHMM, the forward-backward sampling algorithm [ ? ]cannot be directly used, because the forward probabilityvalues for an infinite number of states have to be stored foreach time step t [ ? ]. This is not the case for HCRP-HMM,because predictive transition probability ˆ π from given seat-ing assignment s , which is given as Eqs. (31) and (32),only contains transition probability for finite number K ofstates plus one for k new . Thus we only need to store K + 1 forward probability for each time step t .Result ¯ x of the forward-backward sampler, however, can-not be used directly as the proposal; the i -th state of theproposal x ∗ i is equal to ¯ x i when ¯ x i = k new , but we needto assign new state indices to x ∗ i whenever ¯ x i = k new . Inparticular, when k new has appeared W ≥ times, all ap-pearances of k new may refer either to the same new state,or to W different states, or to anything in between the two,in which some appearances of k new share a new state.To achieve this purpose, we prepare special CRP Q ∗ thataccounts for the previously unseen states, marked by k new in the result of the forward-backward sampler. Specifically,each table in Q ∗ has a dish with an unused state index, andeach appearance of k new is replaced with a draw from Q ∗ .This construction ensures that every state sequence is pro-posed with a non-zero probability, and allows the proposalprobability to be easily calculated. The concentration pa-rameter of Q ∗ is set as equal to γ . To handle the case wheresome of the new states are equal to x t b +1 , i.e., index of thestate that succeeds to the resampling block, we add to Q ∗ n extra customer that correponds to x t b +1 when x t b +1 doesnot appear in s ,Resulting proposal probability is: q x ( x ∗ ) = L Y ℓ =0 ˆ π ¯ x ℓ +1 ¯ x ℓ · L Y ℓ =1 F ¯ x ℓ ( y ℓ ) ! · Y ℓ :¯ x ℓ = k new p ( x ∗ ℓ | Q ∗ ) , (34)where the first factor accounts for the forward probabilityof the sequence, and the second factor accounts for proba-bility of the new state assignment.Note also that, to make a sensible proposal distribution, wecannot resample the whole state sequence simultaneously.We need to divide the state sequence into several blocks,and resample each block given the other blocks. The sizeof a block affects efficiency, because blocks that are toolarge have lower accept probability, while with blocks thatare too small, the algorithm has little advantage over step-wise Gibbs sampling.Algorithm 8 in the Appendix describes one sweep of ablocked Gibbs sampler for an HCRP-HMM. Beam sampling for HDP-HMM [ ? ] is a sampling algo-rithm that uses slice sampling [ ? ] for transition probabilityto extract a finite subset from the state space. Although thepossible states are already finite in HCRP-HMM, the sametechnique may benefit sampling of HCRP-HMM by im-proving efficiency from the reduced number of states con-sidered during one sampling step.We just need replace the call to ForwardBackwadSampling in Algorithm 8 with the call to
BeamSampling to use beamsampling with HCRP-HMM. A brief overview of the beamsampling is:1. Sample auxiliary variables u = ( u , . . . , u L ) as u ℓ ∼ Uniform(0 , π x ℓ x ℓ − ) ,2. For ℓ = 1 , . . . , L , calculate forward probability q ( x ′ ℓ = k ′ ) using a slice of transition probability q ( x ′ ℓ = k ′ ) = F k ′ ( y ℓ ) P k I ( π k ′ k > u ℓ − ) q ( x ′ ℓ − = k ) ,3. For ℓ = L, . . . , , sample the states x ′ ℓ backwardly,i.e. p ( x ′ ℓ = k ) ∝ I ( π x ′ ℓ +1 k > u ℓ ) .For details, refer to the original paper [ ? ].Some remarks may be needed for the calculation of q ∗ x ,i.e., the proposal probability for the state sequence. Al-though beam sampling has a different proposal distributionfrom forward-backward sampling, we can use the same cal-culation of proposal probability used in acceptance proba-bility as that of forward-backward sampling. This is be- cause beam sampling satisfies the detailed balance equa-tion, which ensures that the ratio of proposal probabilitywith beam sampling q ∗ slice q oldslice is always equal to the ratio of theprobability obtained by forward-backward sampling q ∗ q old . We can integrate the split-merge sampling algorithm,which is another sampling approach to Dirichlet processmixture models [ ? ], into HCRP-HMM using the RCD sam-pler. A split-merge sampler makes a proposal move thattries to merge two mixture components into one, or to splita mixture component into two; the sampler then uses aMetropolis-Hastings step to stochastically accept the pro-posal. Based on the RCD framework, we can extend thesplit-merge sampler into HCRP, which can be applied toHCRP-HMM. Within the context of HMM, the samplercorresponds to merge two state indices into one, or to splita state index into two.Our implementation is based on an improved version ofhte split-merge sampler, called the sequentially-allocatedmerge-split sampler [ ? ], which produces a split proposalwhile sequentially allocating components in random order.To deal with temporal dependency in HMM, we identifyfragments of state sequences to be resampled within thestate sequence, and perform blocked Gibbs sampling foreach fragment in random order.We added one important optimization to the split-mergesampling algorithm. Since a merge move is proposed muchmore frequently than a split move, and the move has a rel-atively low accept probability, it is beneficial if we have away of determining whether a merge move is rejected ornot earlier. Let us point out that, when proposal proba-bility for a merge move is calculated, the accept probabil-ity is monotonically decreasing. Consequently we sample R thr , the threshold of accept probability, at the beginningof the algorithm and stop further calculation when R be-comes less than R thr . Algorithm 9 in the Appendix is thesplit-merge sampling algorithm for HCRP-HMM.Split-merge sampling allows faster mixing when it is inter-leaved with other sampling strategies. We examine split-merge sampling with each of the samplers we have pre-sented in this paper. This section presents two series of experiments, the firstwith small artificial sequences and the second with a se-quence of natural language words. .1 Settings
We put gamma prior
Gamma(1 , on α and γ , andsampled between every sweep using an auxiliary variablemethod [ ? ] in all the experiments. We introduced HCRPas a prior of emission distributions as well, and its hyper-parameters were also sampled in the same way.The initial state sequence given to the sampler is the resultof a particle filter with 100 particles.We measured autocorrelation time (ACT) to evaluate mix-ing. Given a sequence of values x = x , x , . . . , x T , itsmean µ and variance σ , ACT ( x ) are defined as follows: ACF t ( x ) = 1( T − t ) σ T − t X i =1 ( x i − µ )( x i + t − µ ) (35) ACT ( x ) = 12 + ∞ X t =1 ACF t ( x ) . (36)Since with larger t , ACF i ( x ) is expected to converge tozero, we used ACF i ( x ) for t ≤ .For artificial sequence, we evaluated mutual informationbetween the h t , hidden state used in sequence generationand x t , inferred states as follows: M I = X h X x p ( x, h ) log p ( x, h ) p ( x ) p ( h ) . (37)For natural language text, the inferred model is evaluatedby multiple runs of a particle filter on a given test sequenceof length T test . We specifically construct a particle filterwith Z = 100 particles for each sampled model state s z ,and evaluate likelihood l ( y i | s z ) for each emission. Finally,we calculate the perplexity (the reciprocal geometric meanof the emission probabilities) of the test sequence: P P L = exp (cid:18) − T test X log ˆ l ( y i ) (cid:19) (38)where ˆ l ( y i ) = 1 Z Z X z =1 l ( y i | s z ) . (39)The samplers we chose for comparison are the step-wiseGibbs sampler with direct assignment representation [ ? ],which uses stick-breaking for the root DP and CRP for theother DPs, the step-wise Gibbs sampler with stick-breakingconstruction, and the beam sampler with stick-breakingconstruction [ ? ]. For fair comparison between different al-gorithms, we collected samples to evaluate the autocorre-lation time and perplexity on a CPU-time basis (excludingthe time used in evaluation). All the algorithms were im-plemented with C++ and tested on machines with an IntelXeon E5450 at 3 GHz.
19 210 311 412 513 614 715 816H A B C D E F G H A
Figure 3: Automaton that generates Sequence 2. Circlesdenote hidden states, and the same alphabet emissions areobserved from states within an oval group. A dashed ar-row denotes transition with probability 0.8, a bold arrowdenotes transition with probability 0.84, and a solid arrowdenotes emission with probability 1/3.
The first series of experiments are performed with twosmall artificial sequences. Sequence 1 consists of repeat-ing sequence of symbols A-B-C-D-B-C-D-E-... for length T = 500 , and we run the sampler 30 s for burn-in, and af-ter that, a model state is sampled every 2 s until a total of300 s is reached. Sequence 2 is generated from the simplefinite state automaton in Figure 3 for length T = 2500 , andwe use 60 s for burn-in and total 600 s. We evaluated themutual information between the inferred hidden states andthe true hidden states.Figure 4 shows the distribution of mutual information for100 trials after 300 s. We can see that some of the samplersbased on the proposed method achieved a better mutual in-formation compared to existing samplers. The improve-ment depends on the type of sequence and the samplers.For Sequence 1, we can see that split-merge samplingyields better results compared to other samplers. AlthoughHMM with eight hidden states can completely predict thesequence, the samplers tend to be trapped in a local opti-mum with five states in the initial phase, because our se-lected prior of γ poses a larger probability on a smallernumber of hidden states, Detailed investigations (Figure 5)confirmed this analysis.For Sequence 2, on the other hand, blocked samplersworked very efficiently. Step-wise samplers generallyworked poorly on the sequence, because the strong de-pendency on temporally adjacent states impedes mixing.Still, step-wise Gibbs sampler for HCRP-HMM outper-formed the beam sampler with the stick-breaking process.The blocked Gibbs sampler had inferior performance dueto its heavy computation for a large number of states, butthe beam sampler for HCRP-HMM was efficient and per-formed well. Combination with a small number of split-merge samplers increases the performance (more split-merge sampling leads to lower performance by occupyingomputational resource for the beam sampler). From aver-ages statistics of samplers (Table 1), we can see that (1) theincrease of mutual information cannot be described onlyby the increase of the number of states; (2) The accept ratiofor the Gibbs trial has a very high accept rate; (3) Split-merge samplers have a very low accept rate, but still makeimprovement for mutual information. We also tested the samplers using a sequence of naturallanguage words from
Alice’s Adventure in Wonderland . Weconverted the text to lower case, removed punctuation, andplaced a special word EOS after every sentence to obtaina corpus with , words; we kept the last 1,000 wordsfor test corpus and learned on a sequence with length T =27120 . We introduce a special word UNK (unknown) toreplace every word that occurred only once, resulting in | Σ | = 1 , unique words in the text. We took 10,000 sfor burn-in, and sampled a model state for every 120 s, untilthe total of 172,800 s. Table 2 summarize the averagedstatistics for 18 trials.We found that step-wise sampling outperformed blockedsampling (including beam sampling). The reason for thismay be the nature of the sequence, which has a lower tem-poral dependency. Blocked Gibbs sampling, in particular,consumes too much time for one sweep to be of any prac-tical use. We also found that split-merge sampling had avery low accept rate and thus made little contribution to theresult.Yet, we can see the advantage of using HCRP represen-tation over stick-breaking representation. The direct as-signment (DA) algorithm showed a competitively goodperplexity, reflecting the fact that DA uses stick-breakingfor only the root DP and uses the CRP representation forthe other DP. Though step-wise Gibbs sampling and itsslice sampling version seems outperforming DA slightly,we need to collect more data to show that the difference issignificant. At least, however, we can say that now manysampling algorithms are available for inference, and we canchoose a suitable one depending on the nature of the se-quence. We have proposed a method of sampling directly from con-strained distributions of simultaneous draws from a hier-archical Chinese restaurant process (HCRP). We pointedout that, to obtain a correct sample distribution, the seat-ing arrangements (partitioning) must be correctly sampledfor restricted collapsed draw, and we thus proposed apply-ing the Metropolis-Hastings algorithm to the seating ar-rangements. Our algorithm, called the Restricted CollapsedDraw (RCD) sampler, uses a na¨ıve sampler to provide a proposal distribution for seating arrangements. Based onthe sampler, we developed various sampling algorithmsfor HDP-HMM based on HCRP representation, includingblocked Gibbs sampling, beam sampling, and split-mergesampling.The applications of the RCD sampler, which is at the heartof our algorithms, are not limited to HCRP-HMM. Theexperimental results revealed that some of the proposed al-gorithms outperform existing sampling methods, indicatingthat the benefits of using a collapsed representation exceedthe cost of rejecting proposals.The main contribution of this study is that it opens a wayof developing more complex Bayesian models based onCRPs. Since the RCD sampler is simple, flexible, andindependent of the particular structure of a hierarchy, itcan be applied to any combination or hierarchical struc-ture of CRPs. Our future work includes using this algo-rithm to construct new Bayesian models based on hierar-chical CRPs, which are hard to implement using a non-collapsed representation. Planned work includes extendingHDP-IOHMM [ ? ] with a three-level hierarchical DP (e.g.,the second level could correspond to actions, and the thirdlevel, to input symbols).a) Sequence 1 (b) Sequence 2Figure 4: Average mutual information of sampled hidden states (a) HDP-HMM (SB) SGibbs (b) HDP-HMM (DA) SGibbs (c) HDP-HMM (SB) Beam (d) HCRP-HMM Beam (e) HCRP-HMM Beam (f) HCRP-HMM BeamSplit-Merge 3/sweep Split-Merge 25/sweepFigure 5: Distribution of mutual information for Sequence 1. X-axis shows mutual information and Y-axis shows fre-quency. Block size ≈ for HCRP-HMM Beam sampling.able 1: Experimental results for Sequence 2name MI ACT DA: Direct Assignment SB: Stick-Breaking constructionMI: Mutual Information ACT: Auto-correlation time, samples collected for every 0.1 s ≈ ≈ T for stick-breaking)SM: Split-Merge sampler (+SM= n denotes SM trials per Gibbs sweep) Table 2: Experiments on Natural language textSampler Perplexity
For HCRP-HMM, the block size ≈ . Miscellaneous Algorithms
Algorithm 2 getProb ( j, k, s ) : Calculate p ( x ji = k | s ) if m · k = 0 thenreturn α γm · k + γ n j ·· + α elsereturn n j · k + α m · km · k + γ n j ·· + α end if Algorithm 3 addCustomer ( j, k, s old ): Adds new cus-tomer eating dish k to restaurant j . s := s old With probabilities proportional to: n jtk ( t =1 , . . . , m j · ) : Increment n jtk (the customer sits at t -thtable) α m · k m ·· + γ : sit customer at a new table t new serv-ing dish k in restaurant j ( n jt new k := 1 , k jt new := k ,increment m jk ) return updated s Algorithm 4 removeCustomer ( j, k, s old ): Removes ex-isting customer eating dish k from restaurant j . s := s old Sample t ji in proportional to n jt ji k Decrement n jt ji k (the customer at t ji -th table is re-moved) if n jt ji k becomes zero then Remove the unoccupied table t ji from restaurant j ,decrement m jk end ifreturn updated s B Step-wise Gibbs sampler
To manipulate emission probability F ( x i ) with a conjugateprior, we introduced a similar notation to HCRP, which canbe intuitively understood. lgorithm 5 Step-wise Gibbs sweep for HCRP-HMM
Input: y , . . . , y i : observed emissions x , . . . , x i : previously inferred states s old : set of CRP seating arrangements F old : set of emission distributions for i = 1 , . . . , T , in random order do s = removeCustomer ( x i +1 , x t , s old ) r old = getProb ( x i +1 , x t , s ) F = removeCustomer ( y i , x i , F old ) r old = getProb ( y i , x t , s ) s = removeCustomer ( x i , x i − , s ) r old = getProb ( x i , x i − , s ) Sample x ∗ i in proportion to q ( x t ) where q ( x t ) ∝ (cid:16) α γ ( α + n xt − ·· )( γ + m ·· ) δ x t +1 + p ( x t | S ( x t − )0 ) (cid:17) · p ( y t | F x t ) · p ( x t +1 | S ( x t )0 ) r ∗ = getProb ( x ∗ t , x t − , s ) s = addCustomer ( x ∗ t , x t − , s ) r ∗ = getProb ( y t , x t , F ) F ∗ = addCustomer ( y t , x t , F ) r ∗ = getProb ( x t +1 , x ∗ t , s ) s ∗ = addCustomer ( x t +1 , x ∗ t , s ) R := min (cid:18) , r ∗ r old r ∗ r old r ∗ r old q ( x t ) q ( x ∗ t ) (cid:19) h x t , s , F i := ( h x ∗ i , s ∗ , F ∗ i with probability R h x t , s old , F old i otherwise end for C Blocked Gibbs sampler
For details on the
ForwardBackwardSampling routine,please refer to the literature [ ? ]. Algorithm 6 removeSeq ( i , i , x , s old , F old ) : removecustomers for a part of state sequence ( x i , . . . , x i ) L = i − i − s L = removeCustomer ( x i + L , x i , s old ) r oldL = p ( x ji = k | s ) F L = F old for ℓ = L − downto do s ℓ = removeCustomer ( x i b + ℓ , x i b + ℓ − , s ℓ +1 ) F ℓ = removeCustomer ( y i b + ℓ , x i b + ℓ , F ℓ +1 ) r oldℓ = getProb ( x i b + ℓ , x i b + ℓ − , s ℓ ) · getProb ( y i b + ℓ , x i b + ℓ , F ℓ ) end forreturn h s , F , Q Lℓ =0 r oldℓ i Algorithm 7 addSeq ( i , i , x , s , F ) : add customers fora part of state sequence ( x i , . . . , x i ) L = i − i − for ℓ = 0 to L − do r ∗ ℓ = getProb ( x i b + ℓ , x i b + ℓ − , s ℓ ) · getProb ( y i b + ℓ , x ∗ i b + ℓ , F ℓ ) s ℓ +1 = addCustomer ( x ∗ i b + ℓ , x ∗ i b + ℓ − , s ℓ ) F ℓ +1 = addCustomer ( y i b + ℓ , x ∗ i b + ℓ , F ℓ ) end for r ∗ L = getProb ( x i , x ∗ i − , s ∗ L ) s ∗ = addCustomer ( x i + L , x ∗ i − , s ∗ L ) ; F ∗ = F ∗ L return h s ∗ , F ∗ L , Q Lℓ =0 r ∗ ℓ i Algorithm 8
Blocked Gibbs sweep for HCRP-HMM
Input: y , . . . , y i : observed emissions x = x , . . . , x T : previously inferred states s : set of CRP seating arrangements F : set of emission distributions B : number of blocksChoose block boundaries i , . . . , i B − ∈ { , . . . , T } ; i := 1 , i B = T for b = 0 , . . . , B − , in random order do h s , F , r old i = removeSeq ( i b , i b +1 − i b , x , s , F , ; x ∗ i = x i for all t < i b or t ≥ t b +1 ( x ∗ i b , . . . , x ∗ i b + L − ) = FBSampler ( ˆ π | S , F , y i b : i b + L − , x i b − , x i b + L ) Calculate q old = q ( x i b , . . . , x i b + L − ) and q ∗ = q ( x ∗ i b , . . . , x ∗ i b + L − ) Q old = CRP( γ, H ) Q ∗ = CRP( γ, H ) if x t b +1 refers to a new state in s then Q ∗ := addCustomer ( x t b +1 , Q ∗ ) Q old := addCustomer ( x t b +1 , Q old ) end iffor t = t b to t b +1 − doif x i refers to a new state in s then q old := q old ∗ getProb ( x i , Q old ) Q old := addCustomer ( x i , Q old ) end ifif x ∗ i = k new then sample s ∼ Q ∗ ; x ∗ i := sq ∗ := q ∗ ∗ getProb ( x ∗ i , Q ∗ ) Q ∗ := addCustomer ( x i , Q ∗ ) end ifend for S ∗ = S ; F ∗ = F h s ∗ , F ∗ , r ∗ = addSeq ( i , L, x , s , F ) R := min (cid:18) , q old q ∗ · r old · r ∗ (cid:19) h x , s , F i := ( h x ∗ , s ∗ , F ∗ i with probability R h x old , s old , F old i otherwise end for Split-Merge sampler
Algorithm 9
Split-Merge Sampler for an HCRP-HMM
Input: y , . . . , y T : observed emissions x , . . . , x T : previously inferred states s old : set of CRP seating arrangements F old : set of emission distributions R thr ∼ Uniform(0 , Choose distinct t , t ∈ { , . . . , T } Identify all fragments ( b i , e i ) s.t. for all t ∈ ( b i , . . . , e i ) , x i ∈ { x t , x t } ∧ t / ∈ { t , t } , and not contained in otherfragmentsPermute fragments randomlyLet U be the number of fragments s U +1 = s old , F U +1 = F old if x t = x t then { Try split move } for i = U downto do h s i , F i , r oldi i = removeSeq ( b i , e i , x , s i +1 , F i +1 ) q oldi = 1 end for x ∗ t = new k index else { Try merge move } for i = U downto do h s i , F i , r oldi i = removeSeq ( x b i : e i , s i +1 , F i +1 ) q oldi = SeqProb ( ˆ π | s i +1 , F i , y b i : e i , x b i − e i +1 ) ForwardProb ( ˆ π | s i +1 , F i , y b i : e i , x b i − , x e i +1 ; { x t , x t } ) end for x ∗ t = x t end if { Remove customers that accounts for transitions around x oldt } F = removeCustomer ( y t , x t , F ) p old = p ( y t | F x t ) s := s if t − is not in any fragment then s := removeCustomer ( x t , x t − , s )) r old ∗ = getProb ( x t , x t − , s )) end ifif t + 1 is not in any fragment then s := removeCustomer ( x t + 1 , x t , s )) r old ∗ = getProb ( x t +1 x t , s )) end if q old = q ∗ = 1 (continue to Algorithm 10) lgorithm 10 Split-Merge Sampler for an HCRP-HMM (continued) { Add customers that accounts for transitions around x ∗ t } p ∗ = p ( y t | F ∗ x ∗ t ) F ∗ = addCustomer ( y t , x ∗ t , F ∗ ) s ∗ := s if t + 1 is not in any fragment then r ∗ ∗ = getP rob ( x t +1 x t , s ∗ )) s ∗ := addCustomer ( x t + 1 , x t , s ∗ )) end ifif t − is not in any fragment then r ∗ ∗ = getP rob ( x t | x t − , s ∗ )) s ∗ := addCustomer ( x t , x t − , s ∗ )) end ifif x t = x t then { Try split move } for i = 1 to U do x ∗ b i , . . . , x ∗ b i + L i − = LimitedFBSampler ( ˆ π | s ∗ i +1 , F ∗ i , y b i : b i + L i − , x ∗ b i − , x ∗ b i + L ; { x ∗ t , x ∗ t } ) q ∗ i = SeqProb ( ˆ π | s ∗ i +1 , F ∗ i , y b i : e i , x ∗ b i − e i +1 ) ForwardProb ( ˆ π | s ∗ i +1 , F ∗ i , y b i : e i , x ∗ b i − , x ∗ e i +1 ; { x ∗ t , x ∗ t } ) h s ∗ i +2 , F ∗ i +1 , r ∗ i i = addSeq ( b i , e i , x ∗ , s ∗ i +1 , F ∗ i ) end forelse { Try merge move } for i = 1 to U do R cur = Q i − i ′ =0 r ∗ i q ∗ i · Q Ii =0 r oldi r oldi if R thr ≥ R cur then rejection determined, exit loop end if x ∗ b i , . . . , x ∗ e i = x t q ∗ i = 1 h s ∗ i , F ∗ i − , r ∗ i i = addSeq ( b i , e i , x , s i +1 , F i ) end forend if R = Q Ii =0 r ∗ i r oldi · Q Ii =1 q oldi q ∗ i h x , s , F i = ( h x ∗ , s ∗ , F ∗ i R thr < R h x old , s old , F old ii