[PDF] A sticky HDP-HMM with application to speaker diarization

Abstract

We consider the problem of speaker diarization, the problem of segmenting an audio recording of a meeting into temporal segments corresponding to individual speakers. The problem is rendered particularly difficult by the fact that we are not allowed to assume knowledge of the number of people participating in the meeting. To address this problem, we take a Bayesian nonparametric approach to speaker diarization that builds on the hierarchical Dirichlet process hidden Markov model (HDP-HMM) of Teh et al. [J. Amer. Statist. Assoc. 101 (2006) 1566--1581]. Although the basic HDP-HMM tends to over-segment the audio data---creating redundant states and rapidly switching among them---we describe an augmented HDP-HMM that provides effective control over the switching rate. We also show that this augmentation makes it possible to treat emission distributions nonparametrically. To scale the resulting architecture to realistic diarization problems, we develop a sampling algorithm that employs a truncated approximation of the Dirichlet process to jointly resample the full state sequence, greatly improving mixing rates. Working with a benchmark NIST data set, we show that our Bayesian nonparametric architecture yields state-of-the-art speaker diarization results.

Full PDF

aa r X i v : . [ s t a t . M E ] A ug The Annals of Applied Statistics (cid:13)

Institute of Mathematical Statistics, 2011

A STICKY HDP-HMM WITH APPLICATION TO SPEAKERDIARIZATION By Emily B. Fox, Erik B. Sudderth, Michael I. Jordan andAlan S. Willsky

Duke University, Brown University, University of California, Berkeley andMassachusetts Institute of Technology

J. Amer. Statist. As-soc. (2006) 1566–1581]. Although the basic HDP-HMM tends toover-segment the audio data—creating redundant states and rapidlyswitching among them—we describe an augmented HDP-HMM thatprovides eﬀective control over the switching rate. We also show thatthis augmentation makes it possible to treat emission distributionsnonparametrically. To scale the resulting architecture to realistic di-arization problems, we develop a sampling algorithm that employsa truncated approximation of the Dirichlet process to jointly resamplethe full state sequence, greatly improving mixing rates. Working witha benchmark NIST data set, we show that our Bayesian nonparamet-ric architecture yields state-of-the-art speaker diarization results.

1. Introduction.

A recurring problem in many areas of information tech-nology is that of segmenting a waveform into a set of time intervals that havea useful interpretation in some underlying domain. In this article we focuson a particular instance of this problem, namely, the problem of speakerdiarization . In speaker diarization, an audio recording is made of a meeting

Received April 2010; revised August 2010. Supported in part by MURIs funded through AFOSR Grant FA9550-06-1-0324 andARO Grant W911NF-06-1-0076, by AFOSR under Grant FA9559-08-1-0180 and byDARPA IPTO Contract FA8750-05-2-0249.

Key words and phrases.

Bayesian nonparametrics, hierarchical Dirichlet processes, hid-den Markov models, speaker diarization.

This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in

The Annals of Applied Statistics ,2011, Vol. 5, No. 2A, 1020–1056. This reprint diﬀers from the original inpagination and typographic detail. 1

FOX, SUDDERTH, JORDAN AND WILLSKY involving multiple human participants and the problem is to segment therecording into time intervals associated with individual speakers [Wootersand Huijbregts (2007)]. This segmentation is to be carried out without a pri-ori knowledge of the number of speakers involved in the meeting; moreover,we do not assume that we have a priori knowledge of the speech patterns ofparticular individuals.Our approach to the speaker diarization problem is built on the frameworkof hidden Markov models (HMMs), which have been a major success storynot only in speech technology but also in many other ﬁelds involving com-plex sequential data, including genomics, structural biology, machine trans-lation, cryptanalysis and ﬁnance. An alternative to HMMs in the speakerdiarization setting would be to treat the problem as a changepoint detectionproblem, but a key aspect of speaker diarization is that speech data froma single individual generally recurs in multiple disjoint intervals. This sug-gests a Markovian framework in which the model transitions among statesthat are associated with the diﬀerent speakers.An apparent disadvantage of the HMM framework, however, is that clas-sical treatments of the HMM generally require the number of states to beﬁxed a priori. While standard parametric model selection methods can beadapted to the HMM, there is little understanding of the strengths andweaknesses of such methods in this setting, and practical applications ofHMMs generally ﬁx the number of states using ad hoc approaches. It is notclear how to adapt HMMs to the diarization problem where the number ofspeakers is unknown.Building on the work of Beal, Ghahramani and Rasmussen (2002), Tehet al. (2006) presented a Bayesian nonparametric version of the HMM inwhich a stochastic process—the hierarchical Dirichlet process (HDP)—deﬁ-nes a prior distribution on transition matrices over countably inﬁnite statespaces. The resulting

HDP-HMM is amenable to full Bayesian posterior in-ference over the number of states in the model. Moreover, this posterior dis-tribution can be integrated over when making predictions, eﬀectively averag-ing over models of varying complexity. The HDP-HMM has shown promisein a variety of applied problems, including visual scene recognition [Kivi-nen, Sudderth and Jordan (2007)], music synthesis [Hoﬀman, Cook and Blei(2008)], and the modeling of genetic recombination [Xing and Sohn (2007)]and gene expression [Beal and Krishnamurthy (2006)].While the HDP-HMM seems like a natural ﬁt to the speaker diarizationproblem given its structural ﬂexibility, as we show in Section 8, the HDP-HMM does not yield state-of-the-art performance in the speaker diarizationsetting. The problem is that the HDP-HMM inadequately models the tempo-ral persistence of states. This problem arises in classical ﬁnite HMMs as well,where semi-Markovian models are often proposed as solutions. However, theproblem is exacerbated in the nonparametric setting, in which the Bayesian

HE STICKY HDP-HMM Fig. 1. (a)

Multinomial observation sequence; (b) true state sequence; (c) and (d) esti-mated state sequence after 30,000 Gibbs iterations for the original and sticky HDP-HMM,respectively, with errors indicated in red. Without an extra self-transition bias, the HD-P-HMM rapidly transitions among redundant states. bias toward simpler models is insuﬃcient to prevent the HDP-HMM fromgiving high posterior probability to models with unrealistically rapid switch-ing. This is demonstrated in Figure 1, where we see that the HDP-HMMsampling algorithm creates redundant states and rapidly switches amongthem. (The ﬁgure also displays results from the augmented HDP-HMM—the “sticky HDP-HMM” that we describe in this paper.) The tendency tocreate redundant states is not necessarily a problem in settings in whichmodel averaging is the goal. For speaker diarization, however, it is criticalto infer the number of speakers as well as the transitions among speakers.Thus, one of our major goals in this paper is to provide a general solutionto the problem of state persistence in HDP-HMMs. Our approach is easilystated—we simply augment the HDP-HMM to include a parameter for self-transition bias, and place a separate prior on this parameter. The challengeis to execute this idea coherently in a Bayesian nonparametric framework.Earlier papers have also proposed self-transition parameters for HMMs withinﬁnite state spaces [Beal, Ghahramani and Rasmussen (2002); Xing andSohn (2007)], but did not formulate general solutions that integrate fullywith Bayesian nonparametric inference.

FOX, SUDDERTH, JORDAN AND WILLSKY

Another goal of the current paper is to develop a more fully nonparametricversion of the HDP-HMM in which not only the transition distribution butalso the emission distribution (the conditional distribution of observationsgiven states) is treated nonparametrically. This is again motivated by thespeaker diarization problem—in classical applications of HMMs to speechrecognition problems, it is often the case that emission distributions arefound to be multimodal, and high-performance HMMs generally use ﬁniteGaussian mixtures as emission distributions [Gales and Young (2007)]. Inthe nonparametric setting it is natural to replace these ﬁnite mixtures withDirichlet process mixtures. Unfortunately, this idea is not viable in prac-tice, because of the tendency of the HDP-HMM to rapidly switch betweenredundant states. As we show, however, by incorporating an additional self-transition bias, it is possible to make use of Dirichlet process mixtures forthe emission distributions.An important reason for the popularity of the classical HMM is its com-putational tractability. In particular, marginal probabilities and samples canbe obtained from the HMM via an eﬃcient dynamic programming algorithmknown as the forward–backward algorithm [Rabiner (1989)]. We show thatthis algorithm also plays an important role in computationally eﬃcient in-ference for our generalized HDP-HMM. Using a truncated approximation tothe full Bayesian nonparametric model, we develop a blocked Gibbs samplerwhich leverages forward–backward recursions to jointly resample the stateand emission assignments for all observations.The paper is organized as follows. In Section 2 we begin by summarizingrelated prior work on the speaker diarization task and analyzing the keycharacteristics of the data set we examine in Section 8. In Section 3 weprovide some basic background on Dirichlet processes. Then, in Section 4we overview the hierarchical Dirichlet process, and in Section 5 discuss howit applies to HMMs and can be extended to account for state persistence. Aneﬃcient Gibbs sampler is also described in this section. In Section 7 we treatthe case of nonparametric emission distributions. We discuss our applicationto speaker diarization in Section 8. A list of notational conventions can befound in the Supplementary Material [Fox et al. (2010)].

2. The speaker diarization task.

There is a vast literature on the speakerdiarization task, and in this section we simply aim to provide an overviewof the most common techniques. We refer the interested reader to Tranterand Reynolds (2006) for a more thorough exposition on the subject.Classical speaker diarization techniques typically employ a two-stage pro-cedure that ﬁrst segments the audio (or features thereof) using one of a vari-ety of changepoint algorithms. The inferred segments are then regrouped intoa set of speaker labels via a clustering algorithm. For example, Reynolds andTorres-Carrasquillo (2004) propose a changepoint detection method based

HE STICKY HDP-HMM on the Bayesian Information Criterion (BIC). Speciﬁcally, a penalized like-lihood ratio test is used to compare whether the data within a ﬁxed win-dow are better modeled via a single Gaussian or two Gaussians. The win-dow gradually grows at each test until a changepoint is inferred, at whichpoint the window is reinitialized at the inferred changepoint. An alternativechangepoint detection technique, ﬁrst proposed in Siegler et al. (1997), usesﬁxed length windows and computes the symmetric Kullback–Leibler (KL)divergence between a pair of Gaussians each ﬁt by the data in their respec-tive windows. A post-processing step then sets the changepoints equal tothe peaks of the computed KL that exceed a predetermined threshold. Inorder to group the inferred segments into a set of speaker labels, a commonapproach is to use hierarchical agglomerative clustering with a BIC stoppingcriterion, as proposed in Chen and Gopalakrishnam (1998).The simple two-stage approach outlined above suﬀers from the fact thaterrors made in the segmentation stage can degrade the performance of thesubsequent clustering stage. A number of algorithms instead iterate betweenmultiple stages of resegmentation (typically via Viterbi decoding) and clus-tering; for example, see Barras et al. (2004); Wooters et al. (2004). Itera-tive segmentation and clustering algorithms employing a Gaussian mixturemodel for each cluster (i.e., speaker), such as those proposed by Gauvain,Lamel and Adda (1998); Barras et al. (2004), have been shown to improvediarization performance. Overall, however, agglomerative clustering is ex-tremely sensitive to the speciﬁed threshold for cluster merging, with diﬀer-ent settings leading to either over- or under-clustering of the segments intospeakers. The thresholds are typically set based on testing on an extensivetraining database.A number of more recent approaches have considered the problem of jointsegmentation and clustering by employing HMMs to capture the repeatedreturns of speakers. To handle the fact that the state space is unknown,Meignier et al. (2000) introduces the use of an evolutive-HMM which isfurther developed in Meignier, Bonastre and Igounet (2001). The HMM isinitialized to have one state and at each iteration a segment of speech isassumed to arise from an undetected speaker who is added to the model.The revised HMM is then used to resegment the audio, and this iterativeprocedure continues until the speaker labels have converged. An alternativeHMM formulation is presented in Wooters and Huijbregts (2007). The dataare initially split into K states, with K assumed to be larger than the num-ber of true speakers, and the HMM states are iteratively merged accordingto a metric based on changes in BIC. At each iteration, Viterbi decoding isperformed to resegment the features of the audio, and the inferred segmentsare used to ﬁt a new HMM via expectation maximization (EM). Then, theBIC criterion is applied to decide whether to merge HMM states. The algo-rithm also includes HMM substates to impose minimum speaker durations. FOX, SUDDERTH, JORDAN AND WILLSKY

Our approach also seeks to jointly segment and cluster the audio into spea-ker-homogenous regions, as targeted by the HMM approaches of Meignier,Bonastre and Igounet (2001); Wooters and Huijbregts (2007), but withina Bayesian nonparametric framework that avoids relying on the heuristicsemployed by these previously proposed algorithms and allows for coherentBayesian inference.The data set we consider in the experiments of Section 8 is a standardbenchmark data set distributed by NIST as part of the Rich Transcription2004–2007 meeting recognition evaluations [NIST (2007)]. The data set con-sists of 21 recorded meetings, each of which may have diﬀerent sets of speak-ers both in number and identity. We use the ﬁrst 19 Mel Frequency CepstralCoeﬃcients (MFCCs), computed over a 30 ms window every 10 ms, as a fea-ture vector. After these features are computed, a speech/nonspeech detec-tor is run to identify and remove observations corresponding to nonspeech.(Nonspeech refers to time intervals in which nobody is speaking.) The pre-processing step of removing nonspeech observations is important in ensuringthat the ﬁtted acoustic models are not corrupted by nonspeech information.When working with this data set, we discovered that the high frequencycontent of these features contained little discriminative information. Sinceminimum speaker durations are rarely less than 500 ms, we chose to de-ﬁne the observations as averages over 250 ms, nonoverlapping blocks. Thispreprocessing stage also aids in achieving speaker dynamics at the correctgranularity (as opposed to ﬁner temporal scale features leading to inferringwithin-speaker dynamics in addition to global speaker changes). In Figure 2we plot a histogram of the speaker durations of our preprocessed featuresbased on the ground truth labels provided for each of the 21 meetings. Fromthis plot, we see that a geometric duration distribution ﬁts this data rea-sonably well. This motivates our approach of simply increasing the priorprobability of self-transitions within a Markov framework rather than mov-ing to the more complicated semi-Markov formulation of speaker transitions.Another key feature of the speaker diarization data is the fact that thespeaker speciﬁc emissions are not well approximated by a single Gaus-sian; see Figure 3. This observation has led many researchers to considera mixture-of-Gaussians speaker model, as previously described. As demon-strated in Section 8, we show that achieving state-of-the-art performancewithin our framework also relies on allowing for non-Gaussian emissions. Mel-frequency cepstral coeﬃcients (MFCCs) comprise a representation of the short-term power spectrum of a sound on the mel scale (a nonlinear scale of frequency based onthe human auditory system response). Speciﬁcally, the computation of an MFCC typicallyinvolves (i) taking the Fourier transform of a windowed excerpt of a signal, (ii) mappingthe log powers of the obtained spectrum onto the mel scale and (iii) performing a discretecosine transform of the mel log powers. The MFCCs are the amplitudes of the resultingspectrum.HE STICKY HDP-HMM Fig. 2.

Normalized histogram of speaker durations of the preprocessed audio features fromthe 21 meetings in the NIST database. A

Geom(0 . density is also shown for comparison.

3. Dirichlet processes.

A Dirichlet process (DP) is a distribution onprobability measures on a measurable space Θ. This stochastic process isuniquely deﬁned by a base measure H on Θ and a concentration parame-ter γ ; we denote it by DP( γ, H ). Consider a random probability measure G ∼ DP( γ, H ). The DP is formally deﬁned by the property that, for anyﬁnite partition { A , . . . , A K } of Θ,( G ( A ) , . . . , G ( A K )) | γ, H ∼ Dir( γH ( A ) , . . . , γH ( A K )) . (3.1)That is, the measure of a random probability distribution G ∼ DP( γ, H ) onevery ﬁnite partition of Θ follows a ﬁnite-dimensional Dirichlet distribution [Ferguson (1973)]. A more constructive deﬁnition of the DP was given bySethuraman (1994). Consider a probability mass function (p.m.f.) { β k } ∞ k =1 Fig. 3.

Contour plots of the best ﬁt Gaussian (top) and kernel density estimate (bottom)for the top two principal components of the audio features associated with each of the fourspeakers present in the AMI 20041210-1052 meeting. Without capturing the non-Gaus-sianity of the speaker-speciﬁc emissions, the speakers are challenging to identify.

FOX, SUDDERTH, JORDAN AND WILLSKY on a countably inﬁnite set, where the discrete probabilities are deﬁned asfollows: v k | γ ∼ Beta(1 , γ ) , k = 1 , , . . . , (3.2) β k = v k k − Y ℓ =1 (1 − v ℓ ) , k = 1 , , . . . . In eﬀect, we have divided a unit-length stick into lengths given by theweights β k : the k th weight is a random proportion v k of the remaining stickafter the previous ( k −

1) weights have been deﬁned. This stick-breakingconstruction is generally denoted by β ∼ GEM( γ ). With probability one,a random draw G ∼ DP( γ, H ) can be expressed as G = ∞ X k =1 β k δ θ k , θ k | H ∼ H, k = 1 , , . . . , (3.3)where δ θ denotes a unit-mass measure concentrated at θ and where { θ k } are drawn independently from H . From this deﬁnition, we see that the DPactually deﬁnes a distribution over discrete probability measures. The stick-breaking construction also gives us insight into how the concentration pa-rameter γ controls the relative magnitude of the mixture weights β k , andthus determines the model complexity in terms of the expected number ofcomponents with signiﬁcant probability mass.The DP has a number of properties which make inference based on thisnonparametric prior computationally tractable. Consider a set of observa-tions { θ ′ i } with θ ′ i ∼ G . Because probability measures drawn from a DP arediscrete, there is a strictly positive probability of multiple observations θ ′ i taking identical values within the set { θ k } , with θ k deﬁned as in equa-tion (3.3). For each value θ ′ i , let z i be an indicator random variable thatpicks out the unique value k such that θ ′ i = θ z i . Blackwell and MacQueen(1973) introduced a P´olya urn representation of the θ ′ i : θ ′ i | θ ′ , . . . , θ ′ i − ∼ γγ + i − H + i − X j =1 γ + i − δ θ ′ j (3.4) = γγ + i − H + K X k =1 n k γ + i − δ θ k , implying the following predictive distribution for the indicator random vari-ables: p ( z N +1 = z | z , . . . , z N , γ ) = γN + γ δ ( z, K + 1) + 1 N + γ K X k =1 n k δ ( z, k ) . (3.5)Here, n k = P Ni =1 δ ( z i , k ) is the number of indicator random variables takingthe value k , and K + 1 is a previously unseen value. We use the nota- HE STICKY HDP-HMM Fig. 4.

Dirichlet process (left) and hierarchical Dirichlet process (right) mixture mod-els represented in two diﬀerent ways as graphical models. (a)

Indicator variable repre-sentation in which β | γ ∼ GEM( γ ) , θ k | H, λ ∼ H ( λ ) , z i | β ∼ β and y i |{ θ k } ∞ k =1 , z i ∼ F ( θ z i ) . (b) Alternative representation with G | γ, H ∼ DP( γ, H ) , θ ′ i | G ∼ G , and y i | θ ′ i ∼ F ( θ ′ i ) . (c) Indicator variable representation in which β | γ ∼ GEM( γ ) , π k | α, β ∼ DP( α, β ) , θ k | H, λ ∼ H ( λ ) , z ji | π j ∼ π j , and y ji |{ θ k } ∞ k =1 , z ji ∼ F ( θ z ji ) . (d) Alternative representa-tion with G | γ, H ∼ DP( γ, H ) , G j | G ∼ DP( α, G ) , θ ′ ji | G j ∼ G j and y ji | θ ′ ji ∼ F ( θ ′ ji ) . The“plate” notation is used to compactly represent replication [Teh et al. (2006)]. tion δ ( z, k ) to indicate the discrete Kronecker delta. This representation canbe used to sample observations from a DP without explicitly constructingthe countably inﬁnite random probability measure G ∼ DP( γ, H ).The distribution on partitions induced by the sequence of conditional dis-tributions in equation (3.5) is commonly referred to as the

Chinese restau-rant process . The analogy, which is useful in developing various generaliza-tions of the Dirichlet process we consider in this paper, is as follows. Take i to be a customer entering a restaurant with inﬁnitely many tables, eachserving a unique dish θ k . Each arriving customer chooses a table, indicatedby z i , in proportion to how many customers are currently sitting at thattable. With some positive probability proportional to γ , the customer startsa new, previously unoccupied table K + 1. The Chinese restaurant processcaptures the fact that the DP has a clustering property such that multipledraws from the random measure take the same value.The DP is commonly used as a prior on the parameters of a mixture modelwith a random number of components. Such a model is called a Dirichletprocess mixture model and is depicted as a graphical model in Figure 4(a)and (b). To generate observations, we choose θ ′ i ∼ G and y i ∼ F ( θ ′ i ) foran indexed family of distributions F ( · ). This sampling process is also oftendescribed in terms of the indicator random variables z i ; in particular, wehave z i ∼ β and y i ∼ F ( θ z i ). The parameter with which an observation isassociated implicitly partitions or clusters the data. In addition, the Chineserestaurant process representation indicates that the DP provides a priorthat makes it more likely to associate an observation with a parameter towhich other observations have already been associated. This reinforcementproperty is essential for inferring ﬁnite, compact mixture models. It can be FOX, SUDDERTH, JORDAN AND WILLSKY shown under mild conditions that if the data were generated by a ﬁnitemixture, then the DP posterior is guaranteed to converge (in distribution)to that ﬁnite set of mixture parameters [Ishwaran and Zarepour (2002b)].

4. Hierarchical Dirichlet processes.

In the following section we describehow ideas based on the Dirichlet process have been used to develop a Bayesiannonparametric approach to hidden Markov modeling in which the numberof states is unknown a priori. To develop this nonparametric version of theHMM, the Dirichlet process does not suﬃce; rather, it is necessary to de-velop a hierarchical Bayesian model involving a tied collection of Dirich-let processes. This has been done by Teh et al. (2006) whose hierarchicalDirichlet process (HDP) we describe in this section. The HDP is applicableto general problems involving related groups of data, each of which can bemodeled using a DP, and we begin by describing the HDP at this level ofgenerality, subsequently specializing to the HMM.To describe the HDP, suppose there are J groups of data and let { y j , . . . ,y jN j } denote the set of observations in group j . Assume that there are a col-lection of DP mixture models underlying the observations in these groups: G j = ∞ X t =1 ˜ π jt δ θ ∗ jt , ˜ π j | α ∼ GEM( α ) , j = 1 , . . . , J,θ ∗ jt | G , ∼ G , t = 1 , , . . . , (4.1) θ ′ ji | G j ∼ G j , y ji | θ ′ ji ∼ F ( θ ′ ji ) , j = 1 , . . . , J, i = 1 , . . . , N j . We wish to tie the DP mixtures across the diﬀerent groups such that atomsthat underly the data in group j can be used in group j ′ . The problem isthat if G is absolutely continuous with respect to the Lebesgue measure(as it generally is for continuous parameters), then the atoms in G j will bedistinct from those in G j ′ with probability one. The solution to this problemis to let G itself be a draw from a DP: G = ∞ X k =1 β k δ θ k , β | γ ∼ GEM( γ ) , (4.2) θ k | H, λ ∼ H ( λ ) , k = 1 , , . . . . In this hierarchical model, G is atomic and random. Letting G be a basemeasure for the draw G j ∼ DP( α, G ) implies that only these atoms canappear in G j . Thus, atoms can be shared among the collection of randommeasures { G j } . The HDP model is depicted graphically in two diﬀerentways in Figure 4(c) and (d).Teh et al. (2006) have also described the marginal probabilities obtainedfrom integrating over the random measures G and { G j } . They show thatthese marginals can be described in terms of a Chinese restaurant franchise

HE STICKY HDP-HMM (CRF) that is an analog of the Chinese restaurant process. The CRF iscomprised of J restaurants, each corresponding to an HDP group, and aninﬁnite buﬀet line of dishes common to all restaurants. The process of seat-ing customers at tables, however, is restaurant speciﬁc. Each customer ispreassigned to a given restaurant determined by that customer’s group j .Upon entering the j th restaurant in the CRF, customer y ji sits at currentlyoccupied tables t ji with probability proportional to the number of currentlyseated customers, or starts a new table T j + 1 with probability proportionalto α . The ﬁrst customer to sit at a table goes to the buﬀet line and picksa dish k jt for their table, choosing the dish with probability proportional tothe number of times that dish has been picked previously, or ordering a newdish θ K +1 with probability proportional to γ . The intuition behind this pre-dictive distribution is that integrating over the global dish probabilities β results in customers making decisions based on the observed popularity ofthe dishes throughout the entire franchise. See the Supplementary Materialfor further details [Fox et al. (2010)].Recalling equations (4.1) and (4.2), since each distribution G j is drawnfrom a DP with a discrete base measure G , multiple θ ∗ jt may take an iden-tical value θ k for multiple unique values of t . As we see in the SupplementalMaterial [Fox et al. (2010)], this corresponds to multiple tables in the samerestaurant being served the same dish. We can write G j as a function of theunique dishes: G j = ∞ X k =1 π jk δ θ k , π j | α, β ∼ DP( α, β ) , θ k | H ∼ H, (4.3)where π j now deﬁnes a restaurant-speciﬁc distribution over dishes servedrather than over tables, with π jk = X t | θ ∗ jt = θ k ˜ π jt . (4.4)Let z ji be the indicator random variable for the unique dish selected byobservation y ji . An equivalent representation for the generative model is interms of these indicator random variables: π j | α, β ∼ DP( α, β ) , z ji | π j ∼ π j , y ji |{ θ k } , z ji ∼ F ( θ z ji ) , (4.5)and is shown in Figure 4(c).

5. The sticky HDP-HMM.

Recall that the hidden Markov model, or

HMM , is a class of doubly stochastic processes based on an underlying,discrete-valued state sequence, which is modeled as Markovian [Rabiner(1989)]. Let z t denote the state of the Markov chain at time t and π j thestate-speciﬁc transition distribution for state j . Then, the Markovian struc-ture on the state sequence dictates that z t ∼ π z t − . The observations, y t , FOX, SUDDERTH, JORDAN AND WILLSKY are conditionally independent given this state sequence, with y t ∼ F ( θ z t ) forsome ﬁxed distribution F ( · ).The HDP can be used to develop an HMM with an inﬁnite state space—the HDP-HMM [Teh et al. (2006)]. In the speaker diarization task, each stateconstitutes a diﬀerent speaker and our goal in moving to an inﬁnite statespace is to remove upper bounds on the total number of speakers present.Conceptually, we envision a doubly-inﬁnite transition matrix, with each rowcorresponding to a Chinese restaurant. That is, the groups in the HDPformalism here correspond to states, and each Chinese restaurant deﬁnesa distribution on next states. The CRF links these next-state distributions.Thus, in this application of the HDP, the group-speciﬁc distribution, π j , isa state-speciﬁc transition distribution and, due to the inﬁnite state space,there are inﬁnitely many such groups. Since z t ∼ π z t − , we see that z t − indexes the group to which y t is assigned (i.e., all observations with z t − = j are assigned to group j ). Just as with the HMM, the current state z t thenindexes the parameter θ z t used to generate observation y t [see Figure 5(a)].By deﬁning π j ∼ DP( α, β ), the HDP prior encourages states to have sim-ilar transition distributions ( E [ π jk | β ] = β k ). However, it does not diﬀerenti-ate self-transitions from moves between diﬀerent states. When modeling datawith state persistence, the ﬂexible nature of the HDP-HMM prior allows forstate sequences with unrealistically fast dynamics to have large posteriorprobability. For example, with multinomial emissions, a good explanation ofthe data is to divide diﬀerent observation values into unique states and thenrapidly switch between them (see Figure 1). In such cases, many modelswith redundant states may have large posterior probability, thus impedingour ability to identify a compact dynamical model which best explains theobservations. The problem is compounded by the fact that once this alter-nating pattern has been instantiated by the sampler, its persistence is thenreinforced by the properties of the Chinese restaurant franchise, thus slow-ing mixing rates. Furthermore, this fragmentation of data into redundantstates can reduce predictive performance, as is discussed in Section 6. Inmany applications, one would like to be able to incorporate prior knowledgethat slow, smoothly varying dynamics are more likely.To address these issues, we propose to instead model the transition dis-tributions π j as follows: β | γ ∼ GEM( γ ) , (5.1) π j | α, κ, β ∼ DP (cid:18) α + κ, αβ + κδ j α + κ (cid:19) . Here, ( αβ + κδ j ) indicates that an amount κ > j th compo-nent of αβ . Informally, what we are doing is increasing the expected prob- HE STICKY HDP-HMM Fig. 5. (a)

Graphical representation of the sticky HDP-HMM. The state evolvesas z t +1 |{ π k } ∞ k =1 , z t ∼ π z t , where π k | α, κ, β ∼ DP( α + κ, ( αβ + κδ k ) / ( α + κ )) and β | γ ∼ GEM( γ ) , and observations are generated as y t |{ θ k } ∞ k =1 , z t ∼ F ( θ z t ) . The originalHDP-HMM has κ = 0 . (b) Sticky HDP-HMM with DP emissions, where s t indexes thestate-speciﬁc mixture component generating observation y t . The DP prior dictates that s t |{ ψ k } ∞ k =1 , z t ∼ ψ z t for ψ k | σ ∼ GEM( σ ) . The j th Gaussian component of the k th mixturedensity is parameterized by θ k,j so y t |{ θ k,j } ∞ k,j =1 , z t , s t ∼ F ( θ z t ,s t ) . ability of self-transition by an amount proportional to κ : E [ π jk | β, κ ] = αβ k + κδ ( j, k ) α + κ . (5.2)More formally, over a ﬁnite partition ( Z , . . . , Z K ) of the positive integers Z + ,the prior on the measure π j adds an amount κ only to the arbitrarily smallpartition containing j , corresponding to a self-transition. That is,( π j ( Z ) , . . . , π j ( Z K )) | α, β (5.3) ∼ Dir( αβ ( Z ) + κδ j ( Z ) , . . . , αβ ( Z K ) + κδ j ( Z K )) . When κ = 0 the original HDP-HMM of Teh et al. (2006) is recovered.Because positive κ values increase the prior probability E [ π jj | β ] of self-transitions, we refer to this extension as the sticky HDP-HMM. See Fig-ure 5(a). Note that this formulation assumes that the stickiness of each HMMstate is the same a priori. The parameter could be made state-dependentthrough a hierarchical model that ties together a collection of state-speciﬁcsticky parameters. However, such state-speciﬁc stickiness is unnecessary forthe speaker diarization task at hand since each speaker is assumed to havesimilar expected durations. Diﬀerences between speaker-speciﬁc transitionsbecome more distinguished in the posterior.The κ parameter is reminiscent of the self-transition bias parameter ofthe inﬁnite HMM , an urn model for hidden Markov models on inﬁnite statespaces that predated the HDP-HMM [Beal, Ghahramani and Rasmussen(2002)]. The connection between the (sticky) HDP-HMM and the inﬁniteHMM is analogous to that between the DP and the P´olya urn; in both cases FOX, SUDDERTH, JORDAN AND WILLSKY the latter is obtained by integrating out the random measures in the former.In particular, the inﬁnite HMM employs a two-level urn model in which thetop-level urn places a probability on transitions to existing states in pro-portion to how many times these transitions have been seen, with an addedbias for a self-transition even if this has not previously occurred. With someremaining probability, an oracle is called, representing the second-level urn.This oracle chooses an existing state in proportion to how many times theoracle previously chose that state, regardless of the state transition involved,or chooses a previously unvisited state. The original HDP-HMM provides aninterpretation of this urn model in terms of an underlying collection of linkedrandom probability measures, however, without the self-transition parame-ter. In addition to the conceptual clarity provided by the random measureformalism, the HDP-HMM has the practical advantage that it makes itpossible to use standard MCMC algorithms for posterior inference; workingwithin the urn model formulation, Beal, Ghahramani and Rasmussen (2002)needed to resort to a heuristic approximation to a Gibbs sampler. The stickyHDP-HMM, an early version of which was presented in Fox et al. (2008),restores the self-transition parameter of the inﬁnite HMM to this class ofmodels, doing so in a way that integrates with a full Bayesian nonparametricspeciﬁcation.As with the DP, this speciﬁcation in terms of random measures yieldsvarious interesting characterizations of marginal probabilities. In particular,as described in the Supplemental Material [Fox et al. (2010)], the partition-ing structure induced by the sticky HDP-HMM has an interpretation as anextension of the Chinese restaurant franchise (CRF) which we refer to asa

CRF with loyal customers . Here, each restaurant in the franchise has a spe-cialty dish with the same index as that of the restaurant. Although this dishis served elsewhere, it is more popular in the dish’s namesake restaurant.Recall that while customers in the CRF of the HDP are pre-partitioned intorestaurants based on the ﬁxed group assignments, in the HDP-HMM thevalue of the state z t determines the group assignment (and thus restaurant)of customer y t +1 . The increased popularity of the house specialty dish (de-termined by the sticky parameter κ ) implies that children are more likely toeat in the same restaurant as their parent ( z t = z t − = j ) and, in turn, morelikely to eat the restaurant’s specialty dish ( z t +1 = j ). This develops familyloyalty to a given restaurant in the franchise. However, if the parent choosesa dish other than the house specialty ( z t = k , k = j ), the child will then goto the restaurant where this dish is the specialty and will in turn be morelikely to eat this dish, too. One might say that for the sticky HDP-HMM,children have similar taste buds to their parents and will always go to therestaurant that prepares their parent’s dish best. Often, this keeps manygenerations eating in the same restaurant. HE STICKY HDP-HMM Throughout the remainder of the paper, we use the following notationalconventions. Given a random sequence { x , x , . . . , x T } , we use the short-hand x t to denote the sequence { x , x , . . . , x t } and x \ t to denote the set { x , . . . , x t − , x t +1 , . . . , x T } . Also, for random variables with double subindi-ces, such as x a a , we will use x to denote the entire set of such randomvariables, { x a a , ∀ a , ∀ a } , and the shorthand notation x a · = P a x a a , x · a = P a x a a and x ·· = P a P a x a a .5.1. Sampling via direct assignments.

In this section we present an in-ference algorithm for the sticky HDP-HMM of Section 5 and Figure 5(a)that is a modiﬁed version of the direct assignment Rao-Blackwellized Gibbssampler of Teh et al. (2006). This sampler circumvents the complicated book-keeping of the CRF by sampling indicator random variables directly. Theresulting sticky HDP-HMM direct assignment Gibbs sampler is outlined inAlgorithm 1 of the Supplementary Material [Fox et al. (2010)], which alsocontains the full derivations of this sampler.The basic idea is that we marginalize over the inﬁnite set of state-speciﬁctransition distributions π k and parameters θ k , and sequentially sample thestate z t given all other state assignments z \ t , the observations y T , and theglobal transition distribution β . A variant of the Chinese restaurant processgives us the prior probability of an assignment of z t to a value k based on howmany times we have seen other transitions from the previous state value z t − to k and k to the next state value z t +1 . As derived in the SupplementaryMaterial [Fox et al. (2010)], this conditional distribution is dependent uponwhether either or both of the transitions z t − to k and k to z t +1 corre-spond to a self-transition, most strongly when κ >

0. The prior probabilityof an assignment of z t to state k is then weighted by the likelihood of theobservation y t given all other observations assigned to state k .Given a sample of the state sequence z T , we can represent the poste-rior distribution of the global transition distribution β via a set of auxiliaryrandom variables ¯ m jk , m jk and w jt , which correspond to the j th restaurant-speciﬁc set of table counts associated with the CRF with loyal customersdescribed in the Supplemental Material [Fox et al. (2010)]. The Gibbs sam-pler iterates between sequential sampling of the state z t for each individualvalue of t given β and z \ t ; sampling of the auxiliary variables ¯ m jk , m jk and w jt given z T and β ; and sampling of β given these auxiliary variables.The direct assignment sampler is initialized by sampling the hyperparame-ters and β from their respective priors and then sequentially sampling each z t as if the associated y t was the last observation. That is, we ﬁrst sample z given y , β , and the hyperparameters. We then sample z given z , y , β ,and the hyperparameters, and so on. Based on the resulting sample of z T ,we resample β and the hyperparameters. From then on, the sampler contin-ues with the normal procedure of conditioning on z \ t when resampling z t . FOX, SUDDERTH, JORDAN AND WILLSKY

Blocked sampling of state sequences.

The HDP-HMM sequential,direct assignment sampler of Section 5.1 can exhibit slow mixing rates sinceglobal state sequence changes are forced to occur coordinate by coordinate.This phenomenon is explored in Scott (2002) for the ﬁnite HMM. Althoughthe sticky HDP-HMM reduces the posterior uncertainty caused by fast state-switching explanations of the data, the self-transition bias can cause twocontinuous and temporally separated sets of observations of a given state tobe grouped into two states. See Figure 6(b) for an example. If this occurs,the high probability of self-transition makes it challenging for the sequentialsampler to group those two examples into a single state.We thus propose using a variant of the HMM forward–backward procedure[Rabiner (1989)] to harness the Markovian structure and jointly sample thestate sequence z T given the observations y T , transition probabilities π k ,and parameters θ k . There are two main mechanisms for sampling in an un-collapsed HDP model (i.e., one that instantiates the parameters π k and θ k ):one is to employ slice sampling while the other is to consider a truncatedapproximation to the HDP. For the HDP-HMM, a slice sampler, referredto as beam sampling , was recently developed [Van Gael et al. (2008)]. Thissampler harnesses the eﬃciencies of the forward–backward algorithm with-out having to ﬁx a truncation level for the HDP. However, as we elaborateupon in Section 6.1, this sampler suﬀers from slower mixing rates than theblock sampler we propose, which utilizes a ﬁxed-order truncation of theHDP-HMM. Although a ﬁxed truncation reduces our model to a parametricBayesian HMM, the speciﬁc hierarchical prior induced by a truncation of thefully nonparametric HDP signiﬁcantly improves upon classical parametricBayesian HMMs. Speciﬁcally, a ﬁxed degree L truncation encourages eachtransition distribution to be sparse over the set of L possible HMM states,and simultaneously encourages transitions from diﬀerent states to have sim-ilar sparsity structures. That is, the truncated HDP prior leads to a shared sparse subset of the L possible states. See Section 6.3 for a comparison withstandard parametric modeling.There are multiple methods of approximating the countably inﬁnite tran-sition distributions via truncations. One approach is to terminate the stick-breaking construction after some portion of the stick has already been brokenand assign the remaining weight to a single component. This approxima-tion is referred to as the truncated Dirichlet process . Another method is toconsider the degree L weak limit approximation to the DP [Ishwaran andZarepour (2002c)], GEM L ( α ) , Dir( α/L, . . . , α/L ) , (5.4)where L is a number that exceeds the total number of expected HMM states.Both of these approximations, which are presented in Ishwaran and Zarepour HE STICKY HDP-HMM Fig. 6. (a)

Observation sequence (blue) and true state sequence (red) for a three-stateHMM with state persistence. (b)

Example of the sticky HDP-HMM direct assignment Gibbssampler splitting temporally separated examples of the same true state (red) into multipleestimated states (blue) at Gibbs iteration 1000. (c)

Histogram of the inferred self-transi-tion proportion parameter, ρ , for the sticky HDP-HMM blocked sampler. For the originalHDP-HMM, the median (solid blue) and 10th and 90th quantiles (dashed red) of Ham-ming distance between the true and estimated state sequences over the ﬁrst 1000 Gibbssamples from 200 chains are shown for the (d) direct assignment sampler, and (e) blockedsampler. (f) Hamming distance over 30,000 Gibbs samples from three chains of the orig-inal HDP-HMM blocked sampler. (g) – (i) Analogous plots to (d) and (f) for the stickyHDP-HMM. (k) and (l)

Plots analogous to (e) and (f) for a nonsticky HDP-HMM usingbeam sampling. (j)

A histogram of the eﬀective beam sampler truncation level, L eﬀ , overthe 30,000 Gibbs iterations from the three chains (blue) compared to the ﬁxed truncationlevel, L = 20 , used in the truncated sticky HDP-HMM blocked sampler results (red). FOX, SUDDERTH, JORDAN AND WILLSKY (2000a, 2002c), encourage the learning of models with fewer than L com-ponents while allowing the generation of new components, upper boundedby L , as new data are observed. We choose to use the second approximationbecause of its simplicity and computational eﬃciency. The two choices ofapproximations are compared in Kurihara, Welling and Teh (2007), and lit-tle to no practical diﬀerences are found. Using a weak limit approximationto the Dirichlet process prior on β (i.e., employing a ﬁnite Dirichlet prior)induces a ﬁnite Dirichlet prior on π j : β | γ ∼ Dir( γ/L, . . . , γ/L ) , (5.5) π j | α, β ∼ Dir( αβ , . . . , αβ L ) . (5.6)As L → ∞ , this model converges in distribution to the HDP mixture model[Teh et al. (2006)].The Gibbs sampler using blocked resampling of z T is derived in theSupplementary Material [Fox et al. (2010)]; an outline of the resulting algo-rithm is also presented (see Algorithm 3). A similar sampler has been usedfor inference in HDP hidden Markov trees [Kivinen, Sudderth and Jordan(2007)]. However, this work did not consider the complications introducedby multimodal emissions, which we explore in Section 7.The blocked sampler is initialized by drawing L parameters θ k from thebase measure, β from its L -dimensional symmetric Dirichlet prior, and the L transition distributions π k from the induced L -dimensional Dirichlet priorspeciﬁed in equation (5.5). The hyperparameters are also drawn from theprior. Based on the sampled parameters and transition distributions, onecan block sample z T and proceed as in Algorithm 3 of the SupplementaryMaterial [Fox et al. (2010)].5.3. Hyperparameters.

We treat the hyperparameters in the sticky HDP-HMM as unknown quantities and perform full Bayesian inference over thesequantities. This emphasizes the role of the data in determining the numberof occupied states and the degree of self-transition bias. Our derivation ofsampling updates for the hyperparameters of the sticky HDP-HMM is pre-sented in the Supplementary Material [Fox et al. (2010)]; it roughly followsthat of the original HDP-HMM [Teh et al. (2006)]. A key step which simpli-ﬁes our inference procedure is to note that since we have the deterministicrelationships α = (1 − ρ )( α + κ ) , (5.7) κ = ρ ( α + κ ) , we can treat ρ and α + κ as our hyperparameters and sample these valuesinstead of sampling α and κ directly. HE STICKY HDP-HMM

6. Experiments with synthetic data.

In this section we explore the perfor-mance of the sticky HDP-HMM relative to the original model (i.e., the modelwith κ = 0) in a series of experiments with synthetic data. We judge perfor-mance according to two metrics: our ability to accurately segment the dataaccording to the underlying state sequence, and the predictive likelihood ofheld-out data under the inferred model. We additionally assess the improve-ments in mixing rate achieved by using the blocked sampler of Section 5.2.6.1. Gaussian emissions.

We begin our analysis of the sticky HDP-HMMperformance by examining a set of simulated data generated from an HMMwith Gaussian emissions. The ﬁrst data set is generated from an HMM witha high probability of self-transition. Here, we aim to show that the originalHDP-HMM inadequately captures state persistence. The second data set isfrom an HMM with a high probability of leaving the current state. In thisscenario, our goal is to demonstrate that the sticky HDP-HMM is still ableto capture rapid dynamics by inferring a small probability of self-transition.For all of the experiments with simulated data, we used weakly informa-tive hyperpriors. We placed a Gamma(1 , .

01) prior on the concentrationparameters γ and ( α + κ ). The self-transition proportion parameter ρ wasgiven a Beta(10 ,

1) prior. The parameters of the base measure were set fromthe data, as will be described for each scenario.

State persistence.

The data for the high persistence case were generatedfrom a three-state HMM with a 0.98 probability of self-transition and equalprobability of transitions to the other two states. The observation and truestate sequences for the state persistence scenario are shown in Figure 6(a).We placed a normal inverse-Wishart prior on the space of mean and varianceparameters and set the hyperparameters as follows: 0.01 pseudocounts, meanequal to the empirical mean, three degrees of freedom, and scale matrix equalto 0.75 times the empirical variance. We used this conjugate base measure sothat we may directly compare the performance of the blocked and direct as-signment samplers. For the blocked sampler, we used a truncation level of L = 20.In Figure 6(d)–(h), we plot the 10th, 50th and 90th quantiles of the Ham-ming distance between the true and estimated state sequences over the 1000Gibbs iterations using the direct assignment and blocked samplers on thesticky and original HDP-HMM models. To calculate the Hamming distance,we used the Munkres algorithm [Munkres (1957)] to map the randomly cho-sen indices of the estimated state sequence to the set of indices that maximizethe overlap with the true sequence.From these plots, we see that the burn-in rate of the blocked samplerusing the sticky HDP-HMM is signiﬁcantly faster than that of any othersampler-model combination. As expected, the sticky HDP-HMM with the FOX, SUDDERTH, JORDAN AND WILLSKY sequential, direct assignment sampler gets stuck in state sequence assign-ments from which it is hard to move away, as conveyed by the ﬂatness of theHamming error versus iteration number plot in Figure 6(g). For example,the estimated state sequence of Figure 6(b) might have similar parametersassociated with states 3, 7, 10 and 11 so that the likelihood is in essence thesame as if these states were grouped, but this sequence has a large error interms of Hamming distance and it would take many iterations to move awayfrom this assignment. Incorporating the blocked sampler with the originalHDP-HMM improves the Hamming distance performance relative to thesequential, direct assignment sampler for both the original and sticky HDP-HMM; however, the burn-in rate is still substantially slower than that of theblocked sampler on the sticky model.As discussed earlier, a beam sampling algorithm [Van Gael et al. (2008)]has been proposed which adapts slice sampling methods [Robert (2007)] tothe HDP-HMM. This approach uses a set of auxiliary slice variables, onefor each observation, to eﬀectively truncate the number of state transitionsthat must be considered at every Gibbs sampling iteration. Dynamic pro-gramming methods can then be used to jointly resample state assignments.The beam sampler was inspired by a related approach for DP mixture mod-els [Walker (2007)], which is conceptually similar to retrospective samplingmethods [Papaspiliopoulos and Roberts (2008)]. In comparison to our ﬁxed-order, weak-limit truncation of the HDP-HMM, the beam sampler providesan asymptotically exact algorithm. However, the beam sampler can be slowto mix relative to our blocked sampler on the ﬁxed, truncated model (seeFigure 6 for an example comparison). The issue is that in order to considera transition which has low prior probability, one needs a correspondinglyrare slice variable sample at that time. Thus, even if the likelihood cuesare strong, to be able to consider state sequences with several low-prior-probability transitions, one needs to wait for several rare events to occurwhen drawing slice variables. By considering the full, exponentially largeset of paths in the truncated state space, we avoid this problem. Of course,the trade-oﬀ between the computational cost of the blocked sampler on theﬁxed, truncated model ( O ( T L )) and the slower mixing rate of the beamsampler yields an application-dependent sampler choice.The Hamming distance plots of Figure 6(k) and (l), when compared tothose of Figure 6(e) and (f), depict the substantially slower mixing rate ofthe beam sampler compared to the blocked sampler (both using a nonstickyHDP-HMM). However, the theoretical computational beneﬁt of the beamsampler can be seen in Figure 6(j). In this plot, we present a histogram ofthe eﬀective truncation level, L eﬀ , used over the 30,000 Gibbs iterationson three chains. We computed this eﬀective truncation level by summingover the number of state transitions considered during a full sweep of sam- HE STICKY HDP-HMM pling z T and then dividing this number by the length of the data set, T ,and taking the square root. Finally, on a more technical note, our ﬁxed,truncated model allows for more vectorization of the code than the beamsampler. Thus, in practice, the diﬀerence in computation time between thesamplers is signiﬁcantly less than the O ( L /L ) factor obtained by countingstate transitions.From this point onward, we present results only from blocked samplingsince we have seen the clear advantages of this method over the sequential,direct assignment sampler. Fast state-switching.

In order to warrant the general use of the stickymodel, one would like to know that the sticky parameter incorporated in themodel does not preclude learning models with fast dynamics. To this end, weexplored the performance of the sticky HDP-HMM on data generated froma model with a high probability of switching between states. Speciﬁcally, wegenerated observations from a four-state HMM with the following transitionprobability matrix:  . . . . . . . . . . . . . . . .  . (6.1)We once again used a truncation level L = 20. Since we are restricting our-selves to the blocked Gibbs sampler, it is no longer necessary to use a conju-gate base measure. Instead we placed an independent Gaussian prior on themean parameter and an inverse-Wishart prior on the variance parameter.For the Gaussian prior, we set the mean and variance hyperparameters to beequal to the empirical mean and variance of the entire data set. The inverse-Wishart hyperparameters were set such that the expected variance is equalto 0.75 times that of the entire data set, with three degrees of freedom.The results depicted in Figure 7 conﬁrm that by inferring a small prob-ability of self-transition, the sticky HDP-HMM is indeed able to capturefast HMM dynamics, and just as quickly as the original HDP-HMM (al-though with higher variability). Speciﬁcally, we see that the histogram ofthe self-transition proportion parameter ρ for this data set [see Figure 7(d)]is centered around a value close to the true probability of self-transition,which is substantially lower than the mean value of this parameter on thedata with high persistence [Figure 6(c)].6.2. Multinomial emissions.

The diﬀerence in modeling power, ratherthan simply burn-in rate, between the sticky and original HDP-HMM ismore pronounced when we consider multinomial emissions. This is becausethe multinomial observations are embedded in a discrete topological space inwhich there is no concept of similarity between nonidentical observation val-ues. In contrast, Gaussian emissions have a continuous range of values in R n FOX, SUDDERTH, JORDAN AND WILLSKY

Fig. 7. (a)

Observation sequence (blue) and true state sequence (red) for a four-stateHMM with fast state switching. For the original HDP-HMM using a blocked Gibbs sampler: (b) the median (solid blue) and 10th and 90th quantiles (dashed red) of Hamming distancebetween the true and estimated state sequences over the ﬁrst 1000 Gibbs samples from200 chains, and (c) Hamming distance over 30,000 Gibbs samples from three chains. (d)

Histogram of the inferred self-transition parameter, ρ , for the sticky HDP-HMM blockedsampler. (e) and (f) Analogous plots to (b) and (c) for the sticky HDP-HMM. with a clear notion of closeness between observations under the Lebesguemeasure, aiding in grouping observations under a single HMM state’s Gaus-sian emission distribution, even in the absence of a self-transition bias.To demonstrate the increased posterior uncertainty with discrete observa-tions, we generated data from a ﬁve-state HMM with multinomial emissionswith a 0.98 probability of self-transition and equal probability of transitionsto the other four states. The vocabulary, or range of possible observationvalues, was set to 20. The observation and true state sequences are shownin Figure 8(a). We placed a symmetric Dirichlet prior on the parameters ofthe multinomial distribution, with the Dirichlet hyperparameters equal to 2[i.e., Dir(2 , . . . ,

HE STICKY HDP-HMM Fig. 8. (a)

Observation sequence (blue) and true state sequence (red) for a ﬁve-stateHMM with multinomial observations. (b)

Histogram of the predictive probability of testsequences using the inferred parameters sampled every 100th iteration from Gibbs iterations10,000–30,000 for the sticky and original HDP-HMM. The Hamming distances over 30,000Gibbs samples from three chains are shown for the (c) sticky HDP-HMM and (d) originalHDP-HMM. sticky parameter is clear. However, it is often the case that the metric ofinterest is the predictive power of the ﬁtted model, not the accuracy of theinferred state sequence. To study performance under this metric, we simu-lated 10 test sequences using the same parameters that generated the train-ing sequence. We then computed the likelihood of each of the test sequencesunder the set of parameters inferred at every 100th Gibbs iteration from iter-ations 10,000–30,000. This likelihood was computed by running the forward–backward algorithm of Rabiner (1989). We plot these results as a histogramin Figure 8(b). From this plot, we see that the fragmentation of data intoredundant HMM states can also degrade the predictive performance of theinferred model. Thus, the sticky parameter plays an important role in theBayesian nonparametric learning of HMMs even in terms of model averaging.6.3.

Comparison to independent sparse Dirichlet prior.

We have alludedto the fact that the shared sparsity of the HDP-HMM induced by β is essen-tial for inferring sparse representations of the data. Although this is clearfrom the perspective of the prior model, or, equivalently, the generativeprocess, it is not immediately obvious how much this hierarchical Bayesian FOX, SUDDERTH, JORDAN AND WILLSKY

Fig. 9. (a)

State transition diagram for a nine-state HMM with one main state (labeled 1)and eight sub-states (labeled 2–9). All states have a signiﬁcant probability of self-transition.From the main state, all other states are equally likely. From a sub-state, the most likelynonself-transition is a transition back to the main state. However, all sub-states havea small probability of transitioning to another sub-state, as indicated by the dashed arcs. (b)

Observation sequence (top) and true state sequence (bottom) generated by the nine-stateHMM with multinomial observations. constraint helps us in posterior inference. Once we are in the realm of con-sidering a ﬁxed, truncated approximation to the HDP-HMM, one might pro-pose an alternate model in which we simply place a sparse Dirichlet prior,Dir( α/L, . . . , α/L ) with α/L <

1, independently on each row of the transi-tion matrix. This is equivalent to setting β = [1 /L, . . . , /L ] in the truncatedHDP-HMM, which can also be achieved by letting the hyperparameter γ tend to inﬁnity. Indeed, when the data do not exhibit shared sparsity or whenthe likelihood cues are suﬃciently strong, the independent sparse Dirichletprior model can perform as well as the truncated HDP-HMM. However, inscenarios such as the one depicted in Figure 9, we see substantial diﬀerencesin performance by considering the HDP-HMM, as well as the inclusion of thesticky parameter. We explored the relative performance of the HDP-HMMand sparse Dirichlet prior model, with and without the sticky parameter,on such a Markov model with multinomial emissions on a vocabulary of size20. We placed a Dir(0 . , . . . , .

1) prior on the parameters of the multinomialdistribution. For the sparse Dirichlet prior model, we assumed a state spaceof size 50, which is the same as the truncation level we chose for the HDP-HMM (i.e., L = 50). The results are presented in Figure 10. From these plots,we see that the hierarchical Bayesian approach of the HDP-HMM does, infact, improve the ﬁtting of a model with shared sparsity. The HDP-HMMconsistently infers fewer HMM states and more representative model param-eters. As a result, the HDP-HMM has higher predictive likelihood on testdata, with an additional beneﬁt gained from using the sticky parameter.Note that the results of Figure 10(f) also motivate the use of the stickyparameter in the more classical setting of a ﬁnite HMM with a standardDirichlet sparsity prior. A motivating example of the use of sparse Dirichletpriors for ﬁnite HMMs is presented in Johnson (2007). HE STICKY HDP-HMM Fig. 10. (a)

The true transition probability matrix (TPM) associated with the state tran-sition diagram of Figures 9. (b) and (c)

The inferred TPM at the 30,000th Gibbs iterationfor the sticky HDP-HMM and sticky sparse Dirichlet model, respectively, only examin-ing those states with more than 1% of the assignments. For the HDP-HMM and sparseDirichlet model, with and without the sticky parameter, we plot: (d) the Hamming distanceerror over 10,000 Gibbs iterations, (e) the inferred number of states with more than 1%of the assignments, and (f) the predictive probability of test sequences using the inferredparameters sampled every 100th iteration from Gibbs iterations 5000–10,000.

7. Multimodal emission densities.

In many application domains, the dataassociated with each hidden state may have a complex, multimodal distri-bution. We propose to model such emission distributions nonparametrically,using a DP mixture of Gaussians. This formulation is related to the nestedDP [Rodriguez, Dunson and Gelfand (2008)], which uses a Dirichlet processto partition data into groups, and then models each group via a Dirichletprocess mixture. The bias toward self-transitions allows us to distinguishbetween the underlying HDP-HMM states. If the model were free to bothrapidly switch between HDP-HMM states and associate multiple Gaussiansper state, there would be considerable posterior uncertainty. Thus, it is onlywith the sticky HDP-HMM that we can eﬀectively ﬁt such models.We augment the HDP-HMM state z t with a term s t indexing the mixturecomponent of the z t th emission density. For each HDP-HMM state, there isa unique stick-breaking measure ψ k ∼ GEM( σ ) deﬁning the mixture weightsof the k th emission density so that s t ∼ ψ z t . Given the augmented state( z t , s t ), the observation y t is generated by the Gaussian component withparameter θ z t ,s t . Note that both the HDP-HMM state index and mixturecomponent index are allowed to take values in a countably inﬁnite set. SeeFigure 5(b). FOX, SUDDERTH, JORDAN AND WILLSKY

Direct assignment sampler.

Many of the steps of the direct assign-ment sampler for the sticky HDP-HMM with DP emissions remain the sameas for the regular sticky HDP-HMM. Speciﬁcally, the sampling of the globaltransition distribution β , the table counts m jk and ¯ m jk , and the overridevariables w jt are unchanged. The diﬀerence arises in how we sample theaugmented state ( z t , s t ).The joint distribution on the augmented state, having marginalized thetransition distributions π k and emission mixture weights ψ k , is given by p ( z t = k, s t = j | z \ t , s \ t , y T , β, α, σ, κ, λ )= p ( s t = j | z t = k, z \ t , s \ t , y T , σ, λ ) ,p ( z t = k | z \ t , s \ t , y T , β, α, κ, λ ) . We then block-sample ( z t , s t ) by ﬁrst sampling z t , followed by s t conditionedon the sampled value of z t . The term p ( s t = j | z t = k, z \ t , s \ t , y T , σ, λ ) relieson how many observations are currently assigned to the j th mixture com-ponent of state k . These conditional distributions are derived in the Supple-mentary Material [Fox et al. (2010)], which also contains an outline of theresulting Gibbs sampler in Algorithm 2.7.2. Blocked sampler.

To implement blocked resampling of ( z T , s T ),we use weak limit approximations to both the HDP-HMM and DP emissions,approximated to levels L and L ′ , respectively. The posterior distributionsfor β and π k remain unchanged from the sticky HDP-HMM; that of ψ k isgiven by ψ k | z T , s T , σ ∼ Dir( σ/L ′ + n ′ k , . . . , σ/L ′ + n ′ kL ′ ) , (7.1)where n ′ kℓ is the number of s t taking a value ℓ when z t = k . (i.e., the numberof observations assigned to the k th state’s ℓ th mixture component). Theprocedure for sampling the augmented state ( z T , s T ) is derived in theSupplementary Material [see Algorithm 4, Fox et al. (2010)].7.3. Assessing the multimodal emissions model.

In this section we eval-uate the ability of the sticky HDP-HMM to infer multimodal emission dis-tributions relative to the model without the sticky parameter. We generateddata from a ﬁve-state HMM with mixture-of-Gaussian emissions, where thenumber of mixture components for each emission distribution was chosenrandomly from a uniform distribution on { , , . . . , } . Each component ofthe mixture was equally weighted and the probability of self-transition wasset to 0.98, with equal probabilities of transitions to the other states. Thelarge probability of self-transition is what disambiguates this process fromone with many more HMM states, each with a single Gaussian emission dis-tribution. The resulting observation and true state sequences are shown inFigure 11(a). HE STICKY HDP-HMM Fig. 11. (a)

Observation sequence (blue) and true state sequence (red) for a ﬁve-stateHMM with mixture-of-Gaussian observations. (b)

We once again used a nonconjugate base measure and placed a Gaussianprior on the mean parameter and an independent inverse-Wishart prior onthe variance parameter of each Gaussian mixture component. The hyperpa-rameters for these distributions were set from the data in the same manneras in the fast-switching scenario. Consistent with the sticky HDP-HMMconcentration parameters γ and ( α + κ ), we placed a weakly informativeGamma(1 , .

01) prior on the concentration parameter σ of the DP emissions.All results are for the blocked sampler with truncation levels L = L ′ = 20.In Figure 11 we compare the performance of the sticky HDP-HMM withDP emissions to that of the original HDP-HMM with DP emissions (i.e., DPemissions, but no bias toward self-transitions). As with the multinomial ob-servations, when the distance between observations does not directly factorinto the grouping of observations into HMM states, there is a considerableamount of posterior uncertainty in the underlying HMM state of the non-sticky model. Even after 30,000 Gibbs samples, there are still state sequencesample paths with very rapid dynamics. The result of this fragmentationinto redundant states is a slight reduction in predictive performance on testsequences, as in the multinomial emission case. See Figure 11(b). FOX, SUDDERTH, JORDAN AND WILLSKY

8. Speaker diarization results.

Recall the speaker diarization task fromSection 2, which involves segmenting audio recordings from the NIST RichTranscription 2004–2007 database into speaker-homogeneous regions whilesimultaneously identifying the number of speakers. In this section we presentour results on applying the sticky HDP-HMM with DP emissions to thespeaker diarization task.A minimum speaker duration of 500 ms was set by associating two pre-processed MFCCs with each hidden state. We also tied the covariances ofwithin-state mixture components (i.e., each speaker-speciﬁc mixture compo-nent was forced to have identical covariance structure), and used a nonconju-gate prior on the mean and covariance parameters. We placed a normal prioron the mean parameter with mean equal to the empirical mean and covari-ance equal to 0.75 times the empirical covariance, and an inverse-Wishartprior on the covariance parameter with 1000 degrees of freedom and expectedcovariance equal to the empirical covariance. Our choice of a large degrees offreedom is akin to an empirical Bayes approach in that it concentrates themass of the prior in reasonable regions based on the data. Such an approachis often helpful in high-dimensional applied problems since our sampler re-lies on forming new states (i.e., speakers) based on parameters drawn fromthe prior. Issues of exploration in this high-dimensional space increase theimportance of the setting of the base measure. For the concentration param-eters, we placed a Gamma(12 ,

2) prior on γ , a Gamma(6 ,

1) prior on α + κ ,and a Gamma(1 , .

5) prior on σ . The self-transition parameter ρ was givena Beta(500 ,

5) prior. For each of the 21 meetings, we ran 10 chains of theblocked Gibbs sampler for 10,000 iterations for both the original and stickyHDP-HMM with DP emissions. We used a sticky HDP-HMM truncationlevel of L = 15, where the DP-mixture-of-Gaussians emission distributionassociated with each of these L HMM states was truncated to L ′ = 30 com-ponents. Our choice of L signiﬁcantly exceeds the typical number of speakers,which in the NIST database tends to be between 4 and 6. In practice, oursampler never approached using the full set of possible states and emissioncomponents.In order to explore the importance of capturing the temporal dynamics,we also compare our sticky HDP-HMM performance to that of a Dirich-let process mixture of Gaussians that simply pools together the data fromeach meeting, ignoring the time indices associated with the observations. Weconsidered a truncated Dirichlet process mixture model with L = 50 compo-nents and a Gamma(6 ,

1) prior on the concentration parameter γ . The basemeasure was set as in the sticky HDP-HMM.For the NIST speaker diarization evaluations, the goal is to produce a sin-gle segmentation for each meeting. Due to the label-switching issue (i.e.,under our exchangeable prior, labels are arbitrary entities that do not nec-essarily remain consistent over Gibbs iterations), we cannot simply integrate HE STICKY HDP-HMM over multiple Gibbs-sampled state sequences. We propose two solutions tothis problem. The ﬁrst, which we refer to as the likelihood metric , is tosimply choose from a ﬁxed set of Gibbs samples the one that produces thelargest likelihood given the estimated parameters (marginalizing over statesequences), and then produce the corresponding Viterbi state sequence. Thisheuristic, however, is sensitive to overﬁtting and will, in general, be biasedtoward solutions with more states.An alternative, and more robust, metric is what we refer to as the mini-mum expected Hamming distance . We ﬁrst choose a large reference set R ofstate sequences produced by the Gibbs sampler and a possibly smaller set oftest sequences T . Then, for each sequence z ( i ) in the test set T , we computethe empirical mean Hamming distance between the test sequence and thesequences in the reference set R ; we denote this empirical mean by ˆ H i . Wethen choose the test sequence z ( j ∗ ) that minimizes this expected Hammingdistance. That is, z ( j ∗ ) = arg min z ( i ) ∈T ˆ H i . The empirical mean Hamming distance ˆ H i is a label-invariant loss function since it does not rely on labels remaining consistent across samples—wesimply compute ˆ H i = 1 |R| X z ( j ) ∈R Hamm( z ( i ) , z ( j ) ) , where Hamm( z ( i ) , z ( j ) ) is the Hamming distance between sequences z ( i ) and z ( j ) after ﬁnding the optimal permutation of the labels in test sequence z ( i ) to those in reference sequence z ( j ) . At a high level, this method for choosingstate sequence samples aims to produce segmentations of the data that are typical samples from the posterior. Jasra, Holmes and Stephens (2005) pro-vide an overview of some related techniques to address the label-switchingissue. Although we could have chosen any label-invariant loss function tominimize, we chose the Hamming distance metric because it is closely re-lated to the oﬃcial NIST diarization error rate (DER) that is calculatedduring the evaluations. The ﬁnal metric by which the speaker diarizationalgorithms are judged is the overall DER, a weighted average over the setof meetings based on the length of each meeting.In Figure 12(a) we report the DER of the chain with the largest like-lihood given the parameters estimated at the 10,000th Gibbs iteration foreach of the 21 meetings, comparing the sticky and original HDP-HMM withDP emissions. We see that the sticky model’s temporal smoothing providessubstantial performance gains. Although not depicted in this paper, the like-lihoods based on the parameter estimates under the original HDP-HMM are FOX, SUDDERTH, JORDAN AND WILLSKY

Fig. 12. (a)–(c)

For each of the 21 meetings, comparison of diarizations using stickyvs. original HDP-HMM with DP emissions. In (a) we plot the DERs corresponding tothe Viterbi state sequence using the parameters inferred at Gibbs iteration 10,000 thatmaximize the likelihood, and in (b) the DERs using the state sequences that minimize theexpected Hamming distance. Plot (c) is the same as ( b) , except for running the 10 chainsfor meeting 16 out to 50,000 iterations. (d)–(f) Comparison of the sticky HDP-HMM withDP emissions to the ICSI errors under the same conditions. almost always higher than those under the sticky model. This phenomenonis due to the fact that without the sticky parameter, the HDP-HMM over-segments the data and thus produces parameter estimates more ﬁnely tunedto the data, resulting in higher likelihoods. Since the original HDP-HMM iscontained within the class of sticky models (i.e., when κ = 0), there is someprobability that state sequences similar to those under the original modelwill eventually arise using the sticky model. Thus, since the parameters as-sociated with these fast-switching sequences result in higher likelihood ofthe data, the likelihood metric is not very robust—one would expect theperformance under the sticky model to degrade given enough Gibbs chainsand/or iterations. In Figure 12(b) we instead report the DER of the chainwhose state sequence estimate at Gibbs iteration 10,000 (this deﬁnes thetest set T ) minimizes the expected Hamming distance to the sequences es-timated every 100 Gibbs iteration, discarding the ﬁrst 5000 iterations (thisdeﬁnes the reference set R ). Due to the slow mixing rate of the chainsin this application, we additionally discard samples whose normalized log- HE STICKY HDP-HMM Fig. 13.

Qualitative results for meetings AMI 20041210-1052 (meeting 1, top),CMU 20050228-1615 (meeting 3, middle) and NIST 20051102-1323 meeting (meeting 16,bottom). (a)

True state sequence with the post-processed regions of overlapping- and non-speech time steps removed. (b) and (c)

Plotted only over the time-steps as in (a) , thestate sequences inferred by the sticky HDP-HMM with DP emissions at Gibbs iteration10,000 chosen using the most likely and minimum expected Hamming distance metrics,respectively. Incorrect labels are shown in red. For meeting 1, the maximum likelihoodand minimum expected Hamming distance diarizations are similar, whereas in meeting 3we clearly see the sensitivity of the maximum likelihood metric to overﬁtting. The min-imum expected Hamming distance diarization for meeting 16 has more errors than thatof the maximum likelihood due to poor mixing rates and many samples failing to identifya speaker. likelihood is below 0.1 units of the maximum at Gibbs iteration 10,000.From this ﬁgure, we see that the sticky model still signiﬁcantly outperformsthe original HDP-HMM, implying that most state sequences produced bythe original model are worse, not just the one corresponding to the mostlikely sample. Example maximum likelihood and minimum expected Ham-ming distance diarizations are displayed in Figure 13. One noticeable ex-ception to this trend is the NIST 20051102-1323 meeting (meeting 16). Forthe sticky model, the state sequence using the maximum likelihood met-ric had very low DER [see Figure 13(b)]; however, there were many chainsthat merged speakers and produced segmentations similar to the one in Fig- FOX, SUDDERTH, JORDAN AND WILLSKY

Table 1

Overall DERs for the sticky and original HDP-HMM with DP emissions using theminimum expected Hamming distance and maximum likelihood metrics for choosing statesequences at Gibbs iteration 10,000

Overall DERs (%) Min Hamming Max likelihood 2-Best 5-Best

Sticky HDP-HMM 19.01 (17.84) 19.37 16.97 14.61Nonsticky HDP-HMM 23.91 25.91 23.67 21.06

Notes : For the maximum likelihood criterion, we show the best overall DER if we considerthe top two or top ﬁve most likely candidates. The number in the parentheses is theperformance when running meeting 16 for 50,000 Gibbs iterations. The overall ICSI DERis 18.37%, while the best achievable DER with the chosen acoustic preprocessing is 10.57%. ure 13(c), resulting in such a sequence minimizing the expected Hammingdistance. See Section 9 for a discussion on the issue of merged speakers.Running meeting 16 for 50,000 Gibbs iterations improved the performance,as depicted by the revised results in Figure 12(c). We summarize our overallperformance in Table 1, and note that (when using the 50,000 Gibbs iter-ations for meeting 16 and 10,000 Gibbs iterations for all other meetings )we obtain an overall DER of 17.84% using the sticky HDP-HMM versus the23.91% of the original HDP-HMM model. Alternatively, when constrainedto single Gaussian emissions the sticky HDP-HMM and original HDP-HMMhave overall DERs of 34.97% and 36.89%, respectively, which clearly demon-strates the importance of considering DP emissions. When considering theDP mixture-of-Gaussians model (ignoring the time indices associated withthe observations), the overall DER is 72.67%. If one uses the ground truthlabels to map multiple inferred DP mixture components to a single speakerlabel, the overall DER drops to 54.19%. The poor performance of the DPmixture-of-Gaussians model, even when assuming that ground truth labelsare available, which would not be the case in practice, illustrates the impor-tance of the temporal dynamics captured by the HMM.As a further comparison, the algorithm that was by far the best performerat the 2007 NIST competition—the algorithm developed by a team at theInternational Computer Science Institute (ICSI) [Wooters and Huijbregts(2007)]—has an overall DER of 18.37%. The ICSI team’s algorithm usesagglomerative clustering, and requires signiﬁcant tuning of parameters onrepresentative training data. In contrast, our hyperparameters are automat-ically set meeting-by-meeting, as outlined at the beginning of this section. On such a large data set, running 10 chains for 50,000 iterations for each of the 21meetings would have represented a signiﬁcant computational burden and, thus, we onlyran the chains to 50,000 iterations for meeting 16, which clearly had not mixed after 10,000iterations (based on an examination of trace plots of log-likelihoods; see Figure 15). Inmeeting 16 the diﬀerences between two of the speakers are especially subtle, and oursampler has diﬃculty in reliably ﬁnding parameters that separate these speakers.HE STICKY HDP-HMM Fig. 14. (a)

Chart comparing the DERs of the sticky and original HDP-HMM with DPemissions to those of ICSI for each of the 21 meetings. Here, we chose the state sequence atthe 10,000th Gibbs iteration that minimizes the expected Hamming distance. For meeting16 using the sticky HDP-HMM with DP emissions, we chose between state sequences atGibbs iteration 50,000. (b)

DERs associated with using ground truth speaker labels forthe post-processed data. Here, we assign undetected nonspeech a label diﬀerent than thepreprocessed nonspeech.

An additional beneﬁt of the sticky HDP-HMM over the ICSI approach is thefact that there is inherent posterior uncertainty in this task, and by takinga Bayesian approach, we are able to provide several interpretations. Indeed,when considering the best per-meeting DER for the ﬁve most likely samples,our overall DER drops to 14.61% (see Table 1). Although not helpful in theNIST evaluations, which require a single segmentation, providing multiplesegmentations could be useful in practice.To ensure a fair comparison, we use the same speech/nonspeech prepro-cessing and acoustic features as ICSI, so that the diﬀerences in our perfor-mance are due to changes in the identiﬁed speakers. As depicted in Fig-ure 14, both our performance and that of ICSI depend signiﬁcantly on thequality of this preprocessing step. For the periods of nonspeech that are in-correctly identiﬁed as speech during preprocessing, we are forced to produceerrors on these sections since they will be assigned an HMM label (and thusa speaker label) that is separate from the label assigned to the preprocessedsections labeled as nonspeech. Another source of errors are periods of over-lapping speech, which impede our ability to clearly identify a single speaker.In Figure 14(a) we compare the meeting-by-meeting DERs of the stickyHDP-HMM, the original HDP-HMM, and the ICSI algorithm. If we usethe ground truth speaker labels for the post-processed data (assigning un-detected nonspeech a label diﬀerent than the preprocessed nonspeech), theresulting overall DER is 10.57% with meeting-by-meeting DERs displayedin Figure 14(b). This number provides a lower bound on the achievable per-formance using the speech/nonspeech preprocessing, our block-averaging offeatures, and our assumptions of minimum duration. Beyond these forced FOX, SUDDERTH, JORDAN AND WILLSKY errors, it is clear from Figure 14(a) that the sticky HDP-HMM with DP emis-sions provides performance comparable to that of the ICSI algorithm, whilethe original HDP-HMM with DP emissions performs signiﬁcantly worse.Overall, the results presented in this section demonstrate that the stickyHDP-HMM with DP emissions provides an elegant and empirically eﬀectivespeaker diarization method.

9. Discussion.

We have developed a Bayesian nonparametric approachto the problem of speaker diarization, building on the HDP-HMM presentedin Teh et al. (2006). Although the original HDP-HMM does not yield com-petitive speaker diarization performance due to its inadequate modeling ofthe temporal persistence of states, the sticky HDP-HMM that we have pre-sented here resolves this problem and yields a state-of-the-art solution tothe speaker diarization problem.We have also shown that this sticky HDP-HMM allows a fully Bayesiannonparametric treatment of multimodal emissions, disambiguated by its biastoward self-transitions. Accommodating multimodal emissions is essentialfor the speaker diarization problem and is likely to be an important ingredi-ent in other applications of the HDP-HMM to problems in speech technology.We also presented eﬃcient sampling techniques with mixing rates that im-prove on the state of the art by harnessing the Markovian structure of theHDP-HMM. Speciﬁcally, we proposed employing a truncated approximationto the HDP and block-sampling the state sequence using a variant of theforward–backward algorithm. Although the blocked samplers yield substan-tially improved mixing rates over the sequential, direct assignment samplers,there are still some pitfalls to these sampling methods. One issue is that foreach new considered state, the parameter sampled from the prior distribu-tion must better explain the data than the parameters associated with otherstates that have already been informed by the data. In high-dimensional ap-plications, and in cases where state-speciﬁc emission distributions are notclearly distinguishable, this method for adding new states poses a signiﬁcantchallenge. Indeed, both issues arise in the speaker diarization task and wedid have diﬃculties with mixing. Further evidence of this is presented inthe trace plots in Figure 15, where we plot log-likelihoods, Hamming dis-tances and speaker counts for 10,000 Gibbs sampling iterations of meeting5 and 100,000 iterations of meeting 16. As discussed previously, meeting 16is the most problematic meeting in our data set, and these plots provideclear evidence that our sampler is not mixing on this meeting. But even onmeeting 5, which is more representative of the full set of meetings and whichis segmented eﬀectively by our procedure, we see a relatively slow evolutionof the sampler, particularly as measured by the number of speakers. Ouruse of the minimum expected Hamming distance procedure to select sam-ples mitigates this diﬃculty, but further work on sampling procedures for

HE STICKY HDP-HMM Fig. 15.

Trace plots of (a) log-likelihood, (b)

Hamming distance error and (c) numberof speakers for 10 chains for two meetings: CMU 20050912-0900 / meeting 5 (top) andNIST 20051102-1323 / meeting 16 (bottom). For meeting 5, which has behavior represen-tative of the majority of the meetings, we show traces over the 10,000 Gibbs iterationsused for the results in Section 8. For meeting 16, we ran the chains out to 100,000 Gibbsiterations to demonstrate the especially slow mixing rate for this meeting. The dashed bluevertical lines indicate 10,000 iterations. the sticky HDP-HMM is needed. One possibility is to consider split-mergealgorithms similar to those developed in Jain and Neal (2004) for the DPmixture model.A limitation of the HMM in general is that the observations are assumedconditionally i.i.d. given the state sequence. This assumption is often in-suﬃcient in capturing the complex temporal dependencies exhibited in real-world data. Another area of future work is to consider Bayesian nonparamet-ric versions of models better suited to such applications, like the switchinglinear dynamical system (SLDS) and switching VAR process. A ﬁrst attemptat developing such models is presented in Fox et al. (2009). An inspirationfor the sticky HDP-HMM actually came from considering the original HDP-HMM as a prior for an SLDS. In such scenarios where one does not havedirect observations of the underlying state sequence, the issues arising fromnot properly capturing state persistence are exacerbated. The sticky HDP-HMM presented in this paper provides a robust building block for developingmore complex Bayesian nonparametric dynamical models. FOX, SUDDERTH, JORDAN AND WILLSKY

Acknowledgments.

We thank O. Vinyals, G. Friedland and N. Morganfor helpful discussions about the NIST data set.SUPPLEMENTARY MATERIAL

Supplement: Notational conventions, Chinese restaurant franchises andderivations of Gibbs samplers (DOI: 10.1214/10-AOAS395SUPP; .pdf). Wepresent detailed derivations of the conditional distributions used for boththe direct assignment and blocked Gibbs samplers, as well as the associ-ated pseudo-code. The description of these derivations relies on the Chineserestaurant analogies associated with the HDP and sticky HDP-HMM, whichare expounded upon in this supplementary material. We also provide a listof notational conventions used throughout the paper.REFERENCES

Barras , C.,

Zhu , X.,

Meignier , S. and

Gauvain , J.-L. (2004). Improving speaker di-arization. In

Proc. Fall 2004 Rich Transcription Workshop (RT-04) , November 2004.

Beal , M. J. and

Krishnamurthy , P. (2006). Gene expression time course clusteringwith countably inﬁnite hidden Markov models. In

Proc. Conference on Uncertainty inArtiﬁcial Intelligence , Cambridge, MA.

Beal , M. J.,

Ghahramani , Z. and

Rasmussen , C. E. (2002). The inﬁnite hidden Markovmodel. In

Advances in Neural Information Processing Systems Blackwell , D. and

MacQueen , J. B. (1973). Ferguson distributions via P´olya urnschemes.

Ann. Statist. Chen , S. S. and

Gopalakrishnam , P. S. (1998). Speaker, environment and channel changedetection and clustering via the Bayesian information criterion. In

Proc. DARPA Broad-cast News Transcription and Understanding Workshop

Ferguson , T. S. (1973). A Bayesian analysis of some nonparametric problems.

Ann.Statist. Fox , E. B.,

Sudderth , E. B.,

Jordan , M. I. and

Willsky , A. S. (2008). An HDP-HMM for systems with state persistence. In

Proc. International Conference on MachineLearning , Helsinki, Finland, July 2008.

Fox , E. B.,

Sudderth , E. B.,

Jordan , M. I. and

Willsky , A. S. (2009). NonparametricBayesian learning of switching dynamical systems. In

Advances in Neural InformationProcessing Systems Fox , E. B.,

Sudderth , E. B.,

Jordan , M. I. and

Willsky , A. S. (2010). Sup-plement to “A sticky HDP-HMM with application to speaker diarization.” DOI:10.1214/10-AOAS395SUPP.

Gales , M. and

Young , S. (2007). The Application of hidden Markov models in speechrecognition.

Foundations and Trends in Signal Processing Gauvain , J.-L.,

Lamel , L. and

Adda , G. (1998). Partitioning and transcription of broad-cast news data. In

Proc. International Conference on Spoken Language Processing ,Sydney, Australia 1335–1338.

Hoffman , M.,

Cook , P. and

Blei , D. (2008). Data-driven recomposition using the hierar-chical Dirichlet process hidden Markov model. In

Proc. International Computer MusicConference , Belfast, UK.HE STICKY HDP-HMM Ishwaran , H. and

Zarepour , M. (2000a). Markov chain Monte Carlo in approximateDirichlet and beta two–parameter process hierarchical models.

Biometrika Ishwaran , H. and

Zarepour , M. (2002b). Dirichlet prior sieves in ﬁnite normal mixtures.

Statist. Sinica Ishwaran , H. and

Zarepour , M. (2002c). Exact and approximate sum—representationsfor the Dirichlet process.

Canad. J. Statist. Jain , S. and

Neal , R. M. (2004). A split-merge Markov chain Monte Carlo procedure forthe dirichlet process mixture model.

J. Comput. Graph. Statist. Jasra , A.,

Holmes , C. C. and

Stephens , D. A. (2005). Markov chain Monte Carlomethods and the label switching problem in Bayesian mixture modeling.

Statist. Sci. Johnson , M. (2007). Why doesn’t EM ﬁnd good HMM POS-taggers. In

Proc. JointConference on Empirical Methods in Natural Language Processing and ComputationalNatural Language Learning , Prague, Czech Republic.

Kivinen , J. J.,

Sudderth , E. B. and

Jordan , M. I. (2007). Learning multiscale represen-tations of natural scenes using Dirichlet processes. In

Proc. International Conferenceon Computer Vision , Rio de Janeiro, Brazil 1–8.

Kurihara , K.,

Welling , M. and

Teh , Y. W. (2007). Collapsed variational Dirichlet pro-cess mixture models. In

Proc. International Joint Conferences on Artiﬁcial Intelligence ,Hyderabad, India.

Meignier , S.,

Bonastre , J.-F.,

Fredouille , C. and

Merlin , T. (2000). Evolutive HMMfor multi-speaker tracking system. In

Proc. IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) , Istanbul, Turkey, June 2000.

Meignier , S.,

Bonastre , J.-F. and

Igounet , S. (2001). E-HMM approach for learningand adapting sound models for speaker indexing. In

Proc. Odyssey Speaker LanguageRecognition Workshop , June 2001.

Munkres , J. (1957). Algorithms for the assignment and transportation problems.

J. Soc.Industr. Appl. Math. Papaspiliopoulos , O. and

Roberts , G. O. (2008). Retrospective Markov chain MonteCarlo methods for Dirichlet process hierarchical models.

Biometrika Rabiner , L. R. (1989). A tutorial on hidden Markov models and selected applications inspeech recognition.

Proc. IEEE Reynolds , D. A. and

Torres-Carrasquillo , P. A. (2004). The MIT Lincoln LaboratoryRT-04F diarization systems: Applications to broadcast news and telephone conversa-tions. In

Proc. Fall 2004 Rich Transcription Workshop (RT-04) , November 2004.

Robert , C. P. (2007).

The Bayesian Choice . Springer, New York.

Rodriguez , A.,

Dunson , D. B. and

Gelfand , A. E. (2008). The nested Dirichlet process.

J. Amer. Statist. Assoc.

Scott , S. L. (2002). Bayesian methods for hidden Markov models: Recursive computingin the 21st century.

J. Amer. Statist. Assoc. Sethuraman , J. (1994). A constructive deﬁnition of Dirichlet priors.

Statist. Sinica Siegler , M.,

Jain , U.,

Raj , B. and

Stern , R. M. (1997). Automatic segmentation, clas-siﬁcation and clustering of broadcast news audio. In

Proc. DARPA Speech RecognitionWorkshop FOX, SUDDERTH, JORDAN AND WILLSKY

Teh , Y. W.,

Jordan , M. I.,

Beal , M. J. and

Blei , D. M. (2006). Hierarchical Dirichletprocesses.

J. Amer. Statist. Assoc.

Tranter , S. E. and

Reynolds , D. A. (2006). An overview of automatic speaker diariza-tion systems.

IEEE Trans. Audio, Speech Language Process. Van Gael , J.,

Saatci , Y.,

Teh , Y. W. and

Ghahramani , Z. (2008). Beam samplingfor the inﬁnite hidden Markov model. In

Proc. International Conference on MachineLearning , Helsinki, Finland, July 2008.

Walker , S. G. (2007). Sampling the Dirichlet mixture model with slices.

Commun. Statist.Simul. Comput. Wooters , C. and

Huijbregts , M. (2007). The ICSI RT07s speaker diarization system.

Lecture Notes in Computer Science

Wooters , C.,

Fung , J.,

Peskin , B. and

Anguera , X. (2004). Towards robust speakersegmentation: The ICSI-SRI Fall 2004 diarization system. In

Proc. Fall 2004 RichTranscription Workshop (RT-04) , November 2004.

Xing , E. P. and

Sohn , K.-A. (2007). Hidden Markov Dirichlet process: Modeling geneticinference in open ancestral space.

Bayesian Anal. E. B. FoxDepartment of Statistical ScienceDuke UniversityBox 90251Durham, North Carolina 27701USAE-mail: [email protected]

E. B. SudderthDepartment of Computer ScienceBrown University115 Waterman Street, Box 1910Providence, Rhode Island 02912USAE-mail: [email protected]

M. I. JordanDepartment of Statisticsand Department of EECSUniversity of California, Berkeley427 Evans HallBerkeley, California 94720USAE-mail: [email protected]