Dirichlet Process Hidden Markov Multiple Change-point Model
aa r X i v : . [ m a t h . S T ] M a y Bayesian Analysis (2015) , Number 2, pp. 275–296 Dirichlet Process Hidden Markov MultipleChange-point Model
Stanley I. M. Ko ∗ , Terence T. L. Chong † , and Pulak Ghosh ‡ Abstract.
This paper proposes a new Bayesian multiple change-point modelwhich is based on the hidden Markov approach. The Dirichlet process hiddenMarkov model does not require the specification of the number of change-points a priori . Hence our model is robust to model specification in contrast to the fullyparametric Bayesian model. We propose a general Markov chain Monte Carlo al-gorithm which only needs to sample the states around change-points. Simulationsfor a normal mean-shift model with known and unknown variance demonstrateadvantages of our approach. Two applications, namely the coal-mining disasterdata and the real United States Gross Domestic Product growth, are provided.We detect a single change-point for both the disaster data and US GDP growth.All the change-point locations and posterior inferences of the two applications arein line with existing methods.
Keywords:
Change-point, Dirichlet process, Hidden Markov model, Markovchain Monte Carlo, Nonparametric Bayesian.
The earliest Bayesian change-point model is explored by Chernoff and Zacks (1964), whoassume a constant probability of change at each point in time. Smith (1975) investigatesthe single change-point model under different assumptions of model parameters. Carlinet al. (1992) assume that the structural parameters are independent of the change-pointsand introduce the Markov chain Monte Carlo sampling method to derive the posteriordistributions. Stephens (1994) further applies the Markov chain Monte Carlo method tothe case of multiple changes. Chib (1998) allows the change-point probability to dependon the regime between two adjacent change-points. Koop and Potter (2007) proposethe Poisson hierarchical prior for durations in the change-point model that allows thenumber of change-points to be unknown. More recent works on the Bayesian change-point model include Wang and Zivot (2000), Giordani and Kohn (2008), Pesaran et al.(2006), Maheu and Gordon (2008) and Geweke and Yu (2011).In this paper we follow the modeling strategy of Chib (1998) which is one of themost popular Bayesian change-point models. He introduces a discrete random variableindicating the regime from which a particular observation is drawn. Specifically, let ∗ Department of Finance and Business Economics, University of Macau, Macau,[email protected] † Department of Economics and Institute of Global Economics and Finance, The Chinese Universityof Hong Kong, Hong Kong, and Department of International Economics and Trade, Nanjing University,China, [email protected] ‡ Department of Quantitative Methods & Information Systems, Indian Institute of Management atBangalore, India, [email protected] c (cid:13) Dirichlet Process Hidden Markov Multiple Change-point Model Y n = ( y , y , . . . , y n ) ′ be the observed time series, such that the density of y t conditionedon Y t − = ( y , y , . . . , y t − ) ′ depends on the parameter θ whose value changes at anunknown time period 1 < τ < · · · < τ k < n and remains constant within each regime,that is, y t ∼ p ( y t | Y t − , θ ) if t ≤ τ , p ( y t | Y t − , θ ) if τ < t ≤ τ ,... ... p ( y t | Y t − , θ k ) if τ k − < t ≤ τ k , p ( y t | Y t − , θ k +1 ) if τ k < t ≤ n , (1)where θ i ∈ R l is an l dimension vector, i = 1 , , . . . , k + 1. Note that we consider inthis paper the change-point problem when the data are assumed to be generated by aparametric model where the unknown parameter θ i changes with respect to differentregimes. Let s t be the discrete indicator variable such that y t | s t ∼ p ( y t | Y t − , θ s t ) , (2)where s t takes values in { , , . . . , k, k + 1 } . The indicator variable s t is modeled as adiscrete time, discrete-state Markov process with the constrained transition probabilitymatrix P = p p · · · p p · · · p kk p k ( k +1) · · · , (3)where p ij = pr( s t = j | s t − = i ) is the probability of moving to regime j at time t given that the regime at time t − i . With this parameterization, the i th change-pointoccurs at τ i if s τ i = i and s τ i +1 = i + 1.As pointed out in Chib (1998), the above is a hidden Markov model where thetransition matrix of the hidden state s t is restricted as in (3). Hence, Chib’s multiplechange-point model inherits the limitation of the hidden Markov model in that thenumber of states has to be specified in advance. In light of this, Chib (1998) suggeststo select from alternative models (e.g. one change-point vs. multiple change-points) ac-cording to the Bayes factors. In this paper, we introduce the Dirichlet process hiddenMarkov model (DPHMM) with left-to-right transition dynamic, without imposing re-strictions on the number of hidden states. The use of the DPHMM has the followingappealing features:1. We do not have to specify the number of states a priori . The information providedby the observations determines the states endogenously. Hence, our method canbe regarded as semiparametric.2. Our modeling approach facilitates the sampling of states since we only need tosample the states around change-points. . I. M. Ko, T. T. L. Chong, and P. Ghosh θ , but in a mixture model.The rest of the paper is organized as follows. Section 2 provides a brief introductionof the Dirichlet process. Section 3 incorporates the Dirichlet process into the change-point model. The general Markov chain Monte Carlo sampler is discussed in Section 4.A Monte Carlo study of the normal mean-shift model is conducted in Section 5. Section 6discusses learning of DPHMM parameters. Section 7 provides applications of our modeland Section 8 concludes the paper. Our new method employs the Dirichlet process technique which is widely used in non-parametric Bayesian models. The Dirichlet process prior is first proposed by Ferguson(1973). He derives the Dirichlet process prior as the prior on the unknown probabilitymeasure space with respect to some measurable space (Ω , F ). Hence the Dirichlet pro-cess is a distribution over probability measures. Blackwell and MacQueen (1973) showthat the Dirichlet process can be represented by the Polya urn model. Sethuraman(1994) develops the constructive sticking-breaking definition.In the present study, we assume a Dirichlet process prior to each row of the transitionmatrix. The Dirichlet process is best defined here as the infinite limit of finite mixturemodels (Neal (1992), Neal (2000) and Beal et al. (2002)). To illustrate the idea, let usfirst consider the case with a finite number of states. With the left-to-right restriction tothe transition dynamic, a particular state s t − = i will either stay at the current state i or transit to a state j > i . A left-to-right Markov chain with k states will typicallyhave the following upper triangular transition matrix P = p p p · · · p k p p · · · p k p · · · p k ... ... ... . . . ...0 0 · · · · · · p kk , (4)where the summation of each row equals 1. Note that the left-to-right Markov chainhere is different from Chib’s restricted band transition matrix (3). Here, the number ofstates k is not necessarily the number of regimes as is the case in Chib’s model.Let p i = (0 , . . . , p ii , p i ( i +1) , . . . , p ik ) be the transition probabilities of the i th row ofthe transition matrix (4). Suppose we draw m samples { c , . . . , c m } of s t +1 given s t = i with probability profile p i . The joint distribution of the sample is thuspr( c , . . . , c m | p i ) = k Y j = i p m j ij , (5)where m j denotes the number of samples that take state j , j = i, . . . , k . We assume asymmetric Dirichlet prior π ( p i | β ) for p i with positive concentration parameter β :78 Dirichlet Process Hidden Markov Multiple Change-point Model p i | β ∼ Dirichlet( β/ ( k − i + 1) , . . . , β/ ( k − i + 1)) = Γ( β )Γ (cid:16) βk − i +1 (cid:17) k − i +1 k Y j = i p β/ ( k − i +1) − ij . (6)With the Dirichlet prior, we can analytically integrate out p i such thatpr( c , . . . , c m | β ) = Z pr( c , . . . , c m | p i , β ) dπ ( p i | β )= Γ( β )Γ( m + β ) k Y j = i Γ (cid:16) m j + βk − i +1 (cid:17) Γ (cid:16) βk − i +1 (cid:17) . (7)The conditional probability of a sample c d ∈ { c , . . . , c m } given all other samples is thuspr( c d = j | c − d ) = m − d,j + β/ ( k + i − m − β , (8)where c − d denotes the sample set with c d deleted, and m − d,j is the number of samplesin c − d that take state j .Taking the limit of equation (8) as k tends to infinity, we havepr( c d = j | c − d ) = ( m − d,j m − β j ∈ { i, i + 1 , . . . , k } , βm − β for all potential states . (9)Note that the probability that c d takes an existing state, say j , is proportional to m − d,j ,which implies that c d is more likely to choose an already popular state. In addition, theprobability that a new state (i.e. k + 1) takes place is proportional to β . Hence, thereare potentially many states available, with infinite dimension transition matrix P = p p p · · · p p · · · p · · · ... ... ... . . . . (10)The actual state space can be regarded as consisting of an infinite number of states, onlya finite number of which are actually associated with the data. Therefore, the numberof states is endogenously determined. Let us now turn to the proposed multiple change-point model and discuss a particularstate evolution. Suppose we have already generated the hidden states up to s t = i . Weimpose the Dirichlet process as described in Section 2 to s t +1 . In the change-point model,the transitions that have existed so far from state i are only self transitions. With the . I. M. Ko, T. T. L. Chong, and P. Ghosh i to some previous states, nor a forward transition, i.e., transition to some futurestates. The counts of the existing transitions from state i will be used as the countsdefined in equation (9). Hence, we will havepr( s t +1 = j | s t = i, s , . . . , s t − ) = ( n ii n ii + β j = i, βn ii + β s t +1 takes a new state , (11)where n ii = P t − t ′ =1 δ ( s t ′ , i ) δ ( s t ′ +1 , i ) denotes the counts of transitions that have occurredso far from state i to itself. Note that in equation (11), s t +1 depends only on the state that s t takes according to the Markovian property. All other previous states merelyprovide the transition counts.We introduce a self-transition prior mass α for each state. The idea here is that if s t transits to a new state, say s t +1 = i + 1, then without previous transition records,the next state s t +2 conditioned on s t +1 = i + 1 will further take another new state withprobability 1. Hence, with α , the trivial case is avoided and we havepr( s t +1 = j | s t = i, s , . . . , s t − ) = ( n ii + αn ii + β + α j = i, βn ii + β + α s t +1 takes a new state . (12)Therefore, the whole Markov chain is characterized by two parameters, α and β ,instead of a transition probability matrix. We can see that α controls the prior tendencyto linger in a state, and β controls the tendency to explore new states. Figure 1 illustratesthree left-to-right Markov chains of length n = 150 with different α and β . Figure 1adepicts the chain that explores many new states with very short linger time. Figure 1bshows the chain with long linger time and less states. Figure 1c lies in between.Equation (12) coincides with Chib (1998)’s model when the probability p ii is inte-grated out. Specifically, in Chib (1998),pr( s t +1 = j | s t = i, s , . . . , s t − )= Z p ( s t +1 = j | s t = i, s , . . . , s t − , p ii ) f ( p ii ) dp ii = ( n ii + an ii + a + b j = i, bn ii + a + b j = i + 1 (13)where p ii ∼ Beta( a, b ) and f ( p ii ) is the corresponding density. However, our model stemsfrom a different perspective. The derivations in the previous section and equation (12)follow the nonparametric Bayesian literature (Neal (2000)) and the infinite HMM of Bealet al. (2002). Indeed, it is known that when the Dirichlet process is truncated at a finitenumber of states, the process reduces to the generalized Dirichlet distribution (GDD),see Connor and Mosimann (1969) and Wong (1998). For the same reason we have (12)coincides with (13). We would like to point out that our modeling strategy facilitatesthe Gibbs sampler of s t which is different from Chib (1998). We will elaborate furtheron the Gibbs sampler in the next section. Also, learning of α and β will be discussed inSection 6. The Kronecker-delta function δ ( a, b ) = 1 if and only if a = b and 0 otherwise. Dirichlet Process Hidden Markov Multiple Change-point Model
Figure 1: Left-to-right Markov chain with different α and β . Suppose we have observations Y n = ( y , y , . . . , y n ) ′ . Given the state S n = ( s , . . . , s n ) ′ we have y t | s t ∼ p ( y t | Y t − , θ s t ) , (14)where θ s t ∈ R l , Y t − = ( y , . . . , y t − ) ′ . Let θ = ( θ , . . . , θ k ) ′ and γ denotes a hyperpa-rameter. Recall that we impose the DPHMM to the states and a hierarchical model tothe parameter, we are thus interested in sampling from the posterior p ( θ, S n , γ | Y n )given the priors p ( θ | γ ), p ( γ ) and p ( S n ). The general Gibbs sampler procedure is tosample the following in turn: Step S n | θ, γ, Y n , Step θ | γ, S n , Y n , Step γ | θ, S n , Y n .We will discuss the three steps below. S n The state prior p ( S n ) can be easily derived from (12). Moreover, the full conditional is p ( S n | θ, γ, Y n ) ∝ p ( S n ) p ( Y n | S n , θ, γ ) . (15)Simulation of S n from the full conditional (15) is done by the Gibbs sampler. Specifically,we draw s t in turn for t = 1 , , . . . , n from p ( s t | S t − , S t +1 , θ, γ, Y n ) ∝ p ( s t | s t − , S t − ) p ( s t +1 | s t , S t +2 ) p ( y t | s t , Y t − , θ, γ ) , (16) . I. M. Ko, T. T. L. Chong, and P. Ghosh S t − = ( s , . . . , s t − ) ′ and S t +1 = ( s t +1 , . . . , s n ) ′ . The most recent updatedvalues of the conditioning variables are used in each iteration. Note that we write p ( s t | s t − , S t − ) and p ( s t +1 | s t , S t +2 ) to emphasize the Markov dynamic; the otherconditioning states merely provide the counts.With the left-to-right characteristic of the chain, we do not have to sample all s t from t = 1 to T . Instead, we only need to sample the state in which a change-point takesplace. To see this, let us consider a concrete example. Suppose from the last sampler,we have s s s s s s s s s s s · · · · · · With left-to-right transition restriction, s t requires a sampling from (16) if and only if s t − and s t +1 differ. For other cases, s t is unchanged with probability one. Suppose weare at t = 2. Since s and s are both equal to 1, s is forced to take 1. In the abovechain, the first state that needs to sample from (16) is s , which will either take thevalues 1 or 2 ( s or s ). If s takes 2 in the sampling (i.e., joining the following regime),then the next state to sample would be s ; otherwise (i.e., joining the preceding regime),the next state to sample is s because s − s = 0 for s = 1. Now suppose we are at t = 9. We will draw a new s and s will either join the regime of s or the regimeof s . This will look strange because a gap exists in the chain. However, our concernhere is the consecutive grouping or clustering in the series. We can alternatively thinkthat the state represented by s is simply pruned away in the current sweep. Note thenumbers assigned to the s t ’s are nothing but indicators of regimes. Therefore, we willrelabel the s t ’s after each sweep.In general, suppose s t − = i and s t +1 = i + 1. s t takes either i or i + 1. Table 1 showsthe corresponding probability values specified in (16). To see this, if s t takes i , then thetransition from s t − to s t is a self-transition and that from s t to s t +1 is an innovation.The corresponding probability values are in the first row of Table 1. The reasoning for s t = i + 1 is similar. Note the changes of counts in different situations.Table 1: Sampling probabilities of s t . n ii = P t − t ′ =1 δ ( s t ′ , i ) δ ( s t ′ +1 , i ) and n i +1 ,i +1 = P n − t ′ = t +1 δ ( s t ′ , i + 1) δ ( s t ′ +1 , i + 1). p ( s t | s t − = i, S t − ) p ( s t +1 = i + 1 | s t , S t +2 ) p ( y t | s t , Y t − , θ ) s t = i n ii + αn ii + β + α βn ii + 1 + β + α p ( y t | Y t − , θ i ) s t = i + 1 βn ii + β + α n i +1 ,i +1 + αn i +1 ,i +1 + β + α p ( y t | Y t − , θ i +1 )For the initial point s , if currently s − s = 0, then we can sample s frompr( s | s , s , . . . , s n ) = ( c · αβ + α · ββ + α · p ( y | Y , θ s ) if s unchanged, c · ββ + α · n s s + αn s s + β + α · p ( y | Y , θ s ) if s = s , (17) This will exclude the case of the regime with only one data point. We do not consider this situationhere. Dirichlet Process Hidden Markov Multiple Change-point Model where n s s = P n − t ′ =2 δ ( s t ′ , s ) δ ( s t ′ +1 , s ) and c is the normalizing constant. For the endpoint s n , if s n − s n − = 1, we sample s n frompr( s n | s n − , s n − , . . . , s ) = c · n sn − sn − + αn sn − sn − + β + α · p ( y n | Y n − , θ s n − ) if s n = s n − , c · βn sn − sn − + β + α · p ( y n | Y n − , θ s n ) if s n unchanged,(18)where n s n − s n − = P n − t ′ =1 δ ( s t ′ , s n − ) δ ( s t ′ +1 , s n − ) and c is the normalizing constant.As mentioned in Section 3, the DPHMM facilitates the Gibbs sampler of states.In sampling s t from (16), we simultaneously use up all information of the transitionsprior to t (i.e. s , . . . , s t − ) and after t (i.e. s t +1 , . . . , s T ) which are captured in p ( s t | s t − , S t − ) and p ( s t +1 | s t , S t +2 ). Thus far, our algorithm only requires a record oftransitions and draws at the point where the structural change takes place, whereas wehave to sample all s t in Chib (1998). θ and γ Given S n and Y n , the full conditionals of θ and γ are simply p ( θ i | γ, S n , Y n ) ∝ p ( θ i | γ ) Y { t : s t = i } p ( y t | Y t − , θ i ) ,p ( γ | θ, S n , Y n ) ∝ p ( γ ) p ( θ | γ, S n , Y n ) , (19)which are model specific. In the following sections, we will study a simulated normalmean-shift model, a discrete type Poisson model, and an ar (2) model. In Section 4.2, we have discussed the simulation of the states S n . The number of change-points is inherently estimated through the sampling of states in equation (16). Withinthe burn-in period, the state number will be changing around after each MCMC pass.After the burn-in period, the Markov chain converges and hence the number of statesbecomes stable. Theoretically, it is legitimate to set any number of change-points in thebeginning and let the algorithm find out the convergent number of states. In practice,we find it is more efficient to initialize with a large number of states and let the algo-rithm prune away redundant states, rather than allow for the change-point number togrow from a small number. Specifically, suppose a reasonably large state number k isproposed. We initialize equidistant states, that is s t = i, if ( i − · nk < t ≤ i · nk , (20)where i = 1 , . . . , k . Then the algorithm described above will work out the change-pointlocations and the number of states after convergence of the Markov chain. . I. M. Ko, T. T. L. Chong, and P. Ghosh In this section, we first study the normal mean-shift model with known variance σ .Suppose the normal data Y n = ( y , . . . , y n ) ′ is subject to unknown k changes in mean.We use the following hierarchical model y t | θ i ∼ N( θ i , σ ) if τ i − < t ≤ τ i ,θ i | µ, υ ∼ N( µ, υ ) , ( µ, υ ) ∼ Inv-Gamma( υ | a, b ) , (21)where σ is known and τ i ( i = 1 , . . . , k ) is the change-point. We set τ = 0 and τ k +1 = n .Next, we apply our algorithm to the case of unknown variance. In addition to (21), weassume the Inverse-Gamma prior for σ σ ∼ Inv-Gamma( σ | c, d ) . (22)All derivations of the full conditionals and the Gibbs samplers are given in the Appendix.We simulate two normal sequences with the parameters specified in Table 2. Specif-ically, Model 1 is subject to one change-point occurring at t = 50. Model 2 is a twochange-points model with breaks at t = 50 and t = 100. Both models assume variance σ = 3. Two realizations with respect to Model 1 and Model 2 are shown in Figure 2.We can see the overlapping of the data ranges of different regimes and it is hard tovisually identify the change-points.Figure 2: Random realizations of Model 1 and Model 2.84 Dirichlet Process Hidden Markov Multiple Change-point Model
Table 2: Normal Mean-Shift Models 1 and 2. θ θ θ σ τ τ k n Model 1
Model 2
To implement our algorithm, we set the inverse-Gamma hyperparameters a = b = c = d = 1, and the DPHMM parameters α = 3 and β = 2. The two Gibbs samplers forthe cases of known and unknown variance are conducted for 5000 sweeps with 5000burn-in samples, respectively. The 5000 sweeps after the burn-in period are thinnedwith 50 draws to reduce dependence of iterations. The first column of Figure 3 showsthe probabilities of regime indicator s t = i of the two models. Intersections of the lines s t = i clearly demonstrate the break locations.To compare our proposed DPHMM to Chib’s method, we also report the posteriorinference of Chib’s model under the true change-point number and the same modelspecification as in (A-3) and (A-4). The posterior means and standard deviations ofparameters are summarized in Table 3. First, our method performs well in all caseswhere the posterior distributions concentrate on the true values. The sample first-orderserial correlations demonstrate good mixing of the samplers. Second, our results arecomparable to those estimated from Chib’s model. The Bayes factors show that in mostof the cases the models with the true number of change-points are preferred to others.For example, in Model 2 where two change-points exist, the Bayes factors comparingmodels with k = 1 versus k = 2 are close to zero favoring the two change-point model.Likewise, the Bayes factors comparing k = 2 versus k = 3 favor the two change-pointmodel. Hence we conclude that the model with two change-points is correctly specifiedwith high probability. However, the Bayes factor fails to detect the correct number ofchanges in Model 1 with unknown variance. The values suggest a model with two changepoints. The posterior probabilities of states estimated by Chib’s model are shown in thesecond column of Figure 3. In summary, the simulation results demonstrate that ouralgorithm works well in the normal mean-shift models and is robust to the change-pointnumber compared to Chib’s model. In this section, we study the robustness of our algorithm in detecting the true numberof change-points. Although our method does not require prespecification of the change-point number, it is still possible that our algorithm fails to estimate the correct numberof change-points. Thus, we replicate the entire estimation process as in the previous The prior of the transition probabilities in Chib (1998)’s model is assumed to be Beta( a, b ). Theparameters a and b are chosen to reflect equidistant duration of each state. For example, in the case ofone change-point with n = 150 sample size, we take b = 0 . a = n/ × b = 7 .
5, i.e. Beta(7 . , . . I. M. Ko, T. T. L. Chong, and P. Ghosh s t = i : DPHMM vs. Chib’s model.86 Dirichlet Process Hidden Markov Multiple Change-point Model
Table 3: Posterior estimates of Normal Mean-Shift Models 1 and 2. Mean and SD denote,respectively, posterior mean and posterior standard deviation.
Model 1
Known variance Unknown varianceMean SD True value Mean SD True value
DPHMM θ θ σ Chib’s Model with k = 1 θ θ σ Bayes Factor Analysis k = 1 vs. k = 2 1.730 0.548 k = 1 vs. k = 3 1.374 1.010 k = 2 vs. k = 3 0.794 1.850 Model 2
Known variance Unknown varianceMean SD True value Mean SD True value
DPHMM θ θ θ σ Chib’s Model with k = 2 θ θ θ σ Bayes Factor Analysis k = 1 vs. k = 2 0.000 0.0193 k = 1 vs. k = 3 0.000 0.0583 k = 2 vs. k = 3 3.090 3.0332section for 1000 times and record the estimated change-point number in each replication.Specifically, in each of the 1000 replications, we iterate the Gibbs sampler for 5000 timesand the change-point number of the last sample is recorded. Therefore, we obtain 1000collections of change-point numbers. Table 4 reports the frequencies of the detectedchange-point numbers. We see that with high frequency (over 99%) our method detectsone change-point ( k = 1) in Model 1 in both cases of known and unknown variance.In Model 2, our results show that over 90% of the 1000 replications detect two change-points ( k = 2), and over 99% detect at least one change-point. The figures demonstrate . I. M. Ko, T. T. L. Chong, and P. Ghosh k = 0 k = 1 k = 2 k = 0 k = 1 k = 2 Model 1
Model 2 α and β In the previous section, we set the DPHMM parameters α = 3 and β = 2 in estimatingthe simulated models. In order to learn about α and β , we propose to use vague Gammapriors, see Beal et al. (2002). Note that with the number of states specified in eachMCMC sweep, the DPHMM reduces to the generalized Dirichlet distribution (GDD),see Connor and Mosimann (1969) and Wong (1998). Hence the posterior is p ( α, β | S n ) ∝ Gamma( a α , b α )Gamma( a β , b β ) k +1 Y i =1 β Γ( α + β )Γ( α ) Γ( n ii + α )Γ( n ii + 1 + α + β ) , (23)where n ii = P T − t =1 δ ( s t , i ) δ ( s t +1 , i ) denotes the counts of self transitions. We set a α = b α = a β = b β = 1 here and in the subsequent sections. Below we consider two alternativeapproaches for sampling: the first based on maximum-a-posteriori (MAP) estimationand a second approach using a random walk sampler. We first solve for the maximum-a-posteriori (MAP) estimates for α and β which areobtained as the solutions to the following gradients using the Newton-Raphson method, ∂ ln p ( α, β | S n ) ∂α = a α − α − b α + k +1 X i =1 [ ψ ( α + β ) + ψ ( n ii + α ) − ψ ( α ) − ψ ( n ii + 1 + α + β )] = 0 ∂ ln p ( α, β | S n ) ∂β = a β − β − b β + k +1 X i =1 (cid:20) β + ψ ( α + β ) − ψ ( n ii + 1 + α + β ) (cid:21) = 0 , (24)where ψ ( · ) is the digamma function defined as ψ ( x ) = d ln Γ( x ) /dx .We implement our algorithm in the previous section together with the MAP updateof α and β in each sweep. The DPHMM with MAP update correctly detects the true88 Dirichlet Process Hidden Markov Multiple Change-point Model
Table 5: MAP of α and β in Normal Mean-Shift Models 1 and 2. Average MAP values of α and β are reported with standard deviations within parentheses. For other parameters,the values are posterior means and posterior standard deviations. Model 1 Model 2 known variance unknown variance known variance unknown variance α β θ θ θ σ α and β , andthe posterior estimates of all parameters in each model. We can see that the averageMAP values of α and β are 0 . . . . α is greater than β indicating that the algorithm tends to linger in existing statesrather than exploring a new one. Besides, all parameter estimates are in line with theresults in the previous section when α and β are prespecified. We also consider a Metropolis-Hastings (M-H) sampler for the posterior (23). Thecandidate-generating density is assumed to be the random walk process with positivesupport f ( α ′ | α ) ∝ φ ( α ′ − α ) , α ′ > , (25)where α is the value of the previous draw, φ ( · ) is the standard normal density function.The acceptance ratio given β is thus A ( α, α ′ ) = p ( α ′ , β | S n ) Φ( α ) p ( α, β | S n ) Φ( α ′ ) , (26)where Φ( · ) is the standard normal distribution function. The same M-H sampler is alsoapplied to β given the updated α . We incorporate the M-H sampler of α and β in theGibbs sampler in Section 5. The posterior estimators are shown in Table 6. The posteriormeans and standard deviations of parameter θ i and σ are similar to those obtained inthe previous analyses. The posterior mean of α is greater than the posterior mean of β in all models. This affirms the conclusion in the MAP results that the algorithm tendsto linger in existing states rather than exploring a new one. Both the MAP and M-Hmethods correctly estimate the number of change-points in each simulation. . I. M. Ko, T. T. L. Chong, and P. Ghosh α and β in Normal Mean-Shift Models 1 and 2. Posteriormeans and posterior standard deviations within parentheses. Model 1 Model 2 known variance unknown variance known variance unknown variance α β θ θ θ σ We can see that the estimates of α and β from the two approaches are quite different asshown in Tables 5 and 6. The MAP as a point estimator may not reflect the variations of α and β , whereas the M-H is a typical Bayesian method which can be incorporated intothe MCMC sampler of other parameters in question. Moreover, the MAP approach maybe limited when the posterior happens to be multi-modal. Therefore, the M-H methodis preferred in practice and the following empirical studies are conducted with the M-Hsampler. We first apply our Dirichlet process multiple change-point model to the much analyzeddata set on the number of coal-mining disasters by year in Britain over the period1951–1962 (Jarrett (1979), Carlin et al. (1992) and Chib (1998)).Let the disaster count y be modeled by a Poisson distribution f ( y | λ ) = λ y e − λ /y ! . (27)The observation sequence Y n = ( y , y , . . . , y ) ′ is subject to some unknown change-points. We plot the data y t in Figure 4. Chib (1998) estimates the models with onechange-point ( k = 1) and with two change-points ( k = 2), respectively. He assumes theparameter λ following the prior Gamma(2 ,
1) in the one-change-point case and the priorGamma(3 ,
1) in the other. Hence, given the regime indicators S n , the correspondingparameter λ i in regime i has the following posteriors with respect to the two priors:Posterior 1: λ i | S n , Y n ∼ Gamma( λ i | U i , N i ) , (28)and Posterior 2: λ i | S n , Y n ∼ Gamma( λ i | U i , N i ) , (29)90 Dirichlet Process Hidden Markov Multiple Change-point Model
Figure 4: Data on coal mining disaster count y t .Figure 5: Posterior probability of s t = i .where U i = P t =1 δ ( s t , i ) y t and N i = P t =1 δ ( s t , i ). We perform our algorithm with thefollowing Gibbs steps: Step
1. Sample S n | λ, Y n as in (16) and obtain k , Step
2. Sample λ i | S n , Y n as in (28) or (29), Step
3. Update α and β with the Metropolis-Hastings Sampler as in Section 6.2.The above Gibbs sampler is conducted for 5000 sweeps with 1000 burn-in samples. Toreduce the sampler dependency, the 5000 sweeps are thinned by 50 draws. The samplerestimates one change-point in the data. Figure 5 shows the posterior probabilities of theregime indicator s t = i at each time point t . The intersections of the two lines s t = 1and s t = 2 show that the break location exits at around t = 40. Figure 6 provides the . I. M. Ko, T. T. L. Chong, and P. Ghosh τ i .distribution of the transition points τ i . Interestingly, our model produces exactly thesame figure as the one in Chib (1998). The change-point is identified as occurring ataround t = 41.The corresponding posterior means of the parameters λ and λ are 3 . . . . , α and β are 1 . . . . , λ and λ equal to3 . . . . α and β are 1 . . . . k ,we conduct 1000 replications of the above estimation process and collect 1000 change-point numbers. When the first prior is assumed, 77 .
23% of the 1000 replications detectone change-point. We find a similar result for prior 2. Hence, we conclude that withoutassuming the number of change-points a priori , our algorithm detects the same change-point number as in the model developed by Chib (1998) with high probability.
We also apply our algorithm to estimate structural changes in real Gross DomesticProduct growth. The data and model are drawn from Maheu and Gordon (2008) (seealso Geweke and Yu (2011)). Let y t = 100[log( q t /q t − ) − log( p t /p t − )], where q t is quar-terly US GDP seasonally adjusted and p t is the GDP price index. The data range fromthe second quarter of 1947 to the third quarter of 2003, for a total of 226 observations(see Figure 7). We model the data with a Bayesian ar (2) model with structural change.92 Dirichlet Process Hidden Markov Multiple Change-point Model
Figure 7: US real GDP growth from the second quarter of 1947 to the third quarter of2003.The frequentist autoregressive structural change-model can be found in Chong (2001).Suppose the data are subject to k change-points and follow y t = β ,s t + β ,s t y t − + β ,s t y t − + ε t , ε t ∼ N(0 , σ s t ) , s t = 1 , , . . . , k + 1 . (30)We assume the following hierarchical priors to β ,i , β ,i and β ,i : β i = ( β ,i , β ,i , β ,i ) ′ ∼ N( µ, V ) , i = 1 , . . . , k + 1 , (31)where µ = ( µ , µ , µ ) ′ and V = Diag( v , v , v ), such that p ( µ j , v j ) ∝ Inv-Gamma( v j | a, b ) , j = 0 , , . (32)We assume the noninformative prior for σ i such that p ( σ i ) ∝ /σ i , i = 1 , . . . , k + 1 . (33)Conditional on σ i , the sampling of β i , µ and V is similar to Section 5. For σ i , wecan draw from the following full conditional: σ i | β i , S n , Y n ∼ Inv- χ (cid:18) σ i | τ i − τ i − , ω i τ i − τ i − (cid:19) , i = 1 , . . . , k + 1 , (34)where ω i = P τ i − In the appendix, we give the derivations of the full conditionals and the Gibbs samplersin Section 5. For the case of known variance, we first rewrite the hierarchical model (21)as the joint distribution p ( Y n , θ, µ, υ | S n , σ ) ∝ k +1 Y i =1 N(˜ y i | θ i , σ i ) k +1 Y i =1 N( θ i | µ, υ ) p ( µ, υ ) , (A-1)where p ( µ, υ ) corresponds to Inv-Gamma( υ | a, b ), θ = ( θ , . . . , θ k +1 ) ′ and˜ y i = P τ i − 1. Sample S n | θ, µ, υ, Y n as in (16) and obtain k , Step 2. Sample θ, µ, υ | S n , Y n as in (A-3).For the case of unknown variance, the full conditional with respect to (22) is σ | θ, S n , Y n ∼ Inv-Gamma σ (cid:12)(cid:12)(cid:12)(cid:12) c + n , d + 12 n X t =1 ( y t − θ s t ) ! . (A-4) . I. M. Ko, T. T. L. Chong, and P. Ghosh σ , we apply the same estimation strategy discussed above. The Gibbssampler is thus Step 1. Sample S n | θ, µ, υ, σ , Y n as in (16) and obtain k , Step 2. Sample θ, µ, υ, σ | S n , Y n as in (A-3) and (A-4). References Beal, M. J., Ghahramani, Z., and Rasmussen, C. E. (2002). “The Infinite Hidden MarkovModel.” In Dietterich, T. G., Becker, S., and Ghahramani, Z. (eds.), Advances inNeural Information Processing Systems , 577–584. MIT Press. 277, 279, 287Blackwell, D. and MacQueen, J. B. (1973). “Ferguson Distributions Via Polya UrnSchemes.” The Annals of Statistics , 1(2): 353–355. 277Carlin, P. B., Gelfand, A. E., and Smith, A. F. M. (1992). “Hierarchical BayesianAnalysis of Changepoint Problems.” Journal of the Royal Statistical Society. SeriesC (Applied Statistics) , 41(2): 389–405. 275, 289Chernoff, H. and Zacks, S. (1964). “Estimating the Current Mean of a Normal Distribu-tion which is Subjected to Changes in Time.” The Annals of Mathematical Statistics ,35(3): pp. 999–1018. 275Chib, S. (1998). “Estimation and comparison of multiple change-point models.” Journalof Econometrics , 86(2): 221 – 241. 275, 276, 279, 282, 284, 289, 291Chong, T. T.-L. (2001). “Structural Change in AR(1) Models.” Econometric Theory ,17(1): 87–155. 292Connor, R. J. and Mosimann, J. E. (1969). “Concepts of Independence for Propor-tions with a Generalization of the Dirichlet Distribution.” Journal of the AmericanStatistical Association , 64(325): pp. 194–206. 279, 287Ferguson, T. S. (1973). “A Bayesian analysis of some nonparametric problems.” Annalsof Statistics , 1: 209–230. 277Geweke, J. and Yu, J. (2011). “Inference and prediction in a multiple-structural-breakmodel.” Journal of Econometrics , 163(2): 172–185. 275, 291Giordani, P. and Kohn, R. (2008). “Efficient Bayesian Inference for Multiple Change-Point and Mixture Innovation Models.” Journal of Business and Economic Statistics ,26(1): 66–77. 275Jarrett, R. G. (1979). “A Note on the Intervals Between Coal-Mining Disasters.” Biometrika , 66(1): 191–193. 289Koop, G. and Potter, S. M. (2007). “Estimation and Forecasting in Models with MultipleBreaks.” The Review of Economic Studies , 74(3): pp. 763–789. 275Kozumi, H. and Hasegawa, H. (2000). “A Bayesian analysis of structural changes withan application to the displacement effect.” The Manchester School , 68(4): 476–490.27796 Dirichlet Process Hidden Markov Multiple Change-point Model Maheu, J. M. and Gordon, S. (2008). “Learning, forecasting and structural breaks.” Journal of Applied Econometrics , 23(5): 553–583. 275, 291, 293Neal, R. M. (1992). “The Infinite Hidden Markov Model.” In Smith, C. R., Erickson,G. J., and Neudorfer, P. O. (eds.), Proceedings of the Workshop on Maximum Entropyand Bayesian Methods of Statistical Analysis , 197–211. Kluwer Academic Publishers.277— (2000). “Markov Sampling Methods for Dirichlet Process Mixture Models.” Journalof Computational and Graphical Statistics , 9(2): 249–265. 277, 279Pesaran, M. H., Pettenuzzo, D., and Timmermann, A. (2006). “Forecasting Time SeriesSubject to Multiple Structural Breaks.” Review of Economic Studies , 73(4): 1057–1084. 275Sethuraman, J. (1994). “A Constructive Definition of Dirichlet Priors.” Statistica Sinica ,4(2): 639–650. 277Smith, A. F. M. (1975). “A Bayesian approach to inference about a change-point in asequence of random variables.” Biometrika , 62: 407–416. 275Stephens, D. A. (1994). “Bayesian Retrospective Multiple-Changepoint Identification.” Journal of the Royal Statistical Society. Series C (Applied Statistics) , 43(1): 159–178.275Wang, J. and Zivot, E. (2000). “A Bayesian Time Series Model of Multiple StructuralChanges in Level, Trend, and Variance.” Journal of Business & Economic Statistics ,18(3): 374–386. 275Wong, T. (1998). “Generalized Dirichlet distribution in Bayesian analysis.” AppliedMathematics and Computation , 97(2-3): 165–181. 279, 287