[PDF] Statistical Inference in Hidden Markov Models using k -segment Constraints

Abstract

Hidden Markov models (HMMs) are one of the most widely used statistical methods for analyzing sequence data. However, the reporting of output from HMMs has largely been restricted to the presentation of the most-probable (MAP) hidden state sequence, found via the Viterbi algorithm, or the sequence of most probable marginals using the forward-backward (F-B) algorithm. In this article, we expand the amount of information we could obtain from the posterior distribution of an HMM by introducing linear-time dynamic programming algorithms that, we collectively call k -segment algorithms, that allow us to i) find MAP sequences, ii) compute posterior probabilities and iii) simulate sample paths conditional on a user specified number of segments, i.e. contiguous runs in a hidden state, possibly of a particular type. We illustrate the utility of these methods using simulated and real examples and highlight the application of prospective and retrospective use of these methods for fitting HMMs or exploring existing model fits.

Full PDF

SStatistical Inference in Hidden Markov Models using k -segmentConstraints Michalis K. Titsias ∗ , Christopher Yau † and Christopher C. Holmes ‡ Athens University of Economics and Business, 76, Patission Str. GR10434, Athens, Greece Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford,United Kingdom Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, United Kingdom

November 6, 2013

Abstract

Hidden Markov models (HMMs) are one of the most widely used statistical methods for analyzing se-quence data. However, the reporting of output from HMMs has largely been restricted to the presentation ofthe most-probable (MAP) hidden state sequence, found via the Viterbi algorithm, or the sequence of most proba-ble marginals using the forward-backward (F-B) algorithm. In this article, we expand the amount of informationwe could obtain from the posterior distribution of an HMM by introducing linear-time dynamic programmingalgorithms that, we collectively call k -segment algorithms, that allow us to i) ﬁnd MAP sequences, ii) computeposterior probabilities and iii) simulate sample paths conditional on a user speciﬁed number of segments, i.e.contiguous runs in a hidden state, possibly of a particular type. We illustrate the utility of these methods usingsimulated and real examples and highlight the application of prospective and retrospective use of these methodsfor ﬁtting HMMs or exploring existing model ﬁts. The use of the Hidden Markov Model (HMM) is ubiquitous in a range of sequence analysis applications acrossa range of scientiﬁc and engineering domains, including signal processing (Juang and Rabiner, 1991; Crouse ∗ E-mail: [email protected] † E-mail: [email protected] ‡ E-mail: [email protected] a r X i v : . [ s t a t . M E ] N ov t al., 1998), genomics (Eddy, 1998; Li and Stephens, 2003) and ﬁnance (Paas et al., 2007). Fundamentally,the HMM is a mixture model whose mixing distribution is a ﬁnite state Markov chain (Rabiner, 1989; Capp´eet al., 2005). Whilst the Markov assumptions rarely correspond to the true physical generative process, it oftenadequately captures ﬁrst-order properties that make it a useful approximating model for sequence data in manyinstances whilst remaining tractable even for very large datasets. As a consequence, HMM-based algorithms cangive highly competitive performance in many applications.Central to the tractability of HMMs is the availability of recursive algorithms that allow fundamental quantitiesto be computed efﬁciently (Baum and Petrie, 1966; Viterbi, 1967). These include the Viterbi algorithm whichcomputes the most probable hidden state sequence and the forward-backward algorithm which computes themarginal probability of a given state at a point in the sequence. Computation for the HMM has been well-summarized in the comprehensive and widely read tutorial by Rabiner (1989) with a Bayesian treatment givenmore recently by Scott (2002). It is a testament to the completeness of these recursive methods that there havebeen few generic additions to the HMM toolbox since these were ﬁrst described in the 1960s. However, as HMMapproaches continue to be applied in increasingly diverse scientiﬁc domains and ever larger data sets, there isinterest in expanding the generic toolbox available for HMM inference to encompass unmet needs.The motivation for our work is to develop mechanisms to allow the exploration of the posterior sequence space.Typically, standard HMM inference limits itself to reporting a few standard quantities. For an M -state Markovchain of length N there exists of M N possible sequences but often only the most probable sequence or the N M marginal posterior probabilities are used to summarize the whole posterior distribution. Yet, it is clear that, whenthe state space is large and/or the sequences long, many other sequences maybe of interest. Modiﬁcations of theViterbi algorithm can allow arbitrary number of the most probable sequences to be enumerated whilst Bayesiantechniques allows us to sample sequences from the posterior distribution. However, since a small change to themost likely sequences typically give new sequences with similar probability, these approaches do not lead toreports of qualitatively diverse sequences. By which we mean, alternative sequence predictions that might lead todifferent decisions or scientiﬁc conclusions.In this article we describe a set of novel recursive methods for HMM computation that incorporates seg-mental constraints that we call k -segment inference algorithms . These are so-called because the algorithms areconstrained to consider only sequences involving no more than k − speciﬁed transition events. We show that k -segment procedures provide an intuitive approach for posterior exploration of the sequence space allowing di-verse sequence predictions containing , , . . . , and k segments or speciﬁc transitions of interest. These methodscan be applied prospectively during model ﬁtting or retrospectively to an existing model. In the latter case, theutility of the methods described here comes at no cost (other than computational time) to the HMM user and weprovide illustrative examples to highlight novel insights that maybe gained through k -segment approaches.2 x x x x N − x N y y y y y N − y N Figure 1: HMM depicted as a directed graphical model.

The HMM encodes for two types of random sequences: the hidden state sequence or path x = ( x , . . . , x N ) and the observed data sequence y = ( y , . . . , y N ) . Individual hidden states take discrete values, such that x n ∈{ , . . . , M } , while observed variables can be of arbitrary type. The hidden state sequence x follows a Markovchain so that p ( x | π , A ) = p ( x | π ) N (cid:89) n =2 p ( x n | x n − , A ) . (1)Here, the ﬁrst hidden state x is drawn from some initial probability vector π so that π ,m = p ( x = m ) denotesthe probability of x being in state m ∈ { , . . . , M } , whereas any subsequent hidden state x n (with n > ) isdrawn according to a transition matrix A so that [ A ] m (cid:48) m = p ( x n = m | x n − = m (cid:48) ) expresses the probability ofmoving to a state m from m (cid:48) . Given a path x following the Markov chain in (1), the observed data are generatedindependently according to p ( y | x ) = N (cid:89) n =1 p ( y n | x n , φ ) , (2)where the densities p ( y n | x n = m, φ ) , m = 1 , . . . , M , are often referred to as the emission densities and areparametrized by φ . Next we shall collectively denote all HMM parameters, i.e. π , A and φ , by θ .Statistical estimation in HMMs takes advantage of the Markov dependence structure, shown in Figure 1, whichallows efﬁcient dynamic programming algorithms to be applied. For instance, maximum likelihood (ML) over theparameters θ via the EM algorithm is carried out by the forward-backward (F-B) recursion (Baum and Petrie,1966) that implements the Expectation step in O ( M N ) time. A similar recursion having the same time complex-ity is the Viterbi algorithm (Viterbi, 1967) which, given a ﬁxed value for the parameters, estimates the maximum aposteriori (MAP) hidden sequence. Furthermore, straightforward generalizations of the Viterbi algorithm estimatethe P -best list of most probable sequences (Schwartz and Chow, 1990; Nilsson and Goldberger, 2001). In con-trast to ML point estimation, a Bayesian approach assigns a prior distribution p ( θ ) over the parameters and seeksto estimate expectations taken under the posterior distribution p ( x , θ | y ) . The Bayesian framework also greatlybeneﬁts from efﬁcient recursions derived as subroutines of Monte Carlo algorithms. Speciﬁcally, the popularGibbs sampling scheme (Scott, 2002) relies on the forward-ﬁltering-backward-sampling (FF-BS) recursion thatsimulates in O ( M N ) time a hidden sequence from the conditional posterior distribution p ( x | θ , y ) . In summary,3ll recursions mentioned above have linear time complexity with respect to the length of the sequence N and areinstances of more general inference tools developed in the theory of probabilistic graphical models (Cowell et al.,2003; Koller and Friedman, 2009). While the linear time efﬁciency of the current HMM recursions is one of the keys for the widespread adoptionof HMMs in applications, the information regarding the posterior distribution obtained by these algorithms isstill very limited. To this end, we deﬁne novel probabilistic inference problems for exploration of the HMMposterior distribution and we efﬁciently solve these problems by introducing linear time recursions. To start witha motivating example, assume that we are interested in the event c x = { x contains transitions of a certain class } , (3)which denotes the number of times a certain class of transitions occurs along the hidden path x of the HMM.Then, we may wish to compute the probability: p ( c x = k | y , θ ) = (cid:88) x I ( c x = k ) p ( x | θ , y ) , (4)which is a global marginal obtained after a summation over all paths having exactly k occurrences from thecertain class of transitions. This probability cannot be obtained from the current F-B recursion which allows usto compute only local marginals such as p ( x n | y , θ ) or p ( x n − , x n | y , θ ) . The use of Monte Carlo methods toapproximate the right hand side of (4) is also unsuitable because, while fast and exact simulation from p ( x | θ , y ) is possible by means of the FF-BS recursion, the obtained accuracy could be insufﬁcient when the underlyingvalue of p ( c x = k | y , θ ) is very small due to the extremeness of the event c x = k .We can also deﬁne several other related tasks that throughout the paper we collectively refer to as k -segmentinference problems . In such problems, we insert hard constraints into the HMM involving the number and typeof segments we want to see along the path x , and then we query the model to provide us with probabilities orrepresentative paths characterizing that constraint. Some additional representative examples of k -segment infer-ence that we will study in this article are the computation of the optimal MAP sequence associated with the event c x = k and the simulation of paths from the conditional distribution p ( x | c x = k, y , θ ) .To solve k -segment inference problems we develop efﬁcient linear time dynamic programming recursions.The solution we provide introduces auxiliary counting variables into the original HMM so that we obtain anextended state-space HMM which is consistent with the original model. The auxiliary counting variables used in4his augmentation allows us to inject evidence or constraints into the model so that the standard recursions appliedto the extended HMM allow us to solve all k -segment inference problems of interest. This provides a simpleand elegant solution and results in new HMM recursions that generalize the standard F-B, Viterbi and FF-BSalgorithms.The remaining of the article has as follows. Section 4 describes the main theory of k -segment inference, forapplications to a posteriori model exploration, and derives the novel HMM recursions. Section 5 considers modelﬁtting under k -segment constraints and presents suitable EM and Bayesian learning procedures. Section 6 presentsextensions to the basic k -segment problems, while Section 7 discusses related work. Section 8 considers sequenceanalysis using two real-world examples from cancer genomics and text information retrieval. Finally, Section 9concludes with a discussion and directions for future work. k -segment inference This section presents the theoretical foundations of k -segment inference problems starting with section 4.1 thatdeﬁnes such problems. Section 4.2 reformulates these problems in terms of an extended state-space HMM havingauxiliary counting variables and section 4.3 presents efﬁcient solutions based on linear time recursions. Finally,Section 4.4 gives a graphical illustration of the proposed algorithms using a simulated sequence. All the algorithmsdescribed in this section assume a ﬁxed setting for the parameters θ . Therefore, to keep our expressions unclutteredin the following we drop θ from our expressions and write for instance p ( x | y , θ ) as p ( x | y ) and p ( y | θ ) as p ( y ) . k -segment inference problems Any hidden path x in a HMM can have from up to N − transitions or equivalently from up to N segments,where a segment is deﬁned as a contiguous run of indices where x n − = x n . Following the notation used in eq.(3), we deﬁne the number of all segments in x by c x = 1 + N (cid:88) n =2 I ( x n − (cid:54) = x n ) , (5)where I ( · ) denotes the indicator function. c x is the sum of the number of transitions, i.e. the locations in thehidden path where x n − (cid:54) = x n , and the value one that accounts for the initial segment which is not the result of atransition.Subsets of hidden paths associated with different number of segments comprise exclusive events which allowto decompose the posterior distribution p ( x | y ) as follows. If we introduce the events c x = k , with k = 1 , . . . , N ,each corresponding to the subset of paths { x | c x = k } having exactly k segments, the posterior distribution p ( x | y ) p ( x | y ) = N (cid:88) k =1 p ( x , c x = k | y ) = N (cid:88) k =1 p ( x | c x = k, y ) p ( c x = k | y ) , (6)where p ( x | y , c x = k ) = I ( c x = k ) p ( y | x ) p ( x ) (cid:80) x : c x = k p ( y | x ) p ( x ) , (7)is the posterior distribution conditional on having k segments, while p ( c x = k | y ) = p ( c x = k, y ) p ( y ) = (cid:80) x : c x = k p ( y | x ) p ( x ) (cid:80) x p ( y | x ) p ( x ) , (8)is the posterior probability of the event c x = k .The mixture decomposition in eq. (6) suggests that one way to explore the posterior distribution of the HMMis to compute quantities associated with the components of this mixture. This leads to the k -segment inferenceproblems which can be divided into the following three types of problems: • Optimal decoding:

Find the MAP hidden path that has k segments, that is the path with the maximumvalue of p ( x | c x = k, y ) . • Probability computation:

Find the posterior probability of having k segments, i.e. p ( c x = k | y ) . • Path sampling:

Draw independent samples from p ( x | c x = k, y ) .To this end, in Section 4.3 we introduce efﬁcient linear time algorithms to solve all the above tasks togetherwith several additional related tasks associated with more general events of the form k ≤ c x ≤ k , where ≤ k < k ≤ N , such as ﬁnding the MAP of p ( x | c x > k, y ) , sampling from p ( x | c x > k, y ) and etc.These algorithms are based on a reformulation of the above k -segment inference problems that uses an extendedstate-space HMM containing auxiliary counting variables. The basis of our algorithm is the augmentation of the Markov chain in (1) with auxiliary variables that countthe number of segments. Speciﬁcally, c x from (5) can be considered as a counter that scans the path x andit increments by one any time it encounters a transition. We can represent this counting process with a N -dimensional vector of auxiliary variables s which is an increasingly monotone sequence of non-negative integers.6onditioning on a certain path x , s is sampled deterministically according to the Markov chain p ( s | x ) = p ( s | x ) N (cid:89) n =2 p ( s n | s n − , x n − , x n ) , = δ s , N (cid:89) n =2 (cid:2) I ( x n − (cid:54) = x n ) δ s n ,s n − +1 + (1 − I ( x n − (cid:54) = x n )) δ s n ,s n − (cid:3) , (9)where δ i,j is the delta mass that equals one when i = j and zero otherwise. We refer to the above conditionaldistribution as the counting Markov chain or counting chain because it is Markov chain that makes precise theconcept of counting the segments. The counting chain starts at one, i.e. s = 1 (which can be interpreted assampling from the delta mass δ s , ), and then it increments by one so that s n = s n − + 1 every time a transitionoccurs in the hidden path, i.e. whenever x n − (cid:54) = x n which implies the generation of a new segment. The jointdensity of the HMM is augmented with the counting chain so that p ( y , x , s ) = p ( y | x ) p ( x ) p ( s | x ) , (10)is the new joint density having the conditional independence structure shown as directed graphical model inFigure 2. Because this augmentation is consistent , prior-to-posterior inference in the initial HMM and the HMMaugmented with auxiliary variables are in theory equivalent. However, in practice, inference in the latter model ismore ﬂexible since it allows to solve the k -segment inference problems through the insertion of constraints in thecounting process. More precisely, the ﬁnal value of the counter s N equals c x so the event c x = k can be realizedby adding the evidence s N = k in the graphical model of Figure 2. Therefore, all type of k -segment inferenceproblems can be reformulated as follows: • Optimal decoding:

The MAP hidden x ∗ of p ( x | c x = k, y ) can be found according to ( x ∗ , s ∗\ N ) = arg max x ∗ , s \ N p ( y | x ) p ( x ) p ( s \ N , s N = k | x ) . (11) • Probability computation:

The posterior probability p ( c x = k | y ) can be expressed as p ( s N = k, y ) p ( y ) where p ( y ) is known from the forward pass of the standard F-B algorithm and p ( s N = k, y ) = (cid:88) x , s \ N p ( y | x ) p ( x ) p ( s \ N , s N = k | x ) . (12) • Path sampling:

An independent sample (cid:101) x from p ( x | c x = k, y ) is obtained as ( (cid:101) x , (cid:101) s \ N ) ∼ p ( x , s \ N | s N = k, y ) ∝ p ( y | x ) p ( x ) p ( s \ N , s N = k | x ) , (13) Clearly, if we marginalize out s we recover correctly the joint density of the initial HMM. s s s s N − s N x x x x x N − x N y y y y y N − y N Figure 2: Directed graphical model for the HMM augmented with the counting chain.where in the above s \ N denotes all counting variables apart from the ﬁnal s N which is clamped to k . For moregeneral events of the form k ≤ s N ≤ k , where ≤ k < k ≤ N , the above still holds with the slightmodiﬁcation that we will need additionally to maximize, marginalize or sample s N , respectively for the threecases above, under the constraint k ≤ s N ≤ k . Simple proofs for the correctness of all above statements can befound in the Appendix A.Furthermore, the k -segment inference problems associated with the special case of the event s N > k can beequivalently reformulated by using a modiﬁed counting chain that absorbs when s n = k + 1 , i.e. p ( s | x ) = δ s , N (cid:89) n =2 (cid:2) I ( x n (cid:54) = x n − & s n − ≤ k ) δ s n ,s n − +1 + (1 − I ( x n (cid:54) = x n − & s n − ≤ k )) δ s n ,s n − (cid:3) , (14)where the indicator function I ( x n (cid:54) = x n − & s n − ≤ k ) is one only when both x n (cid:54) = x n − and s n − ≤ k are true.Notice that the above is an inhomogeneous chain having two modes: the ﬁrst when the segment counting proceedsnormally and the second when counting stops once the absorbing state is visited. The k -segment problems for theevent s N > k are then solved by using the above chain and clamping s N to the value k + 1 .The augmentation with counting variables results in a new HMM having the pair ( s n , x n ) as the new extendedstate variable. Given that s N = k , so that any pair ( s n , x n ) can jointly take at most kM values, we can use theViterbi algorithm to obtain the MAP of p ( x | y , s N = k ) , the forward pass of the F-B algorithm to obtain p ( s N = k, y ) and the FF-BS algorithm to draw an independent sample from p ( x | y , s N = k ) . A naive implementation ofthese algorithms can be done in O ( k M N ) time. However, this complexity can be further reduced to O ( kM N ) by taking into account the deterministic structure of the counting chain as discussed in the following section. Optimal decoding.

We ﬁrst describe the k -segment equivalent of the Viterbi algorithm for the optimal decodingproblem under k -segment constraints, i.e. for obtaining the MAP of p ( x | y , s N = k ) . This algorithm will be ableto solve at once all such problems from k = 1 up to a maximum k = k max by applying a single forward pass8or the maximum value k max which requires O ( k max M N ) operations. Then, by applying k max backtrackingoperations, each scaling as O ( N ) , we can obtain all k max optimal segmentations overall in O ( k max M N ) time.More precisely, the Viterbi algorithm applies a forward pass where recursively p ( y | x ) p ( x ) p ( s \ N , s N = k max | x ) is maximized with respect to the pair ( s n − , x n − ) for any value of the next pair ( s n , x n ) . This canbe implemented as a propagation of a message, which is a k max M dimensional vector, as follows. The messageis initialized to γ ( x , s ) = log p ( y | x ) + log p ( x ) + log p ( s | x ) , (15)which equals log p ( y | x ) + log p ( x ) when s = 1 and −∞ when s > . This message then is propagatedrecursively according to γ ( x n , s n ) = log p ( y n | x n ) + max x n − ,s n − [ γ ( x n − , s n − ) + log p ( x n | x n − ) p ( s n | s n − , x n , x n − )] . (16) δ ( x n , s n ) = ( x ∗ n − , s ∗ n − ) , (17)where the auxiliary message δ ( x n , s n ) simply stores the pair ( x n − , s n − ) that gives the maximum in (16) neededlater in backtracking. Naively, the n th recursive update can be implemented in O ( k max M ) time since each s n takes at most k max values and each x n takes M values. However, for any given conﬁguration of ( s n , x n ) (out ofthe k max M possible), the permissible values for s n − are either s n − = s n when x n − = x n or s n − = s n − when x n − (cid:54) = x n . For all remaining conﬁgurations, log p ( s n | s n − , x n , x n − ) = −∞ , so that these conﬁgurationsneed not to be checked when maximizing over ( s n − , x n − ) for a certain pair ( s n , x n ) . Thus, the maximization in(16) can be done in M operations resulting in k max M operations for the whole n th update. Subsequently, the fullforward pass requires O ( k max M N ) operations. Once the forward pass is completed, we have the ﬁnal message γ ( x N , s N ) (together with all auxiliary δ messages) from which we can obtain all k max optimal segmentationsusing backtracking as follows. For k = 1 , . . . , k max , we ﬁrst compute x ∗ N = arg max x N γ ( x N , s N = k ) . (18)Then, starting from ( x ∗ N , s ∗ N = k ) we backtrack recursively according to ( x ∗ n − , s ∗ n − ) ← δ ( x ∗ n , s ∗ n ) that recov-ers the optimal hidden path x ∗ having exactly k segments. Each backtracking requires O ( N ) simple indexingoperations. Probability computation.

For the probability computation problem, we work similarly to the above Viterbialgorithm and we compute all joint densities p ( s N = k, y ) for k = 1 up to k max using the forward pass of theF-B algorithm applied to the augmented HMM. This recursively sums out each pair ( s n − , x n − ) for any valueof the next pair ( s n , x n ) , essentially passing through the so-called α message (Bishop, 2006). This message is a9 max M dimensional vector taking as initial value α ( x , s ) = p ( y | x ) p ( x ) p ( s | x ) , (19)which equals p ( y | x ) p ( x ) when s = 1 and otherwise. Then, the message is propagated according to thestandard α recursion α ( x n , s n ) = p ( y n | x n ) (cid:88) x n − ,s n − α ( x n − , s n − ) p ( x n | x n − ) p ( s n | s n − , x n , x n − ) . (20)This recursion scales as O ( k max M ) since the summation over ( x n − , s n − ) can be done in O ( M ) time by takingadvantage the structure of the counting conditional p ( s n | s n − , x n , x n − ) . As in any α recursion in a HMM, α ( x n , s n ) equals the density p ( x n , s n , y , . . . , y n ) so that the ﬁnal message is α ( x N , s N ) = p ( x N , s N , y ) , fromwhich we can easily obtain p ( s N = k, y ) = (cid:88) x N α ( x N , s N = k ) , (21)for k = 1 , . . . , k max . Clearly, since the computation of a single recursion of the α message takes O ( k max M ) time, the above computations require overall O ( k max M N ) time. Given that the joint density p ( s N = k, y ) hasbeen obtained, we can compute exactly the posterior probability p ( s N = k | y ) by dividing with the normalizationconstant p ( y ) (i.e. the overall likelihood of the HMM) obtained from the standard forward pass.Similarly to the above we can also deﬁne the so called backwards or β message in the extended state-spaceHMM. Such message is useful when applying the EM algorithm for learning an HMM under k -segments con-straints and its computation will be described in section 5.2.1. Path sampling.

We now turn into the sampling problem where we wish to draw a path from the conditional p ( x | s N = k, y ) . Such a path can be obtained by sampling a pair ( x , s \ N ) from p ( x , s \ N | s N = k, y ) and thendiscarding s \ N . We apply the FF-BS algorithm that is based on the following decomposition p ( x , s \ N | s N = k, y ) = p ( x N | s N = k, y ) (cid:89) n = N − p ( x n , s n | x n +1 , s n +1 , y , . . . , y n ) , (22)where the index n in (cid:81) n = N − starts from N − and decrements down to one. Applying ﬁrst the forwardpass described above we have the ﬁnal message α ( x N , s N , y ) from which we can sample x N from p ( x N | s N = k, y ) ∝ α ( x N , s N = k, y ) . Then, recursively we go backwards and each time we sample ( x n , s n ) , given thealready sampled value of ( x n +1 , s n +1 ) , from p ( x n , s n | x n +1 , s n +1 , y , . . . , y n ) ∝ p ( x n +1 | x n ) p ( s n +1 | s n , x n +1 , x n ) α ( x n , s n ) , (23)10here the message α ( x n , s n ) = p ( x n , s n , y , . . . , y n ) is known from the forward pass. Each sampling steptakes O ( M ) time (again due to the deterministic nature of the conditional p ( s n +1 | s n , x n +1 , x n )) and the wholebackward sampling requires O ( M N ) time. If we wish to simultaneously sample from all conditional distributions p ( x | s N = k, y ) , with k = 1 , . . . , k max , we can do this using a single forward pass that scales as O ( k max M N ) and k max backward sampling iterations scaling as O ( k max M N ) , so the overall complexity is O ( k max M N ) .Furthermore, very simple and straightforward modiﬁcations of the above procedures can deal with the moregeneral constraint k ≤ s N ≤ k , where ≤ k < k ≤ N . For instance, if we wish to sample a pathfrom p ( x | k ≤ s N ≤ k , y ) , we need to ﬁrst apply the forward pass for k max = k and then perform back-wards sampling exactly as described above with the only difference that initially we sample ( x N , s N ) from p ( x N , s N | k ≤ s N ≤ k , y ) ∝ α ( x N , s N , y ) I ( k ≤ s N ≤ k ) . Similarly, the k -segment inference prob-lems associated with the special event s N > k can be efﬁciently solved in O (( k + 1) M N ) time by using theabsorbing counting chain from (14) and then applying exactly the above algorithms by clamping s N = k + 1 .Finally, it is important to notice that running k -segment inference up to some k max and setting k max + 1 asthe absorbing state always gives a global summary of the posterior distribution that is guaranteed to be at least asinformative as the standard Viterbi MAP path. More precisely, the events c x = 1 , . . . , c x = k max and c x > k max comprise exclusive events that make up the whole set of paths for any value of k max . Therefore, the probabilities p ( c x = 1 | y ) , . . . , p ( c x = k max | y ) and p ( c x > k max | y ) , computed based on the forward pass in the augmentedHMM, always sum up to one, while the set of the corresponding k max + 1 optimal paths must include the standardViterbi MAP path, which will be either one of the paths from up to k max or the path with more segments than k max . We refer to the above combined sets of probabilities and optimal paths as the k max + 1 summary of theposterior distribution. Here, we give a graphical illustration of optimal decoding and path sampling under k -segment constraints. Forthis, we simulated a data sequence according to y n | x n , m , σ ∼ N ( m x n , σ ) , n = 1 , . . . , N = 1000 , where thehidden sequence x = { x n } Nn =1 was given by a Markov chain with M = 3 states, m = {− , − , } and σ = 0 . .Using the simulated data, shown in the ﬁrst row of Figure 3, we ﬁtted a three-state HMM using the EM algo-rithm which recovered parameter estimates very close to the ground-truth ones. We then computed the standardViterbi path and obtained the optimal segmentations, associated with the k max + 1 summary, using the k -segmentequivalent of the Viterbi algorithm with k max = 10 . These are shown in the second row of Figure 3.Each such path is displayed so that the three states are shown with different color. On top, the Viterbi path isdisplayed, containing segments, and then the paths of the k max + 1 summary. The ﬁrst paths of the latter An alternative is to assume the standard counting chain along with the event k < s N ≤ N . However, such a solution is very inefﬁcientas it scales as O ( M N ) since k max must be chosen to be equal to N . k + 1 th segmentation might not be obtainedby splitting into two segments a single segment from the k th one. Such a latter approach is sub-optimal. Also,notice that the ﬁnal path that corresponds to the absorbing state (labelled with > in the ﬁgure) is precisely thestandard Viterbi path. The third panel of Figure 3 illustrates path sampling under k -segment constraints using theFF-BS algorithm in the augmented HMM. In particular, samples are shown that are constrained to have exactly k = 7 segments.We remark that the application of our k -segment algorithms, so far, has been applied entirely retrospectivelyto an HMM ﬁtted using a very standard and common approach in a simple but generic model set-up. The k -segment constraints are not involved in the model ﬁtting process but are applied retrospectively to provide a richexploration of the posterior sequence space where qualitatively diverse segmentations are reported. For the expen-diture of some computational time, the application of k -segment generalizations for optimal decoding, probabilitycomputation and path sampling provides the HMM user with alternative summaries. k -segment inference in practice So far we have presented novel recursions for HMM inference that are applied assuming a ﬁxed value for theparameters θ . In this section, we discuss how we could use these recursions in a general statistical estimationproblem with HMMs. More precisely, we will analyze the following two uses of k -segment constraints: i) theretrospective or a posteriori use where the parameters of the HMMs have been ﬁtted beforehand and k -segmentinference is used as a meta-analysis tool (section 5.1) and ii) the prospective or a priori use where the constraintsare introduced during model ﬁtting so that they actively inﬂuence the model parameters (section 5.2). k -segment constraints The most obvious practical use of k -segment inference is the following. Given an HMM with ﬁxed parameters,e.g. estimated by ML training, apply the recursions of Section 4.3 to explore the posterior distribution over thehidden sequences. Some questions that arise are: i) what is justiﬁcation of this approach? ii) when is it suitable?and iii) how can be extended in a Bayesian estimation setting?To start with the ﬁrst question, a way to formalize the a posteriori use of k -segment inference is using decision-theoretic arguments where actions are taken a posteriori given that beliefs about a system are described by someﬁxed and known probability model. For instance, the computation of the MAP hidden path x ∗ of p ( x | c x = k, θ ML , y ) , where θ ML is some value obtained by ML training, can be considered as choosing the action z that12igure 3: The panel in the ﬁrst row shows the simulated data sequence. The panel in the second plot displaysthe standard Viterbi path and optimal paths corresponding to the k max + 1 summary (with k max = 10 ) of the k -segment inference. Each path is depicted with the three states shown in different colors (the state with emissionGaussian density of ground-truth mean − is shown in red, the one with mean − shown in green and the third onewith mean shown in blue). The panel in the third row shows sample paths obtained by the FF-BS algorithmunder the constraint that exactly segments occur. The panel in the forth row illustrates generalized counting(section 6.1) so that the optimal paths having up to segments from the second state are shown. For clarity,only the segments of the second state of interest are displayed with black solid lines. The ﬁnal panel in the lastrow illustrates counting excursions (section 6.2) having as the null set the ﬁrst and second states while the thirdone is the single state in the abnormal set. Again for clarity, only the excursion segments (more precisely only theabnormal sub-segment, i.e. excluding the start and end points) are shown using black solid lines.13aximizes the following expected utility x ∗ = arg max z (cid:88) x u ( z , x ; c x = k ) p ( x | θ ML , y ) , (24)where the - utility function u ( z , x ; c x = k ) takes the value one only when both z = x and x ∈ { x | c x = k } .The introduction of constraints using the counting auxiliary variables allows for efﬁcient maximization of theabove expected utility using dynamic programming.Regarding the second question, notice that the assumption of the a posteriori use of k -segment constraintsimplies that the model parameters, which quantify the structure of the hidden states, are inferred independently ofwhatever k -segment constraints we may wish to consider. In practice, this is sensible when the hidden states areclearly interpretable classes so that the emission densities model true class conditional densities and the transitionmatrix represents meaningful spatial correlation between these classes.In a Bayesian setting, a prior distribution p ( θ ) is placed on the HMM parameters and then it is updated to aposterior distribution p ( θ | y ) by conditioning on the observed data. Following similar arguments with the onesabove, under a posteriori use of k -segment constraints we assume that model parameters θ are conditionallyindependent from any constraint, say c x = k , given the observed data y , i.e. p ( θ | c x = k, y ) = p ( θ | y ) .Again such an assumption is sensible when the parameters describe the structure of true classes which isindependent of any constraint in the classiﬁcation procedure. Assuming now that computationally the posteriordistribution p ( θ | y ) is realized by a set of samples { θ ( t ) } Tt =1 , obtained say by some MCMC algorithm, the three k -segment inference problems can be tackled as follows. Firstly, the computation of p ( c x = k | y ) can be doneaccording to p ( c x = k | y ) = (cid:90) (cid:88) x I ( c x = k ) p ( x | θ , y ) p ( θ | y ) d θ = (cid:90) p ( c x = k | θ , y ) p ( θ | y ) d θ ≈ T T (cid:88) t =1 p ( c x = k | θ ( t ) , y ) , (25)where the Rao-Blackwellization when summing out x is carried out by the forward recursion of the F-B algorithmin the augmented HMM with the ﬁnal counting variable clamped to value k . More precisely, for each parametersample θ ( t ) , this recursion computes the probability p ( c x = k, y | θ ( t ) ) in O ( kM N ) time from which we obtain p ( c x = k | θ ( t ) , y ) by normalizing with p ( y | θ ( t ) ) obtained in O ( M N ) time using the standard forward pass in14he unconstrained HMM. Similarly, to ﬁnd the MAP of p ( x | c x = k, y ) we are based on the expression p ( x | c x = k, y ) = (cid:90) p ( x | θ , c x = k, y ) p ( θ | c x = k, y ) d θ = (cid:90) p ( x | θ , c x = k, y ) p ( θ | y ) d θ , (26)where we used the conditional independence assumption. Similarly to the unconstrained MAP of p ( x | y ) (Scott,2002), the above cannot be maximized analytically with respect to x (notice also that applying Monte Carlo in theﬁnal integral will not help) and therefore we consider an approximate solution of the form ˆ x = arg max x p ( x | c x = k, ˆ θ , y ) , (27)obtained by applying the Viterbi algorithm in the augmented HMM. Here, ˆ θ is an approximation of the MAPof p ( θ | y ) computed from the samples, i.e. ˆ θ = θ ( t ∗ ) , t ∗ = arg max ≤ t ≤ T (cid:2) p ( y | θ ( t ) ) p ( θ ( t ) ) (cid:3) . Regarding pathsampling, if we wish to draw a path x from p ( x | c x = k, y ) , then by following eq. (26) we can choose a parametersample θ ( t ) uniformly form the set { θ ( t ) } Tt =1 , and then draw a path from p ( x | θ ( t ) , c x = k, y ) using the FF-BSalgorithm in the augmented HMM.Finally, it is worth discussing how an HMM with its hidden states being true classes can be ﬁtted to datain practice. It is important to note that this cannot be done in a completely unsupervised manner, as in suchcase the classes are not identiﬁable. Clearly, we need to inject knowledge into the model about which statecorresponds to which class. One extreme way to achieve this is to use completely supervised learning so thatthe data consist of both the sequence y and the class-label sequence x , i.e. x is fully observed. ML training insuch case simpliﬁes signiﬁcantly and the standard EM algorithm for the HMM is not needed. A similar scenario,somehow more realistic, is to use semi-supervised learning where the path x is partially observed. Such supervisedor semi-supervised approaches can be also implemented in two phases where in the ﬁrst phase some of the classconditional densities are found using labelled data and then they are used as ﬁxed transition densities for segmentalclassiﬁcation within a HMM. We will make use of such latter approach to train a HMM for text retrieval in Section8.2. A different way is to inject knowledge about the classes is via the prior p ( θ ) by following a fully Bayesianapproach or penalized ML. For instance, the classes might have a natural ordering or pairwise proximity whichcould be taken into account by choosing a suitable prior p ( θ ) . This will resolve the identiﬁability issues byassigning the hidden states to classes implicitly via the prior while otherwise training could be performed inunsupervised manner with the path x being fully unobserved.15 .2 Learning with k -segment constraints A second way to use k -segment constraints is to incorporate them in the model ﬁtting process so that the inferredparameters, and hence the structure of the hidden states, will depend on these constraints. In contrast to thehidden states being classes, here the hidden states are latent variables that simply add ﬂexibility in ﬁtting the data,similarly for instance to latent components in unsupervised mixture density estimation, A suitable application isoptimal compression of a data sequence by minimizing the reconstruction error, or equivalently maximizing modelﬁtting, between the observed sequence and the representation provided by the model. In such case a k -segmentconstraint could represent a ﬁxed budget in the reconstruction process and clearly it would be more efﬁcient toincorporate the constraint during the model ﬁtting.To this end, next we describe the technical details of using k -segment constraints for learning an HMM eitherby applying the EM algorithm (section 5.2.1) or by applying Bayesian approaches (section 5.2.2). Suppose that together with the data y we have an additional piece of information about the number of segments inthe hidden path. Speciﬁcally, we shall assume that this number cannot be larger than k while any other constraint,such as being exactly k , can be dealt with in a similar manner. By incorporating this constraint in the augmentedHMM we obtain the following joint density: p ( y , x , s ) = p ( y | x ) p ( x ) p ( s N ≤ k, s \ N | x ) , (28)where the evidence s N ≤ k reﬂects the information about the maximum number of segments allowed. Notice thatincorporating the constraint simply amounts for constraining each counting variable s n to take the values , . . . , k .We would like now to apply the EM algorithm to learn the parameters θ for which we need to write downthe auxiliary Q function and subsequently derive the E and M steps. Since the factor p ( s N ≤ k, s \ N | x ) does notcontain learnable parameters, the auxiliary Q function can be written as Q ( θ ; θ old ) = E p ( x | s N ≤ k, y , θ old ) [log p ( y | x , θ ) p ( x , θ )] + const , (29)where θ old denotes the current parameter values. This function has exactly the same form with the auxiliaryfunction in the unconstrained HMM with the only difference being that p ( x | y , θ old ) is replaced by p ( x | s N ≤ k, y , θ old ) .The E step simpliﬁes to computing all marginals p ( x n | s N ≤ k, y , θ old ) and all pair-wise marginals p ( x n − , x n | s N ≤ k, y , θ old ) which can be obtained by applying the F-B algorithm in the augmented HMM. Given the current θ old (omitted next for brevity), this algorithm computes the α messages, as shown in section 4.3, and the backward or16 messages, so that the ﬁrst β message is initialized to unity (i.e. β ( x N , s N ) = 1 ) and subsequent β messages arerecursively obtained according to β ( x n , s n ) = (cid:88) x n +1 ,s n +1 β ( x n +1 , s n +1 ) p ( y n +1 | x n +1 ) p ( x n +1 | x n ) p ( s n +1 | s n , x n +1 , x n ) . (30)Given that each s n takes k values, the β messages are computed in overall O ( kM N ) time. Having stored all α and β messages the desired marginals and pair-wise marginals are obtained from p ( x n | s N ≤ k, y ) ∝ k (cid:88) s n =1 α ( x n , s n ) β ( x n , s n ) , (31) p ( x n − , x n | s N ≤ k, y ) ∝ k (cid:88) s n − ,s n =1 α ( x n − , s n − ) p ( y n | x n ) p ( x n | x n − ) p ( s n | s n − , x n , x n − ) β ( x n , s n ) , (32)which involve summing out the auxiliary counting variables. Given these quantities from the E step, the form ofM step remains the same as in unconstrained HMMs. The iteration between the above E and M steps leads to alocal minimum of the likelihood p ( c x ≤ k, y ) .Further, as mentioned earlier, deriving EM algorithms under similar constraints can be done as above. Forinstance, if we wish to apply the EM algorithm by assuming the number of segments to be exactly equal to k , weneed to clamp the ﬁnal counting variable s N to the value k .To give an example of using the above learning algorithms, we consider six simulated sequences (included theone from Figure 3) and we apply EM under the k -segment constraint c x ≤ . The six panels in Figure 4 showsthe optimal k -segment paths (blue dashed lines) obtained after having optimized the HMM with the constrainedEM described above. The corresponding retrospective optimal paths associated with the same constraint (red solidlines), i.e. the path found after having optimized the HMM parameters using the standard unconstrained EM, arealso displayed. Notice that each path is shown as a piece-wise constant function formed by the mean values ofthe Gaussian emission densities indicated by the states along the path. Each piece-wise constant function gives areconstruction of the observed sequence. The mean squared errors (MSEs) for the retrospective and prospectivereconstructions are shown inside parentheses in the legend of each panel. As expected, there MSEs indicate thatincorporating a k -segment constraint during model ﬁtting gives better reconstruction. In fact, by initializing theparameters in the constrained EM from the ﬁnal values obtained by the standard EM should always lead to alikelihood value which is higher or equal to the corresponding value in the retrospective model. It is also possible to learn an HMM under k -segment constraints using Bayesian inference and here we brieﬂyoutline how this can be done using Gibbs sampling. Consider a Bayesian HMM with a prior distribution p ( θ ) on17igure 4: Each of the six panels corresponds to a different simulated sequence. In each panel, the prospective(blue dashed line) and the retrospective (red solid line) optimal paths, under the k -segment constraint c x ≤ , aredisplayed together with the data. The paths are shown as piece-wise constant functions formed by the means ofthe Gaussian emission densities. MSEs of the two reconstructions are given in the legends of the panels insideparentheses. 18he parameters and a joint density p ( y , x , s , θ ) = p ( y | x , θ ) p ( x | θ ) p ( θ ) p ( s N ≤ k, s \ N | x ) . (33)where, as in the previous section, we assumed that the number of segments cannot exceed k . Notice that, while θ and s are conditionally independent given x , marginally they are dependent because of the constraint s N ≤ k .We aim to compute the posterior distribution p ( x , s , θ | s N ≤ k, y ) and since this is too expensive we resort toGibbs-type of sampling where we iteratively sample the paths ( x , s ) from the conditional p ( x , s | θ , s N ≤ k, y ) and the parameters θ from p ( θ | x , y ) . The ﬁrst step corresponds precisely to the path sampling under a k -segmentconstraint presented in section 4.3 using FF-BS in the augmented HMM. The second step requires simulatingfrom the posterior conditional over parameters and clearly this will always be identical with the correspondingstep when sampling in the unconstrained HMM. Also, when this step involves exact simulation from p ( θ | x , y ) the full algorithm is precisely Gibbs sampling, otherwise it is Metropolis-within-Gibbs where θ is sampled froma proposal distribution and then it is accepted or rejected. k -segment inference problems In this section, we discuss extensions to the basic k -segment inference problems considered in section 4. Specif-ically, in section 6.1 we show how to solve generalized k -segment inference problems where we are interested intransitions of a particular type. In section 6.2, we extend the framework in a different direction by showing howto extract highly non-Markovian events along the HMM hidden path which consist of excursions from null statesto abnormal states. In several applications of HMMs, we may wish to solve more general k -segment inference problems associatedwith probability events involving certain types of segments and transitions. For example, we could have a naturalsub-group of states A ⊂ { , . . . , M } and we would like to classify the observed sequence in terms of the occur-rence or not of A based on the computation of the associated posterior probability. This problem consists of anexample of generalized k -segment inference and in this section we show how this and related problems can besolved using auxiliary counting variables.In a hidden path of an HMM (assuming an irreducible transition matrix) we can encounter M ( M − possibletransitions. We can denote this set of all transitions by an M × M binary matrix C having ones everywhere andzeros in the diagonal, i.e. C ( i, j ) = I ( i (cid:54) = j ) . Such a matrix characterizes the standard k -segment inferenceproblems described earlier where all segments are of interest and are all counted. When we care about a subset of19ransitions, we can modify C so that C ( i, j ) = 1 , if both i (cid:54) = j and the transition i → j belongs to this subset.One way to visualise this is to think of colouring certain transitions in the HMM. Then, we will be interested incounting segments generated from only those coloured transitions. Furthermore, in order to be ﬂexible about theinclusion of the initial segment (which is not the result of a transition) in the probability event, we can deﬁne an M -dimensional binary vector µ indicating the subset of values of the initial state x that are of interest. Thenanalogously to equation (5), we can deﬁne c x = µ ( x ) + N (cid:88) n =2 C ( x n − , x n ) , (34)which denotes the number of segments along the hidden path x which are compatible with the constraints ( µ , C ) .Subsequently, we can deﬁne probability events of the form c x = k , k ≤ c x ≤ k , the special events c x > k andetc, and subsequently formulate all associated k -segment inference problems as described in section 4.1.To solve all these new problems, we introduce again auxiliary counting variables s and deﬁne a suitablecounting Markov chain p ( s | x ) that generates deterministically the variables in s given the path x . This chain hasthe same structure with eq. (9) but with the following modiﬁed conditionals: p ( s | x ) = µ ( x ) δ s , + (1 − µ ( x )) δ s , , (35) p ( s n | s n − , x n − , x n ) = C ( x n − , x n ) δ s n ,s n − +1 + (1 − C ( x n − , x n )) δ s n ,s n − . (36)Here, s is set to one only for the subset of values of x compatible with µ , otherwise it remains zero and theassociated initial segments are not counted. The case of counting always the ﬁrst segment corresponds to thespecial case where µ ( x = i ) = 1 , for each i , in which case p ( s | x ) simpliﬁes to δ s , . Similarly, the conditional p ( s n | s n − , x n − , x n ) is such that s n increases only when C ( x n − , x n ) = 1 so that new segments for which x n − (cid:54) = x n and C ( x n − , x n ) = 0 are not counted. Clearly, counting any segment is obtained as a special case forwhich C ( x n − , x n ) = I ( x n − (cid:54) = x n ) .All dynamic programming recursions of section 4.3 are applicable to the above generalized k -segment infer-ence problems. Given that we solve these problems for k = 1 up to k = k max , the time complexity of thesealgorithms can be either O ( k max M N ) when the vector µ is equal to one everywhere or O (( k max + 1) M N ) when some of the elements of µ are zero. In the latter case the term k max + 1 appears simply because eachcounting variable s n can take k max + 1 values. Finally, standard Viterbi, F-B and FF-BS algorithms for HMMsare obtained as special cases of generalized k -segment recursions corresponding to setting µ and C to zero. Tomake this clear, notice that in such case none of the segments along the path x are counted so that k can takeonly the value zero, i.e. p ( c x = 0 | y ) = 1 and p ( x | c x = 0 , y ) = p ( x | y ) . Therefore, in such case the generalized k -segment recursions reduce to the standard HMM recursions having complexity O ( M N ) which shows that the20 -segment algorithms provide a more general inference methodology for HMMs.Finally, to illustrate optimal decoding in a generalized k -segment setting, we consider again the simulated dataof Figure 3. Suppose, we would like to count segments from the second (green) state only. The constraints ( µ , C ) we need to use are µ = [0 1 0] and C = [0 1 0; 0 0 0; 0 1 0] (where ; separates the rows of C ). The fourth row ofFigure 3 shows several optimal paths having up to segments associated with counting the second state in theHMM. In several applications of HMMs where hidden states correspond to true states of nature, there is often a subsetof states (in the simplest case just a single state) considered as normal or null states while the remaining onesrepresent abnormalities. In such applications the practitioner might be interested to identify excursions where thehidden path moves from any null state to abnormal states and returns back to a null state. Extracting such eventsusing a k -segment formulation is challenging because an excursion has a high order Markov structure and thereforeit cannot be identiﬁed by just comparing two consecutive states. To this end, next we describe a generalization ofour augmentation framework with counting variables that efﬁciently solves the excursion problem.We ﬁrst give a precise deﬁnition of an excursion. Suppose in HMM the states are divided into two groups:the null set N ⊂ { , . . . , M } and the abnormal set N = { , . . . , M } \ N . An excursion is any sub-path ( x i , . . . , x i +1 , . . . , x j ) , with j − i > , where x i , x j ∈ N and the intermediate hidden variables ( x i +1 , . . . , x j − ) take values from the abnormal set. In other words, an excursion is the sub-path having the start and end statesclamped to normal states and with all intermediate variables clamped to abnormal values. Further, a special caseof an excursion is a restricted excursion where the intermediate sub-path ( x i +1 , . . . , x j − ) is clamped to the sameabnormal state.To count excursions, we introduce a new sequence of auxiliary variables e = ( e , . . . , e N ) which aim tosignify the different phases of the excursion cycle. These variables unfold sequentially given the path x accordingto the following deterministic chain. Initially, e is set to zero so that p ( e | x ) = δ e , and then any subsequent e n is drawn according to p ( e n | e n − , x n − , x n ) =  δ e n , x n − ∈ N & x n ∈ N ,δ e n , x n − ∈ N & x n ∈ N ,δ e n ,e n − otherwise . (37)Here, the ﬁrst part of the conditional signals the initiation of an excursion where e n is set to one once a transitionfrom a normal state to an abnormal state occurs. The second part signiﬁes the end of the excursion where we returnto a normal state. The third part replicates the previous value and deals simultaneously with both intermediate21ariables in the excursion sub-path, in which case e n = e n − = 1 , and situations where x has started in anabnormal state and an initiation of an excursion has not occurred so far, in which case e n = e n − = 0 . Thekey now to count excursions is to increment a counter any time there is transition from one to zero in the path e signifying the completion of an excursion. This is achieved using counting variables s generated given e , so that s = 0 and any subsequent s n is drawn from p ( s n | s n − , e n , e n − ) = I ( e n − = 1 & e n = 0) δ s n ,s n − +1 + (1 − I ( e n − = 1 & e n = 0)) δ s n ,s n − . (38)The initial HMM is augmented hierarchically with the above two layers of auxiliary variables so that p ( y , x , e , s ) = p ( y | x ) p ( x ) p ( e | x ) p ( s | e ) , (39)is the joint density of the extended state-space HMM and each triple ( x n , e n , s n ) consists of the new extendedhidden state. Then, by working analogously to section 4.3 we can derive recursions for all types of k -segmentinference problems associated with counting excursions. For instance, by specifying a maximum number of k max excursions we can introduce evidence into the ﬁnal state s N , such that s N ≤ k max , and then obtain all optimal k max paths containing k = 1 up to k = k max excursions using the Viterbi algorithm. Since each variable e n takes two possible values and s n takes k max + 1 possible values, the complexity of all dynamic programmingalgorithms will be O (2( k max + 1) M N ) which is twice as slow as generalized k -segment inference.Dealing with restricted excursions requires only a modiﬁcation of the third “otherwise” part in eq. (37). Inparticular, this part must now be modiﬁed so that once an excursion cycle has previously been initiated, i.e. e n − = 1 , we will count any transition happening between abnormal states. More precisely, this part becomes p ( e n | e n − , x n − , x n ) = I ( e n − = 1 & x n − (cid:54) = x n ) δ e n ,e n − +1 + (1 − I ( e n − = 1 & x n − (cid:54) = x n )) δ e n ,e n − . (40)Then, the problem of counting restricted excursions is solved by constraining all e n variables to take only thetwo values { , } , so that once an excursion cycle is been initiated we cannot transit to a different abnormal state.The time complexity of the dynamic programming recursions remains O (2( k max + 1) M N ) as in the simpleexcursion case.To illustrate the concept of extracting excursions we return to the dataset of Figure 3, where we would like tocount excursions so that the ﬁrst and second states comprise the null set and the remaining third state is taken asabnormal. The panel in the last row of Figure 3 shows several optimal paths found by counting excursions where,for clarity, only the excursion segments are displayed using black solid lines.22 Relation to other methods

Our method formalizes and generalizes the approach of Kohlmorgen (2003) who provided the ﬁrst solution (asfar as we are aware) for a speciﬁc form of the k -segment inference problem. Kohlmorgen (2003) recognizedthat an exact dynamic programming solution for the optimal decoding MAP estimation problem existed. In thisarticle, we have placed that insightful observation by Kohlmorgen (2003) within a novel counting Markov chainframework and showed that the use of dynamic programming can also be used for marginalization and samplingof random variables and thus, for instance, allow the computation of marginal probabilities over subset of hiddenpaths using the forward recursion and simulating samples with exactly k segments using the FF-BS algorithm.The use of augmentation with auxiliary variables means that our framework is easily generalizable as someonecan tackle different types of inference problems by constructing suitable counting chains. For instance, in Section6.1, we took this forward by introducing and solving generalized k -segment inference problems in HMMs simplyby generalizing the structure of the counting chain.In addition, there are similarities in the way we construct counting chains with that of explicit duration HMMs(Mitchell et al., 1995; Murphy, 2002; Yu, 2010), which consists of a modiﬁcation of the original HMM where eachhidden state emits not a single observation but a sequence of observations. The number of these observations ischosen randomly from a distribution. This can be thought of as introducing duration or segment length constraintsin the original HMM, so that the resulting model is a hidden semi-Markov model. From technical point the useof counting variables in ED-HMMs shares similarities with our methodology, however, the scope of our approachis very different. Speciﬁcally, in our case the counting variables are used to obtain probabilities and hidden pathsin the original standard HMM, i.e. we do no alter the original HMM but instead we do exploratory inference inthis model, while in the ED-HMM the counting variables deﬁne a new model (marginalizing out the s n s from thejoint prior distribution of x n s and s n s gives a new semi-Markov model).The task of k -segment inference in HMMs is strongly related to change point estimation. Traditional changepoint estimation algorithms; see e.g. (Auger and Lawrence, 1989; Fearnhead, 2006; Fearnhead and Liu, 2007),allow the computation of optimal segmentations of sequential data having one up to k max segments in O ( k max N ) time, i.e. these algorithms have quadratic complexity in the length of the data sequence. In contrast, our algorithmshave linear complexity in the length of the sequence and therefore they can be applied to massive datasets as thoseencountered in bioinformatics. Recently, Killick et al. (2012) presented a linear complexity algorithm for changepoint estimation, which however, does not solve the optimal decoding problem in k -segment inference (i.e. it doesnot ﬁnd all optimal segmentations having one up to a maximum number of segments), but instead it discovers asingle segmentation with an a priori unknown number of segments.Yau and Holmes (2013) also developed a decision theoretic approach for segmentation using Hidden Markovmodels by deﬁning a loss function on transitions and identifying a Viterbi-like dynamic programming algorithm23o efﬁciently compute the hidden state sequence that minimizes the posterior expected loss. The properties ofthe sequence predictions are modiﬁed through speciﬁcation of the loss penalties on transitions as supposed toaltering the transition dynamics of the Hidden Markov model. The k -segment algorithms developed here can alsobe applied to the method of Yau and Holmes (2013) to produce sequence predictions that minimize the posteriorexpected loss criterion subject to a desired k -segments constraint. The combination of the decision theoreticapproach and the use of k -segments therefore provides a powerful tool for exploration of the complete state spaceof Hidden Markov models. Next, we demonstrate the utility of k -segment methods in two real-word applications. Speciﬁcally, in section8.1 we consider the problem of copy number identiﬁcation in cancer genomic sequences, while in section 8.2 wediscuss an application to text retrieval and topic modelling. In this section, we consider the problem of genome-wide classiﬁcation of somatic DNA copy number alterations(SCNAs) in cancer. SCNAs are a important constituent of the mutational landscape in cancer and refer to numer-ical copy number changes that result in extra or lost copies of parts of the genome. In cancer, these alterationslead to the loss of tumor suppressor genes (which restrict tumorigenic activity) or the gain of oncogenes (whichpromote tumorigenic activity) and many copy number alterations of such genes have been identiﬁed as beingassociated with cancer (Beroukhim et al., 2010). Next generation sequencing or microarray technologies haveallowed cancers to be probed on a genome-wide scale for SCNAs and a number of statistical models have beendeveloped to support the analysis of this data (Loo et al., 2010; Yau et al., 2010; Chen et al., 2011; Carter et al.,2012; Yau, 2013). A particularly popular class of these models have utilised Hidden Markov models to modelmicroarray intensities or sequencing reads as observations of a hidden (discrete) state process that corresponds tothe unobserved copy number sequence.Speciﬁcally, a single nucleotide polymorphism (SNP) microarray dataset consists of a sequence of bivariatemeasurements { y i } ni =1 at n SNP locations spread across the genome. The ﬁrst dimension of the measurementsknown sometimes as the

Log R Ratio values which are intensity measurements whose magnitude is proportionalto the total copy number at that particular genomic location. In human genome analysis, the Log R Ratio valuesare typically normalized such that values approximately equal to zero correspond to a DNA copy number of twosince we typically inherit one copy of every gene from each parent. The second dimension, sometimes known asthe

B allele frequency , measures the relative contribution of one of the parental alleles to the overall signal whichcan allow us to determine which parental allele is lost or gained.24n Yau et al. (2010), these data sequences are modelled using a Bayesian hierarchical model speciﬁed via thefollowing relationships: y i | x i , m , Σ , ν ∼ Student( m x i , Σ x i , ν ) , i = 1 , . . . , n, (41) x i | x i − ∼ Multinomial( π x i − ) , (42)where x i ∈ { , . . . , S } denotes the copy number state at the i -th location, { m j , Σ j } denotes the expected signalmeasurements and noise covariance for the j -th copy number state and π is a transition matrix such that π j corresponds to the transition probabilities out of the j -th copy number state. Note, we present only an abbreviatedand simpliﬁed version of the complete model by Yau et al. (2010) here. For full details, see the original reference.Table 1 shows an example set of copy number states. Yau et al. (2010) models transitions between super-states as relatively unlikely events leading to a “sticky” HMM that produces relatively few super-state segments.Dynamics within super-states are modelled via an embedded Markov chain that approximates the patterns ofgenotypes observed in real data. The primary scientiﬁc interest is in the switching between super-states but it isnecessary to fully model the complete genotypes in order to achieve this.Copy Number State Total Copy Number LOH Genotype Super-state1 0 N/A N/A 12 1 0 A 23 1 0 B 24 2 0 AA 35 2 0 AB 36 2 0 BB 37 3 0 AAA 48 3 0 AAB 49 3 0 ABB 410 3 0 BBB 411 2 1 AA 512 2 1 BB 5Table 1: Example copy number states. Each copy number state is associated with a total copy number andgenotype which tells us the number of each parental allele (A/B). The super-state corresponds to subsets of copynumber states with identical total copy number and/or loss of heterozygosity (LOH) status.Full Bayesian posterior inference for this type of model is prohibited by the size of the datasets ( O ( n ) ≈ ).Yau et al. (2010) perform model ﬁtting using expectation-maximization to compute MAP parameter estimates andcondition on these to obtain MAP segmentations using the Viterbi algorithm. The forward-backward algorithmcan also be applied to obtain site-wise posterior probabilities of state occupation. Figure 5 shows an example copynumber analysis of chromosome 1 of a colorectal cancer cell line SW837 from a SNP microarray dataset usingthe OncoSNP software from Yau et al. (2010). The chromosome exhibits a number of copy number alterationsleading to changes in the pattern of the Log R Ratio and B Allele Frequency along the chromosome. Genomic25egions with non-normal total copy number (2) can be identiﬁed from the Viterbi segmentations and the site-wiseposterior probabilities.The application of our k -segments methods can be used to augment these standard analyses with additionalexploratory information. Figure 5 shows segmentations conditional on different ﬁxed super state segment num-bers obtained using k -segments. Here, we have used the ability to count certain transitions in k -segment inferenceto good effect to count only transitions between super-states and exclude transitions between copy number stateswithin super-states. This means the k -th segmentation represents the most probable copy number segmentationthat involves k different super-state segments as supposed to k segments deﬁned on the original state space whichwould include transitions between states within super-states. These segmentations allow the exploration of al-ternative segmentations that differ from the MAP solution and yet retain segmental constraints that cannot beobserved from the site-wise marginal probabilities. In this section, we apply k -segment inference to a information retrieval task where the objective is to process longdocuments and extract segments referring to certain topics. For this purpose, we deﬁne a hidden Markov topicmodel, as those proposed in (Gruber et al., 2007; Andrews and Vigliocco, 2010), that builds upon popular topicmodels such as probabilistic latent semantic indexing (Hofmann, 2001) and latent Dirichlet allocation (Blei et al.,2003). These latter models represent a document as a set of words y d = ( y d, , . . . , y d,N d ) where each y d,n ∈{ , . . . , V } points into a vocabulary consisting of V keywords. The words are generated exchangeably from amixture distribution where mixing proportions are document-speciﬁc parameters while mixture components aremultinomial distributions over the vocabulary. The latter distributions aim at capturing semantic topics, such as politics , sports , food and etc. Learning in these models is typically carried out in an unsupervised manner byusing a corpus of several documents and assuming a common pool of topics so that each document, through thedocument-speciﬁc mixing proportions, can contain only a subset of topics. The basic topic models ignore wordordering and do not model spatial correlation according to which nearby words are more likely to belong to thesame topic. Motivated by this limitation, HMM-based extensions have been developed in (Gruber et al., 2007;Andrews and Vigliocco, 2010) that assume that the latent topics of words in ordered text follows a Markov chain.Next, we construct a similar HMM and use it to solve an information retrieval task based on semi-supervisedlearning.Assume an unknown-content (test) document d in which we would like to scan through and retrieve segmentsreferring to certain topics. As before the document is represented by a set of words y d = ( y d, , . . . , y d,N d ) which are ordered according to their appearance in the text and assumed to have been generated from an HMM.Speciﬁcally, we assume there is a path x d = ( x d, , . . . , x d,N d ) such that each x d,n ∈ { , . . . , M } indicates the26igure 5: Copy number analysis of the colorectal cancer cell line SW837 (Chromosome 1) using site-wisemarginal posterior probabilities of a copy number aberration from the Forward-Backward algorithm, the Viterbialgorithm (black lines indicate detected regions of aberrant copy number), and k -segment analysis for differentﬁxed super-state segment numbers.hidden topic of word y d,n . Further, the set of these topics is divided into the relevant topics and the irrelevanttopics with the relevant topics being the ones from which we wish to extract text segments, estimate posteriorprobabilities of appearance and etc, while the irrelevant topics are unknown and document-speciﬁc topics ofno interest to us. Without loss of generality, and to simplify our presentation, we shall assume M = 2 sothat there is a one relevant and one irrelevant topic. The relevant topic is described by multinomial parameters27 r = ( φ r, , . . . , φ r,V ) so that the emission distribution that generates a word y d,n is such that p ( y d,n | x d,n = 1) = φ r,y d,n . (43) φ r is assumed to have been estimated by supervised learning using fully labeled documents according to theequations: φ r,v = n v + 1 n + V , v = 1 , . . . , V, (44)where n v is the number of times the v th word appears in the labeled data and n is the total number of words inthese data. Notice that the above is simply the Bayesian mean estimate under an uniform Dirichlet prior over φ r .Similarly, the emission distribution for the irrelevant topic, i.e. p ( y d,n | x d,n = 2) , is described by the parametervector φ d = ( φ d, , . . . , φ d,V ) which is a document-speciﬁc parameter to be estimated. Furthermore, the priordistribution π d and transition matrix A d of the HMM are also document-speciﬁc parameters and the full set ( φ d , π d , A d ) can be estimated via the EM algorithm while φ r is kept ﬁxed. In practice, we also place a conjugateDirichlet prior over all unknown parameters so that EM ﬁnds MAP point estimates similar to those of eq. (44).In the remaining of this section we demonstrate the above system using a freely available text corpus takenfrom the University of Oxford electronic library. Speciﬁcally, we collected a set of doctoral theses on severalsubjects such as History, Social Sciences, Philosophy, Law, Politics, Literature and Economics. The topic ofEconomics was considered to be the relevant topic while all remaining topics were taken as irrelevant. Ten outof documents were classiﬁed (according to the library database system) to be about Economics while theremaining theses were scattered across the other topics. Each d th document was represented by a sequence ofwords from a dictionary of size V = 1260 which was deﬁned separately by choosing all different words from alarge set of freely accessible Wikipedia articles. The multinomial parameters for the relevant topic of Economicswas obtained by supervised learning using counts of words obtained from a small set of Wikipedia entries such asthe entries Economics, Finance and Investment. Having preprocessed each document as above, we then consideredtwo types of prediction tasks: i) classiﬁcation and ii) detection that we describe next in turn.

Classiﬁcation.

For the classiﬁcation task the objective was to predict in a test document the presence orabsence of at least one occurrence of a segment from the topic Economics. The test documents consisted of the theses, originally annotated as non-Economics documents, that were randomly perturbed in order to create aground-truth dataset of known classiﬁcation as explained in the Appendix B. Given this test dataset, the objectivewas to construct a binary classiﬁcation system and classify each of the documents as relevant, i.e. as containingat least one text segment about Economics, or as irrelevant. Each test document was processed separately byapplying the EM algorithm discussed earlier. Then, to achieve probabilistic classiﬁcation, the posterior probability See . Following also the standard practise in topic modelling to exclude from the vocabulary very common words, of non semantic meaning,such as ’the’, ’of’, ’and’ etc. c x that increments only when a segment from the relevant topicoccurs. Notice that this requires the use of generalized counting, as described in section 6.1, that uses certainvalues for the constraints µ and C . Then, the posterior probability p ( c x > | y d ) is computed using the forwardpass in the augmented HMM which subsequently provides a probabilistic classiﬁer. Using different thresholds inthe classiﬁcation probability, we can obtain different decision systems of varying false positive and true positiverates as shown by the ROC curve in Figure 7. In contrast, if we were about to perform classiﬁcation using theViterbi MAP path we can only obtain a single decision system that classiﬁes documents as relevant or irrelevantbased upon whether a segment from the relevant topic occurs or not in the Viterbi path. Such system gives a singlevalue for the true positive and false positive rate as shown in Figure 7. Clearly, k -segment’s ability to computenon-trivial posterior probabilities allows for more ﬂexible uses of HMMs when building decision making systems. Detection.

We now turn into the second task which is concerned with the detection of individual segmentswithin a document that belong to the relevant topic. We adopt a standard information retrieval setup that is referredto as top-k retrieval (B¨uttcher et al., 2010). This is the task of retrieving k patterns (typically full documents) thatare most relevant to a given query among a large set of other possible patterns. Our speciﬁc top-k retrieval taskwill be to extract top-k text segments within the same large document and to achieve that we shall use the hiddenMarkov topic model. Also, to account for documents that may contain fewer than k segments from the relevanttopic, we will relax the constraint to retrieve exactly k segments to the softer constraint of retrieving at most k segments. It is worth noticing that there is a similarity of k -segment problems in HMMs and top- k retrieval sinceboth involve inference under counting constraints. More precisely, k -segment methods can naturally tackle theprevious top- k retrieval task by applying optimal decoding, under the constraint c x ≤ k , that ﬁnds the optimalhidden path containing at most k text segments associated with the relevant topic. Next, it order to evaluate suchsystem in test documents with known ground-truth segments, we randomly perturbed the test documents asexplained in the Appendix B.To measure performance, we make use of a popular evaluation measure used in visual object detection litera-ture. More precisely, detecting segments of certain topics in documents is similar to detecting instances of objectcategories in natural images. There, the detection problem is to predict a bounding box that locates an instance ofan object category within the image. The well-established evaluation measure, used in the PASCAL visual objectrecognition challenge (Everingham et al., 2010), is the overlap area ratio. Adopting this in our case, we havethat for a predicted segment S p = [ i l , i r ] , where i l and i r are the segment start and end locations within the testdocument, the overlap ratio is deﬁned by r = | S p ∩ S gt || S p ∪ S gt | . (45) Assuming that the ﬁrst hidden state in the HMM corresponds to the relevant topic and the second one to the irrelevant topic, µ = [1 0] while the ﬁrst row of C is [0 0] and the second row [1 0] . S gt is the ground-truth segment, S p ∩ S gt is the intersection of the predicted and the ground segments and S p ∪ S gt is their union. Clearly, r ∈ [0 , and values close to zero indicate poor detection while values close toone indicate strong detection. We consider as correct detections all cases when r exceeds the threshold of ;for an illustrative example of a correct detection see Figure 6. Also, to get a total document-speciﬁc performancethat is normalized with respect to k , we average according toper document detection rate = 1 k k p (cid:88) i =1 I ( r i > . , (46)where k p ≤ k is the number of predicted segments. From this we can obtain a mean detection rate that gives theoverall performance in the whole test dataset. Figure 8(a) shows means detection rates for several top- k systemsof varying values of k . Conﬁdence intervals were obtained by repeating the experiment times, so that ineach repeat a random test dataset of documents was created using bootstrapping together with the standardrandomization involved in the segment insertion (see Appendix B).Furthermore, it is interesting to compare the k -segment based method with a system constructed using thestandard Viterbi MAP path in the HMM. Standard Viterbi gives a single path that will contain a priori an unknownnumber of segments from the relevant topic. Thus, to get top- k retrieval systems (for different values of k ), wecan rank all relevant-topic segments with respect to their length so that the top-1 retrieval system simply outputsthe longest segment in the list, the top- retrieval system outputs the two longest segments and so forth. Usingthe same bootstrapped repeats we also evaluated the standard Viterbi system and for each repeat we recordedthe difference in mean detection rates ( k -segment rate minus the standard Viterbi rate). Figure 8(b) displays themean of these differences together with conﬁdence intervals and for several values of k . Clearly, there is acertain range of k values where the k -segment method outperforms the standard Viterbi method. Moreover, as k increases, the k -segment constraint c x ≤ k becomes weaker and the corresponding optimal paths converge to thestandard Viterbi MAP paths which explains the fact that the performance of the two methods becomes identicalfor large k .To summarise, both tasks in text retrieval presented above indicate that k -segment inference allows for moreﬂexible use of HMMs which provides us with new options when building classiﬁcation and decision makingsystems. HMMs can allow for highly efﬁcient analysis of large quantities of sequence data. However, existing methodsfor reporting of posterior summaries from HMMs such as the Viterbi MAP path and the marginal probabilitiesare rather blunt. Here, we demonstrated how the use of auxiliary counting variables allows for computationally30.. changes in services had brought within the direct employ of local authorities new professional groupswho could claim authority and expertise in the ﬁeld of child welfare - school medical ofﬁcers, school nurses,juvenile employment workers, in addition to the many volunteer roles which were integral to the operation ofEducation Departments at local level. These new professional groups ...... changes in services had brought within the direct employ of complexity system modeling to model marketcommunication networks. This would be very useful to understand the connection between price signaling,public awareness and regulatory demand. In addition to modeling market networks, of Education Depart-ments at local level. These new professional groups ...... changes in services had brought within the direct employ of complexity system modeling to model marketcommunication networks. This would be very useful to understand the connection between price signaling,public awareness and regulatory demand. In addition to modeling market networks, of Education Depart-ments at local level. These new professional groups ...Figure 6: An example of detection of a text segment from the relevant topic of Economics. The box on topdisplays a piece of text from a test document before we randomly inserted segments from the topic of Economics(see Appendix B). The middle box displays the same text after having randomly inserted (and replaced the originalpiece of text) a segment from the topic of Economics which is shown in red. The box at the bottom shows in bluecolor the segment predicted as belonging to the relevant topic. In this case, the predicted segment was classiﬁedas a correct detection since it overlaps more than with the ground-truth segment shown in the middle box. t r ue po s i t i v e r a t e Figure 7: The blue line shows the ROC (receiver operating characteristic) curve for the k -segment method thatclassiﬁes documents by using the topic-occurrence posterior probability p ( c x > | y ) . The black star shows thepair of the true and false positive rates corresponding to the classiﬁer obtained by the standard Viterbi MAP path.efﬁcient exploration of the model ﬁt using k -segment algorithms. It is important to note that the techniques wedeveloped are generic and the augmentation scheme can be applied either a posteriori to HMMs already ﬁttedto data or a priori during model ﬁt. In cancer genomics, k -segment inference can be an useful exploratory toolthat can help researchers to analyze genomic sequences in different resolutions or target events of particular types,31 m ean de t e c t i on r a t e m ean de t e c t i on r a t e d i ff e r en c e (a) (b)Figure 8: The panel in (a) shows mean detection rates (solid central line) for top- k systems, obtained by varying k ,together with conﬁdence intervals computed by repeating the experiment times. The panel in (b) showsmean differences in detection rates of the k -segment method and standard Viterbi together with conﬁdenceintervals obtained by the repeats.facilitating thus the process of getting novel insight into structural rearrangements in cancer genomes. For othertype of applications, that appear for instance in machine learning and pattern recognition, the proposed methodscan allow to build more ﬂexible HMM-based classiﬁcation and decision making systems, as we have demonstratedusing the text retrieval example.Regarding future work, an interesting research direction is to exploit the ability of k -segment inference toefﬁciently explore the HMM posterior distribution in order to provide input into constructing meta statisticalmodels. For instance, the ability to obtain alternative explanations of the same data sequence that may havehigh utility to the research scientist but occurs with very low probability. could allow the practitioner to re-rankdifferent explanations based on his expertise and subsequently provide feedback into the model that can be usedfor supervised re-training. A second future direction is to exploit the fact that training an HMM under a k -segmentconstraint often gives sparse transition matrices. This is not surprising, since a bound on the number of segmentsessentially limits the number of transitions along the hidden path which subsequently can result in many inferredzeros in the transition matrix. Such sparsity property could be useful when we would like to perform state selection in large state-space HMMs.To conclude, as data sets become larger and models more complex we expect to see increasing need forcomputationally efﬁcient methods for posterior model exploration and statistical inference under constraints. Inthis paper, we have presented one such approach that signiﬁcantly expands the statistical algorithmic toolbox ofHMMs. 32 MT and CH were supported by a Wellcome Trust Healthcare Innovation Challenge Fund award (Ref No. HICF-1009-026) and the Lincoln College Michael Zilkha Fund. MT was also supported from “Research Funding atAUEB for Excellence and Extroversion, Action 1: 2012-2014”. CY was funded by a UK Medical ResearchCouncil Specialist Training Fellowship in Biomedical Informatics (Ref No. G0701810) and a New InvestigatorResearch Grant (Ref No. MR/L001411/1).

References

Andrews, M. and Vigliocco, G. (2010). The hidden markov topic model: A probabilistic model of semanticrepresentation.

Topics in Cognitive Science , 2:101–113.Auger, I. E. and Lawrence, C. E. (1989). Algorithms for the optimal identiﬁcation of segment neighborhoods.

Bulletin of Mathematical Biology , 51(1):39–54.Baum, L. E. and Petrie, T. (1966). Statistical Inference for Probabilistic Functions of Finite State Markov Chains.

The Annals of Mathematical Statistics , 37(6):1554–1563.Beroukhim, R., Mermel, C. H., Porter, D., Wei, G., Raychaudhuri, S., Donovan, J., Barretina, J., Boehm, J. S.,Dobson, J., Urashima, M., Henry, K. T. M., Pinchback, R. M., Ligon, A. H., Cho, Y.-J., Haery, L., Greulich,H., Reich, M., Winckler, W., Lawrence, M. S., Weir, B. A., Tanaka, K. E., Chiang, D. Y., Bass, A. J., Loo, A.,Hoffman, C., Prensner, J., Liefeld, T., Gao, Q., Yecies, D., Signoretti, S., Maher, E., Kaye, F. J., Sasaki, H.,Tepper, J. E., Fletcher, J. A., Tabernero, J., Baselga, J., Tsao, M.-S., Demichelis, F., Rubin, M. A., Janne, P. A.,Daly, M. J., Nucera, C., Levine, R. L., Ebert, B. L., Gabriel, S., Rustgi, A. K., Antonescu, C. R., Ladanyi, M.,Letai, A., Garraway, L. A., Loda, M., Beer, D. G., True, L. D., Okamoto, A., Pomeroy, S. L., Singer, S., Golub,T. R., Lander, E. S., Getz, G., Sellers, W. R., and Meyerson, M. (2010). The landscape of somatic copy-numberalteration across human cancers.

Nature , 463(7283):899–905.Bishop, C. M. (2006).

Pattern Recognition and Machine Learning (Information Science and Statistics) . Springer-Verlag New York, Inc., Secaucus, NJ, USA.Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation.

J. Mach. Learn. Res. , 3:993–1022.B¨uttcher, S., Clarke, C. L. A., and Cormack, G. V. (2010).

Information Retrieval: Implementing and EvaluatingSearch Engines . MIT Press.Capp´e, O., Moulines, E., and Ryden, T. (2005).

Inference in Hidden Markov Models (Springer Series in Statistics) .Springer-Verlag New York, Inc., Secaucus, NJ, USA.33arter, S. L., Cibulskis, K., Helman, E., McKenna, A., Shen, H., Zack, T., Laird, P. W., Onofrio, R. C., Winckler,W., Weir, B. A., Beroukhim, R., Pellman, D., Levine, D. A., Lander, E. S., Meyerson, M., and Getz, G. (2012).Absolute quantiﬁcation of somatic dna alterations in human cancer.

Nat Biotechnol , 30(5):413–421.Chen, H., Xing, H., and Zhang, N. R. (2011). Estimation of parent speciﬁc dna copy number in tumors usinghigh-density genotyping arrays.

PLoS Comput Biol , 7(1):e1001060.Cowell, R. G., Dawid, P. A., Lauritzen, S. L., and Spiegelhalter, D. J. (2003).

Probabilistic Networks and ExpertSystems (Information Science and Statistics) . Springer, New York.Crouse, M. S., Nowak, R. D., and Baraniuk, R. G. (1998). Wavelet-based statistical signal processing using hiddenmarkov models.

Signal Processing, IEEE Transactions on , 46(4):886–902.Eddy, S. R. (1998). Proﬁle hidden markov models.

Bioinformatics , 14(9):755–763.Everingham, M., Gool, L., Williams, C. K., Winn, J., and Zisserman, A. (2010). The pascal visual object classes(voc) challenge.

Int. J. Comput. Vision , 88(2):303–338.Fearnhead, P. (2006). Exact and efﬁcient Bayesian inference for multiple changepoint problems.

Statistics andComputing , 16(2):203–213.Fearnhead, P. and Liu, Z. (2007). Online Inference for Multiple Changepoint Problems.

Journal of the RoyalStatistical Society, Series B , 69:589–605.Gruber, A., Weiss, Y., and Rosen-Zvi, M. (2007). Hidden topic markov models.

Journal of Machine LearningResearch - Proceedings Track , 2:163–170.Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis.

Machine Learning ,42(1/2):177–196.Juang, B. H. and Rabiner, L. R. (1991). Hidden markov models for speech recognition.

Technometrics , 33(3):251–272.Killick, R., Fearnhead, P., and Eckley, I. A. (2012). Optimal detection of changepoints with a linear computationalcost.

Journal of the American Statistical Association , 107(500):1590–1598.Kohlmorgen, J. (2003). On optimal segmentation of sequential data. In

Neural Networks for Signal Processing,2003. NNSP’03. 2003 IEEE 13th Workshop on , pages 449–458. IEEE.Koller, D. and Friedman, N. (2009).

Probabilistic Graphical Models - Principles and Techniques . MIT Press.Li, N. and Stephens, M. (2003). Modeling linkage disequilibrium and identifying recombination hotspots usingsingle-nucleotide polymorphism data.

Genetics , 165(4):2213–2233.34oo, P. V., Nordgard, S. H., Lingjrde, O. C., Russnes, H. G., Rye, I. H., Sun, W., Weigman, V. J., Marynen, P.,Zetterberg, A., Naume, B., Perou, C. M., Brresen-Dale, A.-L., and Kristensen, V. N. (2010). Allele-speciﬁccopy number analysis of tumors.

Proc Natl Acad Sci U S A , 107(39):16910–16915.Mitchell, C. D., Harper, M. P., and Jamieson, L. H. (1995). On the complexity of explicit duration hmm’s.

IEEETransactions on Speech and Audio Processing , 3(3):213–217.Murphy, K. (2002). Hidden semi-markov models (hsmms). Technical report, University of California – Berkley.Nilsson, D. and Goldberger, J. (2001). Sequentially ﬁnding the n-best list in Hidden Markov Models. In

Proceed-ings of he Seventeenth International Joint Conference on Artiﬁcial Intelligence (IJCAI) 2001 .Olshen, A. B., Venkatraman, E. S., Lucito, R., and Wigler, M. (2004). Circular binary segmentation for theanalysis of array-based DNA copy number data.

Biostatistics , 5(4):557–572.Paas, L. J., Vermunt, J. K., and Bijmolt, T. H. (2007). Discrete time, discrete state latent markov modelling forassessing and predicting household acquisitions of ﬁnancial products.

Journal of the Royal Statistical Society:Series A (Statistics in Society) , 170(4):955–974.Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. In

Proceedings of the IEEE , volume 77, pages 257–286.Schwartz, R. and Chow, Y. L. (1990). The N-best algorithms: an efﬁcient and exact procedure for ﬁnding theN most likely sentence hypotheses.

ICASSP-90., International Conference on Acoustics, Speech, and SignalProcessing , 1:81–84.Scott, S. L. (2002). Bayesian methods for hidden markov models: Recursive computing in the 21st century.

Journal of the American Statistical Association , 97:337–351.Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm.

IEEE Transactions on Information Theory , IT-13(2):260–269.Yau, C. (2013). Oncosnp-seq: a statistical approach for the identiﬁcation of somatic copy number alterations fromnext-generation sequencing of cancer genomes.

Bioinformatics , 29(19):2482–2484.Yau, C. and Holmes, C. (2013). A decision theoretic approach for segmental classiﬁcation using hidden markovmodels.

Annals of Applied Statistics . In press.Yau, C., Mouradov, D., Jorissen, R. N., Colella, S., Mirza, G., Steers, G., Harris, A., Ragoussis, J., Sieber, O., andHolmes, C. C. (2010). A statistical approach for detecting genomic aberrations in heterogeneous tumor samplesfrom single nucleotide polymorphism genotyping data.

Genome Biol , 11(9):R92.35u, S.-Z. (2010). Hidden semi-markov models.

Artif. Intell. , 174(2):215–243.

A Proofs for the auxiliary variable reformulation of k -segment problems Here, we provide proofs for the correctness of the reformulation of the three k -segment inference problems pre-sented in section 4.2. Firstly, we will show that p ( x | y , s N = k ) , computed via the augmented HMM, is equal to p ( x | y , c x = k ) given by eq. (7). We have that p ( x | y , s N = k ) is deﬁned by p ( x | y , s N = k ) ∝ p ( y | x ) p ( x ) (cid:88) s \ N p ( s \ N , s N = k | x ) . (47)What we need to show is that (cid:80) s \ N p ( s \ N , s N = k | x ) is equal to the indicator function I ( c x = k ) . Since p ( s \ N , s N = k | x ) is a deterministic distribution, given that x has k segments there will be an unique s ∗\ N such that p ( s ∗\ N , s N = k | x ) = 1 and zero for all remaining s \ N s. If x does not contain k segments, p ( s \ N , s N = k | x ) = 0 for any s \ N . Thus, when x has k segments (cid:80) s \ N p ( s \ N , s N = k | x ) = 1 , otherwise (cid:80) s \ N p ( s \ N , s N = 1 | x ) =0 . Therefore, (cid:80) s \ N p ( s \ N , s N = 1 | x ) = I ( c x = k ) for any x , from which we conclude that p ( x | y , s N = k ) reduces to the deﬁnition of p ( x | y , c x = k ) from eq. (7). From that, we can immediately obtain that the term thatnormalizes the right hand side of (47), i.e. the quantity p ( s N = k, y ) , is equal to p ( c x = k, y ) . This completesthe proof regarding the correctness of the probability computation.Based on the above, we can also conclude that the initial optimal decoding solution x ∗ is the MAP of p ( x | y , s N = k ) , i.e. x ∗ = arg max x  p ( y | x ) p ( x ) (cid:88) s \ N p ( s \ N , s N = k | x )  . (48)Given now that p ( s | x ) is a deterministic distribution the sum operation can be replaced by a max operation so that x ∗ = arg max x (cid:20) p ( y | x ) p ( x ) max s \ N p ( s \ N , s N = k | x ) (cid:21) , (49)or ( x ∗ , s ∗\ N ) = arg max x , s \ N (cid:2) p ( y | x ) p ( x ) p ( s \ N , s N = k | x ) (cid:3) , (50)which shows that the reformulated optimal decoding problem is equivalent to the initial one.Finally, regarding path sampling, the FF-BS in the augmented HMM gives a pair of paths ( (cid:101) x , (cid:101) s \ N ) that jointlycomprise an independent sample from p ( x , s \ N | s N = k, y ) . Thus, (cid:101) x alone is an independent sample from p ( x | s N = k, y ) . 36 Simulation of ground-truth datasets for the text retrieval example

For the classiﬁcation task we created the ground-truth dataset as follows. For each test document sequence y d wedecided with probability . to insert a number of g d (with g d ∼ Pois (2) ) segments from the subject Economicsso that these segments had random lengths from [10 , and were also placed in random locations within thesequence y d , replacing thus the original text and with the only constraint that they didn’t overlap with each other.Each such set of g d artiﬁcially inserted segments were also randomly selected from the theses in Economicsby ﬁrst picking a thesis and then selecting g d non-overlapping segments within that thesis text sequence. Thewhole procedure created a new dataset of documents so that a subset of them contained segments from therelevant topic and the remaining ones did not.For the detection task we worked similarly with the classiﬁcation task discussed earlier. Particularly, againwe randomly perturb the test documents and insert a number of g d ∼ Pois (10) segments from the subjectEconomics in each of the them. The insertion of segments was done exactly as described above with the onlydifference being that now we insert segments in all documents and their number can be much larger since g d ∼ Pois (10)(10)