[PDF] Temporal Parallelization of Inference in Hidden Markov Models

Abstract

This paper presents algorithms for parallelization of inference in hidden Markov models (HMMs). In particular, we propose parallel backward-forward type of filtering and smoothing algorithm as well as parallel Viterbi-type maximum-a-posteriori (MAP) algorithm. We define associative elements and operators to pose these inference problems as parallel-prefix-sum computations in sum-product and max-product algorithms and parallelize them using parallel-scan algorithms. The advantage of the proposed algorithms is that they are computationally efficient in HMM inference problems with long time horizons. We empirically compare the performance of the proposed methods to classical methods on a highly parallel graphical processing unit (GPU).

Full PDF

11 Temporal Parallelization of Inference in HiddenMarkov Models

Syeda Sakira Hassan, Simo S¨arkk¨a,

Senior Member, IEEE, and ´Angel F. Garc´ıa-Fern´andez

Abstract —This paper presents algorithms for parallelizationof inference in hidden Markov models (HMMs). In particular,we propose parallel backward-forward type of ﬁltering andsmoothing algorithm as well as parallel Viterbi-type maximum-a-posteriori (MAP) algorithm. We deﬁne associative elements andoperators to pose these inference problems as parallel-preﬁx-sumcomputations in sum-product and max-product algorithms andparallelize them using parallel-scan algorithms. The advantageof the proposed algorithms is that they are computationallyefﬁcient in HMM inference problems with long time horizons. Weempirically compare the performance of the proposed methodsto classical methods on a highly parallel graphical processingunit (GPU).

Index Terms —parallel forward-backward algorithm, parallelsum-product algorithm, parallel max-product algorithm, parallelViterbi algorithm

I. I

NTRODUCTION H IDDEN Markov models (HMMs) have gained a lot ofattention in last two decades due to their simplicity andbroad range of applications [1]–[5]. Successful real-world ap-plication areas of HMMs include speech recognition, decodingof convolutional codes, target tracking and localization, facialexpression recognition, gene prediction, gesture recognition,musical composition, and bioinformatics [1], [6]–[11]. AnHMM is a statistical model that provides a simple and ﬂexibleframework that can express the conditional independence andjoint distributions using graph-like structures. An HMM can bethought of as a speciﬁc form of a probabilistic graphical modelconsisting of two components: a structural component thatdeﬁnes the edges and a parametric component that encodes potentials associated with the edges in the graph. A graphicalrepresentation of an HMM is shown in Fig. 1. An HMM is adoubly stochastic process where the underlying stochastic pro-cess (light gray-colored nodes) can only be observed throughanother stochastic process (dark gray-colored nodes) [1], [12].One of the primary tasks in graphical models is the processof computing marginals and conditional probability distribu-tions, the inference task, which is a computationally intensiveprocedure.If the HMM model has D states and the sequence oflength is T , then there are D T possible state sequences.For the classical algorithms, such as forward-backward and S. Hassan and S. S¨arkk¨a are with the Department of Electrical Engi-neering and Automation, Aalto University, 02150 Espoo, Finland (emails:syeda.s.hassan@aalto.ﬁ, simo.sarkka@aalto.ﬁ).´Angel F. Garc´ıa-Fern´andez is with the Department of Electrical Engineeringand Electronics, University of Liverpool, Liverpool L69 3GJ, United Kingdom(email: [email protected]). He is also with the ARIESResearch Centre, Universidad Antonio de Nebrija, Madrid, Spain. x x x T -1 x T y y y T -1 y T Fig. 1. A hidden Markov model (HMM). The observed nodes are shaded indark gray whereas unobserved nodes are in light gray.

Viterbi algorithms, the time complexity is O ( D T ) to ﬁnd themarginals, or alternatively, the most likely sequence, knownas the Viterbi path [1], [6], [13]. Some methods to speedup these algorithms, also using parallelization, have alreadyappeared in literature. For instance, Lifshits et al. [14] useda compression scheme by exploiting the repetitive patternsin the observed sequences to speed up the inference task.Sand et al. [15] developed a parallel forward algorithm usingSIMD instructions and multiple cores. Nielsen and Sand [16]presented a parallel reduction on forward algorithm. Chatterjeeand Russell [17] used the temporal abstraction concept fromdynamic programming to speed up the Viterbi algorithm.Using accelerated hardware, tile-based Viterbi algorithm wasproposed by [18] where the matrix multiplication was donein parallel. Maleki et al. [19] proposed an optimized methodto solve a particular class of dynamic programming problemsby tropical semiring. They showed that this can be used tooptimize the Viterbi algorithm. Nevertheless, parallelizationhas not been fully investigated in HMM inference tasks. Inthis paper, we develop novel algorithms which can achieve O (log T ) span complexity in forward-backward and Viterbialgorithms.Over the last half a century, numerous parallel algorithmshave been developed, and these algorithms can take advantageof the acceleration power of specialized hardware acceler-ators such as general purpose graphical processing units (GPUs) [20], neural processing units (NPUs) [21], and tensorprocessing units (TPUs) [22]. These accelerators allow forparallel computation on large amounts of data, making com-putation faster and cheaper, and therefore economically viable.Among these accelerators, GPUs are most widely used. TheGPU architecture allows for massive parallelization of general-purpose algorithms [23]. Table I summarizes few recent workson the inference tasks of HMM, which were implemented onGPU. However, these works were optimized particularly for a r X i v : . [ c s . D C ] F e b speech recognition tasks and did not explore the full capabili-ties of parallelism. In order to utilize parallelism, the sequentialalgorithms need to be adapted to the primitive operations thatallow for parallel execution on parallel platforms. For example,the all-preﬁx-sums, also known as parallel-scan algorithm [24],[25] can be used to run computations in parallel providedthat they can be formulated in terms of binary associativeoperations. TABLE IP

REVIOUS WORKS ON

HMM

INFERENCE TASK USING

GPU. T

HENOTATION ‘-’

MEANS THAT THE VALUE WAS NOT MENTIONED IN THEORIGINAL PAPER .Algorithm States Observations Speed improvementForward-backward [26] 8 200 3.5xForward [27] 512 3-10 880xBaum-Welch [27] 512 3-10 180xViterbi [28] - 2000-3000 3xForward [29] 4000 1000 4xBaum-Welch [29] 4000 1000 65x

Recently, the parallel-scan algorithm has been used inparallel Bayesian ﬁltering and smoothing algorithms to speedup the sequential computations via parallelization [30], [31].Although this framework is applicable to HMMs as well, thedifference to our proposed framework is that here we formulatethe inference problem in terms of potentials. This results indifferent backward pass in the algorithm which correspondsto two-ﬁlter smoother formulations, whereas the formulationin [30], [31] is of Rauch–Tung–Striebel type smoother [32].In the present article, we also consider parallel Viterbi-typeMAP algorithm which has not been explored before.The main contribution of this paper is to present a par-allel framework to perform the HMM inference tasks efﬁ-ciently by using the parallel-scan algorithm. In particular,we formulate the sequential operations of sum-product andmax-product algorithms as binary associative operations. Thisformulation allows us to construct parallel versions of theforward-backward algorithm and the Viterbi algorithm with O (log T ) span complexity. For latter algorithm, we proposetwo alternative parallelization: the path-based and forward-backward based formulation. We also empirically evaluate thecomputational speed advantage of these methods on a GPUplatform.The structure of the paper is the following. In Section II wedeﬁne the HMM inference problem as the inference problemin a probabilistic graph. Then, in Section III we reviewthe classical sum-product and forward-backward algorithms,introduce the elements and associative operations for paral-lel operations, and formulate the parallel-scan algorithm onthese elements and operators. Next, in Section IV we showthat we can apply the parallel-scan concept to the max-product algorithm, which results in a parallel version of theViterbi algorithm. In Section V we discuss extensions andgeneralizations of the methodology, and in Section VI wepresent experimental results for both parallel sum-product andparallel max-product based algorithms. Section VII concludesthe article, and additional proofs are provided in Appendix. II. P ROBLEM F ORMULATION

Assume that we have T random variables x = ( x , . . . , x T ) which correspond to nodes in a probabilistic graph and N potential functions ψ , . . . , ψ N which deﬁne the cliques ofthe graph [33]–[35]. We also assume each variable x t takesvalues in set X = { , . . . , D } . Each potential is a functionof subset of x t ’s deﬁned by multi-indices α , . . . , α T withelements α t = ( α t, , . . . , α t, | α t | ) . We denote the subsets as x α t . The joint distribution of the random variables now hasthe representation p ( x ) = 1 Z T (cid:89) t =1 ψ t ( x α t ) , (1)where Z = (cid:80) x (cid:81) Tt =1 ψ t ( x α t ) is the normalization constantwhich is also called the partition function [33]. Typical infer-ence tasks are now either the computation of all the marginals p ( x k ) = 1 Z (cid:88) x \ x k T (cid:89) t =1 ψ t ( x α t ) , (2)where the summation is performed over all the variablesexcept x k , or the computation of the similar quantity for themaximum: p ∗ ( x k ) = max x \ x k T (cid:89) t =1 ψ t ( x α t ) . (3)Let us now consider inference problems in a hidden Markovmodel (HMM) of the form x k ∼ p ( x k | x k − ) , (4a) y k ∼ p ( y k | x k ) . (4b)Here, we assume that the sequence x , . . . , x T is Markovianwith transition probabilities p ( x k | x k − ) and the observations y k are conditionally independent given x k with likelihoods p ( y k | x k ) . Furthermore, we have a prior x ∼ p ( x ) . We areinterested in computing the smoothing posterior distributionsof the form p ( x k | y , . . . , y T ) for all k = 1 , . . . , T as wellas in computing the maximum (i.e., Viterbi path) of p ( x ) .These can be computed with sequential algorithms in linearcomputational time [13], [32], [36]. However, our aim is toimprove this computational time by using parallelization inthe temporal domain.We can express the inference problem in (1) by deﬁning ψ ( x ) = p ( y | x ) p ( x ) , (5a) ψ k ( x k − , x k ) = p ( y k , | x k ) p ( x k | x k − ) , for k > . (5b)With these deﬁnitions the joint distribution (1) takes the form p ( x ) = 1 Z ψ ( x ) T (cid:89) t =2 ψ t ( x t − , x t ) . (6)Thus in this Markovian case, each potential is a function of theneighboring nodes x k − , x k , that is, we have α k = ( k − , k ) for k > and α = (1) .Typically, the inference task on the HMM is either tocompute the marginals p ( x k | y , . . . , y T ) or to ﬁnd the most probable sequence x ∗ T , the Viterbi path. In the potentialfunction formulation, the smoothing distribution is given by(2) and the Viterbi (MAP) path is given by the maximum of (6)or equivalently of (3). Both of these correspond to followingkinds of general operations on (6): F ( x k ) = 1 Z OP k (cid:32) ψ ( x ) T (cid:89) t =2 ψ t ( x t − , x t ) (cid:33) , k = 1 , . . . , T, (7)where, OP k ( . ) is a sequence of operations applied to allelements but x k , such as (cid:80) x \ x k or max x \ x k , resulting in afunction of x k . One way to compute these operations in (7) isto use sum-product or max-product algorithms [37]–[39], forsummation and maximization operations, respectively. How-ever, in the following section we show that since sum and maxoperations are associative, we can use parallel scan algorithmsfor parallelizing these computations. The same principle wouldalso apply to any other binary associative operations, but herewe speciﬁcally concentrate on these two operations.III. P ARALLEL SUM - PRODUCT ALGORITHMS FOR

HMM S In this section, we ﬁrst review the classical sum-productalgorithm. Then, we show how to decompose the steps of thisalgorithm in terms of binary associative operations. We thenshow how to compute these elements in parallel with O (log T ) span complexity. A. Classical sum-product algorithm

The sum-product algorithm can be used to ﬁnd the marginaldistribution of all variables in a graph [40], [41]. We ﬁrst deﬁnethe forward potential as ψ f ,k ( x k ) = (cid:88) x k − ψ ( x ) k (cid:89) t =2 ψ t − ,t ( x t − , x t ) (8)and the backward potential as ψ bk,T ( x k ) = (cid:88) x k +1: T T (cid:89) t = k ψ t,t +1 ( x t , x t +1 ) . (9)We can now express the marginal distribution p ( x k ) , whichcorresponds to summation operation in (7), as normalizedproduct of the forward and backward potentials: p ( x k ) = 1 Z k ψ f ,k ( x k ) ψ bk,T ( x k ) , (10)where Z k = (cid:80) x k ψ f ,k ( x k ) ψ bk,T ( x k ) . If we want to computeall the marginals p ( x ) , p ( x ) , . . . , p ( x T ) , then we need tocompute all the terms ψ f ,k ( x k ) and ψ bk,T ( x k ) and combinethem. It turns out that we can compute these forward andbackward potentials in O ( D T ) steps using the algorithm inFig. 2, which is an instance of sum-product algorithm.It should be noted that the belief propagation algorithm[37], operating on a Bayesian network, corresponds to thesum-product algorithm in a factor graph with similar factoriza-tion. The forward algorithm of the HMM model is equivalentto ﬁltering whereas, the backward algorithm corresponds tothe backward pass in two-ﬁlter smoothing [32]. Require:

The potentials ψ k ( · ) for k = 1 , . . . , T . Ensure:

The forward and backward potentials ψ f ,k ( x k ) and ψ bk,T ( x k ) for k = 1 , . . . , T // Forward pass: ψ f , ( x ) = ψ ( x ) (cid:46) initialization for k ← to T do (cid:46) sequentially ψ f ,k ( x k ) = (cid:80) x k − ψ f ,k − ( x k − ) ψ k − ,k ( x k − , x k ) end for // Backward pass: ψ bT,T ( x T ) = 1 (cid:46) initialization for k ← T − to do (cid:46) sequentially ψ bk,T ( x k ) = (cid:80) x k +1 ψ k,k +1 ( x k , x k +1 ) ψ bk +1 ,T ( x k +1 ) end forreturn ψ f ,k ( x k ) and ψ bk,T ( x k ) for k = 1 , . . . , T Fig. 2. The classical sum-product algorithm for computing the forward andbackward potentials.

B. Sum-product algorithm in terms of associative operations

We can formulate sum-product algorithm in a more abstractform by deﬁning a general element a i : k and consider a binaryassociative operator ⊗ such that [25] a i : k = a i : j ⊗ a j : k , for i < j < k. (11)As is shown in the following, the computation of the forwardand backward terms now reduces to computing a k and a k : T +1 . Deﬁnition 1.

We deﬁne element a i : k recursively as follows.We have a = ψ ( x ) ,a k − k = ψ k ( x k − , x k ) ,a T : T +1 = 1 . (12) For notational convenience and to enhance readability, we alsodeﬁne, ψ , ( x , x ) (cid:44) ψ ( x ) ,ψ k − ,k ( x k − , x k ) (cid:44) ψ k ( x k − , x k ) ,ψ T,T +1 ( x T , x T +1 ) (cid:44) . (13) Now, given two elements a i : j and a j : k , the binary associativeoperator ⊗ for the forward-backward algorithm in HMM for ≤ i < j < k is a i : j ⊗ a j : k = (cid:88) x j ψ i,j ( x i , x j ) ψ j,k ( x j , x k ) , (14) which also implies the following representation for a generalelement: a i : k = ψ i,k ( x i , x k ) . (15) Lemma 1.

The operator ⊗ is associative.Proof. The proof of the associative property for forward-backward operator ⊗ can be found in Appendix A.Now, we deﬁne the theorems to compute the forward-backward algorithm in terms of the associative operations. Theorem 1.

The forward potential is given as a k = ψ f ,k ( x k ) , k > . Proof.

Since the operator is associative, it is enough to proveby induction that a k = a ⊗ a ⊗ · · ·⊗ a k − k − ⊗ a k − k = ψ f ,k ( x k ) . (16)Relation (16) holds for k = 1 by deﬁnition of a in (12).That is a = ψ ( x ) . Then, we assume that a k − = a ⊗ a ⊗ · · · ⊗ a k − k − = ψ f ,k − ( x k − ) (17)holds. We need to prove that Relation (16) holds for k . Westart by the binary operation ⊗ with a k − k in left-hand sideof (17) a k − ⊗ a k − k = (cid:88) x k − ψ f ,k − ( x k − ) ψ k − ,k ( x k − , x k )= (cid:88) x k − (cid:20)  (cid:88) x k − ψ ( x ) k − (cid:89) t =2 ψ t − ,t ( x t − , x t )  × ψ k − ,k ( x k − , x k ) (cid:21) = (cid:88) x k − ψ ( x ) k (cid:89) t =2 ψ t − ,t ( x t − , x t )= ψ f ,k ( x k ) . This concludes the proof.

Theorem 2.

The backward potential is given as a k : T +1 = ψ bk,T ( x k ) , k > . Proof.

We prove by induction that a k : T +1 = a k : k +1 ⊗ a k +1: k +2 ⊗ · · ·⊗ a T − T ⊗ a T : T +1 = ψ bk,T ( x k ) (18)holds. Relation (18) holds for k = T by deﬁnition of a T : T +1 =1 in (12). Then, we assume that a k +1: T +1 = a k +1: k +2 ⊗ a k +2: k +3 ⊗ · · · ⊗ a T : T +1 = ψ bk +1 ,T ( x k +1 ) (19)holds. We need to prove that Relation (18) holds for k . Westart by the binary operation ⊗ with a k : k +1 in left-hand side of (19) to yield a k : k +1 ⊗ a k +1: T +1 = (cid:88) x k +1 ψ k,k +1 ( x k , x k +1 ) ψ bk +1 ,T ( x k +1 )= (cid:88) x k +1 (cid:20) ψ k,k +1 ( x k , x k +1 ) ×  (cid:88) x k +2: T T (cid:89) t = k +1 ψ t,t +1 ( x t , x t +1 )  (cid:21) = (cid:88) x k +1: T T (cid:89) t = k ψ t,t +1 ( x t , x t +1 )= ψ bk,T ( x k ) . This concludes the proof.Now, we can express the marginal in (10) using a k and a k : T +1 as p ( x k ) = 1 Z k a k a k : T +1 . (20)In the proofs above we have implicitly used the sum-productalgorithm in Fig. 2 to derive the results, which can be done in O ( T D ) steps. However, because the operator is associative,we can reorder the computations by recombining the oper-ations which is the key to parallelization. This is discussednext. C. Parallelization of the sum-product algorithm

In this section, the aim is to turn the computation of theforward and backward potentials into a parallel operation. Oneof the fundamental building blocks of parallel algorithms is thepreﬁx-sums operation which can be computed with parallelscan algorithm [24], [25]. This is deﬁned in the following.

Deﬁnition 2.

For a sequence of T elements ( a , a , . . . , a T ) and a binary associative operator ◦ , the all-preﬁx-sums oper-ation computes the sequence as ( a , a ◦ a , · · · , a ◦ a ◦ . . . ◦ a T ) . (21)Similarly to above, we can deﬁne reversed preﬁx-sums asfollows. Deﬁnition 3.

For a sequence of T elements ( a , a , . . . , a T ) and a binary associative operator ◦ , the reversed all-preﬁx-sums operation computes the sequence as ( a ◦ a ◦ · · · ◦ a T , . . . , a T − ◦ a T , a T ) . (22)It turns out that the all-preﬁx-sums above can be computedwith the parallel-scan algorithm [24], [25] in span O (log T ) time. A pseudocode for the algorithm is shown in Fig. 3. Thealgorithm computes the non-reversed all-preﬁx-sums only, butby reversing the input before and after the call (and reversingthe operation inside the algorithm), it can also be used tocompute the reversed preﬁx sums. Require:

The elements a k for k = 1 , . . . , T and operator ⊗ . Ensure:

The preﬁx sums in a k for k = 1 , . . . , T // Save the input: for i ← to T do (cid:46) Compute in parallel b i ← a i end for // Up sweep: for d ← to log T − dofor i ← to T − by d +1 do (cid:46) Compute in parallel j ← i + 2 d k ← i + 2 d +1 a k ← a j ⊗ a k end forend for a n ← // Down sweep: for d ← log T − to dofor i ← to T − by d +1 do (cid:46) Compute in parallel j ← i + 2 d k ← i + 2 d +1 t ← a j a j ← a k a k ← a k ⊗ t end forend for // Final pass: for i ← to T do (cid:46) Compute in parallel a i ← a i ⊗ b i end forreturn a T Fig. 3. Parallel-scan algorithm for in-place transformation of the sequence ( a i ) into its all-preﬁx-sums in O (log T ) span-complexity. Note that thealgorithm in this form assumes that T is a power of , but it can easilybe generalized to an arbitrary T . We can now notice that computation of the forward poten-tials in fact corresponds to computation of preﬁx sums for theassociative operation ⊗ and elements a i : i +1 : a = ψ f , ( x ) a = a ⊗ a = ψ f , ( x ) , ... a T = a ⊗ · · · ⊗ a T − T = ψ f ,T ( x T ) . (23)Similarly, computing the backward potentials which corre-spond to elements a k,T can be seen as reversed preﬁx sums.Therefore, we can use the parallel-scan algorithm to com-pute the forward and backward potentials along with themarginal distributions (20) in parallel using algorithm in Fig. 4.As all the steps in algorithm in Fig. 4 are either fullyparallelizable, or are parallelizable by the parallel-scan algo-rithm, then span and work complexities can be summarized asfollows. Proposition 1.

The parallel sum-product algorithm in Fig. 4has span complexity O (log T ) and work complexity O ( T ) . Require:

The potentials ψ k ( · ) , k = 1 , . . . , T and operator ⊗ . Ensure:

The marginals p ( x k ) for k = 1 , . . . , T . for k ← to T do (cid:46) In parallelInitialize a k − k end for Run parallel-scan to get a k = ψ f ,k ( x k ) , k = 1 , . . . , T . for k ← to T do (cid:46) In parallelInitialize a k : k +1 end for Run reversed parallel-scan to get a k : T +1 = ψ bk,T ( x k ) , k =1 , . . . , T . for k ← to T do (cid:46) In parallelCompute marginal p ( x k ) using (20) end forreturn All marginals p ( x k ) , k = 1 , . . . , T (cid:46) In parallel

Fig. 4. Parallel sum-product algorithm.

Proof.

The initializations for the elements for both the forwardand backward passes as well as the marginal computations arefully parallelizable and hence have span complexities O (1) and work complexities O ( T ) . The parallel-scan algorithmpasses have span complexities O (log T ) and work complex-ities O ( T ) . Therefore the total span complexity is O (log T ) and the total work complexity is O ( T ) .It is useful to remark that the computational complexity alsodepends on the dimensionality D . However, it depends on thedetails how much we can parallelize inside the initializationsand operator applications, which is what determines the de-pendence of the span complexity on D .IV. P ARALLEL V ITERBI AND MAX - PRODUCT ALGORITHMS

In this section, we ﬁrst review the classical Viterbi algo-rithm and its max-product version. Then, we decompose theelements of the latter algorithm in terms of binary associativeoperations which is a similar operation as sum-product algo-rithm but for “max” instead of “sum” operation. Finally, weshow how to compute these elements in parallel with O (log T ) span complexity.Please note that in this paper, we assume that the mostprobable (MAP) path is unique for simplicity of exposition.However, the results can be easily extended to account formultiple solutions. A. Classical Viterbi algorithm

In HMM literature, the

Viterbi algorithm is a classical algo-rithm based on dynamic programming principle that computesthe most likely sequence of states [2], [13], [42], that is, maximum-a-posteriori (MAP) estimate of the hidden states.We aim to compute the estimate x ∗ T by maximizing theposterior distribution p ( x T | y T ) = p ( y T , x T ) p ( y T ) ∝ p ( y T , x T ) (24) which is equivalent to maximizing the joint probability distri-bution p ( y T , x T ) = p ( x ) p ( y | x ) T (cid:89) t =2 p ( y t | x t ) p ( x t | x t − ) , (25)that is, x ∗ T = arg max x T p ( y T , x T ) . (26)In terms of potentials, this is also equivalent to maximizationof (6).The classical Viterbi algorithm [2], [13] operates as follows.Assume that we have V k − ( x k − ) which is the probability ofthe maximum probability path x ∗ k − which ends at x k − .Then the maximum probability of a path ending at V k ( x k ) and the corresponding optimal state x ∗ k − (as function of x k hence denoted as u k − ( x k ) ) are given by V k ( x k ) = max x k − [ p ( y k | x k ) p ( x k | x k − ) V k − ( x k − )] ,u k − ( x k ) = arg max x k − [ p ( y k | x k ) p ( x k | x k − ) V k − ( x k − )] , (27)with initial condition V ( x ) = p ( x ) p ( y | x ) . (28)At the ﬁnal step we then just take x ∗ T = max x T V T ( x T ) (29)and then we just compute backwards x ∗ k − = u k − ( x ∗ k ) . (30)In terms of potentials the pseudocode for Viterbi algorithmcan be written as in Fig. 5. As the computational complexityof the forward pass is O ( D T ) and backward pass is O ( T ) ,therefore, the total computational complexity of the Viterbialgorithm is O ( D T ) . Require:

The potentials ψ k ( · ) for k = 1 , . . . , T Ensure:

The Viterbi path x ∗ T // Forward pass: V ( x ) = ψ ( x ) for k ← to T do (cid:46) sequentially V k ( x k ) = max x k − [ ψ k ( x k − , x k ) V k − ( x k − )] u k − ( x k ) = arg max x k − [ ψ k ( x k − , x k ) V k − ( x k − )] end for // Backward pass: x ∗ T = arg max x T V T ( x T ) for k ← T to do x ∗ k − = u k − ( x ∗ k ) end forreturn x ∗ T Fig. 5. The classical Viterbi algorithm.

B. Path-based parallelization

In order to design a parallel version of the Viterbi algorithm,we ﬁrst need to ﬁnd an element ˜ a and the binary associativeoperator ∨ for the Viterbi algorithm. Deﬁnition 4.

For two elements ˜ a i : j = (cid:18) A i : j ( x i , x j )ˆ X i : j ( x i , x j ) (cid:19) , ˜ a j : k = (cid:18) A j : k ( x j , x k )ˆ X j : k ( x j , x k ) (cid:19) , (31) the binary associative operator ∨ for the maximum a posteri-ori (MAP) path in HMM is deﬁned as ˜ a i : k = ˜ a i : j ∨ ˜ a j : k , (32) where ˜ a i : k = (cid:18) A i : k ( x i , x k )ˆ X i : k ( x i , x k ) (cid:19) = (cid:32) max x j A i : j ( x i , x j ) A j : k ( x j , x k ) (cid:16) ˆ X i : j ( x i , ˆ x j ( x i , x k )) , ˆ x j ( x i , x k ) , ˆ X j : k (ˆ x j ( x i , x k ) , x k ) (cid:17)(cid:33) (33) and ˆ x j ( x i , x k ) = arg max x j A i : j ( x i , x j ) A j : k ( x j , x k ) (34) with A k − k ( x k − , x k ) = ψ k − ,k ( x k − , x k ) , ˆ X k − k ( x k − , x k ) = ∅ ,A ( ., x ) = ψ ( x ) , ˆ X ( ., x ) = ∅ . (35) Lemma 2.

The operator ∨ is associative.Proof. The proof of associative property of the operator ∨ canbe found in Appendix B. Theorem 3.

We have A i : j ( x i , x j ) = max x i +1: j − j (cid:89) k = i +1 ψ k − ,k ( x k − , x k ) , ˆ X i : j ( x i , x j ) = arg max x i +1: j − j (cid:89) k = i +1 ψ k − ,k ( x k − , x k ) , (36) where A i : j corresponds to the maximum probability of MAPpath starting from x i and ending at x j , and ˆ X i : j representsthe sequence of the corresponding paths.Proof. The proof is provided in Appendix C.By putting i = 0 and j = T + 1 in Theorem 3 above wenow obtain the following result. Corollary 1.

We have ˜ a T +1 = (cid:18) ψ ( x ∗ ) (cid:81) Tt =2 ψ t − ,t ( x ∗ t − , x ∗ t ) x ∗ T (cid:19) , where x ∗ T is the MAP estimate given by (26) . It would now be possible to form a parallel algorithm forthe elements ˜ a i : j with the associative operator ∨ by leveraging the parallel-scan algorithm for computing the quantity inCorollary 1. However, each element ˜ a i : j contains a path oflength j − i − for each state pair and therefore, the memoryrequirements to store the state sequences are high. Thus, in thenext section, we propose an alternative approach with lowermemory requirements. C. Max-product formulation

Until now, we have used the Viterbi algorithm in forwarddirection. However, the MAP path can also be computedby using the max-product algorithm (see cf. [33]). Let usdenote by ˜ ψ fk ( x k ) the maximum probability of the optimalpath ending at x k and the maximum probability of starting at x k as ˜ ψ bk ( x k ) which are thus given as ˜ ψ fk ( x k ) = A k ( x , x k ) , ˜ ψ bk ( x k ) = A k : T +1 ( x k , x T +1 ) , (37)where the dependence on x and x T +1 is only notational (cf.(13)). From Deﬁnition 4 we now get the following recursionsfor these quantities. Lemma 3.

The maximum forward and backward probabilitiesadmit the recursions ˜ ψ fk ( x k ) = max x k − ψ k − k ( x k − , x k ) ˜ ψ fk − ( x k − ) , ˜ ψ bk ( x k ) = max x k +1 ψ k : k +1 ( x k , x k +1 ) ˜ ψ bk +1 ( x k +1 ) , (38) with initial conditions ˜ ψ f ( x ) = ψ ( x ) and ˜ ψ bT ( x T ) = 1 . It then turns out that the optimal states can be computed bymaximizing the product ˜ ψ fk ( x k ) ˜ ψ bk ( x k ) . This is summarizedin the following theorem. Theorem 4.

Given the maximum forward potentials ˜ ψ fk ( x k ) and the maximum backward potentials ˜ ψ bk ( x k ) , the globaloptimum is determined by x ∗ k = arg max x k ˜ ψ fk ( x k ) ˜ ψ bk ( x k ) (39) for k = 1 , . . . , T .Proof. The proof can be found in Appendix D.Theorem 4 follows from the generalized MAP estimatesfor an arbitrary graph discussed in [33, ch. 13]. However,we compute the elements of these forward and backwardpotentials in parallel.Let us now deﬁne element ¯ a i : j to consist of the upperpart of ˜ a i : j where the path is left out. These elements canbe computed without actually storing the paths x ∗ i : j whichprovides a computational advantage. Deﬁnition 5.

For two elements ¯ a i : j = A i : j ( x i , x j ) and ¯ a j : k = A j : k ( x j , x k ) , the binary operator ∨ can be deﬁned as ¯ a i : k = ¯ a i : j ∨ ¯ a j : k , (40) where ¯ a i : k = max x j A i : j ( x i , x j ) A j : k ( x j , x k ) . (41) The element ¯ a i : j and the operator also inherit the associativeproperty of the element ˜ a i : j , and therefore, we also have ¯ a i : k = A i : k ( x i , x k ) .We also deﬁne the operator ∨ similarly for the elements ¯ a i : j as for ˜ a i : j . We can now compute the maximum forward andbackward potentials in terms of these associative operators andelements, which is summarized in the following. Proposition 2.

The maximum forward potential can also becomputed as ¯ a k = ˜ ψ fk ( x k ) , k > . Proof.

This follows from Theorem 3.

Proposition 3.

The maximum backward potential can becomputed as ¯ a k : T +1 = ˜ ψ bk ( x k ) , k > . Proof.

This follows from Theorem 3.Similar to the sum-product algorithm in Section III-C, wecan now parallelize the max-product algorithm. The steps aresummarized in the algorithm in Fig. 6. The span and workcomplexities of the algorithm are summarized in the followingproposition.

Require:

The potentials ψ k ( · ) , k = 1 , . . . , T and operator ∨ . Ensure:

The Viterbi path x ∗ T . for k ← to T do (cid:46) In parallelInitialize ¯ a k − k end for Run parallel-scan to get ¯ a k = ˜ ψ f ,k ( x k ) , k = 1 , . . . , T . for k ← to T do (cid:46) In parallelInitialize ¯ a k : k +1 end for Run reversed parallel-scan to get ¯ a k : T +1 = ˜ ψ bk,T ( x k ) , k =1 , . . . , T . for k ← to T do (cid:46) In parallelCompute the optimal state x ∗ k using (39) end forreturn x ∗ T (cid:46) In parallel

Fig. 6. Parallel max-product (Viterbi) algorithm.

Proposition 4.

The span complexity of the parallel max-product algorithm in Fig. 6 is O (log T ) and the work com-plexity is O ( T ) .Proof. The initializations and ﬁnal state combinations havethe span complexity O (1) and work complexity O ( T ) , and thescans have the span complexity O (log T ) and work complexity O ( T ) . Hence the result follows.V. E XTENSIONS

In this section, we discuss extensions of the parallel-scanframework.

A. Generic associative operations

We deﬁned the parallel-scan framework in terms ofsum-products and max-products for hidden Markov models(HMMs). We can easily extend this proposed framework foroperations of the form (7) for arbitrary associative operators.In particular, we can also consider continuous-state Markovprocesses in which case the operator becomes integration. Inthis case we get similar algorithms to ones described in [30]except that the smoother will be a two-ﬁlter smoother insteadof the Rauch–Tung–Striebel smoother as in [30]. In particular,for linear Gaussian systems we get a parallel version of thetwo-ﬁlter Kalman smoother.

B. Block-wise operations

In this article, we have restricted our consideration to pair-wise Markovian elements. That is, we assign each singleobservation and state to a single element. However, it ispossible to perform binary association operations in parallelon a block of observations. We can deﬁne a block of obser-vations and states to elements of the parallel framework. Theprocessing of elements in parallel with a block of length l is already been shown in [30]. The practical advantage isthat we can more easily distribute the computations wherethe resources are limited and still achieving the optimal speedfrom parallelization. C. Parameter estimation

The ﬁnal task of HMM inference is learning the modelparameters. One possible solution is to use

Baum-Welch al-gorithm (BWA) [43], which is a special case of expectation–maximization (EM) algorithm [44]. In expectation step, BWAuses the forward-backward algorithm which we have shownto be computed in parallel.VI. E

XPERIMENTAL RESULTS

For our experiment, we consider the Gilbert–Elliott (GE)channel model [2], which is a classical model used in transmis-sion of signals in digital communication channels. The modeldescribes the burst error patterns in communication channelsand simulates the error performance of the communicationlink. This model consists of two hidden Markov states b k and s k , where the b k represents the binary input signal to betransmitted and s k represents the channel regime binary signal.It is assumed that the input signal b k may be ﬂipped by anindependent error which can be modeled as y k = b k ⊕ v k .Here, y k is the measurement signal which is observable, v k is a Bernoulli sequence and ⊕ is the exclusive-or operation.The exclusive-or operation ensures that y k = b k , if p ( v k ) = 0 ,otherwise y k (cid:54) = b k and an error has occurred.The regime input signal s k is modeled as a two-stateMarkov chain that represents high and low error conditionsof the channel. That is, if an error occurs ( p ( v k ) = 1 ), theprobability of error has either a small value ( q ) or a largervalue ( q ). Moreover, we represent the probability of transitionfrom high error state ( s k − = 1 ) to low error state ( s k = 0 )as p . Similarly, we present the probability of transition from Time step S t a t e / M ea s u r e m en t StateMeasurement

Fig. 7. An example of states and measurements from the Gilbert–Elliot hiddenMarkov model with the number of time steps T = 100 . low error state ( s k − = 0 ) to high error state ( s k = 1 ) as p .We also denote the state switch probability of b k as p .In order to recover the hidden states, we use the joint model x k = ( s k , b k ) which is a 4-state Markov chain consistingof the states { (0 , , (0 , , (1 , , (1 , } . These states areencoded as { , , , } . The transition matrix Π and the obser-vation model O , which encode the information in p ( x k | x k − ) and p ( y k | x k ) , respectively, are given in (42). An example ofthe states and the corresponding measurements from the GEmodel is shown in Fig. 7.To evaluate the performance of our proposed methods, wesimulated the states and measurements with varying lengthsof time series ranging from T = 10 to T = 10 andaveraged the run times (10 repetitions for sequential methodsand 100 for parallel ones). In the experimental setup, we usedthe open-source TensorFlow 2.4 software library [45]. Thelibrary natively implements the vectorization and associativescan primitives that can be used to implement the methodologyfor both CPUs and GPUs. We ran the sequential and parallelBayesian smoothers [32] (BS-Seq, BS-Par), sequential andparallel sum-product based smoothers (SP-Seq, SP-Par) fromSection III, and, sequential and parallel max-product basedMAP estimators (MP-Seq, MP-Par) from Section IV on bothCPU and GPU. We also ran the classical Viterbi algorithm (seeFig. 5) for comparison. The experiment was carried both on aCPU, AMD Ryzen TM Threadripper TM ® Ampere ® GeForce RTX3090 (GA102) with 10496 cores.The average run times are shown in Figs. 8 and 9. It isclear that the parallel algorithms are computationally fasterthan the sequential versions both on CPU and GPU. Althoughthe difference is much more pronounced on GPU, even onCPU, we can see a beneﬁt of parallelization. Nevertheless,the number of computational cores is orders of magnitudesmaller than in the GPU (24 vs. 10496 cores). Among thecompared methods, the max-product-based parallel method isthe fastest, sum-product-based parallel method is the second,

Π =  (1 − p ) (1 − p ) p (1 − p ) (1 − p ) p p p p (1 − p ) (1 − p ) (1 − p ) p p (1 − p ) p (1 − p ) p p p (1 − p ) (1 − p ) p (1 − p ) p p (1 − p ) p p (1 − p ) (1 − p ) (1 − p )  , O =  (1 − q ) q (1 − q ) q q (1 − q ) q (1 − q )  . (42)Here, Π = p ( x k | x k − ) and O = p ( y k | x k ) . Time series length -4 -3 -2 -1 A v e r age r un t i m e ( s e c ond s ) BS-SeqBS-ParSP-SeqSP-ParViterbiMP-SeqMP-Par

Fig. 8. The average computation times on CPU for sequential and parallelBayesian smoothers (BS-Seq, BS-Par), sequential and parallel sum-productalgorithms (SP-Seq, SP-Par), sequential and parallel max-product algorithms(MP-Seq, MP-Par), and the classical Viterbi algorithm (Viterbi). Time series length -3 -2 -1 A v e r age r un t i m e ( s e c ond s ) BS-SeqBS-ParSP-SeqSP-ParViterbiMP-SeqMP-Par

Fig. 9. The average computation times on GPU for sequential and parallelBayesian smoothers (BS-Seq, BS-Par), sequential and parallel sum-productalgorithms (SP-Seq, SP-Par), sequential and parallel max-product algorithms(MP-Seq, MP-Par), and the classical Viterbi algorithm (Viterbi).

Time series length A v e r age r un t i m e ( s e c ond s ) BS-Par-GPUSP-Par-GPUMP-Par-GPU

Fig. 10. The average computation times on GPU for parallel Bayesiansmoothers (BS-Par-GPU), parallel sum-product algorithms (SP-Par-GPU), andparallel max-product algorithms (MP-Par-GPU). and the parallel Bayesian smoother is the third, on both CPUand GPU. A similar order can be seen among the sequentialmethod results and the classical Viterbi is placed between theother sequential methods.The average run times of the parallel methods on GPU(on a linear scale) are separately shown in Fig. 10. From theﬁgure it can be seen that the computational times initiallygrow logarithmically as is predicted by the theory and then,for the Bayesian smoother and sum-product method retainback to linear when the time series length becomes longerthan ∼ × . However, the max-product method remainssub-linear even beyond , which is likely due to lesscomputational operations needed.Finally, Fig. 11 shows the ratio of the average run times ofthe sequential methods and the corresponding parallel methodson GPU. It can be seen that with time series length of ∼ × , the speed-up is already between 2000–3000 and with timeseries of length , the speed-up in the max-product methodis ∼ . On the other hand, with the Bayesian smootherand sum-product methods it is ∼ ONCLUSION

In this paper, we have proposed a parallel formulation ofHMM inference. We have considered the computation of both Time series length P a r a ll e li z a t i on s peed up BSSPMP

Fig. 11. The ratio of the average run times of the sequential methods (BS)and the corresponding parallel methods (SP, MP) on GPU. the marginal smoothing distributions as well as the MAP(Viterbi) paths. The proposed formulation enables the efﬁcientparallel computation of the HMM inference by reducingthe time complexity from linear to logarithmic scales. Thealgorithms are based on reformulating the HMM inferenceproblems in terms of associative operations which can be par-allelized using the parallel-scan algorithm. We also showed thepractical advantage of our proposed methods with experimentsin multi-core CPU and GPU.A

PPENDIX

A. Proof of Lemma 1

In this section, we prove the associative property of theoperator ⊗ stated in Lemma 1. For this, we need to provethat for three general elements a i : j , a j : k , a k : l , the followingrelation holds ( a i : j ⊗ a j : k ) ⊗ a k : l = a i : j ⊗ ( a j : k ⊗ a k : l ) , where, ≤ i < j < k < l. (43)We proceed to perform the calculations on left-hand side of(43) to check that they yield the same result as on right-handside: ( a i : j ⊗ a j : k ) ⊗ a k : l = (cid:88) x k (cid:88) x j ψ i,j ( x i , x j ) ψ j,k ( x j , x k )  ψ k,l ( x k , x l )= (cid:88) x j ,x k ψ i,j ( x i , x j ) ψ j,k ( x j , x k ) ψ k,l ( x k , x l )= (cid:88) x j ψ i,j ( x i , x j ) (cid:32)(cid:88) x k ψ j,k ( x j , x k ) ψ k,l ( x k , x l ) (cid:33) = a i : j ⊗ ( a j : k ⊗ a k : l ) , which gives the result. B. Proof of Lemma 2

In this section, we prove the associative property of theoperator ∨ as stated in Lemma 2. We need to prove that forthree general elements ˜ a i : j , ˜ a j : k , ˜ a k : l , the following relationholds (˜ a i : j ∨ ˜ a j : k ) ∨ ˜ a k : l = ˜ a i : j ∨ (˜ a j : k ∨ ˜ a k : l ) , where, ≤ i < j < k < l. (44)From Deﬁnition 4, we can write ˜ a i : j ∨ ˜ a j : k = (cid:18) A i : k ( x i , x k )ˆ X i : k ( x i , x k ) (cid:19) , (45)where A i : k ( x i , x k ) = max x j A i : j ( x i , x j ) A j : k ( x j , x k ) , (46a) ˆ X i : k ( x i , x k ) = (cid:0) ˆ X i : j ( x i , ˆ x j ( x i , x j )) , ˆ x j ( x i , x j ) , ˆ X j : k (ˆ x j ( x i , x j ) , x k )) (cid:1) , (46b)and ˆ x j ( x i , x j ) = arg max x j A i : j ( x i , x j ) A j : k ( x j , x k ) . (47)Now, combining the maximum probability of MAP path forthe element ˜ a k : l , we can write (46a) as A i : l ( x i , x l ) = max x k A i : k ( x i , x k ) A k : l ( x k , x l )= max x k (cid:20) (cid:26) max x j A i : j ( x i , x j ) A j : k ( x j , x k ) (cid:27) × A k : l ( x k , x l ) (cid:21) = max x j (cid:20) A i : j ( x i , x j ) × (cid:26) max x k A j : k ( x j , x k ) A k : l ( x k , x l ) (cid:27) (cid:21) = max x j A i : j ( x i , x j ) A j : l ( x j , x l ) . (48)Now, combining the MAP path of the element ˜ a k : l , we canwrite (46b) as ˆ X i : l ( x i , x l ) = arg max x k A i : k ( x i , x k ) A k : l ( x k , x l )= arg max x k (cid:20) (cid:40) arg max x j A i : j ( x i , x j ) A j : k ( x j , x k ) (cid:41) × A k : l ( x k , x l ) (cid:21) = arg max x j (cid:20) A i : j ( x i , x j ) × (cid:26) arg max x k A j : k ( x j , x k ) A k : l ( x k , x l ) (cid:27) (cid:21) = arg max x j A i : j ( x i , x j ) A j : l ( x j , x l ) . (49)From (48) and (49), it follows that the operator ∨ is associa-tive. C. Proof of Theorem 3

In this section, we prove Theorem 3 by induction. We ﬁrstnote that the theorem is true for i = k and j = k + 1 .Furthermore, if the assertion is true for i + 1 < j , we proceedto show that it holds for i . By using (33) we get A i : j ( x i , x j )= max x i +1 A i : i +1 ( x i , x i +1 ) A i +1: j ( x i +1 , x j )= max x i +1 ψ i,i +1 ( x i , x i +1 ) max x i +2: j − j (cid:89) t = i +2 ψ t − ,t ( x t − , x t )= max x i +1: j − j (cid:89) t = i +1 ψ t − ,t ( x t − , x t ) . (50)We similarly get ˆ X i : j ( x i , x j )= (cid:18) ˆ X i : i +1 ( x i , ˆ x i +1 ( x i , x j )) , ˆ x i +1 ( x i , x j ) , ˆ X i +1: j (ˆ x i +1 ( x i , x j ) , x j ) (cid:19) = (cid:16) ˆ x i +1 ( x i , x j ) , ˆ X i +1: j (ˆ x i +1 ( x i , x j ) , x j ) (cid:17) = (cid:18) arg max x i +1 ψ i,i +1 ( x i , x i +1 ) max x i +2: j − j (cid:89) t = i +2 ψ t − ,t ( x t − , x t ) , max x i +1 arg max x i +2: j − j (cid:89) t = i +2 ψ t − ,t ( x t − , x t ) (cid:19) = arg max x i +1: j − j (cid:89) t = i +1 ψ t − ,t ( x t − , x t ) , (51)which concludes the proof. D. Proof of Theorem 4

In this section, we prove Theorem 4. That is, given the max-imum forward probability ˜ ψ fk ( x k ) of path ending at x k , andthe maximum backward probability ˜ ψ bk ( x k ) of path starting at x k , the global optimum is given by (39).Due to associative property of the operator, we can nowwrite ˜ a T +1 = ˜ a k ∨ ˜ a k : T +1 = (cid:18) ψ ( x ∗ ) (cid:81) Tt =2 ψ t ( x ∗ t − , x ∗ t ) x ∗ T (cid:19) (52)which is given by ˜ a k ∨ ˜ a k : T +1 =  max x k A k ( x , x k ) A k : T +1 ( x k , x T +1 ) (cid:18) ˆ X k ( x , ˆ x k ( x , x T +1 )) , ˆ x k ( x , x T +1 ) , ˆ X k : T +1 (ˆ x k ( x , x T +1 ) , x T +1 ) (cid:19) , (53) where actually the dependence on x and x T +1 is onlynotational and the term ˆ x k ( x , x T +1 )= arg max x k A k ( x , x k ) A k : T +1 ( x k , x T +1 )= arg max x k ˜ ψ f ( x k ) ˜ ψ b ( x k ) (54)has to be the k th element on the optimal path in order to matchthe right hand side of (52).A CKNOWLEDGMENT

The authors would like to thank Academy of Finland forfunding. The authors would also like to thank Adrien Corenﬂosfor helping with the experiments.R

EFERENCES[1] L. R. Rabiner, “A tutorial on hidden Markov models and selectedapplications in speech recognition,”

Proceedings of the IEEE , vol. 77,no. 2, pp. 257–286, 1989.[2] O. Capp´e, E. Moulines, and T. Ryd´en,

Inference in hidden Markovmodels . Springer, 2006.[3] Z. Song and A. Dogandˇzi´c, “A max-product EM algorithm for recon-structing Markov-tree sparse signals from compressive samples,”

IEEETransactions on Signal Processing , vol. 61, no. 23, pp. 5917–5931, 2013.[4] Y. Ueng, K. Liao, H. Chou, and C. Yang, “A high-throughput trellis-based layered decoding architecture for non-binary LDPC codes usingmax-log-QSPA,”

IEEE Transactions on Signal Processing , vol. 61,no. 11, pp. 2940–2951, 2013.[5] B. Mor, S. Garhwal, and A. Kumar, “A systematic review of hiddenMarkov models and their applications,”

Archives of ComputationalMethods in Engineering , 2020.[6] B. H. Juang and L. R. Rabiner, “Hidden Markov models for speechrecognition,”

Technometrics , vol. 33, no. 3, pp. 251–272, 1991.[7] A. A. Salah, M. Bicego, L. Akarun, E. Grosso, and M. Tistarelli,“Hidden Markov model-based face recognition using selective attention,”in

Human Vision and Electronic Imaging XII , vol. 6492, InternationalSociety for Optics and Photonics. SPIE, 2007, pp. 404–412.[8] A. Krogh, B. Larsson, G. Von Heijne, and E. L. Sonnhammer, “Pre-dicting transmembrane protein topology with a hidden Markov model:application to complete genomes,”

Journal of molecular biology , vol.305, no. 3, pp. 567–580, 2001.[9] L. Pan, M. W. Marcellin, W. E. Ryan, and B. Vasic, “Viterbi detectionfor compressively sampled FHSS-GFSK signals,”

IEEE Transactions onSignal Processing , vol. 63, no. 22, pp. 5965–5975, 2015.[10] Y. Yang, S. Chen, M. A. Maddah-Ali, P. Grover, S. Kar, andJ. Kovaˇcevi´c, “Fast temporal path localization on graphs via multiscaleViterbi decoding,”

IEEE Transactions on Signal Processing , vol. 66,no. 21, pp. 5588–5603, 2018.[11] T. Long, L. Zheng, X. Chen, Y. Li, and T. Zeng, “Improved probabilisticmulti-hypothesis tracker for multiple target tracking with switchingattribute states,”

IEEE Transactions on Signal Processing , vol. 59,no. 12, pp. 5721–5733, 2011.[12] P. Di Viesti, G. M. Vitetta, and E. Sirignano, “Double Bayesian smooth-ing as message passing,”

IEEE Transactions on Signal Processing ,vol. 67, no. 21, pp. 5495–5510, 2019.[13] A. Viterbi, “Error bounds for convolutional codes and an asymptoti-cally optimum decoding algorithm,”

IEEE transactions on InformationTheory , vol. 13, no. 2, pp. 260–269, 1967.[14] Y. Lifshits, S. Mozes, O. Weimann, and M. Ziv-Ukelson, “Speedingup HMM decoding and training by exploiting sequence repetitions,”

Algorithmica , vol. 54, no. 3, pp. 379–399, 2009.[15] A. Sand, C. N. Pedersen, T. Mailund, and A. T. Brask, “HMMlib: A C++library for general hidden Markov models exploiting modern CPUs,”in . IEEE, 2010, pp. 126–134.[16] J. Nielsen and A. Sand, “Algorithms for a parallel implementation ofhidden Markov models with a small state space,” in . IEEE, 2011, pp. 452–459. [17] S. Chatterjee and S. Russell, “A temporally abstracted Viterbi algo-rithm,” in Proceedings of the Twenty-Seventh Conference on Uncertaintyin Artiﬁcial Intelligence , 2011, p. 96–104.[18] Zhihui Du, Zhaoming Yin, and D. A. Bader, “A tile-based parallelViterbi algorithm for biological sequence alignment on GPU withCUDA,” in , 2010, pp. 1–8.[19] S. Maleki, M. Musuvathi, and T. Mytkowicz, “Parallelizing dynamic pro-gramming through rank convergence,”

ACM SIGPLAN Notices , vol. 49,no. 8, pp. 219–232, 2014.[20] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C.Phillips, “GPU computing,”

Proceedings of the IEEE , vol. 96, no. 5, pp.879–899, 2008.[21] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural acceler-ation for general-purpose approximate programs,” in , 2012, pp.449–460.[22] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,S. Bates, S. Bhatia, N. Boden, A. Borchers et al. , “In-datacenterperformance analysis of a tensor processing unit,” in

Proceedings ofthe 44th Annual International Symposium on Computer Architecture ,2017, pp. 1–12.[23] D. Luebke, “CUDA: Scalable parallel programming for high-performance scientiﬁc computing,” in , 2008, pp.836–838.[24] G. E. Blelloch, “Scans as primitive parallel operations,”

IEEE Transac-tions on Computers , vol. 38, no. 11, pp. 1526–1538, 1989.[25] ——, “Preﬁx sums and their applications,” School of Computer Science,Carnegie Mellon University, Tech. Rep. CMU-CS-90-190, 1990.[26] J. Li, S. Chen, and Y. Li, “The fast evaluation of hidden Markovmodels on GPU,” in , vol. 4. IEEE, 2009, pp. 426–430.[27] C. Liu, “cuHMM: a CUDA implementation of hidden Markov modeltraining and classiﬁcation,” Johns Hopkins University, Tech. Rep., 2009.[28] D. Zhang, R. Zhao, L. Han, T. Wang, and J. Qu, “An implementationof Viterbi algorithm on GPU,” in . IEEE, 2009, pp. 121–124.[29] S. Hymel, “Massively parallel hidden Markov models for wirelessapplications,” Ph.D. dissertation, Virginia Tech, 2011.[30] S. S¨arkk¨a and ´A. F. Garc´ıa-Fern´andez, “Temporal parallelization ofBayesian smoothers,”

IEEE Transactions on Automatic Control , vol. 66,no. 1, pp. 299–306, 2021.[31] F. Yaghoobi, A. Corenﬂos, S. Hassan, and S. S¨arkk¨a, “Parallel iteratedextended and sigma-point Kalman smoothers,” in

To appear in Proceed-ings of IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , 2021.[32] S. S¨arkk¨a,

Bayesian Filtering and Smoothing . Cambridge UniversityPress, 2013.[33] D. Koller and N. Friedman,

Probabilistic Graphical Models: Principlesand Techniques . MIT Press, 2009.[34] S. Rangan, A. K. Fletcher, V. K. Goyal, E. Byrne, and P. Schniter,“Hybrid approximate message passing,”

IEEE Transactions on SignalProcessing , vol. 65, no. 17, pp. 4577–4592, 2017.[35] Y. Weiss, “Correctness of local probability propagation in graphicalmodels with loops,”

Neural computation , vol. 12, no. 1, pp. 1–41, 2000.[36] L. Rabiner and B. Juang, “An introduction to hidden Markov models,”

IEEE ASSP magazine , vol. 3, no. 1, pp. 4–16, 1986.[37] J. Pearl,

Probabilistic reasoning in intelligent systems: networks ofplausible inference . Morgan Kaufmann, 1988.[38] G. Shafer and P. Shenoy, “Probability propagation,”

Annals of Mathe-matics and Artiﬁcial Intelligence , vol. 2, p. 327–351, 1990.[39] C. M. Bishop,

Pattern Recognition and Machine Learning . Springer,2006.[40] R. G. Cowell, P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter,

Probabilistic networks and expert systems: Exact computational methodsfor Bayesian networks . Springer, 2006.[41] B. J. Frey, J. F. Brendan, and B. J. Frey,

Graphical models for machinelearning and digital communication . MIT press, 1998.[42] R. Larson and J. Peschon, “A dynamic programming approach totrajectory estimation,”

IEEE Transactions on Automatic Control , vol. 11,no. 3, pp. 537–540, July 1966.[43] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximizationtechnique occurring in the statistical analysis of probabilistic functionsof Markov chains,”

The annals of mathematical statistics , vol. 41, no. 1,pp. 164–171, 1970. [44] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihoodfrom incomplete data via the EM algorithm,”

Journal of the RoyalStatistical Society: Series B (Methodological) , vol. 39, no. 1, pp. 1–22,1977.[45] M. Abadi et al.et al.