[PDF] On Low Complexity Maximum Likelihood Decoding of Convolutional Codes

Abstract

This paper considers the average complexity of maximum likelihood (ML) decoding of convolutional codes. ML decoding can be modeled as finding the most probable path taken through a Markov graph. Integrated with the Viterbi algorithm (VA), complexity reduction methods such as the sphere decoder often use the sum log likelihood (SLL) of a Markov path as a bound to disprove the optimality of other Markov path sets and to consequently avoid exhaustive path search. In this paper, it is shown that SLL-based optimality tests are inefficient if one fixes the coding memory and takes the codeword length to infinity. Alternatively, optimality of a source symbol at a given time index can be testified using bounds derived from log likelihoods of the neighboring symbols. It is demonstrated that such neighboring log likelihood (NLL)-based optimality tests, whose efficiency does not depend on the codeword length, can bring significant complexity reduction to ML decoding of convolutional codes. The results are generalized to ML sequence detection in a class of discrete-time hidden Markov systems.

Full PDF

aa r X i v : . [ c s . I T ] J u l On Low Complexity Maximum LikelihoodDecoding of Convolutional Codes

Jie Luo,

Member, IEEE

Abstract

This paper considers the average complexity of maximum likelihood (ML) decoding of convolutionalcodes. ML decoding can be modeled as ﬁnding the most probable path taken through a Markov graph.Integrated with the Viterbi algorithm (VA), complexity reduction methods such as the sphere decoder oftenuse the sum log likelihood (SLL) of a Markov path as a bound to disprove the optimality of other Markov pathsets and to consequently avoid exhaustive path search. In this paper, it is shown that SLL-based optimalitytests are ineﬃcient if one ﬁxes the coding memory and takes the codeword length to inﬁnity. Alternatively,optimality of a source symbol at a given time index can be testiﬁed using bounds derived from log likelihoodsof the neighboring symbols. It is demonstrated that such neighboring log likelihood (NLL)-based optimalitytests, whose eﬃciency does not depend on the codeword length, can bring signiﬁcant complexity reductionto ML decoding of convolutional codes. The results are generalized to ML sequence detection in a class ofdiscrete-time hidden Markov systems.

Index Terms coding complexity, convolutional code, hidden Markov model, maximum likelihood decoding, Viterbialgorithm

I. Introduction

We study the algorithms that reduce the average complexity of maximum likelihood (ML) decodingof convolutional codes. By ML decoding, we mean the decoder uses code-search to ﬁnd, and toguarantee the output of, the most likely codeword.Forney showed that ML decoding of convolutional codes is equivalent to ﬁnding the most probablepath taken through a Markov graph [1]. Denote the codeword length by N and the coding memoryby ν . For each time index, the number of Markov states in the Markov graph is exponential in ν . Thetotal number of Markov states is therefore exponential in ν but linear in N . Deﬁne the complexity of The author is with the Electrical and Computer Engineering Department, Colorado State University, Fort Collins, CO 80523.E-mail: [email protected] work was supported by National Science Foundation grant CCF-0728826. a decoder as the number of visited Markov states normalized by the codeword length N . Practical MLdecoding is often achieved using the Viterbi algorithm (VA) [2][1], whose complexity does not scalein N but scales exponentially in ν . Well known decoders such as the list decoders [3], the sequentialdecoders [4], and the iterative decoders [5] are able to achieve near optimal error performance withlow average complexity. However, these decoders do not guarantee the output of the ML codeword[6].If obtaining the ML codeword is strictly enforced (see Section VII for justiﬁcation), to avoidexhaustive path search, the decoder must develop certain criterion or bound that can be used todisprove the optimality of a Markov path set. This is equivalent to developing an optimality testcriterion (OTC) [7] to test whether the ML path (or codeword) belongs to the complementary pathset (or codeword set) .Two major OTCs have been used in the ML decoding of convolutional codes. The ﬁrst one is the“path covering criterion” (PCC) (explained in [8] and in Appendix A) used in the VA [2][1]. VAvisits all Markov states in chronological order [1]. For each time index, the decoder maintains a set of“cover” (deﬁned in Appendix A) Markov paths each passing one of the Markov states [1]. Accordingto the PCC, the “cover” Markov path passing a Markov state disproves the optimality of all otherMarkov paths passing the same state. The second OTC is the sum log likelihood (SLL)-based OTCsused extensively in the sphere decoder [10][9]. Sphere decoder models ML decoding as ﬁnding thelattice point closest to the channel output in the signal space [9]. Hence the distance between thechannel output and an arbitrary lattice point upper bounds the distance from the channel outputto the ML codeword. Such distance bound is based on the SLL of the corresponding codeword, andis used in the sphere decoder [10][9] as well as other ML decoders [7] as the key means to avoidexhaustive codeword search. In [11][12], Vikalo and Hassibi showed that PCC-based and SLL-basedoptimality tests can be combined to ﬁnd the ML codeword without visiting all Markov states.Assume PCC-based optimality test is always implemented. In this paper, we ﬁrst show that additional complexity reduction brought by the SLL-based optimality test diminishes as one ﬁxesthe coding memory ν and takes the codeword length N to inﬁnity. Such ineﬃciency is due to thefact that SLL-based OTC does not exploit the structure of the convolutional code. Searching theML codeword is equivalent to ﬁnding the ML source message, which contains a sequence of sourcesymbols. We show whether the ML message contains a particular symbol at a given time index canbe tested using an OTC that depends only on the log likelihood of channel output symbols in a ﬁxed- In the literature such as [7], OTC refers to a criterion designed to test whether a single codeword is optimum. In this paper,we extend the deﬁnition of OTC to a general criterion that can either verify or disprove the optimality of a codeword set. sized time neighborhood. We call such test the neighboring log likelihood (NLL)-based optimalitytest, and show its eﬃciency does not depend on the codeword length. We theoretically demonstratethat NLL-based optimality test can bring signiﬁcant complexity reduction to ML decoding whenthe communication system has a high signal to noise ratio (SNR). Complexity of the decoder usingSLL-base optimality test, on the other hand, remains the same as the VA for all SNR if the codewordlength is taken to inﬁnity. The results are also generalized to ML sequence detection in a class ofdiscrete-time hidden Markov systems [13].

II. Problem Formulation

Let C be an ( n, k ) convolutional code over GF( q ) deﬁned by a polynomial generater matrix G ( D )[14], G ( D ) = G [0] + G [1] D + . . . + G [ ν − D ν − , (1)where D is the delay operator; ν is the coding memory; G [ l ], l = 0 , . . . , ν −

1, are k × n matricesover GF( q ). Assume G ( D ) is a minimal encoder [14].Denote the source message by a sequence of vector symbols , x ( D ) = x [ d ] D d + x [ d + 1] D d +1 + . . . , (2)where d is the time index, possibly negative; x [ d ], ∀ d , are row vectors of dimension k over GF( q ).The encoded message, or the corresponding codeword, is given by y ( D ) = x ( D ) G ( D ) = X d ν − X l =0 x [ d − l ] G [ l ] D d . (3)To simplify the presentation, we assume time index d takes all integer values. We assume x [ d ] = for d < d ≥ N . We term N the codeword length.Deﬁne a function g q ( y ) that maps y from GF( q ) to R (the set of real numbers) in one-to-onesense. If y ( D ) is a vector sequence, g q ( y ( D )) applies the mapping to each of the elements of y ( D ),respectively . Assume the codeword is transmitted over a memoryless Gaussian channel. The channeloutput symbol sequence is given by r ( D ) = g q ( y ( D )) + n ( D ) = g q ( x ( D ) G ( D )) + n ( D ) , (4)where n ( D ) = n [ d ] D d + n [ d + 1] D d +1 + . . . is the noise sequence with n [ d ] ∼ N ( , σ I ) beingi.i.d. Gaussian. Without loss of generality, we deﬁne the scaled signal to noise ratio of the systemas SNR = σ . In Section VI, we show that the results are generalizable not only to other channelmodels, but also to a class of hidden Markov systems. Hence the output of g q ( y ( D )) is a vector sequence of the same length and dimension as y ( D ). Given the channel output, for any source message x ( D ) and its corresponding codeword y ( D ) = x ( D ) G ( D ), we deﬁne the “negative SLL” as S x ( x ( D )) = S y ( y ( D )) = N + ν − X d =0 k r [ d ] − g q ( y [ d ]) k . (5)The objective of ML decoding is to ﬁnd the ML message x ML ( D ) that minimizes the negative SLL, x ML ( D ) = argmin x [ d ] , ≤ d

For ML decoders using SLL-based optimality test, the decoder ﬁrst obtains a quick guess of thesource message without solving the ML decoding problem. SLL of the obtained message is then used tohelp disproving the optimality of certain Markov path sets and consequently to avoid exhaustive pathsearch. We make an ideal assumption that the “guessed” message equals the transmitted message .We show in this section that, even under this ideal assumption, complexity reduction brought by theSLL-based optimality tests still diminishes as we take N to inﬁnity.Let x ( D ) be the actual source message, which is also the message “guessed” by the decoder. Let y ( D ) = x ( D ) G ( D ) be the transmitted codeword. The corresponding negative SLL is given by S x ( x ( D )) = N + ν − X d =0 k r [ d ] − g d ( y [ d ]) k = N + ν − X d =0 k n [ d ] k . (7)Now consider a subset of time indices D xd ⊆ [0 , N ). Let { ˜ x [ d ] | d ∈ D xd } be a partial message deﬁnedonly at time indices in D xd . Denote by { ˜ x ( D xd ) } the set of messages satisfying { ˜ x ( D xd ) } = { x ( D ) | x [ d ] = ˜ x [ d ] , ∀ d ∈ D xd , x ( D ) = x ( D ) } . (8)Suppose the decoder wants to test whether it can disprove the optimality of { ˜ x ( D xd ) } , i.e., whether x ML ( D )

6∈ { ˜ x ( D xd ) } . A common practice [7][11][12] is to ﬁnd a lower bound, denoted by S Lx (˜ x ( D xd )),of the negative SLLs of the messages in { ˜ x ( D xd ) } . S x ( x ( D )) ≥ S Lx (˜ x ( D xd )) , ∀ x ( D ) ∈ { ˜ x ( D xd ) } . (9)If the lower bound S Lx (˜ x ( D xd )) is larger than S x ( x ( D )) obtained in (7), then we have S x ( x ( D )) ≥ S Lx (˜ x ( D xd )) > S x ( x ( D )) for all x ( D ) ∈ { ˜ x ( D xd ) } , which means the ML message is not in { ˜ x ( D xd ) } . Note that the decoder still needs to testify whether the guessed message is indeed the ML solution. If it is not, then a searchfor the ML message must be carried out.

In Appendix B, we show that the SLL lower bounds appeared in the literature satisfy the followingassumption.

Assumption 1:

Given { ˜ x ( D xd ) } , let D yd ⊆ [0 , N + ν ) be the maximum time index set, over whichwe can ﬁnd a partial codeword ˜ y ( D yd ) such that for all x ( D ) ∈ { ˜ x ( D xd ) } with y ( D ) = x ( D ) G ( D ),we have y [ d ] = ˜ y [ d ] for all d ∈ D yd . Note that D yd and ˜ y ( D yd ) are uniquely determined by { ˜ x ( D xd ) } .We also have | D yd | ≤ | D xd | + ν .We assume the existence of a positive constant ǫ ∈ (0 , not depend on N , suchthat S Lx (˜ x ( D xd )) ≤ X d ∈ D yd k r [ d ] − g q (˜ y [ d ]) k + ( N + ν − | D yd | )(1 − ǫ ) nσ . (10)As demonstrated in [11][7], if we ﬁx N , using S Lx (˜ x ( D xd )) > S x ( x ( D )) as the OTC to disprovethe optimality of message set { ˜ x ( D xd ) } can bring signiﬁcant complexity reduction to ML decoding,especially under high SNR. However, if we deﬁne D e ⊆ D yd as the subset of time indices correspondingto the erroneous codeword symbols, i.e., D e = { d | d ∈ D yd , ˜ y ( d ) = y ( d ) } , (11)the following proposition shows that SLL-based optimality tests become ineﬃcient if N − | D xd | istaken to inﬁnity while | D e | is kept ﬁnite. Lemma 1:

Assume the generater matrix G ( D ) is ﬁxed, and therefore the constraint length ν isﬁxed. Consider message sets characterized by { ˜ x ( D xd ) } for arbitrary D xd but under the constraint ofa ﬁxed D e , where D e ⊆ D yd is deﬁned in (11) and the derivation of D yd is speciﬁed in Assumption 1.If we ﬁx SNR and take N − | D xd | to inﬁnity, we havelim N −| D xd |→∞ P { S Lx (˜ x ( D xd )) > S x ( x ( D )) } = 0 . (12)If we ﬁrst take N − | D xd | to inﬁnity and then take SNR to inﬁnity, we havelim SNR →∞ lim N −| D xd |→∞ P { S Lx (˜ x ( D xd )) > S x ( x ( D )) } = 0 . (13) Proof:

Since | D yd | ≤ | D xd | + ν , taking N − | D xd | to inﬁnity implies taking N − | D yd | to inﬁnity.According to Assumption 1, we have S Lx (˜ x ( D xd )) − S x ( x ( D )) N + ν − | D yd | ≤ N + ν − | D yd |  X d ∈ D e k r [ d ] − g q (˜ y [ d ]) k  + (1 − ǫ ) nσ − N + ν − | D yd |  X d ∈ D e k n [ d ] k  − N + ν − | D yd | X d D yd k n [ d ] k . (14) Since n [ d ] are i.i.d. Gaussian with covariance matrix σ I , k n [ d ] k are i.i.d. χ with mean nσ andvariance 2 nσ . Therefore N + ν −| D yd | P d D yd k n [ d ] k → nσ , N + ν −| D yd | (cid:16)P d ∈ D e k r [ d ] − g q (˜ y [ d ]) k (cid:17) → N + ν −| D yd | P d ∈ D e k n [ d ] k → N − | D yd | → ∞ . Consequently, denote theright hand side of (14) by U , we have with probability one,lim N −| D yd |→∞ U = − ǫnσ < . (15)This yieldslim N −| D xd |→∞ P n S Lx (˜ x ( D xd )) > S x ( x ( D )) o = lim N −| D yd |→∞ P ( S Lx (˜ x ( D xd )) − S x ( x ( D )) N + ν − | D yd | > ) ≤ lim N −| D yd |→∞ P { U > } = 0 . (16)Since (16) holds for all SNR, the conclusion remains true if we take SNR to inﬁnity after N − | D xd | is taken to inﬁnity .With the help of Lemma 1, ineﬃciency of SLL-based optimality tests is characterized by thefollowing lemma. Lemma 2:

Let C sll be the complexity of an ML decoder that only uses PCC- and SLL-basedoptimality tests for complexity reduction. Let C va be the complexity of the Viterbi decoder, in which,only PCC-based optimality test is used. For any δ >

0, we have,lim N →∞ P { C sll ≥ (1 − δ ) C va } = 1lim SNR →∞ lim N →∞ P { C sll ≥ (1 − δ ) C va } = 1 . (17)The proof of Lemma 2 is given in Appendix C. IV. Neighboring Log Likelihood-based Optimality Test

We propose in Theorem 1 a class of NLL-based optimality tests, whose eﬃciency does not depend onthe codeword length N . We show in Section V that these NLL-based optimality tests can signiﬁcantlyreduce the average complexity of ML decoding under high SNR. This is in contrast to the ineﬃciencyof SLL-based optimality tests which are not able to bring meaningful complexity reduction if N istaken to inﬁnity ﬁrst. Theorem 1:

Deﬁne d , d by d = min y = y k g q ( y ) − g q ( y ) k , d = max y = y k g q ( y ) − g q ( y ) k , (18) Note that the order in which limits are taken in (13) is important. If we ﬁx N and take SNR to inﬁnity ﬁrst, we can getlim N −| D xd |→∞ lim SNR →∞ P { S Lx (˜ x ( D xd )) > S x ( x ( D )) } = 1. where y , y are n -dimensional row vectors over GF ( q ). Let ξ be an arbitrary constant, M be anarbitrary integer, satisfying 0 < ξ < d , M > νd ξ . (19)Let x ( D ) be a source message whose corresponding codeword is y ( D ). For any time index m , ifthe following inequality is satisﬁed for all d ∈ [ m − M ν, m + 2

M ν ), k r [ d ] − g q ( y [ d ]) k < d − ξ, (20)and the following inequalities hold, m +(2 M +1) ν − X d = m +2 Mν k r [ d ] − g q ( y [ d ]) k ≤ M ξ − νd m − Mν − X d = m − (2 M +1) ν k r [ d ] − g q ( y [ d ]) k ≤ M ξ − νd , (21)then we must have x [ ˜ m ] = x ML [ ˜ m ], ∀ ˜ m ∈ [ m, m + ν ).We skip the proof of Theorem 1 since the result is implied by Theorem 3 presented in Section VI.Note that the values of d min and d max only depend on the g q () function. Hence, as long as g q ()and ν are given, the values of ξ and M can be ﬁxed, e.g., ξ = d and M = (cid:24) νd d (cid:25) . Given M , theoptimality test presented in Theorem 1 testiﬁes the optimality of { x [ ˜ m ] | ˜ m ∈ [ m, m + ν ) } using the loglikelihood of channel output symbols within a ﬁxed-sized time interval [ m − (2 M +1) ν, m +(2 M +1) ν ).It is quite intuitive to see, eﬃciency of the test does not depend on the codeword length if all otherparameters are ﬁxed.Eﬃciency of the OTC proposed in Theorem 1 is characterized by the following lemma. Lemma 3:

Assume ξ and M are chosen to satisfy (19). Let m be an arbitrary time index. Let y ( D ) equal the transmitted codeword within time interval [ m − (2 M + 1) ν, m + (2 M + 1) ν ). DeﬁneOPT m as the event that (21) is satisﬁed and (20) is satisﬁed for all d ∈ [ m − M ν, m + 2

M ν ).Fix all other parameters and take SNR to inﬁnity, we havelim

SNR →∞ P { OPT m } = 1 . (22)The same conclusion holds if we ﬁrst take N to inﬁnity, then take SNR to inﬁnity.lim SNR →∞ lim N →∞ P { OPT m } = 1 . (23) Proof: If y ( D ) equals the transmitted codeword within time interval [ m − (2 M + 1) ν, m +(2 M + 1) ν ), for d ∈ [ m − (2 M + 1) ν, m + (2 M + 1) ν ), we have r [ d ] − g q ( y [ d ]) = n [ d ] . (24) Consequently, (22) and (23) hold because k n [ d ] k are i.i.d. χ , whose mean, n SNR , and variance, n SNR ,converge to 0 as SNR goes to inﬁnity.Lemma 3 implies, if there is a suboptimal decoder whose probability of symbol detection error (asopposed to sequence detection error) is low under high SNR, then NLL-based optimality tests canhelp transforming the suboptimal detector to an ML detector with only marginal increase in averagedecoding complexity. An example of such transformation is presented in the following section. V. A Three-step ML Decoding Framework

The communication system given in Section II follows a discrete-time hidden Markov model [13],where each Markov state at time index d corresponds to a possible combination of source symbols intime interval ( d − ν, d ]. If a decoder obtains the ML codeword using the VA, all Markov states withintime interval [ ν, N ] have to be visited. Alternatively, if one can use a low complexity algorithm todisprove the optimality of most of the Markov states, then the VA can limit its search by visitingonly a small subset of Markov states.Following this idea, the three-step ML decoding framework is given as follows. • Step 1: The decoder uses a suboptimal algorithm (denoted by Φ sub ) to obtain a quick guess ofthe codeword ˜ y ( D ) and its corresponding source message ˜ x ( D ). • Step 2: An NLL-based optimality test (speciﬁed in Theorem 1) is applied to each of the sourcesymbols of ˜ x ( D ). The decoder maintains a source symbol set sequence X ( D ), with X [ d ] beingthe source symbol set of time index d . If ˜ x [ d ] = x ML [ d ] can be conﬁrmed by the optimality test,we let X [ d ] = { ˜ x [ d ] } ; otherwise, we let X [ d ] be the set of all possible source symbol vectors attime index d . • Step 3: The decoder uses a modiﬁed VA to search for the ML source message. The only diﬀerencebetween the modiﬁed VA and the conventional VA is that, the modiﬁed VA visits a Markov stateonly if all source symbols corresponding to the Markov state belong to the source symbol sets X [ d ] of the corresponding time indices.Implementing the modiﬁed VA is quite straightforward. Hence its further description is skipped.Comparing to the three-step decoding algorithm studied in [7], the key advantage of using an NLL-based optimality test is that the test can be applied to an individual source symbol rather than thewhole source message. Theorem 2:

Let P e { Φ sub } be the probability of symbol detection error of Φ sub . Assume, whileﬁxing all other parameters,lim SNR →∞ P e { Φ sub } = 0 , lim SNR →∞ lim N →∞ P e { Φ sub } = 0 . (25) Let C mva be the average number of Markov states per time unit visited by the modiﬁed VA in thethird step of the ML decoder. For any δ >

0, we havelim

SNR →∞ P { C mva ≤ δ } = 1 , lim SNR →∞ lim N →∞ P { C mva ≤ δ } = 1 . (26) Proof:

Let x ( D ), y ( D ) be the actual source message and the transmitted codeword, respectively.Let ˜ x ( D ), ˜ y ( D ) be the source message and the codeword output by Φ sub . According to (25), for anytime index m , we have lim SNR →∞ P  ˜ y [ d ] = y [ d ] , ∀ d ∈ [ m − M − ν, m + (2 M + 1) ν )  = 1 . (27)where M is the parameter of the NLL-based optimality test, speciﬁed in Theorem 1. According to(27), Lemma 2, and Theorem 1, for any m , if ˜ y [ d ] = y [ d ] , ∀ d ∈ [ m − M − ν, m + (2 M + 1) ν ),then the probability that the NLL-based optimality test can conﬁrm ˜ x [ d ] = x ML [ d ] , ∀ d ∈ [ m, m + ν )converges to one as SNR → ∞ . Consequently, letting X [ d ] be the source symbol set maintained bythe ML decoder in the second step, we havelim SNR →∞ P {| X [ d ] | = 1 , ∀ d ∈ [ m, m + ν ) } = 1 , ∀ m (28)Since the worst case complexity of the modiﬁed VA is bounded, (28) implies, for any δ > SNR →∞ P { C mva ≤ δ } = 1.Since all derivations hold if we ﬁrst take N to inﬁnity, we also have lim SNR →∞ lim N →∞ P { C mva ≤ δ } = 1.By sharing computations among optimality tests, it is easy to see that the complexity of the secondstep of the ML decoder is equivalent, in order, to visiting one Markov state per time unit. Therefore,if Φ sub satisﬁes (25), as SNR → ∞ , the complexity of the three-step ML decoder converges to thecomplexity of Φ sub , which can be signiﬁcantly lower than the complexity of the VA. Moreover, thethree steps of the ML decoder can be implemented in a parallelized manner in the sense that eachstep can process some of the source symbols without waiting for the previous step to completely ﬁnishits work. An example of such parallelized implementation can be found in [15, The Simple MLSDAlgorithm]. VI. Maximum Likelihood Sequence Detection in A Class of Hidden Markov Systems

In this section, we generalize the results of Section IV to ML sequence detection (MLSD) in a classof ﬁrst order discrete-time hidden Markov systems [13]. We demonstrate in Appendix D that thecommunication system presented in Section II satisﬁes the model and the key assumptions given inthis section. Let u ( D ) = u [ d ] D d + u [ d + 1] D d +1 + ... be a ﬁrst order Markov sequence, where d is the timeindex, possibly negative; u [ d ] represents the Markov state (at time d ), which is a k ν -dimensional rowvector deﬁned over GF ( q ). We assume u [ d ] = for d < d ≥ N , with N being the sequencelength. Deﬁne y [ d ] = y ( u [ d ]) as the “processed state”, which is a deterministic function of u [ d ]. y [ d ]is a n -dimensional row vector deﬁned over GF ( q ). We term y ( D ) = y [ d ] D d + y [ d + 1] D d +1 + ... theprocessed state sequence. Let r ( D ) = r [ d ] D d + r [ d + 1] D d +1 + ... be the observation sequence, where r [ d ] is a n -dimensional row vector with real-valued elements.Denote the state transition probability of the hidden Markov system by P t ( u | u ) = P { u [ d + 1] = u | u [ d ] = u } . (29)Deﬁne the transition probability ratio bound p tr by p tr = min u , u , P t ( u | u ) > u , u , P t ( u | u ) > P t ( u | u ) P t ( u | u ) . (30)We assume the Markov chain is ergodic and homogeneous. Therefore, there exists a positive integer ν , such that P { u [ d + ν ] = u | u [ d ] = u } 6 = 0 , ∀ u , u . (31)Denote the observation distribution function by F o ( r | y ) = P { r [ d ] ≤ r | y [ d ] = y } . (32)Let the corresponding probability density function (or probability mass function) be f o ( r | y ).We also make the following two key assumptions. Assumption 2:

We assume state processing y [ d ] = y ( u [ d ]) does not compromise the observabilityof the Markov states in the sense that there exists a positive integer ν satisfying the following property.Given two Markov state sequences u ( D ) and ˜ u ( D ). For any time index d , if u [ d ] = ˜ u [ d ], then wecan ﬁnd a time index m ∈ ( d − ν, d + ν ), such that y ( u [ m ]) = y ( ˜ u [ m ]).Note that we used the same constant ν in (31) and in Assumption 2. This is valid because if (31)is satisﬁed for ν = ν , then it is also satisﬁed for all ν ≥ ν ; similar property applies to Assumption2. Consequently, if Assumption 2 holds, a common integer ν satisfying both (31) and Assumption 2can always be found. Assumption 3:

Assume the existence of two functions: L l ( r , y ) and L u ( r , y ), both are functionsof the channel output symbol r and the processed state y . Assume L l ( r , y ) and L u ( r , y ) have thefollowing two properties. First, the following inequalities hold for all r and y . L l ( r , y ) ≤ min y , y = y [ − log( f o ( r | y )) + log( f o ( r | y ))] L u ( r , y ) ≥ max y = y [ − log( f o ( r | y )) + log( f o ( r | y ))] . (33)Second, the complexity of evaluating L l ( r , y ) and L u ( r , y ) is low in the sense that they do notrequire the search of any processed state other than y .Note that validity of the results presented in this section does not depend on the second propertyimposed in Assumption 3. However, we still include the property in the assumption since the keymotivation of posing Assumption 3 is to use the two functions L l ( r , y ) and L u ( r , y ) as tools toavoid exhaustive Markov state search and hence to reduce the complexity of ML decoding. Also notethat the right hand side of the second inequality in (33) is not a function of y . However, the upperbound on the left hand side is a function of a processed state y since one often needs a “referencestate” in order to upper bound the right hand side of (33). Further explanation is given in AppendixD.Given the observation sequence r ( D ), the negative SLL of a state sequence u ( D ) is obtained by S u ( u ( D )) = − N X d =0 log( f o ( r [ d ] | y [ d ]) P t ( u [ d ] | u [ d − . (34)The objective of MLSD is to ﬁnd the ML sequence that minimizes the negative SLL, u ML ( D ) = argmin u [ d ] , ≤ d

Assume the discrete-time Markov system satisﬁes Assumptions 2 and 3.Let ρ > u ( D ) and the correspondingprocessed states y ( D ). Let p tr be deﬁned by (30). For any time index m , if there is an integer M > d ∈ [ m − M ν, m + 2

M ν ) L l ( r [ d ] , y [ d ]) > ν ( ρ − log p tr ) , (36)and m +(2 M +1) ν − X d = m +2 Mν L u ( r , y [ d ]) ≤ M νρ + ( ν + 1) log p trm − Mν − X d = m − (2 M +1) ν L u ( r , y [ d ]) ≤ M νρ + ν log p tr , (37)then u [ m + ν −

1] = u ML [ m + ν −

1] must be true.The proof of Theorem 3 is given in Appendix E. Note that Theorem 3 implies Theorem 1 if we setthe parameters in Theorem 1 at the corresponding values given in Appendix D. For communication systems following a discrete-time hidden Markov model, f o ( r | y ) often belongsto an ensemble of density (or probability) functions, with the actual realization determined by theSNR. In other words, we can write the observation density (or probability) f o ( r | y , SNR) as a functionof the SNR. Assume the discrete-time Markov system satisﬁes Assumption 3, where both functions L l ( r , y ) and L u ( r , y ) can be functions of the SNR. We make the following assumption. Assumption 4:

Assume the observation density (or probability) f o ( r | y , SNR) is a function of theSNR. Assume the discrete-time Markov system satisﬁes Assumption 3. Let the actual state sequenceand the processed state sequence be u ( D ) and y ( D ), respectively. Deﬁne two positive numbers d and d as follows d (cid:26) γ ≥

0; lim

SNR →∞ P { L l ( r [ d ] , y [ d ]) ≥ γ SNR } = 1 (cid:27) ,d = inf (cid:26) γ ≥

0; lim

SNR →∞ P { L u ( r [ d ] , y [ d ]) ≤ γ SNR } = 1 (cid:27) . (38)We assume d > , d < ∞ . (39)The following lemma characterizes the eﬃciency of the OTC proposed in Theorem 3. Lemma 4:

Assume the discrete-time Markov system satisﬁes Assumptions 2 and 4. Let the statesequence be u ( D ). Let ξ be an arbitrary constant, M be an arbitrary integer, satisfying0 < ξ < d , M > νd ξ . (40)Let ρ = ξ SNR3 ν . Given an arbitrary time index m , deﬁne OPT m as the event that (37) is satisﬁed and(36) is satisﬁed for all d ∈ [ m − M ν, m + 2

M ν ). If we ﬁx all other parameters except the SNR, wehave lim

SNR →∞ P { OPT m } = 1 . (41)If we ﬁx all other parameters except the SNR and the sequence length N , we havelim SNR →∞ lim N →∞ P { OPT m } = 1 . (42)We skip the proof of Lemma 4 since it is quite straightforward.Note that in Lemma 4, when we take N and SNR to inﬁnity, M can be ﬁxed at a constant. Thisindicates that, when testing the optimality of a Markov state at a given time index, the NLL-basedoptimality test only uses observation symbols in a ﬁxed-sized time neighborhood. Based on Theorem3 and Lemma 3, a three-step ML sequence detector similar to the one presented in Section V can bedeveloped to transform a suboptimal sequence detector to a low complexity ML sequence detector.The detailed discussion is skipped since it does not essentially diﬀer from the one presented in SectionV. VII. Further Discussions

In a practical system, suboptimal decoders such as the belief-propagation-based iterative decoders[5][6] can achieve near optimal error performance with low complexity. It is natural to ask: if subop-timal decoding only causes a negligible performance loss, why one should even bother with enforcingthe ML solution? Note that this question does not suggest a default answer since the argument canalso be presented in the opposite direction, i.e., if ML decoding only causes a negligible complexityincrease, why one should not use an ML decoder? Nevertheless, the purpose of our work is not toparticipate in the debate whether ML decoding is practically useful. Rather, one should interpretTheorem 2 as, for convolutional codes, the existence of a well-performed low complexity suboptimalalgorithm implies that ML decoding can be carried out with a similar complexity under high SNR.More importantly, such conclusion holds irrespective of the codeword length.Although the eﬃciency of SLL-based optimality tests does not depend on the codeword length,NLL-based optimality tests are ineﬃcient only when the codeword length is large. Lemma 1 andTheorem 2 suggest that complexity reduction brought by NLL-based optimality tests can be superiorto SLL-based optimality tests even for moderate SNR if the codeword length is large enough.

Appendix

A. The Path Covering Criterion

Assume the discrete-time hidden Markov model given in Section VI . Given the observation se-quence r ( D ). Let ˜ u ( D ) and u ( D ) be two Markov state sequences whose corresponding processedstate sequences are ˜ y ( D ) and y ( D ), respectively. If we can ﬁnd two time indices d < d , such that˜ u [ d ] = u [ d ], ˜ u [ d ] = u [ d ], and d X d = d +1 log f o ( r [ d ] | ˜ y [ d − P t ( ˜ u [ d ] | ˜ u [ d − f o ( r [ d ] | y [ d − P t ( u [ d ] | u [ d − < , (43)we say u ( D ) “covers” ˜ u ( D ). Path Covering Criterion:

Markov state sequence ˜ u ( D ) cannot be the ML sequence if we canﬁnd another state sequence u ( D ) that covers ˜ u ( D ).The proof of the PCC is skipped since it is quite well known [8].We say u ( D ) is a “cover” path with respect to Markov states u [ d ] and u [ d ] at time indices d < d if, among all Markov paths passing u [ d ] and u [ d ], u ( D ) maximizes P d d = d +1 log( f o ( r [ d ] | y [ d − P t ( u [ d ] | u [ d − u [ −

1] = . We say u ( D ) is a “cover” pathwith respect to Markov state u [ d ] at time index d > u [ d ], u ( D ) maximizes P d d =1 log( f o ( r [ d ] | y [ d − P t ( u [ d ] | u [ d − It is shown in Appendix D that the model is satisﬁed by the communication system given in Section II. B. Examples of SLL-based Optimality Tests Satisfying Assumption 1

In [12][11], when the decoder branches a Markov path at time index m < N , the branch ischaracterized by a partial message { ˜ x [0] , ˜ x [1] , . . . , ˜ x [ m ] } . For any codeword ˜ y ( D ) associated to thebranch, we have ˜ y [ d ] = ν − X l =0 ˜ x [ d − l ] G [ l ] . (44)In other words, D d = [0 , m ]. The negative SLL lower bound is given by S y (˜ y ( D )) = N + ν − X d =0 k r [ d ] − g q (˜ y [ d ]) k ≥ m X d =0 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) r [ d ] − g q ν − X l =0 ˜ x [ d − l ] G [ l ] !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , (45)which satisﬁes Assumption 1 with ǫ = 1.In [7], several SLL-based OTCs were presented for decoding block codes. The decoder obtains aﬁrst guess y ( D ) of the codeword. A negative SLL lower bound S Ly ≤ S y (˜ y ( D ) = y ( D )) is thendeveloped for the codeword set { ˜ y ( D ) = y ( D ) } , which corresponds to the case of D d being an emptyset in the context of Section III. y ( D ) is optimal if the optimality test S Ly > S y ( y ( D )) gives a positiveanswer [7].The lower bounds S Ly presented in [7, Section III] satisfy the following inequality, S Ly ≤ min ˜ y ( D ) = y ( D ) N + ν − X d =0 k g q (˜ y [ d ]) − g q ( y [ d ]) k (46)Since the coding constraint is ν , we can always ﬁnd a codeword ˜ y ( D ) = y ( D ) with ˜ y ( D ) diﬀeringfrom y ( D ) at no more than ν codeword symbols. This implies that the right hand side of (46) canbe upper bounded by a constant, denoted by U , which is not a function of N . S Ly ≤ min ˜ y ( D ) = y ( D ) N + ν − X d =0 k g q (˜ y [ d ]) − g q ( y [ d ]) k ≤ U (47)Consequently, given SNR > < ǫ <

1, there exists a constant N such that Assumption 1 issatisﬁed for N > N . C. Proof of Lemma 2Proof:

Assume, in searching the ML codeword, the decoder successfully avoided visiting a Markovstate speciﬁed by { x [ d − ν + 1] , . . . , x [ d ] } . This implies that we can ﬁnd two time index sets, D x ⊂ [ d − ν + 1 , d ] and D xd , D xd ∩ [ d − ν + 1 , d ] = φ , such that the optimality of all message sets { ˜ x ( D x ∪ D xd ) } with ˜ x [ ˜ d ] = x [ ˜ d ], ∀ ˜ d ∈ D x is disproved. We choose D xd with the maximum cardinality while make sure that, in disproving the optimality of { x [ d − ν + 1] , . . . , x [ d ] } , the detector visited all the Markov states { ˜ x [ ˜ d − ν + 1] , . . . , ˜ x [ ˜ d ] } satisfying [ ˜ d − ν + 1 , ˜ d ] ⊆ D xd .According to the deﬁnitions of D x and D xd , the decoder needs to disprove the optimality of aspecial message set { x ( D x ∪ D xd ) } deﬁned by x [ ˜ d ] = x [ ˜ d ], ∀ ˜ d ∈ D x and x [ ˜ d ] = x [ ˜ d ], ∀ ˜ d ∈ D xd . The deﬁnition of D xd also implies that the decoder needs to obtain a lower bound S Lx (˜ x ( D x ∪ D xd ))of the negative SLLs of the messages in { ˜ x ( D x ∪ D xd ) } . The lower bound S Lx (˜ x ( D x ∪ D xd )) shouldonly be a function of the partial message ˜ x ( D x ∪ D xd ), but should not depend on any source messagesymbol whose time index is outside D x ∪ D xd . However, since the corresponding D e (deﬁned in (11)) of { x ( D x ∪ D xd ) } satisﬁes | D e | ≤ ν , according to Lemma 1, the probability of disproving the optimalityof { x ( D x ∪ D xd ) } (using SLL-based optimality test) is low if N − | D x ∪ D xd | ≫ ν .To make the argument explicit, the fact that the decoder visits all Markov states { ˜ x [ ˜ d − ν +1] , . . . , ˜ x [ ˜ d ] } with [ ˜ d − ν + 1 , ˜ d ] ⊆ D xd implies C sll ≥ | D xd | − νN + ν C va . (48)According to Lemma 1, for any positive constant δ >

0, if we ﬁx all other parameters and take N toinﬁnity, we have lim N →∞ P ( N − | D xd | − | D x | ν < δ ν N ) = 1 . (49)Combining (48) and (49), we get lim N →∞ P { C sll ≥ (1 − δ ) C va } = 1 . (50)Since (50) holds for any ﬁxed SNR, it still holds if we take SNR to inﬁnity after taking N toinﬁnity, i.e., lim SNR →∞ lim N →∞ P { C sll ≥ (1 − δ ) C va } = 1 . (51) D. The Hidden Markov Model and Its Key Assumptions

In this section, we show the communication system presented in Section II satisﬁes the discrete-timehidden Markov model and the key assumptions given in Section VI.Consider a communication system modeled in Section II. Deﬁne u [ d ] = [ x [ d − ν + 1] , . . . , x [ d ]]. Itis easy to see u ( D ) is a Markov sequence. The processed state y [ d ] = y ( u [ d ]) is only a function ofthe corresponding Markov state. If two Markov states in successive time indices take the form u [ d ] = [˜ x [ d − ν + 1] , . . . , ˜ x [ d ]] u [ d + 1] = [˜ x [ d − ν + 2] , . . . , ˜ x [ d + 1]] , (52)for some ˜ x ( D ), then we have P t ( u [ d + 1] | u [ d ]) = 1 q k . (53) An equivalent statement of (49) is, if N −| D xd |−| D x | ν < δ ν N , as N → ∞ , the probability of disproving the optimality of allmessage sets { ˜ x ( D x ∪ D xd ) } with ˜ x [ ˜ d ] = x [ ˜ d ], ∀ ˜ d ∈ D x , using SLL-based optimality test goes to zero. Otherwise P t ( u [ d + 1] | u [ d ]) = 0. According to (30), we have p tr = 1.Since u [ d ] = [ x [ d − ν + 1] , . . . , x [ d ]] does not depend on source symbols at time indices m ≤ d − ν ,we know P t ( u [ d ] | u [ d − ν ]) = 0 , ∀ u [ d ] , u [ d − ν ] . (54)The observation density is given by f o ( r | y ) = (cid:18) SNR π (cid:19) n exp (cid:18) − SNR k r − g q ( y ) k (cid:19) . (55)Next, we show Assumption 2 is satisﬁed. Let u ( D ) and ˜ u ( D ) be two Markov state sequences. Let x ( D ) and y ( D ) be the source message and the codeword corresponding to u ( D ). Let ˜ x ( D ) and ˜ y ( D )be the source message and the codeword corresponding to ˜ u ( D ). For a time index d , if u [ d ] = ˜ u [ d ], wecan ﬁnd a time index m ∈ ( d − ν, d ] such that x [ m ] = ˜ x [ m ]. Consequently, according to [14, Corollary2], we can ﬁnd a time index ˜ m ∈ [ m, m + ν ), such that y [ ˜ m ] = ˜ y [ ˜ m ]. Therefore, Assumption 2 holdsbecause ˜ m ∈ ( d − ν, d + ν ).Let d and d be deﬁned in Theorem 1. Let y = y be two arbitrary codeword symbols. Wehave the following triangle inequalities, k r − g q ( y ) k ≥ k g q ( y ) − g q ( y ) k − k r − g q ( y ) kk r − g q ( y ) k ≤ k g q ( y ) − g q ( y ) k + k r − g q ( y ) k . (56)The ﬁrst inequality in (56) impliesmin y , y = y [ − log( f o ( r | y ))] + log( f o ( r | y )) = min y , y = y (cid:20) SNR k r − g q ( y ) k − k r − g q ( y ) k ) (cid:21) ≥ SNR d min ( d min − k r − g q ( y ) k ) . (57)The second inequality in (56) impliesmax y = y [ − log( f o ( r | y )) + log( f o ( r | y ))] = max y = y (cid:20) SNR k r − g q ( y ) k − k r − g q ( y ) k ) (cid:21) ≤ max y (cid:20) SNR k r − g q ( y ) k (cid:21) ≤ max y h SNR( k r − g q ( y ) k + k g q ( y ) − g q ( y ) k ) i ≤ SNR( k r − g q ( y ) k + d ) . (58)Therefore, Assumption 3 is satisﬁed by deﬁning L l ( r , y ) = SNR d min ( d min − k r − g q ( y ) k ) L u ( r , y ) = SNR( k r − g q ( y ) k + d ) . (59)Note that evaluating L l ( r , y ) and L u ( r , y ) does not involve visiting any processed state other than y . If y [ d ] and r [ d ] are the actual codeword symbol and the channel output at time index d , k r [ d ] − g q ( y [ d ]) k = k n [ d ] k is a χ random variable with mean n SNR and variance n SNR . From (59), it is easilyseen that Assumption 4 is satisﬁed with d > d < ∞ . E. Proof of Theorem 3Proof:

Let ˜ u ( D ) be an arbitrary Markov state sequence with corresponding processed statesequence being ˜ y ( D ). Assume ˜ u [ m + ν − = u [ m + ν −

1] (60)Theorem 3 holds if we can prove that any ˜ u ( D ) satisfying (60) cannot be the ML state sequence.Let k denote a positive integer. Deﬁne two integers K l and K r as follows. K l = argmin k> { ˜ u [ m + ν − − kν ] = u [ m + ν − − kν ] } K r = argmin k> { ˜ u [ m + ν − kν ] = u [ m + ν − kν ] } . (61)We consider respectively the following four cases based on the values of K l and K r . In all the fourcases, we show ˜ u ( D ) cannot be the ML sequence. Case 1: K l ≤ M + 1, K r ≤ M − u [ m + ν − kν ] = u [ m + ν − kν ] for all − K < k < K r , according to Assumption 2, ˜ y ( D )and y ( D ) diﬀer at no less than j K l + K r k time indices in the time interval [ m + ν − K l ν, m + ν + K r ν ),where ⌊ x ⌋ denotes the maximum integer no larger than x . According to (33) and (36), for d ∈ [ m − M ν, m + 2

M ν ), if ˜ y [ d ] = y [ d ], we have − log f o ( r [ d ] | ˜ y [ d ]) f o ( r [ d ] | y [ d ]) ≥ L l ( r [ d ] , y [ d ]) > ν ( ρ − log p tr ) . (62)Consequently, we get − m + ν − K r ν X d = m + ν − K l ν log f o ( r [ d ] | ˜ y [ d ]) P t ( ˜ u [ d ] | ˜ u [ d − f o ( r [ d ] | y [ d ]) P t ( u [ d ] | u [ d − ≥ (cid:22) K l + K r (cid:23) ν ( ρ − log p tr ) + ( K r + K l ) ν log p tr ≥ (cid:22) K l + K r (cid:23) νρ > u ( D ) “covers” ˜ u ( D ). Hence˜ u ( D ) cannot be the ML sequence. Case 2: K l ≤ M + 1, K r > M − u c ( D ) and show that u c ( D ) covers ˜ u ( D ). See deﬁnition in Appendix A. u c ( D ) is constructed as follows. u c [ d ] = u [ d ] , for d < m + 2 M ν u c [ d ] = ˜ u [ d ] , for d ≥ m + (2 M + 1) ν. (64)According to (31), we can always construct u c [ d ] for d ∈ [ m + 2 M ν, m + (2 M + 1) ν ) so that (64) issatisﬁed. Let y c ( D ) be the processed state sequence corresponding to u c ( D ).From (33) and the ﬁrst inequality in (37), we get − m +(2 M +1) ν X d = m +2 Mν log f o ( r [ d ] | ˜ y [ d ]) P t ( ˜ u [ d ] | ˜ u [ d − f o ( r [ d ] | y c [ d ]) P t ( u c [ d ] | u c [ d − ≥ − m +(2 M +1) ν − X d = m +2 Mν L u ( r [ d ] , y [ d ]) + ( ν + 1) log p tr ≥ − M νρ (65)Since ˜ u [ m + ν − kν ] = u c [ m + ν − kν ] for all − K l < k ≤ M −

1, according to Assumption 2,˜ y ( D ) and y ( D ) diﬀer at no less than j K l +2 M − k time indices in the time interval [ m + ν − K l ν, m +2 M ν ). According to (33) and (36), we have − m +2 Mν − X d = m + ν − K l ν log f o ( r [ d ] | ˜ y [ d ]) P t ( ˜ u [ d ] | ˜ u [ d − f o ( r [ d ] | y c [ d ]) P t ( u c [ d ] | u c [ d − > (cid:22) K l + 2 M − (cid:23) ν ( ρ − log p tr ) + ( K l + 2 M − ν log p tr ≥ M ν ( ρ − log p tr ) + 2 M ν log p tr ≥ M νρ (66)Combining (65) and (66), we obtain − m +(2 M +1) ν X d = m + ν − K l ν log f o ( r [ d ] | ˜ y [ d ]) P t ( ˜ u [ d ] | ˜ u [ d − f o ( r [ d ] | y c [ d ]) P t ( u c [ d ] | u c [ d − > u c ( D ) covers ˜ u ( D ). Hence according to the PCC, ˜ u ( D ) cannot be the ML sequence. Case 3: K l > M + 1, K r ≤ M − u c ( D ) and show that u c ( D ) covers ˜ u ( D ). u c ( D ) is constructed as follows. u c [ d ] = u [ d ] , for d ≥ m − M ν u c [ d ] = ˜ u [ d ] , for d < m − (2 M + 1) ν. (68)According to (31), we can always construct u c [ d ] for d ∈ [ m − (2 M + 1) ν, m − M ν ) so that (68) issatisﬁed. Let y c ( D ) be the processed state sequence corresponding to u c ( D ).From (33) and the second inequality in (37), we get − m − Mν − X d = m − (2 M +1) ν log f o ( r [ d ] | ˜ y [ d ]) P t ( ˜ u [ d ] | ˜ u [ d − f o ( r [ d ] | y c [ d ]) P t ( u c [ d ] | u c [ d − ≥ − m − Mν − X d = m − (2 M +1) ν L u ( r [ d ] , y [ d ]) + ν log p tr ≥ − M νρ. (69) Since ˜ u [ m + ν − kν ] = u c [ m + ν − kν ] for all − M − ≤ k < K r , according to Assumption2, ˜ y ( D ) and y ( D ) diﬀer at no less than j M +1+ K r k time indices in the time interval [ m − M ν, m + ν + K r ν ). According to (33) and (36), we have − m + ν + K r ν − X d = m − Mν log f o ( r [ d ] | ˜ y [ d ]) P t ( ˜ u [ d ] | ˜ u [ d − f o ( r [ d ] | y c [ d ]) P t ( u c [ d ] | u c [ d − > (cid:22) M + 1 + K r (cid:23) ν ( ρ − log p tr ) + (2 M + 1 + K r ) ν log p tr ≥ M + 1) νρ. (70)Combining (69) and (70), we obtain m + ν + K r ν − X d = m − (2 M +1) ν log f o ( r [ d ] | ˜ y [ d ]) P t ( ˜ u [ d ] | ˜ u [ d − f o ( r [ d ] | y c [ d ]) P t ( u c [ d ] | u c [ d − < u c ( D ) covers ˜ u ( D ). Hence according to the PCC, ˜ u ( D ) cannot be the ML sequence. Case 4: K l > M + 1, K r > M − u c ( D ) as follows. u c [ d ] = u [ d ] , for m − M ν ≤ d < m + 2 M ν u c [ d ] = ˜ u [ d ] , for d ≥ m + (2 M + 1) ν u c [ d ] = ˜ u [ d ] , for d < m − (2 M + 1) ν. (72)Let the processed state sequence corresponding to u c ( D ) be y c ( D ).Since ˜ u [ m + ν − kν ] = u c [ m + ν − kν ] for all − M − ≤ k ≤ M −

1, according to Assumption2, ˜ y ( D ) and y ( D ) diﬀer at no less than j M +12 k time indices in the time interval [ m − M ν, m +2 M ν ).According to (33) and (36), we have − m +2 Mν − X d = m − Mν log f o ( r [ d ] | ˜ y [ d ]) P t ( ˜ u [ d ] | ˜ u [ d − f o ( r [ d ] | y c [ d ]) P t ( u c [ d ] | u c [ d − > (cid:22) M + 12 (cid:23) ν ( ρ − log p tr ) + 4 M ν log p tr ≥ M νρ. (73)Meanwhile, it is easily seen that (65) and (69) hold. Combine (65), (69) and (73), we obtain − m +(2 M +1) ν X d = m − (2 M +1) ν log f o ( r [ d ] | ˜ y [ d ])) P t ( ˜ u [ d ] | ˜ u [ d − f o ( r [ d ] | y c [ d ]) P t ( u c [ d ] | u c [ d − > − M νρ − M νρ + 6

M νρ = 0 . (74)(74) implies that u c ( D ) covers ˜ u ( D ). Hence according to the PCC, ˜ u ( D ) cannot be the ML sequence.Overall, we showed that ˜ u ( D ) cannot be the ML sequence irrespective of the values of K l and K r .Therefore, ˜ u [ m + ν −

1] = u [ m + ν −

1] must be true.

References [1] G. Forney,

The Viterbi Algorithm , Proc. of The IEEE, Vol. 61, No. 3, pp. 268-278, Mar. 1973.[2] A. Viterbi,

Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , IEEE Trans. Inform.Theory, Vol. IT-13, No. 2, pp. 260-269, Apr. 1967. [3] K. Zigangirov and H. Osthoﬀ, List Decoding of Trellis Codes , Problems of Control and Information Theory, pp. 347-364,1980.[4] R. Fano,

A Heuristic Discussion of Probabilistic Decoding , IEEE Trans. Inform. Theory, Vol. IT-9, pp. 64-74, Apr. 1963.[5] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv,

Optimal Decoding of Linear Codes for Minimizing Symbol Error Rate , IEEETrans. Inform. Theory, Vol. IT-20, pp. 284-287, Mar. 1974.[6] R. Johannesson and K. Zigangirov,

Fundamentals of Convolutional Coding , IEEE Press, 1999.[7] P. Swaszek and W. Jones,

How Often Is Hard-Decision Decoding Enough? , IEEE Trans. Inform. Theory, Vol. 44, pp. 1187-1193, May 1998.[8] M. Ariel and J. Snyders,

Error-Trellises for Convolutional Codes-Part II: Decoding Methods , IEEE Trans. Commun., Vol. 47,pp. 1015-1024, July 1999.[9] B. Hassibi and H. Vikalo,

On The Sphere Decoding Algorithm I. Expected Complexity , IEEE Trans. Sig. Proc., Vol. 53, No.8, pp. 2806-2818, Aug. 2005.[10] U. Fincke and M. Pohst,

Improved Methods for Calculating Vectors of Short Length in A Lattice, Including A ComplexityAnalysis , Math. Comput., Vol. 44, pp. 463-471, Apr. 1985.[11] H. Vikalo and B. Hassibi,

Maximum-Likelihood Sequence Detection of Multiple Antenna Systems over Dispersive Channelsvia Sphere Decoding , EURASIP J. Appl. Sig. Proc., No. 1, pp. 525-531, Jan. 2002.[12] H. Vikalo,

Sphere Decoding Algorithms for Digital Communications , Ph.D. Thesis, Stanford Univ., 2003.[13] L. Rabiner,

A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition , Proc. of IEEE, Vol. 77,No. 2, pp. 257-286, Feb. 1989.[14] G. Forney,

Structural Analysis of Convolutional Codes via Dual Codes , IEEE Trans. Inform. Theory, Vol. IT-19, pp. 512-518,Jul. 1973.[15] J. Luo,