[PDF] A Further Note on an Innovations Approach to Viterbi Decoding of Convolutional Codes

Abstract

In this paper, we show that the soft-decision input to the main decoder in an SST Viterbi decoder is regarded as the innovation as well from the viewpoint of mutual information and mean-square error. It is assumed that a code sequence is transmitted symbol by symbol over an AWGN channel using BPSK modulation. Then we can consider the signal model, where the signal is composed of the signal-to-noise ratio (SNR) and the equiprobable binary input. By assuming that the soft-decision input to the main decoder is the innovation, we show that the minimum mean-square error (MMSE) in estimating the binary input is expressed in terms of the distribution of the encoded block for the main decoder. It is shown that the obtained MMSE satisfies indirectly the known relation between the mutual information and the MMSE in Gaussian channels. Thus the derived MMSE is justified, which in turn implies that the soft-decision input to the main decoder can be regarded as the innovation. Moreover, we see that the input-output mutual information is connected with the distribution of the encoded block for the main decoder.

Full PDF

aa r X i v : . [ c s . I T ] F e b A Further Note on an Innovations Approach toViterbi Decoding of Convolutional Codes

Masato Tajima,

Senior Member, IEEE

Abstract

In this paper, we show that the soft-decision input to the main decoder in an SST Viterbi decoder is regarded as the innovationas well from the viewpoint of mutual information and mean-square error. It is assumed that a code sequence is transmitted symbolby symbol over an AWGN channel using BPSK modulation. Then we can consider the signal model, where the signal is composedof the signal-to-noise ratio (SNR) and the equiprobable binary input. By assuming that the soft-decision input to the main decoderis the innovation, we show that the minimum mean-square error (MMSE) in estimating the binary input is expressed in terms ofthe distribution of the encoded block for the main decoder. It is shown that the obtained MMSE satisﬁes indirectly the knownrelation between the mutual information and the MMSE in Gaussian channels. Thus the derived MMSE is justiﬁed, which in turnimplies that the soft-decision input to the main decoder can be regarded as the innovation. Moreover, we see that the input-outputmutual information is connected with the distribution of the encoded block for the main decoder.

Index Terms

Convolutional codes, Scarce-State-Transition (SST) Viterbi decoder, innovations, ﬁltering, smoothing, mean-square error,mutual information.

M. Tajima is with University of Toyama, 3190 Gofuku, Toyama 930-8555, Japan (e-mail: [email protected]).

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

A Further Note on an Innovations Approach toViterbi Decoding of Convolutional Codes

I. I

NTRODUCTION

Consider an SST Viterbi decoder [17], [24] which consists of a pre-decoder and a main decoder. In [25], by comparingwith the results in the linear ﬁltering theory [1], [9], [11], [12], [15], we showed that the hard-decision input to the maindecoder can be seen as the innovation [11], [12], [16], [18]. In the coding theory, the framework of the theory is determinedby hard-decision data. In the previous paper [25, Deﬁnition 1], we have given the deﬁnition of innovations for hard-decisiondata. We see that the deﬁnition applies also to soft-decision data. That is, if the associated hard-decision data is the innovation,then the original soft-decision data can be regarded as the innovation. Hence, we can consider the soft-decision input to themain decoder to be the innovation as well. In this paper, concerning this subject, we show that this is true from a differentviewpoint. We use the innovations approach to least-squares estimation by Kailath [11], [12] and the relationship between themutual information and the mean-square error (e.g., [6]).Consider the situation where we have observations of a signal in additive white Gaussian noise. (In this paper, it is assumedthat the signal is composed of the signal-to-noise ratio (SNR) and the equiprobable binary input, where the latter is notdependent on the SNR. In the following, we call the equiprobable binary input simply the input.) Kailath [11], [12] applied theinnovations method to linear ﬁltering/smoothing problems. In the discrete-time case [11], [15], he showed that the covariancematrix of the innovation is expressed as the sum of two covariance matrices, where one is corresponding to the estimationerror of the signal and the other is corresponding to the observation noise. Then the mean-square error in estimating the signalis obtained from the former covariance matrix by taking its trace. Based on the result of Kailath, we thought as follows: if thesoft-decision input to the main decoder corresponds to the innovation, then the associated covariance matrix should have theabove property. That is, the covariance matrix of the soft-decision input to the main decoder can be decomposed as the sumof two covariance matrices and hence by following the above, the mean-square error in estimating the signal will be obtained.We remark that in this context, the obtained mean-square error has to be justiﬁed using some method. For the purpose, wehave noted the relation between the mutual information and the mean-square error.By the way, we derived the distribution of the input to the main decoder corresponding to a single code symbol in [25].However, we could not obtain the joint distribution of the input to the main decoder corresponding to a branch. After ourpaper [25] was published, we have noticed that the distribution of the input to the main decoder corresponding to a singlecode symbol has a reasonable interpretation (see [26]). Then using this fact, the joint distribution of the input to the maindecoder corresponding to a branch has been derived. In that case, the associated covariance matrix can be calculated. Thenour argument for obtaining the mean-square error is as follows:1) Derive the joint distribution of the soft-decision input to the main decoder.2) Calculate the associated covariance matrix using the derived joint distribution.3) The covariance matrix of the estimation error of the signal is obtained by subtracting that of the observation noise fromthe matrix in 2). Moreover, by removing the SNR from the obtained covariance matrix, the covariance matrix of theestimation error of the input is derived.4) The mean-square error in estimating the input is given by the trace (i.e., the sum of diagonal elements) of the lastcovariance matrix.In this way, we show that the mean-square error in estimating the input is expressed in terms of the distribution of the encodedblock for the main decoder. Since in an SST Viterbi decoder, the estimation error is decoded at the main decoder, the resultis reasonable.On the other hand, we have to show the validity of the derived mean-square error. For the purpose, we note the relationbetween the mutual information and the mean-square error. Including these notions, the mutual information, the likelihoodratio (LR), and the mean-square error are central concerns in information theory, detection theory, and estimation theory. It iswell known that the best estimate measured in mean-square sense (i.e., the least-squares estimate) is given by the conditionalexpectation [16], [18], [21], [27]. In this paper, the corresponding mean-square error is referred to as the minimum mean-squareerror (MMSE) [6]. In this case, depending on the amount of observations { z ( t ) } used for the estimation, the causal (ﬁltering)MMSE and the noncausal (smoothing) MMSE are considered. When { z ( τ ) , τ ≤ t } is used for the estimation of the input x ( t ) , it corresponds to ﬁltering, whereas when { z ( τ ) , τ ≤ T } is used for the estimation of x ( t ) ( t < T ) , it corresponds tosmoothing.From the late 1960’s to the early 1970’s, the relation between the LR and the mean-square error was actively discussed.Kailath [14] showed that the LR is expressed in terms of the causal least-squares estimate. Moreover [13], he showed thatthe causal least-squares estimate can be obtained from the LR. Esposito [4] also derived the relation between the LR and thenoncausal estimator, which is closely related to [13]. Subsequently, Duncan [3] and Kadota et al. [10] derived the relationbetween the mutual information and the causal mean-square error. Their works are regarded as extensions of the work of Gelfand OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 and Yaglom (see [1], [3]), who discussed for the ﬁrst time the relation between the mutual information and the ﬁltering error.Later, Guo et al. [6] derived new formulas regarding the relation between the mutual information and the noncausal mean-square error. Furthermore, Zakai [29] showed that the relation between the mutual information and the estimation error holdsalso in the abstract Wiener Space [28], [29]. He applied not the Ito calculus [8], [21] but the Malliavin calculus [19], [29].We remark that the additive Gaussian noise channel is assumed for all above works.Among those works, we have noted the work of Guo et al. [6]. It deals with the relation between the mutual informationand the MMSE. Also, their signal model is equal to ours and hence it seems favorable for manipulating. We thought if theMMSE obtained using the above method satisﬁes the relations in [6, Section IV-A], then the derived MMSE is justiﬁed, whichin turn implies that the soft-decision input to the main decoder can be regarded as the innovation. The main text of the paperconsists of the arguments stated above. By the way, it is shown that the MMSE is expressed in terms of the distribution of theencoded block for the main decoder. Then we see that the input-output mutual information is connected with the distributionof the encoded block for the main decoder. We think this is an extension of the relation between the mutual information andthe MMSE to coding theory. The remainder of this paper is organized as follows.In Section II, based on the signal model in this paper, we derive the joint distribution of the input to the main decoder.In Section III, the associated covariance matrix is calculated using the derived joint distribution. Subsequently, by subtractingthe covariance matrix of the observation noise from it and by removing the SNR from the resulting matrix, the covariancematrix of the estimation error of the input is obtained. The MMSE in estimating the input is given by the trace of the lastmatrix. In this case, since the diagonal elements of the target matrix have a correlation, we modify the obtained MMSE. Notethat this modiﬁcation is essentially important. (We remark that some results in Sections II and III have been given in [26] withor without proofs. However, in order for the paper to be self-contained, the necessary materials are provided again with proofsin those sections.)In Section IV, the validity of the derived MMSE is discussed. The argument is based on the relation between the mutualinformation and the MMSE. More precisely, we discuss using the results of Guo et al. [6, Section IV-A]. We remark that theinput in our signal model is not Gaussian. Hence the input-output mutual information cannot be obtained in a concrete form.We have only inequalities and approximate expressions. For that reason, we carry out numerical calculations using concreteconvolutional codes. In this case, in order to clarify the difference between causal estimation (ﬁltering) and noncausal one(smoothing), we take QLI codes [20] with different constraint-lengths. Then the MMSE’s are calculated by regarding theseQLI codes as general codes on one side and by regarding them as inherent QLI codes on the other. The obtained resultsare compared and carefully examined. Moreover, through the argument, we see that the input-output mutual information isconnected with the distribution of the encoded block for the main decoder.In Section V, we give several important comments regarding the discussions in Section IV.Finally, in Section IV, we conclude with the main points of the paper and with problems to be further discussed.Let us close this section by introducing the basic notions needed for this paper. Notations in this paper are same as thosein [25] in principle. We always assume that the underlying ﬁeld is GF (2) . Let G ( D ) be a generator matrix for an ( n , k ) convolutional code, where G ( D ) is assumed to be minimal [5]. A corresponding check matrix H ( D ) is also assumed tobe minimal. Hence they have the same constraint length, denoted ν . Denote by i = { i k } and y = { y k } an informationsequence and the corresponding code sequence, respectively, where i k = ( i (1) k , · · · , i ( k ) k ) is the information block at t = k and y k = ( y (1) k , · · · , y ( n ) k ) is the encoded block at t = k . In this paper, it is assumed that a code sequence y is transmittedsymbol by symbol over a memoryless AWGN channel using BPSK modulation. Let z = { z k } be a received sequence, where z k = ( z (1) k , · · · , z ( n ) k ) is the received block at t = k . Each component z j of z is modeled as z j = x j p E s /N + w j (1) = cx j + w j , (2)where c △ = p E s /N . Here, x j takes ± depending on whether the code symbol y j is or . That is, x j is the equiprobablebinary input. (We call it simply the input.) E s and N denote the energy per channel symbol and the single-sided noisespectral density, respectively. (Let E b be the energy per information bit. Then the relationship between E b and E s is deﬁnedby E s = RE b , where R is the code rate.) Also, w j is a zero-mean unit variance Gaussian random variable with probabilitydensity function q ( y ) = 1 √ π e − y . (3)Each w j is independent of all others. The hard-decision (denoted “ h ”) data of z j is deﬁned by z hj △ = (cid:26) , z j ≥ , z j < . (4)In this case, the channel error probability (denoted ǫ ) is given by ǫ = 1 √ π Z ∞ √ E s /N e − y dy △ = Q (cid:0)p E s /N (cid:1) . (5) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

Note that the above signal model can be seen as the block at t = k . In that case, we can rewrite as z k = c x k + w k , (6)where x k = ( x (1) k , · · · , x ( n ) k ) , w k = ( w (1) k , · · · , w ( n ) k ) , and z k = ( z (1) k , · · · , z ( n ) k ) .In this paper, we consider an SST Viterbi decoder [17], [24], [25] which consists of a pre-decoder and a main decoder. Inthe case of a general convolutional code, the inverse encoder G − ( D ) is used as a pre-decoder. Let r k = ( r (1) k , · · · , r ( n ) k ) bethe soft-decision input to the main decoder. r ( l ) k (1 ≤ l ≤ n ) is given by r ( l ) k = ( | z ( l ) k | , r ( l ) hk = 0 −| z ( l ) k | , r ( l ) hk = 1 . (7)II. J OINT D ISTRIBUTION OF THE I NPUT TO THE M AIN D ECODER

The argument in this paper is based on the result of Kailath [11, Section III-B]. That is, by calculating the covariance matrixof the input to the main decoder, we derive the mean-square error in estimating the input. First we obtain the joint distributionof the input to the main decoder. In the following, P ( · ) and E [ · ] denote the probability and expectation, respectively.Let r k = ( r (1) k , · · · , r ( n ) k ) be the input to the main decoder in an SST Viterbi decoder [25]. In connection with the distributionof r ( l ) k (1 ≤ l ≤ n ) [25, Proposition 12], α (= α l ) is deﬁned by α △ = P ( e ( l ) k = 0 , r ( l ) hk = 1) + P ( e ( l ) k = 1 , r ( l ) hk = 0) . (8)We have noticed that this value has another interpretation. In fact, we have the following. Lemma 1 ([26]): α l = P ( v ( l ) k = 1) (1 ≤ l ≤ n ) (9)holds, where v k = ( v (1) k , · · · , v ( n ) k ) is the encoded block for the main decoder. Proof:

The hard-decision input to the main decoder is expressed as r hk = u k G + e k , where u k = e k G − . Let v k = u k G . We have r ( l ) hk = v ( l ) k + e ( l ) k (1 ≤ l ≤ n ) . (10)Hence it follows that α l = P ( e ( l ) k = 0 , r ( l ) hk = 1) + P ( e ( l ) k = 1 , r ( l ) hk = 0)= P ( e ( l ) k = 0 , v ( l ) k + e ( l ) k = 1) + P ( e ( l ) k = 1 , v ( l ) k + e ( l ) k = 0)= P ( e ( l ) k = 0 , v ( l ) k = 1) + P ( e ( l ) k = 1 , v ( l ) k = 1)= P ( v ( l ) k = 1) . (11)Thus the distribution of r ( l ) k (1 ≤ l ≤ n ) is given by p r ( y ) = P ( v ( l ) k = 0) q ( y − c ) + P ( v ( l ) k = 1) q ( y + c ) . (12)This equation means that if the code symbol is , then the associated distribution obeys q ( y − c ) , whereas if the code symbol is , then the associated distribution obeys q ( y + c ) . Hence the result is quite reasonable. On the other hand, since the distributionof r ( l ) k is given by the above equation, r ( l ) k (1 ≤ l ≤ n ) are not mutually independent, because v ( l ) k (1 ≤ l ≤ n ) are notmutually independent.Next, consider a QLI code whose generator matrix is given by G ( D ) = ( g ( D ) , g ( D ) + D L ) (1 ≤ L ≤ ν − . (13)Let η k − L = ( η (1) k − L , η (2) k − L ) be the input to the main decoder in an SST Viterbi decoder [25]. We see that almost the sameargument applies to β (= β l ) in [25, Proposition 14]. We have the following. Lemma 2 ([26]): β l = P ( v ( l ) k = 1) ( l = 1 , (14)holds, where v k = ( v (1) k , v (2) k ) is the encoded block for the main decoder. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Proof:

In the case of QLI codes [25, Section II-B], the hard-decision input to the main decoder is expressed as η hk − L = u k G + e k − L = v k + e k − L , where u k = e k F ( F △ = (cid:18) (cid:19) ) and v k = u k G . Let ζ k = z k H ( D ) be the syndrome. Since η hk − L = ( ζ k , ζ k ) , the above isequivalent to ( ζ k , ζ k ) = ( v (1) k , v (2) k ) + ( e (1) k − L , e (2) k − L ) . (15)Hence we have β l = P ( e ( l ) k − L = 0 , ζ k = 1) + P ( e ( l ) k − L = 1 , ζ k = 0)= P ( e ( l ) k − L = 0 , v ( l ) k + e ( l ) k − L = 1) + P ( e ( l ) k − L = 1 , v ( l ) k + e ( l ) k − L = 0)= P ( e ( l ) k − L = 0 , v ( l ) k = 1) + P ( e ( l ) k − L = 1 , v ( l ) k = 1)= P ( v ( l ) k = 1) (16)for l = 1 , .In the rest of the paper, n = 2 is assumed, because we are concerned with QLI codes in principle. Let us examinethe relationship between α l and β l . Let ˆ i ( k − L | k − L ) and ˆ i ( k − L | k ) be the ﬁltered estimate and the smoothed estimate,respectively [25, Section II-B]. We have r hk − L = z hk − L − ˆ i ( k − L | k − L ) G ( D )= ( i k − L − ˆ i ( k − L | k − L )) G ( D ) + e k − L = u k − L G ( D ) + e k − L (17) η hk − L = z hk − L − ˆ i ( k − L | k ) G ( D )= ( i k − L − ˆ i ( k − L | k )) G ( D ) + e k − L = ˜ u k − L G ( D ) + e k − L , (18)where ˜ u k − L △ = e k F . Then it follows that v k − L = u k − L G ( D )= ( i k − L − ˆ i ( k − L | k − L )) G ( D ) (19) ˜ v k − L = ˜ u k − L G ( D )= ( i k − L − ˆ i ( k − L | k )) G ( D ) . (20)From the meaning of ﬁltering and smoothing, it is natural to think that P ( i k − L − ˆ i ( k − L | k − L ) = 0) is smaller than P ( i k − L − ˆ i ( k − L | k ) = 0) . Hence it is expected that P ( v ( l ) k − L = 1) > P (˜ v ( l ) k − L = 1) , (21)i.e., α l > β l holds. (In the derivation, P ( v ( l ) k − L = 1) = P ( v ( l ) k = 1) , which is equivalent to a kind of stationarity, has beenused.) However, this is not always true (see Section IV).Since the meaning of α (= α l ) has been clariﬁed, we can derive the joint distribution (denoted p r ( x, y ) ) of r (1) k and r (2) k .In fact, we have the following. Proposition 1 ([26]): p r ( x, y ) is given by p r ( x, y ) = α q ( x − c ) q ( y − c ) + α q ( x − c ) q ( y + c )+ α q ( x + c ) q ( y − c ) + α q ( x + c ) q ( y + c ) , (22)where α ij = P ( v (1) k = i, v (2) k = j ) . Proof: p r ( x, y ) ≥ is obvious. Let us show that R ∞−∞ R ∞−∞ p r ( x, y ) dxdy = 1 . Noting, for example, Z ∞−∞ Z ∞−∞ q ( x − c ) q ( y − c ) dxdy = Z ∞−∞ q ( x − c ) dx Z ∞−∞ q ( y − c ) dy = 1 × , OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5 we have Z ∞−∞ Z ∞−∞ p r ( x, y ) dxdy = α + α + α + α = 1 . (23)( Remark: R ∞−∞ R ∞−∞ q ( x − c ) q ( y − c ) dxdy is a multiple integral on the inﬁnite interval, i.e., an improper integral. Considera ﬁnite interval I = { a ≤ x ≤ b, a ′ ≤ y ≤ b ′ } . Since q ( x − c ) q ( y − c ) is continuous on I , a repeated integral is possible on I and we have Z Z I q ( x − c ) q ( y − c ) dxdy = Z ba q ( x − c ) dx Z b ′ a ′ q ( y − c ) dy. Taking the limit as a, a ′ → −∞ and as b, b ′ → ∞ , the above equality is obtained.)Next, let us calculate the marginal distribution of p r ( x, y ) . We have Z ∞−∞ p r ( x, y ) dy = ( α + α ) q ( x − c ) + ( α + α ) q ( x + c )= (1 − α ) q ( x − c ) + α q ( x + c ) . (24)Note that this is just the distribution of r (1) k . Similarly, we have Z ∞−∞ p r ( x, y ) dx = (1 − α ) q ( y − c ) + α q ( y + c ) , (25)where the right-hand side is the distribution of r (2) k . All these facts show that p r ( x, y ) is the joint distribution of r (1) k and r (2) k .III. M EAN -S QUARE E RROR IN E STIMATING THE I NPUT

Consider the signal model: Z = X + W, (26)where X represents a signal of interest and W represents random noise. We assume that X and W are mutually independent.Since we cannot know the value of X directly, we have to estimate it based on the observation Z . The error of an estimate,denoted f ( Z ) , of the input X based on the observation Z can be measured in mean-square sense E [( X − f ( Z ))( X − f ( Z )) T ] , (27)where “ T ” means transpose. It is well known [16], [18], [21], [27] that the minimum of the above value is achieved by theconditional expectation ˆ X = [ f ( Z ) = E [ X | Z ] . (28) ˆ X is the least-squares estimate and the corresponding estimation error is referred to as the minimum mean-square error (MMSE) [6]. In the following, the value of the MMSE is denoted by “mmse”. Remark:

Let Z be the σ -ﬁeld generated by Z . Also, denote by L ( Z ) the set of elements in L which are Z measurable,where L is the set of square integrable random variables. Then we have E [ X | Z ] = P L ( Z ) X , where P L ( Z ) X is the orthogonalprojection of X onto the space L ( Z ) [18], [21], [27]. If X and Z are jointly Gaussian, then we have E [ X | Z ] = P H ( Z ) X [18],[21], where H ( Z ) is the Gaussian space [18], [21] generated by Z . Note that H ( Z ) is a subspace of L ( Z ) .Kailath [11], [12] applied the innovations method to linear ﬁltering/smoothing problems. Suppose that the observations aregiven by z k = s k + w k , k = 1 , , · · · , (29)where { s k } is a zero-mean ﬁnite-variance signal process and { w k } is a zero-mean white Gaussian noise. It is assumed that w k has a covariance matrix E [ w Tk w l ] = R k δ kl , where δ kl is Kronecker’s delta. The innovation process is deﬁned by ν k △ = z k − ˆ s ( k | k − , (30)where ˆ s ( k | k − is the linear least-squares estimate of s k given { z l , ≤ l ≤ k − } . Kailath [11, Section III-B] showed thefollowing. Proposition 2 (Kailath [11]):

The covariance matrix of ν k is given by E [ ν Tk ν l ] = ( P k + R k ) δ kl , (31)where P k is the covariance matrix of the error in the estimate ˆ s ( k | k − .We remark that this result plays an essential role in this paper. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

A. Covariance Matrix Associated with the Input to the Main Decoder

In this section, by assuming that r k is the innovation, we calculate the covariance matrix of r k . Then the MMSE in estimatingthe input x k is obtained from the associated covariance matrix. In preparation for the purpose, we present a lemma. Lemma 3 ([26]):

Suppose that ≤ ǫ ≤ / . The following quantities have the same value: P ( v (1) k = 0 , v (2) k = 0) − P ( v (1) k = 0) P ( v (2) k = 0) = α − (1 − α )(1 − α ) (32) P ( v (1) k = 0) P ( v (2) k = 1) − P ( v (1) k = 0 , v (2) k = 1) = (1 − α ) α − α (33) P ( v (1) k = 1) P ( v (2) k = 0) − P ( v (1) k = 1 , v (2) k = 0) = α (1 − α ) − α (34) P ( v (1) k = 1 , v (2) k = 1) − P ( v (1) k = 1) P ( v (2) k = 1) = α − α α . (35)The common value is denoted by δ . Remark: δ = 0 implies that v (1) k and v (2) k are mutually independent. Proof:

Suppose that α and α are given. From the deﬁnition of α ij , we obtain a system of linear equations:  α + α = 1 − α α + α = α α + α = 1 − α α + α = α . (36)This can be solved as  α = 1 − α − α + uα = α − uα = α − uα = u, (37)where u is an arbitrary constant. We remark that the probabilities (0 ≤ ) α ∼ α ( ≤ are determined by u , which in turnrestricts the value of u . Since ≤ α ij , u must satisfy the following:  u ≥ α + α − u ≤ α u ≤ α u ≥ . (38)Note that ≤ α i ≤ / i = 1 , for ≤ ǫ ≤ / [25, Lemma 13]. Hence we have α + α − ≤ . Accordingly, the valueof u is restricted to ≤ u ≤ min ( α , α ) . (39)It is shown that α ij ≤ is also satisﬁed for ≤ u ≤ min ( α , α ) .Now we have α − (1 − α )(1 − α )= (1 − α − α + u ) − (1 − α − α + α α )= u − α α = α − α α . (40)We see that the remaining three quantities are equal also to u − α α = α − α α .Let us show that δ ≥ . We need the following. Lemma 4: P ( e + e + · · · + e m = 1) ≤ (41)holds, where errors e j (1 ≤ j ≤ m ) are statistically independent of each other. Proof:

See Appendix A.Now we have the following.

Lemma 5:

Suppose that ≤ ǫ ≤ / . Then δ ≥ holds. Proof:

See Appendix B.

Proposition 3 ([26]):

The covariance matrix associated with p r ( x, y ) is given by Σ r △ = (cid:18) σ r σ r r σ r r σ r (cid:19) = (cid:18) c α (1 − α ) 4 c δ c δ c α (1 − α ) (cid:19) . (42) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

Proof:

Note the relation σ r = Z ∞−∞ Z ∞−∞ x p r ( x, y ) dxdy − m r , (43)where m r = Z ∞−∞ Z ∞−∞ x p r ( x, y ) dxdy. (44)The ﬁrst term is calculated as Z ∞−∞ Z ∞−∞ x p r ( x, y ) dxdy = ( α + α + α + α )(1 + c )= 1 + c . (45)Also, we have m r = Z ∞−∞ Z ∞−∞ x p r ( x, y ) dxdy = ( α + α ) c + ( α + α )( − c )= (1 − α ) c − α c = (1 − α ) c. (46)Then it follows that σ r = (1 + c ) − (1 − α ) c = 1 + 4 c α (1 − α ) . (47)Similarly, we have σ r = Z ∞−∞ Z ∞−∞ y p r ( x, y ) dxdy − m r = 1 + 4 c α (1 − α ) . (48)Let us calculate σ r r . Note this time the relation σ r r = Z ∞−∞ Z ∞−∞ xy p r ( x, y ) dxdy − m r m r . (49)Applying Lemma 3, we have Z ∞−∞ Z ∞−∞ xy p r ( x, y ) dxdy = ( α + α ) c − ( α + α ) c = (1 − α − α + 2 u ) c − ( α + α − u ) c = (1 − α − α + 4 u ) c . (50)Since (cid:26) m r = (1 − α ) cm r = (1 − α ) c, (51)it follows that σ r r = (1 − α − α + 4 u ) c − (1 − α )(1 − α ) c = 4( u − α α ) c = 4 δc . (52) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

B. MMSE in Estimating the Input

Note Proposition 2. If r k = ( r (1) k , r (2) k ) corresponds to the innovation, then the associated covariance matrix (denoted Σ r )is expressed as the sum of two matrices Σ x and Σ w , i.e., Σ r = Σ x + Σ w . Here, Σ x is the covariance matrix of the estimationerror of the signal, whereas Σ w is the covariance matrix of the observation noise. From the deﬁnition of w j (see Section I),it follows that Σ w = (cid:18) (cid:19) (53)and hence we have Σ x = (cid:18) c α (1 − α ) 4 c δ c δ c α (1 − α ) (cid:19) . (54)Moreover, since the signal is expressed as c x k , the covariance matrix (denoted ˆΣ x ) of the estimation error of x k becomes ˆΣ x = (cid:18) α (1 − α ) 4 δ δ α (1 − α ) (cid:19) . (55)Hence the corresponding MMSE is given bymmse = tr ˆΣ x = 4 α (1 − α ) + 4 α (1 − α ) , (56)where tr ˆΣ x is the trace of matrix ˆΣ x . Remark:

The estimate in [11] is the “linear” least-squares estimate. On the other hand, the best estimate is given by theconditional expectation E [ x k | z l , ≤ l ≤ k − , which is in general a highly “nonlinear” functional of { z l , ≤ l ≤ k − } (cf. x k is not Gaussian). Note that if { x k , z k } are jointly Gaussian, then E [ x k | z l , ≤ l ≤ k − is a linear functional of { z l , ≤ l ≤ k − } [16, Section II-D]. Hence to be exact, we havemmse ≤ α (1 − α ) + 4 α (1 − α ) , (57)where the left-hand side corresponds to the conditional expectation, whereas the right-hand side corresponds to the linearleast-squares estimate. As stated above, the MMSE is a nonlinear functional of the past observations. Hence we regard theright-hand side as an approximation to the MMSE. C. Modiﬁcation of the Derived MMSE

Recall that α = P ( v (1) k = 1) and α = P ( v (2) k = 1) . That is, the MMSE in the estimation of x k is expressed in termsof the distribution of v k = ( v (1) k , v (2) k ) for the main decoder in an SST Viterbi decoder. We remark that α and α are notmutually independent. As a result, α (1 − α ) and α (1 − α ) have a correlation. Hence the simple sum of α (1 − α ) and α (1 − α ) is not appropriate for the MMSE and some modiﬁcation considering the degree of correlation is necessary. Thevariable δ (see Lemma 3) has a close connection with the independence of v (1) k and v (2) k . When they have a weak correlation, δ is small, whereas when they have a strong correlation, δ is large. This observation suggests that a modiﬁcation α (1 − α ) + 4 α (1 − α ) − λδ (58)is more appropriate for the MMSE, where λ is some constant. In the following, we discuss how to determine the value of λ .We have the following. Proposition 4:

Suppose that ≤ ǫ ≤ / . We have (4 α (1 − α ))(4 α (1 − α )) ≥ (4 δ ) , (59)where δ = α − α α . Proof:

Note the relation: Σ x = Σ r + Σ w . Σ r is the covariance matrix associated with p r ( x, y ) and is positive semi-deﬁnite. Σ w (i.e., the identity matrix of size × ) is clearly positive semi-deﬁnite. Hence Σ x is positive semi-deﬁnite andhence det (Σ x ) ≥ [23], where “det ( · ) ” denotes the determinant. Since det (Σ x ) = c det ( ˆΣ x ) , we have det ( ˆΣ x ) ≥ .From Proposition 4, we have (4 α (1 − α ))(4 α (1 − α )) ≥ (4 δ ) . So far we have carried out numerical calculations for four QLI codes C ∼ C (regarded as general codes) and one general code C (these codes are deﬁned in Sections IV and V). Then we have found that the difference between the values of α (1 − α ) and α (1 − α ) is small. Note that this fact is derived from the structure of G − G . Let G − G = (cid:18) b b b b (cid:19) . (60) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

Also, denote by m ( i ) col ( i = 1 , the number of terms (i.e., D j ) contained in (cid:18) b i b i (cid:19) . We see that α i ( i = 1 , is determinedby m ( i ) col . Hence if the difference between m (1) col and m (2) col is small, then α ≈ α holds. For example, it is shown (see SectionIV-B) that1) m (1) col = 5 and m (2) col = 6 for C ,2) m (1) col = 10 and m (2) col = 10 for C .Thus we have α (1 − α ) ≈ α (1 − α ) . As a consequence, we have approximately (cid:26) α (1 − α ) ≥ δ α (1 − α ) ≥ δ. (61)That is, α (1 − α ) + 4 α (1 − α ) − δ ≥ holds with high probability.Now we can show that this inequality always holds. That is, we have the following. Proposition 5:

Suppose that ≤ ǫ ≤ / . We have α (1 − α ) + 4 α (1 − α ) − δ ≥ , (62)where δ = α − α α . Proof:

See Appendix C.

Remark:

Let ˆΣ x = (cid:18) α (1 − α ) 4 δ δ α (1 − α ) (cid:19) △ = (cid:18) σ σ σ σ (cid:19) . (63)Then α (1 − α ) + 4 α (1 − α ) − δ corresponds to σ + σ − σ . Denote by µ the correlation coefﬁcient between x (1) k and x (2) k , i.e., µ △ = σ σ σ . (64)Note that − ≤ µ ≤ . We have σ + σ − σ = σ + σ − µσ σ . (65)Now we restrict µ to ≤ µ ≤ . As the special cases, the following hold:1) µ = 0 : σ + σ − µσ σ = σ + σ .2) µ = 1 : σ + σ − µσ σ = ( σ − σ ) .Hence − δ = − µσ σ represents the correction term depending on the degree of correlation.Based on Proposition 5, we ﬁnally setmmse = 4 α (1 − α ) + 4 α (1 − α ) − δ (66) △ = ξ ( α , α , δ ) (67)as the MMSE in the estimation of x k . D. In the Case of QLI Codes

Consider a QLI code whose generator matrix is given by G ( D ) = ( g ( D ) , g ( D ) + D L ) (1 ≤ L ≤ ν − . Let η k − L = ( η (1) k − L , η (2) k − L ) be the input to the main decoder in an SST Viterbi decoder. We see that the results in the previoussections hold for η k − L as well. Therefore, we state only the results. Proposition 6 ([26]):

The joint distribution of η (1) k − L and η (2) k − L (denoted p η ( x, y ) ) is given by p η ( x, y ) = β q ( x − c ) q ( y − c ) + β q ( x − c ) q ( y + c )+ β q ( x + c ) q ( y − c ) + β q ( x + c ) q ( y + c ) , (68)where β ij = P ( v (1) k = i, v (2) k = j ) . OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

Lemma 6 ([26]):

Assume that ≤ ǫ ≤ / . The following quantities have the same value: P ( v (1) k = 0 , v (2) k = 0) − P ( v (1) k = 0) P ( v (2) k = 0) = β − (1 − β )(1 − β ) (69) P ( v (1) k = 0) P ( v (2) k = 1) − P ( v (1) k = 0 , v (2) k = 1) = (1 − β ) β − β (70) P ( v (1) k = 1) P ( v (2) k = 0) − P ( v (1) k = 1 , v (2) k = 0) = β (1 − β ) − β (71) P ( v (1) k = 1 , v (2) k = 1) − P ( v (1) k = 1) P ( v (2) k = 1) = β − β β . (72)The common value is denoted by δ ′ . Proposition 7 ([26]):

The covariance matrix associated with p η ( x, y ) is given by Σ η △ = (cid:18) σ η σ η η σ η η σ η (cid:19) = (cid:18) c β (1 − β ) 4 c δ ′ c δ ′ c β (1 − β ) (cid:19) . (73) Proposition 8:

Suppose that ≤ ǫ ≤ / . We have (4 β (1 − β ))(4 β (1 − β )) ≥ (4 δ ′ ) , (74)where δ ′ = β − β β . Proposition 9:

Suppose that ≤ ǫ ≤ / . We have β (1 − β ) + 4 β (1 − β ) − δ ′ ≥ , (75)where δ ′ = β − β β .Finally, for QLI codes we set mmse = 4 β (1 − β ) + 4 β (1 − β ) − δ ′ (76) △ = ξ ( β , β , δ ′ ) (77)as the MMSE in the estimation of x k − L .IV. M UTUAL I NFORMATION AND

MMSE S We have shown that the MMSE in the estimation of x k is expressed in terms of the distribution of v k = ( v (1) k , v (2) k ) for themain decoder in an SST Viterbi decoder. More precisely, we have derived the following:1) As a general code: mmse = ξ ( α , α , δ ) .2) As a QLI code: mmse = ξ ( β , β , δ ′ ) .In this section, we discuss the validity of the derived MMSE from the viewpoint of mutual information and mean-square error.In this paper, we have used the signal model (see Section I): z k = c x k + w k . Since n = 2 , it follows that c = p E s /N = p E b /N . (78)Then letting ρ = c = E b /N , (79)the above equation is rewritten as z k = √ ρ x k + w k . (80)Note that this is just the signal model used in [6] (i.e., ρ = snr). Guo et al. [6] discussed the relation between the input-outputmutual information and the MMSE in estimating the input in Gaussian channels. Their relation holds for discrete-time andcontinuous-time noncausal (smoothing) MMSE estimation regardless of the input statistics (i.e., not necessarily Gaussian).Here [25] recall that the innovations approach to Viterbi decoding of QLI codes has a close connection with smoothing in thelinear estimation theory. Then we thought the results in [6] can be used to discuss the validity of the MMSE obtained in theprevious section. Hence the argument in this section is based on that of Guo et al. [6]. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

A. General Codes

Let z n △ = { z , z , · · · , z n } . Also, let E [ x | z n ] be the conditional expectation of x given z n . According to Guo et al. [6,Section IV-A], let us deﬁne as follows:pmmse ( i, ρ ) △ = E [( x i − E [ x i | z i − ])( x i − E [ x i | z i − ]) T ] , (81)where pmmse ( i, ρ ) represents the one-step prediction MMSE. Note that this is a function of ρ (= c ) . This is because E [ x i | z i − ] depends on z i − (cf. z i − is a function of ρ ). Remark 1:

The above deﬁnition is slightly different from that in [6], where pmmse ( i, ρ ) is deﬁned aspmmse ( i, ρ ) △ = E [ | x i − E [ x i | z i − ] | ] , (82)where z i − △ = { z , · · · , z i − } . In order to distinguish it from ours, theirs is denoted by pmmse i ( ρ ) in this section. Set n = 2 k .We have n X i =1 pmmse i ( ρ )= k X i =1 ( pmmse i − ( ρ ) + pmmse i ( ρ )) . (83)Here note the following: pmmse i − ( ρ ) = E [ | x i − − E [ x i − | z i − ] | ] (84)pmmse i ( ρ ) = E [ | x i − E [ x i | z i − ] | ] ≤ E [ | x i − E [ x i | z i − ] | ] . (85)Hence pmmse i − ( ρ ) + pmmse i ( ρ ) ≤ E [ | x i − − E [ x i − | z i − ] | ]+ E [ | x i − E [ x i | z i − ] | ]= E [( x i − E [ x i | z i − ])( x i − E [ x i | z i − ]) T ] (86)holds. Thus we have n X i =1 pmmse i ( ρ )= k X i =1 ( pmmse i − ( ρ ) + pmmse i ( ρ )) ≤ k X i =1 E [( x i − E [ x i | z i − ])( x i − E [ x i | z i − ]) T ]= k X i =1 pmmse ( i, ρ ) . (87)It is conﬁrmed from Remark 1 that the relation in [6, Theorem 9] still holds. In fact, we have I [ x k ; z k ] ≤ ρ k X i =1 pmmse i ( ρ ) ≤ ρ k X i =1 pmmse ( i, ρ ) . (88)Then it follows that ρ (cid:18) I [ x k ; z k ] k (cid:19) ≤ k k X i =1 pmmse ( i, ρ ) ! . (89)Our argument is based on the innovations approach proposed by Kailath [11] (cf. [25]). Also, our model is a discrete-timeone. Hence it is reasonable to think that the MMSE associated with the estimation of x i is corresponding to pmmse ( i, ρ ) (seeProposition 2). That is, it is appropriate to set pmmse ( i, ρ ) = ξ ( α , α , δ ) . (90) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

Moreover, we set k k X i =1 pmmse ( i, ρ ) ≈ ξ ( α , α , δ ) . (91)Note that the left-hand side is the average prediction MMSE (see [6, Section IV-A]). We ﬁnally have ρ (cid:18) I [ x k ; z k ] k (cid:19) . ξ ( α , α , δ ) . (92)On the other hand, the mutual information between x k and z k [2, Section 9.4] is evaluated as I [ x k ; z k ] ≤

12 log(1 + ρ ) k = k log(1 + ρ ) . (93)Then we have ρ (cid:18) I [ x k ; z k ] k (cid:19) ≤ log(1 + ρ ) ρ . (94) Remark 2:

The essence of our argument lies in the equalitypmmse ( i, ρ ) = ξ ( α , α , δ ) . Meanwhile, using the signal model, pmmse ( i, ρ ) is calculated as follows. Since x i and z i − are mutually independent, wehave pmmse ( i, ρ ) = E [( x i − E [ x i | z i − ])( x i − E [ x i | z i − ]) T ]= E [( x i − E [ x i ])( x i − E [ x i ]) T ]= var ( x (1) i ) + var ( x (2) i )= 1 + 1 = 2 , (95)where “var” denotes the variance. Hence ρ k X i =1 pmmse ( i, ρ ) = kρ. (96)On the other hand, using the inequality: log x ≤ x − x > , we have ρ (cid:18) I [ x k ; z k ] k (cid:19) ≤ log(1 + ρ ) ρ ≤ ρρ = 1 . (97)Hence I [ x k ; z k ] ≤ kρ = ρ k X i =1 pmmse ( i, ρ ) actually holds.From the above argument, it is expected that log(1+ ρ ) ρ and ξ ( α , α , δ ) are closely related. Then in the next section, wewill discuss the relation between them. B. Numerical Results as General Codes

In order to examine the relation between log(1+ ρ ) ρ and ξ ( α , α , δ ) , numerical calculations have been carried out. We havetaken two QLI codes:1) C ( ν = 2 , L = 1) : The generator matrix is deﬁned by G ( D ) = (1 + D + D , D ) .2) C ( ν = 6 , L = 2) : The generator matrix is deﬁned by G ( D ) = (1 + D + D + D + D , D + D + D + D + D ) .By regarding these QLI codes as “general” codes, we have compared the values of log(1+ ρ ) ρ and ξ ( α , α , δ ) . The results areshown in Tables I and II, where ξ ( α , α , δ ) is simply denoted mmse. The results for C and C are shown also in Fig.1and Fig.2, respectively.With respect to the behaviors of log(1+ ρ ) ρ and ξ ( α , α , δ ) , we have the following:(i) log(1+ ρ ) ρ :1) ǫ → / : Since ρ → , log(1+ ρ ) ρ → .2) ǫ → : Since ρ → ∞ , log(1+ ρ ) ρ → .Note that log(1+ ρ ) ρ approaches slowly as the SNR increases.(ii) ξ ( α , α , δ ) : OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

TABLE I log(1+ ρ ) ρ AND MINIMUM MEAN - SQUARE ERROR ( C AS A GENERAL CODE ) E b /N ( dB ) log(1+ ρ ) ρ α α (1 − α ) α α (1 − α ) δ mmse −

10 0 . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E b /N (dB) log(1+ (cid:1) )/ (cid:0) (1/2)mmse Fig. 1. log(1+ ρ ) ρ and mmse = ξ ( α , α , δ ) ( C ) . ǫ → / : Since α i → / i = 1 , and since δ → , ξ ( α , α , δ ) = 12 (4 α (1 − α ) + 4 α (1 − α ) − δ ) → . (98)2) ǫ → : Since α i → i = 1 , and since δ → , ξ ( α , α , δ ) = 12 (4 α (1 − α ) + 4 α (1 − α ) − δ ) → . (99)From Figs. 1 and 2, we observe that the behaviors of ξ ( α , α , δ ) for C and C are considerably different. Let us examinethe cause of the difference.Let v k = ( v (1) k , v (2) k ) be the encoded block for the main decoder. We have already shown that the degree of correlationbetween v (1) k and v (2) k is expressed in terms of δ (see Lemma 3). We also see that the degree of correlation between v (1) k and v (2) k has a close connection with the number of error terms (denoted m v ) by which v (1) k and v (2) k differ. In fact, when m v is small, v (1) k and v (2) k have a strong correlation, whereas when m v is large, v (1) k and v (2) k have a weak correlation. Theseobservations show that δ is closely related to m v . Then it is expected that the behavior of ξ ( α , α , δ ) depends heavily onthe value of m v (i.e., on a code). In order to conﬁrm this fact, let us evaluate the values of m v for C and C . Note that sincethe encoded block is expressed as v k = ( e k G − ) G , m v is obtained from G − G . OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14

TABLE II log(1+ ρ ) ρ AND MINIMUM MEAN - SQUARE ERROR ( C AS A GENERAL CODE ) E b /N ( dB ) log(1+ ρ ) ρ α α (1 − α ) α α (1 − α ) δ mmse −

10 0 . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E b /N (dB) log(1+ ρ )/ ρ (1/2)mmse Fig. 2. log(1+ ρ ) ρ and mmse = ξ ( α , α , δ ) ( C ) . C : The inverse encoder is given by G − = (cid:18) D D (cid:19) . (100)Hence from G − G = (cid:18) D + D + D D + D D D + D + D (cid:19) , (101)it follows that m v = 3 (i.e., bold-faced terms). C : The inverse encoder is given by G − = (cid:18) D + D + D D + D + D + D (cid:19) . (102)Then it is shown that G − G = (cid:18) D + D + D D + D + D + D + D + D D + D + D + D + D + D D + D + D (cid:19) . (103)We see that m v = 8 . OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15 E b /N (dB) log(1+ ρ )/ ρ (1/2)mmse( av) Fig. 3. log(1+ ρ ) ρ and the average of mmse’s. Now compare the values of δ in Tables I and II. We observe that they are considerably different. For example, at the SNRof E b /N = 0 dB, we have δ ( C ) = 0 . > δ ( C ) = 0 . . (104)We think the difference is due to the fact m v ( C ) = 3 < m v ( C ) = 8 . (105) m v ( C ) = 3 means that there is a certain correlation between v (1) k and v (2) k and then we have δ = 0 . (i.e., δ = 0 . ).This value is fairly large. On the other hand, m v ( C ) = 8 implies that v (1) k and v (2) k have a weak correlation, which results ina small value of δ = 0 . .We have already seen that the behavior of ξ ( α , α , δ ) is dependent on m v . Also, it has been shown that the values of m v for C and C are considerably different. We think these explain the behaviors of curves in Figs. 1 and 2.As observed above, since the behavior of ξ ( α , α , δ ) varies with the codes, it is appropriate to take the average withrespect to a code. Then including C and C , we have taken additionally two QLI codes C and C whose generator matricesare deﬁned by G ( D ) = (1 + D + D + D , D + D ) (106)and G ( D ) = (1 + D + D , D + D + D ) , (107)respectively. It is shown that C and C have m v = 4 and m v = 5 , respectively. Then we have averaged the corresponding ξ ( α , α , δ ) ’s. The result is shown in Fig.3, where mmse(av) denotes the average value of ξ ( α , α , δ ) over four codes.We think the relation between log(1+ ρ ) ρ and ξ ( α , α , δ ) is shown more properly in Fig.3.In Figs. 1, 2, and 3, we observe that ξ ( α , α , δ ) > log(1+ ρ ) ρ holds at low-to-medium SNRs. However, the sign of inequalityis reversed at high SNRs. We think this comes from the fact that ξ ( α , α , δ ) approaches rapidly as the SNR increases,whereas log(1+ ρ ) ρ approaches slowly as the SNR increases. C. QLI Codes

In [25], we have shown that the innovations approach to Viterbi decoding of QLI codes is related to “smoothing” in thelinear estimation theory. As a result, the relationship between the mutual information and the MMSE can be discussed usingthe result in [6, Corollary 3]. For the purpose, we provide a proposition.

Proposition 10:

Let x = ( x , · · · , x n ) , where x i is the equiprobable binary input with unit variance. Each x i is assumedto be independent of all others. Also, let ˜ x = (˜ x , · · · , ˜ x n ) be a standard Gaussian vector. Moreover, let w be a standardGaussian vector which is independent of x and ˜ x . Then we have ddρ I [ x ; √ ρ x + w ] ≤ ddρ I [ ˜ x ; √ ρ ˜ x + w ] . (108) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16

Proof:

We follow Guo et al. [6]. Let z = √ ρ x + w , where z = ( z , · · · , z n ) . We havemmse ( ρ ) = E [( x − E [ x | z ])( x − E [ x | z ]) T ]= E [ | x − E [ x | z ] | ] + · · · + E [ | x n − E [ x n | z ] | ] ≤ E [ | x − E [ x | z ] | ] + · · · + E [ | x n − E [ x n | z n ] | ]= n X i =1 mmse i ( ρ ) , (109)where mmse i ( ρ ) △ = E [ | x i − E [ x i | z i ] | ] . Then it follows from [6, Theorems 1 and 2] that ddρ I [ x ; √ ρ x + w ] = 12 mmse ( ρ ) ≤ n X i =1 mmse i ( ρ )= n X i =1 ddρ I [ x i ; √ ρ x i + w i ] . (110)( Remark 1:

This relation holds for any random vector x satisfying E [ xx T ] < ∞ (cf. [6]).)On the other hand, since ˜ x i is Gaussian (see [2, Section 9.4]), we have ddρ I [ ˜ x ; √ ρ ˜ x + w ] = n X i =1 ddρ I [˜ x i ; √ ρ ˜ x i + w i ] . (111)Note that in the scalar case, ddρ I [ x i ; √ ρ x i + w i ] ≤ ddρ I [˜ x i ; √ ρ ˜ x i + w i ] (112)holds (see Appendix D and see [6, Fig.1]). Hence we have ddρ I [ x ; √ ρ x + w ] ≤ n X i =1 ddρ I [ x i ; √ ρ x i + w i ] ≤ n X i =1 ddρ I [˜ x i ; √ ρ ˜ x i + w i ]= ddρ I [ ˜ x ; √ ρ ˜ x + w ] . Now the following [6, Corollary 3] has been shown: ddρ I [ x k ; z k ] = 12 k X i =1 mmse ( i, ρ ) , (113)where mmse ( i, ρ ) △ = E [( x i − E [ x i | z k ])( x i − E [ x i | z k ]) T ] (1 ≤ i ≤ k ) . (114) Remark 2:

The deﬁnition of mmse ( i, ρ ) is slightly different from that in [6]. However, since the whole observations z k areused in each conditional expectation, the above equality holds also under our deﬁnition.First note the left-hand side, i.e., ddρ I [ x k ; z k ] . Let ˜ x k be Gaussian. It follows from Proposition 10 that ddρ I [ ˜ x k ; z k ] ≥ ddρ I [ x k ; z k ] . (115)Since ˜ x k is Gaussian, we have I [ ˜ x k ; z k ] = 12 log(1 + ρ ) k = k log(1 + ρ ) (116)and hence ddρ I [ ˜ x k ; z k ] = k ρ (117)holds. Therefore,

11 + ρ ≥ ddρ (cid:18) I [ x k ; z k ] k (cid:19) = 12 k k X i =1 mmse ( i, ρ ) ! (118) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 17 is obtained.Next, note the right-hand side, i.e., P ki =1 mmse ( i, ρ ) . For L ≤ k , we have k X i =1 mmse ( i, ρ )= E [( x − E [ x | z k ])( x − E [ x | z k ]) T ] · · · + E [( x k − L − E [ x k − L | z k ])( x k − L − E [ x k − L | z k ]) T ] · · · + E [( x k − E [ x k | z k ])( x k − E [ x k | z k ]) T ] . (119)Our concern is the evaluation of mmse ( k − L, ρ )= E [( x k − L − E [ x k − L | z k ])( x k − L − E [ x k − L | z k ]) T ] . (120)For the purpose, we assume an approximation:mmse ( k − L, ρ ) ≈ k k X i =1 mmse ( i, ρ ) . (121)Hence in the relation

11 + ρ ≥ ddρ (cid:18) I [ x k ; z k ] k (cid:19) = 12 k k X i =1 mmse ( i, ρ ) ! ≈ mmse ( k − L, ρ ) , (122)we replace mmse ( k − L, ρ ) by ξ ( β , β , δ ′ ) . Then we have

11 + ρ ≥ ddρ (cid:18) I [ x k ; z k ] k (cid:19) ≈ ξ ( β , β , δ ′ ) . (123)Note that the above approximation is equivalent to set k k X i =1 mmse ( i, ρ ) ≈ ξ ( β , β , δ ′ ) , (124)where the left-hand side represents the average noncausal MMSE (see [6, Section IV-A]). D. Numerical Results as QLI Codes

In order to conﬁrm the validity of the derived equation, numerical calculations have been carried out. We have used the sameQLI codes C and C as in the previous section. In this case, we regard these codes as inherent QLI codes and compare thevalues of ρ and ξ ( β , β , δ ′ ) . The results are shown in Tables III and IV, where ξ ( β , β , δ ′ ) is simply denoted mmse.The results for C and C are shown also in Fig.4 and Fig.5, respectively.First note the following. Lemma 7:

Let ρ > . Then

11 + ρ ≤ log(1 + ρ ) ρ (125)holds. Proof:

It is shown [7, Theorem 63] that uv ≤ u log u + e v − ( u > . (126)Letting v = 1 , we have u ≤ u log u + 1 ( u > . Furthermore, letting u = ρ + 1 , we have ρ ≤ (1 + ρ ) log(1 + ρ ) + 1 ( ρ > . OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 18

TABLE III ρ AND MINIMUM MEAN - SQUARE ERROR ( C AS A

QLI

CODE ) E b /N ( dB ) ρ β β (1 − β ) β β (1 − β ) δ ′ mmse −

11 + ρ ≤ log(1 + ρ ) ρ . The behavior of ρ is similar to that of log(1+ ρ ) ρ . We have the following:(i) ρ :1) ǫ → / : Since ρ → , ρ → .2) ǫ → : Since ρ → ∞ , ρ → .Note that ρ approaches more rapidly as the SNR increases compared with log(1+ ρ ) ρ .(ii) ξ ( β , β , δ ′ ) :1) ǫ → / : Since β i → / i = 1 , and since δ ′ → , ξ ( β , β , δ ′ ) = 12 (4 β (1 − β ) + 4 β (1 − β ) − δ ′ ) → . (127)2) ǫ → : Since β i → i = 1 , and since δ ′ → , ξ ( β , β , δ ′ ) = 12 (4 β (1 − β ) + 4 β (1 − β ) − δ ′ ) → . (128) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 19

TABLE IV ρ AND MINIMUM MEAN - SQUARE ERROR ( C AS A

QLI

CODE ) E b /N ( dB ) ρ β β (1 − β ) β β (1 − β ) δ ′ mmse −

10 0 . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E b /N (cid:5) dB) (cid:6) (cid:7) ) (cid:8) Fig. 5. ρ and mmse = ξ ( β , β , δ ′ ) ( C ) . From Tables III and IV (or Figs. 4 and 5), we observe that the behaviors of ξ ( β , β , δ ′ ) for C and C are almost thesame. This is because both C and C are regarded as QLI codes and hence have the same m v (= 2) , which results in almostequal values of δ ′ . Also, observe that ξ ( β , β , δ ′ ) provides a good approximation to ρ (Figs. 4 and 5).V. D ISCUSSION

In the previous section, we have discussed the validity of the derived MMSE using the results of Guo et al. [6]. In reality,we have examined1) the relation between log(1+ ρ ) ρ and mmse = ξ ( α , α , δ ) ,2) the relation between ρ and mmse = ξ ( β , β , δ ′ ) .Note that the relation between the mutual information and the MMSE has been reduced to the relation between a functionof ρ and the MMSE. That is, their relation has been examined not directly but indirectly. Now we see that the followingapproximation and evaluation are used in the argument in Section IV:i) k P ki =1 pmmse ( i, ρ ) ≈ ξ ( α , α , δ ) ( k P ki =1 mmse ( i, ρ ) ≈ ξ ( β , β , δ ′ ) ).ii) I [ x k ; z k ] ≤ k log(1 + ρ ) .iii) ddρ I [ x k ; z k ] ≤ k ρ .Since ξ ( α , α , δ ) ( ξ ( β , β , δ ′ ) ) is not dependent on k , averaging in i) seems to be reasonable. (But it has to be justiﬁed.) Onthe other hand, inequalities ii) and iii) come from the fact that the input in our signal model is not Gaussian and hence the OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 20 input-output mutual information cannot be obtained in a concrete form. (In the scalar case, the mutual information of Gaussianchannel with equiprobable binary input has been given (see [6, p.1263]). In our case, however, since the input is taken as avector, we have used the conventional inequality on mutual information.) Consider the case that a QLI code is regarded as ageneral code. It follows from i) and ii) that ρ (cid:18) I [ x k ; z k ] k (cid:19) . ξ ( α , α , δ )1 ρ (cid:18) I [ x k ; z k ] k (cid:19) ≤ log(1 + ρ ) ρ . Similarly, in the case that a QLI code is regarded as the inherent QLI code, it follows from i) and iii) that ddρ (cid:18) I [ x k ; z k ] k (cid:19) ≈ ξ ( β , β , δ ′ ) ddρ (cid:18) I [ x k ; z k ] k (cid:19) ≤

11 + ρ .

Hence evaluations of I [ x k ; z k ] and ddρ I [ x k ; z k ] are important in our discussions. It is expected that inequalities ii) and iii) aretight at low SNRs, whereas they are loose at high SNRs (cf. [6, p.1263]). As a result, if approximation i) is appropriate, thencrossing of two curves in Figs. ∼ is understandable. Hence the validity of approximation i) has to be further examined.Nevertheless, in the case that a given QLI code is regarded as the inherent QLI code, it has been shown that ξ ( β , β , δ ′ ) provides a good approximation to ρ . This is quite remarkable considering the fact that ρ is a function of ρ , whereas ξ ( β , β , δ ′ ) is dependent on concrete convolutional coding. We think this fact implies the validity of our inference and therelated results.Here note the following: log(1+ ρ ) ρ ( ρ ) is a function only of ρ = c = E b /N , whereas mmse = ξ ( α , α , δ ) ( ξ ( β , β , δ ′ ) ) is dependent on the encoded block for the main decoder. Hence at ﬁrst glance, the comparison seems tobe inappropriate. Although mmse actually depends on coding, it is a function of the channel error probability ǫ . Since ǫ = Q (cid:0)p E s /N (cid:1) = Q (cid:0)p E b /N (cid:1) = Q (cid:0) √ ρ (cid:1) (cf. n = 2 ), mmse is also a function of ρ . Hence the above comparison isjustiﬁed.By the way, the argument in Section IV is based on the signal model: z j = cx j + w j given in Section I. Note that x j isdetermined by the encoded symbol y j and hence this signal model is dependent clearly on convolutional coding. However,since each y j is independent of all others, x j can be seen as a random variable having values ± with equal probability.When x j is seen in this way, convolutional coding is not found explicitly in the expression z j = cx j + w j . In words, wecannot see from the signal model how { y j } is generated. On the other hand, consider the MMSE in estimating the inputbased on the observations { z j } . If the signal model is interpreted as above, then the MMSE seems to be independent ofconcrete convolutional coding. But this is not true. As we have already seen, the MMSE is replaced ﬁnally by ξ ( α , α , δ ) ( ξ ( β , β , δ ′ ) ), which is an essential step. By way of this replacement, the MMSE is connected with the convolutional coding.That is, convolutional coding is actually reﬂected in the MMSE.From the results in Section IV, it seems that our argument is more convincing for QLI codes. We know that an SST Viterbidecoder functions well for QLI codes compared with general codes. In fact, QLI codes are preferable from a likelihoodconcentration viewpoint [22], [25]. Then a question arises: How is a likelihood concentration in the main decoder related tothe MMSE? Suppose that the information u k = e k G − ( D ) ( u k = e k F ) for the main decoder consists of m u error terms. Weknow that a likelihood concentration in the main decoder depends on m u [25], whereas δ ( δ ′ ) is affected by m v and hencethe MMSE is dependent on m v .In the case of QLI codes, we have the following. Proposition 11:

Consider a QLI code whose generator matrix is given by G ( D ) = ( g ( D ) , g ( D ) + D L ) (1 ≤ L ≤ ν − . We have m u = m v . (129) Proof:

Let u k = e + · · · + e m , where errors e j (1 ≤ j ≤ m ) are mutually independent. We have v k = u k G ( D )= ( u k g ( D ) , u k g ( D ) + u k D L )= ( v (1) k , v (2) k ) . (130)Then it follows that v (1) k and v (2) k differ by u k D L . Since u k = e + · · · + e m , v (1) k and v (2) k differ by m error terms.We observe that the relation m u = m v actually holds for C ∼ C (see Section IV-B). The above result shows that in thecase of QLI codes, a likelihood concentration in the main decoder and the degree of correlation between v (1) k and v (2) k (i.e., OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 21

TABLE V log(1+ ρ ) ρ AND MINIMUM MEAN - SQUARE ERROR ( C ) E b /N ( dB ) log(1+ ρ ) ρ α α (1 − α ) α α (1 − α ) δ mmse −

10 0 . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . − . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . the value of δ ( δ ′ ) ) have a close connection. We remark that the former is independent of the latter in principle. The aboveresult, however, shows that the two notions are closely related to each other.Note that the equality m u = m v does not hold for general codes in general. Then in order to examine the relation betweena likelihood concentration in the main decoder and the MMSE, we can consider a code with small m u and large m v . As anexample, take the code C [22] whose generator matrix is deﬁned by G ( D ) = (1 + D + D + D + D , D + D + D + D ) . (131)The inverse encoder is given by G − = (cid:18) D D (cid:19) (132)and we have m u = 3 . On the other hand, it is shown that G − G = (cid:18) D + D + D + D + D D + D + D + D + D D + D + D D + D + D + D + D (cid:19) . (133)Then we have m v = 8 .Since m u is small, a likelihood concentration occurs in the main decoder [22], [25]. On the other hand, m v = 8 isconsiderably large. Here recall that when C is regarded as a general code, it has m v = 8 . This value is the same as thatfor C . However, since C has m u = m v = 8 as a general code, a likelihood concentration in the main decoder is notexpected at low-to-medium SNRs. Then in order to see the effect of m u on the values of mmse, let us compare the values of mmse = ξ ( α , α , δ ) for C and C . For the purpose, we have evaluated mmse for C . The result is shown in Table V.In Table V, look at the values of mmse. We observe that they are almost the same as those for C (see Table II). Thatis, it seems that m u has little effect on the values of mmse. This observation is explained as follows. m u = 3 means thata likelihood concentration in the main decoder is notable. However, this does not necessarily affect the MMSE. A likelihoodconcentration really reduces the decoding complexity (i.e., the complexity in estimating the input). On the other hand, theMMSE is the estimation error which is ﬁnally attained after the estimation process. In other words, m u affects the complexityof estimation, whereas m v is related to the ﬁnal estimation error. Moreover, we can say it as follows: the MMSE is determinedby the structure of G − G . Hence even if G − ’s are considerably different between two codes C and ˜ C , when G − G ’s havealmost equal values of m v , the MMSE’s are close to each other. In fact, the inverse encoder for C is given by G − = (cid:18) D + D + D D + D + D + D (cid:19) , whereas the inverse encoder for C is given by G − = (cid:18) D D (cid:19) . Two inverse encoders are quite different, but the values of m v calculated from G − G are equal, which results in almost thesame MMSE. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 22

VI. C

ONCLUSION

In this paper, we have shown that the soft-decision input to the main decoder in an SST Viterbi decoder is regarded as theinnovation as well. Although this fact can be obtained from the deﬁnition of innovations for hard-decision data, this time wehave discussed the subject from the viewpoint of mutual information and mean-square error. Then by combining the presentresult with that in [25], it has been conﬁrmed that the input to the main decoder in an SST Viterbi decoder is regarded asthe innovation. Moreover, we have obtained an important result. Note that the MMSE has been expressed in terms of thedistribution of the encoded block for the main decoder in an SST Viterbi decoder. Then through the argument, the input-outputmutual information has been connected with the distribution of the encoded block for the main decoder. We think this is anextension of the relation between the mutual information and the MMSE to coding theory.On the other hand, we have problems to be further discussed. Note that since the input is not Gaussian, the discussions arebased on inequalities and approximate expressions. In particular, when a given QLI code is regarded as a general code, thedifference between log(1+ ρ ) ρ and mmse = ξ ( α , α , δ ) is not so small. We remark that this comparison has been done basedon two inequalities regarding evaluation of the input-output mutual information. Moreover, the discussions are based partlyon numerical calculations. Hence our argument seems to be slightly less rigorous in some places. Nevertheless, we think thecloseness of two curves in Figs. 4 and 5 implies that the inference and the related results in this paper are reasonable.A PPENDIX AP ROOF OF L EMMA m .1) P ( e = 1) = ǫ ≤ is obvious.2) Suppose that P ( e + · · · + e m = 1) ≤ . Let q = P ( e + · · · + e m = 1) . Then we have P ( e + · · · + e m + e m +1 = 1)= P ( e + · · · + e m = 0) P ( e m +1 = 1)+ P ( e + · · · + e m = 1) P ( e m +1 = 0)= (1 − q ) ǫ + q (1 − ǫ )= ǫ + q (1 − ǫ ) . (A.134)Since q ≤ by the assumption, we have P ( e + · · · + e m + e m +1 = 1) ≤ ǫ + 12 (1 − ǫ )= 12 . (A.135)A PPENDIX BP ROOF OF L EMMA α ( ǫ ) , α ( ǫ ) , and α ( ǫ ) have been determined given ǫ . Here note v k = ( v (1) k , v (2) k ) = ( e k G − ) G . Let e bethe (error) terms common to v (1) k and v (2) k . Then v (1) k and v (2) k are expressed as v (1) k = e + ˜ e (B.136) v (2) k = e + ˜ e , (B.137)where e , ˜ e , and ˜ e are mutually independent. Set  p = P ( e = 1) s = P (˜ e = 1) t = P (˜ e = 1) . (B.138)We have α = (1 − p ) st + p (1 − s )(1 − t ) (B.139) α = (1 − p ) s + p (1 − s ) (B.140) α = (1 − p ) t + p (1 − t ) . (B.141)By direct calculation, it is derived that δ = α − α α = p (1 − p )(1 − s )(1 − t ) . (B.142)Since ≤ s ≤ and since ≤ t ≤ (see Lemma 4), it follows that δ ≥ . OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 23 A PPENDIX CP ROOF OF P ROPOSITION α ( ǫ ) , α ( ǫ ) , and α ( ǫ ) have been determined given ǫ . It sufﬁces to show that α (1 − α ) + α (1 − α ) − δ ≥ . We apply the same argument as that in the proof of Lemma 5. Let p , s , and t be the same as those for Lemma 5. We have α (1 − α ) = p (1 − p ) + s (1 − s ) − ps (1 − p )(1 − s ) (C.143) α (1 − α ) = p (1 − p ) + t (1 − t ) − pt (1 − p )(1 − t ) (C.144) δ = p (1 − p )(1 − s )(1 − t ) . (C.145)By direct calculation, it follows that α (1 − α ) + α (1 − α ) − δ = s (1 − s ) + t (1 − t ) + 4 p (1 − p )( s − t ) ≥ . (C.146)A PPENDIX DE XPLANATION FOR ddρ I [ x ; √ ρ x + w ] ≤ ddρ I [˜ x ; √ ρ ˜ x + w ] The mutual information of Gaussian channel with equiprobable binary input has been given (see [6, p.1263]). Let x be theequiprobable binary input with unit variance. Using the relation [6, Theorem 1] ddρ I [ x ; √ ρ x + w ] = 12 mmse ( ρ ) , (D.147)we have ddρ I [ x ; √ ρ x + w ] = 12 mmse ( ρ )= 12 − Z ∞−∞ e − y √ π tanh( ρ − √ ρ y ) dy ! . (D.148)On the other hand, for the standard Gaussian input ˜ x , we have ddρ I [˜ x ; √ ρ ˜ x + w ] = ddρ (cid:18)

12 log(1 + ρ ) (cid:19) = 12 11 + ρ . (D.149)Accordingly, it sufﬁces to show that

11 + ρ ≥ − Z ∞−∞ e − y √ π tanh( ρ − √ ρ y ) dy. (D.150)Note that this inequality is equivalent to Z ∞−∞ e − y √ π tanh( ρ − √ ρ y ) dy ≥ ρ ρ . (D.151)Furthermore, the above is rewritten as Z ∞ e − ( y − ρ )22 ρ √ πρ tanh( y ) dy − Z ∞ e − ( y + ρ )22 ρ √ πρ tanh( y ) dy ≥ ρ ρ , (D.152)where √ πρ e − ( y − ρ )22 ρ represents the Gaussian distribution with mean ρ and variance ρ . OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 24

Here note the integrands on the left-hand side. The variable ρ appears only in Gaussian distributions. Hence using a linearapproximation of tanh( y ) , numerical integration is possible. tanh( y ) (0 ≤ y < ∞ ) is approximated as tanh( y ) ≈  y, ≤ y ≤ y + 14 , ≤ y ≤ y + 920 , ≤ y ≤ y + 45 , ≤ y ≤ , ≤ y < ∞ . (D.153)(cf. tanh(0) = 0 , tanh(0 . ≈ . , tanh(1 . ≈ . , tanh(1 . ≈ . , tanh(3 . ≈ . )If the above inequality holds for each ρ , then ddρ I [ x ; √ ρ x + w ] ≤ ddρ I [˜ x ; √ ρ ˜ x + w ] will be shown. For example, let ρ = 1 .The target inequality becomes Z ∞ e − ( y − √ π tanh( y ) dy − Z ∞ e − ( y +1)22 √ π tanh( y ) dy ≥ . (D.154)By carrying out numerical integration, we have Z ∞ e − ( y − √ π tanh( y ) dy − Z ∞ e − ( y +1)22 √ π tanh( y ) dy ≈ . − . . > . (D.155)Hence we actually have ddρ I [ x ; √ ρ x + w ] ≤ ddρ I [˜ x ; √ ρ ˜ x + w ] at ρ = 1 .R EFERENCES[1] S. Arimoto,

Kalman Filter , (in Japanese). Tokyo, Japan: Sangyo Tosho Publishing, 1977.[2] T. M. Cover and J. A. Thomas,

Elements of Information Theory , 2nd ed. Hoboken, NJ, USA: John Wiley & Sons, 2006.[3] T. E. Duncan, “On the calculation of mutual information,”

SIAM J. Appl. Math. , vol. 19, pp. 215–220, Jul. 1970.[4] R. Esposito, “On a relation between detection and estimation in decision theory,”

Inform. Contr. , vol. 12, pp. 116-120, 1968.[5] G. D. Forney, Jr., “Convolutional codes I: Algebraic structure,”

IEEE Trans. Inf. Theory , vol. IT-16, no. 6, pp. 720–738, Nov. 1970.[6] D. Guo, S. Shamai (Shitz), and S. Verd´u, “Mutual information and minimum mean-square error in Gaussian channels,”

IEEE Trans. Inf. Theory , vol. 51,no. 4, pp. 1261–1282, Apr. 2005.[7] G. H. Hardy, J. E. Littlewood, and G. P´olya,

Inequalities , 2nd ed. Cambridge University Press, 1952. (H. hosokawa,

Inequalities , (Japanese transl.).Tokyo, Japan: Springer-Verlag Tokyo, 2003.)[8] K. Ito ans S. Watanabe, “Introduction to stochastic differential equations,” in

Proc. Intern. Symp. SDE , pp. i–xxx, Jul. 1976. (

Stochastic DifferentialEquations , K. Ito, Ed. Tokyo: Kinokuniya Book-Store, 1978.)[9] A. H. Jazwinski,

Stochastic Processes and Filtering Theory . New York: Academic Press, 1970.[10] T. T. Kadota, M. Zakai, and J. Ziv, “Mutual information of the white Gaussian channel with and without feedback,”

IEEE Trans. Inf. Theory , vol. IT-17,no. 4, pp. 368–371, Jul. 1971.[11] T. Kailath, “An innovations approach to least-squares estimation–Part I: Linear ﬁltering in additive white noise,”

IEEE Trans. Automatic Control ,vol. AC-13, no. 6, pp. 646–655, Dec. 1968.[12] T. Kailath and P. Frost, “An innovations approach to least-squares estimation–Part II: Linear smoothing in additive white noise,”

IEEE Trans. AutomaticControl , vol. AC-13, no. 6, pp. 655–660, Dec. 1968.[13] T. Kailath, “A note on least-squares estimates from likelihood ratios,”

Inform. Contr. , vol. 13, pp. 534–540, 1968.[14] T. Kailath, “A general likelihood-ratio formula for random signals in Gaussian noise,”

IEEE Trans. Inf. Theory , vol. IT-15, no. 3, pp. 350–361, May1969.[15] T. Kailath, “A view of three decades of linear ﬁltering theory (Invited paper),”

IEEE Trans. Inf. Theory , vol. IT-20, no. 2, pp. 146–181, Mar. 1974.[16] T. Kailath and H. V. Poor, “Detection of stochastic processes (Invited paper),”

IEEE Trans. Inf. Theory , vol. 44, no. 6, pp. 2230–2259, Oct. 1998.[17] S. Kubota, S. Kato, and T. Ishitani, “Novel Viterbi decoder VLSI implementation and its performance,”

IEEE Trans. Commun. , vol. 41, no. 8, pp. 1170–1178, Aug. 1993.[18] H. Kunita,

Estimation of Stochastic Processes (in Japanese). Tokyo, Japan: Sangyo Tosho Publishing, 1976.[19] P. Malliavin, “Stochastic calculus of variation and hypoelliptic operators,” in

Proc. Intern. Symp. SDE , pp. 195–263, Jul. 1976. (

Stochastic DifferentialEquations , K. Ito, Ed. Tokyo: Kinokuniya Book-Store, 1978.)[20] J. L. Massey and D. J. Costello, Jr., “Nonsystematic convolutional codes for sequential decoding in space applications,”

IEEE Trans. Commun. Technol. ,vol. COM-19, no. 5, pp. 806–813, Oct. 1971.[21] B. Øksendal,

Stochastic Differential Equations: An Introduction with Applications , 5th ed. Berlin, Germany: Springer-Verlag, 1998.[22] S. Ping, Y. Yan, and C. Feng, “An effective simplifying scheme for Viterbi decoder,”

IEEE Trans. Commun. , vol. 39, no. 1, pp. 1–3, Jan. 1991.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 25 [23] G. Strang,

Linear Algebra and Its Applications . New York: Academic Press, 1976. (M. Yamaguchi and A. Inoue,

Linear Algebra and Its Applications ,(Japanese transl.). Tokyo, Japan: Sangyo Tosho Publishing, 1978.)[24] M. Tajima, K. Shibata, and Z. Kawasaki, “On the equivalence between Scarce-State-Transition Viterbi decoding and syndrome decoding of convolutionalcodes,”

IEICE Trans. Fundamentals , vol. E86-A, no. 8, pp. 2107–2116, Aug. 2003.[25] M. Tajima, “An innovations approach to Viterbi decoding of convolutional codes,”

IEEE Trans. Inf. Theory , vol. 65, no. 5, pp. 2704–2722, May 2019.[26] M. Tajima, “Corrections to “An innovations approach to Viterbi decoding of convolutional codes”,”

IEEE Trans. Inf. Theory , to be published.[27] D. Williams,

Probability with Martingales . Cambridge University Press, 1991. (J. Akahori et al.,

Probability with Martingales , (Japanese transl.). Tokyo,Japan: Baifukan, 2004.)[28] E. Wong,

Stochastic Processes in Information and Dynamical Systems . New York: McGraw-Hill, 1971.[29] M. Zakai, “On mutual information, likelihood ratios, and estimation error for the additive Gaussian channel,”