The Capacity of Finite-State Channels in the High-Noise Regime
aa r X i v : . [ c s . I T ] J a n The Capacity of Finite-State Channels in the High-Noise Regime
Henry D. Pfister ∗ Abstract
This paper considers the derivative of the entropy rate of a hidden Markov process with respectto the observation probabilities. The main result is a compact formula for the derivative that can beevaluated easily using Monte Carlo methods. It is applied to the problem of computing the capacityof a finite-state channel (FSC) and, in the high-noise regime, the formula has a simple closed-formexpression that enables series expansion of the capacity of a FSC. This expansion is evaluated for abinary-symmetric channel under a (0,1) run-length limited constraint and an intersymbol-interferencechannel with Gaussian noise.
A hidden Markov process (HMP) is a discrete-time finite-state Markov chain (FSMC) observed througha memoryless channel. The HMP has become ubiquitous in statistics, computer science, and electricalengineering because it approximates many processes well using a dependency structure that leads tomany efficient algorithms. While the roots of the HMP lie in the “grouped Markov chains” of Harris[21] and the “functions of a finite-state Markov chain” of Blackwell [8], the HMP first appears (in fullgenerality) as the output process of a finite-state channel (FSC) [9]. The statistical inference algorithmof Baum and Petrie [5], however, cemented the HMP’s place in history and is responsible for greatadvances in fields such as speech recognition and biological sequence analysis [23, 25]. An exceptionalsurvey of HMPs, by Ephraim and Merhav, gives a nice summary of what is known in this area [13].
Definition 1.1.
Let Q be the state set of an irreducible aperiodic FSMC { Q t } t ∈ Z with state transitionmatrix P and define p ij , [ P ] i,j = Pr ( Q t +1 = j | Q t = i ) for i, j ∈ Q . Let Y be a finite set of possible observations and { Y t } t ∈ Z be the stochastic process where Y t ∈ Y is generated by the transition from Q t to Q t +1 . The distribution of the observation conditionedon the FSMC transition is given by h ij ( y ) , ( Pr ( Y t = y | Q t = i, Q t +1 = j ) if ( i, j ) ∈ V otherwisefor i, j ∈ Q , where V = { ( i, j ) ∈ Q × Q| p ij > } is the set of valid transitions. The ergodic process { Y t } t ∈ Z is called a hidden Markov process . With proper initialization, the process is also stationary .Although the notation of this paper assumes that Y is a finite set, many results remain correct when Y = R if h ij ( y ) is assumed to be a continuous p.d.f. and sums over Y are converted to integrals over R . ∗ Henry Pfister is with the Electrical and Computer Engineering Department of Texas A&M University (e-mail: hpfi[email protected]). His research was supported in part by the National Science Foundation under Grant No. 074740. In general, HMPs are defined by noisy observations of the FSMC states (rather than the transitions). This paperuses the “transition observation” model instead because of its natural connection with finite-state channels. Moreover, anyrandom process that can be represented by the “transition observation” HMP model with M states can also be representedby the “state observation” model with M states. .2 The Entropy Rate The entropy rate of a stationary stochastic process { Y t } t ∈ Z is defined to be H ( Y ) , lim n →∞ n H ( Y , . . . , Y n ) , where H ( Y ) , − E [ln Pr( Y )] is the entropy of the random variable (r.v.) Y and the limit exists andis finite if H ( Y ) < ∞ [11]. Computing the exact entropy rate of an HMP in closed form appears to bedifficult, however. In [8], Blackwell states“In this paper we study the entropy of the { y n } [hidden Markov] process; our resultsuggests that this entropy is intrinsically a complicated function of [the parameters of thehidden Markov process] M and Φ .”On the other hand, the Shannon-McMillan-Breiman Theorem shows that the empirical entropy rate − n ln Pr( y n ) converges almost surely to the entropy rate H ( Y ) (in nats) as n → ∞ . Therefore, simulationbased (i.e., Monte Carlo) approaches work well in many cases [30, 17, 1, 38, 37, 3, 2].Other early work related to the entropy rate of HMPs can be found in [7, 36, 39, 35]. Recently,interest in HMPs has surged and there have been a large number of papers discussing the entropy rateof HMPs. These range from bounds [37, 31, 32] to establishing the analyticity of the entropy rate [18]to computing series expansions of the entropy rate [44, 12, 20]. The work in this paper is largely motivated by the analysis of a class of time-varying channels knownas FSCs. An FSC is a discrete-time channel where the distribution of the channel output depends onboth the channel input and the underlying channel state [16]. This allows the channel output to dependimplicitly on previous inputs and outputs via the channel state. In practice, there are three types ofchannel variation which FSCs are typically used to model. A flat fading channel is a time-varyingchannel whose state is independent of the channel inputs. An intersymbol-interference (ISI) channel isa time-varying channel whose state is a deterministic function of the previous channel inputs. Channelswhich exhibit both fading and ISI can also be modeled, and their state is a stochastic function of theprevious channel inputs. An indecomposable FSC is, roughly speaking, a FSC where the effect of theinitial state decays with time. The output process of an indecomposable FSC with an ergodic Markovinput is an HMP.Consider an indecomposable FSC with state set S , finite input alphabet X , and output alphabet Y .The channel is defined by its input-output state-transition probability W ( y, s ′ | x, s ) , which is defined forall x ∈ X , y ∈ Y , and s, s ′ ∈ S . Using this notation, W ( y, s ′ | x, s ) is the conditional probability thatthe channel output is y and the new channel state is s ′ given that the channel input was x and theinitial state was s . The n -step transition probability for a sequence of n channel uses (with input x n and output y n ) is given by Pr ( Y n = y n | X n = x n ) = X s n +11 ∈S n +1 Pr ( S = s ) n Y t =1 W ( y t , s t +1 | x t , s t ) . When Y = R , we will also use W ( y, s ′ | x, s ) to represent a conditional probability density function forthe channel outputs.The achievable information rate of an FSC with Markov inputs is intimately related to the entropyrate of an HMP [1, 38, 24, 2, 42, 22]. Computing this entropy rate exactly is usually quite difficult, andoften the main obstacle in the computation of achievable rates. The main result of this paper, given in in Theorem 3.2, is a compact formula for the derivative, withrespect to the observation probability h ij ( y ) , of the entropy rate of a general HMP . A Monte Carlo2stimator for this derivative follows easily because the formula is an expectation over distributions thatare relatively easy to sample. The formula is also amenable to analysis in some asymptotic regimes. Inparticular, Theorem 3.6 derives a simple formula for the first two non-trivial terms in the expansion ofthe entropy rate in the high-noise regime.In Section 4, this derivative formula also allows one to consider the derivative of achievable informationrates for FSCs. For example, a closed-form expression for the capacity of a BSC under a (0,1) RLLconstraint is derived in the high-noise limit. Section 2 provides the mathematical background necessaryfor the later sections. Calligraphic letters are used to denote sets (e.g., Q , Y , V ) and Y ( · ) is the indicator function of the set Y . Capital letters are used to denote random variables (e.g., Q t , Y t ) and matrices (e.g., M, P ). Lower-case letters are used to represent realizations of random variables (e.g., q t , y t ), column vectors (e.g., π, α, β, u, v ), and indices (e.g., i, j, k, l ). The i -th element of the vector π is denoted π ( i ) .The following sets will also be used: R + = { a ∈ R | a > } , A = R |Q| , A δ = { u ∈ A | u ( q ) > δ, q ∈ Q} , P = { u ∈ A | P q u ( q ) = 1 } , and P δ = A δ ∩ P . We note that the symbols π, α t ∈ P are usedinterchangeably to denote distributions over Q and |Q| -dimensional column vectors (e.g., π T P = π T ).The standard p -norm of the vector u is denoted by k u k p , ( P i | u ( i ) | p ) /p and the induced matrix normis k M k p , sup k u k p =1 k M u k p . One of the primary reasons for the popularity of HMPs is that the forward and backward state estimationproblems have a simple recursive structure. Let us assume that the Markov chain { Q t } t ∈ Z is stationaryand that π ∈ P is the unique stationary distribution that satisfies π T P = π T . For a length- n block, letthe forward state probability α t ∈ P and the backward state probability β t ∈ A be defined by α t ( i ) , Pr (cid:0) Q t = i | Y t − = y t − (cid:1) β t ( j ) , π ( j ) Pr ( Q t = j | Y nt = y nt ) for i, j ∈ Q . These definitions lead naturally to the recursions α t +1 ( j ) = 1 ψ t +1 X i ∈Q α t ( i ) p ij h ij ( y t ) β t − ( i ) = 1 φ t − X j ∈Q β t ( j ) p ij h ij ( y t − ) for i, j ∈ Q , where ψ t +1 is chosen so that P i ∈Q α t +1 ( i ) = 1 and φ i − is chosen so that P j ∈Q π ( j ) β t − ( j ) =1 . It is worth noting that ψ t +1 = Pr( Y t = y t | Y t − = y t − ) and therefore we find that − n n X t =1 ln ψ t +1 = − n ln Pr ( Y n = y n ) a.s. → H ( Y ) nats . This simple connection between the forward recursion and the entropy rate implies a simple Monte Carloapproach to estimating the achievable information rates of FSCs [1, 38, 37, 3, 2]. We believe this normalization for β i − ( q ) is new and it appears to be the natural choice for the problem considered inthis paper (and perhaps in general). .3 The Matrix Perspective In this section, we review a natural connection between the product of random matrices and the forward-backward recursions. This connection is interesting in its own right, but will also be very helpful inunderstanding the results of later sections.
Definition 2.1.
For any y ∈ Y , the transition-observation probability matrix, M ( y ) , is a |Q|×|Q| matrixdefined by [ M ( y )] ij , Pr( Y t = y, Q t +1 = j | Q t = i ) = p ij h ij ( y ) . (2.1)These matrices behave similarly to transition probability matrices because their sequential productscompute the n -step transition observation probabilities of the form, [ M ( y t ) M ( y t +1 ) . . . M ( y t + k )] ij = Pr (cid:0) Y t + kt = y t + kt , Q t + k +1 = j | Q t = i (cid:1) . This means that we can write
Pr( Y n = y n ) as the matrix product Pr ( Y n = y n ) = π T n Y t =1 M ( y t ) ! , (2.2)where is a |Q| -dimensional column vector of ones. When Y = R , the above expressions are understoodto be probability density functions with respect to the observations and the joint probability becomesthe joint density.Likewise, the forward/backward recursions can be written in matrix form as α Tt +1 = α Tt M ( y t ) α Tt M ( y t ) β t − = M ( y t − ) β t π T M ( y t − ) β t where π T = 1 , α Tt +1 = 1 , and π T β t − = 1 . We will also make use of the shorthand notation M ( y lk ) , l Y t = k M ( y t ) . This section summarizes some standard results on the contractive properties of positive matrices andtheir connections to HMPs. More details can be found in [40, 27, 26].
Definition 2.2.
For any two vectors u, v ∈ A , the Hilbert projective metric is d ( u, v ) , ln max i ( u ( i ) /v ( i ))min j ( u ( j ) /v ( j )) = ln max i,j u ( i ) v ( j ) v ( i ) u ( j ) = − ln min i,j u ( i ) v ( j ) v ( i ) u ( j ) . It is metric on A \ ∼ where ∼ is the equivalence relation with u ∼ v if au = v for some a ∈ R + . Proposition 2.3.
For u, v, w ∈ A such that w T u = w T v , the Hilbert projective metric characterizesthe element-wise relative distance between two vectors in the sense that, for any i ∈ Q , d M ( u ( i ) , v ( i )) , | u ( i ) − v ( i ) | max ( u ( i ) , v ( i )) ≤ − e − d ( u,v ) ≤ d ( u, v ) d m ( u ( i ) , v ( i )) , | u ( i ) − v ( i ) | min ( u ( i ) , v ( i )) ≤ e d ( u,v ) − “ d ( u,v ) ≤ ≤ d ( u, v ) , where d M is a metric on R + and d m is a semi-metric on R + (i.e., the triangle inequality does not hold). Since matrix multiplication is not commutative, we use the convention that Q nt =1 M ( y t ) = M ( y ) M ( y ) · · · M ( y n ) . roof. If u ( k ) ≥ v ( k ) , then we have u ( k ) e − d ( u,v ) = u ( k ) min j v ( j ) u ( j ) min i u ( i ) v ( i ) ≤ v ( k ) min i u ( i ) v ( i ) ≤ v ( k ) , where min i u ( i ) v ( i ) ≤ because w T u = w T v . The stated results follow from u ( k ) − v ( k ) ≤ e d ( u,v ) v ( k ) − v ( k ) , u ( k ) − v ( k ) ≤ u ( k ) − u ( k ) e − d ( u,v ) , and simple bounds on e x . Both distances are clearly symmetric andpositive definite. The triangle inequality and other properties of d M are discussed in [43]. Lemma 2.4.
For any vectors u, v, w ∈ A such that w T u = w T v , we have k u − v k ≤ (cid:16) − e − d ( u,v ) (cid:17) X i ∈Q max ( u ( i ) , v ( i )) ≤ ( k u k + k v k ) d ( u, v ) k u − v k ≤ (cid:16) e d ( u,v ) − (cid:17) X i ∈Q min ( u ( i ) , v ( i )) ≤ (cid:16) e d ( u,v ) − (cid:17) min ( k u k , k v k ) . Proof.
The expressions follow from direct calculation of k u − v k using the bounds in Proposition 2.3.The following theorem of Birkhoff plays an important role in the remainder of this paper. Theorem 2.5 ([40, Ch. 3]) . Consider any non-negative matrix M with at least one positive entry inevery row and column. Then, for all u, v ∈ A , we have d ( M u, M v ) ≤ τ ( M ) d ( u, v ) where τ ( M ) , − φ ( M ) / φ ( M ) / = τ (cid:0) M T (cid:1) ≤ is the Birkhoff contraction coefficient and φ ( M ) = min i,j,k,l [ M ] ik [ M ] jl [ M ] jk [ M ] il ≥ min i,j [ M ] ij max i,j [ M ] ij ! . (2.3)The following results connect our HMP definition with Birkhoff’s contraction coefficients. An FSMCthat is irreducible and aperiodic is called primitive . Since the underlying Markov chain is primitive, thematrix P must have at least one non-zero entry in each row and column. Condition 2.6.
For some δ ≥ , the joint probability of every valid transition and output is greaterthan δ . In other words, this means that p ij h ij ( y ) > δ ≥ for all ( i, j ) ∈ V and y ∈ Y .Under Condition 2.6, the matrix M ( y ) has exactly the same pattern of zero/non-zero entries as P for all y ∈ Y . Since P is transition matrix for an ergodic Markov chain, one finds that M ( y ) must alsohave at least one non-zero entry in each row and column for all y ∈ Y . Therefore, τ ( M ( y )) ≤ for all y ∈ Y . Definition 2.7.
An HMP is said to be ( ǫ, k ) -primitive if min i,j (cid:2) M ( y k ) (cid:3) ij > kǫ for all y k ∈ Y . Thisgives a uniform lower bound on the probability that a k -step transition of the HMP simultaneouslymoves between any two states and generates any output sequence y k . An HMP is said to be ǫ - primitive if there exists a k < ∞ such it is ( ǫ, k ) - primitive . Lemma 2.8.
An HMP is ( ǫ, k ) -primitive if it satisfies Condition 2.6 with δ ≥ k /k ǫ /k and P k is apositive matrix. Moreover, this implies that π ( i ) ≥ kǫ (i.e., strictly positive) for all i ∈ Q .Proof. First, we note that P k positive implies there is a length- k path between any two states. Next,we write (cid:2) M ( y k ) (cid:3) q ,q k +1 = X q ,...q k ∈Q k − k Y t =1 p q t ,q t +1 h q t ,q t +1 ( y t ) > X q ,...q k ∈Q k − k Y t =1 V (( q t , q t +1 )) δ ( a ) ≥ δ k , k path between any two states. Since δ k > kǫ , we see that the HMP is ( ǫ, k ) -primitive according to Definition 2.7. Note that, for any u ∈ A ,we have X i ∈Q u ( i ) (cid:2) M ( y k ) (cid:3) ij ≥ X i ∈Q u ( i ) ! kǫ ≥ k u k kǫ (2.4)for u ∈ A implies that π ( i ) ≥ kǫ for all i ∈ Q . Lemma 2.9.
For any ǫ -primitive HMP, there exists a k < ∞ such that, for all y k ∈ Y k and all k ≥ k , τ (cid:0) M ( y k ) (cid:1) ≤ e − k ⌊ k/k ⌋ ǫ . Proof.
From Definition 2.7, we can assume that the HMP is ( ǫ, k ) -primitive. Using the bound (2.3), wesee that φ (cid:16) M ( y k ) (cid:17) ≥ min i,j [ M ] ij max i,j [ M ] ij ! ≥ (cid:18) k ǫ (cid:19) and τ (cid:16) M ( y k ) (cid:17) ≤ − k ǫ k ǫ ≤ e − k ǫ . Since we can break any length k sequence into at least ⌊ k/k ⌋ length- k pieces and τ ( M ( y )) ≤ for theremaining pieces, we have τ (cid:0) M ( y k ) (cid:1) ≤ (cid:0) e − k ǫ (cid:1) ⌊ k/k ⌋ . Consider any stationary stochastic process, { Y i } i ∈ Z , equipped with a function, M ( y ) , that maps each y ∈ Y to a matrix. Now, consider the limit lim n →∞ n log (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) u T n Y i =1 M ( Y i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , where u is any non-zero vector and k·k is any vector norm. Oseledec’s multiplicative ergodic theorem saysthat this limit is deterministic for almost all realizations [34]. An earlier ergodic theorem of Furstenbergand Kesten [14] gives a nice proof that lim n →∞ n log (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n Y i =1 M ( Y i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) a.s. = γ , where k·k is any matrix norm and γ is known as the top Lyapunov exponent. The connection withentropy rate is given by the fact that, for an HMP, choosing M ( y ) according to (2.1) implies that H ( Y ) = − γ [37, 22]. The forward and backward state probability vectors play a very important role in the analysis of HMPs.These vectors, α i , β i ∈ A , are themselves random variables which often have well-defined stationarydistributions. To illustrate the mixing properties, we exploit the stationarity of the HMP and focus ontime zero by defining the random variables U n ( i ) , Pr (cid:0) Q = i | Y − − n (cid:1) V n ( i ) , π ( i ) Pr (cid:0) Q = i | Y n − (cid:1) . It is worth noting that U n ( i ) is a deterministic function of y − − n and V n ( i ) is a deterministic function of y n − . The following sufficient condition characterizes some of the HMPs that have stationary distribu-tions. 6 efinition 2.10. An HMP is called almost-surely mixing if there exists a
C < ∞ , γ < , and k < ∞ such that Pr ( d ( U m , U n ) > Cγ n ) ≤ Cγ n Pr ( d ( V m , V n ) > Cγ n ) ≤ Cγ n for all m ≥ n + k ≥ k + 1 . This implies that the forward and backward recursions both forget theirinitial conditions at an exponential rate that is uniform over all but an exponentially small set of receivedsequences. Definition 2.11.
An HMP is called sample-path mixing if there exists a a
C < ∞ , γ < , and k < ∞ such that d ( U m , U n ) ≤ Cγ n d ( V m , V n ) ≤ Cγ n , for all m ≥ n + k ≥ k + 1 and all received sequences y m − − m ∈ Y m . This implies that the forward andbackward recursions both forget their initial conditions at an exponential rate that is uniform over allreceived sequences. It is easy to see that sample-mixing implies almost-surely mixing. Lemma 2.12. An ( ǫ, k ) -primitive HMP is sample-path mixing with γ = e − ǫ and C = − kǫ ) γ − k .Proof. For each y − − n , the realization of U n ( i ) is given by u n ( i ) = Pr (cid:0) Q = i | Y − − n = y − − n (cid:1) = " π T M (cid:0) y − − n (cid:1) π T M (cid:0) y − − n (cid:1) i . First, we let w T = π T M (cid:0) y − n − − m (cid:1) and note that (2.4) implies that d (cid:0) w T , π T (cid:1) = ln max i,j w ( i ) π ( j ) π ( i ) w ( j ) ≤ ln max i,j π ( i ) w ( j ) ≤ ln (cid:18) kǫ (cid:19) ! when m ≥ n + k . Next, we use Theorem 2.5 and Lemma 2.9 to see that d ( u m , u n ) = d (cid:16) w T M ( y − − n ) , π T M ( y − − n ) (cid:17) d (cid:0) w T , π T (cid:1) ≤ τ (cid:16) M ( y − − n ) (cid:17) ln (cid:0) ( kǫ ) − (cid:1) ≤ − kǫ ) e − ⌊ n/k ⌋ kǫ . This gives an exponential rate of γ = e − ǫ and C = − kǫ ) γ − k is chosen to handle the floor functionand constant. For the backward recursion, the proof is identical except that the constant C is smallerby a factor of 2 because d (cid:0) M ( y m − n ) , (cid:1) = ln max i,j w ( i ) w ( j ) ≤ ln max i,j T T kǫ ≤ ln (cid:18) kǫ (cid:19) . Lemma 2.13. A (0 , k ) -primitive HMP is almost-surely mixing for some γ < and C < ∞ if max q ∈Q E " max i,j (cid:2) M ( Y k ) (cid:3) ij min i,j (cid:2) M ( Y k ) (cid:3) ij (cid:12)(cid:12)(cid:12)(cid:12) Q = q < ∞ . In particular, this can be applied to HMPs with continuous observations.Proof.
This lemma follows, with slight modifications, from the arguments in [27]. Its proof is out of thescope of this work. 7 roposition 2.14.
The joint process { Q t , α t } t ∈ Z forms a Markov chain. If the HMP is almost-surelymixing, then the marginal distribution converges weakly to a unique stationary measure µ q ( A ) .Proof. One can see this is a Markov chain by considering the following method of generating the sequence.At each step, we first choose q t +1 according to p q t ,q t +1 , then choose y t according to h q t ,q t +1 ( y t ) , and finallycompute α t +1 ( · ) from α t ( · ) and y t . In most cases, this Markov chain will not have a finite state-spacebecause α t ( · ) may take uncountably many values. Of course, this process depends on the initializationof the first α t but this dependence decays with time if the HMP is almost-surely mixing. For simplicity,one may assume the initialization α = π is used.To show that µ ( t ) q ( A ) , Pr ( Q = q, U t ∈ A ) converges weakly to the probability measure µ q ( A ) forall Borel subsets A ⊆ P , we observe that µ ( t ) q ( A ) is a Cauchy sequence with respect to the Prohorovmetric. This is sufficient because the Prohorov metric metrizes weak convergence on separable spacesand P is separable [6, p. 72]. Let d ( u, A ) , inf v ∈ A d ( u, v ) and A δ , { u ∈ P | d ( u, A ) < δ } so that theProhorov metric is given by d P ( µ, µ ′ ) = inf (cid:8) δ ∈ R + | µ ′ ( A ) ≤ µ (cid:0) A δ (cid:1) + δ ∀ Borel A ⊆ P (cid:9) . Since the HMP is almost-surely mixing, we can use the fact that
Pr ( d ( U t + k , U t ) > Cγ t ) ≤ Cγ t , for all k ≥ , to see that µ ( t + k ) q ( A ) = Pr ( Q = q, U t + k ∈ A ) ≤ Pr (cid:16) Q = q, U t ∈ A Cγ t (cid:17) + Cγ t = µ ( t ) q (cid:16) A Cγ t (cid:17) + Cγ t . This implies that d P (cid:16) µ ( t ) q , µ ( t + k ) q (cid:17) ≤ Cγ t for all k ≥ . Therefore, µ ( t ) q ( A ) is a Cauchy sequence withrespect to d P and it converges weakly to some probability measure. Therefore, we can define µ q ( A ) tobe the weak limit of µ ( t ) q ( A ) . Definition 2.15.
The (forward) Furstenberg measure is the unique stationary measure (when it exists)of the joint process { Q t , α t } t ∈ Z and is given by the weak limit Pr( Q t = q, α t ∈ A ) w → µ q ( A ) , for any Borel measurable set A ⊆ P . While this does not depend on the initialization of α t , one mayassume the initialization α = π for simplicity. Remark . This name is chosen because the measure first appears in the work of Furstenberg andKifer [15] and is closely related to the work that was started by Furstenberg and Kesten [14].
Consistency of the a posteriori probability (APP)
The following Lemma will be used to make connections between the measures defined in this section.
Lemma 2.17.
Let
X, Y be discrete r.v.s and let the APP function be E y ( x ) , Pr ( X = x | Y = y ) . Then, E Y ( x ) = Pr ( X = x | Y ) is a random function (due to Y ) and we have Pr ( X = x, E Y ( · ) = e ( · )) = Pr ( E Y ( · ) = e ( · )) e ( x ) . Proof.
Applying the chain rule and the definition of E Y ( · ) gives Pr ( X = x, E Y ( · ) = e ( · )) = Pr ( E Y ( · ) = e ( · )) Pr ( X = x | E Y ( · ) = e ( · ))= Pr ( E Y ( · ) = e ( · )) Pr( X = x | Y )= Pr ( E Y ( · ) = e ( · )) e ( x ) , where the second step follows from the fact that E Y ( · ) is a sufficient statistic for X (e.g., X can befaithfully generated from Y using the Markov chain Y → E Y ( · ) → X ). Proposition 2.18.
The process { α t } t ∈ Z forms a Markov chain. If the HMP is almost-surely mixing,then it converges weakly to a unique stationary measure µ ( A ) . roof. One can see that { α t } t ∈ Z is Markov by considering another method of generating the sequence.At each step, we first choose Q t according to α t ( · ) , then choose q t +1 according to p q t ,q t +1 , then choose y t according to h q t ,q t +1 ( y t ) , and finally compute α t +1 ( · ) from α t ( · ) and y t . Of course, this process dependson the initialization of the first α t but this dependence decays with time if the HMP is almost-surelymixing. For simplicity, one may assume the initialization α = π is used.Comparing this to Proposition 2.14, one see that we are now using α t ( · ) as a proxy distribution for Q t . This works because Lemma 2.17 shows that Pr( α t ∈ A ) inf e α ∈ A e α ( q ) ≤ Pr( Q t = q, α t ∈ A ) ≤ Pr( α t ∈ A ) sup e α ∈ A e α ( q ) , for any open set A ⊆ P . By making A arbitrarily small, one can force the LHS and RHS to be arbitrarilyclose. The proof of weak convergence to a unique stationary distribution as t → ∞ is essentially identicalto the corresponding proof for Proposition 2.14. Definition 2.19.
The (forward) Blackwell measure is the unique stationary measure (when it exists)of the process { α t } t ∈ Z and is given by the weak limit Pr( α t ∈ A ) w → µ ( A ) , for any Borel measurable set A ⊆ P . From the definition of µ q , we see also that µ ( A ) = P q ∈Q µ q ( A ) . Remark . This name is chosen because this measure first appears in the work of Blackwell [8] andis now commonly called the Blackwell measure [18].
Lemma 2.21.
The Radon-Nikodym derivative d µ q d µ of the (forward) Furstenberg measure µ q with respectto the (forward) Blackwell measure µ exists and satisfies d µ q d µ ( α ) = Pr( Q t = q | α t = α ) µ -almost everywhere. This implies that µ q (d α ) = α ( q ) µ (d α ) . Proof.
First, we note that µ ( A ) = P q ∈Q µ q ( A ) implies that µ q is absolutely continuous w.r.t. µ . There-fore, the Radon-Nikodym derivative d µ q d µ exists. Since µ q ( A ) µ ( A ) = Pr( Q t = q, α t ∈ A )Pr( α t ∈ A ) = Pr( Q t = q | α t ∈ A ) , the first result can be seen by choosing A to be arbitrarily small. The second result holds because α t ( · ) is the APP estimate of Q t given Y t − −∞ and this (e.g., see Lemma 2.17) implies that Pr( Q t = q | α t = α ) = α ( q ) . Theorem 2.22 ([8]) . In terms of the Blackwell measure, the entropy rate (in nats) of an HMP is H ( Y ) = − Z P µ (d α ) X y ∈Y α T M ( y ) ln (cid:0) α T M ( y ) (cid:1) . (2.5) Proof.
Consider the sequence H ( Y t | Y t − ) for any stationary process. This sequence is non-negative andnon-increasing and therefore must have a limit. Moreover, the entropy rate H ( Y ) , lim n →∞ n H ( Y , . . . , Y n ) = lim n →∞ n n X t =1 H ( Y t | Y t − )
9s the Cesàro mean of this sequence and must have the same limit. Next, we note that α Tt M ( y ) = X i,j ∈Q α t ( i ) p i,j h i,j ( y ) = Pr (cid:0) Y t = y | Y t − (cid:1) . Therefore, (2.5) is simply the expression for lim t →∞ H ( Y t | Y t − ) . Once again, this time in reverse...
One can also reverse time for these Markov processes so that { Q t , β t } t ∈ Z forms a backward Markov chain.Starting from q t and working backwards, one first chooses q t − according to Pr( Q t − = q t − | Q t = q t ) = p q t − ,q t π q t − /π q t . Then, one generates y t − according to h q t − ,q t ( y t − ) and computes β t − from β t and y t − .This process also depends on the initialization of the first β t but this dependence decays with timeif the HMP is almost-surely mixing. For simplicity, one may assume the initialization β = is used.If the HMP is almost-surely mixing, then the joint distribution of Q t , β t converges weakly to a uniquestationary distribution as t → −∞ ; the proof is very similar to the corresponding part of the proof ofProposition 2.14. This allows us to define the stationary distribution of the backwards state probabilityvector.As with the forward process, we can reduce the state space to { β t } t ∈ Z . At each step, one chooses q t according to Pr( Q t = q t ) = β t ( q t ) π q t , then continues as described above to generate with q t − , y t − ,and β t − . Let B ⊆ (cid:8) u ∈ A | π T u = 1 (cid:9) be any open measurable set. Then, using β t ( q ) π q as a proxydistribution for Q t works because Lemma 2.17 shows that Pr( β t ∈ B ) π ( q ) inf e β ∈ B e β ( q ) ≤ Pr( Q t = q, β t ∈ B ) ≤ Pr( β t ∈ B ) π ( q ) sup e β ∈ B e β ( q ) , and choosing B arbitrarily small allows the LHS and RHS to be made arbitrarily close. This process alsodepends on the initialization of β t , but if the HMP is almost-surely mixing, then it converges weakly toa unique stationary distribution. Definition 2.23.
The backward Furstenberg measure, is the unique stationary measure (when it exists)of the backwards process { Q t , β t } t ∈ Z and is given by the weak limit Pr( Q t = q, β t ∈ B ) w → ν q ( B ) , for any Borel measurable set B ⊆ (cid:8) u ∈ A | π T u = 1 (cid:9) . Definition 2.24.
The backward Blackwell measure, is the unique stationary measure (when it exists)of the backwards process { β t } t ∈ Z and is given by the weak limit Pr( β t ∈ B ) w → ν ( B ) , for any Borel measurable set B ⊆ (cid:8) u ∈ A | π T u = 1 (cid:9) . From the definition of ν q , we see also that ν ( B ) = P q ∈Q ν q ( B ) . Lemma 2.25.
The Radon-Nikodym derivative d ν q d ν of the backwards Furstenberg measure ν q with respectto the backwards Blackwell measure ν exists and satisfies d ν q d ν ( β ) = Pr( Q t = q | β t = β ) ν -almost everywhere. This implies that ν q (d β ) = π ( q ) β ( q ) ν (d β ) . roof. First, we note that ν ( B ) = P q ∈Q ν q ( B ) implies that ν q is absolutely continuous w.r.t. ν . There-fore, the Radon-Nikodym derivative d ν q d ν exists. Since ν q ( B ) ν ( B ) = Pr( Q t = q, β t ∈ B )Pr( β t ∈ B ) = Pr( Q t = q | β t ∈ B ) , the first result can be seen by choosing B to be arbitrarily small. The second result holds because β t ( · ) is the APP estimate of Q t given Y ∞ t and this (e.g., see Lemma 2.17) implies that Pr( Q t = q | β t = β ) = π ( q ) β ( q ) . In this section, we introduce a shortcut often used in the statistical physics community. It was introducedto the author by Measson et al. in [28, 29]. It has also been applied to the problem under considerationby Zuk at al. in [44, 12].Let D ⊂ R be a compact set and g n : D n → R be a sequence of functions which essentially dependon a single parameter θ ∈ D in n different ways. Abusing notation, we also let g n : D → R be the samefunction where this dependency is combined so that g n ( θ ) = g n ( θ, . . . , θ ) . The total derivative of g n canbe written as dd θ g n ( θ ) = n X i =1 ∂∂θ i g n ( θ , . . . , θ n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( θ ,...,θ n )=( θ,...,θ ) . This motivates us to define g ′ n ( θ , . . . , θ n ) , n X i =1 ∂∂θ i g n ( θ , . . . , θ n ) . Since the abuse of notation is habit forming, we will also define g ′ n ( θ ) , g ′ n ( θ, . . . , θ ) .The focus on this paper is the limit of these functions as n goes to infinity, so a few technical detailsare required. If g n ( θ ) → f ( θ ) uniformly over θ ∈ D and lim n →∞ g ′ n ( θ ) converges uniformly over θ ∈ D ,then it follows that f ′ ( θ ) = lim n →∞ g ′ n ( θ ) [4]. One might assume that it is necessary to prove uniformconvergence for both of these sequences, but the following standard problem in analysis shows thatsuffices to consider only the sequence of derivatives. Lemma 3.1.
Let g n : D → R be a sequence of functions that are continuously differentiable on a compactset D ⊂ R . If g n ( θ ) converges for some θ ∈ D and g ′ n ( θ ) converges uniformly on D , then the limits f ( θ ) , lim n →∞ g n ( θ ) f ′ ( θ ) , lim n →∞ g ′ n ( θ ) . both exist and are uniformly continuous on D .Proof. First, we note that each g ′ n ( θ ) is uniformly continuous because D is compact. Since g ′ n ( θ ) con-verges uniformly, we find that f ′ ( θ ) exists and is uniformly continuous (and hence bounded) on D .Interchanging the limit and integral, based on uniform convergence, implies that lim n →∞ [ g n ( θ ) − g n ( θ )] = lim n →∞ Z θθ g ′ n ( x ) dx = Z θθ lim n →∞ g ′ n ( x ) dx = Z θθ f ′ ( x ) dx = f ( θ ) − f ( θ ) . This implies that g n ( θ ) converges to f ( θ ). Finally, we note that f ( θ ) is uniformly continuous on D because f ′ ( θ ) exists and is bounded on D . 11 .2 Warmup Example: The Derivative of the Log Spectral Radius The spectral radius of a real matrix M is defined to be ρ ( M ) , lim n →∞ k M n k /n for any matrix norm. Likewise, the log spectral radius (LSR) of a real matrix M is given by ln ρ ( M ) = lim n →∞ n log k M n k , for any matrix norm. Moreover, if M has non-negative entries, then ln ρ ( M ) = lim n →∞ n log (cid:0) u T M n v (cid:1) for any vectors u, v ∈ A .Let M θ be a mapping from a compact set D ⊂ R to the set of non-negative real matrices. Assumefurther that M θ has a unique real eigenvalue λ of maximum modulus (i.e., the 2nd largest eigenvalue λ satisfies | λ /λ | ≤ γ < ) for all θ ∈ D . Using the shorthand notation M , M θ ∗ for θ ∗ ∈ D , we let a, b ∈ A be left/right (column) eigenvectors of M with eigenvalue ρ ( M ) ; they satisfy a T M = ρ ( M ) a T and M b = ρ ( M ) b . In this case, it is known that the derivative of the LSR is given by dd θ ln ρ ( M θ ) (cid:12)(cid:12)(cid:12)(cid:12) θ = θ ∗ = a T M ′ θ ∗ ba T M θ ∗ b , where M ′ , M ′ θ ∗ is the element-wise derivative defined by [ M ′ θ ] ij , dd θ [ M θ ] ij . Of course, one mustassume that M ′ exists and satisfies k M ′ k < ∞ .One can prove this by applying the derivative shortcut to f ( θ ) = log ρ ( M θ ) using g n ( θ , . . . , θ n ) = 1 n ln u T n Y t =1 M θ t ! v ! for any vectors u, v ∈ A . Based on Lemma 3.1, we focus on g ′ n ( θ ) by writing g ′ n ( θ ∗ ) = n X i =1 ∂∂θ i n ln u T n Y t =1 M θ t ! v !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( θ ,...,θ n )=( θ ∗ ,...θ ∗ ) = 1 n n X i =1 ∂∂θ i ln u T i − Y t =1 M θ t ! M ( θ i ) n Y t = i +1 M θ t ! v !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) θ n =( θ ∗ ,...θ ∗ ) = 1 n n X i =1 u T (cid:16)Q i − t =1 M θ t (cid:17) M ′ θ i (cid:0)Q nt = i +1 M θ t (cid:1) vu T (cid:16)Q i − t =1 M θ t (cid:17) M θ i (cid:0)Q nt = i +1 M θ t (cid:1) v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) θ n =( θ ∗ ,...θ ∗ ) = 1 n n X i =1 u T M i − M ′ M n − i vu T M i − M M n − i v , where we have used that dd θ x T M θ y = X k,l x k dd θ [ M θ ] k,l y l = x T M ′ θ y. Since M θ satisfies | λ /λ | ≤ γ for all θ ∈ D , it follows that u T M i − k u T M i − k = a T + O (cid:0) γ i − (cid:1) M n − i v k M n − i v k = b + O (cid:0) γ n − i (cid:1) . g ′ n ( θ ∗ ) = O (cid:4) (ln n ) (cid:5) n k M ′ k ( a T b ) ρ ( M ) u T b k u k a T v k v k ! + 1 n n − ⌊ (ln n ) ⌋ X i = ⌊ (ln n ) ⌋ +1 a T M ′ b + O (cid:16) γ (ln n ) (cid:17) k M ′ k a T M b + O (cid:0) γ (ln n ) (cid:1) k M k . Therefore, g n ( θ ) and g ′ n ( θ ) converge uniformly for all θ ∈ D and we find that f ′ ( θ ∗ ) = a T M ′ ba T M b .
Let M θ ( y ) be transition observation probability matrix of an HMP, which depends on the real parameter θ , and let π be the stationary distribution of the underlying Markov chain. To compute the derivativeof the entropy rate, we define g n ( θ , . . . , θ n ) = − n X y n ∈Y n Pr ( Y n = y n ; θ n ) ln Pr ( Y n = y n ; θ n )= − n X y n ∈Y n π T n Y i =1 M θ i ( y i ) ! · ln " π T n Y i =1 M θ i ( y i ) ! ( θ ,...,θ n )=( θ ∗ ,...θ ∗ ) . This implies that f ( θ ) = lim n →∞ g n ( θ ) = H ( Y ; θ ) in nats. Theorem 3.2.
Let D ⊂ R be a compact set and assume that dd θ π = and M ′ θ ( y ) , dd θ M θ ( y ) exists forall θ ∈ D . Then, if the HMP is well-defined and ǫ -primitive for all θ ∈ D , then f ′ ( θ ∗ ) = dd θ H ( Y ; θ ) (cid:12)(cid:12) θ = θ ∗ equals − Z A µ (d α ) Z A ν (d β ) X y ∈Y α T M ′ θ ∗ ( y ) β ln (cid:0) α T M θ ∗ ( y ) β (cid:1) , (3.1) where µ and ν are the forward/backward Blackwell measures of the HMP at θ = θ ∗ . Moreover, f ( θ ) and f ′ ( θ ) are uniformly continuous on D .Proof. The following shorthand is used throughout: π t ( q ) , Pr ( Q t = q ) , M ( y ) , M θ ∗ ( y ) , M ′ ( y ) , M ′ θ ∗ ( y ) , and M ( y kj ) , Q kt = j M θ ∗ ( y t ) . For the HMP to be well-defined, the transition matrices mustsatisfy P y ∈Y M θ ( y ) = and P y ∈Y M ′ θ ( y ) = for all θ ∈ D . It follows that, for any u ∈ P , one has X y n ∈Y n u T n Y t =1 M θ t ( y t ) ! = 1 ∂∂θ j X y n ∈Y n u T n Y t =1 M θ t ( y t ) ! = 0 . (3.2)Based on Lemma 3.1, we note that the entropy rate exists for all θ ∈ D and focus on the derivative g ′ n ( θ ∗ ) ( a ) = − n n X j =1 ∂∂θ j X y n ∈Y n π T n Y t =1 M θ t ( y t ) ! · ln " C j π T n Y t =1 M θ t ( y t ) ! − ln C j !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) θ j = θ ∗ ( b ) = − n n X j =1 ∂∂θ j X y n ∈Y n π T n Y t =1 M θ t ( y t ) ! · ln " C j π T n Y t =1 M θ t ( y t ) ! θ j = θ ∗ ( c ) = − n n X j =1 ∂∂θ j X y n ∈Y n π T M ( y j − ) M θ j ( y j ) M ( y nj +1 ) · ln π T M ( y j − ) M θ j ( y j ) M ( y nj +1 ) (cid:16) π T M ( y j − ) (cid:17) (cid:0) π Tj +1 M ( y nj +1 ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) θ j = θ ∗ , ( a ) holds for arbitrary positive values C , . . . , C n , ( b ) follows because (3.2) implies the ln C j givesno contribution if ∂∂θ j C j = 0 , and ( c ) follows from choosing C j = Pr (cid:16) Y j − = y j − (cid:17) Pr (cid:0) Y nj +1 = y nj +1 (cid:1) = (cid:16) π T M ( y j − ) (cid:17) (cid:0) π Tj +1 M ( y nj +1 ) (cid:1) . One subtlety is that π j +1 = π j P y ∈Y M θ j ( y ) is affected by θ j . So, small changes in θ j cause smallchanges in π j +1 and we must add the condition dd θ π = to guarantee that ∂∂θ j C j = 0 . After adding thiscondition, we may safely assume that π j = π for j = 1 , . . . n . See Remark 3.3 for more details.For Borel measurable sets A ⊆ A and B ⊆ (cid:8) u ∈ A | π T u = 1 (cid:9) , the sets U j ( A ) , ( y j − ∈ Y j − (cid:12)(cid:12)(cid:12)(cid:12) α Tj = π T M ( y j − ) π T M ( y j − ) ∈ A ) V j ( B ) , ( y nj ∈ Y n − j − (cid:12)(cid:12)(cid:12)(cid:12) β j = M ( y nj ) π T M ( y nj ) ∈ B ) will be used to define the measures µ ( j ) ( A ) , Pr (cid:16) Y j − ∈ U j ( A ) (cid:17) and ν ( j ) ( B ) , Pr (cid:0) Y nj ∈ V j ( B ) (cid:1) forthe forward/backward state probabilities. In this case, µ ( j ) ( · ) , ν ( j ) ( · ) are probability measures on A forthe random variables α j , β j . Using these measures, we find that g ′ n ( θ ∗ ) is given by = − n n X j =1 ∂∂θ j X y n ∈Y n α Tj · π T M ( y j − ) z }| { π T M ( y j − ) M θ j ( y j ) β j +1 · π T M ( y nj +1 ) z }| { M ( y nj +1 ) ln α Tj z }| { π T M ( y j − ) π T M ( y j − ) M θ j ( y j ) β j +1 z }| { M ( y nj +1 ) π T M ( y nj +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) θ j = θ ∗ = − n n X j =1 Z A µ ( j ) (d α ) Z A ν ( j +1) (d β ) ∂∂θ j X y j ∈Y α T M θ j ( y j ) β ln (cid:0) α T M θ j ( y j ) β (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) θ j = θ ∗ = − n n X j =1 Z A µ ( j ) (d α ) Z A ν ( j +1) (d β ) X y j ∈Y (cid:2) α T M ′ ( y j ) β ln (cid:0) α T M ( y j ) β (cid:1) + α T M ′ ( y j ) β (cid:3) . All that is left is to compute the sum. If the HMP is almost-surely mixing, then the results ofSection 2.4 show that measures converge weakly (i.e., µ ( j ) → µ and ν ( j ) → ν ). Moreover, Lemma A.2 inAppendix A.1 shows that the convergence rate is exponential. Therefore, most of the terms in the sumhave essentially the same value. Like the LSR, we neglect terms within (ln n ) of the block edge becausetheir contribution is negligible as n → ∞ . The exponential convergence of the stationary measures alsoshows that the interior terms become equal at the super polynomial rate γ (ln n ) = n ln n · ln γ . Therefore, f n ( θ ) and f ′ n ( θ ) converge uniformly for all θ ∈ D and lim n →∞ − n n X j =1 Z A µ ( j ) (d α ) Z A ν ( j +1) (d β ) X y j ∈Y (cid:2) α T M ′ ( y j ) β ln (cid:0) α T M ( y j ) β (cid:1) + α T M ′ ( y j ) β (cid:3) converges to dd θ H ( Y ; θ ) (cid:12)(cid:12) θ = θ ∗ = − Z A µ (d α ) Z A ν (d β ) X y ∈Y (cid:2) α T M ′ ( y ) β ln (cid:0) α T M ( y ) β (cid:1) + α T M ′ ( y ) β (cid:3) . (3.3)Finally, the last term in (3.3) is shown to be zero in Lemma 3.4. Remark . The necessity of the condition dd θ π = in Theorem 3.2 can be a bit subtle. This isbecause the π -term in many equations (e.g., π T M ( y nj +1 ) ) actually represents the state distribution at a14articular time (e.g., time j + 1 ). The indices are dropped after the first few steps because the underlyingMarkov chain is stationary and the state distribution is independent of time. For example, the proofliberally uses the assumption that Pr (cid:0) Y nj +1 = y nj +1 (cid:1) = X q,q ′ ∈ Q Pr ( Q j +1 = q ) Pr (cid:0) Q n +1 = q ′ , Y nj +1 = y nj +1 | Q j +1 = q (cid:1) = πM (cid:0) y nj +1 (cid:1) , where the last step clearly requires that Pr ( Q j +1 = q ) = π ( q ) . Moreover, this is not simply a problemwith the proof. The author has applied the formula from Theorem 3.2 to a Markov chain (where thetrue entropy-rate derivative is well-known) and shown that the two expressions become equal only if dd θ π = . Lemma 3.4.
The following properties of the forward/backward Blackwell measures will be useful: Z A µ (d α ) α = π Z A ν (d β ) β = Z A µ (d α ) X y ∈Y α T M ( y ) β = 1 Z A ν (d β ) X y ∈Y α T M ( y ) β = 1 Z A µ (d α ) Z A ν (d β ) X y ∈Y α T M ′ ( y ) β = 0 Proof.
The proof is deferred to the appendix.
Suppose the domain of θ includes a “high noise” point θ ∗ where the channel output provides no informa-tion about the channel state. In this case, the forward/backward Blackwell measures become singletonson π, and the entropy rate H ( Y ; θ ) converges to the single-letter entropy H ( Y ; θ ) as θ → θ ∗ . In thehigh-noise regime, one can also evaluate the derivative from Theorem 3.2 in closed form and extend theformula to the 2nd derivative. In this section, we compare the expansions of H ( Y ; θ ) and H ( Y ; θ ) .First, we consider the single-letter entropy H ( Y ; θ ) = − X y ∈Y Pr ( Y t = y ) log (cid:16) Pr ( Y t = y ) (cid:17) = − X y ∈Y π T M θ ( y ) log (cid:0) π T M θ ( y ) (cid:1) , where π is the stationary distribution of the underlying Markov chain as a function of θ . Lemma 3.5.
Under the assumption that dd θ π = for all θ ∈ D , the 1st derivative w.r.t. θ of thesingle-letter entropy is given by dd θ H ( Y ; θ ) = − X y ∈Y π T M ′ θ ( y ) log (cid:0) π T M θ ( y ) (cid:1) , Under the same assumption, the 2nd derivative w.r.t. θ is given by d d θ H ( Y ; θ ) = − X y ∈Y (cid:0) π T M ′ θ ( y ) (cid:1) π T M θ ( y ) − X y ∈Y π T M ′′ θ ( y ) log (cid:0) π T M θ ( y ) (cid:1) . (3.4)15 roof. In particular, the 1st derivative is given by dd θ H ( Y ; θ ) = − dd θ X y ∈Y π T M θ ( y ) log (cid:0) π T M θ ( y ) (cid:1) = − X y ∈Y (cid:18)(cid:18) dd θ π T (cid:19) M θ ( y ) + π T M ′ θ ( y ) (cid:19) log (cid:0) π T M θ ( y ) (cid:1) − dd θ X y ∈Y π T M θ ( y ) = − X y ∈Y π T M ′ θ ( y ) log (cid:0) π T M θ ( y ) (cid:1) because dd θ π = and P y ∈Y π T M θ ( y ) = 1 for all θ . Since dd θ π = for all θ ∈ D , the 2nd derivative isgiven by d d θ H ( Y ; θ ) = − dd θ X y ∈Y π T M ′ θ ( y ) log (cid:0) π T M θ ( y ) (cid:1) = − X y ∈Y π T M ′′ θ ( y ) log (cid:0) π T M θ ( y ) (cid:1) − X y ∈Y (cid:0) π T M ′ θ ( y ) (cid:1) π T M θ ( y ) . Now, we consider closed form evaluation of Theorem 3.2. Since the first derivative is often zero at θ = θ ∗ , we are fortunate that a new formula for the 2nd derivative can also be evaluated in closed form. Theorem 3.6.
If there is a function s ( y ) , a θ ∗ ∈ D , and a matrix P such that lim θ → θ ∗ M ( y ) = s ( y ) P for all y ∈ Y , then dd θ H ( Y ; θ ) (cid:12)(cid:12) θ = θ ∗ = − X y ∈Y π T M ′ ( y ) ln ( s ( y )) and d d θ H ( Y ; θ ) (cid:12)(cid:12)(cid:12)(cid:12) θ = θ ∗ = − X y ∈Y π T M ′′ ( y ) ln ( s ( y )) − X y ∈Y (cid:0) π T M ′ ( y ) (cid:1) π T M ( y ) . (3.5) Proof.
The proof is deferred to the appendix.
Consider the HMP defined by a binary Markov-1 source observed through a BSC ( ε ) . The two-stateMarkov process is defined by Pr( Q t +1 = j | Q t = i ) = p ij with stationary distribution Pr( Q t = i ) = π ( i ) ,and π (0) = 1 − π (1) = − p − p − p . The output of the HMP is simply the observation of state through aBSC or more specifically h i,j ( y ) = ( − ε if y = iε otherwise . The entropy rate of this process was considered earlier using a range of techniques [31, 32, 18, 44]. Now,we will consider the entropy rate of this process as ε → (i.e., in the high-noise regime). This specialcase was also treated earlier and very similar results were obtained using different methods in [20, 19, 33].Since we are interested in the high-noise regime, we start by analyzing the system using the upperbound H ( Y ) ≤ H ( Y ) . This gives H ( Y ) = − X y ∈Y Pr( Y = y ) ln (Pr( Y = y )) , where Pr( Y = 0) = π (0) p (1 − ε ) + π (0) p ε + π (1) p (1 − ε ) + π (1) p ε Pr( Y = 1) = π (0) p ε + π (0) p (1 − ε ) + π (1) p ε + π (1) p (1 − ε ) . H ( Y ; θ ) around θ = − ε , we find that H ( Y ) ≤ H ( Y ; θ ) = ln 2 − p − p )(2 − p − p ) θ O (cid:0) θ (cid:1) . (3.6)To calculate this expansion exactly for H ( Y ) , we apply Theorem 3.6. The conditions of the Theoremare satisfied because M θ ( y ) = " p (1 − ε ) p εp (1 − ε ) p ε if y = 0 " p ε p (1 − ε ) p ε p (1 − ε ) if y = 1= " p p p p + θ " p − p p − p if y = 0 " p p p p − θ " p − p p − p if y = 1 implies M θ (0) = M θ (1) at θ = 0 (i.e., ε = ). Computing (3.5), which is simplified by the symmetry of M θ ( y ) and the fact that M ′′ θ ( y ) is the zero matrix, gives d d θ H ( Y ; θ ) (cid:12)(cid:12)(cid:12)(cid:12) θ =0 = − (cid:18)h − p − p − p − p − p − p i (cid:20) p − p p − p (cid:21) (cid:20) (cid:21)(cid:19) h − p − p − p − p − p − p i (cid:20) p p p p (cid:21) (cid:20) (cid:21) = − p − p )(2 − p − p ) . (3.7)Since H ( Y ; 0) = ln 2 , this implies that the upper bound is tight with respect to the first non-zero termin the high-noise expansion. Consider an HMP where the output distribution, conditioned on the state of underlying Markov chain,is Gaussian. Suppose that the Gaussian associated with the transition from state i to state j has mean θ · m ij and variance 1, then this implies that h ij ( y ) = √ π e − ( y − θm ij ) / . Since the HMP loses statedependence as θ → , we first consider the derivatives w.r.t. θ of the single-letter entropy H ( Y ; θ ) = − Z ∞−∞ π T M θ ( y ) log (cid:0) π T M θ ( y ) (cid:1) d y. In this case, the stationary distribution does not depend on θ so translating Lemma 3.5 to thecontinuous alphabet case gives dd θ H ( Y ; θ ) (cid:12)(cid:12)(cid:12)(cid:12) θ =0 = − lim θ → Z ∞−∞ π T M ′ θ ( y ) log (cid:0) π T M θ ( y ) (cid:1) d y = − lim θ → Z ∞−∞ X i,j ∈Q π ( i ) p ij √ π e − ( y − θm ij ) / m ij ( y − θm ij ) log X k,l ∈Q π ( k ) p kl √ π e − ( y − θm kl ) / d y = − Z ∞−∞ X i,j ∈Q π ( i ) p ij √ π e − y / m ij y log (cid:16) √ π e − y / (cid:17) d y = − Z ∞−∞ X i,j ∈Q π ( i ) p ij √ π e − y / m ij (cid:20) y log (cid:16) √ π (cid:17) − y (cid:21) d y = 0 , d d θ H ( Y ; θ ) = − Z ∞−∞ π T M ′′ θ ( y ) log (cid:0) π T M θ ( y ) (cid:1) dy − Z ∞−∞ (cid:0) π T M ′ θ ( y ) (cid:1) π T M θ ( y ) d y The second term T of the expression for d d θ H ( Y ; θ ) (cid:12)(cid:12) θ =0 is given by T = − lim θ → Z ∞−∞ (cid:16)P i,j ∈Q π ( i ) p ij √ π e − ( y − θm ij ) / m ij ( y − θm ij ) (cid:17) P i,j ∈Q π ( i ) p ij √ π e − ( y − θm ij ) / d y = − Z ∞−∞ (cid:16)P i,j ∈Q π ( i ) p ij √ π e − y / m ij y (cid:17) √ π e − y / d y = − X i,j ∈Q π ( i ) p ij m ij Z ∞−∞ √ π e − y / y d y = − X i,j ∈Q π ( i ) p ij m ij . Using the fact that π T M ′′ θ ( y ) = X i,j ∈Q π ( i ) p ij √ π e − ( y − θm ij ) / m ij (cid:2) ( y − θm ij ) − (cid:3) , we can write the first term T of the expression for d d θ H ( Y ; θ ) (cid:12)(cid:12) θ =0 as T = − lim θ → Z ∞−∞ π T M ′′ θ ( y ) log (cid:0) π T M θ ( y ) (cid:1) d y = − Z ∞−∞ X i,j ∈Q π ( i ) p ij √ π e − y / m ij (cid:0) y − (cid:1) log (cid:16) √ π e − y / (cid:17) d y ( a ) = 12 X i,j ∈Q π ( i ) p ij m ij Z ∞−∞ √ π e − y / (cid:0) y − y (cid:1) d y = X i,j ∈Q π ( i ) p ij m ij , where ( a ) follows from the fact that the 4th moment of a standard Gaussian is 3.Comparing Lemma 3.5 with Theorem 3.6 shows that the first two terms in the expansion of H ( Y ; θ ) match the first two terms in the expansion of H ( Y ; θ ) at θ = 0 . Therefore, we have d d θ H ( Y ; θ ) (cid:12)(cid:12)(cid:12)(cid:12) θ =0 = d d θ H ( Y ; θ ) (cid:12)(cid:12)(cid:12)(cid:12) θ =0 = X i,j ∈Q π ( i ) p ij m ij − X i,j ∈Q π ( i ) p ij m ij . (3.8) Now, we will use the previous result to compute the derivative of the capacity. The mutual information I ( X ; Y ) between the r.v.s X and Y is defined by I ( X ; Y ) , H ( Y ) − H ( Y | X ) , where the conditional18ntropy is defined by H ( Y | X ) , H ( X, Y ) − H ( X ) . Since the mutual information depends on theinput distribution, the capacity is defined to be the supremum of the mutual information over all inputdistributions [11]. Therefore, some care must be taken when expressing the derivative of the capacity interms of the derivative of the mutual information.Consider a family of FSCs whose entropy rate is differentiable with respect to some parameter θ .Let the input distribution be Markov with memory m (e.g., defined by the vector ~P containing |X | m +1 values) and the optimal input distribution be ~P ( θ ) . In this case, we let the mutual information rate be I ( θ, ~P ) and the Markov- m capacity be C ( θ ) = I (cid:16) θ, ~P ( θ ) (cid:17) . Lemma 4.1.
The derivative of the Markov- m capacity is given by dd θ C ( θ ) = dd θ I (cid:16) θ, ~P ( θ ) (cid:17) = I ′ θ (cid:16) θ, ~P ( θ ) (cid:17) , (4.1) where I ′ θ (cid:16) θ, ~P ( θ ) (cid:17) is the derivative (w.r.t. θ ) of the mutual information rate evaluated at the capacityachieving input distribution for θ .Proof. Expanding the derivative of C ( θ ) in terms of I ′ θ (cid:16) θ, ~P (cid:17) and the gradient vector I ′ P (cid:16) θ, ~P (cid:17) (w.r.t.input distribution), gives d I (cid:16) θ, ~P (cid:17) = I ′ θ (cid:16) θ, ~P (cid:17) d θ + I ′ P (cid:16) θ, ~P (cid:17) · d ~P . The optimality of ~P ( θ ) implies I ′ P (cid:16) θ, ~P ( θ ) (cid:17) · d ~P = 0 for any d ~P satisfying d ~P · = 0 (i.e., the sum of ~P ( θ ) is a constant). So, the derivative of the capacity is the derivative of the mutual information rateand we have (4.1). Corollary 4.2.
If there is a “high noise” point θ ∗ ∈ D where the Markov- m capacity satisfies C ( θ ∗ ) = 0 and C ′ ( θ ∗ ) = 0 , then d d θ C ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) θ = θ ∗ = I ′′ θ (cid:16) θ, ~P ( θ ) (cid:17) , where I ′′ θ (cid:16) θ, ~P ( θ ) (cid:17) is the 2nd derivative (w.r.t. θ ) of the mutual information rate evaluated at thecapacity achieving input distribution for θ .Proof. First, we write the 2nd derivative as d d θ C ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) θ = θ ∗ = lim θ → θ ∗ dd θ I ′ θ (cid:16) θ, ~P ( θ ) (cid:17) = I ′′ θ (cid:16) θ ∗ , ~P ( θ ∗ ) (cid:17) + lim θ → θ ∗ (cid:20) dd ~P I ′ θ (cid:16) θ, ~P (cid:17)(cid:21) ~P = ~P ( θ ∗ ) · ~P ′ ( θ ∗ ) . Now, recall that I ′ θ (cid:16) θ ∗ , ~P ( θ ∗ ) (cid:17) = 0 and suppose that the 2nd term is positive. In this case, a smallchange in ~P in the direction ~P ′ ( θ ∗ ) must give an I ′ θ (cid:16) θ ∗ , ~P (cid:17) > . But, this contradicts the fact that C ′ ( θ ∗ ) ≥ max ~P I ′ θ (cid:16) θ ∗ , ~P (cid:17) . Therefore, the 2nd term must be zero.If the domain of θ includes a “high noise” point θ ∗ where the channel output provides no informationabout the channel state, then Theorem 3.6 shows that the first two θ -derivatives of the entropy rate H ( Y ; θ ) can be calculated at θ = θ ∗ . In fact, one also sees that they match the first two θ -derivativesof the single-letter entropy H ( Y ; θ ) at θ = θ ∗ . Using Lemma 4.1 and Corollary 4.2, we see that thesederivatives also equal the derivative of the Markov- m capacity in this case. But this equality holds forall m , so we can take a limit to see that it must hold also for the true capacity [10]. Even without this,however, we can use the fact that H ( Y ; θ ) ≤ H ( Y ; θ ) to upper bound the maximum entropy rate overall input distributions. 19 .2 FSC Example: A BSC with an RLL Constraint Consider the FSC defined by the BSC ( ε ) with a (0,1) run-length (RLL) constraint [24]. This is a standardbinary symmetric channel with a constraint that the input cannot have two 1s in a row (e.g., this requiresa two-state input process). The two-state input process is defined by Pr( X t +1 = j | X t = i ) = p ij with p = 0 , Pr( X t = i ) = π ( i ) , and π (0) = 1 − π (1) = − p .The mutual information rate between the input and output satisfies I ( X ; Y ) = H ( Y ) − H ( Y|X ) ≤ H ( Y i ) − h ( ε ) , where h ( ε ) = − ε ln ε − (1 − ε ) ln(1 − ε ) is the binary entropy function in nats. Now, we can let θ = − ε and combine the entropy-rate expansion from (3.6) with the fact that h ( − θ ) = ln 2 − θ + O ( θ ) .The resulting high-noise expansion for the upper bound is I ( X ; Y ) ≤ − p )(2 − p ) θ + O (cid:0) θ (cid:1) . Notice that the leading coefficient achieves a unique maximum value of at p = 0 . Since this upperbound only depends on the single-letter probabilities, it cannot be increased by extending the memoryof the input process.To see that this rate is achievable, we apply Theorem 3.6 to our system. Taking the result from (3.7),we find that I ( X ; Y ) = H ( Y ) − H ( Y|X )= (cid:18) − p (2 − p ) θ + o (cid:0) θ (cid:1)(cid:19) − (cid:0) − θ + O (cid:0) θ (cid:1)(cid:1) = 8(1 − p )(2 − p ) θ + o (cid:0) θ (cid:1) . So the leading term of the actual expansion matches the upper bound.From a coding perspective, this result implies that that we should choose our Shannon randomcodebook to be sequences with mostly alternating 01 patterns and an occasional 00 pattern (i.e., occurswith probability p → ). It is also worth mentioning that this constraint costs nothing when the noiseis large because the slope of the expansion matches the slope of the unconstrained BSC as p → . Consider a family of finite-memory ISI channels parametrized by θ . Let the time- t output Y t be aGaussian whose mean is given by θ times a deterministic function of the current input and the previous k inputs. Under these conditions, the output process is a conditionally Gaussian HMP, with state Q t = ( X t − , . . . , X t − k ) , as defined in Section 3.6. Moreover, the conditional entropy rate H ( Y|X ) onlydepends on the noise variance, which can be taken to be 1 without loss of generality. Therefore, θ -derivatives of the mutual information rate, I ( X ; Y ) = H ( Y ) − H ( Y|X ) , depend only on θ -derivatives ofthe entropy rate H ( Y ) .Let the mean of the output process induced by a state transition Q t = i to Q t +1 = j be m ij . Onecan explore the high-noise regime by keeping the noise variance fixed to 1 and letting θ → . In thiscase, one can combine (3.8) and Corollary 4.2 to see that C ( θ ) = θ X i,j ∈Q π ( i ) p ij m ij − X i,j ∈Q π ( i ) p ij m ij + o (cid:0) θ (cid:1) . The first term in this expansion can be optimized over the input distribution p ij , but there are afew caveats. Let e ij = π ( i ) p ij be the edge occupancy probabilities that satisfy P i,j ∈Q e ij = 1 , then20tationarity of the underlying Markov chain implies that P j ( e ij − e ji ) = 0 . One also finds that notall state transitions are valid, but setting e ij = 0 if ( i, j ) / ∈ V gives the following convex optimizationproblem with linear constraints:maximize X i,j ∈Q e ij m ij − X i,j ∈Q e ij m ij subject to X i,j ∈Q e ij = 1 X j ∈Q ( e ij − e ji ) = 0 ∀ i. A similar result is given in [41] for linear ISI channels with balanced inputs (i.e., a zero-mean input).In this case, the P e ij m ij term is zero and the optimization problem is reduced to finding the maximummean-weight cycle in a directed graph with edge weights m ij . The formula above generalizes the previousresult to non-linear ISI channels and eliminates the zero-mean input requirement. The results of this paper are closely related to an observation by Vontobel et al. [42] that the firstpart of generalized Blahut-Arimoto algorithm for FSCs actually computes the derivative of the mutualinformation. Their result is somewhat different because it considers derivatives with respect to theedge occupancy probabilities π ( i ) p ij rather than the observation probabilities. Their approach is alsodissimilar because the answer is given exactly for finite blocks rather than focusing on the asymptoticallylong blocks and the forward/backward stationary measures. Moreover, the result in this paper does notapply to changes in the HMP which change the stationary distribution π of the while the derivativeresult in [42] focuses exclusively on changes in the edge occupancy probabilities.Ideally, one would have a unified treatment of the derivative, with respect to changes in both theedge occupancy probabilities π ( i ) p ij and the observation probabilities, of the entropy rate of a FSC.Indeed, a simple formula, in terms of forward/backward stationary measures, can be cobbled togetherby translating the derivative formula in [42] to stationary measures and combining this with Theorem3.2. To clarify the connection, their result is shown first in terms of conditional density functions for α and β . Paraphrasing their result, in terms of the derivative of the edge occupancy probabilities ∆ ij = dd θ π ( i ) p ij (cid:12)(cid:12) θ =0 , gives dd θ H ( X |Y ; θ ) (cid:12)(cid:12) θ =0 = − X i,j ∈Q ∆ ij Z A ×A f α | Q t ( α | i ) f β | Q t +1 ( β | j ) X y ∈Y h ij ( y ) ln α ( i ) M ij ( y ) β ( j ) P k ∈Q α ( i ) M ik ( y ) β ( k ) d α d β. One can decompose this formula to see that the term ∆ ij gives the change in the edge occupancy prob-ability, the term f α | Q t ( α | i ) f β | Q t +1 ( β | j ) f Y | Q t Q t +1 ( y | i, j ) is the probability of α, β, y given the transition,and the logarithmic term gives the contribution to H (cid:0) Q t +1 = j | Q t = i, Y ∞−∞ (cid:1) for this α, β, y .Next, we modify this expression to use unconditional α, β distributions with dd θ H ( X |Y ; θ ) (cid:12)(cid:12) θ =0 ( a ) = − X i,j ∈Q ∆ ij Z A ×A µ i (d α ) π ( i ) · ν j (d β ) π ( j ) X y ∈Y h ij ( y ) ln α ( i ) M ij ( y ) β ( j ) P k ∈Q α ( i ) M ik ( y ) β ( k ) ( b ) = − X i,j ∈Q ∆ ij Z A ×A µ (d α ) α ( i ) π ( i ) · ν (d β ) β ( j ) π ( j ) π ( j ) X y ∈Y h ij ( y ) ln α ( i ) M ij ( y ) β ( j ) P k ∈Q α ( i ) M ik ( y ) β ( k ) ( c ) = − X i,j ∈Q ∆ ij Z A ×A µ (d α ) ν (d β ) X y ∈Y α ( i ) M ij ( y ) β ( j ) π ( i ) p ij ln α ( i ) M ij ( y ) β ( j ) P k ∈Q α ( i ) M ik ( y ) β ( k ) , The objective function is actually concave, but one can negate the objective and minimize instead. ( a ) holds because µ i (d α ) π ( i ) is the conditional density of α given the true state is i and ν j (d β ) π ( j ) is theconditional density of β given the true state is j , ( b ) follows from Lemmas 2.21 and 2.25, and ( c ) followsfrom M ij ( y ) = p ij h ij ( y ) . Finally, using H ( Y ; θ ) = H ( X ; θ ) − H ( X |Y ; θ ) + H ( Y|X ; θ ) and dd θ H ( X ; θ ) (cid:12)(cid:12) θ =0 = − X i,j ∈Q ∆ ij ln p ij dd θ H ( Y|X ; θ ) (cid:12)(cid:12) θ =0 = − X i,j ∈Q ∆ ij X y ∈Y h ij ( y ) ln h ij ( y ) , we find that dd θ H ( Y ; θ ) (cid:12)(cid:12) θ =0 is given by − X i,j ∈Q ∆ ij Z A ×A µ (d α ) ν (d β ) X y ∈Y (cid:20) h ij ( y ) ln M ij ( y ) + α ( i ) M ij ( y ) β ( j ) π ( i ) p ij ln α ( i ) M ij ( y ) β ( j ) P k ∈Q α ( i ) M ik ( y ) β ( k ) (cid:21) . It is straightforward to combine this Theorem 3.2, though the final expression is even more unwieldy.
This paper considers the derivative of the entropy rate for general hidden Markov processes and derivesa closed-form expression for this derivative in high-noise limit. An application is presented relating tothe achievable information rates of finite-state channels. Again, a closed-form expression is derived forthe high-noise limit. Two examples of interest are considered. First, transmission over a BSC under a(0,1) RLL constraint is treated and the capacity-achieving input distribution is derived in the high-noiselimit. Second, an intersymbol interference channel in AWGN is considered and the capacity is derivedin the high-noise limit.
Acknowledgement.
The author would like to thank an anonymous reviewer for catching a number oferrors and inconsistencies in the paper. He is also grateful to Pascal Vontobel for his excellent commentson an earlier draft. This work also benefited from interesting discussions with Brian Marcus and is anatural extension of past work with Paul H. Siegel and Joseph B. Soriaga.
A Technical Details
A.1 Lemmas for Theorem 3.2
Lemma A.1.
Consider function F ( α, β ) = − α T M ′ β log (cid:0) α T M β (cid:1) where M is a non-negative matrixand M ′ is a real matrix. This function is Lipschitz continuous w.r.t. k·k on ( α, β ) ∈ P δ × B δ where B δ = (cid:8) u ∈ A δ | π T β = 1 (cid:9) , η = min i π ( i ) > , and δ > . This implies that | F ( α, β ) − F ( α ′ , β ) | ≤ L α k α − α ′ k | F ( α, β ) − F ( α, β ′ ) | ≤ L β k β − β ′ k | F ( α, β ) − F ( α ′ , β ′ ) | ≤ L α k α − α ′ k + L β k β − β ′ k , where c = δ P i,j M ij and L α = k M k η log 1 c + k M ′ k k M k η cL β = k M k ∞ log 1 c + k M ′ k ∞ k M k ∞ c . Proof.
Let G : R m → R be any function that is differentiable on a convex set D ⊆ R m . Then, the meanvalue theorem of vector calculus implies that G ( y ) − G ( x ) = G ′ ( x + t ( y − x )) T ( y − x ) t ∈ [0 , . Applying Hölder’s inequality allows one to upper bound the Lipschitz constant w.r.t. k·k and gives the upper bound G ( y ) − G ( x ) ≤ sup t ∈ [0 , k G ′ ( x + t ( y − x )) k ∞ k x − y k ≤ sup z ∈ D k G ′ ( z ) k ∞ k x − y k . Since F ( α, β ) is differentiable w.r.t. α , we can bound the Lipschitz constant L α with L α = sup α ∈P δ sup β ∈B sup k u k ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12) u T M ′ β log 1 α T M β − α T M ′ β u T M βα T M β (cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ sup α ∈P δ sup β ∈B sup k u k ∞ ≤ (cid:20)(cid:12)(cid:12) u T M ′ β (cid:12)(cid:12) log 1 c + (cid:12)(cid:12) α T M ′ β (cid:12)(cid:12) (cid:12)(cid:12) u T M β (cid:12)(cid:12) c (cid:21) ( b ) ≤ k M k k β k log 1 c + k M ′ k k β k k M k k β k c ( c ) ≤ k M k η log 1 c + k M ′ k k M k η c , (A.1)where ( a ) follows from α T M β ≥ c with c = δ P i,j M ij , ( b ) follows from (cid:12)(cid:12) x T M y (cid:12)(cid:12) ≤ k x k ∞ k M k k y k ,and ( c ) follows from k β k ≤ η − which holds because π T β = 1 .Likewise F ( α, β ) is differentiable w.r.t. β and we can bound the Lipschitz constant L β with L β = sup α ∈P δ sup β ∈B sup k u k ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12) α T M ′ u log 1 α T M β − α T M ′ β α T M uα T M β (cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ sup α ∈P δ sup β ∈B sup k u k ∞ ≤ (cid:20)(cid:12)(cid:12) α T M ′ u (cid:12)(cid:12) log 1 c + (cid:12)(cid:12) α T M ′ β (cid:12)(cid:12) (cid:12)(cid:12) α T M u (cid:12)(cid:12) c (cid:21) ( b ) ≤ k M k ∞ log 1 c + k M ′ k ∞ k M k ∞ c , (A.2)where ( a ) is the same as above and ( b ) follows from (cid:12)(cid:12) x T M y (cid:12)(cid:12) ≤ k x k k M k ∞ k y k ∞ . Lemma A.2.
If the HMP is ǫ -primitive for ǫ > , then for some γ < and C < ∞ we have X y ∈Y E (cid:2) α T M ′ ( y ) β log (cid:0) α T M ( y ) β (cid:1) − α Tj M ′ ( y ) β j +1 log (cid:0) α Tj M ( y ) β j +1 (cid:1)(cid:3) ≤ L α Cγ j − + 2 L β Cγ n − j +1 , where c ( y ) = δ P i,j [ M ( y )] ij and L α = X y ∈Y (cid:20) k M ( y ) k η log 1 c ( y ) + k M ′ ( y ) k k M ( y ) k η c ( y ) (cid:21) L β = X y ∈Y (cid:20) k M ( y ) k ∞ log 1 c ( y ) + k M ′ ( y ) k ∞ k M ( y ) k ∞ c ( y ) (cid:21) . The expectation assumes that α, β are drawn from their respective stationary distributions while α j , β j +1 are drawn from the distributions implied by an arbitrary initialization of α , β n +1 .Proof. Since the HMP is ǫ -primitive for ǫ > , there is a δ such that min i α i > δ and min i β i > δ on the entire support of α, β . It also follows that η = min i π ( i ) > . Now, consider the function F y ( α, β ) = − α T M ′ ( y ) β log (cid:0) α T M ( y ) β (cid:1) . Under these conditions, Lemma A.1 shows that this function isLipschitz continuous w.r.t. k·k on the support of α, β with Lipschitz constants L α ( y ) and L β ( y ) defined23y generalizing (A.1) and (A.2). Therefore, we can write X y ∈Y E α,β [ F y ( α, β ) − F y ( α j , β j +1 )] ≤ X y ∈Y E α,β (cid:2) L α ( y ) k α − α j k + L β ( y ) k β − β j +1 k (cid:3) ( a ) ≤ X y ∈Y E α,β [ L α ( y )2 d ( α, α j ) + L β ( y )2 d ( β, β j +1 )] ( b ) ≤ X y ∈Y L α ( y )2 Cγ j − + X y ∈Y L β ( y )2 Cγ n − j +1 , where ( a ) follows from Lemma 2.4 and ( b ) follows from Lemma 2.12 because the HMP is ǫ -primitive. A.2 Proof of Lemma 3.4
Proof.
The first two results follow from Lemmas 2.21 and 2.25. Substituting and integrating gives and Z A µ (d α ) α ( q ) = Z A µ q (d α ) | {z } Pr( Q = q,α ∈ d α ) = Pr( Q = q ) and Z A ν (d β ) β ( q ) = Z A π ( q ) ν q (d β ) | {z } Pr( Q = q,β ∈ d β ) = 1 . Using the fact that X y ∈Y M ( y ) = P, we can evaluate the third and fourth results with Z A µ (d α ) α T X y ∈Y M ( y ) β = π T P β = π T β = 1 and α T X y ∈Y M ( y ) Z A ν (d β ) β = α T P = α T = 1 . Finally, the fifth result follows fromdd θ Z A µ (d α ) α T X y ∈Y M θ ( y ) Z A ν (d β ) β = dd θ π T P θ = dd θ . A.3 Proof of Theorem 3.6
Proof.
First, we point out that lim θ → θ ∗ M θ ( y ) = s ( y ) P implies that output symbols provide no stateinformation at θ = θ ∗ so that H ( Y ; θ ∗ ) = H ( Y ; θ ∗ ) . This also implies that, at θ = θ ∗ , the forward andbackward Blackwell measures are Dirac measures, µ ( A ) = A ( π ) and ν ( B ) = B ( ) , concentrated on π, . By Theorem 3.2, the derivative of the entropy rate is uniformly continuous on D and we have lim θ → θ ∗ dd θ H ( Y ; θ ) = − lim θ → θ ∗ E α,β X y ∈Y α T M ′ θ ( y ) β ln (cid:16) α T M θ ( y ) β (cid:17) = − X y ∈Y π T M ′ ( y ) ln ( s ( y )) − π T X y ∈Y M ′ ( y ) ln (cid:0) π T P (cid:1) ( a ) = − X y ∈Y π T M ′ ( y ) ln ( s ( y )) , ( a ) holds because π T P = 1 .For the 2nd derivative, we apply the derivative shortcut a second time by noting that g ′′ n ( θ ) = n X i =1 n X j =1 ∂∂θ i ∂∂θ j g n ( θ ) . Applying this to g n ( θ , . . . , θ n ) for the entropy rate gives g ′′ n ( θ ∗ )= − n n X i =1 n X j =1 ∂∂θ i ∂∂θ j X y n ∈Y n π T n Y t =1 M θ t ( y t ) ! · log " π T n Y t =1 M θ t ( y t ) ! ( θ ,...,θ n )=( θ ∗ ,...θ ∗ )( a ) = − n n X i =1 ∂∂θ i n X j =1 X y n ∈Y n π T M ( y j − ) M ′ θ j ( y j ) M ( y nj +1 ) · log " π T n Y t =1 M θ t ( y t ) ! ( θ ,...,θ n )=( θ ∗ ,...θ ∗ ) − n n X i =1 ∂∂θ i n X j =1 ∂∂θ j X y n ∈Y n π T n Y t =1 M θ t ( y t ) ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( θ ,...,θ n )=( θ ∗ ,...θ ∗ ) (A) = − n n X j =1 X y n ∈Y n π T M ( y j − ) M ′′ ( y j ) M ( y nj +1 ) · log " π T n Y t =1 M ( y t ) ! (T1) − n n X j =1 X y n ∈Y n (cid:16) π T M ( y j − ) M ′ ( y j ) M ( y nj +1 ) (cid:17) π T ( Q nt =1 M ( y t )) (T2) − n n X j =1 j − X i =1 X y n ∈Y n π T M ( y i − ) M ′ ( y i ) M ( y j − i +1 ) M ′ ( y j ) M ( y nj +1 ) · log " π T n Y t =1 M θ t ( y t ) ! (T3) − n n X j =1 j − X i =1 X y n ∈Y n (cid:0) π T M ( y i − ) M ′ ( y i ) M ( y ni +1 ) (cid:1) (cid:16) π T M ( y j − ) M ′ ( y j ) M ( y nj +1 ) (cid:17) π T ( Q nt =1 M ( y t )) , (T4)where the term labeled (A) is zero because it equals − n d d θ . Using the term labels in the equation(i.e., T1,T2,...), we see that g ′′ n ( θ ∗ ) = T + T + T + T , where the terms T , T are associated with i = j , and the terms T , T are associated with i = j . Using this decomposition, we can reduce eachterm separately.For the first term, M ( y ) = s ( y ) P implies that T = − n n X j =1 X y n ∈Y n π T M ( y j − ) M ′′ ( y j ) M ( y nj +1 ) · log " π T n Y t =1 M ( y t ) ! = − n n X j =1 X y n ∈Y n s ( y n ) s ( y j ) π T M ′′ ( y j ) · log ( s ( y n ))= − n n X j =1 X y n ∈Y n s ( y n ) s ( y j ) π T M ′′ ( y j ) · log ( s ( y j )) + s ( y n ) s ( y j ) π T M ′′ ( y j ) n X k =1 ,k = j log ( s ( y k )) ( a ) = − n n X j =1 X y j ∈Y π T M ′′ ( y j ) · log ( s ( y j )) + 0 = − X y ∈Y π T M ′′ ( y ) · log ( s ( y )) , ( a ) follows from the fact that X y j ∈Y s ( y n ) s ( y j ) π T M ′′ ( y j ) n X k =1 ,k = j log ( s ( y k )) = n Y i =1 ,i = j s ( y i ) n X k =1 ,k = j log ( s ( y k )) X y j ∈Y π T M ′′ ( y j ) = 0 . For the second term, M ( y ) = s ( y ) P implies that T = − n n X j =1 X y n ∈Y n (cid:16) π T M ( y j − ) M ′ ( y j ) M ( y nj +1 ) (cid:17) π T ( Q nt =1 M ( y t )) = − n n X j =1 X y n ∈Y n s ( y n ) s ( y n ) s ( y j ) (cid:0) π T M ′ ( y j ) (cid:1) = − n n X j =1 X y j ∈Y (cid:0) π T M ′ ( y j ) (cid:1) s ( y j )= − X y ∈Y (cid:0) π T M ′ ( y ) (cid:1) s ( y )= − X y ∈Y (cid:0) π T M ′ ( y ) (cid:1) π T M ( y ) For the third term, we notice first that P y ∈Y M ′ ( y ) = 0 implies X y i ,y j ,y k ∈Y n π T M ′ ( y i ) P j − i − M ′ ( y j ) · log ( s ( y k )) = 0 if either i = k or j = k . This gives T = − n n X j =1 j − X i =1 X y n ∈Y n π T M ( y i − ) M ′ ( y i ) M ( y j − i +1 ) M ′ ( y j ) M ( y nj +1 ) · log " π T n Y t =1 M θ t ( y t ) ! = − n n X j =1 j − X i =1 X y n ∈Y n s ( y n ) s ( y i ) s ( y j ) π T M ′ ( y i ) P j − i − M ′ ( y j ) · log ( s ( y n ))= − n n X k =1 n X j =1 j − X i =1 X y i ,y j ,y k ∈Y n π T M ′ ( y i ) P j − i − M ′ ( y j ) · log ( s ( y k ))=0 because i < j .For the fourth term, we have T = − n n X j =1 j − X i =1 X y n ∈Y n (cid:0) π T M ( y i − ) M ′ ( y i ) M ( y ni +1 ) (cid:1) (cid:16) π T M ( y j − ) M ′ ( y j ) M ( y nj +1 ) (cid:17) π T ( Q nt =1 M ( y t )) = − n n X j =1 j − X i =1 X y n ∈Y n s ( y n ) s ( y n ) s ( y i ) s ( y j ) (cid:0) π T M ′ ( y i ) (cid:1) (cid:0) π T M ′ ( y j ) (cid:1) = − n n X j =1 j − X i =1 X y i ,y j ∈Y n (cid:0) π T M ′ ( y i ) (cid:1) (cid:0) π T M ′ ( y j ) (cid:1) =0 P y ∈Y M ′ ( y ) = 0 . References [1] D. Arnold and H. Loeliger. On the information rate of binary-input channels with memory. In
Proc.IEEE Int. Conf. Commun. , pages 2692–2695, Helsinki, Finland, June 2001.[2] D. Arnold, H. A. Loeliger, P. O. Vontobel, A. Kavčić, and W. Zeng. Simulation-based computationof information rates for channels with memory.
IEEE Trans. Inform. Theory , 52(8):3498–3508, Aug.2006.[3] D. M. Arnold.
Computing information rates of finite-state models with application to magneticrecording . PhD thesis, Swiss Federal Institute of Technology, Zurich, 2003.[4] R. G. Bartle and D. R. Sherbert.
Introduction to Real Analysis . Wiley, 3rd edition, 1999.[5] L. E. Baum and T. Petrie. Statistical inference for probabilistic functions of finite state Markovchains.
Ann. Math. Stats. , 37:1554–1563, Dec. 1966.[6] P. Billingsley.
Convergence of probability measures . Wiley, 2nd edition, 1999.[7] J. Birch. Approximations for the entropy for functions of Markov chains.
Ann. Math. Stats. ,33(3):930–938, Sept. 1962.[8] D. Blackwell. Entropy of functions of finite-state Markov chains.
Trans. First Prague Conf. onInform. Theory, Stat. Dec. Fun., Rand. Processes , pages 13–20, 1957.[9] D. Blackwell, L. Breiman, and A. J. Thomasian. Proof of Shannon’s transmission theorem forfinite-state indecomposable channels.
Ann. Math. Stats. , 29:1209–1220, Dec. 1958.[10] J. Chen and P. H. Siegel. Markov processes asymptotically achieve the capacity of finite-stateintersymbol interference channels.
IEEE Trans. Inform. Theory , 54(3):1295–1303, 2008.[11] T. M. Cover and J. A. Thomas.
Elements of Information Theory . Wiley, 1991.[12] E. Domany, I. Kanter, O. Zuk, and M. Aizenman. From finite-system entropy to entropy rate for ahidden Markov process.
IEEE Signal Processing Letters , 13(9):517–520, Sept. 2006.[13] Y. Ephraim and N. Merhav. Hidden Markov processes.
IEEE Trans. Inform. Theory , 48(6):1518–1569, June 2002.[14] H. Furstenberg and H. Kesten. Products of random matrices.
Ann. Math. Stats. , 31:457–469, June1960.[15] H. Furstenberg and Y. Kifer. Random matrix products and measures on projective spaces.
IsraelJournal of Mathematics , 46(1):12–32, 1983.[16] R. G. Gallager.
Information Theory and Reliable Communication . Wiley, New York, NY, USA,1968.[17] A. J. Goldsmith and P. P. Varaiya. Capacity, mutual information, and coding for finite-state Markovchannels.
IEEE Trans. Inform. Theory , 42(3):868–886, May 1996.[18] G. Han and B. Marcus. Analyticity of entropy rate of hidden Markov chains.
IEEE Trans. Inform.Theory , 52(12):5251–5266, Dec. 2006.[19] G. Han and B. Marcus. Asymptotics of noisy constrained channel capacity. In
Proc. IEEE Int.Symp. Information Theory , pages 991–995, Nice, France, June 2007.[20] G. Han and B. Marcus. Derivatives of entropy rate in special familes of hidden Markov chains.
IEEE Trans. Inform. Theory , 53(7):2642–2652, 2007.2721] T. E. Harris. On chains of infinite order.
Pacific J. Math , 5(1):707–724, 1955.[22] T. Holliday, A. Goldsmith, and P. Glynn. Capacity of finite state channels based on Lyapunovexponents of random matrices.
IEEE Trans. Inform. Theory , 52(8):3509–3532, Aug. 2006.[23] F. Jelinek. Continuous speech recognition by statistical methods.
Proc. of the IEEE , 64(4):532–556,1976.[24] A. Kavčić. On the capacity of Markov sources over noisy channels. In
Proc. IEEE Global Telecom.Conf. , pages 2997–3001, San Antonio, Texas, USA, Nov. 2001.[25] A. Krogh, M. Brown, I. Mian, K. Sjolander, and D. Haussler. Hidden Markov models in computa-tional biology: Applications to protein modeling.
J. Molecular Biology , 235(5):1501–1531, 1994.[26] F. Le Gland and L. Mevel. Basic properties of the projective product with application to productsof column-allowable nonnegative matrices.
Math. Control Signals Systems , 13(1):41–62, July 2000.[27] F. Le Gland and L. Mevel. Exponential forgetting and geometric ergodicity in hidden Markovmodels.
Math. Control Signals Systems , 13(1):63–93, July 2000.[28] C. Méasson, A. Montanari, T. J. Richardson, and R. L. Urbanke. Life above threshold: From listdecoding to area theorem and MSE.
Arxiv preprint cs.IT/0410028 , 2004.[29] C. Méasson, A. Montanari, and R. L. Urbanke. Maxwell construction: The hidden bridge betweeniterative and maximum a posteriori decoding.
IEEE Trans. Inform. Theory , 54(12):5277–5307, Dec.2008.[30] M. Mushkin and I. Bar-David. Capacity and coding for Gilbert-Elliot channels.
IEEE Trans.Inform. Theory , 35(6):1277–1290, Nov. 1989.[31] E. Ordentlich and T. Weissman. New bounds on the entropy rate of hidden Markov processes. In
Proc. IEEE Inform. Theory Workshop , pages 117–122, San Antonio, TX, Oct. 2004.[32] E. Ordentlich and T. Weissman. Approximations for the entropy rate of a hidden Markov process.In
Proc. IEEE Int. Symp. Information Theory , pages 2198–2202, Adelaide, Australia, Sept. 2005.[33] E. Ordentlich and T. Weissman. On the optimality of symbol-by-symbol filtering and denoising.
IEEE Trans. Inform. Theory , 52(1):19–40, Jan. 2006.[34] V. I. Oseledec. A multiplicative ergodic theorem. Lyapunov characteristic numbers for dynamicalsystems.
Trans. Moscow Math. Soc. , pages 197–231, 1968.[35] Y. Peres. Domains of analytic continuation for the top Lyapunov exponent.
Ann. Inst. H. PoincaréProbab. Statist. , 28(1):131–148, 1992.[36] T. Petrie. Probabilistic functions of finite state Markov chains.
Ann. Math. Stats. , 40(1):97–115,Feb. 1969.[37] H. D. Pfister.
On the Capacity of Finite State Channels and the Analysis of ConvolutionalAccumulate- m Codes . PhD thesis, University of California, San Diego, La Jolla, CA, USA, March2003.[38] H. D. Pfister, J. B. Soriaga, and P. H. Siegel. On the achievable information rates of finite state ISIchannels. In
Proc. IEEE Global Telecom. Conf. , pages 2992–2996, San Antonio, Texas, USA, Nov.2001.[39] D. Ruelle. Analyticity properties of the characteristic exponents of random matrix products.
Adv.Math , 32:68–80, 1979.[40] E. Seneta.
Non-Negative Matrices: An Introduction to Theory and Applications . Wiley, New York,NY, USA, 2nd edition, 1981. 2841] J. B. Soriaga, H. D. Pfister, and P. H. Siegel. On the low-rate Shannon limit for binary intersymbolinterference channels.
IEEE Trans. Commun. , 51(12):1962–1964, Dec. 2003.[42] P. O. Vontobel, A. Kavčić, D. M. Arnold, and H. A. Loeliger. A generalization of the Blahut–Arimoto algorithm to finite-state channels.
IEEE Trans. Inform. Theory , 54(5):1887–1918, 2008.[43] A. Ziv. Relative distance – an error measure in round-off error analysis.
Math. Comp. , 39(160):563–569, Oct. 1982.[44] O. Zuk, I. Kanter, and E. Domany. The entropy of a binary hidden Markov process.