[PDF] A Tamper-Free Semi-Universal Communication System for Deletion Channels

Abstract

We investigate the problem of reliable communication between two legitimate parties over deletion channels under an active eavesdropping (aka jamming) adversarial model. To this goal, we develop a theoretical framework based on probabilistic finite-state automata to define novel encoding and decoding schemes that ensure small error probability in both message decoding as well as tamper detecting. We then experimentally verify the reliability and tamper-detection property of our scheme.

Full PDF

11 A Tamper-Free Semi-Universal Communication System for DeletionChannels

Shahab Asoodeh , Yi Huang , and Ishanu Chattopadhyay Abstract

We investigate the problem of reliable communication between two legitimate parties over deletion channels under an activeeavesdropping (aka jamming) adversarial model. To this goal, we develop a theoretical framework based on probabilistic ﬁnite-state automata to deﬁne novel encoding and decoding schemes that ensure small error probability in both message decoding aswell as tamper detecting. We then experimentally verify the reliability and tamper-detection property of our scheme.

I. I

NTRODUCTION

The deletion channel is the simplest point-to-point communication channel that models synchronization errors. In the simplestform, the inputs are either deleted independently with probability δ or transmitted over the channel noiselessly. As a result, thelength of channel output is a random variable depending on δ . Surprisingly, the capacity of deletion channel has been one ofthe outstanding open problems in information theory [1]. A random coding argument for proving a Shannon-like capacity resultfor deletion channel (in general for all channels with synchronization errors) was given by Dobrushin [2] which is recentlyimproved by Kirsch and Drinea [3] to derive several lower bounds. Readers interested in most recent results on deletionchannels are referred to the recent survey by Mitzenmacher [4] that provides a useful history and known results on deletionchannels.As the problem of computing capacity of deletion channels is infamously hard, we focus on another problem in deletionchannels. In this paper, we study the behavior of the deletion channel under an active eavesdropper attack. Secrecy modelsin information theory literature, initiated by Yamamoto [5], assume that there exists a passive eavesdropper who can observethe symbols being transmitted over the channel. The objective is to design a pair of (randomized) encoder and decoder suchthat the message is decoded with asymptotically vanishing error probability at the legitimate receiver while ensuring that theeavesdropper gains negligible information about the message. In all secrecy models (see, e.g., [6]–[12]) the crucial assumption isthat the eavesdropper can neither jam the communication channel between legitimate parties nor can she modify any messagesexchanged between them. However, in many practical scenarios, the eavesdropper can potentially change the channel, forinstance, add stronger noise to change the crossover probability of a binary symmetric channel or the deletion probability ofa deletion channel.In our adversarial model, we assume that two parties (say Alice and Bob) wish to communicate over a public deletionchannel while an eavesdropper (say Eve) can potentially tamper the statistics of the channel. We focus on deletion channeland assume that Eve can have possibly more bits deleted, and hence increases the deletion probability of the channel. Theobjective is to allow a reliable communication between Alice and Bob (with vanishing error probability) regardless of theeavesdropper’s action. To this goal, we design (i) a randomized encoder using probabilistic ﬁnite-state automata which, givena ﬁxed message, generates a random vector as the channel input and (ii) a decoder which generates an estimate of the message only when the channel is not tampered. In case the channel is indeed tampered, the decoder can declare it with asymptoticallysmall Type I and Type II error probabilities. It is worth mentioning that the rate of our scheme is (almost) zero and hence wedo not intend to study capacity of deletion channels.Unlike the classical channel coding where the set of all possible channel inputs (aka, codebook) must be available at thedecoder, our scheme requires that only the set of PFSA’s used in the encoder to be available at the decoder. This model, thatwe call semi-universal , is contrasted with universal channel coding [13] where neither channel statistics nor codebook areknown and the decoder is required to ﬁnd the pattern of the message.The rest of the paper is organized as follows. In Section II, we discuss brieﬂy the notion of PFSA and its properties requiredfor our scheme. Section III speciﬁes the channel model, encoder, decoder, and different error events. In Section IV, we discussthe effects of deletion channels on PFSA. Section V concerns the thoeretical aspects of our coding scheme and Section VIcontains several experimental results. Notation

We use calligraphic uppercase letters for sets (e.g. S ), sans serif font for functions (e.g. T ), uppercase letters formatrices (e.g. Γ ), bold lower case letters for vectors (e.g. v ). Throughout, we use g to denote a PFSA and s and x to denoteits state and symbol, respectively. We use x n = x . . . x n for a sequence of symbols or interchangeably, x if its size is clearin context. Also, v i for i th entry of vector v , A i, · and A · ,j for the i th row or column of the matrix A , respectively. Weuse ( a x ) x ∈X to denote a vector with the entry indexed by x and ( a x ) x ∈X a matrix with the column indexed by x . Finally, x i = x x . . . x i . Computation Institute and Institute of Genomics and System Biology, The University of Chicago, Chicago, IL 60637 [email protected] The University of Chicago, Chicago, IL [email protected] Computation Institute, Chicago, IL [email protected] a r X i v : . [ s t a t . M L ] A p r s s s s | . | . | . | . | . | . | . | . Fig. 1. A PFSA with S = { s , s , s , s } and X = { , } . II. P

ROBABILISTIC FINITE STATE AUTOMATA

In this section, we introduce a new measure of similarity between two vectors. To do this, we ﬁrst need to deﬁne probabilisticﬁnite-state automata (PFSA).

Deﬁnition 1 (PFSA) . A probabilistic ﬁnite-state automaton is a quadruple ( S , X , T , P ) , where S is a ﬁnite state space, X is aﬁnite alphabet with K = |X | , T : S × X → S is the state transition function, and P : S × X → [0 , speciﬁes the conditionaldistribution of generating a symbol conditioned on the state. In fact, a PFSA is a directed graph with a ﬁnite number of vertices (i.e., states) and directed edges emanating from eachvertex to the other. An edge from state s ∈ S to state s ∈ S is speciﬁed by two labels: (i) a symbol x ∈ X that updatesthe current state from s to s , that is, T ( s , x ) = s , and (ii) the probability of generating x when the system resides in state s , i.e., P ( s , x ) . For instance, P ( s ,

1) = 0 . in the PFSA described in Fig. 1, thus, the system residing in states s evolveto state s with probability . and it generates symbol . Clearly, (cid:80) x ∈X P ( s, x ) = 1 for all s ∈ S . Given two symbols x and x , one can deﬁne the transition function for the concatenation x x as T ( s, x x ) = T ( T ( s, x ) , x ) . Letting X ∗ denotethe set of all possible concatenation of ﬁnitely many symbols from X , one can easily proceed to deﬁne T ( s, x ) as above foreach x ∈ X ∗ and s ∈ S . We say that a PFSA is strongly connected if for any pair of distinct states s i and s j , there existsa sequence x ∈ X ∗ such that T ( s i , x ) = s j . Let G be the set of all strongly connected PFSAs. The signiﬁcance of stronglyconnected PFSAs is that their corresponding Markov chains (i.e., the Markov chain with state space S and transition matrix P = [ P ( i, j )] |S|×|S| whose entry is P ( i, j ) = (cid:80) x ∈X : T ( s i ,x )= s j P ( s i , x ) ) has a unique stationary distribution (thus initial statecan be assumed to be irrelevant). Deﬁnition 2 ( Γ -expression for PFSA) . We notice that a PFSA g is uniquely determined by Γ g = (Γ g,x ) x ∈X given by (Γ g,x ) i,j = (cid:26) P g ( s i , x ) , T g ( s i , x ) = s j , , otherwise . The state-to-state transition matrix P g is deﬁned as P g = (cid:88) x ∈X Γ g,x , (1) and the state-to-symbol transition matrix (cid:101) P g is given by (cid:101) P g = (Γ g,x |S| ) x ∈X , where n is the length- n all-one vector. For the PFSA illustrated in Fig. 1, we have Γ g, =  . . . .  , Γ g, =  . . . .  ,P g =  . . . . . . . .  , and (cid:101) P g =  . . . . . . . .  . Deﬁnition 3 (Generalized PFSA) . Generalized PFSA is a PFSA g whose Γ g,x can have more than one non-zero (positive)entries. In this case, we still have (cid:0) Γ g,x |S| (cid:1) i = P g ( s i , x ) . However, T ( s i , x ) might not be deterministic, and instead it is a probability distribution. Shannon [14] appears to be the ﬁrst one who made use of PFSAs to describe stationary and ergodic sources. Given g ∈ G ,ﬁrst a state s is chosen randomly according to the stationary distribution, then a symbol x is generated with probability P ( s , x ) which takes the system from state s to state s . A new symbol x is then generated with probability P ( s , x ) .Letting this process run for n time units, we obtain a sequence x , x , . . . , x n . In this case, we say that x , x , . . . , x n is a realization of g . According to Shannon, each state s i captures the "residue of inﬂuence" of the preceding symbol x i − on thesystem.For x ∈ X ∗ , we denote by x ← g the fact that g ∈ G generates x . M ι ω W ψ X n g M Y D ˆ MT Eve ϕ Fig. 2. A communication system with an active eavesdropper

III. S

YSTEM M ODEL AND S ETUP

Suppose Alice has a message M which takes value in a ﬁnite set M := { , , . . . , |M|} and seeks to transmit it reliably toBob over a deletion channel W ( δ ) with deletion probability δ ∈ [0 , . The communication channel is assumed to be public,that is, an active eavesdropper, say Eve, can access and possibly tamper the channel. For simplicity, we assume that Eve maydelete extra bits and thus changing the channel from W ( δ ) to W ( δ (cid:48) ) with δ (cid:48) ≥ δ .The objective is to design a pair of encoder ϕ and decoder ψ that enables Alice and Bob to reliably communicate over W ( δ ) only when he is ensured that the channel is not tampered. In classical information theory, the decoder must be tunedwith the channel statistics. Hence, reliable communication occurs only when Bob knows the deletion probability δ . However,Eve might have tampered the channel and increased deletion probability to δ (cid:48) , and since Bob’s decoding policy was tuned to δ , this might cause a decoding error –regardless of Bob’s decoding algorithm. Therefore, reliability of the decoding must bealways conditioned on the fact that the channel has not been tampered during communication.Motivated by this observation, we propose the following coding scheme. We ﬁrst propose a two-step encoder: each message M = m is ﬁrst sent to a function ι : M → G M which maps m to a PFSA g m in G M := { g , . . . , g |M| } , then anotherfunction ω : G M → X n generates y n a realization of PFSA g m and sends it over the memoryless channel W ( δ ) . Therefore,the encoder function ϕ : M → X n is the composition ι ◦ ω (see Fig. 2). Unlike the classical setting, Bob need not knowthe set of all channel inputs y n for each m ∈ M (aka codebook). Instead, we assume Bob knows G M (thus the name semi-universal scheme ). The output of the channel x D is an X -valued random vector whose length D is a binomial random variable Bin ( n, − δ ) (corresponding to how many elements of y n are deleted). Upon receiving x D , Bob applies ψ : X ∗ → M × { , } to generate ψ ( x D ) = ( ˆ M , T ) where ˆ M is an estimate of Alice’s message and T speciﬁes whether or not the channel hasbeen tampered. He then declares ˆ M as the message only when T = 0 . Therefore, the goal is to design ( ϕ, ψ ) such that forsufﬁciently large n Pr( T = 0 | channel is tampered ) + Pr( T = 1 | channel is not tampered ) < ε, (2)and simultaneously Pr( M (cid:54) = ˆ M | T = 0) ≤ ε, (3)for any uniformly chosen message M ∈ M . We say that the reliable tamper-free communication is possible if (2) and (3)hold simultaneously for any ε > . IV. PFSA THROUGH DELETION CHANNEL

In this section, we study the channel effect on PFSA’s by monitoring the change of the likelihood of x being generated bya PFSA at the channel output. To do this, we ﬁrst study the likelihood when δ = 0 in Section IV-A, and then move on tothe case of positive δ in Section IV-B. One of the main results in this section is to show that the output of W ( δ ) (i.e., x D )can be equivalently generated by a generalized PFSA g ( δ ) whose Γ and state-to-state transition matrix follow simple closedforms (cf. Theorem 1). In section IV-C, we discuss some basic properties of g ( δ ) that will be useful for later development. Weconclude this section by introducing the class M2 of PFSAs which is closed under deletion. For notational brevity, we removethe subscript g when it is is clearly understood from context. A. PFSA over W (0) : no deletion Let a sequence of symbols x = x . . . x n ∈ X ∗ be given. We deﬁne p g ( x ) (or simply p ( x ) ) to be the probability that g generates x . Then we have p ( x n ) = p ( x ) p (cid:0) x | x (cid:1) · · · p (cid:0) x n | x n − (cid:1) , where p (cid:0) x i | x i − (cid:1) is the conditional probability of g generating x i given that g generated x i − . It is clear from section II that p = p (4) p ( x ) = (cid:16) p T (cid:101) P (cid:17) x , p T = p T Γ x (cid:13)(cid:13) p T Γ x (cid:13)(cid:13) ,p ( x | x ) = (cid:16) p T (cid:101) P (cid:17) x , p T = p T Γ x (cid:13)(cid:13) p T Γ x (cid:13)(cid:13) , ... p (cid:0) x n − | x n − (cid:1) = (cid:16) p Tn − (cid:101) P (cid:17) x n − , p Tn − = p Tn − Γ x n − (cid:13)(cid:13) p Tn − Γ x n − (cid:13)(cid:13) , and ﬁnally, p (cid:0) x n | x n − (cid:1) = (cid:16) p Tn − (cid:101) P (cid:17) x n , where T denotes matrix transpose.It is clear from the above update rule that any sequence x n induces two probability distribution: one on the state space S ,i.e., p n and the other one on X . Let denote the former by p g ( x ) and the latter by D g ( x ) . Update rules in (4) imply that D g ( x ) = p Tg ( x ) (cid:101) P g and p Tg ( x x ) ∝ p Tg ( x )Γ g,x . More precisely, since (cid:13)(cid:13) p Tg ( x )Γ g,x (cid:13)(cid:13) = p Tg ( x )Γ g,x |S| = p Tg ( x ) (cid:16) (cid:101) P g (cid:17) · ,x = (cid:16) p Tg ( x ) (cid:101) P g (cid:17) x = p ( x | x ) , we have p T ( x | x ) p g ( x x ) = p Tg ( x )Γ g,x . (5)We also call D g ( x ) = ( p g ( x | x )) x ∈X the symbolic derivative of g induced by x . B. PFSA over W ( δ ) : deletion with probability δ > Now we move forward to investigate the effect of deletion probability on PFSA transmission. The following result is a ketfor our analysis.

Theorem 1.

Let y ← g be a channel input and x be a channel output with positive deletion probability δ . Then x ← g ( δ ) ,where g ( δ ) is a generalized PFSA identiﬁed by Γ g,x,δ = Q ( P, δ )Γ g,x for all x ∈ X , where P is the state-to-state transitionmatrix of g and Q is as deﬁned in (6) .Proof. Assume Bob has observed x i − . Then we have p ( x i | x i − ) = (1 − δ ) (cid:16) p Ti − (cid:101) P (cid:17) x i + δ (1 − δ ) (cid:16) p Ti − P (cid:101) P (cid:17) x i + δ (1 − δ ) (cid:16) p Ti − P (cid:101) P (cid:17) x i + · · · = (1 − δ ) (cid:32) p Ti − (cid:32) ∞ (cid:88) i =0 δ i P i (cid:33) (cid:101) P (cid:33) x i = (cid:16) p Ti − Q ( P, δ ) (cid:101) P (cid:17) x i , where Q ( P, δ ) = (1 − δ ) ∞ (cid:88) i =0 δ i P i = (1 − δ ) ( I − δP ) − . (6)Analogous to (4), we can deﬁne the follwoing distribution induced on S p i = p Ti − Q ( P, δ )Γ x i (cid:13)(cid:13) p Ti − Q ( P, δ )Γ x i (cid:13)(cid:13) . (7)Comparing (7) with expressions p i in (4), the result follows. Remark . Notice that while the row-stochastic matrix P may not be invertible, I − δP is non-singular for all δ ∈ [0 , , asthe the eigenvalues of P are less than or equal to . Moreover, it is clear from (6) that Q ( P, δ ) is also a row-stochastic matrixwith p being its eigenvector corresponding to eigenvalue one. We will give a closer look at the eigenvalues of Q ( P, δ ) in thenext section. s s | . | . | . | . (a) δ = 0 s s | .

35 0 | .

56 1 | . | . (b) δ = . Fig. 3. On the left: g ( . ,. in class M2. On the right, g ( . ,. ( . , with transition probabilities rounded to two decimal places. We can seethat deletion only cause the transition probabilities to change, but keep the structure of the machine. C. Properties of the generalized PFSA

We start by analyzing the eigenspace of the state-to-state transition matrix of g ( δ ) . Note that it follows from (1) that P g ( δ ) = Q ( P g , δ ) P g . Theorem 2.

Let p g be the stationary distribution of strongly connected g . Then the generalized PFSA g ( δ ) is also stronglyconnected with stationary distribution p g ( δ ) = p g .Proof. Let λ be an eigenvalue of P g . Then λ (1 − δ )(1 − δλ ) − is an eigenvalue of P g ( δ ) . Deﬁne f ( λ, δ ) = λ (1 − δ )(1 − δλ ) − .Then the result follows from the following observations:1) For λ = 1 , f ( λ, δ ) = 1 for all δ ∈ [0 , , and hence lim δ → f (1 , δ ) = 1 .2) For λ < , f ( λ, δ ) < λ for all δ ∈ [0 , , and furthermore, lim δ → f ( λ, δ ) = 0 .Then following is an immediate corollary. Corollary 1.

We have for all x ∈ X p g ( x ) = p g ( δ ) ( x ) . Proof.

We have p Tg ( δ ) (cid:101) P g ( δ ) = p Tg (cid:101) P g ( δ ) = p Tg Q ( P g , δ ) (cid:101) P g = p Tg (cid:101) P g . A natural question is what happens when δ ↑ . Letting g (1) denote the machine corresponding to δ ↑ , we now show that,quite expectedly, g (1) is a single-state machine. Theorem 3. g (1) is a single-state PFSA.Proof. First note that the observations given in the proof of Theorem 2 imply that lim δ → Q ( P g , δ ) = |S| p Tg , and consequently g (1) is a PFSA speciﬁed by |S| p Tg Γ g,x for x ∈ X .Suppose x = x x . . . x n is observed. Following the argument given in section IV-B, we get p g (1) ( x x )= p T ( T Γ x )( T Γ x ) · · · ( T Γ x n ) (cid:16) (cid:101) P g (1) (cid:17) · ,x = p T ( T Γ x )( T Γ x ) · · · ( T Γ x n ) (cid:0) T Γ x (cid:1) = (cid:0) p T (cid:1) (cid:0) p T Γ x (cid:1) · · · (cid:0) p T Γ x n (cid:1) (cid:0) p T Γ x (cid:1) , and hence, by induction, p g (1) ( x | x ) = p T Γ x for all x . Since an i.i.d. process corresponds to a single-state PFSA, we concludethat g (1) is in fact a single-state PFSA. D. M2 Class of PFSA

We note that g ( δ ) of a PFSA g is not necessarily a PFSA. As an example, the Γ -expression of the generalized PFSA g ( . for g being the PFSA described in Fig. 1 is Γ g ( . , =  .

26 0 .

14 0 .

16 0 .

44 0 .

57 0 .

08 0 .

14 0 .

40 0  , Γ g ( . , =  .

50 0 . .

08 0 . .

29 0 . .

07 0 .  . Nevertheless, we introduce M2 a class of PFSAs which is closed under deletion, i.e. g ∈ M2 implies g ( δ ) ∈ M2 for all δ ∈ [0 , . As this class is instrumental in our experimental results, we shall study it in more details. M2 is the collection of -state PFSAs on a binary alphabet: g = g ( µ,ν ) ∈ M2 with µ, ν ∈ (0 , × (0 , is speciﬁed by aquadruple (cid:0) S , X , T , P ( µ,ν ) (cid:1) , where S = { s , s } X = { , } , and Γ g ( µ,ν ) , = (cid:18) µ ν (cid:19) , Γ g ( µ,ν ) , = (cid:18) − µ − ν (cid:19) . Fig. 3 illustrates g ( . ,. and its corresponding g ( . ,. ( δ ) , which is obtained from Theorem 1. Since Γ g,x,δ has exactly thesame form – containing a single column of non-zero entries for all δ , it is clear that g ( . ,. ( δ ) ∈ M2.Since each g ( ν,µ ) is speciﬁed by two numbers, we can parametrize M2 by a square in R . In Fig. 4, we show the effect ofdeletion probability on M2 machines. The key observation is that deletion probability drives machines to µ = ν line. (a) δ = 0 (b) δ = 0 . (c) δ = 0 . (d) δ = 0 . Fig. 4. Each dot in (a) represents a g ( µ,ν ) in M2 with µ, ν both ranging from . to . and with . increment. The color of the pointsis proportional to the KL divergence (deﬁned in Section V-B) of g ( . ,. to g . The reason that the images are symmetric with respect to the µ + ν = 1 line is that g (1 − ν, − µ ) is exactly g ( µ,ν ) with the two states swapped. We can see that while we increase δ , the dots are movingtowards the µ = ν line which corresponds to the single-state PFSA. The asymmetry in how fast PFSA on each side of the µ + ν = 1 lineconverges to single-state PFSA is caused by structural difference between them – machines on the upper side, with µ < ν , have strongconnections between two states, while machines on the lower side, with µ > ν , have weaker connection between the states. V. T

HE CONVERGENCE OF LIKELIHOOD

The goal of this section is to lay the theoretical ground for our algorithms for decoding and tamper detecting with PFSAs. InSection V-C, we employ maximum likelihood framework to decode the generating PFSA given the channel output. We showthat likelihood is closely related to entropy rate and KL divergence of PFSAs (to be deﬁned and calculated in V-A and V-B).

A. Entropy rate of PFSA

Let g be a PFSA. We deﬁne H n ( g ) as the following: H n ( g ) := − (cid:88) | x | = n p g ( x ) log p g ( x ) . Then the entropy rate of g is deﬁned as H ( g ) := lim n →∞ n H n ( g ) . Note that H ( g ) is in fact the entropy rate of the stochastic process corresponding to g [15]. In the next theorem, we show thatthe above limit exists and and the entropy rate has a simple closed form. Theorem 4.

We have H ( g ) = (cid:88) s ∈S ( p g ) s H (cid:18)(cid:16) (cid:101) P g (cid:17) s, · (cid:19) Proof.

See Appendix VII-A.It readily follows from the theorem above that the entropy rate for g ( µ,ν ) is H (cid:0) g ( µ,ν ) (cid:1) = νh b ( µ )¯ µ + ν + ¯ µh b ( ν )¯ µ + ν , where ¯ a := 1 − a and h b ( a ) := − a log a − ¯ a log ¯ a is the binary entropy function for any a ∈ [0 , .Next, we show that deletion increases entropy rate, which will be critical for tamper detection purpose. Theorem 5.

The map δ (cid:55)→ H ( g ( µ,ν ) ( δ )) is monotonically increasing when µ (cid:54) = ν .Proof. We have µ ( δ ) = µ − δ ( µ − ν )1 − δ ( µ − ν ) , ν ( δ ) = ν − δ ( µ − ν ) , and H (cid:0) g ( µ,ν ) ( δ ) (cid:1) = ν − µ + ν h b (cid:18) µ − δ ( µ − ν )1 − δ ( µ − ν ) (cid:19) + 1 − µ − µ + ν h b (cid:18) ν − δ ( µ − ν ) (cid:19) . We can then write dd δ H ( g ( µ,ν ) ( δ )) = α ¯ µν (1 − αδ ) ¯ α log ( µ − δα )(¯ ν − δα )¯ µν , where α = µ − ν . It’s straightforward to check that the derivative is always positive when µ (cid:54) = ν . B. KL divergence of two PFSAs

Let g , g ∈ M2. The n -th order KL divergence between g and g is the KL divergence on the space of length- n sequences,i.e. D n ( g (cid:107) g ) = (cid:88) | x | = n p g ( x ) log p g ( x ) p g ( x ) . Analogous to entropy rate, we can deﬁne the KL divergence between g and g as D KL ( g (cid:107) g ) := lim n →∞ n D n ( g (cid:107) g ) . We show in Theorem 6 below shows that the limit exists and also derived a closed form for the KL divergence between twoPFSAs. But before we can state the theorem, we need to introduce a very useful construction on two PFSAs, called synchronouscomposition . Deﬁnition 4 (synchronous composition) . Let g = ( S , X , T , P ) and g = ( T , X , T , P ) be two PFSAs with the samealphabet and let g ∗ c ( g (cid:107) g ) be the probabilistic automata speciﬁed by the quadruple ( S c , X , T c , P c ) where S c = S × T = { ( s, t ) } s ∈S ,t ∈T is the Cartesian product of S and T , and T c (( s, t ) , x ) = ( T ( s, x ) , T ( t, x )) , P c (( s, t ) , x ) = P ( s, x ) , for all s ∈ S , t ∈ S , and x ∈ X . Then the synchronous composition g c ( g (cid:107) g ) is deﬁned to be any absorbing stronglyconnected component of g ∗ c ( g (cid:107) g ) , i.e. strongly connected component without any out-going edges. It is not clear that there is only one absorbing strongly connected component in g ∗ c ( g (cid:107) g ) . However, as proved in Theorem 8in Appendix VII-B, g c ( g (cid:107) g ) is equivalent to g irrespective of the choice of absorbing strongly connected component, i.e., p g c ( x ) = p g ( x ) for x ∈ X ∗ .In Figs. 6, 7, 8, and 9, we provide examples of synchronous compositions for several g and g which shed light on thefact that the synchronous composition of two strongly connected PFSA might not be strongly connected. Theorem 6.

Let g c = g c ( g (cid:107) g ) and p g c be the stationary distribution of g c . Then we have lim n →∞ n D n (cid:0) p ng (cid:13)(cid:13) p ng (cid:1) = (cid:88) s ∈S ,t ∈T ( p g c ) ( s,t ) D KL (cid:18)(cid:16) (cid:101) P g (cid:17) s, · (cid:13)(cid:13)(cid:13)(cid:13) (cid:16) (cid:101) P g (cid:17) t, · (cid:19) . Proof.

See Appendix VII-B.In light of this theorem, one can easily show D KL ( g (cid:107) g ) = ν D KL ( µ (cid:107) µ )¯ µ + ν + ¯ µ D KL ( ν (cid:107) ν )¯ µ + ν . C. Convergence of log likelihood

According to Shannon-McMillan-Breiman Theorem [15, Theorem 16.8.1], we have − n log p g ( x ) → H ( g ) for any sequence x ← g . A natural question is that what the log-likelihood converges to if x is generated by a different machine. The followingtheorem states that the log-likelihood converges to entropy of generating machine plus the KL divergence which accounts forthe mismatch. Theorem 7.

For any x n ← g ∈ M , we have with probability one − n n (cid:88) i =1 log p g (cid:48) (cid:0) x i | x i − (cid:1) → H ( g ) + D KL ( g (cid:107) g (cid:48) ) , for any PFSA g (cid:48) ∈ M .Proof. First note that − n n (cid:88) i =1 log p g (cid:48) ( x i | x i − ) = − n log p g ( x ) + 1 n n (cid:88) i =1 log p g ( x i | x i − ) p g (cid:48) ( x i | x i − ) . (8)Clearly, the ﬁrst term in the above sum converges to H ( g ) . To show the convergence of the second term, let Z i = log p g ( x i | x i − ) p g (cid:48) ( x i | x i − ) .Notice that for any PFSA g in M2 and for ≤ i ≤ n , p g ( x i ) equals [1 , for all x i with x i = 0 , and to [0 , for all x i with x i = 1 , and hence the process { Z i } ni =1 is a Markov process. Let Z and Z denote the set of indices i such that x i − = 0 and x i − = 1 , respectively. Then we have n n (cid:88) i =1 Z i = 1 n (cid:88) i ∈Z Z i + 1 n (cid:88) i ∈Z Z i . (9)It is straightforward to show that for all i ∈ Z Z i = 1 { x i =0 } log µ g µ g (cid:48) + 1 { x i =1 } log ¯ µ g ¯ µ g (cid:48) , and for all i ∈ Z Z i = 1 { x i =0 } log ν g ν g (cid:48) + 1 { x i =1 } log ¯ ν g ¯ ν g (cid:48) . It follows from (9) that n n (cid:88) i =1 Z i = 1 n (cid:18) log µ g µ g (cid:48) (cid:19) n (cid:88) i =1 { x i − =0 ,x i =0 } + 1 n (cid:18) log ¯ µ g ¯ µ g (cid:48) (cid:19) n (cid:88) i =1 { x i − =0 ,x i =1 } + 1 n (cid:18) log ν g ν g (cid:48) (cid:19) n (cid:88) i =1 { x i − =1 ,x i =0 } + 1 n (cid:18) log ¯ ν g ¯ ν g (cid:48) (cid:19) n (cid:88) i =1 { x i − =1 ,x i =1 } n →∞ −→ p g (0) (cid:18) µ g log µ g µ g (cid:48) + ¯ µ g log ¯ µ g ¯ µ g (cid:48) (cid:19) + p g (1) (cid:18) ν g log ν g ν g (cid:48) + ¯ ν g log ¯ ν g ¯ ν g (cid:48) (cid:19) . For ease of presentation, we deﬁne L ( g (cid:48) , x n ← g ) := − n n (cid:88) i =1 log p g (cid:48) ( x i | x i − ) . When the generating machine g is not known, we use L ( g (cid:48) , x n ) to identify likelihood of g (cid:48) generating x .VI. A LGORITHM AND SIMULATION

A. Decoding

In this and the following section, we assume that we have a set of PFSAs G = (cid:8) g , . . . , g |M| (cid:9) , with g i ∈ M2 for all i .We will brieﬂy discuss heuristics on how to generate a set of PFSAs that are good for tamper detecting and decoding inSectionVI-C.We saw in Theorem 7 that L ( g j ( δ ) , x n ← g i ( δ )) → H ( g i ( δ )) + D KL ( g i ( δ ) (cid:107) g j ( δ ) ) , (10)which motivates the following deﬁnition for the decoding function in Fig. 2 ψ ( x ) = arg min m ∈M L ( g m ( δ ) , x n ) . We apply this decoding strategy in Fig. 5 when δ = . and two different message sets with |M| = 10 or |M| = 20 . B. Tamper detecting

We assume that active eavesdropper tampers the channel in such a way that δ (cid:48) − δ > η with some η ≥ . Following Theorems5 and 7, we get L ( g j ( δ ) , x ← g i ( δ (cid:48) )) → H ( g i ( δ (cid:48) )) + D KL ( g i ( δ (cid:48) ) (cid:107) g j ( δ ) ) ≥ H ( g i ( δ )) , (11)where the inequality is due to Theorem 5. Hence, tampering the channel results in an increase in the likelihood. This leads toour temper detecting procedure detailed in Algorithm 1. (a) (b) (c) (d) Fig. 5. (a) and (c) shows PFSAs and PFSAs in the parameter space, respectively, with purple dots for the g m ’s and blue dots for g m ( . ’s. Error rates for input sequences of length to for the messages, and for input sequences of length , for the messages are showed in (b) and (d), respectively. The results are averaged over and re-runs. Algorithm 1:

Tampering detection input : { g m } m ∈M , x , . . . , x k , δ, η , ε output: T with T = 0 if no tampering, if otherwise H = ( H ( g m ( δ ))) m ∈M ; H = ( H ( g m ( δ + η ))) m ∈M ; D = H − H ; v = 0 ; /* the weighted vote */ for i = 1 , . . . , k do d = arg min m ∈M L ( g m ( δ ) , y i ) ; e = L ( g d ( δ ) , y i ) ; if e − H ( g d ( δ )) > ε · D [ d ] then v = v + 1 · D [ d ] ; endend S = (cid:80) m ∈M D [ m ] ; if v/ ( S · k ) > . thenreturn T = 1 ; elsereturn T = 0 ; end C. Generate machines with good separation

For ﬁxed number of messages, we need to choose a set of M2 PFSAs with the best decoding and tamper detectionperformance. It is important to indicate that (1) decoding error will be signiﬁcantly lowered by increasing D ( g i (cid:107) g j ) accordingto (10), and (2) the tampering detection error will be improved by making sure | H ( g ( δ )) − H ( g ( δ (cid:48) )) | is large for δ (cid:48) − δ ≥ η ,according to (11). However, there is a trade-off here – to increase pairwise KL divergence, we want the machines to be spreadmore evenly in the parameter space while, according to Theorem 5, to increase H ( g ( δ (cid:48) )) − H ( g ( δ ) , we need the machines tostay away from being single-state, i.e. away from the µ = ν line.Here, we describe brieﬂy how we design G for experiment in Fig. 5. As a naive way, we start off with |M| randomlygenerated µ ’s in (0 , , and for each of them we generate ν in the following way: if µ > . , then we choose a ν randomly in (0 , µ − . , and if µ ≤ . , in ( µ + . , . ) . Then, we use a hill-climbing algorithm to maximize minimum pairwise averaged KLdivergence, . D KL ( g (cid:107) g ) + D KL ( g (cid:107) g )) , between all pair machines. Let σ be step size, for a pair g ( µ ,ν ) and g ( µ ,ν ) with minimum averaged KL divergence, we search the eight neighboring points, ( µ i ± σ, ν i ) and ( µ, ν i ± σ ) , i = 1 , , forimprovement. We exit the search when there is no improvement to be found.VII. C ONCLUSION AND FUTURE WORK

In this paper, we developed a new information-theoretic coding scheme for information transfer over a public deletionchannel, subject to an active eavesdropper (aka jammer). Our coding scheme is based on probabilistic ﬁnite-state automata(PFSA) and is proved to have (1) semi-universal property, in a sense that codebook need not be available at the decoder,(2) small error probability when decoding messages, and (3) tamper-free property, which alarms the decoder about possibletampering of the channel. To the best of our knowledge, exploiting PFSA’s in a secure and reliable information-theoretic ε = 0 ε = 0 . ε = 0 . . . . . . . . . . . . . . . . . . . . . . .

14 0 . . . .

18 0 . .

20 0 . . ε = 0 . ε = 0 . ε = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

04 0 . . Table I. The table above records the error rates of tamper detection algorithm for sending messages through a channel with deletionprobability δ = . . We generate test sets containing k = 200 sequences, with for each message. We assign randomly whether aparticular test set will be tampered or not. For simplicity, if a test set is tampered it will have a ﬁxed deletion probability δ = . . We runthe algorithm for input sequence of length , , , and , and for ε = 0 , . , . , . , . , . . For each block, the ﬁrst column isthe rate of failing to detect a tampering, and the second column is the rate of false alarm of a tampering, and the last column is the sumof two error rates. We can see that with increased cutoff value ε , we have signiﬁcantly fewer false alarms without too much increase in therate of failing to detect a true tampering. communication model is very new, yet very insightful. Promising results in both theoretical and experimental aspects of thiswork lead to several research directions: • To have an analytically better analysis of error probability, the convergence rate of likelihood in Theorem7 for generalPFSA is needed. • We admit that the space of M2 is too small to have simultaneous vanishing error probability (with small n ) in messagedecoding and tamper detecting. To go beyond M2, we need to ﬁnd an analytic way to compute entropy rate and KLdivergence for generalized PFSA. R EFERENCES[1] M. Mitzenmacher, “A survey of results for deletion channels and related synchronization channels,”

Probability Surveys , vol. 6, pp. 1–33, 2009.[2] R. L. Dobrushin, “Shannon’s theorems for channels with synchronization errors,”

Problems Inform. Transmission , vol. 3, no. 4, pp. 11–26, Oct. 1967.[3] A. Kirsch and E. Drinea, “Directly lower bounding the information capacity for channels with i.i.d. deletions and duplications,”

IEEE Trans. Inf. Theory ,vol. 56, no. 1, pp. 86–102, Jan 2010.[4] M. Mitzenmacher and E. Drinea, “A simple lower bound for the capacity of the deletion channel,”

IEEE Trans. Inf. Theory , vol. 52, no. 10, pp.4657–4660, Oct. 2006.[5] H. Yamamoto, “A source coding problem for sources with additional outputs to keep secret from the receiver or wiretappers,”

IEEE Trans. Inf. Theory ,vol. 29, no. 6, pp. 918–923, Nov. 1983.[6] D. Gündüz, E. Erkip, and H. Poor, “Secure lossless compression with side information,” in

Proc. IEEE Information Theory Workshop , May 2008, pp.169–173.[7] E. Ekrem and S. Ulukus, “Secure lossy source coding with side information,” in

Proc. Annual Allerton Conference on Communication, Control, andComputing , Sept. 2011, pp. 1098–1105.[8] Y.-H. Kim, A. Sutivong, and T. Cover, “State ampliﬁcation,”

IEEE Trans. Inf. Theory, , vol. 54, no. 5, pp. 1850–1859, May 2008.[9] K. Kittichokechai, Y. K. Chia, T. J. Oechtering, M. Skoglund, and T. Weissman, “Secure source coding with a public helper,”

IEEE Trans. Inf. Theory ,vol. 62, no. 7, pp. 3930–3949, July 2016.[10] Y. Kaspi and N. Merhav, “Zero-delay and causal secure source coding,”

IEEE Trans. Inf. Theory , vol. 61, no. 11, pp. 6238–6250, Nov 2015.[11] J. Villard and P. Piantanida, “Secure multiterminal source coding with side information at the eavesdropper,”

IEEE Trans. Inf Theory , vol. 59, no. 6, pp.3668–3692, June 2013.[12] S. Asoodeh, F. Alajaji, and T. Linder, “Lossless secure source coding: Yamamoto’s setting,” in

Annual Allerton Conference on Communication, Control,and Computing (Allerton) , Sept 2015, pp. 1032–1037.[13] V. Misra and T. Weissman, “Unsupervised learning and universal communication,” in

IEEE Inter. Symp. Inf. Theory , July 2013, pp. 261–265.[14] C. E. Shannon, “A mathematical theory of communication,”

Bell system technical journal , vol. 27, no. 2, 1948.[15] T. M. Cover and J. A. Thomas,

Elements of Information Theory . Wiley-Interscience, 2006. A PPENDIX

A. Proof for Theorem 4

Following the standard notation in information theory, we use X n to denote a random vector ( X , . . . , X n ) generated froma PFSA g and H ( X n ) to denote the entropy its entropy, that is H ( X n ) = H n ( g ) . We can similarly deﬁne the conditionalentropy H ( X n | X n − ) . It is shown in [15] that lim n →∞ n H ( X n ) = lim n →∞ H (cid:0) X n | X n − (cid:1) for any stationary processes { X n } ∞ n =1 . In order to compute the entropy rate, we can therefore focus on the latter limit. Let S ∼ p denote a random variableindicating the initial state of the PFSA. We have H (cid:0) X n | X n − (cid:1) = H ( X n ) − H (cid:0) X n − (cid:1) = [ H ( X n , S ) − H ( S | X n )] − (cid:2) H (cid:0) X n − , S (cid:1) − H (cid:0) S | X n − (cid:1)(cid:3) = (cid:2) H ( X n , S ) − H (cid:0) X n − , S (cid:1)(cid:3) + (cid:2) H (cid:0) S | X n − (cid:1) − H ( S | X n ) (cid:3) s s | . | . | . | . (a) g ∈ M2 t t | . | . | . | . (b) g ∈ M2 ( s , t ) ( s , t )( s , t ) ( s , t )0 | . | . | . | . | . | . | . | . (c) g ∗ c ( g (cid:107) g ) ( s , t ) ( s , t )( s , t ) ( s , t )0 | . | . | . | . | . | . | . | . (d) strongly connected component of g ∗ c ( g (cid:107) g ) Fig. 6. The example above shows that the g ∗ c of two strongly connected PFSAs may not remain strongly connected. We can see that in thiscase, g c ( g (cid:107) g ) is equal to g . = (cid:2) H ( X n | S ) + H ( S ) − H (cid:0) X n − | S (cid:1) − H ( S ) (cid:3) + (cid:2) H (cid:0) S | X n − (cid:1) − H ( S | X n ) (cid:3) = H (cid:0) x n | S, X n − (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) =: A n + (cid:2) H (cid:0) S | X n − (cid:1) − H ( S | X n ) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) =: B n . Note that for any N ≥ N (cid:88) n =1 B n = N (cid:88) n =1 H (cid:0) S | X n − (cid:1) − H ( S | X n ) = H ( S ) − H (cid:0) S | X N (cid:1) ≤ H ( S ) = H ( p ) . Since B n is nonnegative for each n and (cid:80) Nn =1 B n is bounded from above, it follows that lim n →∞ B n = 0 . It remains toanalyze A n . Notice that the state at time n is a deterministic function of S and X n − (that is T ( s, X n − ) ) and hence we canwrite H (cid:0) X n | S, X n − (cid:1) = (cid:88) s (cid:48) ∈S H (cid:16) (cid:101) P s (cid:48) , · (cid:17) Pr (cid:8) T ( S, X n − ) = s (cid:48) (cid:9) . By induction, we have for any s (cid:48) ∈ S Pr (cid:8) T ( S, X n − ) = s (cid:48) (cid:9) = (cid:88) x ∈X (cid:88) s (cid:48)(cid:48) ∈S Pr (cid:8) T ( s, X n − ) = s (cid:48)(cid:48) (cid:9) (Γ x ) s (cid:48)(cid:48) ,s (cid:48) = (cid:88) s (cid:48)(cid:48) ∈S Pr (cid:8) T ( s, X n − ) = s (cid:48)(cid:48) (cid:9) P s (cid:48)(cid:48) ,s (cid:48) , and hence (cid:0) Pr (cid:8) T ( s, X n − ) = s (cid:9)(cid:1) s ∈S = (cid:0) Pr (cid:8) T ( s, X n − ) = s (cid:9)(cid:1) s ∈S P = · · · = p P n − = p . B. Proof for Theorem 6

Before we can prove Theorem 6, we ﬁrst study synchronous compositions in more detail. Speciﬁcally, we shall show that p g c ( x ) is independent of the choice of absorbing strongly connected component in g ∗ c ( g (cid:107) g ) . Essentially, g c ( g (cid:107) g ) isequivalent (to be deﬁned later) to g , which is key to the usage of synchronous composition in the proof of Theorem 6. t t | . | . | . | . (a) g ( s , t ) ( s , t )( s , t ) ( s , t )0 | . | . | . | . | . | . | . | . (b) g c ( g (cid:107) g ) Fig. 7. g c ( g (cid:107) g ) is strongly connected. The stationary distribution of g c ( g (cid:107) g ) is ( . , . , . , . , while the stationary distributionof g is ( . , . , both rounded to 3 decimal places. t t t | . | . | . | . | . | . (a) g ( s ,t ) ( s ,t ) ( s ,t )( s ,t ) ( s ,t ) ( s ,t ) | . | . | . | . | . | . | . | . | . | . | . | . (b) g c ( g (cid:107) g ) Fig. 8. g c ( g (cid:107) g ) is strongly connected. The stationary distribution of g c ( g (cid:107) g ) = ( . , . , . , . , . , . , while the stationarydistribution of g is ( . , . , both rounded to 3 decimal places. Deﬁnition 5.

Let g = ( S , X , T , P ) and g = ( T , X , T , P ) be two PFSAs with the same alphabet and let g c ( g (cid:107) g ) be the synchronous composition of g and g . Suppose that the state space of g c ( g (cid:107) g ) is U ⊂ S × T . We then deﬁne T s = { t ∈ T : ( s, t ) ∈ U} . We provided several examples of synchronous compositions in Figs. 6 to 9. We note that, in Fig. 7 and 8, the compositions g ∗ c are naturally strongly connected, while those in Fig. 6 and 9 are not. For g c ( g (cid:107) g ) in Fig. 6, we have T s = { t } and T s = { t } , and for g c ( g (cid:107) g ) in Fig. 9, we have T s = { t } , T s = { t } , T s = { t } , and T s = { t } . Proposition 1.

Let g c = g c ( g (cid:107) g ) be any absorbing strongly connected component of g ∗ c ( g (cid:107) g ) and let p g c be its stationarydistribution. Then we have (cid:80) t ∈T s ( p g c ) ( s,t ) = ( p g ) s .Proof. For any ﬁxed initial state ( s, t ) and any sequence of symbols x n ∈ X n , consider the sequence of states of the synchronouscomposition ( s, t ) , ( T ( s, x ) , T ( t, x )) , . . . , ( T ( s, x n ) , T ( t, x n )) . Let n s (cid:48) ,t (cid:48) be the number of indices i = 1 , . . . , n such that (cid:0) T (cid:0) s, x i (cid:1) , T (cid:0) t, x i (cid:1)(cid:1) = ( s (cid:48) , t (cid:48) ) . Since the associated stochasticprocess on states induced by g c is stationary and ergodic, we have n s (cid:48) ,t (cid:48) n → ( p g c ) ( s (cid:48) ,t (cid:48) ) as n → ∞ in probability. Consequently, (cid:88) t (cid:48) ∈T s n s (cid:48) ,t (cid:48) n → (cid:88) t (cid:48) ∈T s ( p g c ) ( s (cid:48) ,t (cid:48) ) . Noticing that the left-hand side converges to ( p g ) s , we obtain the result.Figs. 7 and 8 provide examples of the proposition above. Theorem 8.

Let g c = g c ( g (cid:107) g ) be any absorbing strongly connected component of g ∗ c ( g (cid:107) g ) . Then we have g c is equivalentto g , in the sense that p g c ( x ) = p g ( x ) for x ∈ X ∗ . s s s s | . | . | . | . | . | . | . | . (a) g ( s ,t ) ( s ,t ) ( s ,t ) ( s ,t )( s ,t ) ( s ,t ) ( s ,t ) ( s ,t )0 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . (b) g ∗ c ( g (cid:107) g ) ( s ,t ) ( s ,t ) ( s ,t ) ( s ,t )( s ,t ) ( s ,t ) ( s ,t ) ( s ,t )0 | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . | . (c) strongly connected component of g ∗ c ( g (cid:107) g ) Fig. 9. g c ( g (cid:107) g ) is the strongly connected component of g ∗ c ( g (cid:107) g ) and it is equal to g . Proof.

We ﬁrst show (cid:88) t ∈T s p g c ( x | ( s, t )) ( p g c ) ( s,t ) = p g ( x | s ) ( p g ) s (12)by induction on the length of x . We ﬁrst note that the base case in which x is the empty sequence is given by Proposition 1.Now assume that (12) holds for | x | = n . Follow the notation as in Deﬁnition 5, we have for sequence x x (cid:88) t ∈T s p g c ( x x | ( s, t )) ( p g c ) ( s,t ) = (cid:88) t ∈T s p g c ( x | ( s, t )) p g c ( x | x , ( s, t )) ( p g c ) ( s,t ) = (cid:88) t ∈T s p g c ( x | ( s, t )) P ( T ( s, x ) , x ) ( p g c ) ( s,t ) = (cid:32) (cid:88) t ∈T s p g c ( x | ( s, t )) ( p g c ) ( s,t ) (cid:33) P ( T ( s, x ) , x ) ( a ) = p g ( x | s ) ( p g ) s P ( T ( s, x ) , x )= p g ( x x | s ) ( p g ) s , where equality in ( a ) follows from the induction hypothesis. Now we can write p g c ( x ) = (cid:88) s ∈S (cid:88) t ∈T s p g c ( x | ( s, t )) ( p g c ) ( s,t ) = (cid:88) s ∈S p g ( x | s ) ( p g ) s = p g ( x ) , from which the result follows. Proof for Theorem 6.

We use the same notation as in Appendix VII-A. We start the proof by deﬁning two distributions onthe Cartesian product

S × T × X n . Let g := g c ( g (cid:107) g ) , g := g c ( g (cid:107) g ) , and p and p be the stationary distributions of g and g , respectively. Here we make sure that we choose the sameabsorbing strongly connected component for both compositions. We notice that g and g induce two distributions p , and p on S × T × X n given by p ( s, t, x n − ) = p ( s, t ) p ( x n − | s, t ) and p ( s, t, x n − ) = p ( s, t ) p ( x n − | s, t ) where p ( s, t ) = ( p ) ( s,t ) , p ( s, t ) = ( p ) ( s,t ) ,p ( x n | s, t ) = p g ( x n | s ) = n (cid:89) i =1 P (cid:0) T (cid:0) s, x i − (cid:1) , x i (cid:1) ,p ( x n | s, t ) = p g ( x n | t ) = n (cid:89) i =1 P (cid:0) T (cid:0) t, x i − (cid:1) , x i (cid:1) . Letting p ( x n ) ( p ( x n − ) ) be the marginal of p over X n (resp. X n − ), we can write using the chain rule of KL divergence(see e.g., [15, Theorem 2.5.3]) that D KL ( p ( X n ) (cid:107) p ( X n ) ) − D KL (cid:0) p (cid:0) X n − (cid:1) (cid:13)(cid:13) p (cid:0) X n − (cid:1) (cid:1) = [ D KL ( p ( S, T, X n ) (cid:107) p ( S, T, X n ) ) − D KL ( p ( S, T | X n ) (cid:107) p ( S, T | X n ) )] − (cid:2) D KL (cid:0) p (cid:0) S, T, X n − (cid:1) (cid:13)(cid:13) p (cid:0) S, T, X n − (cid:1) (cid:1) − D KL (cid:0) p (cid:0) S, T | X n − (cid:1) (cid:13)(cid:13) p (cid:0) S, T | X n − (cid:1) (cid:1)(cid:3) = (cid:2) D KL ( p ( S, T, X n ) (cid:107) p ( S, T, X n ) ) − D KL (cid:0) p (cid:0) S, T, X n − (cid:1) (cid:13)(cid:13) p (cid:0) S, T, X n − (cid:1) (cid:1)(cid:3) − (cid:2) D KL ( p ( S, T | X n ) (cid:107) p ( S, T | X n ) ) − D KL (cid:0) p (cid:0) S, T | X n − (cid:1) (cid:13)(cid:13) p (cid:0) S, T | X n − (cid:1) (cid:1)(cid:3) = D KL (cid:0) p (cid:0) X n (cid:12)(cid:12) S, T, X n − (cid:1) (cid:13)(cid:13) p (cid:0) X n (cid:12)(cid:12) S, T, X n − (cid:1) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) =: C n −  D KL ( p ( S, T | X n ) (cid:107) p ( S, T | X n ) ) (cid:124) (cid:123)(cid:122) (cid:125) =: D n − D KL (cid:0) p (cid:0) S, T | X n − (cid:1) (cid:13)(cid:13) p (cid:0) S, T | X n − (cid:1) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) D n −  . We ﬁrst show that C n is a constant that equals the desired quantity. Notice that for a ﬁxed initial state ( s, t ) ∈ S × T and aﬁxed sequence x n − ∈ X n − we have T c (( s, t ) , x n − ) = (cid:0) T ( s, x n − ) , T ( t, x n − ) (cid:1) and hence C n = (cid:88) s (cid:48) ,t (cid:48) D KL (cid:18)(cid:16) (cid:101) P g (cid:17) s (cid:48) , · (cid:13)(cid:13)(cid:13)(cid:13) (cid:16) (cid:101) P g (cid:17) t (cid:48) , · (cid:19) · p { T ( s, x n ) = s (cid:48) , T ( t, x n ) = t (cid:48) } = (cid:88) s (cid:48) ,t (cid:48) D KL (cid:18)(cid:16) (cid:101) P g (cid:17) s (cid:48) , · (cid:13)(cid:13)(cid:13)(cid:13) (cid:16) (cid:101) P g (cid:17) t (cid:48) , · (cid:19) · p s (cid:48) ,t (cid:48) ) . We next show that D n converges in probability and in particular D n − D n − → . For a ﬁxed initial state ( s, t ) and asequence x n , consider the sequence of states s, T (cid:0) s, x (cid:1) , T (cid:0) s, x (cid:1) , . . . , T ( s, x n ) , and let n s (cid:48) ,x = n s (cid:48) ,x ( s ) denote thenumber of indices i such that T ( s, x i − ) = s (cid:48) and x i = x . We have for all t ∈ T s p ( x n | s, t ) = n (cid:89) i =1 P (cid:0) T (cid:0) s, x i − (cid:1) , x i (cid:1) = (cid:89) s (cid:48) ,x P ( s (cid:48) , x ) n s (cid:48) ,x = 2 (cid:80) s (cid:48) ,x n s (cid:48) ,x log P ( s (cid:48) ,x ) = 2 n (cid:80) s (cid:48) ,x ns (cid:48) ,xn log P ( s (cid:48) ,x ) . Since the associated stochastic process on states is stationary and ergodic, we have n s (cid:48) ,x n → ( p g ) s (cid:48) P ( s (cid:48) , x ) in proba-bility as n → ∞ , and hence p ( x n | s, t ) → − nH ( g ) in probability and independent of the initial state s . This implies ( p ( s, t | x n )) ( s,t ) and ( p ( s, t | x n )) ( s,t ) converge in probability to the stationary distribution p and p , respectively,which shows that D nn