RRooted Trees with Probabilities Revisited
Georg B¨ocherer
Institute for Communications EngineeringTechnische Universit¨at M¨unchenEmail: [email protected]
November 15, 2018
Abstract
Rooted trees with probabilities are convenient to represent a classof random processes with memory. They allow to describe andanalyze variable length codes for data compression and distributionmatching. In this work, the Leaf-Average Node-Sum InterchangeTheorem (LANSIT) and the well-known applications to pathlength and leaf entropy are re-stated. The LANSIT is then appliedto informational divergence. Next, the differential LANSIT isderived, which allows to write normalized functionals of leafdistributions as an average of functionals of branchingdistributions. Joint distributions of random variables and thecorresponding conditional distributions are special cases of leafdistributions and branching distributions. Using the differentialLANSIT, Pinsker’s inequality is formulated for rooted trees withprobabilities, with an application to the approximation of productdistributions. In particular, it is shown that if the normalizedinformational divergence of a distribution and a productdistribution approaches zero, then the entropy rate approaches theentropy rate of the product distribution. a r X i v : . [ c s . I T ] F e b robability notation (cid:73) Random variable X , takes values in X (cid:73) Distribution P X : for each a ∈ X : P X ( a ) := Pr( X = a ). (cid:73) Support supp P X := { a ∈ X : P X ( a ) > } . Rooted Trees with Probabilities [1, 2, 3] (cid:73) L : set of leaves. (cid:73) L : random variable over L . (cid:73) We identify supp P L (cid:44) L , i.e., a node is a leaf of a tree if ithas no successors and is generated with non-zero probability. (cid:73) N : set of all nodes on paths through the tree. (cid:73) root 0 ∈ N . (cid:73) B = N \ L : set of branching nodes. (cid:73) L j : leaves below node j ∈ N . j ∈ L ⇒ L j = j . (cid:73) S j : successors of node j ∈ B . (cid:73) S j : random variable over successors of node j ∈ B . Weidentify S j (cid:44) supp P S j . ode Probabilities (cid:73) We associate with each node j ∈ N a probability Q j = (cid:88) i ∈L i P L ( i ) (1) (cid:73) The probabilities of the successors of node j ∈ B are given by Q i = Q j P S j ( i ) , i ∈ S j . (2) Example P S (1) = Q = 1 P S (2) = P S (3) = 11 Q = P S (5) = Q = P S (6) = Q = P L (2) = Q = P L (5) = Q = P L (6) = (cid:73) supp P L = L = { , , } (cid:73) N = { , , , , , } (cid:73) B = N \ L = { , , } (cid:73) supp P S = S = { } (cid:73) L = { , } (cid:73) Q = (cid:88) i ∈L P L ( i ) = + = (cid:73) Q = Q P S (3) = · eaf-Average Node-Sum Interchange Theorem(LANSIT)LANSIT [1, Theorem 1] (cid:73) Let f be a function that assigns to each node j ∈ N a realvalue f ( j ). (cid:73) For each j ∈ N \
0, define ∆ f ( j ) := f ( j ) − f (predecessor of j ) Proposition 1 (LANSIT) E [ f ( L )] − f (0) = (cid:88) j ∈B Q j E [∆ f ( S j )] (3) roof of LANSIT (cid:73) Consider a tree with leaves L . (cid:73) Let S j ⊆ L be a set of leaves with a common predecessor j . (cid:88) i ∈S j P L ( i ) f ( i ) ( a ) = (cid:88) i ∈S j Q j P S j ( i )[ f ( i ) − f ( j ) + f ( j )] (4)= Q j f ( j ) (cid:88) i ∈S j P S j ( i ) (cid:124) (cid:123)(cid:122) (cid:125) =1 + Q j (cid:88) i ∈S j P S j ( i )∆ f ( i ) (5)= Q j f ( j ) + Q j E [∆ f ( S j )] (6)where (a) follows from (2). (cid:73) L ← j ∪ L \ S j is a new tree with a reduced number of leavesand P L ( j ) = Q j . (cid:73) Repeat the procedure until j is the root node 0. Then Q j = 1and Q j f ( j ) = f (0). (cid:3) Path Length Lemma [3, Lemma 2.1] (cid:73)
Function w ( j ) := length of path to node j . (cid:73) For each j ∈ N \
0: ∆ w ( j ) = 1. (cid:73) w (0) = 0. Proposition 2 (Path Length Lemma) E [ w ( L )] = (cid:88) j ∈B Q j . (7)
10 / 34 eaf Entropy Lemma [3, Lemma 2.2]
Function f ( i ) = − log Q i . Proposition 3 (Leaf Entropy Lemma) H ( P L ) = (cid:88) j ∈B Q j H ( P S j ) . (8) Proof. H ( P L ) = E [ − log P L ( L )] = E [ f ( L )] ( a ) = (cid:88) j ∈B Q j E [∆ f ( S j )] (9)= (cid:88) j ∈B Q j E [ − log Q S j Q j ] (10) ( b ) = (cid:88) j ∈B Q j E [ − log P S j ( S j )] (11)= (cid:88) j ∈B Q j H ( P S j ) (12)where (a) follows by the LANSIT and (b) by (2). (cid:3)
11 / 34
Informational Divergence
Function f ( i ) = log Q i Q (cid:48) i . Proposition 4 D( P L (cid:107) P L (cid:48) ) = (cid:88) j ∈B Q j D( P S j (cid:107) P S (cid:48) j ) . (13) Proof. D( P L (cid:107) P L (cid:48) ) = E (cid:104) log P L ( L ) P L (cid:48) ( L ) (cid:105) ( a ) = (cid:88) j ∈B Q j E [∆ f ( S j )] (14)= (cid:88) j ∈B Q j E [log Q S j Q j Q (cid:48) j Q (cid:48) S j ] (15) ( b ) = (cid:88) j ∈B Q j E [log P S j ( S j ) P S (cid:48) j ( S j ) ] (16)= (cid:88) j ∈B Q j D( P S j (cid:107) P S (cid:48) j ) (17)where (a) follows from the LANSIT and (b) by (2). (cid:3)
12 / 34 andom Vectors
Remark. If all paths in a tree have the same length n , then P L canbe thought of as a joint distribution P X n of a random vector X n = ( X , . . . , X n ). In this case, Prop. 3 and Prop. 4 are the chainrules for entropy and informational divergence, respectively.
13 / 34
Differential LANSITifferential LANSIT (cid:73) B : random variable over branching nodes B . (cid:73) Define P B ( j ) = Q j E [ w ( L )] , j ∈ B . (18) (cid:73) By path length lemma (cid:88) j ∈B P B ( j ) = (cid:80) j ∈B Q j E [ w ( L )] = 1 . (19) ⇒ P B defines a distribution over B . Proposition 5 (Differential LANSIT) E [ f ( L )] − f (0) E [ w ( L )] = E [∆ f ( S B )] . (20)Note that the expectation on the right-hand side is over P S B P B .
15 / 34
Example (cid:73)
Consider the path length function w . (cid:73) By the Differential LANSIT, E [∆ w ( S B )] = E [ w ( L )] − w (0) E [ w ( L )] = E [ w ( L )] E [ w ( L )] = 1 . (21)
16 / 34 ntropy Rate
Function f ( i ) = − log Q i . Proposition 6 H ( P L ) E [ w ( L )] = E [ H ( P S B )] (22)Proof. H ( P L ) E [ w ( L )] = E [ − log P L ( L )] E [ w ( L )] ( a ) = E [ − log Q S B Q B ] (23)= E [ − log Q S B Q B ] (24) ( b ) = E [ − log P S B ( S B )] (25)= E [ H ( P S B )] (26)where (a) follows by the differential LANSIT and (b) by (2). (cid:3)
17 / 34
Normalized Informational Divergence
Function f ( i ) = log Q i Q (cid:48) i . Proposition 7 D( P L (cid:107) P L (cid:48) ) E [ w ( L )] = E [D( P S B (cid:107) P S (cid:48) B )] . (27) Proof. D( P L (cid:107) P L (cid:48) ) E [ w ( L )] ( a ) = E [log Q S B Q (cid:48) B Q B Q (cid:48) S B ] (28) ( b ) = E [log P S B ( S B ) P S (cid:48) B ( S B ) ] (29)= E [D( P S B (cid:107) P S (cid:48) B )] (30)where (a) follows by the differential LANSIT and (b) by (2). (cid:3)
18 / 34 insker’s Inequality for TreesVariational distance P X , P Y two distributions on X . Variational distance d ( P X , P Y ) is d ( P X , P Y ) := (cid:88) a ∈X | P X ( a ) − P Y ( a ) | (31) Bounds: d ( P X , P Y ) ≥ , with equality iff ∀ a ∈ X : P X ( a ) = P Y ( a ) (32) d ( P X , P Y ) ≤ , with equality iff supp P X ∩ supp P Y = ∅ . (33)
20 / 34 pproximating Distributions
Set of distributions over X : P X . Proposition 8 i Pinsker’s Inequality: D( P X (cid:107) P Y ) ≥
12 ln 2 d ( P X , P Y ) . (34)ii Let { P X k } ∞ k =1 be a set of distributions in P X . D( P X k (cid:107) P Y ) k →∞ −→ ⇒ d ( P X k , P Y ) k →∞ −→ Let g be a function on P X that is continuous in P Y . D( P X k (cid:107) P Y ) k →∞ −→ ⇒ (cid:12)(cid:12)(cid:12) g ( P X k ) − g ( P Y ) (cid:12)(cid:12)(cid:12) k →∞ −→ . (36)
21 / 34
Example: Entropy
By [4, Lemma 2.7], entropy is continuous in any distribution P Y ∈ P X . ThusD( P X k (cid:107) P Y ) k →∞ −→ ⇒ | H ( P X k ) − H ( P Y ) | k →∞ −→ . (37)
22 / 34 roduct Distributions (cid:73)
Consider a tree and let P S ∗ be a branching distribution.Assign P S j = P S ∗ for all branching nodes j ∈ B . We call theresulting node probabilities the product distribution P + S ∗ . (cid:73) For any complete tree with leaves L , P + S ∗ defines a leafdistribution, i.e., (cid:80) i ∈L P + S ∗ ( i ) = 1. (cid:73) For any (possibly non-complete) tree with leaves L , we definethe informational divergence between the leaf distribution P L and P + S ∗ as D( P L (cid:107) P + S ∗ ) := (cid:88) i ∈L P L ( i ) log P L ( i ) P + S ∗ ( i ) . (38) This is a slight abuse of notation, since for j (cid:54) = i , S i (cid:54) = S i . However, we canthink of P S j as a distribution over branch labels. For example, for a binary tree, P S j is then a distribution over the labels { , } , for all i ∈ B and the assignment P S j = P S ∗ is meaningful.
23 / 34
Approximating Distributions on Trees
Proposition 9 i Pinsker’s Inequality for Trees: D( P L (cid:107) P L (cid:48) ) E [ w ( L )] ≥
12 ln(2) E [ d ( P S B , P S (cid:48) B )] . (39)ii For any (cid:15) > , D( P L (cid:107) P L (cid:48) ) E [ w ( L )] |L|→∞ −→ ⇒ Pr[ d ( P S B , P S (cid:48) B ) ≥ (cid:15) ] |L|→∞ −→ . (40)iii Let P S ∗ be a branching distribution and let g be a function on P S that is bounded and continuous in P S ∗ . D( P L (cid:107) P + S ∗ ) E [ w ( L )] |L|→∞ −→ ⇒ (cid:12)(cid:12)(cid:12) E [ g ( P S B )] − g ( P S ∗ ) (cid:12)(cid:12)(cid:12) |L|→∞ −→ . (41)Proof. See Slides 27–33. (cid:3)
24 / 34 ntropy Rate
By Prop. 6, H ( P L ) E [ w ( L )] = E [ H ( P S B )] . (42) H is continuous and bounded. Thus by Prop. 9iii. we have thefollowing proposition. Proposition 10 D( P L (cid:107) P + S ∗ ) E [ w ( L )] |L|→∞ −→ ⇒ (cid:12)(cid:12)(cid:12) H ( P L ) E [ w ( L )] − H ( P S ∗ ) (cid:12)(cid:12)(cid:12) |L|→∞ −→ . (43)
25 / 34
Random Vectors
Remark. (See also Slide 13) If all paths in a tree have the samelength n , then P L can be thought of as a joint distribution P X n ofa random vector X n = ( X , . . . , X n ). The (tree) productdistribution P + S ∗ is then the conventional product distribution P nS ∗ .Prop. 9 applies and in particular, Prop. 10 becomesD( P X n (cid:107) P nS ∗ ) n n →∞ −→ ⇒ (cid:12)(cid:12)(cid:12) H ( P X n ) n − H ( P S ∗ ) (cid:12)(cid:12)(cid:12) n →∞ −→ . (44)
26 / 34 roof of Prop. 9i. D( P L (cid:107) P L (cid:48) ) E [ w ( L )] ( a ) = E [D( P S B (cid:107) P S (cid:48) B )] (45) ( b ) ≥
12 ln 2 E [ d ( P S B , P S (cid:48) B )] (46)where (a) follows by Prop. 7 and where (b) follows by Pinsker’sinequality. (cid:3)
27 / 34
Proof of Prop. 9ii.
Suppose E [ d ( P S B , P S (cid:48) B )] < (cid:15) for some (cid:15) >
0. ThenPr[ d ( P S B , P S (cid:48) B ) ≥ (cid:15) ] ( a ) ≤ E [ d ( P S B , P S (cid:48) B )] (cid:15) (47) ≤ (cid:15) (cid:15) (48)= (cid:15) (49)where (a) follows by Markov’s inequality [3, Theo. A.2]. Togetherwith statement i., statement ii. follows. (cid:3)
28 / 34 roof of Prop. 9iii. (1)
By assumption, g is bounded and continuous in P S ∗ . Byboundedness, there exists a value g max < ∞ such that ∀ j ∈ B : | g ( P S j ) − g ( P S ∗ ) | ≤ g max . (50)By continuity, we know that ∀ δ > ∃ (cid:15) δ : ∀ (cid:15) (cid:48) < (cid:15) δ : d ( P S j , P S ∗ ) < (cid:15) (cid:48) ⇒ | g ( P S j ) − g ( P S ∗ ) | < δ. (51)Define (cid:15) = min { (cid:15) δ , δ } . (52)
29 / 34
Proof of Prop. 9iii. (2)
Suppose E [ d ( P S B , P S ∗ )] < (cid:15) . We write | E [ g ( P S B )] − g ( P S ∗ ) | = (cid:12)(cid:12)(cid:12)(cid:88) j ∈B P B ( j )[ g ( P S j ) − g ( P S ∗ )] (cid:12)(cid:12)(cid:12) ≤ (cid:88) j ∈B P B ( j ) (cid:12)(cid:12) g ( P S j ) − g ( P S ∗ ) (cid:12)(cid:12) (53)= (cid:88) j : d ( P Sj , P S ∗ ) <(cid:15) P B ( j ) (cid:12)(cid:12) g ( P S j ) − g ( P S ∗ ) (cid:12)(cid:12) + (cid:88) j : d ( P Sj , P S ∗ ) ≥ (cid:15) P B ( j ) (cid:12)(cid:12) g ( P S j ) − g ( P S ∗ ) (cid:12)(cid:12) . (54)We next bound the two sums in (54).
30 / 34 roof of Prop. 9iii. (3)
The first sum in (54) is bounded as (cid:88) j : d ( P Sj , P S ∗ ) <(cid:15) P B ( j ) (cid:12)(cid:12) g ( P S j ) − g ( P S ∗ ) (cid:12)(cid:12) ( a ) ≤ (cid:88) j : d ( P Sj , P S ∗ ) <(cid:15) P B ( j ) δ ≤ δ (55)where (a) follows by (51) and (52).
31 / 34
Proof of Prop. 9iii. (4)
The second sum in (54) is bounded as (cid:88) j : d ( P Sj , P S ∗ ) ≥ (cid:15) P B ( j ) (cid:12)(cid:12) g ( P S j ) − g ( P S ∗ ) (cid:12)(cid:12) ( a ) ≤ (cid:88) j : d ( P Sj , P S ∗ ) ≥ (cid:15) P B ( j ) g max( b ) ≤ (cid:15) g max( c ) ≤ δ g max (56)where (a) follows by (50), where (b) follows by our assumption E [ d ( P S B , P S ∗ )] < (cid:15) and Slide 28 and where (c) follows by (52).
32 / 34 roof of Prop. 9iii. (5)
Using (55) and (56) in (54), we get | E [ g ( P S B )] − g ( P S ∗ ) | ≤ δ + δ g max = δ (1 + g max ) . (57)For δ →
0, the error bound on the right-hand side goes to zero,which proves part iii. of Prop. 9. (cid:3)
33 / 34
References [1] R. A. Rueppel and J. L. Massey, “Leaf-average node-suminterchanges in rooted trees with applications,” in
Communications and Cryptography: Two sides of OneTapestry
Information Theory: CodingTheorems for Discrete Memoryless Systems . CambridgeUniversity Press, 2011.. CambridgeUniversity Press, 2011.