aa r X i v : . [ c s . I T ] J a n Is the Chen-Sbert Divergence a Metric?
Min Chen and Mateu Sbert University of Oxford, UK; [email protected] University of Girona, Spain; [email protected] version: 1 January, 2021
Consider any n -letter alphabet Z = { z , z , . . . , z n } associated with two probability massfunctions, P = { p , p , . . . , p n } and Q = { q , q , . . . , q n } . Chen and Sbert proposed ageneral divergence measure [1] as follows: D CS ( P k Q ) = n ∑ i = ( p i + q i ) log (cid:0) | p i − q i | k + (cid:1) (1)where k > p i and q i in relation to other pairwise differences, i.e., | p j − q j | , ∀ j = i . Here wefocus on the base-2 logarithm in the context of computer science and data science. Thetransformation to other logarithmic bases is not difficult.The commonly-used Kullback-Leibler divergence [4] computes first the informa-tive quantity of individual probabilistic values (in P and Q ) associated with each letter z i ∈ Z . It then computes the pairwise difference between the informative quantitieslog p i and log q i for each letter, and finally computes the probabilistic average ofsuch differences ( log p i − log q i ) , at the informative scale, across all letters z i ∈ Z .Unlike the Kullback-Leibler divergence, D CS ( P k Q ) computes first the pairwise dif-ference between p i and q i for each letter z i ∈ Z , then the informative quantity of thepairwise difference | p i − q i | with a monotonic transformation g ( | p i − q i | ) = log ( | p i − q i | + ) , and finally the probabilistic average of such informative quantities across allletters z i ∈ Z .In comparison with the Kullback-Leibler divergence, D CS ( P k Q ) is bounded by 0and 1 (cf. KL is unbounded); does not suffer from any singularity condition (e.g., q i = D CS ( P k Q ) are similar to that of the Jensen-Shannon divergence. In an early report [1], Chen andSbert found that some empirical similarities between Jensen-Shannon divergence [5]and D CS ( P k Q ) (when k = D CS ( P k Q ) is known to bea distance metric. Clearly the measure satisfies most conditions of a distance metric,including:• identity of indiscernibles : D CS ( P k Q ) = ⇐⇒ P = Q ;• symmetry or commutativity : D CS ( P k Q ) = D CS ( Q k P ) ;• non-negativity or separation : D CS ( P k Q ) ≥ triangle inequality condition (also re-ferred to as the subadditivity condition). In other words, if the alphabet Z is associatedwith three arbitrary probability mass functions P , Q and R , do we have the following:• triangle inequality : D CS ( P k R ) ≤ D CS ( P k Q ) + D CS ( Q k R ) ? When setting k = k = . P , Q , and R indicates such a possibility.When setting k = P , Q , and R finds failures of the triangle inequality. For example, in the case of n = P = { . , . , . } , Q = { . , . , . } , and R = { . , . , . } ,we have: D CS ( P k Q ) = D CS ( Q k P ) = . D CS ( Q k R ) = D CS ( R k Q ) = . D CS ( R k P ) = D CS ( P k R ) = . = ⇒ D CS ( P k Q ) + D CS ( Q k R ) − D CS ( P k R ) = − . < n = P = { . , . , . , . } , Q = { . , . , . , . } , and R = { . , . , . , . } . The calculationshows: D CS ( P k Q ) = D CS ( Q k P ) = . D CS ( Q k R ) = D CS ( R k Q ) = . D CS ( R k P ) = D CS ( P k R ) = . = ⇒ D CS ( Q k P ) + D CS ( P k R ) − D CS ( Q k R ) = − . < k = . k =
50. One expects to find manycases of failures with other settings of k > Postulation 1
When < k ≤ , D CS ( P k R ) ≤ D CS ( P k Q ) + D CS ( Q k R ) . Postulation 2
When k > , k p D CS ( P k R ) ≤ k p D CS ( P k Q ) + k p D CS ( Q k R ) . Special Case A: D CS ( k = ) for 2-Letter Alphabets Consider a 2-letter alphabet Z = { z , z } and three probability mass functions P = { p , − p } , Q = { q , − q } , and R = { r , − r } . When k =
1, the Chen-Sbert measure issimplified as: D CS-bi ( P k Q ) = D CS-bi ( Q k P )= (cid:18) ( p + q ) log (cid:0) | p − q | + (cid:1) + ( − p − q ) log (cid:0) | p − q | + (cid:1)(cid:19) = (cid:18) (cid:0) | p − q | + (cid:1)(cid:19) = log (cid:0) | p − q | + (cid:1) (2)Similarly, we have D CS-bi ( Q k R ) = D CS-bi ( R k Q ) = log (cid:0) | q − r | + (cid:1) (3) D CS-bi ( R k P ) = D CS-bi ( P k R ) = log (cid:0) | p − r | + (cid:1) (4) Lemma 1
For any real values ≤ p , q , r ≤ , the following inequality is true:T ( p , q , r ) = log (cid:0) | p − q | + (cid:1) + log (cid:0) | q − r | + (cid:1) − log (cid:0) | p − r | + (cid:1) ≥ Proof.
The left side of Eq. 5 can be rewritten as: T ( p , q , r ) = log ( | p − q | + )( | q − r | + )( | p − r | + ) (6) = log ( a + )( b + )( a + b + ) cases (1) and (6)log ( c + d + )( d + )( c + ) cases (2), (3), (4), (5) (7)where the six cases are defined as:1. When p ≥ q ≥ r : we set a = p − q , b = q − r ;2. When p ≥ r ≥ q : we set c = p − r , d = r − q ;3. When q ≥ p ≥ r : we set c = p − r , d = q − p ;4. When q ≥ r ≥ p : we set c = r − p , d = q − r ;5. When r ≥ p ≥ q : we set c = r − p , d = p − q ;6. When r ≥ q ≥ p : we set a = q − p , b = r − q .Since 0 ≤ a , b , c , d ≤
1, both parts of Eq. 7 are non-negative. For cases (1) and (6),we have ( a + )( b + ) = ( ab + a + b + ) ≥ ( a + b + ) . For cases (2), (3), (4), (5),we have ( c + d + )( d + ) ≥ ( c + ) . Both fractions inside the logarithmic function inEq. 7 are thus ≥
1, and therefore T ( p , q , r ) ≥
0. According to Eq. 2 and Eq. 5, we have: D CS-bi ( P k Q ) + D CS-bi ( Q k R ) − D CS-bi ( P k R ) = T ( p , q , r ) ≥ D CS-bi ( P k Q ) therefore satisfies the triangle inequality condition. (cid:4) heorem 1 For any 2-letter alphabet, the Chen-Sbert divergence measure (when k = ) is a metric. Proof.
The proof can easily be derived from Lemma 1, together with the other proper-ties of a distance metric discussed in Section 1. (cid:4) D CS ( k ≤ ) for 2-Letter Alphabets We can extend the above special case to any 0 < k ≤ Lemma 2
For any real values ≤ p , q , r ≤ and < k ≤ , the following inequalityis true:T ( p , q , r ) = log (cid:0) | p − q | k + (cid:1) + log (cid:0) | q − r | k + (cid:1) − log (cid:0) | p − r | k + (cid:1) ≥ Proof.
The left side of Eq. 8 can be rewritten as: T ( p , q , r ) = log ( | p − q | k + )( | q − r | k + )( | p − r | k + ) (9) = log ( a k + )( b k + )(( a + b ) k + ) cases (1) and (6)log (( c + d ) k + )( d k + )( c k + ) cases (2), (3), (4), (5) (10)where the six cases are the same as those in Section 3. Since 0 ≤ c , d ≤
1, it is straight-forward to conclude that (( c + d ) k + )( d k + ) ≥ ( c k + ) for cases (2), (3), (4), and(5). Meanwhile, for cases (1) and (6), we consider two terms, X = a k b k + a k + b k and Y = ( a + b ) k . If we can show that X ≥ Y , we will be able to prove the following: XY ≥ = ⇒ X + Y + ≥ = ⇒ a k b k + a k + b k + ( a + b ) k + = ( a k + )( b k + )( a + b ) k + ≥ X < Y , we would have X < Y = ⇒ a k b k + a k + b k < ( a + b ) k = ⇒ (cid:0) a k b k + a k + b k (cid:1) / k < ( a + b )= ⇒ (cid:0) t √ ab + t √ a + t √ b (cid:1) t < ( a + b ) (12)where t = / k > k <
1. Since 0 ≤ a , b ≤
1, we would have: ( ab + a + b ) ≤ (cid:0) t √ ab + t √ a + t √ b (cid:1) t Based on the supposition of X < Y and Eq. 12, this would lead to the followingconclusion: ( ab + a + b ) ≤ (cid:0) t √ ab + t √ a + t √ b (cid:1) t < ( a + b ) = ⇒ ab < ≤ a , b ≤
1. Hence X < Y cannot be true. Because X ≥ Y is true, the fraction in the second part of Eq. 10 is also ≥
1. For T ( p , q , r ) in Eq. 8, wecan now conclude T ( p , q , r ) ≥
0, and D CS-bi (0 < k ≤
1) satisfies the triangle inequalitycondition. (cid:4) heorem 2 For any 2-letter alphabet, the Chen-Sbert divergence measure (with < k ≤ ) is a metric. Proof.
The proof can easily be derived from Lemma 2, together with the other proper-ties of a distance metric discussed in Section 1. (cid:4) D CS ( k = ) for n -Letter Alphabets When k =
1, the might-be triangle inequality of the Chen-Sbert divergence can beexpressed as: D CS ( P k Q ) + D CS ( Q k R ) − D CS ( P k R ) = (cid:18) n ∑ i = ( p i + q i ) log ( | p i − q i | + )+ n ∑ i = ( q i + r i ) log ( | q i − r i | + ) − n ∑ i = ( p i + r i ) log ( | p i − r i | + ) (cid:19) =
12 log ∏ ni = (cid:0) | p i − q i | + (cid:1) p i + q i (cid:0) | q i − r i | + (cid:1) q i + r i ∏ ni = (cid:0) | p i − r i | + (cid:1) p i + r i ≥ This pathway attempts to find a simple proof by examining whether the individual termassociated with each letter in the overall summation satisfies the triangle inequality. Inother words, we ask a question: Is the following inequality always true? ( p + q ) log ( | p − q | + ) + ( q + r ) log ( | q − r | + ) − ( p + r ) log ( | p − r | + )= log ( | p − q | + ) ( p + q ) ( | q − r | + ) ( q + r ) ( | p − r | + ) ( p + r ) ≥ p = . , q = . , r = .
85, Eq. 13 yields − . k < n > p = . , q = . , r = .
85, and the second letter must be associated with p = . , q = . , r = .
15. Eq. 13 yields 0.0899 for the second letter. The sum of − . D CS ( k =
1) is not a metric for n -letter alphabets ( n > .2 Unsuccessful Pathway: Combining the Terms of Two Letters This pathway attempts to find a proof using induction by examining whether there is apositivity/negativity pattern when combining the terms associated with any two letters.In other words, let T ( p , q , r ) = ( p + q ) log ( | p − q | + )+( q + r ) log ( | q − r | + ) − ( p + r ) log ( | p − r | + ) and we ask a question: Is the following inequality always true? T ( p a + p b , q a + q b , r a + r b ) ≤ T ( p a , q a , r a ) + T ( p b , q b , r b ) (14)where 0 ≤ p a , p b , q a , q b , r a , r b ≤ ≤ ( p a + p b ) , ( q a + q b ) , ( r a + r b ) ≤ p a = . , p b = . , q a = . , q b = . , r a = . , r b = . . . Using the same definition of T ( p , q , r ) as in the previous subsection: T ( p , q , r ) = ( p + q ) log ( | p − q | + ) + ( q + r ) log ( | q − r | + ) − ( p + r ) log ( | p − r | + )= log (cid:0) | p − q | + (cid:1) p + q (cid:0) | q − r | + (cid:1) q + r (cid:0) | p − r | + (cid:1) p + r this pathway attempts to reduce any three terms T ( p , q , r ) , T ( p , q , r ) , T ( p , q , r ) associated with three letters to two terms T ( p x , q x , r x ) and T ( p y , q y , r y ) , such that T ( p , q , r ) + T ( p , q , r ) + T ( p , q , r ) = T ( p x , q x , r x ) + T ( p y , q y , r y ) (15)where 0 ≤ p , p , p , p x , p y ≤ , ≤ p + p + p = p x + p y ≤ ≤ q , q , q , q x , q y ≤ , ≤ q + q + q = q x + q y ≤ ≤ r , r , r , r x , r y ≤ , ≤ r + r + r = r x + r y ≤ α = β =
0, and γ = p x = p + p + α , q x = q + q + β , r x = r + r + γ .3. Initiate p y = p + p − α , q y = q + q − β , r y = r + r − γ .4. Use an optimisation algorithm to adjust α , β , γ and to obtain optimised ( p x , p y ) , ( q x , q y ) , and ( r x , r y ) such that the requirements in Eqs. 15 and 16 are met.6or example, given: ( p , p , p ) = ( . , . , . ) ; ( q , q , q ) = ( . , . , . ) ; ( r , r , r ) = ( . , . , . )= ⇒ T ( p , q , r ) + T ( p , q , r ) + T ( p , q , r ) = . p x = p + p + α = . + α ; p y = p + p − α = . − α q x = q + q + β = . + β ; q y = q + q − β = . − β r x = r + r + γ = . + γ ; r y = r + r − γ = . − γ = ⇒ T ( p x , q x , r x ) + T ( p y , q y , r y ) = . − . < α < − . , β = − . , γ = − . ( p , p , p ) = ( . , . , . ) ; ( q , q , q ) = ( . , . , . ) ; ( r , r , r ) = ( . , . , . )= ⇒ α = − . , . < β < . , γ = − . ( p , p , p ) = ( . , . , . ) ; ( q , q , q ) = ( . , . , . ) ; ( r , r , r ) = ( . , . , . )= ⇒ α = − . , β = . , . < γ < . α , β , and γ can always be found, we can use induction toprove that D CS (when k =
1) is a metric for any n -letter alphabet ( n > D CS ( < k ≤ ) for n -Letter Alphabets The strategy for replacing the triangle inequality terms for three letters with equiva-lent terms of two letters can also be applied to the general case of D CS (0 < k ≤ n -letter alphabets ( n > k = . , . , . k =
1) in Section 5.3, i.e., there is alwaysa solution of α , β , and γ that meets the requirements of Eqs. 15 and 16, it is likely thatsuch a proof can be extended to the general case. In general, it is rather hopeful thatPostulation 1 can proved. • For Postulation 1 in Section 2, we have managed to obtain a partial proof that D CS ( P k Q ) (0 < k ≤
1) is a metric for 2-letter alphabets.7 Our attempts to obtain a full proof for the general case of n -letter alphabets haveonly found a possible but unconfirmed pathway.• We have not yet made any solid progress in proving or falsifying Postulation 2in Section 2.We shall appreciate any effort to prove or falsify the two postulations by colleaguesin the international scientific communities. References [1] M. Chen and M. Sbert. On the upper bound of the kullback-leibler divergence andcross entropy. arXiv:1911.08334, 2019.[2] D. M. Endres and J. E. Schindelin. A new metric for probability distributions.
IEEE Transactions on Information Theory , 49(7):1858–1860, 2003.[3] B. Fuglede and F. Topsoe. Jensen-shannon divergence and hilbert space embed-ding. In
Proc. International Symposium on Information Theory , pages 31–36,2004.[4] S. Kullback and R. A. Leibler. On information and sufficiency.
Annals of Mathe-matical Statistics , 22(1):79–86, 1951.[5] J. Lin. Divergence measures based on the shannon entropy.
IEEE Transactions onInformation Theory , 37:145–151, 1991.[6] F. ¨Osterreicher and I. Vajda. A new class of metric divergences on probabilityspaces and its statistical applications.