[PDF] Is the Chen-Sbert Divergence a Metric?

Abstract

Recently, Chen and Sbert proposed a general divergence measure. This report presents some interim findings about the question whether the divergence measure is a metric or not. It has been postulated that (i) the measure might be a metric when (0 < k <= 1), and (ii) the k-th root of the measure might be a metric when (k > 1). The report shows that for a 2-letter alphabet, postulation (i) can be proved. The possible pathway for obtaining a proof for (i) in n-letter cases is also discussed. The authors hope that the report may stimulate more scholarly effort to study the mathematical properties of this divergence measure.

Full PDF

aa r X i v : . [ c s . I T ] J a n Is the Chen-Sbert Divergence a Metric?

Min Chen and Mateu Sbert University of Oxford, UK; [email protected] University of Girona, Spain; [email protected] version: 1 January, 2021

Consider any n -letter alphabet Z = { z , z , . . . , z n } associated with two probability massfunctions, P = { p , p , . . . , p n } and Q = { q , q , . . . , q n } . Chen and Sbert proposed ageneral divergence measure [1] as follows: D CS ( P k Q ) = n ∑ i = ( p i + q i ) log (cid:0) | p i − q i | k + (cid:1) (1)where k > p i and q i in relation to other pairwise differences, i.e., | p j − q j | , ∀ j = i . Here wefocus on the base-2 logarithm in the context of computer science and data science. Thetransformation to other logarithmic bases is not difﬁcult.The commonly-used Kullback-Leibler divergence [4] computes ﬁrst the informa-tive quantity of individual probabilistic values (in P and Q ) associated with each letter z i ∈ Z . It then computes the pairwise difference between the informative quantitieslog p i and log q i for each letter, and ﬁnally computes the probabilistic average ofsuch differences ( log p i − log q i ) , at the informative scale, across all letters z i ∈ Z .Unlike the Kullback-Leibler divergence, D CS ( P k Q ) computes ﬁrst the pairwise dif-ference between p i and q i for each letter z i ∈ Z , then the informative quantity of thepairwise difference | p i − q i | with a monotonic transformation g ( | p i − q i | ) = log ( | p i − q i | + ) , and ﬁnally the probabilistic average of such informative quantities across allletters z i ∈ Z .In comparison with the Kullback-Leibler divergence, D CS ( P k Q ) is bounded by 0and 1 (cf. KL is unbounded); does not suffer from any singularity condition (e.g., q i = D CS ( P k Q ) are similar to that of the Jensen-Shannon divergence. In an early report [1], Chen andSbert found that some empirical similarities between Jensen-Shannon divergence [5]and D CS ( P k Q ) (when k = D CS ( P k Q ) is known to bea distance metric. Clearly the measure satisﬁes most conditions of a distance metric,including:• identity of indiscernibles : D CS ( P k Q ) = ⇐⇒ P = Q ;• symmetry or commutativity : D CS ( P k Q ) = D CS ( Q k P ) ;• non-negativity or separation : D CS ( P k Q ) ≥ triangle inequality condition (also re-ferred to as the subadditivity condition). In other words, if the alphabet Z is associatedwith three arbitrary probability mass functions P , Q and R , do we have the following:• triangle inequality : D CS ( P k R ) ≤ D CS ( P k Q ) + D CS ( Q k R ) ? When setting k = k = . P , Q , and R indicates such a possibility.When setting k = P , Q , and R ﬁnds failures of the triangle inequality. For example, in the case of n = P = { . , . , . } , Q = { . , . , . } , and R = { . , . , . } ,we have: D CS ( P k Q ) = D CS ( Q k P ) = . D CS ( Q k R ) = D CS ( R k Q ) = . D CS ( R k P ) = D CS ( P k R ) = . = ⇒ D CS ( P k Q ) + D CS ( Q k R ) − D CS ( P k R ) = − . < n = P = { . , . , . , . } , Q = { . , . , . , . } , and R = { . , . , . , . } . The calculationshows: D CS ( P k Q ) = D CS ( Q k P ) = . D CS ( Q k R ) = D CS ( R k Q ) = . D CS ( R k P ) = D CS ( P k R ) = . = ⇒ D CS ( Q k P ) + D CS ( P k R ) − D CS ( Q k R ) = − . < k = . k =

50. One expects to ﬁnd manycases of failures with other settings of k > Postulation 1

When < k ≤ , D CS ( P k R ) ≤ D CS ( P k Q ) + D CS ( Q k R ) . Postulation 2

When k > , k p D CS ( P k R ) ≤ k p D CS ( P k Q ) + k p D CS ( Q k R ) . Special Case A: D CS ( k = ) for 2-Letter Alphabets Consider a 2-letter alphabet Z = { z , z } and three probability mass functions P = { p , − p } , Q = { q , − q } , and R = { r , − r } . When k =

1, the Chen-Sbert measure issimpliﬁed as: D CS-bi ( P k Q ) = D CS-bi ( Q k P )= (cid:18) ( p + q ) log (cid:0) | p − q | + (cid:1) + ( − p − q ) log (cid:0) | p − q | + (cid:1)(cid:19) = (cid:18) (cid:0) | p − q | + (cid:1)(cid:19) = log (cid:0) | p − q | + (cid:1) (2)Similarly, we have D CS-bi ( Q k R ) = D CS-bi ( R k Q ) = log (cid:0) | q − r | + (cid:1) (3) D CS-bi ( R k P ) = D CS-bi ( P k R ) = log (cid:0) | p − r | + (cid:1) (4) Lemma 1

For any real values ≤ p , q , r ≤ , the following inequality is true:T ( p , q , r ) = log (cid:0) | p − q | + (cid:1) + log (cid:0) | q − r | + (cid:1) − log (cid:0) | p − r | + (cid:1) ≥ Proof.

The left side of Eq. 5 can be rewritten as: T ( p , q , r ) = log ( | p − q | + )( | q − r | + )( | p − r | + ) (6) =  log ( a + )( b + )( a + b + ) cases (1) and (6)log ( c + d + )( d + )( c + ) cases (2), (3), (4), (5) (7)where the six cases are deﬁned as:1. When p ≥ q ≥ r : we set a = p − q , b = q − r ;2. When p ≥ r ≥ q : we set c = p − r , d = r − q ;3. When q ≥ p ≥ r : we set c = p − r , d = q − p ;4. When q ≥ r ≥ p : we set c = r − p , d = q − r ;5. When r ≥ p ≥ q : we set c = r − p , d = p − q ;6. When r ≥ q ≥ p : we set a = q − p , b = r − q .Since 0 ≤ a , b , c , d ≤

1, both parts of Eq. 7 are non-negative. For cases (1) and (6),we have ( a + )( b + ) = ( ab + a + b + ) ≥ ( a + b + ) . For cases (2), (3), (4), (5),we have ( c + d + )( d + ) ≥ ( c + ) . Both fractions inside the logarithmic function inEq. 7 are thus ≥

1, and therefore T ( p , q , r ) ≥

0. According to Eq. 2 and Eq. 5, we have: D CS-bi ( P k Q ) + D CS-bi ( Q k R ) − D CS-bi ( P k R ) = T ( p , q , r ) ≥ D CS-bi ( P k Q ) therefore satisﬁes the triangle inequality condition. (cid:4) heorem 1 For any 2-letter alphabet, the Chen-Sbert divergence measure (when k = ) is a metric. Proof.

The proof can easily be derived from Lemma 1, together with the other proper-ties of a distance metric discussed in Section 1. (cid:4) D CS ( k ≤ ) for 2-Letter Alphabets We can extend the above special case to any 0 < k ≤ Lemma 2

For any real values ≤ p , q , r ≤ and < k ≤ , the following inequalityis true:T ( p , q , r ) = log (cid:0) | p − q | k + (cid:1) + log (cid:0) | q − r | k + (cid:1) − log (cid:0) | p − r | k + (cid:1) ≥ Proof.

The left side of Eq. 8 can be rewritten as: T ( p , q , r ) = log ( | p − q | k + )( | q − r | k + )( | p − r | k + ) (9) =  log ( a k + )( b k + )(( a + b ) k + ) cases (1) and (6)log (( c + d ) k + )( d k + )( c k + ) cases (2), (3), (4), (5) (10)where the six cases are the same as those in Section 3. Since 0 ≤ c , d ≤

1, it is straight-forward to conclude that (( c + d ) k + )( d k + ) ≥ ( c k + ) for cases (2), (3), (4), and(5). Meanwhile, for cases (1) and (6), we consider two terms, X = a k b k + a k + b k and Y = ( a + b ) k . If we can show that X ≥ Y , we will be able to prove the following: XY ≥ = ⇒ X + Y + ≥ = ⇒ a k b k + a k + b k + ( a + b ) k + = ( a k + )( b k + )( a + b ) k + ≥ X < Y , we would have X < Y = ⇒ a k b k + a k + b k < ( a + b ) k = ⇒ (cid:0) a k b k + a k + b k (cid:1) / k < ( a + b )= ⇒ (cid:0) t √ ab + t √ a + t √ b (cid:1) t < ( a + b ) (12)where t = / k > k <

1. Since 0 ≤ a , b ≤

1, we would have: ( ab + a + b ) ≤ (cid:0) t √ ab + t √ a + t √ b (cid:1) t Based on the supposition of X < Y and Eq. 12, this would lead to the followingconclusion: ( ab + a + b ) ≤ (cid:0) t √ ab + t √ a + t √ b (cid:1) t < ( a + b ) = ⇒ ab < ≤ a , b ≤

1. Hence X < Y cannot be true. Because X ≥ Y is true, the fraction in the second part of Eq. 10 is also ≥

1. For T ( p , q , r ) in Eq. 8, wecan now conclude T ( p , q , r ) ≥

0, and D CS-bi (0 < k ≤

1) satisﬁes the triangle inequalitycondition. (cid:4) heorem 2 For any 2-letter alphabet, the Chen-Sbert divergence measure (with < k ≤ ) is a metric. Proof.

The proof can easily be derived from Lemma 2, together with the other proper-ties of a distance metric discussed in Section 1. (cid:4) D CS ( k = ) for n -Letter Alphabets When k =

12 log ∏ ni = (cid:0) | p i − q i | + (cid:1) p i + q i (cid:0) | q i − r i | + (cid:1) q i + r i ∏ ni = (cid:0) | p i − r i | + (cid:1) p i + r i ≥ This pathway attempts to ﬁnd a simple proof by examining whether the individual termassociated with each letter in the overall summation satisﬁes the triangle inequality. Inother words, we ask a question: Is the following inequality always true? ( p + q ) log ( | p − q | + ) + ( q + r ) log ( | q − r | + ) − ( p + r ) log ( | p − r | + )= log ( | p − q | + ) ( p + q ) ( | q − r | + ) ( q + r ) ( | p − r | + ) ( p + r ) ≥ p = . , q = . , r = .

85, Eq. 13 yields − . k < n > p = . , q = . , r = .

85, and the second letter must be associated with p = . , q = . , r = .

15. Eq. 13 yields 0.0899 for the second letter. The sum of − . D CS ( k =

1) is not a metric for n -letter alphabets ( n > .2 Unsuccessful Pathway: Combining the Terms of Two Letters This pathway attempts to ﬁnd a proof using induction by examining whether there is apositivity/negativity pattern when combining the terms associated with any two letters.In other words, let T ( p , q , r ) = ( p + q ) log ( | p − q | + )+( q + r ) log ( | q − r | + ) − ( p + r ) log ( | p − r | + ) and we ask a question: Is the following inequality always true? T ( p a + p b , q a + q b , r a + r b ) ≤ T ( p a , q a , r a ) + T ( p b , q b , r b ) (14)where 0 ≤ p a , p b , q a , q b , r a , r b ≤ ≤ ( p a + p b ) , ( q a + q b ) , ( r a + r b ) ≤ p a = . , p b = . , q a = . , q b = . , r a = . , r b = . . . Using the same deﬁnition of T ( p , q , r ) as in the previous subsection: T ( p , q , r ) = ( p + q ) log ( | p − q | + ) + ( q + r ) log ( | q − r | + ) − ( p + r ) log ( | p − r | + )= log (cid:0) | p − q | + (cid:1) p + q (cid:0) | q − r | + (cid:1) q + r (cid:0) | p − r | + (cid:1) p + r this pathway attempts to reduce any three terms T ( p , q , r ) , T ( p , q , r ) , T ( p , q , r ) associated with three letters to two terms T ( p x , q x , r x ) and T ( p y , q y , r y ) , such that T ( p , q , r ) + T ( p , q , r ) + T ( p , q , r ) = T ( p x , q x , r x ) + T ( p y , q y , r y ) (15)where 0 ≤ p , p , p , p x , p y ≤ , ≤ p + p + p = p x + p y ≤ ≤ q , q , q , q x , q y ≤ , ≤ q + q + q = q x + q y ≤ ≤ r , r , r , r x , r y ≤ , ≤ r + r + r = r x + r y ≤ α = β =

0, and γ = p x = p + p + α , q x = q + q + β , r x = r + r + γ .3. Initiate p y = p + p − α , q y = q + q − β , r y = r + r − γ .4. Use an optimisation algorithm to adjust α , β , γ and to obtain optimised ( p x , p y ) , ( q x , q y ) , and ( r x , r y ) such that the requirements in Eqs. 15 and 16 are met.6or example, given: ( p , p , p ) = ( . , . , . ) ; ( q , q , q ) = ( . , . , . ) ; ( r , r , r ) = ( . , . , . )= ⇒ T ( p , q , r ) + T ( p , q , r ) + T ( p , q , r ) = . p x = p + p + α = . + α ; p y = p + p − α = . − α q x = q + q + β = . + β ; q y = q + q − β = . − β r x = r + r + γ = . + γ ; r y = r + r − γ = . − γ = ⇒ T ( p x , q x , r x ) + T ( p y , q y , r y ) = . − . < α < − . , β = − . , γ = − . ( p , p , p ) = ( . , . , . ) ; ( q , q , q ) = ( . , . , . ) ; ( r , r , r ) = ( . , . , . )= ⇒ α = − . , . < β < . , γ = − . ( p , p , p ) = ( . , . , . ) ; ( q , q , q ) = ( . , . , . ) ; ( r , r , r ) = ( . , . , . )= ⇒ α = − . , β = . , . < γ < . α , β , and γ can always be found, we can use induction toprove that D CS (when k =

1) is a metric for any n -letter alphabet ( n > D CS ( < k ≤ ) for n -Letter Alphabets The strategy for replacing the triangle inequality terms for three letters with equiva-lent terms of two letters can also be applied to the general case of D CS (0 < k ≤ n -letter alphabets ( n > k = . , . , . k =

1) in Section 5.3, i.e., there is alwaysa solution of α , β , and γ that meets the requirements of Eqs. 15 and 16, it is likely thatsuch a proof can be extended to the general case. In general, it is rather hopeful thatPostulation 1 can proved. • For Postulation 1 in Section 2, we have managed to obtain a partial proof that D CS ( P k Q ) (0 < k ≤

1) is a metric for 2-letter alphabets.7 Our attempts to obtain a full proof for the general case of n -letter alphabets haveonly found a possible but unconﬁrmed pathway.• We have not yet made any solid progress in proving or falsifying Postulation 2in Section 2.We shall appreciate any effort to prove or falsify the two postulations by colleaguesin the international scientiﬁc communities. References [1] M. Chen and M. Sbert. On the upper bound of the kullback-leibler divergence andcross entropy. arXiv:1911.08334, 2019.[2] D. M. Endres and J. E. Schindelin. A new metric for probability distributions.

IEEE Transactions on Information Theory , 49(7):1858–1860, 2003.[3] B. Fuglede and F. Topsoe. Jensen-shannon divergence and hilbert space embed-ding. In

Proc. International Symposium on Information Theory , pages 31–36,2004.[4] S. Kullback and R. A. Leibler. On information and sufﬁciency.

Annals of Mathe-matical Statistics , 22(1):79–86, 1951.[5] J. Lin. Divergence measures based on the shannon entropy.

IEEE Transactions onInformation Theory , 37:145–151, 1991.[6] F. ¨Osterreicher and I. Vajda. A new class of metric divergences on probabilityspaces and its statistical applications.