A generalization of the Kullback-Leibler divergence and its properties
aa r X i v : . [ c ond - m a t . s t a t - m ec h ] A p r A generalization of the Kullback-Leibler divergence and its properties
Takuya Yamano ∗ Department of Physics, Ochanomizu University, 2-1-1 Otsuka, Bunkyo-ku, Tokyo 112-8610, Japan
A generalized Kullback-Leibler relative entropy is introduced starting with the symmetric Jacksonderivative of the generalized overlap between two probability distributions. The generalizationretains much of the structure possessed by the original formulation. We present the fundamentalproperties including positivity, metricity, concavity, bounds and stability. In addition, a connectionto shift information and behavior under Liouville dynamics are discussed.
PACS numbers: 05.90.+m, 89.20.-a, 89.70.-a, 89.70.Cf
Keywords:
Information distances; stability; perturbation; overlaps.
I. INTRODUCTION
The relative entropy or the information divergence is a measure of the extent to which the assumed probabilitydistribution deviates from the true one. We need a means of comparing two different probability distributionsand will define a distance as a fundamental quantity that discriminates the distributions. The form of the relativeentropy, which was first introduced by Kullback and Leibler [1] is the most pervasive measure in information theory[2] and statistical mechanics [3]. Its most prominent property lies in its asymmetry between two distributions (i.e.,under interchange of the two) and that it does not satisfy the triangle inequality [4]. Recently, the parametrizedentropy has gained a great deal of attention in physics and information literature [5], in an effort to gain deeperunderstanding of the structure of equilibrium statistical mechanics and improved perspective of information theory.Due to the close connection to entropy and relative entropy, a generalized Kullback-Leibler (KL) relative entropywas presented, whose form is in conformity with a generalized entropy [6, 7]. In terms of the information gain, thegeneralization of the KL relative entropy has led to an adoption of a generalized information content.The purpose of this paper is to introduce another extended KL divergence and to investigate its fundamentalproperties. Our construction of the new divergence measure is clear and origin of the form of the already existinggeneralization [6, 7] can be explained, once we realize that a seed quantity is a generalized overlap of the twodistributions. The fundamental properties of the newly introduced generalization of KL that we treat in this paperinclude positivity, metricity, form invariance, concavity, upper and lower bounds and stability. In addition, thequantum no-cloning theorem [8] has recently been shown to possess a classical counterpart, where universal perfectcloning machines are incompatible with the conservation of distance measure under the Liouville dynamics governingthe evolution of the statistical ensemble. This fact was first shown through the ordinary KL divergence [9] and wasafterward extended to the Csisz´ar f-divergence [10]. Furthermore it was shown that this fact also applies to a nonf-divergence type [11]. As a particular instance of the f-divergence, we shall specifically show that our generalizedmeasure also exhibits constancy under this linear evolution dynamics.The organization of the paper is as follows. In section II, we first recapitulate the basic properties of two differentquantities: a distance in terms of KL, and the overlap of the distributions. We examine a specific stability propertyin order to clarify the difference between KL and the overlap. This property will be investigated for our measure inthe subsequent section. In section III, we present our generalization by way of the asymmetric Jackson derivative.Some basic properties are addressed in section IV. We summarize our results in the final section.
II. DISTANCE AND OVERLAP
The discrimination between two different probability functions is important in physics and information theory. Togain more insights into how we can measure the difference, we first consider the relationship between the KL distance ∗ Email: [email protected] and the overlap, which is also called the fidelity. Suppose that n statistically independent subsystems constitutea system so that the joint probability distribution can be written in a factorized form P m = P (1) m · · · P ( n ) m , where m = 1 ,
2. The KL distance K ( P , P ) between P and P defined on continuous support with d x = dx (1) · · · dx ( n ) is Z d x P ln P P = Z d x P (1)1 · · · P ( n )1 ln P (1)1 P (1)2 + · · · + ln P ( n )1 P ( n )2 ! = n X j =1 K ( P ( j )1 , P ( j )2 ) . (1)On the other hand, the overlap O ( P , P ) between P and P , is comprised of the overlaps of the subsystems and isexpressed as Z d x p P P = Z d x q P (1)1 · · · P ( n )1 · P (1)2 · · · P ( n )2 = n Y j =1 O ( P ( j )1 , P ( j )2 ) . (2)The KL distance is a sum of distances of the independent component systems (decomposability property), whilethe overlap for the total system is constructed from a product of the overlaps of subsystems. We note that as aclosely related measure to the overlap, the statistical distance has been introduced by [12], and is based on thenumber of distinguishable states between two probabilities and can be given as an inverse cosine of the overlap ascos − O ( P , P ). We used the continuous form in the above, however subtleties exist between the continuous anddiscrete form of entropies as clearly stated in [13]. We shall use both forms depending on the ease of presentation inthe rest of this paper. A. Stability
Stability in general has a quite broad meaning and has various definitions. The stability depends on the degrees ofresponses to the external perturbation. We shall now consider a situation where an external environment disturbs asystem described by a set of probabilities. As a consequence of the disturbance, only a specific state of the systemmay slightly change the probability, say, by a factor ǫ . Alternatively, we could also describe our set up as follows.The fluctuation of the target system is so small that its influence on the probability states could be limited and itappears only between two states. Due to the normalization of the entire probability, if the probability of a certainstate is altered, then another state is also changed. This may be a matter of time scales inherent to the system.Although the fluctuation may initially occur locally in states space, it propagates in the neighboring states, and thereconfiguration of the probability distribution of the system occurs immediately towards a (quasi) equilibrium orstatic states of the system. Instead of considering the long-term stability, we limit our concern to a very early stageof the response. The long term dynamical stability requires the introduction of an underlying physics and is out ofscope of the present treatment.We define { p ( x i , ǫ ) } as the distributions after an infinitesimal change denoted by a factor ǫ , which is assumedto be close to unity. The evolution of the system is attributed to the change of the probabilities in time. Hence,two distributions { p ( x i ) } and { p ( y i ) } are assumed to be connected by p ( y i ) = P nk =1 p ( y i | x k ) p ( x k ), i.e., the lineartransformation of one into another state [14]. We may also regard this in the context of information theory as atransmission of input states p ( x i )’s under the channel matrix p ( y i | x k ) to obtain output states p ( y i ). Then, the outputstates that received the disturbance are expressed as p ( y i ; ǫ ) = P nk =1 p ( y i | x k ) p ( x k ; ǫ ). Without loss of generality, weassume that the fluctuation affects only two states ( l th and m th) in such a way that [15] p ( x k ; ǫ ) = ǫp ( x l ) for k = l (1 − ǫ ) p ( x l ) + p ( x m ) for k = mp ( x k ) for others . (3)This appears to correspond to the situation that a certain external fluctuation boosts the visiting frequency of aparticular state. From the above, we have p ( y i ; ǫ ) = p ( y i )+ c i ( ǫ − p ( x l ), where we have set c i = p ( y i | x l ) − p ( y i | x m ).Let us put ǫ − ξ . We then have K ( { p ( y i ; ξ ) } , { p ( y i ) } ) = n X i =1 [ p ( y i ) + ξc i p ( x l )] ln (cid:20) ξc i p ( x l ) p ( y i ) (cid:21) . (4)Since we are considering the case ξ ≪
1, then expanding the logarithm and considering above up to the second orderin ξ , we have K ( { p ( y i ; ξ ) } , { p ( y i ) } ) = ( n X i =1 c i ) p ( x l ) ξ + n X i =1 c i p ( y i ) ! p ( x l )2 ξ + O ( ξ ) . (5)Therefore, we have always the positive second derivative ∂ K /∂ξ >
0, which means that the distance is stable underthis disturbance. On the other hand, the overlap between p ( y i ; ǫ ) and p ( y i ) are calculated to be O ( { p ( y i ; ξ ) } , { p ( y i ) } ) = n X i =1 p p ( y i ; ξ ) p ( y i )= n X i =1 p ( y i ) s ξc i p ( x l ) p ( y i ) ∼ n X i =1 p ( y i ) " c i p ( x l ) p ( y i ) ξ − c i (cid:18) p ( x l ) p ( y i ) (cid:19) ξ , (6)where we have approximated the last line using √ aξ ∼ aξ/ − a ξ / · · · . Therefore, the second derivative ∂ O /∂ξ = − p ( x l ) ( P i c i /p ( y i )) / <
0, which implies instability in this framework. Note that for two identicaldistributions, the KL distance vanishes while the overlap becomes unity. Therefore, the impact of the fluctuatingeffect on the two distance measures appears in the coefficients of ξ n ( n > III. A GENERALIZED KULLBACK-LEIBLER ENTROPY
The relative entropy can be arbitrarily defined and therefore it is possible to introduce alternative definitions tothe conventional KL if needed. Some classes are actually discussed in [16]. Although the extensions of the usual KLentropy were already proposed by several authors in different forms, their presentation are somewhat heuristic andthe mathematical origins are not fully clear. In this section, we consider a generalization of the KL entropy in lightof the Jackson derivative [17] and illustrate some of its properties in the next section. The Jackson derivative has itsroot in quantum group theory and has already been used to produce the Tsallis type generalized entropy [18]. TheJackson derivative of a function f ( x ) is defined for s = 1 by dd s α f ( α ) := f ( sα ) − f ( α ) sα − α . (7)The case s = 1 corresponds to the ordinary derivative. The generalized entropy of Tsallis [7] is obtained when thederivative is operated to a quantity Z ( α ) = P i p αi and evaluated at α = 1 [18], i.e., dd s α Z ( α ) (cid:12)(cid:12)(cid:12)(cid:12) α =1 = 1 − P i p si s − . (8)Keep in mind that if we operate the derivative d/d s α on another quantity and evaluate it at different values of α , we can in principle obtain different types of generalized entropies. In this scheme, we shall employ a quantity Z ′ ( α ) = P i p αi q − αi for obtaining a new class of generalized KL entropy. This quantity was called the R´enyi overlapof order α [19] because a quantity ln Z ( α ) / ( α −
1) is defined by R´enyi [20, 21]. We note that the usual KL entropy isobtained by the ordinary derivative of Z ′ ( α ) when evaluated either at α = 1, dZ ′ ( α ) d α (cid:12)(cid:12)(cid:12)(cid:12) α =1 = K ( P , Q ) , (9)or at α = 0, dZ ′ ( α ) d α (cid:12)(cid:12)(cid:12)(cid:12) α =1 = −K ( Q , P ) . (10)By the same token, the generalized KL entropy introduced previously in [6, 7] is generated by the operation of d/d s α to Z ′ ( α ) and substituting α = 1, K s ( P , Q ) = dZ ′ ( α ) d s α (cid:12)(cid:12)(cid:12)(cid:12) α =1 = P i p si q − si − s − . (11)These facts provide an indication that we can produce various kinds of generalized KL entropies by evaluatingthe Jackson derivative of Z ′ ( α ) at different values of α . It is also possible to take another approach to achieve ageneralization by employing the symmetric Jackson derivative defined for a function g ( α ) with s = 1 D s ; α [ g ( α )] := g ( sα ) − g ( s − α )( s − s − ) α . (12)This derivative is symmetric under the interchange of s ↔ s − . We operate the symmetric Jackson derivative on Z ′ ( α ) and evaluate it at α = 1, L s ( P , Q ) := D s ; α "X i p αi q − αi α =1 = 1 s − s − X i p i "(cid:18) q i p i (cid:19) − s − (cid:18) q i p i (cid:19) − s − . (13)We note that this generalized KL entropy is asymmetric L s ( P , Q ) = L s ( Q , P ) and the relation L s − ( P , Q ) = L s ( P , Q )holds. A symmetric quantity can be constructed by adding L s ( P , Q ) and L s ( Q , P ). In the limit s → L s ( P , Q )reduces to the usual KL entropy K ( P , Q ), which can be easily checked by the L’Hospital theorem. This divergenceis well-defined whenever the two distributions have common support (state number i’s). In other words, in order tohave a finite value as a distance measure in the case 0 < s <
1, the probability p i must vanish when q i vanishes, andsimilar restrictions also apply for s > s →
1. The decomposability property in the sense that wementioned in section II is not expected for L s , since( s − s − ) L s ( P (1)1 · · · P ( n )1 , P (1)2 · · · P ( n )2 ) = n Y j =1 L ( j ) s − n Y j =1 L ( j ) s − , (14)where L ( j ) s = R d x ( j ) [ P ( j )1 ] s [ P ( j )2 ] − s etc. [22]. It would be interesting to note that L s can be understood from thelimiting case of the weighted power mean of order λ , which is defined for x, y > E λs [ x, y ] := [ sx λ + (1 − s ) y λ ] λ , (15)and the particular instances E [ x, y ] and E [ x, y ] correspond to the geometric mean √ xy and to the arithmetic mean( x + y ) /
2, respectively. Therefore we have L s ( P , Q ) = lim λ → E λs [ p i , q i ] − E λs − [ p i , q i ] s − s − . (16) IV. SOME PROPERTIES OF L s ( P , Q ) A. Positive semi-definiteness
This property corresponds to the information inequality for the standard KL entropy, i.e., K ( P , Q ) >
0. For L s ( P , Q ), the kernel function is f ( x ) = ( x − s − x − s − ) / ( s − s − ). The second derivative f ′′ ( x ) = 1 s − s − (cid:26) s ( s − x − − s + 1 s (1 − s ) x − − s (cid:27) (17)is always positive for s >
0, zero for s = 0 and can be negative for s <
0. Therefore, due to the Jensen’s inequality P i α i f ( x i ) R f ( P i α i x i ) for f ′′ ( x ) R P i α i = 1, by putting x i = q i /p i we obtain L s ( P , Q ) R s − s − "X i p i (cid:18) q i p i (cid:19) − s − "X i p i (cid:18) q i p i (cid:19) − s − = 0 . (18)Note that for s = 0, the last equality holds iff p i = q i , ∀ i . We note that when s > f is a convex function.Accordingly the positivity L s > Z E p ( x ) f (cid:18) q ( x ) p ( x ) (cid:19) m ( dx ) > Z E p ( x ) m ( dx ) · f ( u ) (19)is proved for nonnegative measurable functions on a measure space ( X, X , m ) for E ∈ X ( σ -algebra of subsets of X )and u = R E q ( x ) m ( dx ) / R E p ( x ) m ( dx ). The normalization of the probability functions in Eq.(19) when f satisfies f (1) = 0 results in positivity for our case. B. Metric property
The infinitesimal shift in probability provides L s ( P , P + d P ) = X i p i f (cid:18) δp i p i (cid:19) ≈ ∞ X n =2 f ( n ) (1) n ! X i ( δp i ) n p n − i , (20)where f ( x ) is the same as above, and we have used the facts f (1) = 0 and P i δp i = 0. Since the second derivative is f ′′ (1) = [ s ( s −
1) + s − (1 − s − )] / ( s − s − ) and we have L s ( P , P + d P ) ∼ − f ′′ (1) P i ( δp i ) /p i , if we introduce theinformation metric (or Fisher-Rao metric) ds as ds = X i,j g ij dp i dp j , (21)then the metric tensor g ij is given by g ij = s ( s −
1) + s − (1 − s − )2 p i ( s − s − ) δ ij . (22)In the limit s →
1, this metric reduces to δ ij / (2 p i ). C. Form invariance
Under transformation of coordinates θ → η , the distribution p ( θ ) may satisfy p ( θ ) dθ = φ ( η ) dη where φ ( η ) is theconverted distribution. The resulting relation p /p = φ /φ shows that the distance between φ and φ measuredby L s remains unchanged, that is, the distance is equivalent to the one between p and p before the transformation, L s ( P , P ) = 1 s − s − Z dηφ ( θ ( η )) "(cid:18) φ ( θ ( η )) φ ( θ ( η )) (cid:19) − s − (cid:18) φ ( θ ( η )) φ ( θ ( η )) (cid:19) − s − = L s (Φ , Φ ) . (23) D. Concavity
By setting α j = a j / P k a k and x j = b j /a j for the Jensen’s inequality P j α j f ( x j ) > f ( P j α j x j ) with the same f ( x ) as in the section IV A, we have1 P k a k P j a j (cid:20)(cid:16) b j a j (cid:17) − s − (cid:16) b j a j (cid:17) − s − (cid:21) s − s − > (cid:20)(cid:16) P j b j P k a k (cid:17) − s − (cid:16) P j b j P k a k (cid:17) − s − (cid:21) s − s − . (24)We consider two states j = 1 , a = λp j and a = (1 − λ ) p j to obtain p = λp j + (1 − λ ) p j . Similarly, weset b = λp ′ j and b = (1 − λ ) p ′ j for p ′ = λp ′ j + (1 − λ ) p ′ j , where 0 λ
1. Substituting these into Eq.(24), andsumming over j yields λ L s ( p , p ′ ) + (1 − λ ) L s ( p , p ′ ) > L s ( λp + (1 − λ ) p , λp ′ + (1 − λ ) p ′ ) . (25)This completes the proof. E. Stability
We investigate the stability property explained in section II A. The distance measure between the original and theperturbed distribution is L s ( { p ( y i ; ξ ) } , { p ( y i ) } ) = 1 s − s − n X i =1 { p ( y i ) + c i p ( x l ) ξ }× "(cid:18) p ( y i ) + c i p ( x l ) ξp ( y i ) (cid:19) s − − (cid:18) p ( y i ) + c i p ( x l ) ξp ( y i ) (cid:19) s − − . (26)Expanding (1 + ξc i p ( x l ) /p ( y i )) s − etc. with respect to ξ and taking terms up to second order in ξ , we obtain theexpression L s ( { p ( y i ; ξ ) } , { p ( y i ) } ) = ( n X i =1 c i ) p ( x l ) ξ + f ( s )2 n X i =1 c i p ( y i ) ! p ( x l ) ξ + O ( ξ ) , (27)where f ( s ) = s − s − −
1. Considering the sign of the coefficient of ξ , we can conclude that the L s is stable when s > s <
0. Note that except for the factor f ( s ), the effect of perturbation on the generalizeddistance is the same as on the ordinary KL Eq.(5) up to second order in ξ . F. Upper and lower bounds for L s ( P , Q ) The following bounds hold.
Theorem IV.1.
Let p ( x ) , q ( x ) ∈ X , be two probability distributions. Then we have the inequality: X x ∈X "(cid:18) q ( x ) p ( x ) (cid:19) − s + (cid:18) q ( x ) p ( x ) (cid:19) − s − p ( x ) log p ( x ) q ( x ) L s ( P , Q ) X x ∈X (cid:18) q ( x ) p ( x ) (cid:19) − s + s − p ( x ) log p ( x ) q ( x ) , (28) Proof.
For a convex function f ( u ) = t − u (0 < t < f ( a + b b − a Z ba f ( u ) du f ( a ) + f ( b )2 , ( a, b > . (29)Now, putting a = s and b = s − we have an inequality t − s + s − t − s − t − s − ( s − s − ) log t
12 ( t − s + t − s − ) . (30)Since ( t − s − t − s − ) / ( s − s − ) takes negative values ∀ s for t ∈ (0 , t = q ( x ) /p ( x ) in equation (30) andsum over x ∈ X after multiplying by p ( x ), we obtain12 X x ∈X "(cid:18) q ( x ) p ( x ) (cid:19) − s + (cid:18) q ( x ) p ( x ) (cid:19) − s − p ( x ) log p ( x ) q ( x ) s − s − X x ∈X p ( x ) "(cid:18) q ( x ) p ( x ) (cid:19) − s − (cid:18) q ( x ) p ( x ) (cid:19) − s − X x ∈X (cid:18) q ( x ) p ( x ) (cid:19) − s + s − p ( x ) log p ( x ) q ( x ) . (31)The equality holds if and only if s = s − , i.e., s = ±
1, which completes the desired bounds for L s .As for an upper bound for L s , the following expression also holds. Corollary IV.1.
Let p ( x ) , q ( x ) be as above. For − s and s > , we have L s ( P , Q ) s − s − X x ∈X | exp[ p s ( x ) q − s ( x )] − exp[ p s − ( x ) q − s − ( x )] | p exp[ p s ( x ) q − s ( x ) + p s − ( x ) q − s − ( x )] . (32) Proof.
For a > b >
0, the geometric mean is smaller than or equal to the logarithmic mean, i.e., √ ab b − a ln b − ln a . (33)Equivalently the inequality | ln b − ln a | | b − a | / √ ab holds for the equality iff a = b . Therefore, setting b =exp[ p s ( x ) q − s ( x )] and a = exp[ p s − ( x ) q − s − ( x )], we have (cid:12)(cid:12)(cid:12)(cid:12) p s ( x ) q − s ( x ) − p s − ( x ) q − s − ( x ) (cid:12)(cid:12)(cid:12)(cid:12) | e p s ( x ) q − s ( x ) − e p s − ( x ) q − s − ( x ) | p exp[ p s ( x ) q − s ( x ) + p s − ( x ) q − s − ( x )] . (34)From the positivity property proved in section 5.1, we have 0 L s ( P , Q ). Then( s − s − ) L s ( P , Q ) = (cid:12)(cid:12)(cid:12)(cid:12) X x ∈X p s ( x ) q − s ( x ) − p s − ( x ) q − s − ( x ) (cid:12)(cid:12)(cid:12)(cid:12) X x ∈X (cid:12)(cid:12)(cid:12)(cid:12) p s ( x ) q − s ( x ) − p s − ( x ) q − s − ( x ) (cid:12)(cid:12)(cid:12)(cid:12) . (35)Summing over x in Eq.(34), we obtain the inequality Eq.(32).Another expression for the bound in terms of the l -norm is possible for the exponentiated differences. Corollary IV.2.
Let p ( x ) , q ( x ) be as above. Then we have L s ( P , Q ) (cid:13)(cid:13)(cid:13) e p s ( x ) q − s ( x ) − e p s − ( x ) q − s − ( x ) (cid:13)(cid:13)(cid:13) α · (cid:13)(cid:13)(cid:13) exp[ − p s ( x ) q − s ( x ) − p s − ( x ) q − s − ( x )] (cid:13)(cid:13)(cid:13) β (36) where || t || l := (cid:2)P x ∈X | t ( x ) | l (cid:3) /l for l > .Proof. From the H¨older’s inequality with 1 /α + 1 /β = 1, it immediately follows that X x ∈X | e p s ( x ) q − s ( x ) − e p s − ( x ) q − s − ( x ) | p exp[ p s ( x ) q − s ( x ) + p s − ( x ) q − s − ( x )] " X x ∈X | e p s ( x ) q − s ( x ) − e p s − ( x ) q − s − ( x ) | α /α X x ∈X p exp[ p s ( x ) q − s ( x ) + p s − ( x ) q − s − ( x )] ! β /β . (37) V. SHIFT INFORMATION FOR L s ( P , Q ) The notion of the shift information introduced in Ref.[24] is an interesting means of investigating our new distancemeasure, in that we may ascribe the infinitesimal shift to known quantities. The original definition of the shiftinformation can be expressed by using the ordinary KL entropy as K ( p ( ζ ) , p ( ζ + ∆)), where the ∆ is a sufficientlysmall quantity compared to the variable ζ . As a consequence of the expansion, the Fisher information measure appearsin the second order term in the case of the usual KL entropy [24] and also in a generalized KL entropy [6]. We wouldbe able to expect that the shift information for the present generalization can also be expressible in terms of theFisher information, where the generalization parameter s should govern the degree of the shift. We look this factbelow. The shift information is defined as I (∆ , s ) := L s ( p ( ζ ) , p ( ζ + ∆)) = 1 s − s − Z dζ h p s ( ζ )[ p ( ζ + ∆)] − s − p s − ( ζ )[ p ( ζ + ∆)] − s − i . (38)Expanding [ p ( ζ + ∆)] − γ with respect to ∆, p γ ( ζ )[ p ( ζ + ∆)] − γ ∼ p ( ζ ) + (1 − γ ) p ′ ( ζ )∆ + (1 − γ ) (cid:26) p ′′ ( ζ ) − γ ( p ′ ( ζ )) p ( ζ ) (cid:27) ∆ · · · , (39)where γ denotes either s or s − and the prime implies the derivative with respect to ζ . Then the shift informationcan be expressed up to the second order in ∆ as I (∆ , s ) = 1 s − s − Z dζ (cid:26)(cid:0) s − − s (cid:1) p ′ ( ζ )∆ + (cid:20) ( s − − s ) p ′′ ( ζ ) + a ( s ) [ p ′ ( ζ )] p ( ζ ) (cid:21) ∆ (cid:27) , (40)where we have put a ( s ) = s ( s −
1) + s − (1 − s − ). Therefore we find that the Fisher information R dζ ( p ′ ) /p is arelevant quantity to the second order in the shift ∆. Moreover, the variation of I (∆ , s ) is given by δI (∆ , s ) = 1 s − s − Z dζδp (cid:20) ∂∂p − ∂∂ζ ∂∂p ′ + ∂ ∂ζ ∂∂p ′′ − · · · (cid:21) × (cid:26) ( s − − s ) p ′ ∆ + [( s − − s ) p ′′ + a ( s ) ( p ′ ) p ] ∆ (cid:27) ≃ a ( s )2( s − s − ) δI KL (∆) , (41)where δI KL (∆) is the variation calculated for the shift information for the ordinary KL [24], Z dζδp ((cid:18) p ′ p (cid:19) − p ′′ p ) . (42)This result indicates that the variation simply differs by a factor from the one obtained for the ordinary KL, whosedegree is controlled by the index s . If the second derivative of the distance measure indicates a direct quantity and isresponsible for keeping the discernible interval between the shifted and the original, then the sign of ∂ I (∆ , s ) /∂ ∆ would be a signature of this stability. As an example, consider a Gaussian form as a representative distribution whichappears in many disciplines. The Fisher information is calculated to be σ − for the domain ζ ∈ [ −∞ , ∞ ], where σ isthe standard deviation of the Gaussian distribution function. By straightforward calculation, we obtain ∂ I (∆ , s ) ∂ ∆ = a ( s ) σ ( s − s − ) . (43)We find that a ( s ) / ( s − s − ) ≷ s ≷
0, indicating that the information is stable against the shift ∆ when s > L s is stable (unstable) if s > s < I R (∆ , s ) := 1 s − "Z dζp ( ζ ) (cid:18) p ( ζ + ∆) p ( ζ ) (cid:19) − s ≃ s − (cid:20) − s )∆ Z dζp ′ + (1 − s )2 ∆ Z dζ (cid:18) p ′′ − ( p ′ ) p (cid:19)(cid:21) . (44)For the Gaussian distribution, the second derivative is calculated as ∂ I R (∆ , s ) ∂ ∆ = s ( s − − + 2 σ − s ( s − σ − s (1 − s )∆ ] . (45)Therefore, when σ > s ( s − / I R (∆ , s ) is found to be stable. The remarkable difference between I (∆ , s )and I R (∆ , s ) is that the stability is controlled only by s for our shift information, whereas the form of distribution(i.e., the magnitude of σ ) imposes the restriction for the domain of s in the case of the R´enyi shift information ingeneral. VI. BEHAVIOR UNDER LIOUVILLE DYNAMICS
We shall prove in this section that two states can only become less distinguishable in the course of a dynamicalevolution when the distance between them are measured by the present one. In other words, the generalized KLentropy L s does not increase with time, instead is shown to be constant in time under the Liouville equation ∂p∂t + ∇ · ( ~vp ) = 0 , (46)where p ( ~ζ, t ) denotes a probability density describing a statistical ensemble of dynamical systems and ~v = d~ζ/dt standsfor the drift velocity. The time derivative of the generalized KL entropy for two arbitrary probability distributionswhich satisfy the Liouville equation is d L s ( P , P ) dt = 1 s − s − Z dζ ∂∂t h p s p − s − p /s p − /s i = 1 s − s − Z dζ (cid:20) f ( p , p ) (cid:18) ∂p ∂t (cid:19) + f ( p , p ) (cid:18) ∂p ∂t (cid:19)(cid:21) , (47)where f = s (cid:18) p p (cid:19) − s − s (cid:18) p p (cid:19) − s , f = (1 − s ) (cid:18) p p (cid:19) − s − (1 − s ) (cid:18) p p (cid:19) − s . (48)Substituting Eq.(46) into Eq.(47), then using f ∇ g = ∇ ( f g ) − ( ∇ f ) g , we obtain for the integral of Eq.(47), Z dζ "( s (1 − s ) (cid:18) p p (cid:19) − s − s (1 − s ) (cid:18) p p (cid:19) − s ) p + ( − s (1 − s ) (cid:18) p p (cid:19) − s − − s (1 − s ) (cid:18) p p (cid:19) − s − ) p ~v ∇ (cid:18) p p (cid:19) , (49)where we have assumed that the two probability distributions f and g vanish at the boundary, so that R ∇ ( f g ) dζ = 0holds. The quantity within the bracket is calculated to be zero, therefore L s is found to be an invariant measureunder this dynamics for all values of s . It it worth mentioning that relative entropies of the form R dζ P f ( P / P )(the Csisz´ar f-divergence), where the function f is convex and satisfies f (1) = 0, becomes constant in time underthe Liouville type dynamical evolution [9, 10, 11, 25]. The fact that d L s /dt = 0 under the Liouville equation provedabove is consistent with this observation because L s is a particular instance of the Csisz´ar f-divergence class. VII. CONCLUSIONS
We have investigated properties of a novel generalized KL divergence in the context of statistical physics andinformation theory. Our approach presents a unified recipe for constructing distance measures for probabilitydistributions. In this method, the ordinary KL divergence is obtained by differentiation of the generalized overlapwith respect to the overlap index α and evaluating it by its unity. Similarly, the previously reported generalizationof the KL divergence, which is consistent with the nonextensive entropy proposed in physics literature, can beregarded as an output of the Jackson derivative for the overlap evaluated by its unity. Along this line, we can definea family of distance measures by applying the symmetric Jackson derivative to the generalized overlap. We havechosen α = 1 to obtain a specific generalization, which belongs to the Csisz´ar f-divergence type and have shown somefundamental properties of the divergence measure. As far as the distance between two probability distributions areconcerned, the KL relative entropy has infinite generalizations even with our recipe, depending on the evaluationindex. The connection to an interpretation of the information gain would provide the corresponding generalizedinformation content. We have obtained the ratio of the variation of the shift information to that of the ordinary one δI (∆ , s ) /δI KL (∆). In closing, we remark on a possible application of the divergence in the light of the minimumKL divergence scheme. In [26], this minimization formalism was applied to approximately obtain solutions of thegeneral N -dimensional linear Fokker-Planck equations. Following this reasoning, the newly introduced divergencecould be useful for finding approximate solutions to nonlinear Fokker-Planck equations and the related time evolutionequations. Developing this approach would require future investigation.0 Acknowledgements
This work was partially supported by the Grant-in-Aid for Scientific Research from Monbukagaku-sho No.06225 andwas presented at the DPG (Deutsche Physikalisches Gesellschaft) conference as No. DY 30.7 at Regensburg University,26-30 March 2007. [1] S. Kullback and R.A. Leibler, Ann. Math. Stat. , 79 (1951); S. Kullback, Information Theory and Statistics (Wiley,New York, 1959).[2] A.I. Khinchin, Mathematical Foundations of Information Theory, Dover, New York, (1957).[3] A.I. Khinchin, Mathematical Foundations of Statistical Mechanics, Dover, New York, (1960).[4] T.M. Cover and J.A. Thomas,
Elements of Information Theory (Wiley, New York, 1991).[5] C.Tsallis,
Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World (Springer, New York 2009);M. Gell-Mann and C. Tsallis, eds. Nonextensive Entropy - Interdisciplinary Applications (Oxford University Press, NewYork, 2004).[6] L. Borland, A.R. Plastino, and C. Tsallis, J. Math. Phys. , 6490 (1998); ibid , 2196 (1999).[7] C. Tsallis, Phys. Rev. E , 1442 (1998).[8] W.K. Wootters and W.H. Zurek, Nature , 802 (1982); D. Dieks, Phys. Lett. A, , 271 (1982).[9] A. Daffertshofer, A.R. Plastino, and A. Plastino, Phys. Rev. Lett. , 210601 (2002).[10] A.R. Plastino and A. Daffertshofer, Phys. Rev. Lett. , 138701 (2004).[11] T. Yamano, O. Iguchi, Europhysics Letters , 50007 (2008).[12] W.K. Wootters, Phys. Rev. D. , 357 (1981).[13] G. Jumarie, Relative information , Springer (1990).[14] V. Vedral, in Sec. II.C, Rev. Mod. Phys. , 197 (2002).[15] T. Yamano, Physica A , 71 (2006).[16] J.N. Kapur, and H.K. Kesavan, Entropy optimization principles with applications , Ch.VII (Academic Press, 1992).[17] F.H. Jackson, Mess. Math. , 57 (1909); Quart. J. Pure Appl. Math. , 193 (1910).[18] S. Abe, Phys. Lett. A , 326 (1997).[19] C.A. Fuchs, quant-ph/9601020, PhD thesis, University of New Mexico, 1996.[20] A. R´enyi, Probability Theory , North-Holland, Amsterdam (1970).[21] A. R´enyi, Proceeding of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, J. Neyman ed. Vol.1, pp. 547-561 (1961).[22] x x − y y is not expressible by x − y and x − y only.[23] I. Csisz´ar, Studia Math. Hungarica, , pp.299-318 (1967); ibid , pp.329-339 (1967).[24] G. V. Vstovsky, Phys. Rev. E , 975 (1995).[25] M.C. Mackey, Rev. Mod. Phys. , 981 (1989).[26] A.R. Plastino, H.G. Miller, and A. Plastino, Phys. Rev. E56