Comment on "Rényi entropy yields artifficial biases not in the data and incorrect updating due to the infinite-size data"
aa r X i v : . [ phy s i c s . d a t a - a n ] M a y Comment on “R´enyi entropy yields artificial biases not in the data and incorrectupdating due to the finite-size data”
Petr Jizba ∗ and Jan Korbel
2, 3, 1, † FNSPE, Czech Technical University in Prague, Bˇrehov´a 7, 115 19, Prague, Czech Republic Section for Science of Complex Systems, Medical University of Vienna, Spitalgasse 23, 1090 Vienna, Austria Complexity Science Hub Vienna, Josefst¨adter Strasse 39, 1080 Vienna, Austria
In their recent paper [Phys. Rev. E 99 (2019) 032134], T. Oikinomou and B. Bagci have arguedthat R´enyi entropy is ill-suited for inference purposes because it is not consistent with the Shore–Johnson axioms of statistical estimation theory. In this Comment we seek to clarify the latterstatement by showing that there are several issues in Oikinomou–Bagci reasonings which lead toerroneous conclusions. When all these issues are properly accounted for, no violation of Shore–Johnson axioms is found.
PACS numbers: 05.20.-y, 02.50.Tt, 89.70.Cf
Introduction. — Maximum entropy (MaxEnt) principlebelongs among the most prominent concepts of contem-porary statistical physics, information theory and statis-tical estimation. Its inception dates back to two seminalpapers of E.T. Jaynes [2, 3] who first employed the Shan-non information measure, or Shannon entropy (SE), inthe framework of equilibrium statistical physics.Over the years, Jaynes’ heuristic MaxEnt prescrip-tion has become a powerful instrument, e.g., in non-equilibrium statistical physics, astronomy, geophysics, bi-ology, medical diagnosis or economics [4, 5]. The ratio-nale behind this success is typically twofold: first, max-imizing entropy minimizes the amount of prior informa-tion built into the distribution (i.e. MaxEnt distributionis maximally noncommittal with regard to missing in-formation); second, many physical systems tend to movetowards (or concentrate extremely close to) MaxEnt con-figurations over time [2, 4, 6, 7].With the advent of generalized entropies [8–14], a nat-ural question has arisen as to whether the MaxEnt prin-ciple can be extended also to non-Shannonian entropies.This clearly cannot be decided within Jaynes’ heuris-tic framework — a sound mathematical qualification isneeded. Since the MaxEnt principle is in its essence aninference method estimating the probability distributionsfrom limited information, a pertinent mathematical basisshould stem from theory of statistical estimation. Shoreand Johnson (SJ) [15, 16] introduced a system of axioms,which ensure that the MaxEnt estimation procedure isconsistent with desired properties of inference methods.These axioms are as follows [15, 16]:1. uniqueness : the system should be unique;2. permutation invariance : the permutation of statesshould not matter;3. subset independence : It should not matter whether ∗ Electronic address: p.jizba@fjfi.cvut.cz † Electronic address: [email protected] one treats disjoint subsets of system states in termsof separate conditional distributions or in terms ofthe full distribution;4. system independence : It should not matter whetherone accounts for independent constraints relatedto independent systems separately in terms ofmarginal distributions or in terms of full-system.There is often also a fifth axiom, which was not includedin the original systems of SJ axioms [15] but appeared inlater editions [17]:5. maximality : In absence of any prior information,the uniform distribution should be the solution.One can analogously define the set of axioms for the con-tinuous systems with several adjustments [15, 17].
First ,for the continuous variables it is necessary to use the Min-imum Relative Entropy (MinRel) principle, where themaximization of the entropy is replaced by minimizationof the relative entropy subject to some given prior distri-bution.
Second , the second axiom changes to coordinateinvariance axiom, which states that the change of coor-dinate system should not matter.
Third , the maximalityaxiom is replaced by no-information axiom in the waythat in the absence of any information the prior distri-bution remains unchanged. The MaxEnt principle thenrepresents a special case of the MinRel principle for dis-crete variables and uniform prior distribution.In recent years, there has been much debate as towhether generalized entropies can fulfill SJ axioms, andif yes, how the permissible classes are classified (see,e.g., [18–20] and citations therein). In their latest pa-per [1], Oikinomou and Bagci (OB) focused on the par-ticular case of R´enyi’s entropy (RE) and argued that REis not consistent with some of SJ axioms and hence itis ill-suited for inference purposes. This finding is, how-ever, at odds with recently found one-parameter class of(entropic) functionals — so-called Uffink’s class, which isconsistent with SJ axioms [21], and which contains REas a particular member. In addition, if the OB state-ment was true, then in some important cases, such asin the R´enyi entropy-based signal processing and patternrecognition, there would be important new correctionsor inconsistencies to some existing analyzes. OB resultwould be also detrimental in quantum information the-ory [22, 23] (e.g., RE for q = 2 is related to purity ).Here we show that the RE as they stand is certainly compatible with SJ axioms. Rather than appealing toRef. [21] for a full-fledged proof of Uffink’s class, we willemploy more straightforward approach. In particular, inthis Comment we directly point out several issues in OBreasonings. We carefully go through OB arguments andcorrect respective problematic points. When all issuesare properly accounted for, no violation of SJ is found.For the sake of simplicity, we focus on the discrete ver-sion of SJ axioms and change to continuous variables onlywhen necessary.In the following we will denote the RE of order q as H q ( P ) = 11 − q ln X i p qi ! , q > , (1)and the ensuing relative RE (or R´enyi’s divergence oforder q ) [36], as H q ( P || Q ) = 1 q − (cid:26)Z (cid:20) p ( x ) q ( x ) (cid:21) q q ( x ) d x (cid:27) . (2) Critical revision of the OB paper. — Let us now gostep-by-step through the key arguments presented by OBin [1]. Our discussion will be organized in the descendentorder according to respective SJ axiomatic points:
1. Uniqueness axiom —
OB conclude that the firstaxiom is fulfilled only for q ∈ (0 , sufficient not necessary conditionfor uniqueness. A key observation in this context isthat RE is a strictly Schur-concave function for arbitrary q > P and P (which are not permutation of each other)that maximize H q under given constraints. Let us take aconvex combination P α = αP + (1 − α ) P . Indeed, P α belongs to the probability simplex and also fulfills theconstraints. Since H q is strictly Schur-concave, it fulfillsthe following inequality (see also [24]) H q ( P α ) > αH q ( P )+ ≥ (1 − α ) H q ( P )= H q ( P ) = H q ( P ) . (3)Thus, the result must be unique otherwise we get contra-diction with maximality assumption [17]. This fact willalso be important in connection with subset independence axiom.
2. Invariance axiom —
For discrete case, the permu-tation invariance axiom means that the entropy shouldbe symmetric function of probabilities, which is indeedthe case for RE since RE is Schur-concave. Let us re-call that every concave and symmetric function is Schur-concave. The opposite implication is not true, but all Schur-concave functions (including RE) are symmetric(under permutation of the arguments) [25].For continuous variables, one should use (2). The lat-ter is manifestly invariant under the change of coordi-nate system x y . Indeed, if y = ϕ ( x ) and ϕ isa bijective, differentiable function then the well knowntransformation rule for probability density functions [26]states that p Y ( y ) = p X ( x ) | det( ∂ ϕ − /∂ y ) | . By setting p ( x ) ≡ p X ( x ) and q ( x ) ≡ q X ( x ), and plugging this to(2) wee see that the latter is invariant under the change x y . One could even be more general and employRadon–Nikodym theorem [27]. With this the R´enyi’s di-vergence of order q from P to Q can be rewritten as H q ( P || Q ) = 1 q − "Z (cid:18) dPdQ (cid:19) q − d P . (4)where dP/dQ is the Radon–Nikodym derivative. In thisformulation is H q ( P || Q ) manifestly coordinate-system in-dependent.
3. Subset independence axiom —
Here OB argue thatthe RE does not fulfill the subset independence axiom.To support their claim they use the Livesey–Skilling cri-terion [28]: Any inference (entropic) functional is consis-tent with subset independence axiom if for j = k = l thefollowing identity holds ∂∂p l (cid:18) ∂∂p k − ∂∂p j (cid:19) " H − α X i p i − β X i E i p i = 0 . (5)By using (5), OB show that RE does not fulfill this cri-terion and therefore does not conform with the subsetindependence axiom. It is, however, not difficult to seethat (5) does not cover all possible configurations and asit stands it is too restrictive. In fact, in [15] has beenshown that the entropic functional satisfying first threeSJ axioms must be of the form S ( P ) = f X i g ( p i ) ! , (6)where f is an arbitrary increasing function and g is anincreasing concave function. Above entropic functionalsare called sum-form entropies. The proof can be found inthe Supplemental material of [21]. Note that the explicitform of a function f in (6) does not influence the formof the distribution estimated by the MaxEnt principle.This can be easily seen by comparing two situations: a) f ( x ) = ax (with a > P i g ( p i ) subject to constraints P i p i = 1 and P i p i E i = E . This gives g ′ ( p i ) − α − βE i = 0 , (7)and therefore α = X i p i g ′ ( p i ) − βE , (8) β = g ′ ( p i ) − αE i , (9) p i = ( g ′ ) − ( α + βE i ) , (10)b) f = ax , where the MaxEnt principle prescribes thatwe should maximize f ( P i g ( p i )) under the same con-straints as above. In this case we have C f g ′ ( p i ) − α f − β f E i = 0 , (11)where C f ≡ f ′ ( P i g ( p i )). This leads to α f = C f X i p i g ′ ( p i ) − β f E = C f α , (12) β f = C f g ′ ( p i ) − α f E i = C f β , (13) p i = ( g ′ ) − (cid:18) α f + β f E i C f (cid:19) = ( g ′ ) − ( α + βE i ) , (14)so the resulting MaxEnt distribution is indeed indepen-dent of f . This defines the equivalent classes of entropicfunctionals with equivalence f ( P i g ( p i )) ∼ P i g ( p i ).Let us now go back to the criterion (5) and apply itto the class of entropic functionals (6). The difference ofderivatives gives ∂∂p l n C f ( P )[ g ′ ( p k ) − g ′ ( p j )] + β ( E k − E j ) o , (15)and successive derivative with respect to p l then yields ∂C f ( P ) ∂p l [ g ′ ( p k ) − g ′ ( p j )]+ C f ( P ) ∂∂p l h g ′ ( p k ) − g ′ ( p j ) i = 0 . (16)The second term on the LHS vanishes for l = j, k . Thefirst term is zero only when ∂C f ( P ) /∂p l = 0, which im-plies that f ′′ ( x ) = 0, or equivalently f ( x ) = ax . Con-sequently, we see that the Livesey–Skilling criterion em-ployed by OB can support only the trace-class entropies,i.e., entropy functionals of the form P i g ( p i ). On theother hand, since the function f does not change the re-sulting MaxEnt distribution, all sum-form entropies mustbe consistent with the subset independence axiom. Theonly quantities that are changed are the Lagrange pa-rameters. The transform P i g ( p i ) f ( P i g ( p i )) cantherefore be interpreted as a kind of gauge invariance inthe MaxEnt principle.Let us finally make two remarks regarding the afore-mentioned gauge invariance: • Since the function f can be arbitrary, it is irrelevantwhether the entropic functional is additive or not.Actually, by application of appropriate function,one can impose the desired type of (generalized)additivity. For example, the RE is additive, whileTsallis entropy [9], S q = ln q exp H q is q -additive(here ln q is the q -deformed logarithm [29]) and REpower [30], P q = exp H q is multiplicative. • From the sum-form (6) of entropy it is also clearthat concavity cannot be necessary condition foruniqueness, since an increasing function of a con-cave function does not need to be concave. On theother hand, an increasing function of any Schur-concave function is yields again Schur-concave func-tion [25, 31].
4. System independence axiom —
OB also concludethat RE does not fulfill system independence axiom.Their result relies on the solution of functional equation g ′′′ ( p ij ) ∂p ij ∂u i ∂p ij ∂v j + g ′′ ( p ij ) ∂ p ij ∂u i ∂v j = 0 , (17)where p ij is the joint distribution of the whole systemand u i and v j are the marginal distributions of two dis-joint subsystems. At this point, OB follow the approachof Press´e et al. [18] and assume the form of the joint dis-tribution as p ij = u i v j . From this, they end up with theequation xg ′′′ ( x ) + g ′′ ( x ) = 0 leading to g ( x ) = − x log x modulo multiplicative and additive constant, which cor-responds to Shannon entropy. However, as already dis-cussed in [17, 21], the assumption on the structure ofprobability distribution goes well beyond the originalidea of consistency axioms, since assume only certainstructure of updating information, not the probabilitydistribution itself. In Ref. [21] it was pointed out thatthis requirement can be ensured by assuming a strongerversion of the system independence axiom, which can beformulated as follows: whenever two subsystems of a sys-tem are disjoint, we can treat the subsystems in terms ofindependent distributions. We shall note that this strongsystem independence axiom is fulfilled for many systemsobserved in nature, namely for systems which state spacescales exponentially. These systems have typically short-range interactions. In this case, the strong independenceaxiom allows to bring Eq. (17) to the following form: ∂p ij ∂u i ∂p ij ∂v j / ∂ p ij ∂u i ∂v j = p ij . (18)On the other hand, this is apparently not the most gen-eral form of the relation between joint and marginal dis-tribution. It is beyond the scope of this Comment toinvestigate the general form of this relation, but let usjust look at the case when the ratio is also linear butwith some prefactor. In this case, we have: ∂p ij ∂u i ∂p ij ∂v j / ∂ p ij ∂u i ∂v j = ap ij , (19)for a close to 1. This leads to differential equationin the form axg ′′′ ( x ) + g ′′ ( x ) = 0 which has the solu-tion g ( x ) ∝ x − /a . By denoting q = 2 − /a , i.e., a = 1 / (2 − q ), we end up with entropic functional equiva-lent to RE. Eq. (19) corresponds to the composition ruleof q -exponential distributions described in [21] p ij ∝ (cid:16) u q − i + v q − j − (cid:17) / ( q − . (20)Reader can easily check that (20) satisfies Eq. (17).As shown in [17] (see also Supplemental Material inRef. [21]), RE conforms with the system independence axiom.
5. Maximality axiom — strict Schur-concavity of REautomatically ensures that in the absence of any prior in-formation, the uniform distribution must be the MaxEntdistribution. Indeed, H q ( P ) ≥ H q ( Q ) whenever P ≺ Q for any two n -dimensional distributions P and Q , andhence H q (1 , , . . . , ≤ H q ( P ) ≤ H q (1 /n, . . . , /n ) , (21)for any P . This is a simple consequence of observationthat { /n, . . . , /n } ≺ P ≺ { , , . . . , } . (22)Note that the lover limit in (21) is saturated onlywhen P is a permutation of the pure-state distribu-tion { , , . . . , } and the upper limit is saturated onlyfor maximally-mixed-state (uniform) distribution P = { /n, . . . , /n } . So RE has a strict global maximum atthe uniform distribution. This completes the proof. Re-lations (21)-(22) also nicely bolster the usual interpre-tation of entropy — the larger is the entropy, the moreuniform is the distribution. The issue of escort constraints. — Let us also brieflycomment the use of the escort averages P i ρ q ( p i ) E i = E q in MaxEnt prescription. It should be first stressed thatthe whole framework of SJ consistency axioms has beeninvented for the case of linear constraints P i p i E i = E .Its extension to more generalized types of constraints as,for example, escort averages remains an open problem.Therefore, it is not possible to apply the original SJ cri-teria to that situation. One possible way how to over-come usage of escort means is to change the probabil-ity distribution to the escort distribution, so that theconstraints become linear. In this case, one can alsoformulate the entropy in terms of escort distributions p i = ρ /qi / P j ρ /qj , which leads to maximization of func-tional equivalent to the Landsberg (or also Homogenous) entropy functional [32, 33] S Hq ( ρ ) = (cid:16)P i ρ /qi (cid:17) − q − − q . (23)When the whole framework is formulated in terms ofescort probabilities with linear constraints, we can em-ploy the SJ axioms but formulated in terms of escortdistributions. According to [21], Eq. (23) will then be-long to the class of Uffink’s entropic functionals (i.e.,class that is consistent with SJ axioms). This factalone, however, does not ensure that the SJ axioms willalso be valid in the original picture. Particularly, theproblem might arise in connection with subset indepen-dence axioms, since the de Finneti–Kolmogorov theorem p ij = u i v j | i does not generally hold for correspondingescort-distribution counterparts [34, 35]. Conclusions. — In this Comment we have analyzed therecent claim of T. Oikinomou and B. Bagci [1], that R´enyientropy is ill-suited for inference purposes because it isnot consistent with the Shore–Johnson axioms. By care-fully examining Oikinomou–Bagci arguments we have,however, noticed that there are several issues in theirreasonings that need to be clarified. When the latter areproperly accounted for we find that there is no contra-diction with SJ desiderata. This conclusion should alsobe expected on more general ground. Namely, R´enyi en-tropy is known to be a bona fide member of the so-calledUffink’s class of entropic functionals [17, 21], which is themost general class of inference functionals satisfying SJaxioms [21].Import ingredient in our reasonings was the strictSchur-concavity of R´enyi entropy. In fact, the languageof majorization and (strict) Schur-concavity is very nat-ural in the context of entropies since many processes inphysics occur in the direction of the majorization arrow(because the passage of time tends to make things moreuniform) and Schur-concave entropies grasp this behaviorvia their non-decreasing evolution.
Acknowledgments
Acknowledgments. — P.J. and J.K. were supportedby the Czech Science Foundation (GA ˇCR), Grant 19-16066S. J.K. was also supported by the Austrian ScienceFoundation (FWF) under project I3073. [1] T. Oikonomou and G. B. Bagci, Phys. Rev. E 99, 032134(2019).[2] E.T. Jaynes, Phys. Rev. 106 (1957) 620. [3] E.T. Jaynes, Phys. Rev. 108 (1957) 171.[4] S. Thurner, P. Klimek and R. Hanel,
Introduction to thetheory of complex systems , (Oxford University Press, Ox- ford, 2018).[5] see e.g.,
Entropy Measures, Maximum Entropy Principleand Emerging Applications , Karmeshu (Ed.), (Springer-Verlag, Berlin, 2003).[6] see e.g., C. Tsallis,
Introduction to Nonextensive Statisti-cal Mechanics; Approaching a Complex World , (Springer,New York, 2009).[7] C. Beck and F. Sch¨ogl,
Thermodynamics of Chaotic Sys-tems: An Introduction , (Cambridge University Press,Cambridge, 1993).[8] P. Jizba and T. Arimitsu, Ann. Phys. 312 (2004) 17.[9] C. Tsallis, J. Stat. Phys. 52 (1988) 479.[10] J. Havrda and F. Charv´at, Kybernetika 3 (1967) 30.[11] G. Kaniadakis, Physica A 365 (2006) 17.[12] B.D. Sharma, J. Mitter and M. Mohan, Inf. Control 39(1978) 323.[13] R. Hanel, S. Thurner, Europhys. Lett. 93 (2011) 20006.[14] P. Jizba and J. Korbel, Physica A 444 (2016) 808.[15] J.E. Shore and R.W. Johnson, IEEE Trans. Inf. Theor.26 (1980) 26.[16] J.E. Shore and R.W. Johnson, IEEE Trans. Inf. Theor.27 (1981) 472.[17] J. Uffink, Stud. Hist. Phil. Mod. Phys. 26 (1995) 223.[18] S. Press´e, K. Ghosh, J. Lee and K.A. Dill, Phys. Rev.Lett. 111 (2013) 180604.[19] C. Tsallis, Entropy 17 (2015) 2853.[20] S. Press´e, K. Ghosh, J. Lee and K.A. Dill, Entropy 17(2015) 5043.[21] P. Jizba and J. Korbel, Phys. Rev. Lett. 122, 120601(2019).[22] Z. Pucha la, L. Rudnicki and K. ˙Zyczkowski, J. Phys. A:Math. Theor. 46 (2013) 272002.[23] I. Bengtsson and K. ˙Zyczkowski,
Geometry of Quan-tum States: An Introduction to Quantum Entanglement ,(Cambridge University Press, Cambridge, 2008).[24] A.W. Roberts and D.E. Varberg, Pure and AppliedMathematics 57 (1973) 12. [25] A.W. Marshall, I. Olkin and B.C. Arnold,
Inequalities:Theory of Majorization and Its Applications (Springer,London, 2011).[26] W. Feller,
An Introduction to Probability Theory and ItsApplications, Vol. II (John Wiley, London, 1966).[27] G.E. Shilov and B.L. Gurevich,
Integral, Measure, andDerivative: A Unified Approach (Dover Publications,New York, 1978).[28] A. K. Livesey and J. Skilling, Acta Cryst. A41 (1985)113.[29] E.P. Borges, Physica A 340 (2004) 95.[30] P. Jizba, Y. Ma, A. Hayes, and J.A. Dunningham. Phys.Rev. E 93 (2017) 060104(R).[31] V. ˇCuljak, I. Franji´c, R. Ghulam and J. Peˇcari´c, J. In-equal. Appl. (2011) 581918.[32] J. F. Lutsko, J. P. Boon and P. Grosfils, Europhys. Lett.86 (2009) 40005.[33] P.T. Landsberg, Entropies Galore! Braz. J. Phys. (1999) 46.[34] P. Jizba and J. Korbel, Entropy 19 (2017) 605.[35] P. Jizba and J. Korbel, Physica A 468 (2017) 238.[36] Notice that the relative RE exists in two versions (bothproposed by R´enyi himself [ ? ]). In our reasonings herewe stick to 1st R´enyi’s version which is also employed inthe OB paper.[37] In general, when a given function H is strictly Schur-concave this means that for two probability vectors sat-isfying the majorization relation p ≺ q , (i.e., P ki =1 p ( i ) ≤ P ki =1 q ( i ) for k ∈ { , . . . , n − } , where p ( i ) and q ( i ) arethe ordered distributions components in the descend-ing order) one has the inequality H ( q ) ≥ H ( p ) where H ( p ) = H ( q ) only if p = qq