Divergence and Sufficiency for Convex Optimization
DDivergence and Sufficiency for ConvexOptimization
Peter HarremoësApril 11, 2017
Abstract
Logarithmic score and information divergence appear in informationtheory, statistics, statistical mechanics, and portfolio theory. We demon-strate that all these topics involve some kind of optimization that leadsdirectly to regret functions and such regret functions are often given bya Bregman divergence. If the regret function also fulfills a sufficiencycondition it must be proportional to information divergence. We willdemonstrate that sufficiency is equivalent to the apparently weaker no-tion of locality and it is also equivalent to the apparently stronger notionof monotonicity. These sufficiency conditions have quite different rele-vance in the different areas of application, and often they are not fulfilled.Therefore sufficiency conditions can be used to explain when results fromone area can be transferred directly to another and when one will experi-ence differences.
One of the main purposes of information theory is to compress data sothat data can be recovered exactly or approximately. One of the mostimportant quantities was called entropy because it is calculated accordingto a formula that mimics the calculation of entropy in statistical mechan-ics. Another key concept in information theory is information divergence(KL-divergence) that is defined for probability vectors P and Q as D ( P k Q ) = X x P ( x ) ln P ( x ) Q ( x ) . It was introduced by Kullback and Leibler in 1951 in a paper entitled in-formation and sufficiency Kullback and Leibler (1951). The link from in-formation theory back to statistical physics was developed by E.T. Jaynesvia the maximum entropy principle Jaynes (1957, 1989). The link backto statistics is now well established Liese and Vajda (1987); Barron et al. (1998); Csiszár and Shields (2004); Grünwald and Dawid (2004); Grün-wald (2007).Related quantities appear in information theory, statistics, statisticalmechanics, and finance, and we are interested in a theory that describeswhen these relations are exact and when they just work by analogy. Firstwe introduce some general results about optimization on state spaces offinite dimensional C*-algebras. This part applies exactly to all the topics a r X i v : . [ c s . I T ] A p r nder consideration and lead to Bregman divergences. Secondly, we intro-duce several notions of sufficiency and show that this leads to informationdivergence. This second step is not always applicable which explains whenthe different topics are really different. Our knowledge about a system will be represented by a state space. Imany cases the state space is given by a set of probability distributionson a sample space. In such cases the state space is a simplex, but it iswell-known that the state space is not a simplex in quantum physics. Forapplications in quantum physics the state space is often represented by aset of density matrices, i.e. positive semidefinite complex matrices withtrace 1. In some cases the states are represented as elements of a finitedimensional C ∗ -algebra, which is a direct sum of matrix algebras. A finitedimensional C ∗ -algebra that is a sum of 1 × C ∗ -algebrascontain the classical probability distributions as special cases.The extreme points in the set of states are the pure states. The purestates of a C ∗ -algebra can be identified with projections of rank 1. Twodensity matrices s and s are said to be orthogonal if s s = s s = 0 . Any state s has a decomposition s = X λ i s i where s i are orthogonal pure states. Such a decomposition is not unique,but for a finite dimensional C ∗ -algebra the coefficients λ , λ , . . . , λ n areunique and are called the spectrum of the state.Sometimes more general state spaces are of interest. In generalizedprobabilistic theories a state space is a convex set where mixtures aredefined by randomly choosing certain states with certain probabilitiesHolevo (1982); Krumm et al. (2016). A convex set where all orthogonaldecompositions of a state have the same spectrum is called a spectral statespace. Much of the theory in this paper can be generalized to spectralsets. The most important spectral sets are sets of positive elements withtrace 1 in Jordan algebras. For questions related to the foundation ofquantum theory the Jordan algebras and other spectral sets give newinsight Barnum et al. (2014); Harremoës (2016, 2017), but in this paperwe will restrict our attention to states on finite dimensional C ∗ -algebras.Nevertheless some of the theorems and proofs are stated in such a waythat they hold for more general state spaces. Let S denotes a state space of a finite dimensional C ∗ -algebra and let A denote a set of self-adjoint operators. Each a ∈ A is identified with a realvalued measurement. The elements of A may represent feasible actions (decisions) that lead to a payoff like the score of a statistical decision,the energy extracted by a certain interaction with the system, (minus)the length of a codeword of the next encoded input letter using a specificcode book, or the revenue of using a certain portfolio. For each s ∈ S themean value of the measurement a ∈ A is given by h a, s i = tr(as) . n this way the set of actions may be identified with a subset of the dualspace of S . Next we define F ( s ) = sup a ∈A h a, s i . We note that F is convex, but F need not be strictly convex. In principle F ( s ) may be infinite but we will assume that F ( s ) < ∞ for all states s . We also note that F is lower semi-continuous. In this paper we willassume that the function F is continuous. The assumption that F realvalued continuous function is fulfilled for all the applications we consider.If s is a state and a ∈ A is an action then we say that a is optimal for s if h a, s i = F ( s ). A sequence of actions a n ∈ A is said to be asymptoticallyoptimal for the state s if h a, s i → F ( s ) for n → ∞ . If a i are actions and ( t i ) is a probability vector then we we may definethe mixed action P t i · a i as the action where we do the action a i withprobability t i . We note that (cid:10)P t i · a i , s (cid:11) = P t i · h a i , s i . We will assumethat all such mixtures of feasible actions are also feasible. If a ( s ) ≥ a ( s )almost surely for all states we say that a dominates a and if a ( s ) >a ( s ) almost surely for all states s we say that a strictly dominates a . Allactions that are dominated may be removed from A without changing thefunction F. Let A F denote the set of self-adjoint operators (observables)such that h m, s i ≤ F ( s ) . Then F ( s ) = sup a ∈A F h a, s i . Therefore we mayreplace A by A F without changing the optimization problem.In the definition of regret we follow Servage Servage (1951) but withdifferent notation. Definition 1.
Let F denote a convex function on the state space S . If F ( s ) is finite the regret of the action a is defined by D F ( s, a ) = F ( s ) − h a, s i . (1) Proposition 1.
The regret D F of actions has the following properties: • D F ( s, a ) ≥ with equality if a is optimal for s . • s → D F ( s, a ) is a convex function. • If ¯ a is optimal for the state ¯ s = P t i · s i where ( t , t , . . . , t ‘ ) is aprobability vector then X t i · D F ( s i , a ) = X t i · D F ( s i , ¯ a ) + D F (¯ s, a ) . • P t i · D F ( s i , a ) is minimal if a is optimal for ¯ s = P t i · s i . If the state is s but one acts as if the state were s one may comparewhat one achieves and what could have been achieved. If the state s hasa unique optimal action a we may simply define the regret of s by D F ( s , s ) = D F ( s , a )The following definition leads to a regret function that is essentially equiv-alent to the so-called generalized Bregman divergences defined by KiwielKiwiel (1997a,b). Definition 2.
Let F denote a convex function on the state space S . If F ( s ) is finite then we define the regret of the state s as D F ( s , s ) = inf ( a n ) lim n →∞ D F ( s , a ) where the infimum is taken over all sequences of actions ( a n ) that areasymptotically optimal for s . D F ( s , s ) f ( t ) t Figure 1: The regret equals the vertical distance between curve and tangent.
With this definition the regret is always defined with values in [0 , ∞ .We note that with this definition the value of the regret D F ( s , s ) onlydepends on the restriction of the function F to the line segment from s to s . Let f denote the function f ( t ) = F ((1 − t ) s + ts ) where t ∈ [0 , D F ( s , s ) = f (1) − (cid:0) f (0) + f + (0) (cid:1) (2)where f + (0) denotes the right derivative of f at t = 0. Equation (2) iseven valid when the regret is infinite if we allow the right derivative totake the value ∞ .If the state s has the unique optimal action a ∈ A then F ( s ) = D F ( s , s ) + h a, s i (3)so the function F can be reconstructed from D F except for an affinefunction of s . The closure of the convex hull of the set of functions s →h a, s i is uniquely determined by the convex function F. The followingproposition follows from Alexandrov’s theorem. See (Rockafellar, 1970,Theorem 25.5) for details.
Proposition 2.
A convex function on a finite dimensional convex set isdifferentiable almost everywhere with respect to the Lebesgue measure.
A state s where F is differentiable has a unique optimal action. There-fore Equation (3) holds for almost any state s . In particular the function F can be reconstructed from D F except for an affine function. Proposition 3.
The regret D F of states has the following properties: • D F ( s , s ) ≥ with equality if there exists an action a that is optimalfor both s and s . • s → D F ( s , s ) is a convex function.Further the following two conditions are equivalent. • D F ( s , s ) = 0 implies s = s . • The function F is strictly convex. e say that a regret function D F is strict if F is strictly convex. Thetwo last properties Proposition 1 do not carry over to regret for statesexcept if the regret is a Bregman divergence as defined below. The regretis called a
Bregman divergence if it can be written in the following form D F ( s , s ) = F ( s ) − ( F ( s ) + h s − s , ∇ F ( s ) i ) (4)where h· , ·i denotes the (Hilbert-Smidt) inner product. In the context offorecasting and statistical scoring rules the use of Bregman divergencesdates back to Hendrickson and Buehler (1971). A similar but less gen-eral definition of regret was given by Rao and Nayak Rao and Nayak(1985) where the name cross entropy was proposed. Although Bregmandivergences have been known for many years they did not gain popular-ity before the paper Banerjee et al. (2005) where a systematic study ofBregman divergences was presented.We note that if D F is a Bregman divergence and s minimizes F then ∇ F ( s ) = 0 so that the formula for the Bregman divergence reduces to D F ( s , s ) = F ( s ) − F ( s ) . Bregman divergences satisfy the
Bregman identity X t i · D F ( s i , s ) = X t i · D F ( s i , ¯ s ) + D F (¯ s, s ) , (5)but if F is not differentiable this identity can be violated. Example 1.
Let the state space be the interval [0 , with two actions h a , s i = 1 − s and h a , s i = 2 s − . Let s = 0 and s = 1 . Let further t = / and t = / . Then ¯ s = / . If s = / then X t i · D F ( s i , s ) = 0 , but X t i · D F ( s i , ¯ s ) = 13 · ( h a , i − h a , i ) + 23 · ( h a , i − h a , i )= 13 · (1 − ( − . Clearly the Bregman identity (5) is violated and P t i · D F ( s i , s ) will in-crease if s is replaced by ¯ s . The following proposition is easily proved.
Proposition 4.
For a convex and continuous function F the followingconditions are equivalent. • The function F is differentiable. • The regret D F is a Bregman divergence. • The Bregman identity is always satisfied. • For any probability vectors ( t , t , . . . , t n ) the sum P t i · D F ( s i , s ) isalways minimal when s = P t i · s i . In this section we shall see how regret functions are defined in some ap-plications. .1 Information theory We recall that a code is uniquely decodable if any finite sequence of inputsymbols give a unique sequence of output symbols. It is well-known thata uniquely decodable code satisfies Kraft’s inequality X a ∈ A β - ‘ ( a ) ≤ ‘ ( a ) denotes the length of the codeword corresponding to the inputsymbol a ∈ A and β denotes the size of the output alphabet B . Here thelength of a codeword is an integer. If P = ( p a ) a ∈ A is a probability vectorover the input alphabet, then the mean code-length is X a ∈ A ‘ ( a ) · p a . Our goal is to minimize the expected code-length. Here the state spaceconsist of probability distributions over the input alphabet and the actionsare code-length functions.Shannon established the inequality − X a ∈ A log b ( p a ) · p a ≤ min X a ∈ A ‘ ( a ) · p a ≤ − X a ∈ A log b ( p a ) · p a + 1 . It is a combinatoric problem to find the optimal code length function. Inthe simplest case with a binary output alphabet the optimal code-lengthfunction is determined by the Huffmann algorithm.A code-length function dominates another code-length function if allletters have it has shorter code-length. If a code-length function is notdominated by another code-length function then for all a ∈ A the length isbounded by ‘ ( a ) ≤ | A |− . For fixed alphabets A and B there exists only afinite number of code-length functions ‘ that satisfy Kraft’s inequality andare not dominated by other code-length functions that satisfying Kraft’sinequality. The use of scoring rules has a long history in statistics. An early con-tribution was the idea of minimizing the sum of square deviations thatdates back to Gauss and works perfectly for Gaussian distributions. In the1920s Ramsay and de Finetti proved versions of the Dutch book theoremwhere determination of probability distributions were considered as dualproblems of maximizing a payoff function. Later it was proved that anyconsistent inference procedure corresponds to optimizing with respect tosome payoff function. A more systematic study of scoring rules was givenby McCarthy McCarthy (1956).Consider an experiment with X = { , , . . . , ‘ } as sample space. A scoring rule f is defined as a function X × M +1 ( X ) → R such that thescore is f ( x, Q ) when a prediction has been given in terms of a probabilitydistribution Q and x ∈ X has been observed. A scoring rule is proper iffor any probability measure P ∈ M +1 ( X ) the score P x ∈X P ( x ) · f ( x, Q )is minimal when Q = P. Here the state space consist of probability dis-tributions over X and the actions are predictions over X , which are alsoprobability distributions over X .There is a correspondence between proper scoring rules and Bregmandivergences as explained in Gneiting and Raftery (2007); Ovcharov (2015). f D F is a Bregman divergence and g is a function with domain X then f given by f ( x, Q ) = g ( x ) − D F ( δ x , Q ) defines a scoring rule.Assume that f is a proper scoring function. Then a function F can bedefined as F ( P ) = X x ∈X P ( x ) · f ( x, P )This lead to the regret function D F ( P, Q ) = F ( P ) − X x ∈X P ( x ) · f ( x, Q ) . (7)Since f is assumed to be proper D F ( P, Q ) ≥
0. The Bregman identity(5) follows by straight forward calculations. With these two results we seethat the regret function D F is a Bregman divergence and that D F ( δ y , Q ) = X x ∈X δ y ( x ) · f ( x, δ y ) − X x ∈X δ y ( x ) · f ( x, Q )= f ( y, δ y ) − f ( y, Q ) . (8)Hence a proper scoring rule f has the form f ( x, Q ) = g ( x ) − D F ( δ x , Q )where g ( x ) = f ( x, δ x ). A strictly proper scoring rule can be defined as aproper scoring rule where the corresponding Bregman divergence is strict. Example 2.
The Brier score is given by f ( x, Q ) = 1 n X y ∈X ( Q ( y ) − δ x ( y )) ! . The Brier score is generated by the strictly convex function F ( P ) = n P x ∈X P ( x ) Thermodynamics is the study of concepts like heat, temperature and en-ergy. A major objective is to extract as much energy from a system aspossible. The idea in statistical mechanics is to view the macroscopicbehavior of a thermodynamic system as a statistical consequence of theinteraction between a lot of microscopic components where the interact-ing between the components are governed by very simple laws. Here thecentral limit theorem and large deviation theory play a major role. Oneof the main achievements is the formula for entropy as a logarithm of aprobability.Here we shall restrict the discussion to the most simple kind of ther-modynamic system from which we want to extract energy. We may thinkof a system of non-interacting spin particles in a magnetic field. For sucha system the Hamiltonian is given byˆ H ( σ ) = − µ X h j σ j where σ is the spin configuration, µ is the magnetic moment, h j is thestrength of an external magnetic field, and σ j = ± j ’th particle. If the system is in thermodynamic equilibrium the configu-ration probability is P β ( σ ) = exp (cid:0) − β ˆ H ( σ ) (cid:1) Z β here Z ( β ) is the partition function Z ( β ) = X σ exp (cid:0) − β ˆ H ( σ ) (cid:1) . Here β is the inverse temperature ( kT ) − of the spin system and k =1 . · − / K is Boltzmann’s constant.The mean energy is given by X σ P β ( σ ) ˆ H ( σ )which will be identified with the internal energy U defined in thermody-namics. The Shannon entropy can be calculated as − X σ P β ( σ ) ln P β ( σ ) = − X σ P β ( σ ) ln exp (cid:0) − β ˆ H ( σ ) (cid:1) Z β = − X σ P β ( σ ) (cid:0) − β ˆ H ( σ ) − ln Z ( β ) (cid:1) = β · U + ln Z ( β ) . The Shannon entropy times k will be identified with the thermodynamicentropy S .The amount of energy that can be extracted from the system if a heatbath is available, is called the exergy Gundersen (2011). We assume thatthe heat bath has temperature T and the internal energy and entropy ofthe system are U and S if the system has been brought in equilibriumwith the heat bath. The exergy can be calculated by Ex = U − U − T ( S − S )= U − U − kT ( β · U + ln Z ( β ) − β U − ln Z ( β ))= kT (cid:18) ( β − β ) · U + ln Z ( β ) Z ( β ) (cid:19) . The information divergence between the actual state and the correspond-ing state that is in equilibrium with the environment is D ( P β k P β ) = X σ P β ( σ ) ln P β ( σ ) P β ( σ )= X σ P β ( σ ) ln exp ( − β ˆ H ( σ ) ) Z ( β )exp ( − β ˆ H ( σ ) ) Z ( β ) = X σ P β ( σ ) (cid:18) − β ˆ H ( σ ) + β ˆ H ( σ ) + ln Z ( β ) Z ( β ) (cid:19) = ( β − β ) · X σ P β ( σ ) ˆ H ( σ ) + ln Z ( β ) Z ( β )= ( β − β ) · U + ln Z ( β ) Z ( β ) . Hence Ex = kT D ( P β k P β ) . This equation appeared already in Harremoës (1993). .4 Portfolio theory The relation between information theory and gambling was established byKelly Kelly (1956). Logarithmic terms appear because we are interested inthe exponent in the exponential growth rate of our wealth. Later Kelly’sapproach has been generalized to trading of stocks although the relationto information theory is weaker Cover and Thomas (1991).Let X , X , . . . , X k denote price relatives for a list of k assets. Forinstance X = 1 .
04 means that asset no. 5 increases its value by 4%. Such price relatives are mapped into a price relative vector ~X =( X , X , . . . , X k ) . Example 3.
A special asset is the safe asset where the price relative is 1for any possible price relative vector. Investing in this asset correspondsto placing the money at a safe place with interest rate equal to 0 % . A portfolio is a probability vector ~b = ( b , b , . . . , b k ) where for instance b = 0 . ~b is X · b + X · b + · · · + X k · b k = (cid:10) ~X,~b (cid:11) . Theoriginal assets may be considered as extreme points in the set of portfolios.If an asset has the property that the price relative is only positive for oneof the possible price relative vectors, then we may call it a gambling asset .We now consider a situation where the assets are traded once everyday. For a sequence of price relative vectors ~X , ~X , . . . ~X n and a constantre-balancing portfolio ~b the wealth after n days is S n = n Y i =1 (cid:10) ~X i ,~b (cid:11) (9)= exp n X i =1 ln (cid:0)(cid:10) ~X i ,~b (cid:11)(cid:1)! (10)= exp (cid:0) n · E (cid:2) ln (cid:10) ~X,~b (cid:11)(cid:3)(cid:1) (11)where the expectation is taken with respect to the empirical distributionof the price relative vectors. Here E (cid:2) ln (cid:10) ~X,~b (cid:11)(cid:3) is proportional to the doubling rate and is denoted W (cid:0) ~b, P (cid:1) where P indicates the probabilitydistribution of ~X . Our goal is to maximize W (cid:0) ~b, P (cid:1) by choosing anappropriate portfolio ~b. Definition 3.
Let ~b and ~b denote two portfolios. We say that ~b dom-inates ~b if (cid:10) ~X j ,~b (cid:11) ≥ (cid:10) ~X j ,~b (cid:11) for any possible price relative vector ~X j j = 1 , , . . . , n. We say that ~b strictly dominates ~b if (cid:10) ~X j ,~b (cid:11) > (cid:10) ~X j ,~b (cid:11) for any possible price relative vector ~X j j = 1 , , . . . , n. A set A of assetsis said to dominate the set of assets B if any asset in B is dominated bya portfolio of assets in A. The maximal doubling rate does not change if dominated assets areremoved. Sometimes assets that are dominated but not strictly dominatedmay lead to non-uniqueness of the optimal portfolio.Let ~b P denote a portfolio that is optimal for P and define G ( P ) = W (cid:0) ~b P , P (cid:1) . (12)The regret of choosing a portfolio that is optimal for Q when the distri-bution is P is given by the regret function D G ( P, Q ) = W (cid:0) ~b P , P (cid:1) − W (cid:0) ~b Q , P (cid:1) . (13) (1 − t, t ) t . . G for the price relative vectors in Example 4. If ~b Q is not uniquely determined we take a minimum over all ~b that areoptimal for Q. Example 4.
Assume that the price relative vector is (2 , / ) with proba-bility − t and ( / , with probability t . Then the portfolio concentratedon the first asset is optimal for t ≤ / and the portfolio concentrated onthe second asset is optimal for t > / . For values of t between / and / the optimal portfolio invests money on both assets as illustrated in Figure2. Lemma 1.
If there are only two price relative vectors and the regretfunction is strict then either one of the assets dominates all other assetsor two of the assets are orthogonal gambling assets that dominate all otherassets.Proof.
We will assume that no assets are dominated by other assets. Let ~X = ( X , X , . . . , X k ) ~Y = ( Y , Y , . . . , Y k )denote the two price relative vectors. Without loss of generality we mayassume that X Y ≥ X Y ≥ · · · ≥ X k Y k . If X i Y i = X i +1 Y i +1 then X i X i +1 = Y i Y i +1 so that if X i ≤ X i +1 then Y i ≤ Y i +1 andthe asset i is dominated by the asset i + 1 . Since we have assumed thatno assets are dominated we may assume that X Y > X Y > · · · > X k Y k . If P = (1 − t, t ) is a probability vector over the two price relative vectorsthen according to Cover and Thomas (1991) the portfolio ~b = ( b , b , . . . , b n )is optimal if and only if(1 − t ) X i b X + · · · + b k X k + t Y i b Y + · · · + b k Y k ≤ or all i ∈ { , , . . . , k } with equality if b i > . Assume that the portfolio ~b = δ j is optimal. Now (1 − t ) X j +1 X j + t Y j +1 Y j ≤ t ≤ X j Y j +1 − X j +1 Y j +1 X j Y j − X j +1 Y j +1 . (14)Similarly (1 − t ) X j − X j + t Y j − Y j ≤ t ≥ X j Y j − − X j − Y j − X j Y j − X j − Y j − . (15)We have to check that X j Y j − − X j − Y j − X j Y j − X j − Y j − < X j Y j +1 − X j +1 Y j +1 X j Y j − X j +1 Y j +1 , which is equivalent with0 < X j Y j − − Y j − X j +1 − Y j X j − − ( X j Y j +1 − Y j +1 X j − − Y j X j +1 ) . The right hand side equals the determinant (cid:12)(cid:12)(cid:12)(cid:12) X j +1 − X j − X j − X j − Y j +1 − Y j − Y j − Y j − (cid:12)(cid:12)(cid:12)(cid:12) , which is positive because asset j is not dominated by a portfolio based onasset j − j + 1 . We see that the portfolio concentrated in asset j is optimal for t in aninterval of positive length and the regret between distributions in such aninterval will be zero. In particular the regret will not be strict.Strictness of the regret function is only possible if there are only twoassets and if a portfolio concentrated on one of these assets is only optimalfor a singular probability measure. According to the formulas for the endpoints of intervals (14) and (15) this is only possible if the assets aregambling assets. Theorem 1.
If the regret function is strict it equals information diver-gence, i.e. D G ( P, Q ) = D ( P k Q ) . (16) Proof.
If the regret function is strict then it is also strict when we restrictto two price relative vectors. Therefore any two price relative vectors areorthogonal gambling assets. If the assets are orthogonal gambling assetswe get the type of gambling described by Kelly Kelly (1956). For gamblingequation can easily be derived Cover and Thomas (1991). Sufficiency Conditions
In this section we will introduce some conditions on a regret function.Under some mild conditions they turn out to be equivalent.
Theorem 2.
Let D F denote a regret function based on a continuous andconvex function F defined on the state space of a finite dimensional C ∗ -algebra. If the state space has at least three orthogonal states then thefollowing conditions are equivalent. • The function F equals entropy times a negative constant plus anaffine function. • The regret D F is proportional to information divergence. • The regret is monotone. • The regret is satisfies sufficiency. • The regret is local.
In the rest of this section we will describe each of these equivalentconditions and prove that they are actually equivalent. The theorems andproofs will be stated so that they hold even for more general state spacesthan the ones considered in this paper.
Definition 4.
Let s denote an element in a state space. The entropy of s is be defined as H ( s ) = inf − n X i =1 λ i ln ( λ i ) ! where the infimum is taken over all decompositions s = P ni =1 λ i s i of s into pure states s i . This definition of the entropy of a state was first given by UhlmannUhlmann (1970). Using that entropy is decreasing under majorizationwe see that the entropy of s is attained at an orthogonal decompositionHarremoës (2016) and we obtain the familiar equation H ( s ) = − tr [ s ln( s )] . In general this definition of entropy does not provide a concave functionon a convex set. For instance the entropy of points in the square has localmaximum in the four different points. A characterization of the convexsets with concave entropy functions is lacking.
Definition 5.
If the entropy is a concave function then the Bregmandivergence D − H is called information divergence . The information divergence is also called
Kullback-Leibler divergence , relative entropy or quantum relative entropy . In a C*-algebra we get D − H ( s , s ) = − H ( s ) − ( − H ( s ) + h s − s , −∇ H ( s ) i )= H ( s ) − H ( s ) + h s − s , ∇ H ( s ) i = tr [ f ( s )] − tr [ f ( s )] + tr (cid:2) ( s − s ) f ( s ) (cid:3) = tr (cid:2) f ( s ) − f ( s ) + ( s − s ) f ( s ) (cid:3) here f ( x ) = − x ln ( x ) . Now f ( x ) = − ln ( x ) − f ( s ) − f ( s ) + ( s − s ) f ( s ) = − s ln ( s ) + s ln ( s ) + ( s − s ) ( − ln ( s ) − s (ln ( s ) − ln ( s )) + s − s . Hence D − H ( s , s ) = tr [ s (ln ( s ) − ln ( s )) + s − s ] . For states s , s it reduces to the well-known formula D − H ( s , s ) = tr [ s ln ( s ) − s ln ( s )] . We consider a set T of maps of the state space into itself. The set T willbe used to represent those transformations that we are able to perform onthe state space before we choose a feasible action a ∈ A . Let Φ : S (cid:121) S denote a map. Then the dual map Φ ∗ maps actions into actions and isgiven by h a, Φ ( s ) i = h Φ ∗ ( a ) , s i . Proposition 5 (The principle of lost opportunities) . If Φ ∗ maps the setof feasible actions A into itself then F (Φ ( s )) ≤ F ( s ) . (17) Proof. If a ∈ A then h a, Φ ( s ) i = h Φ ∗ ( a ) , s i≤ F ( s )because Φ ∗ ( a ) ∈ A . Inequality (17) follows because F (Φ ( s )) = sup a h a, Φ ( s ) i . Corollary 1 (Semi-monotonicity) . Let Φ denote a map of the state spaceinto itself such that Φ ∗ maps the set of feasible actions A into itself andlet s denote a state that minimizes the function F . If D F is a Bregmandivergence then D F (Φ ( s ) , Φ ( s )) ≤ D F ( s , s ) . (18) Proof.
Since s minimizes F and F is differentiable we have ∇ F ( s ) = 0.Since s minimizes F and F (Φ ( s )) ≤ F ( s ) we also have that Φ ( s )minimizes F and that ∇ F (Φ ( s )) = 0. Therefore D F (Φ ( s ) , Φ ( s )) = F (Φ ( s )) − ( F (Φ ( s )) + h Φ ( s ) − Φ ( s ) , ∇ F (Φ ( s )) i )= F (Φ ( s )) − F (Φ ( s )) ≤ F ( s ) − F ( s )= D F ( s , s ) , which proves the inequality.Next we introduce the stronger notion of monotonicity. Definition 6.
Let D F denote a regret function on the state space S of afinite dimensional C*-algebra. Then D F is said to be monotone if D F (Φ ( s ) , Φ ( s )) ≤ D F ( s , s ) for any affine map Φ : S → S. · t t r · t t tf ( t )Figure 3: Example of a dilation that increases regret. Proposition 6.
If a regret function D F based on a convex and continuousfunction F is monotone then it is a Bregman divergence.Proof. Assume that D F is monotone. We have to prove that F is differ-entiable. Since F is convex it is sufficient to prove that any restriction of F to a line segment is differentiable. Let s and s denote states that arethe end points of a line segment. The restriction of F to the line segmentis given by the convex and continuous function f ( t ) = F ((1 − t ) s + ts )so we have to prove that f is differentiable.If 0 < t < t < D F ((1 − t ) s + t s , (1 − t ) s + t s ) = f ( t ) − (cid:0) f ( t ) + ( t − t ) · f + ( t ) (cid:1) where f + denotes the denote the derivative from the right. A dilation bya factor r ≤ s decreases the regret so that r → f ( r · t ) − (cid:0) f ( r · t ) + r · ( t − t ) · f + ( r · t ) (cid:1) (19)is increasing. Since f is convex the function r → f + ( r · t ) is increasing.Assume that f is not differentiable so that r → f + ( r · t ) has a positivejump as illustrated on Figure 3. This contradicts that the function (19)is increasing. Therefore f + is continuous and f is differentiable.Recently it has been proved that information divergence on a complexHilbert space is decreasing under positive trace preserving maps Müller-Hermes and Reeb (2015); Christandl and Müller-Hermes (2016). Previ-ously this was only known to hold if some extra condition like completepositivity or 2-positivity was assumed Petz (2003). Theorem 3.
Information divergence is monotone under any positivetrace preserving map on the states of a finite dimensional C ∗ -algebra.Proof. Any finite dimensional C ∗ -algebra B can be embedded in B ( H )and there exist a conditional expectation E : B ( H ) → B . If Φ is a positive race preserving map of the density matrices of B into it self then Φ ◦ E ispositive and trace preserving on B ( H ) . According to Müller-Hermes andReeb Müller-Hermes and Reeb (2015) we have D ( Φ ◦ E ( s ) k Φ ◦ E ( s )) ≤ D ( s k s )for density matrices in B ( H ) . In particular this inequality holds for densitymatrices in B and for such matrices we have E ( s i ) = s i . The notion of sufficiency plays an important role in statistics and relatedfields. We shall present a definition of sufficiency that is based on Petz(1988), but there are a number of other equivalent ways of defining thisconcept. We refer to Jenčová and Petz (2006) where the notion of suffi-ciency is discussed in great detail.
Definition 7.
Let ( s θ ) θ denote a family of states and let Φ denote anaffine map S → T where S and T denote state spaces. A recovery map is an affine map Ψ :
T → S such that
Ψ (Φ ( s θ )) = s θ . The map Φ is saidto be sufficient for ( s θ ) θ if Φ has a recovery map. Proposition 7.
Assume D F is a regret function based on a convex andcontinuous function F and assume that Φ is sufficient for s and s withrecovery map Ψ . Assume that both Φ ∗ and Ψ ∗ map the set of feasibleactions A into itself. Then D F (Φ ( s ) , Φ ( s )) = D F ( s , s ) . Proof.
According to the principle of lest opportunities (Proposition 5) wehave F ( s ) = F (Ψ (Φ ( s ))) ≤ F (Φ ( s )) ≤ F ( s ) . Therefore F (Φ ( s )) = F ( s ) . Let a denote an action that is optimal for s . Then F (Φ ( s )) = F ( s )= h a, s i = h a, Ψ (Φ ( s )) i = h Ψ ∗ ( a ) , Φ ( s ) i and we see that Ψ ∗ ( a ) is optimal for Φ ( s ) . Now D F ( s , s ) = inf a ( F ( s ) − h a, s i )= inf a ( F ( s ) − h Ψ ∗ ( a ) , Φ ( s ) i )where the infimum is taken over actions a that are optimal for s . Theninf a ( F ( s ) − h Ψ ∗ ( a ) , Φ ( s ) i ) ≥ inf ˜ a ( F (Φ ( s )) − h ˜ a, Φ ( s ) i )= D F (Φ ( s ) , Φ ( s ))so we have D F ( s , s ) ≥ D F (Φ ( s ) , Φ ( s )) . The reverse inequality isproved in the same way. he notion of sufficiency as a property of divergences was introduced inHarremoës and Tishby (2007). The crucial idea of restricting the attentionto maps of the state space into itself was introduced in Jiao et al. (2014). Itwas shown in Jiao et al. (2014) that a Bregman divergence on the simplexof distributions on an alphabet that is not binary and satisfies sufficiencyequals information divergence up a multiplicative factor. Here we extendthe notion of sufficiency from Bregman divergences to regret functions. Definition 8.
Let D F denote a regret function based on a convex andcontinuous function F on a state space S . We say D F satisfies sufficiency if D F (Φ ( s ) , Φ ( s )) = D F ( s , s ) for any affine map S → S that is sufficient for ( s , s ) . Proposition 8.
Let D F denote a regret function based on a convex andcontinuous function F on a state space S . If the regret function D F ismonotone then it satisfies sufficiency.Proof. Assume that the regret function D F is monotone. Let s and s denote two states and let Φ and Ψ denote maps on the state space suchthat Φ (Ψ ( s i )) = s i , i = 1 , D F ( s , s ) = D F (Φ (Ψ ( s )) , Φ (Ψ ( s ))) ≤ D F (Ψ ( s ) , Ψ ( s )) ≤ D F ( s , s ) . Hence D F (Ψ ( s ) , Ψ ( s )) = D F ( s , s ) . Combining the previous results we get that information divergencesatisfies sufficiency. Under some conditions there exists an inverse versionof Proposition 8 stating that if monotonicity holds with equality then themap is sufficient. In statistics where the state space is a simplex, this resultis well established. For density matrices over the complex numbers it hasbeen proved for completely positive maps in Jenčová and Petz (2006).Some new results on this topic can be found in Jenčová (2017).
Often it is relevant to use the following weak version of the sufficiencyproperty.
Definition 9.
Let D F denote a regret function based on a convex andcontinuous function F on a state space S . The regret function D F is saidto be local if D F ( s , t · s + (1 − t ) · σ ) = D F ( s , t · s + (1 − t ) · ρ ) when the states σ and ρ are orthogonal to s and t ∈ ]0 , . Example 5.
On a 1-dimensional simplex (an interval) or on the Blocksphere any regret function D F is local. The reason is that if σ and ρ arestates that are orthogonal to s then σ = ρ. Proposition 9.
Let D F denote a regret function based on a convex andcontinuous function F on a state space S . If the regret function D F sat-isfies sufficiency then D F is local. roof. Let σ and ρ be states that are orthogonal to s . Let p denote theprojection supporting the state s . Let the maps Φ and Ψ be defined byΦ ( s ) = tr( ps ) · s + (1 − tr( ps )) · ρ, Ψ ( s ) = tr( ps ) · s + (1 − tr( ps )) · σ. Then Φ ( s ) = Ψ ( s ) = s and Φ ( σ ) = ρ and Ψ ( ρ ) = σ. ThereforeΦ ( t · s + (1 − t ) · σ ) = t · s + (1 − t ) · ρ Ψ ( t · s + (1 − t ) · ρ ) = t · s + (1 − t ) · σ and D F ( s , t · s + (1 − t ) · σ ) = D F ( s , t · s + (1 − t ) · ρ ) . Theorem 4.
Let S be the state space of a C ∗ -algebra with at least threeorthogonal states, and let D F denote a regret function based on a convexand continuous function F on the state space S . If the regret function D F is local then it is the Bregman divergence generated by the entropy timesa negative constant.Proof. In the following proof we will assume that the regret function isbased on the convex function F : S → R . First we will prove that theregret function is a Bregman divergence.Let K denote the convex hull of a set s , s , . . . s n of orthogonal states.For x ∈ [0 ,
1[ let g i denote the function g i ( x ) = D F ( s i , xs i + (1 − x ) s i +1 ).Note that g i is decreasing and continuous from the left. Let P = P p i s i and Q = P q i s i where p i , q i ∈ ]0 ,
1[ for all i = 0 , , , . . . n . If F isdifferentiable in P then locality implies that D F ( P, Q ) = X p i D F ( s i , Q ) − X p i D F ( s i , P )= X p i g i ( q i ) − X p i g i ( p i )= X p i ( g i ( q i ) − g i ( p i )) . Note that P → D F ( P, Q ) is a convex function and thereby it is continu-ous. Assume that P is an arbitrary element in K and let ( P n ) n ∈ N denotea sequence such that P n → P for n → ∞ . The sequence ( P n ) n ∈ N can bechoosen so that regret is differentiable in P n for all n ∈ N . Further thesequence P n can be chosen such that p n,i is increasing for all i = j. Then D F ( P , Q ) = X p ,i ( g i ( q i ) − g i ( p ,i ))+ p ,j g j ( p ,j ) − p ,j lim n →∞ g j ( p n,j ) . Similarly, if the sequence P n can be chosen such that p n,i is increasing forall i = j, j + 1 then D F ( P , Q ) = X p ,i ( g i ( q i ) − g i ( p ,i ))+ p ,j g j ( p ,j ) − p ,j lim n →∞ g j ( p n,j )+ p ,j +1 g j +1 ( p ,j +1 ) − p ,j +1 lim n →∞ g j +1 ( p n,j +1 ) , which implies that p ,j +1 g j +1 ( p ,j +1 ) − p ,j +1 lim n →∞ g j +1 ( p n,j +1 ) = 0and that lim n →∞ g j +1 ( p n,j +1 ) = g j +1 ( p ,j +1 ) or all j . Therefore D F ( P , Q ) = X p ,i ( g i ( q i ) − g i ( p ,i )) (20)for all P , Q in the interior of K . In the following calculations we willassume that the distributions lie in the interior of K . The validity of theBregman identity (5) follows directly from Equation 20 implying that D F is a Bregman divergence.As a function of Q the regret is minimal when Q = P. In the followingcalculations we write x = p i , z = p j , y = q i , and w = q j . If p ‘ = q ‘ for ‘ = i, j then non-negativity of regret can be written as x ( g i ( y ) − g i ( x )) + z ( g j ( w ) − g j ( z )) ≥ x + z = y + w ≤ . Permutation of i and j leads to the inequality x ( g j ( y ) − g j ( x )) + z ( g i ( w ) − g i ( z )) ≥ x ( g ij ( y ) − g ij ( x )) + z ( g ij ( w ) − g ij ( z )) ≥ g ij = g i + g j . Assume that x = z = y + w in Inequality (21). Then x ( g ij ( y ) − g ij ( x )) + x ( g ij ( w ) − g ij ( x )) ≥ g ij ( y ) − g ij ( x ) + g ij ( w ) − g ij ( x ) ≥ g ij ( y ) + g ij ( w )2 ≥ g ij ( x )so that g ij is mid-point convex, which for a measurable function impliesconvexity. Therefore g ij is differentiable from left and right.If y = w and x = y + (cid:15) and z = y − (cid:15) then we have( y + (cid:15) ) ( g ij ( y ) − g ij ( y + (cid:15) )) + ( y − (cid:15) ) ( g ij ( y ) − g ij ( y − (cid:15) )) ≥ (cid:15) = 0 . We differentiate with respect to (cid:15) from right.( g ij ( y ) − g ij ( y + (cid:15) ))+( y + (cid:15) ) (cid:0) − g ij + ( y + (cid:15) ) (cid:1) − ( g ij ( y ) − g ij ( y − (cid:15) ))+( y − (cid:15) ) (cid:0) g ij − ( y − (cid:15) ) (cid:1) which is positive for (cid:15) = 0 so that − y · g ij + ( y ) + y · g ij − ( y ) ≥ y · g ij − ( y ) ≥ y · g ij + ( y ) . (23)Since g ij is convex we have g ij − ( y ) ≤ g ij + ( y ) which in combination In-equality (23) implies that g ij − ( y ) = g ij + ( y ) so that g ij is differentiable.Since g i = g ij + g ik − g jk the function g i is also differentiable.As a function of Q the Bregman divergence D F ( P, Q ) has a minimumat Q = P under the condition P q i = 1. Since the functions g i are dif-ferentiable we can characterize this minimum using Lagrange multipliers.We have ∂∂q i D F ( P, Q ) = p i g i ( q i )and ∂∂q i D F ( P, Q ) | Q = P = p i · g i ( p i ) . urther ∂∂q i P q i = 1 so there exist a constant c K such that p i · g i ( p i ) = c K . Hence g i ( p i ) = c K p i so that g i ( p i ) = c K · ln ( p i ) + m i for some constant m i . Now we get D F ( P, Q ) = X p i ( g i ( q i ) − g i ( p i ))= X p i (( c K · ln ( q i ) + m i ) − ( c K · ln ( p i ) + m i ))= − c K · X p i ln p i q i = D − c K · H ( P, Q ) . Therefore there exists an affine function defined on K such that F | K ( P ) = − c K · H | K ( P ) + g K (24)for all P in the interior of K . Since H K is continuous on K Equation(24) holds for any P ∈ K . If each of the sets K and L is a simplex and x ∈ K ∩ L then − c K · H | K ( x ) + g K ( x ) = − c L · H | L ( x ) + g L ( x )so that ( c L − c K ) · H | K ( x ) = g L ( x ) − g K ( x ) . If K ∩ L has dimension greater than zero then the right hand side isaffine so the left hand side is affine, which is only possible when c K = c L . Therefore we also have g L ( x ) = g K ( x ) for all x ∈ K ∩ L. Therefore thefunctions g K can be extended to a single affine function on the whole of S . If only integer values of a code-length function ‘ are allowed then there areonly finitely many actions that are not dominated. Therefore the function F given by F ( P ) = − min ‘ X ‘ ( a ) · p a is piece-wise linear. In particular F is not differentiable so that the regretis not a Bregman divergence and cannot be monotone according to Propo-sition 6. In information theory monotonicity of a divergence function isclosely related to the data processing inequality and since the data process-ing inequality is one of the most important tools for deriving inequalities ininformation theory we need to modify our notion of code-length functionin order to achieve a data processing inequality.We now formulate a version of Kraft’s inequality that allow the codelength function to be non-integer valued. Theorem 5.
Let ‘ : A → R be a function. Then the function ‘ satisfiesKraft’s inequality (6) if and only if for all ε > there exists an integer n and a uniquely decodable fixed-to-variable length block code κ : A n → B ∗ such that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ¯ ‘ κ ( a n ) − n n X i =1 ‘ ( a i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ε here ¯ ‘ κ ( a n ) denotes the length ‘ κ ( a n ) divided by n. The uniquely decod-able block code can be chosen to be prefix free.Proof.
Assume that ‘ satisfies Kraft’s inequality. Then X a a ...a n ∈ A n β - P ni =1 ‘ ( a i ) = X a ∈ A β - ‘ ( a ) ! n ≤ n = 1 . Therefore the function ˜ ‘ : A n → N given by˜ ‘ ( a a ...a n ) = & n X i =1 ‘ ( a i ) ’ is integer valued and satisfies Kraft’s inequality (6) and there exists aprefix-free code κ : A n → { , } ∗ such that ‘ κ ( a a ...a n ) = ˜ ‘ ( a a ...a n ) . Therefore (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ¯ ‘ κ ( a a ...a n ) − n n X i =1 ‘ ( a i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)& n X i =1 ‘ ( a i ) ’ − n X i =1 ‘ ( a i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n so for any ε > n such that / n ≤ ε. Assume that for all ε > κ : A n → { , } ∗ such that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ¯ ‘ κ ( a a ...a n ) − n n X i =1 ‘ ( a i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ε for all strings a a ...a n ∈ A n . Then n ¯ ‘ κ ( a a ...a n ) satisfies Kraft’s In-equality(6) and X a ∈ A β - ‘ ( a ) ! n = X a a ...a n ∈ A n β - P ni =1 ‘ ( a i ) ≤ X a a ...a n ∈ A n β - n ( ¯ ‘ κ ( a a ...a n ) − ε )= β nε X a a ...a n ∈ A n β - n ¯ ‘ κ ( a a ...a n ) ≤ β nε . Therefore P a ∈ A β - ‘ ( a ) ≤ β ε for all ε > longer finite sequences. Contrary to frequential statistics we do nothave to consider a finite sequence as a prefix of an infinite sequence.If we minimize the mean code-length over functions that satisfy Kraft’sinequality (6), but without an integer constraint the code-length shouldbe ‘ ( a ) = − log β ( p a ) and the function F is given by F ( P ) = X a p a · log β ( p a ) . The function F is proportional to the Shannon entropy and the (negative)proportionality factor is determined by the size of the output alphabet. n lossy source coding and rate distortion theory it is important tochoose a distortion function with tractable properties. The notion of suf-ficiency for divergence functions was introduced in Harremoës and Tishby(2007) in order to characterize such tractable distortions functions. Inthis paper the main result was that sufficiency together with propertiesrelated to Bregman divergence lead directly to the information bottleneckmethod introduced by N. Tishby Tishby et al. (1999). Logarithmic losshas also been studied for lossy compression in No and Weissman (2015). In statistics one is often interested in scoring rules that are local, whichmeans a scoring rule where the payoff only depends on the probability ofthe observed value and not on the predicted distribution over unobservedvalues. The notion of locality has recently been extended by Dawid, Lau-ritzen and Parry Dawid et al. (2012), but here we shall focus on the originaldefinition. The basic result is that the only local strictly proper scoringrule is logarithmic score that was proved by Bernardo under the assump-tion that scoring rule is given by a smooth function Bernardo (1978).
Definition 10. A local strictly proper scoring rule is a scoring rule ofthe form f ( x, Q ) = g ( Q ( x )) . Theorem 6.
On a finite space a local strictly proper scoring rule is givenby a local regret function.Proof.
The regret function of a local strictly proper scoring rule is givenby D ( P, Q ) = X x P ( x ) ( g ( P ( x )) − g ( Q ( x ))) . If Q = (1 − t ) P + tQ i and P and Q are mutually singular then D ( P, Q ) = X x P ( x ) ( g ( P ( x )) − g ((1 − t ) P ( x ) + tQ i ( x )))= X x P ( x ) ( g ( P ( x )) − g ((1 − t ) P ( x ) + 0))and we see that the regret does not depend on Q i because Q i vanish onthe support of P. Therefore the regret function is local.
Corollary 2.
On a finite space with at least three elements a local strictlyproper scoring rule is given by a function g of the form g ( x ) = a · ln ( x )+ b for some constants a and b. Also the notion of sufficiency plays an important role in statistics.Here we will restrict the discussion to 1-dimensional exponential families.A natural exponential family is a family of probability distributions of theform d P β d Q = exp ( βx ) Z ( β )where Q is a reference measure on the real numbers and Z is the momentgenerating function given by Z ( β ) = ´ exp ( βx ) d Qx . Then x n → x + x + · · · + x n is a sufficient statistic for the family (cid:0) P ⊗ nβ (cid:1) β . xample 6. In a Bernoulli model a sequence x n ∈ { , } n is predictedwith probability n Y i =1 p x i (1 − p ) − x i = exp n X i =1 x ! ln ( p ) + n − n X i =1 x ! ln (1 − p ) ! . The function x n → x + x + · · · + x n induces a sufficient map Φ from proba-bility distributions on { , } n to probability distributions on { , , , . . . , n } . The reverse map maps a measure concentrated in j ∈ { , , , . . . , n } into a uniform distributions over sequences x n ∈ { , } n that satisfy P ni =1 x = j. The mean value of P β is ˆ x · exp ( βx ) Z ( β ) d Qx .
The set of possible mean values is called the mean value range and isan interval. Let P µ denote the element in the exponential family withmean value µ. Then a Bregman divergence on the mean value range isdefined by D ( λ, µ ) = D (cid:0) P λ k P µ (cid:1) . Note that the mapping µ → P µ isnot affine so the Bregman divergence D ( λ, µ ) will in general not be givenby the formula for information divergence with the family of binomialdistributions as the only exception. Nevertheless the Bregman divergence D ( λ, µ ) encode important information about the exponential family. Instatistics it is common to use squared Euclidean distance as distortionmeasure, but often it is better to use the Bregman divergence D ( λ, µ )as distortion measure. Note that D ( λ, µ ) is only proportional to squaredEuclidean distance for the Gaussian location family. Example 7.
An exponential distribution has density f λ ( x ) = (cid:26) λ exp (cid:0) − xλ (cid:1) , for x ≥ , else. This leads to a Bregman divergence on the interval [0; ∞ [ given by ˆ ∞ f λ ( x ) ln (cid:18) f λ ( x ) f µ ( x ) (cid:19) d x = λµ − − ln (cid:18) λµ (cid:19) = D − ln ( λ, µ ) This Bregman divergence is called the
Isakura-Saito distance . The Isakura-Saito distance is defined on an unbounded set so our previous results can-not be applied. Affine bijections on [0; ∞ [ have the form Φ ( x ) = k · x for some constant k > . The Isakura-Saito distance obviously satisfysufficiency for such maps and it is a simple exercise to check that theIsakura-Saito distance is the only Bregman divergence on [0 , ∞ ] that sat-isfies sufficiency. Any affine map [0; ∞ [ → [0; ∞ [ is composed of a map x → k · x where k ≥ and a right translation x → x + t where t ≥ . TheItakura-Saito distance decreases under right translations because ∂∂t D − ln ( λ + t, µ + t ) = ∂∂t (cid:18) λ + tµ + t − − ln (cid:18) λ + tµ + t (cid:19)(cid:19) = ( µ + t ) − ( λ + t )( µ + t ) − λ + t + 1 µ + t = − ( λ − µ ) ( λ + t ) ( µ + t ) ≤ . Thus the Isakura-Saito distance is monotone. oth sufficiency and the Bregman identity are closely related to in-ference rules. In Csiszár (1991) I. Csiszár explained why informationdivergence is the only divergence function on the cone of positive mea-sures that lead to tractable inference rules. One should observe that hisinference rules are closely related to sufficiency and the Bregman identity,and the present paper may be view as a generalization of these results ofI. Csiszár. Statistical mechanics can be stated based on classical mechanics or quan-tum mechanics. For our purpose this makes no difference because ourtheorems are valid for both classical systems and quantum systems.As we have seen before Ex = kT · D ( s k s ) . (25)Our general results for Bregman divergences imply that the Bregmandivergence based on this exergy satisfies D Ex ( s , s ) = kT · D ( s k s ) . Therefore D Ex (Φ ( s ) , Φ ( s )) = D Ex ( s , s )for any map that is sufficient for { s , s } . The equality holds for any regretfunction that is reversible and conserves the state that is in equilibriumwith the environment. Since a different temperature of the environmentleads to a different state that is in equilibrium the equality holds for anyreversible map that leave some equilibrium state invariant. We see that D Ex ( s , s ) is uniquely determined as long as there exists a sufficientlylarge set of maps that are reversible.In this exposition we have made some short-cuts. First of all we didnot derive equation 25. In particular the notion of temperature was usedwithout discussion. Secondly we identified the internal energy with themean value of the Hamiltonian and identified the thermodynamic entropywith k times the Shannon entropy. Finally, in the argument above weneed to verify in all details that the set of reversible maps is sufficientlylarge to determine the regret function. For classical thermodynamics themost comprehensive exposition was done by Lieb and Yngvason Lieb andYngvason (1998, 2010). In their exposition randomness was not takeninto account. With the present framework it is also possible to handlerandomness so that one can make a bridge between thermodynamics andstatistical mechanics. A detailed exposition will be given in a future paper.According to Equation (25) any bit of information can be convertedinto an amount of energy! One may ask how this is related to the mixingparadox (a special case of Gibbs’ paradox). Consider a container dividedby a wall with a blue and a yellow gas on each side of the wall. Thequestion is how much energy can be extracted by mixing the blue and theyellow gas? e loose one bit of information about each molecule by mixing theblue and the green gas, but if the color is the only difference no energycan be extracted. This seems to be in conflict with Equation (25), but inthis case different states cannot be converted into each other by reversibleprocesses. For instance one cannot convert the blue gas into the yellow gas.To get around this problem one can restrict the set of preparations and onecan restrict the set of measurements. For instance one may simply ignoremeasurements of the color of the gas. What should be taken into accountand what should be ignored, can only be answered by an experiencedphysicist. Formally this solves the mixing paradox, but from a practicalpoint of view nothing has been solved. If for instance the molecules in oneof the gases are much larger than the molecules in the other gas then asemi-permeable membrane can be used to create an osmotic pressure thatcan be used to extract some energy. It is still an open question whichdifferences in properties of the two gases that can be used to extractenergy. We know that in general a local regret function on a state space with atleast three orthogonal states is proportional to information divergence. Inportfolio theory we get the stronger result that monotonicity implies thatwe are in the situation of gambling introduced by Kelly Kelly (1956).
Theorem 7.
Assume that none of the assets are dominated by a portfolioof other assets. If the regret function D G ( P, Q ) given by (13) is monotonethen the regret function equals information divergence and the measures P and Q are supported by k distinct price relative vectors of the form ( o , , , . . . , (0 , o , , . . . , until (0 , , . . . o k ) . Proof.
If there are more than three price relative vectors then a monotoneregret function is always proportional to information divergence which isa strict regret function. Therefore we may assume that there are onlytwo price relative vectors. Assume that the regret function is not strict.Then the function G defined by (12) is not strictly convex. Assume that D G ( P, Q ) = 0 so that G is affine between P and Q . Let Φ be a con-traction around one of the end points of intersection between the statespace and the line through P and Q . Then monotonicity implies that D G (Φ( P ) , Φ( Q )) = 0 so that G is affine on the line between Φ( P ) andΦ( Q ). This holds for contractions around any point. Therefore G is affineon the whole state space which implies that there is a single portfolio thatdominates all assets. Such a portfolio must be supported on a single asset. herefore monotonicity implies that if two assets are not dominated thenthe regret function is strict and according to Theorem 1 we have alreadyproved that a strict regret function in portfolio theory is proportional toinformation divergence. Example 8.
If the regret function divergence is monotone and one of theassets is the safe asset then there exists a portfolio ~b such that b i · o i ≥ for all i. Equivalently b i ≥ o − i which is possible if and only if P o − i ≤ . One say that the gamble is fair if P o − i = 1 . If the gamble is super-fair ,i.e. P o − i < , then the portfolio b i = o − i / P o − i gives a price relativeequal to (cid:0)P o − i (cid:1) − > independently of what happens, which is a Dutchbook . Corollary 3.
In portfolio theory the regret function D G ( P, Q ) given by(13) is monotone if and only if it is strict.Proof. We use that in portfolio theory the regret function is monotone ifand only it is proportional to information.
In Pitrik and Virosztek (2015) it was proved that if f is a function suchthat the Bregman divergence based on tr ( f ( ρ )) is monotone on any (sim-ple) C*-algebra then the Bregman divergence is jointly convex. As wehave seen that monotonicity implies that the Bregman divergence mustbe proportional to inform divergence, which is jointly convex in both ar-guments. We also see that in general joint convexity is not a sufficientcondition for monotonicity, but in the case where the state space hasonly two orthogonal states it is not known if joint convexity of a Breg-man divergence is sufficient to conclude that the Bregman divergence ismonotone.One should note that the type of optimization presented in this paperis closely related to a game theoretic model developed by F. Topsøe Topsøe(2008, 2011). In his game theoretic model he needed what he called the perfect match principle . Using the terminology presented in this paper theperfect match principle states that the regret function is a strict Bregmandivergence. As we have seen the perfect match principle is only fulfilled inportfolio theory if all the assets are gambling assets. Therefore the theoryof F. Topsøe can only be used to describe gambling while our optimizationmodel can describe general portfolio theory and our sufficient conditionscan explain exactly when our general model equals gambling.The original paper of Kullback and Leibler Kullback and Leibler (1951)was called “On Information and Sufficiency”. In the present paper we havemade the relation between information divergence and the notion of suf-ficiency more explicit. The results presented in this paper are closelyrelated to the result that a divergence that is both an f -divergence and aBregman divergence is proportional to information divergence (see Har-remoës and Tishby (2007) or Amari (2009) and references therein). All f -divergences satisfy a sufficiency condition, which is the reason why thisclass of divergences has played such a prominent role in the study of therelation between information theory and statistics. One major questionhas been to find reasons for choosing between the different f -divergences.For instance f -divergences of power type (often called Tsallis divergencesor Cressie-Read divergences) are popular, but there are surprisingly few apers that can point at a single value of the power α that is optimal fora certain problem except if this value is 1. In this paper we have startedwith Bregman divergences because each optimization problem comes witha specific Bregman divergence. Often it is possible to specify a Bregmandivergence for an optimization problem and only in some of the cases thisBregman divergence is proportional to information divergence.The idea of sufficiency has different relevance in different applications,but in all cases information divergence prove to be the quantity that con-vert the general notion of sufficiency into a number. In information theoryinformation divergence appear as a consequence of Kraft’s inequality. Forcode length functions of integer length we get functions that are piecewiselinear. Only if we are interested in extend-able sequences we get a regretfunction that satisfies a data processing inequality. In this sense informa-tion theory is a theory of extend-able sequences. For scoring functions instatistics the notion of locality is important. These applications do notrefer to sequences. Similarly the notion of sufficiency that plays a majorrole in statistics, does not refer to sequences. Both sufficiency and local-ity imply that regret is proportional to information divergence, but thesereasons are different from the reasons why information divergence is usedin information theory. Our description of statistical mechanics does notgo into technical details, but the main point is that the many symmetriesin terms of reversible maps form a set of maps so large that our result oninvariance of regret under sufficient maps applies. In this sense statisticalmechanics and statistics both apply information divergence for reasonsrelated to sufficiency. For portfolio theory the story is different. In mostcases one has to apply the general theory of Bregman divergences becausewe deal with an optimization problem. The general Bregman divergencesonly reduce to information divergence when the assets are gambling assets.Often one talk about applications of information theory in statistics,statistical mechanics and portfolio theory. In this paper we have arguedthat information theory is mainly a theory of sequences, while some prob-lems in statistics and statistical mechanics are also relevant without refer-ence to sequences. It would be more correct to say that convex optimiza-tion has various application such as information theory, statistics, statis-tical mechanics, and portfolio theory and that certain conditions relatedto sufficiency lead to the same type of quantities in all these applications. Acknowledgment
The author want to thank Prasad Santhanam for inviting me to the Elec-trical Engineering Department, University of Hawai‘i at M¯anoa, wheremany of the ideas presented in this paper were developed. I also want tothank Alexander Müller-Hermes, Frank Hansen, and Flemming Topsøefor stimulating discussions and correspondence. Finally I want to thankthe reviewers for their valuable comments.
References
Kullback, S.; Leibler, R. On Information and Sufficiency.
Ann. Math.Statist. , , 79–86.Jaynes, E.T. Information Theory and Statistical Mechanics, I and II. Physical Reviews ,
106 and 108 , 620–630 and 171–190. aynes, E.T. Clearing up mysteries – The original goal. In MaximumEntropy and Bayesian Methods ; Skilling, J., Ed.; Kluwer: Dordrecht,1989.Liese, F.; Vajda, I.
Convex Statistical Distances ; Teubner: Leipzig, 1987.Barron, A.R.; Rissanen, J.; Yu, B. The Minimum Description LengthPrinciple in Coding and Modeling.
IEEE Trans. Inform. Theory , , 2743–2760. Commemorative issue.Csiszár, I.; Shields, P. Information Theory and Statistics: A Tutorial ;Foundations and Trends in Communications and Information Theory,Now Publishers Inc., 2004.Grünwald, P.D.; Dawid, A.P. Game Theory, Maximum Entropy, Mini-mum Discrepancy, and Robust Bayesian Decision Theory.
Annals ofMathematical Statistics , , 1367–1433.Grünwald, P. the Minimum Description Length principle ; MIT Press,2007.Holevo, A.S. Probabilistic and Statistical Aspects of Quantum The-ory ; Vol. 1,
North-Holland Series in Statistics and Probability , North-Holland: Amsterdam, 1982.Krumm, M.; Barnum, H.; Barrett, J.; Müller, M. Thermodynamics andthe structure of quantum theory. arXiv:1608.04461.Barnum, H.; Müller, M.P.; Ududec, C. Higher-order interference andsingle-system postulates characterizing quantum theory.
New Journalof Physics , , 123029.Harremoës, P. Maximum Entropy and Sufficiency. Proceed-ings MaxEnt2016. American Institute of Physics (AIP), 2016,[arXiv:1607.02259].Harremoës, P. Quantum information on Spectral Sets. arXiv:1701.06688Accepted for presentation at ISIT 2017.Servage, L.J. The Theory of Statistical Decision. Journal of the AmericanStatistical Association , , 55–67.Kiwiel, K.C. Proximal Minimization Methods with Generalized BregmanFunctions. SIAM Journal on Control and Optimization , , 1142–1168, [http://dx.doi.org/10.1137/S0363012995281742].Kiwiel, K.C. Free-steering Relaxation Methods for Problems with StrictlyConvex Costs and Linear Constraints. Math. Oper. Res. , , 326–349.Rockafellar, R.T. Convex Analysis ; Princeton Univ. Press: New Jersey,1970.Hendrickson, A.D.; Buehler, R.J. Proper scores for probability forecasters.
Ann. Math. Statist. , , 1916–1921.Rao, C.R.; Nayak, T.K. Cross Entropy, Dissimilarity Measures, and Char-acterizations of Quadratic Entropy. IEEE Trans. Inform. Theory , , 589–593. anerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering withBregman Divergences. Journal of Machine Learning Research , , 1705–1749.McCarthy, J. Measures of the value of information. Proc. Nat. Acad. Sci. , , 654–655.Gneiting, T.; Raftery, A.E. Strictly Proper Scoring Rules, Prediction,and Estimation. Journal of the American Statistical Association , Time and Conditional Independence ; Vol. 255,
IMFUFA-tekst , IMFUFA Roskilde University, 1993. Original in Danish entitledTid og Betinget Uafhængighed. English translation partially available.Kelly, J.L. A New Interpretation of Information Rate.
Bell System Tech-nical Journal , , 917–926.Cover, T.; Thomas, J.A. Elements of Information Theory ; Wiley, 1991.Uhlmann, A. On the Shannon Entropy and Related Functionals on Con-vex Sets.
Reports on Mathematical Physics , , 147–159.Müller-Hermes, A.; Reeb, D. Monotonicity of the Quantum Relative En-tropy Under Positive Maps. Annales Henri Poincare , [Sept. 2016.arXiv:1512.06117v2].Christandl, M.; Müller-Hermes, A. Relative Entropy Bounds on Quantum,Private and Repeater Capacities. April, 2016. arXiv:1604.03448.Petz, D. Monotonicity of Quantum Relative Entropy Revis-ited.
Reviews in Mathematical Physics , Quart. J.Math. Oxford , , 97–108,.Jenčová, A.; Petz, D. Sufficiency in quantum statistical inference. Com-munications in Mathematical Physics , , 259–276.Harremoës, P.; Tishby, N. The Information Bottleneck Revisited or Howto Choose a Good Distortion Measure. Proceedings ISIT 2007, Nice.IEEE Information Theory Society, 2007, pp. 566–571.Jiao, J.; amd Albert No, T.C.; Venkat, K.; Weissman, T. InformationMeasures: the Curious Case of the Binary Alphabet. Trans. Inform.Theory , , 7616–7626.Jenčová, A. Preservation of a quantum Rényi relative entropy impliesexistence of a recovery map. Journal of Physics A: Mathematical andTheoretical , , 085303. ishby, N.; Pereira, F.; Bialek, W. The information bottleneck method.Proceedings of the 37-th Annual Allerton Conference on Communica-tion, Controland Computing, 1999, pp. 368–377.No, A.; Weissman, T. Universality of logarithmic loss in lossy compression.2015 IEEE International Symposium on Information Theory (ISIT),2015, pp. 2166–2170.Dawid, A.P.; Lauritzen, S.; Perry, M. Proper local scoring rules on discretesample spaces. The Annals of Statistics , , 593–603.Bernardo, J.M. Expected Information as Expected Utility. The Annalsof Statistics , , 686–690. Institute of Mathematical Statistics.Csiszár, I. Why least squares and maximum entropy? An axiomaticapproach to inference for linear inverse problems. Ann. Stat. , , 2032–2066.Lieb, E.; Yngvason, J. A Guide to Entropy and the Second Law of Ther-modynamics. Notices of the AMS , , 571–581.Lieb, E.; Yngvason, J., The Mathematics of the Second Law of Thermody-namics. In Visions in Mathematics ; Alon, N.; Bourgain, J.; Connes, A.;Gromov, M.; Milman, V., Eds.; Birkhäuser Basel, 2010; pp. 334–358.Pitrik, J.; Virosztek, D. On the Joint Convexity of the Bregman Diver-gence of Matrices.
Letters in Mathematical Physics , , 675–692.Topsøe, F. Game theoretical optimization inspired by information theory. Journal of Global Optimization , , 553.Topsøe, F. Cognition and Inference in an Abstract Setting. ProceedingsWITMSE 2011, 2011.Amari, S.I. α -Divergence Is Unique, Belonging to Both f -Divergenceand Bregman Divergence Classes. IEEE Transactions on InformationTheory , , 4925–4931., 4925–4931.