[PDF] From Thermodynamic Sufficiency to Information Causality

Abstract

The principle called information causality has been used to deduce Tsirelson's bound. In this paper we derive information causality from monotonicity of divergence and relate it to more basic principles related to measurements on thermodynamic systems. This principle is more fundamental in the sense that it can be formulated for both unipartite systems and multipartite systems while information causality is only defined for multipartite systems. Thermodynamic sufficiency is a strong condition that put severe restrictions to shape of the state space to an extend that we conjecture that under very weak regularity conditions it can be used to deduce the complex Hilbert space formalism of quantum theory. Since the notion of sufficiency is relevant for all convex optimization problems there are many examples where it does not apply.

Full PDF

aa r X i v : . [ qu a n t - ph ] F e b Noname manuscript No. (will be inserted by the editor)

From Thermodynamic Suﬃciency to Information Causality

Peter Harremoës

Received: date / Accepted: date

Abstract

The principle called information causalityhas been used to deduce Tsirelson’s bound. In this pa-per we derive information causality from monotonic-ity of divergence and relate it to more basic principlesrelated to measurements on thermodynamic systems.This principle is more fundamental in the sense thatit can be formulated for both unipartite systems andmultipartite systems while information causality is onlydeﬁned for multipartite systems. Thermodynamic suﬃ-ciency is a strong condition that put severe restrictionsto shape of the state space to an extend that we conjec-ture that under very weak regularity conditions it canbe used to deduce the complex Hilbert space formal-ism of quantum theory. Since the notion of suﬃciencyis relevant for all convex optimization problems thereare many examples where it does not apply.

Keywords

Bregman divergence · multipartite system · information causality · thermodynamic suﬃciency PACS · · Mathematics Subject Classiﬁcation (2010) · Entanglement is a resource that may allow agents tosolve certain game problems in a more eﬃcient way

P. HarremoësCopenhagen Business CollegeNørre Voldgade 34København KDenmarkTel.: +45-39564171E-mail: [email protected] than what is possible without entanglement. Such taskscould be solved even more eﬃciently if the agents hadaccess to a ﬁctive resource called PR-boxes. Such boxescannot be used for signaling, but they can create cor-relations that are stronger than the correlations thatcan be created using entanglement. To be more precise,all quantum mechanical correlations satisfy Tsirelson’sbound while PR-boxes can violate Tsirelson’s bound.The goal is to explain Tsirelson’s bound and otherbounds on correlations from more basic physical princi-ples. One such principle is called information causality,and it may be formulated as “one bit of communicationcannot create more than one bit of correlation”. In [16]this principle was introduced and it was proved that itcan be used to derive Tsirelson’s bound. In [16] infor-mation causality was formulated and derived from theexistence of the function conditional mutual informa-tion that is assumed to satisfy some basic properties.In [17] two ways of deﬁning entropy were speciﬁed, andthey were used to formulate the principle of informationcausality.In this paper use properties of Bregman divergencesrather than entropy or mutual information as the ba-sic principle. These divergences have several advantagescompared with entropy and mutual information.To each convex optimization problem one can asso-ciate a Bregman divergence. If the optimization prob-lem is energy extraction in thermodynamics the Breg-man divergence is proportional to quantum relative en-tropy that has some very desirable properties. Theseproperties may be violated if one looks at diﬀerent op-timization problems. Therefore one may ask what isso special about energy extraction in thermodynamics,but this important problem will not be covered in thepresent paper. One advantage of studying divergence(and entropy) rather than conditional mutual informa-

P. Harremoës tion is that divergence and its properties can be studiedfor unipartite systems while conditional mutual infor-mation only makes sense for multipartite systems. Thisis important because we do not have a canonical wayof forming product spaces in generalized probabilistictheories. Bregman divergences with nice properties canbe deﬁned on Jordan algebras and the existence of anice Bregman divergence rule out most other convexbodies as potential state spaces. Finally, both entropyand conditional mutual information may be consideredas derived concepts based on divergence. This aspectwill be the focus of the present paper.The paper is organized as follows. In Section 2 wespecify concepts like state space and measurement andwe ﬁx notation. Jordan algebras and their most impor-tant properties are described in Section 3. In Section 4 itis proved that several diﬀerent ways of deﬁning entropycoincide for Jordan algebras. Bregman divergences andtheir relation to optimization are described in Section5. Several conditions related to the notion of suﬃciencyare deﬁned. For Jordan algebras these conditions areequivalent and the Bregman divergence is generated bythe entropy function. In Section 6 we deﬁne conditionalmutual information based on a Bregman divergence andwe demonstrate that the conditional mutual informa-tion has the properties that are needed for informationcausality to be satisﬁed. We conclude with Section 7 wesummarize our results and state some open problems.

Let P denote a set of preparations of a physical experi-ment. A mixed preparation is a formal mixture P s i · p i where p i are preparations and ( s i ) i is a probability vec-tor. The mixture P s i · p i is identiﬁed with the prepara-tion where p i is chosen with probability s i . A measure-ment m maps each preparation in P into a probabilitymeasure on the set of possible outcomes of the exper-iment. We assume that m ( P s i · p i ) = P s i · m ( p i ) . Let M denote the set of measurements that can beperformed by an observer (or a group of observers). If m ( p ) = m ( p ) for all measurements m ∈ M then wesay that p and p represent the same state . The set ofstates is called the state space, and with this Bayesiandeﬁnition of a state the state space will depend on theset of feasible measurements. In particular, the statespaces of two diﬀerent observers may be diﬀerent be-cause they may have diﬀerent sets of measurements. Agroup of observers may have a diﬀerent state space thanany of the individual observers because the set of jointmeasurements may be larger than the set of measure-ments that can be performed by any of the individualobservers. For simplicity we will assume that the state spacesare convex bodies Ω , i.e. convex compact sets spannedby ﬁnitely many elements. The extreme point are called pure states . Any convex body can be embedded in thepointed cone Ω + consisting of formal products t · σ where σ is a state and t is a positive real number calledthe trace of t · σ . The notation is tr ( t · σ ) = t. Theelements in the cone are called positive operators orun-normalized states. The cone is called the state cone .Positive elements can be added by t · σ + t · σ = ( t + t ) · (cid:18) t t + t σ + t t + t σ (cid:19) . The state cone spans a partially ordered vector space V Ω and the trace extends linearly to V Ω . Thus, the statesmay be considered as positive elements of an orderedvector space with trace 1.Let m ∈ M denote a measurement with values v in some set V . If σ is a state then the measurement isgiven by a probability measure m ( σ ) over V . Thus foreach v ∈ V we have a probability m ( σ ) ( v ) ∈ [0 , . For each v the measurement m maps Ω into [0 , andsuch a mapping is called a test and it is an element in Ω ∗ + , i.e. the dual cone of the positive elements. In theliterature on generalized probabilistic theories a test isoften called an eﬀect, but in this paper it is called atest, which is the well established in the statistical lit-erature. The test that maps x ∈ V Ω into λ tr ( x ) willbe denoted λ. In particular the test maps Ω into . Since the total probability of a measurement is 1 wehave P v m ( · ) ( v ) = 1 . A measurement can be repre-sented as a test valued measure. In the Hilbert spaceformalism the tests are given by positive operators andthe measurements are given by positive operator val-ued measures (POVM). We say that two states ρ and σ are mutually singular if there exists a test φ such that φ ( ρ ) = 0 and φ ( σ ) = 1 . Let m , m ∈ M with values in V and V . If M : V → V is some map such that m ( · ) ( v ) = X v : M ( v )= v m ( · ) ( v ) then the measurement m is at least as informativeabout the state as m , and m is called a ﬁne-graining of m . If m ( · ) ( v ) ∝ m ( · ) ( v ) for all values v for which M ( v ) = v , then the ﬁne-graining is said to be trivial . A measurement is ﬁne-grained if all ﬁne-grainings are trivial. Note that a mea-surement m is ﬁne grained if all tests m ( · ) ( v ) lie onextreme rays of Ω ∗ + . Therefore any measurement has a rom Suﬃciency to Causality 3 ﬁne-graining that is ﬁne grained when the state space Ω is a convex body.Let Ω and Ω denote two state spaces. An aﬃnemap Φ : Ω → Ω is called and aﬃnity . Let S : Ω → Ω and R : Ω → Ω denote aﬃnities. If R ◦ S = id Ω then S is called a section and R is called a retraction . A frame is a section S : Ω → Ω where Ω is a simplex.Let Ω denote the state space of a group of observers.The set of measurements M A of a single observer Al-ice is a subset of the set of all measurements M ofthe whole group of observers. Therefore the there isa surjective aﬃnity E A : Ω → Ω A . Assume that Al-ice and Bob are observers that can perform measure-ments independently. Further assume that the choiceof measurement made by Alice does not inﬂuence theoutcome of a measurement made by Bob and that achoice of measurement made by Bob does not inﬂuencethe outcome of a measurement made by Alice. This iscalled the no-signaling condition . If Alice performs themeasurement m A and Bob performs the measurement m B , then the joint measurement is denoted m A ⊗ m B .Further assume that Alice and Bob can communicate.Then Alice and Bob can perform any measurement ofthe form P s i · m A ⊗ m B . If Alice and Bob together canonly perform measurements of the form P s i · m A ⊗ m B their joint state space is a subset of V Ω A ⊗ V Ω B . Assumefurther that Alice and Bob can prepare states individu-ally. If Alice prepares the state σ A and Bob prepares thestate σ B then their joints state is σ A ⊗ σ B ∈ V Ω A ⊗ V Ω B . The convex hull of { σ A ⊗ σ B | σ A ∈ Ω A and σ B ∈ Ω B } is denoted Ω A ⊗ min Ω B and the elements are called sep-arable states. We assume that Ω A ⊗ min Ω B ⊆ Ω . Here we will recall some fact and concepts related toJordan algebras. A more detailed exposition can befound in [14,2]. In the Hilbert space formalism of quan-tum physics the states are represented as density ma-trices on a complex Hilbert space. Classical probabil-ity distributions can be identiﬁed with density matricesthat are diagonal. In the set of self adjoint matrices onemay deﬁne a product • by A • B = 12 ( AB + BA ) . This product makes the set of Hermitean matrices intoan algebra over the real numbers and the product • satisﬁes A • ( B • ( A • A )) = ( A • B ) • ( A • A ) . (1)With this equation fulﬁlled it is possible to deﬁne A n = A • A • . . . • A without specifying where the parenthesis have to be placed. Further we have that X i A i = 0 (2)if and only if A i = 0 for all i. The dimension of the al-gebra is deﬁned as the dimension of the Jordan algebraas a real vector space. A ﬁnite dimensional algebra overthe real numbers with a product • satisfying the prop-erties (1) and (2) is called an Euclidean Jordan algebra .Elements in an Euclidean Jordan algebra of the form A • A are called positive elements and they form apointed cone. Further, an Euclidean Jordan algebra hasa trace tr that maps positive elements into positivenumbers and such tr (( A • B ) • C ) = tr ( A • ( B • C )) . A state in a Jordan algebra is a positive element of trace1. The rank of a Jordan algebra is the Caratheodoryrank of the state space of algebra. An Euclidean Jordanalgebra has an inner product deﬁned by h A, B i = tr ( A • B ) . With this inner product the positive cone becomes selfdual .An element E of a Jordan algebra is idempotent if E = E. Elements A and B are orthogonal if A • B = 0 . With these deﬁnitions any element A has a spectraldecomposition A = X λ i E i where E i are orthogonal idempotent. If the spectral val-ues λ i are diﬀerent, the decomposition is unique. There-fore one can deﬁne f ( A ) = X f ( λ i ) E i . The associative Euclidean Jordan algebras corre-spond to classical probability theory, where the statespace is a simplex. Any Euclidean Jordan algebra J can be written as a direct sum L J i of Jordan algebraswhere each of the Jordan algebras J i is simple. Thesimple Euclidean Jordan algebras belong to one of thethe following ﬁve types. – M n ( R ) Real valued Hermitean n × n matrices. – M n ( C ) Complex valued Hermitean n × n matrices. – M n ( H ) Quaternionic valued Hermitean n × n ma-trices. – M ( O ) Octonionic valued Hermitean × matrices. – Jspin ( d ) Spin factors where the state space has theshape of a d -dimensional solid ball. P. Harremoës

The Jordan algebra M ( O ) is called the exceptionalJordan algebra and Jordan algebras that does not con-tain such an exceptional component are called specialJordan algebras . All special Jordan algebras appear assections of M n ( C ) for some value of n . In this sense allspecial Jordan algebras have representations as physicalsystems. If a section of the set of complex valued Her-mitean matrices is required to be completely positivethen the section can be represented as a set of complexvalued Hermitean matrices.It is an important question why exactly the com-plex valued Hermitean matrices are so good in modelingquantum physics compared with the other simple Jor-dan algebras. Actually Adler has attempted to modelquantum theory using quaternions [1], and there havebeen a number of attempts to let the exceptional Jordanalgebra play an active role in modeling physics [6,13].One important property that single out the complexvalued Hermitean matrices is that there is a canonicaltensor product construction within the category of com-plex valued Hermitean matrices with completely posi-tive maps as morphisms [3]. Example 1

Assume that the whole state space Ω can berepresented as real non-negative deﬁnite × matriceswith trace 1. The dimension of this state space is 9. Let A and B denote a × real Hermitean matrices. Then A ⊗ B can embedded in Ω as (cid:18) a a a a (cid:19) ⊗ (cid:18) b b b b (cid:19) =  a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b  . The vector space of Hermitean × matrices has di-mension 3. Therefore the tensor product has dimension9. Hence the set of tensors with trace 1 has dimen-sion 8, so it has a lower dimension than set of stateson the whole space. Therefore there are joint states onthe whole space that cannot be distinguished by localmeasurements. Hence the tomography condition is notfulﬁlled.There are a number of ways to characterize Jordan al-gebras. Above we have deﬁned the Jordan algebras al-gebraically. A classic result is that a real vector spacewith a self-dual homogeneous cone can be representedas a Jordan algebra [11]. A new result is that a statespace that is spectral and where any pair of frames canbe mapped into each other, can be represented by aJordan algebra [4].For Jordan algebras it is possible to deﬁne a well-behaved entropy function and an associated divergence function. In [10] it was proved that if a state space hasrank 2 and it has a monotone Bregman divergence thenit can be represented as a Jordan algebra (spin fac-tor). Similar representation theorems for state spacesof higher rank are not yet available, so in this paper wefocus on other consequences of the existence of entropyfunction or Bregman divergences. In generalized probabilistic theories there are two waysof deﬁning entropy [17]. The decomposition entropy ofa state σ is given by ˘ H ( σ ) = inf P p i · σ i = σ H (( p i ) i ) . Here the inﬁmum is taken over all mixtures P p i · σ i = σ where σ i are pure states and H (( p i ) i ) denotes the Shan-non entropy of the probability vector ( p i ) i . Versions ofthis deﬁnition can also be found in [8], but they datesback to [18]. Note that the deﬁnition of spectral entropyin [12] is closely related but slightly diﬀerent.Following [17] one can deﬁne the ﬁne grained en-tropy of a state in a generalized probabilistic theory by ˆ H ( σ ) = inf m H ( m ( σ )) where the inﬁmum has been taken over all ﬁne grainedmeasurements m on Ω . This ﬁne grained entropy is astrictly concave function. Lemma 1

If the state space Ω is spectral a decomposi-tion that minimizes the decomposition entropy is spec-tral.Proof This was essentially proved in [8] although theterminology regarding spectrality was slightly diﬀerent. ⊓⊔ Theorem 1

If the state space Ω is spectral then forany state σ the following inequality holds ˆ H ( σ ) ≤ ˘ H ( σ ) . Proof

Let σ = P p i σ i be a decomposition of σ wherethe states σ i are pure. To this decomposition there cor-responds a measurement m such that m ( σ ) ( i ) = p i . Since this measurement is ﬁne grained we have ˆ H ( σ ) ≤ H ( m ( σ )) = H (( p i ) i ) . Therefore ˆ H ( σ ) ≤ inf P p i σ i = σ H (( p i ) i ) = ˘ H ( σ ) . ⊓⊔ rom Suﬃciency to Causality 5 Theorem 2

If the state space Ω is spectral and thecone Ω + is self dual then ˆ H ( σ ) = ˘ H ( σ ) = - h σ, ln ( σ ) i . (3) Proof

Let M denote a ﬁne grained measurement. Themeasurement is given by a positive test valued measure,i.e. there exists ρ j ≥ such that P ρ j = 1 and such that M ( ρ ) ( j ) = h ρ j , ρ i . Since the measurement is ﬁne grained ρ j must be states.Thus, M ( σ ) ( j ) = h ρ j , σ i = * ρ j , X i p i σ i + = X i p i h ρ j , σ i i . If ˜ σ is the state X i r · σ i then M (˜ σ ) = * ρ j , X i r · σ i + j = 1 r h ρ j , i j = 1 r . the Markov kernel ( p i ) i → P i p i h ρ j , σ i i j maps theuniform distribution (cid:0) r (cid:1) i into the uniform distribution (cid:0) r (cid:1) i , i.e. the Markov kernel is bi-stochastic. Since bi-stochastic Markov kernels increase entropy we have H ( M ( σ )) = H (cid:16) h ρ j , σ i j (cid:17) ≥ H (( p i ) i ) = - h σ, ln ( σ ) i . Therefore- h σ, ln ( σ ) i ≤ ˆ H ( σ ) . (4)Now the result is obtained by combining Lemma 1 andTheorem 1 with inequality (4). Deﬁnition 1

The entropy H of a state σ in a Jordanalgebra is given as the common value of any of the ex-pressions given in Equation (3). Corollary 1

In a ﬁnite Euclidean Jordan algebra theentropy - h σ, ln ( σ ) i is a concave function.Proof Concavity of H follows because H equals the ﬁnegrained entropy and the ﬁne grained entropy is concave[17]. ⊓⊔ Concavity of the entropy function H on Jordan al-gebras was proved in [8] with a more involved proof. We consider a optimization problem where we want tooptimize some quantity deﬁned on the state space. Inthermodynamics the goal is typically to extract energyfrom the system by some feasible interaction with thesystem. Our approach makes sense for any convex op-timization problem and in principle the function mayrepresent other objectives such as the amount of moneyone may obtained by trading or the code length that isobtained after using a certain data compression proce-dure. Various examples of such optimization problemsare given in [7]. In this paper the objective function willbe energy.Assume that the system is in state ρ ∈ Ω and thatwe apply some action a from a set of feasible actions A .Then the mean energy that we extract will be denoted h a, ρ i and it is an aﬃne function of the state ρ. An action a will be identiﬁed with this function ρ → h a, ρ i so thatthe actions are considered as elements in the dual spaceof the state space. We can deﬁne the free energy of state ρ as F ( ρ ) = sup a ∈A h a, ρ i . In thermodynamics Helmholz free energy is given as F = U − T S so that the free energy is an aﬃne func-tion minus a term that is proportional to the entropyfunction. Then F is a convex function of ρ. The regretof doing action a if the state is ρ is deﬁned as D F ( ρ, a ) = F ( ρ ) − h a, ρ i . The interpretation of the regret function is as follows.Assume that the system is in state ρ but one uses a sub-optimal action a. Then the regret measures the diﬀer-ence between the energy that one could have extracted F ( ρ ) and the energy that one extracts using action a. For simplicity we will assume that F is diﬀerentiable sothat to each state ρ there exists a unique action a ρ suchthat F ( ρ ) = h a, ρ i . For states ρ, σ ∈ Ω the Bregmandivergence is deﬁned as D F ( ρ, σ ) = D F ( ρ, a σ ) . It measures the regret of acting as if the state were σ ifit actually is ρ. The Bregman divergence is given by D F ( ρ, σ ) = F ( ρ ) − (cid:18) F ( σ ) + dd t F ((1 − t ) σ + tρ ) | t =0 (cid:19) . P. Harremoës

The formula for the Bregman divergence is often writ-ten in terms of the gradient. D F ( ρ, σ ) = F ( ρ ) − ( F ( σ ) + h ∇ F ( σ ) | ρ − σ i ) . Proposition 1 ([8, Lemma 17])

For Hermitean ma-trices A and B we have dd t (tr ( f ( A + tB ))) | t =0 = h f ′ ( A ) , B i . Example 2

Assume that the state space can be repre-sented as a state space of a Jordan algebra. Let F ( σ ) = h σ, ln ( σ ) i denote the negative of the entropy. The Breg-man divergence corresponding to F can be computedas D F ( ρ, σ ) = F ( ρ ) − (cid:26) F ( σ ) + dd t F ((1 − t ) σ + tρ ) | t =0 (cid:27) = h ρ, ln ( ρ ) i− {h σ, ln ( σ ) i + h ln ( σ ) + 1 , ρ − σ i} = h ρ, ln ( ρ ) − ln ( σ ) i − tr ( ρ − σ ) . (5)We call this quantity the information divergence and denote it as D ( ρ k σ ) . Note that the last term vanishif ρ and σ are states. If the Jordan algebra is associativewe get Kulback-Leibler divergence given by D ( P k Q ) = X p i ln p i q i . If the Jordan algebra is a C ∗ -algebra F is minus the vonNeumann entropy the information divergence equalsquantum information divergence (quantum relative en-tropy) given by D ( ρ k σ ) = tr ( ρ (ln ρ − ln σ )) . There are a number of conditions that some regret func-tions and Bregman divergences may have.

Deﬁnition 2

The Bregman divergence D F is mono-tone if D F (Φ ( ρ ) , Φ ( σ )) ≤ D F ( ρ, σ ) for any aﬃnity Φ : Ω → Ω .We note that monotonicity is associated with the de-crease of free energy for a closed thermodynamic sys-tem. It is possible to deﬁne the regret D F ( ρ, σ ) even ifthe function F is not diﬀerentiable, but if such a regretfunction is monotone then F is automatically diﬀeren-tiable [7]. In the rest of this paper we shall focus entirelyon the case when F is diﬀerentiable and the regret be-tween states is given by the Bregman divergence. Theorem 3

Information divergence is monotone onspecial Jordan algebras. Proof

Let Ω denote the state space of a special Jordanalgebra. Then there exists a section S : Ω → M n ( C ) with a corresponding retraction R : M n ( C ) → Ω . Let

Φ : Ω → Ω denote some aﬃnity. Then S ◦ Φ ◦ R is anaﬃnity M n ( C ) → M n ( C ) . Then D ( Φ ( ρ ) k Φ ( σ )) = D ( S (Φ ( ρ )) k S (Φ ( σ )))= D ( ( S ◦ Φ ◦ R ) ( S ( ρ )) k ( S ◦ Φ ◦ R ) ( S ( σ ))) ≤ D ( S ( ρ ) k S ( σ )) = D ( ρ k σ ) . Here we have used that information divergence is mono-tone on M n ( C ) [15]. ⊓⊔ It is not known if information divergence is mono-tone on the exceptional Jordan algebra. Let ρ θ denote afamily of states and let Φ denote an aﬃnity Φ : Ω → Ω .Then Φ is said to be suﬃcient for ρ θ if there existsa recovery map Ψ : Ω → Ω , i.e. an aﬃnity such that Ψ (Φ ( ρ θ )) = ρ θ . Deﬁnition 3

A Bregman divergence D F is said to sat-isfy suﬃciency if D F (Φ ( ρ ) , Φ ( σ )) = D F ( ρ, σ ) when-ever Φ is suﬃcient for ρ, σ. It is easy to prove that monotonicity implies suﬃciency.Further it is easy to prove that suﬃciency implies theproperty called statistical locality as deﬁned below.

Deﬁnition 4

A Bregman divergence D F satisﬁes sta-tistical locality if ρ ⊥ σ i implies D F ( ρ, (1 − t ) · ρ + t · σ ) = D F ( ρ, (1 − t ) · ρ + t · σ ) . Proposition 2

In an Euclidean Jordan algebra Infor-mation divergence satisﬁes statistical locality.Proof

Assume that ρ, σ , and σ are states and that ρ ⊥ σ i . Then D ( ρ k (1 − t ) · ρ + t · σ ) = h ρ, ln ( ρ ) − ln ((1 − t ) · ρ + t · σ ) i = h ρ, ln ( ρ ) − ln ((1 − t ) · ρ ) i = - ln (1 − t ) . ⊓⊔ Theorem 4

If the state space Ω can be represented asthe state space of a Jordan algebra of rank at least 3then a statistically local Bregman divergence D F is pro-portional to information divergence given by Equation(5). There exists a constant c > such that the function F equals c · h ρ, ln ρ i plus an aﬃne function on Ω .Proof The theorem was proved for ﬁnite C ∗ -algebras in[7], but the proof is the same for more general Jordanalgebras. ⊓⊔ The theorem implies under certain conditions thefollowing conditions are equivalent rom Suﬃciency to Causality 7 – Monotonicity, – Suﬃciency – Statistical locality – The Bregman divergence is proportional to informa-tion divergence. – The objective function F is proportional to entropyplus an aﬃne function.If the state space has rank 2 these conditions are notequivalent and this special case was studied in greatdetail in [10]. Consider a bipartite system with Alice and Bob as ob-servers. We assume the no-signaling condition and lo-cal tomography are fulﬁlled so that a joint state canbe described as an element in the tensor product of lo-cal vector spaces. Let U A and U b denote order units ofAlice and Bob.Let F denote some payoﬀ function on a joint systemwith regret function D F . We will assume that the regretfunction D F satisﬁes monotonicity. Then F is diﬀeren-tiable and D F is a Bregman divergence. Therefore D F is given by D F ( ρ, σ ) = F ( ρ ) − ( F ( σ ) + h ∇ F ( σ ) | ρ − σ i ) . The following proposition is well-known if the aﬃnecombination is a convex combination.

Proposition 3 If P i t i = 1 and the aﬃne combina-tion ¯ ρ = P i t i · ρ i is a state then the Bregman identityholds: X i t i · D F ( ρ i , σ ) = X i t i · D F ( ρ i , ¯ ρ ) + D F (¯ ρ, σ ) . (6) Proof

We expand the right hand side of (6) and get X i t i · D F ( ρ i , ¯ ρ ) + D F (¯ ρ, σ )= X i t i · ( F ( ρ i ) − ( F (¯ ρ ) + h ∇ F (¯ ρ ) | ρ i − ¯ ρ i ))+ F (¯ ρ ) − ( F ( σ ) + h ∇ F ( σ ) | ¯ ρ − σ i ) . We can re-arrange the terms and use that ¯ ρ = X i t i · ρ i to get X i t i · F ( ρ i ) − X i t i · F (¯ ρ )+ * ∇ F (¯ ρ ) | X i t i · ρ i − ¯ ρ +! + F (¯ ρ ) − ( F ( σ ) + h ∇ F ( σ ) | ¯ ρ − σ i )= X i t i · F ( ρ i ) − ( F (¯ ρ ) + h ∇ F (¯ ρ ) | ¯ ρ − ¯ ρ i )+ F (¯ ρ ) − ( F ( σ ) + h ∇ F ( σ ) | ¯ ρ − σ i ) . Therefore the right hand side of Equation (6) reducesto X i t i · F ( ρ i ) − ( F ( σ ) + h ∇ F ( σ ) | ¯ ρ − σ i )= X i t i · ( F ( ρ i ) − ( F ( σ ) + h ∇ F ( σ ) | ρ i − σ i ))= X i t i · D F ( ρ i , σ ) , which is the left hand side of Equation (6) and thiscompletes the proof. ⊓⊔ Theorem 5

Assume that Ω ⊂ V A ⊗ V B . If ρ , ρ ∈ Ω A and σ , σ ∈ Ω B and D F satisﬁes suﬃciency then D F ( ρ ⊗ σ , ρ ⊗ σ ) = D F ( ρ ⊗ σ , ρ ⊗ σ ) . Proof

To see this deﬁne

Φ ( π ) = E A ( π ) ⊗ σ , Ψ ( π ) = E A ( π ) ⊗ σ . Then

Φ ( ρ i ⊗ σ ) = ρ i ⊗ σ , Ψ ( ρ i ⊗ σ ) = ρ i ⊗ σ . The result is obtained by suﬃciency of D F . ⊓⊔ If ρ , ρ ∈ Ω A we may write D F ( ρ , ρ ) as an ab-breviation for D F ( ρ ⊗ σ, ρ ⊗ σ ) where some arbitrarystate σ ∈ Ω B is used. Deﬁnition 5

Let σ denote a state on a system witha bipartite subsystem composed of subsystems labeled A and B . Then the mutual information between thesubsystem A and subsystem B is deﬁned as I σ ( A ; B ) = D F ( σ AB , σ A ⊗ σ B ) . (7) Theorem 6

If the Bregman divergence D F is mono-tone then mutual information satisﬁes the following twoconditions. Consistency

If the system has a bipartite subsys-tem consisting of two classical subsystems A and B thenthe mutual information restricted to the bipartite sub-system is proportional to classical mutual information. Data processing inequality If Φ : V B → V B is apositive trace conserving aﬃnity then I σ ( A ; B ) ≥ I ( id ⊗ Φ)( σ ) ( A ; B ) . Proof

Consistency

If the subsystems deﬁned by Aliceand Bob are classical and non-trivial then the rank oftheir joint state space is at least × . When therank of the state space is least 3 the function F is alinear function of the Shannon entropy and therefore P. Harremoës the mutual information deﬁned by (7) is proportionalto the classical mutual information.

Data processing inequality

Assume that

Φ : V B → V B is a positive trace conserving aﬃnity. Then ˜Φ = id ⊗ Φ is given by ˜Φ ( σ A ⊗ σ B ) = σ ⊗ Φ ( σ B ) and I σ ( A ; B ) = D F ( σ AB , σ A ⊗ σ B ) ≤ D F (cid:16) ˜Φ ( σ AB ) , ˜Φ ( σ A ⊗ σ B ) (cid:17) = D F (cid:16) ˜Φ ( σ AB ) , σ A ⊗ Φ ( σ B ) (cid:17) = I ˜Φ( σ ) ( A ; B ) , which completes the proof. ⊓⊔ In probability theory one may deﬁne entropy as selfinformation via H ( A ) = I ( A, A ) . This is not possible in quantum theory because the dif-ferent sub-spaces in a tensor product decompositionhave to be distinct. In probability theory this is nota problem and cloning is allowed i.e. one is allowed toform identical copies a state. In probability theory onegets H ( AB ) = I ( AB, AB )= I ( A, AB ) + I ( B, AB | A ) ≥ I ( A, AB )= I ( A, A ) + I ( A, B | A ) ≥ I ( A, A )= H ( A ) . Therefore in probability theory the entropy of a sub-system is less than the entropy of the full system.

Deﬁnition 6

A Bregman divergence D F on a bipartitesystem is additive if D F ( ρ A ⊗ ρ B , σ A ⊗ σ B ) = D F ( ρ A , σ A ) + D F ( ρ B , σ B ) . Theorem 7

If the state spaces Ω A and Ω B can be rep-resented as state spaces of Jordan algebras J A and J B ,and if D F satisﬁes suﬃciency then D F is additive.Proof Let c A and c B denote distributions that maxi-mize the ﬁne grained entropy distributions in each ofthe algebras. Then D F equals D ˜ F where ˜ F ( σ ) = D F ( σ, c A ⊗ c B ) . Let ρ A and ρ B denote states in the state spaces Ω A and Ω B . Then ρ A and ρ B generate associative sub-algebras A A ⊆ J A and A B ⊆ J B with classical state spaces. Now the restriction of D F to A A ⊗ A B satisﬁes suﬃ-ciency and according to Theorem 4 D F is proportionalto information divergence. Therefore D F ( ρ A ⊗ ρ B , c A ⊗ c B ) = D F ( ρ A , c A ) + D F ( ρ B , c B ) because information divergence is additive on classicalstate spaces. Deﬁne ˜ F A ( ρ A ) = D F ( ρ A , c A ) , ˜ F B ( ρ B ) = D F ( ρ B , c B ) . With this notation ˜ F ( ρ A ⊗ ρ B ) = ˜ F A ( ρ A ) + ˜ F B ( ρ B ) . Thus D F ( ρ A ⊗ ρ B , σ A ⊗ σ B ) = ˜ F ( ρ A ⊗ ρ B ) − ˜ F ( σ A ⊗ σ B ) + D ∇ ˜ F ( σ A ⊗ σ B ) (cid:12)(cid:12)(cid:12) ρ A ⊗ ρ B − σ A ⊗ σ B E ! = ˜ F A ( ρ A ) + ˜ F B ( ρ B ) − ˜ F A ( σ A ) + ˜ F B ( σ B ) + D ∇ ˜ F A ( σ A ) + ∇ ˜ F B ( σ B ) (cid:12)(cid:12)(cid:12) ρ A ⊗ ρ B − σ A ⊗ σ B E ! = ˜ F A ( ρ A ) − ˜ F ( σ A ) + D ∇ ˜ F A ( σ A ) (cid:12)(cid:12)(cid:12) ρ A ⊗ ρ B − σ A ⊗ σ B E ! + ˜ F B ( ρ B ) − ˜ F B ( σ B ) + D ∇ ˜ F B ( σ B ) (cid:12)(cid:12)(cid:12) ρ A ⊗ ρ B − σ A ⊗ σ B E ! = D F ( ρ A , σ A ) + D F ( ρ B , σ B ) . ⊓⊔ Example 3

If tensor products of × Hermitean ma-trices are embedded in Hermitean × matrices as inExample 1 then mutual information is additive. Lemma 2

An additive monotone Bregman divergencesatisﬁes the following identity D F ( σ AB , ρ A ⊗ ρ B ) = D F ( σ AB , σ A ⊗ ρ B )+ D F ( σ A , ρ A ) . (8 )Proof Any state σ AB can be written as an aﬃne com-bination of tensor products σ AB = P t i · π A,i ⊗ π B,i . Then D F ( σ AB , ρ A ⊗ ρ B ) = X t i · D F ( π A,i ⊗ π B,i , ρ A ⊗ ρ B ) − X t i · D F ( π A,i ⊗ π B,i , σ AB ) . rom Suﬃciency to Causality 9 Using additivity it can be rewritten as D F ( σ AB , ρ A ⊗ ρ B )= X t i · ( D F ( π A,i , ρ A ) + D F ( π B,i , ρ B )) − X t i · D F ( π A,i ⊗ π B,i , σ AB )= X t i · D F ( π A,i , ρ A ) + X t i · D F ( π B,i , ρ B ) − X t i · D F ( π A,i ⊗ π B,i , σ AB ) . The Bregman identity (6) gives D F ( σ AB , ρ A ⊗ ρ B ) = X t i · D F ( π A,i , σ A ) + D F ( σ A , ρ A )+ X t i · D F ( π B,i , ρ B ) − X t i · D F ( π A,i ⊗ π B,i , σ AB ) . This can be re-arranged as D F ( σ AB , ρ A ⊗ ρ B )= X t i · ( D F ( π A,i , σ A ) + D F ( π B,i , ρ B )) − X t i · D F ( π A,i ⊗ π B,i , σ AB ) + D F ( σ A , ρ A ) . Now additivity leads to D F ( σ AB , ρ A ⊗ ρ B ) = X t i · D F ( π A,i ⊗ π B,i , σ A ⊗ ρ B ) − X t i · D F ( π A,i ⊗ π B,i , σ AB ) + D F ( σ A , ρ A )= D F ( σ AB , σ A ⊗ ρ B ) + D F ( σ A , ρ A ) . ⊓⊔ Deﬁnition 7

We deﬁne the conditional mutual infor-mation on a tripartite system as I σ ( A ; B | C ) = D F ( σ ABC , σ A ⊗ σ B ⊗ σ C ) − D F ( σ AC , σ A ⊗ σ C ) − D F ( σ BC , σ B ⊗ σ C ) . In our deﬁnition of conditional mutual informationthe subsystems

A, B, and C should be distinct so thatthe tensor products are deﬁned. If the state space is asimplex, i.e. the system is classical, then one may allowthe subsystems to overlap. Deﬁnition 8

Assume that D F is a monotone and addi-tive Bregman divergence. Then conditional mutual in-formation is a separoid function.Proof Positivity

Conditional mutual information canbe rewritten as I σ ( A ; B | C ) = D F ( σ ABC , σ A ⊗ σ B ⊗ σ C ) − D F ( σ AC , σ A ⊗ σ C ) − D F ( σ BC , σ B ⊗ σ C )= D F ( σ ABC , σ B ⊗ σ AC ) − D F ( σ BC , σ B ⊗ σ C )= D F ( σ ABC , σ B ⊗ σ AC ) − D F ( σ A ⊗ σ BC , σ A ⊗ σ B ⊗ σ C ) . Let Φ denote the aﬃnity Φ ( ρ ) = σ A ⊗ E BC ( ρ ) . Then Φ ( σ ABC ) = σ A ⊗ σ BC , Φ ( σ B ⊗ σ AC ) = σ A ⊗ σ B ⊗ σ C . Therefore monotonicity implies that I σ ( A ; B | C ) can-not be negative. Symmetry

It follows directly from the deﬁnitionthat conditional mutual information is symmetric.

Chain rule

To prove the chain rule we expand theleft hand side of Equation (9) as I σ ( A ; BC | D ) = D F ( σ ABCD , σ A ⊗ σ BC ⊗ σ D ) − D F ( σ AD , σ A ⊗ σ D ) − D F ( σ BCD , σ BC ⊗ σ D ) . Next we use Equation (8) to get I σ ( A ; BC | D )= (cid:18) D F ( σ ABCD , σ A ⊗ σ B ⊗ σ C ⊗ σ D ) − D F ( σ BC , σ B ⊗ σ C ) (cid:19) − D F ( σ AD , σ A ⊗ σ D ) − (cid:18) D F ( σ BCD , σ B ⊗ σ C ⊗ σ D ) − D F ( σ BC , σ B ⊗ σ C ) (cid:19) . The left hand side reduces to I σ ( A ; BC | D ) = D F ( σ ABCD , σ A ⊗ σ B ⊗ σ C ⊗ σ D ) − D F ( σ AD , σ A ⊗ σ D ) − D F ( σ BCD , σ B ⊗ σ C ⊗ σ D ) . (10)Similarly, we expand the right hand side of Equation(9) as I σ ( A ; B | D ) + I σ ( A ; C | BD )= D F ( σ ABD , σ A ⊗ σ B ⊗ σ D ) − D F ( σ AD , σ A ⊗ σ D ) − D F ( σ BD , σ B ⊗ σ D ) + D F ( σ ABCD , σ A ⊗ σ C ⊗ σ BD ) − D F ( σ ABD , σ A ⊗ σ BD ) − D F ( σ BCD , σ C ⊗ σ BD ) . We use Equation (8) to re-write the three last terms as I σ ( A ; B | D ) + I σ ( A ; C | BD )= D F ( σ ABD , σ A ⊗ σ B ⊗ σ D ) − D F ( σ AD , σ A ⊗ σ D ) − D F ( σ BD , σ B ⊗ σ D )+ (cid:18) D F ( σ ABCD , σ A ⊗ σ B ⊗ σ C ⊗ σ D ) − D F ( σ BD , σ B ⊗ σ D ) (cid:19) − (cid:18) D F ( σ ABD , σ A ⊗ σ B ⊗ σ D ) − D F ( σ BD , σ B ⊗ σ D ) (cid:19) − (cid:18) D F ( σ BCD , σ B ⊗ σ C ⊗ σ D ) − D F ( σ BD , σ B ⊗ σ D ) (cid:19) . The right hand side reduces to I σ ( A ; B | D ) + I σ ( A ; C | BD )= D F ( σ ABCD , σ A ⊗ σ B ⊗ σ C ⊗ σ D ) − D F ( σ AD , σ A ⊗ σ D ) − D F ( σ BCD , σ B ⊗ σ C ⊗ σ D ) . (11)Since the left hand side (10) and the right hand side(11) are equal we have proved the chain rule (9). ⊓⊔ We have carefully described concepts like state spaceand introduced state spaces on Jordan algebras as themost important example. In general probabilistic theo-ries there are diﬀerent ways of deﬁning the entropy of astate, but these diﬀerent deﬁnitions coincide on Jordanalgebras. For any optimization problem trere is an asso-ciated Bregman divergence, but with extre constraintslike monotonicity, suﬃciency, or statistical locality aBregman divergenceon a Jordan algebra is proportionalto the Bregman divergence generated by the uniquelydeﬁned entropy function. A monotone Bregman diver-gence on a Jordan algebra is automatically additive. Forcomposed systems an additive and monotone Bregmandivergence can be used to deﬁne conditional mutualinformation and this quantity will satisfy consistency,the data processing inequality and the chain rule. In[16] it was proved that if conditional mutual informa-tion can be deﬁned in a way such that consistency, thedata processing inequality and the chain rule are sat-isﬁed then the system will satisfy the condition called information causality [16]. In [16] it was also provedthat a system that satisﬁes information causality can-not have super-quantum correlations, i.e. correlationsviolate Tsirelson’s bound. The conclusion is that theexistence of a monotone Bregman divergence impliesthat super-quantum correlations do not exist. The results work out nicely on Jordan algebras, butmaybe it will work in any generalized probabilistic the-ory. For instance it would be interesting if the followingconjecture holds.

Conjecture 1

All monotone Bregman divergences areadditive.A careful inspection of the proofs also reveal thatthe results involving Jordan algebras only involve thatthe cone is self dual and that a Euclidean Jordan al-gebra is strongly spectral in the sense that f ( σ ) is welldeﬁned for any function f . Appearently monotonicity ofa Bregman divergence implies spectrality, but the onlysolid result in this direction is the following theorem. Theorem 9 ([10])

If a state space has rank 2 and ithas a strict and monotone Bregman divergence then thestate space can be represented as a spin factor. In par-ticular the state spce is strongly spectral.

For most convex bodies it is not possible to deﬁne amonotone Bregman divergence and it is not known if itis possible to deﬁne a monotone Bregman divergenceson any convex body that cannot be represented by aJordan algebra. It would be highly desirable to clas-sify state spaces with monotone Bregman divergencesin cases when the rank exceeds 2.

Conﬂict of interest

The corresponding author states that there is no con-ﬂict of interest.

References

1. Adler, S.: Quaternionic Quantum Mechanics and Quan-tum Fields. Oxford Univ. Press, New York, Oxford (1995)2. Baes, M.: Convexity and diﬀerentiability properties ofspectral functions and spectral mappings on EuclideanJordan algebras. Linear Algebra and its Applications , 664–700 (2007). DOI doi:10.1016/j.laa.2006.11.0253. Barnum, H., Graydon, M., Wilce, A.: Composites andcategories of Euclidean Jordan algebras (2016). ArXivpreprint arXiv:1606.093314. Barnum, H., Hilgert, J.: Strongly symmetric spectral con-vex bodies are Jordan algebra state spaces (2019)5. Dawid, A.P.: Separoids: A mathematical framework forconditional independence and irrelevance. Ann. Math.Artif. Intell. , 335–372 (2001)6. Günaydin, M., Gürsey, F.: Quark structure and octo-nions. J. Math. Phys. (11), 1651–1667 (1973)7. Harremoës, P.: Divergence and suﬃciency for convex op-timization. Entropy (5), Article no. 206 (2017). URL https://doi.org/10.3390/e19050206

8. Harremoës, P.: Maximum entropy and suﬃciency. AIPConference Proceedings (1), 040001 (2017). URL https://doi.org/10.1063/1.4985352 rom Suﬃciency to Causality 119. Harremoës, P.: Entropy inequalities for lattices. Entropy , 748 (2018). DOI 10.3390/e2010078410. Harremoës, P.: Entropy on spin factors. In: N. Ay,P. Gibilisco, F. Matúš (eds.) Information Geometry andIts Applications, Springer Proceedings in Mathematics &Statistics , vol. 252, pp. 247–278. Springer (2018)11. Jordan, P., von Neumann, J., Wigner, E.: On an alge-braic generalization of the quantum mechanical formal-ism. Annals of Mathematics (1), 29–64 (1934). DOI10.2307/1968117. JSTOR 196811712. Krumm, M., Barnum, H., Barrett, J., Müller, M.P.:Thermodynamics and the structure of quantumtheory. New Journal of Physics (4), 043025(2017). DOI 10.1088/1367-2630/aa68ef. URL http://dx.doi.org/10.1088/1367-2630/aa68ef

13. Manogue, C.A., Dray, T.: Octonions, e6, and particlephysics. J. Phys. Conf. p. 012005 (2010)14. McCrimmon, K.: A Taste of Jordan Algebras. Springer(2004)15. Müller-Hermes, A., Reeb, D.: Monotonicity of thequantum relative entropy under positive maps. An-nales Henri Poincaré (5), 1777–1788 (2017). URL https://doi.org/10.1007/s00023-017-0550-9

16. Pawlowski, M., Paterek, T., Kaszlikowski, D., Scarani,V., Winter, A., Zukowski, M.: Information causality as aphysical principle. Nature , 1101–1104 (2009). DOIhttps://doi.org/10.1038/nature0840017. Short, A.J., Wehner, S.: Entropy in general physical the-ories. New J. Phys.12