Understanding interdependency through complex information sharing
Fernando Rosas, Vasilis Ntranos, Christopher J. Ellison, Sofie Pollin, Marian Verhelst
UUnderstanding interdependency throughcomplex information sharing
Fernando Rosas , Vasilis Ntranos , Christopher J. Ellison ,Sofie Pollin and Marian Verhelst Departement Elektrotechniek, KU Leuven Department of Electrical Engineering & Computer Sciences, UC Berkeley Center for Complexity and Collective Computation, University of Wisconsin-Madison
Abstract
The interactions between three or more random variables are often nontrivial, poorly understood,and yet, are paramount for future advances in fields such as network information theory, neuroscience,genetics and many others. In this work, we propose to analyze these interactions as different modes ofinformation sharing. Towards this end, we introduce a novel axiomatic framework for decomposing thejoint entropy, which characterizes the various ways in which random variables can share information. Thekey contribution of our framework is to distinguish between interdependencies where the informationis shared redundantly, and synergistic interdependencies where the sharing structure exists in the wholebut not between the parts. We show that our axioms determine unique formulas for all the terms of theproposed decomposition for a number of cases of interest. Moreover, we show how these results canbe applied to several network information theory problems, providing a more intuitive understanding oftheir fundamental limits.
Corresponding author: Fernando Rosas [email protected] . a r X i v : . [ c s . I T ] S e p Understanding interdependency throughcomplex information sharing
I. I
NTRODUCTION
Interdependence is a key concept for understanding the rich structures that can be exhibited bybiological, economical and social systems [1], [2]. Although this phenomenon lies in the heartof our modern interconnected world, there is still no solid quantitative framework for analyzingcomplex interdependences, this being crucial for future advances in a number of disciplines. Inneuroscience, researchers desire to identify how various neurons affect an organism’s overallbehavior, asking to what extent the different neurons are providing redundant or synergistic sig-nals [3]. In genetics, the interactions and roles of multiple genes with respect to phenotypic phe-nomena are studied, e.g. by comparing results from single and double knockout experiments [4].In graph and network theory, researchers are looking for measures of the information encodedin node interactions in order to quantify the complexity of the network [5]. In communicationtheory, sensor networks usually generate strongly correlated data [6]; a haphazard design mightnot account for these interdependencies and, undesirably, will process and transmit redundantinformation across the network degrading the efficiency of the system.The dependencies that can exist between two variables have been extensively studied, gener-ating a variety of techniques that range from statistical inference [7] to information theory [8].Most of these approaches require that one differentiate the role of the variables, e.g. between a target and predictor . However, the extension of these approaches to three or more variables is notstraightforward, as a binary splitting is, in general, not enough to characterize the rich interplaythat can exist between variables. Moreover, the development of more adequate frameworks hasbeen difficult as most of our theoretical tools are rooted in sequential reasoning, which is adeptat representing linear flows of influences but not as well-suited for describing distributed systemsor complex interdependencies [9].In this work, we propose to understand interdependencies between variables as informationsharing . In the case of two variables, the portion of the variability that can be predicted cor-responds to information that target and predictor have in common. Following this intuition, we present a framework that decomposes the total information of a distribution according to how it isshared among its variables. Our framework is novel in combining the hierarchical decompositionof higher-order interactions, as developed in [10], with the notion of synergistic information, asproposed in [11]. In contrast to [10], we study the information that exists in the system itselfwithout comparing it with other related distributions. In contrast to [11], we analyze the jointentropy instead of the mutual information, looking for symmetric properties of the system.One important contribution of this paper is to distinguish shared information from pre-dictability . Predictability is a concept that requires a bipartite system divided into predictorsand targets. As different splittings of the same system often yield different conclusions, we seepredictability as a directed notion that strongly depends on one’s “point of view”. In contrast, wesee shared information as a property of the system itself, which does not require differentiatedroles between its components. Although it is not possible in general to find an unique measureof predictability, we show that the shared information can be uniquely defined for a number ofinteresting scenarios.Additionally, our framework provides new insight to various problems of network informationtheory. Interestingly, many of the problems of network information theory that have been solvedare related to systems which present a simple structure in terms of shared information andsynergies, while most of the open problems possess a more complex mixture of them.The rest of this article is structured as follows. First, Section II introduces the notions ofhierarchical decomposition of dependencies and synergistic information, reviewing the state-of-the-art and providing the necessary background for the unfamiliar reader. Section III presentsour axiomatic decomposition for the joint entropy, focusing on the fundamental case of threerandom variables. Then, we illustrate the application of our framework for various cases ofinterest: pairwise independent variables in Section IV, pairwise maximum entropy distributionsand Markov chains in Section V, and multivariate Gaussians in VI. After that, Section VIIpresents a first application of this framework in settings of fundamental importance for networkinformation theory. Finally, Section VIII summarizes our main conclusions.II. P
RELIMINARIES AND STATE OF THE ART
One way of analyzing the interactions between the random variables X = ( X , . . . , X N ) is to study the properties of the correlation matrix R X = E { XX t } . However, this approach only captures linear relationships and hence the picture provided by R X is incomplete. Anotherpossibility is to study the matrix I X = [ I ( X i ; X j )] i,j of mutual information terms. This matrixcaptures the existence of both linear and nonlinear dependencies [12], but its scope is restrictedto pairwise relationships and thus misses all higher-order structure. To see an example of howthis can happen, consider two independent fair coins X and X and let X := X ⊕ X be theoutput of an XOR logic gate. The mutual information matrix I X has all its off-diagonal elementsequal to zero, making it indistinguishable from an alternative situation where X is just anotherindependent fair coin.For the case of R X , a possible next step would be to consider higher-order moment matrices,such as co-skewness and co-kurtosis. We seek their information-theoretic analogs, which comple-ment the description provided by I X . One method of doing this is by studying the informationcontained in marginal distributions of increasingly larger sizes; this approach is presented inSection II-A. Other methods try to provide a direct representation of the information that isshared between the random variables; they are discussed in Sections II-B, II-C and II-D. A. Negentropy and total correlation
When the random variables that compose a system are independent, their joint distribution isgiven by the product of their marginal distributions. In this case, the marginals contain all that isto be learned about the statistics of the entire system. For an arbitrary joint probability densityfunction (p.d.f.), knowing the single variable marginal distributions is not enough to capture allthere is to know about the statistics of the system.To quantify this idea, let us consider N discrete random variables X = ( X , . . . , X N ) withjoint p.d.f. p X , where each X j takes values in a finite set with cardinality Ω j . The maximalamount of information that could be stored in any such system is H (0) = (cid:80) j log Ω j , whichcorresponds to the entropy of the p.d.f. p U := (cid:81) j p X j , where p X j ( x ) = 1 / Ω j is the uniformdistribution for each random variable X j . On the other hand, the joint entropy H ( X ) with respectto the true distribution p X measures the actual uncertainty that the system possesses. Therefore,the difference N ( X ) := H (0) − H ( X ) (1)corresponds to the decrease of the uncertainty about the system that occurs when one learns itsp.d.f. – i.e. the information about the system that is contained in its statistics. This quantity is known as negentropy [13], and can also be computed as N ( X , . . . , X N ) = (cid:88) j [log Ω j − H ( X j )] + (cid:32)(cid:88) j H ( X j ) − H ( X ) (cid:33) (2) = D (cid:32)(cid:89) j p X j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p U (cid:33) + D (cid:32) p X (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:89) j p X j (cid:33) , (3)where p X j is the marginal of the variable X j and D ( ·||· ) is the Kullback-Leibler divergence. Inthis way, (3) decomposes the negentropy into a term that corresponds to the information givenby simple marginals and a term that involves higher-order marginals. The second term is knownas the total correlation (TC) [14] (also known as multi-information [15]), which is equal to themutual information for the case of N = 2 . Because of this, the TC has been suggested as anextension of the notion of mutual information for multiple variables.An elegant framework for decomposing the TC can be found in [10] (for an equivalentformulation that do not rely on information geometry c.f. [16]). Let us call k -marginals thedistributions that are obtained by marginalizing the joint p.d.f. over N − k variables. Note thatthe k -marginals provide a more detailed description of the system than the ( k − -marginals, asthe latter can be directly computed from the former by marginalizing the corresponding variables.In the case where only the -marginals are known, the simplest guess for the joint distributionis ˜ p (1) X = (cid:81) j p X j . One way of generalizing this for the case where the k -marginals are knownis by using the maximum entropy principle [17], which suggests to choose the distribution thatmaximizes the joint entropy while satisfying the constrains given by the partial ( k -marginal)knowledge. Let us denote by ˜ p ( k ) X the p.d.f. which achieves the maximum entropy while beingconsistent with all the k -marginals, and let H ( k ) = H ( { ˜ p ( k ) X } ) denote its entropy. Note that H ( k ) ≥ H ( k +1) , since the number of constrains that are involved in the maximization processthat generates H ( k ) increases with k . It can therefore be shown that the following generalizedPythagorean relationship holds for the total correlation:TC = H (1) − H ( X ) = N (cid:88) k =2 (cid:2) H ( k − − H ( k ) (cid:3) = N (cid:88) k =2 D (˜ p ( k ) || ˜ p ( k − ) := N (cid:88) k =2 ∆ H ( k ) . (4)Above, ∆ H ( k ) ≥ measures the additional information that is provided by the k -marginals thatwas not contained in the description of the system given by the ( k − -marginals. In general,the information that is located in terms with higher values of k is due to dependencies between groups of variables that cannot be reduced to combinations of dependencies between smallergroups.It has been observed that in many practical scenarios most of the TC of the measured datais provided by the lower marginals. It can be shown that percentage of the TC that is lost byconsidering only the k -order marginals is given byTC − (cid:80) k k =1 ∆ H ( k ) TC = 1 TC N (cid:88) k = k +1 ∆ H ( k ) = 1 TC D (cid:16) p X || ˜ p ( k ) X (cid:17) . (5)This quantity is small if there exists a value of k such that ˜ p ( k ) X provides an accurate ap-proximation for the joint p.d.f. of the system. Interestingly, it has been shown that pairwisemaximum entropy models (i.e. k = 2 ) can provide an accurate description of the statistics ofmany biological systems [18]–[21] and also some social organizations [22], [23]. B. Internal and external decompositions
An alternative approach to study the interdependencies between many random variables is toanalyze the ways in which they share information. This can be done by decomposing the jointentropy of the system. For the case of two variables, the joint entropy can be decomposed as H ( X , X ) = I ( X ; X ) + H ( X | X ) + H ( X | X ) , (6)suggesting that it can be divided into shared information, I ( X ; X ) , and into terms whichrepresent information that is exclusively located in a single variable, i.e., H ( X | X ) for X and H ( X | X ) for X .In systems with more than two variables, one can compute the total information that isexclusively located in one variable as H (1) := (cid:80) j H ( X j | X cj ) , where X cj denotes all the system’svariables except X j . The difference between the joint entropy and the sum of all exclusiveinformation terms, H (1) , defines a quantity known [24] as the dual total correlation (DTC) :DTC = H ( X ) − H (1) , (7) The superscripts and subscripts are used to reflect that H (1) ≥ H ( X ) ≥ H (1) . The DTC is also known as excess entropy in [25], whose definition differs from its typical use in the context of time series,e.g. [26]. which measures the portion of the joint entropy that is shared between two or more variables ofthe system. When N = 2 then DTC = I ( X ; X ) , and hence the DTC has also been suggestedin the literature as a measure for the multivariate mutual information.By comparing (4) and (7), it would be appealing to look for a decomposition of the DTC of theform DTC = (cid:80) Nk =2 ∆ H ( k ) , where ∆ H ( k ) ≥ would measure the information that is shared byexactly k variables [27]. With this, one could define an internal entropy H ( j ) = H (1) + (cid:80) ji =2 ∆ H ( i ) as the information that is shared between at most j variables, in contrast to the external entropy H ( j ) = H (1) − (cid:80) ji =2 ∆ H ( i ) which describes the information provided by the j -marginals. Theseentropies form a non-decreasing sequence: H (1) ≤ · · · ≤ H ( N − ≤ H ( X ) ≤ H ( N − ≤ · · · ≤ H (1) . (8)This layered structure, and its relationship with the TC and the DTC, is graphically representedin Figure 1. I ( X ; X ) From this, it is clear that while each transmitter have a exclusive portion of the chan-nel with capacity C i , their interaction create synergistically an additional capacity of C S . This additional resource behaves like a physical property, which has to be sharedlinearly, generating a slope of A and B , there is a direct relationship between H ( A | B ) and H ( B | A ) as exclusive informationcontents that needs to be transmitted by each source and C and C as unique channelcapacity for each user, which cannot be shared. On the other hand, the mutual infor-mation I ( A ; B ) is the information that can be transmitted by either of the variables,which in this case corresponds to the synergetic capacity C S . Consider a communication system with a eavesdropper, where the transmitter sendsymbols X , the intended receiver gets X and the eavesdropper receives X . Forsimplicity of the exposition, let us consider the case of a degraded channel where X X X form a Markov chain. Under those conditions, it is known that for agiven input distribution p X the rate of secure communication that can be achieved onthis channel is upper bound by C sec = I ( X ; X ) I ( X ; X ) = I un ( X ; X | X ) (19)where the second equality comes from the Markov condition and the results shown inSeciton 3.2.1. Note that the eavesdropping capacity is given by C eav = I ( X ; X ) = I \ ( X ; X ; X ) . (20) (max 8 pages) References [1] L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, and E. Vaadia, “Neu-ral coding: higher-order temporal patterns in the neurostatistics of cell assemblies,”
Neural Computation , vol. 12, no. 11, pp. 2621–2653, 2000.[2] S.-I. Amari, “Information geometry on hierarchy of probability distributions,”
In-formation Theory, IEEE Transactions on , vol. 47, no. 5, pp. 1701–1711, Jul 2001.[3] V. Gri th and C. Koch, “Quantifying synergistic mutual information,” in
GuidedSelf-Organization: Inception , ser. Emergence, Complexity and Computation,M. Prokopenko, Ed. Springer Berlin Heidelberg, 2014, vol. 9, pp. 159–190.[4] L. Brillouin, “The negentropy principle of information,”
Journal of Applied Physics ,vol. 24, no. 9, pp. 1152–1163, 1953.[5] R. W. Yeung, “A new outlook on shannon’s information measures,”
InformationTheory, IEEE Transactions on , vol. 37, no. 3, pp. 466–474, 1991.
From this, it is clear that while each transmitter have a exclusive portion of the chan-nel with capacity C i , their interaction create synergistically an additional capacity of C S . This additional resource behaves like a physical property, which has to be sharedlinearly, generating a slope of A and B , there is a direct relationship between H ( A | B ) and H ( B | A ) as exclusive informationcontents that needs to be transmitted by each source and C and C as unique channelcapacity for each user, which cannot be shared. On the other hand, the mutual infor-mation I ( A ; B ) is the information that can be transmitted by either of the variables,which in this case corresponds to the synergetic capacity C S . Consider a communication system with a eavesdropper, where the transmitter sendsymbols X , the intended receiver gets X and the eavesdropper receives X . Forsimplicity of the exposition, let us consider the case of a degraded channel where X X X form a Markov chain. Under those conditions, it is known that for agiven input distribution p X the rate of secure communication that can be achieved onthis channel is upper bound by C sec = I ( X ; X ) I ( X ; X ) = I un ( X ; X | X ) (19)where the second equality comes from the Markov condition and the results shown inSeciton 3.2.1. Note that the eavesdropping capacity is given by C eav = I ( X ; X ) = I \ ( X ; X ; X ) . (20) (max 8 pages) References [1] L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, and E. Vaadia, “Neu-ral coding: higher-order temporal patterns in the neurostatistics of cell assemblies,”
Neural Computation , vol. 12, no. 11, pp. 2621–2653, 2000.[2] S.-I. Amari, “Information geometry on hierarchy of probability distributions,”
In-formation Theory, IEEE Transactions on , vol. 47, no. 5, pp. 1701–1711, Jul 2001.[3] V. Gri th and C. Koch, “Quantifying synergistic mutual information,” in
GuidedSelf-Organization: Inception , ser. Emergence, Complexity and Computation,M. Prokopenko, Ed. Springer Berlin Heidelberg, 2014, vol. 9, pp. 159–190.[4] L. Brillouin, “The negentropy principle of information,”
Journal of Applied Physics ,vol. 24, no. 9, pp. 1152–1163, 1953.[5] R. W. Yeung, “A new outlook on shannon’s information measures,”
InformationTheory, IEEE Transactions on , vol. 37, no. 3, pp. 466–474, 1991. H (1) H (1) joint entropy ... H (2) H (3) H ( N ) ... H ( N ) H (2) H (0) DTC TC H ( X ) negentropy Fig. 1. Layers of internal and external entropies that decompose the DTC and the TC. Each ∆ H ( j ) shows how much informationis contained in the j -marginals, while each ∆ H ( j ) measures the information is shared between exactly j variables. It is interesting to note that even though the TC and DTC coincide for the case of N = 2 , thesequantities are in general different for larger system sizes. Therefore, in general ∆ H ( k ) (cid:54) = ∆ H ( k ) ,although it is appealing to believe that there should exist a relationship between them. One ofthe goals of this paper is to explore the difference between these quantities. C. Inclusion-exclusion decompositions
Perhaps the most natural approach to decompose the DTC and joint entropy is to apply theinclusion-exclusion principle, using a simplifying analogy that the entropies and areas have similar properties. A refined version of this approach can be found in and also in the
I-measures [28] and in the multi-scale complexity [29]. For the case of three variables, this approachgives DTC N =3 = I ( X ; X | X ) + I ( X ; X | X ) + I ( X ; X | X ) + I ( X ; X ; X ) . (9)The last term is known as the co-information [30] (being closely related to the interactioninformation [31]), and can be defined using the inclusion-exclusion principle as I ( X ; X ; X ) := H ( X ) + H ( X ) + H ( X ) − H ( X , X ) − H ( X , X ) − H ( X , X ) + H ( X , X , X ) (10) = I ( X ; X ) − I ( X ; X | X ) . (11)As I ( X ; X ; X ) = I ( X ; X ) , the co-information has also been proposed as a candidate forextending the mutual information to multiple variables. For a summary of the various possibleextensions of the mutual information, see Table I and also additional discussion in Ref. [32]. TABLE IS
UMMARY OF THE CANDIDATES FOR EXTENDING THE MUTUAL INFORMATION FOR N ≥ .Name Formula Total correlation TC = (cid:80) j H ( X j ) − H ( X ) Dual total correlation
DTC = H ( X ) − (cid:80) j H ( X j | X cj ) Co-information I ( X ; X ; X ) = I ( X ; X ) − I ( X ; X | X ) It is tempting to coarsen the decomposition provided by this approach in order to build adecomposition for the DTC. In this decomposition, the co-information associates to ∆ H (3) , andthe the remaining terms of (9) associate to ∆ H (2) . With this, one can build a Venn diagramfor the information sharing between three variables, as in Figure 2. However, the resultingdecomposition and diagram are not very intuitive since the co-information can be negative.As part of this temptation, it is appealing to consider the conditional mutual information I ( X ; X | X ) as the information contained in X and X that is not contained in X , just as theconditional entropy H ( X | X ) is the information that is in X and not in X . However, the latter H (1) H (1) joint entropy ... H (2) H (3) H ( N ) ... H ( N ) H (2) H (0) DTC TC H ( X ) negentropy I ( X ; Y ; Z ) H ( X | Y Z ) H ( Y | XZ ) H ( Z | XY ) I ( X ; Y | Z ) I ( X ; Z | Y ) I ( Y ; Z | X ) joint entropy H ( X ) H ( Y ) H ( Z ) H (1) DTC
Fig. 2. An approach based on the
I-measures decomposes the total entropy of three variables H ( X, Y, Z ) into signed areas. interpretation works because conditioning always reduces entropy (i.e., H ( X ) ≥ H ( X | X ) )while this is not true for mutual information; that is, in some cases the conditional mutualinformation I ( X ; X | X ) can be greater than I ( X ; X ) . This suggests that the conditionalmutual information can capture information that extends beyond X and X , incorporatinghigher-order effects with respect to X . Therefore, a better understanding of the conditionalmutual information is required in order to refine the decomposition suggested by (9). D. Synergistic information
An extended treatment of the conditional mutual information and its relationship with themutual information decomposition can be found in [33], [34]. For presenting these ideas, let usconsider two random variables X and X which are used to predict Y . The total predictability ,i.e., the part of the randomness of Y that can be predicted by X and X , can be expressedusing the chain rule of the mutual information as I ( X X ; Y ) = I ( X ; Y ) + I ( X ; Y | X ) . (12) Note that the term total predictability has also been used in [26] with a definition that differs from our current usage. For simplicity, through the paper we use the shorthand notation XY = ( X, Y ) . It is natural to think that the predictability provided by X , which is given by the term I ( X ; Y ) ,can be either unique or redundant with respect of the information provided by X . On the otherhand, due to (12) is clear that the unique predictability contributed by X must be containedin I ( X ; Y | X ) . However, the fact that I ( X ; Y | X ) can be larger than I ( X ; Y ) —while thelatter contains both the unique and redundant contributions of X — suggests that there can bean additional predictability that is accounted for only by the conditional mutual information.Following this rationale, we denote as synergistic predictability the part of the conditionalmutual information that corresponds to evidence about the target that is not contained in anysingle predictor, but is only revealed when both are known. As an example of this, consideragain the case in which X and X are independent random bits and Y = X ⊕ X . Then, itcan be seen that I ( X ; Y ) = I ( X ; Y ) = 0 but I ( X X ; Y ) = I ( X ; Y | X ) = 1 . Hence, neither X nor X individually provide information about Y , although together they fully determine it.Further discussions about the notion of information synergy can be found in [11], [35]–[37].III. A NON - NEGATIVE JOINT ENTROPY DECOMPOSITION
Following the discussion presented in Section II-B, we search for a decomposition of the jointentropy that reflects the private, common and synergistic modes of information sharing. In thisway, we want the decomposition to distinguish information that is shared only by few variablesfrom information that accessible from the entire system.Our framework is based on distinguishing the directed notion of predictability from theundirected one of information . It is to be noted that there is an ongoing debate about the best wayof characterizing and computing the predictability in arbitrary systems, as the commonly usedaxioms are not enough for specifying a unique formula that satisfies them [35]. Nevertheless,our approach is to explore how far one can reach based an axiomatic approach. In this way, ourresults are going to be consistent with any choice of formula that is consistent with the discussedaxioms.In the following, Sections III-A, III-B and III-C discuss the basic features of predictability andinformation. After these necessary preliminaries, Section III-D finally presents our joint entropydecomposition for discrete and continuous variables. A. Predictability axioms
Let us consider two variables X and X that are used to predict a target variable Y := X .Intuitively, I ( X ; Y ) quantifies the predictability of Y that is provided by X . In the following,we want to find a function R ( X X (cid:1) Y ) that measures the redundant predicability providedby X with respect to the predictability provided by X , and a function U ( X (cid:1) Y | X ) thatmeasures the unique predictability that is provided by X but not by X . Following [33], wefirst determine a number of desired properties that these functions should have. Definition A predictability decomposition is defined by the real-valued functions R ( X X (cid:1) Y ) and U ( X (cid:1) Y | X ) over the distributions of ( X , Y ) and ( X , Y ) , which satisfy the followingaxioms:(1) Non-negativity: R ( X X (cid:1) Y ) , U ( X (cid:1) Y | X ) ≥ .(2) I ( X ; Y ) = R ( X X (cid:1) Y ) + U ( X (cid:1) Y | X ) .(3) I ( X X ; Y ) ≥ R ( X X (cid:1) Y ) + U ( X (cid:1) Y | X ) + U ( X (cid:1) Y | X ) .(4) Weak symmetry I : R ( X X (cid:1) Y ) = R ( X X (cid:1) Y ) .Above, Axiom (3) states that the sum of the redundant and corresponding unique predictabili-ties given by each variable cannot be larger than the total predictability . Axiom (4) states that theredundancy is independent of the ordering of the predictors. The following Lemma determinesthe bounds for the redundant predicability (the proof is given in Appendix A). Lemma 1:
The functions R ( X X (cid:1) Y ) and U ( X (cid:1) Y | X ) = I ( X ; Y ) − R ( X X (cid:1) Y ) satisfy Axioms (1)–(3) if and only if min { I ( X ; Y ) , I ( X ; Y ) } ≥ R ( X X (cid:1) Y ) ≥ [ I ( X ; X ; Y )] + , (13)where [ a ] + = max { a, } . Corollary 2:
There always exists at least one predictability decomposition that satisfies Ax-ioms (1)–(4), which is given by R ( X X (cid:1) Y ) := min { I ( X ; Y ) , I ( X ; Y ) } . (14) In fact, the difference between the right and left hand terms of Axiom (3) gives the synergistic predictability , whose analysiswill not be included in this work. Proof:
Being a symmetric function on X and X , (14) satisfies Axiom (4). Also, as (14)is equal to the upper bound given in Lemma 1, Axioms (1)–(3) are satisfied due to Lemma 3.In principle, the notion of redundant predictability takes the point of view of the target variableand measures the parts that can be predicted by both X and X when they are used bythemselves, i.e., without combining them with each other. It is appealing to think that thereshould exist a unique function that provides such a measure. Nevertheless, these axioms defineonly very basic properties that a measure of redundant predictability should satisfy, and hencein general they are not enough for defining an unique function. In fact, a number of differentpredictability decompositions have been proposed in the literature [35], [36], [38], [39].It is to be noted that, from all the candidates that are compatible with the Axioms, thedecomposition given in Corollary 2 gives the largest possible redundant predictability measure.It is clear that in some cases this measure gives an over-estimate of the redundant predictabilitygiven by X and X ; for an example of this consider X and X to be independent variables and Y = ( X , X ) . Nevertheless, (14) has been proposed as a adequate measure for the redundantpredictability of multivariate Gaussians [39] (for a corresponding discussion see Section VI). B. Shared, private and synergistic information
Let us now introduce an additional axiom, which will form the basis for our proposed information decomposition . Definition A symmetrical information decomposition is given by the real valued functions I ∩ ( X ; X ; X ) and I priv ( X ; X | X ) over the marginal distributions of ( X , X ) , ( X , X ) and ( X , X ) , which satisfy Axioms (1) – (4) for I ∩ ( X ; X ; X ) := R ( X X (cid:1) X ) and I priv ( X ; X | X ) := U ( X (cid:1) X | X ) , while also satisfying the following property:(5) Weak symmetry II : I priv ( X ; X | X ) = I priv ( X ; X | X ) .Finally, I S ( X ; X ; X ) is defined as I S ( X ; X ; X ) := I ( X ; X | X ) − I priv ( X ; X | X ) .The role of Axiom (5) can be related to the role of the fifth of Euclid’s postulates, as —whileseeming innocuous— their addition has strong consequences in the corresponding theory. Thefollowing Lemma explains why this decomposition is denoted as symmetrical, and also showsfundamental bounds for these information functions (the proof is presented in Appendix C). Lemma 3:
The functions that compose a symmetrical information decomposition satisfy thefollowing properties:(a)
Strong symmetry: I ∩ ( X ; X ; X ) and I S ( X ; X ; X ) are symmetric on their three argu-ments.(b) Bounds: these quantities satisfy the following inequalities: min { I ( X ; X ) , I ( X ; X ) , I ( X ; X ) } ≥ I ∩ ( X ; X ; X ) ≥ [ I ( X ; X ; X )] + (15) min { I ( X ; X ) , I ( X ; X | X ) } ≥ I priv ( X ; X | X ) ≥ (16) min { I ( X ; X | X ) , I ( X ; X | X ) , I ( X ; X | X ) } ≥ I S ( X ; X ; X ) ≥ [ − I ( X ; X ; X )] + (17)Note that the defined functions can be used to decompose the following mutual information: I ( X X ; X ) = I ( X ; X ) + I ( X ; X | X ) (18) I ( X ; X ) = I ∩ ( X ; X ; X ) + I priv ( X ; X | X ) (19) I ( X ; X | X ) = I priv ( X ; X | X ) + I S ( X ; X ; X ) (20)In contrast to a decomposition based on the predictability, these measures address propertiesof the system ( X , X , X ) as a whole, without being dependent on how it is divided betweentarget and predictor variables (for a parallelism with respect to the corresponding predictabilitymeasures, see Table II). Intuitively, I ∩ ( X ; X ; X ) measures the shared information that iscommon to X , X and X ; I priv ( X ; X | X ) quantifies the private information that is sharedby X and X but not X , and I S ( X ; X ; X ) captures the synergistic information that existbetween ( X , X , X ) . The latter is a non-intuitive mode of information sharing, whose naturewe hope to clarify through the analysis of particular cases presented in Sections IV and VI. TABLE IIP
ARALLELISM BETWEEN PREDICTABILITY AND INFORMATION MEASURES .Directed measures Symmetrical measures
Redundant predictability R ( X X (cid:1) X ) Shared information I ∩ ( X ; X ; X ) Unique predictability U ( X (cid:1) X | X ) Private information I priv ( X ; X | X ) Synergistic predictability Synergistic information I S ( X ; X ; X ) Note also that the co-information can be expressed as I ( X ; X ; X ) = I ∩ ( X ; X ; X ) − I S ( X ; X ; X ) . (21)Hence, a strictly positive (resp. negative) co-information is a sufficient —although not necessary—condition for the system to have a non-zero shared (resp. synergistic) information. C. Further properties of the symmetrical decomposition
At this point, it is important to clarify a fundamental distinction that we make between thenotions of predictability and information . The predictability is intrinsically a directed notion,which is based on a distinction between predictors and the target variable. On the contrary, weuse the term information to exclusively refer to intrinsic statistical properties of the whole systemwhich do not rely on such distinction. The main difference between the two notions is that, inprinciple, the predictability only considers the predictable parts of the target, while the sharedinformation also considers the joint statistics of the predictors. Although this distinction will befurther developed when we address the case of Gaussian variables (c.f. Section VI-C), let us fornow present a simple example to help developing intuitions about this issue.
Example
Define the following functions: I ∩ ( X ; X ; X ) = min { I ( X ; X ) , I ( X ; X ) , I ( X ; X ) } (22) I priv ( X ; X | X ) = I ( X ; X ) − I ∩ ( X ; X ; X ) (23)It is straightforward that these functions satisfy Axioms (1)–(5), and therefore constitute asymmetric information decomposition. In contrast to the decomposition given in Corollary 2,this can be seen to be strongly symmetric and also dependent on the three marginals ( X , X ) , ( X , X ) and ( X , X ) .In the following Lemma we will generalize the previous construction, whose simple proof isomitted. Lemma 4:
For a given predictability decomposition with functions R ( X X (cid:1) X ) and U ( X (cid:1) X | X ) , the functions I ∩ ( X ; X ; X ) = min {R ( X X (cid:1) X ) , R ( X X (cid:1) X ) , R ( X X (cid:1) X ) } (24) I priv ( X ; X | X ) = I ( X ; X ) − I ∩ ( X ; X ; X ) (25) provide a symmetrical information decomposition, which is called the canonical symmetrizationof the predictability . Corollary 5:
There always exists at least one symmetric information decomposition.
Proof:
This is a direct consequence of the previous Lemma and Corollary 2.Maybe the most remarkable property of symmetrized information decompositions is that, incontrast to directed ones, they are uniquely determined by Axioms (1)–(5) for a number ofinteresting cases.
Theorem 6:
The symmetric information decomposition is unique if the variables form aMarkov chain or two of them are pairwise independent.
Proof:
Let us consider the upper and lower bound for I ∩ given in (15), denoting them as c := [ I ( X ; X ; X )] + and c := min { I ( X ; X ) , I ( X ; X ) , I ( X ; X ) } . These bounds restrictthe possible I ∩ functions to lay in the interval [ c , c ] of length | c − c | = min { I ( X ; X ) , I ( X ; X ) , I ( X ; X ) , (26) I ( X ; X | X ) , I ( X ; X | X ) , I ( X ; X | X ) } . (27)Therefore, the framework will provide a unique expression for the shared information if (at least)one of the above six terms is zero. These scenarios correspond either to Markov chains, whereone conditional mutual information term is zero, or pairwise independent variables where onemutual information term vanishes.Pairwise independent variables and Markov chains are analyzed in Sections IV and V-A,respectively. D. Decomposition for the joint entropy of three variables
Now we use the notions of redundant, private and synergistic information functions fordeveloping a non-negative decomposition of the joint entropy, which is based on a non-negativedecomposition of the DTC. For the case of three discrete variables, by applying (20) and (21)to (9), one finds thatDTC = I priv ( X ; X | X ) + I priv ( X ; X | X ) + I priv ( X ; X | X )+ I ∩ ( X ; X ; X ) + 2 I S ( X ; X ; X ) . (28) From (7) and (28), one can propose the following decomposition for the joint entropy: H ( X , X , X ) = H (1) + ∆ H (2) + ∆ H (3) . (29)where H (1) = H ( X | X , X ) + H ( X | X , X ) + H ( X | X , X ) (30) ∆ H (2) = I priv ( X ; X | X ) + I priv ( X ; X | X ) + I priv ( X ; X | X ) (31) ∆ H (3) = I ∩ ( X ; X ; X ) + 2 I S ( X ; X ; X ) (32)In contrast to (9), here each term is non-negative because of Lemma 3 . Therefore, (29) yields anon-negative decomposition of the joint entropy, where each of the corresponding terms capturesthe information that is shared by one, two or three variables. Interestingly, H (1) and ∆ H (2) arehomogeneous (being the sum of all the exclusive information or private information of thesystem) while ∆ H (3) is composed by a mixture of two different information sharing modes.An analogous decomposition can be developed for the case of continuous random variables.Nevertheless, as the differential entropy can be negative, not all the terms of the decompositioncan be non-negative. In effect, following the same rationale that lead to (29), the followingdecomposition can be found: h ( X , X , X ) = h (1) + ∆ H (2) + ∆ H (3) . (33)Above, h ( X ) denotes the differential entropy of X , ∆ H (2) and ∆ H (3) are as defined in (31) and(32), and h (1) = h ( X | X X ) + h ( X | X X ) + h ( X | X X ) . (34)Hence, although both the joint entropy h ( X , X , X ) and h (1) can be negative, the remainingterms conserve their non-negative condition.It can be seen that the lowest layer of the decomposition is always trivial to compute, andhence the challenge is to find expressions for ∆ H (2) and ∆ H (3) . In the rest of the paper, wewill explore scenarios were these quantities can be characterized. From (20), it can be seen that the co-information is sometimes negative for compensating the triple counting of the synergydue to the sum of the three conditional mutual information terms. IV. P
AIRWISE INDEPENDENT VARIABLES
In this section we focus on the case where two variables are pairwise independent while beingglobally connected by a third variable. The fact that pairwise independent variables can becomecorrelated when additional information becomes available is known in statistics literature as the
Bergson’s paradox or selection bias [40], or as the explaining away effect in the context ofartificial intelligence [41]. As an example of this phenomenon, consider X and X to be twopairwise independent canonical Gaussians variables, and X a binary variable that is equal to if X + X > and zero otherwise. Then, knowing that X = 1 implies that X > − X , andhence knowing the value of X effectively reduces the uncertainty about X .In our framework, Bergson’s paradox can be understood as synergistic information that isintroduced by the third component of the system. In fact, we will show that in this case thesynergistic information function is unique and given by I S ( X ; X ; X ) = (cid:88) x p X ( x ) I ( X ; X | X = x ) = I ( X ; X | X ) , (35)which is, in fact, a measure of the dependencies between X and X that are created by X .In the following, Section IV-A presents the unique symmetrized information decomposition forthis case. Then, Section IV-B focuses on the particular case where X is a function of the othertwo variables. A. Uniqueness of the entropy decomposition
Let us assume that X and X are pairwise independent, and hence the joint p.d.f. of X , X and X has the following structure: p X X X ( x , x , x ) = p X ( x ) p X ( x ) p X | X X ( x | x , x ) . (36)It is direct to see that in this case p X X = (cid:80) x p X X X = p X p X , but p X X | X (cid:54) = p X | X p X | X .Therefore, as I ( X ; X ) = 0 , it is direct from Axiom (1) that any redundant predictabilityfunction satisfies R ( X X (cid:1) X ) = R ( X X (cid:1) X ) = 0 . However, the axioms are not enough touniquely determine R ( X X (cid:1) X ) . Nevertheless, the symmetrized decomposition is uniquelydetermined, as shown in the next Corollary that is a consequence of Theorem 6. Note that in this case I ( X ; X ; X ) = − I ( X ; X | X ) ≤ , the only restriction that the bound presented in Lemma 3provides is min { I ( X ; X ) , I ( X ; X ) } ≥ R ( X X (cid:1) X ) ≥ . Corollary 7: If X , X and X follow a p.d.f. as (36), then the shared, private and synergeticinformation functions are unique. They are given by I ∩ ( X ; X ; X ) = I priv ( X ; X | X ) = 0 (37) I priv ( X ; X | X ) = I ( X ; X ) (38) I priv ( X ; X | X ) = I ( X ; X ) (39) I S ( X ; X ; X ) = I ( X ; X | X ) = − I ( X ; X ; X ) . (40) Proof:
The fact that there is no shared information follows directly from the upper boundpresented in Lemma 3. Using this, the expressions for the private information can be found usingAxiom (2). Finally, the synergistic information can be computed as I S ( X ; X ; X ) = I ( X ; X | X ) − I priv ( X ; X | X ) = I ( X ; X | X ) . (41)The second formula for the synergistic information can be found then using the fact that I ( X ; X ) = 0 .With this corollary, the unique decomposition of the DTC = ∆ H (2) + ∆ H (3) can be found tobe ∆ H (2) = I ( X ; X ) + I ( X ; X ) (42) ∆ H (3) = 2 I ( X ; X | X ) . (43)Note that the terms ∆ H (2) and ∆ H (3) can be bounded as follows: ∆ H (2) ≤ min { H ( X ) , H ( X ) } + min { H ( X ) , H ( X ) } , (44) ∆ H (3) ≤ { H ( X | X ) , H ( X | X ) } . (45)The bound for ∆ H (2) follows from the basic fact that I ( X ; Y ) ≤ min { H ( X ) , H ( Y ) } . Thesecond bound follows from I ( X ; Y | Z ) = (cid:88) z p Z ( z ) I ( X ; Y | Z = z ) (46) ≤ (cid:88) z p Z ( z ) min { H ( X | Z = z ) , H ( Y | Z = z ) } (47) ≤ min (cid:40)(cid:88) z p Z ( z ) H ( X | Z = z ) , (cid:88) z p Z ( z ) H ( Y | Z = z ) (cid:41) (48) = min { H ( X | Z ) , H ( Y | Z ) } . (49) B. Functions of independent arguments
Let us focus in this section on the special case where X = F ( X , X ) is a function oftwo independent random inputs, and study its corresponding entropy decomposition. We willconsider X and X as inputs and F ( X , X ) to the output. Although this scenario fits nicelyin the predictability framework, it can also be studied from the shared information framework’sperspective. Our goal is to understand how F affects the information sharing structure.As H ( X | X , X ) = 0 , we have H (1) = H ( X | X X ) + H ( X | X X ) . (50)The term H (1) hence measures the information of the inputs that is not reflected by the output.An extreme case is given by a constant function F ( X , X ) = k , for which ∆ H (2) = ∆ H (3) = 0 .The term ∆ H (2) measures how much of F can be predicted with knowledge that comes fromone of the inputs but not from the other. If ∆ H (2) is large then F is not “mixing” the inputs toomuch, in the sense that each of them is by itself able to provide relevant information that is notgiven also by the other. In fact, a maximal value of ∆ H (2) is given by F ( X , X ) = ( X , X ) ,where H (1) = ∆ H (3) = 0 and the bound provided in (44) is attained.Finally, due to (43), there is no shared information and hence ∆ H (3) is just proportional to thesynergy of the system. By considering (45), one finds that F needs to leave some ambiguity aboutthe exact values of the inputs in order for the system to possess synergy. For example, consider a1-1 function F for which for every output F ( X , X ) = x one can find the unique values x and x that generate it. Under this condition H ( X | X ) = H ( X | X ) = 0 and hence, because of (45),is clear that a 1-1 function does not induce synergy. On the other extreme, we showed alreadythat constant functions have ∆ H (3) = 0 , and hence the case where the output of the systemgives no information about the inputs also leads to no synergy. Therefore, synergistic functionsare those whose output values generate a balanced ambiguity about the generating inputs. Todevelop this idea further, the next lemma studies the functions that generate a maximum amountof synergy by generating for each output value different 1-1 mappings between their arguments. Lemma 8:
Let us assume that both X and X take values over K = { , . . . , K − } andare independent. Then, the maximal possible amount of information synergy is created by the function F ∗ ( n, m ) = n + m ( mod K ) (51)when both inputs variables are uniformly distributed. Proof:
Using the same rationale than in (49), it can be shown that if F is an arbitraryfunction then I S ( X ; X ; F ( X , X )) = I ( X ; X | F ) (52) ≤ min { H ( X | F ) , H ( X | F ) } (53) ≤ min { H ( X ) , H ( X ) } (54) ≤ log K . (55)where the last inequality follows from the fact that both inputs are restricted to alphabets ofsize K .Now, consider F ∗ to be the function given in (51) and assume that X and X are uniformlydistributed. It can be seen that for each z ∈ K there exist exactly K ordered pairs of inputs ( x , x ) such that F ∗ ( x , x ) = z , which define a bijection from K to K . Therefore, I ( X ; X | F = z ) = H ( X | z ) − H ( X | X , z ) = H ( X ) = log K (56)and hence I S ( X ; X ; F ∗ ) = I ( X ; X | F ∗ ) = (cid:88) z P { F ∗ = z } · I ( X ; X | F ∗ = z ) = log K , (57)showing that the upper bound presented in (55) is attained.
Corollary 9:
The
XOR logic gate generates the largest amount of synergistic informationpossible for the case of binary inputs.The synergistic nature of the addition over finite fields helps to explain the central role ithas in various fields. In cryptography, the one-time-pad [42] is an encryption technique thatuses finite-field additions for creating a synergistic interdependency between a private message,a public signal and a secret key. This interdependency is completely destroyed when the keyis not known, ensuring no information leakage to unintended receivers [43]. Also, in networkcoding [44], [45], nodes in the network use linear combinations of their received data packets tocreate and transmit synergistic combinations of the corresponding information messages. This technique has been shown to achieve the multicast capacity in wired communication networks[45] and has also been used to increase the throughput of wireless systems [46].V. D ISCRETE PAIRWISE MAXIMUM ENTROPY DISTRIBUTIONS AND M ARKOV CHAINS
This section studies the case where the system’s variables follow a pairwise maximum entropy (PME) distribution. These distributions are of great importance in statistical physics and machinelearning communities, where they are studied under the names of
Gibbs distributions [47] or
Markov random fields [48].Concretely, let us consider three pairwise marginal distributions p X X , p X X and p X X forthe discrete variables X , X and X . Let us denote as Q the set of all the joint p.d.f.s over ( X , X , X ) that have those as their pairwise marginals distributions. Then, the correspondingPME distribution is given by the joint p.d.f. ˜ p X ( x , x , x ) that satisfies ˜ p X = argmax p ∈Q H ( { p } ) . (58)For the case of binary variables (i.e. X j ∈ { , } ), the PME distribution is given by an Isingdistribution [49]: ˜ p X ( X ) = e −E ( X ) Z , (59)where Z is a normalization constant and E ( X ) an energy function given by E ( X ) = (cid:80) i J i X i + (cid:80) j (cid:80) k (cid:54) = j J j,k X j X k , being J j,k the coupling terms. In effect, if J i,k = 0 for all i and k , then ˜ p X ( X ) can be factorized as the product of the unary-marginal p.d.f.s.In the context of the framework discussed in Section II-A, a PME system has TC = ∆ H (2) while ∆ H (3) = 0 . In contrast, Section V-A studies these systems under the light of the decom-position of the DTC presented in Section III-D. Then, Section V-B specifies the analysis for theparticular case of Markov chains. A. Synergy minimization
It is tempting to associate the synergistic information with that which is only in the joint p.d.f.but not in the pairwise marginals, i.e. with ∆ H (3) . However, the following result states that therecan exist some synergy defined by the pairwise marginals themselves. Theorem 10:
PME distributions have the minimum amount of synergistic information that isallowed by their pairwise marginals. Proof:
Note that max p ∈Q H ( X X X ) = H ( X X ) + H ( X ) − min p ∈Q I ( X X ; X ) (60) = H ( X X ) + H ( X ) − I ( X ; X ) − min p ∈Q I ( X ; X | X ) (61) = H ( X X ) + H ( X ) − I ( X ; X ) − I priv ( X ; X | X ) − min p ∈Q I S ( X ; X ; X ) . (62)Therefore, maximizing the joint entropy for fixed pairwise marginals is equivalent to minimizingthe synergistic information. Note that the last equality follows from the fact that I priv ( X ; X | X ) by definition only depends on the pairwise marginals. Corollary 11:
For an arbitrary system ( X , X , X ) , the synergistic information can be de-composed as I S ( X ; X ; X ) = I PMES + ∆ H (3) (63)where ∆ H (3) is as defined in (4) and I PMES = min p ∈Q I S ( X ; X ; X ) is the synergistic informa-tion of the corresponding PME distribution. Proof:
This can be proven noting that, for an arbitrary p.d.f. p X X X , it can be seen that ∆ H (3) = max p ∈Q H ( X X X ) − H ( { p X X X } ) (64) = I S ( { p X X X } ) − min p ∈Q I S ( X ; X ; X ) . (65)Above, the first equality corresponds to the definition of ∆ H (3) and the second equality comesfrom using (62) on each joint entropy term and noting that only the synergistic informationdepends on more than the pairwise marginals.The previous corollary shows that ∆ H (3) measures only one part of the information synergyof a system, the part that can be removed without altering the pairwise marginals. Note thatPME systems with non-zero synergy are easy to find. For an example, consider X and X tobe two independent equiprobable bits, and X = X AND X . It can be shown that for this caseone has ∆ H (3) = 0 [16]. On the other side, as the inputs are independent the synergy can becomputed using (40), and therefore a direct calculation shows that I S ( X ; X ; X ) = I ( X ; X | X ) = H ( X | X ) − H ( X | X X ) = 0 . . (66)From the previous discussion, one can conclude that only a special class of pairwise distri-butions p X X , p X X , and p X X are compatible with having null synergistic information in the system. This is a remarkable result, as the synergistic information is usually considered to be aneffect purely related to high-order marginals. It would be interesting to have an expresion forthe minimal information synergy that a set of pairwise distributions requires, or equivalently, asymmetrized information decomposition for PME distributions. A particular case that allows aunique solution is discussed in the next section. B. Markov chains
Markov chains maximize the joint entropy subject to constrains on only two of the threepairwise distributions. In effect, following the same rationale as in the proof of Theorem 10, itcan be shown that H ( X , X , X ) = H ( X X ) + H ( X ) − I ( X ; X ) − I ( X ; X | X ) . (67)Then, for fixed pairwise distributions p X X and p X X , maximizing the joint entropy is equivalentto minimizing the conditional mutual information. Moreover, the maximal entropy is attainedby the p.d.f. that makes I ( X ; X | X ) = 0 , which is precisely the Markov chain X − X − X with joint distribution p X X X = p X X p X X p X . (68)For the binary case, it can be shown that a Markov chain corresponds to an Ising distributionlike (59), where the interaction terms J , is equal to zero.Theorem 6 showed that the symmetric information decomposition for Markov chains is unique.We develop this decomposition in the following corollary. Corollary 12: If X − X − X is a Markov chain, then their unique shared, private andsynergistic information functions are given by I ∩ ( X ; X ; X ) = I ( X ; X ) (69) I priv ( X ; X | X ) = I ( X ; X ) − I ( X ; X ) (70) I priv ( X ; X | X ) = I ( X ; X ) − I ( X ; X ) (71) I S ( X ; X ; X ) = I priv ( X ; X | X ) = 0 . (72)In particular, Markov chains have no synergistic information. Proof:
For this case one can show that min i,j ∈{ , , } i (cid:54) = j { I ( X i ; X j ) } = I ( X ; X ) = I ( X ; X ; X ) , (73)where the first equality is a consequence of the data process inequality, and the second of the factthat I ( X ; X | X ) = 0 . The above equality shows that the bounds for the shared informationpresented in Lemma 3 give the unique solution I ∩ ( X ; X ; X ) = I ( X ; X ) . All the otherequalities follow from this fact and their definition.Using this corollary, the unique decomposition of the DTC = ∆ H (2) + ∆ H (3) for Markovchains is given by ∆ H (2) = I ( X ; X ) + I ( X ; X ) − I ( X ; X ) , (74) ∆ H (3) = I ( X ; X ) . (75)Hence, corollary 12 states that a sufficient condition for three pairwise marginals to be com-patible with zero information synergy is for them to satisfy the Markov condition p X | X = (cid:80) X p X | X p X | X . The question of finding a necessary condition is an open problem, intrinsicallylinked with the problem of finding a good definition for the shared information for arbitrary PMEdistributions.For concluding, let us note an interesting duality that exists between Markov chains and thecase where two variables are pairwise independent, which is illustrated in Table III. TABLE IIID
UALITY BETWEEN M ARKOV CHAINS AND PAIRWISE INDEPENDENT VARIABLES
Markov chains Pairwise independent variablesConditional pairwise independency Pairwise independency I ( X ; X | X ) = 0 I ( X ; X ) = 0 No I priv between X and X No I priv between X and X No synergistic information No shared information
VI. E
NTROPY DECOMPOSITION FOR THE G AUSSIAN CASE
In this section we study the entropy-decomposition for the case where ( X , X , X ) followa multivariate Gaussian distribution. As the entropy is not affected by translation, we assume without loss of generality, that all the variables have zero mean. The covariance matrix is denotedas Σ = σ ασ σ βσ σ ασ σ σ γσ σ βσ σ γσ σ σ , (76)where σ i is the variance of X i , α is the correlation between X and X , β is the correlationbetween X and X and γ is the correlation between X and X . The condition that the matrix Σ should be positive semi-definite yields the following condition: αβγ − α − β − γ ≥ . (77)Unfortunately, Theorem 6 implicitly states that Axioms (1)-(5) do not define a unique sym-metrical information decomposition for Gaussian variables with an arbitrary covariance matrix.Nevertheless, there are some interesting properties of their shared and synergistic information,which are discussed in Sections VI-A and VI-B. Then, Section VI-C presents one symmetricalinformation decomposition that is consistent with these properties. A. Understanding the synergistic information between Gaussians
The simplistic structure of the joint p.d.f. of multivariate Gaussians, which is fully determinedby mere second order statistics, could make one to think that these systems do not havesynergistic information sharing. However, it can be shown that a multivariate Gaussian is themaximum entropy distribution for a given covariance matrix Σ . Hence, the discussion providedin Section V-A suggests that these distributions can indeed have non-zero information synergy,depending on the structure of the pairwise distributions, or equivalently, on the properties of Σ .Moreover, it has been reported that synergistic phenomena are rather common among multi-variate Gaussian variables [39]. As a simple example, consider X = A + B, X = B, X = A, (78)where A and B are independent Gaussians. Intuitively, it can be seen that although X is uselessby itself for predicting X , it can be used jointly with X to remove the noise term B and providea perfect prediction. For refining this observation, let us consider a more general example wherethe variables have equal variances and X and X are independent (i.e. γ = 0 ). Then, the optimal predictor of X given X is ˆ X X = αX , the optimal predictor given X is ˆ X X = 0 , and theoptimal predictor given both X and X is [50] ˆ X X ,X = β − α ( X − αX ) . (79)Therefore, although X is useless to predict X by itself, it can be used for further improvingthe prediction given by X . Hence, all the information provided by X is synergistic, as is usefulonly when combined with the information provided by X . Note that all these examples fall inthe category of the systems considered in Section IV. B. Understanding the shared information
Let us start studying the information shared between two Gaussians. For this, let us considera pair of zero-mean variables ( X , X ) with unit variance and correlation α . A suggestive wayof expressing these variables is given by X = W ± W , X = W ± W , (80)where W , W and W are independent centered Gaussian variables with variances s = s =1 − | α | and s = | α | , respectively. Note that the signs in (80) can be set in order to achieve anydesired sign for the covariance (as E { X X } = ± E { W } = ± s ). The mutual information isgiven by (see Appendix D) I ( X ; X ) = − (1 /
2) log(1 − α ) = − (1 /
2) log(1 − s ) , (81)showing that it is directly related to the variance of the common term W .For studying the shared information between three Gaussian variables, let us start consideringa case where σ = σ = σ = 1 , α = β := ρ and γ = 0 . It can be seen that (c.f. Appendix D) I ( X ; X ; X ) = 12 log 1 − ρ (1 − ρ ) . (82)A direct evaluation shows that (82) is non-positive for all ρ with | ρ | < / √ (note that | ρ | cannotbe larger that / √ because of condition (77)). Therefore, following the discussion related to This is consistent with the fact that X and X are pairwise independent, and hence due to (40) one has that ≤ I S ( X ; X ; X ) = − I ( X ; X ; X ) . (21), this system has no shared information for all ρ and has zero synergistic information onlyfor ρ = 0 . In contrast, let us now consider a case where α = β = γ := ρ > , for which I ( X ; X ; X ) = 12 log 1 + 2 ρ − ρ (1 − ρ ) . (83)A direct evaluation shows that, in contrast to (82), the co-information in this case is non-negative,showing that the system is dominated by shared information for all ρ (cid:54) = 0 .The previous discussion suggests that the shared information depends on the smallest of thecorrelation coefficients. An interesting approach to understand this fact can be found in [39],where the predictability among Gaussians is discussed. In this work, the authors note that fromthe point of view of X both X and X are able to decompose the target in a predictableand an unpredictable portion: X = ˆ X + E . In this sense, both predictors achieve the sameeffect although with a different efficiency, which is determined by their correlation coefficient.As a consequence of this, the predictor that is less correlated with the target does not provideunique predictability and hence its contribution is entirely redundant. This motivates the followingredundant predictability measure: R ( X X (cid:1) X ) := min { I ( X ; X ) , I ( X ; X ) } . (84) C. Shared, private and synergistic information for Gaussian variables
Let us use the intuitions developed in the previous section for building a symmetrical informa-tion decomposition. For this, we use the decomposition given by the following Lemma (whoseproof is presented in Appendix E).
Lemma 13:
Let ( X , X , X ) follow a multivariate Gaussian distribution with zero mean andcovariance matrix Σ with α ≥ β ≥ γ ≥ . Then X σ = s W + s W + s W + s W (85) X σ = s W + s W + s W (86) X σ = s W + s W + s W (87)where W , W , W , W , W and W are independent standard Gaussians and s , s , s , s , s and s are given by s = √ γ, s = √ α − γ, s = (cid:112) β − γ,s = (cid:112) − α − β + γ, s = √ − α, s = (cid:112) − β. (88)It is natural to relate s with the shared information, s and s with the private informationand s , s and s with the exclusive terms. Note that the decomposition presented in Lemma 13is unique in not requiring a private component between the two less correlated variables —i.e. aterm W . Hence, based on Lemma 13 and (81), we propose the following symmetric informationdecomposition for Gaussians: I ∩ ( X ; X ; X ) = −
12 log(1 − min { α , β , γ } ) , (89) I priv ( X ; X | X ) = I ( X ; X ) − I ∩ ( X ; X ; X ) (90) = 12 log 1 − min { α , β , γ } − α , (91) I S ( X ; X ; X ) = I ( X ; X | X ) − I priv ( X ; X | X ) (92) = 12 log (1 − α )(1 − β )(1 − γ )(1 + 2 αβγ − α − β − γ )(1 − min { α , β , γ } ) . (93)First, note that the above shared information coincides with what was expected from Lemma 13,as for the general case s = min {| α | , | β | , | γ |} . Also, (91) is consistent with the fact that the twoless correlated Gaussians share no private information. Moreover, by comparing (93) and (122),it can be seen that if X and X are the less correlated variables then the synergistic informationcan be expressed as I S ( X ; X ; X ) = I ( X ; X | X ) , which for the particular case of α = 0 confirms (40). This in turn also shows that, for the particular case of Gaussians variables, forminga Markov chain is a necessary and sufficient condition for having zero information synergy .Finally, by noting that (89) can also be expressed as I ∩ ( X ; X ; X ) = min { I ( X ; X ) , I ( X ; X ) , I ( X ; X ) } , (94)it can be seen that our definition of shared information corresponds to the canonical sym-metrization of (84) as discussed in Lemma 4. In contrast with (84), (94) states that there cannotbe information shared by the three components of the system if two of them are pairwise For the case of α ≥ β ≥ γ , a direct calculation shows that I ( X ; X | X ) = 0 is equivalent to γ = αβ . independent. Therefore, the magnitude of the shared information is governed by the lowestcorrelation coefficient of the whole system, being upper-bounded by any of the redundantpredictability terms.To close this section, let us note that (94) corresponds to the upper bound provided by(15), which means that multivariate Gaussians have a maximal shared information. This iscomplementary to the fact that, because of being a maximum entropy distribution, they alsohave the smallest amount of synergy that is compatible with the corresponding second orderstatistics. VII. A PPLICATIONS TO N ETWORK I NFORMATION T HEORY
In this section we use the framework presented in Section III to analyze four fundamentalscenarios in network information theory [51]. Our goal is to illustrate how the framework canbe used to build new intuitions over these well-known optimal information-theoretic strategies.The application of the framework to scenarios with open problems is left for future work.In the following, Section VII-A uses the general framework to analyze the Slepian-Wolfcoding for three sources, which is a fundamental result in the literature of distributed sourcecompression. Then, Section VII-B applies the results of Section IV to the multiple access channel,which is one of the fundamental settings in multiuser information theory. Section VII-C usesthe results related to Markov chains from Section V to the wiretap channel, which constitutesone of the main models of information-theoretic secrecy. Finally, Section VII-D uses resultsfrom Section VI to study fundamental limits of public or private broadcast transmissions overGaussian channels.
A. Slepian-Wolf coding
The Slepian-Wolf coding gives lower bounds for the data rates that are required to transferthe information contained in various data sources. Let us denote as R k the data rate of the k -thsource and define ˜ R k = R k − H ( X k | X c k ) as the extra data rate that each source has above theirown exclusive information (c.f. Section II-B). Then, in the case of two sources X and X , thewell-known Slepian-Wolf bounds can be re-written as ˜ R ≥ , ˜ R ≥ , and ˜ R + ˜ R ≥ I ( X ; X ) [51, Section 10.3]. The last inequality states that I ( X ; X ) corresponds to shared informationthat can be transmitted by any of the two sources. Let us consider now the case of three sources, and denote R S = I S ( X ; X ; X ) . The Slepian-Wolf bounds provide seven inequalities [51, Section 10.5], which can be re-written as ˜ R i ≥ , i ∈ { , , } (95) ˜ R i + ˜ R j ≥ I priv ( X i ; X j | X k ) + R S for i, j, k ∈ { , , } , i < j (96) ˜ R + ˜ R + ˜ R ≥ ∆ H (2) + ∆ H (3) (97)Above, (97) states that the DTC needs to be accounted by the extra rate of the sources, and (96)that every pair needs to to take care of their private information. Interestingly, due to (32) theshared information needs to be included in only one of the rates, while the synergistic informationneeds to be included in at least two. For example, one possible solution that is consistent withthese bounds is ˜ R = I ∩ ( X ; X ; X ) + I priv ( X ; X | X ) + I priv ( X ; X | X ) + I S ( X ; X ; X ) , ˜ R = I priv ( X ; X | X ) + I S ( X ; X ; X ) and ˜ R = 0 . B. Multiple Access Channel
Let us consider a multiple access channel, where two pairwise independent transmitters send X and X and a receiver gets X as shown in Fig. 3. It is well-known that, for a given distribution ( X , X ) ∼ p ( x ) p ( x ) , the achievable transmission rates R and R satisfy the constrains [51,Section 4.5] R ≤ I ( X ; X | X ) , R ≤ I ( X ; X | X ) , R + R ≤ I ( X , X ; X ) . (98)As the transmitted random variables are pairwise independent, one can apply the results ofSection IV. Therefore, there is no shared information and I S ( X ; X ; X ) = I ( X ; X | X ) − I ( X ; X ) . Let us introduce a shorthand notation for the remaining terms : C = I priv ( X ; X | X ) = I ( X ; X ) , C = I priv ( X ; X | X ) = I ( X ; X ) and C S = I S ( X ; X ; X ) . Then, one can re-writethe bounds for the transmission rates as R ≤ C + C S , R ≤ C + C S and R + R ≤ C + C + C S . (99)From this, it is clear that while each transmitter has a private portion of the channel with capacity C or C , their interaction creates synergistically extra capacity C S that corresponds to what canbe actually shared. X X p X | X ,X transmitters X receiver R R Table 1: Duality between Markov chains and PIPMarkov chains Parwise indep. predictorsConditional pairwise independency Pairwise independency I ( X ; X | X ) = 0 I ( X ; X ) = 0No synergy No redundancy In this section we will apply the results presented in Section 3 to develop new intuitionsover well-known scenarios of Network Information Theory. First, Section 4.2 usesthe results from Section 3.2.2 to study the Multiple Access (MAC) channel. ThenSection 4.3 uses the results presented in Section 3.2.1 to analyse the Wiretap channel.
The Slepian-Wolf coding gives lower bounds for the data rates that are required forvarious sources to transfer the information they convey. Let us introduce the notation R k = R k H ( X k | X c k ) which correspond to the extra data rate that each source haveabove what is needed for their own exclusive information. Then, in the case of twosources X and X , the well-known Slepian-Wolf bounds can be re-written as ˜ R R
0, and ˜ R + ˜ R I ( X ; X ). The third says that I ( X ; X ) corresponds to theshared information, which can be transmitted by any of the two sources.Let us consider now the case of three sources, and denote R S = I S ( X ; X ; X ).Then, beside requiring ˜ R i
0, the bounds for this case state that˜ R i + ˜ R j I ex ( X i ; X j | X k ) + R S (16)˜ R + ˜ R + ˜ R H (2) + H (3) (17)Above, (17) states that all the shared information needs to be accounted by the extrarate of the sources, and (16) that every pair needs to to take care of their uniqueinformation and the synergy. Note that, because of (10), the redundancy can beincluded only in one of the rates while the synergy has to be included in at least two. Let us consider a multiple access channel, where two pairwise independent transmitterssend X and X and a receiver gets X . It is well-known that, for a given distribution( X , X ) ⇠ p ( x ) p ( x ), the achievable rates R and R satisfy the constrains R I ( X ; X | X ), R I ( X ; X | X ) and R + R I ( X , X ; X ).Using the results from Section 3.2.2, it can be seen that in this case there existno redundancy between the three random variables. Because of this I ( X ; X | X ) I ( X ; X ) holds, and the di↵erence is given by the synergy of the system. Let us intro-duce shorthand notation for the remaining three components: C = I un ( X ; X | X ) = I ( X ; X ), C = I un ( X ; X | X ) = I ( X ; X ) and C S = I S ( X ; X ; X ). Then, usingthe results presented in Section 3.2.2, one can find that the contrains for the perfor-mance of the MAC channel can be re-written as R C + C S , R C + C S and R + R C + C + C S . (18) Table 1: Duality between Markov chains and PIPMarkov chains Parwise indep. predictorsConditional pairwise independency Pairwise independency I ( X ; X | X ) = 0 I ( X ; X ) = 0No synergy No redundancy In this section we will apply the results presented in Section 3 to develop new intuitionsover well-known scenarios of Network Information Theory. First, Section 4.2 usesthe results from Section 3.2.2 to study the Multiple Access (MAC) channel. ThenSection 4.3 uses the results presented in Section 3.2.1 to analyse the Wiretap channel.
The Slepian-Wolf coding gives lower bounds for the data rates that are required forvarious sources to transfer the information they convey. Let us introduce the notation R k = R k H ( X k | X c k ) which correspond to the extra data rate that each source haveabove what is needed for their own exclusive information. Then, in the case of twosources X and X , the well-known Slepian-Wolf bounds can be re-written as ˜ R R
0, and ˜ R + ˜ R I ( X ; X ). The third says that I ( X ; X ) corresponds to theshared information, which can be transmitted by any of the two sources.Let us consider now the case of three sources, and denote R S = I S ( X ; X ; X ).Then, beside requiring ˜ R i
0, the bounds for this case state that˜ R i + ˜ R j I ex ( X i ; X j | X k ) + R S (16)˜ R + ˜ R + ˜ R H (2) + H (3) (17)Above, (17) states that all the shared information needs to be accounted by the extrarate of the sources, and (16) that every pair needs to to take care of their uniqueinformation and the synergy. Note that, because of (10), the redundancy can beincluded only in one of the rates while the synergy has to be included in at least two. Let us consider a multiple access channel, where two pairwise independent transmitterssend X and X and a receiver gets X . It is well-known that, for a given distribution( X , X ) ⇠ p ( x ) p ( x ), the achievable rates R and R satisfy the constrains R I ( X ; X | X ), R I ( X ; X | X ) and R + R I ( X , X ; X ).Using the results from Section 3.2.2, it can be seen that in this case there existno redundancy between the three random variables. Because of this I ( X ; X | X ) I ( X ; X ) holds, and the di↵erence is given by the synergy of the system. Let us intro-duce shorthand notation for the remaining three components: C = I un ( X ; X | X ) = I ( X ; X ), C = I un ( X ; X | X ) = I ( X ; X ) and C S = I S ( X ; X ; X ). Then, usingthe results presented in Section 3.2.2, one can find that the contrains for the perfor-mance of the MAC channel can be re-written as R C + C S , R C + C S and R + R C + C + C S . (18) = I ( X ; X ) C S = C = C = I priv ( X ; X | X ) I priv ( X ; X | X ) Fig. 3. Capacity region of the
Multiple Access Channel , which represents the possible data-rates that two transmitters can usefor transferring information to one receiver.
C. Degraded Wiretap Channel
Consider a communication system with an eavesdropper (shown in Fig. 4), where the trans-mitter sends X , the intended receiver gets X and the eavesdropper receives X . For simplicityof the exposition, let us consider the case where the eavesdropper get only a degraded copy ofthe signal received by the intended receiver, i.e. that X − X − X form a Markov chain. Usingthe results of Section V-B, one can see that in this case there is no synergistic but only sharedand private information between X , X and X . X X X p X ,X | X eavesdropperreceivertransmitter I ( X ; X ) From this, it is clear that while each transmitter have a exclusive portion of the chan-nel with capacity C i , their interaction create synergistically an additional capacity of C S . This additional resource behaves like a physical property, which has to be sharedlinearly, generating a slope of A and B , there is a direct relationship between H ( A | B ) and H ( B | A ) as exclusive informationcontents that needs to be transmitted by each source and C and C as unique channelcapacity for each user, which cannot be shared. On the other hand, the mutual infor-mation I ( A ; B ) is the information that can be transmitted by either of the variables,which in this case corresponds to the synergetic capacity C S . Consider a communication system with a eavesdropper, where the transmitter sendsymbols X , the intended receiver gets X and the eavesdropper receives X . Forsimplicity of the exposition, let us consider the case of a degraded channel where X X X form a Markov chain. Under those conditions, it is known that for agiven input distribution p X the rate of secure communication that can be achieved onthis channel is upper bound by C sec = I ( X ; X ) I ( X ; X ) = I un ( X ; X | X ) (19)where the second equality comes from the Markov condition and the results shown inSeciton 3.2.1. Note that the eavesdropping capacity is given by C eav = I ( X ; X ) = I \ ( X ; X ; X ) . (20) (max 8 pages) References [1] L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, and E. Vaadia, “Neu-ral coding: higher-order temporal patterns in the neurostatistics of cell assemblies,”
Neural Computation , vol. 12, no. 11, pp. 2621–2653, 2000.[2] S.-I. Amari, “Information geometry on hierarchy of probability distributions,”
In-formation Theory, IEEE Transactions on , vol. 47, no. 5, pp. 1701–1711, Jul 2001.[3] V. Gri th and C. Koch, “Quantifying synergistic mutual information,” in
GuidedSelf-Organization: Inception , ser. Emergence, Complexity and Computation,M. Prokopenko, Ed. Springer Berlin Heidelberg, 2014, vol. 9, pp. 159–190.[4] L. Brillouin, “The negentropy principle of information,”
Journal of Applied Physics ,vol. 24, no. 9, pp. 1152–1163, 1953.[5] R. W. Yeung, “A new outlook on shannon’s information measures,”
InformationTheory, IEEE Transactions on , vol. 37, no. 3, pp. 466–474, 1991. F r o m t h i s ,i t i s c l e a rt h a t w h il ee a c h tr a n s m i tt e r h a v e a e x c l u s i v e p o rt i o n o f t h ec h a n - n e l w i t h c a p a c i t y C i , t h e i r i n t e r a c t i o n c r e a t e s y n e r g i s t i c a ll y a n a dd i t i o n a l c a p a c i t y o f C S . T h i s a dd i t i o n a l r e s o u r ce b e h a v e s li k e a ph y s i c a l p r o p e rt y , w h i c hh a s t o b e s h a r e d li n e a r l y , g e n e r a t i n ga s l o p e o f i n t h e g r a ph . I s i n t e r e s t i n g t h a t ,i f o n ec o n s i d e rt h e S l e p i a n - W o l f c o d i n g f o rt w o s o u r ce s A a nd B , t h e r e i s a d i r ec tr e l a t i o n s h i pb e t w ee n H ( A | B ) a nd H ( B | A ) a s e x c l u s i v e i n f o r m a t i o n c o n t e n t s t h a t n ee d s t o b e tr a n s m i tt e db y e a c h s o u r ce a nd C a nd C a s un i q u ec h a nn e l c a p a c i t y f o r e a c hu s e r , w h i c h c a nn o t b e s h a r e d . O n t h e o t h e r h a nd , t h e m u t u a li n f o r - m a t i o n I ( A ; B ) i s t h e i n f o r m a t i o n t h a t c a nb e tr a n s m i tt e db y e i t h e r o f t h e v a r i a b l e s , w h i c h i n t h i s c a s ec o rr e s p o nd s t o t h e s y n e r g e t i cc a p a c i t y C S . . D e g r a d e d w i r e t a p c h a nn e l C o n s i d e r a c o mm un i c a t i o n s y s t e m w i t h a e a v e s d r o pp e r , w h e r e t h e tr a n s m i tt e r s e nd s y m b o l s X , t h e i n t e nd e d r ece i v e r g e t s X a nd t h ee a v e s d r o pp e rr ece i v e s X . F o r s i m p li c i t y o f t h ee x p o s i t i o n ,l e t u s c o n s i d e rt h ec a s e o f a d e g r a d e d c h a nn e l w h e r e X X X f o r m a M a r k o v c h a i n . U nd e rt h o s ec o nd i t i o n s ,i t i s k n o w n t h a t f o r ag i v e n i npu t d i s tr i bu t i o n p X t h e r a t e o f s ec u r ec o mm un i c a t i o n t h a t c a nb e a c h i e v e d o n t h i s c h a nn e li s upp e r b o undb y C s ec = I ( X ; X ) I ( X ; X ) = I un ( X ; X | X )( ) w h e r e t h e s ec o nd e q u a li t y c o m e s f r o m t h e M a r k o v c o nd i t i o n a nd t h e r e s u l t ss h o w n i nS ec i t o n . . . N o t e t h a tt h ee a v e s d r o pp i n g c a p a c i t y i s g i v e nb y C e a v = I ( X ; X ) = I \ ( X ; X ; X ) . ( ) C o n c l u s i o n s ( m a x p ag e s ) R e f e r e n c e s [ ] L . M a rt i g n o n , G . D ec o , K . L a s k e y , M . D i a m o nd , W . F r e i w a l d , a nd E . V aa d i a , “ N e u - r a l c o d i n g : h i g h e r - o r d e rt e m p o r a l p a tt e r n s i n t h e n e u r o s t a t i s t i c s o f ce ll a ss e m b li e s , ” N e u r a l C o m p u t a t i o n , v o l. , n o . , pp . , . [ ] S . - I . A m a r i, “ I n f o r m a t i o n g e o m e tr y o nh i e r a r c h y o f p r o b a b ili t y d i s tr i bu t i o n s , ” I n - f o r m a t i o n T h e o r y , I EEE T r a n s a c t i o n s o n , v o l. , n o . , pp . , J u l . [ ] V . G r i t h a nd C . K o c h , “ Q u a n t i f y i n g s y n e r g i s t i c m u t u a li n f o r m a t i o n , ” i n G u i d e d S e l f - O r g a n i z a t i o n : I n ce p t i o n , s e r . E m e r g e n ce , C o m p l e x i t y a nd C o m pu t a t i o n , M . P r o k o p e n k o , E d . Sp r i n g e r B e r li n H e i d e l b e r g , , v o l. , pp . . [ ] L . B r ill o u i n , “ T h e n e g e n tr o p y p r i n c i p l e o f i n f o r m a t i o n , ” J o u r n a l o f A pp l i e d P h y s i c s , v o l. , n o . , pp . , . [ ] R . W . Y e un g , “ A n e w o u t l oo k o n s h a nn o n ’ s i n f o r m a t i o n m e a s u r e s , ” I n f o r m a t i o n T h e o r y , I EEE T r a n s a c t i o n s o n , v o l. , n o . , pp . , . F r o m t h i s ,i t i s c l e a rt h a t w h il ee a c h tr a n s m i tt e r h a v e a e x c l u s i v e p o rt i o n o f t h ec h a n - n e l w i t h c a p a c i t y C i , t h e i r i n t e r a c t i o n c r e a t e s y n e r g i s t i c a ll y a n a dd i t i o n a l c a p a c i t y o f C S . T h i s a dd i t i o n a l r e s o u r ce b e h a v e s li k e a ph y s i c a l p r o p e rt y , w h i c hh a s t o b e s h a r e d li n e a r l y , g e n e r a t i n ga s l o p e o f i n t h e g r a ph . I s i n t e r e s t i n g t h a t ,i f o n ec o n s i d e rt h e S l e p i a n - W o l f c o d i n g f o rt w o s o u r ce s A a nd B , t h e r e i s a d i r ec tr e l a t i o n s h i pb e t w ee n H ( A | B ) a nd H ( B | A ) a s e x c l u s i v e i n f o r m a t i o n c o n t e n t s t h a t n ee d s t o b e tr a n s m i tt e db y e a c h s o u r ce a nd C a nd C a s un i q u ec h a nn e l c a p a c i t y f o r e a c hu s e r , w h i c h c a nn o t b e s h a r e d . O n t h e o t h e r h a nd , t h e m u t u a li n f o r - m a t i o n I ( A ; B ) i s t h e i n f o r m a t i o n t h a t c a nb e tr a n s m i tt e db y e i t h e r o f t h e v a r i a b l e s , w h i c h i n t h i s c a s ec o rr e s p o nd s t o t h e s y n e r g e t i cc a p a c i t y C S . . D e g r a d e d w i r e t a p c h a nn e l C o n s i d e r a c o mm un i c a t i o n s y s t e m w i t h a e a v e s d r o pp e r , w h e r e t h e tr a n s m i tt e r s e nd s y m b o l s X , t h e i n t e nd e d r ece i v e r g e t s X a nd t h ee a v e s d r o pp e rr ece i v e s X . F o r s i m p li c i t y o f t h ee x p o s i t i o n ,l e t u s c o n s i d e rt h ec a s e o f a d e g r a d e d c h a nn e l w h e r e X X X f o r m a M a r k o v c h a i n . U nd e rt h o s ec o nd i t i o n s ,i t i s k n o w n t h a t f o r ag i v e n i npu t d i s tr i bu t i o n p X t h e r a t e o f s ec u r ec o mm un i c a t i o n t h a t c a nb e a c h i e v e d o n t h i s c h a nn e li s upp e r b o undb y C s ec = I ( X ; X ) I ( X ; X ) = I un ( X ; X | X )( ) w h e r e t h e s ec o nd e q u a li t y c o m e s f r o m t h e M a r k o v c o nd i t i o n a nd t h e r e s u l t ss h o w n i nS ec i t o n . . . N o t e t h a tt h ee a v e s d r o pp i n g c a p a c i t y i s g i v e nb y C e a v = I ( X ; X ) = I \ ( X ; X ; X ) . ( ) C o n c l u s i o n s ( m a x p ag e s ) R e f e r e n c e s [ ] L . M a rt i g n o n , G . D ec o , K . L a s k e y , M . D i a m o nd , W . F r e i w a l d , a nd E . V aa d i a , “ N e u - r a l c o d i n g : h i g h e r - o r d e rt e m p o r a l p a tt e r n s i n t h e n e u r o s t a t i s t i c s o f ce ll a ss e m b li e s , ” N e u r a l C o m p u t a t i o n , v o l. , n o . , pp . , . [ ] S . - I . A m a r i, “ I n f o r m a t i o n g e o m e tr y o nh i e r a r c h y o f p r o b a b ili t y d i s tr i bu t i o n s , ” I n - f o r m a t i o n T h e o r y , I EEE T r a n s a c t i o n s o n , v o l. , n o . , pp . , J u l . [ ] V . G r i t h a nd C . K o c h , “ Q u a n t i f y i n g s y n e r g i s t i c m u t u a li n f o r m a t i o n , ” i n G u i d e d S e l f - O r g a n i z a t i o n : I n ce p t i o n , s e r . E m e r g e n ce , C o m p l e x i t y a nd C o m pu t a t i o n , M . P r o k o p e n k o , E d . Sp r i n g e r B e r li n H e i d e l b e r g , , v o l. , pp . . [ ] L . B r ill o u i n , “ T h e n e g e n tr o p y p r i n c i p l e o f i n f o r m a t i o n , ” J o u r n a l o f A pp l i e d P h y s i c s , v o l. , n o . , pp . , . [ ] R . W . Y e un g , “ A n e w o u t l oo k o n s h a nn o n ’ s i n f o r m a t i o n m e a s u r e s , ” I n f o r m a t i o n T h e o r y , I EEE T r a n s a c t i o n s o n , v o l. , n o . , pp . , . I priv ( X ; X | X ) Fig. 4. The rate of secure information transfer, C sec , is the portion of the mutual information that can be used while providingperfect confidentiality with respect to the eavesdropper. In this scenario, it is known that for a given input distribution p X the rate of secure commu- nication that can be achieved is upper bounded by [42, Section 3.4] C sec = I ( X ; X ) − I ( X ; X ) = I priv ( X ; X | X ) , (100)which is precisely the private information sharing between X and X . Also, as intuitionwould suggest, the eavesdropping capacity is equal to the shared information between the threevariables: C eav = I ( X ; X ) − C sec = I ( X ; X ) = I ∩ ( X ; X ; X ) . (101) D. Gaussian Broadcast Channel
Let us consider a Gaussian Broadcast Channel, where a transmitter sends a Gaussian signal X that is received as X and X by two receivers. Assuming that all these variables jointlyGaussian with zero mean and covariance matrix as given by (76), the transmitter can broadcasta public message, intended for both users, at a maximum rate C pub given by [42, Section 5.1] C pub = min { I ( X ; X ) , I ( X ; X ) } = R ( X X (cid:1) X ) , (102)where the redundant predictability, R ( X X (cid:1) X ) , between Gaussian variables is as definedin (84). On the other hand, if the transmitter wants to send a private (confidential) message toreceiver 1, the corresponding maximum rate C priv that can be achieved in this case is given by C priv = [ I ( X ; X ) − I ( X ; X )] + = I ( X ; X ) − R ( X X (cid:1) X ) = U ( X (cid:1) X | X ) , (103)where the last equality follows from Axiom (2).Interestingly, the predictability measures prove to be better suited to describe the commu-nication limits in the above scenario that their symmetrical counterparts. In effect, using theshared information would have underestimated the public capacity (c.f. Section VI-C). Thisopens the question whether or not directed measures could be better suited for studying certaincommunication systems, compared to their symmetrized counterparts. Even though a definiteanswer to this question might not be straightforward, we hope that future research will providemore evidence and a better understanding of this issue. VIII. C
ONCLUSIONS
In this work we propose an axiomatic framework for studying the interdependencies thatcan exist between multiple random variables as different modes of information sharing. Theframework is based on a symmetric notion of information that refers to properties of the system asa whole. We showed that, in contrast to predictability-based decompositions, all the informationterms of the proposed decomposition have unique expressions for Markov chains and for thecase where two variables are pairwise independent. We also analyzed the cases of pairwisemaximum entropy (PME) distributions and multivariate Gaussian variables. Finally, we illustratedthe application of the framework by using it to develop a more intuitive understanding of theoptimal information-theoretic strategies in several fundamental communication scenarios.The key insight that this framework provides is that although there is only one way in whichinformation can be shared between two random variables, there are two essentially different waysof sharing between three. One of these ways is a simple extension of the pairwise dependency,where information is shared redundantly and hence any of the variables can be used to predictany other. The second way leads to the counter-intuitive notion of synergistic information sharing,where the information is shared in a way that the statistical dependency is destroyed if any ofthe variables is removed; hence, the structure exists in the whole but not in any of the parts.Information synergy has therefore been commonly related to statistical structures that exist onlyin the joint p.d.f. and not in low-order marginals. Interestingly, although we showed that indeedPME distributions posses the minimal information synergy that is allowed by their pairwisemarginals, this minimum can be strictly positive.Therefore, there exists a connection between pairwise marginals and synergistic informationsharing that is still to be further clarified. In fact, this phenomenon is related to the differencebetween the TC and the DTC, which is rooted in the fact that the information sharing modesand the marginal structure of the p.d.f. are, although somehow related, intrinsically different.This important distinction has been represented in our framework by the sequence of internaland external entropies. This new unifying picture for the entropy, negentropy, TC and DTC hasshed new light in the understanding of high-order interdependencies, whose consequences haveonly begun to be explored. A PPENDIX AP ROOF OF L EMMA Proof:
Let us assume that R ( X X (cid:1) Y ) and U ( X (cid:1) Y | X ) = I ( X ; Y ) − R ( X X (cid:1) Y ) satisfy Axioms (1)–(3). Then, I ( X ; Y ) ≥ I ( X ; Y ) − U ( X (cid:1) Y | X ) (104) = R ( X X (cid:1) Y ) (105) = I ( X ; Y ) − U ( X (cid:1) Y | X ) ≤ I ( X ; Y ) (106)where the inequalities are a consequence of the non-negativity of U ( X (cid:1) Y | X ) and the thirdequality is due to the weak symmetry of the redundant predictability. For proving the lowerbound, first notice that Axiom (2) can be re-written as I ( X X ; Y ) ≥ I ( X ; Y ) + I ( X ; Y ) − R ( X X (cid:1) Y ) . (107)The lower bound follows considering the non-negativity of R ( X X (cid:1) Y ) and by noting that I ( X ; Y ) + I ( X ; Y ) − I ( X X ; Y ) = I ( X ; X ; Y ) .The proof of the converse is direct, and left as an exercise to the reader.A PPENDIX BP ROOF OF THE CONSISTENCY OF A XIOM (3)Let us show that min { I ( X ; X ) , I ( X ; X ) } ≥ I ( X ; X ; X ) , showing that the boundsdefined by Axiom (3) always can be satisfied. For this, let us assume that the variables areordered in a way such that I ( X ; X ) = min { I ( X ; X ) , I ( X ; X ) , I ( X ; X ) } holds. Then, asone can express I ( X ; X ; X ) = I ( X , X ) − I ( X , X | X ) , it is direct to show that min { I ( X ; X ) , I ( X ; X ) } − I ( X ; X ; X ) ≥ I ( X ; X ) − I ( X ; X ; X ) (108) = I ( X ; X | X ) (109) ≥ , (110)from where the desired result follows. A PPENDIX CP ROOF OF L EMMA Proof:
The symmetry of I ∩ ( X ; X ; X ) can be directly verified from its definition. Theweak symmetry of I priv ( X ; X | X ) can be shown as follows: I priv ( X ; X | X ) = I ( X ; X ) − I ∩ ( X ; X ; X ) (111) = I ( X ; X ) − I ∩ ( X ; X ; X ) (112) = I priv ( X ; X | X ) . (113)The symmetry of I S ( X ; X ; X ) with respect to X and X follows directly from its definition,the weak symmetry of I ( X ; X | X ) and the strong symmetry of I ∩ ( X ; X ; X ) . The symmetrywith respect to X and X can be shown using the definition of I S ( X ; X ; X ) and the strongsymmetry of I ∩ ( X ; X ; X ) and the co-information I ( X ; X ; X ) as follows: I S ( X ; X ; X ) = I ( X ; X | X ) − [ I ( X ; X ) − I ∩ ( X ; X ; X )] (114) = I ( X ; X ; X ) + I ∩ ( X ; X ; X ) (115) = I ( X ; X | X ) − I ( X ; X ) + I ∩ ( X ; X ; X ) (116) = I S ( X ; X ; X ) . (117)The bounds for I ∩ ( X ; X ; X ) , I priv ( X ; X ; X ) and I S ( X ; X ; X ) follow directly from thedefinition of these quantities and Axiom (3). Finally, d ) is proven directly using those definitions,and the fact that the mutual information depend only on the pairwise marginals, while theconditional mutual information depends on the full p.d.f.A PPENDIX DU SEFUL FACTS ABOUT G AUSSIANS
Here we list some useful expressions for Gaussian variables: I ( X ; X ) = 12 log 11 − α (118) = 12 log σ | Σ | , (119) I ( X ; X , X ) = 12 log 1 − γ αβγ − α − β − γ (120) = 12 log | Σ || Σ | , (121) I ( X ; X | X ) = 12 log (1 − β )(1 − γ )1 + 2 αβγ − α − β − γ (122) = 12 log | Σ Σ || Σ | , (123) I ( X ; X ; X ) = 12 log 1 + 2 αβγ − α − β − γ (1 − α )(1 − β )(1 − γ ) (124) = 12 log | Σ || Σ Σ Σ | , (125)where | ∆ | is a matrix determinant, and Σ = σ ασ ασ σ Σ = σ βσ βσ σ Σ = σ γσ γσ σ . (126)A PPENDIX EP ROOF OF L EMMA Proof:
Consider the following random variables Y = σ ( s W + s W + s W + s W ) (127) Y = σ ( s W + s W + s W ) (128) Y = σ ( s W + s W + s W ) (129)where W , W , W , W , W and W are independent standard Gaussians and the parameters s , s , s , s , s and s as defined in (88). Then, is direct to check that Y = ( Y , Y , Y ) is amultivariate Gaussian variable with zero mean and covariance matrix Σ Y equal to (76). Therefore, ( Y , Y , Y ) and ( X , X , X ) have the same statistics, which proves the desired result.A CKNOWLEDGMENTS
We want to thank David Krakauer and Jessica Flack for providing the inspiration for thisresearch. We also thank Bryan Daniels, Michael Gastpar, Bernhard Geiger, Vigil Griffith andMartin Ugarte for helpful discussions. This work was partially supported by a grant to the SantaFe Institute for the study of complexity and by the U.S. Army Research Laboratory and theU.S. Army Research Office under contract number W911NF-13-1-0340. FR would also like toacknowledge the support of the F+ fellowship from KU Leuven and the SBO project SINS,funded by the Agency for Innovation by Science and Technology IWT, Belgium. R EFERENCES [1] K. Kaneko,
Life: an introduction to complex systems biology . Springer, 2006.[2] C. Perrings,
Economy and environment: a theoretical essay on the interdependence of economic and environmental systems .Cambridge University Press, 2005.[3] L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, and E. Vaadia, “Neural coding: higher-order temporalpatterns in the neurostatistics of cell assemblies,”
Neural Computation , vol. 12, no. 11, pp. 2621–2653, 2000.[4] D. Deutscher, I. Meilijson, S. Schuster, and E. Ruppin, “Can single knockouts accurately single out gene functions?”
BMCSystems Biology , vol. 2, no. 1, p. 50, 2008.[5] K. Anand and G. Bianconi, “Entropy measures for networks: Toward an information theory of complex topologies,”
Physical Review E , vol. 80, no. 4, p. 045102, 2009.[6] M. Gastpar, M. Vetterli, and P. L. Dragotti, “Sensing reality and communicating bits: A dangerous liaison,”
IEEE SignalProcessing Magazine , vol. 23, no. LCAV-ARTICLE-2007-002, pp. 70–83, 2006.[7] G. Casella and R. L. Berger,
Statistical Inference . Duxbury Press, 2002.[8] T. M. Cover and J. A. Thomas,
Elements of Information Theory . John Wiley, 1991.[9] P. M. Senge, B. Smith, N. Kruschwitz, J. Laur, and S. Schley,
The necessary revolution: How individuals and organizationsare working together to create a sustainable world . Crown Business, 2008.[10] S.-I. Amari, “Information geometry on hierarchy of probability distributions,”
Information Theory, IEEE Transactions on ,vol. 47, no. 5, pp. 1701–1711, Jul 2001.[11] P. L. Williams and R. D. Beer, “Nonnegative decomposition of multivariate information,” arXiv preprint arXiv:1004.2515 ,2010.[12] W. Li, “Mutual information functions versus correlation functions,”
Journal of statistical physics , vol. 60, no. 5-6, pp.823–837, 1990.[13] L. Brillouin, “The negentropy principle of information,”
Journal of Applied Physics , vol. 24, no. 9, pp. 1152–1163, 1953.[14] S. Watanabe, “Information theoretical analysis of multivariate correlation,”
IBM Journal of research and development ,vol. 4, no. 1, pp. 66–82, 1960.[15] M. Studen`y and J. Vejnarov´a, “The multiinformation function as a tool for measuring stochastic dependence,” in
Learningin graphical models . Springer, 1998, pp. 261–297.[16] E. Schneidman, S. Still, M. J. Berry, W. Bialek et al. , “Network information and connected correlations,”
Physical reviewletters , vol. 91, no. 23, p. 238701, 2003.[17] E. T. Jaynes,
Probability theory: the logic of science . Cambridge university press, 2003.[18] E. Schneidman, M. J. Berry, R. Segev, and W. Bialek, “Weak pairwise correlations imply strongly correlated networkstates in a neural population,”
Nature , vol. 440, no. 7087, pp. 1007–1012, 2006.[19] Y. Roudi, S. Nirenberg, and P. E. Latham, “Pairwise maximum entropy models for studying large biological systems:When they can work and when they can’t,”
PLoS Comput Biol , vol. 5, no. 5, p. e1000380, 05 2009. [Online]. Available:http://dx.doi.org/10.1371%2Fjournal.pcbi.1000380[20] W. Bialek, A. Cavagna, I. Giardina, T. Mora, E. Silvestri, M. Viale, and A. M. Walczak, “Statistical mechanics for naturalflocks of birds,”
Proceedings of the National Academy of Sciences , vol. 109, no. 13, pp. 4786–4791, 2012.[21] L. Merchan and I. Nemenman, “On the sufficiency of pairwise interactions in maximum entropy models of biologicalnetworks,” arXiv preprint arXiv:1505.02831 , 2015. [22] B. C. Daniels, D. C. Krakauer, and J. C. Flack, “Sparse code of conflict in a primate society,” Proceedingsof the National Academy of Sciences
Journal of Statistical Physics ,pp. 1–27, 2013.[24] H. T. Sun, “Nonnegative entropy measures of multivariate symmetric correlations,”
Information and Control , vol. 36, pp.133–156, 1978.[25] E. Olbrich, N. Bertschinger, N. Ay, and J. Jost, “How should complexity scale with system size?”
The European PhysicalJournal B , vol. 63, no. 3, pp. 407–415, 2008.[26] J. P. Crutchfield and D. P. Feldman, “Regularities unseen, randomness observed: Levels of entropy convergence,”
Chaos , vol. 13, no. 1, pp. 25–54, 2003. [Online]. Available: http://scitation.aip.org/content/aip/journal/chaos/13/1/10.1063/1.1530990[27] F. Rosas, V. Ntranos, C. J. Ellison, M. Verhelst, and S. Pollin, “Understanding high-order correlations using a synergy-based decomposition of the total entropy,” in
Proceedings of the 5th joint WIC/IEEE Symposium on Information Theoryand Signal Processing in the Benelux , April 2015, pp. 146–153.[28] R. W. Yeung, “A new outlook on shannon’s information measures,”
Information Theory, IEEE Transactions on , vol. 37,no. 3, pp. 466–474, 1991.[29] Y. Bar-Yam, “Multiscale complexity/entropy,”
Advances in Complex Systems , vol. 7, no. 01, pp. 47–63, 2004.[30] A. J. Bell, “The co-information lattice,” in
Proceedings of the Fifth International Workshop on Independent ComponentAnalysis and Blind Signal Separation: ICA , vol. 2003. Citeseer, 2003.[31] W. J. McGill, “Multivariate information transmission,”
Psychometrika , vol. 19, no. 2, pp. 97–116, 1954.[32] R. G. James, C. J. Ellison, and J. P. Crutchfield, “Anatomy of a bit: Information in a time series observation,”
Chaos ,vol. 21, no. 3, pp. –, 2011. [Online]. Available: http://scitation.aip.org/content/aip/journal/chaos/21/3/10.1063/1.3637494[33] V. Griffith and C. Koch, “Quantifying synergistic mutual information,” in
Guided Self-Organization: Inception , ser.Emergence, Complexity and Computation, M. Prokopenko, Ed. Springer Berlin Heidelberg, 2014, vol. 9, pp. 159–190.[34] V. Griffith, “Quantifying synergistic information,” Ph.D. dissertation, California Institute of Technology, 2014.[35] V. Griffith, E. K. Chong, R. G. James, C. J. Ellison, and J. P. Crutchfield, “Intersection information based on commonrandomness,”
Entropy , vol. 16, no. 4, pp. 1985–2000, 2014.[36] N. Bertschinger, J. Rauh, E. Olbrich, J. Jost, and N. Ay, “Quantifying unique information,”
Entropy , vol. 16, no. 4, pp.2161–2183, 2014.[37] E. Olbrich, N. Bertschinger, and J. Rauh, “Information decomposition and synergy,”
Entropy , vol. 17, no. 5, pp. 3501–3517,2015.[38] M. Harder, C. Salge, and D. Polani, “Bivariate measure of redundant information,”
Physical Review E , vol. 87, no. 1, p.012130, 2013.[39] A. B. Barrett, “Exploration of synergistic and redundant information sharing in static and dynamical gaussian systems,”
Physical Review E , vol. 91, no. 5, p. 052802, 2015.[40] J. Berkson, “Limitations of the application of fourfold table analysis to hospital data,”
Biometrics Bulletin , vol. 2, pp.47–53, 1946.[41] J. Kim and J. Pearl, “A computational model for causal and diagnostic reasoning in inference systems,” in
ProcedingsIJCAI-83 (Karlruhe, Germany) . San Mateo, CA: Morgan Kauffman, 1983, pp. 190–193. [42] M. Bloch and J. Barros, Physical-layer security: from information theory to security engineering . Cambridge UniversityPress, 2011.[43] C. E. Shannon, “Communication theory of secrecy systems*,”
Bell system technical journal , vol. 28, no. 4, pp. 656–715,1949.[44] R. Ahlswede, N. Cai, S.-Y. R. Li, and R. W. Yeung, “Network information flow,”
Information Theory, IEEE Transactionson , vol. 46, no. 4, pp. 1204–1216, 2000.[45] S.-Y. R. Li, R. W. Yeung, and N. Cai, “Linear network coding,”
Information Theory, IEEE Transactions on , vol. 49, no. 2,pp. 371–381, 2003.[46] S. Katti, H. Rahul, W. Hu, D. Katabi, M. M´edard, and J. Crowcroft, “Xors in the air: practical wireless network coding,”
IEEE/ACM Transactions on Networking (ToN) , vol. 16, no. 3, pp. 497–510, 2008.[47] L. Landau and E. Lifshitz,
Statistical physics, vol. 5 , 2nd ed. Pergamon Press, 1970.[48] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,”
Foundations andTrends R (cid:13) in Machine Learning , vol. 1, no. 1-2, pp. 1–305, 2008.[49] B. A. Cipra, “An introduction to the ising model,” American Mathematical Monthly , vol. 94, no. 10, pp. 937–959, 1987.[50] A. H. Sayed,
Adaptive filters . John Wiley & Sons, 2011.[51] A. El Gamal and Y.-H. Kim,