Topological Information Data Analysis
Pierre Baudot, Monica Tapia, Daniel Bennequin, Jean-Marc Goaillard
TTopological Information Data Analysis
Pierre Baudot ,
2, Monica Tapia , Daniel Bennequin and Jean-Marc Goaillard Median technologies, Sophia antiplis, France, Inserm UNIS UMR1072 - Universit´e Aix-Marseille AMU, Marseille, France Universit Paris Diderot, Institut Mathmatique de Jussieu, Paris, [email protected] July 2019
Abstract
This paper presents methods that quantify the structure of statistical interactions within a given data set,and was first used in [58]. It establishes new results on the k -multivariate mutual-informations ( I k ) inspiredby the topological formulation of Information introduced in [4, 63]. In particular we show that the vanishingof all I k for 2 ≤ k ≤ n of n random variables is equivalent to their statistical independence. Pursuing thework of Hu Kuo Ting and Te Sun Han [23, 21, 22], we show that information functions provide co-ordinatesfor binary variables, and that they are analytically independent on the probability simplex for any set of finitevariables. The maximal positive I k identifies the variables that co-vary the most in the population, whereasthe minimal negative I k identifies synergistic clusters and the variables that differentiate-segregate the mostthe population. Finite data size effects and estimation biases severely constrain the effective computation ofthe information topology on data, and we provide simple statistical tests for the undersampling bias and thek-dependences following [43]. We give an example of application of these methods to genetic expression andunsupervised cell-type classification. The methods unravel biologically relevant subtypes, with a sample sizeof 41 genes and with few errors. It establishes generic basic methods to quantify the epigenetic informationstorage and a unified epigenetic unsupervised learning formalism. We propose that higher-order statisticalinteractions and non identically distributed variables are constitutive characteristics of biological systems thatshould be estimated in order to unravel their significant statistical structure and diversity. The topologicalinformation data analysis presented here allows to precisely estimate this higher-order structure characteristicof biological systems. ”When you use the word information, youshould rather use the word form” Ren´e Thom
Contents a r X i v : . [ s t a t . O T ] J u l .6.1 Sampling size and graining landscapes - stability of minimum energy complex estimation 24 A Appendix: Bayes free energy and Information quantities 27
A.1 Parametric modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27A.2 Bethe approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
This note presents a method of statistical analysis of a set of collected characters in a population, describing akind of topology of the distribution of information in the data. New theoretical results are developed to justifythe method. The data that concern us are represented by certain (observed or computed) parameters s , ..., s n belonging to certain finite sets E , ..., E n of respective cardinalities N , ..., N n , which depend on an element z of a certain set Z , representing the tested population, of cardinality N Z . In other terms we are looking at N ”experimental” functions X i : Z → E i , i = 1 , ..., n , then we will refer to the data by the letters ( Z, X ), where X is the product function of the X i , going from Z to the product E of all the sets E i , i = 1 , ..., n .For instance, as in [ .. ], each E i , i = 1 , ..., n has cardinality 8 and is identified with the subset of integers[8] = { , ..., } , each s i = S i ( z ) , i = 1 , ..., n measures the level of expression of a gene g i in a neuron z be-longing to a set Z of classified dopaminergic neurons (DA); another system to be compared to this one is madeby the analog measurements for neurons z (cid:48) belonging to a set Z (cid:48) of classified non-dopaminergic neurons (NDA).To be precise, in this example, n = 41, N Z = 111, N Z (cid:48) = 37.The most usual hypothesis for interpreting the data is the existence of an objective and immutable jointprobability P X on the set E , coming from a hypothetical set Ω enclosing Z and governing the results of thesample ( Z, X ). Then, without any other knowledge, assuming that different experimental z are independentlychosen, the strong law of large numbers tells that the better possible approximation of P X ( s ) is given by thenumber of z such that s = X ( z ) divided by the cardinality of Z . However, the standard inequalities of probabilitytheory (Bienaym-Tchebicheff, Markov, Chernov Kolmogorov, ...) show that the confidence that we can have inthis approximation depends heavily on P X itself. This generates a risk of circularity.Several approaches can be followed to escape circularity; part of them, maintaining the frequentist point ofview, uses the Fisher information metric (cf. [34]), part of them, using a Bayesian approach, puts probabilitylaws on the set of probabilities themselves, and studies the evolution of these choices with the introduction ofnew data. Once the total joint probability is found, it is theoretically possible to verify its agreement with themarginal laws on X i , i = 1 , ..., n , or any partial joint variable X I = ( X i , ..., X i k ) for I = { i , ..., i k } ⊂ [ n ], butpractically it is not so easy, being one of the principal problems in Statistical Mechanics or in Bayesian Analysis.We will follow here a different approach, which consists in describing the manner the variables X i , i = 1 , ..., n distribute the Information on ( Z, X ). The experimented population Z has its own characteristics that the dataexplore, and the frequency of every value s I of each one of the variables X I , I ⊂ [ n ] is an information importantby itself, without considering the hypothetical law on the whole set E . The information quantities , derived fromthe Shannon entropy, offer a natural way for describing all these frequencies. In fact they define the form of thedistribution of information contained in the raw data. For instance, the individual entropies H ( X i ) , i = 1 , ..., n tell us the shape of the individual variables: if H ( X i ) is small (with respect to its capacity log N i ), then X i corresponds to a well defined characteristic of Z ; to the contrary if H ( X j ) is close to the capacity, that is thevalue of the entropy of the uniform distribution, the function X j corresponds to a non-trivial partition of Z , anddoes not correspond to a well defined invariant. At the second degree, we can consider the entropies H ( X i , X j )for every pair ( i, j ), giving the same kind of structures as before, but for pairs of variables. To get a betterdescription of this second degree with respect to the first one, we can look at the mutual information as de-fined by Shannon, I ( X i ; X j ) = H ( X i ) + H ( X j ) − H ( X i , X j ). If it is small, near zero, the variables are notfar from being independent, if it is maximal, i.e. not far from the minimum of H ( X i ) and H ( X j ), this meansthat one of the variables is almost determined by the other. In fact I ( X i ; X j ) can be taken as a measure ofdependence, due to its universality and its invariance. Consider the graph with vertices X i , i = 1 , ..., N andedges ( X i , X j ) , i (cid:54) = j, i, j = 1 , ..., N ; by labeling each vertex with the entropy and each edge with the mutualinformation, we get a sort of one-dimensional skeleton of the data ( Z, X ). The information of higher degreesdefine in an analogous manner the higher dimensional skeletons of the data (
Z, X ) (see figure 3 for example).Works of Clausius, Boltzmann, Gibbs and Helmholtz underlined the importance of entropy and free energyin Statistical Physics. In particular, Gibbs gave the general definition of the entropy for the distribution ofmicrostates, cf. [18]. Later Shannon recognized in this entropy the basis of Information theory in his celebratedwork on the mathematical theory of communication [52] (equation 11), and then further developed their structurein the lattice of variables [51]. Note that this kind of lattice takes its roots in the work of Boole on Logic andProbability [8]. Defining the communication channel, information transmission and its capacity, Shannon also2ntroduced to degree two (pairwise) mutual information functions [52].The expression and study of multivariate higher degree mutual-informations (equation 12) was achieved intwo seemingly independent works: 1) McGill (1954) [38] (see also Fano (1961) [16]) with a statistical approach,who called these functions ”interaction information”, and 2) Hu Kuo Ting (1962) [23] with an algebraic approachwho also first proved the possible negativity of mutual-informations for degrees higher than 2. The study of thesefunctions was then pursued by Te Sun Han [21, 22].Higher-order mutual-informations were then rediscovered in several different contexts, notably by Matsudain 2001 in the context of spin glasses, who showed that negativity is the signature of frustrated states [36] andby Bell in the context of Neuroscience, Dependent Component Analysis and Generalised Belief Propagation onhypergraphs [6]. Brenner and colleagues have observed and quantified an equivalent definition of negativity ofthe 3 variables mutual information, noted I , in the spiking activity of neurons and called it synergy [11]. Anas-tassiou and colleagues unraveled I negativity within gene expression, corresponding in that case to cooperativityin gene regulation [66, 30].Another important family of information functions, named ”total correlation”, which corresponds to the dif-ference between the sum of the entropies and the entropy of the joint, was introduced by Watanabe in 1960[65]. These functions were also rediscovered several times, notably by Tononi and Edelman who called them”integrated information” [61] in the context of consciousness quantification, and by Studen´y and Vejnarova [56]who called them ”multi-information” in the context of graphs and conditional independences. Closely related towhat we present here with respect to both the kind of data analyzed and the conclusions, Margolin and colleagues[35] used these functions of Watanabe to quantify higher order statistical dependences within genetic expression.In our approach, for any data ( Z, S ), the full picture can be represented by a collection of numerical functionson the faces of a simplex ∆([ n ]) having vertices corresponding to the random variables X , ..., X n . We decidedto focus on two subfamilies of Information functions: the first is the collection of entropies of the joint variables,denoted H k , k = 1 , ..., n , giving the numbers H k ( X i ; ... ; X i k ), and the degree k information of the joint variables,denoted I k , k = 1 , ..., n , and giving the numbers I k ( X i ; ... ; X i k ); see the following section for their definition andtheir elementary properties. In particular, the value on each face of a given dimension of these functions givesinteresting curves (histograms, see section on statistics 2.2) for testing the departure from independence, andtheir means over all dimensions for testing the departure from uniformity of the variables. These functions areinformation co-chains of degree k (in the sense of ref [4]) and have nice probabilistic interpretations. By varyingin all possible manners the ordering of the variables, i.e. by applying all the permutations σ of [ n ] = { , ..., n } , weobtain n ! paths H k ( σ ), I k ( σ ), k = 1 , ..., n . They constitute respectively the H k -landscape and the I k -landscape of the data.When the data correspond to uniform and independent variables, that is the uninteresting null hypothesis,each path is monotonic, the H k growing linearly and the I k being equal to zero for k between 2 and n . Anydeparture from this behavior (estimated for instance in Bayesian probability on the allowed parameters) gives ahint of the form of information in the particular data.Especially interesting are the maximal paths, where I k ( σ ) decreases, being strictly positive, or strictly negativeafter k = 3. Other kinds of paths could also be interesting, for instance the paths with the maximal total variationas they can be oscillatory. In the examples provided here and in [58], we proposed to stop the empirical explo-ration of the information paths to their first minima, a condition of vanishing of conditional mutual-informational(conditional independence).As a preliminary illustration of the potential interest of such functions for general Topological Data Analy-sis, we quantify the information structures for the empirical measures of the expression of several genes in twopre-identified populations of cells presented in [58], and we consider here both cases where genes or cells areconsidered as variables for gene or cell unsupervised classification tasks respectively.In practice, the cardinality N Z of Z is rather small with respect to the number of free parameters of the possibleprobability laws on E , that is N − N ...N n −
1, then the quantities H k , I k for k larger than a certain k u have ingeneral no meaning, a phenomenon commonly called undersampling or curse of dimensionality. In the example, n is 20, but k u is 11. Moreover, the permutations σ of the variables values can be applied to test the estimation ofthe dependences quantified by the I k against the null hypothesis of randomly generated statistical dependences.In this approach describing the raw data for themselves, undersampling is not a serious limitation. However, itis better to test the stability of the shape of the landscapes by studying random subsets of Z . Moreover, theanalytic properties of H k and I k considered as functions of P in a given face of the simplex of probabilities ∆([ n ])ensure that, if P X tends to P in this face, the shape is preserved.The originality of our method is the systematic consideration of the entropy and the information landscapesand paths that can be associated to all possible permutations of the basic variables, and the extraction of excep-tional paths from them, in order to define the overall form of the distribution of information among the set ofvariables. This new perspective has its origin in the local (topos) homological theory introduced in [4]. Moreover3his method was successfully applied to a concrete problem of gene expression in [5, 58].In the present article, we first remind the definitions and basic properties of the entropy and informationchains and functions. We give equivalent formulations of the Hu Kuo Ting theorem [23], which allows to expressevery partial mutual conditioned higher information of collections of joint variables from elementary higher en-tropies H k ( X I ) or by elementary higher mutual information functions I k ( X I ), i.e. the functions that form theentropy landscape and information landscape, respectively.Second we establish that these ”pure” functions are analytically independent as functions of the probabilitylaws, in the interior of the large simplex ∆([ n ]). This follows from the fact we also prove here, that these functionsconstitute coordinates (up to a finite ambiguity) on ∆([ n ]) in the special case of binary variables X i , i = 1 , ..., n .In addition, we demonstrate that, for every set of numbers N i , i = 1 , ..., n , the cancellation of the functions I k ( X I ) , k ≥ , I ⊂ [ n ] = { , ..., n } is a necessary and sufficient condition of the set of variables X , ..., X n to bestatistically independent. We were not able to find these results in the literature. They generalize results of TeSun Han [21, 22].Then this article not only presents a method of analysis but it gives proofs of basic results on informationquantities that, to our knowledge, were not available until now in the literature.Third we study the statistical properties of the entropy and information landscapes and paths, and present thecomputational aspects. The mentioned examples of genetic expression are developed. Finally in an appendix, weshow how these functions appear in the theory of Free energies, in Statistical Physics and in Bayesian VariationalAnalysis. Given a probability law P X on a finite set E = E X , Shannon defined the information content of this law by theBoltzmann-Gibbs entropy [52]: H ( P X ) = − (cid:88) x ∈ E P X ( x ) log P X ( x ) . (1)Shannon himself gave an axiomatic justification of this choice, that was developed further by Khinchin, Kendalland other mathematicians, see [29].The article [4] (Baudot and Bennequin) presented such a set of axioms inspired by algebraic topology, see also[63] (Juan-Pablo Vigneaux). In all these approaches, the fundamental ingredient is the decomposition of theentropy for the joint variable of two variables. To better formulate this decomposition, we have proposed toconsider the entropy as a function of three variables: first a finite set E X , second a probability law P on E X and third, a random variable on E X , i.e. a surjective map Y : E X → E Y , considered only through the partitionof E X that it gives, indexed by the elements y of E Y . In this case we say that Y is less fine than X , and write Y ≤ X , or X → Y . Then we define the entropy of Y for P at X : H X ( Y ; P ) = H ( Y ∗ ( P )) , (2)where Y ∗ ( P ) is the image law, also named the marginal of P by Y : Y ∗ ( P )( y ) = (cid:88) x | Y ( x )= y P ( x ) . (3) Remark.
Frequently, when the context is clear, we simply write H X ( Y ; P ) = H ( Y ; P ) or even H ( Y ) , as every-body does, however the ”homological nature” of H can only be understood with the index X , because it is herethat the topos theory appears, see [4, 63]. The second fundamental operation on probabilities (after marginalization) is the conditioning : given y ∈ E Y ,such that Y ∗ ( P )( y ) (cid:54) = 0, the conditional probability P | ( Y = y ) on E X is defined by the following rules: ∀ x | Y ( x ) = y, P | ( Y = y )( x ) = P ( x ) /Y ∗ ( P )( y ) , ∀ x | Y ( x ) (cid:54) = y, P | ( Y = y )( x ) = 0This allows to define the conditional entropy, as Shannon has done, for any Z and Y both less fine that X , Y.H ( Z ; P ) = (cid:88) y ∈ E Y H ( Z ; P | ( Y = y )) Y ∗ ( P )( y ) . (4)4ote that if P | ( Y = y ) is not well defined, we can simply forget it in the formula, because it appears multipliedby zero.This operation is associative (see [4, 63]), i.e. for any triple W, Y, Z of variables less fine than X ,( W, Y ) .H ( Z ; P ) = W. ( Y.H )( Z ; P ) . (5)With these notations, the fundamental functional equation of Information Theory, or its first axiom, accordingto Shannon, is H (( Y, Z ); P ) = H ( Y ; P ) + Y.H ( Z ; P ) . (6) Remark.
In [4, 63] it is shown that this equation can be understood as a co-cycle equation of degree one of amodule in a topos, in the sense of Grothendieck and Verdier [1], and why the entropy is generically the onlyuniversal generator of the first co-homology functor.
More generally, we consider a collection of sets E X , X ∈ C , such that each time Y, Z are less fine than X andbelong to C , then ( Y, Z ) also belongs to C ; in this case we name C an information category . An example is givenby the joint variables X = ( X i , ..., X i m ) of n basic variables X , ..., X n with values in finite sets E , ..., E n ; theset E X being the product E i × ... × E i m .Then for every natural integer k ≥
1, we can consider families indexed by X of (measurable) functions of theprobability P X that are indexed by several variables Y , ..., Y k less fine than XP X (cid:55)→ F X ( Y ; ... ; Y k ; P X ) (7)satisfying the compatibility equations; ∀ X (cid:48) , X ≤ X (cid:48) , ∀ P X (cid:48) , F X ( Y ; ... ; Y k ; X ∗ ( P X (cid:48) )) = F X (cid:48) ( Y ; ... ; Y k ; P X (cid:48) ) . (8)We call these functions the co-chains of degree k of C for the probability laws. An equivalent axiom is that F X ( Y ; ... ; Y k ; P X ) only depends on the image of P X by the joint variable ( Y , ..., Y k ). We call this property locality of the family F = ( F X , X ∈ C ).The action by conditioning extends verbally to the co-chains of any degree:if Y is less fine than X , Y.F X ( Y ; ... ; Y k ; P ) = (cid:88) y ∈ E Y F X ( Y ; ... ; Y k ; P | ( Y = y )) Y ∗ ( P )( y ) . (9)It satisfies again the associativity condition.Higher mutual information quantities were defined by Hu Kuo Ting [23] and McGill [38], generalizing the Shannonmutual information [4, 58]:in our terms, for k random variables X , ..., X k less fine than X and one probability law P on the set E X , H k ( X ; ... ; X k ; P ) = H (( X , ..., X k ); P ) . (10)And more generally, for j ≤ k , we define H j ( X ; ... ; X k ; P ) = (cid:88) I ⊂ [ k ]; card ( I )= j H ( X I ; P ) , (11)where X I denotes the joint variable of the X i such that i ∈ I .We name these functions of P the joint entropies .Then the higher information functions are defined by I k ( X ; ... ; X k ; P ) = j = k (cid:88) j =1 ( − j − H j ( X ; ... ; X k ; P ) , (12)In particular we have I = H , the usual entropy.Reciprocally the functions I k decompose the entropy of the finest joint partition: H ( X ; X ; ... ; X n ; P ) = k = n (cid:88) k =1 ( − k − (cid:88) I ⊂ [ n ]; card ( I )= k I k ( X i ; X i ; ... ; X i k ; P ) (13)The following result is immediate from the definitions, and the fact that H X , X ∈ C is local:5 roposition 1. The joint entropies H k and the higher information quantities I k are information co-chains, i.e.they are local functions of P . Remark.
From the computational point of view, locality is important, because it means that only the less finemarginal probability has to be taken into account.
The definition of H j , j ≤ k and I k makes evident that they are symmetric functions, i.e. they are invariantby every permutation of the letters X , ..., X k .The particular case I ( S ; T ) = H ( S ) + H ( T ) − H ( S, T ) is the usual mutual information defined by Shannon.Using the concavity of the logarithm, it is easy to show that I and I have only positive values, but this ceasesto be true for I k as soon as k ≥ I k,l ( Y ; ... ; Y k ; P X | Z , ..., Z l ) = ( Z , ..., Z l ) .I k ( Y ; ... ; Y k ; P X ) . (14)For instance, considering a family of basic variables X i , i = 1 , ..., n , I k,l ( X I ; ... ; X I k ; ( P | X J )) = X J .I k ( X I ; ... ; X I k ; P ) , (15)for the joint variables X I , ..., X I k , X J , where I , ..., I k , J ⊂ [ n ].The following remarkable result is due to Hu Kuo Ting [23]: Theorem 2.1.
Let X , ..., X n be any set of random variables and P a given probability on the product E X ofthe respective images E , ..., E n , then there exist finite sets Σ , ..., Σ n and a numerical function ϕ from the union Σ of these sets to R , such that for any collection of subsets I m ; m = 1 , ..., k of { , ..., n } , and any subset J of { , ..., n } of cardinality l , the following identity holds true I k,l ( X I ; ... ; X I k ; ( P | X J )) = ϕ (Σ I ∩ ... ∩ Σ I k \ Σ J ) , (16) where we have denoted X I = ( X i , ..., X i l ) and Σ I = Σ i ∪ ... ∪ Σ i l for I = { i , ..., i l } , and where Ω \ Σ J denotesthe set of points in Ω that do not belong to Σ J , i.e. the set Ω ∩ (Σ \ Σ J ) , named subtraction of Y = Σ J from Ω . The Hu Kuo Ting theorem says that for a given joint probability law P , and from the point of view of theinformation quantities I k,l , the joint operation of variables corresponds to the union of sets, the graduation k corresponds to the intersection, and the conditioning by a variable corresponds to the difference of sets. Thiscan be precisely formulated as follows: Corollary 1.1.
Let X , ..., X n be any set of random variables on the product E X of the respective goals E , ..., E n ,then for any probability P on E X , every universal identity between disjoint sums of subsets of a finite set that areobtained, starting with n subsets Σ , ..., Σ n , by 1) forming collections of reunions, 2) taking successive intersectionsof these unions, and 3) subtracting by one of them, gives an identity between sums of information quantities, byreplacing the union by the joint variables ( ., . ) , the intersections by the juxtaposition ( . ; . ; . ) and the subtractionby the conditioning. Remark.
Conversely the corollary implies the Theorem.
This corollary is the source of many identities between the information quantities.For instance, the fundamental equation (6) corresponds to the fact that the union of two sets
A, B is the disjointunion of one of them, say A and of the difference of the other with this one, say B \ A .The following formula follows from (6): H k +1 ( X ; X ; ... ; X k ; P ) = H k (( X , X ); X ; ... ; X k ; P ) = H k ( X ; ... ; X k ; P ) + X .H k ( X ; ... ; X k ; P ) . (17)The two following identities are also easy consequences of the Corollary 1; they are important for the method ofdata analysis presented in this article: Proposition 2.
Let k be any integer I k (( X , X ); X ; ... ; X k ; P ) = I k ( X ; X ; ... ; X k ; P ) + X .I k ( X ; X ; ... ; X k ; P ) (18)6 roposition 3. Let k be any integer I k +1 ( X ; X ; ... ; X k ; P ) = I k ( X ; X ; ... ; X k ; P ) − X .I k ( X ; X ; ... ; X k ; P ) . (19) Remark.
Be careful that some universal formulas between sets do not give identities between information func-tions; for instance A ∩ ( B ∪ C ) = ( A ∩ B ) ∪ ( A ∩ C ) but in general we have I ( X ; ( Y, Z )) (cid:54) = I ( X ; Y ) + I ( X ; Z ) , (20) What is true is the following identity: I ( X ; ( Y, Z )) + I ( Y ; Z ) = I (( X, Y ); Z ) + I ( X ; Y ) , (21) This corresponds to the following universal formula between sets ( A ∩ ( B ∪ C )) ∪ ( B ∩ C ) = ( A ∩ B ) ∪ (( A ∪ B ) ∩ C ) . (22) The formula (21) follows directly from the definition of I , by developing the four terms of the equation. Itexpresses the fact that I is a simplicial co-cycle, being the simplicial co-boundary of H itself.However, although this formula between sets is true, it is not of the form authorized by the corollary .Consequently, some identities of sets that are not contained in the Theorem 2.1 correspond to information iden-tities, but, as we saw just before with the false formula (20) , not all identities of sets correspond to informationidentities. As we already said, the set of joint variables X I , for all the subsets I of [ n ] = { , ..., n } , is an informationcategory, the set C being the n − n ]) of vertices X , ..., X n . In what follows we do not consider moregeneral information categories.We can paraphrase the Theorem 2.1, by a combinatorial Theorem on the simplex ∆([ n ]): Definition 4.
Let X , ..., X n be a set of random variables with respective goals E , ..., E n , and let X I = { X i , ..., X i k } be a face of ∆([ n ]) , we define, for a probability P on the product E of all the E i , i = 1 , ..., n , η I ( P ) = η ( X i ; ... ; X i k ; P ) = X [ n ] \ I .I k ( X i ; ... ; X i k ; P ) , (23) Remark.
With the exception J = [ n ] , the function η J is not an information co-chain of degree k . But it isuseful in the demonstrations of some of the following results. Embed ∆([ n ]) in the hyperplane x + ... + x n = 1 as the standard simplex in R n (the intersection of theabove hyperplane with the positive cone, where ∀ i = 1 , ..., n, x i ≥ , ..., Σ n of radius R strictly larger than (cid:112) ( n − /n that are centered on the vertices X j ; j = 1 , ..., n ; they have all possible non-emptyintersections convex. The subsets Σ (cid:48) I = Σ I \ Σ [ n ] \ I are the connected components of complementary set of theunions of the boundary spheres ∂ Σ , ..., ∂ Σ n in the total union Σ of the balls Σ , ..., Σ n . Proposition 5.
For every k + 1 subsets I , .., I k , K of [ n ] , if l denotes the cardinality of K , the informationfunction I k,l ( X I ; ... ; X I k ; P | X K ) is equal to the sum of the functions η J ( P ) , where J describes all the faces suchthat Σ (cid:48) J is one of the connected components of the set (Σ I ∩ ... ∩ Σ I k ) \ Σ K .Proof. Every subset that is obtained from the Σ J ; J ⊂ [ n ] by union, intersection and difference, repeated indefi-nitely (i.e. every element of the Boolean algebra generated by the Σ i ; i = 1 , ..., n ), is a disjoint union of some ofthe sets Σ (cid:48) J . This is true in particular for the sets obtained by the succession of operations 1 , , elementary (or pure ) joint entropies H k ( X I ) and the elementary (or pure) higher informationfunctions I k ( X I ) as H k ( X i ; ... ; X i k ; P ) and I k ( X i ; ... ; X i k ; P ) respectively, where I = { i , ..., i k } ⊂ [ n ] describesthe subsets of [ n ]. In the following of the text, we will consider only these pure quantities. We will frequentlydenote them simply by H k (resp. I k ). The other information quantities use joint variables and conditioning, butthe preceding result tells that they can be computed from the pure quantities.For the pure functions, the decompositions in the basis η I are simple: Proposition 6. If I = { i , ..., i k } , we have H k ( X i ; ... ; X i k ; P ) = (cid:88) J ⊂ [ n ] |∃ m,i m ∈ J η J ( P ) , (24) and I k ( X i ; ... ; X i k ; P ) = (cid:88) J ⊃ I η J ( P ) . (25)7n other terms, the function H k evaluated on a face X I of dimension k is given by the sum of the functions η J over all the faces X J connected to X I . And the function I k evaluated on X I is the sum of the functions η J over all the faces X J that contain X I . Proposition 7.
For any face J of ∆([ n ]) , of dimension l , and any probability P on E X , we have η J ( P ) = (cid:88) k ≥ l (cid:88) I ⊇ J | dimI = k ( − k − l H k ( X I ; P ) . (26) Proof.
This follows from the Moebius inversion formula (Rota 1964 [47]).
Corollary 7.1. (Te Sun Han): Any Shannon information quantity is a linear combination of the pure functions I k , k ≥ (resp. H k k ≥ ), with coefficients in Z , the ring of relative integers.Proof. This follows from the proposition 5.Hu Kuo Ting [23] also proved a remarkable property of the information functions associated to a Markov process:
Proposition 8.
The variables X , ..., X n can be arranged in a Markov process ( X i , ..., X i n ) if and only if, forevery subset J = { j , ..., j k − } of { i , ..., i n − } of cardinality k − , we have I k ( X i ; X j , ... ; X j k − ; X i n ) = I ( X i ; X i n ) . (27)This implies that, for a Markov process between ( X i , ..., X i n ), all the functions I k ( X I ) involving i and i n , arepositive. The total correlations were defined by Watanabe as the difference of the sum of entropies and the joint entropy,noted G k [65] (see also [61, 56, 35]): G k ( X ; ... ; X k ; P ) = k (cid:88) i =1 H ( X i ) − H ( X ; ... ; X k ) . (28)Total correlations are Kullback-Leibler divergences, cf. appendix A on bayes free energy; and I = G . It is wellknown (cf. the above references or [13]) that for n ≥
2, the variables X , ..., X n are statistically independent forthe probability P , if and only if G n ( X ; ... ; X n ) = 0, i.e. H ( X , ..., X n ; P ) = H ( X ; P ) + ... + H ( X n ; P ) . (29) Remark.
The result is proved by induction using repetitively the case n = 2 , which comes from the strictconcavity of the function H ( P ) on the simplex ∆([ n ]) . Theorem 2.2.
For every n and every set E , ..., E n of respective cardinalities N , ..., N n , the probability P renders the n variables X i , i = 1 , ..., n statistically independent if and only if the n − n − quantities I k for k ≥ are equal to zero.Proof. For n = 2 this results immediately from the above criterion and the definition of I . Then we proceed byrecurrence on n , and assuming that the result is true for n − n .The definition of I n is I n ( X ; ... ; X n ; P ) = H ( X ; P ) + ... + H ( X n ; P ) − H ( X , X ; P ) − ... + ( − n +1 H ( X , ..., X n ; P ) . (30)By recurrence, the quantities I k for 2 ≤ k ≤ n − I = { i , ..., i k } ⊂ [ n ] of cardinality k between 2 and n −
1, the variables X i , ..., X i k are independent. Suppose this isthe case. In the above formula (30), we can replace all the intermediary higher entropies H ( X I ; P ) for I between2 and n − H ( X i ) + ... + H ( X i k ). By symmetry each term H ( X i ) appears the same number of time, with the same sign each time. The total sum of signs is obtained byreplacing each H ( X i ) by 1; it is Σ = n − C n + 3 C n − ... + ( − n ( n − C n − n . (31)8owever, as a polynomial in x , we have(1 − x ) n = 1 − nx + C n x − ... + ( − n x n , (32)thus ddx (1 − x ) n = − n + 2 C n x − ... + ( − n nx n − , (33)therefore n − C n + ... + ( − n ( n − C n − n = ( − n n − ddx (1 − x ) n | x =1 = ( − n n, (34)because n ≥ I n ( X ; ... ; X n ; P ) = ( − n − H ( X , ..., X n ; P ) + ( − n ( H ( X ; P ) + ... + H ( X n ; P )) . (35)Therefore, if the variables X i ; i = 1 , ..., n are all independent, the quantity I n is equal to 0. And conversely, if I n = 0, the variables X i ; i = 1 , ..., n are all independent.Te Sun Han established that, for any subset I of [ n ] of cardinality k ≥
2, there exist probability laws suchthat all the I k ( X I ) , k ≥ I k ( X I ) [21, 22]. Consequently, in the equations of thetheorem 2.2, no one can be forgotten.The unique equation (29) also characterizes the statistical independence, but its gradient with respect to P isstrongly degenerate along the variety of independent laws. As shown by Te Sun Han [21, 22], this is not the casefor the I k . The number of different functions η I , resp. pure I k , resp. pure H k , is 2 n − P X are analytically independent; we will prove here that this is true. Thebasis of the proof is the fact that each family gives finitely ambiguous coordinates in the case of binary variables,i.e. when all the numbers N i , i = 1 , ..., n are equal to 2. Then we begin by considering n binary variables withvalues 0 or 1.Let us look first at the cases n = 1 and n = 2. And consider only the family H k , the other families being easilydeduced by linear isomorphisms.In the first case the only function to consider is the entropy H (( p , p )) = − p log ( p ) − p log ( p ) = − x ln x − (1 − x ) ln(1 − x )) = h ( x ) , (36)if we put x = p . To get a probability ( p , p ) we impose that x belongs to [0 , x , h is strictlyconcave, attaining all values between 0 and 1, but it is not injective, due to the symmetry x (cid:55)→ − x , whichcorresponds to the exchange of the values 0 and 1.For n = 2, we have two variables X , X and three functions H ( X ; P ), H ( X ; P ), H ( X , X ; P ). These functionsare all concave and real analytic in the interior of the simplex of dimension 3.Let us describe the probability law by four positive numbers p , p , p , p of sum 1. The marginal laws for X and X are described respectively by the following coordinates: p = p + p , p = p + p , (37) q = p + p , q = p + p . (38)For the values of H ( X ; P ) and H ( X ; P ) we can take independently two arbitrary real numbers between 0 and1. Moreover, from the case n = 1, if two laws P and P (cid:48) give the same values H and H of H ( X ; P ) and H ( X ; P ) respectively, we can reorder 0 and 1 independently on each variable in such a manner that the imagesof P and P (cid:48) by X and X coincide, i.e. we can suppose that p = p (cid:48) and q = q (cid:48) , which implies p = p (cid:48) and q = q (cid:48) , due to the condition of sum 1. It is easy to show that the third function H ( X , X ; P ) can take anyvalue between the maximum of H , H and the sum H + H .9 emma 9. There exist at most two probability laws that have the same marginal laws under X and X and thesame value H of H ( X , X ) ; moreover, depending on the given values H , H , H in the allowed range, both casescan happen in open sets of the simplex of dimension seven.Proof. When we fix the values of the marginals, all the coordinates p ij can be expressed linearly in one of them,for instance x = p : p = p − x, p = q − x, p = p − q + x. (39)Note that x belongs to the interval I defined by the positivity of all the p ij : x ≥ , x ≤ p , x ≤ q , x ≥ q − p = q + p − . (40)The fundamental formula gives the two following equations: H ( X , X ; P ) − H ( X ) = X .H ( X ; P ) = p h ( xp ) + p h ( q − xp ) , (41) H ( X , X ; P ) − H ( X ) = X .H ( X ; P ) = q h ( xq ) + q h ( p − xq ) . (42)We define the functions f ( x ) and f ( x ) by the two above formulas respectively. As a function of x , each one isstrictly concave, being a sum of strictly concave functions, thus it cannot take the same value for more than twovalues of x .This proves the first sentence of the lemma; to prove the second one, it is sufficient to give examples for bothsituations.Remark that the functions f , f have the same derivative: f (cid:48) ( x ) = f (cid:48) ( x ) = log ( p p p p ) . (43)This results from the formula h (cid:48) ( u ) = − log ( u/ − u ) of the derivative of the entropy.Then the maximum of f or f on [0 , p p = p p , that is when x ( x + 1 − p − q ) = ( x − p )( x − q ) ⇔ x = p q , (44)which we could have written without computation, because it corresponds to the independence of the variables X , X .Then the possibility of two different laws P, P (cid:48) in the lemma is equivalent to the condition that p q belongs tothe interior of I . This happens for instance for 1 > p > q > q > p >
0, where I = [ q − p , q ], because inthis case p q < q and p > p q i.e. p q = q − p q > q − p . In fact, to get P (cid:54) = P (cid:48) with the same H , it issufficient to take x different from p q but sufficiently close to it, and H = f ( x ) + H .However, even in the above case, the values of f (or f ) at the extremities of I do not coincide in general. Letus prove this fact. We have f ( q ) = q h ( p − q q ) = q h (1 − p q ) = F ( p ) , f ( q − p ) = q h (1 − p q ) = G ( p ) . (45)When p = 0, the interval I is reduced to the point q , and F (0) = G (0) = 0. Now fix q , q , and consider thederivatives of F, G with respect to p at every value p > F (cid:48) ( p ) = log p − q p , G (cid:48) ( p ) = log q − p p . (46)Therefore F (cid:48) ( p ) < G (cid:48) ( p ) if and only if p − q < q − p , i.e. q > /
2. Then, when q > / p > F ( p ) < G ( p ).Consequently, any value f ( x ) that is a little larger than F ( p ) determines a unique value of x . It is in thevicinity of q . Which ends the proof of the lemma.From this lemma, we see that there exist open sets where 8 or 4 different laws give the same values of the threefunctions H ( X ) , H ( X ) , H ( X , X ). In degenerate cases, we can have 4, 2 or 1 laws giving the same three values. Theorem 2.3.
For n binary variables X , ..., X n , the functions η I , resp. pure I k , resp. pure H k , characterizethe probability law on E X up to a finite ambiguity. roof. From the preceding section, it is sufficient to establish the theorem for the functions H k ( X I ), where k goes from 1 to n , and I describes all the subsets of cardinality k in [ n ].The proof is made by recurrence on n . We just have established the cases n = 1 and n = 2.For n > H ( X , ..., X n ) = H ( X , ..., X n − ) + ( X , ..., X n − ) .H ( X n ) . (47)By the Marginal Theorem of H.G. Kellerer [28] (see also F. Matus [37]), knowing the 2 n − P , there is only one resting dimension, thus one of the coordinates p i only is free, that we denote x .Suppose that all the values of the H k are known, the hypothesis of recurrence tells that all the non-trivial marginallaws are known from the values of the entropy, up to a finite ambiguity. We fix a choice for these marginals. Theabove fundamental formula expresses H ( X , ..., X n ) as a function f ( x ) of x , which is a linear combination withpositive coefficients of the entropy function h applied to various affine expressions of x ; therefore f is a strictlyconcave function of one variable, then only two values at most are possible for x when the value f ( x ) is given.The group {± } n of order 2 n that exchanges in all possible manners the values of the binary variables X i , i =1 , ..., X n gives a part of the finite ambiguity. However, even for n = 2, the ambiguity is not associated to theaction of a finite group, contrarily to what was asserted in [4] section 1 .
4. What replaces the elements of a groupare partially defined operations of permutations that deserve to be better understood.
Theorem 2.4.
The functions η I , resp. the pure I k ( X I ) , resp. the pure H k ( X I ) , have linearly independentgradients in an open dense set of the simplex ∆([ n ]) of probabilities on E X .Proof. Again, it is sufficient to treat the case of the higher pure entropies.We write N = N ...N n . The elements of the simplex ∆( N ) are described by vectors ( p , ..., p N ) of real numbersthat are positive or zero, with a sum equal to 1. The expressions H k ( X J ) are real analytic functions in theinterior of this simplex. The number of these functions is 2 n −
1. The dimension N − n − p i , i = 1 , ..., N − j between 1 and n choose two different values of the set E j . Thenapply the theorem 2. Remark.
This proves the fact mentioned in . of [3] . Te Sun Han established that the quantities I k ( X I ) for k ≥ Remark.
The formulas of H k ( X I ) , then of I k ( X i ) and η I , extend analytically to the open cone Γ([ n ]) of vectorswith positive coordinates. On this cone we pose H ( P ) = I ( P ) = η ( P ) = n (cid:88) i =1 p i . (48) This is the natural function to consider to account for the empty subset of [ n ] .Be careful that the functions K k for k > are no more positive in the cone Γ([ n ]) , because the function − x ln x becomes negative for x > . In fact we have, for λ ∈ ]0 , ∞ [ , and P = ( p , ..., p n ) ∈ Γ([ n ]) , H k ( λP ) = λH k ( P ) − λ log λH ( P ) . (49) The above theorems extend to the prolonged functions to the cone, by taking into account H . Notice further properties of information quantities:For I k , due to the constraints on I and I , see Matsuda [36], we have for any pair of variables0 ≤ I ( X , X ) ≤ min { H ( X ) , H ( X ) } , (50)and any triple X , X , X : − min { H ( X ) , H ( X ) , H ( X ) } ≤ I ( X , X , X ) ≤ min { H ( X ) , H ( X ) , H ( X ) } . (51)It could be that interesting inequalities also exist for k ≥
4, but it seems that they are unknown.Contrarily to H k , the behavior of the function I k is not the same for k even and k odd. In particular, asfunctions of the probability P X , the odd functions I m +1 , for instance I = H = H , or I (ordinary synergy),11ave properties of the type of pseudo-concave functions (in the sense of [4]), and the even functions I m , like I (usual mutual information) have properties of the type of convex functions (see [4] for a more precise statement).Note that this accords well with the fact that the total entropy H ( X ), which is concave, is the alternate sum ofthe I k ( X I ) over the subsets I of [ n ], with the sign ( − k − (cf. appendix A).Another difference is that each odd function I m +1 is an information co-cycle, in fact a co-boundary if m ≥ I m +1 is a simplicial co-boundary in theordinary sense, and not an information co-cycle. Remark.
From the quantitative point of view, we have also considered and quantified on data the pseudo-concavefunction ( − k − I k (in the sense of [4]) as a measure of available information in the total system and consideredtotal variation along paths. Although such functions are sounding and appealing, we have chosen to illustrate hereonly the results using the function I k as they respect and generalize the usual multivariate statistical correlationstructures of the data and provide meaningful data interpretation of positivity and negativity, as it will becomeobvious in the following application to data. However, what really matters is the full landscape of informationsequences, showing that information is not well described by a unique number, but rather by a collection ofnumbers indexed by collections of joint variables. The developments and tests of the estimation of simplicial information topology on data is made on a geneticexpression dataset of two cell types obtained as described in the section material and methods 4.1. The result ofthis quantification of gene expression is represented in ”Heat maps” and allows two kinds of analysis: • The analysis with genes as variables: in this case the ”Heat maps” correspond to ( m, n ) matrices D (presented in the section 4.2) together with the labels (population A or population B) of the cells. Thedata analysis consists in the detection of gene modules. • The analysis with cells (neurons) as variables: in this case the ”Heat maps” correspond to the transposedmatrices D T (presented in the Figure 3) together with the labels (population A or population B) of thecells. The data analysis consists in the detection of cell types. As information negativity posed problems of interpretation as recalled in introduction and conclusion, we nowillustrate what the negative and positive information values quantify on data with theoretical (taking the caseof the binary variable case previously exposed) and empirical N-ary case examples of gene expression. Let usconsider three ordinary biased coins X , X , X , we will denote by 0 and 1 their individual states and by a, b, c, ... the probabilities of their possible configurations three by three; more precisely: a = p , b = p , c = p , d = p , (52) e = p , f = p , g = p , h = p . (53)We have a + b + c + d + e + f + g + h = 1 . (54)The following identity is easily deduced from the definition of I (cf. 18): I ( X ; X ; X ) = I ( X ; X ) − I ( X ; X | X ) . (55)Of course the identities obtained by changing the indices are also true. This identity interprets the informationshared by three variables as a measure of the lack of information in conditioning. We notice a kind of intricationof I : conditioning can increase the information, which interprets rightly the negativity of I . Another usefulinterpretation of I is given by I ( X ; X ; X ) = I ( X ; X ) + I ( X ; X ) − I (( X , X ); X ) . (56)In this case negativity is interpreted as a synergy, i.e. the fact that two variables can give more information ona third variable than the sum of the two separate information.Several inequalities are easy consequences of the above formulas and of the positivity of mutual information oftwo variables (conditional or not), as shown in [36]. I ( X ; X ; X ) ≤ I ( X ; X ) , (57) I ( X ; X ; X ) ≥ − I ( X ; X | X ) , (58)12igure 1: Example of the 4 maxima (left panel) and of the 2 minima of I for 3 binary variablesa, informal representation of the 7-simplex of probability associated with 3 binary variables. The values of theatomic probabilities that achieve the extremal configurations are noted in each vertex. b, Representation of theassociated probabilities in the data space of the 3-variables for these extremal configurations. c, Information I k landscapes of these configurations (top). Representation of these extremal configurations on the probabilitycube. The colors represents the non-nul atomic probability of each extremal configuration (bottom).,, and the analogs that are obtained by permuting the indices.Let us remark that this immediately implies the following assertions:1) when two variables are independent the information of the three is negative or zero;2) when two variables are conditionally independent with respect to the third, the information of the three ispositive or zero.By using the positivity of the entropy (conditional or not), we also have: I ( X ; X ) ≤ min ( H ( X ) , H ( X )) , (59) I ( X ; X | X ) ≥ − min ( H ( X | X ) , H ( X | X )) ≥ − min ( H ( X ) , H ( X )) . (60)We deduce from here I ( X ; X ; S ) ≤ min ( H ( X ) , H ( X ) , H ( X )) , (61) I ( X ; X ; X ) ≥ − min ( H ( X ) , H ( X ) , H ( X )) . (62)In the particular case of three binary variables, this gives1 ≥ I ( X ; X ; X ) ≥ − . (63)13 roposition 10. The absolute maximum of I , equal to , is attained only in the four cases of three identicalor opposite unbiased variables. That is H ( X ) = H ( X ) = H ( X ) = 1 , and X = X or X = 1 − X , and X = X or X = 1 − X , that is a = h = 1 / or b = g = 1 / or c = f = 1 / or d = e = 1 / and in each caseall the other variables are equal to (cf. Figure 1 a-c).Proof. First it is evident that the example gives I = 1. Second, consider three variables such that I ( X ; X ; X ) =1. We must have H ( X ) = H ( X ) = H ( X ) = 1, and also I ( X i ; X j ) = 1 for any pair ( i, j ), thus H ( X i , X j ) = 1, H ( X i | X j ) = 0, and the variable X i is a deterministic function of the variable X j , which gives X i = X j or X i = 1 − X j . Proposition 11.
The absolute minimum of I , equal to − , is attained only in the two cases of three two bytwo independent unbiased variables satisfying a = 1 / , b = 0 , c = 1 / , d = 0 , e = 1 / , f = 0 , g = 1 / , h = 0 , or a = 0 , b = 1 / , c = 0 , d = 1 / , e = 0 , f = 1 / , g = 0 , h = 1 / . These cases correspond to the two borromean links,the right one and the left one (cf. Figure 1 d-f ).Proof. First it is easy to verify that the examples give I = −
1. Second consider three variables such that I ( X ; X ; X ) = −
1. The inequality (62) implies H ( X ) = H ( X ) = H ( X ) = 1, and the inequality (60) showsthat H ( X i | X j ) = 1 for every pair of different indices, so H ( X , X ) = H ( X , X ) = H ( X , X ) = 2, and thethree variables are two by two independent. Consequently the total entropy H of ( X , X , X ), given by I minus the sum of individual entropies plus the sum of two by two entropies is equal to 2. Thus8 = − a lg a − b lg b − c lg c − d lg d − e lg e − f lg f − g lg g − h lg h. (64)But we also have 8 = 8 a + 8 b + 8 c + 8 d + 8 e + 8 f + 8 g + 8 h, (65)that is 8 = 4 a lg 4 + 4 b lg 4 + 4 c lg 4 + 4 d lg 4 + 4 e lg 4 + 4 f lg 4 + 4 g lg 4 + 4 h lg 4 . (66)Now we subtract (66) from (64), we obtain8 = − a lg 4 a − b lg 4 b − c lg 4 c − d lg 4 d − e lg 4 e − f lg 4 f − g lg 4 g − h lg 4 h. (67)However each of the four quantities − a lg 4 a − b lg 4 b, − c lg 4 c − d lg 4 d, − e lg 4 e − f lg 4 f, − g lg 4 g − h lg 4 h is ≥ a + 4 b, c + 4 d, e + 4 f, g + 4 h isequal to 1, so each of these quantities is equal to zero, which happens only if ab = cd = ef = gh = 0. But we canrepeat the argument with any permutation of the three variables X , X , X . We obtain nothing new from thetransposition of X and X . From the transposition of X and X we obtain ae = bf = cg = dh = 0. From thetransposition of X and X , we obtain ac = bd = eg = f h = 0. So from the cyclic permutation (1 , ,
3) (resp.(1 , , ae = bf = cg = dh = 0 (resp. ac = bd = eg = f h = 0).) If a = 0 this gives necessarily b, e, c nonzero, thus d = f = g = 0, and h (cid:54) = 0, and if a (cid:54) = 0 this gives b = e = c = 0, thus d, f, g nonzero and h = 0.Figure 1 illustrates the probability configurations giving rise to the maxima and minima of I for 3 binary vari-ables.In the much more complex case of gene expressions, the statistical analysis shown in [58] exhibited also acombination of positivity and negativity of the information quantities I k ; k ≥
3. In this analysis, the minimalnegative information configurations provide a clear example of purely emergent and collective interactions analogto Borromean links in topology, since it cannot be detected from any pairwise investigation or 2-dimensionalobservations. In these Borromean links the variables are pairwise independent but dependent at 3. In general I k negativity detects such effects of their projection on lower dimensions, this illustrates the main difficulty whengoing from dimension 2 to 3 in information theory. The example given in Figure 1 provides a simple exampleof this dimensional effect in the data space: the alternated clustering of the data corresponding to I negativitycannot be detected by the projections onto whichever subspace of pair of variables, since the variables are pairwiseindependent. For N-ary variables the negativity becomes much more complicated, with more degeneracy of theminima and maxima of I k .In order to illustrate the theoretical examples of Figure 1 on real data, considering the data set of geneexpression (matrix D ), we plotted some quadruplets of genes sharing some of the highest (positive) and lowest(negative) I values in the data space of the variables (Figure 2). Figure 2 shows that in the data space, I k negativity identifies the clustering of the data points, or in other words, the modules (k-tuples) for which thedata points are segregated into condensate clusters. As expected theoretically, I k positivity identifies co-variationsof the variables, even in cases of non-linear relations, as shown by Reshef and colleagues [46] in the pairwise case.It can be easily shown in the pairwise case that I k positivity generalizes the usual correlation coefficient tonon-linear relations. As a result, the interpretation of the negativity of I k is that it provides a signature andquantification of the variables that segregate or differentiate the measured population.14igure 2: Examples of some of 4-modules (quaduplets) with the highest (positive) and lowest(negative) I of gene expression represented in the data space. a, Two 4-modules of genes sharingamong the highest positive I of the gene expression data set (cf. 4.1). The data are represented in the dataspace of the measured expression of the 4 variables-genes. The fourth dimension-variable is color coded. b, Two4-modules of genes sharing among the lowest negative I . All the modules were found to be significant accordingto the dependence test introduced in section 4.6, except the module { , , , } . The identified extremalmodules (different) give similar patterns of dependences [57, 58]. Example of cell type recognition with a low sample size m = 41 , dimension n = 20 , and graining N = 9 . As introduced in previous section 2.5, the k-tuples presenting the highest and lowest information ( I k )values are the most biologically relevant modules and identify the variables that are the most dependent orsynergistic (respectively ”entangled”). We call information landscape the representation of the estimation of all I k values for the whole simplicial lattice of k-subfaces of the n -simplex of variables ranked by their I k valuesin ordinate. In general the null hypothesis against whom are tested the data is the maximal uniformity andindependence of the variables X i , i = 1 , ..., n . Below the undersampling dimension k u presented in methods 4.5,this predicts the following standard sequence for any permutation of the variables X i , ..., X i n ; H = log r, ..., H k = k log r, ... (68)that is linearity (with N = ... = N n = r ).What we observed in the case where independence is confirmed, for instance with the chosen genes of thepopulation B (NDA neurons) in [58], is linearity up to the maximal significant k , then stationarity. But whereindependency is violated, for example with the chosen genes of the population A (DA neurons) in [58], somepermutations of X , ..., X n give sequences showing strong departures from the linear prediction.This departure and the rest of the structure can also be observed on the sequence I k as shown in Figure 3and 4, which present the case where cells are considered as variables. In the trivial case, i.e. uniformity andindependence, for any permutation, we have I = log r, I = I = ... = I n = 0 . (69)As detailed in materials and methods 4.3, we further compute the longest information paths (starting at 0 andthat go from vertex to vertex following the edges of the simplicial lattice) with maximal or minimal slope (withminimal or maximal conditional mutual-information) that end at the first minimum, a conditional-independencecriterion (a change of sign of conditional mutual-information). Such paths select the biologically relevant vari-ables that progressively add more and more dependences. The paths I k ( σ ) that stay strictly positive for along time are especially interesting, being interpreted as the succession of variables X σ , ..., X σ k that share the15igure 3: Example of a I k landscape and path analysis. a, heatmap (transpose of matrix D ) of n = 20neurons with m = 41 genes. b, the corresponding H k landscape. c, the corresponding I k landscape d, maximum(in red) and minimum (in blue) I k information paths. e, histograms of the distributions of I k for k = 1 , .., I k ( σ ) that become negative for k ≥ I ≈ n = 20 cells with m = 41 genes is represented in Figure 3a. We took n = 20 neurons among the 148within which 10 were pre-identified as population A neurons (in green) and 10 were pre-identified as populationB neurons (in dark red), and ran the analysis on the 41 gene expression with a graining of N = 9 values (cf.section 4.1). The dimension above which the estimation of information becomes too biased due to the finitesample size is given by the undersampling dimension k u = 11 (p value 0.05, cf. section 4.5). The landscapesturn out to be very different from the extremal (totally disordered and totally ordered) homogeneous (identically16istributed) theoretical cases. The I k landscape shown in Figure 3c exhibits two clearly separated components.The scaffold below represents the tuple corresponding to the maximum of I : it corresponds exactly to the 10neurons pre-identified as being population A neurons.The maximum (in red) and minimum (in blue) I k information paths identified by the algorithm are repre-sented in Figure 3d. The scaffold below represents the two tuples corresponding to the two longest maximumpaths in each component: the longest (noted Max IP in green) IP contains the 10 neurons pre-identified aspopulation A and 1 ”error” neuron pre-identified as population B. We restricted the longest maximum path tothe undersampling dimension k u = 11, but this path reached k = 14 with erroneous classifications. The secondlongest maximum path (noted Max IP in red) IP contains the 10 neurons pre-identified as population Band 1 neuron pre-identified as population A that is hence erroneously classified by the algorithm. Altogether theinformation landscape shows that population A neurons constitute a quite homogenous population, whereas thepopulation B neurons correspond to a more heterogeneous population of cells, a fact that was already known andreported in the biological studies of these populations. The histograms of the distributions of I k for k = 1 , .., I k significantly differs from a randomly generated I k , a test of the specificity of the k-dependence. The shuffleddistributions and the significance value for p = 0 . I k , H k and G k (Total Free Energy, TFE) landscapes . a: entropy H k and b: mutual information I k (free energy components) landscapes (same representation as figure 3, k u = 11, p value 0.05). c: G k landscape(total correlation or multi-information or Integrated Information or total free energy) d: the landscape of the G k per body ( G k /k ).As illustrated in Figure 4 and expected from relative entropy positivity, the total correlation G k (see appendixA on bayes free-energy) is monotonically increasing with the order k , and quite linearly in appearance ( G k ≈ k asymptotically). The panel d quantifies this departure from linearity. However the G k landscape fails todistinguih as clearly as I k landscape does, the population A. During the last decades, there have been important efforts in trying to evaluate the pairwise and higher orderinteractions in neuronal and biological measurements, notably to extract the undergoing collective dynamics.Applying the Maximum of Entropy principle on Ising spin models to neural data [48, 59], a first series of studiesconcluded that pairwise interactions are mostly sufficient to capture the global collective dynamics, leading tothe ”pairwise sufficiency” paradigm (see Merchan and Nemenman for presentation [39]). However, as shown17y the Ising model itself, near a second order phase transition, elementary pairwise interactions are compatiblewith non-trivial higher-order dependences, and very large correlations at long distances. From the mathematicaland physical point of view, this fact is nicely encoded in the normalization factor of the Boltzmann probability,namely the Partition Function Z ( β ). As shown by the Ising model, the probability law can be factorized (upto the normalization number Z ) on the edges and vertices of a graph, but the statistical clusters can have un-bounded volumes. Moreover, subsequent studies notably of Tka˘cik et al. [60] (see also [25]) have shown that forsufficiently large populations of registered neurons, the pairwise models are insufficient to explain the data asproposed in [2, 3] for example. Thus the dimension of the interactions to be taken into account for the modelsmust be larger than two.Note that most interactions in Biology are nowadays described in terms of networks, such that the conceptsof protein networks, genetic networks or neural networks became familiar. However from the physical as wellas the biological point of view, none of these systems are really 1-dimensional graphs, and it is now clear formost researchers in the domain that higher order structures are needed for describing collective dynamics, cf. forinstance [69], and [45]. Our Figure 5 clearly shows this point.William Bialek and his collaborators have well explained the interest of a systematic study of joint entropiesand general multi-modal mutual information quantities as an efficient way for understanding neuronal activities,networks of neurons, and gene expression [12, 49, 54]). They also developed approximate computational methodsfor estimating the information quantities. Mutual information analysis was applied for linking adaptation to thepreservation of the information flow [10, 33]. Closely related to the present study, Margolin, Wang, Califano andNemenman have investigated multivariate dependences of higher order [35] with MaxEnt methods, by using thetotal-correlation G k (cf. equation 28) in function of the integer k ≥
2. The apparent advantage is the positivityof the G k .In this respect, the originality of our method relies first on the systematic consideration of the entropy pathsand the information paths that can be associated to all possible permutations of the basic variables, in arbitrarydimension, and the extraction of exceptional paths from them, in order to define the overall form of the distribu-tion of information among the set of variables. Secondly, we used and proved the relevance of peculiar functions,multivariate mutual informations, where the previously cited works focused on total correlations, which fail touncover the data structure as exemplified in Figure 4 or only explored pairwise or I . We named these tools theinformation landscapes of the data. This new perspective and mathematical justification of these functions hasits origin in the local (topos) homological theory introduced in [4] developped and extended in several ways byVigneaux [64]. In the present article, we also proved new theoretical results along this line, about the concretestructure of higher-order information functions. Moreover the method was successfully applied to a concreteproblem of gene expression in [58].Since their introduction, the possible negativity of the I k functions for k ≥ I k -landscapes, providing some new insights with respect to their meaning in terms of data pointclusters.The precise contribution of higher-order is indeed directly quantified by the I k values in the landscapes andpaths. Figure 5 further illustrates the gain and the importance of considering higher statistical interactions,using the previous example of cells pre-identified as 10 population A and 10 population B cells ( n = 20, m = 47, N = 9). The plots are the finite and discrete analogs of Gibbs’s original representation of entropy vs. energy[17]. Whereas pairwise interactions ( k = 2) cannot (or very hardly) distinguish the population A and populationB cell types, the maximum of I unambiguously identifies the population A.As illustrated in Figure 3, the present analysis shows that in the expression of 41 genes of interest of popula-tion A neurons, the higher-order statistical interactions are non-negligible and have a simple functional meaningof a collective module, a cell type. We believe such conclusion to be generic in biology. More precisely, webelieve that even if related to physics, biological structures have higher-order statistical interactions defined byhigher-order information and that these interactions provide the signature of their memory engramming. In fact”information is physical” as stated by Bennett following Landauer [32]), in the sense of memory capacity andnecessity of forgetting. The quantification of the information storage applied here to genes can be considered as a18igure 5: H k − I k landscape: Gibbs-Maxwell’s entropy vs. energy representation. H k and I k areplotted in abscissa and ordinate respectively for dimension k = 1 , ...,
12 for the same data and setting as in Figure3 ( n = 20 cells, m = 47 genes, N = 9, k u = 11). Compare the difficulty in identifying the 2 cells types from thepairwise k = 2 landscape to the k = 10 landscape.generic epigenetic memory characterization, resulting of a developmental-learning process. The consideration ofhigher dimensional statistical dependences increases combinatorially the number of possible information modulesengrammed by the system. It hence provides an appreciable capacity reservoir for information storage and fordifferentiation, for diversity.The critical points of the Ising model in dimension 2 and 3 show the difficulty to relate factorization (upto Z ( β )), which describes the manner energy interactions localize, with the dependences structure, or in otherwords the manner information distributes itself, i.e. the form of information. Only few theoretical results relatethe two notions. However, on the basis of several recent studies that we mentioned, particularly the studies ofadaptive functions, and comforted by the analysis presented in this article, we can suggest that for biologicalsystems, during development or evolution, the distribution of the information flow, as described in particularby higher order information quantities, participates in the generators of the dynamics, on the side of energyquantities coming from Physics. The quantification of genetic expression was performed using microfluidic qPCR technique on single dopamin-ergic (DA) and non-dopaminergic (NDA) neurons isolated from two midbrain structures, the Substantia Nigrapars compacta (SNc) and the neighboring Ventral Tegmental Area (VTA), extracted from adult TH-GFP mice(transgenic mice expressing the Green Fluorescent Protein under the control of the Tyrosine Hydroxylase pro-moter). The precise protocols of extraction, quantification, and identification are detailed in [57, 58]. Thistechnique allowed us to quantify in a single cell the levels of expression of 41 genes chosen for their implicationin neuronal activity and identity of dopaminergic (DA) neurons. The SNc DA neurons were identified based onGFP fluorescence (TH expression). This identification was further confirmed based on the expression levels of Th and Slc6a3 genes, which are established markers of DA metabolism. The quantification of the expression of the41 genes ( n = 41) was achieved in 111 neurons ( m = 111) identified as DA and in 37 neurons ( m = 37) identifiedas nDA. In this article, for readability purpose, we replaced the names of the genes by gene numbers and the celltype DA by population A, and the cell type nDA by population B. The dataset is available in supplementarymaterial [57, 58]. The presentation of the probability estimation procedure is achieved on matrices D (genes as variables), and itis the same in the case of the analysis of the matrices D T (cells as variables). It is illustrated in Figure 6 forthe simple case of 2 random variables taken from the dataset of gene expression presented in section 4.1, namelythe expression of two genes Gene5 and Gene21 in m = 111 population A cells. Our probability estimationcorresponds to a step of the integral estimation procedure of Riemann.19igure 6: Principles of probability estimation for 2 random variables. a, illustration of the basicprocedure used in practice to estimate the probability densitiy for the two genes ( n = 2) Gene5 and Gene21 in111 population A neurons ( m = 111) using a graining of 9 ( N = N = 9). The data points corresponding tothe 111 observations are represented as red dots, and the graining is depicted by the 81-box grid ( N .N ). Theborders of the graining interval are obtained by considering the maximum and minimum measured values foreach variable, and data are then sampled regularly within this interval with N i values. Projections of the datapoints on lower dimensional variable subspaces ( X and X axes here) are obtained by marginalization, giving themarginal probability laws for the 2 variables X and X ( P X i ,N i ,m ) ; represented as histograms above the X -axisfor Gene21 and on the right of the X -axis for Gene21). b, heatmaps representing the levels of expression of the21 genes of interest on a log Ex scale (top, raw heatmap) and after resampling with a graining of 9 (bottom, N = N = ... = N = 9).We write the heatmap as a ( m, n ) matrix D and its real coefficients x ij ∈ R , i ∈ { ..m } , j ∈ { ...n } : thecolumns of D span the m repetitions-trials (here the m neurons) and the rows of D spans the n variables (herethe n genes). We also note, for each variable X j , the minimum and maximum values measured as min x j =min ≤ i ≤ m x ij and max x j = max ≤ i ≤ m x ij .We consider the space in the intervals [min x j , max x j ] for each variable X j and divide it into N .N ...N n boxes,on which it is possible to estimate the atomic probabilities by elementary counting. We note each n -dimensionalbox by an n -tuple of integers { a , ..., a n } where ∀ i ∈ { , ..., n } , a i ∈ { , ..., N i } , and writing the min and the maxof a box on each variable X j (the jth co-ordinate of the vertex of the box) as bmin j = min x j + ( a j − x j − min x j ) N j and bmax j = min x j + ( a j )(max x j − min x j ) N j , then the atomic probabilities can be defined using Dirac function δ as: P (bmin ≤ X ≤ bmax , bmin ≤ X ≤ bmax , ..., bmin n ≤ X n ≤ bmax n )= m (cid:88) i =1 δ i m , δ i = (cid:40) , if bmin > x i or x i > bmax ... or bmin n > x in or x in > bmax n , if bmin ≤ x i ≤ bmax and ... and bmin n ≤ x in ≤ bmax n (70)For two variables, using the definition of conditioning P X ( Y ) = P ( X.Y ) P ( X ) and in the general case using thetheorem of total probability [31] ( P ( X ) = (cid:80) Ni =0 P ( A i .X ) = (cid:80) Ni =0 P ( A i ) .P A i ( X )), we can marginalize, or geo-20etrically project on lower dimensions, to obtain all the probabilities corresponding to subsets of variables, asillustrated in Figure 6. For example, with short notation, the probability associated to the marginal variable X i being in the interval [bmin i , bmax i ] is obtained by direct summation: P (bmin i ≤ X i ≤ bmax i ) = N ... (cid:99) N i ...N n (cid:88) i =1 P (bmin ≤ X ≤ bmax , bmin ≤ X ≤ bmax , ..., bmin n ≤ X n ≤ bmax n ) (71)In the example of Figure 6, the probability of the level of Th being in the 4th box is: P (8 ≤ Th ≤ .
8) = (cid:88) i =0 P (8 ≤ Th ≤ . , bmin ≤ Calb1 ≤ bmax ) = 2 /
111 + 2 /
111 (72)In geometrical terms, the set of total probability laws is an N = N .N ...N n − N .N ...N n − (the − (cid:80) P i = 1, which embeds the simplex in anaffine space). In the example of Figure 6, we have an 80-dimensional probability simplex ∆ , the set of sub-simplicies over the k-faces of the simplex ∆ n , for every k between 0 and n , represents the boolean algebra of thejoint-probabilities, which is equivalent in the finite case to their sigma-algebra. In our analysis, we have chosen N = N = ... = N n = 9 and this choice is justified in section 4.6.1 using Reshef and colleagues criterion [46] andundersampling constraints.In summary, our probability estimation and data analysis depend on n (the number of random variables), on m (the number of observations), and on N , ..., N i (the graining). The merit of this method is its simplicity (fewassumptions, no priors on the distributions) and low computational cost. There exist different methods that cansignificantly improve this basic probability estimation, but we leave this for future investigation. The graininggiven by the numbers N = N .N ...N n and the sample size m are important parameters of the analysis exploredin this section. The computational exploration of the simplicial sublattice has a complexity in O (2 n ) (2 n = (cid:80) nk =1 (cid:0) nk (cid:1) ). Inthis simplicial setting we can exhaustively estimate information functions on the simplicial information struc-ture, that is joint-entropy H k and mutual-informations I k at all dimensions k ≤ n and for every k -tuple, witha standard commercial personal computer (a laptop with processor Intel Core i7-4910MQ CPU @ 2.90GHz ×
8, even though the program currently uses only one CPU) up to k = n = 21 in a reasonable time ( ≈ I k values described in the next sections and the finite entropy rate H k k . The input is an excel table containing thedata values, e.g. the matrix D with the first row and column containing the labels. Here, we limited our analysisto n = 21 genes of specific biological interest. The information data analysis presented here depends on the two parameters N and m . The finite size of thesample m is known to impose an important bias in the estimation of information quantities: in high-dimensionaldata analysis, it is quoted as the Hugues phenomenon [24] and in entropy estimation it has been called thesampling problem since the seminal work of Strong and colleagues [55, 41, 39]. For the method we suggested,it is important to notice that the size m of the population Z is in general much smaller than the dimension ofthe probabilty simplex N = N ...N n −
1. For instance, in the mentioned study of genes as variables [58], we had21 = 111 for DA neurons (resp. m (cid:48) = 37 for N DA neurons) as respective number of neurons, but N = 9 − m = 41 genes and adimension of N = 9 − H k , k = 1 , ..., n must satisfythe following inequality: ∀ J ⊂ [ n ] , k = | J | = cardJ, H k ( X J ; P ) ≤ log m. (73)where equality is an extreme signature of undersampling. However, suppose that all the numbers N i , i = 1 , ..., n are equal to r ≥
2, the maximum value of H k is equal to k log r , for instance 2 k. log (3) in the example. Lemma 12.
Take the uniform probability on the simplex ∆([ n ]) with affine coordinates, and take (cid:15) such that < (cid:15) ≤ /e ≈ , ; then the probability that H k ( X J ) is greater than (cid:15)k log r is larger than − (cid:15) .Proof. Concerning H k , the simplex ∆([ n ]) is replaced by ∆([ k ]); then consider the set ∆ (cid:15) of probabilities suchthat p j ≥ (cid:15)r − k for any coordinate j between 1 and r k , this set is the complement of the union of the sets X j ( ε ) , i = 1 , ..., r k where p j < (cid:15)r − k . From the properties of volumes in affine geometry, the measure of each set X j ( ε ) is less than (cid:15)r − k , thus the probability of ∆ (cid:15) is larger than 1 − (cid:15) . And for any index j the monotony of − x ln x between 0 and 1 /e implies − p j log p j > (cid:15)r − k k log r ; (74)then by summation over all the indices we obtain the result.By example, for r = 9, and (cid:15) = 1 /e , this gives that H k ≥ k log (3) /e is two times more probable than theopposite.Consequently, in the above experiment, the quantities H k , then I k are not significant, except if they appearto be significantly smaller than log m .In counterpart, as soon as the measured H k is inferior than the predicted one for m values, this is significant.Note that the lemma 12, with n replaced by m , gives estimations for the entropies of raw data. In the nextsection, we propose a computational method to estimate the dimension k u above which information estimationceases to be significant. Following the original presentation of the sampling problem by Strong and colleagues [55], the extreme cases ofsampling are given by: • When N = N = ... = N n = 1, there is a single box Ω and P (Ω) = m/m = 1 and we have H k = I k =0 , ∀ k ∈ , ..., n . The case where m = 1 is identical. This fixes the lower bound of our analysis in order notto be trivial; we need m ≥ N = N = ... = N n ≥ • When N .N ...N n are such that only one data point falls into a box, m of the values of atomic probabilitiesare 1 /m and N .N ...N n − m are null as a consequence of equation 71, and hence we have H n = log m .Whenever this happens for a given k-tuple, all the HP k paths passing by this k-tuple will stay on the same informa-tion values since conditional entropy is non-negative: we have H k = H k +1 or equivalently ( X , ..., X k ) H ( X k +1 ) =0, and all k + l -tuples are deterministic (a function of) with respect to the k-tuple. This is typically the case illus-trated in Figure 3: adding a new variable to an undersampled k-tuple is equivalent to adding the deterministicvariable ”0” since the probability remains unchanged (1 /m ).Considering the analysis of cells as variables (matrix D T ), the signature of this undersampling is the satura-tion at H k = log
41 observed in the H k landscape in Figure 3b, starting at k = 5 for some 5-tuples of neurons.Considering the analysis of genes as variables (matrix D [58]), the mean entropy computed also shows this sat-uration at H k = log
111 for population A neurons and H k = log
37 for population B neurons. We proposeto define a dimension k u as the dimension for which the probability p u of having the H k at the biased value of H k = log m is above 5 percent ( p u = 0 . k u = 6 for population A neurons and k u = 4 for population B neurons. The informationstructures identified by our methods beyond these values can be considered as unlikely to have a biological orphysical meaning and shall not be interpreted. Since undersampling mainly affects the distribution of I k valuesclose to 0 value, the maxima and minima of I k and the maximal and minimal information paths below k u arethe least affected by the sampling problem and the low sample size. This will be illustrated in the next sections.22igure 7: Determination of undersampling dimension k u . a, distributions of H k for m = 111 populationA neurons (green) and m = 37 population B neurons (dark red) for k = 1 , ..,
6. The horizontal red line representsthe threshold we have fixed to 5 percent of the total number of k-tuples. c, Plot of the percent of maximumentropy H k = ln m biased values as a function of the dimension k . The horizontal red line represents the thresholdfixed to 5 percent, giving k u = 6 for population A and k u = 4 for population B neurons. c, The mean (cid:104) HP (cid:105) ( k )paths for these two populations of neurons, the maximum entropy H k = ln m is represented by plain horizontallines. Pethel and Hahs [43] have constructed an exact test of 2-dependence for any pair of variables, not necessarilybinary or iid. Indeed, the iid condition usually assumed for the χ test does not seem relevant for biological ob-servations and the examples given here and in [57, 58] with genetic expression support such a general statement.It allows to test the significance of the estimated I values given a finite sample size m , the null hypothesis beingthat I = 0 (2-independence according to Pethel and Hahs). We follow here their presentation of the problem,and provide an extension of their test to arbitrary k (higher dimensions), with the null hypothesis being thek-independence I k = 0. Even in the lowest dimensions, and below the undersampling bound, the values of I k estimated from a finite sample size m are considered as biased [43]. If one considers an infinite sample ( m → ∞ )of n independent variables, we then have for all k ≥ I k = 0. However, if we randomly shuffle the values suchthat the marginal distributions for each variable X i are preserved, the estimated I k can be very different from 0,with distributions of I k values not centered on 0. Figure 8 illustrates an example of such bias with m = 111 forthe analysis with genes as variables.Reproducing the method of Pethel and Hahs [43], we designed a shuffling procedure of the n variables, whichconsists in randomly permuting the measured values (co-ordinates) of each variable one by one in the matrix D or D T (geometrically, a ”random” permutation of the co-ordinates of each data point, point by point). Such ashuffle leaves marginal probabilities invariant. Figure 8 gives an example of the joint and marginal distributionsbefore and after shuffle for two genes. Extending the 2-test of [43] to k ≥
2, the I k values obtained after shufflingprovide the distribution of the null hypothesis, k-independence ( I k = 0) according to [43]. The task is hence tocompute many shuffles, 10.000 in [43], in order to obtain these ”null” distributions. The exact procedure of Petheland Hahs [43] would require to obtain such ”null” distribution for all the 2 n tuples, which would require a numberof shuffled trials impossible to obtain computationally. We hence propose a global test that consists in computing17 different shuffles of the 21 genes, giving ”null” distribution of shuffled I k values composed of 21 × (cid:0) nk (cid:1) . Forexample, the test of 2-dependence and 3-dependence will be against a null distribution with 21 ∗
210 = 3750 I values and 21 ∗ I values respectively. We fix a p value above which we reject the null hypothesis(a significance level, fixed at p = 0 .
05 in [43]), allowing to determine the statistical significance thresholds as23igure 8:
Probability and Information landscape of shuffled data.
The figure corresponds to the case ofanalysis with genes as variables. a, joint and marginal distributions of two genes (genes 4 and 12) for m = 111population A neurons. b, joint and marginal distributions after a shuffling of the values of expression of eachgene. c, the estimated I k landscape for the expression of 21 genes after shuffling. d, histograms representingthe distribution of I k values for all the degrees until k = 5 for population B. The total number of combinationsC(n,k) for each degree (number of pairs for I ; number of triplets for I , etc) is given in gray. The averagedshuffled values of information obtained with 17 shuffles are represented on each histogram as a black line, andthe statistical significance threshold values for p = 0 . p = 0 .
05. Thisholds for k = 2, as described in [43], but since for k ≥ I k can be negative, the test becomes symmetric on thedistribution, and hence for k ≥ p = 0 . I k are above or below these threshold values, we reject the nullhypothesis.In practice, a random generator is used to generate the random permutations (here the NumPy generator [62]),and the present method is not exempt from the possibility that it generates statistical dependences in the higherdegrees. Interpretation of the dependence test . The original interpretation of the test by Pethel and Hahs was thatthe null hypothesis corresponded to independent distributions, motivated by the statement that ”permutationdestroys any dependence that may have existed between the datasets but preserves symbol frequencies”. How-ever, considering simple analytical examples could not allow us to confirm their statement. We propose thatfor a given finite m , random permutations express all the possible statistical dependences that preserve symbolfrequencies (cf. the discussion of E.Borel in [9]). This statement basically corresponds to what we observe inFigure 8. Hence we propose that in finite context the null-hypothesis corresponds to a random k-dependence.The meaning of the presented test is hence a selectivity or specificity test: a test of an I k of given k-tuple againsta null hypothesis of ”randomly” selected k-statistical dependences that preserve the marginals and m . Figure 9 gives a first simple study of how robust the paths of maximum length are with respect to the variationsof m and N , in the case of the analysis of genes as variables. The limit N → ∞ recovers Riemann integrationtheory and gives the differential entropy with the correcting additive factor N (theorem 8.3.1 [13]).The information paths of maximal length identified by our algorithm are relatively stable in the range of N = 5 , , ,
11 and m = 34 , , ,
111 where the m cells were taken among the 111 neurons of population A. Ifwe consider that the paths that only differ by the ordering of the variables are equivalent, then the stability ofthe two first paths is further and largely improved. The undersampling dimension obtained in these conditionsis k u ( m = 34) = 5 , k u ( m = 56) = 6 , k u ( m = 89) = 6 , k u ( m = 111) = 6 and k u ( N = 5) = 8 , k u ( N = 7) =24igure 9: Effect of changing sample size and graining on the identification of gene modules . Thefigure corresponds to the case of analysis with genes as variables for the population A neurons. The positive I k paths of maximum length were computed for a variable number of cells ( m ,left column) and a variable graining( N , right column). For clarity, only the two positive paths of maximum length are represented (first in red,second in black) for each parameter setting and the direction of each path is indicated by arrowheads. The twopositive paths of maximum length for the original setting ( N = 9, m = 111) are represented on the scaffold atthe top of the figure for comparison. Smaller samples of cells (one random pick of 34, 56 and 89 cells) and larger( N = 11) or smaller ( N = 5 , N = 7) graining than the original ( N = 9) were tested. Although slight differencesin paths can be seen (especially for N = 11), most of the parameter combinations identify gene modules thatstrongly overlap with the module identified using the original setting.7 , k u ( N = 9) = 6 , k u ( N = 11) = 5. In general, information landscapes can be investigated with the additionaldimensions of N and m together with n . It allows to define our landscapes as iso-graining landscapes and tostudy the appearance of critical points in a way similar to what is done in thermodynamics. In practice, to studymore precisely the variations of information depending on N and m and to obtain a 2-dimensional representation,we plot the mean information as a function of N and m together with n , as presented in Figure 10a. We callthe obtained landscapes the iso-graining I k landscapes. The choice of a specific graining N can be done usingthis representation: a ”pertinent” graining should be at a critical point of the landscape ( a first minimum of aninformation path), consistent with the proposition of the work of Reshef and colleagues [46], who used maximalinformation coefficient ( M I C ) depending on the graining (with a more elaborated graining procedure) to detectpairwise associations. We have chosen to illustrate the landscapes with N = 9 according to this criterion andthe undersampling criterion, because the I values are close to their maximal values and the sampling size is nottoo limiting, with a k u = 6 (see Figure 10 a ). Moreover, this choice of graining size N = 9 is sufficiently far fromthe critical point to ensure that we are in the condensed phase where interactions are expected. It is well belowthe analog of the critical temperature (the critical graining size), which according to the Figure 10a happensat N c = 3 (the N for which the critical points cease to be trivial). In general, there is no reason why thereshould be only one ”pertinent” graining. The graining algorithm could be improved by applying direct methodsof probability density estimation [50], or more promisingly persistent homology [15]. Finer methods of estimation(graining) have been developed by Reshef and colleagues [46] in order to estimate pairwise mutual-information,with interesting results. Their algorithm presents a lower computational complexity than the estimation on the25igure 10: Iso-sample-size ( m ) and iso-graining mean (cid:104) IP (cid:105) ( k ) landscapes. The figure corresponds to thecase of analysis with genes as variables for the population A neurons. a, The mean (cid:104) IP (cid:105) ( k ) paths are presentedfor N = 2 , ...,
18 and n = 21 genes for the m = 111 population A neurons. The ”undersampling” region beyondthe k u is shaded in white and delimited by black dotted line (the k u was undetermined for N = 2 , N = 2the mean (cid:104) IP (cid:105) ( k ) path has no non-trivial minimum (monotonically decreasing). This N = 2 iso-graining isanalog to the non condensed disordered phase of non interacting bodies, ∀ k > , (cid:104) IP (cid:105) ( k ) ≈
0. All the othermean (cid:104) IP (cid:105) ( k ) paths have non-trivial critical dimensions. The condition N = 9, m = 111 used for the analysisis surrounded by dotted red lines. It was chosen to be in the condensed phase above the critical graining, here N c = 3, close to the criterion of maximal mutual information coefficient M I C proposed by Reshef and colleagues(bin surrounded by green dotted line) and with a not too low undersampling dimension. b, The mean (cid:104) IP (cid:105) ( k )paths are presented for m = 111 , , ...,
12 population A neurons and n = 21 genes with a number of bins N = 9.lattice of partitions, but a higher complexity than the simple one applied here.What we call the iso-sampling size I k landscapes is presented in Figure 10b for mean I k . Such investigationis also important since it monitors what is usually considered as the convergence (or divergence) in probabilityof the informations. For the estimations below the k u represented here, the information estimations are quiteconstant as a function of m , indicating the stability of the estimation with respect to the sample size. Acknowledgement
Acknowledgments:
This work was funded by the European Research Council (ERC consolidator grant 616827
CanaloHmics to J.M.G.; supporting M.T. and P.B.) and Median Technologies, developed at Median Technolo-gies and UNIS Inserm 1072 - Universit´e Aix-Marseille, and at Institut de Mathmatiques de Jussieu - Paris RiveGauche (IMJ-PRG), and thanks previously to supports and hostings since 2007 of Max Planck Institute forMathematic in the Sciences (MPI-MIS) and Complex System Instititute Paris-Ile-de-France (ISC-PIF). Thiswork addresses a deep and warm acknowledgement to the researchers who helped its realization: G.Marrelec andJ.P. Vigneaux; or supported-encouraged it: H.Atlan, F.Barbaresco, H.B´enali, P.Bourgine, F.Chavane, J.Jost,A.Mohammad-Djafari, JP.Nadal, J.Petitot, A.Sarti, J.Touboul.
Supplementary material, contributions and previous versions
Previous Version: a partial version of this work has been deposited in the method section of Bioarxiv 168740in July 2017 and preprints [5].
Author contributions:
D.B. and P.B. wrote the paper, P.B. analysed the data ; M.T. performed the experi-ments ; M.T. and J.M.G. conceived and designed the experiments ; D.B., P.B., M.T. and J.M.G participated inthe conception of the analysis.
Supplementary material:
The software Infotopo is available at https://github.com/pierrebaudot/INFOTOPO
Abbreviations
The following abbreviations are used in this manuscript:26id independent identicaly distributedDA Dopaminergic neuronnDA non Dopaminergic neuron H k Multivariate k-joint Entropy I k Multivariate k-Mutual-Information G k Multivariate k-total-correlation or k-multi-information
M I C Maximal 2-mutual-Information Coefficient
A Appendix: Bayes free energy and Information quantities
A.1 Parametric modelling
As we mentioned in the introduction, the statistical analysis of data X is confronted to a serious the risk ofcircularity, because the confidence in the model is dependent on the probability law it assumes and reconstructsin part. Several approaches were followed to escape from this circularity; all of them rely on the choice of aset Θ of probability laws where P X is researched. For instance, maintaining the frequentist point of view, theFisher information metric on Θ (cf. [34]) determines bounds on the confidence. Another popular approach isto choose an a priori probability P Θ on Θ, and to revise this choice after all the experiments X ( z ) , z ∈ Z , bycomputing the probability on E × Θ, which better explains the results (the new probability on Θ is its marginal,and for each θ in Θ, the probability P θ on E is its conditional probability). Here a more precise principle isnecessary, which expresses a trade-off between the maximization of the marginal probability of the results underthe constraint to be not too far from the prior. A popular example is the minimization of the Bayes Free energy F V ( P ), which appears as the maximum of entropy of the new a posteriori probability under the constraint topredict in the mean the data and to depart the less possible from the a priori probability on the probabilities.This function is given by a Kullback-Leibler distance D KL . In the finite setting, with a uniform a priori, thisconsists in maximizing the entropy among the laws that predict the observed distribution. Remark that thetwo methods, Bayes and Fisher, are related, because in most cases the chosen a priori probability laws (andthe data estimation) used in the function F V are given by frequencies, and because the distance D KL ( P, Q ) isapproximated by the Fisher metric at P when Q approaches P . A.2 Bethe approximation
Let us remind that for two probability laws
P, Q on the same finite set Ω, the Kullback-Leibler divergence from P to Q is defined by D KL ( P, Q ) = (cid:88) x ∈ Ω P x ln P x Q x = E P ( − ln Q ) − H ( P ) . (75)Contrarily to its name, it is not a true distance, because it is not symmetric, however it is always positive and itis equal to zero if and only if P = Q . Another drawback is that it can be + ∞ : it is so when there exists x suchthat Q x = 0 but P x >
0, i.e. when P is not absolutely continuous with respect to Q .The Kullback-Leibler divergence permits to define the Bayes free energy functional as follows:The unknown is the probability law P b on E × Θ. F V ( P b ) = D KL ( P b , P L ⊗ P a ) = (cid:88) x L ,θ (ln P b ( x L , θ ) P L ( x L ) P a ( θ ) ) P b ( x L , θ ) , (76)where P a ( θ ) is the a priori on the probability laws and where P L ( s ) represents the new partial data, collectedby a collection of variables X L , and expressed by a probability law. F V ( P b ) = E P b ( − ln P a + D KL (( X L ) ∗ P θ , P L )) − H ( P b ) . (77)This function looks like a free energy in Statistical Physics, that is the sum of the negentropy and the mean ofan energy function.Here we assume that Ω = E S for a family of variables S i , i = 1 , ..., N , and the states are the possible values ofthe joint variable S .Due the strict convexity of the negentropy, F V has a unique minimum, that defines the equilibrium state.Practically, the full entropy is difficult to estimate, thus approximations were introduced, following Bethe andKikuchi (cf. Mori [40]), generalizing the Mean Field Theory. These approximations are no more convex in theunknown P b , they are obtained by replacing the full entropy H by a convenient linear combination of entropiesof more accessible variables (observable quantities). It is here that the information functions H k and I k appear27n the bayesian variational calculus (cf. Mori [40]):Consider a simplicial complex K in the simplex ∆([ N ]), i.e. a collection of faces that contains every faces insideeach face it contains, and assume K a combinatorial ( P L ) manifold of dimension d , with possibly a boundarythat is a combinatorial ( P L ) manifold ∂K ; then the Bethe function associated to K is given by the two equivalentfollowing formulas: F B ( Q ) = E Q ( − ln f ) − (cid:88) I ∈ K ∗ ( − d −| I | H ( S I ) , (78)where the sum is taken over the set K ∗ of faces not contained in ∂K , and | I | denotes the dimension of the face I ; F B ( Q ) = E Q ( − ln f ) − (cid:88) J ∈ K ( − | J | +1 I | J | ( S I ; Q ) , (79)where the sum is taken over all the faces of K , including the boundary, and I | J | ( S J ; Q ) is the higher mutualinformation considered everywhere above in the text. References [1]
Artin, M., Grothendieck, A., and Verdier, J.
Theorie des topos et cohomologie etale des schemas -(SGA 4) Vol I,II,III . Seminaire de Geometrie Algebrique du Bois Marie. Berlin; New York, Springer-Verlag,coll. e Lecture notes in mathematics ,? 1972, 1964.[2]
Atick, J.
Could information theory provide an ecological theory of sensory processing.
Network: Compu-tation in Neural Systems. 3 (1992), 213–251.[3]
Baudot, P.
Natural computation: much ado about nothing? an intracellular study of visual coding innatural condition. Master’s thesis, Paris 6 university, 2006.[4]
Baudot, P., and Bennequin, D.
The homological nature of entropy.
Entropy 17(5) (2015), 3253–3318.[5]
Baudot, P., Tapia, M., and Goaillard, J.
Topological information data analysis: Poincare-shannonmachine and statistical physic of finite heterogeneous systems. (2018).[6]
Bell, A.
The co-information lattice. in ICA 2003, Nara, Japan, (2003).[7]
Bertschinger, N., Rauh, J., Olbrich, E., Jost, J., and Ay, N.
Quantifying unique information.
Entropy 16 (2014), 21612183.[8]
Boole, G.
An Investigation Of The Laws Of Thought On Which Are Founded The Mathematical TheoriesOf Logic And Probabilities.
McMillan and Co., 1854.[9]
Borel, E.
La mechanique statistique et l’irreversibilite.
J. Phys. Theor. Appl. 3 , 1 (1913), 189–196.[10]
Brenner, N., Bialek, W., and de Ruyter van Steveninck, R.
Adaptive rescaling maximizes infor-mation transmission.
Neuron 26 (2000), 695702.[11]
Brenner, N., Strong, S., Koberle, R., and Bialek, W.
Synergy in a neural code.
Neural Computa-tion. 12 (2000), 1531–1552.[12]
Brenner, N., Strong, S. P., Koberle, R., Bialek, W., and de Ruyter van Steveninck, R. R.
Synergy in a neural code.
Neural Comput 12 , 7 (2000), 1531–52.[13]
Cover, T., and Thomas, J.
Elements of Information Theory.
Wiley Series in Telecommunication, 1991.[14]
Dawkins, R.
The Selfish Gene.
Oxford University Press. 1st ed., 1976.[15]
Epstein, C., Carlsson, G., and Edelsbrunner, H.
Topological data analysis.
Inverse Probl. 27 (2011),120201.[16]
Fano, R.
Transmission of Information: A Statistical Theory of Communication . MIT Press, 1961.[17]
Gibbs, J.
A method of geometrical representation of the thermodynamic properties of substances by meansof surfaces.
Trans. of the connecticut Acad. 2 (1873), 382–404.[18]
Gibbs, J.
Elementary Principles in Statistical Mechanics . Dover edition (1960 reprint), 1902.2819]
Griffith, V., and Koch, C.
Quantifying synergistic mutual information.
In Guided Self-Organization:Inception; Prokopenko, M., Ed.; Springer: Berlin/Heidelberg, Germany (2014), 159–190.[20]
Hagberg, A., Schult, D., and Swart, P.
Exploring network structure, dynamics, and function usingnetworkx.
Proceedings of the 7th Python in Science Conference (SciPy2008). Gel Varoquaux, Travis Vaught,and Jarrod Millman (Eds), (Pasadena, CA USA) (2008), 11–15.[21]
Han, T. S.
Linear dependence structure of the entropy space.
Information and Control. vol. 29 (1975), p.337–368.[22]
Han, T. S.
Nonnegative entropy measures of multivariate symmetric correlations.
IEEE Information andControl 36 (1978), 133–156.[23]
Hu, K. T.
On the amount of information.
Theory Probab. Appl. 7(4) (1962), 439–447.[24]
Hughes, G.
On the mean accuracy of statistical pattern recognizers.
IEEE Transactions on InformationTheory 14 , 1 (1968), 55–63.[25]
Humplik, J., and Tkacik, G.
Probabilistic models for neural populations that naturally capture globalcoupling and criticality.
PLoS Comput Biol. 13 , 9 (2017).[26]
Hunter, J.
Matplotlib: a 2d graphics environment.
Comput. Sci. Eng. 9 (2007), 22–30.[27]
Kay, J., Ince, R., Dering, B., and Phillips, W.
Partial and entropic information decompositions of aneuronal modulatory interaction.
Entropy 19 , 11 (2017), 560.[28]
Kellerer, H.
Masstheoretische marginalprobleme.
Math. Ann. 153 (1964), 168–198.[29]
Khinchin, A.
Mathematical Foundations of Information Theory . New York: Dover. Translated by R. A.Silverman and M.D. Friedman from two Russian articles in Uspekhi Matematicheskikh Nauk, 7 (1953): 320and 9 (1956): 1775, 1957.[30]
Kim, H., Watkinson, J., Varadan, V., and Anastassiou, D.
Multi-cancer computational analysisreveals invasion-associated variant of desmoplastic reaction involving inhba, thbs2 and col11a1.
BMC MedicalGenomics 3:51 (2010).[31]
Kolmogorov, A. N.
Grundbegriffe der Wahrscheinlichkeitsrechnung.(English translation (1950): Foun-dations of the theory of probability.) . Springer, Berlin (Chelsea, New York)., 1933.[32]
Landauer, R.
Irreversibility and heat generation in the computing process.
IBM Journal of Research andDevelopment 5 (3) (1961), 183–191.[33]
Laughlin, S.
A simple coding procedure enhances the neuron’s information capacity.
Z. Naturforsch 36 ,c (1981), 910–912.[34]
Ly, A., Marsman, M., Verhagen, J., Grasman, R., and Wagenmakers, E.-J.
A tutorial on fisherinformation. eprint arXiv:1705.01064 (2017), 1–59.[35]
Margolin, A., Wang, K., Califano, A., and Nemenman, I.
Multivariate dependence and geneticnetworks inference.
IET Syst Biol. 4 , 6 (2010), 428–40.[36]
Matsuda, H.
Information theoretic characterization of frustrated systems.
Physica A: Statistical Mechanicsand its Applications. 294 (1-2) (2001), 180–190.[37]
Matus, F. discrete marginal problem for complex measures.
Kybernetika 24 , 1 (1988).[38]
McGill, W.
Multivariate information transmission.
Psychometrika 19 (1954), p. 97–116.[39]
Merchan, L., and Nemenman, I.
On the sufficiency of pairwise interactions in maximum entropy modelsof networks.
J Stat Phys 162 (2016), 1294–1308.[40]
Mori, R.
New Understanding of the Bethe Approximation and the Replica Method . PhD thesis, KyotoUniversity, 2013.[41]
Nemenman, I., Bialek, W., and de Ruyter van Steveninck, R.
Entropy and information in neuralspike trains: Progress on the sampling problem.
Physical review E 69 (2004), 056111–.[42]
Olbrich, E., Bertschinger, N., and Rauh, J.
Information decomposition and synergy. entropy 17 , 5(2015), 3501–3517. 2943]
Pethel, S., and Hahs, D.
Exact test of independence using mutual information.
Entropy 16 (2014),2839–2849.[44]
Rauh, J., Bertschinger, N., Olbrich, E., and Jost, J.
Reconsidering unique information: To-wards a multivariate information decomposition.
IEEE International Symposium on Information Theory -arXiv:1404.3146 (2014).[45]
Reimann, M., Nolte, M., Scolamiero, M., Turner, K., Perin, R., Chindemi, G., Dotko, P.,Levi, R., Hess, K., and Markram, H.
Cliques of neurons bound into cavities provide a missing linkbetween structure and function.
Front Comput Neurosci. 12 , 11 (2017).[46]
Reshef, D., Reshef, Y., Finucane, H., Grossman, S., McVean, G., Turnbaugh, P., Lander, E.,Mitzenmacher, M., and Sabeti, P.
Detecting novel associations in large data sets.
Science 334 (2011),1518.[47]
Rota, G.
On the foundations of combinatorial theory i. theory of moebius functions.
Z. Wahrseheinlichkeit-stheorie 2 (1964), 340–368.[48]
Schneidman, E., Berry 2nd, M., Segev, R., and Bialek, W.
Weak pairwise correlations implystrongly correlated network states in a neural population.
Nature 440 (2006), 1007–1012.[49]
Schneidman, E., Bialek, W., and Berry, M. n.
Synergy, redundancy, and independence in populationcodes.
J Neurosci. 23 , 37 (2003).[50]
Scott, D.
Multivariate Density Estimation. Theory, Practice and Visualization.
New York: Wiley, 1992.[51]
Shannon, C.
A lattice theory of information.
Trans. IRE Prof. Group Inform. Theory 1 (1953), 105–107.[52]
Shannon, C. E.
A mathematical theory of communication.
The Bell System Technical Journal 27 (1948),379–423.[53]
Shipman, J.
Tkinter reference: a gui for python. .
New Mexico Tech Computer Center, Socorro, NewMexico (2010).[54]
Slonim, N., Atwal, G., Tkacik, G., and Bialek, W.
Information-based clustering.
Proc Natl AcadSci U S A. 102 , 51 (2005).[55]
Strong, S., de Ruyter van Steveninck, R., Bialek, W., and Koberle, R.
On the application ofinformation theory to neural spike trains.
Pac Symp Biocomput (1998), 621–32.[56]
Studeny, M., and Vejnarova, J.
The multiinformation function as a tool for measuring stochasticdependence. in M I Jordan, ed., Learning in Graphical Models, MIT Press, Cambridge (1999), 261–296.[57]
Tapia, M., Baudot, P., Dufour, M., Formizano-Treziny, C., Temporal, S., Lasserre, M.,Kobayashi, K., and J.M., G.
Information topology of gene expression profile in dopaminergic neurons.
BioArXiv168740 (2017).[58]
Tapia, M., Baudot, P., Formizano-Treziny, C., Dufour, M., Temporal, S., Lasserre, M.,Marqueze-Pouey, B., Gabert, J., Kobayashi, K., and J.M., G.
Neurotransmitter identity andelectrophysiological phenotype are genetically coupled in midbrain dopaminergic neurons.
Sci. Rep. 8 , 1(2018), 13637.[59]
Tkacik, G., Marre, O., Amodei, D., Schneidman, E., Bialek, W., and Berry, M. n.
Searchingfor collective behavior in a large network of sensory neurons.
PLoS Comput Biol. 20 10 , 1 (2014).[60]
Tkacik, G., Marre, O., Amodei, D., Schneidman, E. anf Bialek, W., and Berry, M. n.
Searchingfor collective behavior in a large network of sensory neurons.
PLoS Comput Biol. 10 , 1 (2014).[61]
Tononi, G., and Edelman, G.
Consciousness and complexity.
Science 282 (1998), 1846–1851.[62]
Van Der Walt, S., Colbert, C., and Varoquaux, G.
The numpy array: a structure for efficientnumerical computation.
Comput. Sci. Eng. 13 (2011), 22–30.[63]
Vigneaux, J.
The structure of information: from probability to homology. arXiv:1709.07807 (2017).[64]
Vigneaux, J.
Topology of statistical systems. A cohomological approach to information theory . PhD thesis,Paris 7 Diderot University, 2019. 3065]
Watanabe, S.
Information theoretical analysis of multivariate correlation.
IBM Journal of Research andDevelopment 4 (1960), 66–81.[66]
Watkinson, J., Liang, K., Wang, X., Zheng, T., and Anastassiou, D.
Inference of regulatory geneinteractions from expression data using three-way mutual information.
The Challenges of Systems Biology:Ann. N.Y. Acad. Sci. 1158 (2009), 302–313.[67]
Wibral, M., Finn, C., Wollstadt, P., Lizier, J., and Priesemann, V.
Quantifying informationmodification in developing neural networks via partial information decomposition.
Entropy 19 , 9 (2017),494.[68]
Williams, P., and Beer, R.
Nonnegative decomposition of multivariate information. arXiv:1004.2515v1 (2010).[69]
Yedidia, J., Freeamn, W., and Weiss, Y.
Understanding belief propagation and its generalizations.