[PDF] Estimating the Mutual Information between two Discrete, Asymmetric Variables with Limited Samples

Abstract

Determining the strength of non-linear statistical dependencies between two variables is a crucial matter in many research fields. The established measure for quantifying such relations is the mutual information. However, estimating mutual information from limited samples is a challenging task. Since the mutual information is the difference of two entropies, the existing Bayesian estimators of entropy may be used to estimate information. This procedure, however, is still biased in the severely under-sampled regime. Here we propose an alternative estimator that is applicable to those cases in which the marginal distribution of one of the two variables---the one with minimal entropy---is well sampled. The other variable, as well as the joint and conditional distributions, can be severely undersampled. We obtain an estimator that presents very low bias, outperforming previous methods even when the sampled data contain few coincidences. As with other Bayesian estimators, our proposal focuses on the strength of the interaction between two discrete variables, without seeking to model the specific way in which the variables are related. A distinctive property of our method is that the main data statistics determining the amount of mutual information is the inhomogeneity of the conditional distribution of the low-entropy variable in those states (typically few) in which the large-entropy variable registers coincidences.

Full PDF

EE STIMATING THE M UTUAL I NFORMATION BETWEEN TWO D ISCRETE , A

SYMMETRIC V ARIABLES WITH L IMITED S AMPLES

A P

REPRINT

Damián G. Hernández

Department of Medical PhysicsCentro Atómico Bariloche and Instituto BalseiroSan Carlos de Bariloche, Argentina

Inés Samengo

Department of Medical PhysicsCentro Atómico Bariloche and Instituto BalseiroSan Carlos de Bariloche, ArgentinaMay 7, 2019 A BSTRACT

Determining the strength of non-linear statistical dependencies between two variables is a crucialmatter in many research ﬁelds. The established measure for quantifying such relations is the mutualinformation. However, estimating mutual information from limited samples is a challenging task.Since the mutual information is the difference of two entropies, the existing Bayesian estimators ofentropy may be used to estimate information. This procedure, however, is still biased in the severelyunder-sampled regime. Here we propose an alternative estimator that is applicable to those casesin which the marginal distribution of one of the two variables—the one with minimal entropy—iswell sampled. The other variable, as well as the joint and conditional distributions, can be severelyundersampled. We obtain an estimator that presents very low bias, outperforming previous methodseven when the sampled data contain few coincidences. As with other Bayesian estimators, ourproposal focuses on the strength of the interaction between two discrete variables, without seeking tomodel the speciﬁc way in which the variables are related. A distinctive property of our method isthat the main data statistics determining the amount of mutual information is the inhomogeneity ofthe conditional distribution of the low-entropy variable in those states (typically few) in which thelarge-entropy variable registers coincidences. K eywords bayesian estimation, mutual information, bias, sampling Inferring the statistical dependencies between two variables from a few measured samples is an ubiquitous task inmany areas of study. Variables are often linked through non-linear relations, which contain stochastic components. Thestandard measure employed to quantify the amount of dependency is the mutual information, deﬁned as the reduction inentropy of one of the variables when conditioning the other variable [1, 2]. If the states of the joint distribution arewell-sampled, the joint probabilities can be estimated by the observed frequencies, yielding the maximum-likelihoodestimator of mutual information. However, this procedure on average over-estimates the mutual information [3, 4, 5],so that independent variables may appear to be correlated, especially when the number of samples is small.The search for an estimator of mutual information that remains approximately unbiased even with small data samplesis an open ﬁeld of research [6, 7, 8, 9, 10, 11]. Here we focus on discrete variables, and assume it is not possibleto overcome the scarceness of samples by grouping elements that are close according to some metric. In additionto corrections that only work in the limit of large samples [12], the state of the art for this problem corresponds toquasi-Bayesian methods that estimate mutual information indirectly through measures of the entropies of the involvedvariables [8, 13, 14]. These approaches have the drawback of not being strictly Bayesian, since the linear combinationof two or more Bayesian estimates of entropies does not, in general, yield a Bayesian estimator of the combination ofentropies [8]. The concern is not so much to remain within theoretical Bayesian purity, but rather, to avoid frameworksthat may be unnecessarily biased, or where negative estimates of information may arise. a r X i v : . [ phy s i c s . d a t a - a n ] M a y stimating the Mutual Information between two Discrete, Asymmetric Variables with Limited Samples A PREPRINT

Here we propose a new method for estimating mutual information that is valid in the speciﬁc case in which there is anasymmetry between the two variables: One of them has a large number of effective states, and the other only a few. Nohypotheses are made about the probability distribution of the large-entropy variable, but the marginal distribution of thelow-entropy variable is assumed to be well sampled. The prior is chosen so as to accurately represent the amount ofdispersion of the conditional distribution of the low-entropy variable around its marginal distribution. The main ﬁndingis that our estimator has very low bias, even in the severely under-sampled regime where there are few coincidences, thatis, when a given state of the large-entropy variable is only seldom sampled more than once. The key data statistics thatdetermine the estimated information is the inhomogeneity of the distribution of the low-entropy variable in those statesof the high-entropy variable where two or more samples are observed. In addition to providing a practical algorithm toestimate mutual information, our approach sheds light on the way in which just a few samples reveal those speciﬁcproperties of the underlying joint probability distribution that determine the amount of mutual information.

We seek a low-bias estimate of the mutual information between two discrete variables. Let X be a random variablewith a large number k x of effective states { x , . . . , x k x } with probabilities q x , and Y be a variable that varies in a smallset y ∈ { y , . . . , y k y } , with k y (cid:28) k x . Given the conditional probabilities q y | x , the marginal and joint probabilities are q y = (cid:80) x q x q y | x and q xy = q x q y | x , respectively. The entropy H ( Y ) is H ( Y ) = − (cid:88) y q y log q y , (1)and can be interpreted as the average number of well-chosen yes/no questions required to guess the sampled value of Y (when using a logarithm of base two). The conditional entropy H ( Y | X ) is the average uncertainty of the variable Y once X is known, H ( Y | X ) = (cid:88) x q x (cid:34) − (cid:88) y q y | x log q y | x (cid:35) = (cid:88) x q x H ( Y | x ) . (2)The mutual information is the reduction in uncertainty of one variable once we know the other [2] I ( X, Y ) = H ( X ) + H ( Y ) − H ( X, Y ) = H ( Y ) − H ( Y | X ) . (3)Our aim is to estimate I ( X, Y ) when Y is well sampled, but X is severely undersampled, in particular, when thesampled data contain few coincidences in X . Hence, for most values x , the number of samples n x is too small toestimate the conditional probability q y | x from the frequencies n xy /n x . In fact, when n x ∼ O (1) , the maximumlikelihood estimator typically underestimates H ( Y | x ) severely [5], and consequently leads to an overestimation of I ( X, Y ) .One possibility is to estimate H ( X ) , H ( Y ) and H ( X, Y ) using a Bayesian estimator, and then plug the obtained valuesin Eq. 3 to estimate the mutual information. We now discuss previous approaches to Bayesian estimators for entropy,to later analyze the case of information. For deﬁniteness, we focus on H ( X ) , but the same logic applies to H ( Y ) , or H ( X, Y ) .The Bayesian estimator is the expected value of H ( q x | n ) , where q x are unknown probabilities q x , . . . , q x k , and n represents the number of sampled data n = ( n , . . . , n k ) obtained in each state. That is, (cid:104) H | n (cid:105) = (cid:82) d q H ( q ) p ( q | n ) = [ p ( n )] − (cid:82) d q H ( q ) p ( n | q ) p ( q ) . (4)Since p ( n | q ) is the multinomial distribution p ( n | q ) = N ! (cid:89) x q n x x n x ! , (5)and since the normalization constant p ( n ) can be calculated from the integral p ( n ) = (cid:90) d q p ( n | q ) p ( q ) , (6)the entire gist of the Bayesian approach is to ﬁnd an adequate prior p ( q ) to plug into Eqs. 4, 5 and 6. For the sake ofanalytical tractability, p ( q ) is often decomposed into a weighted combination of distributions p ( q | β ) that can be easilyintegrated, each tagged by one or or a few parameters, here generically called β , that vary within a certain domain, p ( q ) = (cid:90) d β p ( β ) p ( q | β ) . (7)2stimating the Mutual Information between two Discrete, Asymmetric Variables with Limited Samples A PREPRINT

The decomposition requires to introduce a prior p ( β ) . Hence, the former search for an adequate prior p ( q ) is nowreplaced by the search for an adequate prior p ( β ) . The replacement implies an assumption and also a simpliﬁcation.The family of priors that can be generated by Eq. 7 does not encompass the entire space of possible priors. Thedecomposition relies on the assumption that the remaining family is still rich enough to make good inference about thequantity of interest, in this case, the entropy. The simpliﬁcation stems from the fact that the search for p ( β ) is morerestricted than the search for p ( q ) , because the space of possible alternatives is smaller (the dimensionality of q istypically high, whereas the one of β is low). Two popular proposals of Bayesian estimators for entropies are NSB [13]and PYM [14]. In NSB, the functions p ( q | β ) are Dirichlet distributions, in which β takes the role of a concentrationparameter. In PYM, these functions are Pitman-Yor processes, and β stands for two parameters: one accounting for theconcentration, and the other for the so-called discount. In both cases, the Bayesian machinery implies (cid:104) H | n (cid:105) = 1 p ( n ) (cid:90) d β p ( β ) W ( β | n ) , (8)where W ( β | n ) is the weight of each β in the estimation of the expected entropy W ( β | n ) = (cid:90) d q H ( q ) p ( n | q ) p ( q | β ) . (9)When choosing the family of functions p ( q | β ) , it is convenient to select them in such a way that the weight W ( β | n ) can be solved analytically. However, this is not the only requirement. In order to calculate the integral in β , the prior p ( β ) also plays a role. The decomposition of Eq. 7 becomes most useful when the arbitrariness in the choice of p ( β ) isless serious than the arbitrariness in the choice of p ( q ) . This assumption is justiﬁed when W ( β | n ) is peaked around aspeciﬁc β value, so that in practice, the shape of p ( β ) hardly has an effect. In these cases, a narrow range of relevant β values is selected by the sampled data, and all assumptions about the prior probability outside this range play a minorrole. For the choices of the families p ( q | β ) proposed by NSB and PYM, W ( β | n ) can be calculated analytically, andone can verify that indeed, a few coincidences in the data sufﬁce for a peak to develop. In both cases, the selected β isone for which p ( q | β ) favours a range of q values that are compatible with the measured data (as assessed by p ( n | q ) ),and also produce non-negligible entropies (Eq. 9).When the chosen Bayesian estimates of the entropies are plugged into Eq. 3 to obtain an estimate of the information,each term is dominated by its own preferred β . Since the different entropies are estimated independently, the β valuesselected by the data to dominate the priors p ( q x ) and p ( q y ) need not be compatible with the ones dominating the priorsof the joint or the conditional distributions. As a consequence, the estimation of the mutual information is no longerBayesian, and can suffer from theoretical issues, as for example, yield a negative estimate [8].A ﬁrst alternative would be to consider an integrable prior containing a single β for the joint probability distribution q xy ,and then replace H by I in the equations above, to calculate (cid:104) I (cid:105) . This procedure was tested by Archer et al. [8], and theresults were only good when the collection of q xy values governing the data were well described by a distribution thatwas contained in the family of proposed priors p ( q | β ) . The authors concluded that mixtures of Dirichlet priors do notprovide a ﬂexible enough family of priors for highly-structured joint distributions, at least for the purpose of estimatingmutual information.To make progress, we note that I ( X, Y ) can be written as I ( X ; Y ) = (cid:88) x q x (cid:88) y q y | x log q y | x q y = (cid:88) x q x D KL ( q y | x || q y ) , (10)where q y | x and q y stand for the k y -dimensional vectors ( q y | x , . . . , q y ky | x ) and ( q y , . . . , q y ky ) , and D KL representsthe Kullback-Leibler divergence. The average divergence between q y | x and q y captures a notion of spread. Therefore,the mutual information is sensitive not so much to the value of the probabilities q y | x , but rather, to their degree of scatteraround the marginal q y . The parameters controlling the prior should hence be selected in order to match the width of thedistribution of q y | x values, and not so much each probability. With this intuition in mind, in this paper we put forward anew prior for the whole ensemble of conditional probabilities q y | x obtained for different x values. In this prior, theparameter β controls the spread of the conditionals q y | x around the marginal q y . Our approach is valid when the total number of samples N is at least of the order of magnitude of √ e H ( X ) , since in thisregime, some of the x states are expected to be sampled more than once [15, 16]. In addition, the marginal distribution q y must be well sampled. This regime is typically achieved when X has a much larger set of available states than Y . In3stimating the Mutual Information between two Discrete, Asymmetric Variables with Limited Samples A PREPRINT this case, the maximum likelihood estimators ˆ q y of the marginal probabilities q y can be assumed to be accurate, that is, ˆ q y = n y N ≈ q y , ∀ y. (11)In this paper, we put forward a Dirichlet prior distribution centered at ˆ q y , that is, p ( { q y | x }| β ) = Γ( β ) k x (cid:89) xy q y | x β ˆ q y − Γ( β ˆ q y ) ∝ exp (cid:2) − β (cid:80) x D KL ( ˆ q y || q y | x ) (cid:3)(cid:81) xy q y | x , (12)where { q y | x } contains the k x conditional probabilities q y | x corresponding to different x values. Large β values selectconditional probabilities close to ˆ q y , while small values imply a large spread, that pushes the selection towards theborder of the k y -simplex.For the moment, for simplicity we work with a prior p ( { q y | x } ) deﬁned on the conditional probabilities q y | x , and makeno effort to model the prior probability of the vector q x . In practice, we estimate the values of q x with the maximumlikelihood estimator ˆ q x = n x /N . Since X is assumed to be severely undersampled, this is a poor procedure to estimate q x . Still, the effect on the mutual information turns out to be negligible, since the only role of q x in Eq. 10 is to weigheach of the Kullback-Leibler divergences appearing in the average. If k x is large, each D KL value will appear in severalterms of the sum, rendering the individual value of the accompanying q x irrelevant, only the sum of them matters. InSect. 6, we tackle the full problem of making Bayesian inference both in q x and { q y | x } .The choice of prior of Eq. 12 is inspired in three facts. First, β captures the spread of q y | x around q y , as implied bythe Kullback-Leibler divergence in Eq. 12. Admittedly, this divergence is not exactly the one governing the mutualinformation (Eq. 10), since q y | x and q y are interchanged. Yet, it is still a measure of spread. The exchange, as well asthe denominator in Eq. 12, were introduced for the sake of the second fact, namely, analytical tractability. The third factregards the emergence of a single relevant β when the sampled data begin to register coincidences. If we follow theBayesian rationale of the previous section, now replacing the entropy by the mutual information, we can again deﬁne aweight W ( β | n ) for the parameter βW ( β | n ) = (cid:90) { d q y | x } I (ˆ q x , { q y | x } ) p ( n | ˆ q x , { q y | x } ) p ( { q y | x }| β )= p ( β | n ) F ( β, n ) , where F ( β, n ) can be obtained analytically, and is a well behaved function of its arguments, whereas p ( β | n ) = p ( β ) p ( n | β ) p ( n ) = p ( β ) p ( n ) (cid:90) (cid:8) d q y | x (cid:9) p ( n | ˆ q x , { q y | x } ) p ( { q y | x }| β )= p ( β ) p ( n ) (cid:89) x Γ( β )Γ( n x + β ) k y (cid:89) y =1 Γ( n xy + β ˆ q y )Γ( β ˆ q y ) . (13)For each x , the vector q y | x varies in a k y -dimensional simplex. For p ( n | ˆ q x , { q y | x } ) we take the multinomial p ( n | ˆ q x , { q y | x } ) = N ! (cid:89) xy [ˆ q x q y | x ] n xy n xy ! . (14)The important point here, is that the ratio of the Gamma functions of Eq. 13 develops a peak in β as soon as the collecteddata register a few coincidences in x . Hence, with few samples, the prior proposed in Eq. 12 renders the choice of p ( β ) inconsequential.Assuming that the marginal probability of Y is well-sampled, the entropy H ( Y ) is well approximated by the maximum-likelihood estimator ˆ H ( Y ) = − (cid:80) y ( n y /N ) log( n y /N ) . For each β , the expected posterior information can becalculated analytically, (cid:104) I | n (cid:105) ( β ) = ˆ H ( Y ) − (cid:88) x n x N (cid:34) ψ ( β + n x + 1) − (cid:88) y β ˆ q y + n xy β + n x ψ ( β ˆ q y + n xy + 1) (cid:35) , (15)where ψ is the digamma function. When the system is well sampled, n xy (cid:29) , so the effect of β becomes negligible,the Digamma functions tend to logarithms, and the frequencies match the probabilities. In this limit, Eq. 15 coincideswith the maximal likelihood estimator, which is consistent. The rest of the paper focuses on the case in which themarginal probability of X is severely undersampled. 4stimating the Mutual Information between two Discrete, Asymmetric Variables with Limited Samples A PREPRINT Y -variable Figure 1: A scheme of our method to estimate the mutual information between two variables X and Y . a : We collect afew samples of a variable x with a large number of effective states x , x , . . . , each sample characterized by a binaryvariable y (the two values represented in white and gray). We consider different hypotheses about the strength withwhich the probability of each y -value varies with x . b : One possibility is that the conditional probability of each of thetwo y -values hardly varies with x . This situation is modeled by assuming that the different q y | x are random variablesgoverned by a Beta distribution with a large hyper-parameter β . c : On the other hand, the conditional probability q y | x could vary strongly with x . This situation is modeled by a Beta distribution with a small hyper-parameter β . d : As β varies, so does the prior mutual information (Eq. 17). If the distribution p ( q | β ) is sampled repeatedly for a ﬁxed β , theprior information (cid:104) I ( q ) (cid:105) may ﬂuctuate from sample to sample. The shaded area around the solid line illustrates suchﬂuctuations when k x = 50 .In this section, for simplicity we take q y =0 = q y =1 = / , such that H ( Y ) = log 2 nats. In this case, the Dirichlet priorof Eq. 12 becomes a Beta distribution p ( q | x | β ) = Γ( β )Γ( β/ (cid:2) q | x (1 − q | x ) (cid:3) β/ − . (16)Large values of β mostly select conditional probabilities q y | x close to / . If all conditional probabilities are similar,and similar to the marginal, the mutual information is low, since the probability of sampling a speciﬁc y value hardlydepends on x . Instead, small values of β produce conditional probabilities q y | x around the borders ( q y | . ∼ or q y | . ∼ ).In this case, q y | x is strongly dependent on x (see Fig. 1 b ), so the mutual information is large. The expected priormutual information (cid:104) I ( β ) (cid:105) can be calculated using the analytical approach developed by [17, 14], (cid:104) I (cid:105) ( β ) = log 2 − ψ ( β + 1) + ψ ( β/ . (17)The prior information is a slowly-varying function of the order of magnitude of β , namely of log β . Therefore, if auniform prior in information is desired, it sufﬁces to choose a prior on log β such that p (log β ) ∝ | ∂ log β (cid:104) I (cid:105) ( β ) | , p (log β ) = β/ | ψ ( β + 1) − ψ ( β/ | . (18)When k y = 2 , the expected posterior information (Eq. 15) becomes (cid:104) I | n (cid:105) ( β ) = ˆ H ( Y ) − (cid:88) x n x N  ψ ( β + n x + 1) − (cid:88) y ∈{ , } (cid:18) β/ n xy β + n x (cid:19) ψ ( β/ n xy + 1)  . (19)5stimating the Mutual Information between two Discrete, Asymmetric Variables with Limited Samples A PREPRINT

The marginal likelihood of the data given β is also analytically tractable. The likelihood is binomial for each x , so p ( n | β ) = (cid:89) x (cid:90) d q | x p ( n x , n x | q | x ) p ( q | x | β ) ∝ (cid:89) x Γ( n x + β/ n x + β/ β )Γ( n x + β )Γ( β/ . (20)The posterior for β can be obtained by adding a prior p ( β ) , as p ( β | n ) ∝ p ( n | β ) p ( β ) . The role of the prior becomesrelevant when the number of coincidences is too low for the posterior to develop a peak (see below).In order to gain intuition about the statistical dependence between variables with few samples, we here highlight thespeciﬁc aspects of the data that inﬂuence the estimator of Eq. 19. Grouping together the terms of Eq. (20) that are equal,the marginal likelihood can be rewritten in terms of the multiplicities m nn (cid:48) , that is, the number of states x with speciﬁcoccurrences { n x = n, n x = n (cid:48) } or { n x = n (cid:48) , n x = n } , log p ( n | β ) = (cid:88) n ≥ n (cid:48) m nn (cid:48) log (cid:20) Γ( n + β/ n (cid:48) + β/ β )Γ( n + n (cid:48) + β )Γ( β/ (cid:21) = (cid:88) n ≥ n (cid:48) m nn (cid:48) log p nn (cid:48) ( β ) , (21)where p ( β ) = β/ β = 12 p ( β ) = ( β/ β ( β + 1) = β β + 1) p ( β ) = ( β/ β/ β ( β + 1) = ( β/ β + 1) . . .p nn (cid:48) ( β ) = ( β/ β/ . . . ( β/ n −

1) ( β/ β/ . . . ( β/ n (cid:48) − β ( β + 1) . . . ( β + n + n (cid:48) − . (22)The posterior for β is independent from states x with just a single count, as p ( β ) = constant. Only states x withcoincidences matter. In order to see how the sampled data favor a particular β , we search for the β value that maximizes log p ( n | β ) in the particular case where at most two samples coincide on the same x , obtaining ∂∂β log p ( n | β ) = m β + m β + 2 − m + m β + 1 = 0 . (23)Denoting the fraction of -count states that have one count for each y value as f = m / ( m + m ) , Eq. 23 impliesthat β → ∞ if f ≥ / , and β = f / (1 / − f ) , otherwise. If the y -values are independent of x , we expect f ∼ / . This case corresponds to a large β and, consequently, to a low information. On the other side, for small f ,the parameter β is also small and the information grows.In Eq. 23, the data only intervene through m and m , which characterize the degree of asymmetry of the y valuesthroughout the different x states. This asymmetry, hence, constitutes a sufﬁcient statistics for β . If a prior p ( β ) isincluded, the β that maximizes the posterior p ( β | n ) may shift, but the effect becomes negligible as the number ofcoincidences grows.We now discuss the role of the selected β in the estimation of information, Eq. (19), focusing on the conditional entropy (cid:104) H Y | X (cid:105) ( β ) . First, in terms of the multiplicities, the conditional entropy can be rewritten as (cid:104) H Y | X (cid:105) ( β ) = (cid:88) k f k (cid:88) n + n (cid:48) = k f nn (cid:48) H nn (cid:48) ( β ) (24)where f k is the fraction of the N samples that fall in states x with k counts, and f nn (cid:48) is the fraction of all states x with n + n (cid:48) counts that have n for one y -value (whichever) and n (cid:48) for the other. Finally, H nn (cid:48) ( β ) is the estimation of theentropy of a binary variable after { n, n (cid:48) } samples, H nn (cid:48) ( β ) = ψ ( n + n (cid:48) + β + 1) − ( n + β/ ψ ( n + β/ n (cid:48) + β/ ψ ( n (cid:48) + β/ n + n (cid:48) + β . (25)6stimating the Mutual Information between two Discrete, Asymmetric Variables with Limited Samples A PREPRINT

A priori, (cid:104) I (cid:105) ( β ) = log 2 − H ( β ) , as in Fig. 1 d . Surprisingly, from the property ψ ( z + 1) = ψ ( z ) + 1 /z , it turnsout that H = H (in fact, H nn = H ( n +1) n ). Hence, if only a single count breaks the symmetry between the two y values, there is no effect on the conditional entropy. This is a reasonable result, since a single extra count is no evidenceof an imbalance between the underlying conditional probabilities, it is just the natural consequence of comparing thecounts falling on an even number of states (2) when taking an odd number of samples. Expanding the ﬁrst terms for theconditional entropy, (cid:104) H Y | X (cid:105) = f H ( β ) + f f H ( β ) + f f H ( β ) + . . . (26)In the severely under-sampled regime, these ﬁrst terms are the most important ones. Typically, f takes most of theweight, and Eq. (26) implies that the estimation is close to the prior H evaluated in the value of β that maximizes themarginal likelihood (or the posterior).Finally, we mention that when dealing with few samples, it is important to have not just a good estimate of the mutualinformation, but also, a conﬁdence interval. Even a small information may be relevant, if the evidence attests thatit is strictly above zero. The theory developed here also allows us to estimate the posterior variance of the mutualinformation, as shown in the Appendix. The variance (Eq. 33) is shown to be inversely proportional to the number ofstates k x , thereby implying that our method beneﬁts from a large number of available states X , even if undersampled. We now analyze the performance of our estimator in three examples where the number of samples N is below orin the order of the effective size of the system exp( H XY ) . In this regime, most observed x states have very fewsamples. In each example, we deﬁne the probabilities q x and q | x with three different criteria, giving rise to collectionsof probabilities that can be described with varying success by the prior proposed in this paper, Eq. 16. Once theprobabilities are deﬁned, the true value I XY of the mutual information can be calculated, and compared to the oneestimated by our method, as well as by three other estimators employed in the literature, in 50 different sets of samples n of the measured data. As our estimator we use (cid:104) I | n (cid:105) from Eq. (19) evaluated in the β that maximizes the marginallikelihood p ( n | β ) . We did not observe any improvement when integrating over the whole posterior p ( β | n ) with theprior p ( β ) of Eq. 18, except when m or m were of order 1. This fact implies the existence of a well-deﬁned peak inthe marginal likelihood.In the ﬁrst example (Fig. 2 a, d ), the probabilities q x are obtained by sampling a Pitman-Yor distribution with concen-tration parameter α = 50 and tail parameter d = 0 . . These values correspond to a PYM prior with a heavy tail.The conditional probabilities q y | x are deﬁned by sampling a symmetric Beta distribution q y | x ∼ Beta ( β/ , β/ , as inEq. 16. In Fig. 2 a , we use β = 2 . . Once the joint probability q xy is deﬁned, 50 sets of samples n are generated. Theeffective size of the system is exp( H XY ) (cid:39) . We compare our estimator to maximum likelihood (ML), NSB andPYM when applied to H X and H XY (all methods coincide in the estimation of H Y ). Our estimator has a low bias,even when the number of samples per effective state is as low as N/e H XY = 0 . . The variance is larger than ML,comparable to NSB and smaller than PYM. All the other methods (ML, NSB and to a lesser extend PYM) overestimatethe mutual information. In Fig. 2 d , the performance of the estimators is also tested for different values of the exactmutual information I XY , which we explore by varying β ∈ (0 . , . For each β , the conditional probabilities q | x are sampled once. Each vector n contains N = 500 samples, and n is sampled times. Our estimates have very lowbias, even as the mutual information goes to zero —namely, for independent variables.Secondly, we analyze an example where the statistical relation between X and Y is remarkably intricate (exampleinspired by [19]), which underscores the fact that making inference about the mutual information does not requireinferences on the joint probability distribution. The variable x is a binary vector of dimension . Each componentrepresents the presence or absence of one of a maximum of delta functions equally spaced on the surface of a sphere.There are possible x vectors, and they are governed by a uniform prior probability: q x = 2 − . The conditionalprobabilities are generated in such a way that they be invariant under rotations of the sphere, that is, q y | x = q y | R ( x ) ,where R is a rotation. Using a spherical harmonic representation [18], the frequency components π (cid:96) ( f ( x )) of thespherical spectrum are obtained, where f ( x ) is the combination of deltas. The conditional probabilities q y | x are deﬁnedas a sigmoid function of ( π − π − π ) . The offset of the sigmoid is chosen such that q y =1 (cid:39) . , and the gainsuch that I XY (cid:39) . nats. In this example, and unlike the Dirichlet prior implied by our estimator, p ( q y | x ) has somelevel of roughness (inset in Fig. 2 b ), due to peaks coming from the invariant classes in { x , . . . , x } . Hence, theexample does not truly ﬁt into the hypothesis of our method. With these settings, the effective size of the system is exp( H XY ) (cid:39) . Our estimator has little bias (Figs. 2 b, e ), even with N/e H XY = 0 . samples per effective state.In this regime, around ∼ of the samples fall on x states that occur only once ( f (cid:39) . ), ∼ on states thatoccur twice and ∼ on states with counts, or maybe . As mentioned above, in such cases, the value of I XY isvery similar to the one that would be obtained by evaluating the prior information (cid:104) I | n = , β (cid:105) of (Eq. 17) at the β A PREPRINT

MLNSBPYM0.0 0.1 0.2 0.3 - MLNSBPYM0.0 0.2 0.4 0.60.00.2

MLNSBPYM0.0 0.2 0.4 0.60.00.20.4 a b c d e f N = 500 N = 2000 N = 2000 p ( q | x ) p ( q | x ) p ( q | x ) /

200 400 8000.00.20.40.6

500 1000 2000 40000.50.60.70.8

100 1000 10 0000.00.40.8

Figure 2: Comparison of the performance of four different estimators for I XY : maximum likelihood estimator (ML),NSB estimator used in the limit of inﬁnite states, PYM estimator, and our estimator (cid:104) I | n (cid:105) ( β ) (Eq. 19) calculated withthe β that maximizes the marginal likelihood p ( n | β ) (Eq. 20). The curves represent the average over different datasets n , with the standard deviation displayed as a colored area around the mean. a : Estimates of mutual informationas a function of the total number of samples N , when the values of q | x are generated under the hypothesis of ourmethod (Eq. 16). We sample once the marginal probabilities q x ∼ PYM ( α = 50 , d = 0 . , as well as the conditionals q y | x ∼ Beta ( β/ , β/ with β = 2 . . The effective size of the system is exp( H XY ) (cid:39) . The exact value of I XY is shown as a horizontal dashed line. b : Estimates of mutual information, for data sets where the conditionalprobabilities have spherical symmetry. X , a binary variable of dimension , corresponds to the presence of deltafunctions equally spaced in a sphere ( q x = 2 − , for all x ). We generate the conditional probabilities such that theyare invariant under rotations of the sphere, namely q y | x = q y | R ( x ) , being R a rotation. To this aim, we set q y | x asa sigmoid function of a combination of frequency components ( π − π − π ) of the spherical spectrum [18]. Theeffective size of the system is exp( H XY ) (cid:39) . c : Estimates of mutual information, for a conditional distribution faraway from our hypotheses. The x states are generated as Bernoulli ( p = 0 . ) binary vectors of dimension D = 40 ,while the conditional probabilities depend on the parity of the sum of the components of the vector. When the sumis even we set q y | x = 1 / , and when is odd, q y | x is generated by sampling a mixture of two deltas of equal weight q y | x ∼ [ δ ( q − q ) + δ ( q − q )] / with q = 0 . . The resulting distribution of q y | x -values contains 3 peaks, andtherefore, cannot be described with a Dirichlet distribution. The effective size of the system is exp( H XY ) (cid:39) . d :Bias in the estimation as a function of the value of mutual information. Settings remain the same as in a , but ﬁxing N = 500 and changing β ∈ (0 . , in the conditional. e : Bias in the estimation as a function of the value of mutualinformation. Settings as in b , but ﬁxing N = 2000 and changing the gain of the sigmoid in the conditional. f : Bias inthe estimation as a function of the value of mutual information. Settings as in c , but ﬁxing N = 2000 and changing q ∈ (0 . , . in the conditional. 8stimating the Mutual Information between two Discrete, Asymmetric Variables with Limited Samples A PREPRINT a b

Figure 3: Veriﬁcation of the accuracy of the analytically predicted mean posterior information (Eq. 19) and variance(Eq. 33) in the severely under-sampled regime. A collection of 13,500 distributions q xy are constructed by sampling q x ∼ DP ( α ) and q y | x ∼ Beta ( β/ , β/ , with α varying in the set { e , e , e } and log β from Eq. 18. Each distribution q xy has an associated I XY ( q xy ) . From each q xy , we take ﬁve (5) sets of just N = 40 samples. a : The values of I ( q xy ) are grouped according to the multiplicities { m nn (cid:48) } produced by the samples, averaged together, and depicted as the y component of each data point. The x component is the analytical result of Eq. 19, based on the sampled multiplicities. b : Same analysis for the standard deviation of the information (the square root of the variance calculated in Eq. 33).that maximizes the marginal likelihood p ( n | β ) , which in turn is mainly determined by f . In Fig. 2 e , the estimator istested with a ﬁxed number of samples N = 2000 for different values of the mutual information, which we explore byvarying the gain of the sigmoid. The bias of the estimate is small in the entire range of mutual informations.In the third place, we consider an example where the conditional probabilities are generated from a distribution thatis poorly approximated by a Dirichlet prior. The conditional probabilities are sampled from three Dirac deltas, as q y | x ∼ [0 . δ ( q − / ) + 0 . δ ( q − q ) + 0 . δ ( q − q )] , with q = 0 . . The delta placed in q = / could beapproximated by a Dirichlet prior with a large β , while the other two deltas could be approximated by a small β , butthere is no single value of β that can approximate all three deltas at the same time. The x states are generated asBernoulli ( p = 0 . ) binary vectors of dimension D = 40 , while the conditional probabilities q | x depend on the parityof the sum of the components of the vector x . When the sum is even, we assign q y | x = / , and when it is odd, weassign q y | x = q or q y | x = 1 − q , both options with equal probability. Although in this case our method has somedegree of bias, it still preserves a good performance in relation to the other approaches (see Fig. 2 c , f ). The marginallikelihood p ( n | β ) contains a single peak in an intermediate value of β , coinciding with none of the deltas in p ( q | x ) ,but still capturing the right value of the mutual information. As in the previous examples we also test the performanceof the estimator for different values of the mutual information, varying in this case the value of q (with N = 2000 ).Our method performs acceptably for all values of mutual information. The other methods, instead, are challenged moreseverely, probably because a large fraction of the x states have a very low probability, and are therefore difﬁcult tosample. Those states, however, provide a crucial contribution to the relative weight of each of the three values of q | x .PYM, in particular, sometimes produces a negative estimate for I XY .Finally, we check numerically the accuracy of the analytically predicted mean posterior information (Eq. 19) andvariance (Eq. 33) in the severely under-sampled regime. The test is performed in a different spirit than the numericalevaluations of Fig. 2. There, averages were taken for multiple samples of the vector n , from a ﬁxed choice of theprobabilities q x and q y | x . The averages of Eqs. 19 and Eq. 33, however, must be interpreted in the Bayesian sense.The square brackets in (cid:104) I | n (cid:105) and (cid:104) H Y | X (cid:105) represent averages taken for a ﬁxed data sample n , and unknown underlyingprobability distributions q x and q y | x . We generate many such distributions with q x ∼ DP ( α ) (a Dirichlet Process withconcentration parameter α ) and q y | x ∼ Beta ( β/ , β/ . A total of , distributions q xy are produced, with log β sampled from Eq. 18, and three equiprobable values of α = { e , e , e } . For each of these distributions we generateﬁve (5) sets of just N = 40 samples, thereby constructing a list of × , cases, each case characterized by speciﬁcvalues of α, β, q x , { q y | x } , I ( q x , { q y | x } ) , n , (cid:104) I | n (cid:105) and σ ( I | n ) . Following the Bayesian rationale, we partition this9stimating the Mutual Information between two Discrete, Asymmetric Variables with Limited Samples A PREPRINT list in classes, each class containing all the cases that end up in the same set of multiplicities { m nn (cid:48) } —for example, { m = 36 , m = 2 } . For each of the most occurring sets of multiplicities (which together cover of all thecases), we calculate the mean and the standard deviation of the mutual information I ( q , { q y | x } ) of the correspondingclass, and compare them with our predicted estimates (cid:104) I |{ m nn (cid:48) }(cid:105) and (cid:104) σ I |{ m nn (cid:48) }(cid:105) , using the prior p (log β ) fromEq. (18). Figure 3 shows a good match between the numerical ( y -axis) and analytical ( x -axis) averages that deﬁne themean information (panel a ) and the standard deviation ( b ). The small departures from the diagonal stem from the factthat the analytical average contains all the possible q x and { q y | x } , even if some of them are highly improbable forone given set of multiplicities. The numerical average, instead, includes the subset of the 13,500 explored cases thatproduced the tested multiplicity. All the depicted subsets contained many cases, but still, they remained unavoidablybelow the inﬁnity covered by the theoretical result.We have also tested cases where Y takes more than two values, and where the marginal distribution q y is not uniform,observing similar performance of our estimator. The prior considered so far did not model the probability q x of the large-entropy variable X . Throughout the calculation,the probabilities q x were approximated by the maximum likelihood estimator ˆ q X = n x /N . Here we justify suchprocedure by demonstrating that proper Bayesian inference on q x hardly modiﬁes the estimation of the mutualinformation. To that end, we replace the prior of Eq. 12 by another prior that depends on both q x and { q y | x } .The simplest hypothesis is to assume that the prior p ( q x , { q y | x } ) factorizes as p ( q x ) p ( { q y | x } ) , implying that themarginal probabilities q x are independent of the conditional probabilities q y | x . We propose q x ∼ DP ( α ) , so thatthe marginal probabilities q x are drawn from a Dirichlet Process with concentration parameter α , associated to thetotal number of pseudo-counts. After integrating in q x and in q y | x , the mean posterior mutual information for ﬁxedhyper-parameters β and α is (cid:104) I | n (cid:105) ( β, α ) = NN + α (cid:40) ˆ H ( Y ) − (cid:88) x, n x > n x N (cid:34) ψ ( β + n x + 1) − (cid:88) y β ˆ q y + n xy β + n x ψ ( β ˆ q y + n xy + 1) (cid:35)(cid:41) + αN + α (cid:34) ˆ H ( Y ) − ψ ( β + 1) + (cid:88) y ˆ q y ψ ( β ˆ q y + 1) (cid:35) . (27)Before including the prior p ( q x ) , in the severely undersampled regime the mean posterior information was approximatelyequal to the prior information evaluated in the best β (Eq. 15). The new calculation (Eq. 27) contains the priorinformation explicitly, weighted by α/ ( N + α ) , that is, the ratio between the number of pseudo-counts from the priorand the total number of counts. Thereby, the role of the non-observed (but still inferred) states is established.The independence assumed between q x and { q y | x } implies that p ( n | α, β ) = p ( n x | α ) p ( n | β ) . (28)The inference over α coincides with the one of PYM with the tail parameter as d = 0 [14], since p ( n x | α ) ∝ Γ(1 + α )Γ( N + α ) α k − , (29)where k = (cid:80) x,n x > is the number of states x with at least one sample. With few coincidences in x , p ( n x | α ) develops a peak around a single α value that represents the number of effective states. Compared to the present Bayesianapproach, maximum likelihood underestimates the number of effective states (or entropy) in x . Since the expectedvariance of the mutual information decreases with the square root of the number of effective states, the Bayesianvariance is reduced with respect to the one of ML. In this work we propose a novel estimator for mutual information of discrete variables X and Y , which is adequatewhen X has a much larger number of effective states than Y . If this condition does not hold, the performance of theestimator breaks down. We inspire our proposal in the Bayesian framework, in which the core issue can be boileddown to ﬁnding an adequate prior. The more the prior is dictated by the data, the less we need to assume from outside.Equation 10 implies that the mutual information I ( X, Y ) is the spread of the conditional probabilities of one of the10stimating the Mutual Information between two Discrete, Asymmetric Variables with Limited Samples A PREPRINT variables (for example, q y | x , but the same holds for q x | y ) around the corresponding marginal ( q y or q x , respectively).This observation inspires the choice of our prior (Eq. 12), which is designed to capture the same idea, and in addition, tobe analytically tractable. We choose to work with an hyper-parameter β that regulates the scatter of q y | x around q y , andnot the scatter of p x | y around q x , because the asymmetry in the number of available states of the two variables makesthe β of the ﬁrst option (and not the second) strongly modulated by the data, by the emergence of a peak in p ( n | β ) .Although our proposal is inspired in previous Bayesian studies, the procedure described here is not strictly Bayesian,since our prior (Eq. 12) requires the knowledge of ˆ q y , which depends on the sampled data. However, in the limit inwhich q y is well sampled, this is a pardonable crime, since ˆ q y is deﬁned by a negligible fraction of the measured data.Still, Bayesian purists should employ a two-step procedure to deﬁne their priors. First, they should perform Bayesianinference on the center of the Dirichlet distribution of Eq. 12 by maximizing p ( q y | n ) , and then replace ˆ q y in Eq. 12 bythe inferred q y . For all practical purposes, however, if the conditions of validity of our method hold, both procedureslead to the same result.By conﬁning the set or possible priors p ( { q y | x } ) to those generated by Eq. 12 we relinquish all aspiration to modelthe prior of, say, q y | x =3 , in terms of the observed frequencies at x = 3 . In fact, the preferred β value is totally blindto the speciﬁc x value of each sampled datum. Only the number of x -values containing different counts of each y -value matters. Hence, the estimation of mutual information is performed without attempting to infer the speciﬁcway the variables X and Y are related, a property named equitability [20], and that is shared also by other methods[13, 8, 14]. Although this fact may be seen as a disadvantage, deriving a functional relation between the variablescan actually bias the inference on mutual information [20]. Moreover, ﬁtting a relation is unreasonable in the severeunder-sampled regime, in which not all x -states are observed, most sampled x -states contain a single count, and few x -states contain more than two counts. At least, without a strong assumption about the probability space. In fact, if thespace of probabilities of the involved variables has some known structure or smoothness condition, other approachesthat estimate information by ﬁtting the relation ﬁrst may perform well [9, 10, 11]. Part of the approach developed herecould be extended to continuous variables or spaces with a determined metric. This extension is left for future work.The main result of the paper, is that our estimator has small bias, even in the severely under-sampled regime. Itoutperforms other estimators discussed in the literature (at least, when the conditions of validity hold), and byconstruction, it never produces negative values. More importantly, it even works in cases where the collection oftrue conditional probabilities q y | x is not contained in the family of priors generated by p ( q | β ) , as demonstrated bythe second and third examples of Sect. 5. In these cases, the success of the method relies on the peaked nature ofthe posterior distribution for β . Even if the selected p ( q | β ) provides a poor description of the actual collection ofprobabilities, the dominant β captures the right value of mutual information. This is the sheer instantiation of theequitability property discussed above.Our method provides also a transparent way to identify the statistics that matter, out of all the measured data. Quitenaturally, the x states that have not been sampled provide no evidence in shaping p ( β | n ) , as indicated by Eq. 13, andonly shift the posterior information towards the prior (Eq. 27). More interestingly, the x states with just a single countare also irrelevant, both in shaping p ( β | n ) and in modifying the posterior information away from the prior. These statesare unable to provide evidence about the existence of either ﬂat or skewed conditional probabilities q y | x . Only the states x that have been sampled at least twice contribute to the formation of a peak in p ( β | n ) , and in deviating the posteriorinformation away from the prior.Several ﬁelds can beneﬁt from the application of our estimator of mutual information. Examples can be found inneuroscience, when studying whether neural activity (a variable with many possible states) correlates with a few selectedstimuli or behavioral responses [12, 21, 22], or in genomics, to understand associations between genes (large-entropyvariable) and a few speciﬁc phenotypes [23]. The method can also shed light into the development of rate-distortionmethods to be employed in situations in which only a few samples are available [24, 25]. The possibility of detectingstatistical dependences with only few samples is of key importance, not just for analyzing data sets, but also tounderstand how living organisms quickly infer dependencies in their environments and adapt accordingly [26]. Funding

This research was funded by CONICET, CNEA, ANPCyT Raíces 2016 grant number 1004.

Acknowledgments

We thank Ilya Nemenman for his fruitful comments and discussions.11stimating the Mutual Information between two Discrete, Asymmetric Variables with Limited Samples A PREPRINT

A Expected variance for a symmetric, binary Y -variable The posterior variance of the mutual information is σ ( I | n ) = (cid:104) ( I | n ) (cid:105) − (cid:104) I | n (cid:105) . (30)In the ﬁrst place, we demonstrate that this quantity is proportional to k − x , implying that our estimator becomesincreasingly accurate as the number of states of the X -variable increases. Given that (cid:104) I | n (cid:105) ≈ ˆ H ( Y ) − (cid:104) H Y | X (cid:105) ( n ) , (31)with H Y | X ( { q | x } ) = − (cid:88) x ˆ q x H Y | x ( q | x ) , (32) H Y | x ( q | x ) = − (cid:2) q | x log( q | x ) + (1 − q | x ) log(1 − q | x ) (cid:3) it is easy to show that σ ( I | n ) ≈ σ ( H Y | X | n ) = (cid:104) H Y | X ( { q | x } ) (cid:105) − (cid:104) H Y | X ( { q | x } ) (cid:105) . (33)In other words, if the marginal entropy is well sampled, the variance in the information is mainly due to the variance inthe conditional entropy. In turn, H Y | X is deﬁned as an average of k x terms (Eq. 32). The independence hypothesisimplied in the prior of Eq. 12, and in the way different q | x and q x factor out in q ( n | q x , { q y | x } ) (Eq. 5), imply thatthe different terms of (Eq. 32) are all independent of each other. The average of k x independent terms has a varianceproportional to 1/ k x , so the estimator proposed here becomes increasingly accurate as k x grows.We now derive the detailed dependence of (cid:104) H Y | X (cid:105) − (cid:104) H Y | X (cid:105) on the sampled data n . The mean conditional entropy (cid:104) H Y | X (cid:105) can be written in terms of (cid:104) H Y | x (cid:105) ( n x , n x , β ) , that is, of the entropy of the variable Y for a particular state x with n x = n x + n x counts at ﬁxed β (cid:104) H Y | X (cid:105) ( n ) = (cid:90) p ( β | n ) d β (cid:88) x n x N (cid:104) H Y | x (cid:105) ( n x , n x , β ) (cid:104) H Y | x (cid:105) ( n x , n x , β ) = −  ψ ( β + n x + 1) − (cid:88) y ∈{ , } (cid:18) β/ n xy β + n x (cid:19) ψ ( β/ n xy + 1)  (34)Similarly, for the second moment, (cid:104) H Y | X (cid:105) ( n ) = (cid:90) d β p ( β | n ) (cid:89) x (cid:90) d q | x p ( q | x | n x , n x , β ) (cid:34)(cid:88) x (cid:48) ˆ q x (cid:48) H Y | x (cid:48) ( q | x (cid:48) ) (cid:35) = (cid:90) d β p ( β | n )  (cid:88) x (cid:54) = x (cid:48) ˆ q x ˆ q x (cid:48) (cid:104) H Y | x (cid:105) ( n x , n x , β ) (cid:104) H Y | x (cid:48) (cid:105) ( n x , n x , β ) + (cid:88) x ˆ q x (cid:104) H Y | x (cid:105) ( n x , n x , β )  = (cid:90) d β p ( β | n ) (cid:34) (cid:104) H Y | X (cid:105) ( n , β ) + (cid:88) x ˆ q x Var [ H Y | x ]( n x , n x , β ) (cid:35) , (35)In turn, Var [ H Y | x (( n x , n x , β )] = (cid:104) H Y | x ( n x , n x , β ) (cid:105) − (cid:104) H Y | x ( n x , n x , β ) (cid:105) , where the ﬁrst moment (cid:104) H Y | x (cid:105) ( n x , n x , β ) is given in Eq. 34. The second moment is [14], (cid:68) H Y | x (cid:69) ( n x , n x , β ) = (cid:90) d q | x p ( q | x | n x , n x , β ) H Y | x ( q | x )= (cid:90) d q | x p ( q | x | n x , n x , β ) (cid:2) q | x log( q | x ) + (1 − q | x ) log(1 − q | x ) (cid:3) = 2 (cid:16) β + n x (cid:17) (cid:16) β + n x (cid:17) ( β + n x + 1)( β + n x ) F (cid:18) β n x , β n x (cid:19) ++ (cid:88) y ∈{ , } (cid:16) β + n xy (cid:17) (cid:16) β + n xy + 1 (cid:17) ( β + n x + 1) ( β + n x ) G (cid:18) β n xy , β + n x (cid:19) . (36)12stimating the Mutual Information between two Discrete, Asymmetric Variables with Limited Samples A PREPRINT

In this equation, F ( z , z ) = − ψ ( z + z + 2) + (cid:89) i ∈{ , } [ ψ ( z i + 1) − ψ ( z + z + 2)] ,G ( z i , z ) = [ ψ ( z i + 2) − ψ ( z + 2)] + ψ ( z i + 2) − ψ ( z + 2) , (37)and ψ ( z ) is the ﬁrst polygamma function.Replacing the obtained expressions in Eq. 33, the variance of the estimated information is obtained. The two terms ofEq. (35) represent the two sources of uncertainty of the conditional entropy: the uncertainty of β (ﬁrst term), manifestedin the width of p ( β | n ) , and the uncertainty of the conditional entropies for a ﬁxed β (second term), manifested in thewidth of p ( { q | x }| β ) . As the number of samples decreases, the uncertainty in β becomes the dominant term.Finally, we need to mention that the approximate symbol in Eq. 31 stems from the fact that we are assuming that H ( Y ) is well approximated by its maximum-likelihood estimator. We are therefore neglecting the error in the marginal entropy H Y , and assuming that the error in the mutual information only stems from the uncertainty in the conditional entropy H Y | X (Eq. 33). This assumption is well justiﬁed in the context explored in this paper, that is, when H ( X ) (cid:29) H ( Y ) . References [1] Claude Elwood Shannon. A mathematical theory of communication.

Bell System Technical Journal , 27(3):379–423, 1948.[2] Thomas M Cover and Joy A Thomas.

Elements of information theory . John Wiley & Sons, 2012.[3] Stefano Panzeri and Alessandro Treves. Analytical estimates of limited sampling biases in different informationmeasures.

Network: Computation in Neural Systems , 7(1):87–107, 1996.[4] Inés Samengo. Estimating probabilities from experimental frequencies.

Physical Review E , 65(4):046124, 2002.[5] Liam Paninski. Estimation of entropy and mutual information.

Neural Computation , 15(6):1191–1253, 2003.[6] Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information.

Physical Review E ,69(6):066138, 2004.[7] Marcelo A Montemurro, Riccardo Senatore, and Stefano Panzeri. Tight data-robust bounds to mutual informationcombining shufﬂing and model selection techniques.

Neural Computation , 19(11):2913–2957, 2007.[8] Evan Archer, Il Memming Park, and Jonathan W Pillow. Bayesian and quasi-bayesian estimators for mutualinformation from discrete data.

Entropy , 15(5):1738–1755, 2013.[9] Artemy Kolchinsky and Brendan D Tracey. Estimating mixture entropy with pairwise distances.

Entropy ,19(7):361, 2017.[10] Ishmael Belghazi, Sai Rajeswar, Aristide Baratin, R Devon Hjelm, and Aaron Courville. Mine: mutual informationneural estimation. arXiv preprint arXiv:1801.04062 , 2018.[11] Houman Safaai, Arno Onken, Christopher D Harvey, and Stefano Panzeri. Information estimation using nonpara-metric copulas.

Physical Review E , 98(5):053302, 2018.[12] Steven P Strong, Roland Koberle, Rob R de Ruyter van Steveninck, and William Bialek. Entropy and informationin neural spike trains.

Physical Review Letters , 80(1):197, 1998.[13] Ilya Nemenman, William Bialek, and Rob de Ruyter van Steveninck. Entropy and information in neural spiketrains: Progress on the sampling problem.

Physical Review E , 69(5):056111, 2004.[14] Evan Archer, Il Memming Park, and Jonathan W Pillow. Bayesian entropy estimation for countable discretedistributions.

The Journal of Machine Learning Research , 15(1):2833–2868, 2014.[15] Shang-keng Ma. Calculation of entropy from data of motion.

Journal of Statistical Physics , 26(2):221–240, 1981.[16] Ilya Nemenman. Coincidences and estimation of entropies of random variables with large cardinalities.

Entropy ,13(12):2013–2023, 2011.[17] David H Wolpert and David R Wolf. Estimating functions of probability distributions from a ﬁnite set of samples.

Physical Review E , 52(6):6841, 1995.[18] Michael Kazhdan, Thomas Funkhouser, and Szymon Rusinkiewicz. Rotation invariant spherical harmonicrepresentation of 3 d shape descriptors. In

Symposium on geometry processing , volume 6, pages 156–164, 2003.13stimating the Mutual Information between two Discrete, Asymmetric Variables with Limited Samples A PREPRINT [19] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXivpreprint arXiv:1703.00810 , 2017.[20] Justin B Kinney and Gurinder S Atwal. Equitability, mutual information, and the maximal information coefﬁcient.

Proceedings of the National Academy of Sciences , page 201309933, 2014.[21] Claire Tang, Diala Chehayeb, Kyle Srivastava, Ilya Nemenman, and Samuel J Sober. Millisecond-scale motorencoding in a cortical vocal area.

PLoS Biology , 12(12):e1002018, 2014.[22] Melisa Maidana Capitán, Emilio Kropff, and Inés Samengo. Information-theoretical analysis of the neural code inthe rodent temporal lobe.

Entropy , 20(8):571, 2018.[23] Atul J Butte, Pablo Tamayo, Donna Slonim, Todd R Golub, and Isaac S Kohane. Discovering functionalrelationships between rna expression and chemotherapeutic susceptibility using relevance networks.

Proceedingsof the National Academy of Sciences , 97(22):12182–12186, 2000.[24] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprintphysics/0004057 , 2000.[25] Susanne Still and William Bialek. How many clusters? an information-theoretic perspective.

Neural Computation ,16(12):2483–2506, 2004.[26] Adrienne L Fairhall, Geoffrey D Lewen, William Bialek, and Robert R de Ruyter van Steveninck. Efﬁciency andambiguity in an adaptive neural code.