Bounds on mutual information of mixture data for classification tasks
BBounds on mutual information of mixture data forclassification tasks
Yijun Ding
James C. Wyant College of Optical SciencesUniversity of Arizona
Tucson, AZ, [email protected]
Amit Ashok
James C. Wyant College of Optical Sciencesand Department of Electrical EngineeringUniversity of Arizona
Tucson, AZ, [email protected]
Abstract —The data for many classification problems, such aspattern and speech recognition, follow mixture distributions. Toquantify the optimum performance for classification tasks, theShannon mutual information is a natural information-theoreticmetric, as it is directly related to the probability of error. Themutual information between mixture data and the class label doesnot have an analytical expression, nor any efficient computationalalgorithms. We introduce a variational upper bound, a lowerbound, and three estimators, all employing pair-wise divergencesbetween mixture components. We compare the new bounds andestimators with Monte Carlo stochastic sampling and boundsderived from entropy bounds. To conclude, we evaluate theperformance of the bounds and estimators through numericalsimulations.
Index Terms —Mixture distribution, classification, Shannonmutual information, bounds, estimation, mixed-pair
I. I
NTRODUCTION
A. Motivation
We study the performance of classification tasks, wherethe goal is to infer the class label C from sample data x .The Shannon mutual information I( x ; C ) characterizes thereduction in the uncertainty of the class label C with theknowledge of data x and provides a way to quantify therelevance of the data x with respect to the class label C . As themutual information is related to probability of classificationerror ( P e ) through Fano’s inequality and other bounds [1]–[3], it has been widely used for feature selection [4], [5],learning [6], and quantifying task-specific information [7] forclassification.Statistical mixture distributions such as Poisson, Wishartor Gaussian mixtures are frequently used in the fields ofspeech recognition [8], image retrieval [9], system evaluation[10], compressive sensing [11], distributed state estimation[12], hierarchical clustering [13] etc. As a practical example,consider a scenario in which the data x is measured with anoisy system, eg. Poisson noise in an photon-starved imagingsystem or Gaussian noise in a thermometer reading. If theactual scene (or temperature) has a class label, e.g. targetpresent or not (the temperature is below freezing or not), thenthe mutual information I( x ; C ) describes with what confidenceone can assign a class label C to the noisy measurement data x . The goal of this paper is to develop efficient methodsto quantify the optimum performance of classification tasks,when the distribution of the data x for each given class label, pr( x | C ) , follows a known mixture distribution. As the mutualinformation I( x ; C ) , which is commonly used to quantify task-specific information, does not admit an analytical expressionfor mixture data, we provide analytical expressions for boundsand estimators of I( x ; C ) . B. Problem Statement and Contributions
We consider the data as a continuous random variable x and the class label as a discrete random variable C , where C can be any integer in [1 , Π] and Π is the number of classes.The bold symbol x emphasizes that x is a vector, which canbe high-dimensional. We assume that, when restricted to anyof the classes, the conditional differential entropy of x is well-defined, or in other words, ( x , C ) is a good mixed-pair vector[14]. The mutual information between the data x and the classlabel C can be defined as [15] I( x ; C ) = KL( pr( x , C ) || Pr( C ) · pr( x ) )= (cid:88) C (cid:90) dx pr( x , C ) ln pr( x , C )Pr( C ) · pr( x ) . (1)When pr( x ) is a mixture distribution with N components, pr( x ) = N (cid:88) i =1 w i pr i ( x ) , (2)where w i is the weight of component i ( w i ≥ and (cid:80) i w i =1 ), and pr i is the probability density of component i . Theconditional distribution of the data, when the class label is c ,also follows a mixture distribution.In this work, we propose new bounds and estimators of theShannon mutual information between a mixture distribution x and its class label C . We provide a lower bound, a variationalupper bound and three estimators of I( x ; C ) , all based onpair-wise distances. We present closed-form expressions forthe bounds and estimators. Furthermore, we use numericalsimulations to compare the bounds and estimators to MonteCarlo (MC) simulations and a set of bounds derived fromentropy bounds. a r X i v : . [ ee ss . SP ] J a n . Related works Although estimation of conditional entropy and mutualinformation has been extensively studied [16]–[19], researchhas focused on purely discrete or continuous data. Nair et al.[14] extended the definition of the joint entropy to mixed-pairs,which consists of one discrete variable and one continuousvariable. Ross [20], Moon et al. [21] and Beknazaryan etal. [15] provided methods for estimating mutual informationfrom samples of mixed-pairs based on nearest-neighbor orkernel estimator. Gao et al. [22] extended the definition ofmutual information to the case that each random variable canhave both discrete and continuous components through theRadon-Nikodym derivative. Here our goal is to study mutualinformation for mixed-pairs, where the data x is continuousand the class label C is discrete.When the underlying distribution of the data is unknown,the mutual information can be approximated from sampleswith a number of density or likelihood-ratio estimators basedon binning [23], [24], kernel methods [25]–[27], k-nearest-neighbor (kNN) distances [28], [29], or approximated Gaus-sianity (Edgeworth expansion [30]). To accommodate highdimensional data (such as image and text) or large datasets,Gao et al. [31] improved the kNN estimator with a localnon-uniformity correction term; Jiao et al. [32] proposed aminimax estimator of entropy that achieves the optimal samplecomplexity; Belghazi et al. [33] presented a general purposeneural-network estimator; Poole et al. [34] provided a thoroughreview and several new bounds on mutual information that iscapable to trade off bias for variance.However, when the underlying data distribution is known,the exact computation of mutual information is tractable onlyfor a limited family of distributions [35], [36]. The mutualinformation for mixture distributions has no known closed-from expression [37]–[39]; hence MC sampling and numericalintegration are often employed as unbiased estimators. MCsampling of sufficient accuracy is computationally intensive[40]. Numerical integration is limited to low-dimensionalproblems [41]. To reduce the computational requirement, de-terministic approximations have been developed using mergedGaussian [8], [42], component-wise Taylor-series expansion[43], unscented transform [44] and pair-wise KL divergencebetween matched components [9]. The merged Gaussian andunscented transform estimators are biased, while the Taylorexpansion method provides a trade-off between computationaldemands and accuracy.Two papers that have deeply inspired our work are [8]and [45]. Hersey et al. [8] proposed a variational upperbound and an estimator of the KL divergence between twoGaussian mixtures by pair-wise KL divergence. Hersey et al.[8] has shown empirically that the variational upper boundand estimator perform better than other deterministic approx-imations, such as merged Gaussian, unscented transform andmatched components. Kolchinsky et al. [45] has boundedentropy of mixture distributions with pair-wise KL divergenceand Chernoff- α ( C α ) divergence and demonstrated through numerical simulations that these bounds are tighter than otherwell-known existing bounds, such as the kernel-density esti-mator [41], [46] and the expected-likelihood-kernel estimator[47]–[49]. Our results are not obvious from either paper, asthe calculation of I( x ; C ) involves a summation of multipleentropies or KL divergences. Instead of providing bounds foreach term (entropy) in the summation, we directly bound andestimate the mutual information.II. M AIN R ESULTS
In this section, we provide three estimators of I( x ; C ) and a pair of lower and upper bounds. All bounds andestimators are based on pair-wise KL divergence and C α divergence. Furthermore, we provide proofs of the lower andupper bounds. Before presenting our main results, we startwith a few definitions. The marginal distribution on the classlabel C is Pr( C = c ) = P c = (cid:88) i ∈{ c } w i . (3)Note that (cid:80) Π c =1 P c = 1 and { c } is the set of the componentsthat have class label C = c . The conditional distribution ofthe data, when the class label is c , is given by pr( x | c ) = (cid:88) i ∈{ c } w i P c pr i ( x ) . (4)Expressing the marginal distribution in terms of the conditionaldistribution, we have pr( x ) = (cid:88) c P c · pr( x | c ) . (5)The joint distribution of the data and class label is pr( x , c ) = P c · pr( x | c ) = (cid:88) i ∈{ c } w i pr i ( x ) (6) A. Pair-wise distances
The Kullback-Leibler (KL) divergence is defined as
KL(pr i || pr j ) = (cid:90) dx pr i ( x ) ln pr i ( x )pr j ( x ) . (7)The C α divergence [50] between the two distribution pr i ( x ) and pr j ( x ) is defined as C α (pr i || pr j ) = − ln (cid:90) dx pr αi ( x ) pr − αj ( x ) , (8)for real-valued α ∈ [0 , . More specifically, when α = 1 / ,the Chernoff divergence is Bhattacharyaa distance. B. Bounds and estimates of the mutual information
We adopt the convention that ln 0 = 0 and ln(0 /
0) = 0 . Anexact expression of the mutual information is I( x ; C ) = H( C ) − N (cid:88) i =1 w i E pr i (cid:34) ln (cid:80) Nj =1 w j pr j (cid:80) k ∈{ C i } w k pr k (cid:35) , (9)where { C i } is the set of component index that is in the sameclass with component i and E pr i [ f ] = (cid:82) dx pr i ( x ) f ( x ) ishe expectation of f with respect to the probability densityfunction pr i .Two approximations of I( x ; C ) are ˆI C α ( x ; C ) = H( C ) − N (cid:88) i =1 w i ln (cid:80) Nj =1 w j e − C α (pr i || pr j ) (cid:80) k ∈{ C i } w k e − C α (pr i || pr k ) , ˆI KL ( x ; C ) = H( C ) − N (cid:88) i =1 w i ln (cid:80) Nj =1 w j e − KL(pr i || pr j ) (cid:80) k ∈{ C i } w k e − KL(pr i || pr k ) . (10)Another approximation of I( x ; C ) is ˆI KL&C α ( x ; C ) = H( C ) − N (cid:88) i =1 w i ln (cid:80) Nj =1 w j e − D ij (cid:80) k ∈{ C i } w k e − D ik , where D ij = 12 (cid:18) i || pr j ) + 1C α (pr i || pr j ) (cid:19) . (11)As D is a function of both KL and C α divergences, we denotethis estimator with the subscript ‘ KL & C α ’.A lower bound on I( x ; C ) based on pair-wise C α is I lb C α = − Π (cid:88) c =1 P c ln (cid:34) Π (cid:88) c (cid:48) =1 P c (cid:48) · min (1 , Q cc (cid:48) ) (cid:35) , whereQ cc (cid:48) = (cid:88) i ∈{ c } (cid:88) j ∈{ c (cid:48) } (cid:18) w i P c (cid:19) α c (cid:18) w j P c (cid:48) (cid:19) − α c e − C αc (pr i || pr j ) , (12)and min ( · ) is the minimum value function.A variational upper bound on I( x , C ) based on pair-wiseKL is I ub KL = H( C ) − (cid:88) m Π (cid:88) c =1 φ m,c ln (cid:80) Π c (cid:48) =1 φ m,c (cid:48) e − KL (pr m,c , pr m,c (cid:48) ) φ m,c , (13)where φ m,c are the variational parameters and detailed expla-nation of the notations is given in Section II-D. C. Proof of the lower bound
We propose a lower bound on I( x ; C ) based on pair-wise C α divergences. For ease of notation, we denote theconditional distribution pr( x | C = c ) as pr c . We first makeuse of a derivation from [45] and [51] to bound I( x ; C ) with class-wise divergence C α (pr c , pr c (cid:48) ) : I( x ; C ) = (cid:88) c P c (cid:90) dx pr c · ln pr c pr( x )= − (cid:88) c P c (cid:90) dx pr c · ln (cid:80) c (cid:48) P c (cid:48) pr − α c c (cid:48) pr − α c c − (cid:88) c P c (cid:90) dx pr c · ln pr( x )pr α c c (cid:80) c (cid:48) P c (cid:48) pr − α c c (cid:48) ≥ − (cid:88) c P c ln (cid:88) c (cid:48) P c (cid:48) (cid:90) dx pr α c c · pr − α c c (cid:48) − ln (cid:90) dx (cid:88) c P c pr − α c c · pr( x ) (cid:80) c (cid:48) P c (cid:48) pr − α c c (cid:48) = − (cid:88) c P c ln (cid:34)(cid:88) c (cid:48) P c (cid:48) e − C αc (pr c || pr c (cid:48) ) (cid:35) (14)This inequality follows from Jensen’s inequality and the con-vexity of function ln( x ) . The parameter α c , which is specificfor a class c , can be any value in [0 , .The class-wise C α divergence has a minimum value of zeroand the minimum is achieved when the two class has the samedistribution. Furthermore, we can bound the C α divergencethrough the subadditivity of the function f ( x ) = x α when ≤ α ≤ . In other words, as f ( a + b ) ≤ f ( a ) + f ( b ) for a ≥ and b ≥ , the C α divergence between the conditionaldistributions pr c and pr c (cid:48) can be bounded by: e − C α (pr c || pr c (cid:48) ) = (cid:90) dx (cid:88) i ∈{ c } w i P c pr i α (cid:88) j ∈{ c (cid:48) } w j P c (cid:48) pr j − α ≤ min , (cid:88) i ∈{ c } (cid:88) j ∈{ c (cid:48) } (cid:18) w i P c (cid:19) α (cid:18) w j P c (cid:48) (cid:19) − α e − C α (pr i || pr j ) , = min (1 , Q cc (cid:48) ) . (15)Therefore, Equation 12 is a lower bound on I( x ; C ) .The best possible lower bound can be obtained by findingthe parameters α c that maximize I lb C α , which is equivalentto minimize (cid:80) c (cid:48) P c (cid:48) min (1 , Q cc (cid:48) ) . In a special case when allcomponents are symmetric and identical except the centerlocation, eg. homoscedastic Gaussian mixture, e − C α (pr i || pr j ) achieves minimum value at α = 1 / [45]. D. Proof of the variational upper bound
Here we propose a direct upper bound on the mutual infor-mation I( x ; C ) using a variational approach. The underlyingidea is to match components from different classes. To pickone component from each class, there are N × N ... × N Π combinations, where N c is the number of components inclass c . Denote an integer M = (cid:81) Π c =1 N c . A component i in class c can be split into M/N c components with eachcomponent corresponding to a component-combination in theother Π − classes. Mathematically speaking, we introducehe variational parameters φ ij ≥ satisfying the constraints (cid:80) M/N c j =1 φ ij = w i . Using the variational parameters, we canwrite the joint distribution as pr( x , c ) = (cid:88) i ∈{ c } w i pr i = (cid:88) i ∈{ c } M/N c (cid:88) j =1 φ ij pr i . (16)Note that the set { c } has N c components. By rearrangingindices ( i, j ) into a vector m of length M , we can simplifythe joint distribution to pr( x , c ) = M (cid:88) m =1 φ m,c pr m,c ( x ) , (17)where the subscript c emphasizes that each class has a uniquemapping from ( i, j ) to m and pr m,c ( x ) equals to the corre-sponding pr i ( x ) .With this notation, the marginal distribution of the data x is pr( x ) = Π (cid:88) c =1 pr( x , c ) = Π (cid:88) c =1 M (cid:88) m =1 φ m,c pr m,c ( x ) . (18)We further define a mini-batch m as b m ( x ) = Π (cid:88) c =1 φ m,c pr m,c ( x ) . (19)Each mini-batch contains Π components with one componentfrom each class. With this definition, the marginal distributionof the data can be written as pr( x ) = (cid:80) Mm =1 b m ( x ) . Theprobability of a component in the m th batch is P m = (cid:80) c φ m,c . The probability density function of the m th batchis pr( x | m ) = b m ( x ) /P m .Now we use Jensen’s inequality, or more specificly log-suminequality [52], to bound I( x ; C ) by batch-conditional entropy, I( x ; C ) = H( C ) + (cid:88) c (cid:90) dx pr( x , c ) · ln pr( x , c )pr( x )= H( C ) + (cid:88) c (cid:90) dx (cid:32)(cid:88) m φ m,c pr m,c (cid:33) ln (cid:80) m φ m,c pr m,c (cid:80) m b m ≤ H( C ) + (cid:88) c (cid:90) dx (cid:88) m (cid:18) φ m,c pr m,c ln φ m,c pr m,c b m (cid:19) = H( C ) + H( x | m ) − H( C | m ) − N (cid:88) i =1 w i H i ( x ) (20)where H( C ) = − (cid:80) c P c ln P c is the entropy of the classlabel; H( x | m ) = (cid:80) m P m H(pr( x | m )) is the batch-conditionalentropy of the data; H( C | m ) = (cid:80) m P m H m ( C ) is the batch-conditional entropy of the label, where H m ( C ) is the entropyof the class label for batch m , and H i ( x ) = H(pr i ( x )) is theentropy of the i th component. We can further bound the batch-conditional entropy withpair-wise KL divergence as I( x ; C ) ≤ H( C ) + ˆH KL ( x | m ) − H( C | m ) − N (cid:88) i =1 w i H i ( x )= H( C ) − (cid:88) m (cid:88) c φ m,c ln (cid:80) c (cid:48) φ m,c (cid:48) e − KL (pr m,c , pr m,c (cid:48) ) φ m,c := I ub KL (21)where ˆH KL ( x | m ) is an upper bound of the batch-conditionalentropy and the inequality has been proved in [45].The tightest upper bound attainable through this methodcan be found by varying parameters φ m,c to minimize I ub KL .The minimization problem has been proved to be convex (seeAppendix A). The upper bound I ub KL can be minimizediteratively by fixing the parameters φ m,c (cid:48) (where c (cid:48) (cid:54) = c ) andoptimizing parameters φ m,c under linear constraints. At eachiteration step I ub KL is lowered, and the convergent is thetightest variational upper bound on the mutual information.Non-optimum variational parameters still provide upperbounds on I( x ; C ) . There are M × Π variational parameters, M × Π non-equality constraints and N equality constraints.When the number of classes or components is large, theminimization problem will be computationally intensive. Anon-optimum solution that is similar to the matched bound [8],[53] can be obtained by dividing all components into max( N c )mini-batches by matching each component i to one componentin each class. Mathematically speaking, φ ij = w i for onepair of matched ( i, j ) and φ ij = 0 otherwise. To find themini-batches, the Hungarian method [54], [55] for assignmentproblems can be applied.III. N UMERICAL SIMULATIONS
In this section, we run numerical simulations and compareestimators on mutual information between mixture data andclass labels. We consider a simple example of binary classifi-cation of mixture data, where the mixture components are two-dimensional homoscedastic Gaussians. The component centersare close to the class boundary and uniformly distributedalong the boundary. The location of the component centersare plotted in the Figure 1(a), where the component centersare represented by a red star (class 1) or a yellow circle(class 2). Each class consists 100 two-dimensional Gaussiancomponents with equal weights. The components have thesame covariance matrix σ I , where I is the identity matrixand σ represents the size of the Gaussian components. Theconditional distribution pr( x | c = 2) is plotted in the insert ofFigure 1(a) for σ = 0 . . When σ is larger, the componentsof the mixtures distribution are more connected; when σ is smaller, the components are more isolated. Estimates of I( x ; C ) are calculated for varying σ .A pair of obvious bounds of I( x ; C ) are [0 , H( C )] , where H( C ) is the entropy of the class label Pr( C ) . Another pair a) (b) Fig. 1: (a) The locations of the center of the components and the mixture distribution pr( x | c = 2) when σ = 0 . (insert). (b)Estimates of I( x ; C ) .of upper and lower bounds of I( x ; C ) can be derived frombounds on mixture entropy as I lb 2H = H lb ( x ) − H ub ( x | C )I ub 2H = H ub ( x ) − H lb ( x | C ) , (22)where the upper and lower bound of entropy based on pair-wise KL and C α divergences have been provided by [45].These bounds on I( x , C ) are based on two entropy bounds,hence the subscript ‘2H’.We evaluate the following estimates of I( x , C ) :1) The new variational upper bound and the new lowerbound, I ub KL and I lb C α , are plotted in dark red andblue solid lines, respectively.2) The estimates based on the pair-wise KL, C α or D (afunction of both KL and C α divergences) are plotted inyellow, light blue and black dashed lines, respectively.3) The true mutual information, I( x , C ) , as estimated byMC sampling of the mixture model (grey solid line).4) The upper and lower bounds I lb 2H and I ub 2H are plottedin orange and green dot-dashed lines, respectively.The obvious bounds on I( x , C ) , which are [0 , H ( C )] , are alsopresented by an area in grey. The Monte-Carlo simulationresults, which can serve as the benchmark, are calculatedwith samples. We use α = 1 / in the calculation of C α divergences, as it provides the optimum bounds for ourexample. We also present details of our implementation andresults of two other scenarios in Appendix B.Our new upper bound and lower bound appear to be tighterthan the bounds derived from entropy bounds over the rangeof σ considered in our simulation. In Figure 1(b), where theestimates of I( x , C ) are plotted, the results show that theblue and dark red solid lines are almost always within thearea covered by the green and orange dot-dashed lines. Thethree estimates, ˆI KL , ˆI C α and ˆI KL&C α , all follow the trend of I( x , C ) . More specifically, ˆI KL (yellow dashed line) followsthe new variational upper bound I ub KL (deep red solid line)closely; ˆI C α (blue dashed line) is a good estimator of I( x , C ) (grey solid line); ˆI KL&C α (black dashed line) is another goodestimator of I( x , C ) , as the black dashed line tracks the greysolid line closely. The differences between the three estimatesand the I( x , C ) calculated from MC simulation are plotted inAppendix B. IV. C ONCLUSION
We provide closed-form bounds and approximations ofmutual information between mixture data and class labels.The closed-form expressions are based on pair-wise distances,which are feasible to compute even for high-dimensional data.Based on numerical results, the new bounds we proposedare tighter than the bounds derived from bounds on entropyand the approximations serve as good surrogates for the truemutual information. A
PPENDIX AT HE M INIMIZATION OF I ub KL The minimization problem of I ub KL by varying φ mc is convex, which we prove in this section. The convexityof the minimization problem can be checked through thefirst and second-order derivatives. For ease of notation, wedefine S m,c = (cid:80) c (cid:48) φ m,c (cid:48) e − KL(pr m,c , pr m,c (cid:48) ) and E m,cc (cid:48) =exp( − KL(pr m,c , pr m,c (cid:48) )) . The first derivative of I ub KL is: ∂I ub KL ∂φ m,c = − ln (cid:18) S m,c φ m,c (cid:19) − φ m,c S m,c − (cid:88) c (cid:48) (cid:54) = c φ m,c (cid:48) E m,c (cid:48) c S m,c (cid:48) + 1 (23)he second derivative is: H cc = ∂ I ub KL ( ∂φ m,c ) = ( S m,c − φ m,c ) ( S m,c ) φ m,c + (cid:88) c (cid:48) (cid:54) = c φ m,c (cid:48) ( E m,c (cid:48) c ) ( S m,c (cid:48) ) (24)for the diagonal terms and H cc (cid:48) = ∂ I ub KL ∂φ m,c ∂φ m,c (cid:48) = φ m,c − S m,c ( S m,c ) E m,cc (cid:48) + φ m,c (cid:48) − S m,c (cid:48) ( S m,c (cid:48) ) E m,c (cid:48) c , (25)for c (cid:48) (cid:54) = c . For any given vector θ of length Π , θ T H θ = (cid:88) c θ c H cc + (cid:88) c (cid:48) (cid:54) = c θ c θ c (cid:48) H cc (cid:48) = (cid:88) c (cid:34) θ c ( S m,c − φ m,c ) ( S m,c ) φ m,c + (cid:88) c (cid:48) (cid:54) = c θ c (cid:48) φ m,c ( E m,cc (cid:48) ) ( S m,c ) + (cid:88) c (cid:48) (cid:54) = c θ c θ c (cid:48) φ m,c − S m,c ( S m,c ) E m,cc (cid:48) (cid:35) = (cid:88) c ( S m,c − φ m,c ) θ c S m,c (cid:112) φ m,c − (cid:88) c (cid:48) (cid:54) = c (cid:112) φ m,c E m,cc (cid:48) θ c (cid:48) S m,c ≥ . (26)Therefore, I ub KL is convex when φ m,c are considered as thevariables. A PPENDIX BO N THE N UMERICAL SIMULATIONS
This appendix provides more simulation results and thedetailed expressions used in the numerical simulations. Theadditional results consider two different distributions of thecomponent-center locations. We further present the differencebetween the the estimated and the true mutual information.Last but not least, the closed form expressions include theKL and C α divergences between Gaussian components, thebounds on I( x ; C ) derived from entropy bounds, and therelation between Shannon mutual information and bounds onbinary classification error ( P e ). A. Numerical simulation results
We consider three scenarios: (1) the component centersare uniformly distributed along the class boundary, (2) thecomponent centers are bunched into one group, and (3) thecomponent centers are bunched into several groups. Resultson the first scenario has been presented in the Section III. Inthis section, we report on Scenarios 2 and 3. Illustration of thetwo scenarios are shown in Figure 2(a) and 3(a), respectively.The results demonstrated in these two scenarios are similarto that of Scenario 1. To further demonstrate that our estima-tors are good surrogates for the true mutual information, we plot the difference between the three estimates and the truemutual information calculated from MC sampling in Figure 4.
B. Closed form expressions for Gaussian mixtures
Gaussian functions are often used as components in mixturedistributions and have closed form expressions for pair-wiseKL and C α divergences. Denoting the difference in the meansof two components as µ ij = µ i − µ j and Σ α,ij = (1 − α )Σ i + α Σ j , the C α divergence between two Gaussian componentsare C α (pr i || pr j ) = α (1 − α )2 µ Tij Σ − α,ij µ ij + 12 ln | Σ α,ij || Σ i | − α | Σ j | α , (27)where | · | is the determinant. The KL divergence between thesame two Gaussian components are KL(pr i || pr j ) = 12 (cid:20) µ Tij Σ − j µ ij + ln | Σ j || Σ i | (cid:21) + tr (Σ − j Σ i ) − d , (28)where tr ( · ) is the trace of the matrix in the parenthesis and d is the dimension of the data.When all mixture components have equal covariance ma-trices Σ i = Σ j = Σ , we can denote λ ij = µ Tij Σ − µ ij andhave C α (pr i || pr j ) = α (1 − α ) λ ij / , KL(pr i || pr j ) = λ ij / . (29)With these expressions, the bounds and estimates of I( x ; C ) have simple forms. C. Expressions of I lb 2H and I ub 2H The detailed expression for the mutual information boundsderived from entropy bounds are: I lb 2H = H( C ) − N (cid:88) i =1 w i ln (cid:80) Nj =1 w j e − C α (pr i || pr j ) (cid:80) k ∈{ C i } w k e − KL(pr i || pr k ) , I ub 2H = H( C ) − N (cid:88) i =1 w i ln (cid:80) Nj =1 w j e − KL(pr i || pr j ) (cid:80) k ∈{ C i } w k e − C α (pr i || pr k ) . (30) D. Bounds on P e for binary classification Bounds on I( x ; C ) can be used to calculate bounds on P e .The Fano’s inequality [1] provides a lower bound on P e forbinary classification, as following P e ≥ h − b [H( C ) − I( x ; C )] , (31)where h b ( x ) = − x log ( x ) − (1 − x ) log (1 − x ) is the binaryentropy function, h − b ( · ) is the inverse function of h b ( · ) . Morespecifically, one can calculate P e by placing the value H( C ) − I( x ; C ) on the left side of the binary entropy function andsolving for x . a) (b) Fig. 2: Scenario 2, where the center of the components are bunched into one group, illustration (a), the mixture distribution pr( x | c = 2) when σ = 0 . (insert), and estimates of I( x ; C ) (b). (a) (b) Fig. 3: Scenario 3, where the center of the components are bunched into multiple groups, illustration (a), the mixture distribution pr( x | c = 2) when σ = 0 . (insert), and estimators of I( x ; C ) (b). -2 -1 I ( b i t s ) I est (x; C) - I(x; C) est kl&cest cest kl (a) -2 -1 I ( b i t s ) I est (x; C) - I(x; C) est kl&cest cest kl (b) -2 -1 I ( b i t s ) I est (x; C) - I(x; C) est kl&cest cest kl (c) Fig. 4: ˆI(x; C) − I(x; C) for (a) Scenario 1, (b) Scenario 2 and (c) Scenario 3. The I( ; C) is calculated from MC simulations. tight upper bound on binary classification error P e hasbeen reported recently [3], P e ≤ min (cid:8) P min , f − [H( C ) − I( x ; C )] (cid:9) := ˆP e ub , (32)where P min is min { P , P } , and f ( x ) is a function definedby f ( x ) = − P min log P min x + P min − x log xx + P min , (33)and f − ( · ) is the inverse function of f ( · ) .When P e (cid:28) , − P e (log P e − log P min ) (cid:46) H( C ) − I( x ; C ) (cid:46) − P e log P e . Therefore, P e is on thesame order of magnitude as H( C ) − I( x ; C ) , when P e (cid:28) . E. P e estimates When σ is small, all estimates of I( x , C ) converges to 1.To compare the estimates for . > σ > . , we calculateand present an estimate of P e in this section. This estimate of P e is the upper bound on P e presented in the previous section.When a lower bound of I( x , C ) is used in the calculation (bluesolid line and green dashed line), the estimates of P e are upperbounds on P e . The other lines in the P e plots are neither upperbound nor lower bound. The black lines in the plots, which isthe estimate of P e calculated from I MC , is also not the true P e but can serve as a benchmark for an upper bound on P e .In Figure 5, the blue and dark red solid lines are significantlycloser to the grey line than the green and orange dashed lines.The results demonstrated that our new bounds is tighter thanthe bounds derived from entropy bounds. Furthermore, thethree estimators (blue, black and yellow dashed lines) are goodsurrogates for the true mutual information.R EFERENCES[1] R. M. Fano and D. Hawkins, “Transmission of information: A statisticaltheory of communications,”
American Journal of Physics , vol. 29, pp.793–794, 1961.[2] V. Kovalevskij, “The problem of character recognition from the point ofview of mathematical statistics,” 1967.[3] B.-G. Hu and H.-J. Xing, “An optimization approach of deriving boundsbetween entropy and error from joint distribution: Case study for binaryclassifications,”
Entropy , vol. 18, no. 2, p. 59, 2016.[4] J. R. Vergara and P. A. Est´evez, “A review of feature selection methodsbased on mutual information,”
Neural computing and applications ,vol. 24, no. 1, pp. 175–186, 2014.[5] R. Battiti, “Using mutual information for selecting features in supervisedneural net learning,”
IEEE Transactions on neural networks , vol. 5, no. 4,pp. 537–550, 1994.[6] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneckmethod,” arXiv preprint physics/0004057 , 2000.[7] M. A. Neifeld, A. Ashok, and P. K. Baheti, “Task-specific informationfor imaging system analysis,”
JOSA A , vol. 24, no. 12, pp. B25–B41,2007.[8] J. R. Hershey and P. A. Olsen, “Approximating the Kullback Leiblerdivergence between Gaussian mixture models,” in , vol. 4. IEEE, 2007, pp. IV–317.[9] J. Goldberger, S. Gordon, and H. Greenspan, “An efficient imagesimilarity measure based on approximations of kl-divergence betweentwo Gaussian mixtures,” in null . IEEE, 2003, p. 487.[10] Y. Ding and A. Ashok, “X-ray measurement model incorporatingenergy-correlated material variability and its application in information-theoretic system analysis,” arXiv preprint arXiv:2002.11046 , 2020. [11] J. M. Duarte-Carvajalino, G. Yu, L. Carin, and G. Sapiro, “Task-drivenadaptive statistical compressive sensing of Gaussian mixture models,”
IEEE Transactions on Signal Processing , vol. 61, no. 3, pp. 585–600,2012.[12] B. Noack, M. Reinhardt, and U. D. Hanebeck, “On nonlinear track-to-track fusion with Gaussian mixtures,” in . IEEE, 2014, pp. 1–8.[13] J. Goldberger and S. T. Roweis, “Hierarchical clustering of a mixturemodel,” in
Advances in Neural Information Processing Systems , 2005,pp. 505–512.[14] C. Nair, B. Prabhakar, and D. Shah, “On entropy for mixtures of discreteand continuous variables,” arXiv preprint cs/0607075 , 2006.[15] A. Beknazaryan, X. Dang, and H. Sang, “On mutual informationestimation for mixed-pair random variables,”
Statistics & ProbabilityLetters , vol. 148, pp. 9–16, 2019.[16] L. Kozachenko and N. N. Leonenko, “Sample estimate of the entropyof a random vector,”
Problemy Peredachi Informatsii , vol. 23, no. 2, pp.9–16, 1987.[17] I. Ahmad and P.-E. Lin, “A nonparametric estimation of the entropyfor absolutely continuous distributions (corresp.),”
IEEE Transactionson Information Theory , vol. 22, no. 3, pp. 372–375, 1976.[18] B. Laurent et al. , “Efficient estimation of integral functionals of adensity,”
The Annals of Statistics , vol. 24, no. 2, pp. 659–681, 1996.[19] G. P. Basharin, “On a statistical estimate for the entropy of a sequence ofindependent random variables,”
Theory of Probability & Its Applications ,vol. 4, no. 3, pp. 333–336, 1959.[20] B. C. Ross, “Mutual information between discrete and continuous datasets,”
PloS one , vol. 9, no. 2, p. e87357, 2014.[21] K. R. Moon, K. Sricharan, and A. O. Hero, “Ensemble estimationof mutual information,” in . IEEE, 2017, pp. 3030–3034.[22] W. Gao, S. Kannan, S. Oh, and P. Viswanath, “Estimating mutualinformation for discrete-continuous mixtures,” in
Advances in neuralinformation processing systems , 2017, pp. 5986–5997.[23] G. A. Darbellay and I. Vajda, “Estimation of the information by anadaptive partitioning of the observation space,”
IEEE Transactions onInformation Theory , vol. 45, no. 4, pp. 1315–1321, 1999.[24] R. Moddemeijer, “On estimation of entropy and mutual information ofcontinuous distributions,”
Signal processing , vol. 16, no. 3, pp. 233–248,1989.[25] A. M. Fraser and H. L. Swinney, “Independent coordinates for strangeattractors from mutual information,”
Physical review A , vol. 33, no. 2,p. 1134, 1986.[26] Y.-I. Moon, B. Rajagopalan, and U. Lall, “Estimation of mutual infor-mation using kernel density estimators,”
Physical Review E , vol. 52,no. 3, p. 2318, 1995.[27] K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman et al. ,“Nonparametric von mises estimators for entropies, divergences and mu-tual informations,”
Advances in Neural Information Processing Systems ,vol. 28, pp. 397–405, 2015.[28] A. Kraskov, H. St¨ogbauer, and P. Grassberger, “Estimating mutualinformation,”
Physical review E , vol. 69, no. 6, p. 066138, 2004.[29] S. Singh and B. P´oczos, “Finite-sample analysis of fixed-k nearestneighbor density functional estimators,”
Advances in neural informationprocessing systems , vol. 29, pp. 1217–1225, 2016.[30] M. M. V. Hulle, “Edgeworth approximation of multivariate differentialentropy,”
Neural computation , vol. 17, no. 9, pp. 1903–1910, 2005.[31] S. Gao, G. Ver Steeg, and A. Galstyan, “Efficient estimation of mutualinformation for strongly dependent variables,” in
Artificial intelligenceand statistics , 2015, pp. 277–286.[32] J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Minimax estimation offunctionals of discrete distributions,”
IEEE Transactions on InformationTheory , vol. 61, no. 5, pp. 2835–2885, 2015.[33] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio,A. Courville, and D. Hjelm, “Mutual information neural estimation,”in
International Conference on Machine Learning , 2018, pp. 531–540.[34] B. Poole, S. Ozair, A. v. d. Oord, A. A. Alemi, and G. Tucker, “On varia-tional bounds of mutual information,” arXiv preprint arXiv:1905.06922 ,2019.[35] J. V. Michalowicz, J. M. Nichols, and F. Bucholtz,
Handbook ofdifferential entropy . Crc Press, 2013.[36] F. Nielsen and R. Nock, “Entropies and cross-entropies of exponentialfamilies,” in .IEEE, 2010, pp. 3621–3624. -2 -1 -6 -4 -2 P e Est P e MCest kl&cest cest klub_kllb_cub_2Hlb_2H (a) -2 -1 -2 P e Est P e MCest kl&cest cest klub_kllb_cub_2Hlb_2H (b) -2 -1 -6 -4 -2 P e Est P e MCest kl&cest cest klub_kllb_cub_2Hlb_2H (c)
Fig. 5: Estimates of P e calculated from a number of estimators of I(x; C) . [37] M. A. Carreira-Perpinan, “Mode-finding for mixtures of Gaussiandistributions,” IEEE Transactions on Pattern Analysis and MachineIntelligence , vol. 22, no. 11, pp. 1318–1323, 2000.[38] J. V. Michalowicz, J. M. Nichols, and F. Bucholtz, “Calculation ofdifferential entropy for a mixed Gaussian distribution,”
Entropy , vol. 10,no. 3, pp. 200–206, 2008.[39] O. Zobay et al. , “Variational bayesian inference with Gaussian-mixtureapproximations,”
Electronic Journal of Statistics , vol. 8, no. 1, pp. 355–389, 2014.[40] J.-Y. Chen, J. R. Hershey, P. A. Olsen, and E. Yashchin, “Acceleratedmonte carlo for Kullback-Leibler divergence between Gaussian mixturemodels,” in . IEEE, 2008, pp. 4553–4556.[41] H. Joe, “Estimation of entropy and other functionals of a multivariatedensity,”
Annals of the Institute of Statistical Mathematics , vol. 41, no. 4,pp. 683–697, 1989.[42] F. Nielsen and R. Nock, “Maxent upper bounds for the differentialentropy of univariate continuous distributions,”
IEEE Signal ProcessingLetters , vol. 24, no. 4, pp. 402–406, 2017.[43] M. F. Huber, T. Bailey, H. Durrant-Whyte, and U. D. Hanebeck, “Onentropy approximation for Gaussian mixture random vectors,” in . IEEE, 2008, pp. 181–188.[44] S. J. Julier and J. K. Uhlmann, “A general method for approximatingnonlinear transformations of probability distributions,” Technical report,Robotics Research Group, Department of Engineering Science, Tech.Rep., 1996.[45] A. Kolchinsky and B. D. Tracey, “Estimating mixture entropy withpairwise distances,”
Entropy , vol. 19, no. 7, p. 361, 2017.[46] P. Hall and S. C. Morton, “On the estimation of entropy,”
Annals of theInstitute of Statistical Mathematics , vol. 45, no. 1, pp. 69–88, 1993.[47] T. Jebara and R. Kondor, “Bhattacharyya and expected likelihoodkernels,” in
Learning theory and kernel machines . Springer, 2003,pp. 57–71.[48] T. Jebara, R. Kondor, and A. Howard, “Probability product kernels,”
Journal of Machine Learning Research , vol. 5, no. Jul, pp. 819–844,2004.[49] J. E. Contreras-Reyes and D. D. Cort´es, “Bounds on r´enyi and shannonentropies for finite mixtures of multivariate skew-normal distributions:Application to swordfish (xiphias gladius linnaeus),”
Entropy , vol. 18,no. 11, p. 382, 2016.[50] H. Chernoff et al. , “A measure of asymptotic efficiency for testsof a hypothesis based on the sum of observations,”
The Annals ofMathematical Statistics , vol. 23, no. 4, pp. 493–507, 1952.[51] D. Haussler, M. Opper et al. , “Mutual information, metric entropy andcumulative relative entropy risk,”
The Annals of Statistics , vol. 25, no. 6,pp. 2451–2492, 1997.[52] T. M. Cover and J. A. Thomas,
Elements of information theory . JohnWiley & Sons, 2012.[53] M. N. Do, “Fast approximation of Kullback-Leibler distance for depen-dence trees and hidden Markov models,”
IEEE signal processing letters ,vol. 10, no. 4, pp. 115–118, 2003.[54] J. Munkres, “Algorithms for the assignment and transportation prob-lems,”
Journal of the society for industrial and applied mathematics ,vol. 5, no. 1, pp. 32–38, 1957. [55] H. W. Kuhn, “The hungarian method for the assignment problem,”