Measuring dependence powerfully and equitably
Yakir A. Reshef, David N. Reshef, Hilary K. Finucane, Pardis C. Sabeti, Michael M. Mitzenmacher
MMeasuring dependence powerfully and equitably
Yakir A. Reshef ∗† David N. Reshef ∗ Hilary K. Finucane Pardis C. Sabeti ∗∗ Michael M. Mitzenmacher ∗∗ Abstract
Given a high-dimensional data set we often wish to find the strongest relationships withinit. A common strategy is to evaluate a measure of dependence on every variable pair andretain the highest-scoring pairs for follow-up. This strategy works well if the statistic usedis equitable [1], i.e., if, for some measure of noise, it assigns similar scores to equally noisyrelationships regardless of relationship type (e.g., linear, exponential, periodic).In this paper, we introduce and characterize a population measure of dependence calledMIC ∗ . We show three ways that MIC ∗ can be viewed: as the population value of MIC, ahighly equitable statistic from [2], as a canonical “smoothing” of mutual information, and asthe supremum of an infinite sequence defined in terms of optimal one-dimensional partitionsof the marginals of the joint distribution. Based on this theory, we introduce an efficientapproach for computing MIC ∗ from the density of a pair of random variables, and we definea new consistent estimator MIC e for MIC ∗ that is efficiently computable. In contrast, thereis no known polynomial-time algorithm for computing the original equitable statistic MIC.We show through simulations that MIC e has better bias-variance properties than MIC.We then introduce and prove the consistency of a second statistic, TIC e , that is a trivialside-product of the computation of MIC e and whose goal is powerful independence testingrather than equitability.We show in simulations that MIC e and TIC e have good equitability and power againstindependence respectively. The analyses here complement a more in-depth empirical evalu-ation of several leading measures of dependence [3] that shows state-of-the-art performancefor MIC e and TIC e . The growing dimensionality of today’s data sets has popularized the idea of hypothesis-generatingscience , whereby a data set is used not to test existing hypotheses but rather to help a researcherformulate new ones. A common approach among practitioners is to evaluate some statistic onmany candidate variable pairs in a data set, sort the variable pairs from highest-scoring tolowest, and manually examine all the pairs above a threshold score [4, 5]. School of Engineering and Applied Sciences, Harvard University. ∗ Co-first author. † To whom correspondence should be addressed. Email: [email protected] Department of Computer Science, Massachusetts Institute of Technology. Department of Mathematics, Massachusetts Institute of Technology. Department of Organismic and Evolutionary Biology, Harvard University. Broad Institute of MIT and Harvard. ∗∗ Co-last author. a r X i v : . [ s t a t . M E ] J u l popular class of statistics used for such analyses is measures of dependence , i.e., statisticswhose population value is 0 in cases of statistical independence and non-zero otherwise. Mea-sures of dependence are attractive because they guarantee that asymptotically no non-trivialrelationship will erroneously be declared trivial. In the setting of continuous-valued data, whichis our focus, there is a long line of fruitful research on such statistics including, e.g., [2, 6–15].The utility of a measure of dependence ˆ ϕ can be assessed in two ways. The first is poweragainst independence , i.e., the power of independence testing based on ˆ ϕ to detect various typesof non-trivial relationships. This is an important goal for datasets that have very few non-trivialrelationships, or only very weak relationships that are difficult to detect. Often, however, thenumber of relationships declared statistically significant by a measure of dependence greatlyexceeds the number of relationships that can then be explored further. For example, biologicaldatasets often contain many non-trivial relationships, but testing a preliminary finding forfurther corroboration may take extensive manual lab work, or a study on human or animalsubjects. In this case, it is tempting to restrict follow-up to relationships with high values of ˆ ϕ , but this can skew the direction of follow-up work: if ˆ ϕ systematically assigns higher scoresto, say, linear relationships than to non-linear ones, relatively noisy linear relationships mightcrowd out strong non-linear relationships from the top-scoring set.Motivated by this problem, in a companion paper [1] we define a second way of assessing ameasure of dependence called equitability . Informally, an equitable statistic is one that, for somemeasure of relationship strength, assigns similar scores to equally strong relationships regardlessof relationship type. For instance, we may want our measure of dependence to also have theproperty that on noisy functional relationships it assigns similar scores to relationships with thesame R , i.e., the squared Pearson correlation between the observed y-values and the x-valuespassed through the underlying function in question [2]. Or, alternatively, we may want thevalue of our statistic to tell us about the proportion of points coming from the deterministiccomponent of a mixture containing part signal and part uniform noise [16]. Defining measuresof dependence that achieve good equitability with respect to interesting measures of relationshipstrength is a new and challenging problem, with a number of different formalizations. (See, e.g.,[1] and [16] cited above, as well as [17] along with associated technical comments [18] and [19].)In this paper, we introduce and theoretically characterize two new measures of dependencethat we empirically show to have good equitability with respect to R and power against in-dependence, respectively. We begin by introducing a new population measure of dependencecalled MIC ∗ . Given a pair of jointly distributed random variables ( X, Y ) , MIC ∗ ( X, Y ) is thesupremum, over all finite grids G imposed on the support of ( X, Y ) , of the mutual informationof the discrete distribution induced by ( X, Y ) on the cells of G , subject to a regularization basedon the resolution of G . We prove three results, each of which gives a different way that thispopulation quantity can be viewed.1. MIC ∗ is the population value of the maximal information coefficient (MIC), a statisticintroduced in [2] that is highly equitable with respect to R on a large class of noisyfunctional relationships. Simple corollaries of this result simplify and strengthen many ofthe theoretical results proven in [2] about MIC.2. MIC ∗ is a minimal “smoothing” of mutual information, in the sense that the regularizationin the definition of MIC ∗ renders it uniformly continuous as a function of random variables,and no smaller regularization achieves continuity. A corollary of this is that MIC ∗ isuniformly continuous while mutual information is not continuous.2. MIC ∗ is the supremum of an infinite sequence defined in terms of optimal partitions of themarginal distributions of ( X, Y ) rather than optimal (two-dimensional) grids imposed onthe joint distribution. This characterization greatly simplifies the computation of MIC ∗ and associated quantities.After proving these three results, we leverage them to introduce efficient algorithms bothfor approximately computing MIC ∗ and for estimating it from finite samples. We first providean efficient algorithm that in many cases allows for computation to arbitrary precision of theMIC ∗ of a pair of random variables whose joint density is known. We then introduce a statistic,called MIC e , that we prove is a consistent estimator of MIC ∗ . In contrast to the MIC statisticfrom [2], for which no efficient algorithm is known and a heuristic algorithm is used in practice,MIC e is efficiently computable. It has a better runtime complexity than the heuristic algorithmcurrently in use for computing the original MIC statistic, and is orders of magnitude faster inpractice.With a consistent and fast estimator for MIC ∗ in hand, we turn to empirical analysis ofits performance. Specifically, we show through simulation that MIC e has better bias/varianceproperties than the heuristic algorithm used in [2] for computing MIC, which has no theoreticalconvergence guarantees. Our analysis also reveals that the main parameter of MIC e can beused to tune statistical performance toward either stronger or weaker relationships in general.After studying the bias/variance properties of MIC e , we then demonstrate via simulation thatit outperforms currently available methods in terms of equitability with respect to R . Notably,we show this performance advantage both on the set of functional relationships analyzed in [2]as well as on a large set of randomly chosen noisy functional relationships.We choose in this paper to analyze equitability specifically with respect to R , rather thansome other notion of relationship strength, because R on noisy functional relationships isa simple measure with broad familiarity and intuitive interpretation among practitioners. Ofcourse, it is also important to develop measures of dependence that are equitable with respect tonotions of relationship strength besides R or on families of relationships besides noisy functionalrelationships; however, our focus here remains on the “simple” case of R on noisy functionalrelationships.Importantly, we note that although there are methods for directly estimating the R of anoisy functional relationship via nonparametric regression (see, e.g., [20, 21]), those methods arenot applicable in the context of equitability because they are not measures of dependence. Thatis, because non-parametric regression methods assume a functional form for the relationship inquestion, they can give trivial scores to non-functional relationships, even in the large-samplelimit. A simple example of this is when a distribution is supported on a circle, such that theregression function is constant. In contrast, a measure of dependence is guaranteed never tomake this “mistake”. A measure of dependence that is equitable with respect to R can there-fore be viewed either as an “upgraded” measure of dependence that also comes with some ofthe interpretability properties of non-parametric regression, or as an “upgraded” approximatenon-parametric regression method that also has the robustness properties of a measure of de-pendence.The main strength of MIC e is equitability rather than power to reject a null hypothesisof independence. In some settings, though, it may be important to have good power againstindependence. We therefore introduce here a statistic closely related to MIC e called the totalinformation coefficient TIC e . We prove the consistency of testing for independence using TIC e ,and show via simulations that it achieves excellent power in practice, performing comparably3o or better than current methods. Because TIC e arises naturally as a side-product of thecomputation of MIC e , it is available “for free” once MIC e has been computed. This leads usto propose a data analysis strategy consisting of first using TIC e to filter out non-significantrelationships, and then ranking the remaining ones using the simultaneously computed valuesof MIC e .In addition to the companion paper [1], which focuses on the theory behind equitability,this paper is accompanied by a second companion work [3] that explores in detail the empiricalperformance of the methods introduced here. That paper shows, by comparing MIC e and TIC e to several leading measures of dependence under many different sampling and noise models,that the equitability of MIC e on noisy functional relationships and the power of independencetesting using TIC e are both state-of-the-art. It also shows that these methods can be computedvery fast in practice.Taken together, our results shed significant light on the theory behind the maximal in-formation coefficient, and suggest that TIC e and MIC e are a useful pair of methods for dataexploration. Specifically, they point to joint use of these two statistics to filter and then rankrelationships as a fast and practical way to explore large data sets by measuring dependenceboth powerfully and equitably. We work extensively in this paper with grids and discrete distributions over their cells. Givena grid G and a point ( x, y ) , we define the function row G ( y ) to be the row of G containing y and we define col G ( x ) analogously. For a pair ( X, Y ) of jointly distributed random variables,we write ( X, Y ) | G to denote ( col G ( X ) , row G ( Y )) , and we use I (( X, Y ) | G ) to denote the discretemutual information [22–24] between col G ( X ) and row G ( Y ) . Given a finite sample D from thedistribution of ( X, Y ) , we sometimes use D to refer both to the set of points in the sample aswell as to a point chosen uniformly at random from D . In the latter case, it will then makesense to talk about, e.g., D | G and I ( D | G ) .For natural numbers k and (cid:96) , we use G ( k, (cid:96) ) to denote the set of all k -by- (cid:96) grids (possibly withempty rows/columns). A grid G is an equipartition of ( X, Y ) if all the rows of ( X, Y ) | G have thesame probability mass, and all the columns do as well. We also use the term equipartition in theanalogous way for one-dimensional partitions into just rows or columns. For a one-dimensionalpartition P into rows and a one-dimensional partition Q into columns, we write ( P, Q ) to referto the grid constructed from these two partitions. When a partition P can be obtained from apartition P (cid:48) by addition of separators alone, we write P (cid:48) ⊂ P .Finally, let us establish some notation for infinite matrices. We use m ∞ to denote the spaceof infinite matrices equipped with the supremum norm. Given a matrix A ∈ m ∞ , we oftenexamine only the k, (cid:96) -th entries of A for which k(cid:96) ≤ i for some i . Thus, for i ∈ Z + , we definethe projection r i : m ∞ → m ∞ via r i ( A ) k,(cid:96) = (cid:26) A k,(cid:96) k(cid:96) ≤ i k(cid:96) > i . ∗ In this section, we define and characterize the population maximal information coefficient MIC ∗ .We begin by defining the population quantity MIC ∗ ( X, Y ) for a pair of jointly distributed4andom variables ( X, Y ) . We then show three different ways to characterize this populationquantity: first, as the large-sample limit of the statistic MIC from [2]; second, as a minimallysmoothed version of mutual information; and third, as the supremum of an infinite sequencedefined in terms of optimal one-dimensional partitions of the marginals of the joint distributionof ( X, Y ) . We conclude the section by showing how the third characterization leads to anefficient approach for computing MIC ∗ from the density of ( X, Y ) . ∗ The population maximal information can be defined in several equivalent ways, as we will seelater. For now, we begin with the simplest definition.
Definition 3.1.
Let ( X, Y ) be jointly distributed random variables. The population maximalinformation coefficient (MIC ∗ ) of ( X, Y ) is defined byMIC ∗ ( X, Y ) = sup G I (( X, Y ) | G )log (cid:107) G (cid:107) where (cid:107) G (cid:107) denotes the minimum of the number of rows of G and the number of columns of G .Given that I ( X, Y ) = sup G I (( X, Y ) | G ) (see, e.g., Chapter 8 of [22]), this can be viewed asa regularized version of mutual information that penalizes complicated grids and ensures thatthe result falls between 0 and 1.Before we continue, we state one simple equivalent definition of MIC ∗ that is useful for theresults in this section. This definition views MIC ∗ as the supremum of a matrix called the population characteristic matrix , defined below. Definition 3.2.
Let ( X, Y ) be jointly distributed random variables. Let I ∗ (( X, Y ) , k, (cid:96) ) = max G ∈ G ( k,(cid:96) ) I (( X, Y ) | G ) . The population characteristic matrix of ( X, Y ) , denoted by M ( X, Y ) , is defined by M ( X, Y ) k,(cid:96) = I ∗ (( X, Y ) , k, (cid:96) )log min { k, (cid:96) } for k, (cid:96) > .It is easy to see the following: Proposition 1.
Let ( X, Y ) be jointly distributed random variables. We haveMIC ∗ ( X, Y ) = sup M ( X, Y ) where M ( X, Y ) is the population characteristic matrix of ( X, Y ) . The population characteristic matrix is so named because just as MIC ∗ , the supremum of thismatrix, captures a sense of relationship strength, other properties of this matrix correspond todifferent properties of relationships. For instance, later in this paper we introduce an additionalproperty of the characteristic matrix, the total information coefficient, that is useful for testingfor the presence or absence of a relationship rather than quantifying relationship strength.5 .2 First alternate characterization: MIC ∗ is the population value of MIC With MIC ∗ defined, we now state our first alternate characterization of it, as the large-samplelimit of the statistic MIC introduced in [2]. We begin by first reproducing a description of MICfrom [2], via the two definitions below. Definition 3.3 (2) . Let D ⊂ R be a set of ordered pairs. The sample characteristic matrix (cid:99) M ( D ) of D is defined by (cid:99) M ( D ) k,(cid:96) = I ∗ ( D, k, (cid:96) )log min { k, (cid:96) } . Definition 3.4 (2) . Let D ⊂ R be a set of n ordered pairs, and let B : Z + → Z + . We defineMIC B ( D ) = max k,(cid:96) ≤ B ( n ) (cid:99) M ( D ) k,(cid:96) . where the function B ( n ) is specified by the user. In [2], it was suggested that B ( n ) be chosento be n α for some constant α in the range of . to . . (The statistics we introduce later willhave an analogous parameter. See Section 4.2.1.)We have shown the following result about convergence of functions of the sample charac-teristic matrix to their population counterparts, a consequence of which is the convergence ofMIC to MIC ∗ . (In the theorem statement below, recall that m ∞ is the space of infinite matricesequipped with the supremum norm, and given a matrix A the projection r i zeros out all theentries A k,(cid:96) for which k(cid:96) > i .) Theorem 1.
Let f : m ∞ → R be uniformly continuous, and assume that f ◦ r i → f pointwise.Then for every random variable ( X, Y ) , we have (cid:0) f ◦ r B ( n ) (cid:1) (cid:16) (cid:99) M ( D n ) (cid:17) → f ( M ( X, Y )) in probability where D n is a sample of size n from the distribution of ( X, Y ) , provided ω (1) . Since the supremum of a matrix is uniformly continuous as a function on m ∞ and can berealized as the limit of maxima of larger and larger segments of the matrix, this theorem yieldsour claim about MIC ∗ as a corollary. Corollary.
MIC B is a consistent estimator of MIC ∗ provided ω (1) < B ( n ) ≤ O ( n − ε ) for some ε > . We prove Theorem 1 in Appendix A and provide here some intuition for why it should holdas well as a description of the obstacles that must be overcome in the proof.To see why the theorem should hold, fix a random variable ( X, Y ) and let D be a sample ofsize n from its distribution. It is known that, for a fixed grid G , I ( D | G ) is a consistent estimatorof I (( X, Y ) | G ) [9, 25]. We might therefore expect I ∗ ( D, k, (cid:96) ) to be a consistent estimator of I ∗ (( X, Y ) , k, (cid:96) ) as well. And if I ∗ ( D, k, (cid:96) ) is a consistent estimator of I ∗ (( X, Y ) , k, (cid:96) ) , then wemight expect the maximum of the sample characteristic matrix (which just consists of normalized I ∗ terms) to be a consistent estimator of the supremum of the true characteristic matrix.These intuitions turn out to be true, but there are two reasons they are non-trivial to prove.First, consistency for I ∗ does not follow from abstract considerations since the maximum of6n infinite set of estimators is not necessarily a consistent estimator of the supremum of theestimands . Second, consistency of I ∗ alone does not suffice to show that the maximum of thesample characteristic matrix converges to MIC ∗ . In particular, if B ( n ) grows too quickly, andthe convergence of I ∗ ( D, k, (cid:96) ) to I ∗ (( X, Y ) , k, (cid:96) ) is slow, inflated values of MIC can result. Tosee this, notice that if B ( n ) = ∞ then MIC = 1 always, even though each individual entry ofthe sample characteristic matrix converges to its true value eventually.The technical heart of the proof is overcoming these obstacles by using the dependenciesbetween the quantities I ( D | G ) for different grids G to not only show the consistency of I ∗ ( D, k, (cid:96) ) but then to quantify how quickly I ∗ ( D, k, (cid:96) ) actually converges to I ∗ (( X, Y ) , k, (cid:96) ) . ∗ is a minimally smoothed mu-tual information We now describe a second equivalent view of MIC ∗ . Recall that for a pair of jointly distributedrandom variables ( X, Y ) , we defined MIC ∗ ( X, Y ) asMIC ∗ ( X, Y ) = sup G I (( X, Y ) | G )log (cid:107) G (cid:107) where (cid:107) G (cid:107) denotes the minimum of the number of rows of G and the number of columns of G .As we discussed in Section 3.1, the mutual information I ( X, Y ) is also a supremum, namely I ( X, Y ) = sup G I (( X, Y ) | G ) . and so MIC ∗ can be viewed as a regularized version of I . It is natural to ask whether theregularization in the definition of MIC ∗ has any smoothing effect on I . In this sub-section weshow first that it does, in the sense that MIC ∗ is uniformly continuous as a function of randomvariables with respect to the metric of statistical distance , and second that the regularization by log (cid:107) G (cid:107) is in fact the minimal one necessary for achieving any sort of continuity. As a corollary,we obtain that I by itself is not continuous as a function of random variables with respect tothe metric of statistical distance. This yields a view of MIC ∗ as a canonical smoothing of I thatyields continuity.Formally, let P ( R ) denote the space of random variables supported on R equipped withthe metric of statistical distance. Our first claim is that as a function defined on P ( R ) , MIC ∗ is uniformly continuous. We prove this claim by establishing a stronger result: the uniformcontinuity of the characteristic matrix M ( X, Y ) . Specifically, by showing that the family of mapscorresponding to each individual entry of the characteristic matrix is uniformly equicontinuous,we establish the following result. Theorem 2.
The map from P ( R ) to m ∞ defined by ( X, Y ) (cid:55)→ M ( X, Y ) is uniformly contin-uous. If ˆ θ , . . . , ˆ θ k is a finite set of estimators, then a union bound shows that the random variable (ˆ θ ( D ) , . . . , ˆ θ k ( D )) converges in probability to ( θ , . . . , θ k ) with respect to the supremum metric. The con-tinuous mapping theorem then gives the desired result. However, if the set of estimators is infinite, the unionbound cannot be employed. And indeed, if we let θ = · · · = θ k = 0 , and let ˆ θ i ( D n ) = i/n deterministically, theneach ˆ θ i is a consistent estimator of θ i , but since the set { ˆ θ ( D n ) , ˆ θ ( D n ) , . . . } = { /n, /n, . . . } is unbounded, sup i ˆ θ i ( D n ) = ∞ for every n . Recall that the statistical distance between random variables A and B is defined as sup T | P ( A ∈ T ) − P ( B ∈ T ) | . When A and B have probability density functions or probability massfunctions, this equals one-half of the L distance between those functions. roof. See Appendix B.Since the supremum is a continuous function on m ∞ , Theorem 2 yields the following corol-lary. Corollary.
The map ( X, Y ) (cid:55)→ MIC ∗ ( X, Y ) is uniformly continuous. Similar corollaries exist for any continuous function of the characteristic matrix.Interestingly, Theorem 2 relies crucially on the normalization in the definition of the charac-teristic matrix. This is not a coincidence: as the following proposition shows, any normalizationthat is meaningfully smaller than the one in the definition of the characteristic matrix will causethe matrix to contain an infinite discontinuity as a function on P ( R ) . Proposition 2.
For some function N ( k, (cid:96) ), let M N be the characteristic matrix with normal-ization N , i.e., M N ( X, Y ) = I ∗ (( X, Y ) , k, (cid:96) ) N ( k, (cid:96) ) . If N ( k, (cid:96) ) = o (log min { k, (cid:96) } ) along some infinite path in N × N , then M N and sup M N are notcontinuous as functions of P ([0 , × [0 , ⊂ P ( R ) .Proof. See Appendix CThe above proposition implies that the “smoothing” that MIC ∗ applies to mutual informa-tion is necessary in some sense. In particular, one corollary of the proposition is that mutualinformation with no smoothing will contain a disconuity. Corollary.
Mutual information is not continuous on P ([0 , × [0 , ⊂ P ( R ) .Proof. Mutual information is the supremum of M N with N ≡ .The same result can also be shown for the squared Linfoot correlation [26, 27], which equals − − I where I represents mutual information. Thus, though the Linfoot correlation smoothesthe mutual information enough to cause it to lie in the unit interval, it does not smooth themutual information sufficiently to cause it to be continuous.As we remarked previously, these results, when contrasted with the uniform continuity ofMIC ∗ , allow us to view the latter as a canonical “minimally smoothed” version of mutual informa-tion that is uniformly continuous. This view gives a meaningful interpretation to the normaliza-tion used in MIC ∗ . Understanding MIC ∗ as having smoothness properties not shared by mutualinformation also suggests that estimators of MIC ∗ may have better statistical properties thanestimators of ordinary mutual information. This is consistent with the hardness-of-estimationresult in [16] and is also borne out empirically in [3]. ∗ is the supremum of the bound-ary of the characteristic matrix We now show the third alternate view of MIC ∗ : that it can be equivalently defined as thesupremum over a boundary of the characteristic matrix rather than as a supremum over all ofthe entries of the matrix. This characterization of MIC ∗ will serve as the foundation both forour approach to computing MIC ∗ ( X, Y ) as well as the new estimator of MIC ∗ that we introducelater in this paper. 8e begin by defining what we mean by the boundary of the characteristic matrix. Ourdefinition rests on the following observation. Proposition 3.
Let M be a population characteristic matrix. Then for (cid:96) ≥ k , M k,(cid:96) ≤ M k,(cid:96) +1 .Proof. Let ( X, Y ) be the random variable in question. Since we can always let a row/columnbe empty, we know that I ∗ (( X, Y ) , k, (cid:96) ) ≤ I ∗ (( X, Y ) , k, (cid:96) + 1) . And since (cid:96), (cid:96) + 1 ≥ k , we knowthat M k,(cid:96) = I ∗ (( X, Y ) , k, (cid:96) ) / log k ≤ I ∗ (( X, Y ) , k, (cid:96) + 1) / log k = M k,(cid:96) +1 .Since the entries of the characteristic matrix are bounded, the monotone convergence the-orem then gives the following corollary. In the corollary and henceforth, we let M k, ↑ =lim (cid:96) →∞ M k,(cid:96) and define M ↑ ,(cid:96) similarly. Corollary.
Let M be a population characteristic matrix. Then M k, ↑ exists, is finite, and equals sup (cid:96) ≥ k M k,(cid:96) . The same is true for M ↑ ,(cid:96) . The above corollary allows us to define the boundary of the characteristic matrix.
Definition 3.5.
Let M be a population characteristic matrix. The boundary of M is the set ∂M = { M k, ↑ : 1 < k < ∞} (cid:91) { M ↑ ,(cid:96) : 1 < (cid:96) < ∞} . The theorem below then gives a relationship between the boundary of the characteristicmatrix and MIC ∗ . Theorem 3.
Let ( X, Y ) be a random variable. We haveMIC ∗ ( X, Y ) = sup ∂M ( X, Y ) where M ( X, Y ) is the population characteristic matrix of ( X, Y ) .Proof. The following argument shows that every entry of M is at most sup ∂M : fix a pair ( k, (cid:96) ) and notice that either k ≤ (cid:96) , in which case M k,(cid:96) ≤ M k, ↑ , or (cid:96) ≤ k , in which case M k,(cid:96) ≤ M ↑ ,(cid:96) .Thus, MIC ∗ ≤ sup { M ↑ ,(cid:96) } ∪ { M k, ↑ } = sup ∂M .On the other hand, Corollary 3.4 shows that each element of ∂M is a supremum over someelements of M . Therefore, sup ∂M , being a supremum over suprema of elements of M , cannotexceed sup M = MIC ∗ . ∗ efficiently The importance of the characterization in Theorem 3 from the previous sub-section is compu-tational. Specifically, elements of the boundary of the characteristic matrix can be expressed interms of a maximization over (one-dimensional) partitions rather than (two-dimensional) grids,the former being much quicker to compute exactly. This is stated in the theorem below.
Theorem 4.
Let M be a population characteristic matrix. Then M k, ↑ equals max P ∈ P ( k ) I ( X, Y | P )log k where P ( k ) denotes the set of all partitions of size at most k . roof. See Appendix D.To formally state how this will help us compute MIC ∗ , we note that Theorems 3 and 4 abovetogether give the following corollary. Corollary.
Let ( X, Y ) be a random variable, and let P be the set of finite-size partitions. ThenMIC ∗ ( X, Y ) = sup (cid:26) I ( X, Y | P )log | P | : P ∈ P (cid:27) (cid:91) (cid:26) I ( X | P , Y )log | P | : P ∈ P (cid:27) where | P | is the number of bins in the partition P . The expressions in the above corollary involve maximization only over one-dimensional par-titions rather than two-dimensional grids. We can exploit this fact to give an algorithm forcomputing elements of the boundary of the characteristic matrix to arbitrary precision. To doso, we utilize as a subroutine a dynamic programming algorithm from [2] called
OptimizeX-Axis . Before continuing, we therefore give a brief overview of that algorithm.
Overview of
OptimizeXAxis algorithm from [2]
The
OptimizeXAxis algorithm takesas input a set D of n data points, a fixed partition into columns Q of size (cid:96) , a “master” partitioninto rows Π , and a number k . The algorithm returns, for ≤ i ≤ k , the partition into rows P i ⊂ Π that maximizes the mutual information of D | ( P i ,Q ) among all sub-partitions of Π of sizeat most i . The algorithm works by exploiting the fact that, conditioned on the location y of thetop-most line of P i , the optimization of the rest of P i can be formulated as a sub-problem thatdepends only on the data points below y . The algorithm uses dynamic programming to storeand reuse solutions to these subproblems, resulting in a runtime of O ( | Π | k(cid:96) ) . If a black-boxalgorithm is used to compute each required mutual information in time at most T , then theruntime of the algorithm can be shown to be O ( T k | Π | ) .The following theorem shows that the theory developed about the boundary of the char-acteristic matrix, together with OptimizeXAxis , yields an efficient algorithm for computingentries of the boundary to arbitrary precision.
Theorem 5.
Given a random variable ( X, Y ) , M k, ↑ (resp. M ↑ ,(cid:96) ) is computable to within anadditive error of O ( kε log(1 / ( kε ))) + E (resp. O ( (cid:96)ε log(1 / ( (cid:96)ε ))) + E ) in time O ( kT ( E ) /ε ) (resp. O ( (cid:96)T ( E ) /ε ) ), where T ( E ) is the time required to numerically compute the mutual informationof a continuous distribution to within an additive error of E .Proof. See Appendix E.The algorithm proposed in Theorem 5 gives us a polynomial-time method for computingany finite subset of the boundary ∂M of the population characteristic matrix M ( X, Y ) of arandom variable ( X, Y ) . Thus, if we have some k , (cid:96) such that the maximum of the finitesubset { M k, ↑ , M ↑ ,(cid:96) : k ≤ k , (cid:96) ≤ (cid:96) } of ∂M will be ε -close to the supremum of the entireset ∂M , we can compute MIC ∗ ( X, Y ) to within an error of ε . Though we usually do not haveprecise knowledge of k and (cid:96) , for simple distributions it is often easy to make very conservative Despite its name, the
OptimizeXAxis algorithm can be used to optimize a partition of either axis. In ourdescription of the algorithm here, we choose to describe the algorithm as it would work for optimizing a partitionof the y-axis rather than the x-axis. This is for notational coherence of this paper only. ∗ ( X, Y ) verywell in practice.Being able to compute MIC ∗ ( X, Y ) has two main advantages. The first is that it allowsus to assess in simulations the large-sample properties of MIC ∗ independent of any estimator.This is done in the companion paper [3], which shows that MIC ∗ achieves high equitability withrespect to R on a set of noisy functional relationships thereby confirming that statisticallyefficient estimation of MIC ∗ is a worthwhile goal.The second advantage of being able to compute MIC ∗ ( X, Y ) is that we can empirically assessthe bias, variance, and expected squared error of estimators of MIC ∗ by taking a distribution,computing MIC ∗ , and then comparing the result to estimates of it based on finite samples. Inthe next section, we introduce a new estimator MIC e of MIC ∗ and carry out such an analysisto compare its statistical properties to those of the statistic MIC from [2]. ∗ with MIC e As we have shown, MIC ∗ is actually the population value of the statistic MIC introduced in [2].However, though consistent, the statistic MIC is not known to be efficiently computable andin [2] a heuristic approximation algorithm called Approx-MIC was computed instead. In thissection, we leverage the theory we have developed here to introduce a new estimator of MIC ∗ that is both consistent and efficiently computable. The new estimator, called MIC e , actuallyhas better runtime complexity even than the heuristic Approx-MIC algorithm, and runs ordersof magnitude faster in practice.The estimator MIC e is based on one of the alternate characterizations of MIC ∗ proven inthe previous section. Namely, if MIC ∗ can be viewed as the supremum of the boundary of thecharacteristic matrix rather than of the entire matrix, then only the boundary of the matrixmust be accurately estimated in order to estimate MIC ∗ . This has the advantage that, whereascomputing individual entries of the sample characteristic matrix involves finding optimal (two-dimensional) grids, estimating entries of the boundary requires us only to find optimal (one-dimensional) partitions. While the former problem is computationally difficult, the latter can besolved using the dynamic programming algorithm from [2] that we also employed in Section 3.5to compute MIC ∗ in the large-sample limit.We formalize this idea via a new object called the equicharacteristic matrix , which we denoteby [ M ] . The difference between [ M ] and the characteristic matrix M is as follows: whilethe k, (cid:96) -th entry of M is computed from the maximal achievable mutual information usingany k -by- (cid:96) grid, the k, (cid:96) -th entry of [ M ] is computed from the maximal achievable mutualinformation using any k -by- (cid:96) grid that equipartitions the dimension with more rows/columns.(See Figure 1.) Despite this difference, as the equipartition in question gets finer and finer itbecomes indistinguishable from an optimal partition of the same size. This intuition can beformalized to show that the boundary of [ M ] equals the boundary of M , and therefore that sup[ M ] = sup M = MIC ∗ . It will then follow that estimating [ M ] and taking the supremum –as we did with M in the case of MIC – yields a consistent estimate of MIC ∗ . We now define the equicharacteristic matrix and show that its supremum is indeed MIC ∗ . Todo so, we first define a version of I ∗ that equipartitions the dimension with more rows/columns.11 ( X, Y ) , [ M ]( X, Y ) , I ∗ = 0 . I [ ∗ ] = 0 . M ( X, Y ) , [ M ]( X, Y ) , I ∗ = 0 . I [ ∗ ] = 0 . Figure 1: A schematic illustrating the difference between the characteristic matrix M and theequicharacteristic matrix [ M ] . (Top) When restricted to 2 rows and 3 columns, the character-istic matrix M is computed from the optimal 2-by-3 grid. In contrast, the equicharacteristicmatrix [ M ] still optimizes the smaller partition of size 2 but is restricted to have the largerpartition be an equipartition of size 3. This results in a lower mutual information of . . (Bottom) When 9 columns are allowed instead of 3, the grid found by the equicharacteristicmatrix does not change, since the grid with 3 columns was already optimal. However, now theequicharacteristic matrix uses an equipartition into columns of size 9, whose resolution is ableto fully capture the dependence between X and Y .Note that in the definition, brackets are used to indicate the presence of an equipartition. Definition 4.1.
Let ( X, Y ) be jointly distributed random variables. Define I ∗ (( X, Y ) , k, [ (cid:96) ]) = max G ∈ G ( k, [ (cid:96) ]) I (( X, Y ) | G ) where G ( k, [ (cid:96) ]) is the set of k -by- (cid:96) grids whose y-axis partition is an equipartition of size (cid:96) .Define I ∗ (( X, Y ) , [ k ] , (cid:96) ) analogously.Define I [ ∗ ] (( X, Y ) , k, (cid:96) ) to equal I ∗ (( X, Y ) , k, [ (cid:96) ]) if k < (cid:96) and I ∗ (( X, Y ) , [ k ] , (cid:96) ) otherwise.We now define the equicharacteristic matrix in terms of I [ ∗ ] . In the definition below, wecontinue our convention of using brackets around a quantity to denote the presence of equipar-titions. Definition 4.2.
Let ( X, Y ) be jointly distributed random variables. The population equichar-acteristic matrix of ( X, Y ) , denoted by [ M ]( X, Y ) , is defined by [ M ]( X, Y ) k,(cid:96) = I [ ∗ ] (( X, Y ) , k, (cid:96) )log min { k, (cid:96) } for k, (cid:96) > . 12he boundary of the equicharacteristic matrix can be defined via a limit in the same wayas the characteristic matrix. We then have the following theorem. Theorem 6.
Let ( X, Y ) be jointly distributed random variables. Then ∂ [ M ] = ∂M .Proof. See Appendix F.Since every entry of the equicharacteristic matrix is dominated by some entry on its bound-ary, the equivalence of ∂ [ M ] and ∂M yields the following corollary as a simple consequence. Corollary.
Let ( X, Y ) be jointly distributed random variables. Then sup[ M ]( X, Y ) =
MIC ∗ ( X, Y ) . e With the equicharacteristic matrix defined, we can now define our new estimator MIC e in termsof the sample equicharacteristic matrix, analogously to the way we defined MIC in terms of thesample characteristic matrix. Definition 4.3.
Let D ⊂ R be a set of ordered pairs. The sample equicharacteristic matrix (cid:100) [ M ]( D ) of D is defined by (cid:100) [ M ]( D ) k,(cid:96) = I [ ∗ ] ( D, k, (cid:96) )log min { k, (cid:96) } . Definition 4.4.
Let D ⊂ R be a set of n ordered pairs, and let B : Z + → Z + . We defineMIC e,B ( D ) = max k(cid:96) ≤ B ( n ) (cid:100) [ M ]( D ) k,(cid:96) . With the equivalence between the boundaries of the characteristic matrix and the equichar-acteristic matrix established, it is straightforward to show that MIC e is a consistent estimatorof MIC ∗ via arguments similar to those we applied in the case of MIC. (See Appendix G.)Specifically, we show the following theorem, an analogue of Theorem 1. Theorem 7.
Let f : m ∞ → R be uniformly continuous, and assume that f ◦ r i → f pointwise.Then for every random variable ( X, Y ) , we have (cid:0) f ◦ r B ( n ) (cid:1) (cid:16) (cid:100) [ M ]( D n ) (cid:17) → f ([ M ]( X, Y )) in probability where D n is a sample of size n from the distribution of ( X, Y ) , provided ω (1) . By setting f ([ M ]) = sup[ M ] , we then obtain as a corollary the consistency of MIC e . Corollary.
MIC e,B is a consistent estimator of MIC ∗ provided ω (1) < B ( n ) ≤ O ( n − ε ) forsome ε > . .2.1 Choosing B ( n ) As with the statistic MIC, the statistic MIC e requires the user to specify a function B ( n ) touse. While the theory suggests that any function of the form B ( n ) = n α suffices provided < α < , different values of α may yield different finite-sample properties. We study theempirical performance of MIC e for different choices of B ( n ) in Section 4.4.[3] provides simple, empirical recommendations about appropriate values of α for differentsettings. Those recommendations are formulated by choosing a set of representative relation-ships (e.g., a set of noisy functional relationships), as well as a “ground truth” populationquantity Φ (e.g., R ) that can be used to quantify the strength of each of those relationships,and then assessing which values of α maximize the equitability of MIC e with respect to Φ at agiven sample size. This approach is applied to an analysis of real data from the World HealthOrganization in [3], and the parameters chosen for that analysis are the ones used for all analysesin this paper.We remark that if the goal of the user is only detection of non-trivial relationships ratherdiscovery of the strongest such relationships, α can also be chosen in a more straightforwardmanner: the user can sub-sample a small random set of relationships on which to compare thepower of MIC e for different values of α . Those relationships can then be discarded and the restof the relationships analyzed with the optimal value of α . However, if the user’s primary goalis power against independence, the statistic TIC e introduced in Section 5 of this paper shouldbe used with this strategy rather than MIC e . e Both MIC and MIC e are consistent estimators of MIC ∗ . The difference between them is thatwhile MIC can currently be computed efficiently only via a heuristic approximation, MIC e canbe computed exactly very efficiently via an approach similar to the one used for computingMIC ∗ involving the OptimizeXAxis subroutine. We now describe the details of this approach.Recall that, given a fixed x-axis partition Q into (cid:96) columns, a set of n data points, a“master” y-axis partition Π , and a number k , the OptimizeXAxis subroutine finds, for every ≤ i ≤ k , a y-axis partition P i ⊂ Π of size at most i that maximizes the mutual informationinduced by the grid ( P i , Q ) . The algorithm does this in time O ( | Π | k(cid:96) ) . (For more discussionof OptimizeXAxis , see Section 3.5, where it is also used to give an algorithm for computingMIC ∗ .)In the pair of theorems below, we show two ways that OptimizeXAxis can be used tocompute MIC e efficiently. In the proofs of both theorems, we neglect issues of divisibility, i.e.,we often write B/ rather than (cid:98) B/ (cid:99) . This does not affect the results. Theorem 8.
There exists an algorithm
Equichar that, given a sample D of size n and some B ∈ Z + , computes the portion r B ( n ) ( (cid:100) [ M ]( D )) of the sample equicharacteristic matrix in time O ( n B ) , which equals O ( n − ε ) for B ( n ) = O ( n − ε ) with ε > .Proof. We describe the algorithm and simultaneously bound its runtime. We do so only for the k, (cid:96) -th entries of (cid:100) [ M ]( D ) satisfying k ≤ (cid:96), k(cid:96) ≤ B . This suffices, since by symmetry computingthe rest of the required entries at most doubles the runtime.To compute (cid:100) [ M ]( D ) k,(cid:96) with k ≤ (cid:96) , we must fix an equipartition into (cid:96) columns on the x-axisand then find the optimal partition of the y-axis of size at most k . If we set the master partition Π of the OptimizeXAxis algorithm to be an equipartition into rows of size n , then it performs14recisely the required optimization. Moreover, for fixed (cid:96) it can carry out the optimizationsimultaneously for all of the pairs { (2 , (cid:96) ) , . . . , ( B/(cid:96), (cid:96) ) } in time O ( | Π | ( B/(cid:96) ) (cid:96) ) = O ( n B ) . Forfixed (cid:96) , this set contains all the pairs ( k, (cid:96) ) satisfying k ≤ (cid:96), kl ≤ B . Therefore, to computeall the required entries of (cid:100) [ M ]( D ) we need only apply this algorithm for each (cid:96) = 2 , . . . , B/ .Doing so gives a runtime of O ( n B ) .The algorithm above, while polynomial-time, is nonetheless not efficient enough for use inpractice. However, a simple modification solves this problem without affecting the consistencyof the resulting estimates. The modification hinges on the fact that OptimizeXAxis can usemaster partitions Π besides the equipartition of size n that we used above. Spefically, setting Π in the above algorithm to be an equipartition into ck “clumps”, where k is the size of the largestoptimal partition being sought, speeds up the computation significantly. This modification doesgive a slightly different statistic. However, the result is still a consistent estimator of MIC ∗ because the size of the master partition Π grows as k grows, and so the optimal sub-partition of Π approaches the true optimal partition eventually. These ideas, first about improved runtimeand second about preserved consistency, are formalized in the following theorem. Theorem 9.
Let ( X, Y ) be a pair of jointly distributed random variables, and let D n be a sampleof size n from the distribution of ( X, Y ) . For every c ≥ , there exists a matrix { (cid:99) M } c ( D n ) suchthat1. The function (cid:94) MIC e,B ( · ) = max k(cid:96) ≤ B ( n ) { (cid:99) M } c ( · ) k,(cid:96) is a consistent estimator of MIC ∗ provided ω (1) < B ( n ) ≤ O ( n − ε ) for some ε > .2. There exists an algorithm EquicharClump for computing r B ( { (cid:99) M } c ( D n )) in time O ( n + B / ) , which equals O ( n + n − ε ) / ) when B ( n ) = O ( n − ε ) .Proof. See Appendix H.For an analysis of the effect of the parameter c in the above theorem on the results of the EquicharClump algorithm, see Appendix H.3.Setting ε = 0 . in the above theorem yields the following corollary. Corollary.
MIC ∗ can be estimated consistently in linear time. Of course, at low sample sizes, setting ε = 0 . would be undesirable. However, our companionpaper [3] shows empirically that at large sample sizes this strategy works very well on typicalrelationships.We remark that the EquicharClump algorithm given above is asymptotically faster eventhan the heuristic
Approx-MIC algorithm used to calculate MIC in practice, which runs intime O ( B ( n ) ) . As demonstrated in our companion paper [3], this difference translates into asubstantial difference in runtimes for similar performance at a range of realistic sample sizes,ranging from a -fold speedup at n = 500 to over a -fold speedup at n = 10 , .For readability, in the rest of this paper we do not distinguish between the two versions ofMIC e computed by the Equichar and
EquicharClump algorithms described above. Wher-ever we present simulation data about MIC e in simulations though, we use the version of thestatistic computed by EquicharClump . 15a)(b)Figure 2: Bias/variance characterization of
Approx-MIC and MIC e . Each plot shows expectedsquared error, bias, or variance across the set of noisy functional relationships described inSection 4.4 as a function of the R of the relationships. The results are aggregated acrossthe 16 function types analyzed by either the average, median, or worst result at every valueof R . (a) A comparison between MIC e (light purple) and MIC (black) as computed via theheuristic Approx-MIC algorithm, at a typical usage parameter. (b)
Performance of MIC e with B ( n ) = n α for various values of α . 16 .4 Bias/variance characterization of MIC e The algorithm we presented in Section 3.5 for computing MIC ∗ in the large-sample limit allowsus to examine the bias/variance properties of estimators of MIC ∗ . Here, we use it to examinethe bias and variance of both MIC as computed by the heuristic Approx-MIC algorithm from[2], and MIC e as computed by the EquicharClump algorithm given above. To do this, weperformed a simulation analysis on the following set of relationships Q = { ( x + ε σ , f ( x ) + ε (cid:48) σ ) : x ∈ X f , ε σ , ε (cid:48) σ ∼ N (0 , σ ) , f ∈ F, σ ∈ R ≥ } where ε σ and ε (cid:48) σ are i.i.d., F is the set of 16 functions analyzed in [2], and X f is the set of n x-values that result in the points ( x i , f ( x i )) being equally spaced along the graph of f .For each relationship Z ∈ Q that we examined, we used the algorithm from Theorem 5 tocompute MIC ∗ . We then simulated 500 independent samples from Z , each of size n = 500 ,and computed both Approx-MIC and MIC e on each one to obtain estimates of the samplingdistributions of the two statistics. From each of the two sampling distributions, we estimatedthe bias and variance of either statistic on Z . We then analysed the bias, variance, and expectedsquared error of the two statistics as a function of relationship strength, which we quantifiedusing the coefficient of determination ( R ) with respect to the generating function.The results, presented in Figure 2, are interesting for two reasons. First, they demonstratethat for a typical usage parameter of B ( n ) = n . , MIC e performs substantially better than Approx-MIC overall. Specifically, the median of the expected squared error of MIC e acrossthe set F of functions is uniformly lower across R values than that of Approx-MIC . Whenaverage expected squared error is used instead of median, MIC e still performs better on allbut the strongest of relationships ( R above ∼ e isconsistent with the fact that we have theoretical guarantees about its statistical propertieswhereas Approx-MIC is a heuristic.Second, the results show that different values of the exponent in B ( n ) = n α give goodperformance in different signal-to-noise regimes due to a bias-variance trade-off representedby this parameter. Large values of α lead to increased expected error in lower-signal regimes(low R ) through both a positive bias in those regimes and a general increase in variance thatpredominantly affects those regimes. On the other hand, small values of α lead to an increasedexpected error in higher-signal regimes (high R ) by leading to a negative bias in those regimesand by shifting the variance of the estimator toward those regimes. In other words, lower valuesof α are better-suited for detecting weaker signals, while higher values of α are better suited fordistinguishing among stronger signals. This is consistent with the results seen in our companionpaper [3], which show that low values of α cause MIC e to yield better powered independencetests while high values of α cause MIC e to have better equitability. For a detailed discussion ofthis trade-off and of specific recommendations for how to set α in practice, see [3]. e As mentioned previously, one of the main motivations for the introduction of MIC was equitabil-ity, the extent to which a measure of dependence usefully captures some notion of relationshipstrength on some set of standard relationships. We therefore carried out an empirical analy-sis of the equitability of MIC e with respect to R and compared its performance to distancecorrelation [10, 28], mutual information estimation [29], and maximal correlation estimation [8].17e began by assessing equitability on the set of relationships Q defined above, a set that hasbeen analyzed in previous work [2, 3, 17]. The results, shown in Figure 3, confirm the superiorequitability of the new estimator MIC e on this set of relationships.To assess equitability more objectively without relying on a manually curated set of func-tions, we then analyzed 160 random functions drawn from a Gaussian process distribution witha radial basis function kernel with one of eight possible bandwidths in the set { . , . , . , . , . , . , . , } to represent a range of possible relationship complexities. The results, shown in Figure 4, showthat MIC e outperforms currently existing methods in terms of equitability with respect to R on these functions as well. Appendix Figure J1 shows a version of this analysis under a differentnoise model that yields the same conclusion. We also examined the effect of outlier relationshipson our results by repeatedly subsampling random subsets of 20 functions from this large set ofrelationships and measuring the equitability of each method on average over the subsets; resultswere similar.One feature of the performance of MIC e on these randomly chosen relationships that isdemonstrated in Figure 4 is that it appears minimally sensitive to the bandwidth of the Gaussianprocess from which a given relationship is drawn. This puts it in contrast to, e.g., mutualinformation estimation, which shows a pronounced sensitivity to this parameter that preventsit from being highly equitable when relationships with different bandwidths are present in thesame dataset.In our companion paper [3], we perform more in-depth analyses of the equitability with re-spect to R of MIC e , MIC, and the four measures of dependence described above as well as theHilbert-Schmidt independence criterion (HSIC) [11, 30], the Heller-Heller-Gorfine (HHG) test[14], the data-derived partitions (DDP) test [31], and the randomized dependence coefficient(RDC) [13]. These analyses consider a range of sample sizes, noise models, marginal distri-butions, and parameter settings. They conclude that, in terms of equitability with respect to R on the sets of noisy functional relationships studied, a) MIC e uniformly outperforms MIC,and b) MIC e outperforms all the methods tested in the vast majority of settings examined.Appendix Figure I1 contains a reproduction of a representative equitability analysis from thatpaper for the reader’s reference. So far we have presented results about estimators of the population maximal information co-efficient, a quantity for which equitability is the primary motivation. We now introduce andanalyze a new measure of dependence, the total information coefficient (TIC). In contrast to themaximal information coefficient, the total information coefficient is designed not for equitabilitybut rather as a test statistic for testing a null hypothesis of independence.We begin by giving some intuition. Recall that the maximal information coefficient is thesupremum of the characteristic matrix. While estimating the supremum of this matrix has manyadvantages, this estimation involves taking a maximum over many estimates of individual entriesof the characteristic matrix. Since maxima of sets of random variables tend to become large asthe number of variables grows, one can imagine that this procedure will lead to an undesirablepositive bias in the case of statistical independence, when the population characteristic matrix18 φ Ф ( e.g. R )Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) φ (a) (b)(c) (d)(e) (f)Figure 3: Equitability with respect to R on a set of noisy functional relationships of (a) the Pearson correlation coefficient, (b) a hypothetical measure of dependence ϕ with perfectequitability, (c) distance correlation, (d) MIC e , (e) maximal correlation estimation, and (f) mutual information estimation. For each relationship, a shaded region denotes 5th and 95thpercentile values of the sampling distribution of the statistic in question on that relationshipat every R . The resulting plot shows which values of R correspond to a given value of eachstatistic. The red interval on each plot indicates the widest range of R values corresponding toany one value of the statistic; the narrower the red interval, the higher the equitability. A redinterval with width 0, as in (b) , means that the statistic reflects only R with no dependenceon relationship type, as demonstrated by the pairs of thumbnails of relationships of differenttypes with identical R values. 19igure 4: Equitability of methods examined on functions randomly drawn from a Gaussianprocess distribution. Each method is assessed as in Figure 3, with a red interval indicatingthe widest range of R values corresponding to any one value of the statistic; the narrower thered interval, the higher the equitability. Each shaded region corresponds to one relationship,and the regions are colored by the bandwidth of the Gaussian process from which they weresampled. Sample relationships for each bandwidth are shown in the top right with matchingcolors. 20quals 0. This is detrimental for independence testing, when the sampling distribution of astatistic under a null hypothesis of independence is crucial.The intuition behind the total information coefficient is that if we instead consider a morestable property, such as the sum of the entries in the characteristic matrix, we might expect toobtain a statistic with a smaller bias in the case of independence and therefore better power.Stated differently, if our only goal is to distinguish any dependence at all from complete noise,then disregarding all of the sample characteristic matrix except for its maximal value may throwaway useful signal, and the total information coefficient avoids this by summing all the entries.We remark that in [2] it is suggested that other properties of the characteristic matrix mayallow us to measure other aspects of a given relationship besides its strength, and several suchproperties were defined. The total information coefficient fits within this conceptual framework.In this section we define the total information coefficient in the case of both the characteristicmatrix (TIC) and the equicharacteristic matrix (TIC e ). We then prove that both TIC andTIC e yield independence tests that are consistent against all dependent alternatives. Finally,we present a simulation study of the power of independence testing based on TIC e , showingthat it outperforms other common measures of dependence. We begin by defining the two versions of the total information coefficient. In the definitionbelow, recall that (cid:99) M denotes a sample characteristic matrix whereas (cid:100) [ M ] denotes a sampleequicharacteristic matrix. Definition 5.1.
Let D ⊂ R be a set of n ordered pairs, and let B : Z + → Z + . We defineTIC B ( D ) = (cid:88) k(cid:96) ≤ B ( n ) (cid:99) M ( D ) k,(cid:96) and TIC e,B ( D ) = (cid:88) k(cid:96) ≤ B ( n ) (cid:100) [ M ]( D ) k,(cid:96) . To show that these two statistics lead to consistent independence tests, we must take a stepback and analyze the behavior of the analogous population quantities. We take some care withthe limits involved in defining these quantities, since the fine-grained behavior of these limitswill be a key part of our proof of consistency.
Definition 5.2.
For a matrix A and a positive number B , the B -partial sum of A , denoted by S B ( A ) , is S B ( A ) = (cid:88) k(cid:96) ≤ B A k,(cid:96) . When A is an (equi)characteristic matrix, S B ( A ) is the sum over all entries correspondingto grids with at most B total cells. Thus, if (cid:99) M ( D ) is a sample characteristic matrix of a sample D , S B ( (cid:99) M ( D )) = TIC B ( D ) , and the same holds for (cid:100) [ M ]( D ) and TIC e,B ( D ) .It is clear that if X and Y are statistically independent random variables, then both thecharacteristic matrix M ( X, Y ) and the equicharacteristic matrix [ M ]( X, Y ) are identically 0,so that S B ( M ( X, Y )) = S B ([ M ]( X, Y )) = 0 for all B . However, we are also interested in howthese quantities behave when X and Y are dependent. The following pair of propositions helps21s understand this. The first proposition shows a lower bound on the values of entries in both M ( X, Y ) and [ M ]( X, Y ) . The second proposition translates this into an asymptotic character-ization of how quickly S B ( M ) and S B ([ M ]) grow as functions of B . These two propositionsare the technical heart of why the total information coefficient yields a consistent independencetest. Proposition 4.
Let ( X, Y ) be a pair of jointly distributed random variables. If X and Y arestatistically independent, then M ( X, Y ) ≡ [ M ]( X, Y ) ≡ . If not, then there exists some a > and some integer (cid:96) ≥ such that M ( X, Y ) k,(cid:96) , [ M ]( X, Y ) k,(cid:96) ≥ a log min { k, (cid:96) } either for all k ≥ (cid:96) ≥ (cid:96) , or for all (cid:96) ≥ k ≥ (cid:96) .Proof. See Appendix K.1
Proposition 5.
Let ( X, Y ) be a pair of jointly distributed random variables. If X and Y arestatistically independent, then S B ( M ( X, Y )) = S B ([ M ]( X, Y )) = 0 for all
B > . If not, then S B ( M ( X, Y )) and S B ([ M ]( X, Y )) are both Ω( B log log B ) .Proof. See Appendix K.2The propositions above, together with reasoning analogous to the convergence argumentspresented earlier, can be used to show the main result of this section, namely that the statisticsTIC and TIC e yield consistent independence tests. Theorem 10.
The statistics TIC B and TIC e,B yield consistent right-tailed tests of indepen-dence, provided ω (1) < B ( n ) ≤ O ( n − ε ) for some ε > .Proof. See Appendix K.3.In practice, we often use the
EquicharClump algorithm (see Section 4.3) to compute theequicharacteristic matrix from which we calculate TIC e . This algorithm does not computethe sample equicharacteristic matrix exactly. However, as in the case of MIC e , the use of thealgorithm does not affect the consistency of our approach for independence testing. This isproven in Appendix H. e With the consistency of independence tests based on TIC and TIC e established, we turn nowto empirical evaluation of the power of independence testing based on TIC e as computed usingthe EquicharClump algorithm.To evaluate the power of TIC e -based tests, we reproduced the analysis performed in [32].Namely, for the set of functions F chosen by Simon and Tibshirani, we considered the set ofrelationships they analyzed: Q = (cid:8) ( X, f ( X ) + ε (cid:48) ) : X ∼ Unif , f ∈ F, ε (cid:48) ∼ N (0 , σ ) , σ ∈ R ≥ (cid:9) . For each relationship Z in this set that we examined, we simulated a null hypothesis ofindependence with the same marginal distributions, and generated , independent samples,22ach with a sample size of n = 500 , from both Z and from the null distribution. These wereused to estimate the power of the size- α right-tailed independence test based on each statisticbeing evaluated. Following Simon and Tibshirani, we compared TIC e to the distance correlation[10, 28], the original maximal information coefficient [2] as approximated using Approx-MIC ,and to the Pearson correlation. (Though it is not a measure of dependence, the Pearson cor-relation was presumably included by Simon and Tibshirani as an intuitive benchmark for whatis achievable under a linear model.) We also compared to MIC e using identical parameters tothose of TIC e to examine whether the summation performed by TIC e is better than maximiza-tion when all other things are equal. Note that we do not compare to methods of analyzingcontingency tables, such as Pearson’s chi-squared test. This is because our data are real-valuedrather than discrete, and so contingency-based methods are not applicable. However, when dataare discrete, those methods can be very well powered.The results of our analysis are presented in Figure 5. First, the figure shows that TIC e compares quite favorably with distance correlation, a method considered to have state-of-the-art power [32]. Specifically, TIC e uniformly outperforms distance correlation on 5 of the 8relationship types examined, and performs comparably to it on the other three relationshiptypes. Distance correlation has many advantages over TIC e , including the fact that it easilygeneralizes to higher-dimensional relationships. However, in the bivariate setting TIC e appearsto perform as well as distance correlation if not better in terms of statistical power againstindependence.The analysis also shows that TIC e outperforms the original maximal information coefficientby a very large margin, and outperformed MIC e as well, supporting the intuition that thesummation performed by the former can indeed lead to substantial gains in power againstindependence over the maximization performed by the latter. (We note that in both Simon andTibshirani’s analysis and in this one, the original maximal information coefficient was run withdefault parameters that were optimized for equitability rather than power against independence.When run with different parameters, its power improves substantially, though it still does notmatch the power of MIC e . See Appendix Figure I2 and the discussion in [3].)Our companion paper [3] expands on this analysis, conducting in-depth evaluation of thethe power against independence of the tests described above as well as tests based on mutualinformation estimation [29], maximal correlation estimation [8], HSIC [11, 30], HHG [14], DDP[31], and RDC [13]. These analyses consider a range of sample sizes and parameter settings, aswell as a variety of ways of quantifying power across different alternative hypothesis relation-ship types and noise levels. They conclude that in most settings TIC e either outperforms allthe methods tested or performs comparably to the best ones. Appendix Figure I2 contains areproduction of one detailed set of power curves from the main analysis in that paper for thereader’s reference. As high-dimensional data sets become increasingly common, data exploration requires not onlystatistics that can accurately detect a large number of non-trivial relationships in a data set, butalso ones that can identify a smaller number of strongest relationships. The former property isachieved by measures of dependence that yield independence tests with high power; the latteris achieved by measures of dependence that are equitable with respect to some measure ofrelationship strength. In this paper, we introduced two related measures of dependence that23igure 5: Comparison of power of independence testing based on TIC e (blue) to MIC withdefault parameters (gray), MIC e with the same parameters as TIC e (black), distance correlation(purple), and the Pearson correlation coefficient (green) across several alternative hypothesisrelationship types chosen by [32]. The relationships analyzed are described in Section 5.2.24chieve these two goals, respectively, through the following theoretical contributions. • A new population measure of dependence, MIC ∗ , that we proved can be viewed in threedifferent ways: as the population value of the maximal information coefficient (MIC) from[2], as a “minimal smoothing” of mutual information that makes it uniformly continuous,or as the supremum of an infinite sequence defined in terms of optimal partitions of onemarginal at a time of a given joint distribution. • An efficient algorithm for approximating the MIC ∗ of a given joint distribution. • A statistic MIC e that is a consistent estimator of MIC ∗ , is efficiently computable, and hasgood equitability with respect to R both on a manually chosen set of noisy functionalrelationships as well as on randomly chosen noisy functional relationships. • The total information coefficient (TIC e ), a statistic that arises as a trivial side-product ofthe computation of MIC e and yields a consistent and powerful independence test.Though we presented here some empirical results for MIC ∗ , MIC e , and TIC e , our focuswas on theoretical considerations; the performance of these methods is analyzed in detail in ourcompanion paper [3]. That paper shows that on a large set of noisy functional relationships withvarying noise and sampling properties, the asymptotic equitability with respect to R of MIC ∗ is quite high and the equitability with respect to R of MIC e is state-of-the-art. It also showsthat the power of the independence test based on TIC e is state-of-the-art across a wide varietyof dependent alternative hypotheses. Finally, it demonstrates that the algorithms presentedhere allow for MIC e and TIC e to be computed simultaneously very quickly, enabling analysis ofextremely large data sets using both statistics together.Our contributions are of both theoretical and practical importance for several reasons. First,our characterization of MIC ∗ as the large-sample limit of MIC sheds light on the latter statistic.For example, while MIC is parametrized, MIC ∗ is not. Knowing that MIC converges in proba-bility to MIC ∗ tells us that this parametrization is statistical only: it controls the bias/varianceproperties of the statistic, but not its asymptotic behavior.Second, the normalization in the definition of MIC, while empirically seen to yield good per-formance, had previously not been theoretically understood. Our result that this normalizationis the minimal smoothing necessary to make mutual information uniformly continuous providesfor the first time a lens through which the normalization is canonical. In doing so, it constitutesan initial step toward understanding the role of the normalization in the performance of MIC ∗ and MIC. The uniform continuity of MIC ∗ and the lack of continuity of ordinary mutual infor-mation also suggest that estimation of the former may be easier in some sense than estimationof the latter. This is consonant with the result concerning difficulty of estimation of mutualinformation shown in [16]. It is also borne out empirically by the substantial finite-sample biasand variance observed in [3] of the Kraskov mutual information estimator [29] compared toMIC e .Third, our alternate characterization of MIC ∗ in terms of one-dimensional optimization overpartitions rather than two-dimensional optimization over grids enhances our understanding ofhow to efficiently compute it in the large-sample limit and estimate it from finite samples usingMIC e . This is a significant improvement over the previous state of affairs, in which the statisticMIC could only be approximated heuristically, and orders of magnitude slower than the resultsin this paper now allow. 25inally, the introduction of the total information coefficient provides evidence that the basicapproach of considering the set of normalized mutual information values achievable by apply-ing different grids to a joint distribution is of fundamental value in characterizing dependence.Interestingly, a statistic introduced in [31] follows a similar approach by considering the (non-normalized) sum of the mutual information values achieved by all possible finite grids. Con-sistent with our demonstration here that an aggregative grid-based approach works well, thatstatistic also achieves excellent power. (TIC e is compared to the statistic from [31] in ourcompanion paper, [3].)Taken together, our results point to joint use of the statistics MIC e and TIC e as a theoret-ically grounded, computationally efficient, and highly practical approach to data exploration.Specifically, since the two statistics can be computed simultaneously with little extra cost be-yond that of computing either individually, we propose computing both of them on all variablepairs in a data set, using TIC e to filter out non-significant associations, and then using MIC e to rank the remaining variable pairs. Such a strategy would have the advantage of leveragingthe state-of-the-art power of TIC e to substantially reduce the multiple-testing burden on MIC e ,while utilizing the latter statistic’s state-of-the-art equitability to effectively rank relationshipsfor follow-up by the practitioner.Of course our results, while useful, nevertheless have limitations that warrant explorationin future work. First, for a sample D from the distribution of some random ( X, Y ) , all of thesample quantities we define here use the naive estimate I ( D | G ) of the quantity I (( X, Y ) | G ) forvarious grids G . There is a long and fruitful line of work on more sophisticated estimators ofthe discrete mutual information [9] whose use instead of I ( D | G ) could improve the statisticsintroduced here. Second, our approach to approximating the MIC ∗ of a given joint densityconsists of computing a finite subset of an infinite set whose supremum we seek to calculate.However, the choice of how large a finite set we should compute in order to approximate thesupremum to a given precision remains heuristic. Finally, though empirical characterization ofthe equitability of MIC e on representative sets of relationships is important and promising, weare still missing a theoretical characterization of its equitability in the large-sample limit. A cleartheoretical demarcation of the set of relationships on which MIC ∗ achieves good equitability withrespect to R , and an understanding of why that is, would greatly advance our understandingof both MIC ∗ and equitability. We would like to acknowledge R Adams, T Broderick, E Airoldi, A Gelman, M Gorfine, R Heller,J Huggins, J Mueller, and R Tibshirani for constructive conversations and useful feedback.
A Proof of Theorem 1
This appendix is devoted to proving Theorem 1, restated below.
Theorem
Let f : m ∞ → R be uniformly continuous, and assume that f ◦ r i → f pointwise.Then for every random variable ( X, Y ) , we have (cid:0) f ◦ r B ( n ) (cid:1) (cid:16) (cid:99) M ( D n ) (cid:17) → f ( M ( X, Y )) n probability where D n is a sample of size n from the distribution of ( X, Y ) , provided ω (1) . We prove the theorem by a sequence of lemmas that build on each other to bound the bias of I ∗ ( D, k, (cid:96) ) . The general strategy is to capture the dependencies between different k -by- (cid:96) grids G by considering a “master grid” Γ that contains many more than k(cid:96) cells. Given this mastergrid, we first bound the difference between I ( D | G ) and I (( X, Y ) | G ) only for sub-grids G of Γ .The bound is in terms of the difference between D | Γ and ( X, Y ) | Γ . We then show that thisbound can be extended without too much loss to all k -by- (cid:96) grids. This gives what we seek,because then the difference between I ( D | G ) and I (( X, Y ) | G ) is uniformly bounded for all grids G in terms of the same random variable: D | Γ . Once this is done, standard arguments give theconsistency we seek.In our argument we occasionally require technical facts about entropy and mutual informa-tion that are self-contained and unrelated to the central ideas. These lemmas are consolidatedin Appendix L.We begin by using one of these technical lemmas to prove a bound on the difference between I ( D | G ) and I (( X, Y ) | G ) that is uniform over all grids G that are sub-grids of a much densergrid Γ . The common structure imposed by Γ will allow us to capture the dependence betweenthe quantities | I ( D | G ) − I (( X, Y ) | G ) | for different grids G . Lemma
A.1 . Let
Π = (Π X , Π Y ) and Ψ = (Ψ X , Ψ Y ) be random variables distributed over thecells of a grid Γ , and let ( π i,j ) and ( ψ i,j ) be their respective distributions. Define ε i,j = ψ i,j − π i,j π i,j . Let G be a sub-grid of Γ with B cells. Then for every fixed < a < we have | I (Ψ | G ) − I (Π | G ) | ≤ O (log B ) (cid:88) i,j | ε i,j | when | ε i,j | ≤ − a for all i and j . Proof.
Let P = Π | G and Q = Ψ | G be the random variables induced by Π and Ψ respectively onthe cells of G . Using the fact that I ( X, Y ) = H ( X ) + H ( Y ) − H ( X, Y ) , we write | I ( Q ) − I ( P ) | ≤ | H ( Q X ) − H ( P X ) | + | H ( Q Y ) − H ( P Y ) | + | H ( Q ) − H ( P ) | where Q X and P X denote the marginal distributions on the columns of G and Q Y and P Y denotethe marginal distributions on the rows. We can bound each of the terms on the right-hand sideof the equation above using a Taylor expansion argument given in Lemma L.1, whose proof isfound in the appendix. Doing so gives | I ( Q ) − I ( P ) | ≤ (ln B ) (cid:88) i O ( | ε i, ∗ | ) + (cid:88) j O ( | ε ∗ ,j | ) + (cid:88) i,j O ( | ε i,j | ) where ε i, ∗ = (cid:80) j ( ψ i,j − π i,j ) (cid:80) j π i,j ε ∗ ,j is defined analogously.To obtain the result, we observe that | ε i, ∗ | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:80) j π i,j ε i,j (cid:80) j π i,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:80) j π i,j | ε i,j | (cid:80) j π i,j ≤ (cid:88) j | ε i,j | since π i,j / (cid:80) j π i,j ≤ , and the analogous bound holds for | ε ∗ ,j | .We now extend Lemma A.1 to all grids with B cells rather than just those that are sub-gridsof the master grid Γ . The proof of this lemma relies on an information-theoretic result provenin Appendix B that bounds the difference in mutual information between two distributions thatcan be obtained from each other by moving a small amount of probability mass. Lemma
A.2 . Let
Π = (Π X , Π Y ) and Ψ = (Ψ X , Ψ Y ) be random variables, and let Γ be a grid.Define ε i,j on Π | Γ and Ψ | Γ as in Lemma A.1. Let G be any grid with B cells, and let δ (resp. d )represent the total probability mass of Π | Γ (resp. Ψ | Γ ) falling in cells of Γ that are not containedin individual cells of G . We have that | I (Ψ | G ) − I (Π | G ) | ≤ O (cid:88) i,j | ε i,j | + δ + d log B + δ log(1 /δ ) + d log(1 /d ) provided that the | ε i,j | are bounded away from 1 and that d, δ ≤ / . Proof.
In the proof below, we use the convention that for any two grids G and G (cid:48) and anyrandom variable Z , the expression ∆ Z ( G, G (cid:48) ) denotes | I ( Z | G ) − I ( Z | G (cid:48) ) | .Consider the grid G (cid:48) obtained by replacing every horizontal or vertical line in G that is notin Γ with a closest line in Γ . The grid G (cid:48) is clearly a sub-grid of Γ . Moreover, Π | G (cid:48) (resp. Ψ | G (cid:48) )can be obtained from Π | G (resp. Π | G ) by moving at most δ (resp. d ) probability mass. Thiscan be shown to imply that ∆ Π ( G, G (cid:48) ) ≤ O ( δ log(1 /δ ) + δ log B ) and ∆ Ψ ( G (cid:48) , G ) ≤ O ( d log(1 /d ) + d log B ) . The proof of this information-theoretic fact is self-contained and so we defer it to Proposition 7in Appendix B, as it is more central to the arguments presented there.With ∆ Φ ( G, G (cid:48) ) and ∆ Ψ ( G (cid:48) , G ) bounded in terms of δ and d , we can bound | I (Ψ | G ) − I (Φ | G ) | using the triangle inequality by comparing it with ∆ Π ( G, G (cid:48) ) + | I (Π | G (cid:48) ) − I (Ψ | G (cid:48) ) | + ∆ Ψ ( G (cid:48) , G ) and bounding the middle term using Lemma A.1, since G (cid:48) ⊂ Γ .We now use the fact that the variables ε i,j defined in Lemma A.1 are small with highprobability to give a concrete bound on the bias of I ( D | G ) that is uniform over all k -by- (cid:96) grids G and that holds with high probability. It is useful at this point to recall that, given adistribution ( X, Y ) , an equipartition of ( X, Y ) is a grid G such that all the rows of ( X, Y ) | G have the same probability mass, and all the columns do as well.28 emma A.3 . Let D n be a sample of size n from the distribution of a pair ( X, Y ) of jointlydistributed random variables. For any α ≥ , any ε > , and any integers k, (cid:96) > , we have thatfor all n | I ( D n | G ) − I (( X, Y ) | G ) | ≤ O (cid:18) log( k(cid:96) ) C ( n ) α + log( k(cid:96)n ) n ε/ (cid:19) for every k -by- (cid:96) grid G with probability at least − C ( n ) e − Ω( n/C ( n ) α ) , where C ( n ) = k(cid:96)n ε/ . Proof.
Fix n , and let Γ be an equipartition of ( X, Y ) into kn ε/ rows and (cid:96)n ε/ columns. C ( n ) is now the number of cells in Γ . Lemma A.2, with Π = (
X, Y ) and Ψ = D , shows that | I ( D | G ) − I (( X, Y ) | G ) | is at most O (cid:88) i,j | ε i,j | + δ + d log( k(cid:96) ) + δ log(1 /δ ) + d log(1 /d ) provided the ε i,j have absolute value bounded away from 1, and provided that d, δ ≤ / .The remainder of the proof proceeds as follows. We first show that the ε i,j are small withhigh probability. This will both show that the lemma’s requirement on the ε i,j holds and allowus to bound the sum in the inequality above. We will then use our bound on the ε i,j to bound d in terms of δ . Finally, we will bound δ using the fact that the number of rows and columnsin Γ increases with n . This will give us that d, δ ≤ / and allow us to bound the rest of theterms in the expression above. Bounding the ε i,j : We bound the ε i,j using a multiplicative Chernoff bound. Let π i,j and ψ i,j represent the probability mass functions of ( X, Y ) | Γ and D | Γ respectively. We write P ( | ε i,j | ≥ δ ) = P ( π i,j (1 − δ ) ≤ ψ i,j ≤ π i,j (1 + δ )) ≤ e − Ω( nπ i,j δ ) since ψ i,j is a sum of n i.i.d Bernoulli random variables and E ( ψ i,j ) = nπ i,j . (See, e.g., [33].)Setting δ = √ π i,j /C ( n ) / α yields P (cid:18) | ε i,j | ≥ √ π i,j C ( n ) / α (cid:19) ≤ e − Ω( n/C ( n ) α ) . A union bound over the pairs ( i, j ) then gives that, with the desired probability, the abovebound on | ε i,j | holds for all i, j . Bounding (cid:80) | ε i,j | : The bound on the ε i,j implies that (cid:88) i | ε i,j | ≤ C ( n ) / α (cid:88) i,j √ π i,j ≤ C ( n ) / α (cid:112) C ( n ) ≤ C ( n ) α where the second line follows from the fact that the function (cid:80) √ π i,j is symmetric and concaveand therefore, when restricted to the hyperplane (cid:80) π i,j = 1 , must achieve its maximum when π i,j = 1 /C ( n ) for all i, j . 29 ounding d in terms of δ : We use our bound on the ε i,j to bound d . We do so by observingthat it implies ψ i,j ≤ π i,j (cid:18) √ π i,j C ( n ) / α (cid:19) = π i,j + π / i,j C ( n ) / α ≤ π i,j + π i,j C ( n ) / α ≤ π i,j since π i,j ≤ and C ( n ) ≥ .The connection to d comes from the fact that for any column j of Γ , this means that ψ ∗ ,j = (cid:88) i ψ i,j ≤ (cid:88) i π i,j = 2 π ∗ ,j . This also applies to the sums across rows. Since d is a sum of terms of the form ψ ∗ ,j and ψ i, ∗ for j in some index set J and i in an index set I , and δ is a sum of terms of the form π ∗ ,j and π i, ∗ with the same index sets, we therefore get that d ≤ δ . Bounding δ and obtaining the result: To bound δ , we observe that because G has at most (cid:96) − vertical lines and k − horizontal lines, we have δ ≤ (cid:96)(cid:96)n ε/ + kkn ε/ ≤ n ε/ . This bound on δ allows us to bound the terms involving d and δ by δ + d ≤ O (cid:18) n ε/ (cid:19) , δ log (cid:18) δ (cid:19) + d log (cid:18) d (cid:19) ≤ O (cid:18) log nn ε/ (cid:19) . Combining all of the bounds gives the desired result.Our final lemma shows that as long as B ( n ) doesn’t grow too fast, the bound from theprevious lemma yields a uniform bound on the entire sample characteristic matrix. This is doneby specifying an error threshold for which Lemma A.3 yields a bound that holds with highprobability, and then invoking a union bound. Lemma
A.4 . Let D n be a sample of size n from the distribution of a pair ( X, Y ) of jointlydistributed random variables. For every B ( n ) = O (cid:0) n − ε (cid:1) , there exists an a > such that forsufficiently large n , (cid:12)(cid:12)(cid:12) (cid:99) M ( D n ) k,(cid:96) − M ( X, Y ) k,(cid:96) (cid:12)(cid:12)(cid:12) ≤ O (cid:18) n a (cid:19) holds for all k(cid:96) ≤ B ( n ) with probability P ( n ) = 1 − o (1) , where (cid:99) M ( D n ) k,(cid:96) is the k, (cid:96) -thentry of the sample characteristic matrix and M ( X, Y ) k,(cid:96) is the k, (cid:96) -th entry of the populationcharacteristic matrix of ( X, Y ) . Proof.
Fix k, (cid:96) , and any α satisfying < α < ε/ (4 − ε ) . Lemma A.3 implies that with highprobability the difference | (cid:99) M ( D n ) k,(cid:96) − M k,(cid:96) | is at most O (cid:18) log( k(cid:96) ) C ( n ) α + log( k(cid:96)n ) n ε/ (cid:19) ≤ O (cid:18) log nC ( n ) α + log nn ε/ (cid:19) ≤ O (cid:18) log nn αε/ + log nn ε/ (cid:19) k(cid:96) ≤ B ( n ) and second is because C ( n ) = k(cid:96)n ε/ ≥ n ε/ .This bound is at most O (1 /n a ) for every a < min { αε/ , ε/ } , as desired. It remains only toshow that the bound holds with high probability across all k(cid:96) ≤ B ( n ) .Lemma A.3 states that the probability our bound holds for one fixed pair ( k, (cid:96) ) is at least − C ( n ) e − Ω( n/C ( n ) α ) ≥ − O ( n ) e − Ω( n u ) for some positive u . This is because C ( n ) ≤ B ( n ) n ε/ ≤ O (cid:0) n − ε/ (cid:1) for large n , and so ourchoice of α ensures that C ( n ) α = O (cid:0) n − u (cid:1) for some u > .We can then perform a union bound over all pairs k(cid:96) ≤ B ( n ) : since the number of suchpairs can be bounded by a polynomial in n , we have that the desired condition is satisfied forall k(cid:96) ≤ B ( n ) with probability approaching 1.We are now ready to prove the main result. Theorem
Let f : m ∞ → R be uniformly continuous, and assume that f ◦ r i → f pointwise.Then for every random variable ( X, Y ) , we have (cid:0) f ◦ r B ( n ) (cid:1) (cid:16) (cid:99) M ( D n ) (cid:17) → f ( M ( X, Y )) in probability where D n is a sample of size n from the distribution of ( X, Y ) , provided ω (1) .Proof. Let N denote B ( n ) , let M N = r N ( M ) , and let (cid:99) M N ( D n ) = r N ( (cid:99) M ( D n )) . We begin bywriting (cid:12)(cid:12)(cid:12) f (cid:16) (cid:99) M N ( D n ) (cid:17) − f ( M ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) f (cid:16) (cid:99) M N ( D n ) (cid:17) − f ( M N ) (cid:12)(cid:12)(cid:12) + | f ( M N ) − f ( M ) | = (cid:12)(cid:12)(cid:12) f (cid:16) (cid:99) M N ( D n ) (cid:17) − f ( M N ) (cid:12)(cid:12)(cid:12) + | ( f ◦ r N ) ( M ) − f ( M ) | and observing that as n → ∞ , the second term vanishes by the pointwise convergence of f ◦ r i and the fact that B ( n ) > ω (1) . It therefore suffices to show that the first term converges to 0in probability. Since f is uniformly continuous, we can establish this via a simple adaptation ofthe continuous mapping theorem, which says that if the sequence of random variables R n → R in probability, and g is continuous, then g ( R n ) → g ( R ) in probability. We replace R with asecond sequence, and replace continuity with uniform continuity.Let (cid:107) · (cid:107) denote the supremum norm on m ∞ , and fix any z > . Then, for any δ > , define C δ = (cid:8) A ∈ m ∞ : ∃ A (cid:48) ∈ m ∞ s.t. (cid:107) A − A (cid:48) (cid:107) < δ, (cid:12)(cid:12) f ( A ) − f A (cid:48) ) (cid:12)(cid:12) > z (cid:9) . This is the set of matrices A ∈ m ∞ for which it is possible to find, within a δ -neighborhoodof A , a second matrix that f maps to more than z away from f ( A ) . Because f is uniformlycontinuous, there exists a δ ∗ sufficiently small so that C δ ∗ = ∅ .Suppose that | f ( (cid:99) M N ( D n )) − f ( M N ) | > z . This means that either (cid:107) (cid:99) M N ( D n ) − M N (cid:107) > δ ∗ ,or M N ∈ C δ ∗ . The latter option is impossible since C δ ∗ = ∅ , and Lemma A.4 tells us that P (cid:16) (cid:107) (cid:99) M N ( D n ) − M N (cid:107) > δ ∗ (cid:17) → as n grows. We therefore have that (cid:12)(cid:12)(cid:12) f (cid:16) (cid:99) M N ( D n ) (cid:17) − f ( M N ) (cid:12)(cid:12)(cid:12) → in probability, as desired. 31 Proof of Theorem 2
In this appendix we prove Theorem 2, reproduced below.
Theorem
Let P ( R ) denote the space of random variables supported on R equipped with themetric of statistical distance. The map from P ( R ) to m ∞ defined by ( X, Y ) (cid:55)→ M ( X, Y ) isuniformly continuous. The proposition below begins our argument with the simple observation that the family ofmaps consisting of applying any finite grid to some ( X, Y ) ∈ P ( R ) is uniformly equicontinuous.The reason this holds is that ( X, Y ) | G is a deterministic function of ( X, Y ) , and deterministicfunctions cannot increase statistical distance. Proposition 6.
Let G be the set of all finite grids. The family { ( X, Y ) (cid:55)→ ( X, Y ) | G : G ∈ G } is uniformly equicontinuous on P ( R ) .Proof. To establish uniform equicontinuity, we need to show that, given some ( X, Y ) ∈ P ( R ) and some ε > , we can choose δ to satisfy the continuity condition in a way that does notdepend on G or on ( X, Y ) . But because deterministic functions cannot increase statisticaldistance, we have that if ( X, Y ) , ( X (cid:48) , Y (cid:48) ) ∈ P are at most ε apart then ∆ (cid:0) ( X, Y ) | G , ( X (cid:48) , Y (cid:48) ) | G (cid:1) ≤ ∆ (cid:0) ( X, Y ) , ( X (cid:48) , Y (cid:48) ) (cid:1) = ε where ∆ denotes statistical distance. Choosing δ = ε therefore gives the result.At this point it is tempting to try to use continuity properties of discrete mutual informa-tion to obtain uniform continuity of the characteristic matrix. And indeed, this strategy doesyield that each individual entry of the characteristic matrix is a uniformly continuous func-tion. However, to obtain continuity of the entire (infinite) characteristic matrix we need tomake a statement about all grid resolutions simultaneously. This is not straightforward becausemutual information is only uniformly continuous for a fixed grid resolution, and the family { ( X, Y ) (cid:55)→ I (( X, Y ) | G ) : G ∈ G } is in fact not even equicontinuous.The normalization in the definition of MIC ∗ is what allows us to establish the uniformcontinuity of the characteristic matrix despite this problem. To see why, suppose we have adistribution over a k -by- (cid:96) grid and we are allowed to move at most δ away in statistical distancefor some small δ . The largest change in discrete mutual information that this can cause indeedincreases as we increase k and (cid:96) . However, it turns out that we can bound the extent of this “non-uniformity”: the proposition below shows that as we move away from a distribution, the discretemutual information can change only proportionally to the amount of mass we move, with theproportionality constant bounded by log min { k, (cid:96) } . Because log min { k, (cid:96) } is the quantity bywhich we regularize the entries of the characteristic matrix, this is exactly enough to make thenormalized matrix continuous. This proposition is the technical heart of our continuity result.And as we show in Corollary 3.3 when we demonstrate the non-continuity of the non-normalizedcharacteristic matrix mutual information, our bound is tight. Proposition 7.
Let I k,(cid:96) : P ( { , . . . , k }×{ , . . . , (cid:96) } ) → R denote the discrete mutual informationfunction on k -by- (cid:96) grids. For < δ ≤ / , the maximal change in I k,(cid:96) over any subset of P ( { , . . . , k } × { , . . . , (cid:96) } ) of diameter δ (in statistical distance) is O (cid:18) δ log (cid:18) δ (cid:19) + δ log min { k, (cid:96) } (cid:19) . roof. Without loss of generality, assume k ≤ (cid:96) , so that log min { k, (cid:96) } = log k . Let ( X, Y ) and ( X (cid:48) , Y (cid:48) ) be two random variables distributed over { , . . . , k } × { , . . . , (cid:96) } that are at most δ apart in statistical distance. Using I ( X, Y ) = H ( Y ) − H ( Y | X ) , we can express the differencebetween the mutual information of these two pairs of random variables as (cid:12)(cid:12) I ( X, Y ) − I ( X (cid:48) , Y (cid:48) ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) H ( Y ) − H ( Y (cid:48) ) (cid:12)(cid:12) + (cid:12)(cid:12) H ( Y | X ) − H ( Y (cid:48) | X (cid:48) ) (cid:12)(cid:12) . We now use Lemma L.5, which relates movement of probability mass to changes in entropyand is proven in Appendix L, to separately bound each of the terms on the right hand side.Straightforward application of the lemma to | H ( Y ) − H ( Y (cid:48) ) | shows that it is at most H b (2 δ ) +3 δ log k , where H b ( · ) is the binary entropy function. Since H b ( x ) ≤ O ( x log(1 /x )) for x small,this is O ( δ log(1 /δ ) + δ log k ) .Bounding the term with the conditional entropies is more involved. Let p x = P ( X = x ) ,and let p (cid:48) x = P ( X (cid:48) = x ) . We have (cid:12)(cid:12) H ( Y | X ) − H ( Y (cid:48) | X (cid:48) ) (cid:12)(cid:12) = (cid:88) x (cid:12)(cid:12) p x H ( Y | X = x ) − p (cid:48) x H ( Y (cid:48) | X (cid:48) = x ) (cid:12)(cid:12) ≤ (cid:88) x (cid:0) p x (cid:12)(cid:12) H ( Y | X = x ) − H ( Y (cid:48) | X (cid:48) = x ) (cid:12)(cid:12) + (cid:12)(cid:12) p (cid:48) x − p x (cid:12)(cid:12) H ( Y (cid:48) | X (cid:48) = x ) (cid:1) = (cid:88) x p x (cid:12)(cid:12) H ( Y | X = x ) − H ( Y (cid:48) | X (cid:48) = x ) (cid:12)(cid:12) + (cid:88) x (cid:12)(cid:12) p (cid:48) x − p x (cid:12)(cid:12) log k ≤ (cid:88) x p x (cid:12)(cid:12) H ( Y | X = x ) − H ( Y (cid:48) | X (cid:48) = x ) (cid:12)(cid:12) + δ log k (1)where the last line is because (cid:80) x | p x − p (cid:48) x | ≤ δ and H ( Y (cid:48) | X (cid:48) = x ) ≤ log k .Now let δ x + be the magnitude of all the probability mass entering any cell in column x ,let δ x − be the magnitude of all the probability mass leaving any cell in column x , and let δ x = δ x + + δ x − . Using this notation, we can again apply Lemma L.5 to obtain (cid:88) x p x (cid:12)(cid:12) H ( Y | X = x ) − H ( Y (cid:48) | X (cid:48) = x ) (cid:12)(cid:12) ≤ (cid:88) x p x (cid:18) H b (cid:18) δ x p x (cid:19) + 3 δ x p x log k (cid:19) = 2 (cid:88) x p x H b (cid:18) δ x p x (cid:19) + 3 (cid:88) x δ x log k ≤ (cid:88) x p x H b (cid:18) δ x p x (cid:19) + 3 δ log k ≤ H b (2 δ ) + 3 δ log k where the last line is by application of Lemma L.2 from the appendix, which bounds weightedsums of binary entropies.Combining this with Line (1) gives that (cid:12)(cid:12) H ( Y | X ) − H ( Y (cid:48) | X (cid:48) ) (cid:12)(cid:12) ≤ H b (2 δ ) + 4 δ log k which, together with the bound on | H ( Y ) − H ( Y (cid:48) ) | and the fact that H b ( X ) ≤ O ( x log(1 /x )) for x small, gives the result. 33aving bounded the extent to which variation in mutual information depends on grid reso-lution, we are now ready to show the uniform continuity of the characteristic matrix. Theorem
Let P ( R ) denote the space of random variables supported on R equipped with themetric of statistical distance. The map from P ( R ) to m ∞ defined by ( X, Y ) (cid:55)→ M ( X, Y ) isuniformly continuous.Proof. We complete the proof in three steps. First, we show that a certain family of functions F is uniformly equicontinuous. Second, we use this to show that a different family F (cid:48) consistingof functions of the form sup g ∈ A g with A ⊂ F is uniformly equicontinuous. Finally, we arguethat since the entries of M ( X, Y ) consist of the functions in F (cid:48) , this is sufficient to establishthe result.Define F = (cid:26) ( X, Y ) (cid:55)→ I k,(cid:96) (( X, Y ) | G )log min { k, (cid:96) } : k, (cid:96) ∈ Z > , G ∈ G ( k, (cid:96) ) (cid:27) .F is uniformly equicontinuous by the following argument. Given some ε > , we know (Propo-sition 6) that for any ( X (cid:48) , Y (cid:48) ) in an ε -ball around ( X, Y ) , ( X (cid:48) , Y (cid:48) ) | G will remain ε of ( X, Y ) | G for any G . Proposition 7 then tells us that if ε is sufficiently small then the distance between I k,(cid:96) (( X (cid:48) , Y (cid:48) ) | G ) and I k,(cid:96) (( X, Y ) | G ) will be at most O ( ε log(1 /ε ) + ε log min { k, (cid:96) } ) . After the normalization, this becomes at most O ( ε (log(1 /ε ) + 1)) , which goes to 0 (uniformlywith respect to ( X, Y ) ) as ε approaches 0, as desired.Next, define F (cid:48) = { ( X, Y ) (cid:55)→ M ( X, Y ) k,(cid:96) : k, (cid:96) ∈ Z > } . Each map in F (cid:48) is of the form sup g ∈ A g for some A ⊂ F . Therefore, for a given ε > , whatever δ establishes the uniform equicontinuity for F can be used to establish continuity of all thefunctions in F (cid:48) . (To see this: sup g ∈ A g can’t increase by more than ε if no g increases by morethan ε , and sup g ∈ A g is also lower bounded by any of the g ’s, so it can’t decrease by more than ε either.) Since we can use the same δ for all of the maps in F (cid:48) , they therefore form a uniformlyequicontinuous family.Finally, the δ provided by the uniform equicontinuity of F (cid:48) also ensures that M ( X (cid:48) , Y (cid:48) ) iswithin ε of M ( X, Y ) in the supremum norm, thus giving the uniform continuity of ( X, Y ) (cid:55)→ M ( X, Y ) . C Proof of Proposition 2
Theorem
For some function N ( k, (cid:96) ), let M N be the characteristic matrix with normalization N , i.e., M N ( X, Y ) = I ∗ (( X, Y ) , k, (cid:96) ) N ( k, (cid:96) ) . If N ( k, (cid:96) ) = o (log min { k, (cid:96) } ) along some infinite path in N × N , then M N and sup M N are notcontinuous as functions of P ([0 , × [0 , ⊂ P ( R ) . roof. Consider a random variable Z uniformly distributed on [0 , / . Because Z exhibitsstatistical independence, I ∗ ( Z, k, (cid:96) ) is zero for all k, (cid:96) . Now define Z ε to be uniformly distributedon [0 , / with probability − ε and uniformly distributed on the line from (1 / , / to (1 , with probability ε .We lower-bound I ∗ ( Z ε , k, (cid:96) ) . Without loss of generality suppose that k ≤ (cid:96) , and consider agrid that places all of [0 , / into one cell and uniformly partitions the set [1 / , into k − rows and k − columns. By considering just the rows/columns in the set [1 / , we see thatthis grid gives a mutual information of at least ε log( k − . Thus, we have that for all k, (cid:96) , I ∗ ( Z ε , k, (cid:96) ) ≥ ε log min { k − , (cid:96) − } . This implies that the limit of M N ( Z ε ) along P is ∞ , and so the distance between M N ( Z ) and M N ( Z ε ) in the supremum norm is infinite. D Proof of Theorem 4
Theorem
Let M be a population characteristic matrix. Then M k, ↑ equals max P ∈ P ( k ) I ( X, Y | P )log k where P ( k ) denotes the set of all partitions of size at most k .Proof. Define M ∗ k, ↑ = max P ∈ P ( k ) I ( X, Y | P )log k . We wish to show that M ∗ k, ↑ is in fact equal to M k, ↑ . To show that M k, ↑ ≤ M ∗ k, ↑ , we observethat for every k -by- (cid:96) grid G = ( P, Q ) , where P is a partition into rows and Q is a partition intocolumns, the data processing inequality gives I (( X, Y ) | G ) ≤ I ( X, Y | P ) . Thus M k,(cid:96) ≤ M ∗ k, ↑ for (cid:96) ≥ k , implying that M k, ↑ = lim (cid:96) →∞ M k,(cid:96) ≤ M ∗ k, ↑ . It remains to show that M ∗ k, ↑ ≤ M k, ↑ . To do this, we let P be any partition into k rows, andwe define Q (cid:96) to be an equipartition into (cid:96) columns. We let M ∗ k,(cid:96),P = I ( X | Q (cid:96) , Y | P )log k . Since M ∗ k,(cid:96),P ≤ M k,(cid:96) when (cid:96) ≥ k , we have that for all PI ( X, Y | P )log k = lim (cid:96) →∞ M ∗ k,(cid:96),P ≤ lim (cid:96) →∞ M k,(cid:96) = M k, ↑ which gives that M ∗ k, ↑ = sup P I ( X, Y | P )log k ≤ M k, ↑ as desired. 35 Proof of Theorem 5
Theorem
Given a random variable ( X, Y ) , M k, ↑ (resp. M ↑ ,(cid:96) ) is computable to within anadditive error of O ( kε log(1 / ( kε ))) + E (resp. O ( (cid:96)ε log(1 / ( (cid:96)ε ))) + E ) in time O ( kT ( E ) /ε ) (resp. O ( (cid:96)T ( E ) /ε ) ), where T ( E ) is the time required to numerically compute the mutual informationof a continuous distribution to within an additive error of E .Proof. Without loss of generality we prove the claim only for M k, ↑ . Given < ε < , wewould like a partition into rows P of size at most k such that I ( X, Y | P ) is maximized. Wewould like to use OptimizeXAxis for this purpose, but while our search problem is continuous,
OptimizeXAxis can only perform a discrete search over sub-partitions of some master partition Π . We therefore set Π to be an equipartition into /ε rows and show that this gets us closeenough to achieve the desired result.With Π as described, the OptimizeXAxis provides in time O ( kT ( E ) /ε ) a partition P intoat most k rows such that I ( X, Y | P ) is maximized, subject to P ⊂ Π , to within an additiveerror of E . To prove the claim then, we must show that the loss we incur by restricting tosub-partitions of Π costs us at most O ( kε log(1 / ( kε ))) . In other words, we must show that I ( X, Y | P ) − I ( X, Y | P ) ≤ O ( kε ) where P is an optimal partition into rows. Note that we have omitted the absolute value above,since by the optimality of P , I ( X, Y | P ) ≥ I ( X, Y | P ) always.We prove the desired bound by showing that there exists some P (cid:48) ⊂ Π such that the mutualinformation of ( X, Y | P (cid:48) ) is O ( kε log(1 / ( kε ))) -close to that achieved with ( X, Y | P ) . Since P (cid:48) ⊂ Π gives us that I ( X, Y | P ) ≥ I ( X, Y | P (cid:48) ) , we may then conclude that I ( X, Y | P ) − I ( X, Y | P ) isat most O ( kε log(1 / ( kε ))) .We construct P (cid:48) by simply moving replacing every horizontal line in P with the horizontalline in Π closest to it. Since there are at most k − horizontal lines in P , and each such lineis contained in a row of Π containing /ε probability mass, performing this operation movesat most ( k − ε probability mass. In other words, the statistical distance between ( X, Y | P (cid:48) ) and ( X, Y | P ) is at most ( k − ε ≤ kε . Thus, for sufficiently small ε , Proposition 7, proven inAppendix B, can be used to show that | I ( X, Y | P (cid:48) ) − I ( X, Y | P ) | ≤ O (cid:18) kε log (cid:18) kε (cid:19) + kε log (cid:18) ε (cid:19)(cid:19) which yields the desired result. Remark.
We do not explore here the details of the numerical integration associated with theabove theorem, since the error introduced by the numerical integration is independent of thealgorithm being proposed. However, standard numerical integration methods can be used tomake this error arbitrarily small with an understood complexity tradeoff (see, e.g., [34]).
F Proof of Theorem 6
Theorem
Let ( X, Y ) be jointly distributed random variables. Then ∂ [ M ] = ∂M . roof. Without loss of generality, we show that [ M ] k, ↑ = M k, ↑ . Fix any partition into rows P .If Q (cid:96) is an equipartition into (cid:96) columns then lim (cid:96) →∞ I ( X | Q (cid:96) , Y | P ) = I ( X, Y | P ) , because the continuous mutual information equals the limit of the discrete mutual informationwith increasingly fine partitions. (See, e.g., Chapter 8 of [22] for a proof of this.) This meansthat, letting P ( k ) denote the set of all partitions of size at most k , we have [ M ] k, ↑ = max P ∈ P ( k ) I ( X, Y | P )log k = M k, ↑ where the second equality follows from Proposition 4. G Consistency of MIC e in estimating MIC ∗ The consistency of MIC e for estimating MIC ∗ can be established using the same technicallemmas that we used to show that MIC → MIC ∗ . Specifically, we can use Lemma A.3, whichbounds the difference, for all k -by- (cid:96) grids G , between the sample quantity I ( D n | G ) and thepopulation quantity I (( X, Y ) | G ) with high probability, where D n is a sample of size n from ( X, Y ) . That lemma yields the following fact about the sample equicharacteristic matrix, whoseproof is similar to that of Lemma A.4. Lemma
G.1 . Let D n be a sample of size n from the distribution of a pair ( X, Y ) of jointlydistributed random variables. For every B ( n ) = O (cid:0) n − ε (cid:1) , there exists an a > such that forsufficiently large n , (cid:12)(cid:12)(cid:12) (cid:100) [ M ]( D n ) k,(cid:96) − [ M ]( X, Y ) k,(cid:96) (cid:12)(cid:12)(cid:12) ≤ O (cid:18) n a (cid:19) holds for all k(cid:96) ≤ B ( n ) with probability P ( n ) = 1 − o (1) , where (cid:100) [ M ]( D n ) k,(cid:96) is the k, (cid:96) -th entryof the sample equicharacteristic matrix and [ M ]( X, Y ) k,(cid:96) is the k, (cid:96) -th entry of the populationequicharacteristic matrix of ( X, Y ) .In the case of MIC, we proceeded to apply abstract continuity considerations to obtainour consistency theorem (Theorem 1) from a result analogous to the above lemma. A similarargument shows us that, in the case of the equicharacteristic matrix as well, we can estimate alarge class of functions of the matrix in the same way. This is stated formally in the theorembelow. As before, we let m ∞ be the space of infinite matrices equipped with the supremumnorm, and given a matrix A the projection r i zeros out all the entries A k,(cid:96) for which k(cid:96) > i . Theorem
Let f : m ∞ → R be uniformly continuous, and assume that f ◦ r i → f pointwise.Then for every random variable ( X, Y ) , we have (cid:0) f ◦ r B ( n ) (cid:1) (cid:16) (cid:100) [ M ]( D n ) (cid:17) → f ([ M ]( X, Y )) in probability where D n is a sample of size n from the distribution of ( X, Y ) , provided ω (1) . The
EquicharClump algorithm
In Theorem 9, we sketched an algorithm called
EquicharClump for approximating the sampleequicharacteristic matrix that is more efficient than the naive computation. In this appendix, wedescribe the algorithm in detail, bound its runtime, and show that it indeed yields a consistentestimator of MIC ∗ from finite samples as well as a consistent independence test when used tocompute the total information coefficient. We then present some empirical results characterizingthe sensitivity of the algorithm to its speed-versus-optimality parameter c .The results in this section can be summarized as follows: let ( X, Y ) be a pair of jointlydistributed random variables, and let D n be a sample of size n from the distribution of ( X, Y ) .For every c ≥ , there exists a matrix { (cid:99) M } c ( D n ) such that1. There exists an algorithm EquicharClump for computing r B ( { (cid:99) M } c ( D n )) in time O ( n + B / ) , which equals O ( n + n − ε ) / ) when B ( n ) = O ( n − ε ) .2. The function (cid:94) MIC e,B ( · ) = max k(cid:96) ≤ B ( n ) { (cid:99) M } c ( · ) k,(cid:96) is a consistent estimator of MIC ∗ provided ω (1) < B ( n ) ≤ O ( n − ε ) for some ε > .3. The function (cid:93) TIC e,B ( · ) = (cid:88) k(cid:96) ≤ B ( n ) { (cid:99) M } c ( · ) k,(cid:96) yields a consistent right-tailed test of independence provided ω (1) < B ( n ) ≤ O ( n − ε ) forsome ε > We will prove these results in order.
H.1 Algorithm description and analysis of runtime
We begin by describing the algorithm and bound its runtime simultaneously. As in the proof ofTheorem 8, we bound the runtime required to approximately compute only the k, (cid:96) -th entriesof { (cid:99) M } c ( D n ) satisfying k ≤ (cid:96), k(cid:96) ≤ B . To do this, we analyze two portions of { (cid:99) M } c ( D n ) sepa-rately: we first consider the case (cid:96) ≥ √ B , in which we must compute the entries correspondingto all the pairs { (2 , (cid:96) ) , . . . , ( B/(cid:96), (cid:96) ) } . We then consider (cid:96) < √ B , in which case we need onlycompute the entries { (2 , (cid:96) ) , . . . , ( (cid:96), (cid:96) ) } since the additional pairs would all have k > (cid:96) .For the case of (cid:96) ≥ √ B , as in the previous theorem we can simultaneously compute us-ing OptimizeXAxis the entries corresponding to all the pairs { (2 , (cid:96) ) , . . . , ( B/(cid:96), (cid:96) ) } in time O ( | Π | ( B/(cid:96) ) (cid:96) ) = O ( | Π | B ) , which equals O ( c B /(cid:96) ) when we set Π to be an equipartition ofsize cB/(cid:96) . Doing this for (cid:96) = √ B, . . . , B/ gives a contribution of the following order to theruntime. O ( c B ) B/ (cid:88) (cid:96) = √ B (cid:96) = O (cid:0) c B (cid:1) O (cid:18) √ B (cid:19) = O ( c B / ) For the case of (cid:96) < √ B , we can simultaneously compute using OptimizeXAxis the en-tries corresponding to all the pairs { (2 , (cid:96) ) , . . . , ( (cid:96), (cid:96) ) } in time O ( | Π | (cid:96) ) which equals O ( c (cid:96) ) ≤ ( c B ) when we set Π to be an equipartition of size c(cid:96) . Summing over the O ( √ B ) possiblevalues of (cid:96) with (cid:96) < √ B gives an upper bound of O ( c B / ) . H.2 Consistency
Let ( X, Y ) be a pair of jointly distributed random variables. For a sample D n of size n fromthe distribution of ( X, Y ) and a speed-versus-optimality parameter c ≥ , let { (cid:99) M } c ( D n ) denotethe matrix computed by EquicharClump . (Notice the use of curly braces to differentiate thisfrom the sample equicharacteristic matrix (cid:100) [ M ] .) We show here that max k(cid:96) ≤ B ( n ) { (cid:99) M } c ( D n ) k,(cid:96) isa consistent estimator of MIC ∗ ( X, Y ) , and correspondingly that (cid:80) k(cid:96) ≤ B ( n ) { (cid:99) M } c ( D n ) k,(cid:96) yieldsa consistent independence test.The key to both consistency results is that, though in calculating the k, (cid:96) -th entry of { (cid:99) M } c ( D n ) the algorithm only searches for optimal partitions that are sub-partitions of someequipartition, the size of the equipartition used always grows as n , k , and (cid:96) grow large. There-fore, in the limit this additional restriction does not hinder the optimization. We present thisargument by introducing a population object called the clumped equicharacteristic matrix . Weobserve that this matrix is the limit of the EquicharClump procedure as sample size grows,and then show that the supremum and partial sums of this matrix have the necessary properties.
Definition H.1.
Let ( X, Y ) be jointly distributed random variables and fix some c ≥ . Let I { c ∗} (( X, Y ) , k, (cid:96) ) = max G I (( X, Y ) | G ) where the maximum is over k -by- (cid:96) grids whose larger partition is an equipartition and whosesmaller partition must be contained in an equipartition of size c · max { k, (cid:96) } . The clumpedequicharacteristic matrix of ( X, Y ) , denoted by { M } c ( X, Y ) , is defined by { M } c ( X, Y ) k,(cid:96) = I { c ∗} (( X, Y ) , k, (cid:96) )log min { k, (cid:96) } Notice that curly braces differentiate the quantities I { c ∗} and { M } c defined above from thecorresponding equicharacteristic matrix quantities I [ ∗ ] and [ M ] .The following two results, which we state without proof, characterize the convergence ofthe output of EquicharClump to the clumped equicharacteristic matrix. These lemmas canbe shown using Lemma A.3, which simultaneously bounds the difference, for all k -by- (cid:96) grids G , between the sample quantity I ( D n | G ) and the population quantity I (( X, Y ) | G ) with highprobability over the sample D n of size n from ( X, Y ) . Lemma
H.2 . Let D n be a sample of size n from the distribution of a pair ( X, Y ) of jointlydistributed random variables. For every B ( n ) = O (cid:0) n − ε (cid:1) , there exists an a > such that forsufficiently large n , (cid:12)(cid:12)(cid:12) { (cid:99) M } c ( D n ) k,(cid:96) − { M } c ( X, Y ) k,(cid:96) (cid:12)(cid:12)(cid:12) ≤ O (cid:18) n a (cid:19) holds for all k, (cid:96) ≤ (cid:112) B ( n ) with probability P ( n ) = 1 − o (1) , where { (cid:99) M } c ( D n ) denotes thematrix computed by the EquicharClump algorithm with parameter c on the sample D n .Notice that the error bound provided by the above lemma holds not for k(cid:96) ≤ B ( n ) as inthe analogous Lemma A.4 and Lemma G.1, but rather for the smaller region defined by k, (cid:96) ≤ (cid:112) B ( n ) . However, though we do not have uniform convergence outside the region k, (cid:96) ≤ (cid:112) B ( n ) ,we do nevertheless have pointwise convergence there, as stated below.39 emma H.3 . Fix k, (cid:96) ≥ . Let D n be a sample of size n from the distribution of a pair ( X, Y ) of jointly distributed random variables. For every B ( n ) > ω (1) , we have that { (cid:99) M } c ( D n ) k,(cid:96) → { M } c ( X, Y ) k,(cid:96) in probability as n grows, where { (cid:99) M } c ( D n ) denotes the matrix computed by the Equichar-Clump algorithm with parameter c on the sample D n . H.2.1 Consistency for estimating MIC ∗ The consistency of { (cid:99) M } c ( D n ) for estimating MIC ∗ follows from the following property of theclumped equicharacteristic matrix { M } c , for which we state a proof sketch. Proposition 8.
Let ( X, Y ) be a pair of jointly distributed random variables. Then we have sup { M } c ( X, Y ) =
MIC ∗ ( X, Y ) .Proof. (Sketch) Let { M } c = { M } c ( X, Y ) , and let M = M ( X, Y ) be the characteristic matrix.Fix k , and consider the limit { M } ck,(cid:96) as (cid:96) grows. The grid chosen for the k, (cid:96) -th entry when (cid:96) > k will contain an equipartition P (cid:96) of size (cid:96) on the x-axis, and a partition Q (cid:96) of size k on the y-axisthat is optimal subject to the restriction that Q (cid:96) be contained in an equipartition of size c(cid:96) . As (cid:96) grows large, the equipartition P (cid:96) on the first axis will become finer and finer until in the limit X | P (cid:96) → X . And the partition Q (cid:96) will be chosen from a finer and finer equipartition, so thatin the limit it approaches an unconditionally optimal partition Q of size k . The convergence of Q (cid:96) to the optimal partition Q of size k can be shown to be uniform using Proposition 7. Thisimplies that { M } ck, ↑ = lim (cid:96) →∞ { M } ck,(cid:96) = max P ∈ P ( k ) I ( X, Y | P )log k where P ( k ) denotes the set of all partitions of size at most k . Therefore, the boundary ∂ { M } c of { M } c equals the boundary ∂M of M . Since MIC ∗ ( X, Y ) = sup ∂M (Theorem 3), this impliesthat sup { M } c ≥ sup ∂ { M } c = sup ∂M = MIC ∗ ( X, Y ) . On the other hand, { M } c ≤ M element-wise since the optimization for the k, (cid:96) -th entry of { M } c is performed over a subset of the grids searched for the k, (cid:96) -th entry of M . This means that sup { M } c ≤ sup M = MIC ∗ ( X, Y ) .This fact, together with the pointwise convergence of { (cid:99) M } c ( D n ) to { M } c , suffices to estab-lish the consistency we seek via standard continuity arguments, which we give in the abstractlemma below. The lemma applies to a double-indexed sequence indexed by i and j ; in ourargument, the index i corresponds to position in the equicharacteristic matrix, and the index j corresponds to sample size. The sequence A corresponds to the output of the Equichar-Clump algorithm, the sequence a corresponds to the clumped equicharacteristic matrix, andthe sequence B corresponds to the sample equicharacteristic matrix. Lemma
H.4 . Let { A ij } ∞ i,j =1 and { B ij } ∞ i,j =1 be sequences of random variables, and let { a i } ∞ i =1 bea non-stochastic sequence. Assume that the following conditions hold.1. A ij ≤ B ij almost surely2. For every i , A ij → a i in probability 40. B (cid:48) j = max i ≤ j B ij satisfies B (cid:48) j → sup { a i } in probabilityThen A (cid:48) j = max i ≤ j A ij converges in probability to sup { a i } as well. Proof.
Let a = sup { a i } . We give the proof for the case that a < ∞ . However, it is easilyadapted to the infinite case. We must show that for every ε > and every < p ≤ , thereexists some N such that P ( | A (cid:48) j − a | < ε ) > p for all j ≥ N . By the definition of a , we knowthat there exists some k such that | a k − a | < ε/ . Also, by the convergence of A kj to a k , thereexists some m such that P ( | A kj − a k | < ε/ > − p for all j ≥ m . Thus, with probability atleast − p , we have | A kj − a | ≤ | A kj − a k | + | a k − a |≤ ε for all j ≥ m .Next, we observe that since A (cid:48) j ≥ A kj for j ≥ k , the above inequality implies that for j ≥ max { m, k } we have P ( A (cid:48) j > a − ε ) > − p . It remains only to show that A (cid:48) j doesn’t gettoo large, but this follows from the fact that A (cid:48) j ≤ B (cid:48) j and B (cid:48) j → a in probability. Specifically,we are guaranteed some N ≥ max { m, k } such that P ( B (cid:48) j < a + ε ) > − p for j ≥ N . Since B (cid:48) j < a + ε implies A (cid:48) j < a + ε , we have that P ( | A (cid:48) j − a | < ε ) > − p for j ≥ N , as desired. Proposition 9.
The function (cid:94)
MIC e,B ( · ) = max k(cid:96) ≤ B ( n ) { (cid:99) M } c ( · ) k,(cid:96) is a consistent estimator of MIC ∗ provided ω (1) < B ( n ) ≤ O ( n − ε ) for some ε > , where { (cid:99) M } c ( · ) is the output of the the EquicharClump algorithm.Proof.
Let ( X, Y ) be a pair of jointly distributed random variables, and let D n be a sample ofsize n from the distribution of ( X, Y ) . Let { ( k i , (cid:96) i ) } ∞ i =1 ⊂ Z + × Z + be a sequence of coordinateswith the property that for every number B there exists an index q ( B ) such that { ( k i , (cid:96) i ) : i ≤ q ( B ) } = { ( k, (cid:96) ) : k(cid:96) ≤ B } . We define B ij = (cid:100) [ M ]( D j ) k i ,(cid:96) i , i.e., B ij is the k i , (cid:96) i -th entry of the sample characteristic matrixevaluated on a sample of size j . We analogously define A ij = { (cid:99) M } c ( D j ) k i ,(cid:96) i , and we define a i = { M } c ( X, Y ) k i ,(cid:96) i . We observe that by Proposition 8, sup a i = sup { M } c ( X, Y ) =
MIC ∗ .It is straightforward to see that A ij ≤ B ij . Additionally, Lemma H.3 shows that A ij → a i in probability, and Corollary 4.2, which states that MIC e is a consistent estimator of MIC ∗ ,shows that B (cid:48) j = max i ≤ j B ij → MIC ∗ ( X, Y ) . In the notation of the lemma, it therefore followsthat A (cid:48) j = max i ≤ j A ij converges in probability to MIC ∗ ( X, Y ) as well. But this means that thesub-sequence A (cid:48) q ( B ( n )) = max i ≤ q ( B ( n )) { (cid:99) M } c ( D q ( B ( n )) ) k i ,(cid:96) i = max k(cid:96) ≤ B ( n ) { (cid:99) M } c ( D q ( B ( n )) ) k,(cid:96) converges in probability to MIC ∗ ( X, Y ) , which implies the result since the sequence A (cid:48) j is mono-tone. 41 .2.2 Consistency for total information coefficient Similarly to the consistency argument for MIC ∗ , we begin by exhibiting the relevant propertyof the population clumped equicharacteristic matrix. Proposition 10.
Let ( X, Y ) be a pair of jointly distributed random variables. If X and Y arestatistically independent, then { M } c ( X, Y ) ≡ . If not, then there exists some a > and someinteger (cid:96) ≥ such that { M } c ( X, Y ) k,(cid:96) ≥ a log min { k, (cid:96) } either for all k ≥ (cid:96) ≥ (cid:96) , or for all (cid:96) ≥ k ≥ (cid:96) .Proof. (Sketch) Let { M } c = { (cid:99) M } c ( X, Y ) . Under independence, every entry of { M } c is zerosince I (( X, Y ) | G ) = 0 for any grid G . For the case of dependence, the argument is identical tothat given in the proof of Proposition 4. Specifically, it can be shown that there exists someindex (cid:96) , taken without loss of generality to be a column index, and some r > such that all butfinitely many of the entries in the (cid:96) -column are at least r . It can then be shown that for large k , the entries ( k, (cid:96) ) , ( k, (cid:96) + 1) , . . . , ( k, k ) have non-decreasing values of I [ c ∗ ] . This establishesthe claim for a = r log (cid:96) .We now show that the above result, together with the uniform convergence of { (cid:99) M } c ( D n ) to { M } c ( X, Y ) , implies the consistency we seek. Proposition 11.
The function (cid:93)
TIC e,B ( · ) = (cid:88) k(cid:96) ≤ B ( n ) { (cid:99) M } c ( · ) k,(cid:96) yields a consistent right-tailed test of independence provided ω (1) < B ( n ) ≤ O ( n − ε ) for some ε > , where { (cid:99) M } c ( · ) is the output of the the EquicharClump algorithm.Proof.
Let ( X, Y ) a pair of jointly distributed random variables, and let D n be a sample of size n from the distribution of ( X, Y ) . It suffices to show consistency for any deterministic monotonicfunction of the statistic in question. We therefore choose to analyze (cid:93) TIC e,B ( D n ) log( B ( n )) /B ( n ) .For the null hypothesis in which X and Y are independent, we observe that since { (cid:99) M } c ( D n ) ≤ (cid:100) [ M ]( D n ) element-wise, ≤ (cid:93) TIC e,B ( D n ) ≤ TIC e,B ( D n ) as well. Moreover, the argument givenin Appendix K, which shows that TIC e,B ( D n ) /B ( n ) converges to 0 in probability under thenull hypothesis, can be adapted to show that TIC e,B ( D n ) log( B ( n )) /B ( n ) → as well. Thus, (cid:93) TIC e,B ( D n ) log( B ( n )) /B ( n ) converges to 0 in probability, as required.For the case that X and Y are dependent, the proof is analogous to the argument given inAppendix K for TIC e . The only difference is that Lemma H.2, which guarantees the uniformconvergence of { (cid:99) M } c ( D n ) to { M } c ( X, Y ) , applies only to the k, (cid:96) -th entries for which k, (cid:96) ≤ (cid:112) B ( n ) , rather than the entries over which we are summing, which are those for which k(cid:96) ≤ B ( n ) . However, since we require only a lower bound on (cid:93) TIC e,B ( D n ) , we may neglect theseentries because (cid:93) TIC e,B ( D n ) = (cid:88) k(cid:96) ≤ B ( n ) { (cid:99) M } c ( D n ) k,(cid:96) ≥ (cid:88) k,(cid:96) ≤ √ B ( n ) { (cid:99) M } c ( D n ) k,(cid:96) .
42t can then be shown, following the argument from Appendix K, that there exists some a > depending only on B such that, with probability − o (1) , log B ( n ) B ( n ) (cid:88) k,(cid:96) ≤ √ B ( n ) { (cid:99) M } c ( X, Y ) k,(cid:96) − (cid:93) TIC e,B ( D n ) ≤ O (cid:18) n log B ( n ) B ( n ) n a (cid:19) = O (cid:18) log B ( n ) n a (cid:19) where n = B ( n ) represents the number of pairs ( k, (cid:96) ) such that k, (cid:96) ≤ (cid:112) B ( n ) . To obtain theresult, we note that this means that log B ( n ) B ( n ) (cid:93) TIC e,B ( D n ) ≥ log B ( n ) B ( n ) (cid:88) k,(cid:96) ≤ √ B ( n ) { (cid:99) M } c ( X, Y ) k,(cid:96) − O (cid:18) log B ( n ) n a (cid:19) and then invoke Proposition 10, which implies that for large n (cid:88) k,(cid:96) ≤ √ B ( n ) { M } c ( X, Y ) ≥ Ω (cid:18) B ( n )log B ( n ) (cid:19) . H.3 Empirical characterization of the performance of
EquicharClump
The
EquicharClump algorithm has a parameter c that controls the fineness of the equiparti-tion whose sub-partitions are searched over by the algorithm. To gain an empirical understand-ing of the effect of c on performance, we computed MIC e on the set of relationships describedin Section 4.4 using EquicharClump with different values of c . For each relationship, wecompared the average MIC e across all 500 independent samples from that relationship with dif-ferent values of c . We performed this analysis at sample sizes of n = 250 (Figure H1), n = 500 (Figure H2), and , (Figure H3).We summarize our findings as follows. • At low ( n = 250 ) and medium ( n = 500 ) sample sizes, using c = 1 introduces a downwardbias for more complex relationships when B ( n ) = n . is used but not when B ( n ) = n . is used. This makes sense since the low sample size and low setting of B ( n ) mean thatthe algorithm is searching over grids with relatively few cells, and so setting c = 1 hindersits ability to find good grids in this limited search space. This bias is almost entirelyalleviated by setting c ≥ . • At high sample size ( n = 5 , ), this effect is still observable but much reduced. Thismakes sense since when n is large, B ( n ) is large as well, and so the number of cells allowedin the grids being searched over is already large regardless of the exponent α used in B ( n ) = n α . Thus, there is less need for the robustness provided by searching for anoptimal grid. I Equitability and power analyses from [3]
Figure I1 contains a representative equitability analysis from [3]. Figure I2 contains powercurves from [3] for a large set of leading methods.43igure H1: The effect of the parameter c on the performance of EquicharClump , at n = 250 .See Section H.3 for details. 44igure H2: The effect of the parameter c on the performance of EquicharClump , at n = 500 .See Section H.3 for details. 45igure H3: The effect of the parameter c on the performance of EquicharClump , at n = 5 , .See Section H.3 for details. 46igure I1: The equitability of measures of dependence on a set of noisy functional relationships,reproduced from [3]. [Narrower is more equitable.] The plots were constructed as in Figure 3.Mutual information, estimated using the Kraskov estimator, is represented using the squaredLinfoot correlation. 47igure I2: Power of independence testing using several leading measures of dependence, onthe relationships chosen by [32], at noise levels with linearly increasing magnitude for eachrelationship and n = 500 . To enable comparison of power regimes across relationships, thex-axis of each plot lists R rather than noise magnitude.48 Equitability analysis of randomly chosen functions with addi-tional noise model
Figure J1 contains a version of the main text Figure 4, but where noise has been added onlyto the dependent variable in each functional relationship, rather to both the independent anddependent variables.
K Consistency of independence testing based on TIC e Here we prove Propositions 4 and 5 and then use those propositions to prove Theorem 10, whichshows that TIC e can be used for independence testing. K.1 Proof of Proposition 4
Proposition
Let ( X, Y ) be a pair of jointly distributed random variables. If X and Y arestatistically independent, then M ( X, Y ) ≡ [ M ]( X, Y ) ≡ . If not, then there exists some a > and some integer (cid:96) ≥ such that M ( X, Y ) k,(cid:96) , [ M ]( X, Y ) k,(cid:96) ≥ a log min { k, (cid:96) } either for all k ≥ (cid:96) ≥ (cid:96) , or for all (cid:96) ≥ k ≥ (cid:96) .Proof. We give the proof only for [ M ] = [ M ]( X, Y ) , with the understanding that all parts ofthe argument are either identical or similar for M ( X, Y ) . When X and Y are independent, thenfor any grid G , ( X, Y ) | G exhibits independence as well. Therefore I (( X, Y ) | G ) = 0 for all grids G , and so every entry of [ M ] , being a supremum over such quantities, is 0.For the case that X and Y are dependent, our strategy is to first find, without loss ofgenerality, a column of [ M ] almost all of whose values are bounded away from zero, and thenargue that this suffices.The dependence of X and Y implies that MIC ∗ ( X, Y ) > . By Corollary 4.1, which statesthat sup ∂ [ M ] = MIC ∗ ( X, Y ) , we therefore know that there is at least one non-zero element ofthe boundary of [ M ] , as defined in Definition 3.5. Without loss of generality, suppose that thiselement is [ M ] ↑ ,(cid:96) = lim k →∞ [ M ] k,(cid:96) . The fact that this limit is strictly positive implies thatthere exists some k ≥ (cid:96) and some r > such that [ M ] k,(cid:96) ≥ r for all k ≥ k . That is, all butfinitely many of the entries in the (cid:96) -th column of [ M ] are at least r .We now show that the existence of such a column suffices to prove the claim. Fix some k > k and note that this implies that k > (cid:96) . We argue that for all (cid:96) in { (cid:96) , . . . , k } , thedesired condition holds. Since k > (cid:96) , the term I [ ∗ ] (( X, Y ) , k, (cid:96) ) in the definition of [ M ] k,(cid:96) is a maximization over grids that have an equipartition of size k on one axis and an optimalpartition of size (cid:96) on the other. Since we allow empty rows/columns in the maximization,substituting any (cid:96) satisfying (cid:96) ≤ (cid:96) ≤ k therefore does not constrain the maximization in anyway and so it cannot decrease I [ ∗ ] . In other words, for (cid:96) satisfying (cid:96) ≤ (cid:96) ≤ k , we have I [ ∗ ] (( X, Y ) , k, (cid:96) ) ≥ I [ ∗ ] (( X, Y ) , k, (cid:96) ) . R values corresponding to anyone value of the statistic; the narrower the red interval, the higher the equitability. Samplerelationships for each Gaussian process bandwidth are shown in the top right with matchingcolors. 50ince k ≥ (cid:96), (cid:96) , the normalizations in the definition of [ M ] k,(cid:96) and [ M ] k,(cid:96) are log (cid:96) and log (cid:96) respectively. Therefore, we have that [ M ] k,(cid:96) ≥ [ M ] k,(cid:96) log (cid:96) log (cid:96) ≥ r log (cid:96) log (cid:96) where the last inequality is because k > k . Setting a = r log (cid:96) then gives the result. K.2 Proof of Proposition 5
Proposition
Let ( X, Y ) be a pair of jointly distributed random variables. If X and Y arestatistically independent, then S B ( M ( X, Y )) = S B ([ M ]( X, Y )) = 0 for all
B > . If not, then S B ( M ( X, Y )) and S B ([ M ]( X, Y )) are both Ω( B log log B ) .Proof. We give the argument for M = M ( X, Y ) only, but the argument holds as stated for [ M ]( X, Y ) as well.The result follows from the guarantee given by the Proposition 4 above. In the case ofindependence, the proposition tells us that M ≡ , which immediately gives that S B ( M ) = 0 for all B > . For the case of dependence, the proposition implies that there is some a > andsome integer (cid:96) ≥ such that, without loss of generality, M k,(cid:96) ≥ a/ log (cid:96) for all k ≥ (cid:96) ≥ (cid:96) . Weconvert this into a lower bound on S B ( M ) .The key is to write the sum one column at a time, counting how many entries in eachcolumn both satisfy k ≥ (cid:96) ≥ (cid:96) and k(cid:96) ≤ B . For any (cid:96) satisfying (cid:96) ≤ (cid:96) ≤ √ B , the entries ( (cid:96), (cid:96) ) , . . . , ( B/(cid:96), (cid:96) ) meet this criterion, and there are B/(cid:96) − ( (cid:96) − of them. Moreover, sincethe guarantee of Proposition 4 tells us that all of these entries are at least a/ log (cid:96) , we canlower-bound S B ( M ) as follows. S B ( A ) ≥ √ B (cid:88) (cid:96) = (cid:96) a log (cid:96) (cid:18) B(cid:96) − ( (cid:96) − (cid:19) = aB √ B (cid:88) (cid:96) = (cid:96) (cid:96) log (cid:96) − a √ B (cid:88) (cid:96) = (cid:96) (cid:96) − (cid:96) = a B √ B (cid:88) (cid:96) = (cid:96) (cid:96) log (cid:96) − O ( B ) = Ω( B log log B ) where the second-to-last equality is because ( (cid:96) − / log (cid:96) ≤ (cid:96) , and the last equality is because (cid:80) ni = i / ( i log i ) grows like log log n . K.3 Proof of Theorem 10
Theorem
The statistics TIC B and TIC e,B yield consistent right-tailed tests of independence,provided ω (1) < B ( n ) ≤ O ( n − ε ) for some ε > . roof. We give the proof for TIC only; however, the argument holds as stated for TIC e as well.Let ( X, Y ) be jointly distributed random variables, and let D n be a sample of size n fromthe distribution of ( X, Y ) . Let M = M ( X, Y ) be the characteristic matrix of ( X, Y ) and let (cid:99) M ( D n ) be the sample characteristic matrix. It suffices to establish the result for a deterministicmonotonic function of TIC B ( D n ) . We therefore show convergence of TIC B ( D n ) /B ( n ) to zerounder the null hypothesis of independence and to ∞ under any alternative. Our general strategyfor doing so is to translate known bounds on our error at estimating entries of M into boundson the difference between TIC B ( D n ) /B ( n ) = S B ( n ) ( (cid:99) M ( D n )) /B ( n ) and S B ( M ) /B ( n ) . We thenobtain the result by invoking Proposition 5, which implies that S B ( M ) /B ( n ) is zero under thenull hypothesis but grows without bound under the alternative.We know from Lemma A.4 (Lemma G.1 for the equicharacteristic matrix) that there existssome a > depending only on B such that (cid:12)(cid:12)(cid:12) (cid:99) M ( D n ) k,(cid:96) − M k,(cid:96) (cid:12)(cid:12)(cid:12) ≤ O (cid:18) n a (cid:19) for all k(cid:96) ≤ B ( n ) with probability − o (1) . This means that with probability − o (1) we have B ( n ) (cid:12)(cid:12) TIC B ( D n ) − S B ( n ) ( M ) (cid:12)(cid:12) ≤ O (cid:18) n B ( n ) n a (cid:19) where n is the number of pairs ( k, (cid:96) ) such that k(cid:96) ≤ B ( n ) . It can be shown by taking theintegral of B/x with respect to x that n = O ( B ( n ) log B ( n )) . Therefore, the error in theabove bound is at most O (log B ( n ) /n a ) = O (1 / poly ( n )) for our choice of B ( n ) .We now use Proposition 5 to show that this bound gives the desired result. Under thenull hypothesis of independence, the proposition says that S B ( n ) ( M ) = 0 always, and so since B is a growing function the bound implies that TIC B ( D n ) /B ( n ) → in probability. Underthe alternative hypothesis in which ( X, Y ) exhibit a dependence, the proposition implies that S B ( n ) ( M ) /B ( n ) > ω (1) . Since B is a growing function of n , this means that for any r > , theprobability that S B ( n ) ( M ) /B ( n ) > r goes to 1 as n grows. In other words, TIC B ( D n ) /B ( n ) →∞ in probability. L Information-theoretic lemmas
Lemma
L.1 . Let Π and Ψ be random variables distributed over a discrete set of states Γ , and let ( π i ) and ( ψ i ) be their respective distributions. Let P = f (Π) and Q = f (Ψ) for some function f whose image is of size B . Define ε i = ψ i − π i π i . Then for every < a < there exists some A > such that | H ( Q ) − H ( P ) | ≤ (log B ) A (cid:88) i | ε i | when | ε i | ≤ − a for all i . Proof.
We prove the claim with entropy measured in nats. A rescaling then gives the generalresult. 52et ( p i ) and ( q i ) be the distributions of P and Q respectively, and define e i = q i − p i p i analogously to ε i . Before proceeding, we observe that e i = (cid:88) j ∈ f − ( i ) π j p i ε j . We now proceed with the argument. We have from [25] that | H ( Q ) − H ( P ) | ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i (cid:18) e i p i (1 + ln p i ) + 12 e i p i + O (cid:0) e i (cid:1)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (2) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i e i p i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i e i p i ln p i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 12 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i e i p i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i O (cid:0) e i (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (3) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i e i p i ln p i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 12 (cid:88) i e i p i + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i O (cid:0) e i (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (4)where the final equality is because (cid:80) i e i p i = (cid:80) i q i − (cid:80) i p i = 0 . We proceed by bounding eachof the terms in Equation 4 separately.To bound the first term, we write (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i e i p i ln p i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ − (cid:88) i | e i | p i ln p i . We then note that − (cid:80) i p i ln p i ≤ ln B , and since each of the summands has the same sign thismeans that − p i ln p i ≤ ln B . We also observe that | e i | ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j ∈ f − ( i ) π j p i ε j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) j π j p i | ε j | ≤ (cid:88) j | ε j | since π j /p i ≤ . Together, these two facts give − (cid:88) i | e i | p i ln p i ≤ (ln B ) (cid:88) i | e i |≤ (ln B ) (cid:88) i | ε i | The second inequality is because each e i is a weighted average of a set of ε i and each ε i entersinto the expression of exactly one e i .To bound the second term, we use the fact that p i ≤ for all i , and so (cid:88) i e i p i ≤ (cid:88) i e i .
53e then write (cid:88) i e i = (cid:88) i (cid:88) j ∈ f − ( i ) π j p i ε j ≤ (cid:88) i (cid:88) j ∈ f − ( i ) π j p i ε j ≤ (cid:88) j ε j = (cid:88) j O ( | ε j | ) where the second line is a consequence of the convexity of f ( x ) = x and the third line is becausethe sets f − ( i ) partition Γ .To bound the third term, we write (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i O (cid:0) e i (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) i O (cid:0) | e i | (cid:1) and then proceed as we did with the second term, using the fact that f ( x ) = x is convex for x ≥ . This gives (cid:88) i O (cid:0) | e i | (cid:1) ≤ (cid:88) i O (cid:0) | ε i | (cid:1) = (cid:88) i O ( | ε i | ) completing the proof. Lemma
L.2 . Let { w i } ⊂ [0 , be a set of size n with (cid:80) i w i ≤ , and let { u i } be a set of n non-negative numbers satisfying (cid:80) i u i = a and u i ≤ w i . Then n (cid:88) i =1 w i H b (cid:18) u i w i (cid:19) ≤ H b ( a ) where H b is the binary entropy function. Proof.
Consider the random variable X taking values in { , . . . , n } that equals 0 with probability − (cid:80) i w i and equals i with probability w i for < i ≤ n . Define the random variable Y takingvalues in { , } by P ( Y = 0 | X = i ) = (cid:26) i = 0 u i /w i < i ≤ n . The function we wish to bound equals H ( Y | X ) ≤ H ( Y ) . We therefore observe that n (cid:88) i =1 w i H b (cid:18) u i w i (cid:19) ≤ H ( Y ) . The result follows from the observation that P ( Y = 0) = (cid:88) i P ( X = i ) u i w i = (cid:88) i u i ≤ a. emma L.3 . Let X be a random variable distributed over k states, with P ( X = x ) = p x .Let α x ≥ be such that (cid:80) α x = δ , and define the random variable X (cid:48) by P ( X (cid:48) = x ) =( p x + α x ) / (1 + δ ) . We have (cid:12)(cid:12) H ( X (cid:48) ) − H ( X ) (cid:12)(cid:12) ≤ H b ( δ ) + δ log k where H b is the binary entropy function. Proof.
Define a new random variable Z by P (cid:0) Z = 0 | X (cid:48) = x (cid:1) = p x p x + α x , P (cid:0) Z = 1 | X (cid:48) = x (cid:1) = α x p x + α x . We will use the fact that H ( X (cid:48) | Z = 0) = H ( X ) to obtain the required bound.To upper bound H ( X (cid:48) ) − H ( X ) , we write H ( X (cid:48) ) − H ( X ) ≤ H ( X (cid:48) , Z ) − H ( X )= H ( Z ) + P ( Z = 0) H ( X (cid:48) | Z = 0) + P ( Z = 1) H ( X (cid:48) | Z = 1) − H ( X ) ≤ H b ( δ ) + (1 − δ ) H ( X ) + δH ( X (cid:48) | Z = 1) − H ( X )= H b ( δ ) − δH ( X ) + δ log k ≤ H b ( δ ) + δ log k where in the fourth line we have used that H ( X (cid:48) | Z = 1) ≤ log k .To upper bound H ( X ) − H ( X (cid:48) ) , we write H ( X (cid:48) ) + H ( Z ) ≥ H ( X (cid:48) , Z ) ≥ P ( Z = 0) H ( X (cid:48) | Z = 0)= (1 − δ ) H ( X ) which yields H ( X (cid:48) ) ≥ (1 − δ ) H ( X ) − H b ( δ ) since H ( Z ) = H b ( δ ) . Thus, we have H ( X ) − H ( X (cid:48) ) ≤ δH ( X ) + H b ( δ ) ≤ δ log k + H b ( δ ) . Lemma
L.4 . Let X be a random variable distributed over k states, with P ( X = x ) = p x .Let α x ≤ be such that (cid:80) | α x | = δ , and define the random variable X (cid:48) by P ( X (cid:48) = x ) =( p x + α x ) / (1 − δ ) . We have (cid:12)(cid:12) H ( X (cid:48) ) − H ( X ) (cid:12)(cid:12) ≤ H b (cid:18) δ − δ (cid:19) + δ − δ log k where H b is the binary entropy function. In particular, when δ ≤ / we have (cid:12)(cid:12) H ( X (cid:48) ) − H ( X ) (cid:12)(cid:12) ≤ H b (2 δ ) + 2 δ log k. Proof.
We observe that we can get from X (cid:48) to X by adding δ/ (1 − δ ) probability mass andrescaling. The previous lemma then gives the result.55 emma L.5 . Let X be a random variable distributed over k states, with P ( X = x ) = p x .Let α x be such that (cid:80) | α x | = δ , and define the random variable X (cid:48) by P ( X (cid:48) = x ) = ( p x + α x ) / (1 − (cid:80) α x ) . That is, X (cid:48) is the result of changing the probability of state x by α x and thenre-normalizing to obtain a valid distribution. If δ ≤ / , we have (cid:12)(cid:12) H ( X (cid:48) ) − H ( X ) (cid:12)(cid:12) ≤ H b (2 δ ) + 3 δ log k where H b is the binary entropy function. Proof.
Let δ + be the total magnitude of all the positive α x , and let δ − be the total magnitudeof all the negative α x . We first add all the mass we’re going to add, and apply the first of theprevious two lemmas. Then we remove all the mass we are going to remove, and apply thesecond of the two previous lemmas. This yields a bound of H b ( δ + ) + δ + log k + H b (cid:18) δ − δ + (cid:19) + 2 δ − δ + log k ≤ H b ( δ + ) + δ + log k + H b (2 δ − ) + 2 δ − log k ≤ H b (2 δ ) + δ log k + H b (2 δ ) + 2 δ log k ≤ H b (2 δ ) + 3 δ log k where the first inequality is because δ + ≤ δ < and δ − ≤ δ ≤ / , and the secondinequality is because δ + ≤ δ < δ ≤ / . 56 eferences [1] Y. A. Reshef, D. N. Reshef, P. C. Sabeti, and M. Mitzenmacher, “Equitability, intervalestimation, and statistical power,” arXiv preprint arXiv:1505.02212 , 2015.[2] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh,E. S. Lander, M. Mitzenmacher, and P. C. Sabeti, “Detecting novel associations in largedata sets,” Science , vol. 334, no. 6062, pp. 1518–1524, 2011.[3] D. N. Reshef, Y. A. Reshef, P. C. Sabeti, and M. Mitzenmacher, “An empirical study ofleading measures of dependence,” arXiv preprint arXiv:1505.02214 , 2015.[4] J. D. Storey and R. Tibshirani, “Statistical significance for genomewide studies,”
Proceed-ings of the National Academy of Sciences , vol. 100, no. 16, pp. 9440–9445, 2003.[5] V. Emilsson, G. Thorleifsson, B. Zhang, A. S. Leonardson, F. Zink, J. Zhu, S. Carlson,A. Helgason, G. B. Walters, S. Gunnarsdottir, et al. , “Genetics of gene expression and itseffect on disease,”
Nature , vol. 452, no. 7186, pp. 423–428, 2008.[6] W. Hoeffding, “A non-parametric test of independence,”
The Annals of Mathematical Statis-tics , pp. 546–557, 1948.[7] A. Rényi, “On measures of dependence,”
Acta mathematica hungarica , vol. 10, no. 3,pp. 441–451, 1959.[8] L. Breiman and J. H. Friedman, “Estimating optimal transformations for multiple regressionand correlation,”
Journal of the American statistical Association , vol. 80, no. 391, pp. 580–598, 1985.[9] L. Paninski, “Estimation of entropy and mutual information,”
Neural computation , vol. 15,no. 6, pp. 1191–1253, 2003.[10] G. J. Székely, M. L. Rizzo, N. K. Bakirov, et al. , “Measuring and testing dependence bycorrelation of distances,”
The Annals of Statistics , vol. 35, no. 6, pp. 2769–2794, 2007.[11] A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf, “Measuring statistical dependencewith hilbert-schmidt norms,” in
Algorithmic learning theory , pp. 63–77, Springer, 2005.[12] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel two-sample test,”
The Journal of Machine Learning Research , vol. 13, no. 1, pp. 723–773, 2012.[13] D. Lopez-Paz, P. Hennig, and B. Schölkopf, “The randomized dependence coefficient,” in
Advances in Neural Information Processing Systems , pp. 1–9, 2013.[14] R. Heller, Y. Heller, and M. Gorfine, “A consistent multivariate test of association basedon ranks of distances,”
Biometrika , vol. 100, no. 2, pp. 503–510, 2013.[15] B. Jiang, C. Ye, and J. S. Liu, “Nonparametric k-sample tests via dynamic slicing,”
Journalof the American Statistical Association , vol. 110, no. 510, pp. 642–653, 2015.[16] A. A. Ding and Y. Li, “Copula correlation: An equitable dependence measure and extensionof pearson’s correlation,” arXiv preprint arXiv:1312.7214 , 2013.5717] J. B. Kinney and G. S. Atwal, “Equitability, mutual information, and the maximal infor-mation coefficient,”
Proceedings of the National Academy of Sciences , 2014.[18] D. N. Reshef, Y. A. Reshef, M. Mitzenmacher, and P. C. Sabeti, “Cleaning up the record onthe maximal information coefficient and equitability,”
Proceedings of the National Academyof Sciences , 2014.[19] B. Murrell, D. Murrell, and H. Murrell, “R2-equitability is satisfiable,”
Proceedings of theNational Academy of Sciences , 2014.[20] W. S. Cleveland and S. J. Devlin, “Locally weighted regression: an approach to regressionanalysis by local fitting,”
Journal of the American Statistical Association , vol. 83, no. 403,pp. 596–610, 1988.[21] C. J. Stone, “Consistent nonparametric regression,”
The annals of statistics , pp. 595–620,1977.[22] T. Cover and J. Thomas,
Elements of Information Theory . New York: John Wiley & Sons,Inc, 2006.[23] I. Csiszár and P. C. Shields, “Information theory and statistics: A tutorial,”
Communica-tions and Information Theory , vol. 1, no. 4, pp. 417–528, 2004.[24] I. Csiszár, “Axiomatic characterizations of information measures,”
Entropy , vol. 10, no. 3,pp. 261–273, 2008.[25] M. S. Roulston, “Estimating the errors on measured entropy and mutual information,”
Physica D: Nonlinear Phenomena , vol. 125, no. 3, pp. 285–294, 1999.[26] T. Speed, “A correlation for the 21st century,”
Science , vol. 334, no. 6062, pp. 1502–1503,2011.[27] E. Linfoot, “An informational measure of correlation,”
Information and Control , vol. 1,no. 1, pp. 85–89, 1957.[28] G. Szekely and M. Rizzo, “Brownian distance covariance,”
The Annals of Applied Statistics ,vol. 3, no. 4, pp. 1236–1265, 2009.[29] A. Kraskov, H. Stogbauer, and P. Grassberger, “Estimating mutual information,”
PhysicalReview E , vol. 69, 2004.[30] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola, “A kernelstatistical test of independence,” in
Advances in neural information processing systems ,pp. 585–592, 2007.[31] R. Heller, Y. Heller, S. Kaufman, B. Brill, and M. Gorfine, “Consistent distribution-free k -sample and independence tests for univariate random variables,” arXiv preprintarXiv:1410.6758 , 2014.[32] N. Simon and R. Tibshirani, “Comment on “Detecting novel associations in large datasets”,” ∼ tibs/reshef/comment.pdfon 11 Nov. 2012) , 2012. 5833] M. Mitzenmacher and E. Upfal, Probability and computing: Randomized algorithms andprobabilistic analysis . Cambridge University Press, 2005.[34] J. Stoer and R. Bulirsch,