An Empirical Study of Leading Measures of Dependence
David N. Reshef, Yakir A. Reshef, Pardis C. Sabeti, Michael M. Mitzenmacher
AAn Empirical Study of Leading Measures of Dependence
David N. Reshef ∗† Yakir A. Reshef ∗ Pardis C. Sabeti ∗∗ Michael M. Mitzenmacher ∗∗ Abstract
In exploratory data analysis, we are often interested in identifying promising pairwise associations forfurther analysis while filtering out weaker, less interesting ones. This can be accomplished by computinga measure of dependence on all possible variable pairs and examining the highest-scoring pairs, providedthe measure of dependence used assigns similar scores to equally noisy relationships of different types.This property, called equitability , is formalized in Reshef et al. [2015b]. In addition to equitability,measures of dependence can also be assessed by the power of their corresponding independence tests aswell as their runtime.Here we present extensive empirical evaluation of the equitability, power against independence, andruntime of several leading measures of dependence. These include two statistics newly introduced inReshef et al. [2015a]: MIC e , which has equitability as its primary goal, and TIC e , which has poweragainst independence as its primary goal.Regarding equitability, our analysis finds that MIC e is the most equitable method on functionalrelationships in most of the settings we considered, although mutual information estimation proves themost equitable at large sample sizes in some specific settings. Regarding power against independence,we find that TIC e , along with Heller and Gorfine’s S DDP , is the state of the art on the relationshipswe tested. Our analyses also show evidence for a trade-off between power against independence andequitability consistent with the theory in Reshef et al. [2015b]. In terms of runtime, MIC e and TIC e are significantly faster than many other measures of dependence tested. Moreover, computing eitherone makes computing the other trivial. This suggests that a fast and useful strategy for achieving acombination of power against independence and equitability may be to filter relationships by TIC e andthen to examine the MIC e of only the significant ones.We conclude with a discussion of the settings in which MIC e and TIC e are (and are not) appropriatetools. It is our hope that this work provides a practical guide for the use of MIC e , TIC e , and relatedstatistics, and for the role of equitability more generally. Suppose we have a high-dimensional data set with hundreds or thousands of dimensions and we wish tofind interesting associations within it to analyze further. Even if we only search for pairwise associationsamong the variables, the number of potential relationships to examine is unmanageably large, necessitatingautomation to assist in the search. In this context, a common, simple approach is to compute some statisticon each combination of variables, rank the variable pairs from highest- to lowest-scoring, and then examinea small number of the top-scoring variable pairs in the resulting list.The success of this strategy depends heavily on the statistic used. One natural approach is to use ameasure of dependence, that is, a statistic whose population value is zero when the variables in question arestatistically independent and non-zero otherwise. However, this is not sufficient to guarantee success. Tosee this, imagine using such a statistic ˆ ϕ on a data set containing many noisy linear relationships as well as Department of Computer Science, Massachusetts Institute of Technology. ∗ Co-first author. † To whom correspondence should be addressed. Email: [email protected] School of Engineering and Applied Sciences, Harvard University. Department of Organismic and Evolutionary Biology, Harvard University. Broad Institute of MIT and Harvard. ∗∗ Co-last author. a r X i v : . [ s t a t . M E ] M a y smaller number of strong sinusoidal relationships. The fact that ˆ ϕ is a measure of dependence guaranteesus that, given sufficient sample size, all of these relationships will receive non-trivial scores. Unfortunatelythough, it tells us nothing about how those non-trivial scores will compare to each other. For example, ˆ ϕ could systematically assign higher scores to linear relationships than to sinusoidal relationships. If that isthe case, then when we rank relationships by ˆ ϕ the noisy linear relationships may crowd out the sinusoidalrelationships from the top of the list. Since we can only manually examine a relatively small number ofrelationships from the top of the list, we may therefore miss the sinusoidal relationships even though theyare strong.If our goal were simply to detect as many relationships as possible, then the measure of dependenceˆ ϕ would perform well to the extent that its associated independence test has good power. But a high-dimensional data set may contain a very large number of non-trivial relationships, some strong and othersweak, and a list of all of them may be too large to allow for manual follow-up of each identified relationshipReshef et al. [2015b]; Emilsson et al. [2008]. Thus, in the exploration of large data sets, our goal is often notonly to detect as many of the non-trivial associations in the data set as possible, but also to rank them bysome notion of strength. For this task, deviation from independence can be too weak a search criterion.One framework to address this challenge utilizes a property called equitability . Loosely, an equitablemeasure of dependence is one that gives similar scores to equally noisy relationships of different types[Reshef et al., 2011]. This definition is formalized in Reshef et al. [2015b] and shown there to be equivalentto power against a range of null hypotheses corresponding to different relationship strengths rather thanthe single null hypothesis of statistical independence (i.e., zero relationship strength). While the generalconcept of equitability is quite broad, one intuitive and natural instantiation is that, when used on functionalrelationships, the value of an equitable measure of dependence should reflect the coefficient of determination( R ) with respect to the generating function with as weak a dependence as possible on the particular functionin question.Equitability is a difficult property to achieve, and most measures of dependence do not have high eq-uitability on functional relationships. (This is understandable, as they are not designed with that goal inmind.) One statistic that has shown good equitability on functional relationships is the maximal informa-tion coefficient (MIC) [Reshef et al., 2011]. In Reshef et al. [2015a] a new, efficiently computable, consistentestimator of the population MIC, called MIC e , is introduced, along with a related measure of dependencecalled the total information coefficient TIC e , which is essentially free to compute when MIC e is computed.In this paper, we demonstrate how the theoretical advances of Reshef et al. [2015b,a] translate into prac-tical benefits via extensive empirical analyses, under a wide range of settings, of the equitability, power, andruntime of MIC e , TIC e , and several leading measures of dependence: MIC [Reshef et al., 2011], distancecorrelation [Szekely and Rizzo, 2009; Sz´ekely et al., 2007], mutual information estimation [Kraskov et al.,2004], maximal correlation [R´enyi, 1959; Breiman and Friedman, 1985], the randomized dependence coeffi-cient (RDC) [Lopez-Paz et al., 2013], the Heller-Heller-Gorfine distance (HHG) [Heller et al., 2013], S DDP [Heller et al., 2014], and the Hilbert-Schmidt Independence Criterion (HSIC) [Gretton et al., 2005, 2008,2012]. Throughout our analyses, we show how the theoretical framework of Reshef et al. [2015b] can be usedto rigorously quantify equitability in practice.Our analyses yield four main conclusions. First, with regard to equitability, they show that estimationof the population MIC via MIC e is more equitable than other methods across the majority (32 out of 36) ofthe settings of noise/marginal distributions and sample size that we tested. (In the remaining four settings,the Kraskov mutual information estimator outperforms MIC e .)The second conclusion we draw is that the total information coefficient TIC e achieves overall statisticalpower against independence that is state-of-the-art. State-of-the-art power against independence is alsoachieved by Heller and Gorfine’s S DDP , which outperforms TIC e by some metrics and is outperformedby TIC e in others. The power of TIC e is high not just overall, but also on each individual alternativehypothesis relationship type we examined, meaning that we did not identify any one relationship type thatTIC e is especially poorly suited for detecting.The third conclusion is that the power against independence of MIC e , the new estimator of the populationMIC, is competitive with other state-of-the-art techniques, albeit with a different setting of its parameter α than the one that confers good equitability. This observation leads us to characterize a power-equitabilitytrade-off that is captured by this parameter and appears consistent with the theory of equitability developedin Reshef et al. [2015b] together with “no free lunch” considerations.2ur final conclusion concerns runtime. We find that MIC e and TIC e are as fast as or faster than mostother methods tested. Even at a sample size of n = 5 , e /TIC e on all variable pairs in a1 , . e or TIC e iscomputed, the other can be computed trivially.Taken together, our results suggest that MIC e can be efficiently used in conjunction with TIC e to achievea useful mix of power against independence (by filtering results using TIC e ) and equitability (by using MIC e on the remaining variable pairs) when exploring a data set.Together, this paper, Reshef et al. [2015b], and Reshef et al. [2015a] have three primary objectives. Thefirst is to formalize the theory behind both equitability and the maximal information coefficient. The secondis to introduce and analyze a new estimator of the population MIC as well as a new measure of dependencecalled the total information coefficient. The third is to provide an extensive comparison of the performanceof a set of state-of-the-art measures of dependence in a wide range of settings in terms of equitability, poweragainst independence, and runtime. While this paper is focused primarily on the performance comparison,providing direct and in-depth comparisons to existing methods, we hope these papers together expand theuse of both this framework for data analysis and the existing algorithms.The rest of this paper is organized as follows. In Section 2 we cover preliminaries, in Section 3 wegive a brief review of equitability, in Section 4 we analyze the equitability of the methods in question, inSection 5 we analyze their power against independence, in Section 6 we characterize the tradeoff betweenpower against independence and equitability, in Section 7 we analyze runtime, and in Section 8 we offer aconcluding discussion. As we extensively analyze several statistics introduced in Reshef et al. [2011] and Reshef et al. [2015a], westart by reviewing the definitions of those statistics and related objects. The informed reader may skip thissection and refer to it as needed.
The statistics we present here are two estimators of the maximal information coefficient, as well as thetotal information coefficient. For all of these statistics, we have a sample from the distribution of sometwo-dimensional random variable (
X, Y ). The goal in estimating the maximal information coefficient is toprovide a score in the form of a number between 0 and 1 that quantifies the strength of the relationshipbetween X and Y in an equitable way (see Section 3 for a review of equitability). The goal in computingthe total information coefficient is to obtain a statistic for testing for the presence or absence of statisticalindependence between X and Y .For all statistics, we use the following notational conventions. Let G be a finite grid drawn on theEuclidean plane. Given a point ( x, y ) ∈ R , we define the function row G ( y ) to be the row of G containing y and we define col G ( x ) analogously. For a pair ( X, Y ) of jointly distributed random variables, we write(
X, Y ) | G to denote the discrete random variable (col G ( X ) , row G ( Y )). For natural numbers k and (cid:96) , we use G ( k, (cid:96) ) to denote the set of all k -by- (cid:96) grids (possibly with empty rows/columns). Given a finite sample D from the distribution of ( X, Y ), we use D to refer both to the set of points in the sample as well as to apoint chosen uniformly at random from D . In the latter case, it then makes sense to talk about, e.g., D | G and I ( D | G ). The maximal information coefficient (MIC) is a statistic introduced in Reshef et al. [2011] as a way to achievegood equitability on a wide range of relationship types. In Reshef et al. [2015a], the population value of thisstatistic is computed and a new estimator of that population value is given. Here we define all three of theseobjects. 3 .2.1 The population MIC
We begin by defining the population value of MIC, which we denote by MIC ∗ . To define this quantity, wemust first define an object called the population characteristic matrix . The population MIC will then be thesupremum of this matrix. Definition 2.1 (Reshef et al. [2015a]) . Let (
X, Y ) be jointly distributed random variables. Let I ∗ (( X, Y ) , k, (cid:96) ) = max G ∈ G ( k,(cid:96) ) I (( X, Y ) | G )where I represents the mutual information. The population characteristic matrix of ( X, Y ), denoted by M ( X, Y ), is defined by M ( X, Y ) k,(cid:96) = I ∗ (( X, Y ) , k, (cid:96) )log min { k, (cid:96) } for k, (cid:96) > Definition 2.2 (Reshef et al. [2015a]) . Let (
X, Y ) be jointly distributed random variables. The populationmaximal information coefficient (MIC ∗ ) of ( X, Y ) is defined byMIC ∗ ( X, Y ) = sup M ( X, Y ) . The population MIC has several alternate characterizations, both as a canonical smoothing of mutualinformation and as the supremum of the boundary of the characteristic matrix. For more, see Reshef et al.[2015a]. ∗ In this work we study two different estimators of the population MIC.
The first estimator: MIC
The first statistic we analyze is the original statistic introduced in Reshef et al.[2011], which estimates MIC ∗ by first estimating each entry of the characteristic matrix until a sample size-dependent maximal grid resolution. This estimated characteristic matrix is called the sample characteristicmatrix and is defined below. Definition 2.3 (Reshef et al. [2011]) . Let D ⊂ R be a set of ordered pairs. The sample characteristicmatrix (cid:99) M ( D ) of D is defined by (cid:99) M ( D ) k,(cid:96) = I ∗ ( D, k, (cid:96) )log min { k, (cid:96) } . MIC is then the maximum of the sample characteristic matrix, subject to a sample size-dependent limiton the maximal allowed grid resolution.
Definition 2.4 (Reshef et al. [2011]) . Let D ⊂ R be a set of n ordered pairs, and let B : Z + → Z + . Wedefine MIC B ( D ) = max k(cid:96) ≤ B ( n ) (cid:99) M ( D ) k,(cid:96) . The statistic MIC is proven in Reshef et al. [2015a] to be a consistent estimator of the population MIC,provided ω (1) < B ( · ) ≤ O ( n − ε ) for ε >
0. However, it is not known how to efficiently compute the exactvalue of MIC, and so in practice a heuristic dynamic-programming approximation algorithm is used.4 he second estimator: MIC e The second statistic we analyze is MIC e , a statistic introduced in Reshefet al. [2015a] and proven there to be a consistent estimator of MIC ∗ . In contrast to MIC, it is known howto compute MIC e exactly in polynomial time (although in practice other, still more efficient statistics maynevertheless be used; see below). Rather than attempting to estimate any entries of the characteristic matrix,MIC e estimates a different matrix, the equicharacteristic matrix , whose supremum is the same as that of thecharacteristic matrix. Estimates of entries of this other matrix turn out to be both much easier to computeand sufficient for estimating MIC ∗ .We first define the sample equicharacteristic matrix, along with a prerequisite definition. Definition 2.5 (Reshef et al. [2015a]) . Let (
X, Y ) be a pair of jointly distributed random variables. Define I ∗ (( X, Y ) , k, [ (cid:96) ]) = max G ∈ G ( k, [ (cid:96) ]) I (( X, Y ) | G )where G ( k, [ (cid:96) ]) is the set of k -by- (cid:96) grids whose y-axis partition is an equipartition of size (cid:96) . Define I ∗ (( X, Y ) , [ k ] , (cid:96) ) analogously.Define I [ ∗ ] (( X, Y ) , k, (cid:96) ) to equal I ∗ (( X, Y ) , k, [ (cid:96) ]) if k ≤ (cid:96) and I ∗ (( X, Y ) , [ k ] , (cid:96) ) otherwise. Definition 2.6 (Reshef et al. [2015a]) . Let D ⊂ R be a set of ordered pairs. The sample equicharacteristicmatrix (cid:100) [ M ]( D ) of D is defined by (cid:100) [ M ]( D ) k,(cid:96) = I [ ∗ ] ( D, k, (cid:96) )log min { k, (cid:96) } . We can now define the second estimator, MIC e . Definition 2.7 (Reshef et al. [2015a]) . Let D ⊂ R be a set of n ordered pairs, and let B : Z + → Z + . Wedefine MIC e,B ( D ) = max k(cid:96) ≤ B ( n ) (cid:100) [ M ]( D ) k,(cid:96) . MIC e can be computed using dynamic programming, resulting in a search procedure that takes time O ( n B ( n ) ), which equals O ( n α ) when B ( n ) = n α . In practice, however, this algorithm can be modifiedto include a parameter c that controls the coarseness of the discretization of the grid-maximization search.The modified statistic remains a consistent estimator of MIC ∗ and runs in time O ( c B ( n ) / ) = O ( c n α/ )[Reshef et al., 2015a]. In this work we use MIC e to refer both to the statistic as defined above and to theresult of this modified algorithm. For more, see Reshef et al. [2015a]. While the maximal information coefficient aims to measure the strength of a relationship equitably, the totalinformation coefficient (TIC), introduced in Reshef et al. [2015a], provides a way of testing for the presenceor absence of statistical independence with good power and is a trivial side-product of the computation ofthe maximal information coefficient.The intuition behind the total information coefficient is that while estimating MIC ∗ has many advan-tages, this estimation involves taking a maximum over many estimates of entries of the characteristic orequicharacteristic matrix. Since the maximum of a set of random variables tends to become large as thenumber of variables grows, one can imagine that this procedure can lead to an unwanted positive bias inthe case of statistical independence, when the population characteristic matrix equals 0, and a consequentreduction in power against independence.To circumvent this problem, the total information coefficient is not the maximum but the sum of theentries of the matrix. Since this property of the matrix has better statistical properties, we might expect it tohave a smaller bias in the case of statistical independence and therefore better power. Stated alternatively,if our only goal is to distinguish any dependence at all from complete noise, then disregarding all of thesample characteristic/equicharacteristic matrix except for its maximal value throws away useful signal, andthe total information coefficient avoids this by summing all the entries.5 .3.1 The statistic TIC e The version of the total information coefficient studied in this work is analogous to the statistic MIC e presented above in that it proceeds via summation not of the sample characteristic matrix (cid:99) M but rather ofthe sample equicharacteristic matrix (cid:100) [ M ]. Definition 2.8.
Let D ⊂ R be a set of n ordered pairs. Given a function B : Z + → Z + , we defineTIC e,B ( D ) to be TIC e,B ( D ) = (cid:88) k(cid:96) ≤ B ( n ) (cid:100) [ M ]( D ) k,(cid:96) where (cid:100) [ M ]( D ) is the sample equicharacteristic matrix.In Reshef et al. [2015a] it is proven that TIC e yields a consistent right-tailed independence test, provided ω (1) < B ( n ) ≤ O ( n − ε ) for ε >
0. As with MIC e , there is an additional parameter c that controls thecoarseness of the discretization of the grid search when TIC e is computed. However, this does not affect theconsistency of the corresponding independence test. See Reshef et al. [2015a] for more detail. Table 1 lists the objects discussed in this section.
Object Description Defined in
MIC Statistic for quantifying relationship strength Reshef et al. [2011]MIC ∗ Population value of MIC Reshef et al. [2015a]MIC e Estimator of MIC ∗ via equicharacteristic matrix Reshef et al. [2015a]TIC e Statistic for testing for independence Reshef et al. [2015a]
Table 1:
Statistics and estimands related to the maximal and total information coefficients.
Equitability is a property of measures of dependence introduced in Reshef et al. [2011] and formalized inReshef et al. [2015b] that is particularly useful in the context of data exploration. Because this paperanalyzes the equitability of several leading measures of dependence, we first present here a review of thebasic definitions of- and results about equitability from Reshef et al. [2015b].There are two different ways to view equitability, each with its corresponding intuition. The first statesroughly that an equitable measure of dependence “give[s] similar scores to equally noisy relationships ofdifferent types” [Reshef et al., 2011]. In this viewpoint, a highly equitable measure of dependence allows usnotionally to find the “strongest K ” relationships in our data set for any K .The second view of equitability is based on statistical power: an equitable measure of dependence providesgood tests for distinguishing between relationships with different, potentially non-zero amounts of noise. Inother words, instead of yielding tests that only reject a null hypothesis of independence (i.e., “relationshipstrength = 0”), an equitable measure of dependence yields tests for rejecting null hypotheses of the form6relationship strength ≤ x ” for all possible x . That is, a highly equitable measure of dependence allowsus to find with high power all the relationships in our data set with “strength at least x ” for any x .These two viewpoints are formalized and shown to be equivalent in Reshef et al. [2015b]. We nowsummarize those formalizations as well as their equivalence, together with some examples and intuition. Let ˆ ϕ be some statistic. To be able to talk rigorously about the equitability of ˆ ϕ , we must specify two things:a set Q of distributions on which we can state what we mean by relationship strength, and a correspondingfunction Φ : Q → [0 ,
1] that computes that strength. The set Q is called the set of standard relationships and the function Φ is called the property of interest .A natural setting to keep in mind is that Q is some diverse set of functional relationships with noiseadded and Φ is R , i.e., the coefficient of determination with respect to the generating function. We returnto this example often as a way to build intuition.We can now define equitability in terms of power against a broad class of null hypotheses. Definition 3.1.
Let ˆ ϕ be a statistic, let Q be a set of standard relationships, let Φ : Q → [0 , ≤ α ≤ /
2. The statistic ˆ ϕ is 1 /d -equitable with respect to Φ with confidence 1 − α if and only iffor every x , x ∈ [0 ,
1] satisfying x − x > d , there exists a right-tailed level- α test based on ˆ ϕ that candistinguish between H : Φ( Z ) ≤ x and H : Φ( Z ) ≥ x with power at least 1 − α .The smaller d is the better, and consequently the best equitability that can be achieved is when d = 0,and the statistic in question is ∞ -equitable. This is called perfect equitability , and is generally discussed asa property of the population value of a statistic.This definition of equitability is illustrated schematically in Figure 1. It implies that when Φ is 0 preciselyin cases of statistical independence, equitability can be viewed as a generalization of power against statisticalindependence on Q . Specifically, when we set x = 0, a statistic being 1 /d -equitable means that that statisticyields a test that has good power against independence on any alternative hypothesis as extreme or moreextreme than H : Φ = d . In general, the definition says that a 1 /d -equitable statistic allows us to, given somethreshold x of relationship strength as measured by Φ, successfully identify all the relationships in a dataset with strength greater than x + d . This may be important if our data set has many weak relationshipsand a smaller number of strong relationships that we would like to find.As the formalization just presented makes clear, an analysis of equitability must differ from conventionalanalyses of power against independence in two ways. First, statistical independence represents only one nullhypothesis, in contrast to the many null hypotheses against which equitability requires good power. Second,since in the setting of equitability the model Q will contain multiple distinct classes of relationship types(e.g., linear, exponential, etc.), the null and alternative hypotheses that must be analyzed are composite. In addition to the view that defines equitability in terms of power, we can take an alternative approach thatdirectly formalizes the intuition that an equitable statistic assigns similar scores to equally noisy relationshipsof different types. To do so, we must define two concepts, reliability and interpretability , which invokeacceptance regions and interval estimates, respectively. For clarity of exposition, we avoid using the term“equitability” in the following, since we have already defined it previously. However, what we describe hereas “worst-case interpretability” will turn out to be equivalent to equitability.We begin with the definition of reliability.
Definition 3.2 (Reshef et al. [2015b]) . Let ˆ ϕ : R n → [0 ,
1] be a statistic, let x, α ∈ [0 , α -reliableinterval of ˆ ϕ at x , denoted by R ˆ ϕα ( x ), is the smallest closed interval A with the property that, for all Z ∈ Q with Φ( Z ) = x , P ( ˆ ϕ ( D ) < min A ) < α/ P ( ˆ ϕ ( D ) > max A ) < α/ We deviate here from Reshef et al. [2015b] in that we use the term “equitability” for arbitrary properties of interest Φ,rather than using “interpretability” in general and reserving “equitability” for cases in which Φ specifically reflects some notionof relationship strength. We do this because in this paper Φ always reflects a notion of relationship strength. However, we notethat the concepts and tools here can be readily applied even if this is not the case. Linear LinearParabolic Parabolic0 101 x P o w e r H : Ф = x in [0, 1] H : Ф = 0.3 H : Ф = x in [0, 1] H : Ф = 0.3 H : Ф = x in [0, 1]H : Ф = x in [0, 1] x H : Ф = 0.3 H : Ф = 0.6 x (red = powerful)0.3 x (red = powerful) φ ^ c α=0.05 H : Ф = 0.3 H : Ф = 0.6 φ ^ c α=0.05 (a) (b) (c) Figure 1:
Equitability as a generalization of power against independence. (a)
The power function of a size- α right-tailed test based on a statistic ˆ ϕ with null hypothesis H : Φ = 0 .
3. The curve shows the power of the test asa function of x , the value of Φ in the alternative hypothesis. (b) The power function can be depicted instead as aheat map. (c)
Instead of considering just one null hypothesis/critical value, we can consider a set of null hypotheses(with corresponding critical values) of the form H : Φ = x and plot each of the resulting power curves as a heatmap. The result is a plot in which the intensity of the color in the coordinate ( x , x ) corresponds to the power of asize- α right-tailed test based on ˆ ϕ at distinguishing H : Φ = x from H : Φ = x . A 1 /d -equitable statistic is onefor which this power surface attains the value 1 − α within distance d of the diagonal along each row. where D is a sample of size n from Z .The statistic ˆ ϕ is 1 /d -reliable with respect to Φ on Q at x with probability 1 − α if and only if thediameter of R ˆ ϕα ( x ) is at most d .The reliable interval at x is an acceptance region for a size- α test of the null hypothesis H : Φ = x . Thisis a convex hull of central intervals of the sampling distributions of ˆ ϕ over all distributions Z ∈ Φ − ( { x } ). Ifthere is only one Z such that Φ( Z ) = x , then the reliable interval is simply a central interval of the samplingdistribution of ˆ ϕ on Z .Figures 2a and 2b show schematic illustrations of reliable intervals in the case where Q is a set of noisyfunctional relationships, Φ = R , and ˆ ϕ is the sample Pearson correlation coefficient. In Figure 2a, theset Q contains only one relationship type: linear. Consequently, each possible value of R has only onedistribution Z ∈ Q with that R . In this case, the reliable interval at that R value is simply a centralinterval of the sampling distribution of the sample correlation. In Figure 2b, the set Q contains not one butthree relationship types: linear, exponential, and parabolic. This means that at every R value there arethree different distributions in Z ∈ Q with that R , and consequently three different sampling distributionsof the sample correlation. In this setting, the reliable interval at that R value is the smallest interval thatcontains the union of the central intervals we constructed of those three sampling distributions.Having defined the reliable interval as an acceptance region, we can now define the interpretable intervalas an interval estimate of Φ. Definition 3.3 (Reshef et al. [2015b]) . Let ˆ ϕ : R n → [0 ,
1] be a statistic, and let y, α ∈ [0 , α -interpretable interval of ˆ ϕ at y , denoted by I ˆ ϕα ( y ), is the smallest closed interval containing the set (cid:8) x ∈ [0 ,
1] : y ∈ R ˆ ϕα ( x ) (cid:9) . The statistic ˆ ϕ is 1 /d -interpretable with respect to Φ on Q at y with confidence 1 − α if and only if thediameter of I ˆ ϕα ( y ) is at most d .Figure 2c shows schematic illustrations of two different interpretable intervals in the setting discussedabove, in which Q is a set of noisy functional relationships with three different function types (linear,exponential, parabolic), Φ = R , and ˆ ϕ is the sample Pearson correlation coefficient.8 φ Ф ( e.g. R )Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) φ ^ φ Ф ( e.g. R )Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) φ ^ φ Ф ( e.g. R )Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) φ ^ φ Ф ( e.g. R )Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) φ (a) (b) (c) Figure 2:
A schematic illustration of interpretability/equitability with three relationship types: linear (blue),exponential (red), and parabolic (yellow). Here the property of interest (Φ) is R and the statistic in question( ˆ ϕ ) is the sample Pearson correlation coefficient ˆ ρ . (a) A plot of central intervals of the sampling distributions ofˆ ϕ = ˆ ρ against R ( Z ) for Z ∈ Q , when Q consists only of linear relationships with varying amounts of added noise;one reliable interval is pictured. Since there is exactly one relationship in Q corresponding to each R value, thereliable interval is simply a central interval of the relevant sampling distribution. (b) The analogous plot in thecase where Q contains noisy functional relationships ranging over three different functions: linear (blue), exponential(red), and parabolic (yellow). Now the reliable interval interval is the smallest interval containing all three of therelevant central intervals. (c) The same plot, with interpretable intervals pictured. The interpretable interval at eachvalue of ˆ ρ is composed of the R values whose reliable intervals contain that value of ˆ ρ . The shorter the interpretableintervals, the more interpretable/equitable the statistic. The worst-case interpretable interval is denoted by a solidred line; an additional interpretable interval is shown with a dashed red line. The thumbnails to the right of eachinterval show representative relationships from the endpoints of that interval, both of which have the same ˆ ρ butdramatically different values of R . When we are discussing the interpretability or reliability of a statistic, we need to speak about more thanone x or y value at a time. There are many potential ways to do this. Here we limit ourselves to two basicones. Definition 3.4 (Reshef et al. [2015b]) . A measure of dependence is worst-case /d -reliable (resp. inter-pretable) if it is 1 /d -reliable (resp. interpretable) at all x (resp. y ) ∈ [0 , average-case /d -reliable (resp. interpretable) if its reliability (resp. inter-pretability), averaged over all x (resp. y ) ∈ [0 , /d .Here and throughout, we use “worst-case” to refer to the worst-seen performance, as opposed to a provenbound, and we use “interpretability” with no qualifier to refer to worst-case interpretability.To gain some intuition for the definition of interpretability, let us consider what values d can take. Thelowest possible interpretability happens when one of the interpretable intervals has size 1. In this case, the(worst-case) interpretability of the statistic is 1 as well. In the best case, when all interpretable intervalsof a statistic are of size 0, the interpretability is ∞ , and the statistic is said to be perfectly interpretable .(As before, the perfect case is only expected to arise, if at all, as a property of the population value of thestatistic.)To complete our example, let us find the worst-case interpretability of the sample correlation coefficientin the example of noisy functional relationships depicted in Figure 2c. To do this, we locate the widestinterpretable interval in the figure; this happens to be the lower of the two intervals pictured. If the lengthof this interval is d , the sample Pearson correlation coefficient is worst-case 1 /d -interpretable with respectto R on our set Q . Thus, the shorter the interpretable intervals, the more interpretable the statistic. It turns out that equitability and worst-case interpretability as defined above are equivalent under modestassumptions [Reshef et al., 2015b]. We state this result below.
Theorem 3.5 (Reshef et al. [2015b]) . Let Q be a set of standard relationships, let Φ : Q → [0 , , and let < α < / . Let ˆ ϕ be a statistic with the property that max R ˆ ϕα ( x ) is a strictly increasing function of x . hen for all d > , the following are equivalent.1. ˆ ϕ is /d -equitable with respect to Φ with confidence − α .2. ˆ ϕ is worst-case /d -interpretable with respect to Φ with confidence − α . This result can be interpreted in two ways. One interpretation is that a statistic that allows us toapproximately rank the relationships in a data set by strength as measured by Φ will also allow us, for any x , to find all the relationships in the data set that have strength at least x as measured by Φ, and viceversa. Another interpretation arises if Φ reflects relationship strength, in particular if Φ = 0 corresponds tothe relationships in Q exhibiting statistical independence. If this is the case, then the above theorem tellsus that equitability is a generalization of power against statistical independence on Q .This is good news and bad news. On the one hand, it provides a link between equitability and power andclarifies the relationship between the two. On the other hand, it shows that equitability – by virtue of beingstronger than power against independence – will also be more difficult to achieve, as it requires simultaneouslyattaining power against a much larger set of null hypotheses. This hints at a trade-off between equitabilityand power against independence for which we provide empirical evidence in Section 6. So far we have discussed equitability in general, conceptual terms, and it has many different concrete inter-pretations depending on the choice of Φ and Q . We define here a concrete instantiation of equitability onfunctional relationships that is used throughout this paper. To do this, we first must state what we meanby “functional relationship”. Definition 3.6 (Reshef et al. [2015b]) . A random variable distributed over R is called a noisy functionalrelationship if and only if it can be written in the form ( X + ε, f ( X ) + ε (cid:48) ) where f : [0 , → R , X is a randomvariable distributed over [0 , ε and ε (cid:48) are (possibly trivial) random variables. We denote the set of allnoisy functional relationships by F .Equitability on functional relationships in the sense of Reshef et al. [2011] and Reshef et al. [2015b] nowjust amounts to the use of R as the property of interest. Definition 3.7 (Reshef et al. [2015b]) . Let Q ⊂ F be a set of noisy functional relationships. A measureof dependence is worst-case (resp. average-case) 1 /d -equitable on Q if it is worst-case (resp. average case)1 /d -equitable with respect to R on Q .In this paper we often abuse terminology by simply writing “equitability” to mean equitability withrespect to R on various sets of functional relationships as defined above. Alternative definitions of thisconcept with other sets Q and functions Φ have been proposed. These are discussed in detail in Reshef et al.[2015b]. Using the framework reviewed here, Figure 3a demonstrates how one might analyze the equitability of astatistic in practice from the standpoint of interpretable intervals. We take as an example the sample Pearsoncorrelation coefficient (ˆ ρ ). This statistic is not a measure of dependence in the sense that its populationvalue can be zero even in cases of non-trivial dependence. However, we analyze it here due to its widespreadfamiliarity and the intuitiveness of its scores.In this example, as before, our property of interest will be Φ = R . The set of standard relationships Q will be a set of noisy functional relationships of the form ( X + ε, f ( X ) + ε (cid:48) σ ) with ε = 0, ε (cid:48) σ ∼ N (0 , σ ), and f ranging over the functions in Table A.1.To analyze the equitability of ˆ ρ , we generate, for 41 different noise levels σ and for every function f in ourset, 500 samples from the relationship Z = ( X, f ( X ) + ε (cid:48) σ ) with a sample size of n = 500. Using these, weestimate the 5th and 95th percentiles of the sampling distribution of ˆ ρ on Z . These allow us to estimate thereliable interval at the value of R corresponding to each noise level. The reliable intervals then enable us10 φ Ф ( e.g. R )Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) φ ^ φ Ф ( e.g. R )Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) φ (a) (b) Figure 3:
Examples of equitable and non-equitable behavior on a set of noisy functional relationships. (Reproducedfrom Reshef et al. [2015b].) (a)
The equitability with respect to R of the sample Pearson correlation coefficient ˆ ρ over the set Q of relationships described in Section 3.5, with n = 500. Each shaded region is an estimated 90% centralinterval of the sampling distribution of ˆ ρ for a given relationship at a given noise level. The fact that the interpretableintervals of ˆ ρ are large indicates that a given ˆ ρ value could correspond to relationships with very different R values.This is illustrated by the pairs of thumbnails corresponding to relationships with the same ˆ ρ but different R values.The largest interpretable interval is indicated by a red line. Because it has width 1, the worst-case equitability withrespect to R in this case is 1, the lowest possible. (b) An illustration of a hypothetical measure of dependence thatachieves perfect equitability in the large-sample limit. Here, the population quantity ϕ depends only on the R ofthe relationships and increases monotonically with R . Thus, ϕ can be used as a proxy for R on Q with no loss.Thumbnails are shown for sample relationships that receive the same ϕ score, which corresponds to the fact thatthey have equal R scores. to construct interpretable intervals, and our estimate of the equitability is then the reciprocal of the lengthof the longest interpretable interval.The fact that the interpretable intervals at many values of ˆ ρ are large indicates that a given value of ˆ ρ could correspond to samples from relationships of different types that have very different R values. Thisis illustrated by the pairs of thumbnails corresponding to relationships that received the same ˆ ρ but havedifferent amounts of noise. This means that ˆ ρ is not very interpretable with respect to R on this set Q and is thus said to have poor equitability with respect to R on Q . As a contrast, Figure 3b contains ahypothetical illustration of the notion of perfect equitability , which would require that all the interpretableintervals be of size 0.Of course, equitability is a function not only of the method in question but also of the standard rela-tionships and the property of interest. For instance, while ˆ ρ has poor equitability with respect to R on the Q above, it is (trivially) asymptotically perfectly equitable with respect to the correlation on the set Q ofbivariate normals. Having reviewed equitability and how to quantify it, we turn to evaluating the equitability of MIC e andseveral other leading measure of dependence. We begin by quantifying the equitability of each measure ofdependence using interpretable intervals. This is followed by an alternate visualization of the equitability ofeach measure of dependence using conventional power analysis via the connection described in the previoussection. 11 .1 Setting up the analysis The set of existing measures of dependence is too large for us to analyze exhaustively, even in a paper thataims to be comprehensive. We therefore strive to include in our analysis a collection of methods that isrepresentative of the broad approaches prevalent in the field today.
Grid-based methods
The methods based on the maximal information coefficient and the total informa-tion coefficient can be viewed as exploring the space of possible grids that can be drawn on the sampleddata, assigning a score to each grid via some metric, and then aggregating the scores. For MIC [Reshefet al., 2011], the metric is a normalized mutual information score and the aggregation is a supremum. MIC e [Reshef et al., 2015a] is similar except it explores a more restricted set of grids. TIC e [Reshef et al., 2015a]is like MIC e except it aggregates by summation.We also include other recent grid-based methods introduced since the maximal information coefficient[Reshef et al., 2011]. HHG [Heller et al., 2013] uses Pearson’s χ test statistic as its score, explores a setof two-by-two grids defined by individual data points, and aggregates by summation. Though similar toHoeffding’s D [Hoeffding, 1948] in that it considers only two-by-two grids, it differs in the use of the χ test statistic. S DDP [Heller et al., 2014] explores a larger set of grids defined by subsets of the data points,uses non-normalized mutual information as its score, and also aggregates by summation. Another notablegrid-based method introduced recently is dynamic slicing [Jiang et al., 2014], which like MIC explores allpossible grids and aggregates by maximization, but uses as its score a version of mutual information that isregularized according to a prior on the space of possible grids. We did not include dynamic slicing in ourcomparison, however, because it is formulated only for performing a k -sample test whereas our focus here ison measuring dependence between two continuous random variables. Mutual information estimation
Since many of the grid-based methods we consider either use someform of mutual information as their score or have variants that do, we also included a standard mutualinformation estimator introduced by Kraskov [Kraskov et al., 2004]. This estimator was compared againstMIC in previous work [Reshef et al., 2011, 2013; Kinney and Atwal, 2014; Reshef et al., 2014], but thosecomparisons were more limited in scope and did not include MIC e . (For convenience, in this work werepresent the estimated mutual information values in terms of the squared Linfoot correlation Speed [2011];Linfoot [1957], defined by L ( X, Y ) = 1 − − I ( X,Y ) , which takes values in [0 , Distance/kernel-based statistics
We include distance correlation (dCor) [Szekely and Rizzo, 2009], ananalogue of the Pearson correlation coefficient that is defined using a different notion of covariance thatuses pairwise distances between points. In addition, we include the Hilbert-Schmidt Information Criterion(HSIC) [Gretton et al., 2005, 2008], a more general statistic defined on reproducing kernel Hilbert spaces ofwhich dCor is a special case [Sejdinovic et al., 2013].
Correlation-based methods
As an intuitive benchmark for the reader, we include the Pearson correlationcoefficient ( ρ ). However, there are many successful tools that use ρ after computing a non-linear transfor-mation of the data. We include perhaps the best-known one, maximal correlation [R´enyi, 1959], which givenrandom variables X and Y searches for arbitrary measurable functions f and g such that ρ ( f ( X ) , g ( Y )) ismaximized. There is no known algorithm for finding the optimal f and g in general, but the (approximate)method of alternating conditional expectations [Breiman and Friedman, 1985] is widely used and we use ithere as well. We also include a more recent related method, the randomized dependence coefficient [Lopez-Paz et al., 2013], which applies many random transformations to X and Y and then searches for the linearcombinations of the transformed features that maximize the correlation. There are other variations on these statistics presented in Heller et al. [2013, 2014]. However, we omit those results as theywere generally similar or worse than the ones we display. .1.2 Choice of Q , Φ , and sample sizes In an ideal world, when assessing equitability in a specific instance, we would know the true underlying model Q governing the relationships in our data set. Knowledge of Q would, for example, include information aboutthe types of relationships present and the noise distribution (e.g., Gaussian, zero-mean, heteroscedastic, etc.).Of course, in reality we generally do not have this information and, to make matters worse, the results ofan equitability analysis may depend strongly on the choice of Q . Thus, in evaluating the equitability ofmeasures of dependence, it is important to aim for robustness: we would like to have a measure of dependencewith good equitability over as many different relationship types as possible.However, there is a central tension between the need to use as large a set Q as possible in order to assessrobustness and the need to use a Q that is sufficiently small that a reasonable property of interest Φ canbe defined for the relationships in Q . To take an extreme example, setting Q to be the set of all bivariaterelationships would certainly ensure that we do not leave any stone unturned, but at the same time it begsthe original question of how one can measure relationship strength in such a general context.For this reason, following Reshef et al. [2011], we choose to focus on noisy functional relationships sincethese represent a broad, easily definable class of relationships commonly found in practical applications thatcomes with an intuitive and natural measure of relationship strength: R , the coefficient of determinationwith respect to the generating function. To ensure robustness, we vary the relationships tested along asmany dimensions as possible including relationship type, the type of noise added, marginal distributions,and sample size.We note here that our goal in this analysis is not to establish the equitability of any method across the entire set of noisy functional relationships. In fact, under some of the sampling/noise models we considered,there are functions whose inclusion leads to poor equitability across all methods. We therefore attemptedto characterize as broad a set of functions as possible that still allowed for non-trivial equitability.To that end, our analyses include some 16-21 different functional relationships (depending on noise model;see Appendix A.1), each with increasing levels of additive Gaussian noise, considered under twelve differentsampling/noise models, at four sample size regimes ( n = 250 , , , and the infinite data limit). Eachof the 12 sampling/noise models Q is defined using a combination of an independent variable marginaldistribution from the set points sampled evenly along the curve described by f ( X ) (cid:0) E f ( X ) (cid:1) points sampled evenly along the X range ( E X )points sampled uniformly along the curve described by f ( X ) (cid:0) U f ( X ) (cid:1) points sampled uniformly along the X range ( U X ) and a noise distribution from the set normally distributed noise added to the dependent variable ( N y )normally distributed noise added to both variables ( N x , N y )normally distributed noise added to the independent variable ( N x ) . We refer to these noise models using abbreviations of the form E f ( X ) [ N y ], which would correspond to amodel in which the independent variable is sampled evenly along the curve described by f ( X ) and Gaussiannoise is added only to the dependent coordinate. Appendix A.1 contains definitions of the functions used. For each Q , for each sample size n , we examine 41 different R values evenly spaced in the unit interval. Ateach of these R values, we generate 500 independent realizations of a sample of size n from each relationshipin Q with the given R value. These are used to estimate sampling distributions for ˆ ϕ . (See Appendix A.2for details regarding data generation.) Several of the methods tested are parametrized, including MIC e , HSIC, the Kraskov mutual informationestimator, RDC, and S DDP . For each of these methods, we performed a parameter sweep to assess the effect13f parameter settings on equitability. In most cases, we found that parameter settings did not significantlyaffect equitability and so we present here results obtained with default parameters. For MIC e and theKraskov mutual information estimator, however, parameter settings did affect equitability. Therefore, forthese methods, we present for each sample size the best results across parameter values tested. Results forall parameter values tested can be found in the online supplement at . In Section 6.2 we discuss guidelines for how to set parameters for MIC e moregenerally. The equitability of each measure of dependence is quantified using interpretable intervals, as discussed inSection 3.5. In the equitability plots presented, shaded regions denote central intervals containing 90%probability mass of the sampling distribution of each measure of dependence at each R value; these reliableintervals correspond to 0 . − interpretable intervals. In general, we report both average-case and worst-case equitability in our analyses, and the interval plotted in red on each plot represent the worst-case0 . − interpretable interval for that plot. (The shorter the interval, the more equitable the statistic.) Figures 4 and B.1 demonstrate the equitability of MIC e , distance correlation, maximal correlation, HSIC,the Kraskov mutual information estimator, RDC, and S DDP for noise models E f ( X ) [ N x , N y ] and E f ( X ) [ N y ]at a range of sample sizes. Results for all other noise models are presented in the supplemental materials,along with results for TIC e , HHG, and ρ . Tables B.1 and B.2 summarize the worst-case and average-caseequitability, respectively, for all measures of dependence across all models and sample sizes, as measured by0 . − interpretability intervals.We offer here some discussion of the salient questions answered by these analyses. e and mutual information Given the connections between MIC e and mutual information, which are discussed in depth in Reshef et al.[2015a], it is natural to ask whether the direct estimation of mutual information achieves a similar level ofequitability to that of MIC e . In general, among the variety of models and sample sizes tested, the answerappears to be ‘no’, but we present a more detailed breakdown of the results below. Effect of model choice on equitability
Figure 5, as well as Tables B.1 and B.2, demonstrate the relativerobustness of the equitability of MIC e to the choice of model Q compared to that of the Kraskov mutualinformation estimator. At each sample size, the equitability of MIC e is fairly stable with respect to thevariations in noise models and independent variable marginal distributions tested. On the other hand, whilemutual information estimation sometimes has good equitability, it more often has poor equitability underthe models tested. More specifically, mutual information estimation can be equitable in models that onlycontain noise added in the dependent coordinate, while MIC e performs equitably even outside this domain,such as in the case of models that include noise added to either or both the dependent and independentcoordinates. The performance of mutual information estimation is also improved when the independentvariable is stochastic rather than fixed, though this distinction never affects whether it outperforms MIC e or not. Effect of sample size on equitability
Estimating mutual information from finite samples is a challengingproblem that has inspired many non-trivial methods [Paninski, 2003; Moon et al., 1995; Kraskov et al., 2004],and Tables B.1 and B.2, as well as Figures 4, B.1, and 5, demonstrate the strong influence of finite-sampleeffects on the equitability of mutual information estimation. Consistent with the fact that MIC ∗ is uniformlycontinuous while mutual information is not [Reshef et al., 2015a], estimation of MIC ∗ suffers less from thisproblem: for n = 250 and n = 500, MIC e has both superior worst-case and average-case equitability overmutual information estimation (using k = 1, 6, 10, and 20 in the Kraskov estimator) in every model Q tested, and in most cases by substantial margins. For n = 5000, mutual information estimation has better14 igure 4: The equitability of measuresof dependence on a set Q of noisy func-tional relationships. [Narrower is moreequitable.] The relationships take theform ( X + ε, f ( X ) + ε (cid:48) ) where ε and ε (cid:48) are i.i.d. normals of varying ampli-tude, and relationship strength is quan-tified by Φ = R . The plots were con-structed as described in Figure 2. Ineach plot, the worst-case interpretableinterval is indicated by a red line, andboth the worst- and average-case eq-uitability are listed. The fact thatthe worst-case interpretable intervals ofMIC e are small indicates that a givenMIC e score reflects the coefficient of de-termination ( R ) with respect to thegenerating function f with a relativelyweak dependence on the function f inquestion. That is, MIC e has high eq-uitability with respect to Φ = R forthis choice of Q . Mutual information,estimated using the Kraskov estimator,is represented using the squared Lin-foot correlation. For every parametrizedstatistic whose parameter meaningfullyaffects equitability, results are presentedat each sample size using parameter set-tings that maximize equitability acrossall twelve of the noise/marginal distri-butions tested at that sample size. igure 5: A comparison of the equitability of MIC e and mutual information estimation under three noise modelsincluding the one in Figure 4. [Narrower is more equitable.] Plots were constructed as in Figure 2. In each plot,the worst-case interpretable interval is indicated by a red line, and both the worst- and average-case equitability arelisted. As in Figure 4, results for both statistics are presented for each sample size using parameter settings thatmaximize equitability across all twelve of all twelve of the noise/marginal distributions tested at that sample size.Mutual information, estimated using the Kraskov estimator, is represented using the squared Linfoot correlation.While mutual information estimation using the Kraskov estimator is equitable at high sample size on some of thesets Q that were tested, the equitability of MIC e is more robust to noise model, independent variable marginaldistribution, and limited sample size. For versions of this analysis using additional independent variable marginaldistributions, see the supplemental materials. equitability than MIC e in settings where there is only noise in the dependent variable, while MIC e hassuperior equitability in all other models tested. Aspects of this phenomenon have previously been notedin Reshef et al. [2013], and subsequently in Kinney and Atwal [2014], and Reshef et al. [2014]. Equitability in the large-sample limit
Departures from perfect equitability can occur either as a resultof finite sample effects, or because of the lack of interpretability of the population value of the statistic.To disentangle these two potential effects, we compare the equitability of MIC ∗ and the Kraskov mutualinformation estimator in the large-sample limit (Figure B.2). This analysis yields two important insights.First, it demonstrates that when finite sample effects are minimal, MIC ∗ has both superior worst-case andaverage-case equitability in the four models Q that contain noise added in the independent variable or inboth the independent and dependent variables, while mutual information is more equitable than MIC ∗ inthe two remaining settings, where noise is added only in the dependent variable. Second, more generally, itshows that neither MIC ∗ nor mutual information is worst-case perfectly interpretable with respect to Φ = R over the sets Q examined. This is not surprising given the broad range of relationships, noise models, andindependent variable marginal distributions tested. Relationship to equitability analysis from Kinney and Atwal [2014]
A more limited analysis of theequitability of MIC and mutual information estimation was presented in Kinney and Atwal [2014]. There, theauthors examined the equitability of MIC and mutual information estimation specifically at a large samplesize ( n = 5000) and under one choice of Q ( E f ( X ) [ N y ]). From this, they concluded that mutual informationestimation was more equitable than MIC. As our analysis here shows, though that is true for this specific16hoice of Q and sample size, it is not true in general. To the contrary, the general picture seems to be thatthe equitability of estimators of MIC ∗ is more robust than that of estimators of mutual information due toa combination of finite-sample effects and differences between the population values themselves.For more on this discussion, see the technical comment [Reshef et al., 2014] published by the authors ofthis paper about Kinney and Atwal [2014]. For a discussion of the theoretical results of Kinney and Atwal[2014], see Reshef et al. [2015b] and Murrell et al. [2014]. ρ , dCor, maximal correlation, HSIC, RDC, TIC e , HHG, and S DDP
Figures 4 and B.1, as well as Tables B.1 and B.2, demonstrate that ρ , distance correlation, maximal correla-tion, HSIC, RDC, TIC e , HHG, and S DDP all display relatively poor equitability over the models Q tested.(We note that these methods were not designed with equitability in mind and so do not make claims aboutequitability.) Of these methods, maximal correlation displays the highest degree of equitability. Additionally,the equitability profiles of both dCor and RDC are similar to that of the correlation ρ . Figures 6 and B.3 quantify the equitability of the set of measures of dependence examined above via a poweranalysis. This is achieved as demonstrated in Figure 1. Analyses are presented for the same range of modelsand sample sizes examined in the equitability analysis performed using interpretable intervals, and resultsfor all other models are presented in the supplemental materials.Assessing equitability using statistical power analysis confirms the conclusions that are reached by thequantification of equitability using interpretable intervals above. That is, in this analysis, MIC e is the onlymeasure of dependence that is able to distinguish any null hypothesis of the form H : R = x from anyalternative hypothesis of the form H : R = x with high power across the full range of models Q and samplesizes examined, even when x − x is relatively small. As in the equitability analysis using interpretableintervals, the Kraskov mutual information estimator is not able to achieve this task for sample sizes testedlower than n = 5 , n = 5 ,
000 it is only able to do so for models that contain noise only inthe dependent variable. This is true regardless of the choice of parameter used in the Kraskov estimator.(See Appendix B for results achieved using additional parameters.) Finally, as before we see that ρ , distancecorrelation, HSIC, RDC, HHG, and S DDP are highly non-equitable, with maximal correlation being theonly other measure of dependence tested that displayed any degree of equitability.In this analysis, in which Q is a set of noisy functional relationships and the property of interest is R , methods such as distance correlation and HSIC, which are traditionally considered to be well poweredfor detecting deviations from independence, do not yield tests that achieve high power, even in the casewhere the null hypothesis is statistical independence. This is due to the fact that even when we consider anull of independence, we have a composite alternative hypothesis due the the multiple different functionalforms present in Q . This requires methods to yield tests that are highly powered at simultaneously detectingdeviations from independence in all of the relationship types present in Q . The poor power displayed by testsbased on distance correlation, HSIC, and RDC is due to the fact that, while they may be highly powered atdetecting deviations from independence in, say, linear relationships, they are worse at simultaneously doingso for the more nonlinear relationships. Of course, when both the null and alternative hypotheses are allowedto take on non-zero values of R , the task of differentiating between the null and alternative becomes evenharder as both the null and alternative are now composite, and correspondingly the performance of thesemethods suffers further. In this section we analyzed the equitability with respect to R of MIC e alongside several leading measuresof dependence, on many different sets of relationships with varying sample sizes, noise types, and marginaldistributions. Our main finding is that in most (32 out of 36) of the settings we considered, MIC e issubstantially more equitable than the other methods. In the remaining four settings, all of which had asample size of n = 5 ,
000 and no noise added in the independent variable, mutual information estimationusing the Kraskov estimator outperformed MIC e by a small margin; however, the equitability of the Kraskovestimator at lower sample sizes or on other noise models is otherwise poor.17 igure 6: The equitability of mea-sures of dependence on noisy func-tional relationships, visualized interms of power. [Redder is moreequitable.]
The set of noisy func-tional relationships analyzed is thesame as in Figure 4, and relation-ship strength is again quantified byΦ = R . Plots were generatedas in Figure 1. The intensity ofthe pixel at coordinate ( x , x ) ineach heat map shows the powerof a right-tailed test based on thestatistic in question at distinguish-ing the (composite) alternative hy-pothesis H : R = x from the(composite) null hypothesis H :Φ = x with type I error at most α = 0 .
05. An optimal statisticwould yield tests with 100% powerfor every x > x . MIC e comesclosest to achieving this ideal, andperforms particularly well relativeto other methods at lower samplesizes. For each plot, the averagearea under the power curve acrossthe entire set of null hypotheses islisted. (The maximum achievablesuch area is 0.5.) Mutual informa-tion, estimated using the Kraskovestimator, is represented using thesquared Linfoot correlation. Forevery parametrized statistic whoseparameter meaningfully affects eq-uitability, results are presented ateach sample size using parametersettings that maximize equitabilityacross all twelve of noise/marginaldistributions tested at that samplesize.
18s we show later, the equitability of MIC e does seem to come at a price. Specifically, though MIC e does, with certain parameter settings, yield tests with good power against independence (see Section 5),the settings that confer the equitability demonstrated above do not have this property. This suggests thatthere is an inherent trade-off in the statistic between power against independence and equitability, and inSection 6 we establish that this is indeed the case.Interestingly, besides MIC e and the Kraskov estimator, the other method with non-trivial equitabilitywith respect to R in our experiments is maximal correlation as computed using the method of alternatingconditional expectations (ACE). This is interesting because, on the one hand, one can show from its definitionthat the squared maximal correlation is bounded from below by R , and on the other hand the lack ofequitability of maximal correlation in our experiments seems to stem from the ACE method returningresults below this lower bound. We therefore wonder whether maximal correlation — were it computableexactly — would be highly equitable with respect to R .The analyses presented in this section demonstrate that equitability with respect to R is achievable toa significant extent, at least on the relationships tested here. However, while the noise models, marginaldistributions, and functions used were chosen to be representative of real-world relationships, they by nomeans form a large enough set to allow us to make claims about the performance of these methods in general.Given this state of affairs, a better theoretical understanding of MIC e and also of equitability — with respectto R and otherwise — is crucial for allowing us to determine when and to what extent equitability can beachieved. Though this is an ambitious goal, we feel it is important for guiding the development of methodsfor coping with the growing complexity of today’s data sets. It is our hope that the empirical insightspresented here, together with the theory presented in Reshef et al. [2015b,a], will inform and enable furtherinvestigation of both equitability and MIC e . There are many settings that call simply for testing for any deviation from independence rather than rela-tionship ranking, or in which relationship ranking is simply not feasible. These settings require a measureof dependence that yields tests with high power against a null hypothesis of statistical independence.Here, we turn to assessing the power against independence of the set of measures of dependence examined.This has been done previously, most notably by Simon and Tibshirani [Simon and Tibshirani, 2012]. Ouranalysis expands upon the power analysis performed by Simon and Tibshirani in three key ways. First, weexamine power not as a function of absolute amount of noise in the alternative hypothesis but rather as afunction of the R of the alternative hypothesis, allowing us to aggregate across relationship types to gain amore global view of the power of each method. Second, for each of the statistics we analyze that has a freeparameter, we perform a parameter sweep to understand the power of the corresponding tests as a functionof that parameter, and to determine what the optimal value of the parameter is. Last, we analyze a largerset of methods, with a greater variety of sample sizes. The result is an in-depth portrait of statistical power,assembled using the best achievable performance of a large number of leading methods. The methods analyzed were the same as those examined in the equitability analysis. See Section 4.1 formore details.
For all of the power analyses performed, we use both the set of relationships and noise model ( U X [ N y ])chosen by Simon and Tibshirani [Simon and Tibshirani, 2012]. For consistency with the sample sizes usedthroughout this work, we show results for n = 500, but results for all analyses using n = 100 are similar andare provided in the supplemental materials. 19 .1.3 Parameters of the analysis In order to make power results for different relationships comparable, we sought to compute power as afunction of R , in a manner similar to the equitability analyses above, rather than as a function of absolutemagnitude of added noise. To do this, we determined, for each of the eight relationship types chosen bySimon and Tibshirani , 100 noise levels evenly distributed over the range of noise levels yielding R = 1 . R = 10 − . (substantial noise). (See Appendix A.2.) We then drew 1000 independent samples,each of size n = 500, from the corresponding distribution. This was our alternative hypothesis. We alsodrew 1000 independent samples from a corresponding null hypothesis chosen to have the same marginals.All analyses were performed at a significance level of 0.05. To understand how choice of parameter affects statistical power in the case of each measure of dependence,we performed a parameter sweep for each method that has a parameter. To do this, we needed a way ofquantitatively summarizing power across eight relationship types, so that we could then graph performanceas a function of parameter value and then choose the optimal parameter value. We did this in two differentways. For both ways, having power computed as a function of R , so that power on different relationshipscould be directly compared, was crucial.The first way that we summarized power was by computing the area under the power curve for eachrelationship type, integrating with respect to absolute noise level. That is, we computed the power curvefor a given relationship type (e.g., linear) as a function of amount of noise added, and then computed thearea under that curve up to a pre-specified limit on the amount of noise (as measured by R ). The resultingnumber measures the expected power of tests based on the statistic in question when the amount of noiseadded in the alternative-hypothesis is chosen uniformly at random.The second way that we summarized power was by computing the minimum alternative hypothesis R necessary to achieve a certain level of power [Kinney and Atwal, 2014]. Another way of thinking of this is“what is the maximum amount of noise that can be added to a relationship before power for differentiatingthat relationship from independence drops below a pre-set threshold?” The results presented here use athreshold of 50% power; results for other thresholds (95%, 75%, 25%, and 10%) are similar and can be foundin the online supplement. Figure 7 contains quantitative rankings of the measures of dependence by the power of their correspondingtests for independence, using optimal parameter values determined by each of the two methods describedabove. The parameter sweeps themselves, which characterize power against independence as a function ofstatistic parameters, are presented in Figures C.1 and C.2.This analysis yields several insights, which we discuss below.
Let us first use the average power across relationship types to rank the measures of dependence from most toleast powerful over this set of relationships. Doing so using the quantification of power in Figure 7a yields Note that one of the relationship types chosen by Simon and Tibshirani was a circle. Since this relationship is not a noisyfunctional relationship, one cannot truly discuss its R . Therefore, as a heuristic workaround, we defined the R of a noisycircle to be the average of the R values, computed separately, of the top and bottom halves. Some methods, such as RDC, will in the future automatically select optimal parameters in a relationship-type-dependentway Lopez-Paz [2015]. Though this quantification of power computes the area under the power curve integrating with respect to absolute noiselevel, one could integrate with respect to R instead. Doing so would measure the expected power of each statistic on analternative hypothesis with a randomly chosen R . When this is done and optimal parameters are chosen for each method, theresulting ranking isTIC e > MIC e > S DDP > MIC > HHG > max. corr. > I > MIC
Orig. param > RDC > dCor > HSIC > ˆ ρ This ranking makes sense because integrating with respect to R rather than absolute noise level emphasizes performance onstronger relationships, which is more similar to the type of performance quantified by equitability. Correspondingly, the optimal Figure 7:
Measures of dependence ranked by the power of their corresponding independence tests. For each measureof dependence and each relationship type, power was quantified using (a) the area under the power curve [higheris more powerful] , or (b) the minimal R at which at least 50% power is achieved [lower is more powerful] . Thecollection of these scores across relationship types is then plotted for each method along with quartiles, and bothaverage- and worst-case performance across relationship types are listed. Optimal parameter values for each teststatistic were chosen to maximize average-case performance; see (a) Figure C.1, or (b) Figure C.2. The MIC statisticfrom Reshef et al. [2011] with the parameters used in Simon and Tibshirani [2012] is labeled in red; there is asubstantial improvement in power when an optimal parameter is chosen. A further improvement in power is attainedby MIC e , and the performance of TIC e is state-of-the-art. The sample size was n = 500; results are similar with n = 100 and, for (b), with power thresholds besides 50%. (See supplementary materials.) S DDP > TIC e > dCor > HHG > max. corr. > HSIC > MIC e > RDC > ˆ ρ > MIC > I > MIC
Orig. param
Doing the same using the quantification of power in Figure 7b yields (from most to least powerful):TIC e > MIC e > S DDP > MIC > HHG > max. corr. > I > MIC
Orig. param > RDC > dCor > HSIC > ˆ ρ When the largest outlier, a high-frequency sinusoid, is removed from the analysis in Figure 7b (and optimalparameters are re-chosen accordingly), the ordering is as follows : S DDP > TIC e > max. corr. > HHG > MIC e > RDC > HSIC > I > dCor > MIC > MIC
Orig. param > ˆ ρ (See online supplement.) Finally, when the two largest outliers, a high-frequency sinusoid and a circle, areremoved from the analysis, the ordering is as follows: S DDP > TIC e > dCor > max. corr. > MIC e > HHG > HSIC > RDC > MIC > I > MIC
Orig. param > ˆ ρ (See online supplement.) The orderings produced by these analyses are relatively robust to sample size andpower threshold used, with TIC e or S DDP generally performing the best and occasionally swapping witheach other as power threshold is varied. Results obtained with n = 100 and using 95%, 75%, 25%, and 10%power thresholds are provided in the supplemental materials.Several aspects of these rankings merit mention. First, state-of-the-art performance is shared betweenTIC e and Heller and Gorfine’s S DDP . This is interesting because the latter statistic is in fact closelyrelated to the theory behind the maximal and total information coefficients in that it too is an aggregationvia summation of mutual information scores taken over many different grids. Thus, these results provideevidence that the basic approach of aggregating mutual information scores over a large set of grids, whethervia the characteristic matrix or other statistics, is a fundamentally promising avenue for thinking aboutdependence.Second, the average power of independence testing using MIC e , when parameters are optimized for thetask of relationship detection rather than ranking, is competitive with the state of the art. In particular itis higher than the power of its predecessor MIC [Reshef et al., 2011], which estimates the same populationquantity (MIC ∗ ). This demonstrates that the improved bias/variance properties of MIC e relative to MIC[Reshef et al., 2015a] indeed translate into an improvement in power.We note parenthetically that the power of the MIC statistic from Reshef et al. [2011] is substantiallyhigher than has been previously reported. This discrepancy is due to the fact that previous analyses thatexamined the power of MIC used the default parameter setting ( α = 0 . α should beused for testing for independence. As we show in Section 6, the same statement holds for MIC e , and bothstatements follow from a more general power-equitability trade-off.Our final — and perhaps most important — observation about our results is that the differences in powerbetween most of the best-performing methods appear rather small. And indeed, an analysis using many ofthese methods on a real gene expression data set [Heller et al., 2014] shows that this observation is true inpractice. For example, of the 3312 significant relationships found in the data set using a statistic related to S DDP , 3199 (97%) were also detected by HHG, and the latter found only 84 other relationships; 2845 (86%)were also detected by dCor on ranks, and the latter found only 44 other relationships; and 2445 (74%) werealso detected even simply by computing the Pearson correlation coefficient on ranks. MIC e and TIC e werenot run in this analysis, but the simulation results presented above lead us to believe that they would alsohave recovered a very similar set of relationships had their corresponding independence tests been used onthis data set. parameters determined for the methods in this analysis were more similar to the parameters yielding optimal equitability. Forthis reason, we did not focus on this method for quantifying power against independence. We chose to remove outliers rather than use median power because since a) the power values for different function typesoften rank in the same order across methods, and b) there are only eight such numbers and they each vary considerably amongmethods, the median is very sensitive to the performance of each method on only one or two particular function types. MIC was run for this analysis, but with the default value of α = 0 .
6, which yields very poor power against independence. .2.2 Worst-case power against independence across relationship types In addition to considering average-case power across relationship types, it is also important to examineworst-case performance. To measure this, we consider the lowest relationship strength x at which eachindependence test is guaranteed to detect, with a given amount of power, all relationships with strength atleast x regardless of relationship type. When a small such x exists, the statistic in question is said to have a low detection threshold Reshef et al. [2015b]. This implies that the corresponding independence test will notoverlook important relationships because of the test statistic’s systematically assigning them lower scores.As described in Reshef et al. [2015b], low detection threshold is related to equitability: an equitable statisticprovably has a low detection threshold on its set of standard relationships, whereas the converse is not true.The detection threshold of the independence tests we consider can be read from Figure 7b: for if x isthe maximum, across relationship types, of the R required to achieve 50% power on each relationship type,then x is also the minimal R such that we can guarantee at least 50% power on any relationship with an R of x regardless of type.As the figure shows, the detection threshold of TIC e and MIC e on the set of relationships examined is anorder of magnitude lower than the detection thresholds of the other statistics we evaluated. This phenomenonis robust to power thresholds besides 50%; see the online supplementary materials. It implies that TIC e isa good candidate for a “first-pass” filtering of the relationships in a data set before other, more fine-grainedanalyses are conducted. In contrast, the high detection thresholds of the other statistics imply that, for afixed relationship strength, their power against independence may be more sensitive to relationship type.Using such statistics for pre-processing may therefore result in certain relationship types being missed indownstream analyses. Finally, to obtain a more fine-grained picture of the power of the methods we consider on specific relationshiptypes, we also re-created the specific power analysis from Simon and Tibshirani [2012] with optimal parameterchoices for each method, as above. The results are shown in Figure 8. Note that, in order to maximize ourability to discern between power curves generated by different tests within each relationship type, in thisanalysis we followed Simon and Tibshirani [2012] by plotting power as a function of absolute noise levelrather than the population R . This differs from the analyses above, and means that power levels are notdirectly comparable across relationship types.Similarly to our other results, the optimal parameter choices used here cause the power of tests based onseveral of the statistics included in this analysis to be better than previously reported [Simon and Tibshirani,2012; Gorfine et al., 2012; Lopez-Paz et al., 2013; Kinney and Atwal, 2014; Jiang et al., 2014]. For instance,we again see here that the power of MIC is substantially improved. We additionally see that the power ofMIC e and TIC e is quite good across this set of relationships. This analysis also illustrates that each measureof dependence tested indeed has its own strengths and weaknesses. For example, distance correlation andHSIC are relatively better powered to detect linear dependence than MIC e and TIC e , but are relativelyworse at simultaneously detecting most of the other forms of dependence tested. In contrast, S DDP appearsto have a similar profile to that of TIC e , which again makes sense given the fact that S DDP , like TIC e , isalso a grid-based method with a mutual information-based score that aggregates by summation. In this section we analyzed the power of independence tests based on several leading measures of dependence,including TIC e and MIC e , on the set of relationships chosen by Simon and Tibshirani [Simon and Tibshirani,2012]. Our analysis differs from previous ones in that we have aggregated results across relationship types,performed parameter sweeps for all the methods that have parameters, and examined a large set of methodsand sample sizes.Our main finding is that TIC e , along with Heller and Gorfine’s S DDP , provides state-of-the-art perfor-mance on average over the relationship types examined. This is significant because TIC e is trivial to computeonce MIC e has been computed: TIC e is the sum of the entries of a matrix whose maximal entry is MIC e . The parameter α of TIC e that leads to optimal power against independence may not equal the parameter α used for the igure 8: A re-creation of the power analysis performed by Simon and Tibshirani [Simon and Tibshirani, 2012],with optimal parameter choices for each statistic. Power against a null hypothesis of statistical independence for therelationships examined in Simon and Tibshirani [2012], at 50 noise levels for each relationship and n = 500. For eachstatistic that has a parameter, an optimal value for the parameter was chosen as described in Figure C.1. (For aversion with n = 100 see supplementary materials.) e on individual relationship types remained high across relationship types; therewas no one relationship type that testing for independence using TIC e would cause us to overlook with highprobability. Our results therefore point to a promising and computationally efficient strategy for exploratorydata analysis: first, simultaneously compute both MIC e and TIC e on all variable pairs in a data set. Thendiscard pairs declared insignificant by TIC e and examine the MIC e scores of the remaining pairs. This way,the multiple-testing burden is borne by the state-of-the-art power of TIC e , but the significant relationshipscan still be ranked equitably using MIC e . We remark that using S DDP together with MIC e in an analogousstrategy would not be optimal for two resaons. First, such a strategy would be slower, both because S DDP must be computed independently of MIC e whereas TIC e need not be, and because S DDP itself is slowerto compute than MIC e /TIC e . (See Section 7 for more on running times.) Second, since the power againstindependence of S DDP appears more sensitive to alternative hypothesis relationship type, it seems that fil-tering relationships by S DDP is more likely to result in important relationships being eliminated prematurelybecause of their relationship type.Our analysis also showed that the power against independence of tests based on MIC e is greater thanthat of tests based on its predecessor, MIC, and in particular that MIC e yields tests with power close to thestate of the art. However, these results require a setting of the parameter α of MIC e that differs from thatused for optimal equitability, suggesting a trade-off between power against independence and equitabilitythat we study in the following section. Additionally, we found that the power against independence of mostof the methods tested varies considerably across different alternative hypothesis function types, whereas thissensitivity is substantially weaker for MIC e and TIC e .Finally, we observed that, at least in the bivariate setting, the performance of many of the leadingmethods appears quite similar, even on real data. This last observation leads us to question whether themagnitude of a method’s power against independence ought to be the only measure of that method’s utility.There are cases in which the answer is ‘yes’, such as when we wish to perform an independence test betweentwo high-dimensional variables whose result is the end-goal of our analysis. However, in data explorationscenarios in which existing measures of dependence already reliably identify thousands of relationships, itmay be more important to be able to prioritize those relationships for follow-up, rather than to discovera small number of additional relationships whose strength, and therefore scientific promise, is uncertain.Solving the data exploration problem well requires us not just to maximize the number of relationships wedetect, but also to think about how the statistic we choose to use will influence which relationships we find.Indeed, this issue is what inspired the original work on MIC and equitability Reshef et al. [2011], but webelieve the questions regarding the right frameworks for understanding data exploration problems continueto pose numerous interesting challenges. e The above analyses establish that MIC e can be both highly equitable and provide high-powered tests fordetecting deviations from independence. However, in each analysis the parameter α of MIC e was chosento optimize the objective in question, and the parameter value that yields optimal equitability is differentfrom the value that yields optimal power against independence. This suggests that there may be a trade-offbetween these two objectives that is being captured by the choice of this parameter [Reshef et al., 2013].Such a trade-off also seems plausible given the equivalence proven in Reshef et al. [2015b] betweenequitability and power against a range of null hypotheses corresponding to different relationship strengths.After all, if equitablity is about simultaneously achieving high power against many null hypotheses, then“no free lunch”-type considerations imply that to attain this objective we may have to give up some of thepower we previously had against the specific null hypothesis of independence.Here we establish that such a trade-off does indeed appear to exist within each of the parametrizedmethods we consider. We then discuss the implications of this trade-off for how one should choose parameterswhen using MIC e in practice. computation of MIC e if, for instance, the latter is being computed with equitability as a goal. In this case, the total runtime willequal the runtime of the method with the greater value of α , since increasing α just grows the portion of the equicharacteristicmatrix that is computed. In most situations, we expect that the value of α desired for MIC e will be greater than that desiredfor TIC e since the former will be run with equitability in mind, and so TIC e will be a trivial side-product of the computationof MIC e . igure 9: The trade-off between equitability and power against statistical independence across methods. For eachmethod, average power as quantified in Figure 7a is plotted as well as the worst-case equitability under the samemodel, with n = 500. For every parametrized method, a point is plotted for each value of the parameter in question.The points corresponding to MIC e are emphasized. Since each coordinate is strictly preferable to all coordinatesbelow and to the left of it, there is a Pareto “power-equitability” front. The methods with points along this front areMIC e , maximal correlation, TIC e , and S DDP . We examined the equitability and power against independence of MIC e for values of α ranging from 0.25 to0.9, at a sample size of 500. By plotting worst-case equitability against average power for each value of α ,we sought to understand whether there is a Pareto front of equitability/power beyond which we cannot seemto advance. The existence of such a boundary would support the existence of a power-equitability trade-off.We performed a similar analysis for all of the statistics whose power against independence and equitabilitywe assessed.Figure 9 shows that every parametrized method with a non-trivial level of equitability does indeedexhibit such a trade-off. In the case of MIC e , the trade-off is captured by the parameter α , which controlsthe maximal grid resolution used by the statistic. This is consistent with the bias-variance analysis inReshef et al. [2015a], which showed that low values of α lead to better performance in the low-signal regimewhile larger values of α lead to better performance in mid-to-high-signal regimes. It is also consistent withthe intuition that disallowing high-resolution grids may increase power against independence but will allowonly coarse-grained distinguishability among distributions, while allowing high-resolution grids might enabledistinguishing between distributions that may be more similar to each other.Figure 9 is also a useful summary of how the different methods we considered compare to each otheralong these two dimensions (for this sample size and set of relationships). Specifically, if one point is bothabove and to the right of another then it is strictly preferable. Thus, the figure shows a Pareto front ofmethods that offer optimal performance with respect to power against independence and equitability. Thisfront includes MIC e , maximal correlation, TIC e , and S DDP . e /TIC e : a practical guide We now give some guidelines for setting parameters for MIC e /TIC e more generally. The two parametersrequired by these statistics are the parameter α discussed above, which governs the maximal grid resolution26 ( n ) of the estimator according to B ( n ) = n α , and c , an optional parameter that controls a speed-versus-optimality trade-off in the algorithm. We discuss each of these in turn. α There are two main considerations involved in choosing α . The first, which is suggested by the analysis above,is how much we care about power against independence relative to equitability (i.e., power at distinguishingcleaner relationships from noisier relationships). The second consideration is whether we expect to seecomplex relationships in our data. These considerations can can be reframed in terms of hypothesis testingas follows:1. Is our null hypothesis statistical independence or presence of a weak dependence?
It follows from the above analysis that when using MIC e (or, more likely, TIC e ) to generate tests forstatistical dependence one should use a lower value of α , while if one is interested in equitability, alarger α is required.2. What is our most complex alternative hypothesis?
Since α places an upper bound on the resolution of grids that can be explored by the estimators, itrestricts the complexity of structure that can be detected. Thus, as the relationship class of interestgrows to include more complex structure relative to sample size, the value of α should be increasedaccordingly. Balancing the two considerations
For the specific values of α that maximized power against indepen-dence of TIC e and equitability of MIC e , respectively, in our analyses, see Appendix E. The tables generallyshow that a) when optimizing for statistical power against independence in the sample-size regimes analyzedhere, one should use an α that leads to B ( n ) being approximately between 4 (for less complex alternativehypotheses) and 12 (for more complex alternative hypotheses) , and b) when optimizing for equitability, oneshould use an α approximately between 0 . n is larger) and 0 .
75 (when n is smaller). Equitability and computational efficiency
For large n , the parameters suggested above for equitabilityare likely needlessly computationally expensive. This is because as n grows, the maximal allowed gridresolution of the statistic B ( n ) = n α will outstrip the complexity of most alternative hypotheses that we areliable to encounter in practice.For example, at n = 5 , B ( n ) = 70 provides good equitability on the set of functions and noise modelstested in this paper. If this level of equitability is acceptable to us, we may set α = log n
70 for n ≥ , B ( n ) = 70 always. Given that the runtime of the search procedure in MIC e is O ( n α/ ),which is O ( n ) for α = 0 .
4, a less extreme version of this strategy that maintains consistency and givesasymptotically linear runtime is to allow α to decrease for large n until α = 0 . .
4. In the example above, this happens around n = 40 , e atthis sample size with α = 0 . α and n are varied, as well as Table E.4, which suggests valuesof α at several sample sizes that yield 80% of the best observed equitability for MIC e at each sample size,and the discussion in the next section, where we examine the runtime of MIC e compared to other statistics. c The parameter c determines the coarseness of the discretization of the grid search in the algorithm thatcomputes MIC e , with larger values of c corresponding to finer discretization [Reshef et al., 2015a]. Charac-terizing the effect of c on the bias and variance of MIC e is an important avenue of future work. However, Of course, for even more complex alternative hypotheses, a larger B ( n ) will lead to better performance, provided thesample size allows for detection of the level of complexity in question. In particular, we suspect that B ( n ) > ω (1) is necessaryfor consistency against all alternatives of the resulting independence test. Note however that this hypothesis applies only toMIC e /TIC e and not to MIC/TIC, because even just estimating the first entry M ( X, Y ) , of the population characteristicmatrix yields a statistic that is consistent against all alternatives. (See, e.g., Lemma 6.7 in the supplemental online materialsof Reshef et al. [2011].) ample Size ρ Max. Corr. RDC dCor HSIC HHG50 0.0001 0.0004 0.0015 0.0010 0.0016 0.0017100 0.0001 0.0005 0.0014 0.0014 0.0032 0.0063500 0.0001 0.0014 0.0023 0.0504 0.0847 0.21851,000 0.0002 0.0025 0.0035 0.3518 0.4886 1.09565,000 0.0002 0.0119 0.0129 6.1402 6.5975 34.017110,000 0.0002 0.0239 0.0251 25.9859 25.7333 465.3222Sample Size MIC e [P] MIC e [FE] MIC e [E] MIC I (Kraskov) S DDP ( m = 3)50 0.0004 0.0009 0.0021 0.0015 0.0096 0.0010100 0.0005 0.0012 0.0052 0.0061 0.0100 0.0023500 0.0018 0.0079 0.1630 0.2187 0.0122 0.05291,000 0.0037 0.0172 0.1992 0.9628 0.0150 0.21225,000 0.0195 0.0974 0.3398 18.7627 0.0427 5.746410,000 0.0398 0.1819 0.6835 66.2238 0.0927 23.4473 Table 2:
Average runtimes, in seconds, of algorithms for computing measures of dependence over 100 trials ofuniformly distributed, independent samples at a range of sample sizes. Results for MIC e , are presented for threesample-size-dependent parameter settings that optimize for maximal power against independence ([P]), 99% of optimalequitability ([E]), and 80% of optimal equitability (fast equitability, [FE]). For a list of the parameters used in eachof these settings, see Table E.4. TIC e is ommitted because its runtime is very similar to MIC e [P]. In this analysis,the Kraskov mutual information estimator was run using a pre-compiled C binary, MIC was computed approximatelyusing the APPROX-MIC algorithm [Reshef et al., 2011] in Java, and MIC e was run in Java. The other statisticswere run using their respective R functions/packages. Note that dCor was run with the standard R package, whichis O ( n ); as of this writing there is a faster estimator of the same population quantity that is computable in time O ( n log n ) [Huo and Szekely, 2014]. using c = 5 seems to provide good performance in most settings, and in more computationally constrainedsettings setting even c = 1 appears to result in only moderate performance loss [Reshef et al., 2015a]. Computational efficiency is often desirable when evaluating dependence, and here we assess the runtimesassociated with the set of measures of dependence examined.
Since the runtime of MIC e /TIC e depends on parameter choice, results for MIC e are presented for parametersettings recommended for maximizing equitability, maximizing power against independence, and attainingreasonable equitability on a limited computational budget. The third set of parameters was computed bysearching at each sample size for the parameters that resulted in the fastest runtime while still yielding 80%of the best observed equitability at that sample size. All the parameters used for MIC e /TIC e in this analysisare detailed in Table E.4.The only other method whose runtime is affected by its parameter was S DDP . Since S DDP did notachieve non-trivial levels of equitability, we set its parameter to the value that maximized power againstindependence. For statistics whose runtimes did not depend on parameter choice, defaults were used (seeAppendix E). Since the runtime of S DDP as a function of its integer-valued parameter m is O ( n m − ) for m = 2 , ,
4, the choice of m heavily affects the runtime. This is significant because the parameter setting that maximizes power against independence can becomputed in different ways that lead to different values of m : when power is measured by average area under the power curve, m = 3 performed the best with m = 2 a close second; in contrast, when power is measured via the minimum R necessary toachieve a certain level of power, m = 4 was the best with m = 3 and m = 2 performing significantly worse. (See Section 5.1.4for a description of these methods of quantifying power.) We therefore have chosen m = 3 here. However, the correct choice ofparameter for this statistic will likely depend on the use-case and the available computational budget. For the performance of S DDP with other values of m , see the online supplement. .2 Results The results of our runtime analysis, found in Table 2, show several things. First, MIC e with all three of theparameter settings given is substantially faster than the previously introduced MIC statistic from Reshefet al. [2011] run using default parameters. This matches the theoretical analysis in Reshef et al. [2015a],which shows that the complexity of the search procedure in MIC e is O ( n α/ ) whereas the complexity ofthe search procedure in the APPROX-MIC algorithm used to compute MIC is O ( n α ). Second, even whenequitability is prioritized, the runtime of MIC e is comparable with or faster than that of most of the otherleading measures of dependence. The two exceptions to this are RDC and maximum correlation, which areboth quite fast even at very large sample sizes.We note one interesting feature of the runtime of MIC e . Since estimating MIC ∗ involves a search proce-dure, runtimes for estimating it are substantially faster when data contain less noise; as such, the runtimes onstatistically independent presented in Table 2 represent worst-case performance. When run on data drawnfrom a noiseless linear relationship at the same sample sizes, MIC e ran 10%-75% faster across the range ofsample sizes tested when using settings that optimize for equitability, 5%-50% faster across the sample sizestested when using settings intended to achieve equitability on a limited computational budget, and 10%-30%faster across the sample sizes tested when using settings that optimize for power against independence. Theruntime of S DDP exhibited a similar phenomenon, but the runtimes of the other methods were insensitiveto the level of structure present and did not exhibit this effect.
In this section we analyzed the runtimes of MIC e /TIC e alongside other leading measures of dependence atsample sizes ranging from 50 to 10 , e /TIC e is faster than or comparable tomost of the other methods tested, and is much faster than its predecessor MIC. Specifically, with parameterschosen to yield state-of-the-art power for TIC e and approximately 80% of the best achievable equitabilityfor MIC e , both statistics can be computed on a sample size of 5 ,
000 in 97 milliseconds. For a data setwith n = 5 ,
000 consisting of 1 ,
000 variables, this translates into a total runtime of 8 . e and TIC e .We emphasize that our results represent a snapshot based on currently available implementations. Just asMIC e has provided an improvement over APPROX-MIC, and just as recent advances are providing ways forestimating distance correlation in time O ( n log n ) rather than O ( n ), we expect that with time algorithmicimprovements will allow for more efficient computation of some of the newer methods analyzed here. In this paper, we presented an in-depth empirical evaluation of the equitability, power against independence,and runtime of several leading measures of dependence, including two new statistics introduced in Reshefet al. [2015a]. Our aims were to give an accessible exposition of equitability and its relationship to poweragainst independence, provide the community with a comprehensive and rigorous side-by-side comparisonof existing methods, and evaluate the new statistics against the existing state of the art. Our main findingswere as follows.1.
Equitability.
MIC e , the estimator of the population MIC introduced in Reshef et al. [2015a], gener-ally has superior and more robust equitability with respect to R than other measures of dependence.In some specific settings (models with no noise in the independent variable and n = 5 , Power against independence . TIC e , a statistic introduced in Reshef et al. [2015a], shares state-of-the-artpower against independence with Heller and Gorfine’s S DDP , with both methods generally performingvery well and alternately outperforming each other in different settings. MIC e also has power against29ndependence that is competitive with the state of the art, albeit under parameter settings that differfrom those that confer good equitability. Moreover, the power of independence testing using TIC e and MIC e is much less sensitive than that of the other methods examined to alternative hypothesisrelationship type. The original statistic MIC has substantially higher power against independence thanhas been reported in previous analyses when a different parameter setting is used. Finally, distancecorrelation, maximal correlation, HHG, HSIC, and RDC also had good power against independence.3. Power/equitability tradeoff.
The parameter α in the estimator MIC e corresponds to a trade-off betweenpower against independence and equitability that is consistent with the characterization of equitabilitygiven in Reshef et al. [2015b]. Lower values of α lead to higher power against a null of independence atthe expense of power against null hypotheses representing weak relationship strength (i.e., equitability),while higher values of α lead to better equitability at the expense of power against independence.4. Runtime.
MIC e and TIC e , each of which can be trivially computed once the other has been obtained,have runtimes that allow them to be run together even on large samples in reasonable time. Thisruntime compares favorably with that of other complex measures of dependence such as S DDP , dCor,HSIC, and HHG. The fastest measures of dependence were maximal correlation and the random-ized dependence coefficient. There is a large variety of runtimes across the measures of dependenceexamined.There are several important takeaways from our results. First, they suggest that using MIC e and TIC e in tandem to filter relationships and rank them by strength is a statistically sound and computationallyefficient strategy for exploratory data analysis. In particular, one can imagine a system in which first TIC e is computed for all relationships and only the significant ones are kept, and then MIC e with equitability-optimized parameters is examined only for the latter set. Since TIC e enjoys high power against independenceon a wide range of alternative hypothesis relationship types, pre-filtering with TIC e in this way will not resultin important relationships being overlooked due to their relationship type. Any measure of dependencedeemed to have sufficient power on a broad range of alternative hypotheses can be substituted for TIC e .However, since TIC e and MIC e can be computed simultaneously, and since TIC e offers state-of-the-art poweragainst independence, using TIC e appears to be a preferable choice in such a scenario.Second, the fact that many measures of dependence performed similarly in our analysis of power againstindependence, as well as in analyses of real data sets that others have performed (see, e.g., Heller et al.[2014]), suggests that power against independence may not be where the true challenge lies for bivariaterelationships, and that we ought to demand more of the measures of dependence that we use in this setting.Equitability is one attempt to formulate a more ambitious goal, as is the concept of low detection thresholdintroduced in Reshef et al. [2015b], but there may well be other possibilities. Of course, for higher-dimensionalrelationships, even just power against independence is very difficult to achieve, and many of the methodsevaluated here are quite useful in that setting.Finally, the comprehensiveness of our results provides significant understanding of the comparative per-formance of various measures. To our knowledge, our analyses are the most exhaustive to date in that theyevaluate a large swath of measures of dependence side-by-side along a number of dimensions (equitability,power against independence, and runtime); over a wide range of models, relationship types, and samplesizes; and with parameters that are optimized for each individual statistic in each analysis. Our hope is thatthe full set of results, which can be downloaded in bulk at will be a resource to the community that enables more consistent, direct comparisons be-tween different measures of dependence, and facilitates a precise discussion of the trade-offs and assumptionsassociated with each one in various settings.While the results presented here make a compelling case for the use of MIC e and TIC e and provide insightinto the trade-offs between different measures of dependence, there are some important limitations for boththe new statistics and the comparisons we performed. First, in this paper we evaluated only equitabilitywith respect to R on noisy functional relationships, whereas the definition we give of equitability explicitlyacknowledges the possibility of using other properties of interest besides R and standard relationships thatare not noisy functional relationships. We feel that R is an important property of interest that is intuitiveand familiar to many practitioners, but equitability with respect to other properties of interest merits study30s well, and the methods tested here may perform much better or worse when their equitability is evaluatedwith respect to other properties of interest.Additionally, though an attempt at comprehensiveness was made, we did limit our scope to the set ofnoisy functional relationships in Reshef et al. [2011] for equitability and the relationships introduced in Simonand Tibshirani [2012] for power against independence. While we feel each of these suites of relationshipsprovides reasonable insight into the performance of the methods in question, there are relationships that,when added to these suites, result in extremely poor performance for all the methods tested. Characterizingthose relationships theoretically in both the setting of equitability and that of power against independenceis important if we are to fully understand the strengths and weaknesses of each of these methods. This is animportant direction for future work.Measures of dependence are useful in a variety of settings and identifying which measures of dependenceprovide superior performance in the face of different objectives, assumptions, and constraints is critical. Foreach separate goal, we must understand both which measure of dependence is most appropriate and alsowhich parameter settings lead to the best performance. Such an understanding provides insight into theinherent trade-offs of different methods, allowing us to navigate the landscape of measures of dependencemore effectively and — ultimately — to better understand our data. The authors would like to acknowledge E Airoldi, K Arnold, H Finucane, A Gelman, M Gorfine, A Gretton,T Hashimoto, R Heller, J Hern´andez-Lobato, J Huggins, T Jaakkola, A Miller, J Mueller, A Narayanan, GSzekely, J Tenenbaum, and R Tibshirani for constructive conversations and useful feedback.
References
Leo Breiman and Jerome H Friedman. Estimating optimal transformations for multiple regression andcorrelation.
Journal of the American statistical Association , 80(391):580–598, 1985.T. Cover and J. Thomas.
Elements of Information Theory . New York: John Wiley & Sons, Inc, 2006.Imre Csisz´ar and Paul C Shields. Information theory and statistics: A tutorial.
Communications andInformation Theory , 1(4):417–528, 2004.Valur Emilsson, Gudmar Thorleifsson, Bin Zhang, Amy S Leonardson, Florian Zink, Jun Zhu, Sonia Carlson,Agnar Helgason, G Bragi Walters, Steinunn Gunnarsdottir, et al. Genetics of gene expression and its effecton disease.
Nature , 452(7186):423–428, 2008.M. Gorfine, R. Heller, and Y. Heller. Comment on “Detecting novel associations in large data sets”.
Unpub-lished (available at http://emotion.technion.ac.il/ ∼ gorfinm/files/science6.pdf on 11 Nov. 2012) , 2012.Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Sch¨olkopf. Measuring statistical dependencewith hilbert-schmidt norms. In Algorithmic learning theory , pages 63–77. Springer, 2005.Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Sch¨olkopf, and Alex J Smola. A kernelstatistical test of independence. 2008.Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch¨olkopf, and Alexander Smola. A kerneltwo-sample test.
The Journal of Machine Learning Research , 13(1):723–773, 2012.Ruth Heller, Yair Heller, and Malka Gorfine. A consistent multivariate test of association based on ranks ofdistances.
Biometrika , 100(2):503–510, 2013.Ruth Heller, Yair Heller, Shachar Kaufman, Barak Brill, and Malka Gorfine. Consistent distribution-free k -sample and independence tests for univariate random variables. arXiv preprint arXiv:1410.6758 , 2014.Wassily Hoeffding. A non-parametric test of independence. The Annals of Mathematical Statistics , pages546–557, 1948. 31iaoming Huo and Gabor J Szekely. Fast computing for distance covariance. arXiv preprint arXiv:1410.1503 ,2014.Bo Jiang, Chao Ye, and Jun S Liu. Non-parametric k-sample tests via dynamic slicing.
Journal of theAmerican Statistical Association , (just-accepted):00–00, 2014.Justin B. Kinney and Gurinder S. Atwal. Equitability, mutual information, and the maximal informationcoefficient.
Proceedings of the National Academy of Sciences , 2014. doi: 10.1073/pnas.1309933111.A. Kraskov, H. Stogbauer, and P. Grassberger. Estimating mutual information.
Physical Review E , 69, 2004.E.H. Linfoot. An informational measure of correlation.
Information and Control , 1(1):85–89, 1957.David Lopez-Paz. Personal communication, April 2015.David Lopez-Paz, Philipp Hennig, and Bernhard Sch¨olkopf. The randomized dependence coefficient. In
Advances in Neural Information Processing Systems , pages 1–9, 2013.Young-Il Moon, Balaji Rajagopalan, and Upmanu Lall. Estimation of mutual information using kerneldensity estimators.
Physical Review E , 52(3):2318–2321, 1995.Ben Murrell, Daniel Murrell, and Hugh Murrell. R2-equitability is satisfiable.
Proceedings of the NationalAcademy of Sciences , 2014. doi: 10.1073/pnas.1403623111. URL .Liam Paninski. Estimation of entropy and mutual information.
Neural computation , 15(6):1191–1253, 2003.Alfred R´enyi. On measures of dependence.
Acta mathematica hungarica , 10(3):441–451, 1959.David Reshef, Yakir Reshef, Michael Mitzenmacher, and Pardis Sabeti. Equitability analysis of the maximalinformation coefficient, with comparisons. arXiv preprint arXiv:1301.6314 , 2013.David N Reshef, Yakir A Reshef, Hilary K Finucane, Sharon R Grossman, Gilean McVean, Peter J Turn-baugh, Eric S Lander, Michael Mitzenmacher, and Pardis C Sabeti. Detecting novel associations in largedata sets.
Science , 334(6062):1518–1524, 2011.David N. Reshef, Yakir A. Reshef, Michael Mitzenmacher, and Pardis C. Sabeti. Cleaning up the record on themaximal information coefficient and equitability.
Proceedings of the National Academy of Sciences , 2014.doi: 10.1073/pnas.1408920111. URL .Yakir A Reshef, David N Reshef, Hilary K Finucane, Pardis C Sabeti, and Michael Mitzenmacher. Measuringdependence powerfully and equitably. arXiv preprint arXiv:1505.02213 , 2015a.Yakir A Reshef, David N Reshef, Pardis C Sabeti, and Michael Mitzenmacher. Equitability, interval estima-tion, and statistical power. arXiv preprint arXiv:1505.02212 , 2015b.Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, Kenji Fukumizu, et al. Equivalence of distance-based and rkhs-based statistics in hypothesis testing.
The Annals of Statistics , 41(5):2263–2291, 2013.N. Simon and R. Tibshirani. Comment on “Detecting novel associations in large data sets”. ∼ tibs/reshef/comment.pdf on 11 Nov. 2012) , 2012.T. Speed. A correlation for the 21st century. Science , 334(6062):1502–1503, 2011.G´abor J Sz´ekely, Maria L Rizzo, Nail K Bakirov, et al. Measuring and testing dependence by correlation ofdistances.
The Annals of Statistics , 35(6):2769–2794, 2007.G.J. Szekely and M.L. Rizzo. Brownian distance covariance.
The Annals of Applied Statistics , 3(4):1236–1265, 2009. 32
Data generation
A.1 Definitions of functions used
Tables A.1 and A.2 contain the definitions of the functions used to assess the equitability and statisticalpower against independence, respectively, of measures of dependence throughout this paper. The functionsused for all analyses of power against independence (Table A.2) are taken from Simon and Tibshirani [2012]. y = cos(14 πx ) x ∈ [0 , y = cos(7 πx ) x ∈ [0 , y = sin(5 πx (1 + x )) x ∈ [0 , y = 4 x + x − x x ∈ [ − . , . y = 41(4 x + x − x ) x ∈ [ − . , . x ] y = 10 x x ∈ [0 , x ] y = 2 x x ∈ [0 , y = (cid:40) x/
99 if x ≤ x > x ∈ [0 , y = x x ∈ [0 , y = sin(10 . x − (2 x − x ∈ [0 , y = sin(10 . x − (2 x − x ∈ [0 , y = sin(4(2 x − (2 x − x ∈ [0 , y = sin(10 πx ) + x x ∈ [0 , y = x if x < − x + if ≤ x < − x + if x ≥ x ∈ [0 , y = 4 x x ∈ [ − , ]16 Sigmoid y = x ≤ x − ) + if ≤ x ≤ x > x ∈ [0 , y = sin(16 πx ) x ∈ [0 , y = sin(8 πx ) x ∈ [0 , y = sin(9 πx ) x ∈ [0 , y = sin(6 πx (1 + x )) x ∈ [0 , y =
20 if x < − x + if ≤ x < − x + if x ≥ x ∈ [0 , Table A.1:
Definitions of the functions used to analyze equitability. Under noise/sampling models containing noisein the independent variable or independent-variable marginal distributions other than E f ( X ) or U f ( X ) , functions 6,8, 14, 16, and 21 were excluded due to poor performance across all methods tested. This is presumably due to thefact that a) horizontally perturbing points in a very steep portion of a function drastically changes the distribution inquestion, and b) sampling uniformly along the x-axis drastically under-samples a large part of the graph of a functionif that graph contains very steep portions. Function Name Definition
Line y = x x ∈ [0 , y = 4 x x ∈ [ − , ]Cubic y = 128( x − ) − x − ) − x − ) x ∈ [0 , y = sin(16 πx ) x ∈ [0 , y = sin(4 πx ) x ∈ [0 , x / y = x / x ∈ [0 , y = ± (cid:112) − (2 x − x ∈ [0 , y = (cid:40) x ≤ x > x ∈ [0 , Table A.2:
Definitions of the functions from Simon and Tibshirani [2012] used to analyze statistical power againstindependence. .2 Generating a sample from a distribution with a specified R Given a noisy functional relationship of the form ( X + ε, f ( X ) + ε (cid:48) ), the R of the relationship is thecorrelation between f ( X + ε ) and f ( X ) + ε (cid:48) . Many of the equitability and power analyses performed in thiswork require the ability to set ε and ε (cid:48) such that the resulting distribution has a given population R .In the case that ε = 0 and the variance of ε (cid:48) is known, the R of the distribution has a closed formexpression given in Reshef et al. [2011]. If we specialize that expression to the case we consider in this paper,wherein ε (cid:48) ∼ N (0 , σ ), and then solve for σ , we obtain the following expression. σ ( R ) = (cid:115) var( f ( X )) (cid:18) R − (cid:19) In cases that include noise in the independent variable, we set ε and ( ε (cid:48) if the noise model requires) bybinary search, using the sample R of a very large sample as an estimate of the population R . B Additional equitability results
Noise (Gaussian)Even Along -Noise 0.58 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.88 0.39 0.45Even Along -Noise 0.52 1.00 1.00 1.00 0.99 0.92 1.00 0.98 0.99 0.87 0.35 0.42Even Along -Noise 0.63 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.94 0.50 0.55Even Along -Noise 0.66 1.00 1.00 1.00 1.00 0.88 1.00 1.00 0.99 0.98 0.63 0.72Even Along -Noise 0.64 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.94 0.50 0.55Even Along -Noise 0.66 1.00 1.00 1.00 1.00 0.90 1.00 1.00 0.99 0.98 0.63 0.68Uniform Along -Noise 0.56 1.00 1.00 1.00 0.65 1.00 1.00 0.98 0.98 0.83 0.43 0.45Uniform Along -Noise 0.51 1.00 1.00 1.00 0.59 0.87 1.00 0.98 0.98 0.83 0.42 0.42Uniform Along -Noise 0.61 1.00 1.00 1.00 0.69 1.00 1.00 0.98 0.98 0.94 0.50 0.53Uniform Along -Noise 0.68 1.00 1.00 1.00 0.76 0.98 1.00 0.98 0.98 0.98 0.65 0.71Uniform Along -Noise 0.61 1.00 1.00 1.00 0.70 1.00 1.00 0.98 0.98 0.94 0.50 0.54Uniform Along -Noise 0.68 1.00 1.00 1.00 0.81 0.98 1.00 0.98 0.98 0.98 0.64 0.72
Worst Case 0.68 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.65 0.72
Even Along -Noise 0.53 1.00 1.00 1.00 0.45 0.75 1.00 0.98 0.98 0.70 0.24 0.27Even Along -Noise 0.48 1.00 1.00 1.00 0.43 0.48 1.00 0.97 0.98 0.70 0.21 0.24Even Along -Noise 0.59 1.00 1.00 1.00 0.97 0.96 1.00 0.98 0.98 0.77 0.34 0.38Even Along -Noise 0.63 1.00 1.00 1.00 0.95 0.84 1.00 0.98 0.98 0.95 0.48 0.51Even Along -Noise 0.58 1.00 1.00 1.00 0.97 0.95 1.00 0.98 0.98 0.80 0.35 0.38Even Along -Noise 0.63 1.00 1.00 1.00 0.96 0.87 1.00 0.98 0.98 0.95 0.48 0.51Uniform Along -Noise 0.52 1.00 1.00 1.00 0.45 0.69 1.00 0.96 0.98 0.68 0.25 0.28Uniform Along -Noise 0.48 1.00 1.00 1.00 0.43 0.48 1.00 0.95 0.98 0.68 0.22 0.25Uniform Along -Noise 0.59 1.00 1.00 1.00 0.57 0.83 1.00 0.98 0.98 0.74 0.35 0.38Uniform Along -Noise 0.65 1.00 1.00 1.00 0.67 0.75 1.00 0.98 0.98 0.93 0.51 0.56Uniform Along -Noise 0.59 1.00 1.00 1.00 0.59 0.84 1.00 0.98 0.98 0.75 0.35 0.39Uniform Along -Noise 0.65 1.00 1.00 1.00 0.73 0.80 1.00 0.98 0.98 0.94 0.51 0.56
Worst Case 0.65 1.00 1.00 1.00 0.97 0.96 1.00 0.98 0.98 0.95 0.51 0.56
Even Along -Noise 0.44 1.00 0.98 1.00 0.18 0.08 1.00 -- 0.96 0.44 0.15 0.16Even Along -Noise 0.40 1.00 0.98 1.00 0.18 0.07 1.00 -- 0.96 0.43 0.12 0.15Even Along -Noise 0.48 1.00 0.98 1.00 0.41 0.32 1.00 -- 0.98 0.50 0.23 0.24Even Along -Noise 0.53 1.00 0.98 1.00 0.56 0.49 1.00 -- 0.98 0.69 0.36 0.41Even Along -Noise 0.48 1.00 0.98 1.00 0.77 0.37 1.00 -- 0.98 0.51 0.24 0.24Even Along -Noise 0.53 1.00 0.98 1.00 0.83 0.58 1.00 -- 0.98 0.69 0.37 0.40Uniform Along -Noise 0.44 1.00 0.98 1.00 0.17 0.09 1.00 -- 0.97 0.44 0.16 0.16Uniform Along -Noise 0.41 1.00 0.98 1.00 0.18 0.07 1.00 -- 0.97 0.43 0.13 0.15Uniform Along -Noise 0.49 1.00 0.98 1.00 0.37 0.30 1.00 -- 0.98 0.50 0.24 0.25Uniform Along -Noise 0.54 1.00 0.98 1.00 0.53 0.47 1.00 -- 0.98 0.69 0.39 0.43Uniform Along -Noise 0.49 1.00 0.98 1.00 0.41 0.34 1.00 -- 0.98 0.51 0.25 0.25Uniform Along -Noise 0.54 1.00 0.98 1.00 0.61 0.55 1.00 -- 0.98 0.70 0.39 0.44
Worst Case 0.54 1.00 0.98 1.00 0.83 0.58 1.00 -- 0.98 0.70 0.39 0.44
TIC e MICSample Size Model MaximalCorr. (ACE) dCor HSIC I [L ] (Kraskov, k=1)Pearson I [L ] (Kraskov, k=6) MIC e S DDP
HHGRDC
Table B.1:
A summary of the worst-case equitability of measures of dependence for a variety of noise models,independent-variable marginal distributions, and sample sizes. [Smaller values correspond to better equitability.]
Each number is a worst-case interpretable interval length for a given statistic in a given setting. Therefore, smallernumbers indicate shorter interpretable intervals and more equitable behavior. Table cells are colored proportionally(red = interval of length 0; white = interval of length 1). The equitability of MIC e is relatively robust to factors likenoise models, independent variable marginal distributions, and sample size. Figures analogous to Figures 4 and B.1for all the settings presented in this table are included in the online supplementary materials. For statistics whoseperformance was dependent on parameter settings, we present for each sample size the best results across parametervalues tested. Results are not presented for HHG for n = 5 ,
000 as it was prohibitively computationally expensive toanalyze at this sample size. oise (Gaussian)Even Along -Noise 0.38 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.45 0.28 0.32Even Along -Noise 0.31 0.50 0.50 0.50 0.50 0.49 0.50 0.50 0.50 0.45 0.26 0.30Even Along -Noise 0.41 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.49 0.34 0.37Even Along -Noise 0.41 0.50 0.50 0.50 0.50 0.48 0.50 0.50 0.50 0.49 0.40 0.43Even Along -Noise 0.41 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.49 0.34 0.37Even Along -Noise 0.41 0.50 0.50 0.50 0.50 0.49 0.50 0.50 0.50 0.49 0.40 0.43Uniform Along -Noise 0.37 0.50 0.50 0.50 0.36 0.50 0.50 0.50 0.50 0.45 0.29 0.32Uniform Along -Noise 0.32 0.50 0.50 0.50 0.33 0.48 0.50 0.50 0.50 0.44 0.28 0.30Uniform Along -Noise 0.40 0.50 0.50 0.50 0.40 0.50 0.50 0.50 0.50 0.48 0.34 0.37Uniform Along -Noise 0.42 0.50 0.50 0.50 0.43 0.49 0.50 0.50 0.50 0.48 0.41 0.43Uniform Along -Noise 0.41 0.50 0.50 0.50 0.41 0.50 0.50 0.50 0.50 0.48 0.35 0.37Uniform Along -Noise 0.42 0.50 0.50 0.50 0.44 0.49 0.50 0.50 0.50 0.48 0.42 0.43 Average Case 0.39 0.50 0.50 0.50 0.45 0.49 0.50 0.50 0.50 0.47 0.34 0.37
Even Along -Noise 0.36 0.50 0.50 0.50 0.26 0.46 0.50 0.50 0.49 0.38 0.19 0.21Even Along -Noise 0.29 0.50 0.50 0.50 0.24 0.35 0.50 0.49 0.50 0.37 0.17 0.19Even Along -Noise 0.39 0.50 0.50 0.50 0.49 0.49 0.50 0.50 0.50 0.45 0.24 0.27Even Along -Noise 0.40 0.50 0.50 0.50 0.49 0.47 0.50 0.49 0.50 0.48 0.30 0.32Even Along -Noise 0.40 0.50 0.50 0.50 0.49 0.49 0.50 0.50 0.50 0.45 0.24 0.27Even Along -Noise 0.40 0.50 0.50 0.50 0.49 0.48 0.50 0.49 0.50 0.48 0.30 0.33Uniform Along -Noise 0.36 0.50 0.50 0.50 0.26 0.44 0.50 0.49 0.49 0.38 0.19 0.21Uniform Along -Noise 0.30 0.50 0.50 0.50 0.24 0.35 0.50 0.49 0.49 0.37 0.18 0.19Uniform Along -Noise 0.39 0.50 0.50 0.50 0.33 0.47 0.50 0.49 0.49 0.44 0.25 0.27Uniform Along -Noise 0.41 0.50 0.50 0.50 0.38 0.44 0.50 0.49 0.49 0.48 0.31 0.33Uniform Along -Noise 0.40 0.50 0.50 0.50 0.33 0.47 0.50 0.49 0.49 0.44 0.25 0.27Uniform Along -Noise 0.41 0.50 0.50 0.50 0.39 0.46 0.50 0.49 0.49 0.48 0.31 0.33
Average Case 0.38 0.50 0.50 0.50 0.37 0.45 0.50 0.49 0.49 0.43 0.24 0.26
Even Along -Noise 0.30 0.50 0.48 0.50 0.11 0.07 0.491 -- 0.478 0.254 0.106 0.11Even Along -Noise 0.23 0.50 0.48 0.50 0.10 0.06 0.491 -- 0.478 0.247 0.0937 0.10Even Along -Noise 0.33 0.50 0.48 0.50 0.25 0.22 0.49 -- 0.478 0.344 0.164 0.17Even Along -Noise 0.34 0.50 0.48 0.50 0.32 0.29 0.488 -- 0.478 0.424 0.228 0.24Even Along -Noise 0.33 0.50 0.48 0.50 0.42 0.24 0.491 -- 0.479 0.352 0.166 0.17Even Along -Noise 0.35 0.50 0.48 0.50 0.43 0.32 0.488 -- 0.478 0.427 0.228 0.24Uniform Along -Noise 0.31 0.50 0.48 0.50 0.11 0.07 0.497 -- 0.478 0.255 0.109 0.11Uniform Along -Noise 0.23 0.50 0.48 0.50 0.10 0.06 0.497 -- 0.478 0.247 0.0962 0.11Uniform Along -Noise 0.34 0.50 0.48 0.50 0.23 0.20 0.498 -- 0.478 0.345 0.168 0.17Uniform Along -Noise 0.35 0.50 0.48 0.50 0.30 0.28 0.497 -- 0.477 0.425 0.236 0.25Uniform Along -Noise 0.34 0.50 0.48 0.50 0.24 0.21 0.498 -- 0.478 0.352 0.171 0.18Uniform Along -Noise 0.35 0.50 0.48 0.50 0.33 0.30 0.495 -- 0.477 0.427 0.237 0.25
Average Case 0.32 0.50 0.48 0.50 0.25 0.19 0.49 -- 0.48 0.34 0.17 0.17
Sample Size Model MaximalCorr. (ACE) dCor HSIC I [L ] (Kraskov, k=6)Pearson TIC e MIC e MICS
DDP
RDC HHG I [L ] (Kraskov, k=1) Table B.2:
A summary of the average-case equitability of measures of dependence for a variety of noise models,independent-variable marginal distributions, and sample sizes. [Smaller values correspond to better equitability.]
Eachnumber is an average interpretable interval length for a given statistic in a given setting. Therefore, smaller numbersindicate shorter interpretable intervals on average and more equitable behavior. Table cells are colored proportionally(red = interval of length 0; white = interval of length 1). The equitability of MIC e is relatively robust to factors likenoise models, independent variable marginal distributions, and sample size. Figures analogous to Figures 4 and B.1for all the settings presented in this table are included in the online supplementary materials. For statistics whoseperformance was dependent on parameter settings, we present for each sample size the best results across parametervalues tested. Results are not presented for HHG for n = 5 ,
000 as it was prohibitively computationally expensive toanalyze at this sample size. igure B.1: The equitability of mea-sures of dependence on a set Q of noisyfunctional relationships with alternativenoise model and marginal distribution. [Narrower is more equitable.] The re-lationships take the form (
X, f ( X ) + ε (cid:48) ) where ε (cid:48) is normally distributedwith varying amplitude, and relation-ship strength is quantified by Φ = R .The plots were constructed as describedin Figure 2. In contrast to its poor eq-uitability under the noise model usedin Figure 4, the Kraskov mutual infor-mation estimator, represented using thesquared Linfoot correlation, is quite eq-uitable under this noise model at largesample sizes. At the low and mid-rangesample sizes, MIC e remains more equi-table. For every parametrized statis-tic whose parameter meaningfully af-fects equitability, results are presentedat each sample size using parameter set-tings that maximize equitability acrossall twelve of the noise/marginal distri-butions tested at that sample size. SeeTables B.1 and B.2 for a summary of theequitability of these measures of depen-dence under those additional models, aswell as the supplemental materials forthe corresponding figures. igure B.2: The equitability of MIC ∗ and mutual information in the infinitedata limit. [Narrower is more equi-table.] Six combinations of noise mod-els and independent variable marginaldistributions were analyzed. The val-ues of MIC ∗ were computed using thenewly introduced algorithm from Reshefet al. [2015a]. In each plot, the worst-case interpretable interval is indicatedby a red line, and both the worst- andaverage-case equitability are listed. Mu-tual information values are representedin terms of the squared Linfoot corre-lation. In the large-sample limit, mu-tual information is more equitable thanMIC ∗ in settings where there is noiseonly in the dependent variable, whileMIC ∗ has superior equitability other-wise. igure B.3: The equitability ofmeasures of dependence on noisyfunctional relationships with noisein the dependent variable only, vi-sualized in terms of power. [Red-der is more equitable.]
The set Q of noisy functional relationshipsanalyzed is the same as in Fig-ure B.1, and relationship strengthis again quantified by Φ = R .Plots were generated as in Fig-ure 6. In contrast to its perfor-mance under the noise model usedin Figure 4, the Kraskov mutualinformation estimator yields pow-erful tests under this noise modelat large sample sizes. At the lowand mid-range sample sizes, testsbased on MIC e remain more power-ful. For every parametrized statis-tic whose parameter meaningfullyaffects equitability, results are pre-sented at each sample size using pa-rameter settings that maximize eq-uitability across all twelve of thenoise/marginal distributions testedat that sample size. See Tables B.1and B.2 for a summary of the equi-tability of these measures of depen-dence under those additional mod-els, as well as the supplemental ma-terials for the corresponding fig-ures. Parameter sweeps for power against independence
Figure C.1:
Power against independence as a function of the parameter of each measure of dependence. [Higher ismore powerful.]
For each measure of dependence, we computed power curves over a range of parameters using therelationships from Simon and Tibshirani [2012]. In order to aggregate the power of a given test across relationshiptypes, all power curves were computed as functions of the R of the noisy relationship comprising the alternativehypothesis, and the area under each power curve was computed. Here, we show for each statistic the area underthe power curve for each relationship type as a function of that statistic’s parameter. The black line represents theaverage area under the power curves across all relationship types, and the vertical dotted line represents the optimalparameter setting. Both the average and worst-case performance across relationship types are listed for the optimalparameter setting of each statistic. For the MIC statistic from Reshef et al. [2011], the red line represents the defaultparameter setting, which was used by Simon and Tibshirani. This parameter setting turns out to be poor for testingfor independence; it is better suited for achieving equitability. For testing for independence, lower values of theparameter are better suited, though these incur a cost in terms of equitability. (See Figure 9.) igure C.2: Power against independence as a function of the parameter of each measure of dependence, with overallpower quantified differently than in Figure C.1. [Lower is more powerful.]
As in Figure C.1, we compute powercurves for a range of parameters of each measure of dependence using the relationships from Simon and Tibshirani[2012]. Here, in order to aggregate the power of a given test across relationship types, the power curve of each testwas computed as a function of the R of the noisy relationship being tested, and the R at which 50% power isachieved for each relationship type was determined. This number is graphed for each relationship type and statisticas a function of that statistic’s parameter. The black line represents the average R at which 50% power is achievedacross all relationships tested, and the vertical dotted line represents the optimal parameter setting. Both the averageand the worst-case performance across relationship types are listed for the optimal parameter setting of each statistic.For the MIC statistic from Reshef et al. [2011], the red line represents the default parameter setting, which was usedby Simon and Tibshirani. This parameter setting turns out to be poor for testing for independence; it is better suitedfor achieving equitability. For testing for independence, lower values of the parameter are better suited, though theseincur a cost in terms of equitability. (See Figure 9.) The Equitability-runtime trade-off
Figure D.1:
The relationship between equitability and runtime of MIC e . Sample sizes are n = 250 (left), 500(middle), and 5000 (right). Each plot shows, as α varies, the worst-case equitability of MIC e with the given value of α on the model used in Figure 9 graphed against the runtime of MIC e with the same value of α . The multiple seriesin every plot correspond to different values of c , with marker size indicating the size of c . The values of c used are 1,2, 3, 5, 10, and 15. ( c = 10 and c = 15 are ommitted from the analysis for n = 5 , α increases, we generallysee a rise in equitability but also in runtime. E Parameter values used in analyses
Parameter sweeps were performed for all methods in evaluating their equitability and statistical power againstindependence.
Parameter values used in equitability analyses
For each method, results are presented for the parameter values tested that maximized worst-case equitabilityacross all models Q examined, at each sample size (see Table E.1). Results for all parameter values tested,including for some methods not included in the figures here due to space constraints, can be found in theonline supplement at .In the case of RDC and HSIC the parameter values tested did not have a strong effect on equitability,so we present performance for the default / rule of thumb parameter values. That is, the random samplingparameters, ( S x , S y ), of RDC and the RBF kernel bandwidth parameters, ( σ x , σ y ), used for HSIC were setindependently for each of the two samples being tested to the Euclidean distance empirical median (valuesof { − , − , − , − , −} %-ile pairwise distances were also tested for these parameters). For RDC, thenumber of random features was set to k = 10. For the Kraskov mutual information estimator, k = 1, k = 6, k = 10, and k = 20 were tested. In the case of S DDP , values of m > e , at n = 250, 500, and 5 , α tested were { . , . , ..., . , . } , { . , . , ..., . , . } , and { . , . , ..., . , . } , respectively. Sample size MIC e TIC e S DDP
I (Kraskov) RDC HSIC α c α c m k S x , S y k σ x , σ y
250 0.75 15 0.80 3 2 6 Median pair. dist. 10 Median pair. dist.500 0.80 5 0.80 3 2 6 Median pair. dist. 10 Median pair. dist.5 ,
000 0.65 3 0.70 3 2 6 Median pair. dist. 10 Median pair. dist.
Table E.1:
Parameters used in the equitability analyses. arameter values used in statistical power analyses Tables E.2 and E.3 summarize the optimal parameters identified for tests for independence based on themethods examined, using area under the power curves and a 50% power threshold, respectively, as theoptimization criterion. The parameters in Table E.2 were used to generate the power curves in Figure 8.The parameter ranges tested for each statistic can be observed from Figures C.1 and C.2.
Sample size MIC e TIC e MIC S DDP
I (Kraskov) RDC HSIC α c α c α c m k S x , S y k σ x , σ y
100 0.48 5 0.50 5 0.40 5 3 13 5%-ile pair. dist. 10 45%-ile pair. dist.500 0.35 5 0.38 5 0.30 5 3 50 5%-ile pair. dist. 10 60%-ile pair. dist.
Table E.2:
Best parameters for testing for independence, identified by maximizing the average area under the powercurves generated by a given test for the set of relationships examined.
Sample size MIC e TIC e MIC S DDP
I (Kraskov) RDC HSIC α c α c α c m k S x , S y k σ x , σ y
100 0.74 5 0.96 5 0.48 5 5 12 5%-ile pair. dist. 10 30%-ile pair. dist.500 0.56 5 0.68 5 0.36 5 4 41 5%-ile pair. dist. 10 5%-ile pair. dist.
Table E.3:
Best parameters for testing for independence, identified by minimizing the average across relationshiptypes of the minimal R for which the power of a given test remained above 50%. Parameter values used in runtime analyses
For methods whose runtime did not strongly depend on parameter settings, default parameter values wereused. That is, the Kraskov mutual information estimator was run using k = 6, and the random samplingparameters, ( S x , S y ), of RDC and the RBF kernel bandwidth parameters, ( σ x , σ y ), used for HSIC were setindependently for each of the two samples being tested to the Euclidean distance empirical median. In thecase of RDC, the number of random features was set to k = 10, as in the runtime analysis in Lopez-Pazet al. [2013]. The parameters used for MIC e are presented in Table E.4. Sample size Power Fast equitability Equitability α c α c α c
50 0.54 5 0.75 3 0.85 5100 0.48 5 0.70 2 0.80 5500 0.36 5 0.65 1 0.80 51 ,
000 0.32 5 0.60 1 0.75 45 ,
000 0.26 5 0.50 1 0.65 110 ,
000 0.24 5 0.45 1 0.60 1
Table E.4:
Parameters used in the runtime analysis of MIC e presented in Table 2. For MIC e , the three sample-size-dependent parameter settings optimize for maximal power against inde-pendence, 80% of optimal equitability (fast equitability), and 99% of optimal equitability. For sample sizesfor which results were not available, parameter values were estimated via interpolation/extrapolation using apower curve. As pointed out in Section 6.2, these parameter settings depend on the set of relationships beingexamined, and, for example, for relationship suites with less complex relationships than the ones examinedin the analyses here, lower values of αα