Equitability, interval estimation, and statistical power
Yakir A. Reshef, David N. Reshef, Pardis C. Sabeti, Michael M. Mitzenmacher
EEquitability, interval estimation, and statistical power
Yakir A. Reshef ∗† David N. Reshef ∗ Pardis C. Sabeti ∗∗ Michael M. Mitzenmacher ∗∗ Abstract
As data sets grow in dimensionality, non-parametric measures of dependence have seenincreasing use in data exploration due to their ability to identify non-trivial relationshipsof all kinds. One common use of these tools is to test a null hypothesis of statisticalindependence on all variable pairs in a data set. However, because this approach attempts toidentify any non-trivial relationship no matter how weak, it is prone to identifying so manyrelationships — even after correction for multiple hypothesis testing — that meaningfulfollow-up of each one is impossible. What is needed is a way of identifying a smaller set of“strongest” relationships of all kinds that merit detailed further analysis.Here we formally present and characterize equitability , a property of measures of depen-dence that aims to overcome this challenge. Notionally, an equitable statistic is a statisticthat, given some measure of noise, assigns similar scores to equally noisy relationships ofdifferent types (e.g., linear, exponential, etc.) [1]. We begin by formalizing this idea via anew object called the interpretable interval, which functions as an interval estimate of theamount of noise in a relationship of unknown type. We define an equitable statistic as onewith small interpretable intervals.We then draw on the equivalence of interval estimation and hypothesis testing to showthat under moderate assumptions an equitable statistic is one that yields well poweredtests for distinguishing not only between trivial and non-trivial relationships of all kindsbut also between non-trivial relationships of different strengths, regardless of relationshiptype. This means that equitability allows us to specify a threshold relationship strength x below which we are uninterested, and to search a data set for relationships of all kindswith strength greater than x . Thus, equitability can be thought of as a strengthening ofpower against independence that enables fruitful analysis of data sets with a small numberof strong, interesting relationships and a large number of weaker, less interesting ones. Weconclude with a demonstration of how our two equivalent characterizations of equitabilitycan be used to evaluate the equitability of a statistic in practice. Suppose we have a data set that we would like to explore to find pairwise associations of interest.A commonly taken approach that makes minimal assumptions about the structure in the data School of Engineering and Applied Sciences, Harvard University. ∗ Co-first author. † To whom correspondence should be addressed. Email: [email protected] Department of Computer Science, Massachusetts Institute of Technology. Department of Organismic and Evolutionary Biology, Harvard University. Broad Institute of MIT and Harvard. ∗∗ Co-last author. a r X i v : . [ m a t h . S T ] M a y s to compute a measure of dependence, i.e., a statistic whose population value is non-zeroexactly in cases of statistical dependence, on many candidate pairs of variables. The score ofeach variable pair can be evaluated against a null hypothesis of statistical independence, andvariable pairs with significant scores can be kept for follow-up [2, 3]. When faced with thistask, there is a wealth of measures of dependence from which to choose, each with a differentset of properties [4–13].While this approach works well in some settings, it is unsuitable in many others due to thesize of modern data sets. In particular, as data sets grow in dimensionality, the above approachoften results in lists of significant relationships that are too large to allow for meaningful follow-up of every identified relationship. For example, in the gene expression data set analyzed in[14], several measures of dependence reliably identified thousands of significant relationshipsamounting to between 65 and 75 percent of the variable pairs in the data set. Given theextensive manual effort that is usually necessary to better understand each of these “hits”,further characterizing all of them is impractical.A tempting way to deal with this challenge is to rank all the variable pairs in a data setaccording to the test statistic used (or according to p-value) and to examine only a smallnumber of pairs with the most extreme values. However, this is a poor idea because, while ameasure of dependence guarantees non-zero scores to dependent variable pairs, the magnitudeof these non-zero scores can depend heavily on the type of dependence in question, therebyskewing the top of the list toward certain types of relationships over others. For example, ifsome measure of dependence ϕ systematically assigns higher scores to, say, linear relationshipsthan to sinusoidal relationships, then using ϕ to rank variable pairs in a large data set couldcause noisy linear relationships in the data set to crowd out strong sinusoidal relationshipsfrom the top of the list. The natural result would be that the human examining the top-rankedrelationships would never see the sinusoidal relationships, and they would not be discovered.The consistency guarantee of measures of dependence is therefore not strong enough tosolve the data exploration problem posed here. What is needed is a way not just to identifyas many relationships of different kinds as possible in a data set, but also to identify a smallnumber of strongest relationships of different kinds.Here we formally present and characterize equitability , a framework for meeting this goal.In previous work, equitability was informally introduced as follows: an equitable measure ofdependence is one that, given some measure of noise, assigns similar scores to equally noisyrelationships, regardless of relationship type [1]. In this paper, we formalize this notion in thelanguage of estimation theory and tie it to the theory of hypothesis testing.Specifically, we define an object called the interpretable interval that functions as an intervalestimate of the strength of a relationship of unknown type. That is, given a set Q of standardrelationships on which we have defined a measure Φ of relationship strength, the interpretableinterval is a range of values that act as good estimates of the true relationship strength Φ ofa distribution, assuming it belongs to Q . In the same way that a good estimator has narrowconfidence intervals, an equitable statistic is one that has narrow interpretable intervals. Aswe explain, this property can be viewed as a natural generalization of one of the “fundamentalproperties” described by Renyi in his framework for measures of dependence [15].We then draw a connection between equitability and statistical power using the equivalencebetween interval estimation and hypothesis testing. This connection shows that whereas typicalmeasures of dependence are analyzed in terms of power to distinguish non-trivial associationsfrom statistical independence, under moderate assumptions an equitable statistic is one that2an distinguish finely between relationships of two different strengths that may both be non-trivial, regardless of the types of the two relationships in question. This result gives us a newway to understand equitability as a natural strengthening of the requirement of power againstindependence in which we ask that our statistic be useful not just for detecting deviations ofdifferent types from independence but also for distinguishing strong relationships from weakrelationships regardless of relationship type.Finally, motivated by the connection between equitability and power, we define a newproperty, detection threshold , which, at some fixed sample size, is the minimal relationshipstrength x such that a statistic’s corresponding independence test has a certain minimal poweron relationships of all kinds with strength at least x . We show that low detection threshold isstrictly weaker than high equitability in that high equitability implies it but the converse doesnot hold. Therefore, when equitability is too much to ask, low detection threshold on a broadset of relationships with respect to an interesting measure of relationship strength may be areasonable surrogate goal.Throughout this paper, we give concrete examples of how our formalism relates to theanalysis of equitability in practice. Indeed, the purpose of the theoretical framework providedhere is to allow for such practical analyses, and so we close with a demonstration of an empiricalanalysis of the equitability of several popular measures of dependence.This paper is accompanied by two companion papers. The first [4] introduces two newstatistics that aim for good equitability on functional relationships and good power againststatistical independence, respectively. The second [16] conducts a comprehensive empiricalanalysis of the equitability and power against independence of both of these new methods aswell as several other leading measures of dependence.The results we present here, in addition to contributing to a better understanding of eq-uitability, also provide an organizing framework in which to consolidate some of the recentdiscussion around equitability. For instance, our formalization of equitability is sufficientlygeneral to accommodate several of variants that have arisen in the literature. This allows usto precisely discuss the definition given by Kinney and Atwal [17, 18] of what, in our theoret-ical framework, corresponds to perfect equitability. In particular, our framework allows us toexplain the limitations of an impossibility result presented by Kinney and Atwal about perfectequitability. Additionally, our framework and the connection it provides to statistical poweralso allows us to crystallize and address the concerns about the power against independence ofequitable methods raised by Simon and Tibshirani [19]. (However, empirical questions concern-ing the performance of the maximal information coefficient and related statistics are deferredto the companion papers [4, 16].)We conclude with a discussion of what situations benefit from using equitability as a desider-atum for data analysis. It is our hope that the theoretical results in this paper will providea foundation for further work not only on equitability and methods for achieving equitability,but also on other possible expansions of our goals for measures of dependence in the setting ofdata exploration or other related settings. Equitability has been described informally by the authors as the ability of a statistic to “givesimilar scores to equally noisy relationships of different types” [1]. Though useful, this informaldefinition is imprecise in that it does not specify what is meant by “noisy” or “similar”, and3oes not specify for which relationships the stated property should hold. In this section weprovide the formalism necessary to discuss equitability more rigorously.To do this, we fix a statistic ˆ ϕ (presumed to be a measure of dependence), a measure ofrelationship strength Φ called the property of interest , and a set Q of standard relationships onwhich Φ is defined. The idea is that Q contains relationships of many different types, and forany distribution Z ∈ Q , Φ( Z ) is the way we would ideally quantify the strength of Z if we hadknowledge of the distribution Z . Our goal is then, given a sample Z of size n from Z , to useˆ ϕ ( Z ) to draw inferences about Φ( Z ).Our general approach is to construct a set of intervals, the interpretable intervals of ˆ ϕ withrespect to Φ, by inverting a certain set of hypothesis tests. We show that these intervals canbe used to turn ˆ ϕ ( Z ) into an interval estimate of Φ( Z ), and we call the statistic ˆ ϕ equitable ifits interpretable intervals are small, i.e., if it yields narrow interval estimates of Φ( Z ).After constructing the interpretable intervals of ˆ ϕ with respect to Φ, we demonstrate howour vocabulary can be used to define a few different concrete instantiations of the conceptof equitability. We do this by using our framework to state several of the notions of- andresults about equitability that have appeared in the literature, and discussing the relationshipsamong them. Following this, we provide a short schematic illustration of how the definitionswe provide would be used to quantitatively evaluate the equitability of a statistic in practice,and a discussion of how equitability is related to measurement of effect size more generally.In what follows, we keep our exposition generic in order to accommodate variations – bothexisting and potential – on the concepts defined here. However, as a motivating example, weoften return to the setting of [1], in which ˆ ϕ is a statistic like the maximal information coefficientMIC e , Q is a set of noisy functional relationships, and Φ is the coefficient of determination ( R )with respect to the generating function. In this setting, the equitability of MIC e correspondsto its utility for constructing narrow interval estimates of the R of a relationship that is in Q but whose specific functional form is unknown. Let ˆ ϕ be a statistic taking values in [0 , Q be a set of distributions, and let Φ : Q → [0 , Q as the setof standard relationships and to Φ as the property of interest. To construct the interpretableintervals of ˆ ϕ with respect to Φ, we must first ask how much ˆ ϕ can vary when evaluated ona sample from some Z ∈ Q with Φ( Z ) = x . The definition below gives us a way to measurethis. (In this definition and in definitions in the rest of this paper, we implicitly assume a fixedsample size of n .) Definition 2.1 (Reliability of a statistic) . Let ˆ ϕ be a statistic taking values in [0 , x, α ∈ [0 , α -reliable interval of ˆ ϕ at x , denoted by R ˆ ϕα ( x ), is the smallest closed interval A with the property that, for all Z ∈ Q with Φ( Z ) = x , we have P ( ˆ ϕ ( Z ) < min A ) < α/ P ( ˆ ϕ ( Z ) > max A ) < α/ Z is a sample of size n from Z .The statistic ˆ ϕ is 1 /d -reliable with respect to Φ on Q at x with probability 1 − α if andonly if the diameter of R ˆ ϕα ( x ) is at most d . 4ee Figure 1a for an illustration. The reliable interval at x is an acceptance region of asize- α test of the null hypothesis H : Φ( Z ) = x . If there is only one Z satisfying Φ( Z ) = x ,this amounts to a central interval of the sampling distribution of ˆ ϕ on Z . If there is morethan one such Z , the reliable interval expands to include the relevant central intervals of thesampling distributions of ˆ ϕ on all the distributions Z in question. For example, when Q is a setof noisy functional relationships with several different function types and Φ is R , the reliableinterval at x is the smallest interval A such that for any functional relationship Z ∈ Q with R ( Z ) = x , ˆ ϕ ( Z ) falls in A with high probability over the sample Z of size n from Z .Because the reliable interval R ˆ ϕα ( x ) can be viewed as the acceptance region of a level- α test of H : Φ( Z ) = x , the equivalence between hypothesis tests and confidence intervals yieldsinterval estimates of Φ in terms of R ˆ ϕα ( x ). These intervals are the interpretable intervals,defined below. Definition 2.2 (Interpretability of a statistic) . Let ˆ ϕ be a statistic taking values in [0 , y, α ∈ [0 , α -interpretable interval of ˆ ϕ at y , denoted by I ˆ ϕα ( y ), is the smallest closedinterval containing the set (cid:110) x ∈ [0 ,
1] : y ∈ R ˆ ϕα ( x ) (cid:111) . The statistic ˆ ϕ is 1 /d -interpretable with respect to Φ on Q at y with confidence 1 − α ifand only if the diameter of I ˆ ϕα ( y ) is at most d .See Figure 1a for an illustration. The correspondence between hypothesis tests and in-terval estimates [20] gives us the following guarantee about the coverage probability of theinterpretable interval, whose proof we omit. Proposition 2.3.
Let ˆ ϕ be a statistic taking values in [0 , , and let α ∈ [0 , . For all x ∈ [0 , and for all Z ∈ Q , P (cid:16) Φ( Z ) ∈ I ˆ ϕα ( ˆ ϕ ( Z )) (cid:17) ≥ − α where Z is a sample of size n from Z . The definitions just presented have natural non-stochastic counterparts in the large-samplelimit that we summarize below.
Definition 2.4 (Reliability and interpretability in the large-sample limit) . Let ϕ : Q → [0 , x ∈ [0 , ϕ (Φ − ( { x } )) is called the reliable interval of ϕ at x and is denoted by R ϕ ( x ). For y ∈ [0 , { x : y ∈ R ϕ ( x ) } is called the interpretable interval of ϕ at y and is denoted by I ϕ ( y ).See Figure 1b for an illustration. Proposition 2.3 implies that if the interpretable intervals of ˆ ϕ with respect to Φ are small thenˆ ϕ will give good interval estimates of Φ. There are many ways to summarize whether theinterpretable intervals of ˆ ϕ are small; we focus here on two simple ones.5 ^ Ф x y Фφ x y (a) (b) Figure 1:
A schematic illustration of reliable and interpretable intervals. In both figure parts, Q consists of noisy relationships of three different types depicted in the three different colors. (a) Therelationship between a statistic ˆ ϕ and Φ on Q at a finite sample size. The bottom and top boundaries ofeach shaded region indicate the ( α/ − α/ · ϕ for each relationship type at various values of Φ. The vertical interval (in black) is the reliableinterval R ˆ ϕα ( x ), and the horizontal interval (in red) is the interpretable interval I ˆ ϕα ( y ). (b) In the large-sample limit, we replace ˆ ϕ with a population quantity ϕ . The vertical interval (in black) is the reliableinterval R ϕ ( x ), and the horizontal interval (in red) is the interpretable interval I ϕ ( y ). Definition 2.5.
The worst-case α -reliability (resp. α -interpretability) of ˆ ϕ is 1 /d if it is 1 /d -reliable (resp. interpretable) at all x (resp. y ) ∈ [0 , ϕ is said to be worst-case /d -reliable (resp. 1 /d -interpretable ) with probability (resp. confidence) 1 − α .The average-case α -reliability (resp. α -interpretability) of ˆ ϕ is 1 /d if its reliability (resp.interpretability), averaged over all x (resp. y ) ∈ [0 , /d . ˆ ϕ is said to be average-case /d -reliable (resp. 1 /d -interpretable ) with probability (resp. confidence) 1 − α .(One could imagine more fine-grained ways to summarize reliability/interpretability ac-cording to, for example, some prior over the distributions in Q that reflects a belief about theimportance or prevalence of various types of relationships; for simplicity, we do not pursue thishere.)With this vocabulary, we can now define equitability: average/worst-case equitability issimply average/worst-case interpretability with respect to some Φ that reflects relationshipstrength. In this paper, we distinguish between interpretability in general and equitabilityspecifically by using “interpretability” in general statements and “equitability” in contextsin which Φ is specifically considered as a measure of relationship strength. Also, we oftenuse “interpretability” and “equitability” with no qualifier to mean worst-case interpretabil-ity/equitability.The corresponding definitions of average/worst-case interpretability/reliability can be madefor ϕ in the large-sample limit as well. In that setting, it is possible that all the interpretableintervals of ϕ with respect to Φ have size 0; that is, the value of ϕ ( Z ) uniquely determines thevalue of Φ( Z ). In this case, the worst-case reliability/interpretability of ϕ is ∞ , and ϕ is saidto be perfectly reliable/interpretable , or perfectly equitable depending on context.Before continuing, let us build intuition by giving two examples of statistics that are per-fectly interpretable in the large-sample limit. First, the mutual information [21, 22] is perfectly6nterpretable with respect to the correlation ρ on the set Q of bivariate normal random vari-ables. This is because for bivariate normals we have that 1 − − I = ρ [23]. Additionally,Theorem 6 of [24] shows that for bivariate normals distance correlation is a deterministic func-tion of ρ as well. Therefore, distance correlation is also perfectly interpretable and perfectlyreliable with respect to ρ on the set of bivariate normals Q .The perfect interpretability with respect to ρ on bivariate normals exhibited in both ofthese examples is in fact equivalent to one of the “fundamental properties” introduced byRenyi in his framework for thinking about ideal properties of measures of dependence [15].This property contains a compromise: it guarantees interpretability that on the one hand isperfect, but on the other hand applies only on a relatively small set of standard relationships.One goal of equitability is to give us the tools to relax the “perfect” requirement in exchangefor the ability to make Q a much larger set, e.g., a set of noisy functional relationships. Thus,equitability can be viewed as a generalization of Renyi’s requirement that allows for a tradeoffbetween the precision with which our statistic tells us about Φ and the set Q on which it doesso. We now give examples, using the vocabulary developed here, of some concrete instantiations of-and results about equitability. Our focus here is on functional relationships, as defined below.
Definition 2.6.
A random variable distributed over R is called a noisy functional relationship if and only if it can be written in the form ( X + ε, f ( X ) + ε (cid:48) ) where f : [0 , → R , X is arandom variable distributed over [0 , ε and ε (cid:48) are (possibly trivial) random variables. Wedenote the set of all noisy functional relationships by F . R We can now state one specific type of equitability on functional relationships: equitability withrespect to R . Definition 2.7 (Equitability on functional relationships with respect to R ) . Let Q ⊂ F bea set of noisy functional relationships. A measure of dependence is 1 /d -equitable on Q withrespect to R if it is 1 /d -interpretable with respect to R on Q .We observe that this definition still depends on the set Q in question. The general approachtaken in the literature thus far has been to fix some set F of functions that on the one handis large enough to be representative of relationships encountered in real data sets, but on theother hand is small enough to enable empirical analysis, and to make equitability a realisticgoal.As important as the choice of functions to include in F is the choice of marginal distribu-tions and noise model, both of which are left unspecified in our definition of noisy functionalrelationships. In past work, we have examined several possibilities. The simplest is X ∼ Unif, ε (cid:48) ∼ N (0 , σ ) with σ varying, and ε = 0. Slightly more complex noise models include having ε and ε (cid:48) i.i.d. Gaussians, or having ε be Gaussian and ε (cid:48) = 0. More complex marginal distri-butions include having X be distributed in a way that depends on the graph of f , or having itbe non-stochastic [1, 16]. Given that we often lack a neat description of the noise in real datasets, we would ideally like a statistic to be highly equitable on as many different such modelsas possible. 7e can also easily imagine models besides the ones described above: for instance, we mightdefine ε a and ε b to be non-Gaussian, we might allow them to depend on each other, or wemight allow their variance to depend on f ( X ). The importance of such modifications dependson the context, but our formalism is designed to be flexible enough to handle general modelsthat include such variations. One version of equitability on functional relationships for which perfect equitability has beenshown to be impossible was introduced by Kinney and Atwal [17]. This version of equitabilityuses as standard relationships the set Q K = (cid:8) ( X, f ( X ) + η ) (cid:12)(cid:12) f : [0 , → [0 , , ( η ⊥ X ) | f ( X ) (cid:9) with η representing a random variable that is conditionally independent of X given f ( X ). Thismodel describes functional relationships with noise in the second coordinate only, where thatnoise can depend arbitrarily on the value of f ( X ) but must be otherwise independent of X .Kinney and Atwal prove that no non-trivial measure of dependence can be perfectly worst-case interpretable with respect to R on the set Q K . However, we note here that this result,while interesting, has two serious limitations. The first limitation, pointed out by Murrell etal. in the technical comment [25], is that Q K is extremely large: in particular, the fact thatthe noise term η can depend arbitrarily on the value of f ( X ) leads to identifiability issues suchas obtaining the noiseless relationship f ( X ) = X as a noisy version of f ( X ) = X . The morepermissive (i.e. large) a model is, the easier it is to prove an impossibility result for it. Since Q K is not contained in the other major models considered in, e.g., [1] and [16], it follows thatthis impossibility result does not imply impossibility for any of those models.The second limitation of Kinney and Atwal’s result is that it only addresses perfect equitabil-ity rather than the more general, approximate notion with which we are primarily concerned. While a statistic that is perfectly equitable with respect to R may indeed be difficult or evenimpossible to achieve for many large models Q including some of the models in [1] and [16], suchimpossibility would make approximate equitability no less desirable a property. The questionthus remains how equitable various measures are, both provably and empirically. To borrowan analogy from computer science, the fact that a problem is proven to be NP-complete doesnot mean that we that we do not want efficient algorithms for the problem; we simply mayhave to settle for approximate solutions. Similarly, there is merit in searching for measures ofdependence that appear to be highly equitable with respect to R in practice.For more on this discussion, see the technical comment [18]. As a matter of record, we wish to clarify a confusion in Kinney and Atwal’s work. They write “The key claimmade by Reshef et al. in arguing for the use of MIC as a dependence measure has two parts. First, MIC is saidto satisfy not just the heuristic notion of equitability, but also the mathematical criterion of R -equitability...”,with the latter term referring to what we here define as perfect equitability [17]. However, such a claim was nevermade in our previous work [1]. Rather, that paper [1] informally defined equitability as an approximate notionand compared the equitability of MIC, mutual information estimation, and other schemes empirically, concludingnot that MIC is perfectly equitable but rather that it is the most equitable statistic available in a variety ofsettings. One method can be more equitable than another, even if neither method is perfectly equitable. .4 Quantifying equitability via interpretable intervals Let us give a simple demonstration of how the formalism above can be used to empiricallyquantify equitability with respect to R on a specific set of noisy functional relationships. Wetake as our statistic the sample correlation ˆ ρ . Since this statistic is meant to detect lineardependencies, we do not expect it to be equitable on a broad class of relationships. In fact itis not even a measure of dependence, since its population value can be zero for relationshipswith non-trivial dependence. However, we analyze it here as an instructional example sinceit is widely used and gives intuitive scores. We analyze the equitability of other statistics inSection 5.Figure 2a shows an analysis of the equitability with respect to R of ˆ ρ at a sample size of n = 500 on the set Q = { (cid:0) X, f ( X ) + ε (cid:48) σ (cid:1) : X ∼ Unif , ε (cid:48) σ ∼ N (0 , σ ) , f ∈ F, σ ∈ R ≥ } where F is a set of 16 functions analyzed in [16]. (See Appendix A.)To evaluate the equitability of ˆ ρ in this context, we generate, for each function f ∈ F andfor 41 noise levels chosen for each function to correspond to R values uniformly spaced in[0 , n = 500 from the relationship Z f,σ = ( X, f ( X ) + ε (cid:48) σ ).We then evaluate ˆ ρ on each sample to estimate the 5th and 95th percentiles of the samplingdistribution of ˆ ρ on Z f,σ . By taking, for each σ , the maximal 95th percentile value and theminimal 5th percentile value across all f ∈ F , we obtain estimates of the 0 . ρ is the reciprocal of the length of the largest interpretable interval.As expected, the interpretable intervals at many values of ˆ ρ are large. This is becauseour set of functions F contains many non-linear functions, and so a given value of ˆ ρ can beassigned to relationships of different types with very different R values. This is shown by thepairs of thumbnails in the figure, each of which depicts two relationships with the same ˆ ρ butdifferent values of R . Thus, ˆ ρ has poor equitability with respect to R on this set Q . Incontrast, Figure 2b depicts the way this analysis would look if ρ were perfectly equitable: allthe interpretable intervals would have size 0. In this section we formalized the notion of equitability via the concepts of reliability and inter-pretability. Given a statistic ˆ ϕ and a measure of relationship strength Φ defined on some set Q of standard relationships, we constructed a set of intervals called the interpretable intervals ofˆ ϕ with respect to Φ. We constructed the interpretable intervals so they yield interval estimatesof Φ, and we then defined the (worst-case) equitability of ˆ ϕ to be the inverse of the size of thelargest interpretable interval.Strictly speaking, equitability simply requires that a natural set of confidence intervalsobtained from analyzing ˆ ϕ as an estimator of Φ be small. However, there is a subtlety here: sincein our setting Q typically contains several different relationship types, there are usually multiplerelationships in Q with a given value of Φ. This is different from the conventional frameworkof estimation of a parameter θ , in which we assume that there is exactly one distribution withany given value of θ , and we must account for this difference in our definitions.When Q is so small that this subtlety does not arise, equitability becomes a less richproperty. To see this, notice that if there is only one relationship in Q for every value of Φ,9 φ Ф ( e.g. R )Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) φ ^ φ Ф ( e.g. R )Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) Ф ( e.g. R ) φ (a) (b) Figure 2:
Examples of equitable and non-equitable behavior on a set of noisy functional relationships. (a)
The equitability with respect to R of the Pearson correlation coefficient ˆ ρ over the set Q ofrelationships described in Section 2.4, with n = 500. Each shaded region is an estimated 90% centralinterval of the sampling distribution of ˆ ρ for a given relationship at a given R . The fact that theinterpretable intervals of ˆ ρ are large indicates that a given ˆ ρ value could correspond to relationshipswith very different R values. This is illustrated by the pairs of thumbnails showing relationships withthe same ˆ ρ but different R values. The largest interpretable interval is indicated by a red line. Becauseit has width 1, the worst-case equitability with respect to R in this case is 1, the lowest possible. (b) A hypothetical population quantity ϕ that achieves perfect equitability in the large-sample limit.Here, the value of ϕ for each relationship type depends only on the R of the relationship and increasesmonotonically with R . Thus, ϕ can be used as a proxy for R on Q with no loss. Thumbnails areshown for sample relationships that have the same ϕ , which corresponds to the fact that they have equal R scores. See Appendix A for a legend of the function types used. then asymptotic monotonicity of ˆ ϕ with respect to Φ is sufficient for perfect equitability in thelarge-sample limit. In this scenario, the main obstacle to the equitability of ˆ ϕ is finite-sampleeffects, as with parameter estimation. For example, on the set Q of bivariate Gaussians, manymeasures of dependence are asymptotically perfectly equitable with respect to the correlation.However, this differs from the motivating data exploration scenario we consider, in which Q contains many different relationship types and there are multiple different relationshipscorresponding to a given value of Φ. Here, equitability can be hindered either by finite-sampleeffects, or by the differences in the asymptotic behavior of ˆ ϕ on different relationship types in Q . This is illustrated in Figure 3.Regardless of the size of Q though, equitability is fundamentally meant for a situation inwhich we cannot simply estimate Φ directly. (In fact, if ˆ ϕ is a consistent estimator of Φ on Q ,it is trivially perfectly equitable in the large-sample limit.) This is because in data explorationwe typically require that ˆ ϕ be a measure of dependence in order to obtain a minimal robustnessguarantee, and this requirement makes it very difficult to make ˆ ϕ a consistent estimator of Φon a large set Q . For instance, suppose Q is a set of noisy functional relationships and Φ = R .Here, computing the sample R relative to a non-parametric estimate of the generating functionwill be asymptotically perfectly equitable. However, this approach is undesirable for dataexploration because of its lack of robustness, as exemplified by the fact that it would assign ascore of zero to, e.g., a circular relationship. Therefore, we are left with the problem of finding10 θ ^ Small n φ ^ Ф Estimators Equitability θθ φ θ ^ ^ n ∞ Figure 3:
Equitability versus parameter estimation. The left-hand column depicts a scenario in whichˆ θ estimates a parameter θ , each value of which specifies a unique distribution. If the population value ofˆ θ is monotonic in θ , then the confidence intervals shown can be large only due to finite-sample effects.The right-hand column depicts a scenario in which ˆ ϕ is being used as an estimate of Φ, but a given valueof Φ does not uniquely determine the population value of ˆ ϕ : the blue, red, and yellow each representdistinct sets of distributions in Q whose members can have identical values of Φ. For instance, theymight correspond to different function types. This is the setting in which we are operating, and the redintervals on the right are called interpretable intervals. Interpretable intervals can be large either becauseof finite sample effects (as in the conventional estimation case) or because of the lack of interpretabilityof the population value of the statistic (shown in the bottom-right picture). the next-best thing: a measure of dependence ˆ ϕ whose values have a clear, if approximate,interpretation in terms of Φ. Equitability supplies us with a way of talking about how well ˆ ϕ does in this regard.We close this section with the observation that, though we largely focused here on setting Q to be some set of noisy functional relationships, the appropriate definitions of Q and Φ maychange from application to application. For instance, instead of functional relationships onemay be interested in relationships supported on one-manifolds, with added noise. Or perhapsinstead of R one may decide to focus on the mutual information between the sampled y-valuesand the corresponding de-noised y-values [17], or on the fraction of deterministic signal in amixture [26]. In each case the overarching goal should be to have Q be as large as possiblewithout making it impossible to define an interesting Φ or making it impossible to find ameasure of dependence that achieves good equitability on Q with respect to this Φ. Findingsuch families Q and properties Φ is an important avenue of future work.11 Equitability and statistical power
In the previous section we defined equitability in terms of interval estimation, and observed thatthe interpretable intervals of a statistic ˆ ϕ with respect to a property of interest Φ yield intervalestimates of Φ on a set of distributions Q . Given our construction of interpretable intervals viainversion of a set of hypothesis tests, it becomes natural to ask whether there is any connectionbetween equitability and the power of those tests with respect to specific alternatives.In this section we answer this question by showing that equitability can be equivalentlyformulated in terms of power with respect to a family of null hypotheses corresponding todifferent relationship strengths. This result re-casts equitability as a strengthening of poweragainst statistical independence on Q and gives a second formal definition of equitability thatis easily quantifiable using standard power analysis.Henceforth, we fix the statistic ˆ ϕ and then use R α ( x ) to denote the α -reliable interval of ˆ ϕ at x ∈ [0 ,
1] and I α ( x ) to denote the α -interpretable interval of ˆ ϕ at y ∈ [0 , Before stating and proving the relationship between equitability and power, let us first buildsome intuition for why it should hold. We begin by recalling that the reliable interval R α ( x )is an acceptance region of a two-sided level- α test of H : Φ( Z ) = x . Since the intervalestimates obtained by inverting this test are the interpretable intervals of ˆ ϕ , it makes sense toask whether there is any property of these hypothesis tests that improves as the interpretabilityof the statistic ˆ ϕ increases. To see why the relevant property is power, let us consider thefollowing illustrative question: what is the minimal x > level- α testof H : Φ = 0 will have power at least 1 − β on H : Φ = x ? As shown graphically in Figure 4,the answer can be stated in terms of the reliable and interpretable intervals of ˆ ϕ .Specifically, if t α is the maximal element of R α (0), then the minimal value of Φ at whicha right-tailed test based on ˆ ϕ will achieve power 1 − β is Φ = max I β ( t α ), i.e., the maximalelement of the β -interpretable interval at t α . So if the statistic is highly interpretable at t α , thenwe will be able to achieve high power against very small departures from the null hypothesis ofindependence. That is, good interpretability on Q implies good power against independenceon Q . It turns out that this reasoning holds in general and in both directions, as we establishbelow. To be able to state our main result, we need to formally describe how equitability would beformulated in terms of power. This requires two definitions. The first is a definition of apower function that parametrizes the space of possible alternative hypotheses specifically bythe property of interest. The second is a definition of a property of this power function calledits uncertain interval. It will turn out later than uncertain intervals are interpretable intervalsand vice versa. We consider a one-sided test here, and henceforth in this section. The reason is because in practice whenΦ corresponds to relationship strength, we are interested in rejecting a null hypothesis representing weakerrelationships. In such a situation, it is more common to perform a one-sided test. Nevertheless, results similarto those shown in this section can be derived for two-sided tests as well. ^ Ф { t α = max R α (0) R α (0) max I β ( t α ) Figure 4:
An illustration of the connection between equitability and power. In this example, we askfor the minimal x > ϕ to achieve power 1 − β in distinguishingbetween H : Φ = 0 and H : Φ = x . The optimal critical value of such a test, denoted by t α , can beshown to be the maximal element of the reliable interval R α (0), and the required x can be shown to bethe maximal element of the interpretable interval I β ( t α ), provided max R α ( · ) is an increasing function.(The reliable and interpretable intervals pictured are for the case that α = β .) As before, let ˆ ϕ be a statistic, let Q be a set of standard relationships, and let Φ : Q → [0 , Q . Given a set of right-tailed tests based on the same teststatistic, we refer to the one with the smallest critical value as the most permissive test. Definition 3.1.
Fix α, x ∈ [0 , T x α be the most permissive level- α right-tailed testbased on ˆ ϕ of the (possibly composite) null hypothesis H : Φ( Z ) = x . For x ∈ [0 , K x α ( x ) = inf Z :Φ( Z )= x P ( T x α ( Z ) rejects)where Z is a sample of size n from Z . That is, K x α ( x ) is the power of T x α with respect to thecomposite alternative hypothesis H : Φ = x .We call the function K x α : [0 , → [0 ,
1] the level- α power function associated to ˆ ϕ at x with respect to Φ.Note that in the above definition our null and alternative hypotheses may be compositesince they are based on Φ and not on a complete parametrization of Q . That is, Z can be oneof several distributions with Φ( Z ) = x or Φ( Z ) = x respectively.Under the assumption that Φ( Z ) = 0 if and only if Z represents statistical independence,the power function K α gives the power of optimal level- α right-tailed tests based on ˆ ϕ atdistinguishing various non-zero values of Φ from statistical independence across the differentrelationship types in Q . One way to view the main result of this section is that the set ofpower functions at values of x besides ϕ against the null hypothesis of Φ = 0, and that this informationcan be equivalently viewed in terms of interpretable intervals. Specifically, we can recover theinterpretability of ˆ ϕ at every y ∈ [0 ,
1] by considering its power functions at values of x beyond0. Let us now define the precise aspect of the power functions associated to ˆ ϕ that will allowus to do this. Definition 3.2.
The uncertain set of a power function K x α is the set { x ≥ x : K x α ( x ) < − α } . 13he main result of this section will be that uncertain sets are interpretable intervals andvice versa. Our proof of the alternate characterization of equitability in terms of power requires two shortlemmas. The first shows a connection between the maximum element of a reliable interval andthe minimal element of an interpretable interval, namely that these two operations are inversesof each other.
Lemma 3.3.
Given a statistic ˆ ϕ , a property of interest Φ , and some α ∈ [0 , , define f ( x ) =max R α ( x ) and g ( y ) = min I α ( y ) . If f is strictly increasing, then f and g are inverses of eachother.Proof. Let y = f ( x ) = max R α ( x ). We know that min I α ( y ) ≤ x , for if it were greater than x then we would have that x / ∈ I α ( y ), which would imply that y / ∈ R α ( x ), contradicting thedefinition of y . On the other hand, we cannot have min I α ( y ) < x , because this would implythat there is some x (cid:48) < x such that y ∈ R α ( x (cid:48) ), meaning that max R α ( x (cid:48) ) ≥ y = max R α ( x ),which contradicts the fact that f is strictly increasing.The second lemma gives the connection between reliable intervals and hypothesis testingthat we will exploit in our proof. Lemma 3.4.
Fix a statistic ˆ ϕ , a property of interest Φ , and some α, x ∈ [0 , . The mostpermissive level- ( α/ right-tailed test based on ˆ ϕ of the null hypothesis H : Φ( Z ) = x hascritical value max R α ( x ) .Proof. We seek the smallest critical value that yields a level-( α/
2) test. This would be thesupremum, over all Z with Φ( Z ) = x , of the (1 − α/ · ϕ when applied to Z . By definition this is max R α ( x ). We are now ready to prove our main result, which is the following equivalent characterizationof equitability in terms of statistical power.
Theorem 3.5.
Fix a set Q ⊂ P , a function Φ : Q → [0 , , and < α < / . Let ˆ ϕ be astatistic with the property that max R α ( x ) is a strictly increasing function of x . Then for all d > , the following are equivalent.1. ˆ ϕ is worst-case /d -interpretable with respect to Φ with confidence − α .2. For every x , x ∈ [0 , satisfying x − x > d , there exists a level- α right-tailed test basedon ˆ ϕ that can distinguish between H : Φ( Z ) ≤ x and H : Φ( Z ) ≥ x with power atleast − α . Theorem 3.5 can be seen to follow from the proposition below.
Proposition 3.6.
Fix < α < and d > , and suppose ˆ ϕ is a statistic with the property that max R α ( x ) is a strictly increasing function of x . Then for y ∈ [0 , , the interval I α ( y ) equalsthe closure of the uncertain set of K x α/ for x = min I α ( y ) . Equivalently, for x ∈ [0 , , theclosure of the uncertain set of K x α/ equals I α ( y ) for y = max R α ( x ) . − α K x α/ ( x ) x + | I α ( y ) | y x φ ^ Ф Ф Figure 5:
The relationship between equitability and power, as in Proposition 3.6. The top plot is thesame as the one in Figure 1a, with the indicated interval denoting the interpretable interval I α ( y ). Thebottom plot is a plot of the power function K x α/ ( x ), with the y-axis indicating statistical power. Thekey to the proof of the proposition is to notice that the width of the interpretable interval describes thedistance from x to the point at which the power function reaches 1 − α/
2, and this is exactly the widthof the uncertain set of the power function. (Notice that because the null and alternative hypotheses arecomposite, K x α/ ( x ) need not equal α/
2; in general it may be lower.)
An illustration of this proposition and its proof is shown in Figure 5.
Proof.
The equivalence of the two statements follows from Lemma 3.3, which states that y =max R α ( x ) if and only if x = min I α ( y ). We therefore prove only the first statement, namelythat I α ( y ) is the uncertain set of K x α/ for x = min I α ( y ).Let U be the uncertain set of K x α/ . We prove the claim by showing first that inf U =min I α ( y ), and then that sup U = max I α ( y ).To see that inf U = min I α ( y ), we simply observe that because α/ < /
2, we have K x α/ ( x ) ≤ α/ < − α/
2, which means that U is non-empty, and so by construction itsinfimum is x , which we have assumed equals min I α ( y ).Let us now show that sup U ≥ max I α ( y ): by the definition of the interpretable interval,we can find x arbitrarily close to max I α ( y ) from below such that y ∈ R α ( x ). But this meansthat there exists some Z with Φ( Z ) = x such that if Z is a sample of size n from Z then P ( ˆ ϕ ( Z ) < y ) ≥ α P ( ˆ ϕ ( Z ) ≥ y ) < − α . But since as we already noted y = max R α ( x ), Lemma 3.4 tells us that it is the critical valueof the most permissive level-( α/
2) right-tailed test of H : Φ( Z ) = x . Therefore, K x α/ ( x ) < − α/
2, meaning that x ∈ U .It remains only to show that sup U ≤ max I α ( y ). To do so, we note that y / ∈ R α ( x ) for all x > max I α ( y ). This implies that either y > max R α ( x ) or y < min R α ( x ). However, since y ∈ R α ( x ) and max R α ( · ) is an increasing function, no x > x can have y > max R α ( x ). Thusthe only option remaining is that y < min R α ( x ). This means that if Z is a sample of size n from any Z with Φ( Z ) = x > max I α ( y ), then P ( ˆ ϕ ( Z ) < y ) < α P ( ˆ ϕ ( Z ) ≥ y ) ≥ − α . As above, this implies that K x α/ ( x ) ≥ − α/
2, which means that x / ∈ U , as desired. Theorem 3.5 gives us an alternative to measuring equitability via lengths of interpretable in-tervals. Instead, for every x ∈ [0 ,
1) and for every x > x , we can use many samples ofsize n to estimate the power of right-tailed tests based on ˆ ϕ at distinguishing H : Φ = x from H : Φ = x . This process is illustrated schematically in Figure 6. In that figure, goodequitability corresponds to high power on pairs ( x , x ) even when x − x is small. In this section, we gave a characterization of equitability in terms of statistical power withrespect to a family of null hypotheses corresponding to different relationship strengths. (SeeTheorem 3.5.) This characterization shows what the concept of equitability/interpretability isfundamentally about: being able to distinguish not just signal (Φ >
0) from no signal (Φ = 0)but also stronger signal (Φ = x ) from weaker signal (Φ = x ), and being able to do soacross relationships of different types. This indeed makes sense when a data set contains anoverwhelming number of heterogeneous relationships that exhibit, say, Φ( Z ) = 0 . Z ) = 0 . Q is a set of noisy functional relationships and the propertyof interest is R . In this setting, the conventional way to assess a measure of dependence wouldbe through analysis of its power with respect to a null hypothesis of independence and witha simple alternative hypothesis. Such an analysis would consider, say, right-tailed tests basedon the statistic ˆ ϕ and evaluate their power at rejecting the null hypothesis of R = 0, i.e.statistical independence, first on linear relationships with varying noise levels, then separatelyon exponential relationships with varying noise levels, and so on.16 Linear LinearParabolic Parabolic0 101 x P o w e r H : Ф = x in [0, 1] H : Ф = 0.3 H : Ф = x in [0, 1] H : Ф = 0.3 H : Ф = x in [0, 1]H : Ф = x in [0, 1] x H : Ф = 0.3 H : Ф = 0.6 x (red = powerful)0.3 x (red = powerful) φ ^ c α=0.05 H : Ф = 0.3 H : Ф = 0.6 φ ^ c α=0.05 Figure 6:
A schematic illustration of the visualization of equitability via statistical power. (Top)
Adepiction of the sampling distributions of a test statistic ˆ ϕ when a data set contains only four relation-ships: a parabolic and a linear relationship with Φ = 0 .
3, and a parabolic and a linear relationship withΦ = 0 .
6. The dashed line represents the critical value of the most permissive level- α right-tailed test of H : Φ = 0 . (Bottom left) The power function of the most permissive level- α right-tailed test basedon a statistic ˆ ϕ of the null hypothesis H : Φ = 0 .
3. The curve shows the power of the test as a functionof x , the value of Φ that defines the alternative hypothesis. (Bottom middle) The power functioncan be depicted instead as a heat map. (Bottom right)
Instead of considering just one null hypothesis,we can consider a set of null hypotheses (with corresponding critical values) of the form H : Φ = x and plot each of the resulting power curves as a heat map. The result is a plot in which the intensityof the color in the coordinate ( x , x ) corresponds to the power of the size- α right-tailed test based onˆ ϕ at distinguishing H : Φ = x from H : Φ = x . A statistic is 1 /d -equitable with confidence 1 − α if this power surface attains the value 1 − α within distance d of the diagonal along each row. In otherwords, the redder the triangle appears, the higher the equitability of ˆ ϕ . In contrast, our result shows that for ˆ ϕ to be 1 /d -equitable, it must yield right-tailedtests with high power at distinguishing null hypotheses of the form R ≤ x from alternativehypotheses of the form R ≥ x for any x > x + d . This is more stringent than the conventionalanalysis described above for the following three reasons.1. Instead of just one null hypothesis x (i.e., x = 0), there are many possible values of x corresponding to different R values.2. Each of the new null hypotheses can be composite since Q can contain relationships ofmany different types (e.g. noisy linear, noisy sinusoidal, and noisy parabolic). Whereasfor many measures of dependence all of these relationships may have reduced to a singlenull hypothesis of statistical independence in the case of R = 0, they yield composite17ull hypotheses once we allow R to be non-zero.3. The alternative hypotheses here are also composite, since each one similarly consists ofseveral different relationship types with the same R . Whereas conventional analysis ofpower against independence considers only one alternative at a time, here we require thattests simultaneously have good power on sets of alternatives with the same R .This understanding of equitability is both good news and bad news. On the one hand, itprovides us with a concrete sense of the relationship of equitability to power against indepen-dence, which has been the more traditional way of evaluating measures of dependence. In sodoing, it also makes clear the motivation behind equitability and the cases in which it is useful.On the other hand, however, the understanding that equitability corresponds to power againsta much larger set of null hypotheses suggests, via “no free lunch”-type considerations, that ifwe want to achieve higher power against this larger set of null hypotheses, we may need togive up some power against independence. And indeed, in [16] we demonstrate empirically thatsuch a trade-off does seem to exist for several measures of dependence.However, there are situations in which it may be desirable to give up some power againstindependence in exchange for a degree of equitability. For instance, recall the analysis [14] of thegene expression data set discussed earlier in this paper. In that analysis, not only did severalmeasures of dependence each detect thousands of significant relationships after correction formultiple hypothesis testing, but there was also an overlap of over 85% among the relationshipsdetected by the five best-performing methods. In data exploration scenarios such as this one,in which existing measures of dependence reliably identify so many relationships, focusingon additional gains in power against independence appears less of a significant priority thandeciding how to choose among the large number of relationships already detected. The primary motivation given for equitability is that often data sets contain so many rela-tionships that we are not interested in all deviations from independence but rather only inthe strongest few relationships. However, there are also many data sets in which, due to lowsample size, multiple-testing considerations, or relative lack of structure in the data, very fewrelationships pass significance. Alternatively, there are also settings in which equitability is tooambitious even at large sample sizes. In such settings, we may indeed be interested in simplydetecting deviations from independence rather than ranking them by strength.In this situation, there is still cause for concern about the effect on our results of our choiceof test statistic ˆ ϕ . For instance, it is easy to imagine that, despite asymptotic guarantees, anindependence test will suffer from low power even on strong relationships of a certain type at afinite sample size n because the test statistic systematically assigns lower scores to relationshipsof that type. To avoid this, we might want a guarantee that, at a sample size of n , the test hasa given amount of power in detecting relationships whose strength as measured by Φ is abovea certain threshold, across a broad range of relationship types. This would ensure that, even ifwe cannot rank relationships by strength, we at least will not miss important relationships asa result of the statistic we use.In this section we show a straightforward connection between equitability as defined aboveand this desideratum, which we call low detection threshold . In particular, we show via thealternate characterization of equitability proven in the previous section that low detection18hreshold is a straightforward consequence of high equitability. Since the converse does not hold,low detection threshold may be a reasonable criterion to use in situations in which equitabilityis too much to ask.Given a set Q of standard relationships, and a property of interest Φ, we define low detectionthreshold as follows. Definition 4.1.
A statistic ˆ ϕ has a (1 − β ) -detection threshold of d at level α with respect to Φon Q if there exists a level- α right-tailed test based on ˆ ϕ of the null hypothesis H : Φ( Z ) = 0whose power on H : Z at a sample size of n is at least 1 − β for all Z ∈ Q with Φ( Z ) > d .The connection between equitability and low detection threshold is then a straightforwardcorollary of Theorem 3.5. Corollary 4.2.
Fix some < α < , let ˆ ϕ be worst-case /d -interpretable with respect to Φ on Q with confidence − α , and assume that max R α ( · ) is a strictly increasing function. Then ˆ ϕ has a (1 − α ) -detection threshold of d at level α with respect to Φ on Q . Assume that Φ has the property that it is zero precisely in cases of statistical independence.Then the above corollary says that equitability and interpretability — to the extent they canbe achieved — make strong guarantees about power against independence on Q . On the otherhand, it is easy to see that low detection threshold need not imply equitability. Therefore,minimal power against independence is a strictly weaker criterion than equitability.The connection between equitability and detection threshold with respect to Φ is importantbecause there exist situations in which equitability may be difficult to achieve but in which westill want some sort of guarantee about the robustness of our power against independence tochanges in relationship type. This general theme of not missing relationships because of theirtype is the intuitive heart of equitability, and the above corollary shows how this conceptionmight be utilized in other ways.Another way that low detection threshold arises naturally is if we pre-filter our data setusing some independence test before conducting a more fine-grained analysis with a secondstatistic. In that case, low detection threshold ensures that we will not “throw out” importantrelationships prematurely just because of their relationship type. In our companion paper [16],we propose precisely such a scheme, and we analyze the detection threshold of the preliminarytest in question to argue that the scheme will perform well. Having defined equitability and seen how it can be interpreted in terms of power, we nowconsider the equitability on a set of noisy functional relationships of some commonly usedmethods: the maximal information coefficient as estimated by MIC e [4], distance correlation[5, 24, 27], and mutual information [21, 22] as estimated using the Kraskov estimator [6].In this analysis, we use Φ = R as our property of interest, n = 500 as our sample size, and Q = { ( x + ε σ , f ( x ) + ε (cid:48) σ ) : x ∈ X f , ε σ , ε (cid:48) σ ∼ N (0 , σ ) , f ∈ F, σ ∈ R ≥ } where ε σ and ε (cid:48) σ are i.i.d., F is the set of functions in Appendix A, and X f is the set of n x-values that result in the points ( x i , f ( x i )) being equally spaced along the graph of f .The results of the analysis are shown in Figure 7. The figure visualizes the analysis viaboth interpretable intervals and statistical power. By Theorem 3.5, these two viewpoints are19 igure 7: An analysis of the equitability with respect to R of three measures of dependence ona set of functional relationships. The set of relationships used is described in Section 5. Each columncontains results for the indicated measure of dependence. (Top) The analysis visualized via interpretableintervals as in Figure 2. [Narrower is more equitable.]
The worst-case and average-case widths of the 0 . (Bottom) The same analysis visualizedvia statistical power as in Figure 6. [Redder is more equitable.]
The average power across all pairs ofnull and alternative hypotheses is computed for each plot. For a legend describing which functionalrelationships were analyzed and which parameters were used for each method, see Appendix A. equivalent, and they are both shown here in order to help the reader build intuition for thisequivalence. For instance, the worst-case 0 . e here is 2 .
92, because thewidest interpretable interval is of size 2 .
92. And indeed, MIC e yields right-tailed tests with1 − . / H : R ( Z ) = x fromany alternative hypothesis of the form H : R ( Z ) = x provided x − x > / .
92 = 0 . .
92 achieved by MIC e on this Q is thehighest among the methods examined. In contrast, the equitabilities with respect to R ofdistance correlation and mutual information estimation on this Q are 1 and 1 .
04, respectively.For a more extensive analysis that varies the sample size as well as noise model and marginaldistributions, and compares many more methods, see [16].
Informally, given some measure Φ of relationship strength, the equitability of a measure ofdependence ˆ ϕ with respect to Φ is the degree to which ˆ ϕ allows us to draw inferences about20elationship strength across a broad set of relationship types. We give here a conceptual frame-work to motivate equitability and then discuss the contributions of this work. There are two different ways to motivate equitability. The first is to begin with a measureof dependence ˆ ϕ and to observe that, though ˆ ϕ will asymptotically allow us to detect alldeviations from independence in a data set, it need not tell us anything about the strength ofthose relationships. Since it often happens that we detect many more relationships than canbe realistically followed up, it would be desirable to have ˆ ϕ tell us something not just aboutthe presence or absence of a relationship, but also about relationship strength as defined by Φon at least a partial set of “standard relationships” Q .The second way is to suppose that ˆ ϕ is a consistent estimator of Φ on Q and to ask “whatis the minimal requirement we can add to ensure that ˆ ϕ is robust to detecting relationshipsoutside of Q ?” Perhaps the weakest stipulation we can impose is that the population value ϕ of our statistic be non-zero in cases of non-trivial dependence of any sort. That is, we want ˆ ϕ to be a measure of dependence as well.Both of these scenarios would be resolved by a measure of dependence that is also a consis-tent estimator of Φ. However, in many interesting cases there is no known statistic satisfyingboth properties: for instance, if Q is a set of noisy functional relationships and Φ is R , thenon the one hand computing the sample R with respect to a non-parametric estimate of thegenerating function will be a consistent estimator of Φ, but will give a score of 0 to a circle.And on the other hand, no measure of dependence is known also to be a consistent estimatorof R on noisy functional relationships.This naturally leads us to wonder whether, despite the difficulty of simultaneously estimat-ing Φ consistently and retaining the properties of a measure of dependence, we can at leastseek an approximate version of this ideal. Doing so, however, requires a weaker requirementthan consistent estimation. This is what leads us to equitability. Equitability allows us to seekstatistics that have the robustness of measures of dependence but that also, via their relation-ship to a property of interest Φ, give values that have a clear, if approximate, interpretationand can therefore be used to rank relationships. In this paper, we formalized and developed the theory of equitability in three ways. We firstdefined the equitability of a statistic ˆ ϕ on Q with respect to Φ as the extent to which ˆ ϕ giveus good interval estimates of Φ on Q . Our definition rests on an object called the interpretableinterval, which has coverage guarantees with respect to Φ. We define ˆ ϕ to be equitable if all ofits interpretable intervals are small.Second, we showed that this formalization of equitability can be equivalently stated interms of power against a specific set of null hypotheses corresponding to different relationshipstrengths. That is, while measures of dependence have conventionally been judged by theirpower at distinguishing non-trivial signal from statistical independence, equitability is equiva-lent to the stronger property of being able to distinguish different degrees of possibly non-trivialsignal strength from each other.Third, we defined a concept called low detection threshold, which stipulates that, at a fixedsample size, a statistic yield independence tests with a guaranteed minimal power to detect21elationships whose strength passes a certain threshold, across a range of relationship types.We showed that low detection threshold is a straightforward consequence of equitability. Sincethe converse does not hold, low detection threshold is a natural weaker criterion that one couldaim for when equitability proves difficult to achieve.Our formalization and its results serve three primary purposes. The first is to provide aframework for rigorous discussion and exploration of equitability and related concepts. Thesecond is to situate equitability in the context of interval estimation and hypothesis testing andto clarify its relationship to central concepts in those areas such as confidence and statisticalpower. The third is to show that equitability and the language developed around it can helpus to both formulate and achieve other useful desiderata for measures of dependence.These connections provide a framework for thinking about the utility of both current andfuture measure of dependence for exploratory data analysis. Power against independence, thelens through which measures of dependence are currently evaluated, is appropriate in manysettings in which very few significant relationships are expected, or in which we want to knowwhether one specific relationship is non-trivial or not. However, in situations in which mostmeasures of dependence already identify a large number of relationships, a rigorous theory ofequitability will allow us to begin to assess when we can glean more information from a givenmeasure of dependence than just the binary result of an independence test.Of course, there is much left to understand about equitability. For instance, to what extentis it achievable for different properties of interest? What are natural and useful propertiesof interest for sets Q besides noisy functional relationships? For common statistics such asMIC [1] or MIC e [4], can we obtain a theoretical characterization of the sets Q for which goodequitability with respect to R is achieved? Are there systematic ways of obtaining equitablebehavior via a learning framework as was done for causation in [28]? These questions all deserveattention.Equitability as framed here is certainly not the only goal to which we should strive indeveloping new measures of dependence. As data sets not only grow in size but also becomemore varied, there will undoubtedly develop new and interesting use-cases for measures ofdependence, each with its own way of assessing success. Notwithstanding which particularmodes of assessment are used, it is important that we formulate and explore concepts that movebeyond power against independence, at least in the bivariate setting. Equitability provides oneapproach to coping with the changing nature of data exploration, but more generally, we canand should ask more of measures of dependence. The authors would like to acknowledge R Adams, E Airoldi, H Finucane, A Gelman, M Gorfine,R Heller, J Huggins, J Mueller, and R Tibshirani for constructive conversations and usefulfeedback.
References [1] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh,E. S. Lander, M. Mitzenmacher, and P. C. Sabeti, “Detecting novel associations in largedata sets,”
Science , vol. 334, no. 6062, pp. 1518–1524, 2011.222] J. D. Storey and R. Tibshirani, “Statistical significance for genomewide studies,”
Proceed-ings of the National Academy of Sciences , vol. 100, no. 16, pp. 9440–9445, 2003.[3] V. Emilsson, G. Thorleifsson, B. Zhang, A. S. Leonardson, F. Zink, J. Zhu, S. Carlson,A. Helgason, G. B. Walters, S. Gunnarsdottir, et al. , “Genetics of gene expression and itseffect on disease,”
Nature , vol. 452, no. 7186, pp. 423–428, 2008.[4] Y. A. Reshef, D. N. Reshef, H. K. Finucane, P. C. Sabeti, and M. Mitzenmacher, “Mea-suring dependence powerfully and equitably,” arXiv preprint arXiv:1505.02213 , 2015.[5] G. J. Sz´ekely, M. L. Rizzo, N. K. Bakirov, et al. , “Measuring and testing dependence bycorrelation of distances,”
The Annals of Statistics , vol. 35, no. 6, pp. 2769–2794, 2007.[6] A. Kraskov, H. Stogbauer, and P. Grassberger, “Estimating mutual information,”
PhysicalReview E , vol. 69, 2004.[7] L. Breiman and J. H. Friedman, “Estimating optimal transformations for multiple regres-sion and correlation,”
Journal of the American statistical Association , vol. 80, no. 391,pp. 580–598, 1985.[8] W. Hoeffding, “A non-parametric test of independence,”
The Annals of MathematicalStatistics , pp. 546–557, 1948.[9] R. Heller, Y. Heller, and M. Gorfine, “A consistent multivariate test of association basedon ranks of distances,”
Biometrika , vol. 100, no. 2, pp. 503–510, 2013.[10] B. Jiang, C. Ye, and J. S. Liu, “Non-parametric k-sample tests via dynamic slicing,”
Journal of the American Statistical Association , no. just-accepted, pp. 00–00, 2014.[11] A. Gretton, O. Bousquet, A. Smola, and B. Sch¨olkopf, “Measuring statistical dependencewith hilbert-schmidt norms,” in
Algorithmic learning theory , pp. 63–77, Springer, 2005.[12] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch¨olkopf, and A. Smola, “A kernel two-sample test,”
The Journal of Machine Learning Research , vol. 13, no. 1, pp. 723–773,2012.[13] D. Lopez-Paz, P. Hennig, and B. Sch¨olkopf, “The randomized dependence coefficient,” in
Advances in Neural Information Processing Systems , pp. 1–9, 2013.[14] R. Heller, Y. Heller, S. Kaufman, B. Brill, and M. Gorfine, “Consistent distribution-free k -sample and independence tests for univariate random variables,” arXiv preprintarXiv:1410.6758 , 2014.[15] A. R´enyi, “On measures of dependence,” Acta mathematica hungarica , vol. 10, no. 3,pp. 441–451, 1959.[16] D. N. Reshef, Y. A. Reshef, P. C. Sabeti, and M. Mitzenmacher, “An empirical study ofleading measures of dependence,” arXiv preprint arXiv:1505.02214 , 2015.[17] J. B. Kinney and G. S. Atwal, “Equitability, mutual information, and the maximal infor-mation coefficient,”
Proceedings of the National Academy of Sciences , 2014.2318] D. N. Reshef, Y. A. Reshef, M. Mitzenmacher, and P. C. Sabeti, “Cleaning up the record onthe maximal information coefficient and equitability,”
Proceedings of the National Academyof Sciences , 2014.[19] N. Simon and R. Tibshirani, “Comment on “Detecting novel associations in large datasets”,” ∼ tibs/reshef/comment.pdfon 11 Nov. 2012) , 2012.[20] G. Casella and R. L. Berger, Statistical inference , vol. 2. Duxbury Pacific Grove, CA, 2002.[21] T. Cover and J. Thomas,
Elements of Information Theory . New York: John Wiley & Sons,Inc, 2006.[22] I. Csisz´ar, “Axiomatic characterizations of information measures,”
Entropy , vol. 10, no. 3,pp. 261–273, 2008.[23] E. Linfoot, “An informational measure of correlation,”
Information and Control , vol. 1,no. 1, pp. 85–89, 1957.[24] G. Szekely and M. Rizzo, “Brownian distance covariance,”
The Annals of Applied Statistics ,vol. 3, no. 4, pp. 1236–1265, 2009.[25] B. Murrell, D. Murrell, and H. Murrell, “R2-equitability is satisfiable,”
Proceedings of theNational Academy of Sciences , 2014.[26] A. A. Ding and Y. Li, “Copula correlation: An equitable dependence measure and exten-sion of pearson’s correlation,” arXiv preprint arXiv:1312.7214 , 2013.[27] X. Huo and G. J. Szekely, “Fast computing for distance covariance,” arXiv preprintarXiv:1410.1503 , 2014.[28] D. Lopez-Paz, K. Muandet, B. Sch¨olkopf, and I. Tolstikhin, “Towards a learning theory ofcausation,” in
International Conference on Machine Learning (ICML) , 2015.24
Details of analyses
A.1 Functions analysed in Figures 2 and 7
Below is the legend showing which function types correspond to the colors in each of Figures 2and 7. The functions used are the same as the ones in the equitability analyses of [16].
The legend for Figures 2 and 7.
A.2 Parameters used in Figure 7
In the analysis of the equitability of MIC e , distance correlation, and mutual information, thefollowing parameter choices were made: for MIC e , α = 0 . c = 5 were used; for distancecorrelation no parameter is required; and for mutual information estimation via the Kraskovestimator, k = 6 was used. The parameters chosen were the ones that maximize overall equi-tability in the detailed analyses performed in [16]. For mutual information, the choice of k = 6(out of the parameters tested: k = 1 , , ,