[PDF] Estimating Conditional Mutual Information for Discrete-Continuous Mixtures using Multi-Dimensional Adaptive Histograms

Abstract

Estimating conditional mutual information (CMI) is an essential yet challenging step in many machine learning and data mining tasks. Estimating CMI from data that contains both discrete and continuous variables, or even discrete-continuous mixture variables, is a particularly hard problem. In this paper, we show that CMI for such mixture variables, defined based on the Radon-Nikodym derivate, can be written as a sum of entropies, just like CMI for purely discrete or continuous data. Further, we show that CMI can be consistently estimated for discrete-continuous mixture variables by learning an adaptive histogram model. In practice, we estimate such a model by iteratively discretizing the continuous data points in the mixture variables. To evaluate the performance of our estimator, we benchmark it against state-of-the-art CMI estimators as well as evaluate it in a causal discovery setting.

Full PDF

EEstimating Conditional Mutual Information for Discrete-ContinuousMixtures using Multi-Dimensional Adaptive Histograms

Alexander Marx ∗† Lincen Yang ∗‡ Matthijs van Leeuwen ‡ Abstract

Estimating conditional mutual information (CMI) is anessential yet challenging step in many machine learningand data mining tasks. Estimating CMI from data thatcontains both discrete and continuous variables, or evendiscrete-continuous mixture variables, is a particularly hardproblem. In this paper, we show that CMI for such mixturevariables, deﬁned based on the Radon-Nikodym derivate,can be written as a sum of entropies, just like CMI for purelydiscrete or continuous data. Further, we show that CMI canbe consistently estimated for discrete-continuous mixturevariables by learning an adaptive histogram model. Inpractice, we estimate such a model by iteratively discretizingthe continuous data points in the mixture variables. Toevaluate the performance of our estimator, we benchmark itagainst state-of-the-art CMI estimators as well as evaluateit in a causal discovery setting.

In many research areas, such as classiﬁcation [16], fea-ture selection [34], and causal discovery [29], estimat-ing the strength of a dependence plays a key role. Atheoretically appealing way to measure dependencies isthrough mutual information (MI) since it has severalimportant properties, such as the chain rule, the dataprocessing inequality, and—last but not least—it is zeroif (and only if) two random variables are independentof each other [4]. For structure identiﬁcation, such ascausal discovery, conditional mutual information (CMI)is even more interesting since it can help to distinguishbetween diﬀerent graph structures. For instance, in asimple Markov chain X → Z → Y , X and Y may bedependent, but are rendered independent given Z . Viceversa, a collider structure such as X → Z ← Y mayintroduce a dependence between two marginally inde-pendent variables X and Y when conditioned on Z .While estimating (conditional) mutual informationfor purely discrete or continuous data is a well-studiedproblem [4, 5, 8, 10, 22], many real-world settings con-cern a mix of discrete and continuous random variables, ∗ equal contribution (ordered alphabetically) † CISPA Helmholtz Center for Information Security andMax Planck Institute for Informatics, Saarland [email protected] ‡ Leiden Institute of Advanced Computer Science, LeidenUniversity. { l.yang,m.van.leeuwen } @liacs.leidenuniv.nl such as age (in years) and height, or even random vari-ables that can individually consist of a mixture of dis-crete and continuous components. Although severaldiscretization-based approaches that can estimate MIfor a mix of discrete and continuous random variableshave recently emerged [2, 17, 31], so far only methodsbased on k -nearest neighbour ( k NN) estimation wereshown to work on mixed variables , which may consist ofdiscrete-continuous mixture variables [7, 19, 24].Regardless of the success of k NN-based estimators,discretization-based approaches have attractive proper-ties, e.g., with regard to global interpretation. Thatis, a natural and understandable way to discretize acontinuous random variable is via creating a histogrammodel, where we cut the sample space of the contin-uous variable in multiple non-overlapping parts calledbins [27], or (hyper)rectangles for multi-dimensionalvariables. Within a bin, we consider the distributionto be constant, which allows us to estimate the densityfunction via Riemann integration by making the binssmaller and smaller [4]. This deﬁnition, however, is lessstraightforward when mixed variables are involved.In this paper, we approach the problem as follows:we ﬁrst extend the deﬁnition of entropy for a univari-ate discrete-continuous mixture variable given by Poli-tis [23] to multivariate variables. Using this deﬁni-tion, we show that CMI for mixed random variables canbe written as a sum of entropies that are well-deﬁnedthrough the Radon-Nikodym derivate (see Sec. 2). Ex-ploiting this property, we propose a consistent CMI es-timator for such data that is based on adaptive his-togram models in Sec. 3. To eﬃciently learn adaptivehistograms from data, in Sec. 4 we deﬁne a model selec-tion criterion based on the minimum description length(MDL) principle [9]. Subsequently, we propose an itera-tive greedy algorithm that aims to obtain the histogrammodel that minimizes the proposed MDL score in Sec. 5.We discuss related work in Sec. 6 and in Sec. 7, we em-pirically show that our method performs favourably tostate-of-the-art estimators for mixed data and can beused in a causal discovery setting. a r X i v : . [ c s . I T ] J a n Entropy for Mixed Random Variables

We consider multi-dimensional mixed random variables ,of which any individual dimension can be discrete,continuous, or a discrete-continuous mixture. Further,we call a vector of such mixed random variables a mixedrandom vector . For a mixed random vector (

X, Y ),where X and Y are possibly multivariate, we need toadopt the most general deﬁnition of mutual information(MI), i.e., the measure-theoretic deﬁnition: I ( X ; Y ) = (cid:90) X ×Y log dP XY dP X P Y dP XY , where dP XY / ( dP X P Y ) is the Radon-Nikodym deriva-tive, dP XY the joint measure, and P X P Y the productmeasure. It has been proven that P X P Y is absolutelycontinuous with respect to P XY [7], i.e., P XY = 0whenever P X P Y = 0; and therefore, such a Radon-Nikodym derivative always exists and I ( X, Y ) is well-deﬁned. This measure-theoretic deﬁnition can be ex-tended to CMI using the chain rule: I ( X ; Y | Z ) = I ( X ; { Y, Z } ) − I ( X ; Z ).It is common knowledge that for purely discrete orcontinuous random variables, CMI can be written as asum of entropies, i.e., I ( X ; Y | Z ) = H ( X, Z )+ H ( Y, Z ) − H ( X, Y, Z ) − H ( Z ). What is not clear, however, is ifthis formula also holds when ( X, Y, Z ) contains discrete-continuous mixture random variables. We investigatethis problem in two steps. We ﬁrst deﬁne the measure-theoretic entropy for a (possibly multi-dimensional)discrete-continuous mixture random variable and proveit to be well-deﬁned, though previous work claimed theopposite [7]. Second, using this deﬁnition, we provethat (conditional) MI for a mixed random vector can bewritten as the sum of measure-theoretic entropies, justlike purely continuous or discrete random vectors.

Themeasure-theoretic entropy is deﬁned only for one-dimensional random variables [23]. Building upon thisdeﬁnition, we give an explicit proof that such a one-dimensional measure-theoretic entropy is well-deﬁned,and then extend this deﬁnition to the multi-dimensionalcase, which we prove is also well-deﬁned.

We start oﬀ by reviewing the existing deﬁnition for theone-dimensional case [23]. Given a one-dimensional ran-dom variable X , entropy H is deﬁned as H ( X ) = (cid:90) R dP X ( x ) dv ( x ) log dP X ( x ) dv ( x ) dv ( x ) , where v ( · ) is a measure deﬁned on all one-dimensionalBorel sets [23]. If v ( · ) is the Lebesgue measure, which we denote as u ( · ), H ( X ) becomes the diﬀerential entropy.Alternatively, if v ( · ) is a counting measure, H ( X )becomes the common (discrete) entropy.If, however, X is a discrete-continuous mixturevariable, v is deﬁned as follows. We split R into threedisjoint subsets s.t. R = S d ∪ S c ∪ S o . First, S o is thesubset of R on which X has zero probability measure,i.e., P X ( S o ) = 0. Second, the set S d contains all discretepoints, i.e., S d is countable and ∀ x ∈ S d , P X ( x ) > S c covers the continuous points, hence P X ( S c ) + P X ( S d ) = 1 and for any Borel set A ⊆ S c satisfying u ( A ) = 0, we have P X ( A ) = 0. Based on these threesubsets S d , S c , and S o , we can deﬁne v as(2.1) v ( A ) = u ( A ∩ S c ) + | A ∩ S d | , where | A ∩ S d | is the cardinality of this intersection.To show that the generalized one-dimensional en-tropy is well-deﬁned, we need to prove that the Radon-Nikodym derivative dP X /dv always exists. This weshow in the following lemma. Lemma 2.1.

Given a one-dimensional discrete-continuous random variable X with probability measure P X , P X is absolutely continuous w.r.t. v , i.e., P X = 0 whenever v = 0 , and hence dP X /dv always exists. We provide the proof of Lemma 2.1, as well as forLemmas 2.2 and 2.3 in Supplementary Material S.1.

In the following, we extend the measure-theoretic en-tropy deﬁnition to a mixed k -dimensional random vector W = ( W , . . . , W k ). For each W i , we deﬁne S id , S ic , S io and measure v i as above, and also deﬁne the prod-uct measure v for the k -dimensional random vector as v = v × . . . × v k . Then, deﬁne the entropy for W as(2.2) H ( W ) = (cid:90) R k dP W ( w ) dv ( w ) log dP W ( w ) dv ( w ) dv ( w ) . To prove that such entropy is well-deﬁned, we show that dP W /dv always exists. Lemma 2.2.

Given a mixed k -dimensional random vec-tor W = ( W , . . . , W k ) with probability measure P W , dP W /dv always exists. Last, based on Lemma 2.1 and 2.2, we can prove thatjust like for a purely continuous or discrete randomvector, conditional mutual information for a mixedrandom vector can be written as a sum of entropies.

Lemma 2.3.

Given a mixed random vector ( X, Y, Z ) with joint probability measure P XY Z , we can write I ( X ; Y | Z ) = H ( X, Z ) + H ( Y, Z ) − H ( Z ) − H ( X, Y, Z ) ,where each entropy can be deﬁned as in Eq. (2.2) . s a direct implication of the above proof, it followsthat mutual information can also be written as the sumof entropies, since it is a special case of CMI with Z = ∅ .With this generalized deﬁnition, we can now show howto estimate CMI using adaptive histogram models. Adaptive histogram models have been thoroughly stud-ied for continuous random variables [27]; however, tothe best of our knowledge, there exists no rigorous deﬁ-nition of histograms for mixed random variables. Thus,to use histogram models as a foundation to estimatethe measure-theoretic (conditional) MI, we need to rig-orously deﬁne histograms for mixed random variables.We start with the one-dimensional case.

A his-togram model is typically deﬁned based on a set of con-secutive intervals called bins [27]. However, to deal withdiscrete-continuous mixture random variables, we deﬁnethe set of bins, denoted as B , such that each bin is eitheran interval or a set containing only a single point. Thatis, B = B (cid:48) ∪ B (cid:48)(cid:48) , where B (cid:48) and B (cid:48)(cid:48) are sets of subsets of R , with B (cid:48) consisting of countable consecutive intervalsand B (cid:48)(cid:48) consisting of countable single point sets. Last,we deﬁne the “width” of a bin using the measure v asdeﬁned in Eq. 2.1, i.e., for a bin B j ∈ B we have v ( B j ) = u ( B j ∩ B (cid:48) ) + | B j ∩ B (cid:48)(cid:48) | . As any B j ∈ B (cid:48)(cid:48) contains only a single discrete point, v ( B j ) = 1 for all B j ∈ B (cid:48)(cid:48) .Further, we deﬁne a histogram model M as a setof bins equipped with a parameter vector of length K , where K = | B | is the number of bins. Thatis, a histogram model M is a family of probabilitydistributions P X,θ , parametrized by the vector θ =( θ , . . . , θ K ). Each element of θ represents the Radon-Nikodym derivative (or density) of each bin. Note thatthis deﬁnition generalizes to purely continuous randomvariables when B (cid:48)(cid:48) = ∅ and also to discrete randomvariables if B (cid:48) = ∅ . For the latter case, the histogrammodel degenerates to a multinomial model. First, we de-ﬁne the set of multi-dimensional bins. For a mixed k -dimensional random vector W = ( W , . . . , W k ), wedeﬁne the set of bins for each W i as in Sec. 3.1, de-noted as B i . Consequently, we can deﬁne a set of k -dimensional bins , denoted B , by the Cartesian product B = B × . . . × B k .Since each B i is countable, B is also countable,and we can hence assume B is indexed by j . Then,we split B in a similar way as in the one-dimensional case, i.e., B = B (cid:48) ∪ B (cid:48)(cid:48) , where B (cid:48)(cid:48) contains only discretevalues . That is, for any k -dimensional bin B j ∈ B (cid:48)(cid:48) ,each dimension of B j is a set that contains a single one-dimensional point. Note that, however, for any B j ∈ B (cid:48) ,each dimension of B j can either be a one-dimensionalinterval or a one-dimensional single-point set. Further,we deﬁne the volume of a multi-dimensional bin B j ∈ B using the product measure v ( B j ) (see Sec. 2.1.2).Similar to one-dimensional histograms, a multi-dimensional histogram model M can be described bya probability distribution P W,θ parametrized by thevector θ = ( θ , . . . , θ K ), where K is the number of binsand θ i is the Radon-Nikodym derivative for each bin. Given apossibly multi-dimensional histogram with K bins, wedenote the Radon-Nikodym derivative dP W,θ /dv as f hθ and its maximum likelihood estimator as f h ˆ θ . Observethat for any parameter θ j ∈ θ , the product θ j v ( B j ) fol-lows a multinomial distribution. Thus, given a dataset D = { D i } i =1 ,...,n , with D i representing a row, the max-imum log-likelihood is denoted as and equal to(3.3) l M ( D ) = log f h ˆ θ ( D ) ( D ) = log K (cid:89) j =1 (cid:18) c j n · v ( B j ) (cid:19) c j , where c j and v ( B j ) are respectively the number of datapoints and the bin volumes of bin j ∈ { . . . K } . Noticethat this maximum likelihood generalizes to the purelydiscrete case (i.e., multinomial distribution) where all v ( B j ) = 1, and to the purely continuous case [27] where v becomes the Lebesgue measure. Combining all previous theoretical discussions, we cannow estimate conditional mutual information for three(possibly multivariate) random variables

X, Y and Z by I h ( X ; Y | Z ) = H h ( X, Z )+ H h ( Y, Z ) − H h ( X, Y, Z ) − H h ( Z ) . The corresponding measure-theoretic entropies are es-timated from k -dimensional data over ( X, Y, Z ), where k X , k Y and k Z are the corresponding number of dimen-sions of X, Y and Z . We estimate the entropies as H h ( X, Y, Z ) = − (cid:90) R k f h ˆ θ ( x, y, z ) log( f h ˆ θ ( x, y, z )) dvH h ( X, Z ) = − (cid:90) R kX + kZ f h ˆ θ ( x, z ) log( f h ˆ θ ( x, z )) dvH h ( Y, Z ) = − (cid:90) R kY + kZ f h ˆ θ ( y, z ) log( f h ˆ θ ( y, z )) dvH h ( Z ) = − (cid:90) R kZ f h ˆ θ ( z ) log( f h ˆ θ ( z )) dv n which f h ˆ θ ( x, y, z ) is the maximum likelihood estimatorgiven the data, while we obtain f h ˆ θ ( x, z ), f h ˆ θ ( y, z ), and f h ˆ θ ( z ) via marginalization from f h ˆ θ ( x, y, z ). Next, wewill prove that I h is a strongly consistent estimator forconditional mutual information on mixed data. Theorem 3.1.

Given a mixed random vector ( X, Y, Z ) with probability measure P XY Z , lim v (cid:48) → lim n →∞ I h ( X ; Y | Z ) = I ( X ; Y | Z ) almost surely, where n refers to the sample size and v (cid:48) refers to the maximum of the histogram volumes for binsin B (cid:48) (deﬁned in Sec. 3.2). The proof is provided in Supplementary Mate-rial S.1. Informally, our proof is based on the followingkey aspects: 1) All volume-related terms in I h cancelout, 2) discrete empirical entropy converges to the trueentropy almost surely [1], and 3) in the limit, diﬀeren-tial entropy can be obtained by discretizing a continuousrandom variable into “inﬁnitely” small bins [4, Theorem8.3.1]. Notably, the order of the double limit in Theo-rem 3.1 inherently indicates that n should grow fasterthan the number of bins [26], which is also required forhistograms on purely continuous data to converge [27]. To eﬃciently estimate a histogram model that inheritsthe consistency guarantees from Theorem 3.1 we needto consider the following requirements. First of all, weneed to ensure that we learn a joint histogram modelover (

X, Y, Z ). This is due to the fact that we obtainthe lower-dimensional entropies such as H h ( X, Z ) bymarginalization over the likelihood estimator f h ˆ θ ( x, y, z ).If we would not learn a joint model, the volume-relatedterms in H h ( X, Y, Z ) , H h ( X, Z ) , H h ( Y, Z ), and H h ( Z )would not cancel out. In addition, we need to make surethat the number of bins is in o ( n ) and increases if wewere to increase the number of samples n, while at thesame time the size of the bins decreases.One way to achieve those properties would be toﬁx the bin width or the number of bins depending onthe number of samples. However, such an approach isnot very ﬂexible and does not allow for variable binwidths. To allow for a more ﬂexible model, we for-mally consider the problem of constructing an adaptivemulti-dimensional histogram as a model selection prob-lem and employ a selection criterion based on the min-imum description length (MDL) principle [25]. MDL-based model selection has been successfully used forlearning one-dimensional [13] and two-dimensional his-tograms [11, 36], demonstrating adaptivity to both localdensity changes and sample size. We now brieﬂy introduce the MDL principle anddeﬁne the MDL-optimal histogram model. Speciﬁcally,while previous work [11, 13, 36] only considers purelycontinuous data (or more precisely, data with arbitrarilysmall precision), we apply the MDL principle to mixed-type data, based on our rigorous deﬁnition of histogrammodels for mixed random variables. On top of that,we empirically show that our score fulﬁls the desiredproperties—i.e. the number of bins grows as o ( n ). The mini-mum description length principle is arguably one of thebest oﬀ-the-shelf model selection criteria [9], which hasbeen successfully applied to many machine learning anddata mining tasks. The general idea is to assign a codelength to data D compressed by a model M , e.g., a his-togram model. Given a collection of candidate models,denoted as M , MDL selects the model M ∗ that mini-mizes the joint code length of the model and the data.Formally, our goal is to ﬁnd M ∗ = argmin M ∈M L ( D | M ) + L ( M ) , where L ( D | M ) denotes the code length of the datagiven the model, while L ( M ) denotes the code lengthneeded to encode the model.The optimal way of encoding data D given M ,in the sense that it will result in minimax regret ,is to use the normalized maximum likelihood (NML) code [9]. Accordingly, the code length of the data iscalled stochastic complexity (SC) , which is deﬁned asthe sum of the negative log-likelihood − l M ( D ), deﬁnedin Eq. 3.3, and the parametric complexity (also called regret ) log R ( n, K ) [9]. The parametric complexity of ahistogram model with K bins is given by [13, 36] R ( n, K ) = (cid:88) c + ··· + c K = n n ! c ! · · · c K ! K (cid:89) i =1 (cid:16) c i n (cid:17) c i , and can be computed in sub-linear time [20]. Given a dataset D with n rows and k individual columns D j , we now deﬁnethe model class M . First, we create ﬁxed bins accordingto B (cid:48)(cid:48) (as deﬁned in Sec. 3.2) per discrete value thatoccurs in D j . Next, we enumerate all possible binsfor B (cid:48) with ﬁxed precision (cid:15) . To this end, denote theremaining non-discrete data points in D j as D cj . If D cj isempty D j corresponds to a discrete variable and we can The code length L denotes the number of bits needed todescribe the given object. Hence, all logarithms are to base 2 and0 log 0 = 0. top here. Otherwise, we create all possible cut pointsfor D cj as C j = { min( D cj ) , min( D cj ) + (cid:15), . . . , max( D cj ) } .By selecting a subset of cut points C j ⊆ C j , we geta valid solution for B (cid:48) . We can enumerate all possiblesegmentations by enumerating each C j ⊆ C j .By repeating this process for each dimension, weobtain our model class M . Further, we get the codelength for a model M ∈ M by encoding all combinationsof cut points for each dimension [13], i.e., L ( M ) = (cid:88) j ∈{ ,...,k } L ( C j ) = (cid:88) j ∈{ ,...,k } log (cid:18) | C j || C j | (cid:19) . This completes the deﬁnition of our ﬁnal optimizationscore L ( D | M ) + L ( M ).To proof consistency for this score, we need to showthat the number of selected bins grows at rate o ( n ).Since the theoretical analysis is rather diﬃcult, we in-stead empirically demonstrate this property for Gaus-sian distributed data in Supplementary Material S.3.In the next section, we present an iterative greedy algo-rithm that optimizes our MDL score. In this section, we describe our algorithm to estimatethe joint entropy H ( X , . . . , X k ) for a k -dimensionaldiscrete-continuous mixture random vector. To discretize a one-dimensional ran-dom variable X , we ﬁrst create bins for the discretevalues of X and then discretize the continuous values.We detect discrete data points by checking if a singlevalue x in the domain X of X occurs multiple times. Ifa user-deﬁned threshold t , e.g., 5 is reached, we createa bin for this point. To discretize the remaining con-tinuous values, we start by splitting X into K init equi-width bins, which we can safely choose from the com-plexity class o ( √ n ) (see Supplementary Material S.3).Using dynamic programming, we compute the variable-width histogram model M that minimizes L ( D, M ) inquadratic time w.r.t. K init [13].Since the runtime complexity to compute the opti-mal variable-width histogram over a multi-dimensionalrandom variable would grow exponentially w.r.t. k , weopt for an iterative greedy algorithm (we provide thepseudocode in Supplementary Material S.2). We startby initializing the optimization: for every dimension, weﬁx bins for the discrete values and put the remainingcontinuous values into a single bin. Then, in each it-eration, we compute a candidate discretization for eachdimension and keep the discretization of that dimensionthat provides the highest gain in compression. To com-pute a candidate discretization for a dimension X j , we extend the one-dimensional algorithm described above.That is, we determine those cut points for X j that pro-vide the highest gain in L ( D, M ), while keeping the binsfor the remaining dimensions ﬁxed. We repeat this un-til the maximum number of iterations i max is reachedor we cannot further decrease L ( D, M ). The complexity of discretizing aunivariate random variable is in O ( K max · ( K init ) ) anddepends on the number of initial bins K init and themaximum number of bins K max , which we typicallychose as a fraction of K init (both in o ( √ n )). In a multi-dimensional setting we have to multiply this complexityby the current domain size of the remaining variables,since we have to update each bin conditioned on those.In the worst case, this number is equal to ( K max ) k − .Overall, we apply this procedure—if all variables arecontinuous— i max · k times. We discuss related methods for adaptive histograms and(conditional) mutual information estimation.Both theoretical properties and practical issues ofdensity estimation using histograms have been studiedfor decades [27]. Various algorithms have been proposedfor the challenging task of constructing an adaptive one-dimensional histogram, among which the MDL-basedhistogram [13] is considered to be the state-of-the-art,as it is self-adaptive to both local density structureand sample size and does not have any hyperparame-ters. Learning adaptive multivariate histograms is evenharder due to the combinatorial explosion of the searchspace. One approach is to resort to the dyadic CARTalgorithm [12]; various methods designed for speciﬁctasks also exist [11, 35]. Our algorithm is similar tothat of Kameya [11], but they only consider the two-dimensional case.For discrete data, (conditional) mutual informationestimation is a well-studied problem [4, 10, 18, 21,33] and it has been shown that mutual informationcan be estimated using the 3 H principle [10]. Animportant observation is that for discrete data, theempirical estimator for entropy is sub-optimal [21],which encouraged the design of more eﬃcient entropyestimators with sub-linear sample complexity [10, 33].For estimating (conditional) mutual informationon continuous data or a mix of discrete and continu-ous data, three classes of approaches exist. The ﬁrstclass concerns kernel density estimation (KDE) meth-ods [8, 22], which perform well on continuous data;however, no KDE-based MI and CMI estimation meth-ods exist that are designed for discrete-continuous mix-ture random variables. Moreover, bandwidth tuningor KDE can be computationally expensive, which be-comes even worse for mixed data, as diﬀerent band-widths may be needed for discrete random variables.The second class of methods relies on k -nearest neigh-bour ( k NN) estimates [6, 14, 15], which have been es-tablished as the state of the art [7, 24]. k NN ap-proaches can be applied not only to a mix of discreteand continuous variables, but can also be used as con-sistent MI [7] and CMI [19, 24] estimators for discrete-continuous mixtures. The third class of methods ﬁrstdiscretizes the continuous random variables and thencalculates mutual information from the discretized vari-ables [4, 5, 31]. Two recent approaches based on adap-tive partitioning for mixed random variables have beenproposed [2, 17]. While Mandros et al. [17] focus onmutual information and its application to functionaldependency discovery, Cabeli et al. [2], similar to us,build upon an MDL-based score to estimate MI andCMI, to which we compare in Sec. 7. The key dif-ference is that Cabeli et al. [2] compute I ( X ; Y | Z ) as( I ( X ; { Y, Z } ) − I ( X ; Z )+ I ( Y ; { X, Z } ) − I ( Y ; Z )) / k NN estimation, since histograms are more inter-pretable [27] and do not require tuning of the parameter k , which can have a large impact on the outcome. In this section, we empirically evaluate the performanceof our approach. First, we will benchmark our estimatoragainst state-of-the-art CMI estimators on diﬀerentdata types. After that, we evaluate how well ourestimator is suited to test for conditional independencein a causal discovery setup. For reproducibility, wemake our code available online. On the mu-tual information estimation task, we compare our ap-proach to the state-of-the-art MI estimators. In par-ticular, we compare against FP [6], RAVK [24] and MS [19], which all rely on k NN estimates, and

MIIC [2],which is a discretization-based method. All of those canbe applied to our setup, but only the authors of

RAVK and MS speciﬁcally consider discrete-continuous mix-ture variables. We apply MIIC using the default param- https://github.com/ylincen/CMI-adaptive-hist.git eters and use k = 10 for all k NN-based approaches. Forour algorithm, we set the maximum number of iterationsand the threshold to detect discrete points in a mixturevariable to 5, set K init = 20 log n and K max = 5 log n .To comply with the literature, we compute all entropiesin this section using the natural logarithm . As a sanity check, we startwith an experiment on purely continuous data. Thatis, for

Experiment I , let X and Y be Gaussiandistributed random variables with mean 0, variance 1,and covariance 0 .

6. Consequently, the correlation ρ between X and Y is 0 . I ( X ; Y ) = − log(1 − ρ ). In Experiment II , X is discrete and drawn from Unif(0 , m − m = 5and Y is continuous with Y ∼ Unif( x, x + 2) for X = x .Therefore, I ( X ; Y ) = log( m ) − ( m −

1) log 2 m [7]. Next,for Experiment III , X is exponentially distributedwith rate 1 and Y is a zero-inﬂated Poissonization of X —i.e., Y = 0 with probability p = 0 .

15 and Y ∼ Pois( x ) for X = x with probability 1 − p . The groundtruth is I ( X ; Y ) = (1 − p )(2 log 2 − γ − (cid:80) ∞ k =1 log k · − k ) ≈ (1 − p )0 . γ is the Euler-Mascheroniconstant [7]. Last, in Experiment IV , we generatethe data according to the Markov chain X → Z → Y (see Mesner and Shalizi [19]). In particular, X isexponentially distributed with rate , Z ∼ Pois( x ) for X = x and Y is binomial distributed with size n = z for Z = z and probability p = . Due to the Markovchain structure, the ground truth I ( X ; Y | Z ) = 0.For each of the above experiments, we sampledata with sample size n ∈ { , , . . . , } andgenerate 100 data sets per sample size. We run eachof the estimators on the generated data and show themean squared error (MSE) of each estimator in Fig. 1.Overall, our estimator performs best or very close tothe best throughout the experiments and reaches anMSE lower than 0 .

001 with at most 1 000 samples. Thebest competitors are MS and MIIC ; however, both arebiased when we consider discrete-continuous mixturevariables, as we show in Experiment V.

Next, we generate data ac-cording to a discrete-continuous mixture [7]. Half ofthe data points are continuous, with X and Y beingstandard Gaussian with correlation ρ = 0 .

8, while theother half follows a discrete distribution with P (1 ,

1) = P ( − , −

1) = 0 . P (1 , −

1) = P ( − ,

1) = 0 . Z independently with Z ∼ We evaluated all approaches with k = 5 , ,

20. Since k = 10had the best trade-oﬀ and is close to k = 7 as used by Mesnerand Shalizi [19], we report those results.

00 500 1 , − − − number of samples M S E Proposed

MS RAVKFP MIIC

100 500 1 , − − − number of samples M S E

100 500 1 , − − − number of samples M S E Figure 1: Synthetic data with known ground truth. Or-dered from top-left to bottom-right, we show the MSEfor Experiments I-IV, for our estimator and competingalgorithms MS , RAVK , FP and MIIC .Binomial(3, 0.2). Hence the ground truth is equal to I ( X ; Y ) = I ( X ; Y | Z ) = 0 . · log . . + 0 . · log . . − log(1 − . ) ≈ . FP and MIIC , which were not designed forthis setup, have a clear bias even for 1 000 data points.The same trend can be observed for MSE.

Last, we test how sensitiveour method is to dimensionality. We generate X and Y as in Experiment II, but ﬁx n to 2 000 and add k independent random variables, Z k ∼ Binomial(3, 0.5).Fig. 2 (bottom) shows the mean and MSE. Ourestimator recovers the true CMI up to a negligibleerror up to k = 2. After that, it starts to slowlyunderestimate the true CMI. This can be explainedby the fact that the model costs increase linearly withthe domain size and hence, we will ﬁt fewer bins tothe continuous variable for large k . We validated thisconjecture by repeating the experiment for n = 10 000.On this larger sample size, the MSE for our estimatorremained below 0 .

001 even for k = 4. While MIIC isslightly more stable for k ≥

3, the competing k NN-basedestimators deviate quite a bit from the true estimate forhigher dimensions.Overall, we are on par with or outperform the bestcompetitor throughout Experiments I–VI. Especially on

100 500 1 , . . M e a n

100 500 1 , − − number of samples M S E . .

52 Z M e a n − − Z M S E Figure 2: Top row: Experiment V, where we show themean of the estimators (left) with the true CMI as adashed gray line and the MSE (right). Bottom row:Experiment VI, where the sample size is constant at2 000 and the x-axis refers to the number of dimensionsof Z . We show the mean (left) and MSE (right). Thecolor coding is chosen as in Fig. 1.mixture data, which is our main focus, our method isthe only one that converges to the true estimate. In theory, two randomvariables X and Y are conditionally independent givena set of random variables Z , denoted as X ⊥⊥ Y | Z ,if I ( X ; Y | Z ) = 0. Vice versa, X and Y are depen-dent given Z , if I ( X ; Y | Z ) >

0. In practice, wecannot simply rely on our estimator to conclude inde-pendence: due to the monotonicity of mutual informa-tion, i.e., I ( X ; Y ) ≤ I ( X ; Y ∪ Z ), estimates will rarelybe exactly zero in the limited sample regime, but only close to zero [18, 34]. To address this problem, we useour algorithm to discretize X, Y and Z , and compute I C ( X ; Y | Z ) := max { , I n ( X d ; Y d | Z d ) + C n ( X d ; Y d | Z d ) } ,where C n is a correction term calculated from the dis-cretized variables, which is negative. In the following,we evaluate our estimator with two diﬀerent correctioncriteria. The ﬁrst one is a correction for mutual infor-mation based on the Chi-squared distribution, with C n equal to −X α,l / n [34], where X α,l refers to the crit-ical value of the Chi-squared distribution with signiﬁ-cance level α and degrees of freedom l . We can com-pute the degrees of freedom l from the domain sizesof the discretized variables for the conditional case as l = ( |X | − |Y| − |Z| [32], and for the unconditionalcase as l = ( |X | − |Y| − I n ( X d ; Y d | Z d ) E FC DA B

Figure 3: Synthetic network with continuous (white),discrete (gray) and mixed (shaded) random variablesconsisting of diﬀerent causal structures, such as collid-ers, a chain ( C → E → G ), and a fork ( C ← B → D ). . . . . .

81 number of samples · . P r ec i s i o n I X I SC JICRCIT RCoT MIIC . . . . .

81 number of samples · . R ec a ll Figure 4: Precision (left) and recall (right) on undi-rected graphs inferred using the PC-stable algorithmequipped with the corresponding independence test.The data is generated from the graph shown in Fig. 3.with its corresponding stochastic complexity term as de-ﬁned in Sec. 4.1. If we subtract the regret terms for H n ( X d , Y d , Z d ) and H n ( Z d ) from those for H n ( X d , Z d )and H n ( Y d , Z d ), we are guaranteed to get a negativevalue, thus a valid regret term [18]. In the following, werefer to the test using the Chi-squared correction as I X and to the one based on stochastic complexity as I SC .To test how well I X and I SC perform on mixed-type and continuous data, we benchmark both againststate-of-the-art kernel-based tests RCIT and RCoT [30],as well as JIC [31], and MIIC [2], which are bothdiscretization-based methods using a correction basedon stochastic complexity. To apply RCIT and RCoTon mixed data, we treat the discrete data points asintegers. In the following, we evaluate the performanceof each test in a causal discovery setup. In addition,we provide a more detailed description of the datageneration and experiments on individual collider andnon-collider structures in Supplementary Material S.3.

To evaluate our test in acausal discovery setting, we generate data according to asmall synthetic network—shown in Fig. 3—that consists Note that MIIC calculates stochastic complexity based onfactorized NML and JIC uses an asymptotic approximation ofstochastic complexity, while we use quotient NML for I SC [18, 28]. of a mixture of generating mechanisms that we used inexperiments I-IV and includes continuous and discrete(ordinal) random variables and one mixture variable,which is partially Gaussian and partially Poisson dis-tributed (for details see Supplementary Material S.3).To evaluate how well the ground truth graph can berecovered, we apply the PC-stable algorithm [3, 29]equipped with the diﬀerent independence tests, wherewe use α = 0 .

01 for I X , RCIT and RCoT.Fig 4 shows recovery precision and recall for theundirected graph, averaged over 20 draws per samplesize n ∈ { , , , , ,

10 000 } .We see that overall I X performs best and is theonly method that reaches both a perfect accuracy andrecall. While JIC also reaches a perfect recall, it ﬁndstoo many edges leading to a precision of only 80%.Although also MIIC, RCIT and RCoT have a perfectprecision, their recall is worse than for I X . Neither ofthe kernel-based tests manages to recall all the edgeseven for 10 000 samples. After a closer inspection, thisis due to the edge E → G that involves the discrete-continuous variable G . If we compare I X to I SC , weclearly see that the latter is too conservative, whichleads to a bad recall. We proposed a novel approach for the estimation ofconditional mutual information from data that maycontain discrete, continuous, and mixture variables.To be able to deal with discrete-continuous mixturevariables, we deﬁned a class of generalized adaptivehistogram models. Based on our observation thatCMI for mixture-variables can be written as a sum ofentropies, we presented a CMI estimator based on suchhistograms, for which we proved that it is consistent.Further, we used the minimum description lengthprinciple to formally deﬁne optimal histograms, andproposed a greedy algorithm to practically learn goodhistograms from data. Finally, we demonstrated thatour algorithm outperforms state-of-the-art (conditional)mutual information estimation methods, and that itcan be successfully used as a conditional independencetest in causal graph structure learning. Notably, forboth setups, we observe that our approach performsespecially well when mixture variables are present.

Acknowledgements

This work is part of the research programme ‘Human-Guided Data Science by Interactive Model Selection’with project number 612.001.804, which is (partly)ﬁnanced by the Dutch Research Council (NWO). eferences [1] A. Antos and I. Kontoyiannis, “Convergence proper-ties of functional estimates for discrete distributions,”

Random Structures & Algorithms , vol. 19, no. 3-4, pp.163–193, 2001.[2] V. Cabeli, L. Verny, N. Sella, G. Uguzzoni, M. Verny,and H. Isambert, “Learning clinical networks frommedical records based on information estimates inmixed-type data,”

PLOS Comput. Biol. , vol. 16, no. 5,p. e1007866, 2020.[3] D. Colombo and M. H. Maathuis, “A modiﬁcationof the pc algorithm yielding order-independent skele-tons,”

CoRR, abs/1211.3295 , 2012.[4] T. M. Cover and J. A. Thomas,

Elements of informa-tion theory . John Wiley & Sons, 2012.[5] G. A. Darbellay and I. Vajda, “Estimation of the infor-mation by an adaptive partitioning of the observationspace,”

IEEE Trans. Inf. Theorie , vol. 45, no. 4, pp.1315–1321, 1999.[6] S. Frenzel and B. Pompe, “Partial mutual informationfor coupling analysis of multivariate time series,”

Phys-ical Review Letters , vol. 99, no. 20, p. 204101, 2007.[7] W. Gao, S. Kannan, S. Oh, and P. Viswanath, “Es-timating mutual information for discrete-continuousmixtures,” in

NIPS , 2017, pp. 5986–5997.[8] W. Gao, S. Oh, and P. Viswanath, “Breaking the band-width barrier: Geometrical adaptive entropy estima-tion,” in

NIPS . Curran Associates, Inc., 2016, pp.2460–2468.[9] P. Gr¨unwald,

The Minimum Description Length Prin-ciple . MIT Press, 2007.[10] Y. Han, J. Jiao, and T. Weissman, “Adaptive estima-tion of shannon entropy,” in

ISIT . IEEE, 2015, pp.1372–1376.[11] Y. Kameya, “Time series discretization via mdl-basedhistogram density estimation,” in

ICTAI . IEEE, 2011,pp. 732–739.[12] J. Klemel¨a, “Multivariate histograms with data-dependent partitions,”

Statistica sinica , pp. 159–176,2009.[13] P. Kontkanen and P. Myllym¨aki, “MDL histogramdensity estimation,” in

AISTATS , 2007.[14] L. F. Kozachenko and N. N. Leonenko, “Sample es-timate of the entropy of a random vector,”

Probl.Peredachi Inf. , vol. 23, pp. 9–16, 1987.[15] A. Kraskov, H. St¨ogbauer, and P. Grassberger, “Es-timating mutual information,”

Phys. Rev. E , vol. 69,no. 6, p. 066138, 2004.[16] J. Lee and D.-W. Kim, “Feature selection for multi-label classiﬁcation using multivariate mutual informa-tion,”

Pattern Recognition Letters , vol. 34, no. 3, pp.349–357, 2013.[17] P. Mandros, D. Kaltenpoth, M. Boley, and J. Vreeken,“Discovering functional dependencies from mixed-typedata,” in

KDD , 2020, pp. 1404–1414.[18] A. Marx and J. Vreeken, “Testing conditional indepen- dence on discrete data using stochastic complexity,” in

AISTATS , 2019, pp. 496–505.[19] O. C. Mesner and C. R. Shalizi, “Conditional Mu-tual Information Estimation for Mixed, Discrete andContinuous, Data,”

IEEE Transactions on InformationTheory , 2020.[20] T. Mononen and P. Myllym¨aki, “Computing the multi-nomial stochastic complexity in sub-linear time,” in

PGM , 2008, pp. 209–216.[21] L. Paninski, “Estimation of entropy and mutual in-formation,”

Neural Computation , vol. 15, no. 6, pp.1191–1253, 2003.[22] L. Paninski and M. Yajima, “Undersmoothed kernelentropy estimators,”

IEEE Trans. Inf. Theorie , vol. 54,no. 9, pp. 4384–4388, 2008.[23] D. N. Politis,

On the entropy of a mixture distribution .Purdue University. Department of Statistics, 1991.[24] A. Rahimzamani, H. Asnani, P. Viswanath, andS. Kannan, “Estimators for multivariate informationmeasures in general probability spaces,” in

NIPS , 2018,pp. 8664–8675.[25] J. Rissanen, “Modeling by shortest data description,”

Automatica , vol. 14, no. 1, pp. 465–471, 1978.[26] W. Rudin et al. , Principles of mathematical analysis .McGraw-hill New York, 1964, vol. 3.[27] D. W. Scott,

Multivariate density estimation: theory,practice, and visualization . John Wiley & Sons, 2015.[28] T. Silander, J. Lepp¨a-aho, E. J¨a¨asaari, and T. Roos,“Quotient normalized maximum likelihood criterion forlearning bayesian network structures,” in

AISTATS ,vol. 84. PMLR, 2018, pp. 948–957.[29] P. Spirtes, C. N. Glymour, R. Scheines, and D. Heck-erman,

Causation, prediction, and search . MIT press,2000.[30] E. V. Strobl, K. Zhang, and S. Visweswaran, “Ap-proximate kernel-based conditional independence testsfor fast non-parametric causal discovery,”

JCI , vol. 7,no. 1, 2019.[31] J. Suzuki, “An estimator of mutual information and itsapplication to independence testing,”

Entropy , vol. 18,no. 4, p. 109, 2016.[32] ——, “Mutual information estimation: Independencedetection and consistency,” in

ISIT . IEEE, 2019, pp.2514–2518.[33] G. Valiant and P. Valiant, “Estimating the unseen: ann/log (n)-sample estimator for entropy and supportsize, shown optimal via new clts,” in

STOC , 2011, pp.685–694.[34] N. X. Vinh, J. Chan, and J. Bailey, “Reconsidering mu-tual information based feature selection: A statisticalsigniﬁcance view,” in

AAAI , 2014.[35] D. Weiler and J. Eggert, “Multi-dimensionalhistogram-based image segmentation,” in

ICONIP .Springer, 2007, pp. 963–972.[36] L. Yang, M. Baratchi, and M. van Leeuwen, “Unsu-pervised discretization by two-dimensional mdl-basedhistogram,” arXiv preprint arXiv:2006.01893 , 2020. upplementary Material

The supplementary material is structured as follows.First, we provide proofs for all lemmas and theorems.After that, we provide the pseudocode for our algorithm.Last, we provide additional experiments and details forthe data generation for the causal discovery experiment.

S.1 ProofsS.1.1 Proof of Lemma 2.1

Proof.

Given a Borel set A ⊆ R such that v ( A ) = u ( A ∩ S c ) + | A ∩ S d | = 0, we have u ( A ∩ S c ) = 0 due tonon-negativity of any measure, as well as | A ∩ S d | = 0.Since A ∩ S c ⊆ S c , by the deﬁnition of S c we have P ( A ∩ S c ) = 0. It remains to show that A ∩ S d = ∅ ,which we do by contradiction. Assume that A ∩ S d (cid:54) = ∅ ,then there exists x ∈ A ∩ S d s.t. for a set containingonly x , |{ x }| = 1. Then | A ∩ S d | ≥ |{ x }| = 1, whichcontradicts | A ∩ S d | = 0. Thus, we must have A ∩ S d = ∅ and then P X ( A ) = 0. S.1.2 Proof of Lemma 2.2

Proof.

Given a k -dimensional Borel set A , there existone-dimensional Borel sets A , . . . , A k such that A = A × . . . × A k . If v ( A ) = 0, then there exists at leastone v i , i ∈ { , . . . , k } , such that v i ( A i ) = 0. Thus, byLemma 2.1, P W i ( A i ) = 0 ⇒ P W ( R × . . . × R × A i × R × . . . × R ) = 0 ⇒ P W ( A ) = 0, as A = A × . . . × A k ⊆ R × . . . × R × A i × R × . . . × R . S.1.3 Proof of Lemma 2.3

Proof.

We ﬁrst proof the statement for Z (cid:54) = ∅ , for whichwe can write I ( X ; Y | Z ) = I ( X ; { Y, Z } ) − I ( X ; Z ) bythe chain rule for mutual information. Thus, it suﬃcesto prove that I ( X ; Z ) = H ( X ) + H ( Z ) − H ( X, Z )and I ( X ; { Y, Z } ) = H ( X ) + H ( Y, Z ) − H ( X, Y, Z ).Next, denote v as the product measure deﬁned based on( X, Z ), where v = v × . . . × v k XZ , and k XZ is the numberof dimensions of X plus that of Z ; then by Lemma 2.2,we also have P XZ (cid:28) v . Then, we show that P X P Z (cid:28) v .For some k XZ -dimensional Borel set A = A × . . . × A k XZ satisfying v ( A ) = 0 there exists v i ∈ { v , . . . , v k XZ } such that v i ( A i ) = 0. Hence, P X P Z ( A ) = 0 because0 ≤ P X P Z ( A ) = P X P Z ( A × . . . × A k XZ ) ≤ P X P Z ( R × . . . R × A i × R . . . × R ) = P i ( A i ) = 0, where P i is themarginalization of the product measure P X P Z to the i th dimension and P i ( A i ) = 0 is because v i ( A i ) = 0 bythe deﬁnition of v .Finally, by the chain rule of the Radon-Nikodym derivative we have that I ( X ; Z ) = (cid:90) log dP XZ dP X P Z dP XZ = (cid:90) log dP XZ /dvdP X P Z /dv ( dP XZ /dv ) dv = H ( X ) + H ( Z ) − H ( X, Z ) . The proof for I ( X ; { Y, Z } ) is equivalent. If Z = ∅ ,CMI reduces to I ( X ; Y ), for which we can prove thestatement in the same manner. S.1.4 Proof of Theorem 3.1

To proof Theorem 3.1we need several intermediate results. Lemma S.3 showsthat a histogram results in a valid discretization as allterms corresponding to volumes in I h cancel out, andhence I h can be written as a sum of plug-in estimatorsof discrete entropies . Then, Lemma S.1 shows a classicresult that the plug-in estimator of discrete entropieswill converge to the true entropy almost surely. Further,we show in Lemma S.2 that as the volumes of histogrambins containing continuous values go to 0, the trueentropies of discretized variables (which are discretizedby the histogram) converges to the true entropy oforiginal variables. Definition S.1.

Given discrete random variables X d , Y d , Z d (possibly multi-dimensional), with support S Xd , S Yd , S Zd , and given dataset D = ( x i , y i , z i ) i ∈{ ,...,n } with sample size n , the plug-in estimator of discreteentropy H is denoted and deﬁned as H n ( X d , Y d , Z d ) = − (cid:88) j ∈ S Xd × S Yd × S Zd ˆ p ( j ) log ˆ p ( j ) with probability estimates ˆ p ( j ) = |{ ( x i , y i , z i ) i ∈{ ,...,n } : ( x i , y i , z i ) = q j }| n , where | · | represents the cardinality of a set, and q j isthe j th element in S Xd × S Yd × S Zd . Lemma S.1.

Given a discrete random vector ( X d , Y d , Z d ) , lim n →∞ H n ( X d , Y d , Z d ) = H ( X d , Y d , Z d ) almost surely [1]. Lemma S.2.

Given a random vector ( X, Y, Z ) thatcontains discrete-continuous mixture random variables,with bins B = B (cid:48) ∪ B (cid:48)(cid:48) and the resulting discretizedrandom vector ( X d , Y d , Z d ) , where B (cid:48)(cid:48) contains discretedata points (of which every dimension has a discretevalue) and B (cid:48) = B \ B (cid:48)(cid:48) , we have lim v (cid:48) → H ( X d , Y d , Z d ) = H ( X, Y, Z ) , where v (cid:48) = max B j ∈ B (cid:48) ( v ( B j )) .roof. Firstly, it is well-known that this result holdsif (

X, Y, Z ) is a continuous random vector [4]; then, if(

X, Y, Z ) contains mixture variables, H ( X, Y, Z ) = lim v (cid:48) → (cid:88) B j ∈ B (cid:48) P X d Y d Z d v ( B j ) log P X d Y d Z d v ( B j )+ (cid:88) B j ∈ B (cid:48)(cid:48) P X d Y d Z d v ( B j ) log P X d Y d Z d v ( B j )= lim v (cid:48) → H ( X d , Y d , Z d ) , which concludes the proof. Definition S.2.

Given a random vector ( X, Y, Z ) thatcontains mixture variables, and an adaptive grid B , wedeﬁne the discretized random variable X d , Y d , Z d , withprobability measure (probability mass function) P X d ,Y d ,Z d (( j , j , j )) = (cid:90) B j d XY Z dv dv , where B j denotes the j th bin of B . Lemma S.3.

Given a k -dimensional random vector ( X, Y, Z ) that contains mixture variables with an un-known probability measure P XY Z , a dataset D =( x i , y i , z i ) i ∈{ ,...,n } generated by P XY Z , a histogrammodel M , and corresponding discretized random vector ( X d , Y d , Z d ) , we have I h ( X, Y | Z ) is equal to H n ( X d , Z d ) + H n ( Y d , Z d ) − H n ( X d , Y d , Z d ) − H n ( Z d ) . That is, the terms corresponding to volumes in I h cancel out and our histogram model results a validdiscretization.Proof. Denote the adaptive grid of histogram model M as B XY Z , which is the Cartesian product of bins deﬁnedon

X, Y, Z —i.e. B XY Z = B X × B Y × B Z , and denotethe corresponding MLE of histogram density function as f h ˆ θ XY Z . Further, deﬁne a function v X , such that for each x i in D , v X ( x i ) = v ( B Xj ) if x i ∈ B Xj , where B Xj is a binof B X and v is deﬁned based on the random variable X . Then, deﬁne v Y , v Z , v XZ , v Y Z , v

XY Z similarly.By the deﬁnition I h ( X, Y | Z ) is equal to H h ( X, Z ) + H h ( Y, Z ) − H h ( X, Y, Z ) − H h ( Z ) . First consider H h ( X, Z ). We write B XZ = B X × B Z ,with marginal density function f h ˆ θ XZ . W.l.o.g. supposethat B XZ consists of K bins, denoted as B XZj , j ∈ { , . . . , K } . Then, H h ( X, Z ) = − (cid:90) R kX + kZ f h ˆ θ XZ log f h ˆ θ XZ dv = − K (cid:88) j =1 (cid:90) B XZj f h ˆ θ XZ log f h ˆ θ XZ dv = − K (cid:88) j =1 c j log (cid:18) c j nv ( B j ) (cid:19) = − K (cid:88) j =1 c j log (cid:16) c j n (cid:17) + n (cid:88) i =1 log( v XZ ( x i , z i ))= H n ( X d , Z d ) + n (cid:88) i =1 log( v XZ ( x i , z i )) , where c j is the number of data points in B j and v XZ ( x i , z i ) = v X ( x i ) v Z ( z i ). The remaining entropiescan be calculated similarly. Hence, I h ( X, Y | Z ) = H n ( X d , Z d ) + H n ( Y d , Z d ) − H n ( X d , Y d , Z d ) − H n ( Z d ),as the sum of the volume related terms n (cid:88) i =1 log( v XZ ( x i , z i )) + n (cid:88) i =1 log( v Y Z ( y i , z i )) − n (cid:88) i =1 log( v XY Z ( x i , y i , z i )) − n (cid:88) i =1 log( v Z ( z i ))is equal to zero.To proof Theorem 3.1, we link the above results:lim v (cid:48) → lim n →∞ I h ( X ; Y | Z )= lim v (cid:48) → lim n →∞ ( H h ( X, Z )+ H h ( Y, Z ) − H h ( X, Y, Z ) − H h ( Z ))= lim v (cid:48) → lim n →∞ ( H n ( X d , Z d )+ H n ( Y d , Z d ) − H n ( X d , Y d , Z d ) − H n ( Z d ))= lim v (cid:48) → ( H ( X d , Z d )+ H ( Y d , Z d ) − H ( X d , Y d , Z d ) − H ( Z d ))= I ( X ; Y | Z ) . S.2 Implementation Details

As discussed in Sec-tion 5, our goal is to ﬁnd that discretization that mini-mizes the joint entropy over a set of k random variablesvia an iterative greedy algorithm. We provide the pseu-docode in Algorithm 1. As input, we are given a dataset D = { D , . . . , D k } consisting of n rows and k columns,representing a sample of size n from a k -dimensionalrandom vector X , and a user-speciﬁed parameter i max specifying the maximum number of iterations. First, weinitialize the discretization X d (line 1) by creating sin-gle bin histograms for the continuous points in D j and lgorithm 1:input : Data D = { D , . . . , D k } ∼ X ;Maximum number of iterations i max output: Discretization X d X d ← init( D ); i ← while X d changes ∧ i ≤ i max do X id ← X d ; foreach j ∈ { , . . . , k } do X ijd ← reﬁne( D j | X d ); if SC ( X ijd ) < SC ( X id ) then X id ← X ijd ; X d ← X id ; i ← i + 1; return X d ;a bin with bin-width 1 per discrete point. To detect thelatter, we check if there exist |{ x ∈ X j | D j = x }| ≥ t ,where t is a user-deﬁned threshold. After that, we iter-atively update the discretization for that X j providingthe highest gain in stochastic complexity, until eitherthe score cannot be improved or the maximum numberof iterations has been reached (lines 2–8). To updatethe discretization of a variable X j we call the functionreﬁne (line 5), which receives as input the data D j andthe discretization after iteration i . It then re-discretizes X j using an extension of the dynamic programming al-gorithm by Kontkanen et al. [13]. In essence, instead ofsimply discretizing X j independently of the remainingvariables, we keep the discretizations for all X i (cid:54) = X j ﬁxed and ﬁnd the optimal histogram model M ∗ over X j s.t. the overall score L ( D, M ) is minimized.

S.3 Data Generation and Additional Experi-ments

In the following, we ﬁrst provide an empiricalanalysis on how the number of bins depends on the num-ber of samples, then we give the details of the data gen-eration for the experiments carried out on the syntheticcausal network and last we provide additional experi-ments to evaluate I X and I SC . S.3.1 Sample Size and Number of Bins

As dis-cussed in Section 3, an important requirement to en-sure consistency is that the number of bins grows asa sub-linear function w.r.t. the number of samples.We demonstrate that MDL-optimal histograms havethis desirable property when learned on one-dimensionalGaussian distributions in Figure 5: the number of bins K grows with n , but slower than √ n . In addition, formulti-dimensional data, for which we can only approxi-mate the histogram model that minimizes L ( D, M ), weobserve that if the number of dimensions increases, the e m p i r i c a l k √ n l o g n n 1 2 3 4 50246810 a v e r ag e k Figure 5: Left: Average number of bins k to discretize X ∼ N (0 ,

1) for increasing sample sizes (20 repetitions).Right: Per dimension of a multivariate Gaussian distri-bution with X i ⊥⊥ X j and X i ∼ N (0 , n = 2 000, 20 repetitions).average number of bins per dimension decreases if wekeep n ﬁxed. S.3.2 Synthetic Network

Here, we describe thedata generation for the synthetic network shown inFig. 3. The source nodes of the network are A and B . A is generated as A ∼ Exp(1) and B ∼ Unif(0 , B → C we generate C as C ∼ Binom( b, .

5) for B = b , for B → D we sample D as D ∼ N ( b − ,

1) for B = b and E is sampled isexponentially distributed with rate c +1 for C = c . F isgenerated as a function of C and D . First, we generate C (cid:48) by rounding the values of C and then we write F as F = D C (cid:48) + N (0 , G as thezero inﬂated Poissonization of A . Let E (cid:48) = sign( E − ,which ensures that E (cid:48) is either zero or one dependenton the value of E . Then G ∼ N ( a,

1) if E (cid:48) = 0 and A = a , and G ∼ Poisson( a ) for A = a if G = 1. S.3.3 Detecting Collider and Non-ColliderStructures

To evaluate how well I X and I SC canidentify conditional (in)dependencies, we evaluate bothvariants on various generating mechanisms that involvecollider and non-collider structures. Those structuresare at the core of causal discovery, since colliderstructures can be inferred by detecting conditionaldependencies, while non-collider structures imposeconditional independencies. As in the causal discoveryexperiment, we set α = 0 .

01 for I X , RCIT and RCoT. Collider Structures

We generate data accordingto a collider structure, which can be representedby a directed acyclic graph as, e.g., X → Z ← Y .According to this structure, we model X and Y bysome distribution and write Z as a non-deterministicfunction of X and Y . We generate data for diﬀerentgenerating mechanisms, including two continuous and

00 500 1 , . . . .

81 number of samples A cc u r a c y I X I SC JICRCIT RCoT MIIC

100 500 1 , . . . .

81 number of samples A cc u r a c y

100 500 1 , . . . .

81 number of samples A cc u r a c y

100 500 1 , . . . .

81 number of samples A cc u r a c y Figure 6: Accuracy for detecting continuous (left)and mixed-type (right) dependencies in collider struc-tures (top) and independencies in non-collider struc-tures (bottom) for diﬀerent sample sizes.four mixed settings.1. X ⊥⊥ Y and X, Y are either drawn from N (0 , − , Z is an additive function ofpolynomials up to degree three or the tangentfunction plus additive noise N ∼ N (0 , . Z = X + tan( Y ) + N . We pick the type of thedistribution of X, Y , as well as the function type,uniform at random.2.

X, Y are drawn from a standard Gaussian distribu-tion, with X ⊥⊥ Y and Z = sign( X · Y ) · Exp(1 / √ X, Y ∼ N (0 ,

1) with X ⊥⊥ Y and Z = sign( X · Y ),where we randomly assign a z ∈ Z to 10% of thevalues in Z to make the function non-deterministic.4. X ∼ N (0 , Y ∼ Poisson( λ ), with parameter λ selected uniformly at random from { , , } . Wegenerate Z as X modulo Y and assign 10% of thedata points randomly.5. X, Y are unbiased coins. Z (cid:48) = X ⊕ Y , where ⊕ denotes the xor operator. From Z (cid:48) we calculate Z as N (0 , .

1) if Z (cid:48) = 0 and Poisson(5) · N (0 , . Z (cid:48) = 1.6. We generate X, Y and Z (cid:48) as above, but this timewe generate Z as Poisson(5) + N (0 , .

1) if Z (cid:48) = 1and as N (0 , .

1) if Z (cid:48) = 0.For each generating mechanism, including two purelycontinuous and four mixed mechanisms, we generate 100 data sets and report the averaged results, separatelyfor the continuous and mixed data, in Fig. 6 (top). Onthe continuous data, both of our approaches performon par with RCIT and JIC for more than 400 datapoints, whereas MIIC has a slightly better performanceand RCoT is not able to detect the dependence forthe sign function and hence has an accuracy of about50%. Since the functions for mixed data include anxor and the modulo operator, it is diﬃcult to treatall discrete variables as ordinal and hence RCIT onlyreaches up to 80% accuracy—which is mostly due to anxor determining the scaling of a Gaussian distributedvariable. On the other hand, both of our tests performvery well and only need 400 samples to obtain anaccuracy close to 100%. JIC and MIIC perform on parwith our tests. Non-Collider Structures

Similar to collider struc-tures, there also exist non-collider structures of theform X → Z → Y or X ← Z → Y . In both cases,the ground truth is that X ⊥⊥ Y | Z . To simulate dataaccording to these graphs, we consider two continuousmechanisms based on polynomial functions and twomixed generating mechanisms.1. X ∼ N (0 , Z is an additive noise function of X and Y is an additive noise function of Z . Thefunctions can be polynomials up to degree three orthe tangent function.2. Z ∼ N (0 , X and Y are independent additivenoise functions of Z , as deﬁned above.3. X, Y and Z are generated as in Experiment IV.4. X and Y are generated according to Experiment IIand Z ∼ N ( µ, x ) for X = x and µ ∈ [ − ,,