[PDF] Maximum Entropy Based Significance of Itemsets

Abstract

We consider the problem of defining the significance of an itemset. We say that the itemset is significant if we are surprised by its frequency when compared to the frequencies of its sub-itemsets. In other words, we estimate the frequency of the itemset from the frequencies of its sub-itemsets and compute the deviation between the real value and the estimate. For the estimation we use Maximum Entropy and for measuring the deviation we use Kullback-Leibler divergence. A major advantage compared to the previous methods is that we are able to use richer models whereas the previous approaches only measure the deviation from the independence model. We show that our measure of significance goes to zero for derivable itemsets and that we can use the rank as a statistical test. Our empirical results demonstrate that for our real datasets the independence assumption is too strong but applying more flexible models leads to good results.

Full PDF

aa r X i v : . [ c s . L G ] A p r Maximum Entropy Based Signiﬁcance ofItemsets

Nikolaj Tatti

HIIT Basic Research Unit, Department of Computer ScienceHelsinki University of Technology, Helsinki, [email protected].ﬁ

Abstract.

We consider the problem of deﬁning the signiﬁcance of an itemset. We saythat the itemset is signiﬁcant if we are surprised by its frequency when compared tothe frequencies of its sub-itemsets. In other words, we estimate the frequency of theitemset from the frequencies of its sub-itemsets and compute the deviation betweenthe real value and the estimate. For the estimation we use Maximum Entropy and formeasuring the deviation we use Kullback-Leibler divergence.A major advantage compared to the previous methods is that we are able to usericher models whereas the previous approaches only measure the deviation from theindependence model.We show that our measure of signiﬁcance goes to zero for derivable itemsets andthat we can use the rank as a statistical test. Our empirical results demonstrate thatfor our real datasets the independence assumption is too strong but applying moreﬂexible models leads to good results.

1. Introduction

How signiﬁcant is a given itemset? Itemsets are popular and well-studied patternsin binary data mining. The major drawback is that, given a dataset, there areexponential number of itemsets. Hence, we need to rank itemsets in order toprune the uninteresting ones.Traditionally, the frequency of an itemset is used as a rank measure. Thehigher the frequency, the more signiﬁcant is the itemset. Frequency has manyvirtues: It is easy to interpret and because of its property of anti-monotonicitythere exist eﬃcient algorithms for ﬁnding all frequent itemsets (Agrawal, Imielin-ski and Swami, 1993; Agrawal, Mannila, Srikant, Toivonen and Verkamo, 1996).There are, however, major drawbacks. First, a frequent itemset may be insignif-icant: An itemset AB may be frequent just because itemsets A and B are fre-quent. Second, an infrequent itemset may be signiﬁcant: If itemsets A and B arefrequent, the infrequency of AB is interesting information.Alternative methods for ranking itemsets are suggested in (Aggarwal andu, 1998; Brin, Motwani and Silverstein, 1997; DuMouchel and Pregibon, 2001).These methods are discussed in more detail in Section 4. A common feature tothese methods is that they compare the frequency of an itemset to an estimateobtained from the independence model. That is, the more the itemset deviatesfrom the independence model, the more surprising, and thus the more signiﬁcant,the itemset is.Our proposal for ranking itemsets resembles the aforementioned approaches.We estimate the frequency of a given itemset from the frequencies of some se-lected sub-itemsets. Namely, we use Maximum Entropy for the estimation. Thisapproach is more ﬂexible than the independence model, since the independencemodel uses only the margins (the frequencies of itemsets of size 1) for predictionwhereas our approach allows to use the information available from the itemsetsof larger size. While our ranking method is based on well-known tools, no similarframework has been suggested previously.Unlike the frequency, our measure is not decreasing with respect to set in-clusion. Hence we cannot mine signiﬁcant itemsets in a level-wise fashion. How-ever, it turns out that in some cases we can prune a large set of uninterestingitemsets (w.r.t. the measure). Namely, if the itemset is derivable (Calders andGoethals, 2002), then the measure is equal to 0. We also point out that can beused as a statistical test, thus providing a clear interpretation for the measure.The rest of the paper is organized as follows: Preliminaries are given in Sec-tion 2. The deﬁnition and the properties of the measure are given in Section 3.We present related work in Section 4. Section 5 is devoted to experiments andﬁnally we provide conclusions in Section 6.

2. Preliminaries and Notation

In this section we review brieﬂy theory of itemsets and also introduce somenotation that will be used later on.A binary dataset D is a collection of M binary vectors, transactions , havinglength K . Such dataset can be naturally represented as a matrix of size M × K .We denote the number of transactions by | D | = M . To each column of the matrixwe assign an attribute a i . Let A = { a , . . . , a K } be the collection of all attributes.An itemset X ⊆ A is a set of attributes.We say that a transaction (binary vector) ω covers an itemset X if a i ∈ X implies ω i = 1. Given a dataset D , a frequency of an itemset X is a proportionof the transaction in D covering X . Note that if an itemset Y is a subset of X ,then the frequency of Y is larger than or equal to the frequency of X . In otherwords, frequency is decreasing with respect to set inclusion.A sample space Ω is the set of all binary vectors of length K . We take asimplistic approach in deﬁning distributions: A distribution p : Ω → [0 ,

1] is afunction from a sample space Ω to a real number between 0 and 1 such that P ω ∈ Ω p ( ω ) = 1. Given an itemset X , a frequency of X calculated from a distri-bution p is the probability of binary vector covering X . We denote this by p ( X = 1) = p ( ω covers X ) . A family of itemsets F is called anti-monotonic or downward closed if everysubset of each member of F is also a member of F . Note that a collection of σ -frequent itemsets, that is, itemsets having frequency larger than some giventhreshold σ , is downward closed. We are interested in three particular families: I , the family containing only itemsets of size 1. – C , the family containing itemsets of size 1 and 2. – A , the family containing all itemsets.A negative border negbord( F ) of the downward closed family F is the set ofitemsets just above F . In other words, X / ∈ F is member of negbord( F ) if thereis no proper subset Y ⊂ X such that Y / ∈ F .Given a dataset D , we say that an itemset X is derivable if by knowing thefrequencies (calculated from D ) of each proper subset of X we can deduce thefrequency of X . For example, if some subset of X has a frequency 0, then weknow that X must also have frequency 0. Thus, in this case, X is derivable. Anitemset that is not derivable is called non-derivable . A family of all non-derivableitemsets is downward closed (Calders and Goethals, 2002).

3. Maximum Entropy Ranking

In this section we introduce our ranking method and discuss its theoretical prop-erties. The fundamental idea behind our approach is to measure how surprisingan itemset is compared to its subsets. In other words, we estimate the itemsetfrequency by using the frequencies of its subsets and compare how close is ourestimation to the actual value. The estimation is done using Maximum Entropymethod and the comparison is done using Kullback-Leibler divergence.

Let D be a binary dataset and let { a , . . . , a K } be its attributes. The numberof columns in D is K . Assume that we are given G , an itemset we wish to rank.We deﬁne a projected dataset D G by keeping only the attributes included in G .Let Ω G = { , } | G | be a space of binary vectors of length | G | . We deﬁne an empirical distribution q G : Ω G → [0 ,

1] to be q G ( ω ) = Number of samples in D G equal to ω | D G | . Our goal is to compare the distribution q G to a distribution obtained by usingMaximum Entropy (Kullback, 1968), a method that we will describe next.Assume now that we are given a family of itemsets F ⊆ A and let θ X be thefrequency of X ∈ F calculated from D . Our next step is to deﬁne an approx-imative distribution using only the itemsets in F . In deﬁning q G we projectedout the attributes outside G . Similarly, we are only interested in subsets of G .Hence we deﬁne a projected family F G to be F G = { X ∈ F | X ⊂ G, X = G, X = ∅} . Note that F G may contain 2 | G | − F = A .We say that a distribution p : Ω G → [0 , satisﬁes the itemsets F G if for eachitemset X ∈ F G and its frequency θ X we have p ( X = 1) = θ X . et P be the set of all distributions satisfying the itemsets F G . This set is notempty since q G ∈ P . We select the distribution from P maximizing the entropy H ( p ) = − X ω ∈ Ω G p ( ω ) log p ( ω ) . We denote this distribution by p ∗ . Note that p ∗ depends on G , F , and θ but wehave omitted these variables from the notation for the sake of clarity.We deﬁne the rank measure r ( G ; F , D ) to be the divergence between q G and p ∗ , that is, r ( G ; F , D ) = X ω ∈ Ω G q G ( ω ) log q G ( ω ) p ∗ ( ω ) . We omit D from the notation when the dataset is clear from the context. Example 1.

Assume the simplest case where G = a is an itemset of size 1. Let θ G be the frequency of G . Note that F G = ∅ , hence there are no constraintson selecting p ∗ . This means that p ∗ is the uniform distribution, that is, p ∗ (0) = p ∗ (1) = 1 /

2. In this case the measure is r ( a ; F ) = (1 − θ G ) log(2(1 − θ G )) + θ G log(2 θ G )obtaining its minimum when θ G = 1 / θ G = 0 or θ G = 1.We are mainly interested in three kinds of measures: The ﬁrst is r ( G ; I ) inwhich I is the family of itemsets of size 1. In this case the Maximum Entropydistribution is equal to the independence model.The second case is r ( G ; C ), where C contains the itemsets of size 1 and 2. Wecan show that there exists a matrix B such that for the non-zero entries of p ∗ we have p ∗ ( ω ) ∝ exp (cid:0) ω T Bω (cid:1) . Hence, r ( G ; C ) can be seen as the measure of the deviation from the discreteGaussian model.Our third type of measure is r ( G ; A ) in which p ∗ is predicted from all theproper sub-itemsets of G . In this case we can prove that for a certain set of realnumbers r i we have for the non-zero entries of p ∗ p ∗ ( ω ) ∝ Y X i ∈A G exp( r i I ( ω covers X i )) , where I is the indicator function. We discuss the evaluation of our approach inSection 3.4. In this section we discuss various properties of r ( G ). We will ﬁrst point theconnection between r ( G ) and derivable itemsets and then discuss the use of r ( G ) as a statistical test. Theorem 2.

Let G be a derivable itemset. Then r ( G ; A ) = 0 . roof. We can argue that if we know the frequencies of all sub-itemsets of G , wecan derive the distribution q G and vice versa. This implies that there is one-to-one correspondence between the distribution p ∈ P satisfying the itemsets A G and the frequency p ( G = 1). Since we can derive the frequency of G from A G ,it follows that P = { q G } , and hence p ∗ = q G .We can reformulate the previous theorem in a stronger form by pointing outthat we need to know only non-derivable itemsets. Theorem 3.

Let F be a family of all non-derivable itemsets. Let G be outsideof F . Then r ( G ; F ) = 0. Proof.

Since all unknown sub-itemsets of G are derivable from F G , the argumentof Theorem 2 holds.The following theorem provides the interpretation to the value of r ( G ) andpoints out that we can use r ( G ) as a statistical test. Theorem 4.

Let G be a non-derivable itemset. Under the 0-hypothesis that G is distributed according to p ∗ , the quantity 2 | D | r ( G ; A ) is distributed asymp-totically as χ with degree 1 of freedom.Theorem 4 is a special case of the following more general statement. Theorem 5.

Let G be a non-derivable itemset and let F be an itemset family.Deﬁne H to be H = { X ∈ A | X ⊆ G, X = ∅ , X / ∈ F G } , that is, H is a family of sub-itemsets of G not belonging to F G . Under the 0-hypothesis that the itemsets in H are distributed according to p ∗ , the quantity2 | D | r ( G ; F ) is distributed asymptotically as χ with degree |H| = 2 | G | − −|F G | of freedom.Theorem 5 is stated (but not proven) in a more general form in (Kullback, 1968).A rather technical proof is provided in Appendix A.Theorem 5 motivates us to deﬁne the normalised rank measure to be theone-sided χ test, that is, nr ( G ; F , D ) = cdf (2 | D | r ( G ; F , D )) , where cdf ( a ) = P (cid:0) χ < a (cid:1) is the cumulative distribution function of χ withdegree 2 | G | − − |F G | of freedom. The number of degrees for diﬀerent rankmeasures are provided in Table 3.4.The following well-known result and its corollaries will play an important rolein solving the measures. Lemma 1.

Let p ∗ be the Maximum Entropy distribution for itemsets F andthe corresponding frequencies θ . Let q be a distribution satisfying the itemsets F . Then we have − X ω q ( ω ) log p ∗ ( ω ) = H ( p ∗ ) . Corollary 6.

Let F be the family of itemsets. We have that r ( G ; F ) = H ( p ∗ ) − H ( q G ) , a cd e Fig. 1. A toy tree model. The related itemsets { a, b, c, d, e, ab, ac, ad, de } corre-spond to the attributes and the edges of the tree.where p ∗ is the Maximum Entropy distribution and q G is the empirical distribu-tion. Corollary 7.

Let F , H be the families of itemsets such that H ⊆ F . Let p ∗ bethe Maximum Entropy distribution for F and let p ∗ be the Maximum Entropydistribution for H . We have that r ( G ; F ) = KL( q G k p ∗ ) − KL( p ∗ k p ∗ ) ,q G is the empirical distribution. Corollary 8.

Let F , H be the families of itemsets such that H ⊆ F . We havethat r ( G ; F ) ≤ r ( G ; H ) . So far we have considered ranks with ﬁxed families of itemsets. In this sectionwe introduce 2 additional models. In these models the itemsets are selected suchthat they minimise the rank.Our ﬁrst rank measure is the optimal tree model. A tree model can be de-scribed as a tree deﬁned on the attributes of G . The corresponding family T ofitemsets contains the attributes from G and the itemsets of size 2 correspondingto the edges of the tree. Example 9.

Consider G = { a, b, c, d, e } and consider the tree given in Figure 9.The corresponding family of itemsets is T = { a, b, c, d, e, ab, ac, ad, de } .We can show that the Maximum Entropy distribution for T has the form p ∗ = Y { a,b }∈T p ∗ ( a, b ) / Y a ∈ G p ∗ ( a ) . This is, of course, Chow-Liu tree model(Chow and Liu, 1968). We deﬁne theoptimal tree to be the ne that minimises the rank, that is, T ∗ = arg min T is a tree r ( G ; T , D ) . To solve this tree let p ind be the independence distribution. Corollary 6 allowss to rewrite the rank measure as r ( G ; T ) = KL( q D k p ∗ ) = KL( q D k p ind ) − KL( p ∗ k p ind ) . Note that the ﬁrst term KL( q D k p ind ) does not depend on T . Hence we need tomaximise the second term KL( p ∗ k p ind ). This is the mutual information of thetree and maximising this term is equivalent to ﬁnding maximum spanning treein the mutual information graph. This can be done in polynomial time (Chowand Liu, 1968).There is a deep connection between the rank r ( G ; T ) and the rank for D-trees suggested in (Heikinheimo, Hinkkanen, Mannila, Mielikäinen and Seppä-nen, 2007). We can rewrite, by applying Corollary 6, the rank as r ( G ; T ) = KL( q D k p ∗ ) = H ( p ∗ ) − H ( q G ) . The ﬁrst term H ( p ∗ ) is the rank that is used in (Heikinheimo et al., 2007). Theauthors in (Heikinheimo et al., 2007) seek patterns that have small H ( p ∗ ), thatis, trees that have strong dependencies between the attributes, whereas we areinterested in patterns that produce large r ( G ; T ∗ ), sets of attributes whose jointdistribution cannot be explained even by the best tree model.Our second model involves in ﬁnding a downward closed family F of itemsetsthat produces the smallest normalised rank. Note that Corollary 8 implies thatthe rank decreases when we increase the number of known itemsets. However,this does not hold for the normalised rank and we will see that, contrary to theexpectations, the best model can be diﬀerent than A G , the set of all sub-itemsetsof G . In other words, knowing all sub-itemsets does not guarantee the best modelbut, in fact, itemsets of higher order may mislead the prediction.Unlike with the tree models, to our knowledge, there is no polynomial algo-rithm for ﬁnding the optimal downward closed family. Hence, we suggest a simplegreedy approach. We start from the itemsets of size 1 and select the itemset fromthe negative border that minimises the rank. The itemset is added into the fam-ily and the procedure is repeat until there is no itemset that can decrease therank. The algorithm is stated in Algorithm 1. We use F ∗ to denote the resultingfamily. Algorithm 1

Greedy algorithm for ﬁnding the optimal downward closed familyof itemsets. The input is the data set D and the query itemset G . The output is F ∗ a family of itemsets that produces low rank for the itemset. F ∗ ⇐ I G . {Initialise F ∗ with itemsets of size 1.} repeat Y ⇐ arg min X ∈ negbord( F ∗ ) nr ( G ; F ∗ ∪ X ). if nr ( G ; F ∗ ∪ Y ) < nr ( G ; F ∗ ) then F ∗ ⇐ F ∗ ∪ Y . end ifuntil no more changes in F ∗ . easure Description r ( G ; I ) Independence model 2 | G | − − | G | O ( | G | ) r ( G ; C ) Gaussian model 2 | G | − − | G | ( | G | + 1) O (cid:0) | G | (cid:1) per iter. r ( G ; A ) All subsets model 1 O (cid:0) | G | (cid:1) per iter. r ( G ; T ∗ ) Optimal tree model 2 | G | − | G | O (cid:0) | G | (cid:1) r ( G ; F ∗ ) Optimal family model 2 | G | − − |F| O (cid:0) | G | (cid:1) per iter. Table 1. Summary of the rank measure. The number of degrees, the third column,is used as a a parameter for χ distribution, when computing the normalisedrank. The fourth column represents the evaluation times for the entropy of p ∗ . Corollary 6 allows us to rewrite the rank as a diﬀerence of two entropies r ( G ) = KL( q G k p ∗ ) = H ( p ∗ ) − H ( q G ) . Both distributions have | Ω G | = 2 | G | entries. However, the distribution q G canhave only | D | positive entries at maximum, hence the term H ( q G ) can be com-puted eﬃciently.The challenge in calculating the measure is to solve the Maximum Entropydistribution p ∗ and calculate its entropy. This can be done in polynomial time forthe independence model and for the tree models. However, in the general casesolving p ∗ is an NP -complete problem (Tatti, 2006 a ; Cooper, 1990); In suchcases the distribution is solved using Iterative Scaling algorithm (Darroch andRatchli, 1972; Jiroušek and Přeušil, 1995). The algorithm consists of consecutivesteps. One such step requires O ( | Ω G | ) = O (cid:0) | G | (cid:1) time. Hence computing themeasure requires exponential time but it is doable for itemsets of reasonablesize. The summary for evaluation times is provided in Table 3.4. The eﬀect of pruning itemsets.

Note that in deﬁning the measure we onlyuse itemsets that are subsets of the query itemset G . This pruning guaranteesthat the number of entries in the distributions is 2 | G | and not, at worst, 2 K ,where K is the number of columns in the dataset. Pruning attributes is essentialsince solving p ∗ is exponential to the number of attributes. The downside is thatpruning may change the prediction as the following example demonstrates. Example 10.

Assume that we have 3 attributes, a , b , and c . Our known itemsetsare F = { a, b, c, ac, bc } and their frequencies are θ a = θ b = θ c = θ ac = θ bc = 1 / G = ab . In this case the pruned familyof itemsets is F G = { a, b } and the Maximum Entropy distribution is the uniformdistribution. The empirical distribution is q G ( a = 0 , b = 0) = q G ( a = 1 , b = 1) = 1 / q G ( a = 1 , b = 0) = q G ( a = 0 , b = 1) = 0 . The rank is then r ( ab ; F ) = 0 .

69. However, if we had used the frequencies of ac and bc , we would have concluded that a = b and that the Maximum Entropydistribution is equal to the empirical distribution, hence the rank would havebeen 0.n (Tatti, 2006 b ) we investigate the eﬀect of pruning attributes and concludethat in some cases we can remove a large portion of attributes outside G . How-ever, in those cases, the family of known itemsets has many restrictions and, forinstance, we cannot remove safely any attribute from the gaussian model.

4. Related Work

Traditionally, the support (frequency) of the itemset is used for ranking item-sets. Alternative measures that resemble the support are studied in (Omiecinski,2003).Our work resembles approach of (Brin, Motwani and Silverstein, 1997) inwhich the authors deﬁned the signiﬁcance of an itemset by comparing the distri-bution q G against the independence model. The authors used χ statistical testas a measure, that is, if p is the distribution related to the independence model,the rank measure is r b ( G ) = X ω ∈ Ω G ( q G ( ω ) − p ( ω )) p ( ω ) . (1)In (DuMouchel and Pregibon, 2001) the authors also compare the frequencyof an itemset against the independence model but in addition they use Bayesscreening to smooth the values. Also, in (Aggarwal and Yu, 1998) the authorsproposed the collective strength as a measure of signiﬁcance. To be more speciﬁc,we say that a transaction ω ∈ Ω G is good if it contains only 0s or only 1s. Let p be the distribution related to the independence model. Then the measure is r cs ( G ) = q G ( ω is good) p ( ω is good) p ( ω is bad) q G ( ω is bad) . (2)This measure obtains small values when data obeys the independence model. Ina related work presented in (Dong and Li, 1999) the authors deﬁne an itemset tobe interesting if its frequency increases signiﬁcantly from one dataset to another.In (Gallo, Bie and Christianini, 2007) the authors order itemsets based on theirp-values. In (Heikinheimo et al., 2007) the authors used entropy of tree modelsfor ranking itemsets. In addition, many measures has been suggested for rankingassociation rules (Piatetsky-Shapiro, 1991; Brin, Motwani, Ullman and Tsur,1997; Agrawal et al., 1993; Jaroszewicz and Simovici, 2002).The authors in (Pavlov, Mannila and Smyth, 2003) showed empirically thatMaximum Entropy model provides excellent estimates for itemsets. Rank can beused for pruning a large family of itemsets by picking the itemsets having thelargest rank. Other pruning methods are proposed in (Boulicaut, Bykowski andRigotti, 2000; Calders and Goethals, 2002; Pasquier, Bastide, Taouil and Lakhal,1999). The authors in (Webb, 2006) suggest a generic framework for discoveringsigniﬁcant rules. In addition, a relevant framework is described in (Mannila andMielikäinen, 2003); the authors deﬁne a pattern ordering given an estimationalgorithm and a loss function. In (Norén, Bate and Edwards, 2007) the authorsuse information component analysis to ﬁnd patterns in a drug safety database. . Experiments In this section we present our empirical results. In the ﬁrst 3 sections we explainthe datasets and the setup. In our experiments we investigate the signiﬁcance ofitemsets, how diﬀerent measures are related to each other, and the monotonicityof the ranks.

For the testing purposes we created two synthetic datasets. Each dataset con-tained 100 attributes and 5000 rows. The ﬁrst dataset, gen-ind , was generatedsuch that the attributes were independent. The margins were sampled uniformlyfrom [0 , gen-copy , each column was a copy of the previ-ous column corrupted by the symmetric white noise. The amount of noise, thatis the probability p ( a i = 1 | a i − = 0) = p ( a i = 0 | a i − = 1) , was selected uniformly from [0 ,

1] for each column a i , individually. The ﬁrstcolumn was generated by a coin ﬂip. Our expectations are that in gen-ind theitemsets of size 1 are signiﬁcant and that in gen-copy the itemsets of size 2 aresigniﬁcant. In our experiments we used the following real-world datasets. Data in

Acci-dents were obtained from the Belgian “Analysis Form for Traﬃc Accidents”forms that is ﬁlled out by a police oﬃcer for each traﬃc accident that occurswith injured or deadly wounded casualties on a public road in Belgium. In to-tal, 340 183 traﬃc accident records are included in the dataset (Geurts, Wets,Brijs and Vanhoof, 2003). The datasets POS , WebView-1 and WebView-2 were contributed by Blue Martini Software as the KDD Cup 2000 data (Kohavi,Brodley, Frasca, Mason and Zheng, 2000). POS contains several years worth ofpoint-of-sale data from a large electronics retailer.

WebView-1 and

WebView-2 contain several months worth of click-stream data from two e-commerce websites.

Kosarak consists of (anonymised) click-stream data of a Hungarian on-linenews portal. Retail is a retail market basket data supplied by an anonymousBelgian retail supermarket store (Brijs, Swinnen, Vanhoof and Wets, 1999). Thedataset Paleo contains information of species fossils found in speciﬁc paleon-tological sites in Europe (Fortelius, 2005), preprocessed as in (Fortelius, Gionis,Jernvall and Mannila, 2006). http://fimi.cs.helsinki.fi/data/accidents.dat.gz http://fimi.cs.helsinki.fi/data/kosarak.dat.gz http://fimi.cs.helsinki.fi/data/retail.dat.gz NOW public release 030717 available from (Fortelius, 2005).valuation timesData n G max | G | nr ( G ; I ) nr ( G ; C ) nr ( G ; A ) nr ( G ; T ∗ ) nr ( G ; F ∗ ) gen-ind s s min s min gen-copy

100 111487 4 0 s s s s min Accidents

100 354399 6 2 s min min s min Kosarak s s s s s Paleo s s s s min POS

10 246640 6 1 s s s s min Retail s s s s min WebView-1 s s s s s WebView-2 s s min s min Table 2. The evaluation times and the sizes of the query families. The secondcolumn is the threshold used in mining almost non-derivable itemsets. The fourthcolumn is the maximal size of a query itemset. The evaluation time does notinclude the time spent mining itemsets.

In this section we will describe how we conducted our experiments. We reducedthe largest datasets by selecting the ﬁrst 10000 rows and 200 most frequentattributes. From each dataset we computed all almost non-derivable itemsets.By almost non-derivable we mean that the diﬀerence between the upper boundand the lower bound of a given itemset, say G , is at least n transactions. Inother words, if we know the frequencies of all sub-itemsets of G , then we cannotpredict the frequency of G within n transactions. If n = 0, then an itemsetis non-derivable. It is known that the family of almost non-derivable itemsetsis anti-monotonic (Calders and Goethals, 2002, Lemma 3.1). A reason to usealmost non-derivable itemsets instead of frequent itemsets is the statement ofTheorem 3, that is, r ( G ; A ) = 0 if the itemset is derivable. The other reason isthat we want to study how the measure behaves for infrequent itemsets.To keep the sizes of the obtained families within reasonable bounds we useddiﬀerent thresholds for diﬀerent datasets: For gen-ind , Retail and

WebView-2 weset n = 5. For POS the threshold n was set to 10 and for gen-copy and Accidents n was set to 100. For the rest of the datasets we set n = 0, that is, we mined allnon-derivable itemsets from these datasets.For each itemset from the obtained itemsets we queried the following mea-sures: – Frequency. – Normalised rank measures nr ( G ; I ), nr ( G ; C ), nr ( G ; A ), nr ( G ; T ∗ ), nr ( G ; F ∗ ). – Measures discussed in Section 4: A χ test r b ( G ) deﬁned in Eq. 1 and a col-lective strength r cs ( G ) deﬁned in Eq. 2.The evaluation times and the sizes of the query families are given in Table 2. Our ﬁrst experiment is to study how many of the itemsets are signiﬁcant. Wedid this by comparing the our rank measures with risk level 0.05. The results aregiven in Tables 3–5. We also provide a typical example of box plots in Figure 2. temset sizeData 1 2 3 4 5 6 All gen-ind . . . . . . . gen-copy . . . .

03 – – . Accidents . . .

95 1 1 1 . Kosarak . .

99 1 1 – . Paleo . . .

99 1 – . POS . .

99 1 1 1 . Retail . . .

93 1 1 . WebView-1 .

70 1 1 1 – . WebView-2 . .

69 1 1 1 . Table 3. The percentages of signiﬁcant itemsets according to nr ( G ; I ). Eachentry is a fraction of itemsets of speciﬁc size calculated from a speciﬁc dataset.Signiﬁcance is measured using χ distribution with 0 .

05 risk level. nr ( G ; C ), itemset size nr ( G ; A ), itemset sizeData 1 2 3 4 5 6 All 1 2 3 4 5 6 All gen-ind . . . . . . . . . . . . . . gen-copy . . . .

03 – – . . . . .

05 – – . Accidents . . . . . . . . . . . . . . Kosarak . . . .

38 – .

37 1 . . . .

04 – . Paleo . . . .

21 – .

15 1 . . . .

64 – . POS . . . . . .

17 1 . . . . . . Retail . . . . . .

05 1 . . . . . . WebView-1 . . . .

52 – .

48 1 . . . .

07 – . WebView-2 . . . .

88 1 .

17 1 . . . . . . Table 4. The percentages of signiﬁcant itemsets according to nr ( G ; C ) and nr ( G ; A ). Each entry is a fraction of itemsets of speciﬁc size calculated froma speciﬁc dataset. Signiﬁcance is measured using χ distribution with 0 .

05 risklevel. nr ( G ; T ∗ ), itemset size nr ( G ; F ∗ ), itemset sizeData 1 2 3 4 5 6 All 1 2 3 4 5 6 All gen-ind . . . . .

01 0 . . . . .

01 0 0 . gen-copy . . .

02 0 – – . . . .

01 0 – – . Accidents . . . . . . . . . . . . . . Kosarak . . .

94 1 – .

80 1 . . . .

03 – . Paleo . . . .

81 – .

24 1 . . . .

04 – . POS . . .

99 1 1 .

65 1 . . . .

02 0 . Retail . . . .

78 1 .

07 1 . . . . . . WebView-1 . .

83 1 1 – .

84 1 . . . .

30 – . WebView-2 . . .

57 1 1 .

37 1 . . . . . . Table 5. The percentages of signiﬁcant itemsets according to nr ( G ; T ∗ ) and nr ( G ; F ∗ ). Each entry is a fraction of itemsets of speciﬁc size calculated froma speciﬁc dataset. Signiﬁcance is measured using χ distribution with 0 .

05 risklevel. * )Size of itemset 1 2 3 4 5.5.95 nr(G; F * )Size of itemset Fig. 2. Box plots of the rank measures computed from

Paleo .Let us ﬁrst study gen-ind , a synthetic dataset with independent columns.We see from Table 3 that according to nr ( G ; I ) a large portion of itemsets ofsize 1 are signiﬁcant but only a small portion of itemsets having size largerthan 1 is signiﬁcant. This is an expected result since the frequencies obey theindependence model. In Tables 4 we have similar results for nr ( G ; C ) and for nr ( G ; A ). However, the values of nr ( G ; C ) and for nr ( G ; A ) tend to be largerthan the values of nr ( G ; I ). The reason for this is a type of overlearning: Sincethe frequencies of itemsets are calculated from the datasets, they are imprecise.Hence, the itemsets with larger size mislead us during prediction, because theresulting Maximum Entropy distribution is not an independent model (althoughclose to one).Let us continue by studying gen-copy , a synthetic data in which an attributeis a noisy copy of the previous attribute. We see that nr ( G ; T ∗ ) tends to havesmaller ranks than nr ( G ; I ) when G has size 3. The reason for this is that, unlikewith gen-ind , the independence model cannot explain the dataset. However, whenwe predict using also the itemsets of size 2, the prediction becomes more accurate.The measures nr ( G ; C ) and nr ( G ; A ) also produce small ranks, however, theseranks tend to be slightly larger than the ranks of nr ( G ; T ∗ ).We turn our attention to real datasets. We see that for these datasets theindependence model is too strict: According to nr ( G ; I ) almost all itemsets aresigniﬁcant: The results change drastically, when we use richer models. Accordingto nr ( G ; C ) or nr ( G ; A ) only 5%–50% of the itemsets are signiﬁcant, dependingon the dataset. Similar overﬁtting that occurred with gen-ind also occurs in somebut not all real datasets (see Figure 2). For instance, in Retail nr ( G ; A ) tendsto produce higher values than nr ( G ; C ) but not in POS . We continued our experiments by comparing the measures nr ( G ; I ), nr ( G ; C ), nr ( G ; A ), nr ( G ; T ∗ ), and nr ( G ; F ∗ ) against each other. This was done by cal-culating the correlations between the rank measures. The results are given inTables 6 and 7.From the results we see that all correlations are positive. For the real datasetsthe correlations between nr ( G ; C ) and nr ( G ; A ) are systematically higher thanthe correlations between nr ( G ; I ) and nr ( G ; A ) or between nr ( G ; C ) and nr ( G ; A ).This suggests that nr ( G ; I ) produces diﬀerent ranks whereas nr ( G ; C ) and nr ( G ; A )are more similar. This supports the behaviour we have seen in Section 5.4.The measure nr ( G ; F ∗ ) correlate more with nr ( G ; A ) and nr ( G ; C ) than with r ( G ; I ) nr ( G ; I ) nr ( G ; C )vs. vs. vs.Data nr ( G ; C ) nr ( G ; A ) nr ( G ; A ) gen-ind . . . gen-copy . . . Accidents . . . Kosarak . . . Paleo . . . POS . . . Retail . . . WebView-1 . . . WebView-2 . . . Table 6. Correlations between the measures nr ( G ; I ), nr ( G ; C ), and nr ( G ; A ). nr ( G ; T ∗ ) vs. nr ( G ; F ∗ ) vs.Data nr ( G ; I ) nr ( G ; C ) nr ( G ; A ) nr ( G ; I ) nr ( G ; C ) nr ( G ; A ) nr ( G ; T ∗ ) gen-ind . . . . . . . gen-copy . . . . . . . Accidents . . . . . . . Kosarak . . . . . . . Paleo . . . . . . . POS . . . . . . . Retail . . . . . . . WebView-1 . . . . . . . WebView-2 . . . . . . . Table 7. Correlations between the ﬂexible measures nr ( G ; T ∗ ) and nr ( G ; F ∗ )and the measures nr ( G ; I ), nr ( G ; C ), and nr ( G ; A ). nr ( G ; I ). The correlation between nr ( G ; F ∗ ) and nr ( G ; T ∗ ) is somewhat weakerbut it is stronger than the correlation between nr ( G ; F ∗ ) and nr ( G ; I ). Our next goal is to compare the ﬂexible measures nr ( G ; T ∗ ) and nr ( G ; F ∗ )against the rest of the measures. From Table 5 we see that nr ( G ; F ∗ ) tendto produce the smallest amount of signiﬁcant itemsets whereas the nr ( G ; T ∗ )produces large ranks, especially for queries with many attributes.We calculated the number of queries in which nr ( G ; T ∗ ) and nr ( G ; F ∗ ) pro-duce smaller rank than the rest of the measures. Since the measures are equiva-lent for the queries of size 1 and 2, these queries were ignored. From the resultsgiven in Table 8 we see that the ﬂexible models outperform nr ( G ; I ), however,the performance against other measure depends on the data set. For instance, nr ( G ; F ∗ ) outperform nr ( G ; C ) and nr ( G ; A ) in Retail but produces larger ranksin

Kosarak . This suggests that the greedy algorithm sometimes fails to ﬁnd theoptimal family F ∗ .We studied the sizes of itemsets occurring in F ∗ , the family of known itemsetsin nr ( G ; F ∗ ). To be more precise, let F ∗ G be the family of known itemsets for thequery G . Let L be the size of itemsets we are interested in. We deﬁne the ratio r ( G ; T ∗ ) ≤ nr ( G ; F ∗ ) ≤ Data nr ( G ; I ) nr ( G ; C ) nr ( G ; A ) nr ( G ; I ) nr ( G ; C ) nr ( G ; A ) nr ( G ; T ∗ ) gen-ind . . .

78 1 . . . gen-copy . . .

82 1 . . . Accidents . .

13 1 . . . Kosarak . .

08 1 . . . Paleo . . .

58 1 . . . POS . .

12 1 . . . Retail . . .

80 1 . . . WebView-1 . .

19 1 . . . WebView-2 . . .

46 1 . . . Table 8. Percentages of queries in which the ﬂexible measures nr ( G ; T ∗ ) and nr ( G ; F ∗ ) outperform the other rank measures. Queries only of size 3 or largerwere considered. Ratio of used itemsetsData 2 3 4 5 gen-ind . .

01 0 0 gen-copy . .

01 – –

Accidents . . .

02 0

Kosarak . . .

01 –

Paleo . .

16 0 –

POS . . .

01 0

Retail . .

13 0 0

WebView-1 . .

44 0 –

WebView-2 . . .

07 0

Table 9. Number of itemsets occurring in F ∗ , the family of known itemsets in r ( G ; F ∗ ), normalised by the maximum number of possible occurrences. Eachcolumn represent itemsets of speciﬁc size. r L to be r L = P G |{ X ∈ F ∗ G ; | X | = L }| P G (cid:0) | G | L (cid:1) , that is, the number of itemset of size L occurring in F ∗ divided by the maximumnumber of occurrences. The ratios r L are given in Table 9. We see that theitemsets of size 2 and 3 are frequently used, however, the itemsets of larger sizeare rarely used. We compared our measures against the other ranking methods described in Sec-tion 5.3. Namely, we calculated the correlations of nr ( G ; I ), nr ( G ; C ), nr ( G ; A ), nr ( G ; T ∗ ), and nr ( G ; F ∗ ) against the frequency of G , r b ( G ), the χ test for in-dependency, and r cs ( G ), the collective strength of the itemset G . The results arepresented in Tables 10 and 11. We also studied the relationships by plotting ourmeasures as functions of the aforementioned approaches and such examples aregiven in Figure 3.Our ﬁrst observation is that nr ( G ; I ) correlates strongly with r b ( G ). Thisis an expected result since both test the independency of attributes inside the r ( G ; I ) vs. nr ( G ; C ) vs. nr ( G ; A ) vs.Data freq. r b ( G ) r cs ( G ) freq. r b ( G ) r cs ( G ) freq. r b ( G ) r cs ( G ) gen-ind . . − . . . − .

01 0 .

25 0 gen-copy .

15 1 . . . . . . . Accidents .

01 1 . − . . . . . . Kosarak . . . . . .

27 0 . . Paleo . . . . . . − . . . POS . . . . . . . .

10 0

Retail . . . . . . . . . WebView-1 . . . . . − . . . − . WebView-2 . . . . . . . . . Table 10. Correlations between the rank measures nr ( G ; I ), nr ( G ; C ), and nr ( G ; A ) and the base measures: the frequency of G , r b ( G ), the χ test forindependency, and r cs ( G ), the collective strength of the itemset G . nr ( G ; T ∗ ) vs. nr ( G ; F ∗ ) vs.Data freq. r b ( G ) r cs ( G ) freq. r b ( G ) r cs ( G ) gen-ind . . − . . . − . gen-copy . . . . . . Accidents − . . . . . . Kosarak . . .

32 0 . . Paleo . . . . . . POS . . . . . − . Retail . . . . . . WebView-1 . . . . . − . WebView-2 . . . . . . Table 11. Correlations between the rank measures nr ( G ; T ∗ ) and nr ( G ; F ∗ ) andthe base measures: the frequency of G , r b ( G ), the χ test for independency, and r cs ( G ), the collective strength of the itemset G .itemsets and also because nr ( G ; I ) is asymptotically a χ test (see Theorem 5).There is some correlation between r b ( G ) and the rest of the measures althoughthis correlation is much weaker compared to nr ( G ; I ).Apart from WebView-2 , there is little correlation between the measures andthe frequency.The correlation between the measures and the collective strength r cs ( G ) ex-ists but varies depending on the method and the dataset. The strongest cor-relations are obtained when r cs ( G ) is compared against nr ( G ; I ) or nr ( G ; T ∗ ).The dependency between nr ( G ; I ) and r cs ( G ) is a natural result since r cs ( G )produces small values when attributes are independent. In this section we investigate the relationship between the rank of an itemsetand the ranks of its sub-itemsets. Namely, we tested whether the measures aremonotonic, that is, whether nr ( G ; F ) ≥ nr ( H ; F ) for all H ⊂ G . We deliberatelyignored sub-itemsets having size 1 since they all have very high rank. We alsotested whether the measures are anti-monotonic, that is, decreasing w.r.t. setinclusion. r ( G ; I ) nr ( G ; C ) nr ( G ; A )Data 3 4 5 6 All 3 4 5 6 All 3 4 5 6 All gen-ind . . .

01 0 . . . .

01 0 . . . . . . gen-copy . .

01 – – . . .

02 – – . . .

10 – – . Accidents . . . . . . . .

02 0 . .

01 0 0 0 0

Kosarak . .

98 1 – .

93 0 0 0 – 0 0 0 0 – 0

Paleo . . .

84 – . .

04 0 0 – . .

04 0 0 – . POS .

87 1 1 1 .

92 0 0 .

01 0 0 0 0 0 0 0

Retail . . .

92 1 . .

04 0 0 0 . . .

02 0 0 . WebView-1 .

98 1 1 – . .

04 0 0 – . .

04 0 0 – . WebView-2 . .

88 1 1 . .

04 0 .

08 1 . .

04 0 0 0 . Table 12. Percentages of itemsets satisfying the property of monotonicity. Theitemset G satisﬁes the property if nr ( G ; F ) ≥ nr ( H ; F ) for all H ⊂ G such that | H | ≥ nr ( G ; T ∗ ) nr ( G ; F ∗ )Data 3 4 5 6 All 3 4 5 6 All gen-ind . .

01 0 0 . . .

01 0 0 . gen-copy . .

01 – – . . .

01 – – . Accidents . . . . . .

01 0 0 0 0

Kosarak . .

32 – .

03 0 0 0 – 0

Paleo .

02 0 0 – . .

02 0 0 – . POS . .

97 1 .

11 0 0 0 0 0

Retail .

03 0 .

14 1 . .

02 0 0 0 . WebView-1 . . .

89 – . .

03 0 0 – . WebView-2 . . .

72 1 . .

03 0 0 0 . Table 13. Percentages of itemsets satisfying the property of monotonicity. Theitemset G satisﬁes the property if nr ( G ; F ) ≥ nr ( H ; F ) for all H ⊂ G such that | H | ≥ nr ( G ; I ) nr ( G ; C ) nr ( G ; A )Data 3 4 5 6 All 3 4 5 6 All 3 4 5 6 All gen-ind . . . . . . . . . . . . . . . gen-copy . .

06 – – . . .

08 – – . . .

07 – – . Accidents .

03 0 0 0 . . .

04 0 0 . . . . . . Kosarak . .

06 0 – . . . .

01 – . . . .

08 – . Paleo .

02 0 0 – . . .

04 0 – . . .

07 0 – . POS . . . . . . .

07 0 0 . . . . . . Retail .

17 0 0 0 . . . .

01 0 . . . .

01 0 . WebView-1 . .

11 0 – . . . .

15 – . WebView-2 . . . . . . .

06 0 0 . . . . . . Table 14. Percentages of itemsets satisfying the property of anti-monotonicity.The itemset G satisﬁes the property if nr ( G ; F ) ≤ nr ( H ; F ) for all H ⊂ G suchthat | H | ≥ b (G) n r( G ;I ) .2 .4 .6 .8.2.4.6.8 r b (G) n r( G ; C ) .2 .4 .6 .8.2.4.6.8 r b (G) n r( G ; A ) .2 .4 .6 .8.2.4.6.8 r b (G) n r( G ; T * ) .2 .4 .6 .8.2.4.6.8 r b (G) n r( G ; F * ) .05 .09 .1.2.4.6.8 freq. n r( G ;I ) .05 .09 .1.2.4.6.8 freq. n r( G ; C ) .05 .09 .1.2.4.6.8 freq. n r( G ; A ) .05 .09 .1.2.4.6.8 freq. n r( G ; T * ) .05 .09 .1.2.4.6.8 freq. n r( G ; F * ) cs (G) n r( G ;I ) cs (G) n r( G ; C ) cs (G) n r( G ; A ) cs (G) n r( G ; T * ) cs (G) n r( G ; F * ) Fig. 3. Ranks as functions of the base measures. The plots are calculated from

Paleo dataset. nr ( G ; T ∗ ) nr ( G ; F ∗ )Data 3 4 5 6 All 3 4 5 6 All gen-ind . . . . . . . . . . gen-copy . .

18 – – . . .

17 – – . Accidents . .

01 0 0 . . . . . . Kosarak .

91 0 0 – . . . .

07 – . Paleo . .

02 0 – . . . .

01 – . POS .

88 0 0 0 . . . .

13 0 . Retail . .

02 0 0 . . . .

03 0 . WebView-1 .

62 0 0 – . . . .

22 – . WebView-2 . .

01 0 0 . . . .

06 0 . Table 15. Percentages of itemsets satisfying the property of anti-monotonicity.The itemset G satisﬁes the property if nr ( G ; F ) ≤ nr ( H ; F ) for all H ⊂ G suchthat | H | ≥ nr ( G ; I )are increasing for real datasets but not for the synthetic datasets. The raw valuesof nr ( G ; I ) are indeed increasing but this does not hold for the P-values since thenumber of degrees varies. The measure nr ( G ; T ∗ ) tends also be monotonic butnot as much as nr ( G ; I ). On the contrary, nr ( G ; C ), nr ( G ; A ), and nr ( G ; F ∗ )are increasing for extremely few itemsets.able 14 suggests that nr ( G ; C ), nr ( G ; A ), and nr ( G ; F ∗ ) satisﬁes the anti-monotonicity to some degree. Measures nr ( G ; C ) and nr ( G ; A ) are anti-monotonicfor relatively high percentage of itemsets of size 3. Among itemsets of size 4, nr ( G ; F ∗ ) satisﬁes the property of anti-monotonicity for a slightly larger portionof itemsets than nr ( G ; A ) that, in turn, is anti-monotonic in more queries than nr ( G ; C ).

6. Conclusions

We have given a deﬁnition of a measure for ranking itemsets. The idea is topredict the frequency of an itemset from the frequencies of its sub-itemsets andmeasure the deviation between the actual frequency and the prediction. Themore the itemset deviates from the prediction, the more it is signiﬁcant. We es-timated the frequencies using Maximum Entropy and we used Kullback-Leiblerdivergence to measure the deviation. In the general case, the measure can becomputed in O (2 | G | ) time, where | G | is the size of the itemset needed to beranked, however, the measures r ( G ; T ∗ ) and r ( G ; I ) can be computed in poly-nomial time.We introduced two ﬂexible rank measures r ( G ; T ∗ ) and r ( G ; F ∗ ). The mea-sure r ( G ; T ∗ ) can be solved by ﬁnding the optimal spanning tree in the mutualinformation matrix. For solving r ( G ; F ∗ ) we proposed a simple greedy approach.A clear advantage of our approach to the previous methods is that the pre-vious solutions calculate the deviation from the independence model whereas weare able to use the information available from the itemsets of larger size, andthus use more ﬂexible models.Our empirical results for real data show that the independence is too strictassumption: Almost all itemsets were signiﬁcant according to r ( G ; I ). The resultschanged when we applied the more ﬂexible models, r ( G ; C ) and r ( G ; A ). We alsoobserved an interesting type of overﬁtting: In some cases we obtain a betterprediction if we do not use all the available information.We showed that there is a little correlation between our measures and theother approaches. For instance, infrequent itemset may be signiﬁcant and fre-quent itemset may be insigniﬁcant. We also observed that r ( G ; I ) is monotonicfor a large portion of itemsets, whereas r ( G ; C ) and r ( G ; A ) are anti-monotonicfor a signiﬁcant portion of itemsets. Acknowledgments

The author wishes to thank Gemma Garriga, Heikki Mannila, and Robert Gwaderafor their comments.

References

Aggarwal, C. C. and Yu, P. S. (1998), A new framework for itemset generation, in ‘PODS ’98:Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Princi-ples of database systems’, ACM Press, pp. 18–24.Agrawal, R., Imielinski, T. and Swami, A. N. (1993), Mining association rules between sets ofitems in large databases, in P. Buneman and S. Jajodia, eds, ‘Proceedings of the 1993 ACMIGMOD International Conference on Management of Data’, Washington, D.C., pp. 207–216.Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. and Verkamo, A. I. (1996), Fast discoveryof association rules, in U. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy,eds, ‘Advances in Knowledge Discovery and Data Mining’, AAAI Press/The MIT Press,pp. 307–328.Boulicaut, J.-F., Bykowski, A. and Rigotti, C. (2000), Approximation of frequency queries bymeans of free-sets, in ‘Principles of Data Mining and Knowledge Discovery’, pp. 75–85.Brijs, T., Swinnen, G., Vanhoof, K. and Wets, G. (1999), Using association rules for productassortment decisions: A case study, in ‘Knowledge Discovery and Data Mining’, ACM,pp. 254–260.Brin, S., Motwani, R. and Silverstein, C. (1997), Beyond market baskets: Generalizing associa-tion rules to correlations, in J. Peckham, ed., ‘SIGMOD 1997, Proceedings ACM SIGMODInternational Conference on Management of Data’, ACM Press, pp. 265–276.Brin, S., Motwani, R., Ullman, J. D. and Tsur, S. (1997), Dynamic itemset counting andimplication rules for market basket data, in ‘SIGMOD 1997, Proceedings ACM SIGMODInternational Conference on Management of Data’, pp. 255–264.Calders, T. and Goethals, B. (2002), Mining all non-derivable frequent itemsets, in ‘Proceedingsof the 6th European Conference on Principles and Practice of Knowledge Discovery inDatabases’.Chow, C. and Liu, C. (1968), ‘Approximating discrete probability distributions with depen-dence trees’, IEEE Transactions on Information Theory (3), 462–467.Cooper, G. (1990), ‘The computational complexity of probabilistic inference using bayesianbelief networks’, Artiﬁcial Intelligence (2–3), 393–405.Csiszár, I. (1975), ‘I-divergence geometry of probability distributions and minimization prob-lems’, The Annals of Probability (1), 146–158.Darroch, J. and Ratchli, D. (1972), ‘Generalized iterative scaling for log-linear models’, TheAnnals of Mathematical Statistics (5), 1470–1480.Dong, G. and Li, J. (1999), Eﬃcient mining of emerging patterns: Discovering trends anddiﬀerences, in ‘Knowledge Discovery and Data Mining’, pp. 43–52.DuMouchel, W. and Pregibon, D. (2001), Empirical bayes screening for multi-item associations, in ‘Knowledge Discovery and Data Mining’, pp. 67–76.Fortelius, M. (2005), ‘Neogene of the old world database of fossil mammals (NOW)’, Universityof Helsinki, .Fortelius, M., Gionis, A., Jernvall, J. and Mannila, H. (2006), ‘Spectral ordering and biochronol-ogy of european fossil mammals. paleobiology’, Paleobiology (2), 206–214.Gallo, A., Bie, T. D. and Christianini, N. (2007), Mini: Mining informative non-redundantitemsets, in ‘11th European Conference on Principles and Practice of Knowledge Discoveryin Databases (PKDD)’, pp. 438–445.Geurts, K., Wets, G., Brijs, T. and Vanhoof, K. (2003), Proﬁling high frequency accident loca-tions using association rules, in ‘Proceedings of the 82nd Annual Transportation ResearchBoard, Washington DC. (USA), January 12-16’.Heikinheimo, H., Hinkkanen, E., Mannila, H., Mielikäinen, T. and Seppänen, J. K. (2007),Finding low-entropy sets and trees from binary data, in ‘Knowledge Discovery and DataMining’.Jaroszewicz, S. and Simovici, D. A. (2002), Pruning redundant association rules using maxi-mum entropy principle, in ‘Advances in Knowledge Discovery and Data Mining, 6th Paciﬁc-Asia Conference, PAKDD’02’, pp. 135–147.Jiroušek, R. and Přeušil, S. (1995), ‘On the eﬀective implementation of the iterative propor-tional ﬁtting procedure’, Computational Statistics and Data Analysis , 177–189.Kohavi, R., Brodley, C., Frasca, B., Mason, L. and Zheng, Z. (2000), ‘KDD-Cup 2000 organiz-ers’ report: Peeling the onion’, SIGKDD Explorations (2), 86–98.Kullback, S. (1968), Information Theory and Statistics , Dover Publications, Inc.Mannila, H. and Mielikäinen, T. (2003), The pattern ordering problem, in ‘Principles of DataMining and Knowledge Discovery’, pp. 327–338.Norén, G. N., Bate, A. and Edwards, I. R. (2007), ‘Extending the methods used to screen thewho drug safety database towards analysis of complex associations and improved accuracyfor rare events’, Statistics in Medicine , 3740–3757.Omiecinski, E. R. (2003), ‘Alternative interest measures for mining associations in databases’, IEEE Transactions on Knowledge and Data Engineering (1), 57–69.asquier, N., Bastide, Y., Taouil, R. and Lakhal, L. (1999), ‘Discovering frequent closed item-sets for association rules’, Lecture Notes in Computer Science , 398–416.Pavlov, D., Mannila, H. and Smyth, P. (2003), ‘Beyond independence: Probabilistic models forquery approximation on binary transaction data’,

IEEE Transactions on Knowledge andData Engineering (6), 1409–1421.Piatetsky-Shapiro, G. (1991), Discovery, analysis, and presentation of strong rules, in ‘Knowl-edge Discovery in Databases’, AAAI/MIT Press, pp. 229–248.Tatti, N. (2006 a ), ‘Computational complexity of queries based on itemsets’, Information Pro-cessing Letters pp. 183–187.Tatti, N. (2006 b ), ‘Safe projections of binary data sets’, Acta Informatica (8–9), 617–638.van der Vaart, A. W. (1998), Asymptotic Statistics , Cambridge Series in Statistical and Prob-abilistic Mathematics, Cambridge University Press.Webb, G. I. (2006), Discovering signiﬁcant rules, in ‘Knowledge discovery and data mining’,pp. 434–443. A. Asymptotic Behaviour of the Divergence

By asymptotic behaviour we mean the following: We assume that we have anensemble of datasets D i such that | D i | → ∞ . We assume that G is non-derivablein each D i and that the frequencies of F G are all equal.Deﬁne N = | D | and M = |H| . Let P be the set of distributions satisfyingthe itemsets F G . It is easy to see that we can parameterize P with frequenciesof H . In other words, let H = { H , . . . , H M } . Then for each p ∈ P , there is aunique frequency vector θ ∈ R M such that θ i = p ( H i = 1). Let Θ be the setof all possible frequency vectors. The set Θ is a closed polytope — the vectorslocated on the boundary of Θ corresponds to the distributions in which at leastone entry is 0.Let θ ∗ be a frequency vector corresponding to the Maximum Entropy dis-tribution p ∗ . We need to show that θ ∗ is not a boundary vector. Assume theconverse, then p ∗ must have p ∗ ( ω ) = 0 for some ω . We know that this im-plies that p ( ω ) = 0 for all p ∈ P (Csiszár, 1975, Theorem 3.1). Let Y be theitemset containing the elements for which ω has positive entries. This in turns(see (Calders and Goethals, 2002)) implies that for each p ∈ P p ( G = 1) = X Y ⊆ Z ⊆ G ( − | G |−| Z | p ( Z = 1) , making G derivable and contradicting the statement.Since θ ∗ is an inner point of Θ, let B ⊂ Θ be an open ball around θ ∗ . Assumethat θ ∈ B . By taking the expectation of the second-degree Taylor expansion oflog p ( ω ; θ ∗ ) p ( ω ; θ ) around θ we arrive to − KL( θ k θ ∗ ) = 12 ∆ θ T E θ [ H ( ω ; η )] ∆ θ, where ∆ θ = θ ∗ − θ and η is a vector lying between θ and θ ∗ , and H is the Hessianmatrix of log p ( ω ; η ).Let θ N be the frequencies of H obtained from a dataset containing N points.According to 0-hypothesis we have θ N θ ∗ and √ N ( θ N − θ ∗ ) N (0 , Σ), whereΣ is a covariance matrix,Σ ij = p ∗ ( H i = 1 , H j = 1) − p ∗ ( H i = 1) p ∗ ( H j = 1) . If θ N ∈ B , we let η N correspond to η in the Taylor expansion, otherwise weet η N = 0. We can show that η N θ ∗ (van der Vaart, 1998, Theorem 2.7).Consider a function g ( a, b, c, d ) = (cid:26) − a T E c [ H ( ω ; b )] a c ∈ B (2 /d ) KL( c k θ ∗ ) c / ∈ B .