Boosting Frequent Itemset Mining via Early Stopping Intersections
aa r X i v : . [ c s . I R ] J a n Boosting Frequent Itemset Mining via EarlyStopping Intersections
Huu Hiep Nguyen { [email protected] } Institute of Research and Development, Duy Tan UniversityP809 7/25 Quang Trung, Danang 550000, Vietnam
Abstract —Mining frequent itemsets from a transactiondatabase has emerged as a fundamental problem in data miningand committed itself as a building block for many patternmining tasks. In this paper, we present a general techniqueto reduce support checking time in existing depth-first searchgenerate-and-test schemes such as Eclat/dEclat and PrePost+.Our technique allows infrequent candidate itemsets to be detectedearly. The technique is based on an early-stopping criterion and isgeneral enough to be applicable in many frequent itemset miningalgorithms. We have applied the technique to two TID-list basedschemes (Eclat/dEclat) and one N-list based scheme (PrePost+).Our technique has been tested over a variety of datasets andconfirmed its effectiveness in runtime reduction.
I. I
NTRODUCTION
First proposed by Agrawal et al. [2], frequent itemset mininghas become a popular data mining technique and has beenstudied extensively by many researchers. It has played anessential role in many important data mining tasks such asmining association rules [14], sequential patterns [6], [13],correlations [9], episodes [16], classification [10], clustering[1] and so on. Although lots of algorithms have been proposed,how to improve the efficiency of itemset mining algorithms isstill one of several key research problems to be solved.Recently, Deng et al. [3] proposed PrePost and its enhancedversion PrePost+ [5] for mining frequent itemsets. Both ofthem employ a novel data structure named N-list to representitemsets and adopt single path property of N-list to directlydiscover frequent itemsets without generating candidate item-sets in some cases. The experiments in [3], [5] show thatPrePost/PrePost+ run faster than some state-of-the-art miningalgorithms including FP-growth [8] and FP-growth* [7]. Byinvestigating PrePost+, we found that support checking timefor candidate itemsets can be reduced largely if we can stopearly the N-list intersection for infrequent candidate itemsets.The same idea holds for other schemes that propose andtest potential children itemsets by intersecting lists held inparent itemsets. Two such schemes are Eclat [20] which usestransaction ID lists (TID-lists) and dEclat [17] which usesDiffsets.In this work, we further improve Eclat/dEclat and PrePost+by proposing a simple yet effective technique to stop early thesupport checking of infrequent candidate itemsets in depth-first search. Given an infrequent candidate itemset, the
EarlyStopping technique accumulates the evidence of infrequencyand decides early if the candidate’s support is undoubtedly less than the minimum support, so further checking stepsare redundant and dropped. The runtime reduction is alwaysguaranteed, especially on datasets with high ratio between thenumber of candidates and the number of frequent itemsets.In the next subsection, we review the mainstream of frequentitemset mining.
A. Related Work
Itemset mining is an important problem of data mining withmany variations such as frequent itemset mining [2], [8], [20],frequent closed/maximal itemset mining [7], [18], [19], fre-quent weighted itemset mining [15], erasable itemset mining[4] and so on. However, frequent itemset mining is still themost popular as it plays an important role in association rulemining [2], sequential mining [6], classification [10]. Therehave been a large number of algorithms which effectivelymine frequent itemsets. We may divide them into three maincategories: • Candidate generate-and-test strategy : Methods in thiscategory use a level-wise (breadth-first-search) approachfor mining frequent itemsets. First, they enumerate fre-quent 1-itemsets which are then used to propose candidate2-itemsets, and so on until no more candidates can begenerated. Apriori [2] is a seminal work in this line ofresearch. • Divide-and-conquer strategy : Methods using this strat-egy compress the dataset into a summary structure (e.g.,FP-Tree, H-struct) and mine frequent itemsets from thisstructure by using a divide-and-conquer strategy. Theydo not propose any candidate itemsets. Instead, fre-quent itemsets are discovered recursively in sub-databasesaccording to the patterns found. FP-Growth [8], FP-Growth* [7] and H-Mine [11] are representative algo-rithms in this category. All of them run depth-first search. • Hybrid strategy : Methods in this category use verticaldata formats to summarize the database and mine frequentitemsets by using the generate-and-test strategy. However,the generate-and-test strategy is realized in depth-firstmanner. TID-list based methods Eclat [20], dEclat [19],and N-list-based methods PrePost/PrePost+ [3], [5] aresome typical examples.
B. Contributions and Paper Structure
In this study, we have made the following contributions • We point out a common characteristic of depth-firstsearch mining schemes that generate and test candidateitemsets by list intersection. • We propose a general and effective Early-Stopping tech-nique for improving list intersection in Eclat, dEclatand PrePost+. The technique always guarantees that thenumber of comparisons is reduced, leading to runtimecut-down in most of the cases. • We have tested the technique over a wide range ofdatasets and found the cases in which Early-Stoppingimproves the existing schemes most.The paper is structured as follows. We review the keyconcepts of frequent itemset mining and the depth-first-searchtechnique in the next section. Our proposed technique willbe presented in Sections III (for Eclat/dEclat) and IV (forPrePost+) followed by the evaluation in Section V. Finally,we conclude the paper and propose future work in SectionVI. II. B
ACKGROUND
In this section, we review basic concepts of frequent patternmining and describe a transaction database as running exam-ple. The Early Stopping technique is clarified in the next twosections.
A. Frequent Itemsets
We assume a dataset DB consists of n transactions suchthat each transaction contains a number of items belonging to I where I = { i , i , ..., i m } is the set of all items in DB .The support of an itemset X ⊆ I , denoted by ρ ( X ) , is thenumber of transactions in DB which contain all the items in X . An itemset X is a frequent itemset if ρ ( X ) ≥ minSup , where minSup is a given threshold. Note that a frequentitemset with k elements is called a frequent k-itemset, and F is the set of frequent 1-itemsets sorted in frequency ascendingor descending order.Table I shows a DB of 10 transactions with I = { a, b, c, d, e } . The minSup is fixed to 3, i.e., itemsets withfrequency at least 3 will be output, e.g., { a, c } with frequency4 as it appears in the transactions 3,4,6 and 8. In PrePost+ [5],the items are sorted in decreasing frequency as { a, c, e, d, b } for PPC-tree because their frequencies are 7,7,7,6, and 3respectively (see the third column of Table I). In the search treeof Eclat/dEclat, the items are sorted in increasing frequency as { b, d, a, c, e } . These choices of sorting order make the numberof candidates as small as possible. B. Downward Closure Property and Depth-First-SearchDownward closure (or anti-monotone ) property [2]: ∀ X : ∀ Y ⊇ X, ρ ( Y ) ≤ ρ ( X ) (1)That means if an itemset is extended, its support cannotincrease. In other words, no superset of an infrequent itemsetcan be frequent. This fact suggests that we can start the searchfrom small itemsets to larger ones. In the search process, ifwe know that an itemset X is infrequent, we will no longer Table I: An example transaction dataset Transaction Items Reordering in PrePost+ [5]1 a, d, e a, e, d2 b, c, d c, d, b3 a, c, e a, c, e4 a, c, d, e a, c, e, d5 a, e a, e6 a, c, d a, c, d7 b, c c, b8 a, c, d, e a, c, e, d9 b, c, e c, e, b10 a, d, e a, e, d extend its branch [2]. In the search tree (Fig. 1), the path fromthe root to a node represents an itemset under considerationwith its support, e.g., itemset dac has support 3.In depth-first-search schemes like Eclat [20], dEclat[17] and PrePost+ [5], the search tree is expandedand visited in depth-first manner. For instance, theorder of 15 found frequent itemsets in Fig. 1 is: b, bc, d, da, dac, dae, dc, de, a, ac, ace, ae, c, ce, e .In the next sections, we present the Early Stopping tech-nique for Eclat/dEclat and PrePost+ respectively.III. E
CLAT / D E CLAT W ITH E ARLY S TOPPING
A. Eclat
In [20], Zaki et al. proposed Eclat, a depth-first-searchtechnique for frequent itemset mining. Its basic idea is basedon downward closure as in Apriori but the search is depth-first,not level-wise.Eclat uses vertical format to represent the database in whicheach itemset has its own list of transaction ids (TID-list). InTable II, TID-lists of each item (1-itemset) is a sorted list oftransactions containing the item. The TID-list of an itemset X is denoted T ( X ) . We need to read the transaction databaseonce to build the TID-lists of all items.We then explore frequent 2-itemsets by intersecting the TID-lists of 1-itemsets. For example, T ( ac ) = T ( a ) ∩ T ( c ) = { , , , } , so ρ ( ac ) = 4 . In general, k-itemset P xy isproposed and tested by intersecting the TID-lists of two (k-1)-itemsets
P x and
P y (which are both frequent, of course).For example, we have T ( da ) = { , , , , } and T ( dc ) = { , , , } , so T ( dac ) = T ( da ) ∩ T ( dc ) = { , , } and ρ ( dac ) = 3 (frequent).Table II: Vertical format b d a c e2 1 1 2 17 2 3 3 39 4 4 4 46 5 6 58 6 7 810 8 8 910 9 10
1) Early Stopping for Eclat:
Main steps of Eclat aredepicted in Algorithm 1. It starts with the creation of TID-list T ( x ) for each frequent 1-itemset x (Line 2). The depth-first search is delegated to the recursive function TRAVERSE(Lines 8-17). The main step in TRAVERSE is to propose Figure 1: Depth-first-search in Eclat/dEclat. The support isshown after each node’s name.a candidate
P xy (Line 11) and to check its support against minSup (Line 12).Looking at the INTERSECT function (Lines 18-29), wefound that its runtime is O ( | U | + | V | ) . If P xy is frequent, westop only when the condition in Line 20 is violated. However,if
P xy is infrequent, we can stop the intersection early.The basic idea is to keep track of skipped
TIDs in U (called s U ) and V (called s V ) (see Lines 37 and 41 in the function IN-TERSECT ES). If the number of items that can be matched in U (i.e., | U |− s U ) or in V (i.e., | V |− s V ) is less than minSup ,we will surely know that the intersection between U and V isless than minSup , resulting in an infrequent candidate itemset.Simply replacing INTERSECT with INTERSECT ES helps toreduce the number of comparisons, hence incurring less timeto run Eclat. Example 3.1:
With T ( b ) = { , , } and T ( d ) = { , , , , , } , INTERSECT( T ( b ) , T ( d ) ) stops at i =4 , j = 6 and returns { } while INTERSECT ES( T ( b ) , T ( d ) )stops at i = 3 , j = 5 with s U = 1 , s V = 3 , telling us that | U | − s U = 3 − < minSup . (cid:3) B. dEclat
To reduce memory consumption, Zaki et al. [17] proposeda novel vertical data representation called
Diffset which onlystores differences in the TID-list of a candidate itemset fromits generating frequent parents.From a pair of nodes
P x , P y having the same prefix P in thesearch tree, the authors of [17] show that the diffset D ( P xy ) = D ( P y ) − D ( P x ) and ρ ( P xy ) = ρ ( P x ) − | D ( P xy ) | , i.e.,we can compute the support of an itemset using its parent’ssupport and its own diffset. The diffsets are usually smallerthan TID-lists, so the memory consumption is reduced.Fig. 2 illustrates such operations on our running example.At the first level, we store T ( x ) instead of D ( x ) for all 1-itemsets x , especially on sparse databases. At the second level, D ( xy ) = T ( x ) − T ( y ) [17]. For example, D ( bd ) = T ( b ) − T ( d ) = { , , } − { , , , , , } = { , } , hence, ρ ( bd ) = ρ ( b ) − | D ( bd ) | = 3 − (infrequent).From the third level, the diffsets are computed directly fromparents diffsets, D ( P xy ) = D ( P y ) − D ( P x ) . For example, D ( dac ) = D ( dc ) − D ( da ) = { , } − { } = { , } , so ρ ( dac ) = ρ ( da ) − | D ( dac ) | = 5 − (frequent). Algorithm 1
Eclat [20]
Input: DB : database with n transactions. minSup . Output: F , the set of all frequent itemsets procedure E CLAT2:
Scan DB to get T ( x ) for each frequent item x . F = F ∪ { T ( x ) } F = F ∪ { x | T ( x ) ∈ F } TRAVERSE( F ) return F . end procedure function T RAVERSE ( F k ) ⊲ depth-first-search F k +1 = ∅ for T ( P x ) , T ( P y ) ∈ F k , x < y do T ( P xy ) =
INTERSECT( T ( P x ) , T ( P y ) ) if | T ( P xy ) | ≥ minSup then F k +1 = F k +1 ∪ { T ( P xy ) } F = F ∪ { P xy } if F k +1 ! = ∅ then TRAVERSE( F k +1 ) end function function I NTERSECT ( U, V ) Z = ∅ , i = 1 , j = 1 while i ≤ | U | AND j ≤ | V | do if U [ i ] == V [ j ] then Z = Z ∪ { U [ i ] } i + + ; j + + else if U [ i ] < V [ j ] then i + + else j + + return Z . end function function I NTERSECT
ES(
U, V ) ⊲ early-stopping Z = ∅ , i = 1 , j = 1 , s U = 0 , s V = 0 while i ≤ | U | AND j ≤ | V | do if U [ i ] == V [ j ] then Z = Z ∪ { U [ i ] } i + + ; j + + else if U [ i ] < V [ j ] then i + + , s U + + if | U | − s U < minSup then break else j + + , s V + + if | V | − s V < minSup then break return Z . end function Figure 2: dEclat with diffsets at each node.
1) Early Stopping for dEclat:
Main steps of dEclat areshown in Algorithm 2. Similar to Eclat, it starts with thecreation of TID-list T ( x ) for each frequent 1-itemset x (Line2). The depth-first search is delegated to the recursive functionTRAVERSE (Lines 8-17). The main step in TRAVERSE is topropose a candidate P xy (Line 11) and to check its supportagainst minSup (Line 12) using the formula ρ ( P xy ) = ρ ( P x ) − | D ( P xy ) | .Looking at the DIFFERENCE function (Lines 18-31), wefound that it runs in time O ( | U | + | V | ) . If P xy is frequent, westop only when the condition in Line 20 is violated. However,if
P xy is infrequent, we can stop the difference early.The basic idea is to check if the support of
P xy is less than minSup after a TID is added to Z (see Lines 40 and 41 inthe function DIFFERENCE ES). If ρ U − | Z | < minSup , wewill surely know that the P xy is an infrequent candidate item-set. Simply replacing DIFFERENCE with DIFFERENCE EShelps to reduce the number of comparisons, hence incurringless runtime of dEclat. Note that compared to the intersectionoperation which is symmetric in Eclat, the difference operationis asymmetric.
Example 3.2:
Let T = { , , , , , , , , , } bethe set of all TIDs. We have D ( b ) = T − T ( b ) = { , , , , , , } and D ( d ) = T − T ( d ) = { , , , } ,DIFFERENCE( D ( b ) , D ( d ) ) = D ( d ) − D ( b ) = { , } stopsat i = 5 , j = 7 while DIFFERENCE ES( D ( b ) , D ( d ) ) stops at i = 3 , j = 6 , | Z | = 1 , making ρ ( b ) − | Z | = 2 < minSup . (cid:3) IV. P RE P OST + W
ITH E ARLY S TOPPING
In this section, we summarize main concepts of PrePost+[5] such as PPC-Tree, PP-code and N-list. Then we show howto apply Early Stopping to PrePost+.
A. PPC-tree and N-list
Given a reordered DB , PPC-Tree [3] is a tree structuredefined as follows • It consists of one root labeled as null ( {} ), and a set ofitem prefix subtrees as children of the root. • Each node in the item prefix subtree contains five fields: name , frequency , childnodes , pre , and post . The field Algorithm 2 dEclat [17]
Input: DB : database with n transactions. minSup . Output: F , the set of all frequent itemsets procedure D E CLAT2:
Scan DB to get T ( x ) for each frequent item x . F = F ∪ { T ( x ) } F = F ∪ { x | T ( x ) ∈ F } TRAVERSE( F ) return F . end procedure function T RAVERSE ( F k ) ⊲ depth-first-search F k +1 = ∅ for D ( P x ) , D ( P y ) ∈ F k , x < y do D ( P xy ) =
DIFFERENCE( D ( P y ) , D ( P x ) ) if ρ ( P x ) − | D ( P xy ) | ≥ minSup then F k +1 = F k +1 ∪ { D ( P xy ) } F = F ∪ { P xy } if F k +1 ! = ∅ then TRAVERSE( F k +1 ) end function function D IFFERENCE ( U, V ) Z = ∅ , i = 1 , j = 1 while i ≤ | U | AND j ≤ | V | do if U [ i ] == V [ j ] then i + + ; j + + else if U [ i ] < V [ j ] then Z = Z ∪ { U [ i ] } i + + else j + + if i ≤ | U | then Z = Z ∪ { U [ k ] | k = i → | U |} return Z . end function function D IFFERENCE
ES(
U, V, ρ U ) ⊲ early-stopping Z = ∅ , i = 1 , j = 1 while i ≤ | U | AND j ≤ | V | do if U [ i ] == V [ j ] then i + + ; j + + else if U [ i ] < V [ j ] then Z = Z ∪ { U [ i ] } i + + if ρ U − | Z | < minSup then return Z else j + + if i ≤ | U | then Z = Z ∪ { U [ k ] | k = i → | U |} return Z . end function Figure 3: PPC-Tree after inserting first four transactionsFigure 4: Full PPC-Tree with PP-code of each node name registers the item this node represents. The field frequency stores the number of transactions containing apath reaching this node. The field childnodes registers allchildren of the node. The field pre is the pre-order rankof the node. The field post is the post-order rank of thenode. For a node, its pre-order is the sequence numberof the node when scanning the tree by pre-order traversaland its post-order is the sequence number of the nodewhen scanning the tree by post-order traversal.Fig. 3 demonstrates how the PPC-Tree is built from thereordered transactions in Table I. We start with a null root.Then the first transaction { a, e, d } is inserted in the PPC-Tree by creating nodes named a , e and d with frequency 1.Similarly, for the second transaction { c, d, b } , a new child nodeof the root and two descendent nodes are added. The third andfourth subfigures show the tree after the insertion of { a, c, e } and { a, c, e, d } . The full PPC-Tree is shown in Fig. 4.The pre-order and post-order ranks are tagged in a pair ofnumbers next to each node in Fig. 4. PP-code [3] of each node N in PPC-Tree is a triple
We have e < c , N L ( e ) = { < , , >, < , , >, < , , > } and N L ( c ) = { < , , >, < , , > } , therefore N L ( ec ) = { < , , >, < , , > } and the support of ec is ρ ( ec ) = 3 + 1 = 4 (see Fig. 6). (cid:3) B. PrePost+ Algorithm
In this section, we briefly recall the PrePost+ algorithm [5](see Algorithm 3). PrePost+ starts with the construction ofPPC-Tree (Line 1) and computation of NL-list of frequent1-itemsets (Line 2). Again, the idea of depth-first searchin Eclat/dEclat repeats here. Recall that PrePost+ combinesitemsets sharing the same suffix (not prefix as in Eclat/dEclat).The recursive function TRAVERSE (Lines 9-18) proposes acandidate xyS (Line 11), computes the intersection between
N L ( xS ) and N L ( yS ) (Line 12), and checks the support of xyS against minSup (Line 13).The main steps of NL intersect are depicted in Lines 19-33 (Algorithm 3). Similar to the function INTERSECTIONin Eclat, we maintain two indexes i and j and carry out theintersection from left-to-right. The criteria for the merge (Lines23,24) are stated in Section IV-A, i.e., the i -th triple in U ismergeable to the j -th triple in V if and only if the former isthe ancestor of the latter in the PPC-Tree.Again with Example 4.1, the step-by-step intersection be-tween N L ( e ) and N L ( c ) is as follows. < , , > is non-mergeable to < , , > so it is tested against the next j ,i.e., < , , > where it is mergeable and returns < , , > .Then < , , > , when compared to < , , > , failsat Line 6, so we consider the next i , i.e., < , , > .Clearly, < , , > is mergeable to < , , > , returning < , , > . The intersection stops and we get N L ( ec ) = { < , , >, < , , > } . Figure 6: Search tree in PrePost+Note that
NL intersect is fixed by the item order, i.e., weonly intersect
N L ( xS ) with N L ( yS ) if x < y in frequencyordering. C. Early Stopping for PrePost+
In PrePost+,
NL intersect runs in O ( | U | + | V | ) . To applyEarly Stopping technique, we integrate again the size test intothe function NL intersect in order that if the test fails early,we can stop the computation and return an empty Z .We present this idea in the function NL intersect ES (Lines34-52). At any triple j of V , if it is non-mergeable to the triple i of U , we increase skip by y j .f req (Line 44). If the sum ofremaining frequencies ρ V − skip is less than minSup , westop and return an empty set (Lines 45,46). We demonstratethe effectiveness of NL intersect ES in the next example.
Example 4.2:
Given
N L ( b ) = { < , , >< , , >< , , > } and N L ( d ) = { < , , >< , , >< , , >< , , > } , if we call NL intersect(NL(b),NL(d)) , we need to run 5 checks for ( i, j ) = (1 , , (1 , , (1 , , (1 , , (2 , in which only pair (1 , matches, so we get N L ( bd ) = { < , , > } . Withsupport 1 (less than minSup = 3 ), db is infrequent.In calling NL intersect ES(NL(b),NL(d)) , we know that ρ V = 6 , minSup = 3 . After the two (failed) checks ( i, j ) = (1 , , (1 , , we increase skip to , making ρ V − skip < minSup , so we safely conclude that bd is notfrequent, omitting the three remaining checks. (cid:3) D. Remarks on Apriori, FP-Growth, and Bit-Vector BasedAlgorithms
Apriori [2] is a level-wise (breadth-first search) miningscheme which use horizontal format to count the supportfor candidate k -itemsets (i.e., itemsets at level k ). No listintersection is required in Apriori, so our technique does notapply.Instead of generating and testing candidate itemsets, FP-Growth [8] and its derivatives FP-Growth* [7], H-Mine [11]recursively project the database into sub-databases using prefixitemsets. Then local frequent patterns are searched to assemblelonger global ones. No list intersection is required in FP-Growth/FP-Growth* or H-Mine, so our technique does notapply either. Algorithm 3
PrePost+ [3]
Input: DB : database with n transactions. minSup : mini-mum support. Output: F , the set of all frequent itemsets procedure P RE P OST + Scan DB to obtain F and build the PPC-Tree Scan PPC-tree to generate
N L ( x ) F = F ∪ { N L ( x ) } F = F ∪ { x | N L ( x ) ∈ F } TRAVERSE( F ) return F . end procedure function T RAVERSE ( F k ) ⊲ depth-first-search F k +1 = ∅ for N L ( xS ) , N L ( yS ) ∈ F k , x < y do N L ( xyS ) = NL intersect(
N L ( xS ) , N L ( yS ) ) if ρ ( xyS ) ≥ minSup then F k +1 = F k +1 ∪ { N L ( xyS ) } F = F ∪ { xyS } if F k +1 ! = ∅ then TRAVERSE( F k +1 ) end function function NL INTERSECT ( U, V ) i = 1 , j = 1 Z = ∅ while x i ∈ U, i ≤ | U | AND y j ∈ V, j ≤ | V | do if x i .pre > y j .pre then if x i .post < y j .post then add < y j .pre, y j .post, x i .f req > to Z i + + else j + + else i + + merge elements in Z return Z . end function function NL INTERSECT
ES(
U, V, ρ V ) ⊲ early-stopping i = 1 , j = 1 skip = 0 Z = ∅ while x i ∈ U, i ≤ | U | AND y j ∈ V, j ≤ | V | do if x i .pre > y j .pre then if x i .post < y j .post then add < y j .pre, y j .post, x i .f req > to Z i + + else skip = skip + y i .f req if ρ V − skip < minSup then return ∅ j + + else i + + merge elements in Z return Z . end function Bit-vector based algorithm such as VIPER [12] appliesdepth-first search in the same manner as Eclat but uses acompressed bit-vector structure instead. The intersection ofdecompressed bit-vectors in memory is performed by ANDoperator. Our technique can be plugged to such algorithms toearly determine if the intersection would be less than minSup or not. V. E
XPERIMENTS
In this section, we evaluate the performance of the proposedEarly-Stopping technique applied to Eclat/dEclat and PrePost+in terms of runtime and number of comparisons. The datasetsare described in Sections V-A. We show the comparison be-tween standard versions and early-stopping versions in SectionV-B. The algorithms are implemented in C++ and run ona desktop PC with
Intel r Core i7-6700@ 3.4Ghz, 16GBmemory.
A. Experiment Setup
We use nine datasets as shown in Table III. Thedatasets were downloaded from FIMI repository(http://fimi.ua.ac.be) and KONECT repository(http://konect.uni-koblenz.de/networks/). The columns minSup value respectively.
T40I10D100K is a synthetic market-basket dataset from [2].It contains 100,000 transactions and 942 items.
MovieLens-1M is a bipartite network containing one millionmovie ratings from http://movielens.umn.edu/. Movies play therole of items and ratings of each user stand for a transaction.
Github is a membership network of the hosting site GitHub.The network is bipartite and contains users (transactions) andprojects (items).
Retail is anonymous retail market-basket data from ananonymous Belgian retail store.
Kosarak contains sequences of click-stream data from aHungarian news portal.
Accidents contains anonymous traffic accident data.
Chess is converted from UCI chess dataset. Each transactionis an instance of the chess game and items describe the boardand the outcome of the game.
Connect is converted from UCI connect-4 dataset. Eachtransaction is an instance of the game and items describe theboard and the outcome of the game.
Pumsb dataset contains census data for population andhousing.We name the Early-Stopping versions as Eclat-ES, dEclat-ES and PrePost+ES. Recall that our technique is easilyplugged to any frequent pattern mining schemes that requirethe intersection operation for itemset support checking.
B. Effectiveness of Early Stopping Technique
In this section, we evaluate the effectiveness of Early-Stopping schemes Eclat-ES, dEclat-ES and PrePost+ES. Be-cause the schemes run deterministically, all the reported valuesexcept runtime do not change. The runtime is the averageresult of ten runs. Table III: Dataset properties
Dataset
T40I10D100K 942 100,000 39.6 0.002 .. 0.02MovieLens-1M 3,706 6,040 165.6 0.07 .. 0.1Github 56,519 120,867 3.6 0.00007 .. 0.0001Retail 16,470 88,162 10.3 0.00003 .. 0.00006Kosarak 41,270 990,002 8.1 0.001 .. 0.004Accidents 468 340,183 33.8 0.1 .. 0.4Chess 75 3,196 37.0 0.1 .. 0.4Connect 129 67,557 43.0 0.1 .. 0.4Pumsb 2,088 49,046 50.5 0.1 .. 0.4
1) Number of Proposed Candidates and Expanded Nodes:
Table IV displays the number of proposed candidates (column minSup . Here minSup meansthe smallest value of minSup for the dataset, minSup meansthe next value and so on (see Table III).Because Eclat/dEclat and PrePost+ traverse the search treebased on items sorted in increasing frequency, the number ofproposed candidates and expanded nodes are the same for allthe schemes on a given dataset and minSup . As minSup increases, there are less frequent 1-itemsets, so the number ofproposed candidates and expanded nodes get smaller too.We can roughly divide the datasets into two groups by theratio between the number of candidates and expanded nodes.The first four datasets have the ratio larger than 2 while theremaining have the ratio less than 1.5. As we will see in thefollowing subsections, the ratio suggests different behavioursof mining schemes in both the number of comparisons andruntime.
2) Number of Comparisons:
Figures 7 to 15 compare sixschemes over nine datasets for different values of minSup . Ineach figure, we report the number of comparisons performedin intersection functions on the left and the total runtime (insecond) on the right.First, the Early-Stopping schemes effectively reduce thenumber of comparisons between pairs of TID-lists (Eclat),Diffsets (dEclat) or N-lists (PrePost+) in all cases. The re-duction varies among datasets and mining schemes.For Eclat-ES, the number of comparisons is cut down con-siderably in the first three datasets T40I10D100K, MovieLens-1M and Github. The reduction is clear cut for small values of minSup and slightly decreases when minSup becomes larger.dEclat-ES and PrePost+ES confirm similar effective reduc-tion on T40I10D100K, MovieLens-1M, Github, Retail andKosarak.Finally, we observe that the reduction of comparison op-erations in Accidents, Chess, Connect and Pumsb is almostnegligible. This result can be explained by the ratio columnof Table IV. These datasets also exhibit large discrepanciesbetween Eclat and dEclat/PrePost+, confirming that TID-listis much less efficient on these kinds of transaction data.
3) Runtime:
The reduction in the number of comparisonsnaturally translates into the reduction of runtime (see the rightplots of Figures 7 to 15). The clear effect is observed in thefour datasets and in Eclat-ES for the remaining five datasets. c o m pa r i s on s EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES r un t i m e ( s ) EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES
Figure 7: Number of comparisons and runtime for T40I10D100K c o m pa r i s on s EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES r un t i m e ( s ) EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES
Figure 8: Number of comparisons and runtime for MovieLens-1M c o m pa r i s on s EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES r un t i m e ( s ) EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES
Figure 9: Number of comparisons and runtime for Github c o m pa r i s on s EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES r un t i m e ( s ) EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES
Figure 10: Number of comparisons and runtime for Retail c o m pa r i s on s EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES r un t i m e ( s ) EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES
Figure 11: Number of comparisons and runtime for Kosarak c o m pa r i s on s EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES r un t i m e ( s ) EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES
Figure 12: Number of comparisons and runtime for Accidents c o m pa r i s on s EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES r un t i m e ( s ) EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES
Figure 13: Number of comparisons and runtime for Chess c o m pa r i s on s EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES r un t i m e ( s ) EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES
Figure 14: Number of comparisons and runtime for Connect Table IV: Number of proposed candidates and expanded nodes
Dataset minSup = minSup minSup = minSup minSup = minSup minSup = minSup c o m pa r i s on s EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES r un t i m e ( s ) EclatEclat-ESdEclatdEclat-ESPrePost+PrePost+ES
Figure 15: Number of comparisons and runtime for PumsbNote that the reduction in runtime must take into accountthe offset caused by the Early-Stopping checks (i.e., Lines 38and 42 in Algorithm 1, Line 40 in Algorithm 2 and Line 45in Algorithm 3). If the candidate itemset is frequent, suchchecks make Early-Stopping intersection functions incur asmall overhead compared to the standard counterparts. Fordatasets whose number of comparisons is not much saved,the runtime reduction is not guaranteed. This fact is clearlyobserved in several cases, especially for dEclat-ES and Pre-Post+ES on Kosarak.
4) Other Remarks:
As TID-lists (in Eclat) and Diffsets(in dEclat) are two complementary structures, we observean interesting tendency: the high number of comparisons orruntime in one scheme implies the low corresponding valuesin the other.All the enhanced versions have the same memory con-sumption as the original schemes. This fact is straightforwardbecause the memory requirement to maintain the supportstructures like TID-lists, Diffsets and N-lists is unchanged.The number of proposed candidates does not change either.VI. C
ONCLUSION
We have presented a simple yet effective Early-Stoppingtechnique to accelerate some existing depth-first search itemsetmining algorithms that use the generate-and-test strategy. Ourtechnique is based on an early-stopping criterion for listintersection. We have applied the technique to TID-list inEclat, diffsets in dEclat and N-list in PrePost+. The number ofcomparisons in the enhanced versions is always less than thatin the original algorithms, leading to runtime cut-down in mostof the cases. We have evaluated the Early-Stopping schemes over nine datasets. The results confirm the effectiveness of ourimprovement and suggest what kind of transaction data willbenefit most. C
ONFLICTS OF I NTEREST
The author(s) declare(s) that there is no conflicts of interestregarding the publication of this paper.R
EFERENCES[1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic sub-space clustering of high dimensional data for data mining applications.In
SIGMOD’98 . ACM, 1998.[2] R. Agrawal, R. Srikant, et al. Fast algorithms for mining associationrules. In
Proc. 20th int. conf. very large data bases, VLDB , volume1215, pages 487–499, 1994.[3] Z. Deng, Z. Wang, and J. Jiang. A new algorithm for fast miningfrequent itemsets using n-lists.
Science China Information Sciences ,55(9):2008–2030, 2012.[4] Z.-H. Deng, G.-D. Fang, Z.-H. Wang, and X.-R. Xu. Mining erasableitemsets. In
Machine Learning and Cybernetics, 2009 InternationalConference on , volume 1, pages 67–73. IEEE, 2009.[5] Z.-H. Deng and S.-L. Lv. Prepost+: An efficient n-lists-based algorithmfor mining frequent itemsets via children–parent equivalence pruning.
Expert Systems with Applications , 42(13):5424–5432, 2015.[6] P. Fournier-Viger, J. C.-W. Lin, R. U. Kiran, Y. S. Koh, and R. Thomas.A survey of sequential pattern mining.
Data Science and PatternRecognition , 1(1):54–77, 2017.[7] G. Grahne and J. Zhu. Fast algorithms for frequent itemset miningusing fp-trees.
IEEE transactions on knowledge and data engineering ,17(10):1347–1362, 2005.[8] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidategeneration. In
ACM sigmod record , volume 29, pages 1–12. ACM, 2000.[9] Y.-K. Lee, W.-Y. Kim, Y. D. Cai, and J. Han. Comine: Efficient miningof correlated patterns. In
ICDM , volume 3, pages 581–584, 2003.[10] L. T. Nguyen, B. Vo, T.-P. Hong, and H. C. Thanh. Classification basedon association rules: A lattice-based approach.
Expert Systems withApplications , 39(13):11357–11366, 2012. [11] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. H-mine:Hyper-structure mining of frequent patterns in large databases. In DataMining, 2001. ICDM 2001, Proceedings IEEE International Conferenceon , pages 441–448. IEEE, 2001.[12] P. Shenoy, J. R. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa, andD. Shah. Turbo-charging vertical mining of large databases. In
ACMSigmod Record , volume 29, pages 22–33. ACM, 2000.[13] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizationsand performance improvements. In
International Conference on Extend-ing Database Technology , pages 1–17. Springer, 1996.[14] H. Toivonen et al. Sampling large databases for association rules. In
VLDB , volume 96, pages 134–145, 1996.[15] B. Vo, F. Coenen, and B. Le. A new method for mining frequentweighted itemsets based on wit-trees.
Expert Systems with Applications ,40(4):1256–1264, 2013.[16] C.-W. Wu, Y.-F. Lin, P. S. Yu, and V. S. Tseng. Mining high utilityepisodes in complex event sequences. In
Proceedings of the 19th ACMSIGKDD international conference on Knowledge discovery and datamining , pages 536–544. ACM, 2013.[17] M. J. Zaki and K. Gouda. Fast vertical mining using diffsets. In
Proceedings of the ninth ACM SIGKDD international conference onKnowledge discovery and data mining , pages 326–335. ACM, 2003.[18] M. J. Zaki and C.-J. Hsiao. Charm: An efficient algorithm forclosed itemset mining. In
Proceedings of the 2002 SIAM internationalconference on data mining , pages 457–473. SIAM, 2002.[19] M. J. Zaki and C.-J. Hsiao. Efficient algorithms for mining closeditemsets and their lattice structure.
IEEE Transactions on Knowledge &Data Engineering , (4):462–478, 2005.[20] M. J. Zaki, S. Parthasarathy, M. Ogihara, W. Li, et al. New algorithmsfor fast discovery of association rules. In