CICLAD: A Fast and Memory-efficient Closed Itemset Miner for Streams
CCICLAD: A Fast and Memory-efficient Closed ItemsetMiner for Streams
Tomas Martin, Guy Francoeur, Petko Valtchev
Centre de recherche en intelligence artificielle (CRIA), UQÀMMontréal, Québec, Canada
Abstract
Mining association rules from data streams is a challengingtask due to the (typically) limited resources available vs. thelarge size of the result. Frequent closed itemsets (FCI) en-able an efficient first step, yet current FCI stream miners arenot optimal on resource consumption, e.g. they store a largenumber of extra itemsets at an additional cost. In a searchfor a better storage-efficiency trade-off, we designed
Ciclad ,an intersection-based sliding-window FCI miner. Leverag-ing in-depth insights into FCI evolution, it combines mini-mal storage with quick access. Experimental results indicate
Ciclad ’s memory imprint is much lower and its performancesglobally better than competitor methods.
Association rule (AR) and the component frequent itemset(FI) mining from data streams have found a wide range ofpractical applications, e.g. in network traffic analysis, clickstream mining, online transaction analysis [9, 13, 21].Both task are challenging due to the specific conditionsof the stream environment: dynamic data inflow, potentialconcept drift and, typically, limited resources [12, 26]. There-fore, the design of efficient stream miners has to reflect con-cerns such as single access to stream data, compact storageof results, resource limit-awareness, etc. Moreover, variousprocessing models have been investigated to reflect decayof data in the stream, such as landmark , sliding or damped window. In a different vein, while a large number of streamminers target interesting itemsets ( frequent [16], rare [10], closed [4], maximal [13], self-sufficient [23], etc.), only fewmethods would maintain the associations, arguably due tothe huge number of rules to process on each window shift.Our ultimate goal is the design of an efficient AR streamminer, whereby we propose to tackle the inherent complexityof the task by reducing the output to some concise represen-tation of strong rules, e.g. as described in [14]. Such represen-tations typically constructed on top of distinguished itemsetssuch as closed itemsets (CIs), generators (a.k.a. free itemsets),or maximal FIs [2], and might exploit additional structureon those, e.g. lattice links [29]. A variety of batch methodsexist for these representations (e.g. see [2, 14]), yet to thebest of our knowledge they have not been studied in streamsettings. Stream miners have been designed for (frequent)CIs [4, 11, 15, 24, 25, 27] as well as for frequent generators [6].The former split into two categories: Methods [4, 15] adapt batch pattern enumeration [28], as opposed to those relyingon CI intersection [11, 24, 25, 27]. Intersection-based closurecomputing is rooted in formal concept analysis (FCA) [5],where a variety of incremental, i.e. landmark, miners targetclosed itemsets (CIs) (dating back to [7]). Moreover, vari-ous derived problems of interest, e.g. mining CIs with linksand generators, have been studied [20]. However, resource-awareness is not a prime concern in FCA, and neither is datadecay, hence no decremental methods have been designed(to achieve a sliding window mode).In this paper, we focus on the groundwork task to supportthe design of fully-blown concise AR stream miner, i.e. theonline mining of FCIs. Our analysis of the literature indicatesthat all existing methods have increased memory consump-tion as they need to store extra itemsets on top of the targetFCIs: The first group requires few infrequent and large num-ber of non closed itemsets whereas the second one maintainsall infrequent CIs. However, pattern enumeration miners ad-ditionally maintain tidsets, which is costly, especially onlarge or highly dynamic streams, whereas intersection-basedones skip tids altogether.We chose to follow an intersection-based approach which,despite the aforementioned overhead due to infriquent CImaintenance, offers distinct advantages such as higher flexi-bility (e.g. upon support threshold decreases) and versatility(both frequent and rare patterns targeted). Moreover, the CIindexing approach used in both [25, 27] seems particularlyappealing, yet both methods suffer on superfluous storageand/or processing.As a remedy, we propose Ciclad , a two-fold CI streamminer whose incremental part
Ciclad + streamlines quickaccess to CIs and item-wise intersection growth from [25]while skipping non essential storage of itemsets. The decre-mental Ciclad − , in turn, implements some novel insights intoCI evolution that allow it to fit the same overall computingschema as Ciclad + . The resulting homogeneous compoundmethod achieves both high efficiency and low memory usage.This has been confirmed by a validation study over both realand synthetic datasets (retail, network security, etc.) whoseoutcome shows that Ciclad outperforms competing methodsby a comfortable margin on all but the most dense datasets.Our paper’s contributions are as follows: (1) formaliza-tion of transaction removal, (2) a unified intersection-basedsliding-window miner
Ciclad , (3) performance study on avariety of CI mining methods for streams (our code and a r X i v : . [ c s . D B ] J u l atasets publicly available ). Additionally, we provide cor-rectness proofs enhancing [25] (see Appendix).In what follows, section 2 provides background on CIs andonline mining while section 3 summarizes previous work.Our mathematical approach and the algorithmic details of Ciclad are presented in sections 4 and 5, respectively. Sec-tion 6 summarizes the performance study. Concluding re-marks are given in section 7.
Below, we recall basics of pattern mining, closed patternsand stream mining.
Assume a transaction database D (as in Table 1) defined ontop of a set of items I = { a , a , . . . , a n } (here { a , . . . , h } ). Aset X ⊆ I is called an itemset while a transaction is a pair( tid , itemset) where tid is a transaction identifier . Similarly, aset of tids is called a tidset .(¯1, abcde f дh ) (¯5, д ) (¯9, d )(¯2, abce f ) (¯6, e f h ) ( ¯10, bcдh )(¯3, cd f дh ) (¯7, abcd )(¯4, e f дh ) (¯8, bcd ) Table 1.
Sample data, further referred to as D Given a tidset Y from D , ι D ( Y ) = (cid:209) { Z |( j , Z ) ∈ D , j ∈ Y } denotes the itemsets shared by the respective transactions.Conversely, the support set of an itemset X comprises tidswhose itemsets cover X , τ D ( X ) = { j |( j , Z ) ∈ D , X ⊆ Z } . Forinstance, τ D ( ab ) = { ¯1 , ¯2 , ¯7 } and ι D ({ ¯2 , ¯7 }) = abc . For a tid j from D , ι ( j ) denotes its itemset: ι ( j ) = Z iff ( j , Z ) ∈ D . Toshorten notations, we omit the subscript D (if no confusionis possible) and use t ∗ to denote t ∗ ’s itemset.Quality of an itemset X follows the size of its support set,a.k.a. its support σ ( X ) = | τ ( X )| .A binary frequency criterion is based on a pre-defined minimum support threshold, or min_supp , denoted ς . Theensuing family of frequent itemsets in D will be F (D , σ ) .Support sets induce an equivalence relation on ℘ (I) : X (cid:27) Z iff τ ( X ) = τ ( Z ) (e.g. ab (cid:27) ac ). The equivalence class of X , [ X ] D admits a unique maximum, a.k.a. closed itemset (CI)which, following the anti-monotony of σ , can be defined asfollows: Definition 1. X ⊆ I is closed if no proper superset thereofhas the same support. In D , abc is closed while b is not ( τ ( bc ) = τ ( b ) ). The CIsof D will be denoted C(D) and frequent ones
F C(D , ς ) .In formal concept analysis (FCA) [5], CIs, termed conceptintents , are defined via a closure operator κ induced by C(D) , κ : X (cid:55)→ max ([ X ] D ) , hence X = κ ( X ) iff X is closed. Here, κ is nothing more than the composition of ι and τ : https://github.com/guyfrancoeur/ciclad Property 1.
Given a X ⊆ I , κ ( X ) = ι ( τ ( X )) . Here, κ ( ab ) = abc (as ι ( τ ( ab )) = ι ({ ¯1 , ¯2 , ¯7 }) = abc and abc = max ([ ab ] D ) ). Next, CIs are exactly the intersectionsof arbitrary sets of transactions [5]. Property 2.
C(D) = D ∩ . For example, in
C(D) as given in Table 2, e f (CI 6), can begenerated as 6 = ¯2 ∩ ¯6, whereas (cid:209) { , , } yields bc (CI 16).As a corollary, C(D) is closed under ∩ (and so is F C(D , ς ) ).1 ( abcde f дh : 1 ) ( д : 5 ) ( bc : 5 ) ( abce f : 2 ) ( f h : 4 ) ( bcd : 3 ) ( c f : 3 ) ( e f h : 3 ) ( d : 5 ) ( cd f дh : 2 ) ( c : 6 ) ( h : 5 ) ( f : 5 ) ( cd : 4 ) ( дh : 4 ) ( e f : 4 ) ( abc : 3 ) ( cдh : 3 ) ( f дh : 3 ) ( abcd : 2 ) ( bcдh : 2 ) ( e f дh : 2 ) Table 2.
The family
C(D ) ( σ values behind ’:’) Stream pattern mining amounts to updating the pattern fam-ily of a window upon adding or removing a transaction(called increment and decrement , respectively). For instance,assume the transactions in Table 1 are acquired in the orderof their tids. Let t n be a new transaction not in D and let D + = D∪{ t n } . Simply put, the incremental update results innew CIs being added in C(D + ) and σ D + values computed forthem as well as for some existing CIs (support sets extendedby t n ). Indeed, following Property 2, no CI from D can vanishin D + as C(D + ) = (C(D) ∪ { t n }) ∩ entails C(D) ⊆ C(D + ) .As an illustration, assume D ¯1 , ¯9 = { ¯1 , . . . , ¯9 } (see Table 1)with its CI family C(D ¯1 , ¯9 ) as given in Table 3 and t n = ¯10.Out of the latter two, the incremental method would output C(D ) (Table 2).1 ( abcde f дh : 1 ) ( f дh : 3 ) ( cd : 4 ) ( abce f : 2 ) ( e f дh : 2 ) ( abc : 3 ) ( c f : 3 ) ( д : 4 ) ( abcd : 2 ) ( cd f дh : 2 ) ( f h : 4 ) ( bc : 4 ) ( f : 5 ) ( e f h : 3 ) ( bcd : 3 ) ( e f : 4 ) ( c : 5 ) ( d : 5 ) Table 3.
The CI family
C(D ¯1 , ¯9 ) Algorithm-wise, as shown in [25], all new
CIs from
C(D + )−C(D) are generated as intersections of t n with a CI from C(D) , e.g. CI 20 ( дh ) arises as ¯10 ∩
8. Let ∆ (D , t n ) = { t n ∩ c | c ∈ C(D)} denote the intersection set generated by t n .Here, ∆ (D ¯1 , ¯9 , ¯10 ) = { bcдh , cдh , bc , дh , c , д , h } . Observe thatsome itemsets correspond to CIs from C(D ¯1 , ¯9 ) , e.g. bc is theCI 16 in D ¯1 , ¯9 . We shall call these CIs promoted , as opposed to new ones, and denote them C P (D) . It is readily shown thatpromoted CIs are exactly those included in (the itemset of) t n . Correspondingly, new intersections in ∆ (D , t n ) , denoted N (D) , can only involve CIs that are not included in t n . Ob-serve that whether a new or a promoted CI, there may bemultiple ways to generate an itemset X ∈ ∆ (D , t n ) , e.g. bc isalso ¯10 ∩
2. In fact, t n induces an equivalence relation over C(D) in which a class is defined as [ c ] t n = { ¯ c | ¯ c ∩ t n = c ∩ t n } for any CI c (e.g. 8 ∈ [ ] ¯10 ). Clearly, each class is associ-ated with some X from ∆ (D , t n ) , safe the one gathering CIsthat are disjoint with t n (of no interest here). In Figure 1(section 4.2), the grey-filled table presents the equivalenceclasses associated to ∆ (D ¯1 , ¯9 , ¯10 ) .Crucially, each class [ ] t n associated to some intersection X has a distinguished member CI that canonically generates X (CI in bold in Figure 1). This is the CI corresponding to κ D ( X ) ,the closure of X in D , which is, provably, the minimum of theclass (see [25]). Now, if X is a promoted CI, then it is closedin D ( X = κ D ( X ) ), hence it equals the canonical member( X = min ⊆ ([ X ] t n ) . Otherwise, X is new CI w.r.t. to D , hencenon closed and strictly smaller than its closure ( X ⊆ κ D ( X ) )that is further called the genitor of X . In our example, дh isa new CI in D ¯1 , ¯9 whose genitor is 7. The set of all genitorsin D will be denoted C G (D) . To sum up, promoted, new andgenitor CIs are defined as follows: • C P (D) = { c | c ∈ C(D) , c ⊆ ι ( t n )} , • C N (D) = { ¯ c | ∃ c ∈ C(D) , ¯ c = c ∩ t n , κ D ( ¯ c ) (cid:44) ¯ c } , • C G (D) = { c ∈ C(D) | c = κ D ( c ∩ t n ) , c ⊈ t n } .Finally, the support in D + for any X in ∆ (D , t n ) is merelya unit more than the support of its closure in D (increase dueto t n ). Indeed since σ D ( X ) = σ D ( κ D ( X )) , we have σ D + ( X ) = σ D ( κ D ( X )) +
1. For instance, in Table 2, the support of thenew CI 20 is 4 while that of CI 7, its genitor, is 3.Dually, let t o be an obsolete transaction from D and D − = D − { t o } . The impact of removing t o is some CIs, called obsolete , vanishing in D − while others, the demoted , gettheir support decreased by 1. The corresponding sets of CIsare investigated in section 4.3. Historically, methods for incrementally listing the closuresof a cross-table date back at least to [19]. In the 1990s and2000s, a variety of intersection-based incremental conceptlattice builders were published, starting with [7] which in-troduced genitors. They compute jointly CIs with respectivetidsets and precedence. Later methods, e.g.
Galicia-T and
Galicia-M [24], reflect FCI mining concerns, thus they forgoprecedence and tidsets. However, they are bound to maintainall CIs as some genitors might well be infrequent. CIs arestored compactly, e.g. in prefix trees, and accessed troughinverted lists to avoid spending on empty intersections.
CFI-Stream [11] is a sliding-window CI miner that heavilyrelies on intersections between CIs (as opposed to CI-to- t n ones). Its decrement lists all subsets of t n and finds theirclosures as intersections of all encompassing CIs to test in In [7], genitors are called generators and promoted - modified . obsolescence; increments are more focused. Overall, genitorsare targeted as such yet key properties thereof are ignored. Galicia-P [25] is a landmark CI miner using inverted liststo selectively access CIs and trie storage. Intersection trie isgrown item-wise with each CI pointing to its current prefix.Intersections are split into new/promoted by tracking theminimum generating CI.
CloStream is another intersection-based method, intro-duced as landmark and later on completed to a sliding-windowmode [27]. It uses inverted lists to filter CIs to get intersectedwith t n / t o . Genitors appear in both add and removal process-ing yet as two unrelated –and informally defined– notions.A landmark intersection-based approach is adapted tobatch FCI mining in [1]. They use a two-pass scheme andstore nearly all CIs (stripped of infrequent items) and notidsets. The approach reportedly outperformed modern batchFCI miners on specific types of data. Moment [4] is a sliding-window FCI stream miner adapt-ing pattern enumeration (as in
Eclat [28]) supported by aCE-tree. Its increment proceeds by sibling node joins whichexploit the non closed promising itemsets plus some infre-quent ones. Support is yielded by tidset intersections. Itsdecrement relies on direct closures computing to spot obso-lete CIs.
NewMoment [15] enhances
Moment with bit vectorencoding of tidsets. Yet it forgoes node categories: A node isjoined with all its siblings.
GC-Tree [3] adapts the batch FCI miner
DCI-Closed [17]to mining CIs from streams. It uses a hybrid scheme: newCIs are generated either by intersection of existing CIs orby the closure climbing [17]. Genitors are absent: Support iscomputed tidset-wise while the decrement exploits directclosure computing. Unlike
Moment , at most one non closeditemset is kept for each class [ ] D .In summary, all methods incur overhead due to extra item-sets stored and inefficient new closure computing. Moreover,while CloStream and
Galicia-P substantially limit the compu-tation effort, they still suffer on sub-optimal memory usage,e.g. redundancy between CI storage and inverted lists.Our
Ciclad method aims at further minimizing both over-heads.
Ciclad combines an incremental part that builds upon ba-sic ideas from
Galicia-P with a novel decremental methodbased on original mathematical results. Both share an overallcomputing schema, thus achieving high homogeneity andcompactness. Below, we outline that schema and illustrate it(in incremental mode), then provide the formal backgroundfor decrementing and expand on high-level algorithmics.
The generic online processing comprises three steps: (1) in-tersection computing; (2) splitting the total set into promoteddemoted) and new (obsolete) CIs; (3) update of the indexingstructures on CIs. Below, we expand on each step while usinggeneric notations, e.g. t x to mean t n or t o .At step (1), intersection itemsets are grown from prefixes,along an iteration over all items in t x . At an item a k , eachCI comprising a k : (i) has its current prefix extended with a k ,and (ii) has its support checked against the current maximalsupport of a CI sharing that prefix. Eventually, prefixes growinto complete intersections whereby each intersection X has netted the minimum generating CI min ⊆ ([ X ] t n ) . Step(2) categorizes each X by checking for a genitor CI within [ X ] t n : In Ciclad + , the equality X = min ⊆ ([ X ] t n ) means X ispromoted, otherwise new. In Ciclad − , the test is more subtle,as explained in section 4.3. Step (3) updates CI storage andinverted lists for items in t x . On the algorithmic side,
Ciclad uses item-wise inverted listsfor quick access to CIs and trie-based storage of evolvingintersections (allows a unique copy per prefix). Thus, duringstep (1), a CI only keeps a pointer to the trie node with thelast item of its current intersection prefix. At step (2), endnodes of full intersections in the trie are identified by thenon-zero count of referring CIs.Figure 1 is a snapshot of the working memory of
Ciclad + atthe very end of the increment of D ¯1 , ¯9 (Table 3) by ( ¯10 , bcдh ) (Table 1), i.e. after all four items have been processed. On theleft, CI storage structure features fields for ID, support, anda reference to an end node (the last field) in the intersectiontrie. Observe that itemsets are only stored in the invertedlists (on the right of Figure 1), e.g. the ID of 7 (valued f дh )appears in the lists of f (not in the figure), д , and h .The trie, which is built anew on each window shift, hasnodes with fields for ID (underlined), the item, a pointer tothe minimal CI ( min ) and a counter ( cpt ) for referring CIs.The counters discriminate end nodes of full intersections(shaded in the figure) against the rest: Due to a simple book-keeping mechanism for shifting last pointers, end nodeshave at least one such pointer directed at them (hence a non-zero counter value). For instance, in 16, whose intersection is bc , last points to 3 whereas 10 and 11 point to 7. Conversely,the min field of trie node 7 points back to 10, the minimalCI yielding h . Overall, no white node is pointed at by a CI,hence they are ignored at step (2).The trie grows along an iteration over items in t x . Foreach item a , trie nodes with a are appended to existing paths.To that end, CIs in the list of a are scanned: For any such c , the node in c . last is replaced with a successor carrying a (if missing, such a node is created). The cpt fields of bothformer and new last nodes are updated accordingly. Then, c is tested for minimality w.r.t. the new last . Details of how Ciclad updates the above fields are given in section 5.1.Finally, categorization tests equality of a node’s intersec-tion to its minimal CI . In our example, equality holds for
Figure 1.
A snapshot of the working memory of
Ciclad + node 3 vs. CI 16 (16 is thus promoted), but not for 7 vs. 10,hence the intersection h becomes a new CI (19 in Table 2).Overall, { bcдh } , { cдh } , { дh } and { h } are new CIs while { bc } , { c } , and { д } are promoted.To sum up, the above structures jointly enable rapid com-putation of intersections and genitors while keeping a lowmemory footprint w.r.t. competitor methods. Dually to the incremental case, our decremental methodcomputes
C(D − ) from C(D) and t o , an obsolete transactionin D . Simply put, this amounts to yielding Table 3 fromTable 2 and ¯10 (though in a stream, ¯1 would vanish first). Byduality, C(D − ) ⊆ C(D) , i.e. no new closures appear. Therelevant CI families are (1) obsolete CIs (to be removed from
C(D) ), denoted C O (D) , and, (2) demoted CIs ’surviving’ in D − but with a decreased support, C D (D) . The situation iseasily illustrated by mirroring our previous example. Thus,in Table 2, 20 becomes an obsolete CI (genitor 7) while 16 isa demoted CI with support decreased from 5 to 4.Notwithstanding apparent symmetry, an issue with decre-ments is t o does not discriminate obsolete vs. demoted CIssince both are subsets thereof: Property 3. ∆ (D , t o ) correspond to obsolete and demoted CIs: ∆ (D , t o ) = { t o ∩ c | c ∈ C(D)} = C O (D) ∪ C D (D) . The above follows from Property 2 and t o ∈ C(D) . Now,while C O (D) ∪ C D (D) stands out within C(D) via inclusionin t o , further tests are needed to split it into components, e.g.presence of a genitor as a test for obsolescence. Indeed, if D is seen as the ’increment’ of D − by t o , then an obsolete c o becomes a new CI hence there should be a genitor c д in D s.t. c o = c д ∩ t o . Thus, c д the closure of c o in D − , i.e. c д = κ D − ( c o ) . However, minimality of genitors, c д = min ([ c д ] t o ) ,that proved crucial for incrementing, only holds in D − , butnot in D where the minimum is c o ( c o = c o ∩ t o hence c o = in ([ c o ] t o ) ). The genitor only becomes the minimum if non-inclusion in t o is also required, c д = min ([ c д ] t o − ℘ ( t o )) . Albeitappealing ( CloStream went this way), extra non-inclusiontests may prove costly. Instead, we leverage the supportdifference of c o and c д in D : For potential c o we look for c д ∈ [ c o ] t o s.t. σ D ( c o ) = σ D ( c д ) +
1. Such c д will nullifyDefinition 1 for c o in D − . Formally: Property 4.
A CI is obsolete iff it has a genitor, C O (D) = { c ∈ C(D) | ∃ c д ∈ C(D) : c = c д ∩ t o , σ D ( c д ) = σ D ( c ) − } . The reasoning behind the if part has been exposed in theprevious paragraph. For the only if part, given a CI c ∈ C(D) ,the above three-fold condition on c д must be shown to entail c (cid:60) C(D − ) . First, σ D ( c д ) (cid:44) σ D ( c ) means c д (cid:44) c , hence c д ⊈ t o . Consequently, σ D − ( c д ) = σ D ( c д ) , as c д is not impactedby t o , while for c it decreases by one, σ D − ( c ) = σ D ( c ) − σ D − ( c д ) = σ D − ( c ) . This and 2ndcondition, c ⊆ c д , imply c (cid:60) C(D − ) , hence its obsolescence.To sum up, while a new CI is easy to spot among other in-tersections since smaller than its genitor CI, here an obsoleteCI equals its intersection with t o . Therefore categorizationgoes support-wise: The genitor of c ∈ C O (D) is the CI in [ c ] t o , with only t o missing in its support set w.r.t. to i D ( c ) . As explained above, for an intersection, its potential genitor,if any, has support one less than the minimal generating CI.To make genitors emerge at step (1), a gen field is added to trienodes to store candidate CIs. Thus, whenever the intersectionprefix of a CI c is extended, c is confronted to the currentminimum of the node in its updated last field. If non minimal, c is then compared to the content of gen . Observe that, due tothe way intersections are grown, at some intermediate stepmore than one CI could satisfy the above criterion. Indeed,since inverted lists are not sorted, support values of CIs maycome at arbitrary order. Next, minimal CIs being unique,there can be only one legit candidate in a min field at anypoint. However, the interplay between computation of gen and min fields requires a set of candidate CIs to be keptin both. Notwithstanding, ultimately a gen field can hold atmost one CI satisfying Property 4. The details of the resultingfield updates are provided in section 5.3.The rest of step (1) mirrors the incremental case. At step(2), discrimination should be as follows: The CI in min isdemoted if gen field is empty, obsolete otherwise. However,we dropped removals from gen enforcing this condition togain efficiency and instead check whether CIs are in [ c ] t o (seeAlgorithm 5 for details). Step (3) is the removal of obsoleteCIs from the global CI storage, as well as from all invertedlists it appears in. Common parts (superscripted by ∗ ) of the overall computingschema are presented below followed by case-specific ones. Notice that to achieve homogeneity in step (1), we initializethe above global structures with a special CI with id 0, itemset I and support of 0. Algorithm 1:
Ciclad ∗ trie ← init () foreach a ∈ t x . items do foreach c ∈ a . list do if c . last = null then c . last ← trie . root ExpandPath ∗ ( c , a ) U pdateCIs ∗ () Ciclad ∗ (Algorithm 1) is the high-level generic method toadd/remove a transaction. At step (1) it iterates over t x toyield its intersections with existing CIs (lines 2 to 6). For anitem a , the current prefixes of all CIs (in c . last ) of its inverselist ( a . list ) are extended by appending a (line 6). Prefixes areinitialized to the root node (line 5). Algorithm 2:
ExpandPath ∗ n ← lookup _ succ ( c . last , a ) if n = null then n ← create _ succ ( a ) c . last . cpt -- ; n . cpt ++ c . last ← n U pdateGen ∗ ( c ) ExpandPath ∗ (Algorithm 2) is the unit intersection step.From the c . last node, it looks up the successor with a (if none,creates it). Relevant fields are updated to reflect the extendedprefix, before calling U pdateGen ∗ for a case-specific book-keeping of the top support CI(s). Basically, each a shared bya CI c and t x pushes c . last downwards in the trie by a node,up till the complete intersection is built. U pdateCIs ∗ covers the above steps (2) and (3). Withinthe final trie, it filters complete intersections and categorizesthem. The CI storage update is case-specific (see U pdateCIs + and U pdateCIs − ). Ciclad + Update minimal CIs (
U pdateGen ∗ ) is the first step to differ-entiate both cases. U pdateGen + (Algorithm 2, line 6) merelycompares current CI c and CI in n . min support-wise andupdates the latter. We skip it here since straightforward. U pdateCIs + (Algorithm 3) looks up the final trie for inter-section end nodes and discriminates them with a cardinalitytest. A new intersection is recognized by its size (end node’s igure 2. Trie evolution upon adding ¯3 to D ¯1 , ¯2 ( m , c , and lst stand for fields min , cpt , and last , respectively). Algorithm 3:
U pdateCIs + foreach n ∈ nodes ( trie ) do if n . cpt (cid:44) then if n . depth = n . min . size () then n . min . supp ++ else createCI ( path ( n ) , n . min . supp + ) depth) being smaller than the size of the CI in n . min (storedfor all CIs). Then, createCI () implements step (3): It pushesthe new CI into the inverted lists of all its items (retrievedfrom the trie path by path ( n ) ). In case of promoted CI, step(3) is a mere support increase.Figure 2 traces the evolution of the trie upon inserting¯3 into D ¯1 , ¯2 whereby C(D ¯1 , ¯2 ) = {( ¯1 , abcde f дh ) , ( ¯2 , abce f )} .Each section of the figure shows the impact of a single item:The inverted list is on top, the changes in the last columnof the CI table on the left, and the current trie on the right.In the trie, nodes are decorated by current minimal CI andcount of all referring CIs. Algorithm 4:
U pdateGen − switch ( c . support − c . last . max _ supp ) case ≥ c . last . min ← { c } ; c . last . дen ← ∅ case c . last . дen ← c . last . min ; c . last . min ← { c } case c . last . min ← c . last . min ∪ { c } case − c . last . дen ← c . last . дen ∪ { c } c . last . max _ supp ← max ( c . supp , c . last . max _ supp ) Ciclad − U pdateGen − (Algorithm 4 below) covers the joint mainte-nance of the min and дen fields of trie node, i.e. the minimalCI and the ’less-by-one’ candidates, respectively. As indicatedabove, both store sets of CIs. Given a CI c whose currentintersection prefix end has been freshly redirected to a trienode n , the following reasoning applies to its own supportof that CI and the maximal support stored at n ( max _ supp field): If the support of c is way higher (2+), then both fieldsare flushed and c becomes the new minimum, whereas with a difference of one, the current min is merely shifted to дen .With 0 or -1, c is added to min or дen , respectively. Othervalues trigger no action. At the end, the maximal supportfor n is updated correspondingly. Algorithm 5:
U pdateCIs − foreach n ∈ nodes ( trie ) do if n . cpt (cid:44) then demoted ← true foreach c ∈ n . дen do if c . last = n then removeCI ( n , n . min ) demoted ← f alse break if demoted then n . min . supp -- Within the final trie,
U pdateCIs − covers step (2) and (3),hence it first categorizes full intersections and updates theCI family based on respective min and дen fields of theirend nodes. In actuality, while proper maintenance of дen would ensure that eventually at most one CI is stored at eachnode, we decided to drop дen updates upon moving the last pointer of a CI away from the node. Thus, a CI may belongto дen lists of end nodes corresponding to strict prefixes ofits intersection. This is illustrated by part e of Figure 3: theCI 2 ( abce f ) remains in the дen set of the end node of abc ,although its last field eventually points to the end of abce f (see part f of Figure 3). Property 3 (see section 8) guaranteesthat only CIs from ∆ (D , t o ) can be left behind in prefixes’ дen fields in the way described above. To detect them within дen , we check for last fields pointing to a different trie node.Conversely, genitor test spots CIs in n . дen whose last points to n . If positive, the test triggers removeCI (step (3))that removes the CI in n . min from the inverted lists of all itsitems.As an illustration, consider the removal of transactions ¯1and ¯2 from the CI family in Table 2. The first one is trivialas the only obsolete is 1: All CIs being subsets of ¯1, onlythe special CI 0 can be the genitor . Thus, the resulting CIfamily is readily derived from Table 2 by decreasing supports igure 3. Trie evolution upon removing ¯2 from D ¯2 , ¯10 ( m , д , c and lst stand for fields min , дen , cpt and last , respectively).by 1. Item iterations in the removal of ¯2, or abce f , (see Al-gorithm 1) are shown in Figure 3. In the final trie (item f ),intersections { abc } , { c f } and { e f } correspond to obsoleteCIs. Indeed, each дen field in the respective end node refersto a CI whose last field, in turn, points to that node. Forinstance, the дen field for 10 ( { c f } ) contains 4, or { cd f дh } ,of support 1 while the min field value is 3 { c f } of support2). The demoted CIs are { bc } and { c } which have both anempty дen field. We compared experimentally
Ciclad to Moment , NewMoment , CloStream and
CFI-Stream . The original version of
Moment was used as provided by its authors. To level the playingfield, we implemented
Ciclad in C++ and in a single threadedmode as well as NewMoment , CloStream , CFI-Stream . Toinvestigate their relative efficiency, we put the methods inidentical conditions, i.e. we made them compute all CIs fromvarying datasets and sliding window lengths. We used seven datasets of varying nature (see metrics in Ta-ble 4).
Mushroom and
Retail , are popular datasets : Retail isa sparse market basket dataset, while
Mushroom , describingmushroom samples, is a dense and correlated dataset madeof same-size transactions.
Synth and
Synth2 are synthetictransactional datasets, generated with SPMF , of mediumand small size, respectively. Three other real-world datasetswere used: click streams for BMS-View (from KDD 2000),online purchases for
Chainstore (from the
Nu-MineBench Code available at https://github.com/guyfrancoeur/ciclad While the original versions would have been preferable, none of these iscurrently made available by respective authors. http://fimi.ua.ac.be/data/ project), and network logs adapted from activity data fromthe DARPA Transparent Computing program . Dataset |D| avд (| t |) |I| stdev (| t |) Density
Retail
Mushroom
Synth
Synth2
BMS-View
Chainstore
Net-Log
Table 4.
Description of the datasets
All experiments were ran on Windows 10 Professional 64 bitswith Intel i7-8700 CPU and 32 GB of RAM. We measured thetotal CPU time over the entire stream and set a time limit of10K secs: Methods that ran longer on a dataset were aborted,while recording memory usage, and withdrawn from experi-ments on larger windows over the same dataset. Moreover,we recorded peak memory usage rather than average acrossall windows.Noteworthily, finer measures, like the evolution of time/memoryvalues as well the number of CIs along the stream, albeitpotentially helpful, could not be provided here for spacereasons (e.g. see section 10).
Results are summarized in Figure 4 whereby w = W indicateswindow size while the dashed line on the top is the time limit.As a general trend, CFI-Stream exceeded the time limit onthe smallest window of every dataset, except
Net-Log , fol-lowed closely by
NewMoment which could process all win-dow sizes only on
Mushroom . Next came
Moment : It went https://gitlab.com/adaptdata/e2 igure 4. CPU time and memory usage of
Ciclad , Moment , NewMoment , CloStream , and
CFI-Stream off-limit on the first window size of
Synth and
Chainstore ,and on the second one for
BMS-View , Synth2 and
Retail (yet,exceptionally, we let it run on all others and measured mem-ory usage). On
Mushroom , it performed slightly better time-wise, yet its memory usage was substantially worse (amongsuccessful methods). Finally,
CloStream and
Ciclad finishedwithin limit overall, whereby
Ciclad showed invariably bettertime and memory figures.On the dataset side, it is noteworthy how on
Retail , withincreasing window sizes,
Moment gradually approaches thelimit of 32 GB of RAM, while
Ciclad remains very competitive(39 MB) and
CloStream less so. The data in
BMS-View is sim-ilar to
Retail with less items and smaller transactions whichenabled larger windows, e.g. of size 20K. It showed similaroverall pattern, yet with a smaller gap between
Ciclad and
CloStream on RAM. On
Chainstore the same trend was ob-served, yet with larger gaps on memory. This is only half asurprise as both represent real-world streams. The
Synth and
Synth2 , complete the picture of the dominance of
Ciclad and,to a lesser degree,
CloStream .With
Mushroom , a very dense dataset, the trend is differ-ent: CPU-wise, there seems to be a tie between
Ciclad and
Moment , far ahead of
CloStream followed by
NewMoment .On memory usage, it is less clear, yet
Ciclad is somewhatahead of the rest. This is matched almost perfectly by thepattern on
Net-Log , despite the size of the windows beingmuch larger. The only difference here is
NewMoment not being among the contendors after its timeout in the secondwindow.
From the above observations, we conclude that on storageintersection-based methods (
CFI-Stream excluded) performinvariably better than pattern enumeration ones. We believethis is the impact of storing non closed itemsets in each [ ] D class. This impact deepens with sparse data as classes tendto grow larger due to the ratio FCI/FI.On sparse data, intersection-based methods are also fasterthan their competitors. Again, it is the overhead of non closeditemsets: Upon each CI lookup, Moment traverses its [ ] D class from a smallest member up to the largest one (the CI) bywalking along a chain of intermediate itemsets. As a result,on sparse datasets (e.g. Synth2 ) with limited-size windows
Ciclad can be up to 40 times faster than
Moment .With dense datasets, pattern enumeration methods arefavored as the FCI/FI ratio is higher, hence the smaller [ ] D classes. Conversely, the intersection-based equivalence classes [ ] t x tend to grow larger which increases the intersection ef-fort per CI in intersection-based methods. However, it is alsoworth recalling that with such data, the benefit of miningFCIs, as opposed to plain FI, is limited.Finally, CloStream lags behind
Ciclad because of its fully-blown intersection operations and recurrent lookups forn intersection X each time it is generated. Ciclad stream-lines both operations by its item-wise trie-based intersectiongrowing technique.
Our novel sliding-window miner
Ciclad implements an effi-cient intersection-based computing schema. It exploits themathematically-grounded notion of genitor, the CI that isthe closure of a non closed itemset which changes its statusw.r.t. closedness upon window shift. Design pillars in ourintersection-based scheme include per-item inverted list stor-age of CIs, item-wise intersection growth, and support-basedgenitor detection. The outcome of our experimental studyconfirmed that
Ciclad outperforms its competitors signifi-cantly, both on storage and processing.As a basis for further research,
Ciclad lays the ground-work for additional challenges to be taken up. For instance,to tackle the mining of strong AR, or rather condensed repre-sentations thereof, over a sliding window, we are designingextensions thereof covering generator itemsets and/or prece-dence links as proposed in [18]. As a separate track, weinvestigate mining of rare yet confident AR [22] from thestream. Next, we plan to leverage
Ciclad ’s homogeneity inmerging of t n and t o processing in a single-pass method. Fi-nally, as a way to focus strictly on FCIs, we will look at theevolution of the FI border [8, 13]. Acknowledgments
Thanks go to Y. Chi for the code of MomentFP and S. Benab-derrahmane for the pre-formatted DARPA data.
References [1] C. Borgelt et al. Finding closed frequent item sets by intersectingtransactions. In , pages 367–376. ACM, 2011.[2] T. Calders et al. A Survey on Condensed Representations for FrequentSets. In
Constraint-Based Mining and Inductive Databases , volume 3848of
LNCS , pages 64–80. Springer, 2004.[3] J. Chen and S. Li. GC-tree: a fast online algorithm for mining frequentclosed itemsets. In , pages 457–468. Springer, 2007.[4] Y. Chi et al. Moment: Maintaining closed frequent itemsets over astream sliding window. In , pages 59–66. IEEE, 2004.[5] B. Ganter and R. Wille.
Formal concept analysis: mathematical founda-tions . Springer, 1999.[6] C. Gao and J. Wang. Efficient itemset generator discovery over astream sliding window. In , pages 355–364, 2009.[7] R. Godin et al. Incremental Concept Formation Algorithms Based onGalois (Concept) Lattices.
Computational Intelligence , 11(2):246–267,1995.[8] D. Gunopulos et al. Data mining, hypergraph transversals, and ma-chine learning (extended abstract). In , pages 209–216. ACM, 1997.[9] S. Hamadi et al. Compiling packet forwarding rules for switchpipelined architecture. In
IEEE INFOCOM 2016 - The 35th AnnualIEEE International Conference on Computer Communications , pages1–9, San Francisco, CA, USA, April 2016. IEEE.[10] D. Huang et al. Rare pattern mining on data streams. In
Intl. Conf.DaWaK , pages 303–314. Springer, 2012. [11] N. Jiang and L. Gruenwald. CFI-Stream: mining closed frequent item-sets in data streams. In , pages 592–597. ACM,2006.[12] N. Jiang and L. Gruenwald. Research issues in data stream associationrule mining.
ACM Sigmod Record , 35(1):14–19, 2006.[13] R. Karim et al. Mining maximal frequent patterns in transactionaldatabases and dynamic data streams: A spark-based approach.
Infor-mation Sciences , 432:278–300, 2018.[14] M. Kryszkiewicz. Concise Representations of Association Rules. In
ESF Exploratory WS on Pattern Detection and Discovery , pages 92–109,2002.[15] H.-F. Li et al. A new algorithm for maintaining closed frequent itemsetsin data streams by incremental updates. In
ICDM Workshops 2006 , pages672–676. IEEE, 2006.[16] H.-F. Li and S.-Y. Lee. Mining frequent itemsets over data streams usingefficient window sliding techniques.
Expert systems with applications ,36(2):1466–1477, 2009.[17] C. Lucchese et al. DCI Closed: A Fast and Memory Efficient Algorithmto Mine Frequent Closed Itemsets. In
FIMI , 2004.[18] K. Nehme et al. On Computing the Minimal Generator Family forConcept Lattices and Icebergs. In , pages 192–207, Lens (FR), 2005. Springer.[19] E. M. Norris. An algorithm for computing the maximal rectanglesin a binary relation.
Revue Roumaine de Maths Pures et Appliquées ,23(2):243–250, 1978.[20] J. L. Pfaltz. Incremental Transformation of Lattices: A Key to EffectiveKnowledge Discovery. In
Proc. of the 1st ICGT , pages 351–362, 2002.[21] M. Rashid et al. Mining associated sensor patterns for data stream ofwireless sensor networks. In , pages91–98. ACM, 2013.[22] L. Szathmary et al. Generating rare association rules using the minimalrare itemsets family.
Intl. J-l of Software and Informatics , 4(3), 2010.[23] F. Tang et al. Adaptive self-sufficient itemset miner for transactionaldata streams. In
Pacific Rim Intl. Conf. on Artificial Intelligence , pages419–430. Springer, 2019.[24] P. Valtchev et al. Generating Frequent Itemsets Incrementally: TwoNovel Approaches Based On Galois Lattice Theory.
Journal of Experi-mental & Theoretical Artificial Intelligence , 14(2-3):115–142, 2002.[25] P. Valtchev et al. A framework for incremental generation of closeditemsets.
Discrete Appl. Math. , 156:924–949, March 2008.[26] Y. Yamamoto et al. Parasol: a hybrid approximation approach for scal-able frequent itemset mining in streaming data.
Journal of IntelligentInformation Systems , pages 1–29, 2019.[27] S. Yen et al. A fast algorithm for mining frequent closed itemsets overstream sliding window. In
IEEE Intl. Conf. on Fuzzy Systems , pages996–1002. IEEE, 2011.[28] M. Zaki et al. New Algorithms for Fast Discovery of Association Rules.In , pages 283–286, 1997.[29] M. Zaki and C-J Hsiao. Efficient algorithms for mining closed itemsetsand their lattice structure.
IEEE Transactions on Knowledge and DataEngineering , 17(4):462–478, 2005.
Appendix
Below, we provide additional results about
Ciclad that clarifyaspects such as its correctness, computational cost, and theway it compares to
GC-Tree , a method that was excludedfrom the final validation study.
To show
Ciclad is correct, we first examine the decision aboutwhere intersections end. Let T denote the final trie and n anode in T . Now, n is the end of a full intersection, i.e. theitemset items ( n ) made of items along the root-bound pathfrom n is in ∆ (D , t x ) , iff its counter n . cpt is strictly positive: Property 5.
Given a node n in the trie T , items ( n ) ∈ ∆ (D , t x ) iff n . cpt > . Consider the evolution of the minimal CIs for a trie node.Understandably, we only focus on trie states at the end of aparticular iteration, i.e. with the respective item inverted listfully parsed. Assume after the k -th iteration, the list of item a k is processed and yielded a (still partially completed) trie T k . Let T k [ a k ] denote the set of nodes labelled by a k . Thesenodes represent the increment w.r.t. T k − . Property 6.
Given a node n ∈ T k [ a k ] , and a CI c ∈ C(D) , n . min = c iff c = κ D ( items ( n )) . Noteworthily, Property 6 ensures that in the final trie, allthe minimal CIs are correctly positioned. For the decrementcase, we need to further show that within any дen field, atmost one CI is not a subset of t o . Property 7.
Let n be a node in the final trie T | t o | of an obso-lete transaction t o , then | n . дen − ∆ (D , t o )| ≤ . This follows from Property 4 and the observation that CIsoutside ∆ (D , t o ) keep their supports from C(D) in C(D − ) .Indeed, assuming more than a single CI satisfies the con-ditions, entails there are two different CIs in C(D − ) withsupport equal to the support of the obsolete/demoted itemsetin n . min . This, regardless of the exact status of that itemsetin ∆ (D , t o ) , is a contradiction. To sum up, the n . дen field ofa node in the final trie can contain at most one CI outside ∆ (D , t o ) plus a number of CIs from that set. Then, only theformer belongs to the class of n (with minimum CI n . min ).Finally, the status of a CI c in n . дen depends on its beingmember of ∆ (D , t o ) . To avoid costly tests of inclusion into t o we rely on the intersection class of c : since all CIs from ∆ (D , t o ) are minimal in their own classes, each has a uniquevalue in its last field. Thus, none of the CIs c in a n . дen fieldthat is also in ∆ (D , t o ) , could refer to n via its c . last field: Property 8.
Let n be a node in the final trie T | t o | , then for c д ∈ n . дen , c д (cid:60) ∆ (D , t o ) iff c д . last = n . The window shift complexity of
Ciclad is O ( k m ∗ l m ) in timeand O ( k m ∗ l m ) in space. Here, k m is the maximal transaction(and CI) size and l m the maximal number of CIs in a window.The intersection computing ( Ciclad ∗ up till the end of ExpandPath ∗ ) is in O ( k m ∗ l m ) . For each item i from t x and CI c comprising i , Ciclad ∗ pushes the intersection of c down itspath in the trie. This involves few operations (half a dozen)all of constant time cost.Categorizing intersections and creating new CIs ( Update-CIs up till createCI() ) has also a cost in O ( k m ∗ l m ) . First, de-tecting end nodes is in O ( k m ∗ l m ) since there are at most l m intersections, each of size at most k m , hence O ( k m ∗ l m ) nodesin the trie. Next, creating all new CIs is also in O ( k m ∗ l m ) : thesame number bound l m multiplied by the unit cost of creation(linear in k m ). Inverted lists can be updated in O ( k m ∗ l m ) time as each combination of a new CI and incident itemamounts to one list insertion. Ciclad ∗ has a memory footprint in O ( k m ∗ l m ) . Indeed, theintersection trie will have at most k m ∗ l m nodes (see above)whose successor structures might need up to k m memorycells each. Comparatively, the total footprint of the invertedlists is in O ( k m ∗ l m ) . Now, k m ∗ l m is, in fact, a gross over-estimation of the total number of items in all CIs which iskey cost factor in both time and memory: The real figure,especially with sparse data, will be way lower. Noteworthily,the size of the window is not a factor in the above functions:This is the effect of skipping tidsets altogether (yet it has anindirect impact through l m ).Finally, Ciclad is a listing algorithm, hence to be assessednot by total time but rather by the cost per output element,i.e. CI. Thus, assuming the entire stream processing cost isin O ( n s ∗ k m ∗ l m ) , where n s is total number of transactions,the per-CI cost is, grossly, in O ( n s ∗ k m ) , i.e. a polynomial inthe size of the dataset. In contrast, the cost of producing aparticular new CI c in Moment might go beyond that limit asthe number of unpromising nodes traversed while generating c can grow up to exponential in its size.
10 Additional performance tests
We made
Ciclad compete on
Moment ’s terms, i.e. with highersupport thresholds. Thus, we compared both over various min_supp values (1,2,3 and 5), this time using only two datasets,one dense (
Mushroom ) and one sparse (
Synth2 ), each withtwo different window sizes. The results are summarized inFigure 5 (CPU time) and in Figure 6 (memory usage).On
Mushroom , variations in min_supp modestly impactthe computing time of
Moment ; this arguably fits the intu-ition that CIs are more regularly scattered over the patternspace (thus higher values are needed for a palpable drop inthe cost). Noteworthily,
Ciclad and
Moment offer comparable igure 5.
Runtime comparison varying
Moment ’s min _ supp performances. On Synth2 , Moment ’s runtime efficiency im-proves much faster and it outperforms
Ciclad for thresholdsof 5 and above.
Figure 6.
Memory comparison varying
Moment ’s minsupp Memory-wise,
Ciclad is still somewhat ahead, yet thetrend of rapidly decreasing consumption in
Moment is visi-ble. Again, for the sparse dataset, with thresholds of 5 andup,
Moment catches up with
Ciclad , whereas with the densedata the break-even point is still somewhere above.
Figure 7.
Evolution of
C(D) in Mushr . w = k and Synth w = k As a possible hint at the reasons behind the observedperformances, we track the proportion of new , promoted , de-moted and obsolete CIs in windows. Results for
Mushroom and
Synth2 datasets with windows of size 1K are shownin Figure 7. In summary, the higher ratio of new/obsoleteCIs to promoted/demoted ones in sparse data would explainsuperior performances of
Ciclad by the costly tree restructur-ing in
Moment as opposed to inexpensive updates of existingnodes. Conversely, it hints at detecting of promoted/demotedCIs in
Ciclad as possible improvement point for speeding updense data processing.
Dataset Avg. nodes Avg. CIs Ratio
Mushr. (w=1k,s=1)
Mushr. (w=1k,s=5)
Mushr. (w=.5k,s=1)
Mushr. (w=1.5k,s=1)
Synth2 (w=.3k,s=1)
Synth2 (w=1k,s=1)
Synth2 (w=2k,s=1)
Table 5.
Average number of CIs and CET nodes in
Moment
We also examined the storage overhead in
Moment , i.e. dueto the storage of promising and intermediate itemsets. Ta-ble 5 shows the average number of nodes within
Moment ’sCE-tree (
Avg. nodes ) against the average number of CIs(
Avg. CIs ), both taken over the entire stream, for a num-ber of combinations (dataset, window size, min_supp). Thewide variation, 3 .
24 to an extreme 232 .
21, is intriguing. Yetthe trend correlates with the observations on computingtime and memory usage, i.e., the higher the value, the lesscompetitive the method vs
Ciclad . GC-Tree vs Ciclad vs Moment
We studied also
GC-Tree in order to assess its hybrid ap-proach. However, its decremental part was impossible toimplement due to inconsistencies in the description of themethod. Therefore,
GC-Tree was compared to
Ciclad and
Moment , in landmark mode only.In Figure 8, an extract of the performance tests is given:The figure presents the CPU time on three of the sevendatasets in landmark mode. We used a prefix large enoughto let a stable trend appear.
Figure 8.
GC-Tree vs.
Ciclad vs.
Moment in landmark modeAn immediate observation is that the hybrid approach,even if appealing, does not perform well with large numberof items. The clear gap between