ONCE and ONCE+: Counting the Frequency of Time-constrained Serial Episodes in a Streaming Sequence
Hui Li, Sizhe Peng, Jian Li, Jingjing Li, Jiangtao Cui, Jianfeng Ma
aa r X i v : . [ c s . D S ] J a n ONCE and ONCE+: Counting the Frequency ofTime-constrained Serial Episodes in aStreaming Sequence
Hui Li , Sizhe Peng , Jian Li ∗ , Jingjing Li † , Jiangtao Cui ∗ , Jianfeng Ma School of Cyber Engineering, Xidian University, China [email protected] [email protected] ∗ School of Computer Science and Technology, Xidian University, China [email protected] [email protected] † Department of Computer Science and Engineering, The Chinese University of Hong Kong, China [email protected] Abstract —As a representative sequential pattern mining problem, counting the frequency of serial episodes from a streamingsequence has drawn continuous attention in academia due to its wide application in practice, e.g., telecommunication alarms, stockmarket, transaction logs, bioinformatics, etc. Although a number of serial episodes mining algorithms have been developed recently,most of them are neither stream-oriented, as they require multi-pass of dataset, nor time-aware, as they fail to take into account thetime constraint of serial episodes. In this paper, we propose two novel one-pass algorithms, ONCE and ONCE+, each of which canrespectively compute two popular frequencies of given episodes satisfying predefined time-constraint as signals in a stream arrivesone-after-another. ONCE is only used for non-overlapped frequency where the occurrences of a serial episode in sequence are notintersected. ONCE+ is designed for the distinct frequency where the occurrences of a serial episode do not share any event.Theoretical study proves that our algorithm can correctly mine the frequency of target time constraint serial episodes in a given stream.Experimental study over both real-world and synthetic datasets demonstrates that the proposed algorithm can work, with little time andspace, in signal-intensive streams where millions of signals arrive within a single second. Moreover, the algorithm has been applied ina real stream processing system, where the efficacy and efficiency of this work is tested in practical applications.
Index Terms —event sequences, frequent episodes, sequence analysis ✦ NTRODUCTION
With the development of cloud computing, internet ofthings, biocomputing and so on, numerous ordered se-quences are accessible from various daily applications.Among all these applications, mining serial episodes fromlong sequences has various potential applications andthereby drew much research attention, especially in thefields of telecommunication [1], finance [2], neuroscience [3]and information security. A serial episode is referred to as anordered collection of specific signals, e.g., sequential alarmpattern in telecommunication alarm sequence, and the timesit appears in the sequence is referred to as its frequency.Generally, studying the frequency of serial episode patternscan be used to analyze or summarize the whole sequenceand can also be used to predict future signals in the se-quence.For instance, a long telecommunication alarm sequencecan be summarized using a limited number of representa-tive serial episodes; on the other hand, we may be interestedin the frequency of some specific alarm episode patterns within the long sequence so that responses towards thesealarm episodes can be optimized; besides, it can be directlyused to mine frequent serial episodes; we may also be inter-ested in predicting the future alarms within the sequence.All of these examples require counting the frequencies of agiven set of serial episodes.Counting the frequency for a finite set of given serialepisodes can be easily found in many real applications indifferent fields. For instance, in securities market, the de-tection of securities fraud is a challenging task consideringthe massive amount of trading data produced everyday.Insider trading, one category of deceptive practices, canbe generalized as a serial pattern using a group of actionsincluding offers and sales of securities [2]. With a set ofpatterns/trends that is known to be fraud, automatic de-tection of fraudulent activities can be achieved as long aswe focus on the deceptive patterns in the streaming tradingsequence. Besides, in the field of bioinformatics, in order toanalyze a gene set of interest, analyzing its frequency anddistribution among the whole genome datasets can help findout in which tissues or cells are they co-expressed [4].
In a message-intensive system, millions of data are gen-erated within several minutes. Such data are referred toas streams [5]. Formally, a streaming sequence is composedof several types of events and it will dynamically updateits length as new events occur and often in a high rate.Conventional methods for counting the frequency of serialepisodes are generally based on the idea of storing the entiredataset and then processing it through multiple passes.Hence, traditional algorithms are not applicable on streamsas it is impossible to store the entire unlimited data beforethe processing. Any method for data streams must thusoperate under the constraints of limited memory and timewhich means that data streams must be processed fasterthan they are generated. To this end, in this paper, we pro-pose an efficient one-pass solution to count the frequenciesof a given set of serial episodes in a data stream without theneed to load the whole sequence beforehand.In addition, although there exist some efforts that minethe frequency of serial episodes from a long sequence,among which [6], [7] even work in streams in a one-passmanner, they suffer from a key limitation that the serialepisodes mined are not associated with any time constraint.That is, they do not care whether the serial episodes fall intoa limited time span ( e.g., an hour, a day, etc.). For instance,a common scenario in telecommunication alarm sequencestudy is to learn the typical serial alarm episodes in orderto discover the sequential association rules between alarmsso that we do not need to respectively respond to each ofthem, because responding to the earliest alarm can alwaysautomatically address the following ones incurred by it.Obviously, alarms that form a sequential association ruleshould not exhibit too large time span ( e.g., an hour, etc.). Asanother example, we may be interested to know a particularperson’s daily mobility pattern (e.g.,
Office → Gym → Bar ) tohelp quantify his daily movement condition. In both ofthese scenarios, we have to limit the time span of the serialepisodes.As the state-of-the-art single-pass serial episodes miningalgorithms, [6]–[8] employ automata to count the occur-rences for each target episode. Unfortunately, the automatathey employed cannot be easily incorporated with timeconstraint. This is thoughtfully discussed in Section 3. Toaddress the problem, we propose a new model that suc-cessfully avoids the problem of [6]–[8] in counting serialepisodes satisfying given time span within streaming se-quence . In summary, our contributions in this work are asfollows. • We formally define the non-overlapped frequencycounting problem of time-constrained episodes. Toaddress the problem, we present a carefully designeddata structure, namely OccMap, as well as a groupof operations over it. An OccMap corresponds to aparticular serial episode and stores the timestampsof valid signals • Based on OccMap,we propose two efficient algo-rithm ONCE and ONCE+ (OccurreNce Count ofserial Episode) to compute two popular frequen-cies of given time-constrained serial episodes in adynamic event stream, over which only one-passprocess is required. In particular, ONCE computes non-overlapped frequency while ONCE+ works ondistinct frequency. • ONCE (ONCE+) does not require any other user-specified parameter except the time constraint τ .Our algorithm does not put any restriction overthe streaming sequence, which can either arrive inbatches (a group of sequentially ordered signals) [9]or single signals. • We theoretically prove that ONCE and ONCE+ algo-rithms can correctly count the target frequencies, re-spectively. Besides, processing an event in the streamonly requires O ( k log τ ) time, where k is the lengthof the episode. • Empirical studies conducted over both real-worldand synthetic datasets justify that ONCE andONCE+ can efficiently and correctly find the frequen-cies of the serial episodes and outperforms baselinemethod in the aspects of both space and time cost.The rest of this paper is organized as follows. In nextsection, we briefly discuss related work in serial episodemining. In Section 3 we introduce the preliminary defini-tions and problem statement. Afterwards, we present thedetails of our solution towards the problem in Section 4and Section 5 with theoretical study of the complexityand correctness. In Section 6, we conduct empirical studyover real-world and synthetic datasets. We show a practicalapplication where the proposed algorithm is applied anddiscuss the corresponding observations in Section 7. Lastly,we conclude our work in the Section 8.
ELATED W ORK
Several types of sequential patterns have been extensivelystudied so far, including frequent (closed) sequential pat-tern mining [10]–[14], serial episodes discovery [6], [15],periodic (ordered) pattern mining [16], [17]. Within theseworks, various frequency definitions of episodes have beenproposed, which have given rise to different types offrequent episodes. Recently, Achar et al. [18] reviewed 7different frequency definitions in the literature. Three ofthem, window-based frequency [1], head frequency [19],and total frequency [19], consider the number of windowscontaining at least one occurrence of an episode, whereeach window has the same specified width. The remainingdefinitions, minimal occurrence-based frequency [1], non-overlapped frequency [20], non-interleaved frequency [21]and distinct frequency [22], directly take into account thedifferent occurrences of an episode in the sequence.However, these efforts cannot be deployed to somereal-world applications such as fraud trading detection ortelecommunication alarm responses as they ignored thepractical significance of time constraint in episodes. Notably,the author in [6] suggested that, by attaching to each au-tomaton a time constraint, their method can address time-constrained serial episode mining problem. However, theyactually failed to empirically test this suggested method intime-constrained problem. Unfortunately, as we will illus-trate in detail in Section 3.2, this suggested method is unableto generate correct answer.In the field of serial episode mining over sequentialstreams, related algorithm studies have become increasingly
TABLE 1: Notations
Symbols Descriptions ( s i , t i ) temporal event s i happens at t i S = h ( s , t ) , . . . i streaming sequence e = h φ , . . . , φ k i serial episode Occ ( e, S ) occurrence of e in SOcc
OPT ( e, S ) minimal occurrence of e in Se τ time-constrained serial episode τ time constraint of e τ k length of serial episode OM ( e τ ) = [ L , . . . , L k ] OccMap of e τ L i timestamp list of i -th layer in OM ( e τ ) prevalent over the recent years [9], [23]–[25]. Patnaik etal. [9] considered serial episode mining over dynamic datastreams. The main contribution of their work is to definethe batch of events and apply their algorithms over eachbatch. But the performance of their method highly dependson the size of batches where the frequency is computed.A large batch leads to high response time, while a smallone fails to count the frequency of long episodes. Espe-cially, once each batch of data contains only one event, i.e., events arrive one after another, their algorithm cannot workanymore. In addition, when a serial episode stretches overtwo consecutive batches, this occurrence of the episode willbe missed. Xiang et al. [26] presented MESELO algorithm,which requires a complete view of the whole sequence. Itstrictly limits their application in streams where the numberof events is potentially unlimited. For instance, if we wantto learn the frequency of an episode in the past 48 hours,the window size ∆ in their method should be set as alarge time span to store all records in the past hours, whichtakes enormous memory consumption. They also presentedanother work [27] that aims to mine serial episodes overprecise-positioning sequences, where the elapsed time be-tween any two consecutive events is a constant. SASE [7]has been proposed to record the appearance of target serialepisode within a stream. The proposed structure has tospend O (2 kτ ) time to process a single signal in the stream,while our algorithm takes only O ( k log τ ) . Besides, none ofthese works takes into account the time span for the targetserial episode. In contrast, we present in this paper a novelone-pass algorithm that works on stream sequence withoutany requirement to store the whole sequence beforehand orany limitation on the batch size, while taking into accountthe time span of the episodes. ROBLEM F ORMULATION
In this section, we shall first present serials of preliminarydefinitions. Besides, for ease of understanding, in Table 1 wesummarize the key notations that will be used in this paper.
We first define streaming sequences, serial episodes [28] andnon-overlapped frequency.
Definition 1 (Streaming sequence).
Streaming sequence isa long (potentially infinite) sequence of event . Let Σ To avoid duplicate word usage, we shall use the words event and signalinterchangeably in the rest of this paper. be finite alphabet set, S be a sequential list of events,denoted by S = h ( s , t ) , . . . , ( s n , t n ) , . . . i , where s i ∈ Σ and the pair ( s i , t i ) ( ≤ i ≤ n, t i − < t i ) meansevent s i happens at timestamp t i . We denote by S ( n ) = h ( s , t ) , . . . , ( s n , t n ) i as the first n event subsequence of S . Let S [ i ] be the i -th element of S (i.e. ( s i , t i ) ), S [ i ] .e and S [ i ] .t be the i -th event and corresponding timestamp of S , respectively.For instance, a daily trajectory of a person canbe denoted as S = h ( Home , , ( Office ,
10 :00) , ( Gym ,
15 : 00) , ( Bar ,
20 : 00) i where | S | = 4 , Σ =
Home , Office , Gym , Bar , S [2] = ( Office ,
10 : 00) , S [2] .e = Office , S [2] .t = 10 : 00 . In particular, if s i ( ≤ i ≤ n ) is a set of events that happen simultaneously( i.e., s i ⊂ Σ ), the sequence is referred to as complex streamingsequence . Otherwise, if s i ( ∀ i, ≤ i ≤ n ) is an individualevent, it is a simple streaming sequence . Definition 2 (Serial episode). A serial episode is a set of totallyordered events, denoted by e = h φ , . . . , φ k i , where φ i appears before φ j , if and only if ≤ i ≤ j ≤ k . Inparticular, we denote by | e | = k as the length of e .For instance, in the above sequence example where S = h ( Home , , ( Office ,
10 : 00) , ( Gym ,
15 : 00) , ( Bar ,
20 :00) i , e = h Home , Office i and e = h Home , Gym , Bar i areboth serial episodes; the length of e ( i.e., | e | ) is 2 and thatof e ( i.e., | e | ) is 3, respectively. Definition 3 (Occurrence).
Given a serial episode e = h φ , . . . , φ k i , the timestamp h t , . . . , t k i is defined as the occurrence of e if φ i happens at timestamp t i . We denoteby Occ ( e, S ) as an occurrence of serial episode e in S . Definition 4 (Minimal occurrence).
Given a serial episode e = h φ , . . . , φ k i , and its occurrence Occ ( e, S ) , namely h t , . . . , t k i . If there is no other occurrence of e , say h t ′ , . . . , t ′ k i , such that t ′ ≥ t and t ′ k ≤ t k , then Occ ( e, S ) is called a minimal occurrence of e in S , denoted as Occ
OP T ( e, S ) . Definition 5 (Time-constrained serial episode).
A serialepisode with time constraint τ is denoted as e τ = h φ , φ , . . . , φ k i , where the occurrence of e τ fall in aspecified time period τ ( e.g., daily/weekly/monthly),that is, | [ t , t k ] | ≤ τ ( i.e., t k − t ≤ τ ). Usually, e isused to represent a certain serial episode without time-constraint. Example 1.
Given the following sequences, S = h ( A, , ( B, , ( A, , ( B, , ( C, , ( B, i ,S = h ( B, , ( B, , ( A, , ( B, , ( A, , ( C, i ,S = h ( B, , ( A, , ( B, , ( A, , ( C, i , serial episode e = h B, A, B i and time-constrained serialepisode e = h B, A, B i , we illustrate the occurrences ofboth e and e in all S , S and S .According to Definition 3, we can easily obtain Occ ( e, S ) = h , , i , Occ ( e, S ) = h , , i , where Occ
OP T ( e, S ) = Occ ( e, S ) is a minimal occurrence.Similarly, according to Definition 3, it is easy to findthat Occ ( e, S ) = h , , i , Occ ( e, S ) = h , , i , Occ ( e, S ) = h , , i . Moreover, the occurrences of e in all the above sequences are as follows, Occ ( e , S ) = h , , i , Occ ( e , S ) = h , , i , Occ ( e , S ) = ∅ .Note that the events constituting an occurrence of a serialepisode are not required to be contiguous in the stream.As reviewed in Section 2, a number of different fre-quency definitions have been proposed to capture howoften an episode occurs in an event sequence. We observethat existing frequency definitions can be grouped intotwo categories: definitions incurring dependent occurrences( e.g., two occurrences of an episode may share commonevents) and definitions incurring independent occurrences.Due to space constraints, we focus this paper only onthe type of frequency definitions incurring independentoccurrences, which contains two frequency definitions: thenon-overlapped frequency [6], [20] and the distinct fre-quency [22]. We review the definitions of the two frequencymeasures as follows. Definition 6 (Non-overlapped frequency).
In an event stream S , two occurrences of e ( resp., e τ ), i.e., h t , . . . , t k i and h t ′ , . . . , t ′ k i , are non-overlapped if either t ′ > t k or t >t ′ k . The non-overlapped frequency of e ( resp., e τ ) in S isdenoted as f req ( e, S ) ( resp., f req ( e τ , S ) ). Definition 7 (Distinct frequency).
In an event stream S , twooccurrences of e = h φ , . . . , φ k i ( resp., e τ ), i.e., h t , . . . , t k i and h t ′ , . . . , t ′ k i , are distinct if they do not share anyevent, that is ∀ ≤ i < j ≤ k, if φ i = φ j , t i = t ′ j .The distinct frequency of e ( resp., e τ ) in S is denoted as f req + ( e, S ) ( resp., f req + ( e τ , S ) ). Example 2.
Given the following sequences, S = h ( A, , ( A, , ( A, , ( A, , ( B, , ( B, i , time-constrained serial episode e = h A, A, B i , we caneasily obtain Occ ( e , S ) = h , , i , Occ ( e , S ) = h , , i , . . . , Occ ( e , S ) = h , , i , . . . , Occ ( e , S ) = h , , i , are occurrences of e in S, h , , i is a min-imal occurrence of e = h A, A, B i in S , obviously, h , , i is another minimal occurrence. However, theyoverlap with each other. Thus, f req ( e , S ) is . On theother hand, h , , i and h , , i are distinct occurrencesbecause they don’t have the same timestamp t i and < < < . Thus, f req + ( e , S ) is 2. Given an event stream and the serial episode, whose fre-quency is to be extracted, we aim to identify the frequency ofserial episodes with time constraint from the long stream.
Definition 8 (Time-constrained frequency counting prob-lem).
Given event S [ i ] in stream S arrives one af-ter another, a time-constrained serial episode e τ , time-constrained frequency counting problem aims to evaluate f req ( e τ , S ( i )) whenever a new event S [ i ] arrives.The most related work with the aforementioned prob-lem is serial episodes frequency mining in long sequences,among which the most representative is [6], an effective It can be a simple stream sequence or a complex one, our model can workon both of them. (cid:1) (cid:2) (cid:3) (cid:1) (cid:1) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5) (cid:1) (cid:2) (cid:3) (cid:1) (cid:2) (cid:1)(cid:6)(cid:3)(cid:4)(cid:5) (cid:1) (cid:2) (cid:3) (cid:2) (cid:1) (cid:1)(cid:7)(cid:3)(cid:4)(cid:5) (cid:1) (cid:2) (cid:3) (cid:2) (cid:2) (cid:1)(cid:8)(cid:3)(cid:4)(cid:5) (cid:1) (cid:2) (cid:3) (cid:1) (cid:1) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5) (cid:1) (cid:2) (cid:3) (cid:1) (cid:2) (cid:1)(cid:6)(cid:3)(cid:4)(cid:5) (cid:1) (cid:2) (cid:3) (cid:2) (cid:1) (cid:1)(cid:7)(cid:3)(cid:4)(cid:5) (cid:1) (cid:2) (cid:3) (cid:2) (cid:2) (cid:1)(cid:8)(cid:3)(cid:4)(cid:5) (cid:1) (cid:2) (cid:3) (cid:1) (cid:1) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5) (cid:1) (cid:2) (cid:3) (cid:1) (cid:2) (cid:1)(cid:6)(cid:3)(cid:2)(cid:5) (cid:1) (cid:2) (cid:3) (cid:2) (cid:1) (cid:1)(cid:7)(cid:3)(cid:4)(cid:5) (cid:1) (cid:2) (cid:3) (cid:2) (cid:2) (cid:1)(cid:8)(cid:3)(cid:4)(cid:5) (cid:1) (cid:2) (cid:3) (cid:1) (cid:1) (cid:1)(cid:2)(cid:3)(cid:2)(cid:5) (cid:1) (cid:2) (cid:3) (cid:1) (cid:2) (cid:1)(cid:6)(cid:3)(cid:2)(cid:5) (cid:1) (cid:2) (cid:3) (cid:2) (cid:1) (cid:1)(cid:7)(cid:3)(cid:2)(cid:5) (cid:1) (cid:2) (cid:3) (cid:2) (cid:2) (cid:1)(cid:8)(cid:3)(cid:4)(cid:5) (cid:1) (cid:2) (cid:3) (cid:1) (cid:1) (cid:1)(cid:2)(cid:3)(cid:2)(cid:5) (cid:1) (cid:2) (cid:3) (cid:1) (cid:2) (cid:1)(cid:6)(cid:3)(cid:2)(cid:5) (cid:1) (cid:2) (cid:3) (cid:2) (cid:1) (cid:1)(cid:7)(cid:3)(cid:2)(cid:5) (cid:1) (cid:2) (cid:3) (cid:2) (cid:2) (cid:1)(cid:8)(cid:3)(cid:4)(cid:5) (cid:1) (cid:2) (cid:3) (cid:1) (cid:1) (cid:1)(cid:2)(cid:3)(cid:2)(cid:5) (cid:1) (cid:2) (cid:3) (cid:1) (cid:2) (cid:1)(cid:6)(cid:3)(cid:6)(cid:5) (cid:1) (cid:2) (cid:3) (cid:2) (cid:1) (cid:1)(cid:7)(cid:3)(cid:2)(cid:5) (cid:1) (cid:2) (cid:3) (cid:2) (cid:2) (cid:1)(cid:8)(cid:3)(cid:2)(cid:5)(cid:9)(cid:2)(cid:10) (cid:9)(cid:6)(cid:10)(cid:11)(cid:9)(cid:12)(cid:13)(cid:2)(cid:10) (cid:9)(cid:7)(cid:10)(cid:11)(cid:9)(cid:14)(cid:13)(cid:6)(cid:10)(cid:9)(cid:8)(cid:10)(cid:11)(cid:9)(cid:12)(cid:13)(cid:7)(cid:10) (cid:9)(cid:15)(cid:10)(cid:11)(cid:9)(cid:12)(cid:13)(cid:8)(cid:10) (cid:9)(cid:16)(cid:10)(cid:11)(cid:9)(cid:14)(cid:13)(cid:15)(cid:10)
Fig. 1: State-of-the-art automaton based serial episode min-ing scheme [best viewed in color] (numbers in bracket arethe corresponding counts).solution towards mining frequent serial episodes from an ar-bitrary long event sequence. This approach utilizes a groupof automaton, each of which corresponds to a particular can-didate serial episode. The approach works by sequentiallyscanning every event within the target sequence S . Eachtime an event is observed, the corresponding automatonwho is waiting for this event ( i.e., the next state matchesthe event) is updated. Whenever an automaton comes toend state, its corresponding count increases by 1 and theautomaton is reset to the start state.For instance, Figure 1 shows an example where the targetsequence S = h ( A, , ( B, , ( A, , ( A, , ( B, i . Supposewe are interested in the frequency of the following serialepisodes, e = h A, A i , e = h A, B i , e = h B, A i , e = h B, B i . Given the four candidates, four finite state au-tomata as M , M , M , M are built (Figure 1(1)) where Σ = { A, B } , the states of which sequentially correspondto the events in the episodes. Then it sequentially scans S .Event A is observed first, both M and M , whose nextstates are A , change to next state (Figure 1(2)); the other twoautomata keep unchanged. Afterwards, B is observed, thistime M , M and M all change to next state (Figure 1(3));as M reaches to the final state, hence its count increasesand it is reset to the initial state again.The aforementioned scheme [6] is justified effective andefficient in mining the non-overlapped frequency of givenserial episodes from a long sequence. However, it does nottake into account the time constraint for each episode, henceit cannot be applied in our scenario described in Section 3.Can we adjust it with limited variation to address ourproblem? The answer is no. Although the authors in [6], [20]suggested that simply attaching the time constraint to eachautomaton can solve the problem of mining episodes withgiven time constraint, they did not put it into practice, evenwhen they mentioned the same method again several yearslater [18]. In fact, the suggested method may lead to inaccu-rate results in time-constrained case. The following exampleshows that simply adding a time constraint towards eachautomaton, as suggested by [6], [18], [20], may inaccuratelycount the frequency of target serial episodes. Suppose we are now adding a time constraint to each au-tomaton in [6], [7] by introducing another column, namely start_time , to store the timestamp of first valid state change.The count of an automaton increases when not only the statechanges to the final one but also the time span between finalstate and start_time is within τ .In this way, only those instances satisfying the timeconstraint are counted. It seems to be a valid solution to-wards our problem. However, this solution may miss manyvalid occurrence counts, hence cannot satisfy Definition 8.For instance, if we follow the above adjusted solution,the count of e will be (as M will be activated by < ( A, , ( A, > which finally fails to satisfy the time-constraint check). However, in fact there exists an instanceof e , namely h , i . Similarly, the above solution can find minimum occurrence of e ( i.e., h , i ). However, there existtwo minimum occurrences of e , namely h , i and h , i .Such problems will be much more complex and difficultto address by automata especially when e τ contains manyrepeated events, e.g., e τ = h A, A, A i .Therefore, it is not a trivial task to design a modelto count the non-overlapped frequency of serial episodeswithin a long streaming sequence that takes into accountthe time span of serial episode. To address this challenge,we develop a novel approach in next section. LGORITHM
In this section, we present in detail the algorithm for serialepisodes counting in streaming sequence under an arbitrarytime constraint. Given a streaming event sequence anda target serial episode with an arbitrary time constraint,ONCE algorithm generally works as follows. As each eventin the stream passing by, we first need to find the latestminimum occurrences of the target serial episode, no matterit satisfies the given time constraint or not. To achievethat, we present a delicately constructed data structure,namely OccMap, which stores the timestamps of events thatconstitute the target serial episode. Whenever all eventshave been found in OccMap, we validate the candidateminimum occurrence by testing whether it satisfies thetime constraint or not. If the test succeeds, we increasethe count by 1. Afterwards, the tested occurrence and theunused timestamps of events are removed from OccMap.As a result, ONCE can output the frequency of target time-constrained serial episode whenever requested. Notably, inthe following discussion, although we focus on countingthe frequency of a given time-constrained serial episode,ONCE can in fact simultaneously count the frequencies fora group of target time-constrained serial episodes as thefollowing proposed structure and corresponding operationsare bind with each target time-constrained serial episodeindependently. Moreover, it is obvious that ONCE is a one-pass algorithm which is applicable to stream sequences.
First of all, we present the data structure, namely OccMap,to store the timestamps of events in target serial episode. Itis further used to extract candidate minimum occurrence ofthe serial episode. An OccMap for time-constrained serial episode e τ , isdefined as a group of hierarchical lists. In particular, given e τ = h φ , . . . , φ k i , the OccMap for e τ contains k listswhich are organized hierarchically into k layers. The k layers correspond to all the k signals of e τ . Each layer isan individual list, which is used to record the timestampsof the corresponding signals in the stream. For instance,in Figure 2(15+) there is an OccMap that corresponds to e τ = h A, A, B i . It consists of three lists that are hierarchi-cally organized. Each list is correlated with a single signalin the serial episode. Notably, if the same signal appearsmany times in a serial episode, i.e., A in e τ , we assign a listto each of the appearance independently. To facilitate the following discussion, we denote by OM ( e τ ) = [ L , . . . , L k ] as the OccMap for time-constrainedserial episode e τ = h φ , . . . , φ k i .In particular, we denote by OM ( e τ )[ i ] = L i , where L i = h i , . . . , i j i and L i .e = φ i . i j is the timestamp of φ i in stream S ; φ i refers to the signal corresponding to L i .When processing the stream, OccMap has to performthe following operations, list update , occurrence validate and invalid entries elimination . In the following, we describe in de-tail how each of the operations are performed in OccMap. Toillustrate each operation clearly, we shall use the followingsequence S as the running example. Example 3.
Consider the following event stream S : h ( A, , ( A, , ( B, , ( A, , ( C, , ( A, , ( A, , ( B, , ( A, , ( C, , ( C, , ( A, , ( B, , ( A, B, i List update.
Given an OccMap OM ( e τ ) that corresponds to e τ , we perform list update by scanning every signal in thestreaming sequence S as it passes by.At the very beginning, OM ( e τ ) is initialized as k lay-ered empty lists. Besides the layered lists, we denote by OM ( e τ ) .ℓ as the most recent active layer number, whichis initialized as . For each signal S [ j ] passing by, OM ( e τ ) checks whether it matches any signals whose correspondinglists are active. Suppose S [ j ] .e = φ i and i ≤ OM ( e τ ) .ℓ ,we append the timestamp of S [ j ] .e to the end of list L i .During the update, only the layers L i ( i.e., OM ( e τ )[ i ] ) where i ≤ OM ( e τ ) .ℓ can be updated. In another word, a new layercan be updated ( i.e., the corresponding list can be appendedwith a timestamp) only when the layers before it are notempty. It guarantees OccMap the following property, whichis straightforward. Property 1. [Minimum monotonicity] If an OccMap is up-dated according to list update strategy, it satisfies: min L < min L < . . . < min L OM ( e τ ) .ℓ − . For example, given the target episode e = h A, A, B i and sequence S shown in Example 3, an OccMap OM ( e ) is built, which contains three empty lists L ( L .e = A ), L ( L .e = A ), and L ( L .e = B ). Therefore, whenevernew event of S arrives, only two types of events: h A, t i i and h B, t j i are accepted and stored in the lists. Notably, aswe have mentioned above, if a signal appears many times (cid:1)(cid:2)(cid:1) (cid:1)(cid:2)(cid:1)(cid:3) (cid:3) (cid:4)(cid:4) (cid:1)(cid:2)(cid:1) (cid:4) (cid:1)(cid:2)(cid:3)(cid:1)(cid:4)(cid:3) (cid:1)(cid:5)(cid:3) (cid:5)(cid:6)(cid:7) (cid:1)(cid:2)(cid:1) (cid:8) (cid:1)(cid:6)(cid:3) (cid:1)(cid:2)(cid:1) (cid:8) (cid:1)(cid:7)(cid:3) (cid:9)(cid:9) (cid:1)(cid:2)(cid:1) (cid:8) (cid:1)(cid:8)(cid:3) (cid:9)(cid:9) (cid:10)(cid:10) (cid:1)(cid:2)(cid:1) (cid:8) (cid:1)(cid:9)(cid:3) (cid:11)(cid:9) (cid:10)(cid:12)(cid:13)(cid:1)(cid:2)(cid:1) (cid:14) (cid:1)(cid:10)(cid:3) (cid:1)(cid:2)(cid:1) (cid:14) (cid:1)(cid:4)(cid:5)(cid:3) (cid:3)(cid:4)(cid:3)(cid:4) (cid:1)(cid:2)(cid:1) (cid:15) (cid:1)(cid:4)(cid:2)(cid:3) (cid:3)(cid:4)(cid:16)(cid:17) (cid:1)(cid:2)(cid:1) (cid:16) (cid:1)(cid:4)(cid:11)(cid:3) (cid:3)(cid:8)(cid:18)(cid:19)(cid:1)(cid:2)(cid:1) (cid:3)(cid:4) (cid:1)(cid:4)(cid:6)(cid:3) (cid:3)(cid:8)(cid:3)(cid:8)(cid:1)(cid:2)(cid:1) (cid:1)(cid:9) (cid:1) (cid:3) (cid:1)(cid:2)(cid:1) (cid:1)(cid:4)(cid:11) (cid:1) (cid:3) (cid:1)(cid:2)(cid:1) (cid:1)(cid:2)(cid:12)(cid:3) (cid:1)(cid:2)(cid:1) (cid:3)(cid:4) (cid:1)(cid:4)(cid:2)(cid:12)(cid:3) Fig. 2: Mining the non-overlapped frequency of e = h A, A, B i in the sequence of Example 3 [best viewed in color]. (circlednumber in green: occurrences that satisfy the time constraint; circled number in red: occurrences that fail to satisfy timeconstraint; circled and underlined numbers: the timestamps to be removed)within a target serial episode, we assign an independent listfor each appearance, i.e., L and L . Suppose we are startingfrom the very beginning of S , each time a new event arrivesand an old one leaves. At the very beginning, all the lists L , L , L are initialized as empty. Besides, OM ( e τ ) .ℓ is setto . Now the first event of S ( i.e., ( A, ) comes, OccMapfinds that A is a valid event that should be taken intoaccount. Then it tests whether it matches any signal in theactivated layers, which now contains only L . Obviously, itperfectly matches the first layer, i.e., S [1] .e = L .e , and L isof cause activated, i.e., ≤ OM ( e τ ) .ℓ . As the test successes,we update OM ( e ) by appending S [1] .t to the end of L ,which results in Figure 2(1). Moreover, as the first layer isnot empty then, the second layer now becomes activated, i.e., OM ( e ) .ℓ = 2 .When the second event in S arrives, we perform thesame test as above. As S [2] .e = A = L .e = L .e and OM ( e ) .ℓ ≥ , both L and L (they are both activated)should be updated. Therefore, S [2] .t should be appendedto the end of both L and L , which results in the stateshown in Figure 2(2). Similarly, as the second layer is notempty then, the third layer now becomes activated ( i.e., OM ( e ) .ℓ = 3 ).Afterward, the third event, ( B, arrives, we appendit to L and update OM ( e ) .ℓ to 4. Now the last layer inOccMap is not empty, we have to perform another action, occurrence validation , which is described in the followingpart. Occurrence validation.
Whenever the bottom layer ( i.e., L k )is updated ( i.e., appended by an arbitrary timestamp), itindicates that there exist some groups of timestamps in eachlayered list that construct a candidate minimum occurrencefor the target serial episode e . Here, candidate occurrencemeans that it is the minimum occurrence of general serialepisode without taken into account the time constraint. Infact, according to the list update strategy, there is at least onecandidate occurrence for the target serial episode once thelast layer is updated. Theorem 1.
Given that an OccMap OM ( e ) , where | e | = k is updated according to list update strategy, once OM ( e ) .ℓ = k + 1 ( i.e., the last layered list is not empty),there are ≥ k entries in OM ( e ) , where no two entriesbelong to the same layered lists, that constitute an occur-rence of e . Proof 4.1.
As the last list is not empty, OM ( e ) .ℓ = k + 1 . Ac-cording to Property 1, min L < min L < . . . < min L k .If we select the entries corresponding to min L , min L ,. . . , min L k from L , . . . , L k , respectively, the entriesobviously constitute an occurrence of e as they satisfytotal order.In fact, there may exist many groups of entries that canconstitute an occurrence of e . Recall that we are performing occurrence validation once the last layered list is not empty( i.e., appended by a timestamp). In other words, L k onlycontains one entry, namely L k [1] , when occurrence validation is performed. That is, all the occurrences share the same endtimestamp, t k , at most one can affect the frequency accord-ing to Definition 6. Therefore, we have to find the optimaloccurrence, which is most probable to satisfy the time con-straint, to validate. Intuitively, the optimal occurrence of e should have the minimum time span ( i.e., t k − t ), which is infact the minimum occurrence shown in Definition 4. As all theoccurrences of e share the same t k , the minimum occurrence Occ
OP T ( e, S ) that is most probable to satisfy τ should havethe latest t . Therefore, to find the Occ
OP T ( e, S ) , we haveto find the latest t in L .In order to find the Occ
OP T ( e, S ) ( i.e., [ t s , t e ] ), we tra-verse the OccMap in a bottom-up way. In particular, giventhe end stamp t e = L k [1] , we greedily find from L k − a latest entry that appears before t e , that is max L k − [ j ] subject to L k − [ j ] < t e . Let t k − be the selected entryin L k − , then we further greedily select from L k − a lat-est entry that appears before t k − , say t k − . Afterwards,we iteratively perform the same selections in the upperlayers, until L . In the end, we can obtain t , . . . , t k − ,which are greedily selected from L , . . . , L k − , respectively. t , . . . , t k − , t e constitute an occurrence of e . Obviously, t is the latest t s .Till now, we have the minimum occurrence that is mostprobable to satisfy the time constraint τ . Hence, we checkwhether its time span satisfy the time constraint by testingthe inequality t e − t s ≤ τ . If the test successes, we increasethe frequency of the target time-constrained serial episode e τ by . Notably, if the test fails, any other occurrences willalso fail, as they have smaller t s ( i.e., larger t e − t s ).For instance, given e = h A, A, B i and sequence S inExample 3, we have updated OM ( e ) as S [1] , S [2] and S [3] Any other t ′ > t in L cannot be t s , as it is definitely greater than t according to our greedy selection strategy passed by. When S [3] = ( B, arrives, we have appended to the end of L . Once the bottom layer L is not empty,we perform occurrence validation as described before. Inparticular, we greedily find the Occ
OP T ( e, S ) as h , , i shown in Figure 2. Afterwards, we test whether its timespan ( i.e., − ) satisfies τ ( i.e., ). As the test successes(the corresponding entries in OM ( e ) are circled and markedin green in Figure 2), we increase the frequency of e by . Invalid entries elimination.
Once occurrence validation isperformed, we need to immediately eliminate invalid en-tries from OccMap. Depending on whether time constrainttest in occurrence validation successes or not, the eliminationprocess varies.If the minimum occurrence, say
Occ
OP T ( e, S ) , foundfrom occurrence validation is validated as satisfying the timeconstraint, i.e., t e − t s ≤ τ , all the other entries left in OM ( e ) are useless then. The reason is, any other occurrences thatconsist of any of these entries, which are smaller than t e ,definitely overlap with Occ
OP T ( e, S ) , which deviates fromDefinition 6. Therefore, once the occurrence validation suc-cesses, all the other entries in OM ( e ) are invalid anymore,and are immediately removed from OM ( e ) . Besides, theactive layer now are reset to L .As shown in Figure 2(3) and (3+), except for the oc-currence of the episode tested ( i.e., circled green entriesin Figure 2(3)), all the other entries ( i.e., underlined blueentry in Figure 2(3)) should be eliminated, which resultsin Figure 2(3+). The same operations can be found fromFigure 2(8) and Figure 2(8+), as well as Figure 2(15) and Fig-ure 2(15+). All the lists now become empty, thus OM ( e ) .ℓ =1 . Otherwise, when the minimum occurrence Occ
OP T ( e, S ) fails to satisfy time constraint τ , we alsoneed to find those invalid entries to eliminate from OM ( e ) .Differently, the invalid entries are no longer all the left onesin OM ( e ) in this case. Instead, although the entries thatconstitute the minimum occurrence Occ
OP T ( e, S ) fails topass time constraint test, the other entries may be furtherused to constitute another minimum occurrence. In orderto show which entries left are useless to further constituteother minimum occurrences, we present the followingtheory. Theorem 2.
Given that OM ( e τ ) ( | e τ | = k ) is updatedaccording to list update strategy and Occ
OP T ( e, S ) isfound and validated by occurrence validation process,if Occ
OP T ( e, S ) which consists of h t , . . . , t k i ( t <. . . < t k ) fails to satisfy the given time constraint τ ( i.e., t k − t > τ ), no other minimum occurrences Occ ′ OP T ( e, S ) with h t ′ , . . . , t ′ k i , where t ′ k > t k and ∃ i < k such that t ′ i ≤ t i , can satisfy the time constrainteither. Proof 4.2.
Suppose
Occ ′ OP T ( e, S ) that consists of h t ′ , . . . , t ′ k i , where t ′ k > t k and ∃ i < k such that t ′ i ≤ t i ,satisfy the time constraint τ , then t ′ k − t ′ ≤ τ. (1)As we are iteratively selecting the largest entry in L i − [ j ] subject to that L i − [ j ] < t i according to the bottom-upminimum occurrence finding strategy presented above, we can find L i − [ j ] = t i − and L i − [ j ′ ] = t ′ i − . If t ′ i ≤ t i ,it is straightforward to know that j ′ ≤ j , thus t ′ i − ≤ t i − . Similarly, we can prove that ∀ j ≤ i , t ′ j ≤ t j .Therefore, t ′ ≤ t , which means t k − t < t ′ k − t ′ as t ′ k > t k . As t k − t > τ , it is easy to know t ′ k − t ′ > τ ,which contradicts with Equation 1. Hence, Occ ′ OP T ( e, S ) cannot satisfy the time constraint τ .Suppose Occ
OP T ( e, S ) , which consists of [ t , . . . , t k ] ( t < . . . < t k ), fails to satisfy time constraint τ , according tothe above theory, it is easy to know that any entry t ′ i ≤ t i inlist L i cannot be used to generate an occurrence that passestime constraint test. Therefore, if the minimum occurrence Occ
OP T ( e, S ) fails to satisfy time constraint τ , we need toeliminate the entries t ′ i ≤ t i in each layered list L i for all ≤ i ≤ k . Moreover, we need to eliminate some other entries in L i to guarantee that min L < min L < . . . < min L k , asthose entries not satisfying this property are also useless forminimum occurrence extraction. Besides, the active layer isupdated accordingly after the elimination.As shown in Figure 2(13) and Figure 2(13+), the occur-rence of the episode ( i.e., circled red entries in Figure 2(13)),namely h , , i , fails to pass the time constraint testas − > . Then, in each L i , we eliminate all theentries t ′ i ≤ t i . In L , we eliminate and any other entriesbefore that. Similarly, in L and L , we eliminate and , respectively. The other entries, i.e., in L , is left in OM ( e ) , the rest lists are all empty again. Therefore, we reset OM ( e ) .ℓ = 2 .With the help of all these operations above, an OccMaphas the following interesting features. • An OccMap corresponds to exactly one time-constrained serial episode. • The last/bottom layered list contains no more thanone entry; there is at most one occurrence for thetarget time-constrained serial episode. • The entries in all the layered lists satisfy: min L <. . . < min L k . On the other hand, as timestampsappended into the same list strictly follow time se-quence, min L i = L i [1] . Therefore, the above prop-erty can be rewritten as L [1] < . . . < L k [1] . • Notably, to save memory, after each list update op-eration, we additionally check the inserted times-tamp, say S [ j ] .t , against the first entry in L i , if S [ j ] .t − L i [1] > τ , L i [1] cannot be used to gen-erate any minimum occurrence that satisfies timeconstraint τ . Therefore, in this case L i [1] is also beeliminated during list update . In this way, the size ofeach list L i is in fact upper bounded by τ ρ if theevent in the stream arrives with a constant speed ρ . Figure 2 shows the complete one-pass process of miningthe non-overlapped frequency for time-constrained serialepisode e = h A, A, B i within S shown in Example 3.Notably, all the other events that do not appear in e ( i.e., C )are ignored in the process. Each successful time constrainttest over the minimum occurrences is marked in green, theunsuccessful ones are marked in red. It is easy to know fromthe figure that the non-overlapped frequency for e in S is Algorithm 1:
ONCE algorithm
Require:
Streaming sequence S = h ( s , t ) , ( s , t ) , . . . , ( s n , t n ) , . . . i , target time-constrained serial episode e τ = h φ , φ , . . . , φ k i Ensure: f req ( e τ , S ) : the non-overlapped frequency of e τ in S Initialize OM ( e τ ) for e with k empty lists L , . . . , L k that are hierarchically organized for each event S [ i ] in S arrives do ListU pdate ( OM ( e τ ) , S [ i ]) / ∗ Algorithm 2 ∗ / if the bottom layer is not empty then f lag ← V alidate & Eliminate ( OM ( e τ )) / ∗ Algorithm 3 ∗ / if f lag is TRUE then f req ( e τ , S ) + + end if end if end for return f req ( e τ , S )3 , i.e., the number of successful time constraint tests. Thedetailed algorithms are shown in Algorithm 1, 2 and 3.In Algorithm 1, given the input of streaming sequence S and target time-constrained serial episode e τ , ONCEalgorithm first initializes an OccMap for e τ (Line 1). As eachevent in the stream arrives, ONCE first performs list update ( i.e., Algorithm 2) based on the event (Lines 2-3). When thelast layer in OccMap is not empty, we perform occurrencevalidation to find the minimum occurrence and test whetherit satisfies τ . Invalid entries elimination is performed imme-diately after that (Lines 4-5 and Algorithm 3). If the timeconstraint test successes, the frequency of e τ is increased by (Lines 6-7). Finally, the frequency is returned (Line 11).Algorithm 2 works as follows. Given the OccMap OM ( e τ ) to be updated and event S [ i ] , we check everyactivated list L j which wait for update, if S [ i ] .e matches theevent corresponding to the L j , we append S [ i ] .t to the endof L j (Lines 2-3). Afterwards, we perform a local check in L j in order to eliminate out-of-date entries ( i.e., old entriesthat cannot constitute a minimum occurrence that satisfies τ ) from L j (Lines 4-6). Finally, we update the active layer tothe next empty list (Line 9). Obviously, the time complexityfor Algorithm 2 is O ( k ) where k is the length of e τ .In Algorithm 3, we first extract the minimum occurrencefrom OM ( e τ ) (Lines 1-5) and then eliminate invalid entries(Lines 6-19). To extract the minimum occurrence, we first setthe right bound of the occurrence interval t k as L k [1] , whichis the only entry in L k (Line 1). Afterwards, we iterativelyfind from each upper layer t i as the latest timestamp thatappears before t i +1 . This process continues until t (Lines 2-5). Therefore, [ t , . . . , t k ] constitute a minimum occurrencefor e τ . The time complexity of this process is O ( k log | L | ) .As the length of each list in OM ( e τ ) is in fact upperbounded by τ ρ if the event in the stream arrives with aconstant speed ρ . According to Algorithm 2, in Line 4 and 5, In the implementation, as each t i always locates in the end of L i , wealternatively use linear search from the end of L i in Line 3, the average timeof which is better than binary search in practice. Similar strategy also applies toLine 15. Algorithm 2:
ListUpdate algorithm
Require:
OccMap OM ( e τ ) and the next event in streamsequence S [ i ] Ensure: updated OccMap OM ( e τ ) for all activated lists L j ( j ≤ OM ( e τ ) .ℓ ) and S [ i ] .e ∈ e τ do if S [ i ] .e = L j .e then append S [ i ] .t to the end of L j if S [ i ] .t − L j [1] > τ then remove L j [1] from L j end if end if end for Update OM ( e τ ) .ℓ to the next empty list return OM ( e τ ) we know that ( L j [ k ] .t − L j [1] .t ) < τ . Considering the factthat ρ is a positive number, as a result, ρ ( L j [ k ] .t − L j [1] .t ) <τ ρ . Actually, considering the ListUpdate process, we have k ≤ ρ ( L j [ k ] .t − L j [1] .t ) . With the two inequalities above,we can conclude that k < τ ρ , which proves that the upperbound of L j in OM ( e τ ) is τ ρ . The time complexity is infact O ( k log τ ) if ρ is fixed. Afterwards, we test whether theextracted minimum occurrence satisfies the time constraint.If the test successes, we eliminate all entries from the Oc-cMap and return (Lines 6-9). Otherwise, we eliminate allentries in each list L i subject to L i [ j ] ≤ t i (Lines 11-13).Besides, we also need to make sure L [1] < . . . < L k − [1] by eliminating some other entries, as those entries are alsouseless in constituting further minimum occurrences (Lines14-16). Finally, we update the active layer as the first emptylayered list (Lines 17-18). It is easy to see the complexityof elimination is also O ( k log τ ) as searching from a list anentry (Line 3 and 15) takes O (log τ ) using binary search.Therefore, the complexity of Algorithm 3 is O ( k log τ ) .In all, the time complexity for ONCE algorithm in Algo-rithm 1 to process a single event is O ( k log τ ) . Taking intoaccount the number of events in the stream S , the time com-plexity of processing n events is all together O ( nk log τ ) . In this part, we discuss the correctness of ONCE algorithm.In particular, we need to show that ONCE algorithm cancorrectly answer the problem in Definition 8, namely count-ing the non-overlapped frequency f req ( e τ , S ) of giventime-constrained serial episode e τ as the event in streamingsequence S passes by. To this end, we present a pair oflemmas below, based on which we can finally prove thecorrectness of ONCE. L EMMA Suppose the frequency of e τ in S returnedby ONCE is denoted by f req ′ ( e τ , S ) and the groundtruth is denoted by f req ( e τ , S ) , then f req ′ ( e τ , S ) ≤ f req ( e τ , S ) . Proof 4.3.
Given in Appendix A. L EMMA Suppose the frequency of e τ in S returnedby ONCE is denoted by f req ′ ( e τ , S ) and the ground Algorithm 3:
Validate&Eliminate algorithm
Require:
OccMap OM ( e τ ) Ensure: f lag indicating whether the occurrence passes timeconstraint test t k ← L k [1] for i in k − to do find the maximum j subject to L i [ j ] < t i +1 t i ← L i [ j ] end for if t k − t ≤ τ then remove all entries from OM ( e τ ) reset OM ( e τ ) .ℓ as return TRUE else for all lists L i in OM ( e τ ) do remove from L i all entries L i [ j ] ≤ t i end for for i in to k − do remove from L i all entries L i [ j ] ≤ L i − [1] end for reset OM ( e τ ) .ℓ to max | L i | > i + 1 return FALSE end if truth is denoted by f req ( e τ , S ) , then f req ′ ( e τ , S ) ≥ f req ( e τ , S ) . Proof 4.4.
Given in Appendix B.
Theorem 3.
ONCE algorithm (Algorithm 1, 2 and 3) can cor-rectly answer the time-constrained frequency countingproblem in Definition 8.
Proof 4.5.
Directly follows Lemma 1 and 2.
In previous section we have presented ONCE algorithm,ONCE can compute the overlapped frequency of targettime-constrained serial episode. In fact, it can also beadapted to compute distinct frequency with series of mod-ifications. In this part, we propose the modified version,namely ONCE+, to compute the distinct frequency of time-constraint serial episodes.ONCE+ also utilizes OccMap to store the timestampsof events in target serial episode, the only difference is in
Validate&Eliminate algorithm. In order to distinguish ONCEand ONCE+, we name the modified algorithm as
Vali-date&Eliminate+ algorithm. The detailed algorithm is shownin Algorithm 4.In Algorithm 4, we need to find the occurrence whichmeets the time-constrained condition from OM ( e τ ) . Firstly,we compare all the entries of L with L k [1] . If we failed findthe minimum entry L [ i ] which meets L k [1] − L [ i ] < τ ,eliminate all entries from the OccMap and return (Lines 26-29). If we find it, set the first layer t as L [ i ] . Afterwards,we iteratively find from each upper layer t i as the latesttimestamp that appears before t i +1 . This process continuesuntil t k (Line 1-9). If we failed to find t i ( < i < k ) in anylayer, we also have to eliminate all entries from the OccMap Algorithm 4:
Validate&Eliminate+ algorithm
Require:
OccMap OM ( e τ ) Ensure: f lag indicating whether the occurrence passes timeconstraint test if find the minimum i subject to L k [1] − L [ i ] < τ then t ← L [ i ] p=0 for i in to k do if find the minimum j subject to L i [ j ] > t i − then t i ← L i [ j ] P++ end if end for if p=k-1 then for all list L i in OM ( e τ ) do remove from L i all entries L i [ j ] ≤ t i for y in to k do if find y subject to L i [ j ] = t y then remove L i [ j ] from L i end if end for end for reset OM ( e τ ) .ℓ to max | L i | > i + 1 return TRUE else remove all entries from OM ( e τ ) reset OM ( e τ ) .ℓ as return FALSE end if else remove all entries from OM ( e τ ) reset OM ( e τ ) .ℓ as return FALSE end if and return (Line 21-25). Therefore, [ t , . . . , t k ] constitute adistinct occurrence for e τ . Then, we eliminate all the entriesfrom OccMap which is no later than t i (Line 10-20). Finally,we update the active layer as the first empty layered list.Notably, in ONCE we find t i from bottom to top ofthe OccMap, then test whether the extracted minimal oc-currence satisfies the time constraint condition. However, inONCE+, we firstly need to find the occurrence from OM ( e τ ) which meets the time-constrained. Then find t i from top tobottom of the OccMap. It is easy to see, the only differencebetween ONCE and ONCE+ is that the process of seeking t i from OccMap is opposite, so the complexity of Algorithm 4is the same as Algorithm 3.In order to illustrate the characteristics of ONCE+ algo-rithm. We use another example to show how ONCE+works. In Figure 3, there is an OccMap that correspondsto e τ = h A, A, B i . S = h ( A, , ( C, , ( A, , ( D, , ( A, , ( C, , ( A, , ( C, , ( B, , ( B, i when τ = 9 , we follow Algorithm 4, h , , i and h , , i are distinct occurrences of e = h A, A, B i , so f req + ( e , S ) = 2 , if we set τ = 7 , the distinct occurrences
13 3 331
551 133 5 77 33 55 77 B AA BAA B AA B AA BAA B AA B AA B (6) (7) (7+)
577 77
75 710
Fig. 3: Mining the non-overlapped frequency with inter-sected occurrence of e = h A, A, B i in the sequence S [bestviewed in color].
13 3 331
551 133 5 77 33 55 77 B AA BAA B AA B AA BAA B AA B AA B (6) (7) (7+) Fig. 4: Mining the non-overlapped frequency with inter-sected occurrence of e = h A, A, B i in the sequence S [bestviewed in color].of e = h A, A, B i are then show (in green) in Figure 4, thatis f req + ( e , S ) = 1 .However, if we use ONCE to count the non-overlappedfrequency of e = h A, A, B i or e = h A, A, B i , the onlynon-overlapped minimal occurrence is h , , i , that is, f req ( e , S ) = 1 and f req ( e , S ) = 1 .The correctness of ONCE+ is easy to justify following thesame way as Section 4.4. XPERIMENTAL S TUDY
In this part, we conduct experimental study over bothsynthetic and real world data. Through the experimentalresults, we justify that ONCE ( resp.,
ONCE+) can answer thenon-overlapped ( resp., distinct) frequency counting prob-lem for time-constrained serial episodes in a one-pass wayefficiently. The real world data is the streaming sequence TABLE 2: Dataset statistics datasets | S | Σ durationTelecom. alarms 8,821,220 252 2014-05-01 0h to 2014-05-31 24hSynthetic 91,021 622 - of telecommunication alarms within 4 cities in GuizhouProvince of China during 2014. Besides, we also generatea synthetic dataset by randomly sample an event at eachtimestamp. The statistics of both datasets are shown inTable 2. The synthetic data are generated by uniformly ran-domly sampling a particular event from Σ one step after an-other. All the experiments are tested on a workstation withXeon E5-2603v3 1.6GHz CPU, 16GB RAM running Ubuntu12.04 LTS. We compare ONCE algorithm and ONCE+ withother baselines, namely SASE+ and SASE++ [29]. All theparameters in SASE+ and SASE++ are optimized accordingto their suggested settings.Notably, in the following experiments, we report the av-erage throughput, which is defined as the number of signalsprocessed by an algorithm per second. In line with [29],we report how the throughput can be affected by differentfactors. Through all these experiments, we are exciting tofind that ONCE only takes less than µs for each S [ i ] ,especially for the real world dataset. That is, our model canwork in event-intensive stream even if millions of eventsarrive in single second . Selectivity θ . It is defined as,
Matches
Events , which is controlledby changing the target episodes in stream S . Similar to [29],it is varied from − , up to 1.6, which is a very heavyworkload to test our algorithms. We simulate the stream S by sequentially input a new signal after some time interval.In particular, for the real-world data, each signal in theexperiment arrives exactly the same with its original timeinterval; for synthetic dataset, we set each signal arriveswith a constant speed every 1 ms . Firstly we test the through-put processing all signals in S , and report the average overall 10 episodes. Figure 5 show the throughput of the real-world data and synthetic dataset while varying θ . We seethat the throughput of SASE+ drops very fast as θ increase,and that of SASE++ is worse than ONCE and ONCE+.The throughput of ONCE and ONCE+ is similar. SASE++,ONCE and ONCE+ are not sensitive to the selectivity. Thethroughput of ONCE and ONCE+ is nearly an order ofmagnitude better than SASE++. Effect of τ . Secondly, we test the throughput of ONCE andONCE+ by varying the time constraint for the given serialepisodes. In particular, at each τ we randomly select 10different episodes with length . We test the throughputof Algorithm 1 processing all signals in S , and reportthe average over all 10 episodes. Notably, as in syntheticdata each signal is associated with a discrete step, τ isdefined as the maximal number of steps an episode shouldcover. In real-world dataset, we vary the time constraintfrom . to hours. The results are shown in Figure 6. The code and dataset will be released once this work is published.
6. Once a timestamp is inserted into OccMap, we identify it as a
Match . Selectivity(matches/event) T h r o u g h p u t ( e v e n t s / s e c o n d ) ONCEONCE+SASE+SASE++ (a) Synthetic
Selectivity(matches/event) T h r o u g h p u t ( e v e n t s / s e c o n d ) ONCEONCE+SASE+SASE++ (b) Telecom. alarms
Fig. 5: The throughput by varying Selectivity(matches/event). τ(h) T h r o u g h p u t ( e v e n t s / s e c o n d ) ONCEONCE+SASE+SASE++ (a) Synthetic τ(h) T h r o u g h p u t ( e v e n t s / s e c o n d ) ONCEONCE+SASE+SASE++ (b) Telecom. alarms
Fig. 6: The throughput by varying τ .Notably, the response time for all the cases remains almostconstant. The phenomenon seems different from our timecomplexity study. The reason is as follows. As τ is increased,the probability of performing Lines 7-9 in Algorithm 3,whose complexity is O (1) , will be mach larger than that ofLines 11-18, whose complexity is O ( k log τ ) . Therefore, as τ increases, the curve will tend to be more constant (as Lines7-9 contributes more to the response time) than sublinear. Implicit factors.
During the experimental study, we findthat besides τ and k , the response time also vary fordifferent e τ even if they share the same length k and τ .The reason is that, according to Algorithm 1, each time anew event S [ i ] arrives, Lines 4-9 in Algorithm 1, which isthe most time-consuming, may not always be performed.Intuitively, each time f req ( e τ , S ) is updated, this part isperformed. Therefore, the frequency of e τ implicitly affectsthe eventual response time. Hence, we conduct anotherexperiment to test the effect of frequency by fixing both k and τ at particular levels. The results are shown in Figure 7.We randomly select 10 episodes at each frequency level( i.e., , , , ) and report the average time forprocessing S [ i ] . We repeat the same setting for episodeswith lengths , , , respectively. As the maximum frequencyfor episodes with length is less than 1500, thus it doesnot appear when frequency is 1500 and 2000. Notably, thefrequencies of episodes selected at each level ( e.g., ) mayvary a bit ( e.g., , and etc.). Figure 7 only reports thatof the synthetic data, as we cannot find enough episodesat each frequency level in real world one. Obviously, theresponse time increases along with the frequency level, T i m e c o s t ( m i c r o s e c ond s ) Frequencyepisdoe len=3episode len=5episode len=7 (a) |S| T h r o u g h p u t ( e v e n t s / s e c o n d ) ONCEONCE+SASE+SASE++ (b)
Fig. 7: (a) Effect of frequency; (b) Scalability.which agrees with our analysis above.
Scalability and memory consumption.
We now test thescalability of the model by varying the length of inputsequence S from , to , . The response timeare shown in Figure 7(b). It increases almost linearly, whichis consistent with our analysis in the end of Section 4.3.Notably, the average response time for processing a singlesignal is less than µs . That is, ONCE and ONCE+ can workon signal-intensive streams where millions of events happenin a second. We also demonstrate how the memory usage ofthe core structure OM ( e τ ) in ONCE algorithm scales withthe size of target episodes. As described in ListU pdate op-eration, each OM ( e τ ) is initialized as k layered empty lists.As massive data flow in, corresponding signal will be storedin this structure. To evaluate the memory consumption ofthe proposed algorithm during this dynamic process, themaximum cost (the structure reaches its largest conditionat occurrence validation step) is measured under the con-dition that k is steadily increased from 3 to 11. Notably, ateach particular length level (3, 5, 7, 9, 11), we randomlyselect 10 target episodes whose occurrences are count andcorresponding memory consumptions are evaluated. As isevident in Table 3, the memory consumption grows with k because the number of lists in OM ( e τ ) increases with thelength of the target episode. Therefore, with more layeredlists to generate candidate occurrence, OM ( e τ ) shall storemore signals before occurrence validation. Notably, SASE++exhibits the same memory consumption with SASE+, andthe memory consumption of ONCE+ is the same to ONCE,so we only list the comparative results between ONCE andSASE+.TABLE 3: Memory consumption by varying k (in KB) Synthetic Telecom. Alarms k ONCE SASE+ ONCE SASE+3 0.1089 1.1194 0.1338 1.24035 0.4393 2.282 0.8174 4.39647 0.7067 4.2985 1.1613 10.10329 1.7303 14.3811 7.4654 66.498811 2.0011 20.198 14.3659 162.8722
PPLICATION AND O BSERVATIONS
As the direct implementation of the algorithm, it has beenapplied in practice within a corresponding telecom alarmmanagement system within China. The system is built to monitor all the alarms sent from network equipments of aparticular service provider within Guizhou province in realtime. It is expected that through the system, operators andofficers can respond to system faults and emergency as soonas possible.For instance, the following is a standard rule that is listedin the Service Manual of the system operator, “If more than10% of the cells in a district are out-of-service, local officerhave to be dispatched to handle it”. There exist hundredsof other similar rules within the manual. Applying theserules in practice requires counting the specific incidents inreal-time while the alarms keep on arriving in streamingmanner. Unfortunately, this is not a trivial task, as themajority of the incidents appeared in the listed rules cannotbe simply interpreted as a single alarm. Instead, they canonly be interpreted as a sequential combination of series ofparticular alarms. For example, according to the empiricalstudy of the company, a cell is “out-of-service” only if thereare three alarms, namely ‘ low voltage ’, ‘ base station disconnect ’and ‘ carrier wave alarm ’, appearing sequentially within 3minutes. Therefore, in order to respond the the incidentseffectively and efficiently, the system has to be able to countthe frequency of specific serial combination of alarms inreal-time and respond as soon as possible whenever thefrequency is beyond a predefined threshold. Table 4 liststhe incidents, their corresponding sequential alarms as wellas the respond thresholds.To address the problem, we are invited to apply ONCEand ONCE+ algorithms in this system over the streamingalarms received in real-time. Within this practical appli-cations, there are over 250,000 alarms received everyday,that is, more than 3 alarms every second. In another word,the system has to observe an arbitrary incident listed inthe manual within / second, otherwise the system willfail to respond properly. ONCE and ONCE+ algorithm de-ployed in this system can successfully output the frequencyof required incidents ( resp., time-constrained serial alarms)within a millisecond, which has also been demonstrated inour experimental study. Notably, most of the serial alarms(shown in Table 4) are restricted to happen in the samedistrict ( resp., NE, base station), which, in fact, makes eachindividual alarm signal a two-dimensional sample such thatONCE and ONCE+ cannot be directly applied. To addressthat, the same alarm happening in different places ( i.e.,
NE,base station) are treated as different symbols in Σ , that is,they are viewed as completely different symbols in ONCEand ONCE+. In this way, the frequency of all the requiredincidents can be counted in the system. ONCLUSION
In this work, we present ONCE and ONCE+ algorithms,which can answer non-overlapped and distinct frequencycounting problem respectively, for given time-constrainedserial episodes within a given streaming sequence. ONCEand ONCE+ algorithms work in a one-pass way with thehelp of a carefully designed data structure, OccMap. Foreach single event arrived, ONCE and ONCE+ only takes O ( k log τ ) time to process it. In fact, the problem we areaddressing in this work can degenerate to the traditionalserial episode frequency mining problem if time constraint for the target episode is set to infinite. Moreover, we the-oretical prove that ONCE ( resp., ONCE+) can correctly an-swer the non-overlapped ( resp., distinct) frequency countingproblem. Experimental study conducted over both syntheticand real world datasets justify that ONCE and ONCE+ canefficiently work on stream data, even if millions of eventsarrive in a single second.In some complex applications, there exist more complexserial episodes that consist of multi-dimensional signals.Extending ONCE and ONCE+ to address the same problemin high-dimensional streams is part of our future work. A PPENDIX AP ROOF OF L EMMA Suppose f req ′ ( e τ , S ) > f req ( e τ , S ) , that is, either one ofthe following cases happens. • ONCE finds a minimum occurrence
Occ
OP T ( e τ , S ) which does not pass time constraint test, but weincorrectly increase f req ′ ( e τ , S ) by . • ONCE find two minimumoccurrences
Occ
OP T ( e τ , S ) , Occ
OP T ( e τ , S ) that overlap with each otherand both pass time constraint test, we count thefrequency of e τ by .According to Algorithm 1 and 3, only when Occ
OP T ( e τ , S ) passes the time constraint test, we increaseits frequency. Therefore, the first case cannot happen inONCE.Then we prove that the second case contradicts theconditions in ONCE. Without loss of generality, suppose Occ
OP T ( e τ , S ) consists of t , . . . , t k and Occ
OP T ( e τ , S ) consists of t ′ , . . . , t ′ k , respectively, and t k ≤ t ′ k . If t ≥ t ′ ,then according to Definition 4, Occ
OP T ( e τ , S ) cannot be aminimum occurrence. Hence, t < t ′ . As there is overlapbetween Occ
OP T ( e τ , S ) and Occ
OP T ( e τ , S ) , then ∃ i ∈ [2 , k ] . Thus, t i ≥ t ′ .According to Algorithm 3, when Occ
OP T ( e τ , S ) isfound by ONCE and passes time constraint test, all theentries in OM ( e τ ) are eliminated, including t ′ . After that,any other entries t inserted satisfies t > t k .Obviously, as long as Occ
OP T ( e τ , S ) are found andtested, Occ
OP T ( e τ , S ) can never be found by ONCE as t ′ has already been eliminated. Therefore, the second case cannever happen in ONCE.In all, all the cases that lead to f req ′ ( e τ , S ) >f req ( e τ , S ) can never happen in ONCE, that is f req ′ ( e τ , S ) ≤ f req ( e τ , S ) . A PPENDIX BP ROOF OF L EMMA Suppose f req ′ ( e τ , S ) < f req ( e τ , S ) , that is, there exists aminimum occurrence Occ
OP T ( e τ , S ) ( i.e., [ t , . . . , t k ] ) thatsatisfies time constraint ( t k − t ≤ τ ) and does not overlapwith any other minimum occurrences, which successfullyincrease the frequency, but ONCE fails to find it. In thefollowing, we show that the premise cannot happen inONCE through two steps. Firstly, we prove that when t k isappended into L k , ∀ t < k, t i ∈ OM ( e τ ) . Secondly, we prove TABLE 4: Incidents and the interpreted serial alarms in the Service Manual (part).
Incidents Corresponding serial alarms ( e ) Time thresholds ( τ ) Frequency thresholds cell out of service low voltage, base station disconnect, carrier wave alarm 3 minutes 10% of all cells in a dis-trictAP out of service ap reboot, ap fault 5 minutes 5% of all cells in a dis-trictAPG process failure 1 APG process reinitiated, statistics and traffic measure-ment colleciton timeout fault, CPT fault 5 minutes 5% of all cells in a dis-trictAPG process failure 2 APG process reinitiated, CPT fault, statistics and trafficmeasurement colleciton timeout fault 5 minutes 5% of all cells in a dis-trictNE out of service SNT fault, MSC fault, MSC fault 5 minutes 10 in the same NEbase station out of service carrier wave alarm, DTS fault, DS fault 15 minutes 5 in the same base sta-tionNE communication blocked Digital path unavailible state fault, Digital path faultsupervision 5 minutes 5 in the same NENE synchronous failure Synchronous digital path fault supervision, Syn-chronous fault 5 minutes 5 in the same NEgateway articulate failure 1 GARP fault, MGW fault, gateway channel fault, gate-way block alarm 5 minutes 5% of all gateways inthe same districtgateway articulate failure 2 SER reinitiated, MGW fault, gateway channel fault, gate-way block alarm 5 minutes 5% of all gateways inthe same districtSAE monitor failure NE_e alarm, audit monitor alarm, NE threshold alarm 5 minutes 10 in the same NENE signal failure Signalling link alarm 3 minutes 10 in the same NEBilling failure Billing equipment fault, abnormal storage of billingdata, billing file fault 3 minutes 10 in a districtFiber link failure IP signal alarm, optical fiber fault 3 minutes 5 in a districtAC failure DHCP resource alarm, AC power alarm, Portal servicefault, Radius billing server fault, Radius authenticationfault 5 minutes 5 in a district that as long as t , . . . , t k are present in OM ( e τ ) , ONCE cannever miss them. Step 1.
Without loss of generality, suppose t i OM ( e τ ) and t j ∈ OM ( e τ ) for all j < i ( t i is the first one that is notpresent in OM ( e τ ) ).According to Algorithm 3, t i may either be eliminatedfrom OM ( e τ ) or not inserted in OM ( e τ ) . [ if eliminated ] According to Algorithm 1, 2 and 3, theelimination only happen in Line 7,12,15 of Algorithm 3and Line 5 of Algorithm 2, we discuss each of the case insequence. • If it is eliminated from OM ( e τ ) in Line 7 of Algori-thm 3, then [ t , . . . , t k ] overlaps with another mini-mum occurrences that successfully increases the fre-quency, which contradicts with the premise. • If it is eliminated from OM ( e τ ) in Line 12 of Algori-thm 3, all t j ( j < i ) will also be eliminated, this con-tradicts with the fact that t j are present in OM ( e τ ) ;otherwise i = 1 , as it has been eliminated in Line12, then t k − t i = t k − t > L k [1] − t > τ , whichcontradicts with the fact that t k − t ≤ τ . • If it is eliminated from OM ( e τ ) in Line 15 of Algo-rithm 3, then according to Line 14, we can derive i ≥ . All t j ( j < i ) will also be eliminated as t i ≤ L i − [1] ≤ . . . ≤ L [1] , which contradicts withthe fact that t j are present in OM ( e τ ) . • If it is eliminated from OM ( e τ ) in Line 5 of Algo-rithm 2, it is easy to find t k − t i > τ , therefore t k − t > τ , which contradicts with the fact that t k − t ≤ τ .Therefore, t i can never be eliminated from OM ( e τ ) .Then it should never be inserted in to OM ( e τ ) . [ if not inserted ] If t i is never inserted into OM ( e τ ) , i.e., t i is never appended to L i , then i > and L i − should beempty when the event S [ j ] ( S [ j ] .t = t i ) arrives according toAlgorithm 2. As L i − is empty, t i − should not be presentin OM ( e τ , S ) , which contradicts with the premise that t i isthe first one that is not present in OM ( e τ ) .Taking both 1) and 2) together, ∀ i ∈ [1 , k ] , t i should bepresent in OM ( e τ , S ) when t k is appended to L k . Step 2.
As all t , . . . , t k are present in OM ( e τ , S ) when t k is appended to L k , we further show that ONCE can nevermiss them.Suppose ONCE algorithm finds a minimum occurrence [ t ′ , . . . , t ′ k − , t k ] when t k is appended. If t ′ > t , [ t , . . . , t k ] is not a minimum occurrence, which contradicts with thepremise of the lemma.Otherwise, t ′ < t , then [ t ′ , . . . , t ′ k − , t k ] cannot be aminimum occurrence. Therefore, t ′ = t . As t k − t ′ = t k − t ≤ τ , ONCE will definitely increase f req ′ ( e τ , S ) by andremove all entries in OM ( e τ ) .Taking both step 1 and 2 together, f req ′ ( e τ , S ) This work is supported by National Nature Science Founda-tion of China (No. 61672408, 61472298), CCF-VenustechRP(No. 2017005), Huawei Innovative Research Program (No.HIRPO20160606), China 111 Project (No. B16037). R EFERENCES [1] H. Mannila, H. Toivonen, and A. I. Verkamo, “Discovery offrequent episodes in event sequences,” Data Min. Knowl. Discov. ,vol. 1, no. 3, pp. 259–289, 1997. [2] K. Golmohammadi and O. R. Zaïane, “Data mining applicationsfor fraud detection in securities market,” in EISIC , 2012, pp. 107–114.[3] A. Achar, A. Ibrahim, and P. S. Sastry, “Pattern-growth basedfrequent serial episode discovery,” Data Knowl. Eng. , vol. 87, pp.91–108, 2013.[4] I. Saha and J. Dewanjee, “A web based nucleotide sequencing toolusing blast algorithm,” International Journal of Biotech Trends andTechnology , vol. 18, 2016.[5] D. Patnaik, N. Ramakrishnan, S. Laxman, and B. Chandramouli,“Streaming algorithms for pattern discovery over dynamicallychanging event sequences,” CoRR , vol. abs/1205.4477, 2012.[6] S. Laxman, P. S. Sastry, and K. P. Unnikrishnan, “A fast algorithmfor finding frequent episodes in event streams,” in KDD , 2007, pp.410–419.[7] E. Wu, Y. Diao, and S. Rizvi, “High-performance complex eventprocessing over streams,” in SIGMOD , 2006, pp. 407–418.[8] B. Cadonna, J. Gamper, and M. H. Böhlen, “Sequenced event setpattern matching,” in EDBT , 2011, pp. 33–44.[9] D. Patnaik, S. Laxman, B. Chandramouli, and N. Ramakrishnan,“Efficient episode mining of dynamic event streams,” in ICDM ,2012, pp. 605–614.[10] R. Agrawal and R. Srikant, “Mining sequential patterns,” in ICDE ,1995, pp. 3–14.[11] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal,and M. Hsu, “Prefixspan: Mining sequential patterns by prefix-projected growth,” in ICDE , 2001, pp. 215–224.[12] X. Yan, J. Han, and R. Afshar, “Clospan: Mining closed sequentialpatterns in large databases,” in SDM , 2003, pp. 166–177.[13] C. Mooney and J. F. Roddick, “Sequential pattern mining - ap-proaches and algorithms,” ACM Comput. Surv. , vol. 45, no. 2, p. 19,2013.[14] R. Bertens, J. Vreeken, and A. Siebes, “Keeping it short andsimple: Summarising complex event sequences with multivariatepatterns,” in KDD , 2016, pp. 735–744.[15] X. Ji, J. Bailey, and G. Dong, “Mining minimal distinguishingsubsequence patterns with gap constraints,” Knowl. Inf. Syst. ,vol. 11, no. 3, pp. 259–286, 2007.[16] M. Zhang, B. Kao, D. W. Cheung, and K. Y. Yip, “Mining periodicpatterns with gap requirement from sequences,” TKDD , vol. 1,no. 2, 2007.[17] J. Pei, H. Wang, J. Liu, K. Wang, J. Wang, and P. S. Yu, “Discoveringfrequent closed partial orders from strings,” TKDE , vol. 18, no. 11,pp. 1467–1481, 2006.[18] A. Achar, S. Laxman, and P. S. Sastry, “A unified view of theapriori-based algorithms for frequent episode discovery,” Knowl.Inf. Syst. , vol. 31, no. 2, pp. 223–250, 2012.[19] K. Iwanuma, Y. Takano, and H. Nabeshima, “On anti-monotonefrequency measures for extracting sequential patterns from asingle very-long data sequence,” in , 2004, pp. 213–217 vol.1.[20] S. Laxman, P. S. Sastry, and K. P. Unnikrishnan, “Discoveringfrequent episodes and learning hidden markov models: A formalconnection,” TKDE , vol. 17, no. 11, pp. 1505–1517, 2005.[21] S. Laxman, “Discovering frequent episodes: fast algorithms, con-nections with hmms and generalizations,” Ph.D. dissertation, In-dian Institute of Science, 2006.[22] M. Joshi, G. Karypis, and V. Kumar, “A universal formulation ofsequential patterns,” Department of Computer Science, Universityof Minnesota, Tech. Rep., 1999.[23] G. S. Manku and R. Motwani, “Approximate frequency countsover data streams,” in VLDB , 2002, pp. 346–357.[24] T. Calders, N. Dexters, and B. Goethals, “Mining frequent itemsetsin a stream,” in ICDM , 2007, pp. 83–92.[25] J. Cheng, Y. Ke, and W. Ng, “A survey on algorithms for miningfrequent itemsets over data streams,” Knowl. Inf. Syst. , vol. 16,no. 1, pp. 1–27, 2008.[26] X. Ao, P. Luo, C. Li, F. Zhuang, and Q. He, “Online frequentepisode mining,” in ICDE , 2015, pp. 891–902.[27] X. Ao, P. Luo, J. Wang, F. Zhuang, and Q. He, “Mining precise-positioning episode rules from event sequences,” in ICDE , 2017,pp. 83–86.[28] C. Wu, Y. Lin, P. S. Yu, and V. S. Tseng, “Mining high utilityepisodes in complex event sequences,” in KDD , 2013, pp. 536–544.[29] H. Zhang, Y. Diao, and N. Immerman, “On complexity and op-timization of expensive queries in complex event processing,” in