Summarizing Event Sequences with Serial Episodes: A Statistical Model and an Application
aa r X i v : . [ c s . L G ] A p r Summarizing Event Sequences with SerialEpisodes: A Statistical Model and an Application
Soumyajit Mitra, and P S Sastry
Abstract —In this paper we address the problem of discovering a small set of frequent serial episodes from sequential data so as toadequately characterize or summarize the data. We discuss an algorithm based on the Minimum Description Length (MDL) principleand the algorithm is a slight modification of an earlier method, called CSC-2. We present a novel generative model for sequence datacontaining prominent pairs of serial episodes and, using this, provide some statistical justification for the algorithm. We believe this isthe first instance of such a statistical justification for an MDL based algorithm for summarizing event sequence data. We then present anovel application of this data mining algorithm in text classification. By considering text documents as temporal sequences of words,the data mining algorithm can find a set of characteristic episodes for all the training data as a whole. The words that are part of thesecharacteristic episodes could then be considered the only relevant words for the dictionary thus resulting in a considerably reducedfeature vector dimension. We show, through simulation experiments using benchmark data sets, that the discovered frequent episodescan be used to achieve more than four-fold reduction in dictionary size without losing any classification accuracy.
Index Terms —Frequent episodes, MDL principle, compressing frequent patterns, HMM models for episodes, dictionary learning, textclassification. ✦ NTRODUCTION F REQUENT pattern mining is an important problem indata mining with applications in diverse domains [1].Frequently occurring local patterns can capture useful as-pects of the semantics of the data. However, in practice,the mined frequent patterns are often large in numberand quite redundant in nature which makes it difficult toeffectively use them. Isolating a small set of non-redundantinformative frequent patterns that best describes the data,is an interesting current research problem [2], [3], [4], [5],[6], [7], [8]. In this paper we are concerned with mining ofsequential data in the framework of frequent episodes [9].We address the problem of isolating a small set of non-redundant serial episodes that best characterize the data.There have been many recent efforts for extracting asmall subset of non-redundant characteristic patterns. Thereare mainly two families of methods. One family of methodsretain only those patterns which are, in some sense, sta-tistically significant. The statistical significance is assessedusing either a suitable null model in a hypothesis testingframework or by fitting a generative model for the datasource [4], [10], [11], [12], [13], [14]. While this can reduce thenumber of frequent patterns to some extent, this approachcannot tackle redundancy in the discovered patterns.Another prominent family of methods for decidingwhich subset of patterns best explains the data, is basedon an information theoretic approach called
Minimum De-scription Length (MDL) principle [15]. In the context of theproblem of isolating a ‘best’ subset of frequent patterns, • Soumyajit Mitra was at the Department of Electrical Engineering, IndianInstitute of Science, Bangalore, India. He is currently with SamsungR&D, Bangalore, India.E-mail: [email protected] • P.S. Sastry is with the Department of Electrical Engineering, IndianInstitute of Science, Bangalore, India.E-mail: [email protected] the use of MDL principle can be explained as follows.We formulate a mechanism so that given any subset offrequent patterns we can use them as a ‘model’ to encodethe data. Then, the subset that results in the overall bestlevel of data compression is considered to be the subsetthat best characterizes the data. Such a view, motivated byMDL principle, has been found effective for many frequentpattern mining algorithms [16].MDL principle views learning as data compression. Ifwe are able to discover all the important regularities indata then we should be able to use these to compressthe data well. In this view, the coding mechanism usedshould be lossless; that is the original data should be exactlyrecoverable given the encoded compressed representation.The Krimp algorithm [2] is one of the first methodsthat used MDL principle to identify a small set of rele-vant patterns in the context of frequent itemset mining.As mentioned earlier, in this paper we are concerned withsequential data. For sequential data, unlike in the case oftransaction data, the temporal ordering of data tuples isimportant and our encoding mechanism should be such thatwe should be able to recover the original data sequencein correct order along with all time stamps. This presentsadditional complications while encoding sequential datausing frequent patterns. (See [6] for more discussion on this).There are many MDL-motivated algorithms proposed forcharacterizing sequence data through a subset of frequentpatterns [3], [5], [6], [7]. Different algorithms use differentstrategies for coding data using frequent patterns. While themethods are motivated by MDL principle, the coding strate-gies and hence the computation of compression achieved bya given subset of frequent patterns are essentially arbitrary.In this paper we consider a recently proposed algorithmcalled CSC-2 [6] which is an efficient method to discovera subset of serial episodes that best characterizes data of event sequences. It uses a novel pattern class consistingof injective serial episodes with fixed inter-event times.A similar pattern class was also used recently for learn-ing association rules from temporal data [17]. The CSC-2algorithm uses the number of distinct occurrences of anepisode as its frequency. Here, we extend it to the case ofnon-overlapped occurrences as episode frequency and thenprovide some statistical justification for the algorithm basedon a generative model.The main contribution of this paper is a HMM-basedgenerative model which provides some statistical justifica-tion for the CSC-2 algorithm. In all MDL-based approaches,a subset of patterns is selected based on the data com-pression it can achieve. This depends on the (arbitrary)coding scheme used by the algorithm which is selectedheuristically. In this paper we provide a justification for thecoding scheme and the algorithm used in CSC-2 based onour proposed statistical generative model. This is the firsttime, to our knowledge, that such a formal connection isestablished between mining of episodes using the MDLprinciple and a generative model for data source. Sincethis generative model is Markovian and hence can handleonly non-overlapped occurrences based episode frequency,we extended CSC-2 to use non-overlapped occurrences asepisode frequency.Another major contribution of this paper is a novel ap-plication of this method of discovering a set of characteristicepisodes from sequential data. The application is in text clas-sification. Most text classification methods represent eachdocument as a vector over a dictionary of words which isoften called the bag-of-words representation. The dictionaryfor this is taken to be all the words in the corpus (after ap-propriate stemming and dropping of stop words). Often, thedictionary sizes are large resulting in high dimensionalityof the feature vectors representing individual documents.A text document can be viewed as a sequence of eventswith event types being words. Hence, using our method,we can discover a subset of characteristic episodes that bestrepresent the full corpus of document data. We can then usethe words (event-types) in the subset of discovered episodesto form our dictionary. Since our method does not evenneed a frequency threshold, this constitutes a parameter-lessunsupervised method of feature selection for this problem.We show, through empirical experiments, that this methodresults in a very significant reduction of dictionary sizewithout any loss of classification accuracy.The rest of the paper is organized as follows. Section IIdescribes the episode mining algorithm. Section III presentsour proposed generative model. Section IV explains ourmethod of finding a smaller sized dictionary in text clas-sification problems and reports results obtained on differenttext datasets. Conclusions are presented in Section-V.
ISCOVERING B EST S UBSET OF S ERIAL E PISODES
We begin with a brief informal description of the episodesframework. (See [9], [18] for more details). Here the data is(abstractly) viewed as an event sequence denoted as D = < ( E , t ) , ( E , t ) , ..., ( E n , t n ) > where, in each tuple or event , ( E i , t i ) , E i is the event-type and t i is the time of occurrenceof that event. We have E i ∈ E , a finite alphabet set and t i ≤ t i +1 , ∀ i . An example event sequence is D = < ( D, , ( A, , ( C, , ( E, , ( A, , ( B, , ( C, , ( D, , ( B, , ( C, , ( E, , ( A, , ( C, , ( B, , ( C, > (1)The patterns of interest here are called episodes. In thispaper we are concerned with only serial episodes. Werepresent an N -node serial episode, α , as α [1] → · · · → α [ N ] where α [ i ] is the event-type of the i th event of theepisode. An episode is said to be injective if all eventtypes in the episode are distinct. For example, A → B → C is a three node injective serial episode. An oc-currence of the serial episode is constituted by eventsin the data sequence that have appropriate event typesand their times of occurrence are in the correct order.In (1), (( A, , ( B, , ( C, constitutes an occurrence of A → B → C while (( A, , ( B, , ( C, does not because B does not occur after A . (Note that the events constitutingan occurrence need not be contiguous in the data).The data mining problem is to discover all frequentlyoccurring episodes. In the frequent episodes framework,many different frequency measures are defined based oncounting different subsets of occurrences. We mention twosuch frequencies below which are relevant for this paper.There are efficient algorithms for discovering serial episodesunder many frequency measures. (See [18] for more detailson different frequencies and algorithms for discovering se-rial episodes).Two occurrences of a serial episode are said to benon-overlapped if no event of one occurrence is in be-tween events of the other. In (1), (( A, , ( B, , ( C, and (( A, , ( B, , ( C, are non-overlapped occurrencesof A → B → C while (( A, , ( B, , ( C, is anotheroccurrence of this episode which overlaps with both theearlier ones. The non-overlapped frequency of an episodeis defined as the maximum number of non-overlappedoccurrences of the episode in the event sequence [14]. Twooccurrences are said to be distinct if they do not share anyevent. All three occurrences above are distinct. The maxi-mum number of distinct occurrences is another frequencyof interest.In our method here we use a special class of serialepisodes called fixed-interval serial episodes [6]. A fixed inter-val serial episode can be denoted as α = ( α [1] ∆ −−→ α [2] ∆ −−→· · · ∆ N − −−−−→ α [ N ]) where ∆ i is the prescribed gap betweenthe times of i th and ( i + 1) st events of any occurrence of α .For example, A −→ B −→ C is a fixed interval injective serialepisode. In (1), (( A, , ( B, , ( C, is an occurrence of thisepisode while (( A, , ( B, , ( C, is not.As is easy to see, all events constituting an occurrenceof a fixed interval serial episode are completely specifiedby giving only the time of occurrence of the first event.Also, two occurrences starting at different times would bedistinct if the episode is injective (that is, all event types inthe episode are different). Here our interest is in discovering a small set of fixed-interval serial episodes that best explains the data. We usethe Minimum Description Length (MDL) principle for this.Hence we rank different subsets of episodes by the totalencoding length that results when we use them as models toencode data. Under MDL, the encoding should be such thatwe should be able to recover the original data completely.Since we are considering sequential data, this means weshould be able to recover the data in the original sequencewith all time stamps.We first explain the strategy of coding the data sequenceusing our episodes. The basic idea is that we can encode allevents constituting the occurrence of a fixed interval serialepisode by just giving the start times of the occurrence. Theencoding strategy is same as that used in [6]. For obtainingthe best subset of episodes we essentially use the CSC-2algorithm from [6] with the main difference being we usethe non-overlapped frequency while that algorithm usesdistinct occurrences as frequency. Below we first explainthe encoding scheme through an example and then brieflyexplain the CSC-2 algorithm. (For more details on the en-coding scheme and the CSC-2 algorithm, please see [6]).Table 1 illustrates the coding scheme by encoding theevent sequence in (1) using essentially three arbitrarily se-lected episodes. Each row specifies the size and descriptionof an episode, the number of occurrences of the episodeand the start times of these occurrences. Thus, the firstrow of Table 1 specifies a three-node episode, namely, ( A −→ B −→ C ) , which has two occurrences starting attime instants 2 and 4. Thus, this row codes for six eventsin the data constituting the two occurrences of this 3-nodeepisode. Similarly the second row codes for six events byspecifying two occurrences of a 3-node episode and the thirdrow codes for two events by specifying one occurrence of a2-node episode. Suppose we are interested in asking howgood is this subset of three episodes. These three episodestogether, as specified through Table 1, account for all buttwo events in the data. But coding under MDL should belossless. Hence, in the last row of Table 1 we have usedtwo occurrences of a 1-node episode to make sure that allevents in the data sequence are covered. It is easy to see thatgiven this table, we can recreate the entire data sequenceexactly. In this table we can think of the first two columns ascoding the model, that is the subset of episodes, and the lasttwo columns as coding the data using this model. Thus thelength or size of this table can be the total encoded length forthe subset of episodes. Given any subset of episodes (suchas the three episodes in the first three rows of the table) wecan find an encoding like this for the whole data by addingoccurrences of a few 1-node episodes as needed (which iswhat is done in the fourth row of the table).In this table, one can see that the event ( C, is codedfor, by both the first and the second episode in the table.Intuitively, we get better data compression if such overlapsamong the parts of data encoded by different episodes inthe selected set, are minimized. Thus, we should get bettercompression of data if we can choose episodes with highfrequency (so that they can cover for large number of events)which are non-redundant (so that the overlaps as mentioned above are reduced). This is the intuitive reason for usingthis coding scheme and looking for a subset of episodes thatachieves best compression of data. Table 1: Encoding of event sequence
Size of episode Episode name No. of occurrences List of Occurrences A −→ B −→ C ) 2 < , > D −→ E −→ C ) 2 < , > A −→ B ) 1 < > C < , > Our objective is to find a subset of episodes to encodedata like this so as to get best data compression. For pur-poses of counting length/memory we assume that eventtypes as well as times of occurrence are integers and thateach integer takes one unit of memory. Let α be an N-nodeepisode of frequency f α used for encoding. Its row in thetable would need N + 1 + f α units ( unit to representthe size of the episode, N units to represent the event-typesof the episode, N − units for representing the inter-eventgaps, unit for frequency and f α units to represent the starttimes of the occurrences). Since non-overlapped or distinctoccurrences do not share events, this episode encodes for f α N events in the data and hence we need at least f α N units of memory if we want to encode these events in thedata using 1-node episodes. Define score ( α, D ) = (cid:0) f α N (cid:1) − (cid:0) N + 1 + f α (cid:1) (2)If score ( α, D ) > , then we can conclude that α is a useful candidate , since, selecting it can improve encodinglength (in comparison to the trivial encoding using only1-node episodes). However, the true utility of α is to beassessed with respect to what it would add to compressiongiven the other selected episodes.Let F s be a set of episodes of size greater than one. Givenany such F s , let L ( F s , D ) denote the total encoded lengthof D , when we encode all the events which are part of theoccurrences of episodes in F s , by using episodes in F s andencode the remaining events in D , if any, by episodes of sizeone. Given any two episodes α , β , let OM ( α, β ) denote thenumber of events in the data that are covered by occurrencesof both α and β in the data sequence D . Define overlap - score ( α, D , F s ) = score ( α, D ) − P β ∈F s OM ( α, β ) (3) Overlap - score gives an estimate of how much extra encod-ing efficiency can be achieved by selecting α given the set F s . It can be proved [6, Prop. 1] that If overlap - score ( α, D , F s ) > , then L ( F s ∪ { α } , D ) < L ( F s , D ) This means that, given a current set of episodes F s , addingto F s an episode α with positive overlap - score , would onlyreduce the total encoded length. The CSC-2 algorithm in [6]is essentially a greedy algorithm that keeps adding episodeswith highest overlap - score . This greedy selection of bestepisode (based on overlap - score ) is done from a set ofcandidate episodes, generated through a depth-first searchof the lattice of all serial episodes. Each candidate episode isthe ‘best’ episode in one of the paths of the depth-first searchtree. For the sake of completeness we give the pseudocodeof this algorithm as Algorithm 1 (For more details see [6]). Algorithm 1
CSC-2( D , T g , K ) Input: D : Event sequence, T g : maximum inter-event gap, K : maximum number of selected episodes. Output:
The set of selected frequent episodes F . Initialize the final set of selected episodes F as ∅ while data compression achievable and |F| < K do F s ← ∅ C ←
Generate Candidate Episodes Calculate events shared by occurrences for every pairof candidate episodes repeat α ← argmax γ ∈C overlap - score ( γ, D , F s ) if overlap - score ( α, D , F s ) > then F s ← F s ∪ { α } C ← C\ α end if until overlap - score ( α, D , F s ) ≤ or |F ∪ F s | = K Delete the events from D corresponding to the oc-currences of selected episodes F s F ← F ∪ F s end while A ←
Size-1 episodes to encode remaining events in D F ← F ∪ A return F We can run this algorithm to find ‘top- K ’ best episodes.If we give a very large value of K , the algorithm exitswhen it cannot find any more episodes (of size greaterthan 1) which improves coding efficiency. The algorithmneeds no frequency threshold given by users. Our overlap - score naturally prefer episodes with higher frequency andwe need no threshold because we pick episodes based onwhat they add to coding efficiency. Thus, the algorithmdoes not really have any hyperparameters (except for T g , themaximum allowable inter-event gap which is not a criticalone).While calculating overlap - score , we need to decide whattype of occurrences we would count toward frequency. Asmentioned earlier, CSC-2 uses distinct occurrences. In thispaper we use non-overlapped occurrences for frequency ofepisodes. The reason for this is that the generative modelwe present in the next section is for non-overlapped oc-currences. Also, in our application to text classification,non-overlapped occurrences is a more natural choice forfrequency.We obtain the sequence of non-overlapped occurrencesfrom the distinct occurrences returned by CSC-2 using asimple algorithm. We take the first occurrence from thesequence of distinct occurrence as the first one in the se-quence of non-overlapped occurrences. Then onwards wetake the first distinct occurrence starting after the last non-overlapped occurrence we have as the next one in oursequence of non-overlapped occurrences. The pseudocodefor this algorithm is listed as Algorithm 2. Below, we provethe correctness of this algorithm. That is, we show thatthe sequence of non-overlapped occurrences we get is amaximal one and hence we get the correct non-overlappedfrequency.Let H = { h , h , . . . , h l } denote the set of non- Algorithm 2
Find-NO-occurrences( α ) Input:
OccStarttime - List ( α ) : List of start times of distinctoccurrences of α in increasing order of start times, ∆ i ∀ i ∈ { , , . . . , N − } : Inter-event gaps of episode α . Output:
NO-occurrences ( α ): List of start times of a maxi-mal set of non-overlapping occurrences of α . NO-occurrences ( α ) ← ∅ itr : pointer to first element of OccStarttime - List ( α ) if OccStarttime - List is empty then return NO-occurrences ( α ) end if t s ← itr.starttime NO-occurrences ( α ) ← NO-occurrences ( α ) ∪ { t s } itr ← itr.next while itr = N U LL (end of list) do t ′ s ← itr.starttime if t ′ s > t s + N − P i =1 ∆ i ! then NO-occurrences ( α ) ← NO-occurrences ( α ) ∪ { t ′ s } t s ← t ′ s end if itr ← itr.next end while return NO-occurrences ( α ) overlapped occurrences returned by Algorithm 2. Each oc-currence, h i can be thought of as a tuple of indices in thedata sequence which give the position of events in data thatconstitute this occurrence. For example, in data sequence (1),the occurrence of the episode A → B → C constituted bythe events < ( A, , ( B, , ( C, > would be representedby the tuple (2 6 7) . Hence, as a notation, we use t h i ( k ) to denote the time of the k th event of the episode in theoccurrence h i . On the set of occurrences, H , there is a naturalorder: occurrence h i is earlier than h j if the t h i (1) < t h j (1) .Because of the way the occurrences in H are selected by ouralgorithm, the following property is easily seen to hold: Property 1: h is the earliest distinct occurrence of theepisode α . For any i , h i is the first distinct occurrencestarting after t h i − ( N ) and there is no distinct occurrencewhich starts after t h l ( N ) . Proposition 1: H is a maximal set of non-overlappedoccurrences of α Proof:
Note that for fixed interval injective serial episodes,occurrences having different start times are distinct. Con-sider any other set of non-overlapped occurrences of theepisode, H ′ = { h ′ , h ′ , . . . , h ′ m } . Let p = min { m, l } . Wefirst show that t h i ( N ) ≤ t h ′ i ( N ) ∀ i ∈ { , , . . . , p } (4)We use induction on i to prove this. Let us show this first for i = 1 . Suppose, t h ( N ) > t h ′ ( N ) . Since, the inter-event gapsare fixed, we have t h (1) > t h ′ (1) . This means we have founda distinct occurrence of the episode which starts before h .This contradicts the first statement of Property 1 that h isthe earliest distinct occurrence. Hence, t h ( N ) ≤ t h ′ ( N ) . Suppose, t h i ( N ) ≤ t h ′ i ( N ) is true for some i < p . We showthat t h i +1 ( N ) ≤ t h ′ i +1 ( N ) . Suppose, t h i +1 ( N ) > t h ′ i +1 ( N ) . Thisimplies t h i +1 (1) > t h ′ i +1 (1) . Again, since, H ′ is a set of non-overlapped occurrences, we have t h ′ i ( N ) < t h ′ i +1 (1) . Hence,we have t h i ( N ) ≤ t h ′ i ( N ) < t h ′ i +1 (1) < t h i +1 (1) But this contradicts the fact of Property 1, that h i +1 is theearliest distinct occurrence after t h i ( N ) . Hence, t h i +1 ( N ) ≤ t h ′ i +1 ( N ) .Now we prove the maximality of the set H . Suppose, weassume that |H ′ | > |H| , i.e , m > l . From inequality (4), h ′ l +1 is an occurrence beyond t h l ( N ) . But this contradictsthe last statement of Property 1 that there is no distinctoccurrence beyond t h l ( N ) . Hence, |H| ≥ |H ′ | for every set ofnon-overlapped occurrences H ′ . This proves the maximalityof the set H .We can now sum up our method of finding a subset ofserial episodes that best characterizes the data sequence. Weuse the coding scheme as described here and use a greedyheuristic to find the subset that achieves the best compres-sion. This is essentially the same as the CSC-2 algorithmof [6]. However, we use Algorithm 2 to get non-overlappedoccurrences of episodes from distinct occurrences and thenuse that frequency in selecting episodes with best overlap-score. In the next section we present an interesting genera-tive model that provides some statistical justification for ouralgorithm based on selecting an episode with best overlap-score. ENERATIVE MODEL FOR P AIRS OF E PISODES
In this section we present a class of generative models whichis a specialized class of HMMs. (This model is motivated bya HMM-based model for single episodes proposed in [14]).An HMM contains a Markov chain over some state space.But the states are unobservable. In each state, a symbol isemitted from a finite symbol set according to a symbol prob-ability distribution associated with that state. The stream ofsymbols is the observable output sequence of the model.In our case, the symbol set would the set of event-typesand thus the observed output sequence would be a sequenceof event-types. We think of this as an event sequence wherethe event-times are not explicitly specified. For occurrencesand hence for frequencies of general serial episodes (withoutany inter-event times specified) only the time-ordering ofthe event-types in the data sequence is important; actualevent times play no role. Hence in this section we considerserial episodes without any fixed inter-event times.In our generative model, the state transition probabilitymatrix of the Markov chain is parameterized by a singleparameter, which is called the noise parameter. For everypair of serial episodes, we have one such generative model.For small enough value of the noise parameter, the modelis such that the output from the model would be an eventsequence containing many non-overlapped occurrences ofthe two corresponding episodes. While occurrences of anyone episode would be non-overlapped in the output event sequence, an occurrence of one episode may be arbitrarilyinterleaved with occurrences of the other episode. Thusthis is a good class of generative model for a data sourcewhere a pair of episodes form the most prominent frequentpatterns (under the frequency based on non-overlappedoccurrences). This is the first instance of such a statisticalgenerative model for multiple episodes.Consider the family of such models containing a modelfor every possible pair of episodes. Let Λ αβ denote themodel for the pair of episodes α and β . Given an eventsequence we can now ask which is the maximum likelihoodestimate of a model from this class of models. This wouldessentially tell us which pair of episodes best ‘explains’ thedata sequence in the sense of maximizing the likelihood. Weshow that such a pair of episodes are not necessarily the twomost frequent episodes. The data likelihood depends bothon the frequencies of the episodes as well as on the numberof events in the data that the occurrences of the two episodesshare. Thus, we show, for example, that Λ αβ may have betterlikelihood than Λ αγ even when β has lower frequency than γ , if overlap between α and β is much less than that between α and γ . The results we present here provide some statisticaljustification for the coding scheme and the algorithm thatwe presented in the previous section. A HMM is specified as
Λ = ( P , π, b ) where P = [ p ij ] isthe state transition probability matrix of the Markov chainwith state space, say, S , π is the initial state probabilitiesand b = ( b q , q ∈ S ) where b q denotes the symbolprobability distribution in state q . Let o = ( o , o , · · · , o T ) be an observed symbol (or output) sequence. The jointprobability of the output sequence o and a state sequence q = ( q , q , · · · , q T ) given an HMM Λ is P ( o,q | Λ) = π q b q ( o ) T Y t =2 p q t − q t b q t ( o t ) (5)To determine the model with maximum likelihood, we needto find P ( o | Λ) . This data likelihood is often assessed byevaluating the above joint probability of (5) along a mostlikely state sequence, q ∗ , where q ∗ = argmax q P ( o,q | Λ) (6)We also follow this simplification often employed by meth-ods using HMM models. Thus we assume P ( o | Λ ) >P ( o | Λ ) if P ( o , q ∗ | Λ ) > P ( o , q ∗ | Λ ) . (This would bereferred to as assumption A1 ).Let Λ αβ denote the model corresponding to the pair ofepisodes, α and β . We give full description of this modelbelow. For the sake of simplicity, we consider that both are N -node episodes. The model depends on whether or not thetwo episodes share any event types and hence we considertwo separate cases (wherever necessary): • Case-I: α and β have no common event-types, i.e α [ i ] = β [ j ] ∀ i, j ∈ { , , ..., N } . • Case-II: α and β have some common event-types. The State Space
The number of states in the HMM is N + 1 . The statespace can be partitioned into two parts: episode states, S e ,comprising of N states and noise states, S n , comprisingof N + 1 states. Episode states are denoted by S ki,j , k ∈{ , } , i, j ∈ { , , .., N } . The noise states are given by N ki,j , k ∈ { , } , i, j ∈ { , , .., N } , and the state N , . Emission structure
The symbol probability distribution for the episode states isa delta function. The episode state S i,j emits the symbol α [ i ] with probability 1, whereas S i,j emits the symbol β [ j ] withprobability 1. For each noise state, the symbol probabilitydistribution is uniform over the alphabet set E . (We denote |E| = M ). Transition structure
Under
Case-I (where α and β do not share any event types),the state transition probabilities out of episode states aregiven by Fig. 1. Under Case-II also (where α and β mayshare some event types), the transition probabilities out ofepisode states are as given by the state transition structureof Fig 1 except for the states S i, ( j mod N )+1 and S i mod N )+1 ,j where i, j are such that α [( i mod N ) + 1] = β [( j mod N ) +1] . For such i, j the transition probabilities are as given inFig. 2. S i,j S i mod N )+1 ,j S i mod N )+1 ,j N i,j − η − η η S i,j S i, ( j mod N )+1 S i, ( j mod N )+1 N i,j − η − η η Figure 1: Episode state Transition Structure S i, ( j mod N )+1 N i, ( j mod N )+1 S i mod N )+1 , (( j +1) mod N )+11 − ηη S i mod N )+1 ,j N i mod N )+1 ,j S i +1) mod N )+1 , ( j mod N )+11 − ηη Figure 2: Transition Structure when α [( i mod N ) + 1] = β [( j mod N ) + 1] For all the noise states, N ki,j , k ∈ { , } the transitionstructure is as shown in Fig. 3. The noise state N , , cantransit with − η probability to each of the episode states S , and S , or remain in N , with η probability. N i,j S i mod N )+1 ,j S i mod N )+1 ,j N i,j − η − η η N i,j S i, ( j mod N )+1 S i, ( j mod N )+1 N i,j − η − η η Figure 3: Noise state Transition Structure
It may be noted that all transition probabilities are de-termined by a single parameter, η , which is called the noiseparameter. The values of individual transition probabilitiesare fixed in an intuitively simple manner. From any state,transitions into a noise state has probability η . The remain-ing probability is equally divided between all reachableepisode states.One can intuitively see the logic of the state transitionstructure also. Recall that in state S i,j we emit symbol α [ i ] .So, after this we can either go to S i +1) ,j to emit the nextevent type of α or go to S i +1) ,j to now emit an event typefrom β . This allows for arbitrary overlap of occurrences of α and β . Similarly from S i,j (after emitting β [ j ] ) we can eithergo to S i, ( j +1) or S i, ( j +1) . Since event types constitutingoccurrence of an episode need not be contiguous, from theepisode states we can go to the noise states and cycle therezero or more times before coming back to episode states.After emitting the last event types of, say, α , the next eventtype of α that can be emitted is α [1] . Hence, from S N,j weshould go to either S ,j or S ,j (or a noise state). That iswhy in the transition structure as given, whenever an indexis incremented it is always with respect to modulo N .All the above is fine when α and β do not share eventtypes. Suppose they share an event type. When that eventtype appears in the data it could be part of an occurrenceof only α or that of only β or neither. These possibilitiesare all accounted for by the above transition structure.However, there is one more possibility, namely, it is partof an occurrence of both α as well as β ; that is, the twooccurrences share an event. The transition structure givenin Fig. 2 ensures that our generative model includes thispossibility too. Initial states If α [1] = β [1] , the initial state is N , with probability η , S , with probability − η and S , with probability − η . If α [1] = β [1] , the initial state is N , with probability η , S , with probability − η . An Example
Consider a model Λ αβ , where α = A → B → C and β = D → B → E . Let the alphabet set E = { A, B, C, D, E, F, G } . We show a few example state se-quences and output sequences of length 10 that can beemitted by Λ αβ in Fig. 4. As can be seen from the figure,the output sequence contains occurrences of α and β thatmay be arbitrarily interleaved. Here we have α [2] = β [2] .Hence transitions out of episode states S , S are as givenin Fig. 2 and for all other episode states they are as givenin Fig. 1. The special transition structure for S , S allowssome occurrences of α and β to share an event (of event type B ) as can be seen in row-3 (the transition from S to S )of Fig. 4. In this section we derive expressions for the likelihood (ofa joint state and output sequence) of our HMM model anduse this to compare likelihoods of models correspondingto different pairs of episodes. The expressions depend on N , S , S , N , S , N , S , S , N , S , B A D F B G B E C C S , S , N , S , S , N , S , N , N , S , D B G A E B B F A C N , S , N , S , S , N , S , N , S , N , F A C D B B C G E F S , N , N , S , N , S , S , S , S , N , D G E A B B B E C G
Figure 4: First 10 events of four sample output sequences whether or not the pairs of episodes share event types andhence the two cases are dealt with separately.
In all our analysis we assume η < MM +8 where M = |E| . Here, α [ i ] = β [ j ] ∀ i, j ∈ { , , . . . , N } . Hence, all episodestates have only − η transition into them. Decomposingany state sequence into two sub-sequences q e and q n ,corresponding to the episode and noise states, we havethe following observation: in equation (5), whenever thetransition probability p q t − q t is (1 − η ) / , the state q t hasto be an episode state, and hence the b q t ( o t ) is either or .Similarly, whenever p q t − q t is η , the corresponding b q t ( o t ) is M . Thus for any state sequence with non-zero probability,we can write the joint probability as P ( o,q | Λ) = ηM ! | q n | − η ! | q e | (7)Here, | q n | and | q e | denote lengths of the respective sub-sequences. Since | q e | + | q n | = | q | = T (the length of outputor event sequence), (7) can be written as P ( o,q | Λ) = ηM ! T (1 − η ) M η ! | q e | (8)Under our assumption we have η < MM +8 , and hence (1 − η ) M η > . Then, P ( o,q | Λ) is monotonically increasingwith | q e | . Thus, the most likely state sequence is the one thatspends the longest time in episode states. Due to constraintsimposed on the state transition structure, in any state se-quence of Λ αβ having non-zero probability, the episodestates corresponding to a particular episode have to occurin sequence. Moreover, when a particular episode state S i,j or S i,j is revisited, it implies one cycle of all the event-types corresponding to that episode have been emitted.Suppose f ∗ α and f ∗ β are the maximum possible number of non-overlapping occurrences of α and β respectively in o .Since, α and β do not share any event type and at eachepisode state we emit one symbol, the most likely statesequence has at least N ( f ∗ α + f ∗ β ) number of episode statesin it, i.e | q ∗ e | ≥ N f ∗ α + N f ∗ β .For the sake of simplicity, we make the assumption(referred to as A2 ) that there is no state sequence with non-zero probability that includes any incomplete occurrence ofeither of the episodes. Under the assumption A2 , we have | q ∗ e | = N f ∗ α + N f ∗ β .Consider two models Λ αβ and Λ αγ . P ( o , q ∗ | Λ αβ ) = ηM ! T (1 − η ) M η ! Nf ∗ α + Nf ∗ β P ( o , q ∗ | Λ αγ ) = ηM ! T (1 − η ) M η ! Nf ∗ α + Nf ∗ γ = ⇒ P ( o , q ∗ | Λ αβ ) P ( o , q ∗ | Λ αγ ) = (1 − η ) M η ! N ( f ∗ β − f ∗ γ ) Hence, under assumption A1 , if f ∗ β > f ∗ γ , we have P ( o | Λ αβ ) > P ( o | Λ αγ ) . This essentially implies that givenan already selected episode (the episode α here), if we wantto select the next one from the set of episodes that do notshare any event type with the already selected one, weshould choose the most frequent one from that set. In this case, we have some episode states with a (1 − η ) / transition into them while some have (1 − η ) transitioninto them. It should be noted that because of the transitionstructure, a symbol emitted from an episode state with − η transition into it is part of an occurrence of both theepisodes. It means that the event is shared across occur-rences of the two episodes. On the other hand, a symbolemitted from an episode state with (1 − η ) / transition into itis part of an occurrence of only one episode and hence is notshared. Now, we further decompose the episode states partof any state sequence, q e into two parts q , q . The episodestates corresponding to event types that are not shared form q while those corresponding to shared ones form q . Since,every state emits one symbol, we have | q n | + | q | + | q | = T .Now, the joint probability of an output and state sequenceis given by P ( o,q | Λ) = ηM ! | q n | (1 − η )2 ! | q | − η ! | q | = ηM ! T (1 − η ) M η ! | q | (1 − η ) Mη ! | q | Let us consider a state sequence q (having non-zeroprobability) that contains f α and f β number of occurrencesof the episodes α and β respectively. Let the number ofevents shared between these occurrences be O αβ . Then, the
1. We can ensure that the assumption A2 always holds by modifyingour model by adding an extra symbol (“end-of-sequence” marker) atthe end of the output sequence o and modifying the symbol probabilitydistribution of the noise states N N, and N ,N by following the trickused with HMMs for single episodes as in [19]. no of events covered by the occurrences of the episodes inthe output sequence is ( N f α + N f β − O αβ ) , out of which ( N f α + N f β − O αβ ) number of events are not shared and O αβ number of events are shared. Under assumption A2 ,we have | q | = O αβ and | q | = N f α + N f β − O αβ . So, forthis state sequence, P ( o,q | Λ αβ ) = ηM ! T (1 − η ) M η ! Nf α + Nf β − O αβ (1 − η ) Mη ! O αβ = ηM ! T (1 − η ) M η ! Nf α + Nf β (1 − η ) M η ! − O αβ (9)For η < MM +8 , (1 − η ) M η > . So, we see that the joint prob-ability is an increasing function of the no of occurrences ofthe episodes, and for fixed f α and f β , a decreasing functionof the number of events shared between the occurrences.Let f ∗ α and f ∗ β be the maximum possible number of non-overlapped occurrences of α and β respectively in o . So, themost likely state sequence ( q ∗ ) is the one which emits all the f ∗ α + f ∗ β number of occurrences from the episode states andamong all such state sequences it is the one which sharesminimum number of events between these occurrences. Let O ∗ αβ be the number of shared events corresponding to q ∗ .Then, from (9), P ( o , q ∗ | Λ αβ ) = ηM ! T (1 − η ) M η ! Nf ∗ α + Nf ∗ β (1 − η ) M η ! − O ∗ αβ (10)We will have a similar expression for the model Λ αγ andhence P ( o , q ∗ | Λ αβ ) P ( o , q ∗ | Λ αγ ) = (1 − η ) M η ! N ( f ∗ β − f ∗ γ ) (1 − η ) M η ! O ∗ αγ − O ∗ αβ (11)Thus, under assumption A1 , we see that if f β = f γ , like-lihood is higher for the pair of episodes that share lessernumber of events. In general, the relative likelihood of Λ αβ and Λ αγ depends both on the frequencies of β and γ aswell as on the difference in their overlaps with α . To betterunderstand this, let us define two metrics to rate any otherepisode with respect to episode α . Overlap - score ( β, α ) = N f ∗ β − O ∗ αβ Overlap - score ( β, α ) = N f ∗ β − O ∗ αβ We will show that given an episode α , if the values ofboth metric for an episode β are higher than those for an episode γ , then Λ α,β has higher data likelihood comparedto Λ α,γ (under our assumption on η and under A1 and A2 ). Case-a: f ∗ β > f ∗ γ , O ∗ αβ < O ∗ αγ Under assumption A1 , from (11), it is easily seenthat P ( o | Λ αβ ) > P ( o | Λ αγ ) . Also, it is easy tocheck that f ∗ β > f ∗ γ and − O ∗ αβ > − O ∗ αγ im-ply Overlap - score ( β, α ) > Overlap - score ( γ, α ) and Overlap - score ( β, α ) > Overlap - score ( γ, α ) Case-b: f ∗ β > f ∗ γ , O ∗ αβ > O ∗ αγ ,In this scenario, depending on the values of overlaps, thetwo metrics for β may be greater or smaller than those of γ .Hence we consider these two sub-cases. Case-b1:
Overlap - score ( β, α ) > Overlap - score ( γ, α ) and Overlap - score ( β, α ) > Overlap - score ( γ, α ) .Since, Overlap - score ( β, α ) > Overlap - score ( γ, α ) = ⇒ N f ∗ β − O ∗ αβ > N f ∗ γ − O ∗ αγ = ⇒ N f ∗ β − N f ∗ γ > O ∗ αβ − O ∗ αγ ,we have from (11), P ( o , q ∗ | Λ αβ ) P ( o , q ∗ | Λ αγ ) = (1 − η ) M η ! N ( f ∗ β − f ∗ γ ) (1 − η ) M η ! O ∗ αβ − O ∗ αγ = 2 N ( f ∗ β − f ∗ γ ) (1 − η ) M η ! N ( f ∗ β − f ∗ γ ) − ( O ∗ αβ − O ∗ αγ ) > ⇒ P ( o | Λ αβ ) > P ( o | Λ αγ ) (under assumption A1 ) Case-b2:
Overlap - score ( γ, α ) > Overlap - score ( β, α ) and Overlap - score ( γ, α ) > Overlap - score ( β, α ) Since,
Overlap - score ( γ, α ) > Overlap - score ( β, α ) = ⇒ N f ∗ γ − O ∗ αγ > N f ∗ β − O ∗ αβ = ⇒ N f ∗ β − N f ∗ γ < ( O ∗ αβ − O ∗ αγ ) . Let N f ∗ β − N f ∗ γ be x . Then we can write ( O ∗ αβ − O ∗ αγ ) = 2 x + ξ , where ξ > . Since we assume η < MM +8 , wehave (1 − η ) M η > . Now, from (11), P ( o , q ∗ | Λ αβ ) P ( o , q ∗ | Λ αγ ) = (1 − η ) M η ! x (1 − η ) M η ! x + ξ = 2 x (1 − η ) M η ! x + ξ = 1 (1 − η ) M η ! x (1 − η ) M η ! ξ < ⇒ P ( o | Λ αγ ) > P ( o | Λ αβ ) (under A1 )The results presented here provide statistical justificationfor our algorithm presented in the previous section wherewe select episodes based on their overlap score as given by(3). Suppose we have selected only α and want to chooseeither β or γ as our second episode. Based on (3), this choicedepends on the sign of ( N − f ∗ β − f ∗ γ ) − ( O ∗ αβ − O ∗ αγ ) ,which is a figure of merit motivated by considerations ofcoding efficiency. This is essentially same as the differenceof Overlap - score between β and γ which is a figure of merit that determines which pair of episodes maximize thedata likelihood. PPLICATION TO T EXT C LASSIFICATION
In this section we present a novel application of our methodof finding a ‘good’ subset of frequent episodes to char-acterize data. The application is in the domain of textclassification. Most text classification techniques use a bag-of-words approach where each document (or data sample)is represented as a collection of words that belong to adictionary. The dictionary is usually considered as the setof all unique words present in the training corpus afterpreliminary preprocessing. This makes the size of the dic-tionary large leading to high dimensionality of the featurevector representation of each document. Other vector spacerepresentation of documents (e.g.,word-averaging in [20])also, depends largely on the dictionary of words used.One can think of a text document as a sequence of eventswith event types being the words. Then using all trainingdata in an unsupervised fashion, we can use our methodto find the ‘best’ subset of serial episodes that represent thedata well. These episodes are likely to contain all specificwords that are important for this document collection. Thus,a dictionary built using only the words (event-types) foundin the subset of discovered episodes is likely to be useful.This is what we explore here.Let the dictionary obtained by using all unique words(after usual preprocessing) from the training data collectionbe termed
Dictionary-I . We run our algorithm for discov-ering the ‘best’ subset of serial episodes (that achieve bestdata compression) on the entire training corpus. We forma new dictionary as the set of all the unique words (event-types) that are present in the non-singleton episodes (i.e.,episodes of size 2 or more) discovered by our algorithm.We call this smaller sized dictionary as
Dictionary-II . Ineach case we would represent documents as vectors overone of these dictionaries and investigate standard classifierssuch as Naive-Bayes and SVM. Using simulations on somestandard benchmark datasets we show that we get largedimensionality reduction without any loss of accuracy bythe classifier.Typically, in training data for text classification, we havemany documents but each document is short. Mining forepisodes that can achieve significant compression individ-ually for each document does not give any interestingepisodes mainly because each sequence is short. We stringtogether all training data (of all classes) to make one longdocument and we mine for a set of frequent serial episodesthat achieve best compression (using the algorithm dis-cussed in this paper). We employ special symbols to denoteend of each training document and modify our miningalgorithm so that occurrence of no episode would span twodifferent documents.
We compared the classification accuracies on three stan-dard benchmarks, , Reuters-21578 and
WebKB ,downloaded from a publicly available repository of datasets for single-label text categorization. We used the prepro-cessed stemmed version of these datasets. For
Reuters-21578 ,we use the 8 class stemmed version of the dataset;
WebKB- is a 4-class dataset while is a 20-class dataset.For these, the
Dictionary-I is the set of all unique wordspresent in the stemmed training data. Apart from these,we also used the movie-review dataset prepared by Pangand Lee (2004). We used the polarity dataset v2.0 . Thissentiment analysis dataset consist of 2000 movie reviews.As preprocessing steps, we converted all letters to lowercase and removed all words less than 3 characters long. Nostop words except ‘and’, ‘the’ were removed.
Dictionary-I was created from this preprocessed training data.
We compared the text classification accuracies using twodifferent models • Bag-of-words(BoW) - Each data sample is convertedinto a feature vector of the dimension of the sizeof the corresponding dictionary used. Each featurerepresents the frequency of that word in that datasample except for the Movie Review dataset, where,as in [21], we used binary features denoting presenceor absence of the word in the corresponding docu-ment. Further, tf-idf along with cosine normalizationwere done on these feature vectors as explained inthe next subsection. • Average Embedding (VecAvg) [20]- Word2vec isused to produce the word embeddings and eachtext is then represented as the average of all theembeddings of the words present in that text. Incase of
Dictionary-II , averaging of word embeddingswere done only for words which were part of the dic-tionary and the rest were ignored. In case of MovieReviews and 20 Newsgroup, the pretrained model ofGoogleNews vectors were used, whereas in case ofthe other two datasets (since these were stemmed),the model was trained with gensim library withparameters vector size=200 and window=5 . Term frequency-Inverse document frequency (tf-idf) is a nu-merical statistic which is good at quantifying the importanceof a word to a document in a collection. Let wf ( w, d ) denotethe frequency (that is the number of occurrences) of a word w in a document d . Instead of using this raw frequencyas the feature value, we use a modified word frequencydefined by M odif ied - wf ( w, d ) = wf ( w, d ) ∗ idf ( w ) where the inverse-document frequency, ( idf ( w ) ), is given as idf ( w ) = log 1 + n d df ( w ) + 1 Here, n d is the total number of documents and df ( w ) is thenumber of documents that contain the word w . We use this modified frequency of each word ( M odif ied - wf ( w, d ) ) asthe feature value. The feature vectors were further cosinenormalized. We compare the classification accuracies obtained us-ing our proposed
Dictionary-II with those obtainedwith
Dictionary-I . For BoW and VecAvg representation,we present results using Linear SVM. For BoW, NaiveBayes(NB) results are also presented for comparison withaccuracies reported in literature. For the Movie Reviewdataset, we present the mean value corresponding to theten fold cross validation on the original folds introduced in[21].
Dataset Number ofdiscoveredepisodes Size ofDict-I Size ofDict-II
Reuters-21578
WebKB
Movie Review
Table 2 shows sizes of the two dictionaries for differentdatasets. The number of episodes reported in Table 2 is thenumber of non-singleton episodes. As can be seen from thetable, the size of
Dictionary-II is almost a fourth of that of
Dictionary-I in case of
WebKB ; for the other datasets it isabout one eighth to one tenth. Thus, this method results ina very significant reduction in dictionary size (and hence infeature vector dimension).The classification accuracies obtained with different dic-tionaries are shown in Tables 3–4. Table 3 shows accuraciesand F-measure with linear SVM classifier under Vec Avgrepresentation while Table 4 shows these for Naive Bayesand linear SVM classifiers under BoW representation. Wedid not try any nonlinear SVM because all other studieson these benchmark data sets reported only accuracies withlinear SVM. As is easy to see, the accuracies and F-measure
Dataset Scores ( %)
Accuracy F-measure
Dict-I Dict-II Dict-I Dict-II
Reuters-21578
WebKB
Movie Review
SVM 97.03
WebKB NB 83.52
SVM
Dataset Dictionary-I Dictionary-II
Reuters-21578 ± ± . ) WebKB ± ± ± . ) 80.36( ± . )Table 5: Mean (and standard deviation) of classification accuracy withNaive-Bayes using different dictionaries (BoW representation) Dataset Dictionary-I Dictionary-II
Reuters-21578 ± ± WebKB ± ± ± . ) 82.67( ± scores (under both BoW as well as VecAvg representation)achieved by either classifier with different dictionaries aremostly very close. Thus we can conclude that our frequentepisodes based method allows us to get a very large reduc-tion in dictionary size without any significant change in theclassification accuracy. (We also note that the accuracy of our Dictionary-I in Table 4 is consistent with the bag-of-wordsaccuracy reported in [22] and [23]).The above are with the train-test split as given in theoriginal datasets. For BoW representation, we also generated3 random splits for the datasets
Reuters-21578, WebKB, 20-Newsgroup having the same train-test distribution of eachclass as in the original split. The results (showing averagesand standard deviations) are presented in Tables 5–6. Onceagain, the results clearly show that there are no significantdifferences between accuracies achieved with the two dic-tionaries.For the BoW representation for this document classi-fication application, our method of learning a dictionaryresults in a significant decrease in feature vector dimen-sion. But this method is quite different from generic di-mensionality reduction techniques such as PCA. With PCAwe may get dimensionality reduction by choosing certainlinear combinations of earlier features. With the originalfeature vector dimension being in tens of thousands, thenew features obtained as such linear combinations wouldnot be semantically interpretable. However, our data miningmethod essentially decides on which words of
Dictionary-I to be retained (and which are to be rejected). Thus thismethod is essentially a feature selection method rather thana dimensionality reduction method. Hence, the dimension-ality reduction achieved here is semantically interpretable.To get such a feel for what the data mining does, wepresent in Table-7, some sample of words that are retainedand rejected by our method in case of Movie Review andWebKB datasets. The words shown are hand-picked butonly from a set of 1000 randomly selected words. It is easyto see that this makes good semantic sense. For example,in Movie Review we reject many movie related wordslike ‘stunts’, ‘theater’, ‘performances’ etc. which, while theymay appear in the reviews, may not carry any informationregarding sentiment of the review. On the other hand, weretain words like ‘hilarious’, ‘boring’, ‘surprised’ etc. thatcan carry sentiment information. Similar comments applyto WebKb dataset (e.g., selected words like ‘prerequisite’,‘introductory’, ‘project’ can be commonly found on a projector course web-page and hence they may carry discrimina- Dataset
Movie Review labels=‘positive sentiment’, ‘negative sentiment’
WebKb labels=‘project’,‘student’, ‘faculty’,‘course’
RejectedWords stunts, theatre, cinematographer, moviestar,directorship, producers, storyteller, scripts,spotlight, audition, auditorium, backstage,torrent, reviewer, performances, entertainment. chemistry, cryptography, probabilistic, lagrangian,arithmetic, scholarship, bibtex, manuscript, newsletter,computer, interdisciplinary, mathematician, biotechnology,accuracy, baseline, neurocomputing, gaussian.
SelectedWords enjoyable, funny, hilarious, entertaining,superb, boring, sleepy, disappointed, twists,clever, impressed, surprised, liked, interested,awful, pleasing, miserably, dumber, interesting,impressive, intelligent, fantastic. syllabus, internet, introductory, prerequisite, research,bibliography, professor, student, quiz, exercise, credit,query, tutor, project, phd, fellowship, conference,curriculum, scientist, magazine, instructor, theorem,homework, examination, semester, journal, homepage.
Table 7: Sample words from the set of rejected and selected words for Dictionary-II tive information). Thus, the data mining method (based onfinding episodes for compressing data) seems to be effectivein picking a dictionary that is relevant to the text corpus.
ONCLUSIONS
In this paper we considered the problem of discovering asmall set of serial episodes to characterize sequential data.We extended the existing CSC-2 algorithm of [6] to workwith non-overlapped frequency.Our main contribution is a novel HMM-based gener-ative model for pairs of episodes. The model generatesvery general output sequences where the two episodesare the most prominent frequent episodes (under non-overlapped frequency). The model is very intuitive. Thesymbols emitted from episode states constitute the ‘model-based’ occurrences of episodes. The noise states can emitany symbol and hence symbols emitted from the noise statescan be thought of as the distracting signal that may maskreal episodes and contribute spurious frequent episodes.The transition structure is also intuitively motivated. Fromany state, transitions into a noise state has probability η .The remaining probability is equally divided between allreachable episode states. For this model class we showedthat the episode-pair model that has best likelihood for thedata sequence, is determined both by the frequencies of theepisodes as well as overlaps between their occurrences. Theanalytical expressions we derived for the data likelihoodsprovide statistical justification for our algorithm of selectinga subset of episodes.The CSC-2 algorithm is motivated based on the MDLprinciple. Using an intuitively appealing coding scheme toencode data using episodes, the algorithm finds a subsetof episodes to maximize data compression achieved. It isessentially incrementally picking episodes based on the socalled overlap score which depends both on the frequencyof the episode as well as on the extent of overlap in itsoccurrences with those of already selected episodes. OurHMM-based model provides some statistical justification forthis strategy used by the algorithm.A generative model for sequential data to capture in-teractions of two episodes as well as using it to justifyan MDL based algorithm for frequent episodes are bothnovel contributions of this paper. As mentioned in Section 1,there have been many algorithms, motivated by the MDLphilosophy, for succinctly characterizing data using a smallset of frequent patterns. However, all such algorithms for sequential data are heuristic in nature. We believe that theHMM model we presented here is a good first step indeveloping a statistical theory for MDL-based algorithmsthat find a good subset of frequent episodes.Another important contribution of this paper is a novelapplication of frequent episodes mining to text classifica-tion. We view the text document as a sequence of eventswith event types being the words. Then we find the subsetof episodes that best characterizes the entire text corpus interms of data compression. The words appearing in thissubset of frequent episodes is likely to gives us the most in-formative words for the corpus and hence we use only thesewords to form the dictionary using which the documentsare represented as vectors. Thus the method amounts tolearning a context-sensitive dictionary using the idea of fre-quent pattern mining. Also, since our data mining methoddoes not need any user-specified hyperparameters, same istrue for this method of dimensionality reduction. To thebest of our knowledge this is a first instance of applicationof frequent pattern methods to dictionary learning. As weshowed through extensive simulations, the method resultsin many-fold decrease in the size of dictionary withoutcompromising the classification accuracy. Also, as can beseen from the examples of retained and rejected words, themethod seems to be quite effective in learning a good subsetof words.The HMM model we presented is for pairs of episodes.While it is, in principle, extendable to any number ofepisodes, notationally it would be very complex. A goodextension of the work presented here is in the direction ofextending these analytical techniques to arbitrary number ofepisodes. Generative models can, in general, be used for as-sessing statistical significance of the frequency of an episode(e.g., [14]). Since the model introduced here also accountsfor interactions among episodes, it should be usable forquestions such as whether or not the observed frequenciesof two episodes would make both of them significant giventhe extent of overlap between their occurrences. This is alsoa useful direction in which the work presented here can beextended. R EFERENCES [1] C. C. Aggarwal and J. Han,
Frequent pattern mining . Springer,2014.[2] J. Vreeken, M. Van Leeuwen, and A. Siebes, “Krimp: miningitemsets that compress,”
Data Mining and Knowledge Discovery ,vol. 23, no. 1, pp. 169–214, 2011.[3] N. Tatti and J. Vreeken, “The long and the short of it: summarisingevent sequences with serial episodes,” in
Proceedings of the 18thACM SIGKDD international conference on Knowledge discovery anddata mining . ACM, 2012, pp. 462–470.[4] M. Mampaey, N. Tatti, and J. Vreeken, “Tell me what i need toknow: succinctly summarizing data with itemsets,” in
Proceedingsof the 17th ACM SIGKDD international conference on Knowledgediscovery and data mining . ACM, 2011, pp. 573–581.[5] H. T. Lam, F. M¨orchen, D. Fradkin, and T. Calders, “Mining com-pressing sequential patterns,”
Statistical Analysis and Data Mining ,vol. 7, no. 1, pp. 34–52, 2014.[6] A. Ibrahim, S. Sastry, and P. S. Sastry, “Discovering compressingserial episodes from event sequences,”
Knowledge and InformationSystems , vol. 47, no. 2, pp. 405–432, 2016.[7] A. Bhattacharyya and J. Vreeken, “Efficiently summarizing eventsequences with rich interleaving patterns,” in
Proceedings of the2017 SIAM International Conference on Data Mining . SIAM, 2017.[8] Q. Fan, Y. Li, D. Zhang, and K.-L. Tan, “Discovering newswor-thy themes from sequenced data: A step towards computationaljournalism,”
IEEE Transactions on Knowledge and Data Engineering ,vol. 29, no. 7, pp. 1398–1411, 2017.[9] H. Mannila, H. Toivonen, and A. Inkeri Verkamo, “Discovery offrequent episodes in event sequences,”
Data mining and knowledgediscovery , vol. 1, no. 3, pp. 259–289, 1997.[10] R. Gwadera, M. J. Atallah, and W. Szpankowski, “Reliable detec-tion of episodes in event sequences,”
Knowledge and InformationSystems , vol. 7, no. 4, pp. 415–437, 2005.[11] N. Tatti, “Significance of episodes based on minimal windows,”in
Data Mining, 2009. ICDM’09. Ninth IEEE International Conferenceon . IEEE, 2009, pp. 513–522.[12] R. Gwadera and F. Crestani, “Ranking sequential patterns withrespect to significance,”
Advances in Knowledge Discovery and DataMining , pp. 286–299, 2010.[13] C. Low-Kam, C. Ra¨ıssi, M. Kaytoue, and J. Pei, “Mining statisti-cally significant sequential patterns,” in
Data Mining (ICDM), 2013IEEE 13th International Conference on . IEEE, 2013, pp. 488–497.[14] S. Laxman, P. S. Sastry, and K. P. Unnikrishnan, “Discoveringfrequent episodes and learning hidden markov models: A formalconnection,”
IEEE Transactions on Knowledge and Data Engineering ,vol. 17, no. 11, pp. 1505–1517, 2005.[15] P. D. Grunwald,
The minimum description length principle, Vol.1 .Cambridge, MA: MIT Press, 2007.[16] J. V. Matthijs van Leeuwen, “Mining and using sets of patternsthrough compression,” in
Frequent pattern mining , C. C. Aggarwaland J. Han, Eds. Springer, 2014, ch. 8.[17] X. Ao, P. Luo, J. Wang, F. Zhuang, and Q. He, “Mining precise-positioning episode rules from event sequences,”
IEEE Transactionson Knowledge and Data Engineering , vol. 30, no. 3, pp. 530–543, 2018.[18] A. Achar, S. Laxman, and P. S. Sastry, “A unified view of theapriori-based algorithms for frequent episode discovery,”
Knowl-edge and Information Systems , vol. 31, no. 2, pp. 223–250, 2012.[19] S. Laxman, V. Tankasali, and R. W. White, “Stream predictionusing a generative model based on frequent episodes in eventsequences,” in
Proceedings of the 14th ACM SIGKDD internationalconference on Knowledge discovery and data mining . ACM, 2008, pp.453–461.[20] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng,and C. Potts, “Recursive deep models for semantic compositional-ity over a sentiment treebank,” in
Proceedings of the 2013 conferenceon empirical methods in natural language processing , 2013, pp. 1631–1642.[21] B. Pang and L. Lee, “A sentimental education: Sentiment analysisusing subjectivity,” in
Proceedings of ACL , 2004, pp. 271–278.[22] A. Cardoso-Cachopo, “Improving Methods for Single-label TextCategorization,” PdD Thesis, Instituto Superior Tecnico, Universi-dade Tecnica de Lisboa, 2007.[23] S. Wang and C. D. Manning, “Baselines and bigrams: Simple, goodsentiment and topic classification,” in