[PDF] On mining complex sequential data by means of FCA and pattern structures

Abstract

Nowadays data sets are available in very complex and heterogeneous ways. Mining of such data collections is essential to support many real-world applications ranging from healthcare to marketing. In this work, we focus on the analysis of "complex" sequential data by means of interesting sequential patterns. We approach the problem using the elegant mathematical framework of Formal Concept Analysis (FCA) and its extension based on "pattern structures". Pattern structures are used for mining complex data (such as sequences or graphs) and are based on a subsumption operation, which in our case is defined with respect to the partial order on sequences. We show how pattern structures along with projections (i.e., a data reduction of sequential structures), are able to enumerate more meaningful patterns and increase the computing efficiency of the approach. Finally, we show the applicability of the presented method for discovering and analyzing interesting patient patterns from a French healthcare data set on cancer. The quantitative and qualitative results (with annotations and analysis from a physician) are reported in this use case which is the main motivation for this work. Keywords: data mining; formal concept analysis; pattern structures; projections; sequences; sequential data.

Full PDF

OOn mining complex sequential data by means ofFCA and pattern structures

Aleksey Buzmakov a , c ∗∗ Elias Egho b , † Nicolas Jay a Sergei O. Kuznetsov c Amedeo Napoli a Chedy Ra¨ıssi a a Orpailleur, LORIA (CNRS – Inria NGE – U. de Lorraine),Vandoeuvre-l`es-Nancy, France ; b Orange Labs, Lannion, France c National Research University Higher School of Economics,Moscow, Russia

Abstract

Nowadays data sets are available in very complex and heterogeneousways. Mining of such data collections is essential to support many real-world applications ranging from healthcare to marketing. In this work,we focus on the analysis of “complex” sequential data by means of inter-esting sequential patterns. We approach the problem using the elegantmathematical framework of Formal Concept Analysis (FCA) and its ex-tension based on “pattern structures” . Pattern structures are used formining complex data (such as sequences or graphs) and are based on asubsumption operation, which in our case is deﬁned with respect to thepartial order on sequences. We show how pattern structures along withprojections (i.e., a data reduction of sequential structures), are able to enu-merate more meaningful patterns and increase the computing eﬃciency ofthe approach. Finally, we show the applicability of the presented methodfor discovering and analyzing interesting patient patterns from a Frenchhealthcare data set on cancer. The quantitative and qualitative results(with annotations and analysis from a physician) are reported in this usecase which is the main motivation for this work.

Keywords: data mining; formal concept analysis; pattern structures;projections; sequences; sequential data.

Sequence data is present and used in many applications. Mining sequential pat-terns from sequence data has become an important data mining task. In the ∗∗ Corresponding author. Email: [email protected] † Elias Egho was in LORIA (Vandoeuvre-les-Nancy, France) when this work was done. a r X i v : . [ c s . A I] A p r ast two decades, the main emphasis has been on developing eﬃcient miningalgorithms and eﬀective pattern representations [Han et al., 2000, Pei et al.,2001a, Yan et al., 2003, Ding et al., 2009, Ra¨ıssi et al., 2008]. However, oneproblem with traditional sequential pattern mining algorithms (and generallywith all pattern enumeration algorithms) is that they generate a large numberof frequent sequences while a few of them are truly relevant. To tackle thischallenge, recent studies try to enumerate patterns using some alternative in-terestingness measures or by sampling representative patterns. A general idea inﬁnding statistically signiﬁcant patterns is to extract patterns whose characteris-tics for a given measure, such as frequency, strongly deviates from its expectedvalue under a null model, i.e. the value expected by the distribution of all data.In this work, we focus on complementing the statistical approaches with a soundalgebraic approach trying to answer the following question: can we develop aframework for enumerating only relevant patterns based on data lattices and itsassociated measures? The above question can be answered by addressing the problem of analyz-ing sequential data using the framework of Formal Concept Analysis (FCA),a mathematical approach to data analysis [Ganter and Wille, 1999], and pat-tern structures, an extension of FCA that handles complex data [Ganter andKuznetsov, 2001]. To analyze a dataset of “complex” sequences while avoidingthe classical eﬃciency bottlenecks, we introduce and explain the usage of projec-tions, which are mathematical mappings for deﬁning approximations. Projec-tions for sequences allow one to reduce the computational costs and the volumeof enumerated patterns, avoiding the infamous “pattern ﬂooding”. In addition,we provide and discuss several measures, such as stability, to rank patterns withrespect to their “interestingness”, giving an expert order in which the patternsmay be eﬃciently analyzed.In this paper, we develop a novel, rigorous and eﬃcient approach for work-ing with sequential pattern structures in formal concept analysis. The maincontributions of this work can be summarized as follows: • Pattern structure speciﬁcation and analysis.

We propose a novel way ofdealing with sequences based on complex alphabets by mapping them topattern structures. The genericity power provided by the pattern struc-tures allows our approach to be directly instantiated with state-of-the-artFCA algorithms, making the ﬁnal implementation ﬂexible, accurate andscalable. • “Projections” for sequential pattern structures . Projections signiﬁcantlydecrease the number of patterns, while preserving the most interestingones for an expert. Projections are built to answer questions that anexpert may have. Moreover, combinations of projections and concept sta-bility index provide an eﬃcient tool for the analysis of complex sequentialdatasets. The second advantage of projections is its ability to signiﬁcantlydecrease the complexity of a problem, saving thus computational time. • Experimental evaluations.

We evaluate our approach on real sequence2able 1: A toy FCA context. m m m m g x x g x x g x g x xdataset of a regional healthcare system. The data set contains orderedsets of hospitalizations for cancer patients with information about thehospitals they visited, causes for the hospitalizations and medical proce-dures. These ordered sets are considered as sequences. The experimentsreveal interesting (from a medical point of view) and useful patterns, andshow the feasibility and the eﬃciency of our approach.This paper is an extension of the work presented at CLA’14 conference [Buz-makov et al., 2013]. The main diﬀerences w.r.t. the CLA’14 paper are a morecomplete explanation of the mathematical framework and a new experimentalpart evaluating diﬀerent aspects of the introduced framework.The paper is organized as follows. Section 2 introduces formal concept analy-sis and pattern structures. The speciﬁcation of pattern structures for the case ofsequences is presented in Section 3. Section 4 describes projections of sequentialpattern structures followed in Section 5 by the evaluation and experimentations.Finally, related works are discussed before concluding the paper. FCA is a formalism that can be used for guiding data analysis and knowledgediscovery [Ganter and Wille, 1999]. FCA starts with a formal context and buildsa set of formal concepts organized within a concept lattice. A formal context isa triple (

G, M, I ), where G is a set of objects, M is a set of attributes and I isa relation between G and M , I ⊆ G × M . In Table 1, a cross table for a formalcontext is shown. A Galois connection between G and M is deﬁned as follows: A (cid:48) = { m ∈ M | ∀ g ∈ A, ( g, m ) ∈ I } , A ⊆ GB (cid:48) = { g ∈ A | ∀ m ∈ M, ( g, m ) ∈ I } , B ⊆ M The Galois connection maps a set of objects to the maximal set of attributesshared by all objects and reciprocally. For example, { g , g } (cid:48) = { m } , while { m } (cid:48) = { g , g , g } , i.e. the set { g , g } is not maximal. Given a set of objects A , we say that A (cid:48) is the description of A .3; { m , m , m , m } )( g ; g ; { m , m } )( { g } ; { m , m } ) ( { g } ; { m } )( { g , g , g } ; { m } )( { g , g , g , g } ; )Figure 1: Concept Lattice for the toy context Deﬁnition 1.

A formal concept is a pair ( A, B ) , where A ⊆ G is a subset ofobjects, B ⊆ M is a subset of attributes, such that A (cid:48) = B and A = B (cid:48) , where A is called the extent of the concept, and B is called the intent of the concept. A formal concept corresponds to a pair of maximal sets of objects and at-tributes, i.e. it is not possible to add an object or an attribute to the conceptwithout violating the maximality property. For example a pair ( { g , g , g } , { m } )is a formal concept. Formal concepts can be partially ordered w.r.t. theextent inclusion (dually, intent inclusion). For example, ( { g } ; { m , m } ) ≤ ( { g , g , g } , { m } ) . This partial order of concepts is shown in Figure 1. Thenumber of formal concepts for a given context can be exponential w.r.t. thecardinality of set of objects or set of attributes. It is easy to see that for context(

G, G, I G ), where I G = { ( x, y ) | x ∈ G, y ∈ G, x (cid:54) = y } , the number of conceptsis equal to 2 | G | . The number of concepts in a lattice for real-world tasks can be large. To ﬁndthe most interesting subset of concepts, diﬀerent measures can be used such asthe stability of the concept [Kuznetsov, 2007] or the concept probability andseparation [Klimushkin et al., 2010]. These measures help extracting the mostinteresting concepts. However, the last ones are less reliable in noisy data.

Deﬁnition 2.

Given a concept c , the concept stability Stab ( c ) of c is the relativenumber of subsets of the concept extent (denoted Ext ( c ) ), whose description, i.e.the result of ( · ) (cid:48) , is equal to the concept intent (denoted Int ( c ) ). Stab ( c ) := |{ s ∈ ℘ ( Ext ( c )) | s (cid:48) = Int ( c ) }|| ℘ ( Ext ( c )) | (1)Here ℘ ( P ) is the powerset of P . Stability measures how a concept depends onobjects in its extent. The larger the stability is the more combinations of objectscan be deleted from the context without aﬀecting the intent of the concept, i.e.the intent of the most stable concepts is likely to be a characteristic pattern of agiven phenomenon and not an artifact of a dataset. Of course, stable conceptsstill depend on the dataset, and, consequently some important information can4able 2: A toy formal context m m m m m m g x x g x x g x x g x x g x ( { g } ; ∗ ) [0.5] ( { g } ; ∗ ) [0.5] ( { g } ; ∗ ) [0.5] ( { g } ; ∗ ) [0.5] ( { g } ; ∗ ) [0.5] ( ∅ ; ∗ ) [1.0] ( { g1 , g2 , g3 , g4 } ; { m6 } ) [0.69] ( { g , g , g , g , g } ; ∗ ) [0.47] Figure 2: Concept Lattice for the context in Table 2 with corresponding stabilityindexes.be contained in the unstable concepts. However, the stability can be consideredas a good heuristic for selecting concepts because the more stable the conceptis the less it depends on the given dataset w.r.t. to object removal.

Example 1.

Figure 2 shows a lattice for the context in Table 2, for simplic-ity some intents are not given. Extent of the outlined concept c is Ext ( c ) = { g , g , g , g } , thus, its powerset contains elements. Descriptions of 5 sub-sets of Ext ( c ) ( { g } , . . . , { g } and ∅ ) are diﬀerent from Int ( c ) = { m } , whileall other subsets of Ext ( c ) have a common description equal to { m } . So, Stab ( c ) = − = 0 . . One of the fastest algorithm processing a concept lattice L is proposedin [Roth et al., 2008] with the worst-case complexity of O ( | L | ) where | L | isthe size of the concept lattice. The experimental section shows that for a biglattice, the stability computation can take much more time than the constructionof the concept lattice. Thus, the estimation of concept stability is an importantquestion. Here we present an eﬃcient way for such an estimation. It should benoticed that in a lattice the extent of any ancestor of a concept c is a superset ofthe extent of c , while the extent of any descendant is a subset. Given a concept c and an immediate descendant d , we have ∀ s ⊆ Ext ( d ) , s (cid:48)(cid:48) ⊆ Ext ( d ), whichmeans that s (cid:48) ⊇ Int ( d ) ⊃ Int ( c ), i.e. s (cid:48) (cid:54) = Int ( c ). Thus, we can exclude in thecomputation of the numerator of stability in (1) all subsets of the extent of a5irect descendant c . Thus, the following bound holds: Stab ( c ) ≤ − max d ∈ DD ( c ) ∆( c,d ) , (2)where DD ( c ) is the set of all direct descendants and ∆( c, d ) is the set-diﬀerencebetween extent of c and extent of d , ∆( c, d ) = | Ext ( c ) \ Ext ( d ) | . Example 2.

With help of (2) we can ﬁnd all stable concepts (and some un-stable), i.e. the concepts with a high stability w.r.t. a threshold θ . If θ = 0 . ,we should compute for each concept c in the lattice the following value md ( c ) =min d ∈ DD ( c ) ∆( c, d ) and then select concepts verifying md ( c ) ≥ − log(1 − .

97) = 5 . . Although FCA applies to binary contexts, more complex data such as sequencesor graphs can be directly processed as well. For that, pattern structures wereintroduced in Ganter and Kuznetsov [2001].

Deﬁnition 3.

A pattern structure is a triple ( G, ( D, (cid:117) ) , δ ) , where G is a setof objects, ( D, (cid:117) ) is a complete meet-semilattice of descriptions and δ : G → D maps an object to a description. The lattice operation in the semilattice ( (cid:117) ) corresponds to the similaritybetween two descriptions. Standard FCA can be presented in terms of a pat-tern structure. In this case, G is the set of objects, the semilattice of descrip-tions is ( ℘ ( M ) , (cid:117) ) and a description is a set of attributes, with the (cid:117) operationcorresponding to the set intersection ( ℘ ( M ) denotes the powerset of M ). If x = { a, b, c } and y = { a, c, d } then x (cid:117) y = x ∩ y = { a, c } . The mapping δ : G → ℘ ( M ) is given by, δ ( g ) = { m ∈ M | ( g, m ) ∈ I } , and returns thedescription for a given object as a set of attributes.The Galois connection for a pattern structure ( G, ( D, (cid:117) ) , δ ) is deﬁned asfollows: A (cid:5) := (cid:108) g ∈ A δ ( g ) , for A ⊆ Gd (cid:5) := { g ∈ G | d (cid:118) δ ( g ) } , for d ∈ D The Galois connection makes a correspondence between sets of objects anddescriptions. Given a subset of objects A , A (cid:5) returns the description whichis common to all objects in A . Given a description d , d (cid:5) is the set of allobjects whose description subsumes d . More precisely, the partial order (orthe subsumption order) on D ( (cid:118) ) is deﬁned w.r.t. the similarity operation (cid:117) : c (cid:118) d ⇔ c (cid:117) d = c , and c is subsumed by d . Deﬁnition 4.

A pattern concept of a pattern structure ( G, ( D, (cid:117) ) , δ ) is a pair ( A, d ) where A ⊆ G and d ∈ D such that A (cid:5) = d and d (cid:5) = A , A is called theconcept extent and d is called the concept intent. p (cid:104) [ H , { a } ]; [ H , { c, d } ]; [ H , { a, b } ]; [ H , { d } ] (cid:105) p (cid:104) [ H , { c, d } ]; [ H , { b, d } ]; [ H , { a, d } ] (cid:105) p (cid:104) [ H , { c, d } ]; [ H , { b } ]; [ H , { a } ]; [ H , { a, d } ] (cid:105) As in standard FCA, a pattern concept corresponds to the maximal set ofobjects A whose description subsumes the description d , where d is the maximalcommon description for objects in A . The set of all concepts can be partiallyordered w.r.t. partial order on extents (dually, intent patterns, i.e (cid:118) ), within aconcept lattice.An example of pattern structures is given in Table 3, while the correspondinglattice is depicted in Figure 3.As stability of concepts only depends on extents, it can be deﬁned by thesame procedure for both formal contexts and pattern structures. Certain phenomena, such as a patient trajectory (clinical history), can be con-sidered as a sequence of events. This section describes how FCA and patternstructures can process sequential data.

Imagine that we have medical trajectories of patients, i.e. sequences of hospi-talizations, where every hospitalization is described by a hospital name and aset of procedures. An example of sequential data on medical trajectories withthree patients is given in Table 3. We have a set of procedures P = { a, b, c, d } , aset of hospital names T H = { H , H , H , H , CL, CH, ∗} , where hospital namesare hierarchically organized (by level of generality). H and H are centralhospitals ( CH ), H and H are clinics ( CL ), and ∗ denotes the root of thishierarchy. The least common ancestor in this hierarchy is denoted by h (cid:117) h ,for any h , h ∈ T H , i.e. H (cid:117) H = CH . Every hospitalization is describedby one hospital name and may contain several procedures. The procedure or-der in each hospitalization is not important in our case. For example, the ﬁrsthospitalization [ H , { c, d } ] for the second patient ( p ) was a stay in hospital H and during this hospitalization the patient underwent procedures c and d . Animportant task is to ﬁnd the “characteristic” sequences of procedures and asso-ciated hospitals in order to improve hospitalization planning, optimize clinicalprocesses or detect anomalies.We approach the search for characteristic sequences by ﬁnding the most sta-ble concepts in the lattice corresponding to a sequential pattern structure. For7he simpliﬁcation of calculations, subsequences are considered without “gaps”,i.e the order of non consequent elements is not taken into account. This is rea-sonable in this task because experts are interested in regular consecutive eventsin healthcare trajectories. A sequential pattern structure is a set of sequencesand is based on the set of maximal common subsequences (without gaps) be-tween two sequences. Next subsections deﬁne partial order on sequences andthe corresponding pattern structures. A sequence is constituted of elements from an alphabet. The classical subse-quence matching task requires no special properties of the alphabet. Severalgeneralizations of the classical case were made by introducing a subsequencerelation based on an itemset alphabet [Agrawal and Srikant, 1995] or on amultidimensional and multilevel alphabet [Plantevit et al., 2010]. Here, wegeneralize the previous cases, requiring for an alphabet to form a semilattice( E, (cid:117) E ) (We should note that in this paper we consider two semilattices, theﬁrst one is related to the characters of the alphabet, ( E, (cid:117) E ), and the secondone is related to pattern structures, ( D, (cid:117) )). Thanks to the formalism of patternstructures we are able to process in a uniﬁed way all types of sequential datasetswith poset-shaped alphabet (it is mentioned above that any partial order canbe transformed into a semilattice). However, some sequential data can haveconnections between elements, e.g. [Adda et al., 2010], and, thus, cannot bestraightforwardly processed by our approach. Deﬁnition 5.

Given a semilattice ( E, (cid:117) E ) , also called an alphabet, a sequenceis an ordered list of elements from E . We denote it by (cid:104) e ; e ; · · · ; e n (cid:105) where e i ∈ E . In this alphabet semilattice ( E, (cid:117) E ) there is a bottom element ⊥ E thatcan be matched with any other element. Formally, ∀ e ∈ E, ⊥ E = ⊥ E (cid:117) E e . This element is required by the lattice structure, but provides no usefulinformation. Thus, it should be excluded from sequences. The bottom elementof E corresponds to the empty set in sequential mining [Agrawal and Srikant,1995], and the empty set is always ignored in this domain. Deﬁnition 6.

A valid sequence (cid:104) e ; · · · ; e n (cid:105) is a sequence where e i (cid:54) = ⊥ E forall i ∈ { , · · · , n } . Deﬁnition 7.

Given an alphabet ( E, (cid:117) E ) and two sequences t = (cid:104) t ; ... ; t k (cid:105) and s = (cid:104) s ; ... ; s n (cid:105) based on E ( t q , s p ∈ E ), the sequence t is a subsequence of s ,denoted t ≤ s , iﬀ k ≤ n and there exist j , ..j k such that ≤ j < j < ...

In the running example (Section 3.1), the alphabet is E = T H × ℘ ( P ) with the similarity operation ( h , P ) (cid:117) ( h , P ) = ( h (cid:117) h , P ∩ P ) ,where h , h ∈ T H are hospitals and P , P ∈ ℘ ( P ) are sets of procedures.Thus, the sequence ss = (cid:104) [ CH, { c, d } ]; [ H , { b } ]; [ ∗ , { d } ] (cid:105) is a subsequence of = (cid:104) [ H , { a } ]; [ H , { c, d } ]; [ H , { a, b } ]; [ H , { d } ] (cid:105) because if we set j i = i + 1 (Deﬁnition 7) then ss (cid:118) p j (‘CH’ is more general than H and { c, d } ⊆ { c, d } ), ss (cid:118) p j (the same hospital and { b } ⊆ { b, a } ) and ss (cid:118) p j (‘*’ is more generalthan H and { d } ⊆ { d } ). With complex sequences and this kind of subsequence relation the compu-tation can be hard. Thus, for the sake of simpliﬁcation, only “contiguous” sub-sequences are considered, where only the order of consequent elements is takeninto account, i.e. given j in Deﬁnition 7, j i = j i − + 1 for all i ∈ { , , ..., k } .Since experts are interested in regular consecutive events in healthcare trajec-tories, such a restriction does make sens for our data. It helps to connect onlyrelated hospitalizations.The next section introduces pattern structures that are based on complex se-quences with a general subsequence relation, while the experiments are providedfor a “contiguous” subsequence relation. Based on the previous deﬁnitions, we can deﬁne the sequential pattern struc-ture used for representing and managing sequences. For that, we make ananalogy with the pattern structures for graphs [Kuznetsov, 1999] where themeet-semilattice operation (cid:117) respects subgraph isomorphism. Thus, we intro-duce a sequential meet-semilattice respecting subsequence relation. Given analphabet lattice ( E, (cid:117) E ), S is the set of all valid sequences based on ( E, (cid:117) E ). S is partially ordered w.r.t. Deﬁnition 7. ( D, (cid:117) ) is a semilattice on S , where D ⊆ ℘ ( S ) such that, if d ∈ D contains a sequence s , then all subsequences of s should be included into d , ∀ s ∈ d, (cid:64) ˜ s ≤ s : ˜ s / ∈ d , and the similarity operationis the set intersection for two sets of sequences. Given two patterns d , d ∈ D ,the set intersection operation ensures that if a sequence s belongs to d (cid:117) d then any subsequence of s belongs to d (cid:117) d and thus d (cid:117) d ∈ D . As the setintersection operation is idempotent, commutative and associative, ( D, (cid:117) ) is asemilattice. Example 4.

If pattern d ∈ D includes sequence ss = (cid:104) [ ∗ , { c, d } ]; [ ∗ , { b } ] (cid:105) (seeTable 4), then it should include also (cid:104) [ ∗ , { d } ]; [ ∗ , { b } ] (cid:105) , (cid:104) [ ∗ , { c, d } ] (cid:105) , (cid:104) [ ∗ , { d } ] (cid:105) and others. If pattern d ∈ D includes ss = (cid:104) [ ∗ , { a } ]; [ ∗ , { d } ] (cid:105) , then it shouldinclude (cid:104) [ ∗ , { a } ] (cid:105) , (cid:104) [ ∗ , { d } ] (cid:105) and (cid:104)(cid:105) . Thus the intersection of two sets d and d is equal to the set {(cid:104) [ ∗ , { d } ] (cid:105) , (cid:104)(cid:105)} . The next proposition stems from the aforementioned and will be used in theproofs in the next section.

Proposition 1.

Given ( G, ( D, (cid:117) ) , δ ) and x, y ∈ D , x (cid:118) y if and only if ∀ s x ∈ x there is a sequence s y ∈ y , such that s x ≤ s y . The set of all possible subsequences for a given sequence can be large. Thus,it is more eﬃcient to consider a pattern d ∈ D as a set of only maximal sequences˜ d , ˜ d = { s ∈ d | (cid:64) s ∗ ∈ d : s ∗ ≥ s } . Furthermore, every pattern will be given only9 (cid:110) p (cid:111) ; p (cid:17)(cid:16)(cid:110) p (cid:111) ; p (cid:17) (cid:16)(cid:110) p (cid:111) ; p (cid:17)(cid:16)(cid:110) p , p (cid:111) ; ss , ss (cid:17) (cid:16)(cid:110) p , p (cid:111) ; ss , ss (cid:17) (cid:16)(cid:110) p , p (cid:111) ; ss , ss , ss (cid:17)(cid:16)(cid:110) p , p , p (cid:111) ; ss , ss (cid:17) ( ∅ ; ∗ ) Figure 3: The concept lattice for the pattern structure given by Table 3. Con-cept intents reference to sequences in Tables 3 and 4.Table 4: Subsequences of patient sequences in Table 3.Subsequences ss (cid:104) [ CH, { c, d } ]; [ H , { b } ]; [ ∗ , { d } ] (cid:105) ss (cid:104) [ CH, { c, d } ]; [ ∗ , { b } ]; [ ∗ , { d } ] (cid:105) ss (cid:104) [ CH, {} ]; [ ∗ , { d } ]; [ ∗ , { a } ] (cid:105) ss (cid:104) [ ∗ , { c, d } ]; [ ∗ , { b } ] (cid:105) ss (cid:104) [ ∗ , { a } ] (cid:105) ss (cid:104) [ ∗ , { c, d } ]; [ CL, { b } ]; [ CL, { a } ] (cid:105) ss (cid:104) [ CL, { d } ]; [ CL, {} ] (cid:105) ss (cid:104) [ CL, {} ]; [ CL, { a, d } ] (cid:105) ss (cid:104) [ CH, { c, d } ] (cid:105) ss (cid:104) [ CL, { b } ]; [ CL, { a } ] (cid:105) ss (cid:104) [ ∗ , { c, d } ]; [ ∗ , { b } ] (cid:105) ss (cid:104) [ ∗ , { a } ]; [ ∗ , { d } ] (cid:105) by the set of all maximal sequences. For example, (cid:8) p (cid:9) (cid:117) (cid:8) p (cid:9) = (cid:8) ss , ss , ss (cid:9) (see Tables 3 and 4), i.e. (cid:8) ss , ss , ss (cid:9) is the set of all maximal sequencesspecifying the intersection of p and p . Similarly we have (cid:8) ss , ss , ss (cid:9) (cid:117) (cid:8) p (cid:9) = (cid:8) ss , ss (cid:9) . Note that representing a pattern by the set of all maximalsequences allows for an eﬃcient implementation of the intersection “ (cid:117) ” of twopatterns (in Section 5.1 we give more details on similarity operation w.r.t. acontiguous subsequence relation). Example 5.

The sequential pattern structure for our example (Subsection 3.1)is ( G, ( D, (cid:117) ) , δ ) , where G = (cid:8) p , p , p (cid:9) , ( D, (cid:117) ) is the semilattice of sequentialdescriptions, and δ is the mapping associating an object in G to a descriptionin D shown in Table 3. Figure 3 shows the resulting lattice of sequential patternconcepts for this particular pattern structure ( G, ( D, (cid:117) ) , δ ) . Projections of sequential pattern structures

Pattern structures are hard to process due to the large number of concepts in theconcept lattice, the complexity of the involved descriptions and the similarityoperation. Moreover, a given pattern structure can produce a lattice with a lotof patterns which are not interesting for an expert.

Can we save computationaltime by avoiding to compute “useless” patterns?

Projections of pattern struc-tures “simplify” to some degree the computation and allow one to work with areduced description. In fact, projections can be considered as ﬁlters on patternsrespecting mathematical properties. These properties ensure that the projec-tion of a semilattice is a semilattice and that projected concepts are related tooriginal ones [Ganter and Kuznetsov, 2001]. Moreover, the stability measure ofprojected concepts never decreases w.r.t the original concepts. We introduceprojections on sequential patterns revising Ganter and Kuznetsov [2001]. It isnecessary to provide an extended deﬁnition of projection in order to deal withinteresting projections for real-world sequential datasets.

Deﬁnition 8 (Ganter and Kuznetsov [2001]) . A projection ψ : D → D is an in-terior operator, i.e. it is (1) monotone ( x (cid:118) y ⇒ ψ ( x ) (cid:118) ψ ( y ) ), (2) contractive( ψ ( x ) (cid:118) x ) and (3) idempotent ( ψ ( ψ ( x )) = ψ ( x ) ). Deﬁnition 9.

A projected pattern structure ψ (( G, ( D, (cid:117) ) , δ )) is a pattern struc-ture ( G, ( D ψ , (cid:117) ψ ) , ψ ◦ δ ) , where D ψ = ψ ( D ) = { d ∈ D | ∃ d ∗ ∈ D : ψ ( d ∗ ) = d } and ∀ x, y ∈ D, x (cid:117) ψ y := ψ ( x (cid:117) y ) . Note that in [Ganter and Kuznetsov, 2001] ψ (( G, ( D, (cid:117) ) , δ )) = ( G, ( D, (cid:117) ) , ψ ◦ δ ). Our deﬁnition allows one to use a wider set of projections. In fact all pro-jections that we describe for sequential pattern structures below require Deﬁni-tion 9. Now we should show that ( D ψ , (cid:117) ψ ) is a semilattice. Proposition 2.

Given a semilattice ( D, (cid:117) ) and a projection ψ , for all x, y ∈ Dψ ( x (cid:117) y ) = ψ ( ψ ( x ) (cid:117) y ) .Proof. ψ ( x ) (cid:118) x , thus, x, y (cid:119) ( x (cid:117) y ) (cid:119) ( ψ ( x ) (cid:117) y ) (cid:119) ψ ( ψ ( x ) (cid:117) y )2. x (cid:118) y ⇒ ψ ( x ) (cid:118) ψ ( y ), thus, ψ ( x (cid:117) y ) (cid:119) ψ ( ψ ( x ) (cid:117) y )3. ψ ( x (cid:117) y ) (cid:117) ψ ( x ) (cid:117) y = ψ ( x (cid:117) y ) (cid:118) ψ ( x ) ψ ( x (cid:117) y ) (cid:117) y = ψ ( x (cid:117) y ) (cid:118) y ψ ( x (cid:117) y ),then ( ψ ( x ) (cid:117) y ) (cid:119) ψ ( x (cid:117) y ) and ψ ( ψ ( x ) (cid:117) y ) (cid:119) ψ ( ψ ( x (cid:117) y )) = ψ ( x (cid:117) y )4. From (2) and (3) it follows that ψ ( x (cid:117) y ) = ψ ( ψ ( x ) (cid:117) y ). Corollary 1. X (cid:117) ψ X (cid:117) ψ · · · (cid:117) ψ X N = ψ ( X (cid:117) X (cid:117) · · · (cid:117) X N ) Proof.

It can be prooven by induction.1. X (cid:117) ψ X = ψ ( X (cid:117) X ) by Deﬁnition 9.11. If X (cid:117) ψ · · · (cid:117) ψ X K = ψ ( X (cid:117) · · · (cid:117) X K ), then X (cid:117) ψ · · · (cid:117) ψ X K (cid:117) ψ X K +1 = ψ ( X (cid:117) · · · (cid:117) X K ) (cid:117) ψ X K +1 == ψ ( ψ ( X (cid:117) · · · (cid:117) X K ) (cid:117) X K +1 ) = Proposition 2 ψ ( X (cid:117) · · · (cid:117) X K +1 ) Corollary 2.

Given a semilattice ( D, (cid:117) ) and a projection ψ , ( D ψ , (cid:117) ψ ) is asemilattice, i.e. (cid:117) ψ is commutative, associative and idempotent. The concepts of a pattern structure and a projected pattern structure areconnected through Proposition 3. This proposition can be found in Ganter andKuznetsov [2001], but thanks to Corollary 1, it is valid in our case.

Proposition 3.

Given a concept ( A, d ) in ψ (( G, ( D, (cid:117) ) , δ )) , the extent A is anextent in ( G, ( D, (cid:117) ) , δ ) . Given a concept ( A, d ψ ) in ψ (( G, ( D, (cid:117) ) , δ )) , the intent d ψ is of the form d ψ = ψ ( d ) , where ( A, d ) is a concept in ( G, ( D, (cid:117) ) , δ ) . Moreover, while preserving the extents of some concepts, projections cannotdecrease the stability of the projected concepts, i.e. if the projection preservesa stable concept, then its stability (Deﬁnition 2) can only increase.

Proposition 4.

Given a pattern structure ( G, ( D, (cid:117) ) , δ ) , its concept c and aprojected pattern structure ( G, ( D ψ , (cid:117) ψ ) , ψ ◦ δ ) , and the projected concept ˜ c , ifthe concept extents are equal ( Ext ( c ) = Ext (˜ c ) ) then Stab ( c ) ≤ Stab (˜ c ) .Proof. Concepts c and ˜ c have the same extent. Thus, according to Deﬁnition 2,in order to prove the proposition, it is enough to prove that for any subset A ⊆ Ext ( c ), if A (cid:5) = Int ( c ) in the original pattern structure, then A (cid:5) = Int (˜ c )in the projected one.Suppose that ∃ A ⊂ Ext ( c ) such that A (cid:5) = Int ( c ) in the original patternstructure and A (cid:5) (cid:54) = Int (˜ c ) in the projected one. Then there is a descendantconcept ˜ d of ˜ c in the projected pattern structure such that A (cid:5) = Int ( ˜ d ) in theprojected lattice. Then there is an original concept d for the projected concept˜ d with the same extent Ext ( d ). Then A (cid:5) (cid:119) Int ( d ) (cid:65) Int ( c ) and, so, A (cid:5) cannotbe equal to Int ( c ) in the original lattice. Contradiction.Now we are going to present two projections of sequential pattern structures.The ﬁrst projection comes from the following observation. In many cases it maybe more interesting to analyze quite long subsequences rather than short ones.This kind of projections is called Minimal Length Projection (MLP) and itdepends on the minimal length parameter (cid:96) for the sequences in a pattern. Thecorresponding function ψ maps a pattern without short sequences to itself, anda sequence with short sequences to the pattern containing only long sequencesw.r.t. a given length threshold. Later, propositions 1 and 5 state that MLP iscoherent with Deﬁnition 8. Deﬁnition 10.

The function ψ MLP : D → D of minimal length (cid:96) is deﬁned as ψ MLP ( d ) = { s ∈ d | length ( s ) ≥ (cid:96) } xample 6. If we prefer common subsequences of length (cid:96) ≥ , then between p and p in Table 3 there is only one maximal common subsequence, ss inTable 4, while ss and ss are too short to be considered. Figure 4a shows thelattice of the projected pattern structure (Table 3) with patterns of length greateror equal to . Proposition 5.

The function ψ MLP is a monotone, contractive and idempotentfunction on the semilattice ( D, (cid:117) ) .Proof. The contractivity and idempotency are quite clear from the deﬁnition.It remains to prove the monotonicity.If X (cid:118) Y , where X and Y are sets of sequences, then for every sequence x ∈ X there is a sequence y ∈ Y such that x ≤ y (Proposition 1). We shouldshow that ψ ( X ) (cid:118) ψ ( Y ), or in other words for every sequence x ∈ ψ ( X ) there isa sequence y ∈ ψ ( Y ), such that x ≤ y . Given x ∈ ψ ( X ), since ψ ( X ) is a subsetof X and X (cid:118) Y , there is a sequence y ∈ Y such that x ≤ y , with | y | ≥ | x | ≥ (cid:96) ( (cid:96) is a parameter of MLP), and thus, y ∈ ψ ( Y ).Another important type of projections is related to a variation of the latticealphabet ( E, (cid:117) E ). One possible variation of the alphabet is to ignore certainﬁelds in the elements. For example, if a hospitalization is described by a hospitalname and a set of procedures, then either hospital or procedures can be ignoredin similarity computation. For that, in any element the set of procedures shouldbe substituted by ∅ , or the hospital by ∗ (“arbitrary hospital”) which is the mostgeneral element of the taxonomy of hospitals.Another variation of the alphabet is to require that some ﬁeld(s) shouldnot be empty. For example, we want to ﬁnd patterns with non-empty set ofprocedures or the element ∗ of the hospital taxonomy is not allowed in elementsof a sequence. Such variations are easy to realize within our approach. For this,when computing the similarity operation between elements of the alphabet, oneshould check if the result contains empty ﬁelds and, if yes, should substitute theresult by ⊥ . This variation is useful, as it is shown in the experimental section,but is rather diﬃcult to deﬁne within more classical frequent sequence miningapproaches, which will be discussed later. Example 7.

An expert is interested in ﬁnding sequential patterns describinghow a patient changes hospitals, but with little interest in procedures. Thus, anyelement of the alphabet lattice, containing a hospital and a non-empty set ofprocedures can be projected to an element with the same hospital, but with anempty set of procedures.

Example 8.

An expert is interested in ﬁnding sequential patterns containingsome information about the hospital in every hospitalization, and the corre-sponding procedures, i.e. hospital ﬁeld in the patterns cannot be equal to ∗ , e.g., ss is an invalid pattern, while ss is a valid pattern in Table 4. Thus, anyelement of the alphabet semilattice with ∗ in the hospital ﬁeld can be projectedto the ⊥ E . Figure 4b shows the lattice corresponding to the projected patternstructure (Table 3) deﬁned by a projection of the alphabet semilattice. ⊥ E cannot belong to a valid sequence. Thus, sequences in a pattern should be“developed” w.r.t. ⊥ E , as it is explained below. Deﬁnition 11.

Given an alphabet ( E, (cid:117) E ) , a projection of the alphabet ψ anda sequence s = (cid:104) s , · · · , s n (cid:105) based on E , the projection ψ ( s ) is the sequence ˜ s = (cid:104) ˜ s , · · · , ˜ s n (cid:105) , such that ˜ s i = ψ ( s i ) . Here, it should be noticed that ˜ s is not necessarily a valid sequence (see Def-inition 6), since it can include ⊥ E as an element. However, in sequential patternstructures, elements should include only valid sequences (see Section 3.3). Deﬁnition 12.

Given an alphabet ( E, (cid:117) E ) , a projection of the alphabet ψ E , analphabet projection for the sequential pattern structure ψ ( d ) is the set of validsequences smaller than the projected sequences from d : ψ ( d ) = { s ∈ S | ( ∃ t ∈ d ) s ≤ ψ E ( t ) } , where S is the set of all valid sequences based on ( E, (cid:117) E ) . Example 9. { ss } = {(cid:104) [ ∗ , { c, d } ]; [ CL, { b } ]; [ CL, { a } ] (cid:105)} is an alphabet-projec-ted pattern for the pattern { ss } = {(cid:104) [ CL, { b } ]; [ CL, { a } ] (cid:105)} , where the alphabetlattice projection is given in Example 8.In the case of contiguous subsequences, {(cid:104) [ CH, { c, d } ] (cid:105)} is an alphabet-pro-jected pattern for the pattern { ss } = {(cid:104) [ CH, { c, d } ]; [ ∗ , { b } ]; [ ∗ , { d } ] (cid:105)} , where thealphabet lattice projection is given by projecting every element with medical pro-cedure b to the element with the same hospital and with the same set of proceduresexcluding b . The projection of sequence ss is (cid:104) [ CH, { c, d } ]; [ ∗ , {} ]; [ ∗ , { d } ] (cid:105) , but [ ∗ , {} ] = ⊥ E , and, thus, in order to project the pattern { ss } the projected se-quence is substituted by its maximal subsequences, i.e. ψ ( {(cid:104) [ CH, { c, d } ]; [ ∗ , { b } ]; [ ∗ , { d } ] (cid:105)} ) = {(cid:104) [ CH, { c, d } ] (cid:105)} . Proposition 6.

Considering an alphabet ( E, (cid:117) E ) , a projection of the alpha-bet ψ , a sequential pattern structure ( G, ( D, (cid:117) ) , δ ) , the alphabet projection (seeDeﬁnition 12) is monotone, contractive and idempotent.Proof. This projection is idempotent, since the projection of the alphabet isidempotent and only the projection of the alphabet can change the elementsappearing in sequences.It is contractive because for any pattern d ∈ D and any sequences s ∈ d ,a projection of the sequence ˜ s = ψ ( s ) is a subsequence of s . In Deﬁnition 12the projected sequences should be substituted by their subsequences in order toavoid ⊥ E , building the sets { ˜ s i } . Thus, s is a supersequence for any ˜ s i , and, so,the projected pattern ˜ d = ψ ( d ) is subsumed by the pattern d .14 (cid:110) p (cid:111) ; p (cid:17)(cid:16)(cid:110) p (cid:111) ; p (cid:17) (cid:16)(cid:110) p (cid:111) ; p (cid:17)(cid:16)(cid:110) p , p (cid:111) ; ss , ss (cid:17) (cid:16)(cid:110) p , p (cid:111) ; ss (cid:17)(cid:16)(cid:110) p , p , p (cid:111) ; ∅ (cid:17) ( ∅ ; ∗ ) (a) MLP projection, l = 3 (cid:16)(cid:110) p (cid:111) ; p (cid:17)(cid:16)(cid:110) p (cid:111) ; p (cid:17) (cid:16)(cid:110) p (cid:111) ; p (cid:17)(cid:16)(cid:110) p , p (cid:111) ; ss (cid:17) (cid:16)(cid:110) p , p (cid:111) ; ss , ss , ss (cid:17)(cid:16)(cid:110) p , p , p (cid:111) ; ∅ (cid:17) ( ∅ ; ∗ ) (b) Projection removing ‘*’ in the hos-pital ﬁeld Figure 4: The projected concept lattices for the pattern structure given byTable 3. Concept intents refer to the sequences in Tables 3 and 4.Finally, we should show monotonicity. Given two patterns x, y ∈ D , suchthat x (cid:118) y , i.e. ∀ s x ∈ x, ∃ s y ∈ y : s x ≤ s y , consider the projected sequence of s x , ψ ( s x ). As s x ≤ s y for some s y then for some j < · · · < j | s x | (see Deﬁnition 7) s xi (cid:118) E s yj i ( i ∈ , , ..., | s x | ), then ψ ( s xi ) (cid:118) E ψ ( s yj i ) (by the monotonicity ofthe alphabet projection), i.e. the projected sequence preserves the subsequencerelation. Thus, the set of allowed subsequences of s x is a subset of the setof allowed subsequences of s y . Hence, the alphabet projection of the patternpreserves pattern subsumption relation, ψ ( x ) ≤ ψ ( y ) (Proposition 1), i.e. thealphabet projection is monotone. Nearly any state-of-the-art FCA algorithm can be adapted to process patternstructures. We adapted the

AddIntent algorithm [Merwe et al., 2004], as thelattice structure is important for us to calculate stability (see an algorithm forcalculating stability in [Roth et al., 2008]). To adapt the algorithm to our needs,every set intersection operation on attributes is substituted with the semilatticeoperation (cid:117) on corresponding patterns, while every subset checking operationis substituted with the semilattice order checking (cid:118) , in particular all ( · ) (cid:48) aresubstituted with ( · ) (cid:5) .The next question is how the semilattice operation (cid:117) and subsumption re-lation (cid:118) can be implemented for contiguous sequences. Given two sets of se-quences S = { s , ...s n } and T = { t , ..., t m } , the similarity of these sets S (cid:117) T , iscalculated according to Section 3.3, i.e. maximal sequences among all commonsubsequences for any pair of sequences s i and t j .To ﬁnd all common subsequences of two sequences, the following observationscan be useful. If ss = (cid:104) ss ; ... ; ss l (cid:105) is a subsequence of s = (cid:104) s ; ... ; s n (cid:105) with j si = k s + i , i.e. ss i (cid:118) E s k s + i (Deﬁnition 7: k s is the index diﬀerence from which15 s is a contiguous subsequence of s ) and a subsequence of t = (cid:104) t ; ... ; t m (cid:105) with j ti = k t + i , i.e. ss i (cid:118) E t k t + i , then for any index i ∈ { , , ..., l } , ss i (cid:118) E ( s j si (cid:117) t j ti ).Thus, to ﬁnd all maximal common subsequences of s and t , we ﬁrst align s and t in all possible ways. For each alignment of s and t we compute the resultingintersection. Finally, we keep only the maximal intersected subsequences.For example, let us consider two possible alignments of s and s : s = (cid:104){ a } ; { c, d } ; { b, a } ; { d } (cid:105) s = (cid:104) { c, d } ; { b, d } ; { a, d }(cid:105) ss l = (cid:104) ∅ ; { d } (cid:105) s = (cid:104){ a } ; { c, d } ; { b, a } ; { d } (cid:105) s = (cid:104) { c, d } ; { b, d } ; { a, d } (cid:105) ss r = (cid:104) { c, d } ; { b } ; { d } (cid:105) The left intersection ss l is not retained, as it is not maximal ( ss l < ss r ), whilethe right intersection ss r is kept.The complexity of the alignment for two sequences s and t is O ( | s | · | t | · γ ),where γ is the complexity of computing a common ancestor in the alphabetlattice ( E, (cid:117) ). The experiments are carried out on a MacBook Pro with a 2.5GHz Intel Core i5,8GB of RAM Memory running OS X 10.6.8. The algorithms are not parallelizedand are coded in C++.Our use-case dataset comes from a French healthcare system, called PMSI [Fetter et al., 1980]. Each element of a sequence has a “complex” nature. Thedataset contains 500 patients suﬀering from lung cancer , who live in the Lorraineregion (Eastern France). Every patient is described as a sequence of hospitaliza-tions without any time-stamp. A hospitalization is a tuple with three elements:(i) healthcare institution (e.g. university hospital of Nancy ( CHU

Nancy )), (ii)reason for the hospitalization (e.g. a cancer disease), and (iii) set of medicalprocedures that the patient undergoes. An example of a medical trajectory isgiven below: (cid:104) [CHU

Nancy , Cancer , { mp , mp } ] ; [CH Paris , Chemo , {} ] ; [CH Paris , Chemo , {} ] (cid:105) . This sequence represents a patient trajectory with three hospitalizations. Itexpresses that the patient was ﬁrst admitted to the university hospital of Nancy(

CHU

Nancy ) for a cancer problem as a reason, and underwent procedures mp and mp . Then he had two consequent hospitalizations in the general hospital ofParis ( CH P aris ) for chemotherapy with no additional procedure. Substitutingthe same consequent hospitalizations by the number of repetitions, we have ashorter and more understandable trajectory. For example, the above patternis transformed into two hospitalizations where the ﬁrst hospitalization repeatsonce and the second twice: (cid:104) [CHU

Nancy , Cancer , { mp , mp } ] × [1]; [CH Paris , Chemo , {} ] × [2] (cid:105) . Programme de M´edicalisation des Syt`emes d’Information. Trajectory Length

Trajectory Length N u m be r o f pa t i en t s Figure 6: The length distribution of sequences in the datasetDiagnoses are coded according to the 10 th International Classiﬁcation ofDiseases (ICD10). Based on this coding, diagnoses could be described at 5levels of granularity: root, chapter, block, 3-character, 4-character, terminalnodes. This taxonomy has 1544 nodes. The healthcare institution is associ-ated with a geographical taxonomy of 4 levels, where the ﬁrst level refers tothe root (France) and the second, the third and the fourth levels correspondto administrative region, administrative department and hospital respectively.Figure 5 presents University Hospital of Nancy (code: 540002078) as a hospitalin Meurthe et Moselle, which is a department in Lorraine, region of France. Thistaxonomy has 304 nodes. The medical procedures are coded according to theFrench nomenclature “Classiﬁcation Commune des Actes M´edicaux (CCAM)”.The distribution of sequence lengths is shown in Figure 6.With 500 patient trajectories, the computation of the whole lattice is in-feasible. We are not interested in all possible frequent trajectories, but rather17

00 200 300 400 500

Database Size C o m pu t a t i on T i m e GRGRIRPRPIGRPGRPI ( s ) (a) MLP projection, (cid:96) = 2

100 200 300 400 500

Database Size C o m pu t a t i on T i m e GRGRIRPRPIGRPGRPI ( s ) (b) MLP projection, (cid:96) = 3 Figure 7: Computational time for diﬀerent projectionsin trajectories which answer medical analysis questions. An expert may knowthe minimal size of trajectories that he is interested in, i.e. setting the MLPprojection. We use the MLP projection of length 2 and 3 and take into accountthat most of the patients has at least 2 hospitalizations in the trajectory (seeFigure 6).Figure 7 shows computational times for diﬀerent projections as a function ofdataset size. Figure 7a shows diﬀerent alphabet projections for MLP projectionwith (cid:96) = 2, while Figure 7b for MLP with (cid:96) = 3. Every alphabet projectionis given by the name of ﬁelds, that are considered within the projection: G corresponds to hospital geo-location, R is the reason for a hospitalization, P ismedical procedures and I is repetition interval, i.e. the number of consequenthospitalizations with the same reason. We can see from these ﬁgures that MLPallows one to save some computational resources with increasing of (cid:96) . The dif-ference in computational time between (cid:96) = 2 and (cid:96) = 3 projections is signiﬁcant,especially for time consuming cases. Even a bigger variation can be noticed forthe alphabet projections. For example, computation of the RPI projection takes100 times more resources than any from

GRP, RP, GR, GRP .The same dependency can be seen in Figure 8, where the number of conceptsfor every projection is shown. Consequently, it is important for an expert toprovide a strict projection that allows him to answer his questions in order tosave computational time and memory.Table 5 shows some interesting concept intents with the corresponding sup-port and ranking w.r.t. concept stability. For example the concept GR (i.e., we consider only hospital and reason), withthe intent (cid:104) [ Lorraine, C

Lung Cancer ] (cid:105) , where C341 Lung Cancer is a spe-cial kind of lung cancer (malignant neoplasm in Upper lobe, bronchus or lung).This concept is the most stable concept in the lattice for the given projection,and the size of the concept extent is 287 patients.18

100 300 500 + + + + Database Size La tt i c e S i z e GRGRIRPRPIGRPGRPI (a) MLP projection, (cid:96) = 2 + + + + Database Size La tt i c e S i z e GRGRIRPRPIGRPGRPI (b) MLP projection, (cid:96) = 3

Figure 8: Lattice size for diﬀerent projectionsTable 5: Interesting concepts, for diﬀerent projections. GR (cid:104) [ Lorraine,C

Lung Cancer ] (cid:105) GR (cid:104) [ Lorraine,Respiratory Disease ];[

CHU

Nancy ,Lung Cancer ] (cid:105)

26 223 GR (cid:104) [ Lorraine,Chemotherapy ] × (cid:105) RPI (cid:104) [ Preparation for Chemotherapy, { Lung Radiography } ]; [ Chemotherapy ] × [3 , (cid:105) “Wheredo patients stay (i.e. hospital location) during their treatment, and for whichreason ?” . To answer this question, we consider only healthcare institutionsand reason ﬁelds, requiring both to “hold” some information and we use theMLP projection of length 2 and 3 (i.e. projections GR GR pattern obtained under GR “22 patientswere ﬁrst admitted in some healthcare institution in Lorraine region for a prob-lem related to the respiratory system and then they were treated for a lung cancerin University Hospital of Nancy.” Another interesting question is “What are the sequential relations betweenhospitalization reasons and the corresponding procedures?” . To answer this ques-tion, we are not interested in healthcare institutions. Thus, any alphabet ele-ment is projected by substituting healthcare institution ﬁeld with ‘*’. As hos-pitalization reason is important in each hospitalization, any alphabet elementwithout the hospitalization reason is of no use and is projected to the bottom el-ement ⊥ E of the alphabet. Such projections are called RP I

RP I

3, meaningthat we consider the ﬁelds “Reason” and “Procedures”, while the reason shouldnot be empty and the MLP parameter is 2 or 3.

Pattern trivially statesthat, “36 patients with lung cancer are hospitalized once for the preparationof chemotherapy and during this hospitalization they undergo lung radiography.Afterwards, they are hospitalized between 3 and 4 times for chemotherapy.”

Variability is high in healthcare processes and aﬀects many aspects of health-care trajectories: patients, medical habits and protocols, healthcare organisa-tion, availability of treatments and settings. . . Mining sequential pattern struc-tures is an interesting approach for ﬁnding regularities across one or severaldimensions of medical trajectories in a population of patients. It is ﬂexibleenough to help healthcare managers to answer speciﬁc questions regarding thenatural organisation of care processes and to further compare them with ex-pected or desirable processes. The use of taxonomies plays also a key role inﬁnding the right level of description of sequential patterns and reducing theinterpretation overhead.

Agrawal and Srikant [1995] introduced the problem of mining sequential pat-terns over large sequential databases. Formally, given a set of sequences, whereeach sequence is a list of transactions ordered by time and each transaction is aset of items, the problem amounts to ﬁnd all frequent subsequences that appeara suﬃcient number of times with a user-speciﬁed minimum support threshold( minsup ). Following the work of Agrawal and Srikant many studies have con-tributed to the eﬃcient mining of sequential patterns [Mooney and Roddick,2013]. Most of them are based on the antimonotonicity property (used in

Apri-ori ), which states that any super pattern of a non-frequent pattern cannot befrequent. The main algorithms are PreﬁxSpan [Pei et al., 2001b], SPADE [Zaki,20001], SPAM [Ayres et al., 2002], PSP [Masseglia et al., 1998], DISC [Chiuet al., 2004], PAID [Yang et al., 2006] and FAST [Salvemini et al., 2011]. Allthese algorithms aim at discovering sequential patterns from a set of sequencesof itemsets such as customers who frequently buy DVDs of episodes I, II and IIIof Stars Wars, then buy within 6 months episodes IV, V, VI of the same famousepic space opera.Many studies about sequential pattern discovery focus on single-dimensionalsequences. However, in many situations, the database is multidimensional inthe sense that items can be of diﬀerent nature. For example, a consumerdatabase can hold information such as article price, gender of the customer,location of the store and so on. Pinto et al. [2001] proposed the ﬁrst work formining multidimensional sequential patterns. In this work, a multidimensionalsequential database is deﬁned as a schema (

ID, D , ..., D m , S ), where ID is aunique customer identiﬁer, D , ..., D m are dimensions describing the data and Sis the sequence of itemsets. A multidimensional sequence is deﬁned as a vector (cid:104){ d , d , ..., d m } , S , S , ..., S l (cid:105) where d i ∈ D i for ( i (cid:54) m ) and S , S , ..., S l , arethe itemsets of sequence S . For instance, (cid:104){ M etz, M ale } , { mp , mp } , { mp }(cid:105) describes a male patient who underwent procedures mp and mp in Metz andthen underwent mp also in Metz. Here, dimensions remain constant over time,such as the location of the treatment. This means that it is not possible to havea pattern indicating that when the patient underwent procedures mp and mp in Metz then he underwent mp in Nancy. Among other proposals, Yu and Chen[2005] proposed two methods AprioriMD and PreﬁxMDSpan for mining multi-dimensional sequential patterns in the web domain. This study considers pages,sessions and days as dimensions. Actually, these three diﬀerent dimensions canbe projected into a single dimension corresponding to web pages, gathering webpages visited during a same session and ordering sessions w.r.t the day as order.In real world applications, each dimension can be represented at diﬀerentlevels of granularity, by using a poset. For example, apples in a market basketanalysis can be either described as fruits, fresh food or food. The interest lies inthe capacity of extracting more or less general/speciﬁc multidimensional sequen-tial patterns and overcome problems of excessive granularity and low support.Srikant and Agrawal [1996] proposed GSP which uses posets for extracting se-quential patterns. The basic approach is based on replacing every item with allthe ancestors in the poset and then the frequent sequences are generated. Thisapproach is not scalable in a multidimensional context because the size of thedatabase becomes the product of maximum height of the posets and number ofdimensions.Plantevit et al. [2010] deﬁned a multidimensional sequence as an ordered listof multidimensional items, where a multidimensional item is a tuple ( d , ..., d m )and d i is an item associated with the i th dimension. They proposed M SP , anapproach taking both aspects into account where each dimension is representedat diﬀerent levels of granularity, by using a poset. M SP is able to search for se-quential patterns with the most appropriate level of granularity. Their approachis based on the extraction of the most speciﬁc frequent multidimensional items,which are then used as alphabet to rephrase the original database. Then, M SP M SP is not adapted to mine sequentialdatabases, where sequences are deﬁned over a combination of sets of items anditems lying in a poset. Then it is not possible to have a pattern indicating thatwhen the patient went to uh p for a problem of cancer ca , where he underwentprocedures mp and mp , then he went to gh l for the same medical problem ca ,where he underwent mp ( i.e, (cid:104) ( uh p , ca, { mp , mp } ) , ( gh l , ca, { mp } ) (cid:105) ). Ourapproach allows us to process such kind of patterns and in addition the elementsof sequences are even more general. For example, beside multidimensional andmultilevel sequences, sequences of graphs fall under our deﬁnition. Moreover,frequent subsequence mining gives rise to a lot of subsequences which can behardly analyzed by an expert. Since our approach is based on Formal ConceptAnalysis (FCA) [Ganter and Wille, 1999], we can use eﬃcient relevance indexesdeﬁned in FCA.This paper is not the ﬁrst attempt to use FCA for the analysis of sequentialdata. Ferr´e [2007] processes sequential datasets based on a “simple” alphabetwithout involving any partial order. In Casas-Garriga [2005] only sequencesof itemsets are considered. All closed subsequences are ﬁrstly mined and thenregrouped by a specialized algorithm in order to obtain a lattice similar tothe FCA lattice. This approach was not veriﬁed experimentally. Moreover,compared with both approaches, i.e. Ferr´e [2007] and Casas-Garriga [2005],our approach suggests a more general deﬁnition of sequences and, thanks topattern structures, there is no ‘pre-mining’ step to ﬁnd frequent (or maximal)subsequences. This allows us to apply diﬀerent “projections” specializing therequest of an expert and simplifying the computations. In addition, in ourapproach nearly all state-of-the-art FCA algorithms can be used in order toeﬃciently process a dataset.There is a number of approaches that help to analyze medical treatmentdata. However, the direct comparison of them is hardly possible, because ev-ery approach is designed for its own problem. For example, [Tsumoto et al.,2014] analyze data of one hospital and provide a diﬀerent view on the processeswithin the hospital w.r.t. our approach. Finally and naturally, the most similarapproach to our work can be found in [Egho et al., 2014a,b], as some authors ofthe present paper are involved in this alternative work. In [Egho et al., 2014a,b],authors mine frequent sequences of the dataset similar to the sequences studiedhere. However, they approach the complexity of the analysis of such data in adiﬀerent way. They use a support threshold in order to specify the outcome ofthe algorithm and do not provide any order in which one can analyze the result.In our case we rely on projections that are usually simpler to incorporate expertknowledge than a support threshold and we give an order (w.r.t. stability of aconcept) which can be used to simplify the analysis of the treatment data.22 Conclusion

In this paper, we have presented a novel approach for analyzing sequential datawithin the framework of pattern structures, an extension of Formal ConceptAnalysis dealing with complex data. It is based on the formalism of sequentialpattern structures and projections. Our work complements the general orienta-tions towards statistically signiﬁcant patterns by presenting strong formal resultson the notion of interestingness from a concept lattice viewpoint. The frame-work of pattern structures is very ﬂexible and shows some important properties,for example in allowing to reuse state-of-the-art and eﬃcient FCA algorithms.Using pattern structures leads to the construction of a pattern concept lattice,which does not require the setting of a support threshold, as usually neededin classical sequential pattern mining. Moreover, the use of projections gives alot of ﬂexibility especially for mining and interpreting special kinds of patterns(patterns can be proposed at several levels of complexity w.r.t. extraction andinterpretation).Our framework was tested on a real-world dataset with patient hospitaliza-tion trajectories. Interesting patterns answering questions of an expert are ex-tracted and interpreted, showing the feasibility and usefulness of the approach,and the importance of the stability as a pattern-selection procedure. In partic-ular, projections play an important role here: mainly, they provide means toselect patterns of a special interest and they help to save computational time(which could be otherwise very large).For future work, we are planning to more deeply investigate projections,their potential w.r.t. the types of patterns. It can be interesting to introduceand evaluate the stability measure directly on sequences. Another research di-rection is mining of association rules or building a Horn approximation [Balc´azarand Casas-Garriga, 2005] from the stable part of the pattern lattice or stablesequences. Finally, as discussed above, a precise study combining frequent sub-sequence mining and FCA-based approaches should be carried out.

Acknowledgments

The fourth co-author was supported within the framework of the Basic ResearchProgram at National Research University Higher School of Economics (Moscow).

Notes on contributors

Aleksey Buzmakov is a PhD student in Informatics at Universit´e de Lorraine (Van-doevre les Nance, France). He holds master and bachelor degree in appliedmathematics and physics from Moscow Institute of Physics and Technology. Hisresearch interest includes data mining and artiﬁcial intelligences. In particularhe works with Formal Concept Analysis and Pattern Structure in order to minecomplex data such as sequences or graphs. lias Egho is a Post Doctoral Researcher in Orange Labs (France Telecom Researchand Development) with Proﬁling & Data Mining team. In 2014, he received aPhD degree in Computer Science from University of Lorraine, Nancy, Francein LORIA-INRIA Nancy Grand Est laboratory. His main research interest ismining sequential patterns for detection and classiﬁcation of sequential data. Nicolas Jay is a professor of biostatistics and medical informatics at the Universit´ede Lorraine. His research interests include medical knowledge representation andknowledge discovery in medical databases, with applications to patient trajec-tory analysis. He works as a public health physician at the University Hospitalof Nancy.

Sergei O. Kuznetsov is a professor of the National Research University HigherSchool of Economics (HSE), Moscow, where he is the head of department ofdata analysis and artiﬁcial intelligence. He defended habilitation thesis (“Doc-tor of Science”) at the Computer Center of the Russian Academy of Sciences(Moscow, Russia) in 2002. He holds the “Candidate of Science” degree (PhDequivalent) from VINITI (Moscow, Russia) since 1990. His research interestsinclude mathematical models, algorithms and algorithmic problems of machinelearning, formal concept analysis, data mining, and knowledge discovery.

Amedeo Napoli is a CNRS senior scientist (DR CNRS) and the scientiﬁc leader ofthe Orpailleur research team at LORIA/Inria Laboratory in Nancy. His sci-entiﬁc interests are knowledge discovery (pattern mining and Formal ConceptAnalysis) and knowledge representation (ontology engineering). He is involvedin many national and international research projects with applications in agron-omy, biology, chemistry, and medicine.

Chedy Ra¨ısi received his PhD in Computer Science from the University of Mont-pellier and the Ecole des Mines d’Al`es in July 2008. He is currently a researchscientist (“Charg´e de recherche 1”) at the Institut ”National de Recherche enInformatique et en Automatique” (INRIA) in France. His research interestsincludes pattern mining and privacy-preserving data analysis.

References

Mehdi Adda, Petko Valtchev, Rokia Missaoui, and Chabane Djeraba. A frame-work for mining meaningful usage patterns within a semantically enhancedweb portal. In

Proceedings of the 3rd C* Conference on Computer Scienceand Software Engineering , C3S2E ’10, pages 138–147, New York, NY, USA,2010. ACM.Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In

Proceedings of the Eleventh International Conference on Data Engineering ,ICDE ’95, pages 3–14, Washington, DC, USA, 1995. IEEE Computer Society.Jay Ayres, Jason Flannick, Johannes Gehrke, and Tomi Yiu. Sequential patternmining using a bitmap representation. In

KDD , pages 429–435, 2002.Jos´e L. Balc´azar and Gemma Casas-Garriga. On Horn Axiomatizations forSequential Data. In

ICDT , pages 215–229, 2005.24leksey Buzmakov, Elias Egho, Nicolas Jay, Sergei O. Kuznetsov, AmedeoNapoli, and Chedy Ra¨ıssi. On Projections of Sequential Pattern Structures(with an application on care trajectories). In

Proc. 10th International Con-ference on Concept Lattices and Their Applications , pages 199–208, 2013.Gemma Casas-Garriga. Summarizing Sequential Data with Closed Partial Or-ders. In

Proc. of the 5th SIAM Int’l Conf. on Data Mining (SDM’05) , 2005.Ding-Ying Chiu, Yi-Hung Wu, and Arbee L. P. Chen. An eﬃcient algorithmfor mining frequent sequences by a new strategy without support counting.In

ICDE , pages 375–386, 2004.Bolin Ding, David Lo, Jiawei Han, and Siau-Cheng Khoo. Eﬃcient Mining ofClosed Repetitive Gapped Subsequences from a Sequence Database. In

Proc.of IEEE 25th International Conference on Data Engineering , pages 1024–1035. IEEE, March 2009.Elias Egho, Nicolas Jay, Chedy Ra¨ıssi, Dino Ienco, Pascal Poncelet, MaguelonneTeisseire, and Amedeo Napoli. A contribution to the discovery of multidimen-sional patterns in healthcare trajectories.

Journal of Intelligent InformationSystems , 42(2):283–305, 2014a.Elias Egho, Chedy Ra¨ıssi, Nicolas Jay, and Amedeo Napoli. Mining Heteroge-neous Multidimensional Sequential Patterns. In

ECAI 2014 - 21st EuropeanConference on Artiﬁcial Intelligence , pages 279–284, 2014b.S´ebastien Ferr´e. The Eﬃcient Computation of Complete and Concise SubstringScales with Suﬃx Trees. In Sergei O. Kuznetsov and Stefan Schmidt, editors,

Formal Concept Analysis SE - 7 , volume 4390 of

Lecture Notes in ComputerScience , pages 98–113. Springer, 2007.Robert B. Fetter, Youngsoo Shin, Jean L. Freeman, Richard F. Averill, andJohn D. Thompson. Case mix deﬁnition by diagnosis-related groups.

MedCare , 18(2):1–53, February 1980.Bernhard Ganter and Sergei O. Kuznetsov. Pattern Structures and Their Pro-jections. In Harry S. Delugach and Gerd Stumme, editors,

Conceptual Struc-tures: Broadening the Base , volume 2120 of

Lecture Notes in Computer Sci-ence , pages 129–142. Springer Berlin Heidelberg, 2001.Bernhard Ganter and Rudolf Wille.

Formal Concept Analysis: MathematicalFoundations . Springer, 1st edition, 1999.Jiawei Han, Jian Pei, Behzad Mortazavi-Asl, Qiming Chen, Umeshwar Dayal,and Meichun Hsu. FreeSpan: frequent pattern-projected sequential patternmining. In

Proc. of the 6th ACM SIGKDD Int’l Conf. on Knowledge discoveryand data mining , pages 355–359, 2000.25ikhail Klimushkin, Sergei A. Obiedkov, and Camille Roth. Approaches to theSelection of Relevant Concepts in the Case of Noisy Data. In

Proc. of the8th International Conference on Formal Concept Analysis , ICFCA’10, pages255–266. Springer, 2010.Sergei O. Kuznetsov. Learning of Simple Conceptual Graphs from Positive andNegative Examples. In Jan M. ˙Zytkow and Jan Rauch, editors,

Principlesof Data Mining and Knowledge Discovery SE - 47 , volume 1704 of

LectureNotes in Computer Science , pages 384–391. Springer Berlin Heidelberg, 1999.Sergei O. Kuznetsov. On stability of a formal concept.

Annals of Mathematicsand Artiﬁcial Intelligence , 49(1-4):101–115, 2007.Florent Masseglia, Fabienne Cathala, and Pascal Poncelet. The PSP approachfor mining sequential patterns. In

PKDD , pages 176–184, 1998.Dean Van Der Merwe, Sergei Obiedkov, and Derrick Kourie. AddIntent: A newincremental algorithm for constructing concept lattices. In Gerhard Goos,Juris Hartmanis, Jan Leeuwen, and Peter Eklund, editors,

Concept Lattices ,volume 2961, pages 372–385. Springer, 2004.Carl H. Mooney and John F. Roddick. Sequential pattern mining – approachesand algorithms.

ACM Computing Surveys , 45(2):1–39, February 2013.Jian Pei, Jiawei Han, B. Mortazavi-Asl, H. Pinto, Qiming Chen, U. Dayal,and Mei-Chun Hsu. PreﬁxSpan Mining Sequential Patterns Eﬃciently byPreﬁx Projected Pattern Growth. In , pages 215–226, 2001a.Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umesh-war Dayal, and Meichun Hsu. Preﬁxspan: Mining sequential patterns bypreﬁx-projected growth. In

ICDE , pages 215–224, 2001b.Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, and UmeshwarDayal. Multi-dimensional sequential pattern mining. In

CIKM , pages 81–88,2001.Marc Plantevit, Anne Laurent, Dominique Laurent, Maguelonne Teisseire, andYeow Wei Choong. Mining multidimensional and multilevel sequential pat-terns.

ACM Transactions on Knowledge Discovery from Data , 4(1):1–37,January 2010.Chedy Ra¨ıssi, Toon Calders, and Pascal Poncelet. Mining conjunctive sequentialpatterns.

Data Min. Knowl. Discov. , 17(1):77–93, 2008.Camille Roth, Sergei Obiedkov, and Derrick G Kourie. On succinct represen-tation of knowledge community taxonomies with formal concept analysis AFormal Concept Analysis Approach in Applied Epistemology.

InternationalJournal of Foundations of Computer Science , 19(02):383–404, April 2008.26liana Salvemini, Fabio Fumarola, Donato Malerba, and Jiawei Han. Fast se-quence mining based on sparse id-lists. In

Proceedings of the 19th internationalconference on Foundations of intelligent systems , ISMIS’11, pages 316–325,Berlin, Heidelberg, 2011. Springer-Verlag.Ramakrishnan Srikant and Rakesh Agrawal. Mining sequential patterns: Gener-alizations and performance improvements. In

Proceedings of the 5th Interna-tional Conference on Extending Database Technology: Advances in DatabaseTechnology , EDBT ’96, pages 3–17, London, UK, UK, 1996. Springer-Verlag.Shusaku Tsumoto, Haruko Iwata, Shoji Hirano, and Yuko Tsumoto. Similarity-based behavior and process mining of medical practices.

Future GenerationComputer Systems , 33(0):21–31, April 2014.Xifeng Yan, Jiawei Han, and Ramin Afshar. CloSpan: Mining Closed SequentialPatterns in Large Databases. In

Proc. of SIAM Int’l Conf. Data Mining(SDM’03) , pages 166–177, 2003.Zhenglu Yang, Masaru Kitsuregawa, and Yitong Wang. Paid: Mining sequentialpatterns by passed item deduction in large databases. In

IDEAS , pages 113–120, 2006.Chung-Ching Yu and Yen-Liang Chen. Mining sequential patterns from multi-dimensional sequence data.

IEEE Trans. Knowl. Data Eng. , 17(1):136–140,2005.Mohammed J. Zaki. Spade: An eﬃcient algorithm for mining frequent se-quences.