From Low-Level Events to Activities -- A Session-Based Approach (Extended Version)
FFrom Low-Level Events to Activities:A Session-Based Approach(Extended Version)
Massimiliano de Leoni and Safa Dundar Department of MathematicsUniversity of PaduaVia Trieste, 63 - 35123 Padua (Italy) Eindhoven University of Technology, Eindhoven, The Netherlands [email protected]
Abstract.
Process-Mining techniques aim to use event data about past execu-tions to gain insight into how processes are executed. While these techniquesare proven to be very valuable, they are less successful to reach their goal if theprocess is flexible and, hence, it exhibits an extremely large number of variants.Furthermore, information systems can record events at very low level, which donot match the high-level concepts known at business level. Without abstractingsequences of events to high-level concepts, the results of applying process mining(e.g., discovered models) easily become very complex and difficult to interpret,which ultimately means that they are of little use. A large body of research existson event abstraction but typically a large amount of domain knowledge is re-quired, which is often not readily available. Other abstraction techniques are un-supervised, which ultimately return less accurate results and/or rely on strongerassumptions. This paper puts forward a technique that requires limited domainknowledge that can be easily provided. Traces are divided in sessions, and eachsession is abstracted as one single high-level activity execution. The abstractionis based on a combination of automatic clustering and visualization methods. Thetechnique was assessed on two case studies about processes characterized by highvariability. The results clearly illustrate the benefits of the abstraction to conveyaccurate knowledge to stakeholders.
Keywords:
Event-Log Abstraction · Clustering · Process Mining · Visualization
Nowadays, large, complex organizations leverage on well-defined processes to try tocarry on their business more effectively and efficiently than their competitors. In ahighly competitive world, organizations aim to continuously improve their businessperformance, which ultimately boils down to improving their process.The first step towards improvement is to understand how processes are actuallybeing executed. The understanding of the actual process enactment is the goal of processmining. This research field focuses on providing insights by reasoning on the actualprocess executions, which are recorded in so-called event logs [1]. Event logs groupprocess events in traces, each of which contains the events related to a specific process-instance execution. An event refers to the execution of an activity (e.g.,
Apply for a a r X i v : . [ c s . A I] J un Massimiliano de Leoni and Safa Dundar
Fig. 1: A model for a very flexible process, which shows an ocean of variability. loan ) for a specific process instance (e.g. customer
Mr. Bean ) at a specific moment intime (e.g. on January, 1st, 2018 at 3.30pm).While process mining has proven to be effective in a wide range of applicationfields, it has shown its limitation when the process intrinsically allows for a high degreeof flexibility [1], or information systems record executions into logs where events are ata lower-level granularity than the concepts that are relevant from a business viewpoint.Both of the problems lead to an “ocean” of observed process behavior. This means that,e.g., if one tries to discover a process model, one obtains a model that is very complexand/or low-level, thus being difficult to interpret. As a matter of fact, if the granularityis too low level, even the event-log visualization through dotted charts [1] is less useful:users are confronted with a chart with too many dots to draw insightful conclusions.Extreme complexity and difficulty of interpretation contrast the initial purpose ofprocess mining: conveying interpretable insights and knowledge to process stakeholdersand owners. Typical examples are in health-care [16], customer journey [20], on-lineretailer shops, supermarkets, hospitals, home automation, and IoT systems.Similarly to existing related work (see Section 5), here we advocate the need ofabstracting low-level events to high-level activities. However, differently from existingrelated work, we do not want to rely on the provision of an extensive amount of domainknowledge as existing approaches require: this can be hard in several domains. On theother hand, we want to avoid completely unsupervised approaches, which naturallyshow lower accuracy and/or rely on strong assumptions.To balance accuracy and practical feasibility, we aim at a technique that requiresprocess analysts to only feed in knowledge that is limited in quantity and easy to pro-vide. In a nutshell, the idea is that events of the same trace can be clustered into sessionssuch that the time distance between the last event of a session and the first event of thesubsequent session is larger than a user-defined threshold. Each trace is seen as a se-quence of sessions of events. These sessions are encoded into data points to be clustered;this way, each session is assigned to one cluster. The abstract event log is created suchthat the entire session is replaced by a high-level event that indicates to which clusterthe session belongs. The high-level events needs to be named: The centroids of the clus-ters provide meaningful information for a process stakeholder to identify the high-levelactivity that corresponds to each cluster. To support stakeholder in this identification, vi-sualization techniques are foreseen, based on heat maps. However, the latter is optional:e.g., without domain knowledge, each cluster may be given a name that coincides with rom Low-Level Events to Activities: A Session-Based Approach (Extended Version) 3 that of the most frequent activity in the sessions of the cluster, or with a concatenationof the names of those most frequent, if more than one clearly stand out.The benefit and feasibility of the proposed technique was assessed on two real-lifecase studies. The first refers to the web site. Results show that over-complex, low-level process models can be converted into high-level counterparts thatare accurate according to the process-mining metrics, and that are simply enough to beable to convey information that has business value. However, the idea of a session-based clustering goes beyond analysing web sites; it certainly applies to other domains,including on-line retailer shops, supermarkets, hospitals, home automation, and IoTsystems. In general, one can apply the proposed technique to any domain in whichevents happen in batches/sessions. A second case study showcases the wider applica-bility of the technique and focuses on the executions of a process to manage building-permit requests.
The technique is not only beneficial when discovering a model, but also in a widerrange of applications of diverse process-mining techniques.
As a support of this state-ment, we showcase an example: for the second case study, the abstract log is used tocompare the management of building permits when different city-hall employees areresponsible.Section 2 introduces the initial motivating example of the web site.Section 3 introduces the abstraction technique, while Section 4 reports on the evaluationon the two cases. Section 5 compares with the related work while Section 6 concludesthe paper, delineating the avenues of future work.
The web site is a very significant example of customer journey, in-tended as the product of the interaction between an organization and a customer through-out the duration of their relationship. Gartner highlights the importance of managingthe customer’s experience, which is seen as “the new marketing battlefront”. The web site is run by UWV, which is the social security institute thatimplements employee insurances and provide labour market services to residents in theNetherlands. Specifically, the web site supports unemployed Netherlands’ residents inthe process of job reintegration. Once logged in the web site, people can upload theirown CVs, search for suitable jobs and, more in general, interact with UWV via mes-sages as well as they can ask questions, file complaints, etc. The website is structured into sections of pages and logged-in users can arbitrary switch fromone to another. However, to improve the experience, it would be worthwhile introducingsupporting wizards. The starting pointing for designing such wizards is to gain insightsinto the typical ways in which the web site is actually used. Here, a process is intended as a set activities that are executed while complying given orderingconstraints. The activities can be of any nature, ranging from those more traditional performed,e.g., by a bank or city-hall employee, till web-page visits or those executed by domotics or IoTsystems, such as by/with TVs, ovens, bulbs, bath tubes, or heaters. Key Findings From the Gartner Customer Experience Sur-vey - . Massimiliano de Leoni and Safa Dundar
Fig. 2: The steps of the abstraction technique based on sessions.Publicly available is an event log that collects the browsing behavior of the logged-in visitors in the period from July, 2015 to February, 2016. The event log is composedby 335655 events divided in 2624 traces. We tried to discover a model of the web-siteinteraction without abstracting the event log. Figure 1 shows the result obtained throughthe new Heuristic Miner [15]. Similar results are also obtained through other miners andall show the problems mentioned above: the model is overcomplex, with an “ocean” ofactivity dependencies. While this is certainly not surprising because of the freedom ofvisiting the web site, still one wants to discover a model that provides insights for thestakeholders.
This section introduces the technique of clustering low-level events into high-level ac-tivities. This procedure consists of four main steps, as visualized in Figure 2. The start-ing point is an event log. All the traces of the event log are split into sessions, whichare then clustered; the centroids of the found clusters are visualized on a heat map toprovide support to assign a name to each cluster. Finally, the abstract event log is cre-ated: each session is replaced by two events (e.g. C st and C co in figure) of the samename as that given to the cluster to which the session belongs. The two events refer tothe starting and the completion of the session and, respectively, take on the timestampsof the first and the last event of the session. The starting point of our technique is an event log, which consists of a set of traces,each of which is a sequence of unique events:
Definition 1 (Event, Trace, Log).
Let E be the universe of events. A trace σ ∈ E ∗ is asequence of events. An event log L consists of a set of traces, i.e. L ∈ E ∗ . The dataset is available at https://doi.org/10.4121/uuid:01345ac4-7d1d-426e-92b8-24933a079412 rom Low-Level Events to Activities: A Session-Based Approach (Extended Version) 5
Events carry on information: given an event e ∈ E , λ A ( e ) and λ T ( e ) respectively re-turn the activity associated with e and the timestamp when event e occurred. In theremainder, e ∈ L indicates that there is a trace σ ∈ L s.t. e ∈ σ . Given a trace σ (cid:48) = (cid:104) e , . . . , e n (cid:105) , σ (cid:48) ( i ) returns the i -th event of the trace, namely σ (cid:48) ( i ) = e i ; also, | σ (cid:48) | returns the number of σ (cid:48) , namely n .Furthermore, given a second trace σ (cid:48)(cid:48) = (cid:104) f , . . . , f m (cid:105) , σ (cid:48) ⊕ σ (cid:48)(cid:48) indicates the traceobtained by concatenating σ (cid:48)(cid:48) at the end of of σ (cid:48) , i.e. σ (cid:48) ⊕ σ (cid:48)(cid:48) = (cid:104) e , . . . , e n , f . . . , f m (cid:105) .As mentioned in Section 1, we leverage on clustering techniques. In a nutshell, thesetake a multiset of n -ples, elements of domain D × . . . × D N , and split it into a numberof disjoint smaller multisets: Definition 2 (Clustering).
Let M be the set of all multisets of all data points definedover the cartesian product D × . . . × D N . A clustering technique can be abstractedas function C LUSTER : M → ℘ ( M ) that, given a multiset M ∈ M , returns a M ’sclustering into a set { M (cid:48) , . . . , M (cid:48) n } of multisets such that M (cid:48) (cid:93) . . . (cid:93) M (cid:48) n = M and,for any ≤ i ≤ j ≤ n , m ∈ M (cid:48) i ∧ m ∈ M (cid:48) j ⇒ M (cid:48) i = M (cid:48) j . The first step of the technique is to identify the sessions. We introduce a session thresh-old ∆ , a time range. For each trace σ = (cid:104) e , . . . , e n (cid:105) in an event log, we iterateover its events and create a sequence of sessions (cid:104) s , . . . , s m (cid:105) . We create a session s k = (cid:104) e i , . . . , e j (cid:105) , subsequence of σ , if (1) the timestamp’s difference between e i and e i − and e j and e j +1 is larger than or equal to ∆ and (2) the timestamp’s differencebetween two consecutive events in (cid:104) e i , . . . , e j (cid:105) is smaller than ∆ : Definition 3 (Sessions of a Trace).
Let σ = (cid:104) e , . . . , e n (cid:105) ∈ E ∗ be a log trace. Let ∆ bea time interval. s (cid:13) ∆ ( σ ) = (cid:104) s , . . . , s m (cid:105) ∈ ( E ∗ ) ∗ denotes the session sequence of σ : (1) for any ≤ i < m , λ T ( s i +1 (1)) − λ T ( s i ( | s i | )) ≥ ∆ , and (2) , for any ≤ i ≤ m and ≤ j < | s i | , λ T ( s i ( j + 1)) − λ T ( s i ( j )) < ∆ , and (3) σ = s ⊕ . . . ⊕ s n . The third condition states that, if we concatenate the sessions in which σ was split, weobtain σ back. The following example further clarifies: Example 1.
Consider a trace σ = (cid:104) a , b , c , a , d (cid:105) of an event log L . The letterindicates the activity name, and the subscript is the timestamp of the event’s occurrence(e.g. d occurred at time 13). Assume that the time interval ∆ = 5 . One can easily seethat the time difference between the second occurrance of a and the first of e is greaterthan the given time interval ∆ ( λ T ( a ) − λ T ( c ) = 6 > ∆ = 5 ), thus resulting in twosessions: s (cid:13) ∆ ( σ ) = (cid:104) s , s (cid:105) where s = (cid:104) a , b , c (cid:105) and s = (cid:104) a , d (cid:105) . Note that theconcatenation results in σ : σ = s ⊕ s . Once the sessions are identified, the next step is to cluster them. To apply clusteringtechniques, each session needs to be encoded as a vector p , point of a cartesian space Given a (multi)set M , ℘ ( M ) denotes the powerset, namely the set of all sub(multi)sets of M .The operator (cid:93) denotes the union of multisets, namely such that the cardinality of an elementin the union is the sum of the cardinality of all elements of the joined multisets. Massimiliano de Leoni and Safa Dundar D × . . . × D n . This encoding can be made using different policies. As an example,a session s can be encoded into a vector that contains one integer dimension for eachactivity a that, as value, takes on the number of occurrence of events referring to theactivity a in session s . The encoding is abstracted as function E NCODE ( s, σ, L ) thatreturns a tuple that encodes a session s of a trace σ of an event log L . Given an eventlog L , we create the multiset of data points as follows: M L = (cid:93) σ ∈L ( (cid:93) s ∈ s (cid:13) ( σ ) E NCODE ( s, σ, L )) (1)which are then clustered into { M (cid:48) , . . . , M (cid:48) n } = C LUSTER ( M L ) . The remainder illus-trates two encodings that are of more general applicability. However, it is possible toseamlessly plug in new encodings. Frequency-based Encoding.
Let L be an event log defined over an activity set A L = ∪ e ∈L λ A ( e ) . Given a session s of a trace σ ∈ L , the frequency-based encoding returns atuple where each of its elements is associated with a different activity of L and takes ona value that is the number of occurrence of the respective activity in the event log. Forinstance, the sessions s and s of Example 1 at page 5 are encoded as quadruples wherethe elements from the first to the fourth dimension take on values equal to the numberof occurrences of respectively a, b, c, d : namely, E NCODE freq ( s , σ, L ) = (1 , , , and E NCODE freq ( s , σ, L ) = (1 , , , . This encoding is useful when one wants tocluster on the basis of the frequency of occurrence of activities in sessions. Consider,for instance, an online retail shop where each log trace contains one event for each itemof product that is added to the basket. Each web-site visit corresponds to a session. Thefrequency-based encoding makes a vector out of each session with as many dimensionsas the products that can be potentially added to a basket: the value of a certain dimensioncoincides with the quantity bought of the product associated with that dimension. Moreformally: Definition 4 (Frequency-based Encoding).
Let L be an event log; let A = { a , . . . , a n } be the activities of L , namely A = ∪ σ ∈L ∪ e ∈ σ λ A ( e ) . Given a trace σ ∈ L and a timeinterval ∆ , let s ∈ s (cid:13) ∆ ( σ ) be a session of σ . The frequency-based encoding of s is E NCODE freq ( s, σ, L ) = ( c a , . . . , c a n ) such that, for all ≤ i ≤ n , c a i is the number of events e ∈ s for activity a i , c a i = |{ e ∈ s.λ A ( e ) = a i }| . Duration-Based Encoding.
Given a session s = (cid:104) es , . . . , es m (cid:105) of a trace σ of event L ,the duration-based encoding E NCODE dur ( s, σ, L ) returns a tuple ( d a , . . . , d a n ) where,for all a i in log activity set A L , d a i returns the average duration of executions of activity a i in s . The average duration of a i can be computed as the average λ T ( es j +1 ) − λ T ( es j ) (2)for all es j s.t. j < m and λ A ( es j ) = d a i . For the last event es m , we compute theaverage duration of all executions of λ A ( es m ) that were associated to all events in L thatwere not the last in the respective sessions, i.e. for the events in L for which Equation 2can be computed. For further clarification, let us again consider the sessions s and s of Example 1: they are encoded as quadruples where the elements from the first tothe fourth dimension take on values equal to the average duration of activities a, b, c, d . rom Low-Level Events to Activities: A Session-Based Approach (Extended Version) 7(a) Cluster Name (b)
Fig. 3: An example of heat map of the cluster centroids (part a) and of names that canbe given to the clusters (part b)Let avg ( c, L ) and avg ( d, L ) be the average duration of c and d in the event log L ofExample 1. For this example, we have E NCODE dur ( s , σ, L ) = (2 , , avg ( c, L ) , andE NCODE dur ( s , σ, L ) = (3 , , , avg ( d, L )) . Note that this way to compute is basedon the idea that events record the starting of executing an activity, and none recordsthe completion. The specific choice was driven by the analysis of the web site: events are associated to starting visiting a web-site page and users remain onthat page until they start visiting the next. However, new encoding can be put forward,which consider events as the execution’s completion, cross information about resourceutilizations and activity executions [17], or which are based on the exactly duration, ifderivable/present in the event. Section 3.3 produces a set of clusters { M (cid:48) , . . . , M (cid:48) n } (cf. Equation 1 at page 6), which isthe input to build the abstract event log. As mentioned, clusters need to be given names.Here, we advocate the use of heatmaps to visualize the cluster centroids and, hence,facilitate the assignment of names to clusters. An example is in Figure 3(a), whichrefers to the application to the werk.nl web-site. Each row and column respectivelyrefer to a different low-level event, dimension of the clustering space, and to a differentcluster.In particular, the centroid of each cluster is normalized between 0 and 1, and shownon the heat maps through different red-color intensities, with 0 being white and 1 beingthe most intense red. The color for a column X and row Y is proportional to the value ofthe dimension for low-level event Y in the centroid of cluster X. The normalization of agiven centroid ( c , . . . , c n ) is achieved by dividing by the sum of the centroid’s values: ( c sum , . . . , c n sum ) where sum = c + . . . + c n . The following example well explains: Massimiliano de Leoni and Safa Dundar
Algorithm 1:
Creation of an Abstract Event Log
Input:
Event Log
L ∈ E ∗ , a set M = { M , . . . , M n } of clusters with namesN AME ( M ) , . . . , N AME ( M n ) Result:
Abstract Event Log L (cid:48) ← ∅ foreach σ ∈ L do σ (cid:48) ← (cid:104)(cid:105) foreach session (cid:104) e , . . . , e m (cid:105) ∈ s (cid:13) ( σ ) do c ← E NCODE ( (cid:104) e , . . . , e m (cid:105) ) Pick M i ∈ M s.t. c ∈ M i Create Events e starts and e completes s.t. λ A ( e starts ) = λ A ( e completes ) = N AME ( M i ) λ T ( e starts ) = λ T ( e ) λ T ( e completes ) = λ T ( e m ) σ (cid:48) ← σ (cid:48) ⊕ (cid:104) e starts , e completes (cid:105) end L (cid:48) ← L (cid:48) ∪ { σ (cid:48) } end return ( L (cid:48) ) Example 2.
Let us assume the following centroids: (1 , , , , , , (40 , , , , , , (0 , , , , , , (1 , , , , , , (0 , , , , , . The normalization produces ( , , , , , ) , ( , , , , , , (0 , , , , , ) , ( , , , , , , (0 , , , , , ) .Note that we do not normalize by simply dividing by the largest value, such as 42 inExample 2. If we did so in Example 2, the first, fourth and fifth centroids would benormalized to a vector with almost zero values for all dimensions.If one obtains such a heatmap as that in Figure 3(a), the stakeholder is largely fa-cilitated to assign names to clusters because almost each cluster is characterized by acentroid with predominant values for one or two dimensions, each associated to a dif-ferent activity. This stakeholder’s involvement is optional: if this domain knowledgeis absent, cluster can be given a name that just coincides with the predominant di-mension or with the concatenations of those predominant. In sum, each cluster M i isgiven a name N AME ( M i ) , which makes it possible to synthesize the abstract event logas follows. Algorithm 1 illustrates the procedure. For each log trace σ , the algorithmbuilds a new trace σ (cid:48) to be added to the abstract log L (cid:48) as follows: for each session s = (cid:104) e , . . . , e m (cid:105) ∈ s (cid:13) ( σ ) , the algorithm determines the cluster M i to which sessionbelongs (lines 5 and 6) and adds two events e starts and e completes to the tail of σ (cid:48) (lines7 and 11). Events e starts and e completes respectively represent the start and the end ofsession s with the corresponding timestamps(see lines 9 and 10), and they refer to thehigh-level activity N AME ( M i ) (line 8). The abstraction technique introduced in this paper has been implemented as a plug-in named
Session-based Log Abstraction in the
TimeBasedAbstraction package of the rom Low-Level Events to Activities: A Session-Based Approach (Extended Version) 9 nightly-build version of ProM. To this date, the implementation features the clusteringalgorithms K-means and DBSCAN algorithms. As discussed in Section 3.3, the clus-ter centroids are visualized on a heatmap to provide users with the necessary help todetermine the high-level activity names: the heat-map visualization is provided via the
JHeatChart library. The rest of this section illustrates the application to two casestudies, for process discovery and behavior comparison. werk.nl website
This section focuses on illustrating the successful application to the case study of the werk.nl website.To abstract the event log, we used a duration-based encoding (cf. Section 3.3): it iscertainly more important to consider how long visitors stay on a web page, rather thanjust counting the number of visits of the different pages. For instance, three 1-minutevisits of any page should not have more weight than a 30-minute visit of the page. Thesession threshold ∆ was set to 15 minutes, because it coincides with the timeout of .Initially, the data points that encode the sessions of the log traces were clusteredvia DBSCAN. The generation of the clusters with DBSCAN took nearly 2 hours on alow-profile laptop with 8 Gb of RAM. The clusters’ centroids were visualized throughthe heatmap in Figure 3(a). To help stakeholders, the plug-in removes the rows refer-ring to low-level events that, when normalized, are associated with nearly-zero valuesof all centroids. The results in Figure 3(a) are certainly very interesting: the sessions ofa certain cluster are characterized by few particular pages, long and often visited. Notethat DBSCAN does not always return clusters: DBSCAN would have failed, if it hadnot been possible to cluster the data points. Without using additional domain knowl-edge, each cluster was named after the low-level event (i.e. web page) that refers to thedimension with the highest value in the centroid (the most intense red color). This ledto the names in Table 3(b).Once the names are assigned to clusters, we generated an accordant, abstract eventlog. To validate the quality of the abstract event log, this was randomly split into a70% part, which was used for discovery, and a 30%, for testing. The DBScan algorithmnaturally computes outliers, namely points that are not assigned to any cluster. Resultsshow that, if those outliers are simply filtered out, the quality of the discovered modelis significantly dropped (see discussion below, summarized in Table 1). Therefore, weperformed a post-processing where each outlier session is manually inserted into thecluster with the closest centroid. The abstract event log with the manual cluster assign-ment of outliers was used as input for the new Heuristic Miner [15], thus discoveringthe model in Figure 4, using the Causal-Net notation [1].The same procedure was employed to discover a high-level model with K-Means,using the same 70% of the traces for discovery, and the same temporal threshold andencoding as for DBSCAN. Note that, compared with DBSCAN, K-Means requires oneto explicitly set the number of clusters to create. Our implementation features the ElbowMethod to facilitate the setup [12]: when applied to the case study, creating ten clustersseemed to provide a good balancing between minimizing the error and not scattering Fig. 4: Process model produced by the Heuristic Miner [15] on 70% of the abstract eventlog of the werk.nl dataset, clustering via DBSCAN.the sessions among too many clusters (i.e. high-level activities). The resulting model isin Figure 5.The quality of these models was assessed through the classical process-mining met-rics of fitness, precision, generalization and simplicity [1]. Fitness was computed onthe 30% of abstract log that was not used for discovery. This is accordant with typi-cal machine-learning methods of verifying process-model “recall” on traces that werenot used for discovery. Conversely, precision and generalization were computed on theentire abstract log. Finally, simplicity was measured as the sum of activities, arcs andbindings of the causal nets. Since fitness, precision and generalization are traditionallydefined on Petri nets [1], causal nets were converted to Petri nets using the implementa-tion in [15]. The resulting Petri nets were manually adjusted to ensure soundness whilenot adding extra behavior. Of course, to keep the comparison fair, all models were dis-covered by the Heuristic Miner [15], using the same configuration of parameters.
Thisincludes the model in Figure 1.Table 1 illustrates the results of the comparison of the models discovered throughthe abstract event logs obtained via K-Means and DBSCAN. They equally generalizeand are of similar complexity (variation of simplicity is around 10-12%). The abstractmodel when applying DBSCAN without post processing shows very poor fitness, whichis conversely satisfactory when applying K-Means or DBSCAN with post processing.Focusing on precision, the model of DBSCAN with post-processing is characterized bya precision that is 2.25 times than the precision of the K-Means model. This leads tothe conclusion that DBScan with post-processing has produced a better model, in termsof fitness, simplicity, precision and generalization. Intuitively, this is not surprising: rom Low-Level Events to Activities: A Session-Based Approach (Extended Version) 11
Fig. 5: Process model produced by the Heuristic Miner [15] on 70% of the abstract eventlog of the werk.nl dataset, clustering via K-Means.
K-Means DBSCAN With Post-Processing DBSCAN No Post-Processing
Fitness 0.6637 0.6270 0.2785Precision 0.33192 0.74779 0.68247Generalization 0.99962 0.99996 0.99998Simplicity 81 91 79
Table 1: Measures of the quality of the models discovered on the log abstracted throughK-Means and DBSCAN. For the DBSCAN, we report on the values when the postpro-cessing to manually insert outlier was and was not performed.DBScan is based on maximizing the cluster density, ensuring that “similar” sessionsare put in the same cluster.In conclusion, the model in Figure 4 is the most preferable, and unarguably moreunderstandable, if compared with the non-abstract model in Figure 1. From a busi- ness viewpoint, it illustrates that typical users navigate the werk.nl web site as fol-lows. During the first session, users visit the home page and, also, page taken (Dutchfor tasks ), where they can see the tasks assigned by UWV (e.g. to upload certaindocuments). If no tasks are assigned to do via the web site, the interaction with theweb site completes. If any tasks are, users look for jobs to apply for (page vaca-tures zoeken ) and/or amend the information that they previously provided (page wijzig-ing doorgeven ). If information is amended, usually an updated curriculum is uploaded(cf. the branch of the model starting with page mijn cv ) and/or the visitor looks andpossibly applies for jobs (cf. the branches of the model starting with pages vacature and vacature bij mijn cv , which are either both executed or both skipped). Looking atstatistics, the mean and median duration of the web-site interaction (i.e. the log traces)is around 20 weeks (more than 4 months) and, hence, the visiting sessions are certainlytemporarily spread. One can also observe that Every session type is usually repeatedmultiple times, and this is likely due to the fact that the corresponding tasks are car-ried on through similar sessions in consecutive days. It is, however, remarkable that themodel does not contain larger loops involving different session types. This means thatthe web site is visited in conceptual sections: when users start access pages of a givensection, the pages of previous sections will no longer be visited. Note that the web sitedoes not define sections, nor does it restrict the order with which pages can be visited. Infact, this testifies the benefits of introducing wizards. We acknowledge that informationis lost in the abstraction. However, this loss is justified by gaining comprehensible busi-ness knowledge. As a matter of fact, this model was shown to one UWV’s stakeholder,who literally said “this is the most understandable analysis of the web-site behaviorthat I have seen, certainly beyond the results seen for the BPI Challenge” . This section discusses a second case study to illustrate the applicability of the tech-nique beyond werk.nl . This case study refers to the execution of process to managebuilding-permit applications in a Dutch municipality. There are 304 different activi-ties denoted by their respective English name as recorded in attribute taskNameEN . Theevent log spans over a period of approximately four years and consists of 44354 eventsdivided in 832 cases. Figure 6 shows the model discovered with the
Inductive Miner -Infrequent Behavior [14], using the default configuration. The model exactly shows thesame problems as that in Figure 1: The large variability has made the miner discoveran overly complex model. See, e.g., the large OR split around the area highlighted bya red circle in the picture. We applied the abstraction technique to the event log, usingthe frequency-based encoding (cf. Section 3.3) and the DB-SCAN clustering algorithmwith post processing, which proved to perform better for the first case study reported inSection 4.1. A session threshold of 8 hours was employed so that the events of the sameday were put in the same (work) session.The clustering step resulted in the heatmap in Figure 3(a) where, similarly to theprevious case study, infrequent activities are filtered out, and each cluster centroid has Indeed, the BPI challenge in 2016 was based on the same event data - . The event log is available at http://dx.doi.org/10.4121/uuid:63a8435a-077d-4ece-97cd-2c76d394d99crom Low-Level Events to Activities: A Session-Based Approach (Extended Version) 13
Fig. 6: Building-permit process model produced by the Inductive Miner without ab-straction: overly complex to be insightful.significantly non-zero values for the dimensions for one or few low-level activities.Analogously to the first case study, clusters were given the same name as the low-levelactivity with the most intense red colour in the heat map, possibly concatenated withthe names of the additional activities with a significantly red color (see Table 7b).The abstract event log was then generated and used as input for the
Inductive Miner- Infrequent with default parameter values, namely the same as for the not-abstractedmodel in Figure 6. This yielded the model in Figure 8, which is unarguably simplified,emphasising the most salient behavioral aspects. This model is a good representationof the actual behavior: its fitness is 0.79. Unfortunately, it was not possible to computeprecision and generalization because the reference ProM implementation (see [1]) gotstuck and never terminated the computation.We previously claimed that event abstraction is not only about model discovery, andit enables a fruitful application of a large repertoire of process-mining techniques. Theremainder will provide a support to this claim. In particular, we will show that it makesit possible to highlight that the executions under the responsibility of certain resourcesare statistically different from those of other resources. To achieve this, we leveragedon the technique proposed in [5]. The technique allowed us (1) to find out that the ex-ecutions under the responsibility of resource 560458 are remarkably different, and (2)to pinpoint what these differences are. The latter piece of knowledge can be gained bylooking at the transition system obtained by the technique in [5], which is in Figure 9.In the transition system, nodes are the event’s activities and an arc between two activitynodes indicated that the event log shows that sometimes the source activity is followedby the destination activity. Nodes and arcs are coloured with different shades of blueand orange to indicate that the activity or transition is statistically more or is less fre-quent for 560458, respectively. The thickness of arcs and node’s borders signifies thefrequency of occurrence. The colour’s darkness is proportional to the average differ-ence. In Figure 9, e.g., high-level activity entersend date procedure confirmation standsout: it occurs in 67% of the cases of resource 560458 versus 13.7% of the cases whereothers are responsible. Similar proportions are also observed in high-level enter date
Cluster Name (b)
Fig. 7: The heat map of the cluster centroids for the building-permit process (part a) andthe names given to the clusters (part b) publication decision environmental permit . Conversely, high-level register deadline iscolored orange, showing that it is statistically more frequent for the cases in which otherresources than 560458 are responsible. It follows quite naturally that, without abstrac-tion, the behavior complexity represented in Figure 6 would generate such a complextransition system that no fruitful insights could be derived.
A large body of research has been conducted on log abstraction. It can be grouped in twocategories: supervised and unsupervised abstraction. The difference is that supervisedabstraction techniques require process analysts to provide domain knowledge, whileunsupervised does not rely on additional information.
Supervised Abstraction Methods.
Baier et al. provide a number of approaches that,based on some process documentation, map events to higher-level activities [2,3,4],using log-replay techniques and solving constraint-satisfaction problems. The idea ofreplaying logs onto partial models is also in [16]: the input is a set of models of thelife cycles of the high-level activities, where each life-cycle step is manually mapped to rom Low-Level Events to Activities: A Session-Based Approach (Extended Version) 15
Fig. 8: Building-permit process model produced by the Inductive Miner with abstrac-tion, clustering via DBSCAN. The fitness value is 0.79; the precision and generalizationvalues are not reported because the reference software implementation never termi-nated.Fig. 9: Comparison of the building-permit process behavior between executions whenresource 560458 is responsible and when others are.low-level events. Ferreira et al. [10] rely on the provision of one Markov model, whereeach Markov-model transition is a different high-level activity. In turn, each transitionis broken down into a new Markov model where low-level events are modelled. Fazz-inga et al. [9] assume process analysts to provide a probabilistic process model withthe high-level activities, along with a probabilistic mapping between low-level eventsand high-level activities. It returns an enumeration of all potential interpretations ofeach log traces in terms of high-level activities, ranked by the respective likelihood.In [18], authors propose a supervised abstraction technique that is applicable in thosecase in which annotations with the high-level interpretations of the low-level events areavailable for a subset of traces.
Unsupervised Abstraction Methods.
Log abstraction is related with episode miningand its application to Process Mining (a.k.a. discovery of local process models) [13,19]).In fact, Mannhardt and Tax propose a method that combines local process model dis-covery with the supervised abstraction technique in [16]. However, the technique relieson the ability to discover a limited number of local process models that are accurate andcover most of the low-level event activities. In [11], G¨unther et al. cluster events look-ing at their correlation, which is based on the vicinity of occurrences of events for thesame low-level activity in the entire log. Clustering is also the basic idea of [7] to clusterevents through a fuzzy k-medoids algorithm. Both [7] and [11] share the drawback thatthe time aspects are not considered and, thus, they can cluster events that are temporar-ily distant (e.g. web-site visits that are weeks far from each other). Also, [7] only aims todiscover a fuzzy high-level model, instead of abstracting event logs to enable a broaderprocess-mining application, whereas [11] assumes a transitive nature of the property ofactivity correlation, which does not always hold. See Figure 3(a): cluster 3 shows a cor-relation between
Visit page werkmap and
Visit page vacature bij mijn CV and cluster 4shows a correlation between
Visit page werkmap and
Visit page taken , while no correla-tion exists between
Visit page vacature bij mijn CV and
Visit page taken . Finally, vanEck et al. [8] illustrate a technique to gather observations from sensor data, encode andcluster them in a similar way as our approach does. However, they assume that events(in fact, sensor observations) are generated at a constant rate.
Log Clustering vs Log Abstraction.
This paper has discussed a log-abstraction tech-nique that builds on machine-learning clustering techniques. Event-log clustering alsoleverages on the same techniques [6]. However, event-log clustering has a different pur-pose because it is based on the idea to split the traces into homogenous groups, withoutaltering the contents of the traces themselves.
Abstracting and grouping low-level events to high-level activities is a problem that isreceiving a lot of attention. Often, event logs are not immediately ready to be usedbecause they model concepts that are not at the right business level and/or they exhibita too broad variety of behavior to be summarized into one simple model, map, diagram,etc.Section 5 illustrates how, on the one hand, supervised methods often require vast do-main knowledge (e.g. through process models, Markov chains or mapping ontologies),which is not always possible to provide. On the other hand, unsupervised methods showlimitations, related to the absence of any external knowledge. This paper reports on athird way where very limited domain knowledge is necessary. The technique is basedon the idea that a trace can be regarded as a sequence of sessions each of which termi-nates when no additional events occur within a user-defined time interval. The sessionsare later clustered; finally, a heatmap visualization of the clusters is provided to domainexperts, so that they could assign meaningful high-level concepts to the sessions, i.e.sequences of low-level events. Admittedly, the concept of sessions and the use of clus-tering techniques and heat-map visualizations are not novel in process mining, if each istaken in isolation. The innovation here is that we ensemble them to provide a solution tothe problem of abstracting low-level events to high-level concepts, with the advantagesmentioned above. rom Low-Level Events to Activities: A Session-Based Approach (Extended Version) 17
Section 4 reports on the successful evaluation of the proposed technique on two casestudies, discussing both a quantitative and a qualitative analysis. The qualitative anal-ysis shows that our log-abstraction technique (1) allows overly complex models to besimplified, by focusing on the higher-level concepts, and (2) is applicable beyond pro-cess discovery (see Figure 9). Quantitatively, the evaluation showed that the discoveredmodels can reasonably balance the typical process-mining metrics: fitness, precision,generalization and simplicity [1].The technique does not depend on any clustering algorithm, and this explains whyconcrete algorithms are only mentioned in Section 4. While we acknowledge that amore thorough assessment is necessary, Section 4 shows that the best performances arewith DBScan, which has the advantage of automatically computing the best number ofclusters. This motivates further that our technique requires to supply little knowledge:considering that the provision of the cluster names is optional (cf. Section 3.4), the onlyinput for our technique is the session threshold.The technique is applicable to all those domain fields where customers performactivities in batches/sessions, including retail shopping (e.g. Amazon or supermarkets),health care (e.g. hospitals) as well as scenarios of home automation and/or IoT. All ofthese domains loosely constrain the order with which the activities are executed, whichultimately leads to an “ocean” of alternative behavior.In spite of the assessment reported in this paper, we acknowledge that further valida-tion is needed, especially in such domains as those mentioned in the previous paragraph.In parallel, the technique can be further extended towards achieving better clustering.Firstly, the technique needs to be extended to consider the entire event payload, insteadof just limiting to the sole activity names. For instance, for the werk.nl case study, onecould add clustering dimensions related to the customer age, gender, geographic loca-tions, etc., providing extra information towards a more accurate clustering. This is alsovery relevant for such domains as domotics, where, e.g., the use of an oven at 180 ◦ C may be conceptually different than using it at 240 ◦ C . Secondly, we plan to explorehierarchical clustering because it would allow one to tune the level of aggregation thatis achieved through the log abstraction. Thirdly, the number of low-level activities isgenerally large. Therefore, it is worth investigating the benefits, if any, of reducing thelow-level activities to consider when applying clustering.Last and not least, we aim to specialize the general technique for the discovery ofhierarchical processes and analyze the structure of the sessions within each separatecluster. The sessions within each cluster can be seen as traces of a sub log, which canbe used as input to discover small (fragments of) models, to be later combined with themodel discovered via abstract event logs. References
1. van der Aalst, W.M.P.: Process Mining - Data Science in Action. Springer (2016)2. Baier, T.: Matching Events and Activities. PhD dissertation, University of Potsdam (2015)3. Baier, T., Mendling, J.: Bridging abstraction layers in process mining by automated matchingof events and activities. In: Proceedings of the 11th International Conference on BusinessProcess Management. pp. 17–32. Springer Berlin Heidelberg, Berlin, Heidelberg (2013)4. Baier, T., Rogge-Solti, A., Mendling, J., Weske, M.: Matching of events and activities: anapproach based on behavioral constraint satisfaction. In: Proceedings of the 30th AnnualSymposium on Applied Computing. pp. 1225–1230. ACM (2015)8 Massimiliano de Leoni and Safa Dundar5. Bolt, A., de Leoni, M., van der Aalst, W.M.P.: A visual approach to spot statistically-significant differences in event logs based on process metrics. In: International Conferenceon Advanced Information Systems Engineering. LNCS, vol. 9694, pp. 151–166. Springer(2016)6. De Weerdt, J.: Trace Clustering. Springer International Publishing, Cham (2018)7. van Dongen, B.F., Adriansyah, A.: Process mining: fuzzy clustering and performance visu-alization. In: Proceedings of the 7th International Conference on Business Process Manage-ment. pp. 158–169. Springer (2009)8. van Eck, M.L., Sidorova, N., van der Aalst, W.M.P.: Enabling process mining on sensordata from smart products. In: Proceedings of the Tenth IEEE International Conference onResearch Challenges in Information Science (RCIS) (June 2016)9. Fazzinga, B., Flesca, S., Furfaro, F., Masciari, E., Pontieri, L.: A probabilistic unified frame-work for event abstraction and process detection from log data. In: Proceedings of the 23thOTM Confederated International Conference on Cooperative Information Systems. LNCS,vol. 9415, pp. 320–328. Springer (2015)10. Ferreira, D.R., Szimanski, F., Ralha, C.G.: Mining the low-level behaviour of agents in high-level business processes. International Journal of Business Process Integration and Manage-ment 8 (2), 146–166 (2013)11. G¨unther, C.W., Rozinat, A., van der Aalst, W.M.P.: Activity mining by global trace segmen-tation. In: Proceeding of the 7th International Conference on Business Process Management.pp. 128–139. Springer (2009)12. Ketchen, D.J., Shook, C.L.: The application of cluster analysis in strategic man-agementresearch: An analysis and critique. Strategic Management Journal (6), 441–458 (1996)13. Leemans, M., van der Aalst, W.M.P.: Discovery of frequent episodes in event logs. In: Data-Driven Process Discovery and Analysis - 4th International Symposium, SIMPDA 2014,Milan, Italy, November 19-21, 2014, Revised Selected Papers. LNBIP, vol. 237, pp. 1–31.Springer (2015)14. Leemans, S.J.J., Fahland, D., van der Aalst, W.M.P.: Discovering block-structured processmodels from event logs - A constructive approach. In: Proceedings of the 34th InternationalConference on Application and Theory of Petri Nets and Concurrency (Petri Net 2013).LNCS, vol. 7927, pp. 311–329. Springer (2013)15. Mannhardt, F., de Leoni, M., Reijers, H.A.: Heuristic mining revamped: An interactive data-aware and conformance-aware miner. In: Proceedings of the BPM Demo Track and BPMDissertation Award at 15th International Conference on Business Process Management.CEUR Workshop Proceedings, vol. 1920. CEUR-WS.org (2017)16. Mannhardt, F., de Leoni, M., Reijers, H.A., van der Aalst, W.M.P., Toussaint, P.J.: Fromlow-level events to activities - A pattern-based approach. In: Proceedings of the 14th In-ternational Conference on Business Process Management. LNCS, vol. 9850, pp. 125–141.Springer (2016)17. Nakatumba, J.: Resource-aware Business Process Management : Analysis and Sup-port.Ph.D. thesis, Eindhoven University of Technology (2013)18. Tax, N., Sidorova, N., Haakma, R., van der Aalst, W.M.P.: Event abstraction for processmining using supervised learning techniques. In: Bi, Y., Kapoor, S., Bhatia, R. (eds.) Pro-ceedings of SAI Intelligent Systems Conference (IntelliSys) 2016. pp. 251–269. SpringerInternational Publishing, Cham (2018)19. Tax, N., Sidorova, N., Haakma, R., van der Aalst, W.M.: Mining local process models. Jour-nal of Innovation in Digital Ecosystems3