Native Directly Follows Operator
NNative Directly Follows Operator
Alifah Syamsiyah ( (cid:0) ) , Boudewijn F. van Dongen, Remco M. Dijkman Eindhoven University of Technology
[email protected], [email protected], [email protected]
Abstract.
Typical legacy information systems store data in relationaldatabases. Process mining is a research discipline that analyzes this datato obtain insights into processes. Many different process mining tech-niques can be applied to data. In current techniques, an XES event logserves as a basis for analysis. However, because of the static characteristicof an XES event log, we need to create one XES file for each process min-ing question, which leads to overhead and inflexibility. As an alternative,people attempt to perform process mining directly on the data sourceusing so-called intermediate structures. In previous work, we investigatedmethods to build intermediate structures on source data by executing abasic SQL query on the database. However, the nested form in the SQLquery can cause performance issues on the database side. Therefore, inthis paper, we propose a native SQL operator for direct process discov-ery on relational databases. We define a native operator for the simplestform of the intermediate structure, called the “directly follows relation”.This approach has been evaluated with big event data and the experi-mental results show that it performs faster than the state-of-the-art ofdatabase approaches.
Keywords:
SQL operator · Relational database · Process discovery
Process mining is a research discipline that turns event data into process models,checks the model with reality, and enhances the model with statistics derivedfrom event data. There has been an extensive research in process mining, includ-ing process discovery, conformance checking , and enhancement .Typical legacy information systems store data in relational databases. In thecontext of big data, the stored data possesses the characteristic of 4V (Volume,Variety, Veracity, and Velocity). When applying process mining, a process ana-lyst aims to find process structures in the event data. In some cases, however,the process analyst simply does not know what kind of process view can be ob-tained from the data. In this scenario, many process mining questions are to beanswered on the underlying data in order to get insights into the business pro-cesses. These questions may involve various classifiers, span on multiple times,and use heterogeneous case notions.Let us take a process in a commercial bank as an example. The financialmanager may want to figure out the core process of the bank, such as “the a r X i v : . [ c s . D B ] J un Alifah Syamsiyah, Boudewijn F. van Dongen, Remco M. Dijkman process involving successful loans during last month”, “the progress of ongo-ing investment projects with a property this year”, “how is the performance ofcollaboration between department X and Y”, and many other process miningrelated questions.In the current state-of-the-art, process mining tools need (XES) event logs asinput. One of the characteristics of event logs is that they are static, i.e. one fileonly contains one case notion and contains data for one specific period of time.Therefore, we need to create one log file for each process mining question. Thisproduces unnecessary overhead caused by the exporting from the data source,conversion to a tool’s input format, and importing into the process mining tool.Moreover, the traditional techniques lack of flexibility, i.e. once we change theperspective from which we look at the data, we need to change the log file.Given the fact that many legacy information systems use a database as theback end, we study the question how to perform a process mining directly onthe database [1, 3–5]. In this paper, we focus on process discovery, i.e. we focuson deriving a process model from event data. In a process discovery algorithm,there is typically a so called intermediate structure , which is a first abstractionof the event data. In [3], we showed how to compute this intermediate structureinside a database and how to only import the structure (not the event data) intoa process mining tool. The mining tool then discovers the process model usingan existing algorithm.In order to compute the intermediate structure, the existing approach usesstandardized SQL queries, which, unfortunately, are not designed towards pro-cess mining purposes. Let us take a simple intermediate structure called
DirectlyFollows Relation (DFR). To compute DFR, the standard SQL query has a nestedform which results in a bad performance. In this paper we present two ways toovercome nested queries and we compare the performance of all database vari-ants against the traditional approach.A way to overcome the nested queries is to create an interface between adatabase and a process mining tool. As computing the DFR is based on a sortedevent data, through this interface, we execute a standard SQL query to sortevent data in a database. Then, we import the sorted event data into a processmining tool and compute the frequency of the DFR in the tool. We later show inthis paper that even though this approach does not contain a nested structure,it still has a performance issue as the whole log needs to be transferred to theprocess mining tool.In [2] it has been proven that executing a nested query to compute the DFRleads to third order polynomial time complexity if the intermediate results ofthe queries do not fit into memory anymore. To overcome this problem, a nativeSQL operator is proposed.In this paper, we propose a native SQL operator for direct process discoveryin relational databases. The native operator is designed for a specific process dis-covery purpose. As a starting point, this paper investigates the directly followsrelation. However, it does not restrict the possibility to extend the operator toother kinds of intermediate structures. Using this native operator, the database ative Directly Follows Operator 3 has more flexibility to query process mining related questions. Moreover, it har-nesses the database technology to speed up the computation time.The reminder of this paper is structured as follows. Section 2 introducessome important terminologies that are used in the paper. Then we introduce theproposed idea, which is the native SQL operator for a direct process discovery, inSection 3. The implementation of this operator is given in Section 4 and Section 5demonstrates the experimental results. Finally the paper is concluded in Section6.
This paper deals with process mining. We refer to Process Mining Manifesto [8]and Process Mining book in [6] for more detail explanation about process mining.Process mining needs event data as inputs. Events are collections of attributes,referring to the executions of activities in the context of some cases in a process.
Definition 1 (Event Attribute).
Let A be the universe of event attributes, C be the universe of cases, E be the universe of events, and E ⊆ E be a collectionof events.For any event e ∈ E and name a ∈ A : a ( e ) is the value of attribute a forevent e , a ( e ) = ⊥ if there is no value. { caseid, act, time } ∈ A are standardevent attributes, such that caseid ( e ) is the case of event e , act ( e ) is the activityname of event e , and time ( e ) is the timestamp when event e is executed. Furthermore, an event log is a collection of events captured within a partic-ular time period.
Definition 2 (Event Log).
Let E ⊆ E be a collection of events and t s , t e ∈ R two timestamps with t s < t e relating to the start and the end of the collectionperiod.A case σ ∈ E ∗ is a sequence of events such that the same event occurs onlyonce in σ , i.e. | σ | = |{ e ∈ σ }| . Furthermore, each event in a case refers to thesame case c ∈ C , i.e. ∀ e ∈ σ case ( e ) = c and we assume all events within the giventime period are included, i.e. ∀ e ∈E ( case ( e ) = c ∧ t s ≤ time ( e ) ≤ t e ) = ⇒ e ∈ σ .An event log L ⊆ E ∗ is a set of cases. Finally, we define Directly Follows Relation (DFR), which holds for two ac-tivities A and B if and only if somewhere in the event log L , there are twosuccessive events in a trace corresponding to these activities. Note that we alsoconsider cases with partial ordered events. Definition 3 (Directly Follows Relation (DFR)).
Let A be the universe of event attributes, let E be a universe of events and L bean event log over E ⊆ E . Let M = { a ∈ A | e ∈ E ∧ act ( e ) = a } be the set ofactivities in the log. The DFR > L : M × M → N counts the number of timesactivity a is directly followed by activity b in some cases in L as follows: Alifah Syamsiyah, Boudewijn F. van Dongen, Remco M. Dijkman
Table
Log
SELECT * FROM DIRECTLYFOLLOWS (SELECT * FROM Log)
Case Activity Time
Parse “DIRECTLYFOLLOWS”Call function directlyFollows()
Predecessor Successor Frequency
A B 1A C 1
Directly
Follows
Relation
A BC
Directly
Follows
GraphInductive
Miner
A BC Process
Model r e t u r n Database Process
Mining
Tool
Fig. 1: Using native operator “directlyfollows” and Inductive Miner to discovera process model (a)
Case Activity Time1 Send request 2017/10/011 Check application 2017/10/021 Check document 2017/10/021 Accept 2017/10/052 Send request 2017/10/032 Check application 2017/10/072 Reject 2017/10/10 (b)
Event Label P Event Label S FrequencySend request Check application 2Send request Check document 1Check application Accept 1Check document Accept 1Check application Reject 1
Table 1: (a) Example of event data stored in Table
Log , (b) The DFR of Table
Log> L ( a, b ) = Σ σ ∈ L Σ | σ |− i =1 , if i < j ∧ act ( σ ( i )) = a ∧ act ( σ ( j )) = b ∧ time ( σ ( i )) < time ( σ ( j )) ∧¬∃ k time ( σ ( i )) < time ( σ ( k )) < time ( σ ( j ))0 , otherwise . In this section, we introduce a new native directly follows operator for computingdirectly follows relation. In the following, we first explain the input and outputof this operator, an example of how to use the operator, and how to process theresult in the context of discovery.A native operator directlyfollows requires a table object representing theevent log, which consists of three columns: the case, the activity, and the times-tamp. As a result, it returns a table object representing the DFR, which consistsof three columns: the first and the second part of pair of the DFR, and thefrequency of each pair. The schema in Figure 1 illustrates how to utilize theoperator and what happens inside a database engine and a process mining toolassuming that we use Inductive Miner for the discovery. ative Directly Follows Operator 5
Assume that we have a Table
Log (see Table 1a), we compute the DFR usingthe native operator directlyfollows as follows. SELECT * FROM DIRECTLYFOLLOWS ( SELECT * FROM Log );
Algorithm 1: directlyFollows
Input: L Output: > L foreach σ ∈ L do sort( σ ) sa ← ea ← aT ime ← time ( σ ( sa )) while ea + 1 < | σ | and aT ime = time ( σ ( ea + 1)) do ea ← ea + 1 sc ← ea + 1 ec ← sc while ec < | σ | do cT ime ← time ( σ ( sc )) while ec + 1 < | σ | and cT ime = time ( σ ( ec + 1)) do ec ← ec + 1 for i ← sa ; i ≤ ea ; i + + do for j ← sc ; j ≤ ec ; j + + do a ← act ( σ ( i )) b ← act ( σ ( j )) freq ← > L ( a, b ) freq ← freq + 1 > L ( a, b ) ← freq sa ← sc ea ← ec sc ← ea + 1 ec ← sc return > L After the database engine parses directlyfollows , it calls the directlyFollows() function (see Algorithm 1). First of all, for each case σ in log L , thefunction sorts σ . Then it assigns initial value for four variables, namely (1) sa (start antecedent), (2) ea (end antecedent), (3) sc (start consequent), and (4) ec (end consequent). These four variables point to indexes in σ such that all eventsbetween the start and end indexes have the same timestamp, i.e. events between sa and ea have the same timestamp, events between sc and ec have the sametimestamp. Based on these indexes, the function creates combinations betweenantecedent events and consequent events to construct a pair of directly follows Alifah Syamsiyah, Boudewijn F. van Dongen, Remco M. Dijkman relation. Finally, it counts the frequency of such pair and returns the directlyfollows (as denoted in Table 1b).In Java, we use hash table to store the relations. Pairs are the key andfrequencies are the value. The complexity is worst case | E | log | E | (with E isthe total number of events) due to sorting events.In the next step, the DFR is retrieved by a process mining tool. Since we usethe Inductive Miner for the discovery, such relation is converted into a directlyfollows graph before the algorithm constructs a process model. Note that theresult from the directlyfollows operator can be used directly in the existingprocess discovery technique without modifying the algorithm or reinvent a newdiscovery method.Process discovery using the native operator has several advantages. First ,the query can be expressed in a straightforward way, hence it is more convenientfor novice users to express various process mining related questions.
Second , thequery does not contain a nested form which leads to bad performance.
Third , thecomputation (i.e. the abstraction phase) can be done inside relational databases,thus leveraging the power of database technology and saving the memory usageof process mining tool.
Fourth , there is no need to extract and load a log fileinto a process mining tool, thus saving time and reducing the complexity.
Fifth ,the DFR can be computed upon insertion of data using the standard triggeringmechanism of any database system, hence eliminating the need for the processanalyst to wait for the computation to finish.
We implemented the directlyfollows as a native operator in H2 Database .Moreover, process discovery using this operator and the Inductive Miner wereimplemented as a plug-in in an open source process mining toolkit called ProM.The plug-in name is H2 Inductive Miner and it is distributed within the
Databa-seInductiveMiner package .Figure 2 illustrates the excerpt of the log in the server.Fig. 2: The excerpt of Table BPI2017 in H2 database https://github.com/alifahsyamsiyah/h2processmining-master https://svn.win.tue.nl/repos/prom/Packages/DatabaseInductiveMiner/Trunk/ ative Directly Follows Operator 7 As an example, we used the offer event log of BPI Challenge 2017 [7], whichwas imported into an H2 database server. To connect with the server, we set therequired database configuration in the
H2 Inductive Miner plug-in. We filled inthe username and the password of the database, the JDBC URL, and the queryto extract the DFR as denoted in Figure 3. After we clicked the “Finish” button,the plug-in automatically executes the query, retrieved the DFR, converts thevalues into a directly follows graph, and finally discovers a process model basedon the Inductive Miner algorithm. The result is displayed in Figure 4.
SELECT * FROM DIRECTLYFOLLOWS (SELECT caseid, activity, completetimestamp FROM BPI2017)
Fig. 3: A query to extract theDFR from Table BPI2017 Fig. 4: The discovered model of Table BPI2017
We implemented our work in ProM and H2 and we compared the native operatorwith the other three approaches (Figure 5): (1) the traditional technique, (2) thediscovery with Nested SQL, and (3) the SQL interface. The starting point for thetraditional approach is an XES event log file already loaded into memory. Forthe database approaches, the starting point is a database in which events havebeen inserted. In the initial experiments, we did not use any type of indices forthe events so we can represent general scenarios where table logs are constructedwithout any indices. However, in the subsequent experiments, we also investigatescenarios where indices are utilized. We used an H2 database server which has64GB of RAM and 8 cores of CPU @2.40Ghz. Furthermore, the discovery wasexecuted in a personal computer which has 8GB of RAM and 2 cores of [email protected] created synthetic logs with the number of events ranging from 1K to42M with between 30 to 3840 activities. More specifically, we created two kindsof synthetic logs: (a) logs with an increased number of activities, and (b) logswith an increased number of events. For (a), we relabeled activities to extendthe number of activities while keeping the same number of events. For (b), wemerged one case with another case to extend the number of events while keepingthe same number of activities. This way, we preserve the control flow for ourextended logs in the same way as the control flow of the original log.
Alifah Syamsiyah, Boudewijn F. van Dongen, Remco M. Dijkman
RelationalDatabase Process ModelProcess ModelProcess Model (1) Traditional Process Discovery
Process ModelEvent Log DFRRelationalDatabaseRelationalDatabase
Relational
Database DFR DFRDFRSQL InterfaceTransformation from database memory to process mining tool’s memory (2)
Discovery with Nested SQL(3) Discovery with SQL Interface(4) Discovery with Native Operator
Native operator
Fig. 5: Four different approaches in process discovery: (a) traditional processdiscovery, (b) discovery with Nested SQL, (c) discovery with SQL interface, and(d) discovery with native operator · ,
000 Number of Events C P U T i m e ( s e c ) Traditional, linear, R = 0 . R = 1SQL Interface, linear, R = 0 . R = 0 .
996 1 ,
000 2 ,
000 3 , C P U T i m e ( s e c ) SQL Interface, constantNative, constant
Fig. 6: The comparison of abstraction phaseFigure 6 shows the time for the abstraction phase of the four approaches. Onthe left, the time is shown as a function of the number of events in the log andon the right as a function of the number of activities.As expected, the time complexity of the native, SQL interface, and traditionalapproaches is linear because events in the log are already sorted. However, wecannot see the last dot of the traditional approach. This is because the tradi-tional approach cannot handle the biggest log containing 42M events due to outof memory exception. Furthermore, the execution of the nested query leads tothird order polynomial time complexity. It is considerably higher than the otherapproaches presented.As shown on the right hand side of Figure 6, the number of activities does notaffect the performance of the database approaches . However, for the traditionalapproach, there is an influence. This is due to the fact that, when scanning thelog, at some points the internal data structures need to grow to accommodate for due to the fact that the nested query is so time-consuming, we did not include it insome of the tests.ative Directly Follows Operator 9 · C P U T i m e ( s e c ) SQL Interface, linear, R = 0 . ,
000 2 ,
000 3 , C P U T i m e ( s e c ) SQL Interface, quadratic, R = 0 . R = 0 . Fig. 7: The comparison of retrieval phase ,
000 2 ,
000 3 , C P U T i m e ( s e c ) Traditional, cubic, R = 1Nested SQL, cubic, R = 1SQL Interface, cubic, R = 1Native, cubic, R = 1 Fig. 8: The comparison of miningphase . . . . · ,
000 Number of Events C P U T i m e ( s e c ) Nested without index, cubic, R = 1Nested with index, x · log x , R = 0 . R = 0 . Fig. 9: The comparison of abstractionphase with index for native vs nestedSQLpreviously unseen activities. For < , , C P U T i m e ( s e c ) NativeTraditionalSQL Interface 1 2 3 4 5 6 7 802004006008001 , , Fig. 10: The comparison of all phases for native vs traditional and SQL interfaceFigure 10 denotes the total time of Figure 6-8, i.e. the total time of abstrac-tion, retrieval, and mining phases. In Figure 10, we do not include the nestedquery because it is so time-consuming (the experiments with nested query werestopped until the number of events reached 14K). However, we still tried to im-prove the performance of nested query by adding indices in case and timestampcolumns. Even though the cubic complexity in the nested query is reduced to x · log x (with x is the total number of events), the native approach still outper-forms the nested query because the former has linear complexity as shown inFigure 9. Note that the linearithmic comes from the fact that H2 database usesB-tree index, hence finding an element is O (log x ). There are x rows for whichwe need to perform this look up, therefore the complexity is O ( x · log x ).From Figure 9 and 10 it becomes clear that for large numbers of events and/orlarge numbers of activities, the approach using native operator outperforms theother database approaches. Both native and traditional approaches show rela-tively similar performance, except for the log with the biggest number of events(the 6 th log in the left chart of Figure 10). The 6 th bar of the traditional ap-proach does not exist because the approach cannot handle the log anymore. Thisis to be expected as this approach works in memory. As data grows, the eventlogs no longer fit into memory and then the database approaches are required.There is however another important motivation for in-database processing.In most information systems, data is stored in relational databases. To doprocess mining, this data needs to be exported in the form of an XES file, whichrequired sorting of the events on timestamps, i.e. this preparatory phase has thesame time complexity as in-database abstraction. In the next experiment, wetake into account two additional steps in the traditional approach, namely theexporting phase (i.e. exporting event data from database to a Comma SeparatedValue (CSV) form) and the conversion phase (i.e. conversion from the CSV file toan XES event log file). These two phases are not relevant to database approachessince they directly access the data source.In Table 3 we show the complete chain of discovery using traditional tech-nique, starting from exporting until mining. We can see from this table that, for ative Directly Follows Operator 11 Traces Events Activities Exporting Conversion Abstraction Mining Exporting+Conversion100000 667 361 30 2 .
532 7 .
784 738 43 .
360 468 0 .
097 992 955 10 .
316 738100000 1 352 645 50 5 .
505 16 .
717 697 87 .
035 568 0 .
028 197 078 22 .
222 697100000 2 718 451 200 30 .
871 31 .
705 399 185 .
524 574 0 .
140 519 505 62 .
576 399100000 4 039 403 400 55 .
275 74 .
437 021 260 .
539 442 0 .
076 435 603 129 .
712 021100000 4 453 320 100 68 .
872 78 .
903 779 275 .
585 310 0 .
195 748 059 147 .
775 779500000 3 346 890 30 39 .
525 53 .
646 275 250 .
444 424 0 .
005 964 106 93 .
171 275500000 6 761 572 50 90 .
275 133 .
006 643 462 .
305 078 0 .
128 949 184 223 .
281 643500000 13 629 325 200 182 .
267 284 .
626 551 950 .
972 794 0 .
136 748 262 466 .
893 551
Table 3: Computation time in different phases in traditional process discovery(measured in seconds)example, there is additional 10 seconds to export and convert event data with100K traces and 660K events. This additional time increases up to 466 secondsfor event data with 500K traces and 13M events.From Table 3, we deduce that the export (11.3 %) and conversion (16.9 %)phases take up to 28.2% of the whole process, while the rest is dominated bythe abstraction phase (71.8 %). The actual mining only takes 0.1 % of the time.This shows that there is 28.2 % overhead in the traditional approach. A directprocess mining approach to database, as proposed by the native operator, is anexcellent solution to remove this overhead.
In this paper we focus on direct process discovery on relational databases. Wepropose a native SQL operator for computing one of the intermediate structures,namely the Directly Follows Relation (DFR). Using this operator, users can eas-ily express their query and get DFR values from the underlying data. Moreover,the operator does not contain any nested form which causes bad performance.The native operator has been implemented in H2 database and we provide aspecialized implementation of the state-of-the-art process mining technology toshow applicability of the work.
References
1. D. Calvanese, M. Montali, A. Syamsiyah, and W.M.P. van der Aalst. Ontology-Driven Extraction of Event Logs from Relational Databases. In
BPI 2015 , pages140–153, 2015.2. R. Dijkman, J. Gao, P. Grefen, and A. ter Hofstede. Relational Algebra for In-Database Process Mining. 2017.3. A. Syamsiyah, B.F. van Dongen, and W.M.P. van der Aalst. DB-XES: EnablingProcess Mining in the Large. In
SIMPDA 2016 - Extended Versions , pages 63–77,2016.4. A. Syamsiyah, B.F. van Dongen, and W.M.P. van der Aalst. Discovering SocialNetworks Instantly: Moving Process Mining Computations to the Database andData Entry Time. In
BPMDS 2017 , 2017.2 Alifah Syamsiyah, Boudewijn F. van Dongen, Remco M. Dijkman5. A. Syamsiyah, B.F. van Dongen, and W.M.P. van der Aalst. Recurrent ProcessMining on Procedural and Declarative Approaches.
BPM Center Report BPM-17-03 , 2017.6. W.M.P. van der Aalst.
Process Mining: Data Science in Action . Springer, 2016.7. B.F. van Dongen. BPI Challenge 2017. https://doi.org/10.4121/uuid:5f3067df-f10b-45da-b98b-86ae4c7a310b , 2017.8. et al. W.M.P van der Aalst. Process Mining Manifesto. In