Efficient Time and Space Representation of Uncertain Event Data
EEfficient Time and Space Representationof Uncertain Event Data (cid:63)
Marco Pegoraro (cid:12) [0000 − − − , Merih Seran Uysal [0000 − − − ,and Wil M.P. van der Aalst [0000 − − − Process and Data Science Group (PADS)Department of Computer Science, RWTH Aachen University, Aachen, Germany { pegoraro, uysal, wvdaalst } Abstract.
Process mining is a discipline which concerns the analysis ofexecution data of operational processes, the extraction of models fromevent data, the measurement of the conformance between event dataand normative models, and the enhancement of all aspects of processes.Most approaches assume that event data is accurately capture behavior.However, this is not realistic in many applications: data can contain un-certainty, generated from errors in recording, imprecise measurements,and other factors. Recently, new methods have been developed to an-alyze event data containing uncertainty; these techniques prominentlyrely on representing uncertain event data by means of graph-based mod-els explicitly capturing uncertainty. In this paper, we introduce a newapproach to efficiently calculate a graph representation of the behaviorcontained in an uncertain process trace. We present our novel algorithm,prove its asymptotic time complexity, and show experimental results thathighlight order-of-magnitude performance improvements for the behav-ior graph construction.
Keywords:
Process Mining · Uncertain Data · Partial Order.
The pervasive diffusion of digitization, which gained momentum thanks to ad-vancements in electronics and computing at the end of the last century, broughta wave of innovation in the tools supporting businesses and companies. The pastdecades have seen the rise of
Process-Aware Information Systems (PAISs) – use-ful to structurally support processes in a business – as well as research disciplinessuch as Business Process Management (BPM) and process mining .Process mining [2] is a field of research that enables process analysis in adata-driven manner. Process mining analyses are based on recordings of tasksand events in a process, memorize in an ensemble of information systems whichsupport business operations. These recordings are exported and systematically (cid:63)
We thank the Alexander von Humboldt (AvH) Stiftung for supporting our researchinteractions. Please do not print this document unless strictly necessary. a r X i v : . [ c s . D S ] N ov Pegoraro et al. collected in databases called event logs . Using an event log as a starting point,process mining techniques can automatically obtain a process model illustratingthe behavior of the real-life process ( process discovery ) and identify anomaliesand deviations between the execution data of a process and a normative model( conformance checking ). Process mining is a subfield of data science which isquickly growing in interest both in academia and industry. Over 30 commercialsoftware tools are available on the market for analyzing processes and theirexecution data. Process mining tools are used by process experts to analyzeprocesses in tens of thousands of organizations, e.g., within Siemens, over 6000employees actively use process mining to improve internal procedures.Commercial process mining software is able to discover and build a processmodel from an event log. Most of the process discovery algorithms implementedin these tools are based on tallying the number of directly-follows relationships between activities in the execution data of the process. The more frequentlya specific activity immediately follows another one in the execution log of aprocess, the stronger a causality and/or precedence implication between thetwo activities is understood to be. Such directly-follows relationships are alsothe basis for the identification of more complex and abstract constructs in theworkflow of a process, such as interleaving or parallelism of activities. Theserelationships between activities are often represented in a labeled directed graphcalled the
Directly-Follows Graph (DFG).In recent times, a new type of event logs has gained research interest: un-certain event logs [25]. Such execution logs contain, rather than precise values,an indication of the possible values that event attributes can acquire. In thispaper, we will consider the setting where uncertainty is represented by eitheran interval or a set of possible values for an event attribute. Moreover, we willconsider the case in which an event has been recorded in the event log albeit itdid not happen in reality.Uncertainty in event logs is best illustrated with a real-life example of a pro-cess that can generate uncertain data in an information system. Let us considerthe following process instance, a simplified version of anomalies that are actuallyoccurring in processes of the healthcare domain. An elderly patient enrolls in aclinical trial for an experimental treatment against myeloproliferative neoplasms,a class of blood cancers. The enrollment in this trial includes a lab exam and avisit with a specialist; then, the treatment can begin. The lab exam, performedon the 8th of July, finds a low level of platelets in the blood of the patient, acondition known as thrombocytopenia (TP). At the visit, on the 10th of May,the patient self-reports an episode of night sweats on the night of the 5th ofJuly, prior the lab exam: the medic notes this, but also hypothesized that itmight not be a symptom, since it can be caused not by the condition but byexternal factors (such as very warm weather). The medic also reads the medicalrecords of the patient and sees that, shortly prior to the lab exam, the patientwas undergoing a heparine treatment (a blood-thinning medication) to preventblood clots. The thrombocytopenia found with the lab exam can then be pri-mary (caused by the blood cancer) or secondary (caused by other factors, such fficient Time and Space Representation of Uncertain Event Data 3 as a drug). Finally, the medic finds an enlargement of the spleen in the patient(splenomegaly). It is unclear when this condition has developed: it might haveappeared in any moment prior to that point. The medic decides to admit thepatient in the clinical trial, starting 12th of July.These events generate the trace of Table 1 in the information system of thehospital. For clarity, the timestamp field only reports the day of the month.
Table 1:
The uncertain trace of an instance of healthcare process used as run-ning example. The “Case ID” is a unique identifier for all events in a singleprocess case; the “Event ID” is a unique identifier for the events in the trace.The “Timestamp” field indicates either the moment in time in which the eventhas happened, or the interval of time in which the event may have happened.The “Activity” field indicates the possible choices for the activity instantiatedby the event. Lastly, the “Indeterminate event” field contains a “!” if the cor-responding event has surely occurred, and a “?” if it might have been recordeddespite not occurring in reality. For the sake of readability, in the timestampscolumn only reports the day of the month.
Case ID Event ID Timestamp Activity Indet. event
ID327 e NightSweats ?ID327 e { PrTP , SecTP } !ID327 e [4, 10] Splenomeg !ID327 e Adm ! Event e has been recorded with two possible activity labels ( PrTP or SecTP ). This is an example of uncertainty on activities. Some events, e.g. e ,do not have a precise timestamp but a time interval in which the event couldhave happened has been recorded: in some cases, this causes the loss of a pre-cise ordering of events (e.g. e and e ). This is an instance of uncertainty onthe time dimension, i.e., on timestamps. As evident by the “?” symbol, e isan indeterminate event: it has been recorded, but it is not guaranteed to haveactually happened. Conversely, the “!” symbol indicates that the event has beenrecorded while certainly occurring in reality, i.e., it has been recorded correctlyin the information system (e.g., the event e ).Quality problems and imprecision in data recording such as the ones de-scribed in the running example as source of uncertainty are not uncommon; insome settings, they are a frequent occurrence. Healthcare processes are specif-ically know to be afflicted by these sorts of data anomalies, especially if partsof the process rely on recording information on paper [32, ? ]. Existing processmining software cannot manage such uncertain event data. When mining theprocesses where uncertainty in execution data is prominent, a natural first ap-proach is to filter the event log eliminating cases where uncertainty appear.Unfortunately, in processes with a large portion of cases are affected by such Pegoraro et al. data anomalies, filtering without losing essential information about the processis not feasible.As a consequence, new process mining methods to inspect and analyze ithave to be developed. Uncertain timestamps are the most prominent and criti-cal source of uncertain behavior in a process trace. For example, if n events haveuncertain timestamps such that their order is unknown, the possible configura-tions that the control-flow of the trace can assume are all the n ! permutationsof the events, in the case where all events in a case have timestamps defined bymutually overlapping intervals. This is the worst possible scenario in terms ofamount of uncertain behavior introduced by uncertainty on the timestamps ofthe events ins a trace. Thus, it is important to capture the time relationshipsbetween events in a compact and effective way. This is accomplished by the con-struction of a behavior graph , a directed acyclic graph that expresses precedencebetween events. Figure 1 shows the behavior graph of the process trace in Ta-ble 1; every known precedence relationship between events is represented by theedges of the graph, while the pairs of event for which the order is unknown re-main unconnected. Effectively, this creates a representation of the partial orderwhere the arcs are defined by the possible values of the timestamps containedin the trace, and where the nodes may refer to sets of possible activities. As wewill see, this construct is central to effectively implement both process discoveryand conformance checking applied to uncertain event data. NightSweats e { PrTP, SecTP } e Splenomeg e Adm e Fig. 1:
The behavior graph of the trace in Table 1. Every node represents anevent; the labels in the nodes represent the activity, or set of activities, associatedwith the event. The arcs represent the partial order relationship between eventsas defined by their timestamps. The indeterminate event, which might not haveoccurred, is represented by a dashed node.In a previous paper [27], we presented a time-effective algorithm for the con-struction of the behavior graph of an uncertain process trace, attaining quadratictime complexity on the number of events in the trace.This paper elaborates on this previous result, by providing the proof of thecorrectness of the new algorithm. Additionally, we will show the improvementin performance both theoretically, via asymptotic complexity analysis, and inpractice, with experiments on various uncertain event logs comparing compu- fficient Time and Space Representation of Uncertain Event Data 5 tation times of the baseline method against the novel construction algorithm.Furthermore, the version of the algorithms presented in this paper is refined soto preprocess uncertain traces in linear time, individuating the variants – whichshare the same behavior graph –, and proceed to perform the construction ofthe behavior graph only once per variant. This slightly improves performance,and more importantly, enables the representation of an uncertain event log as amultiset of behavior graphs, greatly reducing the memory requirements to storethe log. This enables a streamlined application of process mining techniques onevent data where uncertainty is present.The algorithms have been implemented within the PROVED (
PRocess min-ing OVer uncErtain Data ) library , based on the PM4Py process mining frame-work [9].The reminder of the paper is structured as follows: Section 2 motivates thestudy of uncertainty in process mining by illustrating an example of conformancechecking over uncertain event data. Section 3 strengthens the motivation showingthe discovery of process models of uncertain event logs. Section 4 provides formaldefinitions, describes the baseline technique for our research, and shows a newand more efficient method to obtain a behavior graph of an uncertain trace.Section 5 presents the analysis of asymptotic complexity for both the baselineand the novel method. Section 6 shows results of experiments on both syntheticand real-life uncertain event logs comparing the efficiency of both methods tocompute behavior graphs. Section 7 explores recent related works in the contextof uncertain event data and the management of alterations of data in processmining. Finally, Section 8 discusses the output of the experiments and concludesthe paper. Conformance checking is one of the main tasks in process mining, and consistsin measuring the deviation between process execution data (usually in the formof a trace) and a reference model. This is particularly useful for organization,since it enables them to compare historical process data against a normativemodel created by process experts to identify anomalies and deviations in theiroperations.Let us assume that we have access to a normative model for the disease ofthe patient in the running example, shown in Figure 2.This model essentially states that the disease is characterised by the occur-rence of night sweats and splenomegaly on the patient, which can be verifiedconcurrently, and then should be followed by primary thrombocytopenia. Wewould like to measure the conformance between the trace in Table 1 and thisnormative model. A very popular conformance checking technique works via thecomputation of alignments [3]. Through this technique, we are able to identifythe deviations in the execution of a process, in the form of behavior happening https://github.com/proved-py/proved-core/tree/Efficient_Time_and_Memory_Representation_for_Uncertain_Event_Data Pegoraro et al. t NightSweats t t Splenomeg t PrTP t Adm t Fig. 2:
A normative model for the healthcare process case in the running exam-ple. The initial marking is displayed; the gray “token slot” represents the finalmarking.in the model but not in the trace, and behavior happening in the trace but notin the model. These deviations are identified, and used as basis to compute aconformance score between the trace and the process model.The formulation of alignments in [3] is not applicable to an uncertain trace.In fact, depending on the instantiation of the uncertain attributes of events – likethe timestamp of e in the trace – the order of event may differ, and so may theconformance score. However, we can look at the best- and worst case-scenarios:the instantiation of attributes of the trace that entails the minimum and maxi-mum number of deviations with respect to the reference model. In our example,two possible outcomes for the sample trace are (cid:104) NightSweats , Splenomeg , PrTP , Adm (cid:105) and (cid:104)
SecTP , Splenomeg , Adm (cid:105) ; both represent the sequence of event that mighthave happened in reality, but their conformance score is very different. The align-ment of the first trace against the reference model can be seen in Table 2, whilethe alignment of the second trace can be seen in Table 3. These two outcomesof the uncertain trace in Table 1 represent, respectively, the minimum and max-imum amount of deviation possible with respect to the reference model, anddefine then a lower and upper bound for conformance score.
Table 2:
An optimal alignment for (cid:104)
NightSweats , Splenomeg , PrTP , Adm (cid:105) , one ofthe possible instantiations of the trace in Table 1, against the model in Figure 2.This alignment has a deviation cost equal to 0, and corresponds to the best casescenario for conformance between the process model and the uncertain trace. (cid:29)
NightSweats Splenomeg (cid:29)
PrTP Adm τ NightSweats Splenomeg τ PrTP Adm t t t t t t The minimum and maximum bounds for conformance score of an uncertaintrace and a reference process model can be found with the uncertain versionof the alignment technique that we first described in [25]. In order to find suchbounds, it is necessary to build a Petri net able to simulate all possible behaviors fficient Time and Space Representation of Uncertain Event Data 7
Table 3:
An optimal alignment for (cid:104)
SecTP , Splenomeg , Adm (cid:105) , one of the pos-sible instantiations of the trace in Table 1, against the model in Figure 2. Thisalignment has a deviation cost equal to 3, caused by 2 moves on model and 1move on log, and corresponds to the worst case scenario for conformance betweenthe process model and the uncertain trace. (cid:29)
SecTP (cid:29)
Splenomeg (cid:29) (cid:29)
Adm τ (cid:29) NightSweats Splenomeg τ PrTP Adm t t t t t t in the uncertain trace, called the behavior net . Obtaining a behavior net is pos-sible through a construction that uses behavior graphs as a starting point, usingthe structural information therein contained to connect places and transitionsin the net. The behavior net of the trace in Table 1 is shown in Figure 3. ( start , e ) NightSweats ( e , NightSweats ) NightSweats ( e , τ ) ( e , e ) PrTP ( e , P rT P ) SecTP ( e , SecT P ) ( e , e )( start , e ) ( e , e ) Splenomeg ( e , Splenomeg ) Adm ( e , Adm ) ( e , end ) Fig. 3:
The behavior net representing the behavior of the uncertain trace in Ta-ble 1 and obtained thanks to its behavior graph. The initial marking is displayed;the gray “token slot” represents the final marking. This artifact is necessary toperform conformance checking between uncertain traces and a reference model.The alignments in Tables 2 and 3 show how we can get actionable insightsfrom process mining over uncertain data. In some applications it is reasonableand appropriate to remove uncertain data from an event log via filtering, andthen compute log-level aggregate information – such as total number of devia-tions, or average deviations per trace – using the remaining certain data. Even inprocesses where this is possible, doing so prevents the important process miningtask of case diagnostic. Conversely, uncertain alignments allow not only to havebest- and worst-case scenarios for a trace, but also to individuate the specificdeviations affecting both scenarios. For instance, the alignments of the runningexample can be implemented in a system that warns the medics that the patientmight have been affected by a secondary thrombocytopenia not explained by themodel of the disease. Since the model indicates that the disease should develop
Pegoraro et al. primary thrombocytopenia as a symptom, this patient is at risk of both types ofplatelets deficit simultaneously, which is a serious situation. The medics can thenintervene to avoid this complication, and performing more exams to ascertainthe cause of the patient’s thrombocytopenia.
Process discovery is another main objective in process mining, and involves au-tomatically creating a process model from event data. Many process discoveryalgorithms rely on the concept of directly-follows relationships between activitiesto gather clues on how to structure the process model.
Uncertain Directly-FollowsGraphs (UDFGs) enable the representation of directly-follows relationships inan event log under conditions of uncertainty in the event data; they consist indirected graphs where the activity labels appearing in an event log constitutethe nodes, and the edges are decorated with information on the minimum andmaximum frequency observable for the directly-follows relation between pair ofactivities.Let us examine an example of UDFG. In order to build a significant example,we need to introduce an entire uncertain event log; since the full table notation foruncertain traces becomes cumbersome for entire logs, let us utilize a shorthandsimplified notation. In a trace, we represent an uncertain event with multiplepossible activity labels by listing all the associated labels between curly braces.When two events have mutually overlapping timestamps, we write their ac-tivity labels between square brackets, and we indicate indeterminate events byoverlining them . For instance, the trace (cid:104) a, { b, c } , [ d, e ] (cid:105) is a trace containing4 events, of which the first is an indeterminate event with activity label a , thesecond is an uncertain event that can have either b or c as activity label, andthe last two events have an interval as timestamp (and the two ranges overlap).Let us consider the following event log: (cid:104) a, b, e, f, g, h (cid:105) , (cid:104) a, { b, c } , [ e, f ] , g, i (cid:105) , (cid:104) a, { b, c, d } , [ e, f ] , g, j (cid:105) .For each pair of activities, we can count the minimum and maximum oc-currences of a directly-follows relationship that can be observed in the log. Theresulting UDFG is shown in Figure 4.This graph can be then utilized to discover process models of uncertain logsvia process discovery methods based on directly-follows relationships. In a previ-ous work [26] we illustrated this principle by applying it to the Inductive Miner,a popular discovery algorithm [19]; the edges of the UDFG can be filtered viathe information on the labels, in such a way that the final model can representall possible behavior in the uncertain log, or only a part. Figure 5 shows someprocess models obtained through inductive mining of the UDFG, as well as adescription regarding how the model relate to the original uncertain log. Notice that this notation does not allow for the representation of every possibleuncertain trace: in the case of timestamp uncertainty, it can only express mutualoverlapping of time intervals. However, this notation is adequate to illustrate anexample for process discovery under uncertainty.fficient Time and Space Representation of Uncertain Event Data 9 a b cd ef g h ij [80, 100] [0, 20][0, 5] [80, 100][0, 20][0, 20][0, 20] [0, 5][0, 5] [80, 100][0, 20] [0, 20][80, 100] [80, 80][15, 15][0, 5][100, 100] [80, 80] [15, 15][0, 5]
Fig. 4:
The
Uncertain Directly-Follows Graph (UDFG) computedbased on the uncertain event log (cid:104) a, b, e, f, g, h (cid:105) , (cid:104) a, { b, c } , [ e, f ] , g, i (cid:105) , (cid:104) a, { b, c, d } , [ e, f ] , g, j (cid:105) . The arcs are labeled with the minimum and maximumnumber of directly-follows relationship observable between activities in thecorresponding trace. Notice the large amount of connections extracted froma single and rather short trace. Uncertain directly-follows relationships areinferred from the behavior graphs of the traces in the log. The construction ofthis object is necessary to perform automatic process discovery over uncertainevent data.UDFGs of uncertain event data are obtained on the basis of the behaviorgraphs of the traces in an uncertain event log, making their construction a nec-essary step to perform uncertain process discovery. In fact, the frequency infor-mation labeling the edges of UDFGs are obtained through a search among thepossible connections within the behavior graphs of all the traces in an uncertainlog.Thus, the construction of behavior graphs for uncertain traces is the basis ofboth process discovery and conformance checking on uncertain event data, sincethe behavior graph is a necessary processing step to mine information fromuncertain traces. It is then important to be able to quickly and efficiently buildthe behavior graph of any given uncertain trace, in order to enable performantprocess discovery and conformance checking. Let us illustrate some basic concepts and notations, partially from [2]:
Definition 1 (Power set).
The power set of a set A is the set of all possiblesubsets of A , and is denoted with P ( A ) . P NE ( A ) denotes the set of all the non-empty subsets of A : P NE ( A ) = P ( A ) \ {∅} . (a) A process model that can only replay the relationships appearing inthe certain parts of the traces in the uncertain log. Here, information fromuncertainty has been excluded completely.a bc ef g hi (b)
A process model that can replay some – but not all – the relation-ships appearing in the uncertain parts of the traces in the uncertain log.This process model mediates between representing only certain obser-vation and representing all the possible behavior in the process.a bdc ef g ijhk (c)
A process model that can replay all possible configurations of certainand uncertain traces in the uncertain log. This process model has thehighest possible replay fitness, but is also very likely to contain somenoisy or otherwise unwanted behavior.
Fig. 5:
Three different process models for the uncertain event log (cid:104) a, b, e, f, g, h (cid:105) , (cid:104) a, { b, c } , [ e, f ] , g, i (cid:105) , (cid:104) a, { b, c, d } , [ e, f ] , g, j (cid:105) obtained throughinductive mining over an uncertain directly-follows graph. The different filteringparameters for the UDFG yield models with distinct features. fficient Time and Space Representation of Uncertain Event Data 11 Definition 2 (Multiset). A multiset is an extension of the concept of set thatkeeps track of the cardinality of each element. B ( A ) is the set of all multisetsover some set A . Multisets are denoted with square brackets, e.g. b = [ x, x, y ] , orwith the cardinality of the elements as superscript, e.g. b = [ x , y ] . We denote theempty multiset with [ ] . The operator ( · ) retrieves the cardinality of an elementof the multiset, e.g. b ( x ) = 2 , b ( y ) = 1 , b ( z ) = 0 . Over multisets we define x ∈ b ⇔ b ( x ) ≥ , and set ( b ) = { x ∈ b } . The multiset union b = b (cid:93) b is themultiset b such that for all x we have b ( x ) = b ( x ) + b ( x ) . Definition 3 (Sequence and permutation).
Given a set X , a finite se-quence over X of length n is a function s ∈ X ∗ : { , . . . , n } → X , and iswritten as s = (cid:104) s , s , . . . , s n (cid:105) . For any sequence s we define | s | = n , s [ i ] = s i , x ∈ s ⇔ x ∈ { s , s , . . . , s n } and s ⊕ s = (cid:104) s , s , . . . , s n , s (cid:105) . A permutation of the set X is a sequence x S that contains all elements of X without dupli-cates: x S ∈ X , X ∈ x S , and for all ≤ i ≤ | x S | and for all ≤ j ≤ | x S | , x S [ i ] = x S [ j ] → i = j . We denote with S X all such permutations of set X . Weoverload the notation for sequences: given a sequence s = (cid:104) s , s , . . . , s n (cid:105) , we willwrite S s in place of S { s ,s ,...,s n } . Definition 4 (Transitive relation and correct evaluation order).
Let X be a set of objects and R be a binary relation R ⊆ X × X . R is transitive if andonly if for all x, x (cid:48) , x (cid:48)(cid:48) ∈ X we have that ( x, x (cid:48) ) ∈ R ∧ ( x (cid:48) , x (cid:48)(cid:48) ) ∈ R → ( x, x (cid:48)(cid:48) ) ∈ R .A correct evaluation order is a permutation s ∈ S X of the elements of the set X such that for all ≤ i < j ≤ | s | we have that ( s [ i ] , s [ j ]) ∈ R . Definition 5 (Strict partial order).
Let S be a set of objects. Let s, s (cid:48) ∈ S . A strict partial order ( ≺ , S ) is a binary relation that have the following properties: – Irreflexivity: s ≺ s is false. – Transitivity: s ≺ s (cid:48) and s (cid:48) ≺ s (cid:48)(cid:48) imply s ≺ s (cid:48)(cid:48) . Definition 6 (Directed graph). A directed graph G ∈ U G is a tuple ( V, E ) where V is the set of vertices and E ⊆ V × V is the set of directed edges. The set U G is the graph universe . A path in a directed graph G = ( V, E ) is a sequence ofvertices p such that for all < i < | p | − we have that ( p i , p i +1 ) ∈ E . We denotewith P G the set of all such possible paths over the graph G. Given two vertices v, v (cid:48) ∈ V , we denote with p G ( v, v (cid:48) ) the set of all paths beginning in v and endingin v (cid:48) : p G ( v, v (cid:48) ) = { p ∈ P G | p [1] = v ∧ p [ | p | ] = v (cid:48) } . v and v (cid:48) are connected (and v (cid:48) is reachable from v ), denoted by v G (cid:55)→ v (cid:48) , if and only if there exists a pathbetween them in G : p G ( v, v (cid:48) ) (cid:54) = ∅ . Conversely, v G (cid:54)(cid:55)→ v (cid:48) ⇔ p G ( v, v (cid:48) ) = ∅ . We dropthe superscript G if it is clear from the context. A directed graph G is acyclic ifthere exists no path p ∈ P G satisfying p [1] = p [ | p | ] . Definition 7 (Topological sorting).
Let G = ( V, E ) be an acyclic directedgraph. A topological sorting [15] o G = (cid:104) v , v , . . . , v | V | (cid:105) ∈ S V is a permutation Formally, the third property of strict partial orders is antisimmetry: s ≺ s (cid:48) impliesthat s (cid:48) ≺ s is false. It is implied by irreflexivity and transitivity [13].2 Pegoraro et al. of the vertices of G such that for all ≤ i < j ≤ | V | we have that v j (cid:54)(cid:55)→ v i . Wedenote with O G ⊆ S V all such possible topological sortings over G . Definition 8 (Transitive reduction). A transitive reduction [6] ρ : G → G of a graph G = ( V, E ) is a graph ρ ( G ) = ( V, E r ) with E r ⊆ E where everypair of vertices connected in ρ ( G ) is not connected by any other path: for all ( v, v (cid:48) ) ∈ E r , p G ( v, v (cid:48) ) = {(cid:104) v, v (cid:48) (cid:105)} . ρ ( G ) is the graph with the minimal number ofedges that maintain the reachability between edges of G . The transitive reductionof a directed acyclic graph always exists and is unique [6] . This paper proposes an analysis technique on uncertain event logs . Theseexecution logs contain information about uncertainty explicitly associated withevent data. A taxonomy of different types of uncertain event logs and attributeuncertainty has been described in [25]; we will refer to the notion of simpleuncertainty , which includes uncertainty without probabilistic information on thecontrol-flow perspective: activities, timestamps, and indeterminate events.
Definition 9 (Universes).
Let U I be the set of all the event identifiers . Let U C be the set of all case ID identifiers . Let U A be the set of all the activity identifiers .Let U T be the totally ordered set of all the timestamp identifiers . Let U O = { ! , ? } ,where the “!” symbol denotes determinate events , and the “?” symbol denotes indeterminate events . Definition 10 (Simple uncertain events). e = ( e i , A, t min , t max , o ) is a sim-ple uncertain event, where e i ∈ U E is its event identifier, A ∈ P NE ( U A is theset of possible activity labels for e , t min and t max are the lower and upper boundsfor the value of its timestamp, and o indicates if it is an indeterminate event.Let U E = ( U I × P NE ( U A ) × U T × U T × U O ) be the set of all simple uncertainevents. Over the uncertain event e = ( e i , A, t min , t max , o ) we define the projectionfunctions π a ( e ) = A , π t min ( e ) = t min , π t max ( e ) = t max and π o ( e ) = o . Definition 11 (Simple uncertain traces and logs). σ ⊆ U E is a simpleuncertain trace if for any ( e i , A, t min , t max , o ) ∈ σ , t min < t max and all theevent identifiers are unique. T U denotes the universe of simple uncertain traces. L ⊆ T U is a simple uncertain log if all the event identifiers in the log are unique. Definition 12 (Strict partial order over simple uncertain events).
Let e, e (cid:48) ∈ E SU be two simple uncertain events. ( ≺ , E SU ) is an order defined on theuniverse of strongly uncertain events E SU as: e ≺ e (cid:48) ⇔ π t max ( e ) < π t min ( e (cid:48) ) Definition 13 (Order-realizations of simple uncertain traces).
Let σ ∈T U be a simple uncertain trace. An order-realization σ O = (cid:104) e , e , . . . , e | σ | (cid:105) ∈ S σ is a permutation of the events in σ such that for all ≤ i < j ≤ | σ | we have that e j ⊀ e i , i.e. σ O is a correct evaluation order for σ over ( ≺ , E SU ) , and the (total)order in which events are sorted in σ O is a linear extension of the strict partialorder ( ≺ , E SU ) . We denote with R O ( σ ) the set of all such order-realizations of thetrace σ . fficient Time and Space Representation of Uncertain Event Data 13 A necessary step to allow for analysis of simple uncertain traces is to obtaintheir behavior graph . A behavior graph is a directed acyclic graph that synthe-sizes the information regarding the uncertainty on timestamps contained in thetrace.
Definition 14 (Behavior graph).
Let σ ∈ T U be a simple uncertain trace.Let the identification function id : σ → { , , . . . , | σ |} be a bijection between theevents in σ and the first | σ | natural numbers. A behavior graph β : T U → U G is the transitive reduction of a directed graph ρ ( G ) , where G = ( V, E ) ∈ U G isdefined as: – V = { ( id ( e ) , π a ( e ) , π o ( e )) | e ∈ σ } – E = { ( v, w ) | v, w ∈ V ∧ π t max ( v ) < π t min ( w ) } The set of topological sortings of a behavior graph β ( σ ) corresponds to the set ofall the order-realizations of the trace σ : Figures 6 and 7 show the transitive reduction operation on the running ex-ample.
NightSweats e { PrTP, SecTP } e Splenomeg e Adm e Fig. 6:
The behavior graph of thetrace in Table 1 before applying thetransitive reduction. All the nodesin the graph are pairwise connectedbased on precedence relationships;pairs of nodes for which the order isunknown are not connected.
NightSweats e { PrTP, SecTP } e Splenomeg e Adm e Fig. 7:
The same behavior graph afterthe transitive reduction. The arc be-tween e and e is removed, since theyare reachable through e . This graphhas a minimal number of arcs whileconserving the same reachability rela-tionship between nodes.The semantics of a behavior graph are able to efficaciously communicatetime and order information concerning the time relationships among events inthe corresponding uncertain trace in a compact manner. For a behavior graph β ( σ ) = ( V, E ) and two events e ∈ σ , e ∈ σ , ( e , e ) ∈ E holds if and only if e is immediately followed by e for some possible values of the timestamps of theevents in the trace. A consequence of this fact is that if a pair of events in thegraph are unreachable, they might have occurred in any order. A technical note: this definition for the nodes of the behavior graph is slightly differ-ent from the one in [25], to simplify the notation in algorithms. The two definitionsare functionally identical.4 Pegoraro et al.
Definition 14 is meaningful and clear from a theoretical point of view. Itrigorously defines a behavior graph and the semantics of its parts. While helpfulto understand the function of behavior graphs, obtaining them from processtraces following this definition – that is, utilizing the transitive reduction – isinefficient and slow. This hinders the analysis of logs with a large number ofevents, and with longer traces. It is nonetheless possible to build behavior graphsfrom process traces in a faster and more efficient way.
The set of steps to efficiently create a behavior graph from an uncertain traceis separated into two distinct phases, described by Algorithms 1 and 2. Anuncertain event e is associated with a time interval which is determined by twovalues: minimum and maximum timestamp of that event π t min ( e ) and π t max ( e ).If an event e has a certain timestamp, we have that π t min ( e ) = π t max ( e ). Algorithm 1:
TimestampList( σ ) Input :
An uncertain trace σ . Output :
The list of timestamps L of σ . L ∗ ← (cid:104) (cid:105) ; // Support list L ← (cid:104) (cid:105) ; // List of event attributes E ← Sort ( σ ) ; // Sorts uncertain events by minimum timestamp i ← while i ≤ | E | do L ∗ ← L ∗ ⊕ ( π t min ( e ) , i, e, ’MIN’) L ∗ ← L ∗ ⊕ ( π t max ( e ) , i, e, ’MAX’) i ← i + 1 Sort ( L ∗ ) ; // Sorts the list based on timestamp value i ← while i ≤ |L ∗ | do ( t, id, e, type ) ← L ∗ [ i ] L ← L ⊕ ( id, π a ( e ) , π o ( e ) , type ) i ← i + 1 return L We will examine here the effect of Algorithms 1 and 2 on a running exam-ple, the process trace shown in Table 4. Notice that, in this running example,no uncertainty on activity labels nor indeterminate events are present: this isbecause of the fact that the topology of a behavior graph only depends on the(uncertain) timestamps in the events belonging to the corresponding trace.The construction of the graph relies on a preprocessing step shown in Algo-rithm 1, where a support list L is created (lines 4-8). Every entry in this list isa tuple of four elements. For each event e in the trace, we insert two entries in fficient Time and Space Representation of Uncertain Event Data 15 Algorithm 2:
BehaviorGraph(TimestampList( σ )) Input :
The list L = TimestampList ( σ ) of an uncertain trace σ . Output :
The behavior graph β ( σ ) = ( V, E ). V ← { ( id, π a ( e ) , π o ( e )) | ( id, π a ( e ) , π o ( e ) , type ) ∈ L} E ← ∅ i ← while i < |L| do ( id, a, o, type ) ← L [ i ] if type = ’MAX’ then j ← i + 1 while j ≤ |L| do ( id ∗ , a ∗ , o ∗ , type ∗ ) ← L [ j ] if type ∗ = ’MIN’ then E ← E ∪ { (( id, a, o ) , ( id ∗ , a ∗ , o ∗ )) } else if (( id, a, o ) , ( id ∗ , a ∗ , o ∗ )) ∈ E then break j ← j + 1 i ← i + 1 return ( V, E ) Table 4:
Running example for the creation of the behavior graph.
Case ID Event ID Activity Timestamp Event Type e a 05-12-2011 !872 e b [06-12-2011, 10-12-2011] !872 e c 07-12-2011 !872 e d [08-12-2011, 11-12-2011] !872 e e 09-12-2011 !872 e f [12-12-2011, 13-12-2011] !6 Pegoraro et al. the list – one for each timestamp π t min and π t max appearing in a trace. The fourelements in each tuple contained in the list are: – an identifier , which in the list construction is an integer representing therank of the uncertain event by minimum timestamp (computed in line 3); – the activity labels associated with the event π a ( e ); – the attribute π o ( e ), which will carry the information regarding indeterminateevents; – the type of timestamp that generated this entry – if it is a minimum ormaximum of an interval.As we can see, the list is designed to contain all information about an uncertainevent except the values of minimum and maximum timestamps, which we useto sort the list (line 9) and then discard prior to returning the list (lines 10-15). Table 5:
Entries for the list L generated by each event in the uncertain trace.Every event e has two associated entries, one marked as ’MIN’ and the other as’MAX’. Each entry is a 4-uple containing an integer that acts as event identifier,the set of possible activity labels π a ( e ) of the uncertain event, the indeterminateevent attribute π o ( e ), and the type of timestamp (’MIN’ or ’MAX’). Event List L ∗ entry(minimum timestamp) List L ∗ entry(maximum timestamp) e (05-12-2011, 1, { a } , !, ’MIN’) (05-12-2011, 1, { a } , !, ’MAX’) e (06-12-2011, 2, { b } , !, ’MIN’) (10-12-2011, 2, { b } , !, ’MAX’) e (07-12-2011, 3, { c } , !, ’MIN’) (07-12-2011, 3, { c } , !, ’MAX’) e (08-12-2011, 4, { d } , !, ’MIN’) (08-12-2011, 4, { d } , !, ’MAX’) e (09-12-2011, 5, { e } , !, ’MIN’) (09-12-2011, 5, { e } , !, ’MAX’) e (12-12-2011, 6, { f } , !, ’MIN’) (13-12-2011, 6, { f } , !, ’MAX’) The events of the trace in Table 4 are represented in the list L ∗ by entriesshown in Table 5. These entries are then sorted by Algorithm 1 yielding thefollowing list L : L = (cid:104) (1 , { a } , ! , ’MIN’) , (1 , { a } , ! , ’MAX’) , (2 , { b } , ! , ’MIN’) , (3 , { c } , ! , ’MIN’) , (3 , { c } , ! , ’MAX’) , (4 , { d } , ! , ’MIN’) , (5 , { e } , ! , ’MIN’) , (5 , { e } , ! , ’MAX’) , (2 , { b } , ! , ’MAX’) , (4 , { d } , ! , ’MAX’) , (6 , { f } , ! , ’MIN’) , (6 , { f } , ! , ’MAX’) (cid:105) One of the purposes the list L serves is gathering the structural informationto create the behavior graph; in fact, visiting the list in order is equivalentof sweeping the events of the trace on the time dimension, encountering eachtimestamp (minimum or maximum) sorted through time. We can visualize thison the Gantt diagram representation of the trace of Table 4, visible in Figure 8.Every segment representing an uncertain event in the diagram is translatedby TimestampList into two entries in a sorted list, representing the two ex-tremes of the segment. Events without an uncertain timestamp collapse into a fficient Time and Space Representation of Uncertain Event Data 17 - - : : - - : : - - : : - - : : - - : : - - : : - - : : - - : : abcdef Fig. 8:
A Gantt diagram visualizing the time perspective of the events in Ta-ble 4. The horizontal blue bars represent the interval of possible timestamps ofuncertain events: such interval is ample for the event with activity label “c”,which has an uncertain timestamp, and is narrow to indicate a precise point intime for the other events. This diagram is able to show the order relationshipbetween events in a trace, as well as the dimensions of their interval of possibletimestamps in scale.single point in the diagram, and their corresponding two entries in the list arecharacterized by the same timestamp.Now, let us examine Algorithm 2. The idea leading the algorithm is to ana-lyze the time relationship among uncertain events in a more precise manner, asopposed to adding a large number of edges to the graph and then removing themvia transitive reduction. This is attained by searching all the viable successorsof each event in the sorted timestamp list L . We scan the list L with two nestedloops, and we use the inner loop to look for successors of the entry selectedby the outer loop. According to the semantics of behavior graphs, events withoverlapping intervals as timestamps must not be connected by a path; thus, wedraw outgoing edges from an event only when, reading the list, we arrive at apoint in time in which the event has certainly occurred. This is the reason whyoutgoing edges are not drawn when inspecting minimum timestamps (line 6)and incoming edges are not drawn when inspecting maximum timestamps (line10).First, we initialize the set of nodes with all the triples ( id, π a ( e ) , π o ( e )) inthe entries of L , and we initialize the edges with an empty set (lines 1-2). Foreach maximum timestamp that we encounter in the list, we start searching forsuccessors in the following entries (lines 3-9), so we proceed in looking for thesuccessors of ( id, a, o, type ) only if type = ’MAX’. If, while searching for successors of the entry ( id, a, o, ’MAX’), we encounterthe entry ( id ∗ , a ∗ , o ∗ , type ∗ ) corresponding to a minimum timestamp ( type ∗ =’MIN’), we connect ( id, a, o ) and ( id ∗ , a ∗ , o ∗ ) in the graph, since their times-tamps do not have any possible value in common. The search for successors mustcontinue, since it is possible that other events took place before the maximumtimestamp of the event corresponding to ( id ∗ , a ∗ , o ∗ , type ∗ ). This configurationoccurs for events e and e in Table 4. As can be seen in Figure 8, e can indeedfollow e , but the still undiscovered event e is another possible successor for e .If the entry ( id ∗ , a ∗ , o ∗ , type ∗ ) corresponds to a maximum timestamp (line12), so type ∗ = ’MAX’, there are two separate situations to consider. Case 1:( id, a, o ) was not already connected to ( id ∗ , a ∗ , o ∗ ). Then, the timestamps of theevents corresponding to ( id, a, o ) and ( id ∗ , a ∗ , o ∗ ) overlap with each other – ifthey did not, the two nodes would have already been connected, since we wouldhave encountered ( id ∗ , a ∗ , o ∗ , ’MIN’) from ( id, a, o, ’MAX’ ) before encountering( id ∗ , a ∗ , o ∗ , ’MAX’). Thus, ( id, a, o ) must not be connected to ( id ∗ , a ∗ , o ∗ ) andthe search must continue. Events e and e are an example: when the maximumtimestamp of e is encountered during the search for the successor of e , thetwo are not connected, so the search for a viable successor of e has to continue.Case 2: ( id, a, o ) and ( id ∗ , a ∗ , o ∗ ) are already connected. This means that we hadalready encountered ( id ∗ , a ∗ , o ∗ , ’MIN’) during the search for the successors of( id, a, o ). Since the entire time interval representing the possible timestamp ofthe event associated with ( id ∗ , a ∗ , o ∗ ) is detected after the occurrence of ( id, a, o ),there are no further events to consider as successors of ( id, a, o ) and the searchstops (line 13). In the running example, this happens between e and e : whensearching for the successors of e , we first connect it with e when we encounterits minimum timestamp; we then encounter its maximum timestamp, so no othersuccessive event can be a successor for e . This concludes the walkthrough ofthe procedure, which shows why Algorithms 1 and 2 can be used to correctlycompute the behavior graph of a trace. The behavior graph of the trace in Table 4obtained through this procedure is shown in Figure 9. a e c e d e e e b e f e Fig. 9:
The behavior graph of the trace in Table 4. fficient Time and Space Representation of Uncertain Event Data 19
Let us now prove, in more formal terms, the correctness of these algorithms.We will show that the procedures
BehaviorGraph and
TimestampList areable to construct a behavior graph with the semantics illustrated in Definition 14.
Theorem 1 (Correctness of the behavior graph construction).
Let σ ∈T U be an uncertain trace. Let bg = ( V, E ) =
BehaviorGraph ( TimestampList ( σ )) be the behavior graph of σ obtained through Algorithms 1 and 2. The graph bg follows the behavior graph semantics: for all pairs of events e ∈ σ and e (cid:48) ∈ σ such that id ( e ) = e id , π a ( e ) = e a , π o ( e ) = e o , id ( e (cid:48) ) = e (cid:48) id , π a ( e (cid:48) ) = e (cid:48) a , π o ( e (cid:48) ) = e (cid:48) o , we have that the node ( e id , e a , e o ) is connected to the node ( e (cid:48) id , e (cid:48) a , e (cid:48) o ) if and only if π t max ( e ) < π t min ( e (cid:48) ) and there exists no event e (cid:48)(cid:48) ∈ σ such that π t max ( e ) < π t min ( e (cid:48)(cid:48) ) ≤ π t max ( e (cid:48)(cid:48) ) < π t min ( e (cid:48) ) . Thus, bg = β ( σ ) .Proof. Let us first define a suitable id function for the behavior graph utilizingthe list E created in TimestampList ( σ ). For all events e ∗ ∈ σ and for i ∈ N such that E [ i ] = e ∗ , we define id ( e ∗ ) = i . Since id is just an enumeration of theevents in σ , it is trivially bijective.( ⇐ ) Assume π t max ( e ) < π t min ( e (cid:48) ). By construction, we have that L = (cid:104) . . . , ( e id , e a , e o , ’MAX’) , . . . , ( e (cid:48) id , e (cid:48) a , e (cid:48) o , ’MIN’) , . . . (cid:105) . The checks in line 6and line 10 only allow for edges to be linked from entries of type ’MAX’ to entriesof type ’MIN’ that only appear in a later position in the list L . Thus, the configu-ration π t max ( e ) < π t min ( e (cid:48) ) is a strict prerequisite for ( e id , e a , e o ) and ( e (cid:48) id , e (cid:48) a , e (cid:48) o )to be connected: (( e id , e a , e o ) , ( e (cid:48) id , e (cid:48) a , e (cid:48) o )) ∈ E ⇒ π t max ( e ) < π t min ( e (cid:48) ).( ⇒ ) Assume π t max ( e ) < π t min ( e (cid:48) ), and that the algorithm is currently searchingthe successors for the entry ( e id , e a , e o , ’MAX’). Eventually, the inner loop willconsider as a successor the entry ( e (cid:48) id , e (cid:48) a , e (cid:48) o , ’MIN’), and since it is of type ’MIN’,( e id , e a , e o ) and ( e (cid:48) id , e (cid:48) a , e (cid:48) o ) will necessarily be connected unless the algorithm ex-ecutes the break at line 13. To execute it, the algorithm needs to find a list entry( e (cid:48)(cid:48) id , e (cid:48)(cid:48) a , e (cid:48)(cid:48) o , ’MAX’) such that there already exist an arc between ( e id , e a , e o ) and( e (cid:48)(cid:48) id , e (cid:48)(cid:48) a , e (cid:48)(cid:48) o ), and this is only possible if ( e (cid:48)(cid:48) id , e (cid:48)(cid:48) a , e (cid:48)(cid:48) o , ’MIN’) has been encounteredwhile searching for successors of ( e id , e a , e o ). This implies that L = (cid:104) . . . , ( e id , e a , e o , ’MAX’) , . . . , ( e (cid:48)(cid:48) id , e (cid:48)(cid:48) a , e (cid:48)(cid:48) o , ’MIN’) , . . .. . . , ( e (cid:48)(cid:48) id , e (cid:48)(cid:48) a , e (cid:48)(cid:48) o , ’MAX’) , . . . , ( e (cid:48) id , e (cid:48) a , e (cid:48) o , ’MIN’) , . . . (cid:105) which, by construction of L , is only possible if there exist some e (cid:48)(cid:48) ∈ σ such that π t max ( e ) < π t min ( e (cid:48)(cid:48) ) ≤ π t max ( e (cid:48)(cid:48) ) < π t min ( e (cid:48) ) (cid:117)(cid:116) As mentioned earlier, the procedure of constructing a behavior graph hasbeen structured in two different algorithms specifically to enable further opti-mization in processing uncertain process trace. This becomes evident once we consider the problem of converting in behavior graphs all the traces in an eventlog, as opposed as one single uncertain trace.Firstly, it is important to notice that different uncertain traces can have thesame list L . Similarly to directly-follows relationships in more classical processmining, which can ignore the amount of time in absolute terms elapsed betweentwo consecutive events, specific values of timestamps in an uncertain trace arenot necessarily meaningful with respect to the connection in the behavior graph;their order, conversely, is crucial.This fact enables further optimization at the log level. The construction of thelist L in TimestampList ( σ ) is engineered in a way that allows for computing thebehavior graph without direct lookup to the events in the trace. This impliesthat it is possible to extract a multiset of lists L from the event log, and tocompute the conversion to behavior graph only for the set of lists induced bythis multiset. This allows to save computation time in converting an entire eventlog to behavior graphs; furthermore, it enables a more compact representationof the log in memory, since we only need to store a smaller number of graphs torepresent the whole log.The procedure to efficiently convert an event log into graphs is detailed inAlgorithm 3. Algorithm 3:
ProcessUncertainLog
Input :
An uncertain log L . Output :
A multiset of behavior graphs BG . ML ← [ ] V L ← [ ] for σ ∈ L do ML ← ML (cid:93) [ TimestampList ( σ )] for L ∈ ML do V L ← V L (cid:93) [ BehaviorGraph ( L ) ML ( L ) ] return BG These considerations allow us to extend to the uncertain scenario some con-cepts that are essential in classical process mining. Firstly, we can now derive thedefinition of variant , highly important for preexisting process mining techniques,to uncertain event data.
Definition 15 (Uncertain variants).
Let L ⊆ T U be a simple uncertain eventlog. The variants of L denoted by V L , are the multisets of behavior graphs forthe uncertain traces in L , and are computed with ProcessUncertainLog ( L ) . The computational advantage in representing a log through a multiset ofbehavior graphs is evident in the procedure described in Algorithm 2. We seethat all data necessary to the creation of a behavior graph is contained in the list L , fact that justifies the log representation method illustrated in Algorithm 3. fficient Time and Space Representation of Uncertain Event Data 21 Lemma 1.
Two uncertain traces σ ∈ L and σ ∈ L belong to the same vari-ant, and share the same behavior graph, if and only if they result in the sametimestamp list L : TimestampList ( σ ) = TimestampList ( σ ) . Another central concept in process mining is the so-called control-flow per-spective of event data. In certain process traces, where timestamps have a totalorder, events have a single activity label and no event is indeterminate, thecontrol-flow information is represented by a sequence of activity labels sorted bytimestamp. Although there are many analysis approaches that also account forother perspectives (e.g. the performance perspective, that considers the durationof events and their distance in time, or the resource perspective, that accountsfor the agents that execute the activities), a vast amount of process miningtechniques, including most popular algorithms for process discovery and confor-mance checking, rely only on the control-flow perspective of a process. Analo-gously, behavior graphs carry over the control-flow information of an uncertaintrace: instead of describing the flow of events like their certain counterpart, thebehavior graph describes all possible flows of events in the uncertain trace.
In this section, we will provide some values for the asymptotic complexity of thealgorithms seen in this paper.In a previous paper [25] we introduced the concept of behavior graph forthe representation of uncertain event data, together with a method to obtainsuch graphs. Definition 14 describes such a baseline method for the creationof the behavior graph consisting of two main parts: the construction of thestarting graph and the computation of its transitive reduction. Let us consideran uncertain process trace σ ∈ T U with | σ | = n events, and the graph G = ( V, E )generated in Definition 14 before the transitive reduction.The starting graph is created by inspecting the time relationship betweenevery pair of events; this corresponds to checking if an edge exists between eachpair of vertices in G , which needs O ( n ) time.The transitive reduction of graphs can be obtained through many methods.A simple and efficient method to compute the transitive reduction on sparsegraphs is to test reachability through a search (either breadth-first or depth-first) from each edge. This method costs O ( V · E ) time . However, in the initialgraph each event e ∈ V has an inbound arc from each event certainly preceding e and an outbound arc to each event certainly following e . Fewer events withoverlapping intervals as timestamps of uncertain events imply fewer arcs in G ;the initial graph G of a trace with no uncertainty has | E | = n ( n − = O ( V )edges. Thus, except for rare, very uncertain cases, the graph G is dense. Here, for simplicity, we resort to a widely adopted abuse of notation in asymptoticcomplexity analysis: we indicate a set instead of its cardinality (e.g., we use O ( V )in place of O ( | V | )).2 Pegoraro et al. Aho et al. [6] presented a technique to compute the transitive reduction in O ( n ) time, more appropriate in the case of dense graphs, and proved that thetransitive reduction has the same computational complexity of the matrix multi-plication problem. The problem of matrix multiplication was generally regardedas having an optimal time complexity of O ( n ), until Volker Strassen presentedan algorithm [30] able to multiply matrices in O ( n . ) time. Subsequentimprovements have followed, by Coppersmith and Winograd [11], Stothers [29]and Williams [33]. The asymptotically fastest algorithm known to date has beenillustrated by Le Gall [17] and has an execution time of O ( n . ). However,these faster algorithms are very seldomly used in practice, due to the existence oflarge constant factors in their computation time that are hidden by the asymp-totic notation. Moreover, they have vast memory requirements. The Strassenalgorithm is helpful in real-life applications only when applied on very largematrices [12], and the Coppersmith-Winograd algorithm and subsequent im-provements are more efficient only with inputs so large that they are effectivelyclassified as galactic algorithms [16].Bearing in mind these considerations, for the vast majority of event logs, themost efficient way to implement the creation of the behavior graph via transitivereduction runs in O ( n ) + O ( n ) = O ( n ) time in the worst-case scenario.It is straightforward to find upper bounds for the complexity of Algorithms 1and 2.Line 3 of TimestampList require O ( n log n ) to be executed. Lines 5-8 require O ( n ) time. Line 9 requires O (2 n log(2 n )) = O ( n log n ) time to be run. Lines 11-14 require 2 n = O ( n ) time to be run. Lines 1-4 and 10 have a constant cost O (1).Thus, TimestampList has a total asymptotic cost of O (1) + 2 · O ( n log n ) + 2 ·O ( n ) = O ( n log n ) in the worst-case scenario.Let us now examine BehaviorGraph . Lines 1-3 and line 11 run in O (1)time. Lines 11-30 consist of two nested loops over the list L , and we have |L| =2 n , resulting in an asymptotic cost of O ((2 n ) ) = O ( n ). The total running timefor the novel construction method is then O (1) + O ( n ) = O ( n ) time in theworst-case scenario.We can also obtain a lower bound for the complexity in the worst-case sce-nario by analyzing the possible size of the output. The complete directed bi-partite graph with n vertices, usually indicated with K n , n , is a DAG that has( n ) = O ( n ) edges. It is easy to see that the complete bipartite graph fulfillsthe requirements to be a behavior graph: it is in fact acyclic, and no edge canbe removed without changing the reachability of the graph – namely, it is equiv-alent to its transitive reduction. We can show that a behavior graph with sucha shape exists employing a simple construction: a trace composed by n eventswith timestamps such that the first n events all have overlapping timestamps,the last n also all have overlapping timestamps, and the maximum timestampof each of the first n is smaller than the minimum timestamp of each of thelast n events. The construction, together with an example, is illustrated in Fig-ure 10. Since lines 11-30 of the algorithm build this graph with O ( n ) edges, thealgorithm runs in Ω ( n ) time, and thus also in Θ ( n ) time. This also proves the fficient Time and Space Representation of Uncertain Event Data 23 asymptotic optimality of the algorithm: no algorithm to build behavior graphscan run in less than Θ ( n ) time in the worst-case scenario. . [1, k] . [2, k+1] . [3, k+2] .... [k, 2k-1] . [2k, 3k] . [2k+1, 3k+1] . [2k+2, 3k+2] .... [3k-1, 4k-1] . [1, 4] . [2, 5] . [3, 6] . [4, 7] . [8, 12] . [9, 13] . [10, 14] . [11, 15] Fig. 10:
Construction of the class of behavior graphs isomorphic to a completebipartite graph and an instantiated example. For any n = 2 k , it is possible tohave a behavior graph isomorphic to the graph K k,k , which thus has a numberof edges quadratic in the number of vertices. The formal definition of our novel construction method for the behavior graphwas used to show its asymptotic speedup with respect to the construction utiliz-ing the transitive reduction. In order to empirically confirm this improvement,we built a set of experiments in order to measure the gain in speed and memoryusage.
In this section, we will show a comparison between the running time of the na¨ıvebehavior graph construction – which employs the transitive reduction – versusthe improved method detailed throughout the paper. The experiments are set toinvestigate the difference in performance between the two algorithms, and mostimportantly how this difference scales when the size of the event log increases,as well as the amount of events in the log that have uncertain timestamps. Indesigning the experiments, we took into consideration the following researchquestions: – Q1 : how does the computation time of the two methods compare when runon logs having an increasing number of traces? – Q2 : how does the computation time of the two methods compare when runon logs with increasing trace lengths? – Q3 : how does the computation time of the two methods compare when runon logs with increasing percentages of events with uncertain timestamps? – Q4 : what degree of reduction in memory consumption for the representationof an uncertain log can we attain with the novel method? – Q5 : do the answers obtained for Q3 hold when simulating uncertainty onreal-life event data?Both the baseline algorithm based on transitive reduction [25] and the newalgorithm for the construction of the behavior graph are implemented in Python,within the PROVED project. The implementation of both methods is availableonline, as well as the full code for the experiments presented here (see the refer-ence in Section 1).For each series of experiments exploring Q1 through Q4 , we generate a syn-thetic event log with a number n of traces of length l (in number of eventsbelonging to the trace). Uncertainty on timestamps is then artificially addedto the events in the log. A specific percentage p of the events in the event logwill have an uncertain timestamp, causing it to overlap with an adjacent event.Finally, behavior graphs are built from all the traces in the event log with eitheralgorithm, while the execution time is measured. All results in this section arepresented as the mean of the measurements for 10 runs of the corresponding ex-periment. In the diagrams, we will label with “TrRed” the na¨ıve method usingthe transitive reduction, and with “Improved” the faster algorithm illustratedin this paper. Additionally, the data series for the novel method are labeledwith the relative variation in running time for each specific data point in theexperiment, expressed in percentage.To answer Q1 , the first experiment inspects how the efficiency of the twoalgorithms scales with log dimension in number of traces. We generate logs witha fixed uncertainty percentage of p = 0 .
5, and trace length of l = 20. Thenumber of traces in the uncertain log progressively scales from n = 1000 to n =10000. As shown in Figure 11, our proposed algorithm outperforms the baselinealgorithm, showing a much smaller slope in computation time. As anticipated bythe theoretical analysis, the computing time to build behavior graphs increaseslinearly with the number of traces in the event log for both methods; in the novelmethod, the constant factors are much smaller, thus producing the speedup thatwe can observe in the graph. Note that in this experiment the novel methodrequires between 18% and 26% of the time with respect to the baseline method. fficient Time and Space Representation of Uncertain Event Data 25 . . . . . . . . B e h a v i o r g r a phbu il d i n g t i m e ( s ec o nd s ) TrRedImproved
Fig. 11:
Time in seconds for the creation of the behavior graphs for syntheticlogs with traces of length l = 20 events and p = 0 . n . The solid blue line indicates the time needed for thena¨ıve construction; the dashed red line shows the building time of the improvedalgorithm, and is labeled with the relative time variation (in percentage).The second experiment is designed to answer Q2 . We analyze the effect of thetrace length on the total time needed for behavior graph creation. Therefore, wecreated logs with n = 100 traces of increasing lengths in number of events, andadded uncertain timestamps to events with p = 0 .
5. The results, illustrated byFigure 12, meet our expectations: the computation time of the baseline methodscales much worse than the computation time required by our new technique,due to its cubic asymptotic time complexity. This confirms the results of theanalysis of the asymptotic time complexity analysis detailed in Section 5. Wecan notice an order-of-magnitude increase in speed. At trace length l = 600, thenew algorithm computes the graphs in only 0 .
35% of the time required by thebaseline algorithm.
100 200 300 400 500 600
Trace length (number of events) − B e h a v i o r g r a phbu il d i n g t i m e ( s ec o nd s ) Fig. 12:
Time in seconds for the creation of the behavior graphs for syntheticlogs with n = 100 traces and p = 0 . l .The next experiment tackles Q3 , by inspecting the difference in executiontime for the two algorithms in function of the percentage of events with an un-certain timestamp in the event log. Keeping constant the values n = 100 and l = 100, we progressively increased the percentage p of events with an uncertaintimestamp and measured computation time. As presented in Figure 13, the timerequired for behavior graph construction remains almost constant for our pro-posed algorithm, while it is very slightly decreasing for the baseline algorithm.This behavior is expected, and is justified by the fact that the worst-case scenariofor the baseline algorithm is a trace that has no uncertainty on the timestamp:in that case, the behavior graph is simply a chain of nodes representing the to-tal order in a sequence of events with certain timestamps, thus the transitivereduction needs to find and remove a higher number of edges from the directedgraph. This worst-case scenario occurs at p = 0, explaining why the computationtime needed by the transitive reduction is at its highest. It is important to note,however, that for all values of p our new algorithm runs is significantly moreefficient than the baseline algorithm: with p = 0, the new algorithm takes 0 . p = 1 this figure growsto 4 . fficient Time and Space Representation of Uncertain Event Data 27 . . . . . . Uncertainty (%) B e h a v i o r g r a phbu il d i n g t i m e ( s ec o nd s ) TrRedImproved
Fig. 13:
Time in seconds for the creation of the behavior graphs for syntheticlogs with n = 100 traces of length l = 100 events, with increasing percentagesof timestamp uncertainty p .An additional experiment is illustrated to provide an answer to Q4 . Similarlyto the first experiment, we increase the number of traces n in the uncertain log,while keeping the other parameters fixed: l = 10 and p = 0 .
5. We then performthe behavior graph construction with both methods, and we measure the memoryconsumption derived from the transitive reduction method (keeping in memoryone behavior graph for each uncertain trace) versus the improved method (whichgenerates a multiset of behavior graphs, one for each variant in the uncertainlog). . . . . M e m o r y o cc up a t i o n ( b y t e s ) × TrRedImproved
Fig. 14:
Memory consumption in bytes needed to store the behavior graphs forsynthetic uncertain event logs with traces of length l = 10 events and timestampuncertainty of p = 0 .
5, with an increasing number of traces n .The results are summarized in Figure 14. Note that when n increases, moreand more uncertain traces are characterized by the same behavior graph, andcan then be grouped in the same variant. This allows the improved algorithmto store the uncertain log more effectively. At n = 15000, the space neededby the multiset of behavior graphs is 59 . Q5 we compared the computation timefor behavior graphs creation on real-life event logs, where we artificially insertedtimestamp uncertainty in progressively higher percentage of uncertain events p as described for the experiments above. We considered three event logs: an eventlog tracking the activities of the help desk process of an Italian software company,a log related to the management of road traffic fines in an Italian municipality,and a log from the BPI Challenge 2012 related to a loan application process. Theresults, presented in Figure 15, closely adhere to the findings of the experimentswith synthetically generated uncertain event data: the novel method provides asubstantial speedup, that remains rather stable with respect to the percentage p of uncertain events added in the log. fficient Time and Space Representation of Uncertain Event Data 29 . . . Uncertainty (%) B e h a v i o r g r a phbu il d i n g t i m e ( s ec o nd s ) BPIC 2012 . . . Uncertainty (%) . . . . . . . HelpDesk . . . Uncertainty (%) . . . . . . . . RTFM
Fig. 15:
Execution times in seconds for real-life event logs with increasing per-centages p of timestamp uncertainty. In Section 1 we saw how building the behavior graph is a fundamental prepro-cessing step for both process discovery and conformance checking when dealingwith uncertain event logs. In the previous section, we showed in practice howthe novel algorithm presented in this paper impacts the computation time forthe construction of behavior graphs. Now, let us have a glance into the effect ofthe speedup when applied to process mining techniques.In this additional experiment we consider the conformance checking problem.In [25] we proposed an approach to compute upper and lower bounds for theconformance score of a trace against a reference Petri net through the alignmenttechnique, which yields alignments for the best- and worst-case scenarios of anuncertain trace as illustrated in Section 1. The experiment is set up to assessthe effect of the new behavior graph construction on the overall performance ofconformance checking over uncertain data. We first generate a Petri net with t transitions, simulate a log by playing out n = 500 traces, and add timestampuncertainty with p = 0 .
1. We then compute the lower bound for conformancebetween the uncertain event log and the Petri net used as a source, and comparethe overall execution time for conformance using the two different methods forthe creation of the behavior graph. In this specific experiment, we also consideredthe other types of uncertainty in process mining illustrated in the taxonomyof [25], as well as all types of uncertainty simulate on the same log.
10 20 30 40
Number of transitions T i m e v a r i a t i o n ( % ) Activities
10 20 30 40
Number of transitions T i m e v a r i a t i o n ( % ) Timestamps
10 20 30 40
Number of transitions T i m e v a r i a t i o n ( % ) Indeterminate events
10 20 30 40
Number of transitions T i m e v a r i a t i o n ( % ) All
Fig. 16:
Relative variation in computation time obtained through the improvedbehavior graph construction when applied to the computation of conformancebounds between a synthetic uncertain log and a Petri net with an increasingnumber of transitions. The synthetic uncertain logs have n = 500 traces andtimestamp uncertainty has been introduced with p = 0 . t = 5), the alignment algorithm takes a short time to execute, so the speedupprovided by the improved behavior graph construction has a larger impact on thetotal computation time (taking as little as 30 .
71% of the time to calculate align-ments). With the increase of t , the computation time for conformance checkingusing the fast construction of the behavior graph appears to stabilize around 65%of the time needed if we employ the na¨ıve construction when considering onlyone type of uncertainty in isolation. This translates in a reduction of roughly35% of computation time for the very common problem of calculating the con-formance score between event data and a reference model, a significant impacton performances of concrete applications of process mining over uncertain data.When compounding all types of uncertainty we see a similar effect, although for t = 5 the improved method takes 52 .
22% of the time required by the baselineconstruction, a less dramatic effect than the other uncertainty settings. This isdue to the fact that even at such small scales, the high number of realizationsof traces slow down the alignment phase in the computation. fficient Time and Space Representation of Uncertain Event Data 31
In evaluating this result, it is important to consider that alignments are anotoriously time-intensive technique [18], since the technique is based on anA ∗ search on a state space that consists in pairs of the activities in the tracecombined with the possible actions in the model. As a consequence, the impactof the algorithm presented in this paper is limited by the characteristics of theimplementation of such alignment technique; combining it with more refinedalignment algorithms would further improve the gain in speed.In summary, the outcomes of the experiments show how our new algorithmhereby presented outperforms the previous method for creating the behaviorgraph on all the parameters in which the problem instance can scale in dimen-sions, in both the time and space dimensions. The experiment designed to answer Q3 shows that, like the na¨ıve algorithm, our novel method being is essentiallyinsensitive to the percentage of events with uncertain timestamps contained in atrace. This fact is also verified by the experiment associated with Q5 on real-lifedata with added time uncertainty. While for every combination of parameterswe benchmarked the novel algorithm runs in a fraction of time required by thebaseline method, the experiments also confirm the improvements in asymptotictime complexity demonstrated through theoretical complexity analysis. The topic of process mining analysis over uncertain event data is relatively new,and little research has been carried out. The work introducing the concept ofuncertainty in process mining, together with a taxonomy of the various types ofuncertainty, specifically illustrated that if a trace displays uncertain attributes, itcontains behavior, which can be effectively represented through graphical models– namely, behavior graphs and behavior nets [25]. Differently to classic processmining, where we have a clearly defined separation between data and model andbetween the static behavior of data and the dynamic behavior of models, thedistinction between data and models becomes more unclear in presence of un-certainty, because of the variety in behavior that affects the data. Representingtraces through process models is utilized in [25] for the computation of upperand lower bounds for conformance scores of uncertain process traces againstclassic reference models. Another practical application of behavior graphs in thefield of process mining over uncertain event data is presented in [26]. Behaviorgraphs of uncertain traces are employed to determine the number of possibledirectly-follows relationships between uncertain events, with the end goal of au-tomatically discovering process models from uncertain event data.Albeit, as said, the application of the concept of uncertainty in data to processmining is recent, the same idea has precedents in the older field of data mining.Aggarwal and Philip [4] offer an overview of the topic of uncertain data and itsanalysis, with a strong focus on querying. Such data is modeled on the basis ofprobabilistic databases [31], a foundational concept in the setting of uncertaindata mining. A branch of data mining particularly close to process mining is frequent itemsets mining: an efficient algorithm to search for frequent itemsetsover uncertain data, the U-Apriori, have been described by Chui et al. [10].Behavior graphs are Directed Acyclic Graphs (DAGs), which are widely usedthroughout many areas of science to represent with a graph-like model depen-dencies, precedence relationships, time information, or partial orders. They areeffectively utilized in circular dependency analysis in software [7], probabilisticgraphical models [8], dynamic graphs analytics [23], and compiler design [5]. Inprocess mining,
Conditional Partial Order Graphs (CPOGs) – which consist ofcollections of DAGs – have been exploited by Mokhov et al. [24] to aid the taskof process discovery.We have seen throughout the paper that uncertainty on the timestamp di-mension – namely, representing at which time an event occurred with an intervalof possible timestamps – generates, on the precedence relationships of events, apartial order. Although uncertainty research in process mining provides a noveljustification of partial ordering that spawns from specific attribute values, theidea of having a partial order instead of a total order among events in a trace hasprecedents in process mining research. Lu et al. [21][22] examined the problem ofconformance checking through alignments in the case of partially ordered traces,and developed a construct to represent conformance called a p-alignment . Gengaet al. [14] devised a method to identify highly frequent anomalous patterns inpartially ordered process traces. More recently, van der Aa et al. [1] developed aprobabilistic infrastructure that allows to infer the most likely linear extensionof a partial order between events in a trace, with the goal of “resolving” thepartial order.An important aspect to notice is that conformance checking over uncertainevent data is not to be confused with stochastic conformance checking, whichconcerns measuring conformance of certain event data against models enrichedwith probabilistic information. The probabilities decorating a stochastic modeldo not derive from uncertainties in event data, but rather from frequency ofactivities [20] or from performance indicators [28].A review of related work on the topic of the asymptotic complexity of thetransitive reduction and the equivalent problem of matrix multiplication is pro-vided with the complexity analysis of the algorithms examined by this paper, inSection 5.
The creation of the behavior graphs – a graphical structure of paramount impor-tance for the analysis of uncertain data in the domain of process mining – playsa key role as initial processing step for both conformance checking and processdiscovery of process traces containing events with timestamp uncertainty, themost critical type of uncertain behavior. It allows, in fact, to represent the timerelationship between uncertain events, which can be in a partial order. The be-havior graph also carries the information regarding other types of uncertainty,like uncertain activity labels and indeterminate events. Such a representation is fficient Time and Space Representation of Uncertain Event Data 33 vital to establish which possible sequence of events in an uncertain trace mostadhere to the behavior prescribed by a reference model, thereby enabling con-formance checking; and to measure the number of possible occurrences of thedirectly-follows relationship between activities in an event log, making processdiscovery over uncertainty possible. Extracting behavior graphs from uncertainevent data is thus concomitantly crucial and time consuming. In this paper, weshow an improvement for the performance of uncertainty analysis by propos-ing a new algorithm that enables the creation of behavior graphs in quadratictime in the number of events in the trace. This novel method additionally al-lows for the representation of an uncertain log as a multiset of behavior graphs,which relevance is twofold: it allows to represent the control-flow information ofan uncertain event log in a more compact manner by using less memory, andnaturally extends the concept of variant – central throughout the discipline ofprocess mining – to the uncertain domain. We proved the correctness of thisnovel algorithm, we showed asymptotic upper and lower bounds for its timecomplexity, and implemented performance experiments for this algorithm thateffectively show the gain in computing speed it entails in real-world scenarios.
References
1. Van der Aa, H., Leopold, H., Weidlich, M.: Partial order resolution of event logsfor process conformance checking. Decision Support Systems p. 113347 (2020)2. Van der Aalst, W.M.P.: Process mining: data science in action. Springer (2016)3. Adriansyah, A., van Dongen, B.F., van der Aalst, W.M.P.: Towards robust confor-mance checking. In: International Conference on Business Process Management.pp. 122–133. Springer (2010)4. Aggarwal, C.C., Philip, S.Y.: A survey of uncertain data algorithms and appli-cations. IEEE Transactions on knowledge and data engineering (5), 609–623(2008)5. Aho, A., Lam, M., Sethi, R., Ullman, J., Cooper, K., Torczon, L., Muchnick, S.:Compilers: Principles, Techniques and Tools (2007)6. Aho, A.V., Garey, M.R., Ullman, J.D.: The transitive reduction of a directed graph.SIAM Journal on Computing (2), 131–137 (1972)7. Al-Mutawa, H.A., Dietrich, J., Marsland, S., McCartin, C.: On the shape of circulardependencies in Java programs. In: 2014 23rd Australian Software EngineeringConference. pp. 48–57. IEEE (2014)8. Bayes, T.: LII. An Essay towards solving a problem in the doctrine of chances.By the late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to JohnCanton, AMFR S. Philosophical transactions of the Royal Society of London (53),370–418 (1763)9. Berti, A., van Zelst, S.J., van der Aalst, W.M.P.: Process Mining for Python(PM4Py): Bridging the Gap Between Process- and Data Science. In: ICPM DemoTrack (CEUR 2374). p. 13–16 (2019)10. Chui, C.K., Kao, B., Hung, E.: Mining frequent itemsets from uncertain data.In: Pacific-Asia Conference on knowledge discovery and data mining. pp. 47–58.Springer (2007)11. Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions.Journal of symbolic computation (3), 251–280 (1990)4 Pegoraro et al.12. D’Alberto, P., Nicolau, A.: Using recursion to boost ATLAS’s performance. In:High-Performance Computing. pp. 142–151. Springer (2005)13. Flaˇska, V., Jeˇzek, J., Kepka, T., Kortelainen, J.: Transitive closures of binaryrelations. i. Acta Universitatis Carolinae. Mathematica et Physica (1), 55–69(2007)14. Genga, L., Alizadeh, M., Potena, D., Diamantini, C., Zannone, N.: Discoveringanomalous frequent patterns from partially ordered event logs. Journal of Intelli-gent Information Systems (2), 257–300 (2018)15. Kalvin, A.D., Varol, Y.L.: On the generation of all topological sortings. Journal ofAlgorithms (2), 150–162 (1983)16. Le Gall, F.: Faster algorithms for rectangular matrix multiplication. In: 2012 IEEE53rd annual symposium on foundations of computer science. pp. 514–523. IEEE(2012)17. Le Gall, F.: Powers of tensors and fast matrix multiplication. In: Proceedings ofthe 39th international symposium on symbolic and algebraic computation. pp.296–303. ACM (2014)18. Lee, W.L.J., Verbeek, H.M.W., Munoz-Gama, J., van der Aalst, W.M.P.,Sep´ulveda, M.: Replay using recomposition: alignment-based conformance check-ing in the large. In: BPM Demo Track and BPM Dissertation Award co-locatedwith 15th International Conference on Business Process Management (BPM 2017).CEUR-WS (2017)19. Leemans, S.J.J., Fahland, D., van der Aalst, W.M.P.: Discovering block-structuredprocess models from event logs-a constructive approach. In: International con-ference on applications and theory of Petri nets and concurrency. pp. 311–329.Springer (2013)20. Leemans, S.J.J., Polyvyanyy, A.: Stochastic-aware conformance checking: Anentropy-based approach. In: International Conference on Advanced InformationSystems Engineering. pp. 217–233. Springer (2020)21. Lu, X., Fahland, D., van der Aalst, W.M.P.: Conformance checking based on par-tially ordered event data. In: International conference on business process manage-ment. pp. 75–88. Springer (2014)22. Lu, X., Mans, R.S., Fahland, D., van der Aalst, W.M.P.: Conformance checking inhealthcare based on partially ordered event data. In: Proceedings of the 2014 IEEEEmerging Technology and Factory Automation (ETFA). pp. 1–8. IEEE (2014)23. Mariappan, M., Vora, K.: Graphbolt: Dependency-driven synchronous processingof streaming graphs. In: Proceedings of the Fourteenth EuroSys Conference 2019.p. 25. ACM (2019)24. Mokhov, A., Carmona, J., Beaumont, J.: Mining conditional partial order graphsfrom event logs. In: Transactions on Petri Nets and Other Models of ConcurrencyXI, pp. 114–136. Springer (2016)25. Pegoraro, M., van der Aalst, W.M.P.: Mining uncertain event data in processmining. In: 2019 International Conference on Process Mining (ICPM). pp. 89–96.IEEE (2019)26. Pegoraro, M., Uysal, M.S., van der Aalst, W.M.P.: Discovering process modelsfrom uncertain event data. In: International Conference on Business Process Man-agement. pp. 238–249. Springer (2019)27. Pegoraro, M., Uysal, M.S., van der Aalst, W.M.P.: Efficient construction of be-havior graphs for uncertain event data. In: International Conference on BusinessInformation Systems (BIS). Springer (2020)fficient Time and Space Representation of Uncertain Event Data 3528. Rogge-Solti, A., van der Aalst, W.M.P., Weske, M.: Discovering stochastic petrinets with arbitrary delay distributions from event logs. In: International Conferenceon Business Process Management. pp. 15–27. Springer (2013)29. Stothers, A.J.: On the complexity of matrix multiplication. Ph.D. thesis, Universityof Edinburgh (2010)30. Strassen, V.: Gaussian elimination is not optimal. Numerische mathematik (4),354–356 (1969)31. Suciu, D., Olteanu, D., R´e, C., Koch, C.: Probabilistic databases. Synthesis lectureson data management3