Conformance Checking over Uncertain Event Data
CConformance Checkingover Uncertain Event Data (cid:63)
Marco Pegoraro [0000 − − − , Merih Seran Uysal [0000 − − − ,and Wil M.P. van der Aalst [0000 − − − Process and Data Science Group (PADS)Department of Computer Science, RWTH Aachen University, Aachen, Germany { pegoraro,uysal,wvdaalst } Abstract.
The strong impulse to digitize processes and operations incompanies and enterprises have resulted in the creation and automaticrecording of an increasingly large amount of process data in informa-tion systems. These are made available in the form of event logs . Processmining techniques enable the process-centric analysis of data, includ-ing automatically discovering process models and checking if event dataconform to a given model. In this paper, we analyze the previously un-explored setting of uncertain event logs. In such event logs uncertaintyis recorded explicitly, i.e., the time, activity and case of an event maybe unclear or imprecise. In this work, we define a taxonomy of uncertainevent logs and models, and we examine the challenges that uncertaintyposes on process discovery and conformance checking. Finally, we showhow upper and lower bounds for conformance can be obtained by aligningan uncertain trace onto a regular process model.
Keywords:
Process Mining · Uncertain Data · Partial Order.
Over the last decades, the concept of process has become more and more centralin formally describing the activities of businesses, companies and other similarentities, structured in specific steps and phases. A process is thus defined as awell-structured set of activities, potentially performed by multiple actors ( re-sources ), which contribute to the completion of a specific task or to the achieve-ment of a specific goal. In this context, a very important notion is the concept of case , that is, a single instance of a process. For example, in a healthcare process,a case may be a single hospitalization of a patient, or the patient themself; if theprocess belongs to a credit institution, a case may be a loan application from acustomer, and so on. The case notion allows us to define a process as a proce-dure that defines the steps needed to handle cases from inception to completion. (cid:63)
We thank the Alexander von Humboldt (AvH) Stiftung for supporting our researchinteractions. We acknowledge Elisabetta Benevento for her valuable input. Please donot print this document unless strictly necessary. a r X i v : . [ c s . A I] N ov Pegoraro et al. A process model defines such procedure, and can be expressed in a number ofdifferent formalisms (transition systems, Petri nets, BPMN and UML diagrams,and many more). Consequently, the study and adoption of analysis techniquesspecifically customized to deal with process data and process models has enablethe bridging of business administration and data science and the developmentof dedicated disciplines like business intelligence and Business Process Manage-ment (BPM).The processes that govern the innards of business companies are increas-ingly supported by software tools. Performing specific activities is both aidedand recorded by
Process-Aware Information Systems (PAISs), which supportthe definition and management of processes. The information regarding the ex-ecution of processes can then be extracted from PAISs in the form of an eventlog , a database or file containing the digital footprint of the operations carriedout in the context of the execution of a process and recorded as events . Eventlogs can vary in form, and contain differently structured information dependingon the information system that enacted data collection in the organization. Al-though many different event attributes can be recorded, it is typically assumedthat three basic features of an event are available in the log: the time in whichthe event occurred, the activity that has been performed, and the case identifierto which the event belong. This last attribute allows to group events in clustersbelonging to the same case, and these resulting clusters (usually organized in se-quences sorted by timestamp) are called process traces . The discipline of processmining is concerned with the automatic analysis of event logs, with the goal ofextracting knowledge regarding e.g. the structure of the process, the conformityof events to a specific normative process model, the performance of the agentsexecuting the process, the relationships between groups of actors in the process.In this paper, we will consider the analysis of a specific class of event logs:logs that contain uncertain event data . Uncertain events are recordings of exe-cutions of specific activities in a process which are enclosed with an indicationof uncertainty in the event attributes. Specifically, we consider the case wherethe attributes of an event are not recorded as a precise value but as a range ora set of alternatives.Uncertain event data are common in practice, but often uncertainty is notmade explicit. The
Process Mining Manifesto [5] describes a fundamental prop-erty of event data as trustworthiness , the assumption that the recorded data canbe considered correct and accurate. In a general sense, uncertainty – as definedhere – is an explicit absence of trustworthiness, with an indication of uncer-tainty recorded together with the event data. In the taxonomy of event dataproposed in the Manifesto, the logs at the two lower levels of quality frequentlylack trustworthiness, and thus can be uncertain. This encompasses a wide rangeof processes, such as event logs of document and product management systems,error logs of embedded systems, worksheets of service engineers, and any pro-cess recorded totally or partially on paper. There are many possible causes ofuncertainty: onformance Checking over Uncertain Event Data 3 – Incorrectness . In some instances, the uncertainty is simply given by errorsthat occurred while recording the data themselves. Faults of the informationsystem, or human mistakes in a data entry phase can all lead to missingor altered event data that can be subsequently modeled as uncertain eventdata. – Coarseness . Some information systems have limitations in their way of record-ing data - often tied to factors like the precision of the data format - such thatthe event data can be considered uncertain. A typical example is an informa-tion system that only records the date, but not the time, of the occurrence ofan event: if two events are recorded in the same day, the order of occurrenceis lost. This is an especially common circumstance in the processes that are,partially or completely, recorded on paper and then digitalized. Another fac-tor that can lead to uncertainty in the time of recording is the informationsystem being overloaded and, thus, delaying the recording of data. This typeof uncertainty can also be generated by the limited sensibility of a sensor. – Ambiguity . In some cases, the data recorded is not an identifier of a certainevent attribute; in these instances, the data needs to be interpreted, eitherautomatically or manually, in order to obtain a value for the event attribute.Uncertainty can arise if the meaning of the data is ambiguous and cannotbe interpreted with precision. Examples include data in the form of images,text, or video.These factors cause the presence of implicit uncertainty in the event log. It isimportant to note that, in order to be analyzed, these indications of imprecisionor incorrectness have to be translated into explicit uncertainty. Explicit uncer-tainty is contained directly in the event log in the form of event attributes. Itis possible to think of explicit uncertainty as metadata complementing the in-formation regarding events. This metadata describes the type and magnitudeof the imprecision affecting some event attributes, which might be part of thecontrol-flow perspective or an additional data perspective present in the eventlog.Aside from the possible causes, we can individuate other types of uncer-tain event logs based on the frequency of uncertain data. Uncertainty can be infrequent , when a specific attribute is only seldomly recorded together with ex-plicit uncertainty; the uncertainty is rare enough that uncertain events can beconsidered outliers. Conversely, frequent uncertain behavior of the attribute issystematic, pervasive in a high number of traces, and thus not to be consideredan outlier. The uncertainty can be considered part of the process itself. Theseconcepts are not meant to be formal, and are laid out to distinguish betweenlogs that are still processable regardless of the uncertainty, and logs where theuncertainty is too invasive to analyze them with existing process mining tech-niques.The diagram in Figure 1 shows an overview of the main elements of processmining over uncertainty. The schema shows some additional elements with re-spect to classical process mining: we can see that we can combine raw processdata from information systems (containing implicit uncertainty) with domain
Pegoraro et al. knowledge provided by a process expert to obtain an uncertain event log, whichcontains explicit uncertainty. The data in an uncertain event log can be ab-stracted in a graph representation, which enables the inspection of its causes.Lastly, the graph representations also allows to perform the tasks of processdiscovery and conformance checking on uncertain event data.
Agents
Process
Domain knowledgeRaw data
Uncertain event logGraph representation Process model
Records
Abstracts
Process discoveryConformancechecking
Fig. 1.
The overall schema for process mining over uncertain event data.
In this paper, we propose a taxonomy of the different types of explicit un-certainty in process mining, together with a formal, mathematical formulation.As an example of practical application, we will consider the case of conformancechecking [14], and we will apply it to uncertain data by assessing what are theupper and lower bounds on the conformance score for possible values of theattributes in an uncertain trace.The main drivers behind this work is to provide the means to treat uncer-tainty as a relevant part of a process; thus, we aim not to filter it out but to modeland explain it. In conclusion, there are two novel aspects regarding uncertaindata that we intend to address in this work. The first novelty is the explicitnessof uncertainty : we work with the underlying assumption that the actual value ofthe uncertain attribute, while not directly provided, is described formally. This isthe case when meta-information about the uncertainty in the attribute is avail-able, either deduced from the features of the information system(s) that recordthe logs or included in the event log itself. Note that, as opposed to all previouswork on the topic, the fact that uncertainty is explicit in the data means thatthe concept of uncertain behavior is completely separated from the concept ofinfrequent behavior. The second novelty is the explicit modeling of uncertainty : onformance Checking over Uncertain Event Data 5 we consider uncertainty part of the process. Instead of filtering or cleaning thelog we introduce the uncertainty perspective in process mining by extending thecurrently available techniques to incorporate it.The rest of this paper is organized as follows. Section 2 proposes a taxon-omy of the different possible types of uncertain process data. Section 3 con-tains the formal definitions needed to manage uncertainty. Section 4 presentsthe main contribution of this paper, a framework able to describe an array oftypes and classifications of uncertain behavior. Section 5 describes a practicalapplication of process mining over uncertain event data, the case of conformancechecking through alignments. Section 6 shows experimental results on comput-ing conformance checking scores for synthetic uncertain data, as well as a caseof application on real-life data. Section 7 discusses previous and related work onthe management of uncertain data and on the topic of conformance checking.Finally, Section 8 concludes the paper and discusses future work. The goal of this section of the paper is to propose a categorization of the differenttypes of uncertainty that can appear in process mining. In process management,a central concept is the distinction between the data perspective (the event log)and the behavioral perspective (the process model). The first one is a staticrepresentation of process instances, the second summarizes the behavior of aprocess. Both can be extended with a concept of explicit uncertainty: this conceptalso implies an extension of the process mining techniques that have currentlybeen implemented.In this paper, we will focus on uncertainty in event data, rather than applyingthe concept of uncertainty to models. Specifically, we will consider computingthe conformance score of uncertain process data on classical models, extendingthe approach shown in [25]. An application of process discovery in the setting ofuncertain event data have been presented in [26].We can individuate two different notions of uncertainty: – Strong uncertainty : the possible values for the attributes are known, but theprobability that the attribute will assume a certain instantiation is unknownor unobservable. – Weak uncertainty : both the possible values of an attribute and their respec-tive probabilities are known.In the case of a discrete attribute, the strong notion of uncertainty consistson a set of possible values assumed by the attribute. In this case, the probabilityfor each possible value is unknown. Vice-versa, in the weak uncertainty scenariowe also have a discrete probability distribution defined on that set of values. Inthe case of a continuous attribute, the strong notion of uncertainty can be repre-sented with an interval for the variable. Notice that an interval does not indicatea uniform distribution; there is no information on the likelihood of values in it.Vice-versa, in the weak uncertainty scenario we also have a probability density
Pegoraro et al. function defined on a certain interval. Figure 2 summarizes these concepts. Thisleads to very simple representations of explicit uncertainty.
Fig. 2.
The four different types of uncertainty.
In this paper, we consider only the control flow and time perspective of aprocess – namely, the attributes of the events that allow us to discover a processmodel. These are the unique identifier of a process instance (case ID), the times-tamp (often represented by the distance from a fixed origin point, e.g. the
UnixEpoch ), and the activity identifier of an event. Case IDs and activities are valueschosen from a finite set of possible values; they are discrete variables. Times-tamps, instead, are represented by numbers and thus are continuous variables.We will also describe an additional type of uncertainty, which lays on theevent level rather than the attribute level: – Indeterminate event : the event may have not taken place even though itwas recorded in the event log. Indeterminate events are indicated with a ?symbol, while determinate (regular) events are marked with a ! symbol.Examples of strongly and weakly uncertain traces are shown in Tables 1 and 2respectively. Additionally, we present a time diagram of the trace in Table 1:this representation shows the time relationship between events in the trace inabsolute scale. This diagram is shown in Figure 3Lastly, we need to put forward an additional assumption which is importantfor our formalization and analysis. The probabilities related to uncertain eventscan in principle be dependent, since they are part of the same process whereagents interact with the subject of a specific case, as well as with one another. onformance Checking over Uncertain Event Data 7
Table 1.
An example of a strongly uncertain trace. For sake of clarity, the timestampfield only reports dates.
Case ID Timestamp Activity Indet. event { ID327, ID412 } { B, C, D } !ID327 [2011-12-06, 2011-12-10] D ?ID327 2011-12-09 { A, C } ! { ID327, ID412, ID573 } Table 2.
An example of a weakly uncertain trace. For sake of clarity, the timestampfield only reports dates.
Case ID Timestamp Activity Indet. event { ID313:0.9, ID370:0.1 } { B:0.7, C:0.3 } !ID313 N (2011-12-08, 2) D ?:0.5ID313 2011-12-09 { A:0.2, C:0.8 } ! { ID313:0.4, ID370:0.6 } - - : : - - : : - - : : - - : : - - : : - - : : AB, C, DDA, CE
Fig. 3.
Time diagram of the trace in Table 1. This diagram shows the time informationof an uncertain trace in an absolute scale. Note that some types of uncertainty (namely,indeterminate events and uncertainty on case IDs) are not depicted.
Toward the analysis presented in this paper, our assumption is that probabilitiesare independent. This enables a clearer and simpler formalization; we reserve to
Pegoraro et al. examine the case where indeed there is dependency between probabilities infuture work.The taxonomy presented in this section is summarized in Table 3. This tableencodes all types of uncertainty illustrated here. Through this taxonomy we canindicate the types of uncertainty that might affect an uncertain event log.
Table 3.
Summary of the types of uncertainty that can affect a log over the attributesof its events. The last column provides an encoding for each type of uncertainty.
Attribute Attribute type Uncertainty type Encoding
Event(indeterminacy) Discrete Weak [E] W Strong [E] S Case Discrete Weak [C] W Strong [C] S Activity Discrete Weak [A] W Strong [A] S Timestamp Continuous Weak [T] W Strong [T] S Other attribute Discrete Weak [ATD] W Strong [ATD] S Continuous Weak [ATC] W Strong [ATC] S More types of uncertainty can be combined to described an uncertain eventlog. For example, an event log with strong uncertainty on events, activities andtimestamps would be an [E, A, T] S -type log. An uncertain log can also be charac-terized by different types of uncertainty on different attributes: a log with stronguncertainty on events and weak uncertainty on activities is a [E] S [A] W -type log.In the next section, we will describe these different types of uncertainty ina mathematical framework that will, in turn, enable process mining analyses onuncertain event data. Let us introduce some preliminary definitions in order to describe uncertaintyin process mining in a formal way. These definitions will provide the means torepresent the behavior contained in uncertain data, and enable process miningtasks such as process discovery and conformance checking.
Firstly, we will define some basic mathematical structures.
Definition 1 (Power Set).
The power set of a set A is the set of all possiblesubsets of A , and is denoted with P ( A ) . P NE ( A ) denotes the set of all the non-empty subsets of A : P NE ( A ) = P ( A ) \ {∅} . onformance Checking over Uncertain Event Data 9 Definition 2 (Multiset). A multiset is an extension of the concept of set thatkeeps track of the cardinality of each element. B ( A ) is the set of all multisetsover some set A . Multisets are denoted with square brackets, e.g. b = [ ] (theempty multiset), b = [ a, a, b ] , b = [ a, b, c ] , b = [ a, b, c, a, a, b ] are all multisetsover A = { a, b, c } . In multiset the order of representation of the elements isirrelevant, and they can also be denoted with the cardinality of their elements, e.g. b = [ a, b, c, a, a, b ] = [ a , b , c ] . We denote with b ( x ) the cardinality of element x ∈ A in b , e.g. b ( a ) = 3 , b ( c ) = 1 , and b ( d ) = 0 .We can extend to multiset standard set operators such as membership (e.g. a ∈ b and c / ∈ b ), union (e.g. b (cid:93) b = b ), difference (e.g. b \ b = b ) andtotal cardinality (e.g. | b | = 6 ). Definition 3 (Sequence, Subsequence and Permutation).
Given a set X ,a finite sequence over X of length n is a function s ∈ X ∗ : { , . . . , n } → X , andit is written as s = (cid:104) s , s , . . . , s n (cid:105) . We denote with (cid:104) (cid:105) the empty sequence, thesequence with no elements and of length 0. Over the sequence s we define | s | = n , s [ i ] = s i and x ∈ s ⇔ ∃ ≤ i ≤ n s = s i . The concatenation between two sequencesis denoted with (cid:104) s , s , . . . , s n (cid:105) · (cid:104) s (cid:48) , s (cid:48) , . . . , s (cid:48) m (cid:105) = (cid:104) s , s , . . . , s n , s (cid:48) , s (cid:48) , . . . , s (cid:48) m (cid:105) .Given two sequences s = (cid:104) s , s , . . . , s n (cid:105) and s (cid:48) = (cid:104) s (cid:48) , s (cid:48) , . . . , s (cid:48) m (cid:105) , s (cid:48) is a sub-sequence of s if and only if there exists a sequence of strictly increasing naturalnumbers (cid:104) i , i , . . . , i m (cid:105) such that ∀ ≤ j ≤ m s i j = s (cid:48) j . We indicate this with s (cid:48) ⊆ s .A permutation of the set X is a sequence x S that contains all elements of X without duplicates: x S ∈ X , X ∈ x S , and for all ≤ i ≤ | x S | and for all ≤ j ≤ | x S | , x S [ i ] = x S [ j ] → i = j . We denote with S X all such permutationsof set X . Definition 4 (Sequence Projection).
Let X be a set and Q ⊆ X one of itssubsets. (cid:22) Q : X ∗ → Q ∗ is the sequence projection function and is defined recur-sively: (cid:104) (cid:105) (cid:22) Q = (cid:104) (cid:105) and for σ ∈ X ∗ and x ∈ X : ( (cid:104) x (cid:105) · σ ) (cid:22) Q = (cid:40) σ (cid:22) Q if x (cid:54)∈ Q (cid:104) x (cid:105) · σ (cid:22) Q if x ∈ Q For example, (cid:104) y, z, y (cid:105) (cid:22) { x,y } = (cid:104) y, y (cid:105) . Definition 5 (Applying Functions to Sequences).
Let f : X (cid:54)→ Y be apartial function. f can be applied to sequences of X using the following recursivedefinition: f ( (cid:104) (cid:105) ) = (cid:104) (cid:105) and for σ ∈ X ∗ and x ∈ X : f ( (cid:104) x (cid:105) · σ ) = (cid:40) f ( σ ) if x (cid:54)∈ dom ( f ) (cid:104) f ( x ) (cid:105) · f ( σ ) if x ∈ dom ( f )Next, so as to manage the possible different orders between events in a tracewith uncertain timestamps, we introduce formalisms to denote strict partialorders. Definition 6 (Transitive Relation and Correct Evaluation Order).
Let X be a set of objects and R be a binary relation R ⊆ X × X . R is transitive if andonly if for all x, x (cid:48) , x (cid:48)(cid:48) ∈ X we have that ( x, x (cid:48) ) ∈ R ∧ ( x (cid:48) , x (cid:48)(cid:48) ) ∈ R ⇒ ( x, x (cid:48)(cid:48) ) ∈ R .A correct evaluation order is a permutation s ∈ S X of the elements of the set X such that for all ≤ i < j ≤ | s | we have that ( s [ i ] , s [ j ]) ∈ R . Definition 7 (Strict Partial Order).
Let S be a set of objects. Let s, s (cid:48) ∈ S .A strict partial order ≺ over S is a binary relation that satisfies the followingproperties: – Irreflexivity: s ≺ s is false. – Transitivity: see Definition 6. – Antisymmetry: s ≺ s (cid:48) implies that s (cid:48) ≺ s is false. Implied by irreflexivity andtransitivity [19] . Definition 8 (Directed Graph). A directed graph G is a tuple ( V, E ) where V is the set of vertices and E ⊆ V × V is the set of directed edges. The set U G is the graph universe . A path in a directed graph G = ( V, E ) is a sequence ofvertices p ∈ V such that for all < i < | p | − we have that ( p i , p i +1 ) ∈ E . Wedenote with P G the set of all such possible paths over the graph G. Given twovertices v, v (cid:48) ∈ V , we denote with p G ( v, v (cid:48) ) the set of all paths beginning in v andending in v (cid:48) : p G ( v, v (cid:48) ) = { p ∈ P G | p [1] = v ∧ p [ | p | ] = v (cid:48) } . v and v (cid:48) are connected (and v (cid:48) is reachable from v ), denoted by v G (cid:55)→ v (cid:48) , if and only if there exists a pathbetween them in G : p G ( v, v (cid:48) ) (cid:54) = ∅ . Conversely, v G (cid:54)(cid:55)→ v (cid:48) ⇔ p G ( v, v (cid:48) ) = ∅ . We omitthe superscript G if it is clear from the context. A directed graph G is acyclic ifthere exists no path p ∈ P G satisfying p [1] = p [ | p | ] . Definition 9 (Topological Sorting).
Let G = ( V, E ) ∈ U G be an acyclicdirected graph. A topological sorting [22] o G ∈ S V is a permutation of the verticesof G such that for all ≤ i < j ≤ | o G | we have that o G [ j ] (cid:54)(cid:55)→ o G [ i ] . We denotewith O G ⊆ S V all such possible topological sortings over G . Definition 10 (Transitive Reduction). A transitive reduction of a graph G = ( V, E ) ∈ U G [9] is the function ρ : U G → U G such that for the graph ρ ( G ) =( V, E r ) we have E r ⊆ E and every pair of vertices connected in ρ ( G ) is notconnected by any other path: for all ( v, v (cid:48) ) ∈ E r , p G ( v, v (cid:48) ) = {(cid:104) v, v (cid:48) (cid:105)} . ρ ( G ) is the graph with the minimal number of edges that maintain the reachabilitybetween edges of G . The transitive reduction of a directed acyclic graph alwaysexists and is unique [9] . Let us now define the basic artifacts needed to perform process mining.
Definition 11 (Universes).
Let U I be the set of all the event identifiers . Let U C be the set of all the case ID identifiers . Let U A be the set of all the activityidentifiers . Let U T be the totally ordered set of all the timestamp identifiers . onformance Checking over Uncertain Event Data 11 Definition 12 (Events and event logs).
Let us denote with E C = U I × U C ×U A × U T the universe of certain events . A certain event log is a set of events L C ⊆ E C such that every event identifier in L C is unique. Definition 13 (Simple certain traces and logs).
Let { ( e , c , a , t ) , ( e , c ,a , t ) , . . . , ( e n , c n , a n , t n ) } ⊆ L C be a set of certain events such that c = c = · · · = c n and t < t < · · · < t n . A simple certain trace is the sequence ofactivities (cid:104) a , a , . . . , a n (cid:105) ∈ U A ∗ induced by such a set of events. T = U A ∗ denotesthe universe of certain traces. L ∈ B ( T ) is a simple certain log . We will dropthe qualifier “simple” if it is clear from the context. As a preliminary application of process mining over uncertain event data, wewill consider conformance checking. Starting from an event log and a processmodel, conformance checking verifies if the event data in the log conforms to themodel, providing a diagnostic of the deviations. Conformance checking servesmany purposes, such as checking if process instances follow a specific normativemodel, assessing if a certain execution log has been generated from a specificmodel, or verifying the quality of a process discovery technique.The conformance checking algorithm that we are applying in this paperis based on alignments . Introduced by Adriansyah [6], conformance checkingthrough alignments finds deviations between a trace and a Petri net model ofa process by creating a correspondence between the sequence of activities exe-cuted in the trace and the firing of the transitions in the Petri net. The followingdefinitions are partially from [2].
Definition 14 (Petri Net).
A Petri net is a tuple N = ( P, T, F ) with P theset of places, T the set of transitions, P ∩ T = ∅ , and F ⊆ ( P × T ) ∪ ( T × P ) the flow relation. A Petri net N = ( P, T, F ) defines a directed graph ( V, E ) withvertices V = P ∪ T and edges E = F . A marking M ∈ B ( P ) is a multiset ofplaces. A marking defines the state of a Petri net, and indicates how many tokens each place contains. For any x ∈ P ∪ T , N • x = { x (cid:48) | ( x (cid:48) , x ) ∈ F } denotes the setof input nodes and x N • = { x (cid:48) | ( x, x (cid:48) ) ∈ F } denotes the set of output nodes. Weomit the superscript N if it is clear from the context.A transition t ∈ T is enabled in marking M of net N , denoted as ( N, M )[ t (cid:105) ,if each of its input places • t contains at least one token. An enabled transition t may fire , i.e., one token is removed from each of the input places • t and onetoken is produced for each of the output places t • . Formally: M (cid:48) = ( M \ • t ) (cid:93) t • is the marking resulting from firing enabled transition t in marking M of Petrinet N . ( N, M )[ t (cid:105) ( N, M (cid:48) ) denotes that t is enabled in M and firing t results inmarking M (cid:48) .Let σ T = (cid:104) t , t , . . . , t n (cid:105) ∈ T ∗ be a sequence of transitions. ( N, M )[ σ T (cid:105) ( N, M (cid:48) )denotes that there is a set of markings M , M , . . . , M n such that M = M , M n = M (cid:48) , and ( N, M i )[ t i +1 (cid:105) ( N, M i +1 ) for 0 ≤ i < n . A marking M (cid:48) is reachable from M if there exists a σ T such that ( N, M )[ σ T (cid:105) ( N, M (cid:48) ). Definition 15 (Labeled Petri Net).
A labeled Petri net N = ( P, T, F, l ) isa Petri net ( P, T, F ) with labeling function l : T (cid:54)→ U A where U A is some uni-verse of activity labels. Let σ = (cid:104) a , a , . . . , a n (cid:105) ∈ U A ∗ be a sequence of activi-ties. ( N, M )[ σ (cid:66) ( N, M (cid:48) ) if and only if there is a sequence σ T ∈ T ∗ such that ( N, M )[ σ T (cid:105) ( N, M (cid:48) ) and l ( σ T ) = σ . If t / ∈ dom ( l ), it is called invisible . To indicate invisible transitions, we usethe placeholder symbol τ / ∈ U A ; for any invisible transition t we define l ( t ) = τ .An occurrence of visible transition t ∈ dom ( l ) corresponds to observable activity l ( t ). Definition 16 (System Net).
A system net is a triplet SN = (
N, M init , M final ) where N = ( P, T, F, l ) is a labeled Petri net, M init ∈ B ( P ) is the initial marking,and M final ∈ B ( P ) is the final marking. U SN is the universe of system nets . Overa system net we define the following: – T v ( SN ) = dom ( l ) is the set of visible transitions in SN , – A v ( SN ) = rng ( l ) is the set of corresponding observable activities in SN , – T uv ( SN ) = { t ∈ T v ( SN ) | ∀ t (cid:48) ∈ T v ( SN ) l ( t ) = l ( t (cid:48) ) ⇒ t = t (cid:48) } is the set of unique visible transitions in SN (i.e., there are no other transitions havingthe same visible label), – A uv ( SN ) = { l ( t ) | t ∈ T uv ( SN ) } is the set of corresponding unique observableactivities in SN , – φ ( SN ) = { σ | ( N, M init )[ σ (cid:66) ( N, M final ) } is the set of visible traces startingin M init and ending in M final , and – φ f ( SN ) = { σ T | ( N, M init )[ σ T (cid:105) ( N, M final ) } is the corresponding set of com-plete firing sequences. Figure 4 shows a system net with initial and final markings M init = [ start ]and M final = [ end ]. Given a system net, φ ( SN ) is the set of all possible visible activity sequences, i.e., the labels of complete firing sequences starting in M init and ending in M final projected onto the set of observable activities. Given theset of activity sequences φ ( SN ) obtainable via complete firing sequences on acertain system net, we can define a perfectly fitting event log as a set of traceswhich activity projection is contained in φ ( SN ). The task of conformance checking consist in comparing an event log and a model,in order to assess the deviations of event data with respect to the expectedbehavior of the process. This is usually done to verify if the process conforms toa de iure model designed by process experts, which describes how the processshould ideally run. We will now describe a conformance checking technique, inorder to extend it to the uncertain setting.
Definition 17 (Perfectly Fitting Log).
Let L ∈ B ( T ) be a certain event logand let SN = ( N, M init , M final ) ∈ U SN be a system net. L is perfectly fitting SNif and only if { σ ∈ L } ⊆ φ ( SN ) . onformance Checking over Uncertain Event Data 13 The definitions described so far allow us to build alignments in order tocompute the fitness of trace on a certain model. An alignment is a correspondencebetween a sequence of activities (extracted from the trace) and a sequence oftransitions with the relative labels (fired in the model while replaying the trace).The first sequence indicates the “moves in the log” and the second indicates the“moves in the model”. If a move in the model cannot be mimicked by a move inthe log, then a “ (cid:29) ” (“no move”) appears in the top row; conversely, if a movein the log cannot be mimicked by a move in the model, then a “ (cid:29) ” (“no move”)appears in the bottom row.“no moves” not corresponding to invisible transitionspoint to deviations between the model and the log. A move is a pair ( x, ( y, t ))where the first element refers to the log and the second element to the model.A “ (cid:29) ” in the first element of the pair indicates a move on the model, in thesecond element it indicates a move on the log. Definition 18 (Legal Moves).
Let L ∈ B ( T ) be a certain event log, let A ⊆U A be the set of activity labels appearing in the event log, and let SN = ( N, M init ,M final ) ∈ U SN be a system net with N = ( P, T, F, l ) . A LM = { ( x, ( x, t )) | x ∈ A ∧ t ∈ T ∧ l ( t ) = x } ∪ { ( (cid:29) , ( x, t )) | t ∈ T ∧ l ( t ) = x } ∪ { ( x, (cid:29) ) | x ∈ A } isthe set of legal moves . An alignment is a sequence of legal moves such that after removing all “ (cid:29) ”symbols, the top row corresponds to a trace in the log and the bottom rowcorresponds to a firing sequence starting in M init and ending in M final . Noticethat if t / ∈ dom ( l ) is an invisible transition, the activation of t is indicated by a“ (cid:29) ” on the log in correspondence of t and the placeholder label τ . Hence, themiddle row corresponds to a visible path when ignoring the τ steps. Figure 4shows a system net with two examples of alignments, σ of a fitting trace and σ of a non-fitting trace. Definition 19 (Alignment).
Let σ ∈ L be a certain trace and σ T ∈ φ f ( SN ) a complete firing sequence of system net SN . An alignment of σ and σ T is asequence γ ∈ A LM ∗ such that the projection on the first element (ignoring “ (cid:29) ”)yields σ and the projection on the last element (ignoring “ (cid:29) ” and transitionlabels) yields σ T . A trace and a model can have several possible alignments. In order to selectthe most appropriate one, we introduce a function that associates a cost toundesired moves - the ones associated with deviations.
Definition 20 (Cost of Alignment).
Cost function δ : A LM → IN assignscosts to legal moves. The cost of an alignment γ ∈ A LM ∗ is the sum of all costs: δ ( γ ) = (cid:80) ( x,y ) ∈ γ δ ( x, y ) . Moves where log and model agree have no costs, i.e., δ ( x, ( x, t )) = 0 for all x ∈ A . Moves on model only have no costs if the transition is invisible, i.e., δ ( (cid:29) , ( τ, t )) = 0 if l ( t ) = τ . δ ( (cid:29) , ( x, t )) > x move” without a corresponding move of the log (assuming l ( t ) = x (cid:54) = τ ). δ ( x, (cid:29) ) > x move” only on the log. In this paper, weoften use a standard cost function δ S that assigns unit costs: δ S ( x, ( x, t )) = 0, δ S ( (cid:29) , ( τ, t )) = 0, and δ S ( (cid:29) , ( x, t )) = δ S ( x, (cid:29) ) = 1 for all x ∈ A . Definition 21 (Optimal Alignment).
Let L ∈ B ( T ) be a certain event logand let SN ∈ U SN be a system net with φ ( SN ) (cid:54) = ∅ . – For σ ∈ L , we define: Γ σ,SN = { γ ∈ A LM ∗ | ∃ σ T ∈ φ f ( SN ) γ is an alignment of σ and σ T } . – An alignment γ ∈ Γ σ,SN is optimal for trace σ ∈ L and system net SN iffor any γ (cid:48) ∈ Γ σ,SN : δ ( γ (cid:48) ) ≥ δ ( γ ) . – λ SN : T → A LM ∗ is a deterministic mapping that assigns any trace σ to anoptimal alignment, i.e., λ SN ( σ ) ∈ Γ σ,SN and λ SN ( σ ) is optimal. – costs ( L, SN , δ ) = (cid:80) σ ∈ L δ ( λ SN ( σ )) are the misalignment costs of the wholeevent log. σ ∈ L is a (perfectly) fitting trace for the system net SN if and only if δ ( λ SN ( σ )) =0 . L is a (perfectly) fitting event log for the system net SN if and only ifcosts ( L, SN , δ ) = 0 . Fig. 4.
Example of alignments on a system net. The alignment γ shows that thetrace (cid:104) a, d, b, e, h (cid:105) is perfectly fitting the net. The alignment γ shows that the trace (cid:104) a, b, d, b, e, h (cid:105) is misaligned with the net in one point, indicated by “ (cid:29) ”. Partiallyfrom [3]. The technique to compute the optimal alignment [6] is as follows. Firstly,it creates an event net , a sequence-structured system net able to replay onlythe trace to align. The transitions in the event net have labels corresponding to onformance Checking over Uncertain Event Data 15 the activities in the trace. Then, a product net should be computed. A productnet is the union of the event net and the model together, with synchronoustransitions added. These additional transitions are paired with transitions inthe event net and in the process model that have the same label. Then, theyare connected with arcs from the input places and to the output places of thosetransitions. The product net is able to represent moves on log, moves on modeland synchronous moves by means of firing transitions. In fact, the transitions ofthe event net correspond to moves on log, the transitions of the process modelcorrespond to moves on model, the added synchronous transitions correspondto synchronous moves. The union of the initial and final markings of the eventnet and the process model constitute respectively the initial and final markingof the product net, while every complete firing sequence on the product netcorresponds to a possible alignment. Lastly, the product net is translated to astate space, and a state space exploration via the A ∗ algorithm is performed inorder to find the complete firing sequence that yields the lowest cost.Let us define formally the construction of the event net and the product net: Definition 22 (Event Net).
Let σ ∈ T be a certain trace. The event net en : T → U SN of σ is a system net en ( σ ) = ( P, T, F, l, M init , M final ) such that: – P = { p i | ≤ i ≤ | σ | + 1 } , – T = { t i | ≤ i ≤ | σ |} , – F = (cid:83) ≤ i ≤| σ | { ( p i , t i ) , ( t i , p i +1 ) } – l : T → U A such that for all ≤ i ≤ | σ | , l ( t i ) = σ [ i ] , – M init = [ p ] , – M final = [ p | P | ] . Note that the labeling function l of an event net is a total function: no invis-ible transitions are contained in an event net, since for each event we generate atransition labeled with the corresponding activity label. Definition 23 (Product of two Petri Nets [36]).
Let S = ( P , T , F , l ,M init , M final ) and S = ( P , T , F , l , M init , M final ) be two system nets.The product net of S and S is the system net S = S ⊗ S = ( P, T, F, l, M init ,M final ) such that: – P = P ∪ P , – T ⊆ ( T ∪ {(cid:29)} × T ∪ {(cid:29)} ) such that T = { ( t , (cid:29) ) | t ∈ T } ∪ { ( (cid:29) , t ) | t ∈ T } ∪ { ( t , t ) ∈ ( T × T ) | l ( t ) = l ( t ) (cid:54) = τ } , – F ⊆ ( P × T ) ∪ ( T × P ) such that F = { ( p , ( t , (cid:29) )) | p ∈ P ∧ t ∈ T ∧ ( p , t ) ∈ F } ∪{ (( t , (cid:29) ) , p ) | t ∈ T ∧ p ∈ P ∧ ( t , p ) ∈ F } ∪{ ( p , ( t , (cid:29) )) | p ∈ P ∧ t ∈ T ∧ ( p , t ) ∈ F } ∪{ (( t , (cid:29) ) , p ) | t ∈ T ∧ p ∈ P ∧ ( t , p ) ∈ F } ∪{ ( p , ( t , t )) | p ∈ P ∧ ( t , t ) ∈ T ∩ ( T × T ) ∧ ( p , t ) ∈ F } ∪{ ( p , ( t , t )) | p ∈ P ∧ ( t , t ) ∈ T ∩ ( T × T ) ∧ ( p , t ) ∈ F } ∪{ (( t , t ) , p ) | p ∈ P ∧ ( t , t ) ∈ T ∩ ( T × T ) ∧ ( t , p ) ∈ F } ∪{ (( t , t ) , p ) | p ∈ P ∧ ( t , t ) ∈ T ∩ ( T × T ) ∧ ( t , p ) ∈ F } – l : T → U A such that for all ( t , t ) ∈ T , l (( t , t )) = l ( t ) if t = (cid:29) , l (( t , t l ( t ) if t = (cid:29) , and l (( t , t )) = l ( t ) otherwise, – M init = M init (cid:93) M init , – M final = M final (cid:93) M final . In this section, we will extend the definitions of event, trace, and event log tothe uncertain case. Let us first define the identifiers necessary to express eventindeterminacy.
Definition 24 (Determinate and indeterminate event qualifiers).
Let U O = { ! , ? } , where the “!” symbol denotes determinate events , and the “?”symbol denotes indeterminate events . For strong uncertainty, attribute values are replaced by a set of possiblevalues. In the case of weak uncertainty, a continuous function f provides theprobability density for the combinations of attribute values in the uncertainevent. Notice that the total mass of probabilities described by f might be lowerthan 1: this is so we can aptly represent the case of an indeterminate event. Definition 25 (Uncertain events).
Let E S = U I × P NE ( U C ) × P NE ( U A ) ×P NE ( U T ) ×U O denote the universe of strongly uncertain events . E W = { ( e i , f ) ∈U I × (( U C × U A × U T ) (cid:54)→ [0 , | (cid:80) ( c,a,t ) ∈ dom ( f ) f ( c, a, t ) ≤ } is the universe of weakly uncertain events . The probability of a weakly uncertain event of having been recorded but nothappening in reality is equal to 1 − (cid:80) ( c,a,t ) ∈ dom ( f ) f ( c, a, t ).Now that the definitions of strongly and weakly uncertain events are given,let us aggregate them in uncertain event logs. Definition 26 (Uncertain event logs). A strongly uncertain event log is aset of events L S ⊆ E S such that every event identifier in L S is unique. A weaklyuncertain event log is a set of events L W ⊆ E W such that every event identifierin L W is unique.For a strongly uncertain event e = ( e i , c s , a s , t s , o ) ∈ L S we define the fol-lowing projection functions: π L S c ( e ) = c s ∈ P NE ( U C ) , π L S a ( e ) = a s ∈ P NE ( U A ) , π L S t ( e ) = t s ∈ P NE ( U T ) and π L S o ( e ) = o ∈ U O . A weakly uncertain event log L W ⊆ E W has a corresponding strongly un-certain event log L W = L S ⊆ E S such that L S = { ( e i , c s , a s , t s , o ) ∈ E S |∃ ( e i (cid:48) ,f ) ∈ L W e i = e i (cid:48) ∧ c s = { c ∈ U C | ∃ a,t ( c, a, t ) ∈ dom ( f ) ∧ f ( c, a, t ) > } ∧ a s = { a ∈ U A | ∃ c,t ( c, a, t ) ∈ dom ( f ) ∧ f ( c, a, t ) > } ∧ We assume here that dom ( f ) is finite. It is easy to generalize to the infinite case byemploying an integral.onformance Checking over Uncertain Event Data 17 t s = { t ∈ U T | ∃ c,a ( c, a, t ) ∈ dom ( f ) ∧ f ( c, a, t ) > } ∧ ( o = ! ⇔ (cid:80) ( c,a,t ) ∈ dom ( f ) f ( c, a, t ) = 1) ∧ ( o = ? ⇔ (cid:80) ( c,a,t ) ∈ dom ( f ) f ( c, a, t ) < } .Notice that representing the density of probability for combinations of valuesof case ID, time and activity with a single function f is an approximation thatassumes probabilistic independence between event attributes. Definition 27 (Realization of an event log). L C ⊆ E C is a realization of L S ⊆ E S if and only if: – For all ( e i , c, a, t ) ∈ L C there is a distinct ( e i (cid:48) , c s , a s , t s , o ) ∈ L S such that e i = e i (cid:48) , c ∈ c s , a ∈ a s and t ∈ t s ; – For all ( e i , c s , a s , t s , o ) ∈ L S with o = ! there is a distinct ( e i (cid:48) , c, a, t ) ∈ L C such that e i = e i (cid:48) , c ∈ c s , a ∈ a s and t ∈ t s . R L ( L S ) is the set of all such realizations of the log L S . Note that these definitions allow us to transform a weakly uncertain log intoa strongly uncertain one, and a strongly uncertain one in a set of certain logs.In this paper we focus on three types of uncertainty: – Strong uncertainty on the activity; – Strong uncertainty on the timestamp; – Strong uncertainty on indeterminate events.All three can happen concurrently. Following the taxonomy presented in Sec-tion 2, this setting corresponds to a [E, A, T] S -type log. It is worth noting thatthe specific case of uncertainty on the case ID causes a problem; since an eventcan have many possible case IDs, it can belong to different traces. In data formatwhere the events are already aggregated into traces, such as the very commonXES standard, this means that the information related to a trace can be non-local to the trace itself, but can be stored in some other points of the log. Wewill focus on the problem of uncertainty on the case ID attribute in future work.Firstly, we will lay down some simplified notation in order to model theproblem at hand in a more compact way. Definition 28 (Simple uncertain events, traces and logs).
Let e i ∈ U I , a s ∈ P NE ( U A ) , t min ∈ U T , t max ∈ U T and o ∈ U O such that t min < t max . e SU = ( e i , a s , t min , t max , o ) is a simple uncertain event . Let us denote with E SU ⊆U I × P NE ( U A ) × U T × U T × U O the universe of all simple uncertain events. σ U ⊆ E SU is a simple uncertain trace if all the event identifiers in σ U are unique. T U denotes the universe of simple uncertain traces. L U ∈ P ( T U ) is a simpleuncertain log if all the event identifiers in L U are unique. For σ U ∈ L U and e SU = ( e i , a s , t min , t max , o ) ∈ σ U we define the following projection functions: π L U a ( e SU ) = a s ∈ P NE ( U A ) , π L U t min ( e SU ) = t min ∈ U T , π L U t max ( e SU ) = t max ∈ U T and π L U o ( e SU ) = o ∈ U O . In a simple uncertain event e SU = ( e i , a s , t min , t max , o ), the true activity labelof the event is one of the labels contained in the set a s , the true timestamp is oneof the values contained in the closed interval [ t min , t max ], while the indeterminacysymbol o indicates whether the event has certainly occurred, or if it is possiblethat it did not occur even though it has been recorded in an event log.Simple uncertain events are best illustrated with a running example. Let usconsider the following process instance, a simplified version of anomalies thatare actually occurring in processes of the healthcare domain. An elderly patientenrolls in a clinical trial for an experimental treatment against myeloproliferativeneoplasms, a class of blood cancers. The enrollment in this trial includes a labexam and a visit with a specialist; then, the treatment can begin. The lab exam,performed on the 8th of July, finds a low level of platelets in the blood of thepatient, a condition known as thrombocytopenia (TP). At the visit, on the 10thof May, the patient self-reports an episode of night sweats on the night of the5th of July, prior the lab exam: the medic notes this, but also hypothesizedthat it might not be a symptom, since it can be caused not by the conditionbut by external factors (such as very warm weather). The medic also reads themedical records of the patient and sees that, shortly prior to the lab exam, thepatient was undergoing a heparine treatment (a blood-thinning medication) toprevent blood clots. The thrombocytopenia found with the lab exam can then beprimary (caused by the blood cancer) or secondary (caused by other factors, suchas a drug). Finally, the medic finds an enlargement of the spleen in the patient(splenomegaly). It is unclear when this condition has developed: it might haveappeared in any moment prior to that point. The medic decides to admit thepatient in the clinical trial, starting 12th of July. These events are collected andrecorded in the trace shown in Table 4 in the information system of the hospital.For readability, the timestamp field only indicates the day of the month. Thistrace includes all types of uncertainty contained in a [E, A, T] S -type log, thesetting we are considering for the application of conformance checking. Table 4.
The uncertain trace of an instance of healthcare process used as a runningexample. For sake of clarity, we have further simplified the notation in the timestampscolumn, by showing only the day of the month.
Case ID Event ID Timestamp Activity Indet. event
ID192 e NightSweats ?ID192 e { PrTP , SecTP } !ID192 e [4, 10] Splenomeg !ID192 e Adm ! In the notation of Definition 28, the trace σ U in Table 4 is denoted as: σ U = { ( e , { NightSweats } , , , ?) , ( e , { PrTP , SecTP } , , , !) , ( e , { Splenomeg } , , , !) , ( e , { Adm } , , , !) } . onformance Checking over Uncertain Event Data 19 We can also draw the time diagram of this example of uncertain trace, whichcan be seen in Figure 5. - - : : - - : : - - : : - - : : - - : : - - : : - - : : - - : : NightSweatPrTP, SecTPSplenomegAdm
Fig. 5.
Time diagram of the trace in Table 4.
In the reminder of the paper, when defining simple uncertain traces andevents, we always assume that these belong to a corresponding simple uncertainlog. Thus, for simplicity, we will omit the qualifier “ L U ” when denoting thecorresponding projection functions.These simplified traces and logs can be related to the more general frameworkdescribed in the previous section through the following transformation: let L S ⊆E S be a strongly uncertain log and let g : U I (cid:54)→ U C be a function mapping eventidentifiers onto cases such that dom ( g ) = { e i | ( e i , c s , a s , t s , u ) ∈ L S } and forall ( e i , c s , a s , t s , u ) ∈ L S , g ( e i ) ∈ c s . Thus, for c ∈ rng ( g ), g − ( c ) = { e i ∈U I | g ( e i ) = c } . The simple uncertain event log defined by g on L S is given as L U = {{ ( e i , π L S a ( e ) , min ( π L S t ( e )) , max ( π L S t ( e )) , π L S o ( e )) | e i ∈ g − ( c ) ∧ π L S i ( e ) = e i } | c ∈ rng ( g ) } .In order to more easily work with timestamps in simple uncertain events, letus frame their time relationship as a strict partial order. Definition 29 (Strict partial order over simple uncertain events).
Let e, e (cid:48) ∈ E SU be two simple uncertain events. ≺ E is a strict partial order defined onthe universe of strongly uncertain events E SU as: e ≺ E e (cid:48) ⇔ π t max ( e ) < π t min ( e (cid:48) ) Proposition 1 ( ≺ E is a strict partial order). Proof.
All properties characterizing strict partial orders are fulfilled by ≺ E . Forall e, e (cid:48) , e (cid:48)(cid:48) ∈ E SU we have: – Irreflexivity: this property is always verified, since π t max ( e ) < π t min ( e ) isfalse (see Definition 25). – Transitivity: since π t max ( e ) < π t min ( e (cid:48) ) ≤ π t max ( e (cid:48) ) < π t min ( e (cid:48)(cid:48) ) and U T is totally ordered, we have that π t max ( e ) < π t min ( e (cid:48)(cid:48) ) and this property isalways verified. (cid:117)(cid:116) Lemma 1 (Uncomparable events share possible timestamp values).
Let e, e (cid:48) ∈ E SU be two strongly uncertain events. e and e (cid:48) are uncomparable withrespect to the strict partial order ≺ E (i.e., neither e ≺ E e (cid:48) nor e (cid:48) ≺ E e are true)if and only if e and e (cid:48) share some possible values of their timestamp.Proof. ( ⇒ ) From Definition 29, it follows that two events e, e (cid:48) ∈ E SU are comparableif and only if either π t max ( e ) < π t min ( e (cid:48) ) or π t max ( e (cid:48) ) < π t min ( e ). If both arefalse, then π t min ( e (cid:48) ) ≤ π t max ( e ) and π t min ( e ) ≤ π t max ( e (cid:48) ). If we assume that π t min ( e ) ≤ π t min ( e (cid:48) ) then π t min ( e ) ≤ π t min ( e (cid:48) ) ≤ π t max ( e ), while if π t min ( e ) >π t min ( e (cid:48) ) then π t min ( e (cid:48) ) < π t min ( e ) ≤ π t max ( e (cid:48) ). In both cases, there are valuescommon to both uncertain timestamps.( ⇐ ) If the two events share timestamp values, it follows that at least one of theextremes of one event is encompassed by the extremes of the other. Assume that e encompasses at least one of the extremes of e (cid:48) (the other case is symmetric): theneither π t min ( e ) ≤ π t min ( e (cid:48) ) ≤ π t max ( e ) or π t min ( e ) ≤ π t max ( e (cid:48) ) ≤ π t max ( e ). In thefirst case, considering that U T is totally ordered and that π t min ( e (cid:48) ) ≤ π t max ( e (cid:48) ),we have that both π t min ( e (cid:48) ) ≤ π t max ( e ) and π t min ( e ) ≤ π t max ( e (cid:48) ) are true, and e and e (cid:48) are uncomparable. The second case is proved analogously. (cid:117)(cid:116) Definition 30 (Realizations of simple uncertain traces).
Let σ U ∈ T U bea simple uncertain trace. An order-realization σ O ∈ S σ U is a permutation of theevents in σ U such that for all ≤ i < j ≤ | σ O | we have that σ O [ j ] ⊀ E σ O [ i ] , i.e., σ O is a correct evaluation order for σ U over ≺ E , and the (total) order in whichevents are sorted in σ O is a linear extension of the strict partial order ≺ E . Wedenote with R O ( σ U ) the set of all such order-realizations of the trace σ U .Given an order-realization σ O ∈ R O ( σ U ) , the sequence σ = (cid:104) a , a , . . . , a n (cid:105) ∈U A ∗ is a realization of σ O if there exists a total function f : { , , . . . , n } → σ O such that: – For all ≤ i ≤ n , a i ∈ π a ( f ( i )) , – (cid:104) f (1) , f (2) , . . . , f ( n ) (cid:105) is a subsequence of σ O , – For all e ∈ σ O with π o ( σ O ) = ! there exists ≤ i ≤ n such that f ( i ) = e . onformance Checking over Uncertain Event Data 21 We denote with R (cid:48) ( σ O ) ⊆ U A ∗ the set of all such realizations of the order-realization σ O . We denote with R ( σ U ) ⊆ U A ∗ the union of the realizationsobtainable from all the order-realizations of σ U : R ( σ U ) = (cid:83) σ O ∈R O ( σ U ) R (cid:48) ( σ O ).Let us see some examples of realizations of uncertain traces. Let σ U be the un-certain trace shown in Table 4. We then have that σ U has three order-realizations: R O ( σ U ) = {(cid:104) e , e , e , e (cid:105) , (cid:104) e , e , e , e (cid:105) , (cid:104) e , e , e , e (cid:105)} We can then compute the realizations of one of the order-realizations of σ U : R (cid:48) ( (cid:104) e , e , e , e (cid:105) ) = {(cid:104) NightSweats , PrTP , Splenomeg , Adm (cid:105) , (cid:104) NightSweats , SecTP , Splenomeg , Adm (cid:105) , (cid:104) PrTP , Splenomeg , Adm (cid:105) , (cid:104) SecTP , Splenomeg , Adm (cid:105)}
Simple uncertain traces and logs carry less information than their certaincounterparts. Nevertheless, it is possible to extend existing process mining algo-rithms to extract the information in a simple uncertain log to design a processmodel that describes its possible behavior, or verify that it conforms to a givennormative model.
Depending on the possible values for a s , t min , t max , and u there are multiplepossible realizations of a trace. This means that, given a model, a simple uncer-tain trace could be fitting for certain realizations, but non-fitting for others. Thequestion we are interested in answering is: given a simple uncertain trace anda Petri net process model, is it possible to find an upper and lower bound forthe conformance score? Usually we are interested in the optimal alignments (theones with the minimal cost). However, we are now interested in the minimumand maximum cost of alignments in the realization set of a simple uncertaintrace. Definition 31 (Upper and Lower Bound on Alignment Cost for a Trace).
Let σ U ∈ T U be a simple uncertain trace, and let SN ∈ U SN be a system net.The upper bound for the alignment cost is a function δ max : T U → N such that δ max ( σ U ) = max σ ∈R ( σ U ) δ ( λ SN ( σ )) . The lower bound for the alignment cost isa function δ min : T U → N such that δ min ( σ U ) = min σ ∈R ( σ U ) δ ( λ SN ( σ )) . A simple way to compute the upper and lower bounds for the cost of anyuncertain trace is using a brute-force approach: enumerating the possible real-izations of the trace, then searching for the costs of optimal alignments for allthe realizations, and picking the minimum and maximum as bounds. We nowpresent a technique which improves the performance of calculating the lowerbound for conformance cost with respect to a brute-force method.
We will produce a version of the event net that embeds the possible behaviorsof the uncertain trace. We define a behavior net , a Petri net that can replay alland only the realizations of an uncertain trace. As an intermediate step in orderto obtain such a Petri net, we first build the behavior graph , a dependency graphrepresenting the uncertain trace. This graph contains a vertex for each uncertainevent in the trace and contains an edge between two vertices if the correspondinguncertain events happen one directly after the other in at least one realizationof the uncertain trace.
Definition 32 (Behavior Graph).
Let σ U ∈ T U be a simple uncertain trace.A behavior graph β : T U → U G is the transitive reduction of a directed graph ρ ( G ) , where G = ( V, E ) ∈ U G is defined as: – V = { e ∈ σ U } , – E = { ( v, w ) | v, w ∈ V ∧ v ≺ E w } . The behavior graph provides a structured representation of the uncertaintyon the timestamp: when a specific vertex has two or more outbound edges,the events corresponding to the destination vertices can occur in any order,concurrently with each other. We can see the result on the example trace inFigures 6 and 7.
NightSweats e { PrTP, SecTP } e Splenomeg e Adm e Fig. 6.
The graph of the trace in Table 4before applying the transitive reduction.All the nodes in the graph are pairwiseconnected based on precedence relation-ships; pairs of nodes for which the order isunknown are not connected. The dashednode represents an indeterminate event.
NightSweats e { PrTP, SecTP } e Splenomeg e Adm e Fig. 7.
The behavior graph of the tracein Table 4. The transitive reduction re-moved the arc between e and e , sincethey are reachable through e . This graphhas a minimal number of arcs while con-serving the same reachability relationshipbetween nodes. Theorem 1 (Correctness of behavior graphs).
Let σ U ∈ T U be a simpleuncertain trace and bg ( σ U ) = ( V, E ) be its behavior graph. The behavior graph bg ( σ U ) is acyclic; additionally, the set of all topological sortings of the behaviorgraph corresponds to the set of order-realizations of σ U : O bg ( σ U ) = R O ( σ U ) .Proof. From Proposition 1 we know that ≺ E is a strict partial order. Let p = (cid:104) p , p , . . . , p m (cid:105) ∈ P bg be a path in the behavior graph: if p was a cycle, that onformance Checking over Uncertain Event Data 23 means that according to Definition 32 we have p ≺ E p ≺ E · · · ≺ E p m ≺ E p .Since ≺ E is transitive, we have that p ≺ E p m and p m ≺ E p , which would violatethe antisymmetry property in Definition 7 and would contradict Proposition 1.Thus the behavior graph is necessarily acyclic.The result O bg ( σ U ) = R O ( σ U ) immediately follows from Definitions 9, 30and 32, and from Proposition 1. (cid:117)(cid:116) Lemma 2 (Semantics of behavior graphs).
Events connected by paths in agiven behavior graph have a precedence relationship; events not connected by anypaths share possible values for their timestamps and thus might have happenedin any order.Proof.
Immediately follows from Proposition 1, Theorem 1, and from Lemma 1. (cid:117)(cid:116)
We then obtain a behavior net by replacing every vertex in the behaviorgraph with one or more transitions in an XOR configuration, each representingan activity contained in the π a set of the corresponding uncertain event. Everyedge of the behavior graph becomes a place in the behavior net, connected fromand to the transitions corresponding to, respectively, its source and target nodesin the graph. Definition 33 (Behavior Net).
Let σ U ∈ T U be a simple uncertain trace,and let bg ( σ U ) = ( V, E ) be the corresponding behavior graph. A behavior net bn : T U → U SN is a system net bn ( σ U ) = ( P, T, F, l, M init , M final ) such that: – P = E ∪{ ( start , v ) | v ∈ V ∧ (cid:64) v (cid:48) ∈ V ( v (cid:48) , v ) ∈ E } ∪{ ( v, end ) | v ∈ V ∧ (cid:64) v (cid:48) ∈ V ( v, v (cid:48) ) ∈ E } , – T = { ( v, a ) | v ∈ V ∧ a ∈ π a ( v ) } ∪ { ( v, τ ) | v ∈ V ∧ π o ( v ) = ? } , – F = { (( start , v ) , ( v , a )) ∈ E × T | v = v } ∪{ (( v , a ) , ( v , w )) ∈ T × E | v = v } ∪{ (( v, w ) , ( w , a )) ∈ E × T | w = w } ∪{ (( v , a ) , ( v , end ) ∈ T × E | v = v } , – l = { (( v, a ) , a ) | ( v, a ) ∈ T ∧ a (cid:54) = τ } , – M init = [( start , v ) ∈ P | v ∈ V ] , – M final = [( v, end ) ∈ P | v ∈ V ] . In Figure 8, we can see the behavior net corresponding to the uncertain tracein Table 4. It is important to note that every set of edges in the behavior graphwith the same source vertex generates an AND split in the behavior net, anda set of edges with the same destination vertex generates an AND join. At thesame time, the transitions whose labels correspond to different possible activitiesin an uncertain event will appear in an XOR construct inside the behavior net.Thus, in the behavior net, every set of events which timestamps share somepossible values will be represented by transitions inside an AND construct, andwill then be able to execute in any order allowed by their uncertain timestampattributes. In the same fashion, an event with uncertainty on the activity will ( start , e ) NightSweats ( e , NightSweats ) NightSweats ( e , τ ) ( e , e ) PrTP ( e , P rT P ) SecTP ( e , SecT P ) ( e , e )( start , e ) ( e , e ) Splenomeg ( e , Splenomeg ) Adm ( e , Adm ) ( e , end ) Fig. 8.
The behavior net corresponding to the uncertain trace in Table 4. The labelsshow the objects involved in the construction of Definition 33. The initial marking isdisplayed; the gray “token slot” represents the final marking. be represented by a number of transitions in an XOR construct. This allows toreplay any possible choice for the activity attribute. It follows that, by construc-tion, for a certain simple uncertain trace σ U we have that φ ( bn ( σ U )) = R ( σ U ).We can use the behavior net of an uncertain trace σ U in lieu of the event netto compute alignments with a model SN ∈ U SN ; the search algorithm returnsan optimal alignment, a sequence of moves ( x, ( y, t )) with x ∈ U A , y ∈ U A and t transition of the model SN . After removing all “ (cid:29) ” symbols, the sequence offirst elements of the moves will describe a complete firing sequence σ bn of thebehavior net. Since σ bn is complete, σ bn ∈ φ ( bn ( σ U )) and, thus, σ bn ∈ R ( σ U ). Itfollows that σ bn is a realization of σ U , and the search algorithm ensures that σ bn is a realization with optimal conformance cost for the model SN : δ ( λ SN ( σ bn )) =min σ ∈R ( σ U ) λ SN ( σ ) = δ min ( σ U ). Theorem 2 (Correctness of behavior nets).
Let σ U ∈ T U be a simple un-certain trace and let bg ( σ U ) = ( V, E ) be its behavior graph. The correspondingbehavior net bn ( σ U ) = ( P, T, F, l, M init , M final ) can replay all and only the re-alizations of σ U : φ ( bn ( σ U )) = R ( σ U ) .Proof. Let ( v, v (cid:48) ) ∈ E be an edge of the behavior graph, which also defines aplace in the behavior net: ( v, v (cid:48) ) = p v,v (cid:48) ∈ P . Let us denote with T v the set oftransitions in the behavior net generated from the vertex v : T v = { ( v (cid:48) , a ) ∈ T | v (cid:48) = v } .( ⊆ ) Let σ = (cid:104) a , a , . . . , a n (cid:105) ∈ φ ( bn ( σ U )) be any certain trace accepted by bn ( σ U ). Let σ T = (cid:104) t , t , . . . , t n (cid:105) ∈ φ f ( bn ( σ U )) be a complete firing sequenceof bn ( σ U ) yielding σ , i.e., l ( σ T ) (cid:22) U A = σ . Let (cid:104) v , v , . . . , v n (cid:105) be a sequence ofvertices in bg ( σ U ) such that t = ( v , a ) , t = ( v , a ) , . . . , t n = ( v n , a n ) and t ∈ T v , t ∈ T v , . . . , t n ∈ T v n . Let V be the set of all such sequences; by theflow relation in Definition 33 there must exist a sequence σ O = (cid:104) v , v , . . . , v n (cid:105) ∈V such that (( v , a ) , ( v , v )) ∈ F, (( v , v ) , ( v , a )) ∈ F, (( v , a ) , ( v , v )) ∈ onformance Checking over Uncertain Event Data 25 F, (( v , v ) , ( v , a )) ∈ F, . . . , (( v n − , a n − ) , ( v n − , v n )) ∈ F, (( v n − , v n ) , ( v n , a n )) ∈ F . This implies that ( v , v ) ∈ E, ( v , v ) ∈ E, . . . , ( v n − , v n ) ∈ E . From Def-inition 32 we then have that v (cid:7) E v (cid:7) E · · · (cid:7) E v n . Furthermore, since thereexist a T v for all v ∈ V and for all 1 ≤ i ≤ n exactly one transition t i ∈ T v i has to fire to complete the firing sequence, we have that for all v ∈ V , v ∈ σ O and is unique. Thus, σ O ∈ S V is a permutation of the vertices in bg ( σ U ). Be-cause all vertices in σ O are sorted by a linear extension of ≺ E , we also havethat σ O ∈ O bg ( σ U ) is a topological sorting of the vertices in bg ( σ U ). By Defi-nition 32, we then have that σ O is an order-realization of σ U : σ O ∈ R O ( σ U ).Since, by construction, l ( t i ) ∈ π a ( v i ) if π o ( v i ) = ! and l ( t i ) ∈ π a ( v i ) ∪ { τ } if π o ( v i ) = ?, we have that σ = l ( σ T ) (cid:22) U A ∈ R ( σ U ). Since this construction is validfor any σ ∈ φ ( bn ( σ U )), every complete firing sequence of the behavior net is arealization of σ U : φ ( bn ( σ U )) ⊆ R ( σ U ).( ⊇ ) Let σ O ∈ R O ( σ U ) be any order-realization of σ U , and let n = | σ U | . Since σ O [1] ≺ E σ O [2] ≺ E · · · ≺ E σ O [ n ] (by Definition 30), there exists a path p ∈ P bg ( σ U ) such that p = (cid:104) v , v , . . . , v n (cid:105) = (cid:104) σ O [1] , σ O [2] , . . . , σ O [ n ] (cid:105) (by Theorem 1).Let p , = ( v , v ), p , = ( v , v ), and so on. Let t ∈ T v , t ∈ T v , . . . , t n ∈ T v n and let σ T = (cid:104) t , t , . . . , t n (cid:105) . By the construction in Definition 33, in bn ( σ U ) = N we have that( N, M init )[ t (cid:105) ( N, M , )[ t (cid:105) ( N, M , )[ t (cid:105) , . . . , [ t n − (cid:105) ( N, M n − ,n )[ t n (cid:105) ( N, M final )where: M , = ( M start \ [( start , v )]) (cid:93) [ p , ] M , = ( M , \ [ p , ]) (cid:93) [ p , ] . . .M n − ,n = ( M n − ,n − \ [ p n − ,n − ]) (cid:93) [ p n − ,n ] M final = ( M n − ,n \ [ p n − ,n ]) (cid:93) [( v n , end )]This construction implies that ( N, M init )[ σ T (cid:66) ( N, M final ) and therefore σ T ∈ φ f ( bn ( σ U )).The definition of the labeling function in the behavior net is such that, for all1 ≤ i ≤ n , we have that ( v i , a ) ∈ T v i ⇔ a ∈ π a ( v i ). By Definition 30, the labelingof the sequence (cid:104) t , t , . . . , t n (cid:105) projected on the universe of activities is then arealization of the uncertain trace σ U obtained from the possible activity labelsof σ O : l ( σ T ) (cid:22) U A = R ( σ U ). Since this construction is valid for any σ O ∈ R O ( σ U ),the behavior net can replay any realization of σ U : R ( σ U ) ⊆ φ ( bn ( σ U )). (cid:117)(cid:116) Theorem 3 (Correctness of uncertain alignments).
Let σ U ∈ T U be asimple uncertain trace and let SN ∈ U SN be a system net. Computing an align-ment using the product net between SN and the behavior net bn ( σ U ) yields thealignment with the lowest cost among all realizations of σ U : δ ( λ SN ( σ bn )) =min σ ∈R ( σ U ) λ SN ( σ ) = δ min ( σ U ) . Proof.
Recall from Definition 21 that λ SN : T → A LM ∗ is a deterministic map-ping that assigns any trace σ to an optimal alignment. Adriansyah [6] detailshow to compute such a function λ SN through a state-based A ∗ search over astate space defined by the reachable markings of the product net SN ⊗ en ( σ )between a reference system net SN and the event net a certain trace σ ∈ T . Asper Definition 19, this search retrieves an alignment which is optimal with re-spect to a certain cost function δ and, ignoring “ (cid:29) ”, is composed by a completefiring sequence of the system net σ T ∈ φ f ( SN ) and the only complete firingsequence of the event net en ( σ ), which corresponds to σ by construction. Givena system net SN ∈ U SN , an uncertain trace σ U ∈ T U and its respective behaviornet bn ( σ U ), the same search algorithm for λ SN over SN ⊗ bn ( σ U ) yields an op-timal alignment containing a complete firing sequence for the reference systemnet σ T ∈ φ f ( SN ) and a complete firing sequence for the behavior net of the un-certain trace σ ∈ φ ( bn ( σ U )). Since λ SN minimizes the cost and σ ∈ R ( σ U ) is avalid realization of σ due to Theorem 2, the resulting alignment has the minimalcost possible over all the possible realizations of the uncertain trace. (cid:117)(cid:116) The framework for computing conformance bounds for uncertain event dataillustrated in this paper raises some research questions that need to be addressedin a practical and empirical manner. The questions that we aim to answer are: – Q1 : how do conformance bounds behave, when computed on uncertain data? – Q2 : what is the impact of different deviating behavior and different types ofuncertain behaviors on the conformance score of uncertain event logs? – Q3 : what is the impact on the efficiency of computing uncertain alignmentsutilizing the behavior net as opposed to the baseline method of enumeratingand aligning all realizations? – Q4 : what is the impact of computing uncertain alignments utilizing thebehavior net on different types of uncertain behavior? – Q5 : is it possible to apply uncertain alignments to real-life data to obtain abest- and worst-case scenario for the execution of process instances?The technique to compute conformance for strongly uncertain traces and tocreate the behavior net hereby described has been implemented in the Pythonprogramming language, thanks to the facilities for log importing, model cre-ation and manipulation, and alignments provided by the library PM4Py [12].Uncertainty has been represented in the XES standard through meta-attributesand constructs such as lists, such that any XES importer can read an uncertainlog file. The algorithm was designed to be fully compatible with any event login the XES format (both including and not including uncertainty); the meta-attributes for uncertainty were designed to be backward compatible with otherprocess mining algorithms – meta-attributes describing the possible values foran uncertain activity or the interval of an uncertain timestamp can also specifya “fallback value” which other process mining software will read as (certain)activity or timestamp value. onformance Checking over Uncertain Event Data 27 The first four research questions listed above have been addressed by tests onsynthetic uncertain event logs. To this end, we implemented the following soft-ware components necessary to the experiments: – a noise generator , to introduce deviations in a controlled way in an event log.This component allows to alter the activity label, swap the order of events oradd redundant events to an event log with a given probability or frequency. – an uncertainty generator , to alter the XES attributes present in the log byappending additional meta-information which is then interpreted as uncer-tainty. The component introduces uncertainty information in an event log,with the possibility to add any of the strongly uncertain attributes describedin the taxonomy of Section 2. This also allows for exporting the generateduncertain event log through the XES exporter of the PM4Py library. – a number of smaller extensions to PM4Py functionalities, also useful forother process mining applications. Examples are the generation of all possibleprocess variants (language) of a PM4Py Petri net, and a memoized versionof alignments, which allows to trade off space in memory in order to speedup the computation of the conformance of an event log and a model.In order to answer to Q1 and Q2 , we set up an experiment with the goalto inspect the quality of bounds for conformance scores as increasingly moreuncertainty is added to an event log. We ran the tests on synthetic event logswhere we added simulated uncertainty. In this way, we can control the amountsand types of uncertainty in event data.Every iteration of this experiment is as follows:1. We generate a random Petri net with a fixed dimension ( n = 10 transitions)through the ProM plugin “Generate block-structured stochastic Petri nets” .2. We play out an event log consisting of 100 traces generated from the Petrinet.3. We randomly alter the activity label of a specific percentage d a of events,swapping it with another label sampled from the universe of activities.4. We randomly swap a specific percentage d s of events with their successor. Foreach event sampled for the swap, we randomly select either the predecessoror the successor (with 50% probability each), and we swap the timestampsof the two events, effectively inverting their order. We skip the selection ofthe swap direction if we select the first event in a trace (which is swappedwith the second) or the last event in a trace (which swaps with the secondto last).5. We randomly duplicate a specific percentage d d of events. For each eventselected for duplication, we create a new event in the trace with identical caseID and activity label, and with timestamp equal to the average between thetimestamp of this selected event and the timestamp of the following event.If we select the last event in a trace for duplication, we simply add a fixeddelta to the timestamp of the duplicate.
6. We randomly introduce uncertainty in activity labels for a specific percentage u a of events. Each event selected for uncertainty on activity labels receivesone additional activity label, different from the one it already have, sampledfrom the universe of activity labels.7. We randomly introduce uncertainty in timestamps for a specific percentage u t of events. For each event sampled for timestamp uncertainty we randomlychoose either the predecessor or the successor (with 50% probability each);the timestamp of the sampled event becomes an interval which extremes arethe original timestamp and the timestamp of the predecessor or successor,effectively causing them to mutually overlap. In case the sampled event isthe first (resp., last) event in a trace, we skip the selection of the predecessoror successor and we directly consider the successor (resp., predecessor) forthe extremes of the uncertain timestamp.8. We randomly transform a specific percentage u i of events in indeterminateevents. To these sampled events, we add the “?” attribute, in order to markthem as indeterminate.9. We measure upper and lower bounds for conformance score with increasingpercentage p of uncertainty.All sampling operations mentioned in the previous list are performed over auniform probability distribution over the possible values.In terms of amount of deviation to be considered in each configuration, weaimed at recreating a situation where there is significant deviating behaviorwith respect to the normative model; for each kind of deviation considered, weintroduced anomalous behavior in 30% of events. Thus, we consider four differentsettings for the addition of deviating behavior to events logs: Activity labels = { d a = 30% , d s = 0% , d d = 0% } , Swaps = { d a = 0% , d s = 30% , d d = 0% } , Extraevents = { d a = 0% , d s = 0% , d d = 30% } and All = { d a = 30% , d s = 30% , d d =30% } .We consider four different settings for the addition of uncertain behavior toevents logs: Activities = { u a = p, u t = 0% , u i = 0% } , Timestamps = { u a =0% , u t = p, u i = 0% } , Indeterminate events = { u a = 0% , u t = 0% , u i = p } and All = { u a = p, u t = p, u i = p } . We test all four different configurations ofdeviation against each of the four configurations of uncertainty, with increasingvalues of p , for a total of 16 separate experiments.Figure 9 summarizes our findings. The plots on this figure represent theaverage of 10 runs as described above.We can observe that, in general, all plots show the expected behavior: theupper and lower bound for conformance coincide at percentage of uncertainevents p = 0 for all experiments, to then diverge while p increases. A numberof additional observations can be made looking at individual configurations fordeviation or uncertainty, or at specific scatter plots. When only uncertainty onactivity labels is added to the event log, we see a deterioration of the upperbound for conformance cost, but the lower bound does not improve – in fact, itis essentially constant. onformance Checking over Uncertain Event Data 29 Activitylabels . . . . . . . .
97 99 . . . . . . . .
93 97 . . . . . . . . . . . . . . . . Swaps . . . . . . . .
82 97 . . . . . . . .
82 96 . . . . . . . .
79 92 . . . . . . . . Extraevents . . . . . . . .
06 100 . . . . . . . .
78 91 . . . . . . . .
51 92 . . . . . . . . . . . . . . . A c t i v i t i e s All . . . . . . . . . . . . . . . T i m e s t a m p s . . . . . . . . . . . . . . . I nd e t e r m i n a t ee v e n t s . . . . . . . . . . . . . . . A ll .
73 90 .
57 85 .
95 80 . . . . . U n ce r t a i n t y ( t y p e a ndp e r ce n t a g e ) Conformancecost F i g . . U pp e r ( r e d , d a s h e d ) a nd l o w e r ( b l u e , d o tt e d ) b o und f o r c o n f o r m a n c e c o s t f o r s y n t h e t i c e v e n t l og s w i t h i n c r e a s i n g un c e r t a i n t y . E v e r y p l o t s h o w s a d i ff e r e n t c o nfi g u r a t i o n o f d e v i a t i o n a dd e d t o t h e l oga nd t y p e s o f un c e r t a i n t y s i m u l a t e d i n t h ee v e n t d a t a . T h e x - a x i s s h o w s t h e p e r c e n t ag e o f un c e r t a i n t y p a dd e d t o t h e l og s ; t h e y - a x i ss h o w s t h e a m o un t o f d e v i a t i o n s , c o m pu t e d w i t h a li g n m e n t s . T h e l a b e l s i n s i d e t h e g r a ph i nd i c a t e t h e r e l a t i v e c h a n g e i nd e v i a t i o n s c o r e w i t h r e s p e c tt o p = ,i np e r c e n t ag e . T h e g r a y c o n t i nu o u s li n e i nd i c a t e s t h e nu m b e r o f d e v i a t i o n s a t p = s a r e f e r e n c e . This can be attributed to the fact that, since to generate uncertainty on ac-tivity label we sample from the set of labels randomly, the chances of observinga realization of a trace where an uncertain activity label matches the alterationintroduced by the deviations are small. Uncertainty on timestamps makes thelower bound decrease only when the introduced deviations are swaps: as ex-pected, the possibility of changing the order of pairs of events does not have asensible improvement in the lower bound for deviation when extra events areadded or activity labels of existing events are altered.Conversely, the possibility to “skip” some critical events has a positive effecton the lower bound of all possible configurations for deviations: in fact, whenmarking some events as indeterminate in a log where extra events were addedas deviations, the average conformance cost drops by 30.61% at p = 16%, thelargest drop among all the experiments. The experiment with all three types ofuncertainty and extra events as deviations essentially displays the same effect(improvement in lower bound is slightly lower, but not significantly, with a de-crease in deviation of 29.38% at p = 16%). For the experiments where all types ofdeviations were added at once, we can see that, as could be anticipated, the dif-ferences in deviation scores on the two bounds become smaller in relative terms(because of the very high amount of deviations p = 0%), but larger in absoluteterms. As per the previous experiments, the largest contributor in decreasingthe conformance cost of the lower bound is the addition of indeterminate events,which by itself decreases the deviation cost by 13.92% at p = 16%. In general, thevast variability in measuring the conformance of an uncertain log shows that, ifall types of uncertainty can occur with high frequency in a process, the businessowner should act on the uncertainty sources, since they will be a major obsta-cle in obtaining accurate measurements of process conformance. Vice versa, inthe case of limited occurrences of uncertainty in event data the algorithm hereproposed is able to provide actionable bounds for conformance score, togetherwith descriptions of best- and worst-case scenarios of process conformance for agiven trace.The second experiment we set up aims to answer questions Q3 and Q4 , andis concerned with the performance of calculating the lower bound of the cost viathe behavior net versus the brute-force method of listing all the realizations ofan uncertain trace, evaluating all of them through alignments, then picking thebest value. We used a constant percentage of uncertain events of p = 5% andlogs of 100 traces for this test, with progressively increasing values of n . We ran4 different experiments, each with one of the four configurations for uncertainbehavior Activities , Timestamps , Indeterminate events and
All illustrated above.Figure 10 summarizes the results. As the diagram shows, the difference intime between the two methods tends to diverge quickly even on a logarithmicscale. The largest model we could test was n = 20, a Petri net with 20 transitions,which is comparatively tiny in practical terms; however, even at these small scalesthe brute-force method takes roughly 3 orders of magnitude more than the timeneeded by the behavior net, when all the types of uncertainty are added with p = 5%. onformance Checking over Uncertain Event Data 31 This shows a very large improvement in the computing time for the lowerbound computation; thus, the best-case scenario for the conformance cost of anuncertain trace can be obtained efficiently thanks to the structural properties ofthe behavior net. This graph also shows the dramatic impact on the number ofrealizations of a behavior net – and thus, the time needed to perform a brute-forcecomputation of alignments – when the effects of different kinds of uncertaintyare compounded.
10 20
Activities − M e a n t i m e ( s ec o nd s )
10 20
Timestamps
10 20
Indeterminate events
10 20
All
Uncertainty (type and percentage)
Fig. 10.
Effect on the time performance of calculating the lower bound for conformancecost with the brute-force method (blue) vs. the behavior net (red) on four differentconfigurations for uncertain events.2 Pegoraro et al.
As illustrated in Section 1, uncertainty in event data can originate from a num-ber of different causes in real-world applications. One prominent source of un-certainty is missing data: attribute values not recorded in an event log can onoccasions be described by uncertainty, through domain knowledge provided byprocess owners or experts. Then, as described in this paper, it is possible toobtain a detailed analysis of the deviations of a best- and worst-case scenario forthe conformance to a process model.To seek to answer research question Q4 through a direct application of confor-mance checking over uncertainty, let us consider a process related to the medicalprocedures performed in the Intensive Care Unit (ICU) of a hospital. Figure 11shows a ground truth model for the process: Access t Triage t R1 t R2 t R3 t R4 t Laboratory t Visit t Consultancy - Begin t Consultancy - End t Consultancy t Laboratory - Begin t Laboratory - End t Laboratory t Dismissal t Exit t Fig. 11.
The Petri net that models the process related to the treatment of patients inthe ICU ward of an Italian hospital. The activities R1 through R4 are abbreviationsfor the four phases of a radiology exam: respectively,
Radiology - Submitted Request , Radiology - Accepted Request , Radiology - Exam , Radiology - Results . An execution log containing events that concern this ICU process is available.Throughout the process, some anomalies with attribute values can be spotted– namely, a number of anomalies affecting the timestamp attributes. This is a[E] S -type uncertain log. onformance Checking over Uncertain Event Data 33 The alterations on the timestamps in this event log happen for a number ofreasons. The domain experts reported that the human error is a frequent sourceof anomaly, which is worsened by the fact that operators often do not input datain real-time, but the information is recorded after a certain delay (e.g., at theend of a shift). Moreover, the information systems of the ICU ward and otherwards (such as radiology, for instance) do not allow for automatic transmissionof data between one another, so in some occurrences the timestamp of visits byspecialists is not recorded in the ICU information system.Tables 5 and 6 show two examples of traces with anomalous timestamp be-havior. We can see that in the trace of Table 5 the event
Triage has an imprecisetimestamp – only the day has been recorded. This can be modeled with an un-certain timestamp encompassing a range of 24 hours. The column
PreprocessedTimestamp shows the results of this preprocessing step.
Table 5.
Events relative to one case of the ICU process. The timestamp of the “Triage”event is imprecise: through domain knowledge, we are able to represent this uncertaintyin an explicit way within the event attributes in the log.
Event ID Raw Timestamp Preprocessed Timestamp Activity e Access e Visit e Consultancy - Begin e R1 e R2 e R3 e R4 e Consultancy - End e Dismissal e Exit e Triage
Some of the events in the trace of Table 6 are missing the timestamp valueentirely. In this case, we can resort to domain knowledge provided by the processowners: it is known that events related to the
Radiology exams happen after the
Triage event, and before the
Dismissal event. This allows to represent the times-tamp with ranges of possible values. Notice that such a small interval of time,obtainable from the domain knowledge available, is preferable to larger possibleintervals (e.g., 2017-08-27 00:00:00 to 2017-08-27 23:59:59), since it minimizesthe amount of possible overlaps in time with other events in the trace. In turn,this means that the number of possible realizations of the uncertain trace issmaller, granting a faster conformance checking. As before, the results of mod-eling timestamp uncertainty are shown in the column
Preprocessed Timestamp .Once uncertainty is made explicit using the event log formally defined inthis paper, it is possible to apply conformance checking over uncertainty. Thetechnique of alignments illustrated here provides two results, corresponding to
Table 6.
Events relative to one case of the ICU process. Some of the timestamp at-tributes are missing: through domain knowledge, we are able to represent them withuncertainty within a small interval of time. The timestamps in bold and italic of the“Raw Timestamp” column are used to set the interval boundaries for uncertain times-tamps.
Event ID Raw Timestamp Preprocessed Timestamp Activity e Access e Triage e Visit e R1 e Consultancy - Begin e Dismissal e Exit e NULL [2017-08-27 , 2017-08-27 ] Consultancy - End e NULL [2017-08-27 , 2017-08-27 ] R2 e NULL [2017-08-27 , 2017-08-27 ] R3 e NULL [2017-08-27 , 2017-08-27 ] R4 the lower and upper bound for the conformance score. The traces shown inTables 5 and 6 have a best-case scenario alignment in common, which is shownin Table 7; aligning through the behavior net of these traces has allowed thealgorithm to select a value for the uncertain timestamps of the traces (translatedin a specific ordering) such that the deviations between data and model is thesmallest possible. For both traces, the best-case scenario has a cost equal to 0,thus, no deviations occur in that case. Table 7.
A valid alignment for both traces of Tables 5 and 6. This alignment hasa deviation cost equal to 0, and corresponds to a best-case scenario for conformancebetween the process model and both uncertain traces.
Access Triage Visit Consultancy - Begin R1 R2 R3 R4 Consultancy - End (cid:29)
Dismissal ExitAccess Triage Visit Consultancy - Begin R1 R2 R3 R4 Consultancy - End τ Dismissal Exit t t t t t t t t t t t t Let us now look at the worst-case scenarios. One of the alignments with theworst possible score for the trace in Table 5 is shown in Table 8. In this scenario,the deviations are one move on model (the
Triage activity should have occurredafter the
Access but did not), and one move on log (the activity
Triage occursin the data at an unexpected moment in the process).A worst-case scenario for the trace in Table 6 is illustrated in Table 9. In thiscase, the deviation is equal to 6, given by the wrong order of the event related tothe
Radiology exam. Note that, in this example, we assume that every deviationhas a unit cost, but the alignment technique allows to define different costsfor different types of deviations based on impact in the process. For example,a patient that exits the hospital without official dismissal might have a worse onformance Checking over Uncertain Event Data 35
Table 8.
A valid alignment for the trace of Table 5. This alignment has a deviationcost equal to 2 (1 move on log and 1 move on model), and corresponds to a worst-casescenario for conformance between the process model and the uncertain trace.
Access (cid:29)
Visit Consultancy - Begin R1 R2 R3 R4 Consultancy - End (cid:29)
Dismissal Exit TriageAccess Triage Visit Consultancy - Begin R1 R2 R3 R4 Consultancy - End τ Dismissal Exit (cid:29) t t t t t t t t t t t t Table 9.
A valid alignment for the trace of Table 6. This alignment has a cost equal to6 (3 moves on log and 3 moves on model), and corresponds to the worst-case scenariofor conformance between the process model and the uncertain trace.
Access Triage Visit Consultancy - Begin (cid:29) (cid:29) (cid:29)
R4 R3 R2 R1 Consultancy - End (cid:29)
Dismissal ExitAccess Triage Visit Consultancy - Begin R1 R2 R3 R4 τ τ τ
Consultancy - End τ Dismissal Exit t t t t t t t t t t t t impact than an unauthorized laboratory exam. For simplicity, in this case, weassume that all types of deviation have a unit cost.Uncertain alignments provide novel insights, not obtainable through existingconformance techniques. The process owner can utilize these results to gaininsights and decide actions in regard of the process. The potential violationshown in the worst-case scenario for the trace in Table 5 can be investigated,as well as the source of said uncertainty; the process owner can, furthermore,decide whether the consequences and the likelihood of the worst-case scenarioare indicative of a need for a process restructuration, or whether the risk of suchpotential violation of the normative process model are not critical for the processexecution. The discipline of conformance checking, a subfield of process mining, is concernedwith defining metrics to compare how well an event log matches a given processmodel. The input for this task consists of an execution log and a process model(most commonly a labeled Petri net) and the output is a measurement of thedistance – that is, the deviation – between the model and the log, or the tracesthat compose the log. The two main goals of conformance checking are measuringthe quality of a process discovery algorithm by comparing the discovered processmodel with the source event log, to verify the extent to which the model fitsthe log; and comparing an execution log with a normative process model (oftendefined partially or completely by hand) in order to verify the deviations betweenthe rules governing the process and the tasks carried out in reality. Often, theconformance measure defined between logs (or traces) and models includes notonly a distance in absolute terms, but also an indication of where and whatdeviated from the norm in the process. Conformance checking was introducedby Rozinat and van der Aalst [29], who obtained a conformance measure bytracking counts of tokens during replay of traces in a Petri net. Despite the elevated computational complexity, state-of-the-art approaches are mostly basedon alignments, introduced by Adriansyah et al. [7].
As mentioned, the occurrence of data containing uncertainty – in a broad sense– is common both in more classic disciplines like statistics and data mining [21]and in process mining [5]; and logs that show an explicit uncertainty in thecontrol flow perspective can be classified in the lower levels of the quality rankingproposed in the process mining manifesto.To historically position the topic of uncertain data, let us mention some pre-vious work in the domain of data mining. A survey work offering a panoramicview of mining uncertain data is the one by Aggarwal and Philip [8], whichfocuses with particular attention on the problem of uncertain data querying.Such data is represented on the basis of probabilistic databases [33], a founda-tional notion in the setting of uncertain data mining. A branch of data miningparticularly related to process mining is frequent itemsets mining: an efficientalgorithm to search for frequent itemsets over uncertain data, the U-Apriori,have been presented by Chui et al. [15].Within process mining there exist various techniques to deal with a kind ofuncertainty different, albeit closely related, from the one that we analyze here:missing or incorrect data. This can be considered as a form of non-explicit un-certainty: no measure or indication on the nature of the uncertainty is given inthe event log. The work of Suriadi et al. [34] provides a taxonomy of this typeof issues in event logs, laying out a series of data patterns that model errors inprocess data. In these cases, and if this behavior is infrequent enough to allowthe event log to remain meaningful, the most common way for existing pro-cess mining techniques to deal with missing data is by filtering out the affectedtraces and performing discovery and conformance checking on the resulting fil-tered event log. A case study illustrating such situation is, e.g., the work ofBenevento et al. [11]. While filtering out missing values is straightforward, vari-ous methodologies of event log filtering have been proposed in the past to solvethe problem of incorrect event attributes: the filtering can take place thanksto a reference model, which can be given as process specification [35], or frominformation discovered from the frequent and well-formed traces of the sameevent log; for example extracting an automaton from the frequent traces [17],computing conditional probabilities of frequent sequences of activities [30], ordiscovering a probabilistic automaton [37]. In the latter cases, the noise is iden-tified as infrequent behavior.Some previous work attempt to repair the incorrect values in an event log.Conforti et al. [16] propose an approach for the restoration of incorrect times-tamps based on a log automaton, that repairs the total ordering of events ina trace based on correct frequent behavior. Fani Sani et al. [31] define outlierbehavior as the unexpected occurrence of an event, the absence of an event thatis supposed to happen, and the incorrect order of events in the trace; then, theypropose a repairing method based on probabilistic analysis of the context of an onformance Checking over Uncertain Event Data 37 outlier (events preceding or following the anomalous event). Again, both of thesemethods define anomalous/incorrect behavior on the basis of the frequency ofoccurrence.The definition of uncertainty on activity labels as defined in the taxonomyof Section 2 has not been, to the best of our knowledge, previously employed inthe field of process mining. There are, however, related examples of anomalies oruncertainties on activity labels of events: for instance, the problem of matchingevent identifiers to normative activity labels [10]. In this case, an event is asso-ciated with only one activity label, but this association is not known. There area number of techniques to estimate the correct association, included some thatconsider the data perspective, together with the control flow perspective [32]. Us-ing this setting, van der Aa et al. [18] proposed a technique to estimate boundsof conformance scores for event logs with unknown or partially known event-to-activity mapping. Another related domain is the many-to-one abstraction fromlow level events to a higher order of activity labels, which can be performed viaclustering events in minimal conflict groups [20] or representing low-level pat-terns with data Petri nets which then discovers high-level activities by matchingpatterns through alignments [24].A kind of anomaly in event data which is even more related to uncertaintyas discussed in this paper is incompleteness in the order of events in a trace.This occurs when total ordering among events is lost or not available, and onlya partial order is known. In the field of concurrent and distributed systems, theabsence of a total order among logged activities has historically been relevantby virtue of being both caused by, and a necessary condition for, the presenceof concurrency in a system (refer e.g. to Beschastnikh et al. [13]). An importantconcept at the base of this paper is the representation of uncertainties in thetimestamp dimension through directed acyclic graphs, which express these par-tial orders. This intuition was first presented by Lu et al. [23], also in the contextof conformance checking, in order to produce partially ordered alignments. Morerecently, van der Aa et al. [1] proposed a technique to resolve such order uncer-tainty, through estimates based on probabilistic inference aided by a normativeprocess model.In process mining, a notion well known for a long time is the fact that inmany cases the definition of the case is not part of the normative informationimmediately accessible to the process analyst, so there needs to be a decision onwhich attribute or attributes constitutes the case of the process. In some cases,multiple definitions of cases are possible and analysis on a subset of them isdesirable. This specific setting, which can be interpreted as uncertainty on thecase notion, has a long history both in terms of mathematical formalization andin terms of implementation and definition of data standards. For an introductionto this subfield of process mining we refer to [4].This paper presents an extended version of the preliminary analysis on un-certain event data in process mining shown in [25], in which we presented apreliminary description of uncertain event data and their taxonomy, as well as adescription of an approach to find upper and lower bound for the conformance score of an uncertain process trace through alignments. We elaborate on thisprevious work adding an extended formalization, proving theorems on uncer-tainty in process mining, and reporting on new experiments. The frameworkfor uncertain data proposed in this paper has also been expanded by providingan algorithm capable of process discovery on uncertain event data through thedefinition of directly-follows relationship in uncertain settings and the compu-tation of an uncertain directly-follows graph, which enables process discoverytechniques [26]. On the topic of efficient uncertain data management, we pre-sented an improved algorithm that allows to preprocess uncertain traces intobehavior graphs in quadratic time, enabling fast uncertainty analysis [27]. Theexploration of uncertain event data can also ba facilitated by a memory-efficientrepresentation method and the definition of the concept of uncertain processvariants [28].
As the need to quickly and effectively analyze process data has arisen in therecent past and is growing to this day, many new types of information regardingevents are recorded; this calls for new techniques able to provide an adequateinterpretation of the new data. Not only more and more event data is availableto the analyst, but these data are accessible in association with a wealth of in-formation and meta-information about the process, the resources that executedactivities, data about the outcome of those actions, and many other types ofknowledge about the nature of events, activities, and the process as a whole. Inthis paper, we presented a new paradigm for process mining applied to eventdata: explicit uncertainty. We described the possible form it can assume, build-ing a taxonomy of different types of uncertainty, and we provided examples ofhow uncertainty can originate in a process, and how uncertainty informationcan be inferred from the available data and from domain knowledge providedby process experts. We then designed a framework to define the various flavorsof uncertainty shown in the taxonomy. Then, in order to assess the practicalapplications of the uncertainty framework, we applied it to a well consolidatedtechnique for conformance checking: aligning data to a reference Petri net. Thisapplication of uncertainty analysis is integrated by theorems that prove the cor-rectness of the techniques developed and illustrated here within the frameworkpreviously described. The results can provide insights on the possible violationsof process instances recorded with uncertainty against a normative model. Thebehavior net provides an efficient way to compute the lower bound for the confor-mance cost – i.e., the best-case scenario for conformity of uncertain process data– with a large improvement in time performance with respect to a brute-forceprocedure.The approaches shown here can be extended in a number of ways. From aperformance perspective, to improve the usability of alignments over uncertaintythe computation of the upper bound of the conformance cost should either beoptimized, or replaced by an approximate algorithm. Another direction for fu- onformance Checking over Uncertain Event Data 39 ture work is extending the conformance checking technique to logs with weakuncertainty, weighting the deviation by means of the probability distributionsattached to activities, timestamps and indeterminate events. Additionally, in-vestigation on real-life data is an important important milestone for this line ofresearch, and it is vital to analyze in depth a complete use case in real life ofprocess mining in the presence of uncertain event data.
References
1. van der Aa, H., Leopold, H., Weidlich, M.: Partial order resolution of event logsfor process conformance checking. Decision Support Systems p. 113347 (2020)2. van der Aalst, W.M.P.: Decomposing Petri nets for process mining: A genericapproach. Distributed and Parallel Databases (4), 471–507 (2013)3. van der Aalst, W.M.P.: Process mining: data science in action. Springer (2016)4. van der Aalst, W.M.P.: Object-centric process mining: Dealing with divergence andconvergence in event data. In: International Conference on Software Engineeringand Formal Methods. pp. 3–25. Springer (2019)5. van der Aalst, W.M.P., Adriansyah, A., De Medeiros, A.K.A., Arcieri, F., Baier, T.,Blickle, T., Bose, J.C., van Den Brand, P., Brandtjen, R., Buijs, J., et al.: Processmining manifesto. In: International Conference on Business Process Management.pp. 169–194. Springer (2011)6. Adriansyah, A.: Aligning observed and modeled behavior. Ph.D. thesis, EindhovenUniversity of Technology (2014)7. Adriansyah, A., van Dongen, B.F., van der Aalst, W.M.P.: Towards robust confor-mance checking. In: International Conference on Business Process Management.pp. 122–133. Springer (2010)8. Aggarwal, C.C., Philip, S.Y.: A survey of uncertain data algorithms and appli-cations. IEEE Transactions on knowledge and data engineering (5), 609–623(2008)9. Aho, A.V., Garey, M.R., Ullman, J.D.: The transitive reduction of a directed graph.SIAM Journal on Computing (2), 131–137 (1972)10. Baier, T., Mendling, J.: Bridging abstraction layers in process mining by automatedmatching of events and activities. In: Business process management, pp. 17–32.Springer (2013)11. Benevento, E., Dixit, P.M., Sani, M.F., Aloini, D., van der Aalst, W.M.P.: Evaluat-ing the effectiveness of interactive process discovery in healthcare: A case study. In:International Conference on Business Process Management. pp. 508–519. Springer(2019)12. Berti, A., van Zelst, S.J., van der Aalst, W.M.P.: Process Mining for Python(PM4Py): Bridging the Gap Between Process- and Data Science. In: ICPM DemoTrack (CEUR 2374). p. 13–16 (2019)13. Beschastnikh, I., Brun, Y., Ernst, M.D., Krishnamurthy, A., Anderson, T.E.: Min-ing temporal invariants from partially ordered logs. In: Managing Large-scale Sys-tems via the Analysis of System Logs and the Application of Machine LearningTechniques, pp. 1–10 (2011)14. Carmona, J., van Dongen, B., Solti, A., Weidlich, M.: Conformance Checking:Relating Processes and Models. Springer (2018)15. Chui, C.K., Kao, B., Hung, E.: Mining frequent itemsets from uncertain data.In: Pacific-Asia Conference on knowledge discovery and data mining. pp. 47–58.Springer (2007)0 Pegoraro et al.16. Conforti, R., La Rosa, M., ter Hofstede, A.: Timestamp repair for business processevent logs (2018), http://hdl.handle.net/11343/209011 , [preprint]17. Conforti, R., La Rosa, M., ter Hofstede, A.H.: Filtering out infrequent behaviorfrom business process event logs. IEEE Transactions on Knowledge and Data En-gineering (2), 300–314 (2017)18. van Der Aa, H., Leopold, H., Reijers, H.A.: Efficient process conformance check-ing on the basis of uncertain event-to-activity mappings. IEEE Transactions onKnowledge and Data Engineering (5), 927–940 (2019)19. Flaˇska, V., Jeˇzek, J., Kepka, T., Kortelainen, J.: Transitive closures of binaryrelations. i. Acta Universitatis Carolinae. Mathematica et Physica (1), 55–69(2007)20. G¨unther, C.W., van der Aalst, W.M.P.: Mining activity clusters from low-levelevent logs. Beta, Research School for Operations Management and Logistics (2006)21. Han, J., Pei, J., Kamber, M.: Data mining: concepts and techniques. Elsevier (2011)22. Kalvin, A.D., Varol, Y.L.: On the generation of all topological sortings. Journal ofAlgorithms (2), 150–162 (1983)23. Lu, X., Fahland, D., van der Aalst, W.M.P.: Conformance checking based on par-tially ordered event data. In: International conference on business process manage-ment. pp. 75–88. Springer (2014)24. Mannhardt, F., De Leoni, M., Reijers, H.A., van der Aalst, W.M.P., Toussaint,P.J.: From low-level events to activities-a pattern-based approach. In: Internationalconference on business process management. pp. 125–141. Springer (2016)25. Pegoraro, M., van der Aalst, W.M.P.: Mining uncertain event data in processmining. In: 2019 International Conference on Process Mining (ICPM). pp. 89–96.IEEE (2019)26. Pegoraro, M., Uysal, M.S., van der Aalst, W.M.P.: Discovering process modelsfrom uncertain event data. In: International Conference on Business Process Man-agement. pp. 238–249. Springer (2019)27. Pegoraro, M., Uysal, M.S., van der Aalst, W.M.P.: Efficient construction of be-havior graphs for uncertain event data. In: International Conference on BusinessInformation Systems. Springer (2020)28. Pegoraro, M., Uysal, M.S., van der Aalst, W.M.P.: Efficient time and space repre-sentation of uncertain event data. Algorithms (11), 285–312 (2020)29. Rozinat, A., van der Aalst, W.M.P.: Conformance checking of processes based onmonitoring real behavior. Information Systems (1), 64–95 (2008)30. Sani, M.F., van Zelst, S.J., van der Aalst, W.M.P.: Improving process discoveryresults by filtering outliers using conditional behavioural probabilities. In: Interna-tional Conference on Business Process Management. pp. 216–229. Springer (2017)31. Sani, M.F., van Zelst, S.J., van der Aalst, W.M.P.: Repairing outlier behaviourin event logs. In: International Conference on Business Information Systems. pp.115–131. Springer (2018)32. Senderovich, A., Rogge-Solti, A., Gal, A., Mendling, J., Mandelbaum, A.: Theroad from sensor data to process instances via interaction mining. In: InternationalConference on Advanced Information Systems Engineering. pp. 257–273. Springer(2016)33. Suciu, D., Olteanu, D., R´e, C., Koch, C.: Probabilistic databases. Synthesis lectureson data management (2), 1–180 (2011)34. Suriadi, S., Andrews, R., ter Hofstede, A.H., Wynn, M.T.: Event log imperfectionpatterns for process mining: Towards a systematic approach to cleaning event logs.Information Systems , 132–150 (2017)onformance Checking over Uncertain Event Data 4135. Wang, J., Song, S., Lin, X., Zhu, X., Pei, J.: Cleaning structured event logs: Agraph repair approach. In: Data Engineering (ICDE), 2015 IEEE 31st InternationalConference on. pp. 30–41. IEEE (2015)36. Winskel, G.: Petri nets, algebras, morphisms, and compositionality. Informationand Computation72