[PDF] Discovering Business Area Effects to Process Mining Analysis Using Clustering and Influence Analysis

Abstract

A common challenge for improving business processes in large organizations is that business people in charge of the operations are lacking a fact-based understanding of the execution details, process variants, and exceptions taking place in business operations. While existing process mining methodologies can discover these details based on event logs, it is challenging to communicate the process mining findings to business people. In this paper, we present a novel methodology for discovering business areas that have a significant effect on the process execution details. Our method uses clustering to group similar cases based on process flow characteristics and then influence analysis for detecting those business areas that correlate most with the discovered clusters. Our analysis serves as a bridge between BPM people and business, people facilitating the knowledge sharing between these groups. We also present an example analysis based on publicly available real-life purchase order process data.

Full PDF

DDiscovering Business Area Eﬀects To ProcessMining Analysis Using Clustering and InﬂuenceAnalysis

Teemu Lehto , − − − and Markku Hinkka , − − − QPR Software Plc, Finland Aalto University, School of Science, Department of Computer Science, Finland

Abstract.

A common challenge for improving business processes inlarge organizations is that business people in charge of the operationsare lacking a fact-based understanding of the execution details, processvariants, and exceptions taking place in business operations. While ex-isting process mining methodologies can discover these details based onevent logs, it is challenging to communicate the process mining ﬁndingsto business people. In this paper, we present a novel methodology fordiscovering business areas that have a signiﬁcant eﬀect on the process ex-ecution details. Our method uses clustering to group similar cases basedon process ﬂow characteristics and then inﬂuence analysis for detectingthose business areas that correlate most with the discovered clusters. Ouranalysis serves as a bridge between BPM people and business, people fa-cilitating the knowledge sharing between these groups. We also presentan example analysis based on publicly available real-life purchase orderprocess data.

Keywords: process mining, clustering, inﬂuence analysis, contribution,business area, classiﬁcation rule mining, data mining

Process mining helps organizations to improve their operations by providingvaluable information about the business processes in easy to understand visualﬂowchart format based on transactional data in ERP systems. However, to pro-vide these meaningful results, the data extracted from ERP systems may oftencontain diﬀerent kinds of objects like ’apples and oranges’ that should be an-alyzed separately. Using the procurement process as an example: the purchaseorder database tables may contain several diﬀerent kinds of purchase ordersitems like services, equipment, raw materials, software licenses, high-cost items,free items, headquarter purchases, plant maintenance costs, manually approveditems, and automatic replenishment purchases. Without appropriate tools theprocess analyst needs to either a) analyze all items separately - leading to poten-tially massive amount of work, b) analyze all items at the same time - leading topotentially meaningless results or c) rely on subjective information like askingbusiness people which of the items should be analyzed separately or relying on a r X i v : . [ c s . D B ] M a r Teemu Lehto, Markku Hinkka own intuition. The techniques presenting in this paper help the analyst to dis-cover those business areas (classiﬁcation rules) that seem to have a major eﬀecton the business process ﬂow. These business areas are based on case attributecharacteristics of the cases and thus easy to understand for the business people.Discovered business areas can be used to eﬀectively guide the process mininganalysis further in divide & conquer manner.In this paper, we present methods to answer these three questions: – How a business process can be analyzed based on the process ﬂow of individ-ual process instances in order to discover business-relevant clusters in such away that a business analyst can easily understand the clustering results anduse them for further analysis. – How to ﬁnd business areas that have a major eﬀect on process ﬂow behavior. – How to further consolidate business area results to discover case attributesthat have a signiﬁcant eﬀect on process ﬂow behavior.The rest of this paper is structured as follows: Section 2 is a summary of thelatest developments. Section 3 present our methodology for Discovering BusinessArea Eﬀects. Section 4 is a case study with real-life purchase order process data.Section 5 shows limitations and Section 6 draws the ﬁnal conclusions.

Process mining is an active research area that analyses business processes basedon the event log data from IT systems in order to discover, monitor, and im-prove processes [16]. Process mining typically focuses on discovering the processﬂowchart as a control ﬂow diagram, Petri net, or BPMN diagram. Other processmining types include conformance checking and enhancement. Root cause anal-ysis as part of process mining has been studied in [14] as well as in our previousworks [8] and [9].One key challenge in process mining is that a single event log may often con-tain many diﬀerent processes, in which case trying to discover a single processdiagram for the whole log ﬁle is not a working solution. In the process miningcontext clustering has been studied a lot for with excellent results [5], [12], [3]and [15]. These previous work cover the usage of several distance measures likeEuclid, Hamming, Jaccard, Cosine, Markov chain, Edit-Distance as well as sev-eral cluster approaches like partitioning, hierarchial, density-based and neuronalnetwork. However, most of the previous research related to clustering within theprocess mining ﬁeld has been directly focused on the process ﬂowchart discoverywith the prime objectives categorized as Process, Variant or Outlier Identiﬁ-cation, Understandability of Complexity, Decomposition, or Hierarhization. Inpractice, this means that clustering has been used as a tool for improving theother process mining methods like control ﬂow discovery to work better, i.e.,clustering has divided the event log into smaller sub logs that have been directlyused for further analysis. In this paper, we show how to use clustering for dis-covering those business areas that have a signiﬁcant eﬀect on process behavior. iscovering Business Area Eﬀects To Process Mining Analysis... 3

Yet another use case for clustering in the process mining ﬁeld has been to per-form structural feature selection in order to improve the prediction accuracy andperformance [6].Some recent research has started to address the challenge of how to explainthe clustering results to business analyst [2]. It has been presented that whenexplaining the characteristics of clusters to business analysts, the role of caseattributes becomes more important [11]. We show an easy-to-understand repre-sentation for showing cluster characteristics based on the diﬀerence of densitiesand case attribute information.Substantial eﬀort has also been spent in the process mining community todiscover branching conditions from business process execution logs [4]. This hasalso lead to the introduction of decision models and decision mining [1] as wellas a standard Decision Model and Notation (DMN) [10]. While the objective ofthe decision modeling is to provide additional details into individual branchingconditions, our approach is to analyze the eﬀect of any business area to the wholestructure of the process ﬂow, not just one decision branch at a time.

In this section, we present our methodology for Discovering Business Area Ef-fects To Process Mining Analysis Using Clustering and Inﬂuence Analysis. Ourapproach is to do the clustering using process ﬂow features and then use inﬂuenceanalysis to ﬁnd those business areas that have the highest contribution for cer-tain kinds of cases ending up in distinct clusters. If all process instance-speciﬁcbusiness area values derived using any given case attribute are distributed ran-domly, then the contribution measure for each business area is zero, and theinformation for the analyst is that the particular case attribute does not corre-late with the way how the clusters are formed. According to our methodology,it then means that the particular case attribute has no inﬂuence on the pro-cess ﬂow behavior. In summary, our method ﬁnds those business areas and caseattributes that have the highest contribution to the process ﬂow behavior.

To identify those business areas that have the strongest ef-fect on the process execution, we ﬁrst run clustering using relevant features rep-resenting the process execution characteristics. These features have been widelystudied in Trace Clustering papers [12],[15], [6] and [7]. Clustering is a trade-oﬀbetween quality and performance. As the amount of features is increased, thequality of the results potentially improves while performance gets slower. – Activity proﬁle: This proﬁle contains one feature for each Event Type labelin the data. The value of this feature is related to the number of occurrencesof that particular event type within the case. If the number of occurrencesis used as an exact value, then the clustering algorithm somehow needs to

Teemu Lehto, Markku Hinkka take into account the continuous values, ie. repeating activity A seven timesis much more similar to repeating 6 or 8 times, compared to repeating theactivity A only twice. One approach is to use value zero if the Event Logcontains no occurrences of the Event Type for the given case and one if thelog contains one or more occurrences. While this approach often works well,it may not be able to detect the repeating of a given Event Type multipletimes with the log. For this reason, we recommend using value zero for nooccurrences of the Event Type, one for only one occurrence and two for two or more occurrences. – Transition proﬁle The transition proﬁle captures all process ﬂows from everyactivity to the next activity. In eﬀect, it contains the process control ﬂowinformation. Transition proﬁle potentially provides a large number of fea-tures up to the square of the number of Event Types plus one for start andend transitions. For example, in the sample analysis presented in Section 4,we have 42 distinct event types, giving potentially 43 = 1849 distinct tran-sition. Luckily the control ﬂow for 251.734 cases only contains 676 distincttransitions. Because the amount of transition features is high, we recommendusing the coding zero if the transition does not occur in the case and one ifit occurs once or more. Clustering Algorithms

A comparative analysis of process instance clustertechniques has been presented in [15] and shows how various clustering tech-niques have been used to separate diﬀerent process variants from a large set ofcases as well as reducing the complexity by grouping similar cases into same clus-ters. Considering our method, the main functional requirement for the clusteringalgorithm is that it needs to put cases with similar process ﬂow behavior intothe same clusters, and all 20 approaches listed in [15] meet this requirement. If aparticular clustering algorithm produces meaningful results and if there indeedis a correlation with a particular business area, then our method gives very highcontribution values for that business area. If the clustering algorithm does notwork perfectly but is still capable to some extent grouping similar cases together,then the contribution values are still likely to show the most signiﬁcant businessareas among the top contributors. The essential non-functional requirement forthe clustering algorithm is performance, i.e., the ability to produce results fastwith a small amount of memory. With these considerations, we have receivedgood results with the algorithms and parameters below: – One-hot encoding.

Since our Activity and Transition feature proﬁles onlyinclude categorial values zero, one, and two, it is possible to use eﬃcientone-hot encoding. This results in maximum of ( n ( EventT ypes ) + 1) + 2 ∗ n ( EventT ypes ) feature vectors. – Hamming distance is the natural choice as the distance function with binarydata like one-hot encoded features, because it completely avoids the ﬂoating-point distance calculations needed for common Euclid distance measure. – K-modes clustering algorithm is suitable for categorical data. In our tests,k-modes produced well-balanced clusters and was fast to execute. The re- iscovering Business Area Eﬀects To Process Mining Analysis... 5 sult of K-modes depends on the initial cluster center initialization. We alsotested agglomerative clustering algorithms, but it produced highly unbal-anced clusters. – Number of clusters has a signiﬁcant eﬀect on the clustering. To discoverthe business areas, clustering should be done several times with diﬀerentnumbers of clusters. We found out that clustering four times with clustersizes 2, 3, 5, and 10 clusters gave enough variation in the results providingmeaningful results. When the number of clusters is less than ﬁve, the largebusiness areas correlate more with the clustering. While clustering to 10 ormore clusters, the smaller business areas like

Vendor , Customer , Product having more distinct values correlate more with the clusters. Running theclustering several times is also an easy way to mitigate the random behaviorof K-modes coming from initialization.

Examples of business area dimensions include: company code , product line , sales unit , delivery team , geographical location , customer group , product group , branch oﬃces , request category and diagnosis code . All the caseattributes that are relevant to business can be used as business area dimensionsas such, for example, product code . However, a large organization may easily havethousands of low-level product codes in their ERP system, so it is beneﬁcial tohave access to product hierarchy and use each level as a separate business areadimension. Another example of a derived business area dimension is when acase attribute like Logistics Manager Name can be used to identify the

DeliveryTeam . We again suggest having both the

Logistics Manager and

Delivery Team as business area dimensions; if one particular

Logistics Manager has many casesand a major eﬀect on process ﬂow behavior then our method will show thatperson as the most signiﬁcant business area in the

Logistics Manager dimension.The third example of derived business areas is to utilize the event attributes.For example the

Logistics Manager Name may be stored as an attribute valuefor the

Delivery Planning Done activity. If there is always at most one

DeliveryPlanning Done activity, then the attribute value can be used as such in the caselevel. If there are multiple

Delivery Planning Done activities, then typical optionsinclude: use the ﬁrst occurrence, use the last occurrence or use a concatenatedcomma-separated list of all distinct values from activities as the value on thecase level. The outcome of forming business area dimensions is a list of case-level attributes that contain a speciﬁc (possibly empty) business area value foreach case. To continue with our formal methodology, we now consider thesebusiness area dimensions as case attributes and the case attribute values as thecorresponding business areas.

Interestingness Measures

We now present the deﬁnitions for interestingnessmeasures used for ﬁnding the business areas that correlate with the clusteringresults. Let C = { c , . . . , c N } be the set of cases in the process analysis. Each case Teemu Lehto, Markku Hinkka represents a single business process execution instance. Let P = { p , . . . , p N } bea set of clusters each formed by clustering the cases in C . C p = { c p , . . . , c p N } is the set of cases belonging to cluster p. C p ⊆ C . Similarly C a = { c a , . . . , c a N } is the set of cases belonging to the same business area a, ie. they have the samevalue for the case attribute a. Deﬁnition 1.

Let Density ρ ( a, C ) = n ( C a ) n ( C ) where n ( C a ) is the total amount ofcases belonging to the business area a and n ( C ) is the total amount of all casesin the whole process analysis. Similarly, the Density ρ ( a, C p ) = n ( C p ∩ C a ) n ( C p ) is thedensity of cases belonging to the business area a within the cluster P. Deﬁnition 2.

Let

Contribution %( a → p ) = ρ ( a, C p ) − ρ ( a, C ) = n ( C p ∩ C a ) n ( C p ) − n ( C a ) n ( C ) is the extra density of cases belonging to the business area a inthe cluster p compared to average density. If business area a is equally distributed to all clusters, then the

Contribution %( a → p ) is close to zero in each cluster. If the business area a isa typical property in a particular cluster p i and rare property in other clusters,then the Contribution %( a → p i ) is positive and other Contribution %( a → p j , wherej <> i ) values are negative. Calculating the sum of all Contributionvalues for all clusters is always zero, so the extra density in some clusters isalways balanced by the smaller than average density in other clusters.We now want to ﬁnd the business areas that have a high contribution inmany clustering. We deﬁne: Deﬁnition 3.

Let

BusinessAreaContribution ( a ) = (cid:80) p i ∈ P n ( C pi ) n ( C ) ( max { Contribution %( a → p i ) , } ) .Here we sum the weighted squares of all positive contributions the business areaa has with any clustering p i . Positive values of Contribution %( a → p i ) indicatea positive correlation with the business are a and the particular cluster i, whilenegative values indicate that the business area a has smaller than the averagedensity in the cluster i. We found out that using only the positive correlationsgives more meaningful results when consolidating to the business area level. Sincea few high contributions are relatively more important than many small contri-butions, we use the Variance of the density diﬀerences, i.e., taking the square ofthe Contribution %( a → p i ) . Since a contribution within a small cluster is lessimportant than contribution in a large cluster, we also use the cluster size basedweight n ( C p ) n ( C ) . Any particular business area a may have a substantial contribution in someclusters and small contribution in other, so the sum of all these clusterings isgiving the overall correlation between business area a and all clusters p i ⊆ P We use the term

Business area in this paper for any combination of a processmining case attribute and a distinct value for that particular case attribute.

BusinessAreaContribution thus identiﬁes the individual case attribute-valuecombinations that have the highest eﬀect on clustering results. It is then alsopossible to continue and consolidate the results further to Case Attribute level: iscovering Business Area Eﬀects To Process Mining Analysis... 7

Deﬁnition 4.

Let AT = { at , . . . , at N } be a set of case attributes in the processanalysis. Each case c i ∈ C has a value at jc i for each case attribute at j ∈ AT . at jc i is the value of case attribute at j for case c i and V at j = { v at j , . . . , v at jN } isthe set of distinct values that the case attribute at j has in the process analysis. Deﬁnition 5.

Let

CaseAttributeContribution ( at ) be a sum of all BusinessArea-Contributions from all the business areas corresponding to the given case at-tribute at as (cid:80) v atji ∈ V atj BusinessAreaContribution ( at j vatji ) In this section, we apply our method to the real-life purchase order processdata from a large Netherlands multinational company operating in the area ofcoatings and paints. The data is publicly available as the BPI Challenge 2019[17] dataset. We made the following choices: – Source data

We imported the data from the XES ﬁle as such withoutany modiﬁcations. To keep the execution times short, we experimented withthe eﬀect of running the analysis with a sample of the full dataset. Ourexperiments showed that the results remained consistent for sample size10.000 cases and more. With the sample size of 1.000 cases, the results ofthe individual analysis runs started to change, so we decided to keep thesample size 10.000 cases. – Clustering algorithm

We used the k-modes clustering as implemented inAccord.Net Machine Learning Framework [13] with one-hot encoding andhamming distance function. To take into account the diﬀerent clusteringsizes, we performed clustering four times, ﬁxed to two, three, ﬁve, and tenclusters. – Activity proﬁle features for clustering

We used our default booleanactivity proﬁle, which creates one feature dimension for each activity andthe value is zero if the activity does not occur in the case, value one if theactivity occurs once and value two if it is repeated multiple times. Therewere 37 diﬀerent activities in the sample, and the Top 20 activity proﬁle isshown in Table 1. – Transition proﬁle features for clustering

Using a typical process mininganalysis to discover the process ﬂow diagram, we discovered 376 diﬀerentdirect transitions, including 13 starting activities, 22 ending activities, and341 direct transitions between two unique activities. All of these 376 featureswere used as dimensions for clustering in a similar way as the activity proﬁle,i.e., boolean value zero if transition did not occur in the case and one if itoccurred once or multiple times. – Business area dimensions

Since we did not have any additional informa-tion or hierarchy tables concerning possible business areas, we are using allavailable 15 distinct case attributes listed in Table 4 as business area dimen-sions. These case attributes have a total of 9901 distinct values, giving us

Teemu Lehto, Markku Hinkka

Table 1.

Activity proﬁle: Top 20 activities ordered by unique occurrence count

Name Unique Count Count

Create Purchase Order Item 10 000 10 000Record Goods Receipt 9 333 13 264Record Invoice Receipt 8 370 9 214Vendor creates invoice 8 310 8 901Clear invoice 7 245 7 704Remove Payment Block 2 223 2 272Create Purchase Requisition Item 1 901 1 901Receive Order Conﬁrmation 1 321 1 321Change Quantity 707 853Change Price 443 498Delete Purchase Order Item 338 339Cancel Invoice Receipt 251 271Vendor creates debit memo 244 253Record Service Entry Sheet 232 10 326Change Approval for Purchase Order 194 319Change Delivery Indicator 112 128Cancel Goods Receipt 109 136SRM: In Transfer to Execution Syst. 42 57SRM: Awaiting Approval 42 50SRM: Complete 42 50

Table 2 shows the results of clustering to ﬁxed ﬁve clusters. We see that theﬁrst cluster contains 48% of cases, the second cluster 33%, third 17%, and both4th and 5th one percent each. Here we show the ﬁve most important businessareas based on the contribution%, which is calculated as the diﬀerence betweenCluster speciﬁc density of that business area and Total Density. These resultsalready give hints about the meaningful characteristics in the whole dataset, ie:Cluster one contains many

Standard cases from spend areas related to

Sales , Products for Resale and

NPR . On the other hand cluster two contains morethan average amount of cases from spend area Packaging , related to

Labels and PR . VendorID 0120 seems to be highly associated with the process ﬂow char-acteristics of cluster 2. Cluster 3 is dominated by

Consignment cases. Cluster 4contains many

Metal Containers & Lids cases as well as cases from

VendorID s and . Further analysis of the top ﬁve business areas listed as charac-teristics for each cluster conﬁrms that these business areas indeed give a goodoverall idea of the cases allocated into each cluster. We clustered four times for ﬁxed cluster amounts of 2,3,5 and 10 - yielding atotal of 20 clusters, and then consolidating the results into business area levelusing Deﬁnition 3. The top 20 of all these 9901 business areas ordered by theirrespective Business Area Contribution is shown in Table 3. Clearly the business iscovering Business Area Eﬀects To Process Mining Analysis... 9

Table 2.

Clustering results based on Contribution

Cluster Business Area a ClusterDensity TotalDensity Contribution

Spend area text = Sales 0.36 0.26 0.11Cluster1 Sub spend area text = Products for Resale 0.34 0.24 0.1148% cases Spend classiﬁcation text = NPR 0.41 0.32 0.10Item Type = Standard 0.96 0.87 0.09Item Category = 3-way match, invoice before GR 0.95 0.88 0.07Spend area text = Packaging 0.65 0.44 0.21Cluster2 Sub spend area text = Labels 0.39 0.24 0.1633% cases Spend classiﬁcation text = PR 0.79 0.66 0.13Name = vendor 0119 0.14 0.05 0.08Vendor = vendorID 0120 0.14 0.05 0.08Item Category = Consignment 0.33 0.06 0.27Cluster3 Item Type = Consignment 0.33 0.06 0.2717% cases Name = vendor 0185 0.09 0.02 0.08Vendor = vendorID 0188 0.09 0.02 0.08Item = 10 0.33 0.26 0.07Sub spend area text = Metal Containers & Lids 0.19 0.08 0.11Cluster4 Name = vendor 0393 0.09 0.01 0.081% cases Vendor = vendorID 0404 0.09 0.01 0.08Name = vendor 0104 0.11 0.04 0.07Vendor = vendorID 0104 0.11 0.04 0.07Spend classiﬁcation text = NPR 0.59 0.32 0.27Cluster5 Spend area text = Sales 0.41 0.26 0.151% cases GR-Based Inv. Verif. = TRUE 0.21 0.06 0.15Item Category = 3-way match, invoice after GR 0.21 0.06 0.15Sub spend area text = Products for Resale 0.38 0.24 0.14 areas

Item Category = Consignment and

Item Type = Consignment have mostsigniﬁcant eﬀect on the process ﬂow. Looking at the actual process model, wesee that

Consignment cases completely avoid three of the ﬁve most commonactivities in the process, namely

Record Invoice Receipt , Vendor creates invoice and

Clear Invoice . Similarly, the business area

Spend area text = Packaging alsohas a high correlation with process ﬂow characteristics. Analysis of the processmodel shows that, for example, 23% of

Packaging cases contain activity

ReceiveOrder Conﬁrmation compared to only 5% of the other cases. Further analysis ofall the business areas listed in Table 3 shows that each of these areas has somedistinctive process ﬂow behavior that is more common in that area compared tothe other business areas.

Finally, Table 4 consolidates individual business areas into the Case Attributelevel.

Item Type with six distinct values and

Item Category with four distinctvalues have most signiﬁcant eﬀects on process ﬂow characteristics. To conﬁrmthe validity of these results we further analysis the materials provided in BPIChallenge 2019 website including the background information and submissionreports [17]. It is clear that the

Item Type and

Item Category indeed can beregarded as the most important factors explaining the process ﬂow behavior asthey are speciﬁcally mentioned to roughly divide the cases into four types of ﬂowsin the data . It is also interesting to see that the

Spend are text and

Sub spend

Table 3.

Top 20 Business areas with major eﬀect to process ﬂow

Business Area a Contribution nCases n ( C a )Item Category = Consignment 0.051 576Item Type = Consignment 0.051 576Spend area text = Packaging 0.040 4382Spend classiﬁcation text = NPR 0.024 3175Sub spend area text = Labels 0.022 2351Spend area text = Sales 0.021 2574Item Type = Standard 0.021 8740Sub spend area text = Products for Resale 0.021 2390Spend classiﬁcation text = PR 0.019 6574Item Category = 3-way match, invoice before GR 0.017 8760Spend area text = Logistics 0.013 210Item Type = Service 0.013 244Item = 1 0.012 342GR-Based Inv. Verif. = TRUE 0.012 623Item Category = 3-way match, invoice after GR 0.012 625Name = vendor 0119 0.007 549Vendor = vendorID 0120 0.007 549Sub spend area text = Road Packed 0.006 145Name = vendor 0185 0.004 163Vendor = vendorID 0188 0.004 163 are text have a signiﬁcant eﬀect on the process ﬂow even though they have muchhigher number of distinct values (19 and 115) compared to Spend classiﬁcationtext which only has four distinct values.

Table 4.

Case Attributes ordered by eﬀect on process ﬂow

Case Attribute at Contribution Distinct Values n ( V at )Item Type 0.086 6Item Category 0.080 4Spend area text 0.077 19Sub spend area text 0.056 115Spend classiﬁcation text 0.043 4Name 0.025 798Vendor 0.025 840Item 0.016 167GR-Based Inv. Verif. 0.012 2Purchasing Document 0.002 7937Document Type 0.000 3Goods Receipt 0.000 2Company 0.000 2Source 0.000 1Purch. Doc. Category name 0.000 1 Forming business area dimensions is an essential step in our method. However,some relevant business areas may consist of several dimensions, for example, theprocess ﬂow behavior could be very distinctive in a particular combination ofbusiness areas

SalesOﬃce=Spain and

ProductGroup=Computers . Automatically iscovering Business Area Eﬀects To Process Mining Analysis... 11 detecting this kind of signiﬁcant combined business areas would be a useful fea-ture. Another limitation is that the process ﬂow behavior does not take intoaccount the performance proﬁle, i.e., the lead times between individual activi-ties and the total case duration. Although the usage of this kind of numericalinformation would require a more advanced clustering technique, the inﬂuenceanalysis part of the method presented in this paper would already handle thediscovery of related business areas.

In this paper, we have presented a method for discovering those business areasthat have a signiﬁcant eﬀect on process ﬂow behavior based on clustering andinﬂuence analysis. As a summary of our ﬁndings: – Our presented method is capable of discovering those business areas thathave the most signiﬁcant eﬀect on the process execution. Our method pro-vides valuable information to business people who are very familiar with caseattributes and attribute values but not so familiar with the often technicalevent type names extracted from transactional system log ﬁles. – Our method supports any available trace clustering method. Our case studyshows that using the k-modes clustering algorithm with activity and transi-tion proﬁles provides good results. – Clustering makes the analysts realize that not all the cases in the processmodel are similar. Using the

Contribution% measure to explain clusteringresults works well for explaining the clustering results to business people. – The case study presented in this paper conﬁrms that the identiﬁed businessareas indeed have distinctive process ﬂow behavior, for example missing ac-tivities, higher than average amount of some special activities, or distinctiveexecution sequence for activities. Using our method, the business analystmay now divide the process model into smaller subsets and analyze themseparately. It is a good idea to start the analysis of any process subset againby running the clustering to see if the cases are similar enough from bothprocess ﬂow point of view. – Clustering reduces the need for external subject matter business experts.Naturally, it would be nice to have a person who can explain everything,but in real life, those persons are very busy, and some important details arealways likely to be forgotten by busy business people.

Acknowledgements.

We thank QPR Software Plc for the practical experiencesfrom a wide variety of customer cases and for funding our research. The algo-rithms presented in this paper have been implemented in a commercial processmining tool QPR ProcessAnalyzer.

References

1. Bazhenova, E., & Weske, M. (2016, September). Deriving decision models fromprocess models by enhanced decision mining. In International conference on businessprocess management (pp. 444-457). Springer, Cham.2. De Koninck, P., De Weerdt, J., & Vanden Broucke, S. K. (2017). Explaining clus-terings of process instances. Data mining and knowledge discovery, 31(3), 774-808.3. De Leoni, M., Van Der Aalst, W. M., & Dees, M. (2016). A general process miningframework for correlating, predicting and clustering dynamic behavior based onevent logs. Information Systems, 56, 235-257.4. De Leoni, M., Dumas, M., & Garca-Bauelos, L. (2013, March). Discovering branch-ing conditions from business process execution logs. In International Conference onFundamental Approaches to Software Engineering (pp. 114-129). Springer, Berlin,Heidelberg.5. De Medeiros, A. K. A., Guzzo, A., Greco, G., Van Der Aalst, W. M., Weijters, A. J.M. M., Van Dongen, B. F., & Sacc, D. (2007, September). Process mining based onclustering: A quest for precision. In International Conference on Business ProcessManagement (pp. 17-29). Springer, Berlin, Heidelberg.6. Hinkka, M., Lehto, T., Heljanko, K., & Jung, A. (2017, September). Structuralfeature selection for event logs. In International Conference on Business ProcessManagement (pp. 20-35). Springer, Cham.7. Hinkka, M., Lehto, T., Heljanko, K., & Jung, A. (2018, September). Classifyingprocess instances using recurrent neural networks. In International Conference onBusiness Process Management (pp. 313-324). Springer, Cham.8. Lehto, T., Hinkka, M., & Hollm´en, J. (2016, September). Focusing business improve-ments using process mining based inﬂuence analysis. In International Conference onBusiness Process Management (pp. 177-192). Springer, Cham.9. Lehto, T., Hinkka, M., & Hollm´en, J. (2017). Focusing business process lead timeimprovements using inﬂuence analysis. In International Symposium on Data-DrivenProcess Discovery and Analysis (SIMPDA) (pp. 54-67). Rheinisch-WestfaelischeTechnische Hochschule Aachen.10. OMG: Decision Model and Notation (DMN) v.1.2, 2019.11. Seeliger, A., Nolle, T., & Mhlhuser, M. (2018, September). Finding structure inthe unstructured: hybrid feature set clustering for process discovery. In InternationalConference on Business Process Management (pp. 288-304). Springer, Cham.12. Song, M., Gnther, C. W., & Van Der Aalst, W. M. (2008, September). Traceclustering in process mining. In International Conference on Business Process Man-agement (pp. 109-120). Springer, Berlin, Heidelberg.13. Souza, C. R. (2014). The accord .NET framework. So Carlos, Brazil. http://accord-framework.nethttp://accord-framework.net