19, 20, 23] also allows for concurrency. It uses stochastic task graphs as an intermediate representation and it generates a workﬂow model described in the ADONIS modeling language. In the induction step task nodes are merged and split in order to discover the underlying process. A notable diﬀerence with other approaches is that the same task can appear multiple times in the workﬂow model, i.e., the approach allows for duplicate tasks. The graph generation technique is similar to the approach of [7, 33]. The nature of splits and joins (i.e., AND or OR) is discovered in the transformation step, where the stochastic task graph is transformed into an ADONIS workﬂow model with block- structured splits and joins. In contrast to the previous papers, our work [31, 32, 48–50] is characterized by the focus on workﬂow processes with concurrent behavior (rather than adding ad-hoc mechanisms to capture parallelism). In [48–50] a heuristic approach using rather simple metrics is used to construct so-called “dependency/frequency tables” and “dependency/frequency graphs”. In  another variant of this technique is presented using examples from the health-care domain. The preliminary results presented in [31, 48–50] only provide heuristics and focus on issues such as noise. The approach described in this paper diﬀers from these approaches in the sense that for the α algorithm it is proven that for certain subclasses it is possible to ﬁnd the right workﬂow model. In  the EMiT tool is presented which uses an extended version of α algorithm to incorporate timing information. Note that in this paper there is no detailed description of the α algorithm nor a proof of its correctness.
Abstract. Workflowmining aims to find graph-based processmodels based on activities, emails, and various eventlogs recorded in computer systems. Current workflowmining techniques mainly deal with well-structured and -symbolized eventlogs. In most real applications where workflow management software tools are not installed, these structured and symbolized logs are not available. Instead, the artifacts of daily computer operations may be readily available. In this paper, we propose a method to map these artifacts and content-based logs to structured logs so as to bridge the gap between the unstructured logs of real life situations and the status quo of workflowmining techniques. Our method consists of two tasks: discoveringworkflow instances and activity types. We use a clustering method to tackle the first task and a classification method to tackle the second. We propose a method to combine these two tasks to improve the performance of two as a whole. Experimental results on simulated data show the effectiveness of our method.
2 Related Work
The research in the area of software processmining started with new approaches to the grammar inference problem for eventlogs [CW98]. The other work from the software do- main are in the area of miningfrom software repositories [MSR05]; like in our approach, they use SCM systems and especially CVS as a source of input information, but for mea- suring the project activity, detecting and predicting changes in code, advising newcomers to an open-source project and detecting the social dependencies between the developers. The first application of “processmining” to the workflow domain was presented by Agrawal in 1998 [AGL98]. This approach models business processes as annotated activity graphs and is restricted to sequential patterns. The approach of Herbst and Karagiannis [HK99] uses machine learning techniques for acquisition and adaptation of workflowmodels. The foundational approach to workflowmining was presented by van der Aalst et al. [WvdA01]. Within this approach, formal causality relations between events in logs and the α-mining algorithm for discoveringworkflowmodels with its improvements are presented. In ad- dition to the software process and business process domains, the research concerning dis- covering the sequential patterns in the area of data mining is important here [AS95]. In comparison to the classical approaches, we do not have logs of activities and, so, must discover the activities first. We make use of our document-oriented view on the activities, i.e. the process is derived from the inputs and outputs of the activities. We suggest coming up with the model early and refining it when additional information is available.
applying the genetic algorithm. Their approach was based on the discovery of Petri nets. Petri nets are an approach to present the process model . Bratosin et al. in 2010 improved this approach and tried to decrease the time taken in models evaluation stage by a sampling of event log . In the same year and in another article, they tried to reduce the running time of algorithm by using a distributed approach . Tsai et al. in 2010 added time perspective exploring to the genetic algorithm by the use of available data on the events time in the event log and incorporated
Compared to these approaches we emphasized application areas where the workflowprocess model is known. Nevertheless, the proposed warehouse model can also store log data of ad-hoc workflows and may thus serve as a basis for processmining techniques mentioned above. The focus of our work is to exploit the workflow log and building a data warehouse to obtain aggregated information such as i.e. to detect critical process situations or quality degradations under different circumstances, rather than re-engineering workflow specifications form the log. However, these processmining techniques can deliver important data for discovering typical execution scenarios, dependencies between decisions and probabilities of workflow instance types. Business process re-engineering and workflow improvement will benefit form a combination of the approaches.
– Second, of the millions of potential constraints, many may be trivially true. For example, the response constraint in Fig. 2 holds for any event log that does not contain events relating to activity a. Moreover, one constraint may dominate another constraint. If the stronger constraint holds (e.g., (a → ♦b)), then automatically the weaker constraint (e.g., ♦a → ♦b) also holds. Showing all constraints that hold typically results in unreadable models. This paper addresses these two problems using a two-phase approach. In the first phase, we generate the list of candidate constraints by using an Apriori al- gorithm. This algorithm is inspired by the seminal Apriori algorithm developed by Agrawal and Srikant for mining association rules . The Apriori algorithm uses the monotonicity property that all subsets of a frequent item-set are also frequent. In the context of this paper, this means that sets of activities can only be frequent if all of their subsets are frequent. This observation can be used to dramatically reduce the number of interesting candidate constraints. In the sec- ond phase, we further prune the list of candidate constraints by considering only the ones that are relevant (based on the event log) according to (the combination of) simple metrics, such as Confidence and Support, and more sophisticated met- rics, such as Interest Factor (IF) and Conditional-Probability Increment Ratio (CPIR), as explained in Section 4. Moreover, discovered constraints with high CPIR values are emphasized like highways on a roadmap whereas constraints with low CPIR values are greyed out. This further improves the readability of discovered Declare models.
Therefore, in YAWL new constructs are introduced, such as OR joints, removal of tokens and multiple instances activities, that make the language easier and more expressive. In particular, the OR is one of the most problematic patterns and very often other notations struggle with its semantic ( Rozinat, 2010). Despite other languages, in YAWL the OR split/merge is projected to guarantee the desired synchronization. On the one hand, the OR-split triggers some, but not necessarily all the outgoing flows and it is appropriate in those situation when it is unknown until runtime what concurrent resultant work can lead from the completion of activities. On the other hand, the OR-join ensures that an activity waits until all the incoming flows have either finished, only if there is something necessary to wait. Moreover, the formalism offers several new syntactical elements which intuitively describes other workflow patters. For instance, the notation enables the description of simple choice (graphically depicted via an XOR split), simple merge (indicated as a XOR join). Clearly, the possibility of designing common situation, such as parallelism of activities, are still satisfied with the present notation (via AND split). Furthermore, in YAWL, transitions are assumed to be no atomic: in this sense they do not fire immediately, and it may require some time to do the task. For this reason, one transition here is equal to two transitions in a Petri net, with one place within them.
M ) is greater than 1, 5 event types were removed as their frequencies were higher than the SD. After that, the application created the object frequency distribution (OFD) of the remaining entries and conducted a Two-Sample Kolmogorov–Smirnov normality test. The OFD was determined as normal distribution and the support range (SR) was calculated as 5% − 33%. Using the SR and 100% confidence value, the Apriori algorithm generated 64,931 object-based association rules. These rules produced 9 chains of events, which resulted in 10 temporal-association rules using a 50% threshold of temporal- association-accuracy value. The entire set of these rules formed a Directed Acyclic Graph (DAG) as no cycles, conflicts and redundancies were found. After that, a causal rank was calculated and assigned to each rule in the DAG using Fast Causal Inference algorithm. This produced the final set of TAC rules, which was stored into the MySQL database and represented as a PDDL domain action model as well. Every step of this application is performed in a fully automated manner, and are also capable of error management and exception handling. It is worth mentioning here that the processing of this application is based on batches. It can process a batch of eventlogs, specified by a directory path, and generate individual domain action models against each dataset. Every time an event log dataset is processed, the resulting TAC rules alongside other relevant information are stored in the database. Another Figure 5.3 presents the second part of the same application. It shows the process of generating a PDDL domain action model file, named ‘database-to-domain.pddl’, directly from the database that contains 347 TAC rules. After creating a DAG of the rules, only 134 of them remained and were encoded into a domain action model. Rest of the rules were eliminated by the application due to cycles, conflicts and redundancies.
1 Eindhoven University of Technology, the Netherlands; 2 Technion, Haifa, Israel
Abstract. Detecting and measuring resource queues is central to busi- ness process optimization. Queue mining techniques allow for the iden- tification of bottlenecks and other process inefficiencies, based on event data. This work focuses on the discovery of resource queues. In par- ticular, we investigate the impact of available information in an event log on the ability to accurately discover queue lengths, i.e. the number of cases waiting for an activity. Full queueing information, i.e. times- tamps of enqueueing and exiting the queue, makes queue discovery triv- ial. However, often we see only the completions of activities. Therefore, we focus our analysis on logs with partial information, such as missing enqueueing times or missing both enqueueing and service start times. The proposed discovery algorithms handle concurrency and make use of statistical methods for discovering queues under this uncertainty. We evaluate the techniques using real-life eventlogs. A thorough analysis of the empirical results provides insights into the influence of information levels in the log on the accuracy of the measurements.
FLOSS repositories store a sheer volume of data about participants activities. A number of these repositories have been mined using some of the techniques and tools we have discussed in this paper. However, to the date, there has not been any concrete investigation into how logsfrom FLOSS repositories can be process mined for analysis. This maybe attributed partly to two apparent factors. Firstly, researchers interested in mining software repositories have not come across pro- cess mining and thus its value is unexploited; secondly, the format of recorded in FLOSS poses a challenge in constructing eventlogs. Nevertheless, after re- viewing existing mining techniques and the analysis they provide on the data, one can infer the type of input data, the expected output and thus construct logs that can be used for analysis through any of processmining recognised tools such as the ProM framework or Disco. The example presented previously has been carried out using Disco as tool of visualisation. This approach can bring an additional flair and extensively enrich data analysis and visualization in the realm of FLOSS data. In our future work, we plan to produce tangible exam- ples of processmodels as reconstructed from FLOSS members daily activities. These logs can be built from Mailing archives, CVS data as well as Bug reports. With a clearly defined objective and the type of data needed, processmining promises to be a powerful technique for empirical evidence provision in software repositories.
Another type of behavior that frequently occurs in processmodels is parallelism . For example, in a process with two parallel branches ab and cd, their concurrent execution may lead to (sub-)sequences such as abcd, acbd, cabd, acdb, cadb, etc. However, activities have often time constraints or other ordering restrictions, meaning that only a small fraction of all possible interleavings is actually observed in practice. In any case, the interleaving of parallel branches ﬁts well into the nature of our problem and can be handled without adaptation, provided that, in a process with parallel branches, these will be captured as separate patterns. As a consequence, the presence of parallel behavior may increase the number of patterns required to ﬁnd a minimal solution.
Abstract. The organizational perspective of processmining supports the discovery of social networks within organizations by analyzing eventlogs recorded during process execution. However, applying these social network mining techniques to real data generates very complex mod- els that are hard to analyze and understand. In this work we present an approach to overcome these diﬃculties by focusing on the discovery of communities from such eventlogs. The clustering of users into com- munities allows the analysis and visualization of the social network at diﬀerent levels of abstraction. The proposed approach also makes use of the concept of modularity, which provides an indication of the best di- vision of the social network into community clusters. The approach was implemented in the ProM framework and it was successfully applied in the analysis of the emergency service of a medium-sized hospital. Key words: ProcessMining, Social Network Analysis, Hierarchical Clustering, Community Structure, Modularity
Aligning Data-Aware Declarative ProcessModels and EventLogs Abstract
Conformance checking, a branch of processmining, allows analysts to determine whether the execution of a business process matches the modeled behavior. Processmodels can be procedural or declarative. Procedural models dictate the exact behavior that is allowed to execute a specific process whilst declarative models implicitly specify allowed behavior with the rules that must be followed during execution. The execution of a business process is represented by eventlogs. Conformance checking approaches check various perspectives of a process execution including control-flow, data and resources. Approaches that checks not only the control-flow perspective, but also data and resources are called multi-perspective or data-aware approaches. The approaches provide more deviation information than control-flow based techniques. Alignment based techniques of conformance checking have proved to be advantageous in both control-flow based and data-aware approaches. While there exist several data-aware approaches for procedural processmodels that are based on the principle of finding alignments, there is none so far for declarative processmodels.
5.2 Concurrency-aware CPOG mining
This section presents an algorithm for extracting concurrency from a given event log and using this information for simplifying the result of the CPOG mining. Classic processmining techniques based on Petri Nets generally rely on the α-algorithm for concurrency extraction . We introduce a new concurrency ex- traction algorithm, which differs from the classic α-algorithm in two aspects. On the one hand, it is more conservative when declaring two given events concur- rent, which may lead to the discovery of more precise processmodels. On the other hand, it considers not only adjacent events in a trace as candidates for the concurrency relation but all event pairs, and therefore can find concurrent events even when the distance between them in traces is always greater than one, as we demonstrate below by an example. This method works particularly well in combination with CPOGs due to their compactness, however, we believe that it can also be useful in combination with other formalisms.
2 ProcessMining and ProM
Process-aware information systems, such as WfMS, ERP, CRM and B2B sys- tems, need to be configured based on processmodels specifying the order in which process steps are to be executed . Creating such models is a complex and time-consuming task for which different approaches exist. The most tradi- tional approach is to analyze and design the processes explicitly making use of a business process modeling tool. However, this approach has often resulted in discrepancies between the actual business processes and the ones as perceived by designers ; therefore, very often, the initial design of a process model is incomplete, subjective, and at a too high level. Instead of starting with an ex- plicit process design, processmining aims at extracting process knowledge from “process execution logs”.
Also attempts at eliciting non-Markovian stochastic Petri nets exist. Leclercq et al. investigate how to extract models of normally distributed data in . Their work is based on an expectation maximization algorithm that they run until convergence. In comparison to our approach, they are not able to deal with missing data and do not consider di fferent execution policies. Reconstructing model parameters for stochastic systems has also been investigated by Buchholz et al. in . They address the problem to find fixed model parameters of a partially observable underlying stochastic process. In contrast to our work, the underlying process’s transition distributions need to be specified beforehand, while our aim is to infer also transition distributions of a GDT_SPN model. In a similar setting, i.e., with incomplete information, Wombacher and Iacob estimate distributions of activities and missing starting times of processes in .
2 ProcessMining and ProM
Process-aware information systems, such as WfMS, ERP, CRM and B2B sys- tems, need to be configured based on processmodels specifying the order in which process steps are to be executed . Creating such models is a complex and time-consuming task for which different approaches exist. The most tradi- tional approach is to analyze and design the processes explicitly, making use of a business process modeling tool. However, this approach has often resulted in discrepancies between the actual business processes and the ones as perceived by designers ; therefore, very often, the initial design of a process model is incomplete, subjective, and at a too high level. Instead of starting with an ex- plicit process design, processmining aims at extracting process knowledge from “process execution logs”.
Traditional Workflow Management Systems (WFMSs) are based on the idea that pro- cesses are described by procedural languages where the completion of a task may en- able the execution of other tasks. While such a high degree of support and guidance is certainly an advantage when processes are repeatedly executed in the same way, in dynamic settings (e.g., healthcare) a WFMS is considered to be too restrictive. Users often need to react to exceptional situations and execute the process in the most appro- priate manner. Therefore, in these environments systems tend to provide more freedom and do not restrict users in their actions. Comparing such dynamic process executions with procedural models may reveal many deviations that are, however, not harmful. In fact, people may exploit the flexibility offered to better handle cases. In such situations we advocate the use of declarative models. Instead of providing a procedural model that enumerates all process behaviors that are allowed, a declarative model simply lists the constraints that specify the forbidden behavior, i.e., “everything is allowed unless explicitly forbidden”.
In contrast, Fig. 9 shows that the error caused by Query 2 on the Road Traffic Fine log is small. It is noteworthy that by using Query 2 with an value of 0.1 we often obtain the exact same result as when using the unprotected event log. In this case, the F1 score is consistently 1.0 indicating that our approach can be used to protect the privacy of individual participants while still discovering the correct main process behavior for very structured processes with a small number of variants. When lowering the further to 0.01 as shown in Fig. 9 , differences appear due to the added noise by our protection approach. In particular, some of the less frequent activities connected to the appeals part of the Road Traffic Fines process, for example Notify Result Appeal to Offender and Receive Result Appeal From Prefecture, appear in the discovered process model. Some of the noise added by our privacy protection method cannot longer be distinguished from the regular process behavior. Still, other parts of the frequent process behavior are left intact. For example, the process model starts with Create Fine and may end with either Payment or Send for Credit Collection as in the model discovered on the unprotected log.
More from a theoretical point of view, the process discovery problem is related to the work discussed in [12, 37, 38, 49]. In these papers the limits of inductive inference are explored. For example, in  it is shown that the computational problem of finding a minimum finite-state acceptor compatible with given data is NP-hard. Several of the more generic concepts discussed in these papers can be translated to the domain of processmining. It is possible to interpret the problem described in this article as an inductive inference problem specified in terms of rules, a hypothesis space, examples, and criteria for successful inference. The comparison with literature in this domain raises interesting questions for processmining, e.g., how to deal with negative examples (i.e., suppose that besides log L there is a log L 0 of traces that are not possible, e.g., added by a domain expert). However, despite the relations with the work described in [12, 37, 38, 49] there are also many differences, e.g., we are mining at the net level rather than sequential or lower level representations (e.g., Markov chains, finite state machines, or regular expressions), tackle concurrency, and do not assume negative examples or complete logs.