UNIVERSITY OF TARTU
Institute of Computer Science
Software Engineering Curriculum
Volodymyr Leno
Incremental Discovery of Process Maps
Master’s Thesis (30 ECTS)
Supervisor(s): Marlon Dumas,
Michal Rosik
2
Incremental Discovery of Process Maps
Abstract:Process mining is a body of methods to analyze event logs produced during the execution of business processes in order to extract insights for their improvement. A family of process mining methods, known as automated process discovery, allows analysts to extract business process models from event logs. Traditional automated process discovery methods are intended to be used in an offline setting, meaning that the process model is extracted from a snapshot of an event log stored in its entirety. In some scenarios however, events keep coming with a high arrival rate to the extent that it is impractical to store the entire event log and to continuously re-discover a process model from scratch. Such scenarios require online automated process discovery approaches. Given an event stream produced by the execution of a business process, the goal of an online automated process discovery method is to maintain a continuously updated model of the process with a bounded amount of memory while at the same time achieving similar accuracy as offline methods. Existing automated discovery approaches require relatively large amounts of memory to achieve levels of accuracy comparable to that of offline methods. This thesis proposes a online process discovery framework that addresses this limitation by mapping the problem of online process discovery to that of cache memory management, and applying well-known cache replacement policies to the problem of online process discovery. The proposed framework has been implemented in .NET, integrated with the Minit process mining tool and comparatively evaluated against an existing baseline, using real-life datasets.
Keywords:
Process Mining, Process Discovery, Event Stream, Process Map, Cache Replacement Policy CERCS: P170 – Computer science, numerical analysis, systems, control
3
Kuhjuv protsessikaartide avastamine
Lühikokkuvõte:Protsessikaeve on meetodite kogu, analüüsimaks, protsesside teostuse jooksul loodud, sündmuste logisid, et saada teavet nende parandamiseks. Protsessikaeve meetodite kogu, mida nimetatakse automatiseeritud protsessi avastuseks, lubab analüütikutel leida infor-matsiooni äriprotsesside mudelite kohta sündmuste logidest. automatiseeritud protsessi avastus meetodeid kasutatakse tavaliselt ühenduseta keskkonnas, mis tähendab, et protsessi mudel avastatakse hetketõmmisena tervest sündmuste logist. Samas on olukordi, kus uued juhtumid tulevad peale sellise suure kiirusega, et ei ole mõtet salvestada tervet sündmuste logi ja pidevalt nullist taasavastada mudelit. Selliste olukordade jaoks oleks vaja võrgus olevaid protsessi avastus meetmeid. Andes sisendiks protsessi teostuse käigus loodud sündmuste voo, võrgus oleva protsessi avastus meetodi eesmärk on järjepidevalt uuendada protsessi mudelit, tehes seda piiratud hulga mäluga ja säilitades sama täpsust, mida suuda-vad meetodid ühenduseta keskkondades. Olemas olesuuda-vad meetodid vajasuuda-vad palju mälu, et saavutada tulemusi, mis oleks võrreldavad ühenduseta keskkonnas saadud tulemustega. Käesolev lõputöö pakub välja võrgus oleva protsessi avastusraamistiku, ühtlustades protsessi avastus probleemi vähemälu haldusega ja kasutades vähemälu asenduspoliitikaid lahendamaks antud probleemi. Loodud raamistik on kirjutatud kasutades .NET-i, integreeri-tud Minit protsessikaeve tööriistaga ja analüüsiintegreeri-tud kasutades elulisi ärijuhte.
Võtmesõnad:
Protsessi Kaevandamine, Protsess Avastus, Juhtum Vool, Protsessi Kaardid CERCS: P170 – Computer science, numerical analysis, systems, control
4
Table of Contents
1 Introduction ... 6 1.1 Problem Statement ... 6 1.2 Contribution ... 7 1.3 Structure Description ... 7 2 Background ... 82.1 Introduction to process discovery ... 8
2.2 Process Maps ... 9
2.3 Data Streams ... 10
3 Related Work ... 11
3.1 Research Method ... 11
3.2 Process discovery challenges ... 11
3.3 Process discovery techniques ... 12
3.4 Data stream mining approaches ... 14
3.5 Process change ... 14
3.6 Online process discovery ... 15
3.7 Model update ... 18
3.8 Cache Replacement Problem ... 18
3.9 Summary ... 21
4 Contribution ... 22
4.1 Requirements to solution ... 22
4.2 Architecture of the solution ... 23
4.3 Implementation ... 29
4.3.1. Azure Event Hub ... 30
4.3.2. Visualization ... 33 4.3.3. Summary ... 35 5 Validation ... 37 5.1 System description ... 37 5.2 Experimental Evaluation ... 38 5.3 Qualitative Results ... 44
6 Conclusion and Future Work ... 47
7 References ... 48
Appendix ... 49
5
List of Abbreviations
The table below describes the meaning of the various abbreviations and acronyms used in the thesis.
Abbreviation
Meaning
AMQP Advanced Message Queuing Protocol
BPMN Business Process Model and Notation
CRM Customer Relationship Management
EPC Event-driven Process Chain
ERP Enterprise Resource Planning
HTTPS HyperText Transfer Protocol Secure
SAS Shared Access Signature
SHA Secure Hash Algorithm
SSL Secure Sockets Layer
TLS Transport Level Security
WFM Workforce Management
YAWL Yet Another Workflow Language
6
1 Introduction
The following chapter aims to give a brief overview of the problem and identify research questions with the thesis main goals. In addition, it provides the structure description of the paper.
1.1 Problem Statement
Business processes take an important part in modern organizations, describing the list of activities that have to be done, their execution order and performers. Nowadays, many pro-cesses are either supported or monitored by information systems (e.g., ERP, CRM, WFM systems). The information about the execution of these processes is stored in files called event logs. An event log consists of a set of traces, each representing one possible process execution. A trace consists of a set of events, where an event contains a timestamp and an event type, which refers to an activity of the process (e.g. “Assess Claim”, “Notify Customer”, etc.). Process discovery aims to construct a process model from an event log. Different process discovery techniques support different notations (e.g., Petri Net, BPMN, YAWL and EPC). In this thesis, we are specifically interested in process discovery techniques that produce a process map as output. In this context, process map is a directed graph where each node corresponds to an event type and each edge represents a direct-follows relation, which exists between two event types (e.g. between “Receive Order” and “Check Order”) if there is at least one trace where the first of these events immediately preceded the second one. By representing the process in a graphical way, process discovery helps to understand and analyze the process in order to identify problems and bottlenecks. Based on the results of analysis process improvement can be done.
However, the amount of data is continuously increasing, systems produce large and complex datasets that are hard to manage and store. This causes the computational and memory is-sues. One of the solutions to this problem is to process events immediately when they occur in order to enable real-time analysis. This approach is known as online process discovery or incremental process discovery.
There are many process discovery algorithms, but all of them designed for offline pro-cessing, taking as input complete event logs. However, it may be impossible to store all events when working with event streams. Thus, these algorithms fail in this case and tend to generate unreliable and outdated results.
Moreover, some processes are changing in time. They might change due to periodic changes (e.g., in the different time of the year, there are different sales rates or market demands), or thanks to changing conditions (e.g., changes in the strategy of the company, market changes). These changes affect the processes and it is important to detect them and update the model correspondingly.
In order to be able to work with event streams, the process discovery algorithms have to meet the following requirements [1]:
- It is impossible to store the complete stream
- It is impossible to process data more than one time (backtracking is not feasible) - Data model has to be quickly updated
- Algorithm has to cope with variable system conditions
Currently, only a few algorithms can deal with the event streams. All of them are modifica-tions and extensions of existing offline algorithms.
In addition, Process Discovery itself faces challenges such as dealing with noise, incom-pleteness, balancing overfitting and underfitting, representational bias limitations [2].
7
The core problem to be addressed in the thesis can be formulated as:
“Given a process map PM computed from an event log at time T and given the set of events ES that has occurred between time T and T′ we need to compute the process map PM′ at time T′.”
The main research question is:
How to perform process discovery on the event streams?
The other research questions are following:
1) How to mine data immediately when it occurs?
2) How to continuously update the model in a computationally efficient way? 3) How to deal with concept drifts?
4) How to deal with process discovery problems? 1.2 Contribution
The main goal of this work is to introduce an efficient process discovery algorithm in order to cope with event streams, which will be built in the client application with an intuitive and user-friendly interface. The thesis aims at extending the process map construction technique implemented in the Minit tool to handle event streams, therefore the main focus will be done on process maps discovery.
In the thesis, the current state of incremental process discovery problem will be studied and existing solutions will be analyzed. Based on the results of analysis, an efficient algorithm will be presented and developed.
1.3 Structure Description
The thesis is organized as follows: Section 2 presents the basic concepts of process mining and Section 3 provides the overview of state of the art of the problem with analysis of ex-isting solutions; in Section 4 the new solution based on the Lossy Counting with Budget algorithm is introduced and the implementation details are reported; Section 5 presents the results of empirical evaluation and Section 6 concludes the paper.
8
2 Background
This chapter aims to give basics about process mining and process discovery. In addition, it contains the overview of process maps and data/event streams.
2.1 Introduction to process discovery
Process Mining is relatively young research discipline, which takes place between machine learning and data mining from the one hand and process modeling and analysis from the other hand [2]. It aims to discover, monitor and improve the processes, extracting knowledge from event logs, produced by information systems (Figure 1).
Information System
Event Log
Process Model
Process Mining
Figure 1. Process Mining overview
There are three types of process mining [2]: discovery, conformance, and enhancement. Discovery technique takes the event log as input and produces the model, as shown in Figure 1. The main goal of conformance is to compare a model with existing event log, which describes the same process, in order to check if reality conforms to the model and vice versa. Enhancement aims to extend or improve an existing process model using information about the actual process recorded in some event log [2]. In the thesis, we will focus on process discovery only.
A business process is a set of inter-related events, activities, and decisions, performed by multiple actors, which collectively allow an organization to deliver a product or service or achieve a business goal. An example of a business process is the process performed by a bank to approve loan applications, or the sales process of manufacturing company, which starts when a purchase order is received and ends when the corresponding invoice has been paid. This latter process is also called an order-to-cash process. An execution of a business process is called a case. For example, the concrete set of activities and decisions performed to fulfill a given purchase order constitutes one case of the order-to-cash process.
Every case consists of a sequence of events with the following attributes: timestamp, case identifier, and the name of the activity. By using timestamps, we can analyze the process from the time perspective, measuring the average execution time for the process overall or for each of the activities, estimating the waiting time between the activities, etc. Then, some process improvements can be done based on this information. Another purpose of timestamps is that they enable to order the events. However, it can be assumed that events arrive in a stream in chronological order, so the event sequence is already ordered. There-fore, to be able to discover control flow from event log every event should at least have the case identifier and activity name.
The quality of a process model discovered from an event log can be evaluated by four cri-teria [2; 3] (Figure 2):
- Fitness (measures the extent to which the traces in a log can be parsed by the discovered model [3]. A model with perfect fitness can replay all traces in the log)
9
- Precision (measure of additional behavior, which is not described in the log, allowed by a discovered model. A model with high precision can parse only the behavior spec-ified in the event log)
- Generalization (measures how well the discovered model generalizes the behavior de-scribed in the log. For example, if a model can be discovered using 90% of the traces from the log and it can parse the rest 10% it has a good level of generalization)
- Simplicity (the model should be as simple as possible. In general, the complexity of a model can be defined as the total number of edges and nodes in the underlying graph. Therefore, a model with high simplicity has small amount of edges and nodes)
Process Discovery
Fitness Simplicity
Precision Generalization
Figure 2. Four quality dimensions 2.2 Process Maps
A process map is directed weighted graph representation of a control-flow of the process, where:
1) nodes of the graph represent activities 2) edges represent causality relation
3) weights of edges represent the frequency of occurrence of the direct-follows rela-tions (e.g. edge from activity A to activity B with weight 100 means that AB occurs 100 times)
The process map example is given in Figure 3:
10
Unlike Petri net or BPMN model, it does not describe typical process design patterns, such as AND or XOR splits/joints, etc. Instead, it gives general knowledge about the process in an intuitive and understandable way, which is highly efficient in case of using in industry and business. Process maps help to identify bottlenecks, duplication, delays or gaps, to clar-ify process boundaries, process ownership, responsibilities and effectiveness measures or process metrics [4].
2.3 Data Streams
According to the definition, a data stream is an unbounded sequence of data items with a very high throughput [5]. Additionally, there are some assumptions, such as:
- Data have a small and fixed number of attributes - The amount of data is infinite
- The memory is limited
- The time allowed to process an item is small
One type of data streams called event streams, which occur while working with business processes. The elements of event stream are event records, which have the same structure as events in the event log. In other words, event stream is the sequence of events before being recorded into an event log. Therefore, the main focus of this paper is done on event streams.
There are three types of data stream processing models [1]: - Insert only model
Once an item is seen, it cannot be changed - Insert-delete model
Items can be seen, deleted or updated - Additive model
Each seen item refers to a numerical variable which is incremented
In the case of event streams, we are working with first type processing model because it does not make sense to delete events after they were observed.
11
3 Related Work
This chapter presents the state of the art of the problem. It gives an overview of process discovery techniques and challenges, data stream mining approaches and currently existing solutions for event streams process discovery. It also covers the process change problem and presents a possible way to deal with it. Finally, it describes cache replacement problem and cache replacement policies as a possible way to solve the event stream process discovery problem.
3.1 Research Method
Searching for relevant literature for selected topic is one of the most important stages of any research. The search was performed using following phrases:
“event stream” AND “discovery” OR “event stream” AND “mining” “data stream” AND “discovery” OR “data stream” AND “mining” “process discovery”
“process map”
Traditional search gives a high number of irrelevant results and on the other hand, it misses some relevant ones. Therefore, a more targeted search was completed based on papers that have been previously identified as relevant. Google Scholar was used to get the relevant papers and then the search of the papers that were cited in the relevant ones was performed. The relevance of the paper can be estimated by following criteria:
Is the paper about event streams or process maps discovery?
Does the paper include a comparative analysis of the process discovery algorithms? Does the paper include real-life examples?
Does the paper provide algorithm implementation details (e.g. detailed description of the algorithm, pseudocode examples)?
If the paper satisfies any of first two requirements it is considered to be relevant. Those that, in addition, include real-life examples and algorithm implementation details are highly pre-ferred.
Before trying to implement the solution of the problem it was important to understand the current challenges and approaches. In order to identify process discovery challenges, the book [2] of W. M. P. Van der Aalst was used. In [6,7] and [6,8] the review of Heuristics and Fuzzy Miner respectively was performed. The research of M. M. Gaber and A. Zaslavsky [9] gives the overview of data stream mining techniques. G. Widmer and M. Kubat in their work [10] describe the problem of process change in time and introduce the possible ways how to deal with it. In the papers [1], [5] and [9] the review of currently existing event streams discovery approaches was done. Finally, in [12] we can find the principles of model update in a scalable and dynamic way.
3.2 Process discovery challenges
As it was specified earlier, Process Discovery faces many challenges. The biggest ones are following:
- Noise
Noise is a term used to refer to rare (i.e. infrequent) behavior, which can be observed in the event log. The example of noise can be some exceptions in the execution of the system. The Pareto Rule can be applied to show the concept of the noise. It says that the 20% of the log may easily account for 80% of the variability in the process [2]. It
12
causes the difficulties in producing efficient models because the noise leads to over-crowding of the model since we focus not only on the actual behavior but also on the exceptional cases. This results in complicated, unreadable process models. Therefore, to be efficient, process discovery techniques should be able to deal with noise.
- Incompleteness of event log
Another major concern for automated process discovery methods is incompleteness. It is not realistic to assume that log is complete and includes all possible behavior. Since event log is not complete, it is hard to build a good model based on it. Incompleteness leads either to excluding the missing behavior or to including unseen behavior, trying to fulfill the model. This can reduce the quality of the model. Therefore, the process model can be considered as just a view on reality [2].
- Balance between underfitting and overfitting
There are two extreme cases of models, which can be created: the model, where every-thing is possible (“flower model” [2]) and the model, which allows only the behavior, specified in the event log (“enumerating model”). These models are examples of under-fitting and overunder-fitting respectively. It is crucial for process discovery techniques to find the balance between overfitting and underfitting. The final model should be general enough, but at the same time, it should not allow too much of behavior not observed in the event log.
- Representational bias limitations
Any process discovery algorithm is effective within its representational bias – the class of models, which can be discovered. Therefore, in different cases, different algorithms should be used. All algorithms have their own representational limitations.
- Support of different abstraction levels
Most of the algorithms assume that all events in the event log are on the same level of abstraction. Although, this simplify the discovery process, in real life actions belong to different abstractions levels. Therefore, to be more efficient on practice, process dis-covery algorithms should be able to deal with this problem.
3.3 Process discovery techniques
Considering the process maps specifics, the best process discovery algorithm, in this case, is Fuzzy Miner. In addition, in [7] A. Buratin and A. Sperduti developed the online version of Heuristics Miner for event streams process discovery, therefore we will present this algorithm as well.
Heuristics Miner algorithm takes into consideration the frequencies of the events and relations, thereby dealing with noise. By statistical measures and user defined thresholds, it discovers the dependencies from the event log and then builds a control-flow model on the basis of the observed directly-follows relations [5]. The dependency measure is calculated by frequency, e.g. the causal dependency between activities a and b is calculated as follows:
𝑎 → 𝑏 = |𝑎>𝑏|−|𝑏>𝑎|
|𝑎>𝑏|+|𝑏>𝑎|+1 ∈ [−1,1] (1)
Where |a > b| is the frequency of occurrence of event b directly after the event a, i.e. the number of times that a directly-follows relation from a to b is observed in the log.
When the value is close to 1, there is a strong dependency between a and b and if the value is close to -1 there is a strong dependency between b and a. Using the dependency measure and a set of user-defined thresholds, Heuristics Miner builds a control-flow model in the form of Heuristics Net, directed graph in which nodes represent activities and edges
13
represent direct-follows dependencies. This graph is hereby called a Directly-Follow Graph (DFG). Moreover, outgoing/ingoing arcs of a node are annotated to XOR or AND splits/joins in case they constitutes exclusive choices or parallel alternatives respectively [5].
Given a log, the Heuristics Miner first calculates the dependency matrix wich contains the direct-follows relations observed between each pair of events. Next, based on the matrix, the algorithm constructs the dependency graph. The node corresponding to the activity for which none of the column/row entries in the matrix are positive is marked as start/end node. Finally, splits and joins types are identified using the AND measure. Given a dependency graph with activity a and two successor nodes b and c, their AND measure is following:
𝑎 → (𝑏 ∧ 𝑐) = |𝑏>𝑐|+|𝑐>𝑏|
|𝑎>𝑏|+|𝑎>𝑐|+1∈ [0,1] (2)
When the value of AND measure is close to 1, the event a is an AND-split and events b and
c can be executed in parallel. If the value of this measure is close to 0, then a is XOR-split
and events b and c are in mutual exclusion.
Traditional process discovery algorithms have problems dealing with unstructured or semi-structured processes, producing complicated and unreadable models. These models are often „spaghetti-like“, showing all details without providing a suitable abstraction.
Fuzzy Miner algorithm is able to deal with unstructured processes, reducing complexity and improving comprehensibility [6]. It produces high-level view on the process, generalizes the process behavior when it is possible by removing of unimportant arcs in the directly-follows graph, clustering highly correlated nodes into a single node and removing isolated node clusters [8]. The result of the Fuzzy Miner discovery algorithm is process map – the general description of the control-flow of the process in a form of directed weighted graph.
For simplification of the process, Fuzzy Miner uses two metrics: significance and correlation. Significance measures the relative importance of behavior, either for individual events or for binary ordering relations. The simplest way to calculate the significance measure is by frequency, e.g. events with a higher frequency have more significance.
Correlation metric shows how closely related two events following one another are. The
measuring of correlation includes determining the overlap of data attributes associated with two events following one another or comparing the similarity of their event names. Highly significant behavior is preserved, less significant but highly correlated behavior is aggregated into clusters, while less significant and less correlated behavior is abstracted. The first stage of the algorithm is to build the directly-follows graph. Then by significance and correlation metrics, it transforms the graph by edge filtering, aggregation, and abstraction. After this, it builds the model. The advantage of this algorithm is that depending on the specified constraints, such as minimum fuzzy support and minimum fuzzy confidence, it is able to provide the different views on the process with a different level of details. Because of its zoom-in and zoom-out capability, the fuzzy miner is a convenient method to deal with very large and highly unstructured event log. It is hence not surprising that commercial process mining tools such as Fluxicon Disco, Minit, myInvenio and Celonis relay mainly on variants of the fuzzy miner.
We note that both the heuristics miner and the fuzzy miner take as a starting point the directly-follows graph. The same is true of the alpha miner [2] and variants thereof, as well as the inductive miner [13]. In other words, the DFG is one of the fundamental primitives in the field of automated process discovery. Accordingly, in this thesis we focus on the
14
problem of online automated discovery of DFGs, which we call herein "process maps", in line with the term in commercial tools.
3.4 Data stream mining approaches
In general, there are two data stream mining approaches: data-based and task-based [9]. The main concept of data-based solutions is to analyze the only fragment of a stream or to summarize the data. The examples of data-based techniques are sampling, aggregation, load shedding and sketching.
The idea of sampling technique is that by studying examples from a data stream, it is possi-ble to generalize the results back to the whole data stream. It is a quite old technique that has been used for a long time. The problem with using sampling when working with data streams is the unknown dataset size. Therefore, it is a question how much examples are needed to achieve relevant results. Moreover, some of the items that are ignored can be interesting and meaningful.
The load shedding technique aims to drop a sequence of data streams, decreasing the amount of data, that need to be processed. However, it is difficult to use with mining algorithms because it drops chunks of data streams that can be used in the structuring of the generated models or it might represent a pattern of interest in time series analysis [9].
According to [9], sketching is the process of vertically sampling the incoming stream, with a random projection of a subset of features. The disadvantage of sketching is problems with accuracy.
The main goal of aggregation is to summarize the incoming stream using statistical measures such as mean, median, etc. This aggregated data then can be used as input for mining algorithms, but it has problems dealing with data with highly fluctuating distribu-tions.
Task-based approaches designed to address the computational challenges of data stream processing. The solutions of this type are represented by approximation algorithms, sliding window, and algorithm output granularity.
The approximation algorithms aim to extract approximate solution, with a possibility to define error bounds on the procedure.
The main idea of sliding window is to concentrate only on “fresh” events. The analysis is performed giving more importance to recent data, considering only summarization of the old ones.
The algorithm data granularity is resource-aware data analysis approach that deals with fluctuating high data rates according to the available memory and processing speed [9]. In addition, the following data mining and machine learning techniques can be applied to analyze data streams: clustering, classification, frequency counting, change diagnosis and time series analysis. Taking into consideration the characteristics of process maps, the main focus should be done on frequency counting.
3.5 Process change
As it was mentioned earlier, processes can change in time. There are following types of changes that can be observed: the drift of the process model, the shift of the process model and cases distribution change [9]. The drift of the model refers to a gradual change of the underlying process. Model shifts happen when a change between two process models is sharper. Although the original process can stay the same during the time, the cases distribu-tion may change, therefore causing changes. For example, in a producdistribu-tion process of a com-pany selling clothing, the items involved in incoming orders during winter will follow a
15
completely different distribution than in the summer. Such distribution change may signifi-cantly affect the relevance of specific paths in the control-flow of the involved process. The general approach to deal with concept drift has three main characteristics [10]: 1) keep-ing only a window of currently trusted examples and hypothesis; 2) storkeep-ing concept descrip-tions and re-using them if the previous context re-appears; 3) controlling both of these func-tions by the heuristic that constantly monitors the system’s behavior.
Effective process mining algorithm has to be able to detect context changes without being explicitly informed about them and quickly adapt the model accordingly to these changes. One possible approach is to trust only the most recent examples since the processes tend to change in time. The set of these examples called a “window”. Newly arrived examples are added to the window and the oldest ones have to be deleted from it. The model has to allow the behavior from the window. In the most simple case, the window has fixed size and the oldest example will be deleted when a new one is observed. This approach is called “time-based forgetting” [10].
3.6 Online process discovery
The general idea of event streams process discovery is described in Figure 4:
Stream Miner Instance
...
A C B D A C B D E Time Events...
Figure 4. General idea of event stream process mining
Currently, there are available several algorithms to deal with data streams such as Sliding Window, Sticky Sampling, Lossy Counting and Lossy Counting with Budget.
The main idea of Sliding Window algorithm (Figure 5) is to collect events for a certain span of time and then apply a standard offline process discovery approach. This approach has several advantages such as the possibility to mine the log using any process mining algorithm. However, the notion of history is not very accurate: only the more recent events
16
are considered, and equal importance is assigned to all of them. In addition, the model is not constantly updated since each newly received events triggers only the update of the memory and not necessarily an update of the model. Model updating with every event occurrence will cause computational problems. The time required by this approach is completely un-balanced. When the new event arrives, inexpensive operations are performed (shift and in-sert), instead of when the model needs to be updated, the log, retained in memory is mined from scratch. Therefore, each event is handled at least twice: the first time to store it into a log and subsequently any time the mining phase takes place on it. In mining event streams, it is more desirable to process each event only once.
A A A A
... B D F C B B D C A B A E C E B F C B F E F ...
Even Stream
Log used for mining
Figure 5. Sliding Window algorithm
Sticky Sampling (SS) (Figure 6) is a probabilistic algorithm, an adaptation of an existing technique, used for approximate frequency count. It fails to provide correct answers with a miniscule probability of failure [11]. The algorithm takes three parameters such as support 𝑆, error 𝜖, and probability of failure 𝛿. 𝑆 is a tuple of the form (𝑒, 𝑓), where 𝑓 estimates the frequency of an element 𝑒 in the stream. Initially, it is empty and the sampling rate 𝑟 is set to 1. The probability of selection of the element is 1
𝑟. For each incoming element 𝑒, if an entry for 𝑒 already exists in 𝑆, the corresponding frequency has to be incremented. In another case, we sample the element with rate 𝑟. If this element is selected by sampling, we add an entry (𝑒, 1) to 𝑆 [11]. Otherwise, we ignore the element and move on to the next one in the stream. Sampling rate 𝑟 increases logarithmically proportional to the size of the stream [11]. Whenever the sampling rate 𝑟 changes, for each entry (𝑒, 𝑓) we will toss an unbiased coin and in the case toss is not successful, the frequency 𝑓 of the element 𝑒 will be decreased by 1. When frequency will reach 0, the corresponding entry will be deleted. The space com-plexity of the algorithm is independent of the current length of the stream.
A A A A
... B D F C B B D C A B A E C E B F C B F E F ...
Even Stream
A = 5 B = 6 C = 2 D = 1 E = 3 F = 3
Figure 6. Sticky Sampling algorithm
Lossy Counting (LC) (Figure 7) algorithm is an alternative to Sticky Sampling, but it is a deterministic approach. The main idea is to consider aggregated representations of the latest observations instead of storing repetitions of the same event, in order to save space, they store a counter keeping trace of the number of instances of the same observation. The stream
17 is divided into buckets of width 𝑤 = [1
𝜖], where 𝜖 ∈ (0,1) represents the maximum approx-imation error allowed. The current bucket is identified with 𝑏𝑐𝑢𝑟= [
𝑁
𝑤], where N is the pro-gressive events counter. The basic data structure used by Lossy Counting is a set of tuples of form (𝑒, 𝑓, ∆) where: 𝑒 is an element of the stream, 𝑓 is the estimated frequency of item 𝑒 and ∆ is the maximum number of times event 𝑒 already occurred. Every time a new event is observed, the algorithm looks whether the data structure contains an entry for the corre-sponding element. If such entry exists then its frequency value is incremented by one, oth-erwise, a new tuple is added: (𝑒, 1, 𝑏𝑐𝑢𝑟− 1). Every time 𝑁 𝑚𝑜𝑑 𝑤 = 0, the algorithm cleans the data structure by removing the entries that satisfy the following inequality: 𝑓 + ∆≤ 𝑏𝑐𝑢𝑟. Such inequality ensures that every time the cleanup procedure is executed, 𝑏𝑐𝑢𝑟 ≤ 𝑁. The main drawback of this algorithm is that the only parameter of Lossy Counting is the maximum allowed error in the frequency approximation and no information on space usage can be provided. Lossy Counting drops low frequent items fast and only elements with high frequency survive. Therefore, it can cope with noise problem of process discovery. Sticky Sampling performs worse because of its tendency to remember every unique element that gets sampled [11]. A A A A ... B D F C B B D C A B A E C E B F C B F E ... Bucket 1 (bcur = 1) Bucket 2 (bcur = 2) Bucket 3 (bcur = 3) Bucket 4 (bcur = 4) B D F A C B 0 0 0 0 0 B B B B D C A A 0 1 1 1 B B B B B A A A E E C F 0 1 2 2 2 B B B B B A A A E E A B A E C F 0 1 2 3 3
Figure 7. Lossy Counting Algorithm
Lossy Counting with Budget (LCB) is the modification of Lossy Counting algorithm. It uses the same data structure to store the incoming events, but, unlike Lossy Counting, it allows to control the maximum space used to store these data structures. This algorithm divides the stream into buckets of variable sizes. Each bucket will contain all occurrences that can be accommodated into the available budget. The approximation error for a given bucket 𝐵𝑖 will depend on the bucket size and is defined as 𝜖𝑖 =
1
𝐵𝑖. Buckets of variable sizes
imply that the cleanup operations do not occur always at the same frequency, but only when the current bucket is not able to store the newly observed event. The remove condition is the same as in Lossy Counting. The current bucket, however, is updated only when no more items can be stored. Moreover, the deletion condition may not be satisfied by any item, thus not creating space for the new observation. In this case, the algorithm has to increase the current bucket size arbitrarily, until at least one item is removed. The basic structure of the procedure has several differences comparing to Lossy Counting. The cleanup operations must be performed before inserting an item (not after as in Lossy Counting) and the cleanup has to update the current bucket accordingly. Currently Lossy Counting with Budget it the most efficient technique
18 3.7 Model update
In order to update the model in an efficient way, the online discovery algorithm should be divided into two separate parts: run-time footprint calculation and scheduled footprint inter-pretation. The footprint in this context is the abstract representation of the entire process taken in a certain time. Footprint consists of the set of activities and relations together with information about their frequency. The main responsibility of run-time footprint calculation component is to update the footprint when a new event occurs. Taking into consideration the high frequency of events occurrence, the algorithmic execution time needs to be mini-mized. Since the process tends to change in time, the old events should have a small influ-ence on the process model. The footprint has to be dynamically updated, giving more im-portance to the recent events, and using the summarization of the old ones. In order to meet the scalability requirement, the footprint update should take a fixed amount of time, inde-pendent of the total number of previously occurred events and traces [12]. To deal with noise, the events that have not been seen for a long time should be deleted. In the schedule footprint interpretation component, the model is discovered from the current footprint in a scheduled, reoccurring way. Therefore, it has less execution time constraints. The model can be updated after some fixed amounts of traces occurred or after every n minutes. 3.8 Cache Replacement Problem
As it was mentioned earlier, it is not possible to keep the whole event stream in memory. Therefore, we need a mechanism to “free-up” space once a certain limit is reached. The problem of clearing the memory for upcoming data can be seen as a cache replacement problem.
Nowadays, the web is the most important source of information and communication in the world. The majority of web objects are static, therefore caching them at HTTP proxies can reduce network traffic and response time [14]. However, since the size of the cache is limited, there have to be wise strategies that identify which objects have to stored in the cache and which have to be thrown away in order to clear the space for new ones. These strategies are called cache replacement policies.
Web objects have many characteristics. For the purpose of caching, recency, frequency and size are the most commonly used properties. Accordingly, the cache replacement policies can be classified into three categories: recency oriented, frequency oriented and size oriented policies. In some cases, policies of one type can use methods from another type in order to choose the objects for removal (this can happen when the main characteristics of objects are equal).
Recency-Based Strategies
This set of strategies is derived from a property known as temporal locality, the measure of how likely an object is to appear again in a request stream after being requested within a time span [15]. These strategies use recency information as a significant part of their victim selection process. Recency based strategies are typically straight forward to implement taking advantage of queues and linked lists.
1) Least Recently Used (LRU)
One of the most known and commonly used strategies. Replaces the block in the cache that has not been used for the longest period of time. From the basics of temporal locality, the blocks that have been referenced in recent past will likely be referenced in the near future. This policy works well when there is a high temporal locality of references in the workload.
19 2) LRU-Threshold
The modification of LRU policy. An object is not permitted into the cache when its size exceeds given threshold.
3) Pitkow/Reckers Strategy
Objects that are most recently referenced within the same day are differentiated by size, choosing the largest first. Objects referenced not in the same period are sorted by LRU. 4) Value Aging
Defines a characteristic function based on the time of a new request to object and removes an object with the smallest value of the function. The function is the following:
𝑅𝑖 = 𝐶𝑖 ∗ √𝐶𝑖−𝑇𝑖
2 (3)
where Ci is the current time and Ti – last time the object was referenced. 5) LRU*
Combines an LRU list and a request counter. When an object enters the cache, its request counter is set to 1 and it is added to the front of the list. On a cache hit, its request counter is incremented by 1 and also moved to the front of the list. During victim selection, the request counter of the least recently used object (the tail of the list) is checked. If it is zero, the object is removed from the list; if it is not zero, its request counter is decremented by 1 and moved to the front of the list.
6) Segmented LRU
Divides the cache into two parts. The first part is known as the unprotected segment and the second - the protected segment. The strategy requires space set aside for the protected segment, known as A. Objects that belong to this segment cannot be removed from the cache once added. Both segments are managed by LRU replacement strategy. When an object is added to the cache, it is added to the unprotected segment, removing only objects from the unprotected space to make room for it. There is an implicit size threshold for objects, where the minimum object size allowed to be cached is min(A, M – A). Upon a cache hit of an object, it is moved to the front of the protected segment. If the object is in the unprotected segment and there is not enough space in the protected segment, the LRU strategy is applied to the protected segment, moving objects into the unprotected segment. [15]
7) Most Recently Used (MRU)
In contrast to LRU, it removes the most recent objects.
Frequency-Based Strategies
Frequency based strategies use a property of request streams known as spatial locality, the likelihood that an object will appear again based on how often it’s been seen before [15].
1) Least Frequently Used (LFU)
Removes the object with the smallest frequency of occurrence. 2) LFU-Aging
The LFU policy can suffer from cache pollution (an effect of temporal locality): if a formerly popular object becomes unpopular, it will remain in the cache for a long time, preventing other newly or slightly less popular objects from replacing it. This strategy tries to solve this problem by introducing an aging factor. When the average of all the frequency counters in the cache exceeds a given average frequency threshold, then all
20
frequency counts are divided by 2. There is also a maximum threshold set that no frequency counter is allowed to exceed.
3) LFU with Dynamic Aging (LFU-DA)
A parameterless policy which is less complex and easier to manage. For each object, the key value Ki is calculated such as:
𝐾𝑖 = 𝐹𝑖 + 𝐿 (4)
where Fi is the frequency of object i and L is the dynamic aging factor. Initially, it is set to 0, but with a removal of object i, its value is set to Ki.
4) α-Aging
Periodic aging method, which can use varying periods and a range [0,1] for its aging factor α. Each object uses a value K which is equal to the frequency of the object. At the end of every period an aging factor is applied to each object:
𝐾𝑛𝑒𝑤 = 𝑎 ∗ 𝐾, 0 ≤ 𝑎 ≤ 1 (5) The object with the smallest value of K will be removed. 5) HYPER-G
First, removes the least frequent objects. Then if the objects have the same frequency, the LRU is applied. Finally, if it is not enough to choose one object for deletion, the largest object will be chosen.
Size-Based Strategies
These strategies aim to minimize the memory consumption, by removal of the largest ob-jects.
1) SIZE
This strategy removes the largest objects first, to minimize the amount of memory used. In the case when objects have the same size, LRU is applied to choose the object for deletion.
2) LOG2-SIZE
Sort objects by their floor[log(size)], differentiating objects with the same value by LRU. This strategy tends to apply LRU more often comparing to SIZE strategy. 3) Greedy-Dual Size (GDS)
This strategy replaces the object with the smallest key value for a certain utility (cost) function. When an object i is requested its key value is calculated as follows:
𝐾𝑖 = 𝐶𝑖
𝑆𝑖+ 𝐿 (6)
where Ci is the cost associated with bringing object i into the cache, Si – size of object and L is a running age factor that starts at 0 and is updated for each replaced object j to the priority key of this object: L = Kj. In the case when C = 0, the removal procedure depends only on the size of objects.
4) Greedy-Dual Size with Frequency (GDSF)
The modification of Greedy-Dual Size strategy. Take into account the frequencies of objects. The key values will be calculated in the following way:
𝐾𝑖 = 𝐹𝑖∗ 𝐶𝑖
21
In the case of process discovery, all events have equal size, and hence size replacement policies are not very relevant when it comes to determine which events should be kept and which ones can be replaced. Accordingly, in this thesis we will consider recency and fre-quency-based policies.
3.9 Summary
Currently, all of the presented approaches for online automated process discovery are based on Heuristics Miner algorithm and designed for discovery of BPMN process models. How-ever, as it was mentioned earlier, in order to be able to cope with unstructured processes, we will focus on the online discovery of process maps. The approach to be followed is to adapt the Fuzzy Miner algorithm by using a combination of Lossy Counting with Budget tech-nique together with cache replacement policies in order to enable real-time discovery. We retain from the overview provided here that the LRU and LFU policies are among the most relevance cache replacement policies to manage the requirement of a fixed-size budget for storing events of completed traces. In addition, we will consider the LFU-DA policy in order to cope with concept drifts.
22
4 Contribution
In the following section, the solution will be presented in the form of the process map dis-covery algorithm, which is able to deal with event streams. It presents the algorithm includ-ing pseudocode and architecture of the solution.
4.1 Requirements to solution
As it was discussed earlier the following requirements to discovery algorithm were identi-fied:
- The algorithm has to minimize the amount of memory used - The algorithm execution time need to be minimized
- The update of the current state of the process has to be done after observing every new element in event stream
- The model update has to meet scalability requirements
- Algorithm has to produce the actual current state of the process, dealing with concept drifts
- Algorithm has to deal with noise
- The result of the execution of the algorithm is a process map
The most efficient algorithm to produce process maps is Fuzzy Miner. It uses two major metrics to build the model: significance and correlation. Significance can be calculated by frequencies of activities and direct-follows relations. In order to estimate the correlation, we have to know the performers of the activities, data they are working with, etc. In most of the cases, an incoming event has a structure (c, a, t), where c represents case identifier, a is an activity which is done and t is a timestamp. Therefore, we will focus on significance metric only. In this case, the produced model can be filtered by the minimal support of activity or path.
In section 3.6. currently existing frequency counting approaches were presented. Sliding Window algorithm takes events for a certain period of time, produces temporal event log and then apply an offline version of the process discovery algorithm. This approach does not update the model constantly since each new event triggers only the update of the memory but not necessarily the model. In addition, the time required by this approach is completely unbalanced, and the algorithm performs a lot of inexpensive operations. Therefore, it does not meet the requirements specified above. Lossy Counting and Sticky sampling are two alternative approaches, but based on the results of analysis in [11], Lossy Counting performs better. This algorithm deals with noise removing low frequent behavior. In addition, it con-stantly updates the abstract footprint of the model with minimum time. One disadvantage of this algorithm is that it does not take into consideration the memory limitation. The Lossy Counting with Budget is a modification of Lossy Counting algorithm, which deals with this problem. Based on retrieved relations and their frequencies the control-flow of the process can be built.
However, the fact that appearance of a new case can trigger the deletion of event or relation, therefore unnecessary losing some information does not satisfy the specified requirements. Therefore, this algorithm has to be rewritten. Cache replacement policies will take respon-sibility of objects deletion. We will take three of them (LFU, LRU, LFU-DA) and compare the results.
The model update will be done in scheduled way after the occurrence of n events or after n seconds/minutes. In order to identify the criteria for scheduling, we will need to consider the frequency of occurrence of a new event. On the one hand, in large enterprises, where
23
information systems produce a huge amount of data with high frequency, the model update has to be done after every n second. On the other hand, in the case of low frequency of new event occurrence, the model can be updated after every n received events. Moreover, the update of the model can even be done once a day in case it used for monitoring purposes. Therefore, the scheduling type strongly depends on the business case. The update can be done automatically or manually.
4.2 Architecture of the solution
The high-level architecture of the solution is described by Figure 8:
Figure 8. The architecture of solution
It consists of two main modules which are working in parallel but with different frequencies. The first one updates the footprint of the current state of the process with an occurrence of every event, while the second updates the dependency graph and visualize the current model in scheduled recurring way.
According to this chart, there are four main components:
1) Preprocessing component
This component is responsible for extraction of event attributes. In an offline scenario, every event is represented by a row in Event Log table (CSV file for example). In online scenario event which is recorded by the information system is a string of the form of: "case_id<separator>activity<separator>timestamp",
where separator is a character delimiter such as comma (','), semicolon (';') or dot ('.') However, the attributes can have a different order (e.g. "1, Sent to Customer, 02-10-16 17:40" and "02-10-16, 1, Sent to Customer") or different separators. In addition, some attributes can be empty or even miss. The main problem is related to timestamps. There many notations of timestamps (e.g. "07.11.2016 17:40" and "02-10-16 17:40") and sometimes they are not recorded by information systems. A lot of discovery techniques rely on timestamps, but, in the case of working with event streams, to discover control-flow of the process case identifier and activity name are only needed. Therefore, many data quality problems can be eliminated. In order to do further analysis, find bottlenecks and other possible problems we need timestamps. The main idea of this component is to use regular expressions to extract the necessary attributes values and to create an object of the class Event based on it (Figure 9):
24
Event
Timestamp Case_id Activity1 , Check Claim , 02-12-2010 17:45
Figure 9. Attributes extraction
2) Frequency Counting Component
This component takes an important part in the solution because it has to work online, taking into consideration time and memory limitations. Time needed to process an event has to be minimized as well as used memory. As it was discussed earlier the Lossy Counting with Budget frequency algorithm is chosen but it has to be modified. Let us put all changes into the list:
- The budget will specify the maximum amount of activities and relations that can be stored in memory. Since activities and relations represent nodes and edges of the graph model, budget now will have more practical interpretation
- Deletion procedure will be triggered only when the total amount of activities and relations exceeds the budget (appearance of new case will not trigger deletion of activity or relation anymore)
- All running cases will be kept in memory and finished cases will be deleted. In order to understand whether the case is finished, the notion of activity type is introduced. - Cache replacement policies will be responsible for deletion of objects
- The concept of buckets is eliminated. Instead of it the last seen index metric will be used for every object to define its „age“
- There is no need for case object to store more attributes than case_id and last activity
In order to apply modified Lossy Counting with Budget method to Fuzzy Miner algo-rithm we need to define following variables:
e – event, tuple (c, a, t, type), where c – case_id, a – activity, t – timestamp, type – type of event
ES – event stream
B – available budget (used in order to specify memory constraints)
Sa – the set of activities, each element of the set is tuple (a, f, idx), where a – activity, f – estimated frequency, idx - last seen index
Sc – the set of cases, each element is a tuple (c, a), where c – case_id, a – the last activity in the case
Sr – the set of direct succession relations, each element is a tuple of (r, f, idx), where r – relation (start_activity, end_activity), f – frequency, idx – last seen index
Structures Sa and Sr store the frequency of activities and relations respectively. Additional structure Sc is used to hold the running cases. Here is the last activity of case has to be updated every new event of the current case arrives. Every time a new event e = (ci, ai, ti) is observed, Sa is updated with the new observation of the activity a. After that, the procedure checks if there is an entry in Sc associated to the case id of the current event. If not, a new entry is added to Sc. Otherwise, the case is already running and the observed event is not the start event. Therefore we are observing the direct-follows relation. The
25
algorithm builds the relation r (last activity of the case, ai) and updates its corresponding frequency in Sr. Finally, if the case c is finished, it will be removed from Sc. Sa and Sr will share the memory, thus the appearance of new activity can trigger the deletion of relation. When an entry is removed from Sc,the corresponding activity can not appear in any relations in Sr.
Three well-known cache replacement policies such as LFU, LRU, and LFU-DA will be used as the deletion mechanisms. Therefore, we will have three modifications of Lossy Counting with Budget algorithm:
1) LCB with LRU policy
Here the pointer in the stream will be continuously updated with an occurrence of every event. Objects of Sa and Sr will have attribute last seen index, which will tell about the recency of an element. When the total amount of activities and relations will exceed the budget, the LRU policy will be triggered. When the least recent object is relation it will be deleted from Sr. However, when the least recent object is activity, in addition to removal of this activity, all relations associated to activity have to be deleted as well.
Algorithm 1: LCB with LRU policy Input: ES: event stream; B: available budget Sa ← null /* Initially empty set */ Sc ← null /* Initially empty set */ Sr ← null /* Initially empty set */ idx ← 1 /* Initial pointer in the stream */ 1 forever do
2 e ← observe(ES) 3 if analyse(e) then
4 if activity a is already in the Sa then
5 Increment the frequency of a in Sa
6 Update lastSeenIndex of a in Sa
7 else
8 if |𝑆𝑎| + |𝑆𝑟| = 𝐵 then
9 Apply deletion procedure 10 end
11 Insert (𝑎, 1, 𝑖𝑑𝑥) in Sa 12 end
13 if case c is already in the Sc then
14 if relation r(a, ai) exists in the Sr then 15 Increment the frequency of r in the Sr 16 Update lastSeenIndex of r in Sr 17 else
18 if |𝑆𝑎| + |𝑆𝑟| = 𝐵 then
19 Apply deletion procedure 20 end
21 Insert (𝑟, 1, 𝑖𝑑𝑥) in Sr 22 end
23 update a in (𝑐, 𝑎) 24 if case c is finished then 25 Delete c from Sc 26 end
26 27 else 28 Insert (𝑐, 𝑎) in Sc 29 end 30 idx ← idx + 1 31 end 32 end Deletion Procedure: Input: Sa, Sc, Sr, B 1 mina ← min(idx in (𝑎, 𝑓, 𝑖𝑑𝑥) ∈ 𝑆𝑎) 2 minr ← min(idx in (𝑟, 𝑓, 𝑖𝑑𝑥) ∈ 𝑆𝑟)
3 if (mina > minr) then
4 Remove (𝑟, 𝑓, 𝑖𝑑𝑥) from Sr where idx = minr
5 else
6 Remove (𝑎, 𝑓, 𝑖𝑑𝑥) from Sa where idx = mina
7 for each (𝑟, 𝑓, 𝑖𝑑𝑥) ∈ 𝑆𝑟 where 𝑠𝑡𝑎𝑟𝑡 𝑜𝑟 𝑒𝑛𝑑 𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑎 do
8 Remove (𝑟, 𝑓, 𝑖𝑑𝑥) from Sr 9 end
10 end
2) LCB with LFU policy
In contrast to LRU, there is no need to keep the pointer on the event stream and store the last seen index for every object. Therefore, the size of objects can be shortened and the memory consumption will decrease. The deletion procedure will be similar to the first case, but we will drop the object with the smallest frequency.
Algorithm 2: LCB with LFU policy Input: ES: event stream; B: available budget Sa ← null /* Initially empty set */ Sc ← null /* Initially empty set */ Sr ← null /* Initially empty set */
1 forever do
2 e ← observe(ES) 3 if analyse(e) then
4 if activity a is already in the Sa then
5 Increment the frequency of a in Sa
6 else
7 if |𝑆𝑎| + |𝑆𝑟| = 𝐵 then
8 Apply deletion procedure 9 end
10 Insert (𝑎, 1) in Sa 11 end
12 if case c is already in the Sc then 13 if relation (a, ai) exists in the Sr then 14 Increment the frequency of r in the Sr 15 else
16 if |𝑆𝑎| + |𝑆𝑟| = 𝐵 then
17 Apply deletion procedure 18 end
27
19 Insert (𝑟, 1) in Sr 20 end
21 update a in (𝑐, 𝑎) 22 if case c is finished then 23 Delete c from Sc 24 end 25 else 26 Insert (𝑐, 𝑎) in Sc 27 end 28 end 29 end Deletion Procedure: Input: Sa, Sc, Sr, B 1 mina ← min(f in (𝑎, 𝑓) ∈ 𝑆𝑎) 2 minr ← min(f in (𝑟, 𝑓) ∈ 𝑆𝑟)
3 if (mina > minr) then
4 Remove (𝑟, 𝑓) from Sr where f = minr
5 else
6 Remove (𝑎, 𝑓) from Sa where f = mina
7 for each (𝑟, 𝑓) ∈ 𝑆𝑟 where 𝑠𝑡𝑎𝑟𝑡 𝑜𝑟 𝑒𝑛𝑑 𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑎 do
8 Remove (𝑟, 𝑓) from Sr
9 end 10 end
3) LCB with LFU-DA policy
This approach will be the closest to original Lossy Counting with Budget method because it will consider frequency and age of elements (unlike first two approaches where only one metric was considered, either frequency or age). As it was stated in section 3.8, LFU-DA policy uses a calculation of key value K in order to drop the elements. K is calculated as Ki = Fi + L, where Fi is frequency and L is a dynamic aging factor. Initially, L is set to 0, thus before the first deletion, we are calculating just frequencies as in the second approach (LFU). However, after deletion of object j, L is updated and set to Kj.
Algorithm 3: LCB with LFU-DA policy Input: ES: event stream; B: available budget Sa ← null /* Initially empty set */ Sc ← null /* Initially empty set */ Sr ← null /* Initially empty set */ L ← 0 /* Initial dynamic aging factor */ 1 forever do
2 e ← observe(ES) 3 if analyse(e) then
4 if activity a is already in the Sa then
5 Increment the frequency of a in Sa
6 else
7 if |𝑆𝑎| + |𝑆𝑟| = 𝐵 then
28 9 Update L
10 end
11 Insert (𝑎, 1, 𝐿) in Sa 12 end
13 if case c is already in the Sc then 14 if relation (a, ai) exists in the Sr then 15 Increment the frequency of r in the Sr 16 else
17 if |𝑆𝑎| + |𝑆𝑟| = 𝐵 then
18 Apply deletion procedure 19 Update L
20 end
21 Insert (𝑟, 1, 𝐿) in Sr 22 end
23 update a in (𝑐, 𝑎) 24 if case c is finished then 25 Delete c from Sc 26 end 27 else 28 Insert (𝑐, 𝑎) in Sc 29 end 30 end 31 end Deletion Procedure: Input: Sa, Sc, Sr, B 1 mina ← min(𝑓 + ∆ in (𝑎, 𝑓, ∆) ∈ 𝑆𝑎) 2 minr ← min(𝑓 + ∆ in (𝑟, 𝑓, ∆) ∈ 𝑆𝑟)
3 if (mina > minr) then
4 Remove (𝑟, 𝑓, ∆) from Sr where 𝑓 + ∆ = minr
5 else
6 Remove (𝑎, 𝑓, ∆) from Sa where 𝑓 + ∆ = mina
7 for each (𝑟, 𝑓, ∆) ∈ 𝑆𝑟 where 𝑠𝑡𝑎𝑟𝑡 𝑜𝑟 𝑒𝑛𝑑 𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑎 do
8 Remove (𝑟, 𝑓, ∆) from Sr
9 end 10 end
The execution of this module results in the list of frequencies for activities and direct-follows relations. Based on this data it is possible to identify the control-flow structure of the process. The proposed approaches allow us to control the loss. Since the model creation now depends only on the activities and relations, it is possible to calculate which budget is needed to get the lossless model by using the following formula:
𝐵 =𝑁(𝑁+1)
2 + 𝑁 (8) where N is the total number of distinct activities.
Let us take as an example Cloud Order Human Anonymized log, provided by company Minit. The model of the process described in the log (Figure 10):
29
Figure 10. Model extracted from Cloud Order Human Anonymized log The budget to get lossless model is the following:
𝐵 =4 ∗ 5
2 + 4 = 14
The model obtained with this budget is described by Figure 11:
Figure 11. Model discovered by the system It can be easily seen that the obtained model is equal to the original.
3) Process Discovery Component
It is an intermediate component between frequency counter and visualization component that applies process mining techniques in order to discover the control-flow of the process, to identify parallelism, loops, etc. In the case of process map it is impossible to say whether it is parallelism or choice (there is no such notion as gate), thus both of them can be discovered in the same way: if there are exist relations (a,b) and (b,a) in the same time then a and b are parallel. In the case of discovering BPMN model, the technique, mentioned in Heuristics Miner can be applied (see Section 3.3).
4) Visualisation Component
Based on the results of process discovery it builds the graphical view of the process. 4.3 Implementation
The solution is built using Cortana Intelligence. It is a cloud-based solution for work with big data, which provides advanced analytics tools (Figure 12).
30
Figure 12. Role of Cortana Intelligence (Source: http://bit.ly/1RTOHRg)
Cortana Intelligence Suite includes Machine Learning and Analytics tools as well as Information Management and Visualization solutions. The full stack of tools is given in Figure 13:
Figure 13. Cortana Intelligence Suite (Source: http://bit.ly/1r7dBXz) While building our solution, we are mostly interested in Event Hub.
4.3.1. Azure Event Hub
Azure Event Hub is a highly scalable publish-subscribe service that can ingest millions of events per second and stream them into multiple applications. This allows to process and analyze the massive amounts of data produced by connected devices and applications. Event Hub is a managed service that ingests the events with an elastic scale to accommodate various load profiles, accordingly to use case. It meets the challenge of connecting disparate data sources while handling the scale of the aggregate stream. It provides the capacity to process events from millions of devices while preserving the order of the events on a per-device basis. Event Hub supports Advanced Message Queuing Protocol (AMQP) and HTTP, allowing many platforms to work with it. In our solution Event Hub will play a role of reliable pipeline between customer company and process mining tool (Figure 14).