CHAPTER 1: INTRODUCTION
1.2.3 Process Mining Techniques for Knowledge discovery
Process Mining is used as a method of reconstructing processes as executed from Event Logs [38]. These logs are generated from process-aware information systems such as Enterprise Resource Planning (ERP), Workflow Management (WFM), Customer Relationship Management (CRM), Supply Chain Management (SCM), and Product Data Management (PDM) [14]. The logs contain records of events such as activities being executed or messages being exchanged on which Process Mining techniques can be applied in order to discover, analyze, diagnose and improve processes, organizational, social and data structures [37]. This can also be understood as the automated discovery of processes from Event Logs resulting in the generation of a process model (e.g., a Petri net) that describes the causal dependencies between activities [14].
More specifically, Van der Aalst et al. [14] describe the goal of Process Mining to be the extraction of information on the process from Event Logs using a family of a-posteriori analysis techniques. These techniques enable the identification of sequentially recorded events where each event refers to an activity and is related to a particular case (i.e., a process instance). They can also help identify the performer or originator of the event (i.e., the person/resource executing or initiating the activity), the timestamp of the event, or data elements recorded with the event. Such information is critical in our endeavor, as we attempt to study the generation and originators of learning patterns from Event Logs we built from FLOSS repositories.
Process Mining comprises three different techniques: discovery, conformance and extension [14]. In Discovery, there is no a-priori model, neither an existing model; the “discovered” model instead stems from an Event Log. An example is shown in Figure 10: a process model generated using the α-algorithm (described later). Conformance refers to an a-priori model that is used to verify if the events recorded in the log conform to it; this is used to detect deviations, locate and explain them in order to take appropriate actions. The last technique is Extension where there is an a-priori model that is extended or enriched with new aspects; for example the extension of a process model with performance data [14, 37, and 39]. Current Process Mining techniques evolved from the work done by Weijters and Van der Aalst [38] where the purpose was to generate a workflow design from recorded information on workflow processes as they take place. This is accomplished based on an assumption that in Event Logs, each event refers to a task (a well-defined step in the workflow), each event refers to a case (a workflow instance), and these events are recorded in a certain order. Combining techniques from machine learning and Workflow nets, the authors construct Petri nets that provide a graphical but formal language for modeling concurrency [38]. Figure 10 depicts an example of a workflow process modeled as a Petri net.
Process Models for Learning patterns in FLOSS repositories Page 19 Figure 10. Example of a workflow process modeled as a Petri net
In Workflows, as shown in Figure 10, cases or workflow instances represent activities such as an insurance claim, a tax declaration, or a request for information [38]. These cases are handled through executing tasks in a specific order. They enable this execution by specifying which tasks need to be executed and in which order, as exemplified in the Petri net in Figure 10. A Petri net helps in routing the execution of cases. Hence, transitions model tasks while places and arcs model causal dependencies. In Figure 10, the transitions T1, T2, …, T13 represent tasks, The places Sb, P1, …, P10, Se represent the causal dependencies. Places can also be understood as representing pre or post-conditions for tasks. An AND-split corresponds to a transition with two or more output places (from T2 to P2 and P3), and an AND-join corresponds to a transition with two or more input places (from P8 and P9 to T11). OR- splits/OR-joins correspond to places with multiple outgoing/ingoing arcs (from P5 to T6 and T7, and from
T7 and T10 to P8). At any time a place contains zero or more tokens, drawn as black dots.
We consider a workflow log as a set of event sequences where each event sequence is simply a sequence of task identifiers [38], formally expressed as WL T* where WL is a workflow log and T is the set of tasks. An example event sequence of the Petri net of Figure 10 is:
T1, T2, T4, T3, T5, T9, T6, T3, T5, T10, T8, T11, T12, T2, T4, T7, T3, T5, T8, T11 ,T13. Hence, one can
conclude that given a workflow log WL, Process Mining enables the discovery of a WF-net that (i) potentially generates all event sequence appearing in WL, (ii) generates as few event sequences of T*\WL as possible, (iii) captures concurrent behavior, and (iv) is as simple and compact as possible. The Process Mining technique used to generate the Petri net consists of three distinct steps: Step (i) the construction of a dependency/frequency table (D/F-table), Step (ii) the induction of a D/F-graph out of a D/F-table, and Step (iii) the reconstruction of the WF-net out of the D/F-table and the D/F graph [38].
Significant advances in Process Mining have been done with regards to algorithm implementation and tool development [14, 37, 39-42]. Due to space constraints, we cannot report on all of these as well as the steps of Process Mining in generating such a workflow model as depicted in Figure 10. We find the work by Van der Aalst et al. [39] to be an important resource in this regard. Nevertheless, we note that the discovery process in Process Mining is carried out using the α-algorithm, which implements consecutive reads of a given Event Log, identifies and gets the set of tasks, infers the ordering relations, builds the net based on inferred relations and outputs the net. Another algorithm called the Multi-phase approach can also be used for Process Mining as detailed in [47]. However, we believe that the preliminaries of Process Mining can be easily understood starting with the α-algorithm of which formalization is given below.
Process Models for Learning patterns in FLOSS repositories Page 20 1. TW = { t T W t },
2. TI = { t T W t = first() }, 3. TO = { t T W t = last() },
4. XW = { (A,B) A TW B TW a Ab B a W b a1,a2 A a1#W a2 b1,b2 B b1#W b2 },
5. YW = { (A,B) X (A,B) XA AB B (A,B) = (A,B) }, 6. PW = { p(A,B) (A,B) YW } {iW,oW},
7. FW = { (a,p(A,B)) (A,B) YW a A } { (p(A,B),b) (A,B) YW b B } { (iW,t) t TI} { (t,oW) t TO}, and
8. (W) = (PW,TW,FW).
Understanding the basic notations in the algorithm requires a review of some fundamentals and preliminaries on Petri nets and Workflow nets [39, 44]. First, we review some classical definitions of P/T. Definition 1 [39] : (P/T-nets)1 is a Place/Transition net, or simply
P/T-net, is a tuple (P, T, F) where: 1. P is a finite set of places,
2. T is a finite set of transitions such that P ∩T = ∅, and
3. F ⊆ (P × T)∪(T × P) is a set of directed arcs, called the flow relation.
Definition 2 [44]: A net is PN = (P, T, F, W, M0) where; P = {p1, p2, . . . , pm} is a finite set of places,
T = {t1, t2, . . . , tm} is a finite set of transitions, F ⊆ (P × T) ∪ (T × P) is a set of arcs,
W is a weight function of arcs, (default = 1)
M0 : P → {0, 1, 2, . . . } is the initial marking where P ∩ T = ∅ and P ∪ T , ∅. Also; k = P → {1, 2, 3, . . . } ∪ {∞} = partial capacity restriction (default = ∞).
Definition 3 [44]: Let X = P ∪ T and N = (P, T, F, W, M0) be a PN, then: 1. •x = {y ∈ X | (y, x) ∈ F} is the pre-set (input set) of x,
2. x• = {y ∈ X | (y, x) ∈ F} is the post-set (output set) of x, 3. nbh[x] = •x ∪ x• is called neighborhood of x,
4. If Y ⊆ X then •Y = ∪ • x and Y• = ∪x•.
Definition 4 [44]: Let N = (P, T, F, W, M0) be a PN then PN; 1. is P-simple iff ∀ x, y ∈ P, (•x = •y ∧ x• = y• =⇒ x = y) 2. is T-simple iff ∀ s, t ∈ T, (•s = •t ∧ s• = t• =⇒ s = t) 3. has no isolated places iff ∀ x ∈ X, nbh(x) , ∅
Definition 5 [44]: A PN is; 1. pure iff ∀ x ∈ X, [•x ∩ x• = ∅],
Process Models for Learning patterns in FLOSS repositories Page 21 Additional notions crucial for understanding of Petri nets include the firing rule, reachable markings, firing sequence, connectedness, boundedness, safeness, dead transitions and liveness [39].
When a Petri net is constructed to model a workflow, it is called a Workflow net (WF-net) [39]. Let N = (P, T, F) be a P/T-net and t¯ be a fresh identifier not in P ∪T. N is a workflow net (WF-net) iff:
1. Object creation: P contains an input place i such that •i = ∅, 2. Object completion: P contains an output place o such that o• = ∅,
3. Connectedness: N¯ = (P, T ∪ {t¯}, F ∪ {(o,t¯),(t, i ¯ )}) is strongly connected.
Furthermore, the concept of workflow mining, whose purpose is to produce a workflow net on the basis of a workflow log such as the one given in Figure 11, can be understood by analyzing workflow logs for Workflow nets generation as briefly described below. Table 2 shows an example of a workflow log.
Table 2. A workflow log
case identifier task identifier case 1 case 2 case 3 case 3 case 1 case 1 case 2 case 4 case 2 case 2 case 5 case 4 case 1 case 3 case 3 case 4 case 5 case 5 case 4 task A task A task A task B task B task C task C task A task B task D task A task C task D task C task D task B task E task D task D
From the log in Table 2, we can define a Workflow trace as follows: Let T be a set of tasks. σ ∈ T∗ is a workflow trace and W ∈ P(T∗) is a workflow log.
The workflow trace of case 1 in Table 2 is ABCD. The workflow log corresponding to Table 2 is {ABCD, ACBD, AED} because workflow traces ABCD and ACBD appear twice in case 1 and case 3 as well as case 2 and case 4 respectively.
Process Models for Learning patterns in FLOSS repositories Page 22 With such information available, the sequence of execution of the α-algorithm is as follows [45]: the log traces are examined and in the first step, the algorithm creates (Step 1) the set of transitions (TW) in the
workflow, (Step 2) the set of output transitions (TI) of the source place, and (Step 3) the set of the input transitions (TO) of the sink place, the (Steps 4 and 5, respectively), the α-algorithm creates sets (XW and YW, respectively) used to define the places of the mined workflow net. In Step 4, the algorithm discovers which transitions are causally related. Thus, for each tuple (A, B) in XW, each transition in set A causally relates to all transitions in set B, and no transitions within A (or B) follow each other in some firing sequence. Note that the OR-split/join requires the fusion of places. In Step 5, the α-algorithm refines set XW by taking only the largest elements with respect to set inclusion. In fact, Step 5 establishes the exact amount of places the mined net has (excluding the source place iW and the sink place oW). The places are created in Step 6 and connected to their respective input/output transitions in Step 7. The mined workflow net is returned in Step 8 [45].
From a workflow log, four important relations are derived upon which the algorithm is based. These are >W, →W, #W, and ||W [45]. In order to construct a model such as the one in Figure 10 on the basis of a workflow log, the latter has to be analyzed for causal dependencies [39]. For this purpose, the Log-based ordering relations notation is introduced:
Let W be a workflow log over T, i.e., W ∈ P(T∗). Let a, b ∈ T:
— a >W b if and only if there is a trace σ = t1t2t3 ...tn−1 and i ∈ {1,...,n−2} such that σ ∈ W and ti = a and ti+1 = b,
— a →W b if and only if a >W b and not b >W a, — a #W b if and only if not a >W b and not b >W a, — a ||W b if and only if a >W b and b >W a.
Considering the workflow log W = {ABCD, ACBD, AED}, relation >W describes which tasks appeared in sequence (one directly following the other). Clearly, A >W B, A >W C, A >W E, B >W C, B >W D, C >W B, C >W D, and E >W D. Relation →W can be computed from >W and is referred to as the (direct) causal relation derived from workflow log W. IN our example we have : A →W B, A →W C, A →W E, B →W D, C →W D, and E →W D. Note that B →W C because C >W B. Relation ||W suggests potential parallelism. More details about the algorithm, its potentials and limitations as well as the current status of its implementation are described further in the work by Van der Aalst and others [14, 39, 43 and 45].