Matching-based Algorithm for Pattern Discovery

6.1 Motivation to a Pattern Discovery Solution

Among the pattern-based techniques in the proposed framework for SOA design and integration (Chapter 3), pattern discovery can be exploited to guide the definition of new services based on unknown and frequent process sections which can be documented and reused in the form of process patterns.

Previous sections focused on process pattern matching, whose aim is to identify known process patterns in process models. The motivation behind it is to support automatic process pattern matching as an analysis instrument during the definition of process-centric services based on proven designs documented as process patterns. Process patterns can provide guidelines to design new (software) services that can be used during enterprise process and application integration projects. However, in a number of organisations process steps are already supported by existing software components. Identifying recurring connected process steps can provide an opportunity to define reusable services that can be implemented encapsulating existing software components. This idea is aligned with the basic principle of software reuse in SOA [Erl 2004] and it can support software component rationalisation [Albani 2006]. Considering this opportunity, this section refocuses the attention from process pattern matching to the discovery of frequently occurring substructures – that can be captured as patterns – on large-scale business process models. The pattern matching algorithms introduced in the previous chapter are used as the basis of the pattern discovery technique proposed in this chapter. Also, semantic variations and gener- alisation can potentially be used in this new scenario.

Pattern discovery scenarios. The problem of identifying recurring connected process steps in process models is addressed as a problem of frequent pattern discovery in graphs [Kuramochi 2005]. Two distinct scenarios can be considered in this regard: the graph-transaction scenario and the single-graph scenario. The former refers to the discovery of subgraphs that occur frequently across a set of input graphs (graph transactions repository). The result of an algorithm for discovering patterns in the graph-transaction scenario is a set of graphs containing the frequent subgraph (pattern) across graphs in the repository. A graph is considered part of the result irre- spective of how many times the pattern occurs in a particular transaction. Instead, for the single-graph scenario, an algorithm would discover the subgraphs that occur multiple times in a single, large input graph. The problem formulation and the input data used by algorithms in these two scenarios have inherent differences. According to [Kuramochi 2005], the algorithms developed for the graph-transaction scenario cannot be used to solve the problem defined for the single-graph scenario, whereas algorithms for the latter scenario can be easily adapted to work in the graph- transaction scenario. The problem and solution presented in this section are defined on the basis of a single-graph setting scenario. Such a single graph represents an – often large and complex – process model. Discovered patterns from this process model graph would indicate potential reusable process-centric services. Trace links to (lower) application architecture levels - see traceability model in Section 3.4 - would link processes to software components that can be analysed with the aim of being rationalised.

6.1.1 Matching versus Discovering Patterns in Graphs

Although there are similarities between the pattern discovery problem and the pattern matching problem described in previous sections, the formal relationship between a pattern and its host graph is different for both problems. Pattern matching on graphs involves the identification of a homomorphic relation between a given pattern graph P and a given graph model M. Instead, pattern discovery only takes the graph model M as input and, in order to discover unknown (frequent) patterns in M, M is compared to itself during the search for reoccurring subgraphs. Just as the homomorphic mapping from P to M formalises the structural preservation relation between P and its pattern instances in M, an endomorphic relation from M to subgraphs of itself formalises the structural relation involved in the frequent pattern discovery problem.

Pattern Size and Occurrence Frequency. Discovering a frequent pattern in M involves the identification of a subgraph U that appears in M a specific number of

6.1. Motivation to a Pattern Discovery Solution

times that is considered frequent. The occurrence frequency of U in M and also the size of U are defined by end users. Users can be interested only in subgraphs occurring in M at least a specific number of times fU. On the other hand, if a subgraph

U is considered a subgraph (pattern) of interest, then so can be subgraphs of U. Determining what size of U is adequate depends on the final goal that triggered the discovery of patterns in a graph. The goal in this work is to define boundaries on process descriptions as guidelines to define new services. Those boundaries are identified with the purpose of benefiting service reuse across the process. Service reuse is strongly influenced by the service application scenario and therefore the input from end users (designers) is important. During the design of new services based on discovered patterns, end users would deal with a tradeoff between size and frequency of the discovered patterns. A greater occurrence frequency would benefit service reuse. On the other hand, a greater pattern size would lead to a coarser grained service, and indirectly, it could benefit the performance of a service composition created to automate or integrate a business process. In terms of performance, coarser grained services (as building blocks of service compositions) can be seen as beneficial compared to finer grained services. By composing finer grained services addressing the same integration problem, a larger number of services – therefore, more service requests and responses – would be involved. That increased number of requests and responses could cause undesired levels of performance due to the added overhead originated from the new and more complex service composition. Thus, pattern size and its occurrence frequency affects the decisions involved during the design of new pattern-based services. Methodological guidelines for how to adjust the parameters of an algorithm for pattern discovery and the implementation of the algorithm in a tool that end users can interact with would assist with semi- automated support to the design of the new services. The affected design guidelines are those related to service reuse and service granularity.

Counting occurrence frequency of a subgraph. There are different approaches to count the occurrence frequency of a subgraph U in M. If the subgraph is frequent enough, then it is considered a pattern of interest. Counting approaches vary according to how overlaps among of a subgraph U are considered – see Section 4.5.1for more details on overlaps in graphs. One alternative would count all occurrences of a subgraph U, including those belonging to an overlap. Another alternative would count occurrences of U in M only if they are edge-disjoint (i.e., they do not share edges in M). Counting occurrences from overlaps could lead to the fact that the counting of a subgraph occurrence frequency does not decrease monotonically as a function of its length, causing a pattern discovery solution to become untractable [Vanetik 2002].

6.1.2 Frequent Pattern Discovery in Process Graphs

Consider a process model M, where M is a typed attributed graph M = hAM, ami over ATM. The problem of frequent pattern discovery in process graph concentrates on finding connected edge-disjoint subgraphs occurring in M. A subgraph U of M is considered a frequent pattern if it appears in M at least a number f of times, where

f is the so-called occurrence frequency threshold.

In the case of an overlap containing recurrent edge-disjoint subgraphs, counting occurrences of the subgraph involves the calculation of an independent set of vertices from the overlap [Kuramochi 2005]. For a graph H = (V, E), a set of vertices I ⊂ V(H)is called independent if for every pair of vertices in I, the pair is not connected by an edge in E(H). The independent set is called maximum if for every vertex v in I there is an edge in E(H)that connects v to a vertex u in V(H)but not in V(I). Exact counting of the occurrence frequency of a pattern involves calculating the exact maximum independent set of an overlap containing the pattern. Because the calculation of the exact maximum independent set of graphs is NP-complete [Lawler 1980], an approximate pattern discovery would try to find as many as possible subgraphs with an occurrence frequency at least f . This approximate solution is used in many practical cases such as in [Kuramochi 2005], [Inokuchi 2005] and also in this work.

Motivating Example. Figure 6.1 shows an example of a business-level process- centric service composition extracted from [Rabhi 2007]. The process-centric service implements a trading strategy simulation process and it has highlighted - with borders coloured in red and blue - instances of frequent patterns. Examples of these patterns are shown in Figure6.2. P1 is the larger frequent pattern occurring exactly two times in the process from Figure6.1. P2 is a smaller pattern that occurs more frequently in the process, and it corresponds to a subgraph of the graph representing P1. On the other hand, P3, P4 and P5 are associated to subgraphs of the graph representing P2, and therefore P1. P3 and P4 consist of single elements that can be abstracted by the Action on message element, which in turn, it is a more concrete element that refines the Action element. The same situation can be considered for the Process In- terrupted? element from P5, which refines the more abstract Decision element. If an algorithm for frequent pattern discovery allows inexact matching by relaxing the vertex matching to allow two elements to be matched if they are semantically similar but no exactly the same, it could be considered that the model from Figure6.1

has five instances of P1 (two exact instances and three inexact instances that include semantically similar elements). Intuitively and considering the semantics of labels, the elements with borders coloured in red are more similar to the pattern elements in Figure6.2than the elements with borders coloured in blue. For instance, for the

6.2. Matching-based Algorithm for Pattern Discovery

trading strategy and simulation service (middle of Figure6.1), the Generate and Submit orders activity is semantically similar to Send message. Also, the Monitor Market Events activity can be (semantically) associated to the abstract Action element in Figure6.2. A threshold defining how similar should be considered two elements to say that they are instances of a same pattern role has to be defined by an end user. An algorithm for pattern discovery which is flexible could allow for this inexact matches. Also, the algorithm should allow end users to identify partial matches of a pattern of interest. For instance, after discovering that P1 occurs exactly two times in the model, end users may be interested in to know if there are partial instances of P1. In the case of the example, there are indeed. They are associated to the frequent patterns P2, P3, P4 and P5.

Note that the example here was chosen because it can fit in one page. Real processes can be larger and more complex, requiring a means to automate some of the analysis tasks that can be difficult and expensive to do without support.

6.2 Matching-based Algorithm for Pattern Discovery

This section describes a technique to find frequent patterns in an (often large) process model M. The technique is based on an algorithm that uses the pattern matching algorithm families from previous sections. The inputs to the pattern discovery algo- rithm - named λ-PD algorithm, where λ is any of the pattern matching algorithms from theCP-E-PM and CP-I-PM families described in the previous chapter - are:

• the recorded graph Mt of the target process model M,

• the maximum expected size of a subgraph U representing a pattern (|V(U) + E(U)|), and

• the occurrence frequency threshold fmin indicating the minimum number of

times that U has to be contained in M. Since there is no interest in finding the trivial automorphism of M and it is expected a (frequent) pattern to occur at least two times in M, then|V(U) +E(U)|is trivially bounded by|V(Mt)|and

fmin≥2.

The idea behind the proposed λ-PD algorithm is to create temporal patterns orig- inated from each vertex in V(Mt), subsequently expand them and then check if they

occur at least the number of times defined by an occurrence frequency threshold fmin.

Before any expansion step is performed, temporal patterns formed by single vertices are discarded early if they do not reach the occurrence frequency fmin. Expansion

steps are performed until the maximum desired size of the pattern is reached or overlaps extending the target model graph are found. The pseudo code of λ-PD is described in Table6.1and explained in the rest of the section.

Figure 6.1: Example of an abstract process-centric service with frequent pattern instances.

6.2. Matching-based Algorithm for Pattern Discovery

Figure 6.2: Example of an abstract process-centric service with frequent pattern instances.

Table 6.1: λ-PD Algorithm -Pattern Discovery Algorithm based on λ pattern matching algorithm, with λ among theCP-E-PM and CP-I-PM families.

λ-PD Algorithm

Input:Target Graph (M), number of expansion steps (k), minimum occurrence frequency fminand a selected pattern matching algorithm λ=alg-PM

Output: score, FreqM

1 : For each vertex u in V(Mt)do 2 : P_pivot(u,1)←u

3 : score_(u,1)←alg-PM(M, P_pivot(u,1))

4 : f_(u,1)←countFrequency(score(u, 1),P_pivot(u,1)) 5 : If f_(u,1)>fmindo

6 : If k>=1 do 7 : For j: 2→k

8 : P_pivot(u,j)←expand(P_{pivot(u,j−1)})

9 : score_(u,j)←alg-PM(M, P_pivot(u,j))

10 : f_(u,j)←countFrequency(score(u, j),P_pivot(u,j))

11 : end for

12 : end if

13 : Elseprintf(insufficient frequency of temporal match centred on u) 14 : end if

14 : end for

Table 6.2: countFrequencyfunction for counting the frequency of a subgraph P in M.

countFrequency

Input: scoreresulting from matching subgraph Ptin Mt, subgraph Pt, approximate match ratio (Rt) Output: f

1 : For each index i in score do 2 : If score(i)/|V(Pt)| >=Rt then 3 : cnt←cnt+1

4 : end if

5 : end for

Pseudo-code of λ-PD. The λ-PD algorithm takes as inputs the recorded graph Mt

of a target process model where frequent patterns would be searched for, a parameter k defining the maximum number of expansion steps for initial temporal matches centred in each vertex of Mt, a parameter fmin indicating the minimum occurrence

frequency that a subgraph must have to be considered a frequent pattern and a se- lected pattern matching algorithm λ = alg-PM - presented in the previous chapter - that is used to identify occurrences of subgraphs in Mt. The outputs of the algo-

rithm are two matrices (score and FreqM). score is a m×k matrix of vectors. Each entry (i, j) in score is a vector of length m, where m is the number of vertices in the recorded graph of Mt. The vector contains the results of matching a subgraph

P_pivot(i,j) originated in the vertex i and created by means of j expansion steps that included its neighbours. k is the maximum number of expansion steps considered by the algorithm. FreqM is a m×k matrix whose entries indicate the frequency of subgraphs representing potential frequent patterns. These subgraphs are created from expansion steps indicated in the Line 8 of Table 6.1. Subgraphs in each step continue their expansion if their frequency in Mt remains greater than fminand per-

formed steps are less than k. Occurrence frequency of a subgraph is calculated with the function countFrequency in Table 6.2. For countFrequency, the ratio between the result of matching a subgraph (pattern) Pt ⊂ Mt and the number of vertices in Pt

is compared to an approximate match ratio Rt. If Rt = 1, the match between a non-identical occurrence of Pt and Pt is exact and it does not consider surjection. If

Rt > 1 the counted occurrence involves an overlap. If Rt < 1, it indicates a partial match - for details of exact, partial and overlapped pattern instances, please refer to Section4.4. The last column in FreqM indicates the frequency of subgraphs of Mt

originated from each vertex in V(Mt)and expanded k-times with its neighbours.

6.3 Summary

The contribution of this chapter is to relate the well-known problem of frequent subgraph discovery to the finding of potential services based on frequent process substructures. This idea promotes process-centric service reuse and can provide a mechanism to discern redundant software components supporting similar or equiva- lent process steps in complex and often de-integrated processes across organisations. A solution for the frequent subgraph discovery problem is proposed. The proposal relies on the pattern matching solution provided in previous chapters and focuses on a single-graph scenario. It processes graphs one at a time and the solution provided is restricted to count overlapped subgraphs (patterns) as a single occurrence – i.e., it identifies overlapped instances but it can not differentiate them. Other proposals in the literature can be more adequate in situations where overlaps

6.3. Summary

are common or parallel processing is required. However, the inherent complexity of process descriptions, which can involve an elaborated typing and multiple attribute values in graphs – including descriptions in natural language – certainly defines a more complex scenario compared to more simple graphs considered in the literature. The complexity of the frequent pattern discovery problem in graph-based process descriptions goes beyond the structural problem addressed when identifying frequent subgraphs, it also includes matching of complex types and attributes describing process elements.

Before explaining in detail the results of an evaluation for the matching and discovery techniques (Chapters5and6) proposed in the context of the LABAS framework for process and application integration (Chapter 3), a reference to the general limitations of the pattern discovery technique is provided. These indicate that the proposed technique may perform poorly when the occurrences of the frequent graph are highly overlapped. Also, as the pattern discovery technique is based on a pattern matching technique, the latter could present limitations during matching activities when dealing with unlabeled and highly connected graphs [Messmer 2000]. To the best of the author’s knowledge there are not empirical studies indicating if it is frequent to find real process descriptions which have highly overlapped frequent pattern instances. Intuitively, if frequent subgraphs (patterns) are used to guide the design of reusable services, they should not overlap, since another of the important design guidelines for services is that service implementations should be – ideally – decoupled from each other. Thus, highly overlapped instances can be discarded from the results since they are not helpful for end users. Unlabeled and highly connected graphs are not representative of process graphs. Process graphs often have labels that identify the different process elements [Aguilar-Saven 2004] and connections between process elements are frequently upper-bounded to six or eight out/in-going edges per process element [Golani 2003], which can not be considered a property of a highly connected graph for the case of realistic medium size process graphs whose size is in the order of hundred of vertices.

A comparative study covering other techniques for frequent pattern discovery based on different paradigms, for example the approximate solutions in [Kuramochi 2005] or cluster-based approaches such as in [Jung 2006] are required to recommend an adequate solution to different process scenarios, specially in the presence of overlapped frequent patterns. A number of proposals for sequential and parallel calculation of the frequent subgraph mining problem to discover interesting patterns has risen the attention in diverse applications scenarios such as analysis in social networks, molecular compounds and document-based information retrieval [Han 2007], [Wang 1995], [Kuramochi 2005], [Bringmann 2008]. Solutions in those scenarios seem to require major efforts to be translated to the process and work-

flow settings [Greco 2005]; hence, dedicated solutions as the one presented here are required.

Given the limitations to distinguish non-identical overlapped frequent subgraphs in this particular proposal for discovering frequent patterns on graphs, investigations on how other graph-based techniques proposed in the literature can be adopted in the single-graph setting scenario for large and complex process models is defined as a line of future work. Another envisioned and promising line of future work

In document Graph-based Pattern Matching and Discovery for Process-centric Service Architecture Design and Integration (Page 149-159)