Realizing a Process Cube Allowing for the Comparison of Event Data

(1)

Department of Mathematics and Computer Science

Architecture of Information Systems Research Group

Realizing a Process Cube Allowing

for the Comparison of Event Data

Master Thesis

Tatiana Mamaliga

Supervisors:

prof. dr. ir. W.M.P. van der Aalst

MSc J.C.A.M. Buijs

dr. G.H.L. Fletcher

Final version

(2)

Abstract

Continuous efforts to improve processes, require a deep understanding of process inner working. In this context, the process mining discipline aims at discovering process behavior from historical records, i.e., event logs. Process mining results can be used for analysis of process dynamics. However, mining on realistic event logs is difficult due to complex interdependencies within a process. Therefore, to gain more in-depth knowledge about a certain process, it can be split into subprocesses, which can then be separately analysed and compared. Typical tools for process mining, e.g., ProM, are designed to handle a single event log at a time, which does not particularly facilitate the comparison of multiple processes. To tackle this issue, Van der Aalst proposed in [4] to organize the event log in a cubic data structure, calledprocess cube, with a selection of the event attributes forming the dimensions of the cube.

Although, multidimensional data structures are already employed in various business intelli-gence tools, the data used has a static character. This is in stark contrast to process mining, since event data characterizes a dynamic process that evolves in time. The aim of this thesis is to develop a framework that supports the construction of the process cube and permits multi-dimensional filtering on it, in order to separate subcubes for further processing. We start with the OLAP foundation and reformulate its corresponding operations for event logs. Moreover, the semantics of a traditional OLAP aggregate are changed. Numerical aggregates are substituted by sublog data. With these adjustments, a tool is developed and integrated as a plugin in ProM to support the aforementioned operations on the event logs. The user can unload sublogs from the process cube, give them as parameters to other plug-ins in ProM and visualize different results simultaneously.

During the development of the tool, we had to deal with a shortcoming of the multidimen-sional database technologies when storing event logs, i.e., the sparsity of the resulted process cube. Sparsity in multidimensional data structures occurs when a large number of cells in a cube are empty, i.e., there are missing data values at the intersection of dimensions. Taking a single at-tribute of an event log as a dimension in the process cube results in a very sparse multidimensional data structure. As a result, the computational time required to unload a sublog for processing increases dramatically. This shortcoming was addressed by designing a hybrid database structure that combines a high-speed in-memory multidimensional database with a sparsity-immune rela-tional database. Within this solution, only a subset of event attributes actually contribute to the construction of the process, whereas the rest are stored in the relational database and used further only for event log reconstruction. The hybrid database solution proved to provide the flexibility needed for real-life logs, while keeping response times acceptable for efficient user interaction. The applicability of the tool was demonstrated using two event log examples, a synthetic event log and a real-life event log from the CoSeLog project. The thesis concludes with a detailed load-ing and unloadload-ing performance analysis of the developed hybrid structure, for different database configurations.

Keywords: event log, relational database, in-memory database, OLAP, process mining, visu-alization, performance analysis

(5)

Chapter 1

Introduction

The greatest challenge to any thinker is stating the problem in a way that will allow a solution.

Bertrand Russell, British author, mathematician, & philosopher (1872 - 1970)

This thesis completes my graduation project for the Computer Science and Engineering master at Eindhoven University of Technology (TU/e). The project was conducted in the Architecture of Information Systems (AIS) group. The AIS group has a distinct research reputation and is specialized in process modeling and analysis, process mining and Process-Aware Information Systems (PAIS).

The process mining field, detailed further in this chapter, provides valuable analysis techniques and tools, but also faces a series of challenges. Main issues are large data streams and rapid changes over time. This project creates a proof-of-concept prototype, which considers the so-calledprocess cube concept as a starting point for possible solutions to the above-mentioned challenges. The outcome is further used for visual comparison of event data.

This chapter describes the assignment within its scientific context. Section 1.1 provides the research background. Section 1.2 enumerates the most important advances in process mining and identifies the current issues in the field. Section 1.3 specifies the problem and the project objectives. Section 1.4 continues with a short summary on the problem solution. Finally, Section 1.5 provides an overview on the remaining chapters of the thesis.

1.1 Context

Technology has become an integral part of any organization. For example, current systems and installations are heavily controlled and monitored remotely by integrated internet technologies [23]. Moreover, employing automated solutions in any line-of-business has become a trend. As a result, Enterprise Systems software, offering a seamless integration of all the information flowing through a company [22], is used in any modern organization.

Enterprise Information Systems (EIS) keep businesses running, improve service times and thus, attract more clients. Still, like in every complex system, there are multiple points where things can go wrong. System errors, fraud, security issues, inefficient distribution of tasks are just a few to mention. To cope with these issues, EIS had to extend its function-oriented enterprise applications with Business Intelligence (BI) techniques. That is, BI applications have been installed to support management in measuring company’s performance and deriving appropriate decisions [39]. Among most important functions of BI are online analytical processing (OLAP), data mining, business performance management and predictive analytics.

Being aware of the existing problems in an organization and applying standardized solutions to solve them, is usually not enough. Consider a doctor that always prescribes pain killers

(6)

indepen-dent of the patient complaints. Of course, these kind of pills will temporarily release the pain, but they will not treat the real disease. A good doctor should run tests, identify the root causes of the health problem and only then, give an adequate treatment. This is what the process mining field tries to accomplish. It goes beyond analyzing merely individual data records, but rather focuses on the underlying process which glues event data together. The deep understanding of the inside of a process can point to notorious deviations, persistent bottlenecks and unnecessary rework.

All in all, technology has a major impact on organizations and it proved to be an enabler for business process improvement. Therefore, by means of business intelligence, and process mining, in particular, new opportunities are constantly exploited to keep pace with challenges such as change.

1.2 Challenges - Then & Now

In the context of today’s rapidly changing environment, organizations are looking for new solu-tions to keep their businesses running efficiently. Slogans such as “Driving the Change” (Renault), “Changes for the Better” (Mitsubishi Semiconductor), “Empowering Change” (Credit Suisse First Boston), “New Thinking. New Possibilities” (Hyundai) are used more and more often. Further-more, different areas of business research are trying to keep up with the change and process mining is not an exception.

In 2011, the Process Mining Manifesto [7] was released to describe the state-of-the-art in process mining on one hand, and its current challenges, on the other hand. A year later, the project proposal “Mining Process Cubes from Event Data (PROCUBE)” in [4] suggested the so-calledprocess cube as a solution direction for some of these challenges. In the context of currently employed process mining solutions and using the Process Mining Manifesto as a reference, the PROCUBEproject proposal presents several challenges that process mining is currently facing:

From “small” event data to “big” event data.

Due to increased storage capacity and advanced technologies, the vast amount of available event data have become difficult to control and analyse. Most of the traditional process mining techniques operate with event logs whose size does not exceed several thousands cases and a couple hundred thousands events (for example, in BPI Challenge [2] files). However, nowadays corporations work on a different scale of event logs. Giants like Royal Dutch Shell, Walmart, IBM, would rather consider millions of events (a day or even a second) and this number will continue to grow. Ways to ensure that event data growth will not affect the importance of process mining techniques are constantly sought.

From homogeneous to heterogeneous processes.

With the increasing complexity of an event log, chances are that the variability in its corre-sponding process increases as well. For example, events in an event log can present different levels of abstraction. However many mining techniques assume that all events in an event log are logged at the same level of abstraction. In that sense, the diverse event log characteristics have to be properly considered.

From one to many processes.

Many companies have their agencies spread across the globe. Let’s take SAP AG as an example. Only its research and development units are located on four continents, but it has regional offices all around the world. That is, SAP units are executing basically the same set of processes. Still, this does not exclude possible variations. For instance, there might be various influences due to the characteristics of a certain SAP distribution region (Germany, India, Brazil, Israel, Canada, China, and others). Traditional process mining is oriented on stand-alone business processes. However, it is of great importance to be able to compare business processes of different organizations (units of an organization). For example, efficient and less efficient paths in different processes can be identified. Inefficient paths can be substituted and efficient paths can be applied to the rest of the processes to improve performance.

(7)

From steady-state to transient behavior.

The change has a major impact not only on the size of event logs and on the necessity of dealing with many processes together, but also on the state of a business process. For example, companies should be able to quickly adjust to different business requirements. As a result, their corresponding processes undergo different modifications. Current process mining techniques assume business processes to be in a steady-state [5]. However, it is important to understand the changing nature of a process and to react appropriately. The notion of

concept drift was introduced in process mining [33] to capture this second-order dynamics. Its target is to discover and analyze the dynamics of a process by detecting and adapting to change patterns in the ongoing work.

From offline to online.

As previously mentioned, systems produce an overwhelming amount of information. The idea of storing it as historical event data for later analysis, as it is currently done, may not seem as appealing any more. Instead, the emphasis should be more on the present and the future of an event. That is, an event should be analysed on-the-fly and predictions on the contingency of its occurrence should be made based on existing historical data. As such, online analysis of event data is yet another process mining challenge.

Each of the issues discussed above, are extremely challenging. Analysing large scale event logs is difficult with the current process mining techniques. Solutions to mitigate some of the issues that appear when dealing with large scale event logs are proposed in [14], i.e., by event log simplification, by dealing with less-structured processes and others. A framework for time-based operational support is described in [8]. In [16], an approach is offered to compare collections of process models corresponding to different Dutch municipalities. Nevertheless, there is still the need for more elaborated solutions and a unified way of approaching them.

1.3 Assignment Description

Stand-alone process analysis is the common way of analysing processes in today’s process mining approaches. However, inspecting a process as a single entity, impedes observing differences and similarities with other processes. Let’s take a simple example from the airline industry. There is a constant discussion about which of the low-cost airlines, Ryanair or Wizzair, offers better services. There are both advantages and disadvantages of traveling with either of these two. Generally, Ryanair is considered more punctual than Wizzair1_{. To determine why Ryanair is more on-time}

with flights than Wizzair, we compare their processes. We noticed that while at Wizzair the luggage is checked only once, Ryanair is very strict with the luggage procedure and checks it twice before embarking. As a result, passengers and crew are not busy with “fitting” luggage that does not fit and the hallway of the aircraft is kept free for new passengers that arrive at board. With minimizing the turnaround time, the airline punctuality improves. The procedure of checking the luggage may not be the only factor that improves the punctuality of Ryanair airline, but it is clear from the comparison of the two airline processes that it contributes to reducing the flight delays. In conclusion, the comparison of the two processes helped in answering a specific question and identifying parts of these processes that can be further improved.

When it comes to comparison of large processes, it is difficult to inspect processes entirely at a glance. Splitting and merging different parts of a process can offer more insightful details. Let’s consider the following scenario. In the car manufacturing process, there is a final polishing inspection step. Several resources check whether there is a scratch on a car that needs to be polished. During the last two weeks, it was noticed that one polishing crew worked slower than the others. To identify the cause of this issue, the car manufacturing process is analysed. First, the process is split by department type and thepolishing department is selected. Then, only the process corresponding to the resources of this specific crew is isolated. The following aspects are

(8)

inspected: the car type, the engine type, the color type. When filtering by car type and engine type, it seems that there are no patterns indicating a potential delay. However, when inspecting the subprocesses corresponding to different car colors, a pattern emerges. The average working time of polishing a red car is much higher compared to the one of polishing cars of a different color. Since red cars take, in general, more time to be polished than other cars, this indicates that there is a problem in the painting department. The red-colored cars are not painted properly and therefore need constant polishing. While at the beginning, it seemed like the crew is responsible for the delays, in fact, the crew members were just polishing more colored cars. Since red-colored cars required more polishing due to a painting issue, the crew worked slower compared to the other crews. Without filtering the initial process, it would have been difficult to identify such detailed problems.

Taking into consideration the discussion above, the goal of this master project can be defined as follows:

GOAL: Create a proof-of-concept tool to allow comparison of multiple processes.

In other words, the aim is to support integrated analysis on multiple processes, while examining different views of a process. Together with the main goal, there are some other targets: filtering processes by preserving the initial dataset, merging different parts of a process, visualizing process mining results simultaneously and placing them next to each other to facilitate comparison. In the following, we present the approach we propose to reach the enumerated objectives.

1.4 Approach

Figure 1.1: Theprocess cube. Concept proposed in the PROCUBEproject.

To accomplish the goal, we base our approach on the process cube concept, introduced in [4] and shown in Figure 1.1. Aprocess cubeis a structure composed of process cells. Each process cell (or collection of cells) can be used to generate an event log and derive process mining results [4]. Note that traditional process mining algorithms are always applied to a specific event log without systematically considering the multidimensional nature of event data.

In this project, the process cube is materialized as an online analytical processing (OLAP) hypercube structure. Except for the built-in multidimensional structure, one can benefit from the functionality of the OLAP operations and hopefully from the good performance of OLAP implementations. Transactional databases are designed to store and clean data, but are not tailored towards analysis. OLAP, on the other hand, is herein chosen to harbor complex event data for further process analysis, in the view of its analysis-optimized databases and its specialized “drilling” operations. Organizing event data in OLAP multidimensional structures, makes it easy

(9)

to get event data and to pick a side to look at it. There are also many ways to divide event data, e.g., one can always drill down and up in the multidimensional structure and inspect event data at different granularity levels. Finally, the retrieved event data can be used to obtain different process-related characteristics, e.g., process models, that can be further analysed and compared.

There are however, some challenges with respect to this approach, mainly due to the fact that OLAP does not handle event data, but enterprise data:

• Only the aggregation of large collections of numerical data is supported by the OLAP tools.

• Process-related aspects are entirely missing in the OLAP framework.

• Overlapping of cells (event) classes is not possible in OLAP cubes.

Figure 1.2: Master Project Scope.

Nevertheless, adjustments can be made to OLAP tools to accommodate process cube require-ments. The approach considers several steps shown also in Figure 1.2. First, event logs are introduced among OLAP data sources. Hence, it becomes possible to load XES event logs in the OLAP database. Second, the process cube is created to support the materialization of an event log. Moreover, the process cube is designed to allow the visualization of cells with overlapping event data. Finally, different process mining results can be produced for any section of the cube and further exported as images.

The materialization of the process cube as an OLAP cube allows to define our objective even more precise: the goal is to create a proof-of-concept tool that exploits OLAP features to accommodate process mining solutions such that the comparison of multiple processes is possible.

1.5 Thesis Structure

To describe the approach, the master thesis is structured as follows:

Present a literature study on employed concepts and technologies (Chapter 2)

Concepts from process mining and business intelligence fields will be introduced. Then, a discussion on the implemented OLAP and database technologies will follow.

Elaborate on process cube functionality (Chapter 3)

The process cube notion will be clearly defined together with its structure. The requirements needed to attire the envisioned process cube functionality will be listed.

Explain Palo software choice (Chapter 4)

Based on the requirements from Chapter 3, a collection of technological solutions that could support the process cube structure is generated. After analyzing the pros and the cons of each solution, the choice to use Palo OLAP server is described and motivated.

(10)

Recall the most relevant implementation steps (Chapter 5)

After presenting the architecture of the project, the implementation steps are described. The main functionality consists of: loading/unloading a XES file in/from the in-memory database, enabling the adjusted OLAP operations on event logs and visualizing process mining results.

Report on the testing process and on the system test results (Chapter 6)

The functionality of the software is tested and its performance is evaluated for different event logs and process cubes.

Conclude with general remarks on the project (Chapter 7)

The thesis concludes with a series of comments and observations on both the implemented solution and further research possibilities.

(11)

Chapter 2

Preliminaries

2.1 Business Intelligence

Business Intelligence (BI) incorporatesall technologies and methods that aim at providing action-able information that can be used to support decision making. An alternative definition states that

BI systems combine data gathering, data storage, and knowledge management with analytical tools to present complex internal and competitive information to planners and decision makers [41]. All in all, BI represents a mixture of multiple disciplines (e.g., data warehousing, data mining, OLAP, process mining, etc.), as shown in Figure 2.1, all with the same main goal of turning raw data into useful and reliable information for further business improvements. Even though

Figure 2.1: BI - a confluence of multiple disciplines.

herein presented as totally separate disciplines, there are various attempts to interconnect some of them for obtaining more powerful analysis results. For example, data mining is integrated with OLAP techniques [31, 45]. Data warehousing and OLAP technologies are more and more used in conjunction [13, 18]. From the above-mentioned BI disciplines, process mining and OLAP are detailed in Section 2.2 and in Section 2.3, as being particularly relevant for this project.

(12)

2.2 Process Mining

2.2.1 Concepts and Definitions

The idea of process mining is to discover, monitor and improve real processes (i.e., not assumed processes) by extracting knowledge from event logs readily available in todays systems [3]. The content and the level of detail of a process description depends on the goal of the conducted process mining project and the employed process mining techniques. The set of real executions is fixed and is given by the event data from an existing event log.

There are basically three types of process mining projects [3]. The goal of the first,data-driven

process mining project, is to conclude with a process description, which should be as detailed as possible, without necessarily having a specific question in mind. This can be accomplished in two ways: by asuperficial analysis, covering multiple process perspectives or by anin-depth analysis, on a limited number of aspects. The second, thequestion-driven process mining project, aims at obtaining a process description from which an answer to a concrete question can be derived. A possible question can be: “How does the decision to increase the duration of handling an invoice influences the process?” The third type, thegoal-drivenprocess mining project, consists of looking for weaker parts in the resulted process description that can be considered for improving a specific aspect, e.g., better response times.

Figure 2.2: Process mining: discovery, conformance, enhancement.

Establishing the type of the process mining project to conduct is followed by choosing the relevant process mining techniques to apply on the event log. Process mining comes in three flavors: discovery, conformance and enhancement. Figure 2.21 _{shows these three main process}

mining categories. Discovery techniques take the event log as input and return the real process as output. Conformance checking techniques checks if reality, as recorded in the log, conforms to the model and vice versa [7]. Enhancement techniques produce an extended process model which gives additional insights in the process, i.e., existing bottlenecks.

Regardless of the process mining technique, an event log is always given as input, shown also in Figure 2.2. The content of an event log can vary greatly from process to process. Nevertheless,

(13)

Figure 2.3: Structure of event logs.

there is a fixed skeleton, expected to be found in any event log. Figure 2.3, from [3], presents the structure of an event log. Generally, event data from an event log correspond to a process. A process is composed of cases or completed process instances. In turn, a case consists of events. Events should be ordered within a case. Preserving the order is important as it influences the control flow of the process. An event corresponds to an activity, e.g., register request, pay com-pensation. Atrace represents a sequence of activities. Both events and cases are characterized by

attributes, e.g., activity, time, resource, costs.

The data source used for process mining is an event log. Event data of different information systems are stored in event logs. Since event logs can be recorded not only for process mining purposes (e.g., for debugging errors), there is no unique format used at creation. Handling various event log formats for process analysis is time consuming. Therefore, event logs need to be stan-dardized by converting raw event data to a single event log format. One such format is MXML, which emerged in 2003. Recently, the popularity of XES event log standardization has grown. Further, we present an overview on XES event log structure, with relevant details for this master thesis. A more in depth discussion on the XES format can be found in [15] and more up to date information on XES can be found onhttp://www.xes-standard.org/.

Figure 2.4, taken from [29], shows the XES meta model. Except for traces and events, with their corresponding attributes, the log object contains a series of other elements. The global

(14)

Figure 2.4: The XES Meta-model.

attributes for traces and events are usually used to quickly find the existing attributes in the XES log. The purpose of event classifiers is to assign each event to a pre-defined category. Events within the same category can be compared with the ones from another category. XES logs are also characterized by extensions. Extensions are used to resolve the ambiguity in the log by introducing a set of commonly understood attributes and attaching semantics to them. Attributes

have assigned values which corresponds to a specific type of data. Based on the type of data, attributes can be classified in five categories: String attributes, Date attributes, Int attributes, Float attributes, and Boolean attributes. These attribute types correspond to the standard XML types: xs:string, xs:dateTime, xs:long, xs:double and xs:boolean.

To understand the separation between required and flexible event log aspects, a formalization of the above-highlighted concepts is given. The process mining book [3] is used as reference.

Definition 1 (Event, attribute [3]). Let E be the event universe, i.e., the set of all possible event identifiers. Events may be characterized by various attributes, e.g., an event may have a timestamp, correspond to an activity, is executed by a particular person, has associated costs, etc. LetAN be a set of attribute names. For any evente∈E and namen∈AN: #n(e) is the value

of attribute n for event e. If evente does not have an attribute named n, then #n(e) =⊥(null

value).

(15)

Definition 2 (Case, trace, event log [3]). LetC be the case universe, i.e., the set of all possible case identifiers. Cases, like events, have attributes. For any casec∈C and namen∈AN : #n(c)

is the value of attributenfor casec(#n(c) =⊥if casechas no attribute namedn). Each case has

a special mandatory attribute trace: #trace(c)∈E∗.2 ˆc= #trace(c) is a shorthand for referring

to the trace of a case.

A traceis a finite sequence of eventsσ∈E∗ _{such that each event appears only once, i.e., for}

1≤i < j ≤ |σ|:σ(i)6=σ(j).

For any sequenceδ=ha1, a2,· · · , anioverA,δset={a1, a2,· · · , an}. δset converts a sequence

into a set, e.g., δset(hd, a, a, a, a, a, a, di) ={a, d}. ais an element of δ, denoted as a ∈δ, if and

only ifa∈δset(δ).

Anevent log is a set of casesL⊆_C such that each event appears at most once in the entire log, i.e., for anyc1, c2∈Lsuch thatc16=c2:δset( ˆc1)∩δset( ˆc2) =∅.

2.2.2 ProM Framework

A large number of algorithms are produced as a result of process mining research. Ranging from algorithms that provide just a helicopter view on the process (Dotted Chart) to ones that give an in-depth analysis (LTL Checker), many of them are implemented in the ProM Framework in the form of plugins.

Figure 2.5: ProM Framework Overview.

Figure 2.5, based on [24], shows an overview of the ProM Framework. It includes the main types of ProM plugins and the relations between them. Before applying any mining technique, an event log can be filtered using aLog filter. Further, the filtered event log can be mined using the

Mining plugin and then stored as aFrame result. The Visualization engine ensures that frame results can be visualized. An (filtered) event log, but also different models, e.g., Petri nets, LTL formulas, can be loaded into ProM using anImport plugin. Both theConversion plugin and the

2_{In the remainder, we assume #}

(16)

Figure 2.6: Examples of process mining plugins: Log Dialog andDotted Chart (helicopter view),

Fuzzy Miner(discovery),Social Networksbased on Working Together (organizational perspective).

Analysis plugin use mining results as input. While the first plugin is specialized in converting the result to a different format, the second plugin is focused on the analysis of the result.

The ProM framework includes five types of process mining plugins, as shown in Figure 2.5:

• Mining plugins - mine models from event logs.

• Analysis plugins - implement property analysis on a mining result.

• Import plugins - allow import of objects from Petri net, LTL formula, etc.

• Export plugins - allow export of objects to various formats, e.g., EPC, Petri net, DOT, etc.

• Conversion plugins - make conversions between different data formats, e.g., from EPC to Petri net.

Figure 2.6 presents some examples of plugins in ProM: theLog Dialog, the Dotted Chart, the

Fuzzy Miner [30] and the Working Together Social Network [9]. There are, however, more than 400 plug-ins available in Prom 6.2, covering a wide spectrum. Plugins objectives can vary from providing process information at a glance, e.g.,Log Data, Dotted Chart, to providing automated process discovery, e.g., Heuristics Miner [53] andFuzzy Miner and offering detailed analysis for verification of process models, e.g., Woflan analysis, for performance aspects, e.g.,Performance Analysis with Petri net, and for the organizational perspective, e.g.,Social Network miner.

2.3 OLAP

2.3.1 Concepts and Definitions

On-Line Analytical Processing (OLAP) is amethod to support decision making in situations where raw data on measures such as sales or profit needs to be analysed at different levels of statistical aggregation [42]. Introduced in 1993 by Codd [20] as a more generic name for “multidimensional

(17)

data analysis”, OLAP embraces the multidimensionality paradigm as a means to provide fast access to data when analysing it from different views.

Figure 2.7: Traditional OLAP cube. At the intersection of the three dimensions: regions, time and sales information, anaggregate (e.g., profit margin %) can be derived. Both time and regions dimensions contain ahierarchy (e.g., 2012Jan, 2012F eb, 2012M arare months of 2012).

In comparison with its On-Line Transactional Processing (OLTP) counterpart, OLAP is op-timized for analysing data, rather than storing data originating from multiple sources to avoid redundancy. Therefore, OLAP is mostly based on historical data, e.g., data that can be aggre-gated, and not on instantaneous data which is quite challenging to analyse, sort, group or compare “on-the-fly”.

Multidimensional data analysis is possible due to a multidimensional fact-based structure, called anOLAP cube. An OLAP cube is a specialized data structure to store data in an optimized way for analysis.

Figure 2.7 presents the traditional OLAP cube structure. Designed to support enterprise data analysis, an OLAP cube is usually built around a business fact. A fact describes an occurrence of a business operation (e.g., sale), which can be quantified by one or more measures of inter-est (e.g., the total amount of the sale, sales cost, profit margin %). Generally, the measure of interest is a real number. A business operation can be characterized by multiple dimensions of analysis (e.g., time, region, etc). LetDAi, 1≤i≤nbe the set of elements of the dimensions of

analysis. Then, the measure of interestM I can be defined as a function M I : Qn_i₌₁DAi → R.

For example, if region, time and sales are the dimensions of analysis, as in Figure 2.7, then

M I(Germany, 2012M ar, P rof itM argin) = 11.

Moreover, elements of a dimension of analysis can be organized in a hierarchy, e.g., the

Europe region is herein represented by countries like N etherlands, Germany and Belgium. A natural hierarchical organization can be observed among time elements. Consider the tree structure in Figure 2.8. The root of the tree is the 2012 year. This element has three chil-dren: 2012Jan, 2012F eb and 2012M ar, corresponding to months. Finally, each month ele-ment has days of week as children elements. Let Hi be the set of hierarchy elements, i.e.,

Hi = {2012, 2012Jan, 2012F eb, 2012M ar, 2012JanM on, 2012JanT hu, . . .}. The children

function, children : Hi → P(Hi) returns the children elements of the argument. For example,

(18)

Figure 2.8: Example of hierarchy tree structure ontime dimension.

P(Hi) returns all leaf elements corresponding to the subtree with the function argument as a root

node. For example,allLeaves(2012) ={2012JanM on, 2012JanT hu,2012F ebW ed, 2012M arT ue,

2012M arF ri}. Note that a hierarchy is a undirected graph, in which any two nodes are connected by a simple path, with the following property: for any node h ∈ Hi, any two children h1, h2

∈children(h),allLeaves(h1)∩allLeaves(h2) =∅.

Dimensions of analysis, hierarchies and measures of interest can be used to construct an OLAP cube, like the one in Figure 2.7. Dimensions of an OLAP cube are defined byCD=D1×D2× . . .×Dn. For any 1≤i ≤n, Di ⊆Hi is the set of dimension elements. Hierarchies are defined

by CH = H1×H2×. . .×Hn. For example, the time dimension contains elements from the

hierarchy shown in Figure 2.8. Let D1 be the cube dimension corresponding to time, then a

possible content ofD1 is {2012Jan, 2012F eb, 2012M ar}. It is not necessary for a dimension to

contain all the hierarchy elements. Together with dimensions, hierarchies are elements of an OLAP cube structureCS={CD, CH}. Measures of interests are functions specific for the dimensions of analysis. For the dimensions of the cube, theaggregate functionCA, CA:Qni=1Hi→R, is used

as an equivalent of a measure of interest. The only difference is that aggregates can be computed from multiple measure of interest results or from other aggregates. For example, the aggregate

sales cost for the entire month 2012Janis a sum of the measure of interest results corresponding to 2012JanM onand 2012JanT hu.

To make the reasoning in terms of OLAP more precise and to strengthen the understanding of various cube-related concepts, we provide a formalization of the core OLAP notions.

An OLAP cube presents a multidimensional view on data from different sides (dimensions). Each dimension consists of a number of dimension attributes or values, which can be also called

dimension elements or members. Members in a dimension can be organized into a hierarchy and correspond, as such, to a hierarchical level. These concepts are further formalized in Definition 3.

Definition 3. (OLAP cube)

Let Di, 1≤i≤nbe a set of dimension elements, wherenis the number of dimensions,

Hi, 1≤i≤nbe a set of hierarchy elements,

CD=D1×D2. . .×Dn be the cube dimensions,

CH=H1×H2. . .×Hn be the cube hierarchies,

children:Hi → P(Hi), where children(h) is the function returning the children ofh∈Hi,

allLeaves:Hi→ P(Hi), whereallLeaves(h) is the function returning all leaves ofh∈Hi,

h∈Hi,h1, h2 ∈children(h),allLeaves(h1)∩allLeaves(h2) =∅, CS= (CD, CH) be the cube structure,

CA:CH →Rbe the cube aggregate function,

AnOLAP cube is defined asOC= (CS, CA).

Given the multidimensional structure of an OLAP cube, the risk exists of having it populated with sparse data. Sparsity appears when often, at the intersection of dimensions, there is no corresponding measure of interest, thus, there is an empty cell. Such behavior occurs in multidi-mensional cubes with a large number of sparse dimensions. A dimension is considered a sparse dimension when it has a large number of members, that in most of the cases appear only once in the original data source and data values are missing for the majority of member combinations. On the contrary, in adense dimension, a data value exists for almost every dimension member.

(19)

So far, we focused on the OLAP cube multidimensional structure. However, learning how to employ it, is particularly interesting, as it gives a feeling of OLAP’s usefulness and applicability. Therefore, we further discuss about one of the main features of OLAP, the OLAP operations. In [18], Chandhuri and Dayal enumerate among the typical OLAP operations: slice anddice for selection and projection, drill-up (or roll-up) and drill-down, for data grouping and ungrouping, andpivoting (orrotation) for re-orienting the multidimensional view of data. There are also other OLAP operations, e.g., ranking,drill-across [44]. However, the operations mentioned in [18] are considered sufficient for a meaningful exploration of the data.

Thedice operation returns a subcube by selecting a subset of members on certain dimensions.

Definition 4 (Dice operation). LetOC, OC = (CS, CA) and D0

i ⊆Di for all 1≤i ≤n. The

dice operation isdiceCD0(OC) =OC0, where

OC0_{= (}_CS0_{, CA}0_), CS0_{= (}_CD0_{, CH}0_), CH0 ₌_H0 1×H20×. . .×Hn0, H0 i ={h∈Hi|∃v∈Di0, allLeaves(v)∩allLeaves(h)6=∅}, children0_:_H0 i → P(Hi0), children0(h) =children(h)∩Hi0, allLeaves0 _:_H0 i → P(Hi0), allLeaves0(h) =allLeaves(h)∩Hi0, h∈H0

i,h1,h2 ∈children0(h),allLeaves0(h1)∩allLeaves0(h2) =∅, CA0_:_CH0_→

R, CA0(h1, . . . , hn) =CA(h1, . . . , hn), for (h1, . . . , hn)∈CH0.

The slice operation is a special case of dice operation. It produces a subcube by selecting a single member for one of its dimensions.

Definition 5(Slice operation). LetOC,OC = (CS, CA). Theslice operation isslicek,v(OC) =

OC0_{, where 1}_≤_k_≤_n_, _v_∈_D

k, andOC0 =diceCD0(OC) with CD0 =D₁×. . .×D_k₋₁× {v} ×

Dk+1×. . .×Dn.

Note that an OLAP cell can be defined as an OLAP subcube obtained by slicing each of the OLAP cube dimensions. Let OC, OC = (CS, CA). TheOLAP cell isslice1,v1 (slice2,v2. . .

(slicen−1,vn−1 (slicen,vn(OC))). . .)) =OC 0_.

By slice and dice operations, various OLAP subcubes are isolated. To make them useful for analysis purposes, the data from the cube should be visualized. Although the cube is a multidimensional structure, only two dimensions can be visualized at a time.

Pivoting (or rotation) operation changes the visualization perspective of the OLAP cube, by swapping two dimensionsD∗

i andDj∗.

Definition 6 (Pivoting operation). LetOC, OC = (CS, CA) withCD=D1×D2×. . .×Di×

. . .×Dj×. . .×Dn andCH=H1×H2×. . .×Hi×. . .×Hj×. . .×Hn. Thepivoting operation

ispivoti,j(OC) =OC0, where 1≤i, j≤n,

OC0_{= (}_CS0_{, CA}0_), CS0_{= (}_CD0_{, CH}0_), CD0₌_D 1×D2×. . .×Dj×. . .×Di×. . .×Dn, CH0 ₌_H 1×H2×. . .×Hj×. . .×Hi×. . .×Hn, children0_:_H0 i → P(Hi0), children0(h) =children(h), allLeaves0 _:_H0 i → P(Hi0), allLeaves0(h) =allLeaves(h), h∈H0

i,h1,h2 ∈children0(h),allLeaves0(h1)∩allLeaves0(h2) =∅, CA0_:_CH0_→

R, CA0(h1, . . . , hj, . . . , hi, . . . , hn) =CA(h1, . . . , hi, . . . , hj, . . . , hn), for (h1, . . . , hj, . . . , hi, . . . , hn)∈CH0 .

Theroll-upoperation consolidates some of the elements of a dimension into one element, which corresponds to a hierarchically superior level.

Definition 7 (Roll-up operation). LetOC, OC = (CS, CA) andv∈Hk, where 1≤k≤n. The

roll-up operation isrollupk,v(OC) =OC0, whereOC0 = (CS0, CA) withCS0 = (CD0, CH), and

CD0₌_D

(20)

Thedrill-downoperation refines a member of a dimension into a set of members, corresponding to a hierarchically inferior level.

Definition 8 (Drill-down operation). LetOC, OC = (CS, CA) and v ∈Dk, where 1≤k ≤n.

The drill-down operation is drilldownk,v(OC) = OC0, where OC0 = (CS0, CA) with CS0 =

(CD0_{, CH}_{), and}_CD0₌_D

1×. . .×Dk−1×(Dk\ {v})∪children(v)×. . .×Dn.

2.3.2 The Many Flavors of OLAP

Before introducing the OLAP principle, relational databases were the most widely used as tech-nology for enterprise databases. Relational databases are stable and trustworthy and can be used for storing, updating and retrieving data. However, they provide limited functionality to support user views of data. Most notably lacking was the ability to consolidate, view, and analyze data according to multiple dimensions, in ways that make sense to one or more specific enterprise ana-lysts at any given point in time [20]. Consequently, OLAP facilities were designed to compensate for the limitations of the conventional relational databases.

The OLAP Server functionality had to be implemented on top of an existing database technol-ogy. Relational databases were considered to be amongst the most reliable and popular types of databases [21]. Naturally, one of the proposed solutions was to add OLAP characteristics on top of a relational model. This is how the ROLAP (Relational OLAP) category came into existence. The OLAP layer provides a multidimensional view, calculation of derived data, slice, dice and drill-down intelligence and the relational database gives an acceptable performance by employing a Star-schema or Snowflake data model [21, 43].

Being the most appropriate database type for OLTP, due to its design, the relational database is not as good an option for OLAP [20, 25]. Even though presenting close to real-time data loading and having advantages in terms of capacity, ROLAP presents slow query performance and is not always efficient when aggregating large amounts of data.

Instead, a multidimensional database approach deemed to be more suited [11, 54]. Known under the name of MOLAP (Multi-dimensional OLAP), this type of OLAP is created to achieve the highest possible query performance. Still, MOLAP has its own deficiencies. MOLAP works the best for cubes with a limited number of sparse dimensions. Sparse data within large cubes often causes performance problems.

Hence, the advantages of ROLAP are the disadvantages of MOLAP and vice versa. Therefore, the HOLAP (Hybrid OLAP) version was introduced as the combination of the two, to compensate for the deficiencies of each technology [46]. HOLAP is one of the OLAP types that goes mainstream among the next-generation OLAP. Additional technologies, such as in-memory OLAP, are consid-ered for speed-oriented systems. Nonetheless, depending on data characteristics (e.g., summarized, detailed), one or a combination of these technologies can be considered. Even though multi-hybrid models (e.g., MOLAP and real-time in-memory for analysis and HOLAP for drill through) are designed to incorporate the most of OLAP benefits, there is still no generic OLAP architecture or standard procedure to guarantee optimal performance independent of the requirements.

With the growth of available memory capacity and because memory prices are decreasing with time, the feasibility of storing large databases in memory increases. As a consequence, the disk-based databases are replaced more and more often with in-memory database technology. While conventional disk-based database systems (DRDB) store data on disk, main memory database systems (MMDB) [26] store and access data directly from the main physical memory. Therefore, the response times and transaction throughputs of a MMDB are considerably better than for a disk-based database system. Obviously, a DRDB still has advantages in terms of capacity. There are very large databases that simply cannot fit in memory, e.g., database containing NASA space data (with images). However, it is difficult for DRDB to compete with the speed of MMDB. That is, a database of a reasonable size stored in-memory outperforms a database stored on disk.

(21)

Chapter 3

Process Cube

In Section 1.3, the goal of this master project was described as tocreate a proof-of-concept tool to allow comparison of multiple processes. In Section 1.4, the process cube was introduced as a means to satisfy the goal. Both process mining and OLAP aspects were described in Chapter 2. Being the central component of the system, theprocess cube links the process mining framework to the existing OLAP technology. By storing event logs in OLAP multidimensional structures, event data can be used to obtain and compare process mining results. In this chapter, the concept of the process cube is explained in detail, together with an example that shows its functionality and a comparison with other hypercube structures. Before proceeding with the process cube materialization in Chapter 4, a set of requirements are established and enumerated at the end of the chapter.

3.1 Process Cube Concept

In Section 2.2.1, the definition of an event with attributes (Definition 1) and of a case with attributes (Definition 2) were given. Section 2.3.1 includes the definition of anOLAP cube ( Defi-nition 3) with its corresponding operations (Definitions 4, 5, 6, 7, 8). In this section, theprocess cube and process cell notions are introduced by adding event log aspects into the OLAP cube definition. For a further elaboration and formalization of the process cube concept see the paper [6], which was published towards the end of this project.

Figure 3.1: Process Cube Concept.

Figure 3.1, taken from [4], shows relevant process cube characteristics and is therefore, rep-resentative for the definitions of different process cube concepts given below (e.g., process cube, process cell). A detailed discussion on the elements of the Figure 3.1 is presented in [6].

(22)

Aprocess cubeis a multidimensional structure built from event log data in a way that facilitates further meaningful process mining analysis. A process cube is composed of a set of process cells [4] and the main difference between a process cube and an OLAP cube lies in its cell characteristics. In contrast to the OLAP cube, there is no real measure of interest quantifying a business operation. While OLAP structures are designed for business operations analysis, the process cube aims at analyzing processes. Therefore, each dimension of analysis is composed of event attributes. Consequently, the content of a cell in the process cube changes from real numbers to events. While in OLAP, dimensions of analysis are used to populate the cube, in case of process cubes the events of an event log are used to create the dimensions of analysis. Hence, instead of theM I

function, the event members function is defined asEM : E → DA1×. . .×DAn. Note that to

differentiate between two events with the same attributes, theevent id is added as a dimension of analysis. Consequently, for each event there will be a unique combination of dimension of analysis members.

Definition 9. (Process cube)

Let Di, 1≤i≤nbe a set of dimension elements, wherenis the number of dimensions,

Hi, 1≤i≤nbe a set of hierarchy elements,

CD=D1×D2×. . .×Dn be the cube dimensions,

CH=H1×H2×. . .×Hn be the cube hierarchies,

children:Hi → P(Hi), where children(h) is the function returning the children ofh∈Hi,

allLeaves:Hi→ P(Hi), whereallLeaves(h) is the function returning all leaves ofh∈Hi,

h∈Hi,h1, h2 ∈children(h),allLeaves(h1)∩allLeaves(h2) =∅, CS= (CD, CH) be the process cube structure,

CE:CH→ P(E) be thecell event function,CE(h1, h2, . . . , hn) ={e∈E|(d1, d2, . . . dn) =

CC(e), di∈allLeaves(hi),1≤i≤n}, for (h1, h2, . . . , hn)∈CH.

Aprocess cube is defined asP C= (CS, CE).

Note that aprocess cellcan be defined as a subcube obtained by slicing each of the process cube dimensions. Let P C, P C = (CS, CA). The process cell is slice1,v1 (slice2,v2. . . (slicen−1,vn−1

(slicen,vn(P C))). . .)) = P C

0_{. Each cell in the process cube corresponds to a} _{set of events} _[4],

returned by the cell event functionCE.

The process cube, as defined above, is a structure that does not allow overlapping of events in its cells. To allow the comparison of different processes using the process cube, a table of visualization is created. The table of visualization is used to visualize only two dimensions at a time. Multiple slice and dice operations can be performed by selecting different elements of the two dimensions. Each slice, dice, roll-up or drill-down is considered to be a filtering operation. Hence, a new filter is created with each OLAP operation. Filters are added as rows/columns in the table of visualization. Note that unlike the cells of the process cube, the cells of the table of visualization may contain overlapping events. That is because there is no restriction in selecting the same dimension members for two filtering operations.

Given a process cubeP C, aprocess model,MP C is the result of a process discovery algorithm,

such as Alpha Miner, Heuristic Miner or other related algorithms, used on P C. However, there are various process mining algorithms whose results are not necessarily process models. Instead, they can offer some insightful process-related information. For example, Dotted Chart Analysis provides metrics (e.g., average interval between events) related to events and their distribution over time. Process cubes are not limited to process models as well. Therefore, we refer to process mining results just asmodels.

So far, we described the process cube as being a hypercube structure, with a finite number of dimensions. In [4], a special process cube is presented, with three dimensions: case type (ct),

event class (ec) andtime window (tw).

Figure 3.2, taken from [4], contains a table corresponding to a fragment of an event log. Let the event data from the event log be used to construct a process cubeP C. Then, thect, ecand

tw dimensions are established as follows. The case type dimension is based on the properties of a case. For example, the case type dimension can be represented by the type of the customer, in which case, the members of ct are gold and silver, i.e., D1 ={gold, silver}, H1 = D1. The

(23)

Figure 3.2: Event log excerpt.

event class dimension is based on the properties of an event. For example,eccan be represented by the resource and include, as such, the following members: D2={John}, H2 =D2. Thetime window dimension is based on timestamps. A time window can refer to years, months, days of week, quarters or any other relevant period of time. Due to its natural hierarchical structure,tw

dimension can be organized as a hierarchy, e.g., 2012→2012Dec→2012DecSun. We consider

D3={2012DecSun} andH3={2012, 2012Dec, 2012DecSun}.

Let D1={gold, silver}, D2={John} andD3={2012DecSun}

H1={gold, silver}, H2={John} andH3={2012, 2012Dec, 2012DecSun} CD=D1×D2×D3 be the cube dimensions,

CH=H1×H2×H3 be the cube hierarchies,

h1, h2∈H3, h1= 2012, children(h1) ={2012Dec}, h2= 2012Dec, children(h2) =

2012DecSun,

h1, h2∈H3, h1= 2012, allLeaves(h1) ={2012DecSun}, h2= 2012Dec, allLeaves(h2)

= 2012DecSun,

CS= (CD, CH) be the process cube structure,

h1∈H1, h1=gold, allLeaves(h1) ={gold}, h2∈H2, h2=John, allLeaves(h2) =

{John}, h3∈H3, h3= 2012, allLeaves(h3) ={2012DecSun}.

CE(h1, h2, h3) ={35654423}, CC(35654423) = (gold, John,2012DecSun). For the rest of the elements ofCH,CE is defined in the same way.

Theprocess cube is defined asP C = (CS, CE).

Figure 3.3: A process model discovered from an extended version of the event log in Figure 3.2 using the Alpha Mining algorithm.

Each process cell l can be used to dis-cover a process model, Ml. However,

a process model can be also discovered from a group of cells Q, MQ, or from

the entire process cube P C, MP C.

Fig-ure 3.3 shows a process model discovered from all the event data from the pro-cess cube P C. MP C is the discovered

process model using the Alpha Miner al-gorithm, from the set of events returned by CE. This is possible if consid-ering the process cube corresponding to a single cell in the table of visualiza-tion.

(24)

3.2 Process Cube by Example

In the previous section, the process cube was introduced together with a formalization of its relevant concepts. In this section, we continue with describing its functionality by means of an example.

Figure 3.4: Functionality in three steps: 1. From XES data to process cube structure. 2. Applying OLAP operations to the process cube. 3. Materialization of process cells.

We propose afunctionality in three steps approach, as depicted in Figure 3.4. In the first step, the event data for this example is presented in a XES-like format. The event data is then used to construct a process cube prototype. While building the process cube, its various characteristics are clearly specified by referring to definitions from Section 3.1. The aim of the second step is to show ways of exploring the process cube. In that sense, a range of OLAP operations (e.g., slice, dice, roll-up, drill-down, pivoting) are applied to it. As such, the process cube is prepared for the last step - the process cube analysis. More precisely, in the third step, it is described how parts of the process cube are materialized in event logs and then used to obtain process models. These models can then be compared to discover similarities and dissimilarities between their underlying processes.

3.2.1 From XES Data to Process Cube Structure

Table 3.1 contains the event data used in this example to illustrate the process cube functionality. This data is needed to build the process cube structure. In practice, explicit case ids and/or the event ids may be missing. From Definition 1 and Definition 2, both events and cases are represented by unique identifiers. Therefore, when these identifiers do not exist in the original data source, they can be automatically generated when extracting the data.

The definition of the process cube (Definition 9) describes the process cube as an−dimensional structure. Thus, establishing the dimensions is an important step in the creation of a process cube. There is no unique way of deciding on a process cube dimensions. One possibility is to select each case attribute and event attribute as a dimension. When applied to our example, this choice leads to a process cube with 5 dimensions. Should thecase id and theevent id be also considered, the final structure is a 7-dimensional process cube structure. By considering each different attribute value as a dimension member, the resulting process cube has 4×2×2×43×43×14×2 = 828,352 process cells. It is easy to notice that thecase id,event id andtimestamp are sparse dimensions, causing the entire process cube to be sparse. Sparsity was discussed in Section 2.3.1.

Another possibility is to limit the number of dimensions to three, as suggested in [4]. Based on the case properties, the case type dimension can contain members created from both parts

and sum leges attributes. The parts attribute, specifies for what building parts can a building permit be requested, e.g., Bouw, M ilieu. The sum leges attribute, gives the total cost of a building permit application, e.g., 138.55, 179.8. At this point, it is important to establish a representative dimension member, as it can influence further analysis. This can be achieved, for

(25)

case id properties event id properties

parts sum leges timestamp activity resource

1 Bouw 138.55 1 2012-02-21T11:52:13 01 HOOFD 010 560464 2 2012-02-21T11:56:31 01 HOOFD 020 560464 3 2012-02-21T12:15:07 01 HOOFD 040 560925 4 2012-02-21T12:19:22 01 HOOFD 050 560464 5 2012-02-21T12:50:18 01 HOOFD 055 560464 6 2012-02-21T14:09:49 01 HOOFD 060 560925 2 Bouw 138.55 7 2012-03-08T12:03:11 01 HOOFD 010 560464 8 2012-03-08T12:07:53 01 HOOFD 020 560464 9 2012-03-08T12:31:15 01 HOOFD 040 560925 10 2012-03-08T13:22:08 01 HOOFD 060 560925 11 2012-03-08T13:35:47 01 HOOFD 065 560925 12 2012-03-08T14:53:34 01 HOOFD 120 560925 13 2012-03-08T15:20:55 01 HOOFD 260 560464 14 2012-03-08T15:36:19 09 AH I 010 560925 15 2012-03-08T15:56:41 01 HOOFD 430 560925 3 Milieu 179.8 16 2012-03-12T09:03:52 01 HOOFD 010 560464 17 2012-03-12T09:08:21 01 HOOFD 020 560464 18 2012-03-12T09:17:39 01 HOOFD 040 560925 19 2012-03-12T09:42:48 01 HOOFD 050 560925 20 2012-03-12T10:15:07 06 VD 010 560925 21 2012-03-12T10:24:56 01 HOOFD 120 560925 22 2012-03-12T10:49:01 01 HOOFD 180 560925 23 2012-03-12T11:18:19 01 HOOFD 260 560925 4 Bouw 138.55 24 2012-03-15T13:11:06 01 HOOFD 010 560464 25 2012-03-15T13:15:27 01 HOOFD 020 560464 26 2012-03-15T13:37:42 01 HOOFD 040 560925 27 2012-03-15T14:02:18 01 HOOFD 050 560925 28 2012-03-15T14:19:32 01 HOOFD 065 560925 29 2012-03-15T15:06:11 01 HOOFD 120 560464 30 2012-03-15T15:46:37 01 HOOFD 180 560464 31 2012-03-15T16:10:44 01 HOOFD 260 560464 32 2012-03-15T16:42:01 01 HOOFD 380 560464 33 2012-03-15T16:53:26 01 HOOFD 430 560925

Table 3.1: Event Log Example

instance, by employing data mining techniques. For this example, we describe a simple two-step approach. First, cases are grouped in clusters, based on their properties. It is obvious that cases 1, 3 and 4 belong to one cluster, as they all have the same case properties, and case 2 belongs to another cluster. Secondly, a classification (decision tree learning algorithm) is used on the clustering results. In this example, we expect to identify, after classification, a representative number, e.g., 150, for the sum leges attribute that would differentiate between the two clusters. Consequently, the following two case type dimension members can be considered representative

parts=Bouw, sum leges < 150 and parts=M ilieu, sum leges >= 150. The difficulty of this approach is that is requires data mining knowledge to store the event data in the process cube.

There is also a middle-ground approach. For instance, the number of dimensions can still be kept small, but not necessarily limited to three. Moreover, one dimension can contain a single property instead of a combination of properties. In this case, the attributes that do not end up as dimensions can be still stored in a cell. For this example, we consider 4 dimensions: parts,activity,

resource and timestamp. The parts dimension has two elements, D1 = {Bouw, M ilieu}. The resourcedimension has also two elements,D2={560464,560925}. Theactivitydimension consists

(26)

of 15 elements, e.g. 01HOOF D 010, 09AH I 010 and others. While the first three dimensions have a relatively small number of members, the last dimension consists of 43 different members. To reduce this number, only the year, the month and the day of the week is considered for the

timestamp dimension and the rest is stored in the cell. Consequently, the size of thetimestamp

dimension is reduced to three: 2012F ebT ue, 2012M arM onand 2012M arT hu. As a result, the process cubeP C consists of 2×14×3×2 = 168 process cells.

To show what is the content of a process cell for the process cubeP C, we use theCEfunction on a set of selected hierarchy elements. Forh1∈H1, h1=Bouw, allLeaves(Bouw) ={Bouw}, h2∈ H2, h2 = 560925, allLeaves(h2) = 560925, h3 ∈H3, h3 = 01HOOF D 040, allLeaves(h3) =

{01 HOOF D 040}, h4 ∈ H4, h4 = 2012M arT hu, allLeaves(h4) = {2012M arT hu}, the CE

function returnsCE(h1, h2, h3, h4) ={9, 26}. Both

CC(9) = (Bouw,2012M arT hu, 01HOOF D 040,560925) and

CC(26) = (Bouw,2012M arT hu, 01 HOOF D 040,560925)

return the same tuple of hierarchy elements. Event data that is not yet stored as dimension values, can still be stored in the process cell containing events 9 and 26, as shown in the Table 3.2.

case id properties event id properties sum leges timestamp 2 138.55 9 2012-03-08T12:31:15 4 138.55 26 2012-03-15T13:37:42

Table 3.2: Event data corresponding to the process cell defined byCE(h1, h2, h3, h4) ={9, 26}.

3.2.2 Applying OLAP Operations to the Process Cube

In Section 2.3.1, the following OLAP operations were described: slice, dice, pivoting, roll-upand

drill-down. In this section, we show, by means of an example, how these operations can be applied on a process cube. 2012M arM on 2012F eb 2012M ar 2012 2012F ebT ue

H

4

D

4

(timestamp)

Bouw Milieu 01 HOOFD 01001 HOOFD 020 01 HOOFD 04001 HOOFD 050 01 HOOFD 05501 HOOFD 060 01 HOOFD 65 560464 560925

D

2

D

1

(parts)

D

3 2012M arT hu

(resource)

(activity)

Figure 3.5: Process cube by example. With orange, 2012F ebT ueand 2012M arT huare selected for thetimestamp dimension and are used for dicing the process cube. With green, a subcube is illustrated, which is the result of slicing the previous subcube on 560464 member of theresource

dimension. With red, a subcube is illustrated, which is the result of slicing the previous subcube on 560925 member of theresourcedimension.

Figure 3.5 illustrates the 4-dimensional process cubeP C, constructed in the previous step. To represent the 4D structure in a 2D plan, first the members of thetimestamphierarchy are displayed on the left. The root element of the hierarchy is the 2012 year, followed by the month elements,

(27)

2012F eband 2012M arand having the days of week as the leaf nodes, 2012F ebT ue, 2012M arM on

and 2012M arT hu. To each leaf member of thetimestamp dimension, corresponds a 3D subcube as the one on the right.

For the process cube P C, we choose to do first adice, by selecting the 2012F ebT ue and the 2012M arT humembers on the timestamp dimension. LetP C,P C = (CS, CA) andD0

i =Di for

all 1≤i ≤3,D0

4 ={2012F ebT ue, 2012M arT hu}. The dice operation isdiceCD0(P C) =P C0,

where

P C0 _{= (}_CS0_{, CE}0_), CS0_{= (}_CD0_{, CH}0_), CH0 ₌_H

1×H2×H3×H40,

allLeaves(2012) ={2012F ebT ue, 2012M arM on, 2012M arT hu},

allLeaves(2012F ebT ue) ={2012F ebT ue}.

T hen, allLeaves(2012)∩allLeaves(2012F ebT ue) ={2012F ebT ue}, . . . H0

4={2012,2012F eb,2012M ar,2012F ebT ue,2012M arT hu},

h∈H4, h= 2012M ar, children(h) ={2012M arM on, 2012M arT hu}, children0₍_h_{) =}_children₍_h₎_∩_H0

4, children0(h) ={2012M arT hu}, . . . h∈H4, h= 2012M ar, allLeaves(h) ={2012M arM on, 2012M arT hu}, allLeaves0₍_h_{) =}_allLeaves₍_h₎_∩_H0

4, allLeaves0(h) ={2012M arT hu}, . . . CE0₍_h

1, . . . , h4) =CE(h1, . . . , h4), for (h1, . . . , h4)∈CH0.

Further, two slice operations are performed on the diced subcubeP C0_{, by selecting first the}

560464 and then 560925 member of theresourcedimension. The resulted subcubesP C0

1andP C20

are still 4D structures, although they have only one member on the resource dimension. The corresponding 3D subcubes, with dimensiontimestampleft aside due to representation issues, are depicted in Figure 3.5. TheP C560464subcube is represented with green and theP C560925subcube

is represented with red.

The slice operation where the 560464 resource is selected is slice2,560464(P C0) = P C560464, P C560464=diceCD0(P C0) withCD₅₆₀₄₆₄=D0₁× {560464} ×D₃0 ×D0₄. The slice operation where

the 560925 resource is selected isslice2,560925(P C0) =P C560925, P C560925=diceCD0(P C0) with

CD560925=D01× {560925} ×D30 ×D04.

While slice and dice operations are used to select parts of a process cube, pivoting, roll-up and drill-down operations help in visualizing the selections. As mentioned in Section 2.3.1, only two dimensions out of all the process cube dimensions, can be visualized at a time. For example, in Figure 3.5, dimensionsparts andresourcecan be easily visualized. This part of the cube indicates which resources are responsible for handling cases forBouwand which forM ilieu. It is possible to visualize also theactivity dimension, but not all its elements can be clearly distinguished.

By pivoting (or rotation) operation, the visualization perspective of the process cube can be changed. For example, by selecting the dimension activity on x axis instead of dimension parts

and dimension parts on y axis instead of dimension activity, the cube is rotated and a new side of it can be visualized. Such a change makes it easy to distinguish the activities corresponding to

BouwandM ilieuparts, together with their corresponding cells. Thepivoting operation ispivot1,3(P C0) =P Cp0.

P C0 p= (CSp0, CEp0), CS0 p= (CD0p, CHp0), CDp0 =D30 ×D20 ×D01×D40, CH0 p=H30×H20 ×H10 ×H40, children0₍_h_{) =}_children₍_h_), allLeaves0₍_h_{) =}_allLeaves₍_h_), CE0 p(h3, h2, h1, h4) =CE(h1, h2, h3, h4).

The roll-up and drill-down operations have an impact when applied on a dimension with a hierarchical structure. Through a roll-up operation, members of a hierarchically inferior level are replaced with a member of a hierarchically superior level. For this example, we consider the

timestamp dimension with its elements 2012F ebT ue, 2012M arM onand 2012M arT hu. Aroll-up operation on the children of 2012M arreplaces the currenttimestamp elements with 2012F ebT ue

(28)

and 2012M ar.

The roll-up operation is then rollup4,2012M ar(P C0) = P Cr0, where P Cr0 = (CSr0, CE) with

CS0

r= (CDr0, CH), andCDr0 =D01× D20 ×D03×(D04\children(2012M ar))∪ {2012M ar}.

While the roll-upoperation folds elements from an inferior hierarchical level into elements of a superior one, thedrill-downoperation expands members from hierarchically superior levels. We consider again thetimestampdimension. For the previousP C0

rsubcube, adrill-down operation on

the 2012M arelement replaces the current dimension elements with 2012F ebT ue, 2012M arM on

and 2012M arT hu.

The drill-down operation is then drilldown4,2012M ar(P Cr0) =P Cd0, whereP Cd0 = (CSd0, CE)

withCS0

d= (CD0d, CH), andCDd0 =D10 × D20 ×D03×(D04\ {2012M ar})∪children(2012M ar).

3.2.3 Materialization of Process Cells

In the previous step, the applicability of the OLAP operations was shown by means of an example. The main emphasis was on the changes that occurred at the dimension level. Naturally, the question arises as what happens at the cell level. The last step of our approach gives an answer to this question. We rely in our explanation on Figure 3.6, presented in more detail in [6].

Figure 3.6: Partitioning of the process cube. Thesplit operation is realized by drill-down. The functionality of themerge operation is given by roll-up.

The left part of Figure 3.6 shows the process cube created from an extended version of the event log in Figure 3.2. In the process cube, the top part depicts a simplified event log corresponding to the process cube. The step of extracting an event log based on the event data from the process cube or from parts of it (process cells or groups of cells) is known as thematerialization step. The resulted event logs are then given as input to different process mining algorithms. The outcome is a set of process models which can be visualized. Back to our example, the event log shown at the top of the process cube is used to obtain the process model shown at the bottom, by applying the Alpha Miner algorithm on it.

The right part of Figure 3.6 shows the result of splitting the process cube from the left on its case type and event class dimensions. In the figure, two types of splitting can be identified. Vertical splits consider for separation an entire case. For example, by splitting on the case type

dimension, cases 1,4,5,6 are separated from cases 2,3,7,8. The results of a horizontal split are no longer whole cases, but rather parts of cases corresponding to subsets of activities. For example, by splitting on theevent class dimension, activitiesA, C are representative for the cell given by

Realizing a Process Cube Allowing for the Comparison of Event Data

Department of Mathematics and Computer Science

Architecture of Information Systems Research Group