Before starting with the process cube implementation, a literature study is performed to identify the cubes with the closest functionality and requirements to the process cube. The reason for doing this is threefold. First, one can find similarities with other hypercube structures, in which case, some of its functionality can be reused. Secondly, identifying limitations of the current multidimensional structures, helps in clarifying what is still to be done. Finally, previous work on similar OLAP cubes can suggest where one could expect difficulties.
Data loaded in traditional OLAP cubes come from different sources, e.g., multiple data ware- houses. Due to the considerable growth of stored data, simple ways of data representation are sought to conveniently keep data outside local databases. OLAP cubes are also adjusted to handle data in different formats. For example, OLAP cubes can be specified on XML data [34]. Still, OLAP cubes cannot support data in XES format, typical for event logs, because of the specific characteristics of event data.
OLAP cubes are designed to work with numerical measures, and various ways of computing numerical aggregates are explored, from traditional sum, count and average to sorting-based algo- rithms [10] and agglomerative hierarchical clustering [40]. In [45], several measures are proposed
to summarize process behavior in a multidimensional process model. Among those, instance-based measures (e.g., average throughput time of all process instances), event-based measures (e.g., av- erage execution time of process events), flow-based measures (e.g., average waiting time between process events), are the most relevant.
In the last years, also non-numerical data have been considered in an OLAP setting. OLAP cubes have been extended to graphs [52], sequences [37, 38] and also to text data [36]. Creating a Text Cube became possible by employing information retrieval techniques and selecting term frequency and inverted index measures.
In [45], the Event Cube is presented. Unlike other OLAP cubes, this multidimensional struc- ture is constructed for the inspection of different perspectives of a business process, which in fact, coincides with the purpose of the process cube. To accomplish this, event information is summa- rized by means of different measures of interest. For instance, the control-flow measure is used to directly apply the Multidimensional Heuristics Miner process discovery algorithm. The difficulty with respect to this approach, is that traditional process mining techniques have to be extended with multidimensional capacity, in the same way as it was done for the Flexible Heuristics Miner: the Multidimensional Heuristics Miner was introduced as a generalization of the Flexible Heuris- tics Miner, to handle multidimensional process information. Of course, extending existing process mining techniques requires a lot of effort. Therefore, we propose a more conceptually clear and more generic approach. That is, instead of adjusting all process mining techniques to multidi- mensionality, the OLAP multidimensional structure can be adjusted to allow employing existing process mining techniques, without the need of changing them.
All in all, the process cube is unique as it allows the storage of event data in its multidimensional structure, which is further used for process analysis purposes by employing existing process mining techniques. This approach creates a bridge between process mining and OLAP, as methods from both fields are interchangeably applied. The advantage is that quick discovery and analysis of business processes and of their corresponding sub-processes is facilitated in an integrated way. Moreover, no changes to the applied traditional process mining techniques are needed.
Chapter 4
OLAP Open Source Choice
Based on the conceptual aspects previously introduced, in the following chapters we continue with describing the prototype solution. Before going into detail with respect to the implementation, in this chapter we give the motivation for our technology choice.
The process cube formalization from Chapter 3, indicates the need for process mining and OLAP support. For process mining, the selected framework is ProM, introduced in Section 2.2.1, as it is the leading open source tool for process mining. Other commercial process mining systems exist, e.g., Futura Reflect, Fluxicon, Comprehend, ARIS Process Performance Manager [12], but ProM contains many plugins that allow effective process mining discovery and analysis. A part of these plugins are chosen for this project. Except for the OLAP database, we also use a classical relational database to store event data which is only used for event log reconstruction. There is a vast array of possibilities when it comes to available relational database systems, e.g., Oracle Database, Microsoft SQL Server, MySQL, IBM DB2, SAP Sybase, just to name a few. As there are no special benefits of using one relational database over another, in our project we choose MySQL, as it is one of the most widely used database systems in the world.
For OLAP, on the other hand, it is difficult to make an immediate decision with respect to the tool selection. There are multiple technologies available, which vary in terms of the used database type e.g., classical relational, multidimensional, hybrid; the storage location, e.g., in- memory or on-disk; the storage method, e.g., column-based or row-based databases; the way data relationships are kept, e.g., matrix or non-matrix (polynomial) databases and so on. Therefore, in this chapter, the different OLAP tools and their characteristics are further detailed, together with the corresponding advantages and disadvantages. Finally, a single OLAP system is selected for our application.
4.1
Existing OLAP Open Source Tools
For a potential OLAP tool to be used in this project, supporting conventional OLAP functionality is not sufficient. Several requirements were listed in Section 3.3. From those, two are particularly important to consider when choosing an OLAP external tool. The tool has to be open source, to allow changes in its functionality, and should provide support for further Java development, to enable the integration of ProM (which is written in Java) and OLAP capabilities on a single platform. OLAP tools can be split in OLAP servers and OLAP clients. OLAP clients are the user interfaces to the OLAP servers.
Even though the open source OLAP servers and clients are not as powerful as commercial solutions [49], they encourage the community-based development by being free to use and modify. In our case, when integrating process mining solutions in OLAP technology, we expect to encounter differences with existing functionality. Therefore, in this project, an open source tool which allows to add new solutions is preferred over a more “powerful”, but non-extensible commercial tool.
sources [1, 27, 28, 48, 49, 50]. From those, [1, 49, 50] contain the work of Thomsen and Pedersen, and include a periodic survey of open source tools for business intelligence. The first survey [49], published in 2005, refers to three OLAP servers, Bee, Lemur and Mondrian and two OLAP clients, Bee and JPivot, which are the only ones implemented at the time. In the survey from 2011 [1], only two OLAP servers are presented, Mondrian and Palo. That is because Bee and Lemur servers were discontinued and a new OLAP server, Palo was created. In [28], we find again the same Mondrian and Palo OLAP servers mentioned. By 2011, there are already several OLAP clients available, e.g., JPalo, JPivot, JRubik, FreeAnalysis, JMagallanes OLAP & Reports. There are also several integrated BI Suites. Both [27] and [50] refer to Jasper Soft BI Suite, Pentaho and SpagoBI. All these BI suites use the Mondrian OLAP engine and the JPivot OLAP client graphical interface. Recently, the Palo BI Suite was released that is working with the Palo multidimensional OLAP server and the Palo for Excel client.
As every OLAP client uses a specific OLAP server, selecting an OLAP server, automatically narrows the client choice. In the following, we offer a summary on the two previously introduced OLAP servers, Mondrian and Palo. These servers are quite different from each other, mainly because they use different types of databases to store the data. The first one, Mondrian, stores data in relational databases, and it is therefore called a ROLAP server, and the other, Palo, stores data in multidimensional databases, and it is therefore considered a MOLAP server.