Observation 1. Loading time of an event log is practically independent from the number of the dimensions of analysis. This fact is illustrated in Figure 6.7 and is the result of the loading algorithm. The event log is loaded into the database event by event, and for each event a constant number of operations is performed. Hence, the loading time is dependent only on the number of events.
Observation 2. Sparsity of the process cube heavily impacts the unloading performance. For the selected cell in the table of visualization, all combinations of the dimensions of analysis members which correspond to this cell, are computed during the unload. For each combi- nation, it is verified whether the associated process cube cell contains any events. Hence, there is a fixed amount of time spent to check whether the cell is empty, i.e., the cell id is retrieved from the multidimensional database, if cell id is NULL, the cell is empty and no further actions are performed with respect to this cell. If the cell contains events, additional time is spent to unload the event data from the relational database. Obviously, checking empty cells impacts negatively the unloading time. This is illustrated by the results of the second test, where for 191 times more cells to verify and the same number of events to unload comparing to a normally sparse cube, the unloading time is 10 times larger.
Observation 3. Manual splitting and analysing of sparse dimensions, e.g., with several hundred of dimension members, would be very time consuming and probably would overload the user. Realistically, only the dimensions with at most 20 members are fit to be included in the process cube structure. Selection of such dimensions ensures low sparsity of the resulting process cube, and results in good responsiveness of the developed tool. Test 1 was based on a typical selection of analysis dimensions and therefore, its results characterize the operation speed of the tool in case of regular sparsity. Moreover, it was observed that the developed tool with the processing step, e.g., Log Dialog, delivers the result within 10s for event logs smaller than 2000 events and process cubes with about 3 to 4 normally sparse dimensions of analysis. This performance is respectable and makes the tool applicable to different processes. Moreover, the main focus of the tool is to compare selected parts of the event log, thus, only small sections of the process cube will be unloaded for comparison in typical situations. Test 3 shows that the unloading time reduces when only a part the cube is unloaded, which means that for 2000 events and 4 analysis dimensions, the average time of an operation will be far lower than 10s. Furthermore, even if the entire cube is split in subcubes and all these subcubes are unloaded simultaneously, no performance penalty will occur, i.e., all subcubes will be processes within 10s.
Chapter 7
Conclusions & Future Work
7.1
Summary of Contributions
This master thesis builds on the ideas presented in the PROCUBE project proposal [4]. The proposal suggests to organize event data from logs in process cubes in such a way that discovery, analysis and comparison of multiple processes is possible. The main goal of this master project was to build a framework to support process cube exploration. The goal was achieved by following a series of steps, which the thesis describes in detail.
We started by identifying the problem context. The role of business intelligence and process mining, in particular, in the functionality and performance of enterprise information systems, was investigated. Further, the reader was introduced to the business intelligence area, with emphasis on process mining and OLAP technologies. As concepts from both process mining and OLAP were repeatedly employed throughout the thesis, a formalization was given for all the adherent notions. The formalization of OLAP and of process-cube-related notions is one of the contributions of this thesis. Further elaboration and formalization of the process cube concept can be found in [6].
The next step in the project was to describe the central element of the project, the process cube. Process cubes realize the link between the process mining framework and the existing OLAP technology. While, process mining focuses on process anaysis, OLAP technology is used for its built-in hypercube structures allowing for operations like slice, dice, roll-up, drill-down and pivoting. As such, process cubes are defined by introducing the event-related aspects in the formalization of the OLAP cubes. Along with the process cube formalization, an example was presented to illustrate the process cube capabilities. This stage of the project was an important one, as it helped in establishing and clarifying the process cube functionality before its actual implementation.
Since databases, OLAP and process mining tools already exist, we decided to reuse the current technologies to save time. Choosing a framework for process mining was easy, as ProM is clearly the leading open source framework and expertise is readily available at TU/e. Selecting a suit- able OLAP technology was not as straightforward though. That is because the applied methods and principles vary quite a lot from OLAP tool to tool. Finally, we selected the Palo in-memory multidimensional OLAP. In-memory tools are known for their increased speed. Moreover, unlike relational databases, multidimensional databases have already the built-in multidimensional struc- ture that is natural for OLAP cubes and therefore, facilitates the OLAP analysis. Relatively new, this technology is still undergoing a lot of changes and improvements. Nevertheless, it is deemed to have a bright future perspective, especially because of its current and envisioned performance benefits.
The main contribution of the thesis is creating a basic prototype supporting the notion of process cube in a process mining context, with the following functionality: XES event logs are introduced as data sources for OLAP applications; the OLAP process cube is created from event data; the cube can be visualized from different perspectives; one can “play” with the cube before
starting the analysis, by applying different OLAP operations. One of the challenges we encoun- tered after finishing the application was that MOLAP performance was worsening with increasing sparsity of the loaded data. We were aware of the sparsity problem from the very beginning, how- ever, we did not expect such poor performance results. One of the potential explanations is that we used an open source version of Palo from 2011, which might not include the latest performance improvements that can be found in the commercial tool. Moreover, sparsity is still an open issue for many multidimensional tools. Only Essbase is known to provide a solution to this problem at the moment, but it is not open source. We hope that Palo will also release a new version with the sparsity problem solved. In meanwhile, we offered an interim solution to improve the performance for sparse data.
The solution we provided for dealing with sparsity, was to replace the in-memory database with a hybrid structure, that stores part of event data in-memory and the other part, in a relational database. The advantage of such a strategy is that it reduces the number of dimensions in the cube and thus, makes it less sparse. The limitation is that only a part of the event data can be used for filtering purposes. Furthermore, we reduced the number of elements per dimension by implementing the hierarchical feature for time data. By allowing time data to be stored in a hierarchical structure, the sparsity of some very sparse dimensions like the timestamp, is reduced considerably.
Finally, we tested the PROCUBEsystem to determine its capabilities. The information stored in event logs is inherently multidimensional, and as such, efficient application of process mining tools requires multidimensional filtering of the event database. The multidimensional, and as a particular case, in-memory database technology is developed for exactly that purpose. However, the performed tests show that event logs generally result in sparse multidimensional database structures, which incurs severe performance penalties when unloading parts of the event log for further processing. The proposed hybridization of the database structure, i.e., keeping only strictly necessary dimensions in memory and the rest in a relational database, makes an efficient trade-off in between the flexibility of the complete process cube and responsiveness of the user interaction. Nevertheless, complete understanding of the sparsity concept is required for efficient use of the developed tool as only limited number dimensions, e.g., up to 4D for WABO1 event log, can be used for on-line analysis.