Reuse with Data Mining Process Patterns - Data Mining Process Patterns for Data Mining based An

4. Flexible and Interactive Development of Data Mining Scripts in Grid-based

5.5. Data Mining Process Patterns for Data Mining based Analysis Processes

5.5.2. Reuse with Data Mining Process Patterns

Data mining process patterns are designed to support the reuse of data mining processes. Figure 5.8 gives an example of how the reuse of data mining processes is supported. User A holds a workflow that solves a certain analysis problem. To support the reuse, he creates a data mining process pattern from his workflow. This is done by abstracting tasks that are not reusable directly according to the task hierarchy and to model the assumptions

and prerequisites. As he is the only one who knows all assumptions and details of his workflow, he is the right person to perform the abstraction. Other users do not have detailed knowledge on this, so it is harder for them to collect the correct assumptions and to abstract the tasks. User B, who wants to reuse the solution of user A, takes the pattern, checks the prerequisites and assumptions, and creates a workflow by specializing the abstract tasks according to his specific needs.

Figure 5.8.: Procedure of reuse with data mining process patterns.

In the following, we will give details on creating process patterns (responsibility of user A) and how to apply process patterns (responsibility of user B).

Creating Process Patterns

Data mining process patterns are created by abstracting parts of an existing data mining workflow. Similar to the specialization presented in Section 5.5.1, the abstraction is done according to the presented task hierarchy. Executable tasks are abstracted to configurable tasks by explicitly modelling the parameters of the underlying component, script or service, e.g. the number of clusters for a clustering component. This means that the task is reusable, but the parametrization is not. Abstraction to a structural task is done by defining the order of tasks in the process while leaving out information about the details of the tasks. E.g., it can be modelled that a quality control task is necessary before a clustering task, or that a data normalization task has to be performed at a certain step of the process, but the actual tasks are not bound to any components, scripts or services. This means that the connection and order of the tasks are reusable, but the components, scripts or services not. An abstraction to a conceptual tasks is done by textually describing

5.5. Data Mining Process Patterns for Data Mining based Analysis Processes

Task User Level Integration Level use component configurable or executable -

develop component conceptual configurable use script configurable or executable -

develop script conceptual or structural configurable use workflow configurable or executable -

develop workflow structural structural

Table 5.1.: Task levels from user’s and technical point of view.

what needs to be done at a certain step in the process, but components, scripts or services including their connections are not reusable. E.g., a component could be usable just for a certain data type.

We can argue that the choice of the different levels of the task hierarchy presented in Section 5.5.1 makes sense if we map those to the capabilities of users described in Section 5.2. We distinguish between the user’s point of view and the point of view from the technical integration of new developed components, scripts and workflows. Tasks that involve using existing components, scripts and workflows are always configurable tasks if they are parametrized, or executable tasks if the parameters are already specified. For the development of components, scripts or workflows, this is different. Although the task for creating a component is at the conceptual level, from the technical point of view the integration of the component is a configurable task, as the component needs to be described by metadata passed as parameter to a service that grid-enables the component (see our contribution from Section 3.3). This is similar for scripts. The task for creating a script is at conceptual or structural level, but from the technical point of view of integrating the script it is a configurable task, as the script can be passed as parameter to a single service (see our contribution from Section 4.4). Thus, by the solutions presented in Chapters 3 and 4 the complexity of the integration is kept at a low level and requires less knowledge from the users. The development of workflows remains at structural level. Table 5.1 presents the task levels from the user’s and from the technical point of view.

Process patterns can also be created based on information from a data mining paper [148]. Data mining solutions are often worked off when creating a publication about the solution. This work could be used for the process of generating a data mining process pattern. Papers on data mining solutions consist of a lot of information on requirements, approaches, related work, literature, examples, configurations, pseudo-code, results, summaries, etc. Within this, typically a lot of information is included which can be used for creating a data mining process pattern. Information that is not useful for the creation of a process pattern has to be ignored. This includes, e.g., information on related work, literature and examples, as this information is related to other patterns. In addition, summaries and other redundancies have to be left out, as the process patterns follow a more formal, structured approach. Experiments and results are also not important for creating a process pattern, as the process pattern is only focused on the process of the data mining solution. The remaining parts of the paper, which hold the information useful for the pattern, can be transformed into tasks of the data mining process pattern depending on

their relation to the CRISP phases as well as how precise they are described with respect to the process pattern. Hence, the structure of a data mining process pattern is based on the information contained in the paper and on the level of abstraction in which it is presented. Figure 5.9 visualizes the approach.

Figure 5.9.: Mapping the information of a paper to the generic CRISP pattern.

The approach of creating a process pattern out of information contained in a data mining paper can be summarized as follows:

• Remove information on related work, literature, examples, results, summaries and other redundancies.

• Transform descriptions of tasks and requirements as regards content into conceptual tasks.

• Transform detailed descriptions of tasks and requirements, instructions and configurations into structural or configurable tasks.

• Transform code and pseudo code into structural, configurable or executable tasks. • Use figures and use-case diagrams for the arrangement of structural tasks, e.g. lanes,

pools and groups of tasks.

Examples of process patterns will be given in Sections 5.7, 5.8 and 5.9.

Applying Process Patterns

In the previous paragraph, we presented details on the abstraction of data mining based analysis processes for a given problem to data mining process patterns. Now, we focus on the specialization of data mining process patterns to new analysis processes.

We describe the steps needed to use a pattern for a given analysis process in a meta- process for applying process patterns. Figure 5.10 visualizes the meta-process and its steps. In detail, the process consists of the following steps:

5.5. Data Mining Process Patterns for Data Mining based Analysis Processes

Figure 5.10.: The meta-process for applying a process pattern.

• Determine Business Objectives: The first task is to define the business objectives of the application.

• Select Pattern with matching Data Mining Goal: After that, a data mining pattern with a data mining goal is selected that addresses these business objectives.

• Specify pattern tasks to the lowest level: The tasks of the selected pattern are specified to the executable level according to the task hierarchy. The concretion of a process pattern is not unique and it is not guaranteed that the concretion of a pattern to the executable level is always possible. As the process pattern becomes more specialized by each concretion, it possible to either specify all tasks to the executable level or to detect that it is not possible to specify all tasks to this level in finite steps.

• is pattern specified as executable?: If it is observed that the process pattern cannot be specified as executable (all tasks are described at executable level), the meta- process steps back to the task of choosing a new process pattern. If the process pattern is executable, the meta-process steps on.

• Deploy process pattern into a process: Assuming that an adequate process environment exists, the process pattern is deployed into an executable process. As the pattern is described as executable, it includes the necessary information for creating an executable process or workflow. However, it is not guaranteed that the process or workflow is really executable in the execution environment, as checks for correctness of components, connections of components, types etc. are not possible beforehand.

• Run process: The process is executed in the process environment and performs the analysis. The result is either the result of the analysis or an error.

• is the result ok?: If the result is verified as satisfying by the user, the meta-process is finished. If it is not satisfying or if an error occurred, the meta-process steps back to the task of finding a new specification.

The meta-process of applying a process pattern has a set of data mining process patterns as input and an executable process as output. Although the individual steps of the meta-

process can be completed in finite steps, it is not guaranteed that the meta-process ends, as the concretion of pattern tasks can result in various solutions.

The steps of the meta-process still include a lot of manual work. However, tool-based support for applying process patterns is possible and might be implemented in the future. Such tools would guide the user through the steps of selecting patterns and specifying their tasks, helping to handle the different abstraction levels of the task hierarchy and to interface with data mining tools.

In document Integration of Data Mining into Scientific Data Analysis Processes (Page 157-162)