The existing KDDM process models present a fragmented view of the KDDM process. In other words, the process models do not capture or highlight the important dependencies existent in a typical KDDM process. By dependencies we mean the interrelationships between the various steps, or between the various phases and tasks (of the same and different phases) of a KDDM project. For instance, process models that structure a KDDM process in form of a sequence of steps, do not clearly discuss how the steps are related to each other.
That a particular step is recommended to be executed at the beginning and another one towards the end signifies that the step performed at the end may be dependent on the execution of the one performed at the beginning of the project; specifically, it may utilize the output of the particular step directly or indirectly (using the output of a step which in turn uses the output of the step at the beginning). However, these dependencies are not discussed in the process models. We discuss the same issue with respect to CRISP-DM which presents a KDDM project in terms of a number of phases and tasks (instead of steps like the model discussed above) before proceeding to discuss the serious repercussions of not identifying the dependencies in a KDDM process.
CRISP-DM structures a KDDM process in form of phases and their corresponding tasks. The CRISP-DM process model is shown in Figure 1-2. The
various phases described by the model include business understanding, data understanding, data preparation, modeling, evaluation and deployment.
Figure 2-3 CRISP Process Model (CRISP-DM, 2003)
The dependency which is most obvious from this model is the phase-phase dependency resulting from the ordering of phases proposed by the model. That the model recommends executing the business understanding phase ahead of the data understanding phase suggests that data understanding phase must be utilizing the output of the business understanding phase. These dependencies are critical as they cannot be reversed without leading to detrimental effects or even inability of executing a particular phase. Further, it is important to consider that a phase really comprises of various tasks. Therefore, the output of a phase really comprises of the output of the diverse array of tasks that lie within it.
Clearly, a task-level view of a process model should explicate and highlight these dependencies. These dependencies are not obvious from the phase-level view of the process model as shown in Figure 2-3. It is relevant to note that task-task dependencies exist both due to interrelationships between the tasks of the same and phase as well as the tasks of the different phases of the model. Therefore the task level view of the process model should explicate both of these; in other words it should represent a complete view of the KDDM process.
Determination of Business objectives Determination of Data Mining goals Business
Understanding Data Understanding Data Preparation Modeling Evaluation
Background ObjectivesBusiness Business Success Criteria
Data Mining Goals Success Criteria Data Mining
Figure 2-4 CRISP-DM - partial view of Business Understanding Phase
In Figure 2-4 and Figure 2-5 we present the task-level view of the CRISP-DM process model for a subset of tasks belonging to business understanding and data understanding phases. For the purpose of discussion, we only present a partial view of each phase in these figures. It can be seen that neither of the two types of dependencies highlighted above are obvious from these figures. The dependencies between the tasks
of different phases are not captured at all as each phase is described in standalone manner. The dependencies between the tasks of the same phase are also not obvious from these figures.
Figure 2-5 CRISP-DM - partial view of Data Understanding Phase
CRISP-DM presents the remaining four phases in a similar manner and does not present an integrated process model that shows all the dependencies. It can be argued that this is only a presentation issue as the documentation also describes the various tasks in detail. Careful analysis of the documentation reveals that while some dependencies can be implied from the (brief) description of tasks such that business objective can be translated into a data mining objective, the model does not make an effort at explicating the large number of dependencies that exist in the KDDM process or presenting them in form of an integrated model.
The repercussion of not explicating various dependencies existent in the context of a KDDM project could lead to inefficient implementation of projects. For instance, an organization may embark on a particular task and realize that it cannot be completed; this could translate into unnecessary costs and overhead worst still the task may be executed disregarding the output from a relevant task that should have been carried out prior to this task’s execution. An example of this situation could be selection of a modeling algorithm without clearly first setting up the business objective. This is an important task-task dependency which if neglected can lead to the project take a completely different direction than intended.
3. Fragmentation: a Hindrance to Building an Integrated Process Model and