Survey of Existing KDDM Process Models - An Integrated Knowledge Discovery and Data Mining Proc

In this section we discuss five leading KDDM process models that have been proposed in the extant literature. These include a nine step model proposed by Fayyad, Piatetsky-Shapiro et al. (1996a); a five step model proposed by Cabena et al. (1998); a six step model proposed by Cios et al. (2000) and a multi-step model in form of CRISP- DM (2003). We also discuss the model proposed by Berry and Linoff (1997) authors of the book ‘data mining techniques for marketing, sales and customer relationship management’ who have done some early work in this area. Of these models, CRISP- DM has been proposed in the practitioner literature, while all others models have been proposed in the academic literature.

Fayyad, Piatetsky-Shapiro et al. (1996a)

The Fayyad et al’s. (1996a) KDDM model consists of nine steps, which are outlined below.

1. Developing and understanding the application domain: This step includes learning the relevant prior knowledge and the goals of the end user of the discovered knowledge.

and data points (examples) that will be used to perform discovery tasks. This step usually includes querying the existing data to select the desired subset.

3. Data cleaning and preprocessing: This step consists of removing outliers, dealing with noise and missing values in the data, and accounting for time sequence information and known changes.

4. Data reduction and projection: This step consists of finding useful attributes by applying dimension reduction and transformation methods, and finding invariant representation of the data.

5. Choosing the data mining task: Here the data miner matches the goals defined in Step 1 with a particular DM method, such as classification, regression, clustering, etc.

6. Choosing the data mining algorithm: The data miner selects methods to search for patterns in the data and decides which models and parameters of the methods used may be appropriate.

7. Data mining: This step generates patterns in a particular representational form, such as classification rules, decision trees, regression models, trends, etc.

8. Interpreting mined patterns: Here the analyst performs visualization of the extracted patterns and models, and visualization of the data based on the extracted models.

discovered knowledge into the performance system, and documenting and reporting it to the interested parties. This step may also include checking and resolving potential conflicts with previously believed knowledge.

Berry and Linoff (1997)

Berry and Linoff (1997) presented a four step methodology consisting of following steps: Identifying the Problem: Analyzing the Problem, Taking Action, and Measuring the outcome. They also specify the following 11 steps to further describe their proposed approach.

1. Translate the business problem into a data mining problem. 2. Select appropriate data.

3. Get to know the data. 4. Create a model set.

5. Fix problems with the data.

6. Transform data to bring information to the surface. 7. Build models.

8. Assess models. 9. Deploy models. 10. Assess results.

Cabena et al. (1998)

Step 1: Business Objectives Determination: This step involves clearly defining the business problem or challenge. The minimum requirements are a perceived business problem or opportunity and some level of executive sponsorship. This step in the process is also the time at which to start setting expectations.

Step 2: Data Preparation: Cabena et al. (1998) note that Data preparation is the most resource-consuming step in the process, typically requiring up to 60% of the effort of the entire project. This step comprises 3 sub-tasks:

1. Data Selection: Identify all internal or external sources of information and select which subset of the data is needed for the data mining application.

2. Data Preprocessing: Study the quality of the data to pave the way for further analysis and to determine the kind of mining operation that will be possible and worth performing.

3. Data Transformation: During data transformation, the preprocessed data is transformed to produce the analytical data model. The analytical data model is an informational data model, and it represents a consolidated, integrated, and time-dependent restructuring of the data selected and preprocessed from the various operational and external sources. This is a crucial phase as the accuracy

structure and present the input.

Step 3: Data Mining: This is the step in which the actual data mining takes place. The objective is clearly to apply the selected data mining algorithm or algorithms to the preprocessed data. The actual details of the data mining step will vary with the kind of application that is under development. The author presents the example that while in the case of database segmentation, one or two runs of the algorithm may be sufficient, development of a predictive model will be a cyclical process where the models will be repeatedly trained and retrained on sample data before being tested against the real database.

Step 4: Analysis of Results: According this process model the analysis of results is inseparable from the data mining step in that the two are typically linked in an interactive process. The specific activities in this step depend very much on the kind of application that is being developed. However, the main objective remains the same, that is, to interpret and evaluate the output from the data mining step.

Step 5: Assimilation of Knowledge: This step closes the loop, which was opened when the business objectives were set at the beginning of the process. The objective now is to put into action the commitments made in the opening step, according to the new, valid and actionable information from the previous process steps. The two main challenges in this step are: to present the new findings in a convincing,

best exploited. CRISP-DM (2003)

CRISP-DM (an acronym for CRoss Industry Standard Process was data mining) is an industry-neutral, tool-neutral data mining process model that was conceived in late 1996 by three leaders of the then immature data mining market: Daimler (then Daimler- Benz), SPSS (then ISL) and NCR. At the time, Daimler was ahead of other industrial and commercial organizations as it had already gained experience in data mining by applying it to its business operations. SPSS too had data mining experience owing to the data mining services it had been providing since the 1990’s. It was also the first vendor to launch commercial data mining workbench called ‘Clementine’ in 1994. NCR too brought in data mining expertise owing to its experience of offering data mining services through its teams of consultants and technology specialists, in order to deliver added value to its Teradata data warehouse customers.

In 1997, a consortium was formed with the goal of formalizing the experience of the various real-world organizations that had been practicing data mining, in form of a process model. One of the prime characteristics of this project was the focus on creating a non-proprietary and freely available model that would assist in execution of data mining projects.

different phases, namely, business understanding, data understanding, data preparation, modeling, evaluation and deployment (Figure 2-1). It also describes the tasks and activities that need to be carried out in each of these phases (

Different phases of the CRISP-DM process model

Phase 1 - Business understanding: This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives.

Phase 2 - Data understanding: The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information.

Phase 3 - Data preparation: The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record and attribute selection as well as transformation and cleaning of data for modeling tools.

Phase 4 - Modeling: In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. The CRISP-DM documentation points out that typically, there are several techniques for the same data mining problem

stepping back to the data preparation phase is often necessary.

Phase 5 - Evaluation: This phase of the project consists of thoroughly evaluating the model and review the steps executed to construct the model to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

Phase 6 - Deployment: Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. According to the CRISP-DM process model, depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.

Feedback Loops Described in the CRISP-DM Process Model

It also describes various feedback loops to emphasize how certain phases should be revisited to leverage the new information or knowledge gained in the phase succeeding them. These have also been highlighted in Figure 2-1. For instance, while Data Preparation typically precedes Modeling, there may be a need to revisit Data Preparation as a chosen Modeling technique may require data to be prepared in a certain way.

Figure 2-1 CRISP Process Model (CRISP-DM, 2003)

CRISP-DM is the most detailed of existing KDDM models. The documentation associated with CRISP-DM v 1.0 is divided in two parts. The first part provides a description of the reference model, its phases, general tasks and outputs. The second part called the user guide aims to provide detailed guidance about how to perform activities associated with each task.

The user guide portion of CRISP DM methodology (CRISP-DM 2003) aims to provide detailed advice about “how” to execute DM activities. That is, the user guide is expected to provide tools for implementing the vast number of activities suggested in the process model. However analysis of the user guide reveals that does not meet its

accomplish the tasks associated with each phase. Tool support is only provided towards only two of the total twenty four tasks mentioned in the model and it appears that even these are not sufficient for efficiently executing the corresponding tasks. These are described below:

1. Tool support for task of selection of modeling techniques (modeling phase)

The CRISP-DM v1.0 documentation (CRISP-DM 2003) includes some support towards the modeling phase by providing a list of modeling techniques relevant to various types of data mining problems. However, it does not provide any support towards selection of appropriate techniques. Clearly, the list of techniques enumerated in the process model could be narrowed down further using output from previous tasks such as business objectives and data mining objectives, but that it is not considered by the process model.

2. Tool support for task of identification of divisions and manager’s name and responsibilities (business understanding phase)

Analysis of the foundational business understanding phase reveals the use of just one tool - an organizational chart, to “identify divisions, manager’s names and responsibilities etc”. Clearly, organizations also need support for the diverse array of other activities associated with this important phase. Besides, the usefulness of

Table 2-1: Phases, Tasks and Outputs - CRISP-DM process model Business

understanding Data understanding Data preparation Modeling Evaluation Deployment

Determine Business Objectives - Background - Business Objectives - Business Success Criteria Collect Initial Data - Initial Data Collection Report Select Data - Rationale for Inclusion/ Exclusion Select Modeling Technique - Modeling Technique - Modeling Assumption s Evaluate Results - Assessment of Data Mining Results with respect to Business Success Criteria - Approved Models Plan Deployment - Deployment Plan Assess Situation - Inventory of resources - Requirements Assumptions and Constraints Risks and Contingencies - Terminology - Costs and Benefits

Describe Data - Data Description Report Clean Data - Data Cleaning Report Generate Test Design - Test Design Review Process - Review of Process Plan Monitoring and Maintenance - Monitoring and Maintenance Plan Determine Data Mining Goals - Data Mining Goals - Data Mining Success Criteria Explore Data - Data Exploration Report Construct Data - Derived Attributes - Generated Records Build Model - Parameter Settings Model - Model Description Determine Next Steps - List of Possible Actions - Decision Produce Final Report - Final report - Final Presentation Produce Project Plan - Project Plan - Initial Assessment of Tools and Techniques Verify Data Quality - Data Quality Report Integrate Data Merged Data Assess Model Model Assessment Revised parameter settings Review Project - Experience - Documentation Format Data Reformatted data

Cios and Kurgan (2005)

The process model proposed by Cios and Kurgan (2005) is shown in Figure 2-2.

In document An Integrated Knowledge Discovery and Data Mining Process Model (Page 31-43)