The term business refer to any activity developed in a company in the most general sense, no matter the nature and aim of such activity (commercial, governmental, education, . . . ). Data mining is one of the technologies that make Business Intelligence solutions [6] be implemented (“a fairly new term that incorporates a broad variety of processes and technologies to harvest and analyze specific information to help a business make sound decisions”). In fact, any business intelligence solution should include a data mining project to extract “the intelligence” of the business that will be accordingly deployed. However, the truth is that data mining projects are being developed more as an art than as an engineering process. It does not properly meet real busi- ness needs when dealing with any kind of project. Companies really need to manage projects in the most controlled way, always trying to reduce risks without increasing costs. As there is no proper methodology to face data min- ing projects, several different practices from different areas are applied. This leads to failures when developing a project to getting poor results, or at least not as good as they could be.
The need for a proper method to manage data mining projects is thus clear. This method should allow managers to identify tasks and subtasks,
roles, risks, milestones as well as estimating costs, benefits, taking into ac- count the cross-dependant nature of all the elements. Efforts presented in Sect. 2 towards a methodology have generated a “good manners” guide as well as a definition of the technical activities to be developed in the process of knowledge discovery. However, no clear result towards the management of the global process have been obtained up to date. Higher risk levels, greater efforts and higher associated costs with each task being developed is the result of a lack of understanding of the project development process due mainly to a lack of methodology. Methodologies make the development team concentrate their efforts in tasks to be developed, clearly defining roles and assignment for each participant making organization and project development a lot easier. Consequently, prior to defining the methodology, the terms we are refereing to have to be clearly understood and defined.
A project has been defined as any piece of work that is undertaken or attempted. Consequently, project management involves “the application of knowledge, skills, tools and techniques to a broad range of activities to meet the requirements of the particular project” [3]. Project management is needed to organize the process of development and to produce a project plan. The way the process is going to be developed (life cycle) and how it will be split into phases and tasks (process model), will be established. This project defin- ition [23] exactly describes the common understanding, its extent and nature, among the key people involved in a project. Thus, any data mining project need to be defined to state the parties, goals, data and human resources, tasks, schedules, expected results, that comply the foundation upon which a successful project will be built. In general, any engineering project iterates through the following stages between inception and implementation: Justifi- cation and motivation of the project, Planning and Business Analysis, Design, Construction, Deployment. In fact in software engineering this approach has been successfully applied. Although a data mining project has components similar to those found in IT, the nature is different even some that concepts need to be modified in order to be integrated.
In fact the proposal that we make here is inspired on concepts from RUP, taking as technical tasks the ones defined in the CRISP-DM process model.
For a proper definition of the methodology, phases that will lead the project have to be defined. Phases will have iterations with intermediate products and the end of a phase will lead a deliverable. The set of activities (in our case from CRISP-DM) and the effort dedicated to them in each project phase will have to be defined. Depending on the activities involved in each phase,the roles of the team and the associated effort will have to be defined.
Figure 1 depicts development phases in the proposed methodology for data mining projects. X axis represents phases while Y axis represents the involved processes. For each phase, efforts dedicated to each activity of the process model have been represented. In each phase more than one iteration can occur and each of them may lead to an intermediate product. A phase will end having as a result deliverables. Intermediate products and deliverables will help to establish milestones in the project development plan.
Fig. 1.Phases of a data mining methodology
Defining the proposed phases of the methodology, the different stages of RUP methodology have been taken into account. One of the main goals of the proposed methodology is to establish the activities to be carried out as well as their timing for its successful ending while preserving flexibility in the process.
The proposed phases are briefly described below:
• Project Conception establishes the main topics in the project. In order to develop a proper project plan information about business goals, data sources, risks and contingencies plans, costs and benefits, estimations and schedules, resources and results need to be gathered. A complete project plan is fundamental to achieve a proper and successful data mining project because the information reflected will be used to complete project’s life cy- cle. Figure 1 depicts main activities in this phase, Business Understanding, Data Understanding and Data Preparation, that help data miners define business goals and data sources. This definition is basic to develop a data mining project and will be used in later phases.
• Data And Tasks Conception. The Data Model defines all the data sources and extraction, transformation, loading and integration processes involved in a data mining project. In order to define them in a formal way some metamodel must be defined and used. The Task Model defines all the data mining tasks to be done in the project. The approach here is that a task model is first defined in terms of types of problems (e.g. clustering instead of K-means, association instead of a priori, . . . ) and then refined in some iterations by a data mining expert. The Task Model uses the Data Model to establish the data involved in each data mining task. Considering these models, the main activities involved are Data Understanding and Preprocessing and Modelling.
• Data Mining Modelling applies data mining techniques from tasks defined in the task model and choose data sources from the data model. At this point, the task model is refined from problem types to algorithms. Some procedures to measure the satisfaction of possible algorithms is needed.
• Results establish visualization and evaluation in terms of business goals and customer satisfaction. Thus, some mechanisms must be defined in order to adequate the solution to customer needs. Sometimes this phase supposes a software project to deploy the knowledge obtained. This is a very important phase because our customer must validate the results of the project.