2 Related Work - Data Mining Foundations And Practice Tsau Young Lin (2008) pdf

In the information age when data generated and stored by modern orga- nizations increase in an extraordinary way, data mining tasks [9] become a necessary and fundamental technology. A lot of data mining research has been focusing on the development of algorithms for performing diﬀerent tasks, i.e. clustering, association and classiﬁcation [1, 2, 5, 13, 15, 16, 19, 20, 24, 28, 30], and on their applications to diverse domains. One major challenge in data mining, according to [12], is getting researchers to agree on a common standard for pre- processing tasks and standards related to applying the data mining process to operational processes and systems. In this sense, the Predictive Model Markup Language (PMML) [8] provides several components (Data Dictionary, Min- ing Schema, Transformation Dictionary, Models) useful for producing data mining models. The Data Dictionary includes only information about type of data and range of values. Semantic information is not taken into account.

Several proposals have been developed in order to oﬀer a guide for imple- menting data mining projects [7, 22, 27].

The Common Warehouse Model for Data Mining (CWM DM) [22] proposed by the Object Management Group, introduces a CWM Data Mining metamodel integrated by the following conceptual areas: a core Mining metamodel and metamodels representing the data mining subdomains of Clus- tering, Association Rules, Supervised, Classiﬁcation, Approximation, and Attribute Importance.

The Cross-Industry Standard Process for Data Mining (CRISP-DM), was proposed in 1997 [7] in order to establish the standard data mining process. CRISP-DM steps include several processes:

• Business Understanding focuses on understanding the project objectives and requirements from the business perspective, then converting this

knowledge into data mining problem deﬁnition and a preliminary plan to achieve the objectives.

• Data Understanding starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data qual- ity problems, to discover ﬁrst insights into the data or to detect interesting subsets to propose hypotheses for hidden information.

• Data Preparation constructs the ﬁnal dataset from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record and attribute selection as well as transformation and cleaning of data for modelling tools.

• Modelling techniques are selected and applied and their parameters are calibrated to optimal values. There are several techniques for the same data mining problem type that have some diﬀerent data requirements.

• Evaluation evaluates the model and review the steps executed to construct the model to be certain it properly achieves the business objectives.

• Deployment presents the knowledge in a way that the customer can use it. It often involves applying models within an organization’s decision making processes.

At 1999 SAS Institute proposed the SEMMA [27] methodology integrated by five phases: Sample, Explore, Modify, Model and Assess. The data mining process starts by taking a representative sample of the target population to which a confidence level is associated. Then, this sample is explored and analyzed using visualization and statistical tools in order to obtain a set of significant variables that will become the input for a selected model. The selected model is analyzed. The goal of this step is to determine relationships among variables. In this phase, both statistical methods (e.g. discriminant analysis, clustering, and regression analysis) and data-oriented methods (e.g. neural networks, decision trees, association rules) can be used. The final phase in this process consists of evaluating the model and comparing it with differ- ent statistical methods and samples. On the other hand, Clementine proposes CATs [29] (Clementine Application Templates) as application specific libraries that follow the CRISP-DM standard, being each CAT stream assigned to a CRISP-DM phase.

All of the above models depend heavily on the analysts (business, domain experts, data miners) knowledge. There seems to exist a need for an interme- diate level of conceptualization which can provide an interface between the experts and the clients.

According to Grossman et al. [12] “although eﬀorts have been done to ho- mogenize terminology and concepts among standards more work is required”. A framework to develop a uniﬁed model for data mining is proposed in [10]. The goal of the model is to provide a uniform data structure for all data mining patterns and operators to manipulate them. The model is designed under a three-view architecture (Process view, model view and data view) that includes a process model and data views. The model view contains a set

of mining models with information about mining results. All these approaches and standards do not take the semantics of the data into account.

In [21] a very good approach of the advantages and disadvantages of traditional methodologies of software development when applied to Business Intel- ligence solutions is found. Here the authors state how old practices are good when every system had a beginning and an end and every system was designed to solve only one isolated problem for one set of business people from one line of business. However, this practices fail when integration of diﬀerent depart- ments is needed, because they do not include any cross-organizational activities necessary to sustain an enterprise-wide decision support environment. For nonintegrated system development, conventional waterfall methodology is suﬃcient. However, these traditional methodologies do not cover strategic planning cross-organizational business analysis. Software methodologies such as an iterative model had to be improve to deal with risks.

In RUP [17] an architecture centered model is presented in which an iterative and incremental way makes it possible to develop a software product of any scale or size. Outputs of each iteration can be components, modules of any software part that will be integrated into the next iteration in order to fulﬁl the ﬁnal product at the end. These features make it appropriate for Data Mining projects in which requirements change as a consequence of already ob- tained patterns and where the outputs (patterns) of each step integrate the global solution.

3 Basis of a Data Mining Project Development

In document Data Mining Foundations And Practice Tsau Young Lin (2008) pdf (Page 177-179)