Data Mining as an Automated Service

(1)

Data Mining as an Automated Service

P. S. Bradley

Apollo Data Technologies, LLC [email protected]

February 16, 2003

Abstract

An automated data mining service offers an outsourced, cost- effective analysis option for clients desiring to leverage their data resources for decision support and operational improvement. In the context of the service model, typically the client provides the service with data and other information likely to aid in the analysis process (e.g.

domain knowledge, etc.). In return, the service provides analysis results to the client. We describe the required processes, issues, and challenges in automating the data mining and analysis process when the high-level goals are: (1) to provide the client with a high quality, pertinent analysis result; and (2) to automate the data mining service, minimizing the amount of human analyst effort required and the cost of delivering the service. We argue that by focusing on client problems within market sectors, both of these goals may be realized.

1 Introduction

The amount spent by organizations in implementing and supporting database technology is considerable. Global 3500 enterprises spend, typically, $664,000 on databases annually, primarily focusing on transactional systems and web-based access [25]. Unfortunately, the ability of organizations to effectively utilize this information for decision support typically lags behind their ability to collect and store it. But, organizations that can leverage their data for decision support are more likely to have a competitive edge in their sector of the market [19].

An organization may choose from a number of options when implementing the data analysis and mining technology needed to support decision-making processes or to optimize operations. One option is to perform the analysis within the organization (i.e. “in-house”). It is

(2)

costly to obtain the ingredients required for a successful project: ana- lytical and data mining experience on the part of the team performing the work, software tools, and hardware. Hence the “in-house” data mining investment is substantial.

Another option that an organization may pursue is to out-source the data analysis and mining project to a third party. Data mining consultants have gone into business to address this market need. Con- sultants typically have data analysis and mining experience within one or more market sectors (e.g. banking, health-care, etc.), have analysis and mining software at their disposal, have the required hardware to process large datasets, and are able to produce customized analysis solutions. The cost of an outsourced data mining project is dependent upon a number of factors including: project complexity, data avail- ability, and dataset size. Out-sourcing such projects to a third party consulting firm has proven to be successful, however, the cost is on the order of tens to hundreds of thousands of dollars.

Unfortunately, a number of organizations simply do not have the resources available to perform data mining and analysis “in-house” or to out-source these tasks to large consulting firms. But these organizations often realize that they may be able to gain a competitive ad- vantage by utilizing their data for decision support and/or operational improvement. Hence, there is an opportunity to deliver data mining and analysis services at a reduced cost. We argue that an automated data mining service can provide such analysis at a reduced cost by targeting organizations (or data mining clients) within a given market sector (or market vertical) and automating the knowledge discovery and data mining process [12, 9] to the extent possible.

We note that there are similarities between the design of an automated data mining service and data mining software packages focusing on specific business needs within given vertical markets. By focusing on specific problems in a vertical market, both the software and service designer are able to address data format and preparation issues and choose appropriate modeling, evaluation and deployment strategies.

An organization may find a software solution appealing, if it addresses their specific analysis needs, has the proper interfaces to the data and end-user (who may not be an analyst), and the analysis results are easily deployable. Organizations that have problems not specifically addressed by software solutions, or are attracted to a low-cost, outsourced alternative, may find an automated data mining service to be their best option.

We next present the steps of the knowledge discovery and data mining process to provide context for the discussion on automating various steps. Data mining project engagements typically follow the following process (note that the sequence is not strict and often moving back and forth between steps is required) [9]:

(3)

1. Problem understanding: This initial step primarily focuses on data mining problem definition and specification of the project objectives.

2. Data understanding: This step includes data extraction, data quality measurement, and detection of “interesting” data subsets.

3. Data preparation: This step includes all activities required to construct the final dataset for modeling from the “raw” data including: data transformation, merging of multiple data sources, data cleaning, record and attribute selection.

4. Modeling: In this step, different data mining algorithms and/or tools are selected and applied to the table constructed in the data preparation step. Algorithm parameters are tuned to optimal values.

5. Evaluation: The goal of this step is to verify that the data mining or analysis models generated in the modeling stage are robust and achieve the objectives specified in the problem understanding step.

6. Deployment: In this step, the results are put in a form to be utilized for decision support (e.g. a report) or the data mining models may be integrated with one or more of the organizations IT systems (e.g. scoring prospective customers in real-time).

Fig. 1 shows these analysis steps and the party that is primarily responsible for each individual step.

We next specifically define the data resources and parties that form the basis of the relationship between the automated data mining service and the client.

1.1 Definitions

Definition 1.1 (Raw Data) The base electronic data sources that contain or are believed to contain information relevant to the data mining problem of interest.

Definition 1.2 (Metadata) Additional information that is necessary to properly clean, join and aggregate the raw data into the form of a dataset that is suitable for analysis.

Metadata may consist of data schemas, data dictionaries (including the number of records in each file, and the names, types and range of values for each field), and aggregation rules.

Definition 1.3 (Domain-specific Information) Additional information on specific rules, conventions, practices, etc. that are limited to the problem domain.

(4)

Problem Understanding

Data Understanding

Data Preparation

Modeling

Evaluation

Deployment

Data Mining Client

Data Mining Service

Figure 1: Data Mining Project Steps: Problem understanding, data understanding, data preparation, modeling, evaluation and deployment.

Dashed lines indicate that primary responsibility is on the part of the data

mining client. Dotted lines indicate that primary responsibility is on the

part of the service. Notice that both the client and service play central roles

for the data understanding step.

(5)

For example, a retail chain may indicate product returns in their transactional record with negative sales and quantity values. This information may be extremely helpful in the data preparation, evaluation and possibly the modeling steps.

Definition 1.4 (Third-party data sources) Additional raw data that will likely improve the analysis result, but not collected or owned by the data mining client.

Examples of third-party data sources include address information from the postal service, demographic information from third party data collection companies, etc.

Definition 1.5 (Automated Data Mining Service) An organization that has implemented or obtained processes, tools, algorithms and infrastructure to perform the work of data understanding, data preparation, modeling, and evaluation with minimal input from a human analyst.

The high-level goals of the automated data mining service are the following.

1. Provide the data mining client with a high quality, pertinent result. Achieving this goal ensures that the organization receiving the data mining/analysis results is satisfied (and hopefully a return customer!).

2. Remove or minimize the amount of human intervention required to produce a high-quality, pertinent result. Achieving the second goal allows the data mining/analysis service to “scale” to a large number of concurrent customers, allowing it to amortize the cost of offering the service across a large number of customers. In addition to automating as much of the operational aspects of the analysis process as possible, this goal is achievable by focusing on problems common in a few market verticals or problem domains.

For this discussion, we will assume that the set of data mining analysis problems that the automated service addresses is fixed. For example, the service may be able to offer data clustering analysis services, but is not able to offer association-rule services. We will also drop the word “automated” for the remaining discussion and it is assumed that when referring to the “data mining service”, we are referring to an “automated data mining service”.

Definition 1.6 (Data Mining Client) An organization that possesses two items: (1) a data mining analysis problem; (2) the raw data, metadata, domain-specific information that, in combination with possibly third party data sources, are needed to solve the data mining analysis problem.

(6)

We next specify the context of the relationship between the data mining client and the data mining service.

1.2 Relationship between Consumer and Service

The data mining service receives raw data, metadata, and possibly domain-specific information from the client. The service then per- forms data understanding, data preparation, modeling, and evaluation for the analyses specified by the consumer, for an agreed-upon fee.

When these steps are completed, the results and possibly the provided data and other information (reports on analyses tried, intermediate aggregations, etc.) are returned to the client.

The combination of offering a fixed set of data mining and analysis solutions and focusing on clients from similar domains or market verticals enables the data mining service to perform the data understanding and deployment tasks with minimal intervention from the human data analyst working for the service. Offering a fixed set of data mining and analysis solutions allows the service to templatize problem definition documents and related information. Solution deployment is also similarly constrained by the problem domain and the focused vertical market to again allow the service to templatize deliverables. Addi- tionally, by focusing on a particular problem domain, the data mining service analyst gains invaluable domain knowledge increasing the likelihood of successful solution delivery in future engagements.

Example:

Consider the e-commerce domain. A data mining service may offer the following solutions: determining the most common paths followed by website visitors, determining the most common products purchased together and ranking products that are most likely to be purchased together. For the e-commerce domain, problem definition and specification document templates can help make the problem understanding phase clear and efficient. Additionally, the deployment steps may be nearly automated. Depending on the analysis performed, the results may take the form of an automatically generated report, a file, an executable, etc. Analysis of common paths and common products purchased together are typically best delivered in a standard report form. Product rankings may best be delivered to the client in the form of a file or an executable that takes a given product ID and returns the ranked list of product IDs.

There are often legal, security, and privacy concerns regarding data extraction on the part of the data mining client that should be addressed by the service. For more detail into these issues, please see [22].

The remainder of the paper is organized as follows. Sections 2,

(7)

3 and 4 focus on issues involved in automating and scaling the data preparation, modeling, and evaluation steps in the general KDD process. Section 5 concludes the paper.

2 Data Understanding and Preparation

Tasks involved in the data understanding step are data extraction, data quality measurement and the detection of “interesting” data subsets.

The data preparation step consists of all activities required to construct the final dataset for modeling.

Responsibility for tasks in the data understanding step is typically split between the client and the service. The data mining client is responsible for extracting and providing the required raw data sources to the service for analysis. The data mining service may also augment the raw data provided by the client with third-party data sources. Data quality measurement and detection of “interesting” data subsets are performed by the service.

To efficiently address data understanding and preparation tasks, the data mining service needs to rely upon the fact that its clients come from certain specific domain areas. Ideally, these domains are chosen so that the organizations within a particular domain have similar data schemas and data dictionaries characterizing the data sources that are to be analyzed. Given a-priori knowledge about the data schemas and data dictionaries for a given domain, the data mining service can automate the operational steps of joining the appropriate raw data sources and possibly integrating third-party data sources.

Similar data formats within a market vertical or domain also justi- fies the building of automated domain-specific data cleaning and data quality measurement tools. In the perfect setting, domain-specific rules for data cleaning can be completely automated. But, a useful solution automatically captures and fixes a majority of data cleaning issues and only flags a small fraction of violations for human intervention and triage. The goal of data quality measurement is to justify the potential that the data may provide the required solution to the given problem [22]. Since the solutions offered by the service and the data sources themselves tend to be domain-specific, automating the data quality measurement may be done with minimal effort and requiring only configuration information.

Example:

For example, suppose a data mining service is offering market-basket analysis for e-commerce companies. The typical data source of interest includes the order header, order line item information, and product catalog data. Since these sources tend to have similar schemas, it is possible to automate some data cleaning processes (en-

(8)

suring that product IDs in the order information correlate with product IDs in the catalog, that rules governing the line item price information with respect to the catalog prices are respected, etc.). Additionally, initial data quality measurements may include the number of 1-itemsets that have sufficient support.

Similarly, when restricted to a small number of domains and a small set of data mining problems to address, with knowledge and expertise, it is possible to automatically apply a number data transformations and feature selection techniques that are shown to be useful for the given domain (possibly consisting of a combination of domain knowledge and automated feature selection methods). These can be automatically ex- ecuted and the resulting models may be automatically scored, yielding a system that constructs a number of models with little intervention from the human analyst. For a more detailed discussion on automating the feature selection/variable transformation task, see [2].

3 Modeling

Modeling is the step in the data mining process that includes the appli- cation of one or more data mining algorithms to the prepared dataset, including selection of algorithm parameter values. The result of the modeling step is a series of either predictive or descriptive models or both. Descriptive models are useful when the data mining client is attempting to get a better understanding of the dataset at hand. Pre- dictive models summarize trends in the data so that, if future data has the same or similar distribution to past data, the model will predict trends with some degree of accuracy.

In this section we assume that the data understanding and data preparation phases have produced a dataset from which the desired data mining solution can be derived. Although, in practice, often results of the modeling step motivate revisiting the data understanding and data preparation steps. For example, after building a series of models it may become apparent that a different data transformation would greatly aid in the modeling step.

When evaluating the utility of a given data mining algorithm for possible use in the service, the following considerations should be taken into account:

1. Assuming the prepared dataset is informative, is it possible to obtain high-quality models consistently using the given algorithm?

2. Is the algorithm capable of optimizing objectives specific to the client’s organization (e.g. total monetary cost/return)?

3. Is the algorithm efficient?

(9)

There are two factors influencing the likelihood of obtaining a high quality, useful model using a given algorithm. The first factor relates to the robustness of the computed solution with respect to small changes in the input parameters or slight changes in the data. Ideally, for a majority of datasets that are encountered in a given domain, the data mining service prefers robust algorithms since model quality with respect to small parameter and data change is then predictable (and hence, the algorithm is amenable to automation).

The second factor is the ease at which the insight gained by an- alyzing the model can be communicated to the data mining client.

Typically, prior to deployment of a model or utilizing the model in organizational decision processes, the data mining client desires to understand the insights gleaned from the model. This process is typically difficult to automate, but developing intuitive, easy to understand user interfaces aid greatly. Additionally, a process that identifies a fraction of interesting rules or correlations and reports these to a data mining service analyst is very useful quality assurance tool. These are primarily concerns during the deployment step, but the choice of modeling technique does effect this later phase.

We note that there are some data mining applications in which the client often does not analyze the model, but analyzes the computed results (e.g. results produced by product recommender systems are often analyzed, rather than attempting to understand the underlying model).

Industry standards for data mining model storage such as PMML [13] and OLE DB for Data Mining [10] enable consultants and third party vendors to build effective model browsers for specific industry problems and domains. These standards provide a basis for data mining platforms that enable data mining clients to more easily deploy and understand the models built by the service.

From the viewpoint of the data mining client, model maintenance tends to be an important issue. The data mining client may not have or may not want to invest resources to ensure that the data mining models they’ve received from the service are maintained and accurately model their underlying organizational processes, which may be changing over time. Techniques for incrementally maintaining data mining models are discussed in [4]. Additionally, work on identifying the fit of data mining models to data changing over time includes [7]. It may be possible for the service to incorporate these techniques into the client deliverable so that that model may maintain itself or notify the client that it is not sufficiently modeling recently collected data.

We briefly discuss some popular algorithms used in developing data mining solutions. Note that this list is not exhaustive.

(10)

3.1 Decision Trees

Decision tree construction algorithms are appealing from the perspective of the data mining service for a number of reasons. The tree’s hierarchical structure enables non-technical clients to understand and effectively explore the model after a short learning period. Decision tree construction typically requires the service to provide few, if any, parameters, hence tuning the algorithm is relatively easy. Typically, small data changes result in small, if any, changes to the resulting model (although this is not true for all datasets and possible changes in data) [8]. For excellent discussions on decision tree algorithms, please see [5, 18]. For a discussion on techniques used to scale decision tree algorithms to large datasets, see [4].

3.2 Association Rules

Association rule algorithms identify implications of the form X ⇒ Y where X and Y are sets of items. The association rule model consists of a listing of all such implications existing in the given dataset. These implications are useful for data exploration and may be very useful in predictive applications (e.g. see [17]). Association rules are often presented to the user in priority order with the most “interesting” rules occurring first. Alternatively or in addition to the list of interesting rules, a browser allowing the data mining client to filter rules with specified item occurring in the set X or Y is typically useful. For an overview of association rule algorithms, see [1, 16]. Approaches used to scale association rule algorithms to large databases are discussed in [4].

3.3 Clustering

Clustering algorithms aim to partition the given dataset into several groups such that records in the same group are “similar” to each other, identifying subpopulations in the data. Typically, the data mining client is not interested in the particular clustering strategy employed by the service, but is interested in a concise, effective summary of the groups that occur in their data. Although there are numerous clustering approaches available, we will focus the discussion on two methods: iterative and hierarchical methods.

Iterative clustering methods are well-known and typically straightforward to implement. But from the perspective of the data mining service, there are two challenges to automating them: obtaining a robust model that accurately characterizes the underlying data, and determining the “correct” number of clusters existing in the underlying data. Iterative clustering methods require the specification of initial clusters and the computed solution is dependent upon the quality of

(11)

this initial partition. Hence to ensure a quality solution, the data mining service must implement a search for a good initial clusters [3].

This is typically done by re-running the iterative clustering algorithm from multiple random initial clusters and taking the “best” model or utilizing sampling strategies. Additionally, determining the “correct”

number of clusters is challenging, but strategies such as those discussed in [23] are useful. For a general overview of iterative clustering methods, see [15, 11].

Hierarchical clustering methods build a tree-based hierarchical tax- onomy (dendogram) summarizing similarity relationships between subsets of the data at different levels of granularity. The hierarchical nature of the resulting model is a benefit for the data mining service since this structure is typically easily browsed and understood by the client. Additionally, these clustering methods are very flexible when it comes to the distance metric employed to group together similar items, making hierarchical methods applicable to a number of problems that require the use of non-standard distance metrics. The main caveat to standard hierarchical clustering implementations is in their compu- tational complexity, requiring either O(m²) memory or O(m²) time for m data points, but automating these standard implementations is straightforward. Work on scaling these methods to large datasets includes [14, 20]. For a detailed discussion of hierarchical clustering methods, see [15].

3.4 Support Vector Machines

Support Vector Machines (SVMs) are powerful and popular solutions to predictive modeling problems. SVM algorithms are typically stable and robust with respect to small changes in the underlying data. The algorithms require the specification of a parameter that effectively bal- ances the predictive performance on the available training data with the complexity of the predictive model computed. Tuning set strategies are typically used to automate the selection of optimal values of this parameter. The SVM predictive model is a function in the space of the input or predictor attributes of the underlying data. Since this space tends to be high-dimensional, presenting the SVM model to the data mining client for the purpose of gaining insight is often a difficult proposition. The SVM is a very good predictive model, but is some- what of a “black-box” with respect to understanding and extracting information from its functional form. For an overview of SVMs, see [6]. For strategies on scaling SVM computation to large datasets, see [4].

(12)

4 Evaluation

Prior to delivering a data mining model or solution to the client, the service will evaluate the model or solution to ensure that it has sufficient predictive power or provides adequate insight into trends existing in the underlying data. The primary focus in the evaluation phase is ensuring that the client is being handed a high-quality result from the service.

Depending upon the project, model evaluation may involve one or two components. The first is an objective, empirical measurement of the performance of the model or of the ability of the model to accurately characterize the underlying data. The second, which may not be needed for some projects, involves a discussion or presentation of the model with the client to collect feedback and ensure that the delivered model satisfies the client’s needs. This second component is typical for projects or models in which the goal is data understanding or data exploration.

We discuss the empirical measurement component of the evaluation phase in more detail for two high-level data mining tasks: predictive applications and data exploration. By exposing intuitive, well-designed model browsers to the client, the service may automate the process of presenting the model and collecting client feedback to the extent possible. As the service focuses on clients in particular domains or verticals, model browsers or report templates may be created that raise attention to “important” or “interesting” results for the specific domain or market.

4.1 Evaluating Predictive Models

The primary focus of predictive model evaluation is to estimate the predictive performance of a given model when it is applied to future or unseen data instances. Algorithms discussed that produce predictive models include decision trees (Section 3.1), association rules (Section 3.2) and support vector machines (Section 3.4).

The basic assumption underlying different predictive performance estimation strategies is that the distribution of future or unseen data is the same (or similar to) the distribution of the training data used to construct the model.

Popular methods for estimating the performance of predictive models include cross-validation [24] and ROC curves [21]. Cross-validation provides an overall average predictive performance value, given that the data distribution assumption above is satisfied. ROC curves provide a more detailed analysis of the frequency of false positives and false negatives for predictors constructed from a given classification algorithm.

(13)

From the viewpoint of the data mining service, automating cross- validation and ROC computations is straightforward. Running these evaluation techniques requires little (if any) input from a human analyst on the part of the service. But computation of these values may be time consuming, especially when the predictive modeling algorithm used has a lengthy running time on the client’s data.

In addition to evaluating a given model (or set of models) with respect to predictive performance, the data mining service may implement evaluation metrics that are more informative for clients within a given domain or vertical.

Example:

Consider again the data mining service that caters to e- commerce clients. The service may provide product recommendations for e-commerce companies (i.e. when a customer is viewing sheets at the e-commerce site, recommend pillows to them also). In this case, although the recommender system is a predictive modeling system, the data mining service may evaluate different predictive models with respect to the amount of revenue expected to be generated when the recommender is placed in production.

4.2 Evaluating Data Exploration Models

The primary goal in evaluation of models that support data exploration tasks is ensuring that the model is accurately summarizing and characterizing patterns and trends in the underlying dataset. Algo- rithms discussed that address data exploration tasks are the clustering methods mentioned in Section 3.3. To some extent association rules (Section 3.2) are also used as data exploration tools.

Objective measures for evaluating clustering models to ensure that the model accurately captures data characteristics include Monte Carlo cross-validation [23]. This method is straightforward to automate on the part of the data mining service.

Given the nature of association rule discovery algorithms, the set of association rules found are, by definition, accurately derived from the data. So there is no need to empirically measure the “fit” of the set of association rules to the underlying dataset.

The quality of data exploration models is related to the utility of the extracted patterns and trends with respect to the client’s organization. When the data mining service focuses on a particular client domain or vertical market, effective model browsers and templates can be constructed that focus the client’s attention to information that is frequently deemed useful in the domain or vertical. Hence the quality of the model with respect to the particular domain is then easily eval- uated by the client. Additionally, the service can use these browsers

(14)

and templates to evaluate model quality prior to exposing the model to the client.

5 Conclusion

The goal of the data mining service is to effectively and efficiently produce high-quality data mining results for the data mining client for a reasonable (low) cost. We argued that a quality, cost-effective data mining result may be delivered by automating the operational aspects of the data mining process and focusing on specific client domains.

Upon successful execution of these tasks, the service is then an at- tractive option for small, medium and large organizations to capitalize on and leverage their data investment to improve the delivery of their products and services to their customers.

References

[1] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, pages 207–216, Washington, D.C., May 1993.

[2] J. D. Becher, P. Berkhin, and E. Freeman. Automating ex- ploratory data analysis for efficient mining. In Proc. of the Sixth ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining (KDD-2000), pages 424 – 429, Boston, MA, 2000.

[3] P. S. Bradley and U. M. Fayyad. Refining initial points for K- Means clustering. In Proc. 15th International Conf. on Machine Learning, pages 91–99. Morgan Kaufmann, San Francisco, CA, 1998.

[4] P. S. Bradley, J. Gehrke, R. Ramakrishnan, and R. Srikant. Scal- ing mining algorithms to large databases. Comm. of the ACM, 45(8):38–43, 2002.

[5] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Clas- sification and Regression Trees. Wadsworth, Belmont, 1984.

[6] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121–167, 1998.

[7] I. V. Cadez and P. S. Bradley. Model based population tracking and automatic detection of distribution changes. In Proc. Neural Information Processing Systems 2001, 2001.

[8] D. M. Chickering. Personal communication, January 2003.

(15)

[9] CRISP-DM Consortium. Cross industry standard process for data mining (crisp-dm). http://www.crisp-dm.org/.

[10] Microsoft Corp. Introduction to ole db for data mining.

http://www.microsoft.com/data/oledb/dm.htm.

[11] R. Duda, P. Hart, and D. Stork. Pattern classification. John Wiley & Sons, New York, 2000.

[12] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurasamy. Advances in Knowledge Discovery and Data Mining. MIT Press, Cambridge, MA, 1996.

[13] Data Mining Group. Pmml version 2.0.

http://www.dmg.org/index.htm.

[14] S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. In Proc. ACM SIGMOD Intl. Conf.

on Management of Data, pages 73–84, New York, 1998. ACM Press.

[15] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data.

Prentice Hall, 1988.

[16] Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Effi- cient algorithms for discovering association rules. In Usama M.

Fayyad and Ramasamy Uthurusamy, editors, AAAI Workshop on Knowledge Discovery in Databases (KDD-94), pages 181–192, Seattle, Washington, 1994. AAAI Press.

[17] Nimrod Megiddo and Ramakrishnan Srikant. Discovering predictive association rules. In Knowledge Discovery and Data Mining, pages 274–278, 1998.

[18] Sreerama K. Murthy. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowl- edge Discovery, 2(4):345–389, 1998.

[19] M. T. Oguz. Strategic intelligence: Business intelligence in competitive strategy. DM Review, August 2002.

[20] Clark F. Olson. Parallel algorithms for hierarchical clustering.

Parallel Computing, 21(8):1313–1325, 1995.

[21] Foster J. Provost and Tom Fawcett. Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In Knowledge Discovery and Data Mining, pages 43–48, 1997.

[22] D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, San Francisco, CA, 1999.

[23] Padhraic Smyth. Clustering using monte carlo cross-validation.

In Knowledge Discovery and Data Mining, pages 126–133, 1996.

(16)

[24] M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, 36:111–147, 1974.

[25] D. E. Weisman and C. Buss. Database functionality high, analyt- ics lags, September 28, 2001. Forrester Brief: Business Techno- graphics North America.