not possible or was possible at a high cost for collecting the data.

(1)

White Paper

Data Mining and Knowledge Discovery

Knowledge Discovery

Organizations collect a vast amount of data in the process of carrying out their day-to-day business. The open- ing up of the information systems through the World Wide Web has helped organizations to amass huge amounts of data--ranging from inter- actions to transactions--that was pre- viously not possible or was possible at a high cost for collecting the data.

Organizations have perfected various techniques to churn this vast amount of data into reports that provide various facts and figures. This in turn has created an "information overload"

rather than help these organizations glean any knowledge from this data.

Data mining is the process of discov- ering and extracting patterns from data. Data mining applies a set of algorithms for the pattern extraction.

When these patterns are analyzed with the help of prior knowledge and proper interpretation, the process is called Knowledge Discovery.

The field of Knowledge Discovery in Databases (KDD) deals with data mining and interpretation of the data mining results to create knowledge from databases. In this paper we will discuss the KDD process, data mining algorithms, and the benefits of practicing KDD to businesses.

Generating knowledge from data

Data Mining

Data mining involves determining patterns from, or fitting models to observed data. A typical data mining system may perform one or more of the following tasks:

Association: Association is the discovery of correlation between a set of items. The output is often ex- pressed in the form of a rule show- ing attribute value conditions that occur frequently together. This type of analysis is widely used in analyz- ing data for direct marketing campaigns and sales catalog designs and many other business decision- making processes.

Example - An association model might discover that of all Electronic customers under study, the age group of 20-29 (10% of the set), with an income 40-50K buys DVD Play- ers with 80% probability.

Classification: Classification analyzes a set of training data and con- structs a model for each class based on the features of the data. A decision tree or a set of classification rules is generated. This can fur- ther be used for better understanding of each class and classification of future data. Rich classification methods are inherited from the field of machine learning, neural networks, statistics and other fields.

(2)

Classification has been quite useful in customer segmentation and credit analysis type requirements.

Example - A customer can be evaluated to be good or bad risk depending upon the income range, number of years in job and the amount of debt he or she is carrying.

Prediction: Classification can be used for predicting the class label of data objects.

Prediction also can be used for missing data value prediction. The classification produces the appropriate business rule or a decision tree from which the prediction can be made.

Example - A customer's potential expenditure using a credit card can be predicted based on the expenditure distribution of similar customers using that credit card. Usu- ally, genetic algorithms, regression analysis and neural networks are the commonly used techniques for this purpose.

Clustering: A cluster is a collection of objects that are similar to one another. Clustering analysis refers to identifying clusters embedded in the data. A good clustering method produces high quality clusters. It means that inter-cluster similarity is low and the intra- cluster similarity is high. It is very commonly used for customer segmentation and deriv- ing marketing strategies.

Example - The customer base may be clus- tered around certain sets of attributes that uniquely determine the cluster membership.

For example, location, income group, age group of the customers.

Time-Series Analysis: This analysis is used to find regularities and interesting characteristics of data varying over time. It looks for sequential patterns, periodicity, trends and deviations.

+ +

+

Cluster Centers

Group1 Group

2

Group 3

Example - The time-series analysis may be used for predicting sales quantities for different SKUs, based on demand pattern, market condition and competitor's performance.

Time-Series Analysis: This analysis is used to find regularities and interesting characteristics of data varying over time. It looks for sequential patterns, periodicity, trends and deviations.

Example - The time-series analysis may be used for predicting sales quantities for different SKUs, based on demand pattern, market condition and competitor's performance.

Example Applications

Retail/Marketing

• Identify buying patterns from customers

• Associations among customer demographic characteristics

• Predict response to mailing campaigns

Banking

• Detect patterns of fraudulent credit card use

• Classifying customers for target marketing.

• Predict customers likely to change their credit card affiliation

• Determine credit card spending by customer groups

2

3

(3)

Insurance

•Claims analysis

• Predict which customers will buy new policies

• Identify behavior patterns of risky customers

•Identify fraudulent behavior

Telecommunication

• Call Behavior Analysis

• Churn Analysis

• Fraud Detection

• Call Center Performance

e-Commerce

• Recommendation System

• Website Access Profiling

• Personalization

• Clickstream analysis for Web Insurance

Process of Data Mining

A systematic approach is essential to success- ful data mining. Quintegra Solutions has effectively used the process model described in this section.

It should be noted that the data mining process is not linear. The loops in the process model indicate that the previous one or more steps may be revisited depending upon the result at that step.

For example, the results of the data exploration phase may require you to add new data to the database. Usually a number of initial models are built in order to arrive at a satisfactory model.

The Business Definition Phase and the Results Deployment Phase govern effectiveness of the entire process.

The following is a brief description of the data mining process phases adopted by TranSys in providing the data mining solution:

1. Business Definition Phase

Prerequisite to knowledge discovery is to de- velop a clear understanding of the business environment. This is required in order to ap- preciate opportunities for improvement and also to prepare the data for mining, or correctly in- terpret the results. Clear statement of business objectives will make the best use of the data mining effort. TranSys will work with clients to clearly define the business objective.

This definition stage will include a way of mea- suring the results of the data mining project and cost justification.

2. Data Building Phase

In this phase the data to be mined is collected in a database. Depending on the amount and complexity of the data, many times, even a flat file or a spreadsheet may be adequate.

The required components of data may be sourced from a data warehouse, as they en- sure the required cleanliness of the data. Other data from external sources may have to be integrated.

TranSys will perform the following tasks in order to achieve the objectives of this phase:

• Collect the required data

• Select the subset of data to be mined

• Assess data quality and if required, cleanse the data

• Consolidate and integrate the data

• Load the data mining database

(4)

3. Data Exploration Phase

Understanding the data is very important.

Graphing and visualization tools are a vi- tal aid in understanding data and prepar- ing data. Data visualization most often pro- vides the "Aha!" leading to new insights and success. Some of the common and very useful graphical displays of data are histograms or box plots that display distri- butions of values.

TranSys will work closely with the func- tional team members to identify the most important attributes and fields in predicting an outcome and determine which derived values may be useful. They will use visualization, link analysis and other means of exploring data.

Results Deployment Business Definition Phase

Data Building Phase

Data Exploration Phase Data Preparation Phase

Model Building Phase

Model Evaluation Phase

The Data Mining Process

4. Data Preparation Phase

This is the final phase before building models. It is often good idea to sample the data when the database is large. If done carefully, this yields no loss of information. Data that is clearly extrane- ous need to be identified and discarded. It is often necessary to construct new variables derived from the raw data.

For example, forecasting credit risk using a debt- to-income ratio rather than just debt and income as predictor variable may yield more accurate results that are also easier to understand. Data may also need to be selectively segregated (discretized), for example decision trees used for classification require continuous data such as income to be grouped in ranges or bins - High, Medium and Low or given ranges. The cut-off points for the bins may change the outcome of a model.

(5)

TranSys will perform the following tasks in- volved in this phase:

• Selection of variables

• Selection of rows

• Construction of new variables

• Transformation of variables

5. Model Building Phase

Model building is an iterative process. One needs to explore alternative models to find the most useful in solving the business problem.

Once you have decided on the type of prediction you want to make, you must choose a model type for making the prediction. This could be a decision tree, a neural net, a proprietary method, or logistic regression.

Based upon the results of building initial models, you may want to build another model using the same technique but different param- eters. No tool or technique is perfect for all data, and it is difficult, if not impossible, to be sure before you start which technique will work the best.

Quintegra chooses the strategy of building nu- merous models before finding a satisfactory one, which will provide accurate results for the purpose.

6. Model Evaluation Phase

After building a model, you must evaluate its results. It is important to test the model in the real world. There is no guarantee that an accurate model reflects the real world.

A valid model is not necessarily a correct model. In addition, the data used to build the model may fail to match the real world in some unknown way, leading to an incorrect model.

For example, if a model is used to select a subset of a mailing list, do a test mailing to verify the model. If a model is used to predict credit risk, try the model on a small set of applicants before full deployment.

TranSys analyzes the risk associated with an incorrect model first. The higher the risk associated with an incorrect model, the more impor- tance is given to construct an experiment to check the model results.

7. Results Deployment Phase

Once a data mining model is built and validated, the deployment of the results to applications within the enterprise may be done. For example, the clusters identified by the model can be used to extract the rules that define the model and make recommendations for the new observa- tions.

TranSys may aid in developing the application where the model is embedded into the application. For example, the business rules compo- nent (out of the model) can be integrated with a loan application system to facilitate evaluation of an applicant.

Knowledge Discovery in Databases (KDD)

The KDD process extends the data mining to consolidate the discovered knowledge, and then incorporating this knowledge into the operational systems.

The knowledge integration has been achieved to a fairly large extent in the e-business domains compared to other domains. For example, the discovered knowledge relating to the behavior of the customers (essentially from the click-stream data) has been effectively used to improve the sites, personalize the pages, improve the pro- motional and other features, and enhanced the buying experience.

(6)

The KDD process extends the data mining to consolidate the discovered knowledge, and then incorporating this knowledge into the operational systems. The value of the discovered knowledge lies in its appropriate use. Focus should be on

the utilization of data and knowledge for strategic use that can provide a competitive edge.

TranSys, with its domain experience and practice excellence, can help clients to minimize the risks of running the KDD processes.

Data