White Paper
Data Mining and Knowledge Discovery
Knowledge Discovery
Organizations collect a vast amount of data in the process of carrying out their day-to-day business. The open- ing up of the information systems through the World Wide Web has helped organizations to amass huge amounts of data--ranging from inter- actions to transactions--that was pre- viously not possible or was possible at a high cost for collecting the data.
Organizations have perfected various techniques to churn this vast amount of data into reports that provide vari- ous facts and figures. This in turn has created an "information overload"
rather than help these organizations glean any knowledge from this data.
Data mining is the process of discov- ering and extracting patterns from data. Data mining applies a set of al- gorithms for the pattern extraction.
When these patterns are analyzed with the help of prior knowledge and proper interpretation, the process is called Knowledge Discovery.
The field of Knowledge Discovery in Databases (KDD) deals with data mining and interpretation of the data mining results to create knowledge from databases. In this paper we will discuss the KDD process, data min- ing algorithms, and the benefits of practicing KDD to businesses.
Generating knowledge from data
Data Mining
Data mining involves determining patterns from, or fitting models to observed data. A typical data min- ing system may perform one or more of the following tasks:
Association: Association is the dis- covery of correlation between a set of items. The output is often ex- pressed in the form of a rule show- ing attribute value conditions that occur frequently together. This type of analysis is widely used in analyz- ing data for direct marketing cam- paigns and sales catalog designs and many other business decision- making processes.
Example - An association model might discover that of all Electronic customers under study, the age group of 20-29 (10% of the set), with an income 40-50K buys DVD Play- ers with 80% probability.
Classification: Classification ana- lyzes a set of training data and con- structs a model for each class based on the features of the data. A decision tree or a set of classifica- tion rules is generated. This can fur- ther be used for better understand- ing of each class and classification of future data. Rich classification methods are inherited from the field of machine learning, neural net- works, statistics and other fields.
Classification has been quite useful in cus- tomer segmentation and credit analysis type requirements.
Example - A customer can be evaluated to be good or bad risk depending upon the in- come range, number of years in job and the amount of debt he or she is carrying.
Prediction: Classification can be used for predicting the class label of data objects.
Prediction also can be used for missing data value prediction. The classification produces the appropriate business rule or a decision tree from which the prediction can be made.
Example - A customer's potential expendi- ture using a credit card can be predicted based on the expenditure distribution of simi- lar customers using that credit card. Usu- ally, genetic algorithms, regression analysis and neural networks are the commonly used techniques for this purpose.
Clustering: A cluster is a collection of objects that are similar to one another. Clustering analysis refers to identifying clusters embed- ded in the data. A good clustering method produces high quality clusters. It means that inter-cluster similarity is low and the intra- cluster similarity is high. It is very commonly used for customer segmentation and deriv- ing marketing strategies.
Example - The customer base may be clus- tered around certain sets of attributes that uniquely determine the cluster membership.
For example, location, income group, age group of the customers.
Time-Series Analysis: This analysis is used to find regularities and interesting character- istics of data varying over time. It looks for sequential patterns, periodicity, trends and deviations.
+ +
+
Cluster Centers
Group1 Group
2
Group 3
Example - The time-series analysis may be used for predicting sales quantities for different SKUs, based on demand pattern, market condition and competitor's performance.
Time-Series Analysis: This analysis is used to find regu- larities and interesting characteristics of data varying over time. It looks for sequential patterns, periodicity, trends and deviations.
Example - The time-series analysis may be used for predicting sales quantities for different SKUs, based on demand pattern, market condition and competitor's performance.
Example Applications
Retail/Marketing
• Identify buying patterns from customers
• Associations among customer demographic characteristics
• Predict response to mailing campaigns
Banking
• Detect patterns of fraudulent credit card use
• Classifying customers for target marketing.
• Predict customers likely to change their credit card affiliation
• Determine credit card spending by customer groups
2
3
Insurance
•Claims analysis
• Predict which customers will buy new policies
• Identify behavior patterns of risky customers
•Identify fraudulent behavior
Telecommunication
• Call Behavior Analysis
• Churn Analysis
• Fraud Detection
• Call Center Performance
e-Commerce
• Recommendation System
• Website Access Profiling
• Personalization
• Clickstream analysis for Web Insurance
Process of Data Mining
A systematic approach is essential to success- ful data mining. Quintegra Solutions has effec- tively used the process model described in this section.
It should be noted that the data mining process is not linear. The loops in the process model indicate that the previous one or more steps may be revisited depending upon the result at that step.
For example, the results of the data exploration phase may require you to add new data to the database. Usually a number of initial models are built in order to arrive at a satisfactory model.
The Business Definition Phase and the Results Deployment Phase govern effectiveness of the entire process.
The following is a brief description of the data mining process phases adopted by TranSys in providing the data mining solution:
1. Business Definition Phase
Prerequisite to knowledge discovery is to de- velop a clear understanding of the business environment. This is required in order to ap- preciate opportunities for improvement and also to prepare the data for mining, or correctly in- terpret the results. Clear statement of busi- ness objectives will make the best use of the data mining effort. TranSys will work with cli- ents to clearly define the business objective.
This definition stage will include a way of mea- suring the results of the data mining project and cost justification.
2. Data Building Phase
In this phase the data to be mined is collected in a database. Depending on the amount and complexity of the data, many times, even a flat file or a spreadsheet may be adequate.
The required components of data may be sourced from a data warehouse, as they en- sure the required cleanliness of the data. Other data from external sources may have to be in- tegrated.
TranSys will perform the following tasks in or- der to achieve the objectives of this phase:
• Collect the required data
• Select the subset of data to be mined
• Assess data quality and if required, cleanse the data
• Consolidate and integrate the data
• Load the data mining database
3. Data Exploration Phase
Understanding the data is very important.
Graphing and visualization tools are a vi- tal aid in understanding data and prepar- ing data. Data visualization most often pro- vides the "Aha!" leading to new insights and success. Some of the common and very useful graphical displays of data are histograms or box plots that display distri- butions of values.
TranSys will work closely with the func- tional team members to identify the most important attributes and fields in predict- ing an outcome and determine which de- rived values may be useful. They will use visualization, link analysis and other means of exploring data.
Results Deployment Business Definition Phase
Data Building Phase
Data Exploration Phase Data Preparation Phase
Model Building Phase
Model Evaluation Phase
The Data Mining Process
4. Data Preparation Phase
This is the final phase before building models. It is often good idea to sample the data when the database is large. If done carefully, this yields no loss of information. Data that is clearly extrane- ous need to be identified and discarded. It is of- ten necessary to construct new variables derived from the raw data.
For example, forecasting credit risk using a debt- to-income ratio rather than just debt and income as predictor variable may yield more accurate results that are also easier to understand. Data may also need to be selectively segregated (discretized), for example decision trees used for classification require continuous data such as income to be grouped in ranges or bins - High, Medium and Low or given ranges. The cut-off points for the bins may change the outcome of a model.
TranSys will perform the following tasks in- volved in this phase:
• Selection of variables
• Selection of rows
• Construction of new variables
• Transformation of variables
5. Model Building Phase
Model building is an iterative process. One needs to explore alternative models to find the most useful in solving the business problem.
Once you have decided on the type of predic- tion you want to make, you must choose a model type for making the prediction. This could be a decision tree, a neural net, a proprietary method, or logistic regression.
Based upon the results of building initial mod- els, you may want to build another model us- ing the same technique but different param- eters. No tool or technique is perfect for all data, and it is difficult, if not impossible, to be sure before you start which technique will work the best.
Quintegra chooses the strategy of building nu- merous models before finding a satisfactory one, which will provide accurate results for the purpose.
6. Model Evaluation Phase
After building a model, you must evaluate its results. It is important to test the model in the real world. There is no guarantee that an ac- curate model reflects the real world.
A valid model is not necessarily a correct model. In addition, the data used to build the model may fail to match the real world in some unknown way, leading to an incorrect model.
For example, if a model is used to select a sub- set of a mailing list, do a test mailing to verify the model. If a model is used to predict credit risk, try the model on a small set of applicants before full deployment.
TranSys analyzes the risk associated with an incorrect model first. The higher the risk associ- ated with an incorrect model, the more impor- tance is given to construct an experiment to check the model results.
7. Results Deployment Phase
Once a data mining model is built and validated, the deployment of the results to applications within the enterprise may be done. For example, the clusters identified by the model can be used to extract the rules that define the model and make recommendations for the new observa- tions.
TranSys may aid in developing the application where the model is embedded into the applica- tion. For example, the business rules compo- nent (out of the model) can be integrated with a loan application system to facilitate evaluation of an applicant.
Knowledge Discovery in Databases (KDD)
The KDD process extends the data mining to consolidate the discovered knowledge, and then incorporating this knowledge into the operational systems.
The knowledge integration has been achieved to a fairly large extent in the e-business domains compared to other domains. For example, the discovered knowledge relating to the behavior of the customers (essentially from the click-stream data) has been effectively used to improve the sites, personalize the pages, improve the pro- motional and other features, and enhanced the buying experience.
The KDD process extends the data mining to consolidate the discovered knowledge, and then incorporating this knowledge into the operational systems. The value of the discovered knowledge lies in its appropriate use. Focus should be on
the utilization of data and knowledge for strategic use that can provide a competitive edge.
TranSys, with its domain experience and practice excellence, can help clients to minimize the risks of running the KDD processes.
Data