• No results found

not possible or was possible at a high cost for collecting the data.

N/A
N/A
Protected

Academic year: 2021

Share "not possible or was possible at a high cost for collecting the data."

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

White Paper

Data Mining and Knowledge Discovery

Knowledge Discovery

Organizations collect a vast amount of data in the process of carrying out their day-to-day business. The open- ing up of the information systems through the World Wide Web has helped organizations to amass huge amounts of data--ranging from inter- actions to transactions--that was pre- viously not possible or was possible at a high cost for collecting the data.

Organizations have perfected various techniques to churn this vast amount of data into reports that provide vari- ous facts and figures. This in turn has created an "information overload"

rather than help these organizations glean any knowledge from this data.

Data mining is the process of discov- ering and extracting patterns from data. Data mining applies a set of al- gorithms for the pattern extraction.

When these patterns are analyzed with the help of prior knowledge and proper interpretation, the process is called Knowledge Discovery.

The field of Knowledge Discovery in Databases (KDD) deals with data mining and interpretation of the data mining results to create knowledge from databases. In this paper we will discuss the KDD process, data min- ing algorithms, and the benefits of practicing KDD to businesses.

Generating knowledge from data

Data Mining

Data mining involves determining patterns from, or fitting models to observed data. A typical data min- ing system may perform one or more of the following tasks:

Association: Association is the dis- covery of correlation between a set of items. The output is often ex- pressed in the form of a rule show- ing attribute value conditions that occur frequently together. This type of analysis is widely used in analyz- ing data for direct marketing cam- paigns and sales catalog designs and many other business decision- making processes.

Example - An association model might discover that of all Electronic customers under study, the age group of 20-29 (10% of the set), with an income 40-50K buys DVD Play- ers with 80% probability.

Classification: Classification ana- lyzes a set of training data and con- structs a model for each class based on the features of the data. A decision tree or a set of classifica- tion rules is generated. This can fur- ther be used for better understand- ing of each class and classification of future data. Rich classification methods are inherited from the field of machine learning, neural net- works, statistics and other fields.

(2)

Classification has been quite useful in cus- tomer segmentation and credit analysis type requirements.

Example - A customer can be evaluated to be good or bad risk depending upon the in- come range, number of years in job and the amount of debt he or she is carrying.

Prediction: Classification can be used for predicting the class label of data objects.

Prediction also can be used for missing data value prediction. The classification produces the appropriate business rule or a decision tree from which the prediction can be made.

Example - A customer's potential expendi- ture using a credit card can be predicted based on the expenditure distribution of simi- lar customers using that credit card. Usu- ally, genetic algorithms, regression analysis and neural networks are the commonly used techniques for this purpose.

Clustering: A cluster is a collection of objects that are similar to one another. Clustering analysis refers to identifying clusters embed- ded in the data. A good clustering method produces high quality clusters. It means that inter-cluster similarity is low and the intra- cluster similarity is high. It is very commonly used for customer segmentation and deriv- ing marketing strategies.

Example - The customer base may be clus- tered around certain sets of attributes that uniquely determine the cluster membership.

For example, location, income group, age group of the customers.

Time-Series Analysis: This analysis is used to find regularities and interesting character- istics of data varying over time. It looks for sequential patterns, periodicity, trends and deviations.

+ +

+

Cluster Centers

Group1 Group

2

Group 3

Example - The time-series analysis may be used for predicting sales quantities for different SKUs, based on demand pattern, market condition and competitor's performance.

Time-Series Analysis: This analysis is used to find regu- larities and interesting characteristics of data varying over time. It looks for sequential patterns, periodicity, trends and deviations.

Example - The time-series analysis may be used for predicting sales quantities for different SKUs, based on demand pattern, market condition and competitor's performance.

Example Applications

Retail/Marketing

Identify buying patterns from customers

Associations among customer demographic characteristics

Predict response to mailing campaigns

Banking

Detect patterns of fraudulent credit card use

Classifying customers for target marketing.

Predict customers likely to change their credit card affiliation

Determine credit card spending by customer groups

2

3

(3)

Insurance

Claims analysis

Predict which customers will buy new policies

Identify behavior patterns of risky customers

Identify fraudulent behavior

Telecommunication

Call Behavior Analysis

Churn Analysis

Fraud Detection

Call Center Performance

e-Commerce

Recommendation System

Website Access Profiling

Personalization

Clickstream analysis for Web Insurance

Process of Data Mining

A systematic approach is essential to success- ful data mining. Quintegra Solutions has effec- tively used the process model described in this section.

It should be noted that the data mining process is not linear. The loops in the process model indicate that the previous one or more steps may be revisited depending upon the result at that step.

For example, the results of the data exploration phase may require you to add new data to the database. Usually a number of initial models are built in order to arrive at a satisfactory model.

The Business Definition Phase and the Results Deployment Phase govern effectiveness of the entire process.

The following is a brief description of the data mining process phases adopted by TranSys in providing the data mining solution:

1. Business Definition Phase

Prerequisite to knowledge discovery is to de- velop a clear understanding of the business environment. This is required in order to ap- preciate opportunities for improvement and also to prepare the data for mining, or correctly in- terpret the results. Clear statement of busi- ness objectives will make the best use of the data mining effort. TranSys will work with cli- ents to clearly define the business objective.

This definition stage will include a way of mea- suring the results of the data mining project and cost justification.

2. Data Building Phase

In this phase the data to be mined is collected in a database. Depending on the amount and complexity of the data, many times, even a flat file or a spreadsheet may be adequate.

The required components of data may be sourced from a data warehouse, as they en- sure the required cleanliness of the data. Other data from external sources may have to be in- tegrated.

TranSys will perform the following tasks in or- der to achieve the objectives of this phase:

Collect the required data

Select the subset of data to be mined

Assess data quality and if required, cleanse the data

Consolidate and integrate the data

Load the data mining database

(4)

3. Data Exploration Phase

Understanding the data is very important.

Graphing and visualization tools are a vi- tal aid in understanding data and prepar- ing data. Data visualization most often pro- vides the "Aha!" leading to new insights and success. Some of the common and very useful graphical displays of data are histograms or box plots that display distri- butions of values.

TranSys will work closely with the func- tional team members to identify the most important attributes and fields in predict- ing an outcome and determine which de- rived values may be useful. They will use visualization, link analysis and other means of exploring data.

Results Deployment Business Definition Phase

Data Building Phase

Data Exploration Phase Data Preparation Phase

Model Building Phase

Model Evaluation Phase

The Data Mining Process

4. Data Preparation Phase

This is the final phase before building models. It is often good idea to sample the data when the database is large. If done carefully, this yields no loss of information. Data that is clearly extrane- ous need to be identified and discarded. It is of- ten necessary to construct new variables derived from the raw data.

For example, forecasting credit risk using a debt- to-income ratio rather than just debt and income as predictor variable may yield more accurate results that are also easier to understand. Data may also need to be selectively segregated (discretized), for example decision trees used for classification require continuous data such as income to be grouped in ranges or bins - High, Medium and Low or given ranges. The cut-off points for the bins may change the outcome of a model.

(5)

TranSys will perform the following tasks in- volved in this phase:

Selection of variables

Selection of rows

Construction of new variables

Transformation of variables

5. Model Building Phase

Model building is an iterative process. One needs to explore alternative models to find the most useful in solving the business problem.

Once you have decided on the type of predic- tion you want to make, you must choose a model type for making the prediction. This could be a decision tree, a neural net, a proprietary method, or logistic regression.

Based upon the results of building initial mod- els, you may want to build another model us- ing the same technique but different param- eters. No tool or technique is perfect for all data, and it is difficult, if not impossible, to be sure before you start which technique will work the best.

Quintegra chooses the strategy of building nu- merous models before finding a satisfactory one, which will provide accurate results for the purpose.

6. Model Evaluation Phase

After building a model, you must evaluate its results. It is important to test the model in the real world. There is no guarantee that an ac- curate model reflects the real world.

A valid model is not necessarily a correct model. In addition, the data used to build the model may fail to match the real world in some unknown way, leading to an incorrect model.

For example, if a model is used to select a sub- set of a mailing list, do a test mailing to verify the model. If a model is used to predict credit risk, try the model on a small set of applicants before full deployment.

TranSys analyzes the risk associated with an incorrect model first. The higher the risk associ- ated with an incorrect model, the more impor- tance is given to construct an experiment to check the model results.

7. Results Deployment Phase

Once a data mining model is built and validated, the deployment of the results to applications within the enterprise may be done. For example, the clusters identified by the model can be used to extract the rules that define the model and make recommendations for the new observa- tions.

TranSys may aid in developing the application where the model is embedded into the applica- tion. For example, the business rules compo- nent (out of the model) can be integrated with a loan application system to facilitate evaluation of an applicant.

Knowledge Discovery in Databases (KDD)

The KDD process extends the data mining to consolidate the discovered knowledge, and then incorporating this knowledge into the operational systems.

The knowledge integration has been achieved to a fairly large extent in the e-business domains compared to other domains. For example, the discovered knowledge relating to the behavior of the customers (essentially from the click-stream data) has been effectively used to improve the sites, personalize the pages, improve the pro- motional and other features, and enhanced the buying experience.

(6)

The KDD process extends the data mining to consolidate the discovered knowledge, and then incorporating this knowledge into the operational systems. The value of the discovered knowledge lies in its appropriate use. Focus should be on

the utilization of data and knowledge for strategic use that can provide a competitive edge.

TranSys, with its domain experience and practice excellence, can help clients to minimize the risks of running the KDD processes.

Data

References

Related documents

As described above, our benchmark model uses the three variables that growth theory suggests should have approximately the same permanent components: Real output per hour (variable

The American Recovery and Reinvestment Act of 2009 (ARRA, P.L. 111-5) includes a temporary provision that allowed non-itemizing homeowners to claim an additional standard deduction

COMPASS consortium partners involved in the development of the self- assessment included academics with expertise in corporate sustain- ability, organizational learning,

The literature reveals that poverty, environment, race, education, gender, food deserts, food assistance, and perceptions and behaviors towards nutrition all have an association

Also these figures are not consistent with data I have compiled from ABS sources in recent years for the housing tenure of Indigenous households by remoteness geography4. Table 2

Code of GCG mengatur tentang komisaris yang tidak terafliasi yang dikenal dengan komisaris independen yang memiliki tugas untuk menjamin agar mekanisme pengawasan berjalan

Adrian was retained to work with the Head of Knowledge Development to review the CBI information stores and its knowledge management processes and to design a new infrastructure

The role delineation panel identified the domains, tasks, knowledge, and skills essential to the performance of a PMA Pilates Certified Teacher.. A representative sample