Development of Data Mining System to Analyze Cars using TkNN Clustering Algorithm

(1)

Abstract— Now a day’s customers are required comfort and their loving brand & color. With the advent of the Internet and Data Mining Algorithms has undoubtedly contributed to the shift of marketing focus. Conventional way of business is a challenging in car market due to many competitors are there around the world for providing competitive products. The car manufacturers categorizes the car users and have to invent a suitable car; the seller correctly groups the buyers and he sells a right car; and the customers selects best car by analyzing more brands of cars with ‘N’ number of sellers. These three cases they spent too much of time for analyzing old or statistical data for choosing a right product.

In this paper, we proposed TkNN algorithm for best car market analysis. We have executed the same in Weka Tool with Java code. We analyzed the graphical performance analysis between KNN and our novel TkNN clustering algorithms with Classes to Clusters evaluation purchase, safety, luggage booting, persons (seating capacity), doors, maintenance and buying attributes of customer’s requirements for unacceptable/acceptable/good/very good ratings of a car to purchase.

Index Terms—Clustering, Unsupervised Learning, Weka, KNN, TkNN.

I. INTRODUCTION

It may seem that customer relation is applicable only for managing relationships between businesses and consumers. A closer examination of consumer, it reveals that it is even more crucial for business customers. In “business-to-business” environments, a tremendous amount of information is exchanged on a regular basis. For example, transactions are more numerous, custom contracts are more diverse, and pricing schemes are more complicated. Consumer relations are facilitating to smooth the process when various representatives of seller and buyer companies communicate and collaborate [11]. Customized catalogues, personalized business portals, and targeted product discounts can simplify the procurement process and improve efficiencies for both companies. E-mail alerts and new product information tailored to different roles in the buyer company can help increase the effectiveness of the sales pitch. Trust and authority are enhanced if targeted academic reports or industry news are delivered to the relevant individuals. All of these can be considered among the benefits

M.Jayakameswaraiah, Dept.of Computer Science,Sri Venkateswara University,Tirupati,Andhrapradesh.

Prof.S.Ramakrishna,Dept.of Computer Science,Sri Venkateswara University,Tirupati,Andhrapradesh.

of customer interactions.

A new business culture is developing today [4, 7, 24, 19]. Within it, the economics of customer relationships are changing in fundamental ways, and companies are facing the need to implement new solutions and strategies that address these changes. The concepts of mass production and mass marketing, first created during the Industrial Revolution, are being supplanted by new ideas in which customer relationships are the central business issue [1, 5]. Firms today are concerned with increasing customer value through analysis of the customer lifecycle. The tools and technologies of data warehousing, data mining, and other customer relationship techniques afford new opportunities for businesses to act on the concepts of relationship marketing. The old model of “design-build-sell” (a product-oriented view) is being replaced by “sell-build-redesign” (a customer-oriented view). It is a spiral model of software engineering [26]. The traditional process of mass marketing is being challenged by the new approach of one-to-one marketing. In the traditional process, the marketing goal is to reach more customers and expand the customer base [31, 11, 5]. But given the high cost of acquiring new customers, it makes better sense to conduct business with current customers. In so doing, the marketing focus shifts away from the breadth of customer base to the depth of each customer’s needs.

Economic growth of a country depends on transportation as one constraint. Like many economic activities that are intensive in the use of infrastructures, the transport sector is an important component of the economy impacting on development and the welfare of population. A relation between the quantity and quality of transport infrastructure and the level of economic development is apparent. When transport systems are efficient, they provide economic and social opportunities and benefits that result in positive multipliers effects such as better accessibility to markets, employment and additional investments.

The lifecycle of a modern car comprises a multitude of complex and interdependent tasks that start early during the development phase, guide and advice the production process and keep track of issues related to operating vehicles. We will present examples of data mining applications from all these three stages: development, production planning and fault analysis. All contributions share the property that we use (or extract) rule patterns to explain the domain under analysis to the user. Rules (in form of association rules) are a

well-Development of Data Mining System to

Analyze Cars using T

k

NN Clustering Algorithm

(2)

dependencies. The inherent interdisciplinary character of the automobile development and manufacturing process requires models that are easily understood across application area boundaries. The understanding of patterns can be greatly enhanced by providing powerful visualization methods alongside with the analysis tools.

The next section will briefly sketch the underlying theoretical frameworks, after which we will present and discuss successfully applied fault analysis, planning and development methods, all of which have been rolled out to production sites of two large automobile manufacturers.

Section 2 provides some notations needed for the rest of the article. Section 3 discusses an approach based on graphical models to assess and reveal potential fault patterns inside vehicle data. Section 4 deals with the handling of production planning. Section 5 discovers rules from time series that are created by prototype simulations. Finally, section 6 concludes the findings.

II.BACKGROUND

Some people preferred only high cost cars, some are low price with all features and others are in between these classes. One more category of people are only knowing information about different brands but they never buys. Selecting the suitable car is extremely tricky job if parameters (color, comfort, seating capacity, maintenance, price, and so on) are known otherwise it is difficult task. If the customer knows these all things then also sometimes it is hard to choose the right car.

The problem is unmanageable in the perspective of manufacturer and seller, because they must work with different categories of people [31]. The need to increase the productivity of manufacturer, raises the seller transactions and customer satisfy of the selected car comforts. Comfort transportation is encourages frequency of vehicle usage, it increases Economic growth as well as it decreases wastage of time in journey. Car is the symbol of comfortable.

If this system is available then there are no capabilities required to assist with telecom expense management, i.e., the administrator can find out the number of calls and text messages used as well as cellular and WiFi data usage, both for home and roaming networks [19].

We will now briefly discuss the notational underpinning that is needed to present the ideas and results from the industrial applications.

A.Unsupervised learning

In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning from supervised learning and reinforcement learning.

Unsupervised learning is closely related to the problem of density estimation in statistics. However unsupervised learning also encompasses many other techniques that seek to

summarize and explain key features of the data. Many methods employed in unsupervised learning are based on data mining methods used to preprocess data.

Approaches to unsupervised learning include:

 clustering (e.g., k-means, mixture models, hierarchical clustering)

 hidden Markov models,

 blind signal separation using feature extraction techniques for dimensionality reduction (e.g., principal component analysis, independent component analysis, non-negative matrix factorization, singular value decomposition)

Among neural network models, the self-organizing map (SOM) and adaptive resonance theory (ART) are commonly used unsupervised learning algorithms. The SOM is a topographic organization in which nearby locations in the map represent inputs with similar properties. The ART model allows the number of clusters to vary with problem size and lets the user control the degree of similarity between members of the same clusters by means of a user-defined constant called the vigilance parameter. ART networks are also used for many pattern recognition tasks, such as automatic target recognition and seismic signal processing. The first version of ART was "ART1", developed by Carpenter and Grossberg (1988).

B.Clustering

A common need in analyzing large quantities of data is to divide the dataset into logically well distinct groups, such that the objects in each group share some property which does not hold (or holds much less) for other objects. As such, clustering searches a global model of data, usually with the main focus on associating each object with a group (i.e., a cluster), even though in some cases we are interested (also) in understanding where clusters are located in the data space.

1)Importance of Clustering

Data clustering is one of the main tasks of data mining and pattern recognition. Moreover, it can be used in many applications such as:

1 Data compression. 2 Image analysis. 3 Bioinformatics.

4 Academics.

5 Search engines.

6 Wireless sensor networks. 7 Intrusion detection. 8 Business planning.

C.About Weka tool

(3)

1)Main Features

Some of WEKAs main features are the following:

Data preprocessing - WEKA supports a couple of popular text file formats such as CSV, JSON and Matlab ASCII files to import data along with their own file format ARFF. They also have support to import data from databases through JDBC. Besides importing data, they have a wide collection of supervised as well as unsupervised filters to apply on your data to facilitate further analysis.

Data classification - A huge collection of algorithms have been implemented to perform classification on data sets. These include Bayesian algorithms, mathematical functions such as support vector machines, lazy classifiers implementing nearest-neighbor calculations; Meta based algorithms as well as rule and tree-based classifiers.

Data clustering - A couple of algorithms for clustering exist such as variations of the k-mean method as well as density and hierarchical based clustering algorithms.

Attribute association - Methods to analyze data using association rule learners. Association rules can be seen as rules describing relations between attributes in a data set.

Attribute selection - Methods to evaluate which attribute contribute the most when predicting an outcome.

Data visualization - Depending on the methods used to analyze the data, this view can to plot data against suitable variables as well as give tools to analyze specific points further.

2)Preprocessing data

A majority of the classifiers are built to analyze numeric and nominal values. This was a problem since the data in the data sets consist of string attributes only, so the method used had to transform this into something useful. Lucky enough, WEKA had a filter for doing just this. Using the StringToWordVector-filter, it was possible to convert each string entry in our data into a set of attributes describing the string as well as information about its occurrence in the data set.

3)The selected WEKA algorithms

These are the selected WEKA algorithms I chose to analyze since whey where implemented in the WEKA suite and ready to use directly. The decision to use the following algorithms was based on the efficiencies seen in the reports I read about data classification. We tried to pick at least one type of classifier from each of the major classifier groups and ended up with the following:

 Based on the Bayes Theorem - Naive Bayes

 Function oriented - SMO

 Lazy learners - IBk

 Rules - DecisionTable & JRip

 Trees - J48 & REPTree

Here’s a small description of them as well as references for further reading.

Naive Bayes - A classic implementation of the Naive Bayes algorithm. For more information on this Naive Bayes implementation, see George H, John and Pat Langley (1995) [10].

SMO - An implementation of sequential minimal

optimization algorithm that uses support vectors. It has built in support to handle multiple classes using pair wise classification.

IBk - A classic implementation of the k-nearest neighbours classifier. Note that we settled with using k=1 for my analysis since the model didn’t get any additional performance by raising this. Note that this algorithm is of the lazy type which does all the calculations from memory while classifying, so by having a low k the running time decreased to a somewhat reasonable level.

J48 - An open source implementation of the C4.5 algorithm that builds a decision tree using information entropy. That means that when building the decision tree, C4.5 will at each node select the attribute that most successfully splits its set of samples seen to the difference in entropy that the selected sub tree generates.

REPTree - A quick decision tree classifier that use information gain as well as pruning to classify the data.

DecisionTable - A classifier that uses a simple decision table.

JRip - This is a propositional rule learner that will start with a big rule set and then keep simplifying the rule by removing conditions that creates the largest reduction of error.

4)Procedure for executing Algorithms in Weka

In this section, we are taken ID3 algorithm is used for explanation.

The appropriate buttons are used to start respective applications. To execute ID3 algorithms, we have to click explorer link in application menu drop down list. The following screen is Explorer window.

[image:3.612.322.569.443.631.2]

By default this screen having disabled tabs of implemented algorithms of weka tool. To execute any algorithm we require data set. In this weka tool the data set is able to pass to our algorithms via .arff files.

Figure 1: Weka Screen

(4)

Figure 2: Sample Algorithm (ID3) Simulation

III. KNN

The k-nearest neighbor (KNN) algorithm is a simple and one of the most intuitive machine learning algorithms that belongs to the category of instance-based learners. Instance-based learners are also called lazy learner because the actual generalization process is delayed until classification is performed, i. e., there is no model building process. Unlike most other classification algorithms, instance-based learners do not abstract any information from the training data during the learning (or training) phase. Learning (training) is merely a question of encapsulating the training data, the process of generalization beyond the training data is postponed until the classification process.

KNN is based on the principle that instances within a dataset will generally exist in close proximity to other instances that have similar properties. If the objects are tagged with a classification label, objects are classified by a majority vote of their neighbors and are assigned to the class most common amongst its k-nearest neighbors. k is a usually rather small odd (to avoid tied votes) positive number and the correct classification of the neighbors is known a priori. The objects can be considered n-dimensional points within an n-n-dimensional instance space where each point corresponds to one of the n features describing the objects. The distance or “closeness” to the neighbors of an unclassified object is determined by using a distance metric (also called similarity function), for example the Euclidean distance or the Manhattan distance. A survey of different distance metrics for kNN classification can be found, for example, n.

The high degree of local sensitivity makes kNN highly susceptible to noise in the training data – thus, the value of k strongly influences the performance of the kNN algorithm. The optimal choice of k is a problem dependent issue, but techniques like cross-validation can be used to reveal the optimal value of k for objects within the training set. Figure 4.1 is an example of how the choice of k affects the outcome of the kNN algorithm. In Figure 4.1 (a), if k is set to 7 or smaller, a majority count of the k-nearest neighbors of the object under investigation (qi) returns the (obviously correct) class crosses. Increasing k to 9 (all objects inside the big circle) would result in predicting qi to class circles. On the contrary, setting k to 1 or 9 in Figure 4.1 (b) returns class circles, if k is set to 3, 5, or 7, the kNN algorithm returns

[image:4.612.55.528.47.190.2]

class crosses.

Figure 3: KNN classifier.

A.Algorithm for Clustering KNN

Data set has n attributes which we combine to form an n-dimensional vector:

x = (x1, x2, . . . , xn )

These n attributes are considered to be the independent variables. Each sample also has another attribute, denoted by y (the dependent variable), whose value depends on the other n attributes x. We assume that y is a category variable, and there is a scalar function, f, which assigns a class, y = f (x) to every such vectors.

We do not know anything about f (otherwise there is no need for data mining) except that we assume that it is smooth in some sense.

We suppose that a set of T such vectors are given together with their corresponding classes:

x(i) , y(i) for i = 1, 2, . . . , T . This set is referred to as the training set.

The problem we want to solve is the following. Supposed we are given a new sample where x = u. We want to find the class that this sample belongs. If we knew the function f, we would simply compute v = f (u) to know how to classify this new sample, but of course we do not know anything about f except that it is sufficiently smooth.

The idea in k-Nearest Neighbor methods is to identify k samples in the training set whose independent variables x are similar to u, and to use these k samples to classify this new sample into a class, v. If all we are prepared to assume is that f is a smooth function, a reasonable idea is to look for samples in our training data that are near it (in terms of the independent variables) and then to compute v from the values of y for these samples.

When we talk about neighbors we are implying that there is a distance or dissimilarity measure that we can compute between samples based on the independent variables. For the moment we will concern ourselves to the most popular measure of distance: Euclidean distance.

The Euclidean distance between the points x and u is

d x, u = xi− ui 2 n

i = 1

We will examine other ways to measure distance between points in the space of independent predictor variables when we discuss clustering methods.

(5)

using a single nearest neighbor to classify samples can be very powerful when we have a large number of samples in our training set. It is possible to prove that if we have a large amount of data and used an arbitrarily sophisticated classification rule, we would be able to reduce the misclassification error at best to half that of the simple 1-NN rule.

For k-NN we extend the idea of 1-NN as follows. Find the nearest k neighbors of u and then use a majority decision rule to classify the new sample. The advantage is that higher values of k provide smoothing that reduces the risk of over-fitting due to noise in the training data. In typical applications k is in units or tens rather than in hundreds or thousands. Notice that if k = n, the number of samples in the training data set, we are merely predicting the class that has the majority in the training data for all samples irrespective of u. This is clearly a case of over-smoothing unless there is no information at all in the independent variables about the dependent variable.

Input: D, T=The set of k training objects, and test object z = (x I, y I ) Output: 𝑦′ =𝑎𝑟𝑔𝑚𝑎𝑥_𝑣 𝑥𝑖,𝑦𝑖 ∈ 𝐷𝑧 𝐼 𝑣 = 𝑦𝑖

Algorithm

kNN Algorithm(Set startAndEndPoint, real � , int MinC) Begin

Compute , the distance between z and every object, . select , the set of k closet training objects to z. End

IV. ALGORITHM FOR TKNN Input: Training data set

Output: Decision Tree Output: Clusters Algorithm

TkNN (Attributesvalues V, Y𝓛) Begin

1. Construct TkNN graph among instances.

2. Initialize the similarities on each edge as Wiz= exp 𝑥𝑖 − 𝑥𝑧

2

2𝜎2 and normalize to 𝑊𝑧 𝑖𝑧= 1.

3. Determine the 𝛼𝑢𝑗 values for all unlabeled data.

4. Compute the label set prediction matrix P.

5. Predict label set for each unbalanced instance by 𝑦𝑖= 𝑆𝑖𝑔𝑛 𝑃𝛼𝑖 ∀𝑖 ∈ 𝑢

End

V.RESULT ANALYSIS

A.Test Results for Clustering KNN and TkNN Algorithm for Car Market Dataset

We are considered car data set from UCI repository to compare KNN and TkNN Algorithms.

In this section we are KNN and TkNN algorithms. The following parameters are used to compare and contrast these two algorithms.

 Number of iterations

 Within cluster sum of squared errors

The following testing scenarios are used to identifying accuracy of KNN and TkNN algorithms,

 Percentage split

 Training set

 Classes to Clusters evaluation

KNN

S.

N

o

Test Option Number of iterations

Sum of within cluster

distances

1 Percentage split 33% 3 2168

4 Training Set 2 6872

5 Classes to Clusters evaluation

purchase attribute 2 6354

safety attribute 2 5720

luggage booting attribute 2 5900 8 Classes to Clusters evaluation

persons attribute 2 5720

doors attribute 2 5702

maintenance attribute 2 5702 11 Classes to Clusters evaluation

[image:5.612.39.544.98.587.2]

buying attribute 2 5702

Table 1: Test Results of KNN Algorithm with two parameters

TkNN

S.

N

o

Test Option

Number of iterations

Sum of within cluster

distances

4 Training Set 2 6722

purchase attribute 2 6204

safety attribute 2 5720

luggage booting attribute 2 5570 8 Classes to Clusters evaluation

persons attribute 2 5720

doors attribute 2 5534

maintenance attribute 2 5534

buying attribute 2 5534

Table 2: Test Results of TkNN Algorithm with two parameters

1)Percentage split

(6)

[image:6.612.58.290.45.238.2]

Figure.4: Test Result on Car Data Set with Percentage Split of 66% using KNN Algorithm

[image:6.612.316.563.79.286.2]

TkNN Algorithm using Percentage Split 66%

Figure 5: Test Result on Car Data Set with Percentage Split of 66% using TkNN Algorithm

2)Training set

KNN Algorithm Using Training Set

Figure 6: Test Result on Car Data Set with Training Set using KNN Algorithm

[image:6.612.55.297.256.461.2]

TkNN Algorithm Using Training Set

Figure 7: Test Result on Car Data Set with Training Set using TkNN Algorithm

3)Classes to Clusters evaluation

KNN Algorithm using Classes to Clusters evaluation purchase attribute

Figure 8: Test Result on Car Data Set with Classes to Clusters evaluation purchase attribute

[image:6.612.315.563.357.545.2] [image:6.612.55.289.514.713.2]

(7)

[image:7.612.55.292.43.234.2]

Figure 9: Test Result on Car Data Set with Classes to Clusters evaluation purchase attribute

4) Visualize Curve

[image:7.612.312.569.139.392.2]

KNN Algorithm using Training Set

Figure 10: Visualize cluster assignments for Car Data Set using KNN Algorithm

[image:7.612.49.299.270.477.2]

TkNN Algorithm using Training Set

Figure 11: Visualize cluster assignments for Car Data Set using TkNN Algorithm

B. Compare and contrast Test Results with Graphical performance Analysis of Clustering KNN and TkNN Algorithms

In this section we are compared KNN and TkNN algorithms. The following parameters are used to compare and contrast these two algorithms.

 Number of iterations

 Sum of within cluster distance

1)Number of Iterations

Figure 12: Comparison of KNN and TkNN with Number of Iterations

2)Sum of within cluster distances

In Weka normalize all attribute values to lie between zero and one, so that each attribute has the same influence on the overall distance. The error/distance reported is simply the sum of the intra-cluster distance of each instance to its respective cluster centroid.

Figure 13: Comparison of KNN and TkNN with Sum of within cluster distances

0 1 2 3 4 5 6

Number of iterations

KNN Algorithm

TkNN Algorithm

0 1000 2000 3000 4000 5000 6000 7000 8000

Sum of within cluster distances

KNN Algorithm

[image:7.612.312.569.489.700.2]

(8)

In this chapter we have shown KNN and our novel TkNN clustering algorithms. We have also executed the same in Weka Tool with Java code and compared the performance of two algorithms based on different Percentage Splits to help the car seller/manufacturer for analyzing their customer views in purchasing a car.

We analyzed the graphical performance analysis between KNN and our novel TkNN clustering algorithms with Classes to Clusters evaluation purchase, safety, luggage booting, persons (seating capacity), doors, maintenance and buying

attributes of customer’s requirements for

unacceptable/acceptable/good/very good ratings of a car to purchase.

REFERENCES

[1] Alex Berson, Kurt Thearling, “Stephen J.Smith, Building Data Mining Applications for CRM”, eBook, 488 pages

[2] Cabibbo and R. Torlone, “An architecture for data warehousing supporting data independence and interoperability: an architecture for data warehousing”, International Journal of Cooperative Information Systems, vol. 10, no. 3, 2001.

[3] Calvanese, G. D. Giacomo, M. Lenzerini, D. Nardi, and R. Rosati, “Data integration in data warehousing”, International Journal Of Cooperative Information Systems, vol. 10, no. 3, pp. 237, 2001.

[4] Cao, L.; Yu, P.S.; Zhang, C.; Zhang, H., Data Mining for Business Applications,

[5] Christoph F. Eick, Nidal Zeidat, and Zhenghong Zhao, "Supervised Clustering – Algorithms and Benefits", 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004), 2004 [6] D. Mukherjee and D. D'souza, “Think phased implementation for

successful data warehousing”, Information Systems Management, vol. 20, no. 2, pp. 82, 2003.

[7] D. Olson and S. Yong, “Introduction to Business Data Mining”, McGraw Hill International Edition,2006.

[8] Ester, H. P. Kriegel, J. Sander, M. Wimmer, and X. Xu, “Incremental clustering for mining in a data warehousing environment”, Proceedings Of The 24th Vldb Conference, New York, USA, 1998.

[9] Fong, Q. Li, and S. Huang, “Universal data warehousing based on a meta-data modeling approach”, International Journal Of Cooperative Information Systems, vol. 12, no. 3, pp. 325, 2003.

[10] IDC & Cap Gemini, “Four elements of customer relationship management:, Cap Gemini White Paper.ISBN: 978-0-387-79419-8,2009 [11] J. A. Berry and G. S. Linoff, “Data Mining Ttehniques: For Marketing, Sales, and Customer Support”, New York: John Wiley & Sons, Inc, 2004.

[12] J. Hand and W. E. Henley,” Statistical classification methods in consumer credit scoring: A review”, Journal of the Royal Statistical Society: Series A (Statistics in Society), 160(3), 523–541,1997. [13] J. Parzinger, “Creating competitive advantage through data

warehousing”, Information Strategy: The Executive's Journal, vol. 17, no. 4, pp. 10-15, 2001.

[14] James A.O. Brien, “Management Information System,Tata Mc Graw Hill publication company Limited” , New Delhi,2009

[15] Jorgensen, F. and Wentzel-Larsen, T, ”Forecasting car holding,scrappage and new car purchase in Norway”, Journal of Transport Economics and Policy 24(2), 139-156,1990.

[16] K. Hian and K.L. Chan, “Going concern prediction using data mining techniques”, Managerial Auditing Journal, Vol 19, No 3, 462-476, 2004. [17] K.J. Cios and L.A. Kurgan, “Trends in data mining and knowledge discovery”, Advanced Information and Knowledge Processing, 1-26,2005.

[18] Kalton, K. Wagstaff, and J. Yoo, “Generalized Clustering,Supervised Learning, and Data Assignment,” Proceedings of theSeventh International Conference on Knowledge Discovery and DataMining, ACM Press, 2001.

[19] Kerdprasop, and K. Kerdpraso, “Moving data mining tools toward a business intelligence system”, Enformatika;, vol. 19, pp. 117-122, 2007.

[20] Lapersonne, G. Laurent and J-J Le Goff, Consideration sets of size one: An empirical investigation of automobile purchases. International Journal of Research in Marketing 12, 55-66,1995.

[21] Liqiang and J. Howard, “Interestingness measures for data mining: a survey”, ACM Computing Surveys, vol. 38, no. 3, pp. 1-32, 2006. [22] Mannila, “Methods and problems in data mining”, Proceedings Of

International Conference On Database Theory, Delphi, Greece, 1997. [23] Moutinho, L., Davies, F. and Curry, B, ”The impact of gender on car

buyer satisfaction and loyalty”, Journal of Retailing and Consumer Services 3(3), 135-144,1996.

[24] Paolo, “Applied Data Mining for Business and Industry”, John Wiley & Sons, 2003.

[25] Petrissans A. “Customer relationship management: the changing economics of customer relationships”, White Paper prepared by Cap Gemini and International Data Corporation, 1999.

[26] R. Nayak and T. Qiu, “A data mining application: analysis of problems occurring during a software project development process”, International Journal Of Software Engineering & Knowledge Engineering, vol.15, no. 4, pp. 647-663, 2005.

[27] S. Bongsik, “An exploratory investigation of system success factors in data warehousing”, Journal Of The Association For Information Systems, vol. 4, pp. 141-168, 2003.

[28] S. Fortune and J.E. Hopcroft. "A note on Rabin's nearest-neighbor algorithm", Information Processing Letters,8(1), pp. 20-23, 1979 [29] S. Lee, S. Hong, and P. Katerattanakul, “Impact of data warehousing on

organizational performance of retailing firms”, International Journal of Information Technology & Decision Making, vol. 3, no. 1, pp. 61-79, 2004.

[30] Sen and A. P. Sinha, “A comparison of data warehousing methodologies”, Communications of The ACM, vol. 48 Issue 3, pp. 79- 84, 2005.

[31] Sun and Morwitz, V.G.,” Stated intentions and purchase behavior: A unified model”, International Journal of Research in Marketing,Volume 27( 4), 356-366,2010.

[32] Thanh N. Tran, RonWehrens, Lutgarde M.C. Buydens , KNN-kernel density-based clustering for high-dimensional multivariate data , ELSEVIER 2005.

[33] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From data mining to knowledge discovery in databases”, American Association For Artificial Intelligence. AI Magazine, pp. 37-54, 1996.

[34] Usama et al. “On the Handling of Continuous-Values Attributes in Decision Tree Generation”. University of Michigan, Ann Arbor. [35] Vaduva and T. Vetterli, “Metadata management for data warehousing:

an overview”, International Journal of Cooperative Information Systems, vol. 10, no. 3, pp. 273, 2001.

[36] W. Hugh, A. Thilini, Jr. Matyska, and J. Robert, “Data warehousing stages of growth”, Information Systems Management; vol. 18, no. 3, pp.42-51, 2001.

[37] W. Smith, “Applying data mining to scheduling courses at a university“, Communications Of AIs; vol. 2005, no. 16, pp. 463-474, 2005. [38] Y. Freund and R. Schapire, “A decisiontheoretic generalization of

on-line learning and an application to boosting”, in Proceedings of the Second European Conference on Computational Learning Theory, 23-37 (Springer Verlag, 1995).

[39] Yang Song, Jian Huang, Ding Zhou, Hongyuan Zha, and C. Lee Giles , “IKNN: Informative K-Nearest Neighbor Pattern Classification”, PKDD 2007, LNAI 4702, pp. 248–264, 2007.

[40] Chablo E, Marketing Director,“The importance of marketing data intelligence in delivering successful CRM”, smartFOCUS Limited. 1999. [http://www.crm-forum.com/crm-forum-white-papers/mdi/sld01.htm]

[41] Data mining for Business Intelligence. [www.dataminingbook.com] [42] M. Matteucci” A Tutorial on Clustering Algorithms”,2008.

[http://home.dei.polimi.it/matteucc/Clustering/tutorial_html]

[43] Transportation and Economic Development. [https://people.hofstra.edu/geotrans/eng/ch7en/conc7en/ch7c1en.html]

[44] UCI Machine Learning Repository –

[http://mlearn.ics.uci.edu/databases]

(9)

M.Jayakameswaraiah,Received his Master of

Computer Applications Degree from Sri

Venkateswara University,Tirupati, Andhrapradesh, INDIA in 2009 and Currently Pursuing Ph.D in Computer Science from Sri Venkateswara University ,Tirupati, Andhrapradesh, INDIA.The Research fields of interest are Data Mining, Cloud Computing, Software Engineering and Data Base Management System etc.

Prof.S.Ramakrishna, Professor in Department of

Computer Science, Sri Venkateswara