C Evaluation of Clustering Algorithms for Credit Card Data SetUsing WEKA

(1)

Evaluation of Clustering Algorithms for Credit Card Data Set Using WEKA

Jismy Joseph*, Dr.G. Kesavaraj

Abstract

The wealth of data generated daily in financial sectors is growing day by day. In such a scenario data mining is an emerging research field, it can be aPp.lied to find hidden information from the data set and which can be used to analyze the credit risk. Now a days Clustering is a challenging filed of data mining and it plays a vital role in analyzing credit data set. This paper analyses the efficiency of four different clustering algorithms – Furthest First, Filtered Cluster, Density Based Cluster and Simple K-Meanson credit card data sets.

Keywords:Clustering

,

K-Means, Filtered Clustering, Density Based Clustering, WEKA

1. Introduction

C

lustering is an unsupervised data analysis technique. In clustering, large data sets are partitioned into different groups or clusters according to the similarities of the data sets. From one cluster to another cluster data should be different. Predefined class labels are not used in clustering; the data sets are grouped on the basis of maximizing the similarities of interclass objects and minimizing the intra-class similarity. The aPp.lications of clustering analysis are used in pattern recognition, finding credit card frauds, biological studies, outlier detection, web document clustering and many more.

This article presents efficiency of different clustering algorithms are evaluated using WEKA tool. These algorithms are executed on two different data sets using WEKA interface.

WEKA is an open source machine learning software that contains a set of machine learning algorithms for data analysis developed by Waikato University, New Zealand.

WEKA is used for data preprocessing, classification, clustering, association rules mining, and visualization [1]. The WEKA tool contains a collection of cluster algorithms like Farthest First, Filtered Cluster, Hierarchical Cluster, Density-Based Cluster, K-Means, and Cobweb etc.

This article has 5 sections, In section 2 we presents literature survey, the next section discussed dataset and classifiers. The Section 4, 5 and 6 deals with comparison of algorithms, results and conclusion respectively

(2)

In Singh, P., & Surya A[3] in their paper they conducted a study and compared nine clustering algorithms with respect to time and number of iteration etc. They conducted this study using WEKA. In Kapil, S and Chawla[4] conducted a study of K- Meansclustering. They used two distance function like Euclidean distance and Manhattan distance to evaluate K-Meansclustering and their result illustrate the influences of this distance in K-Means algorithm.

In Khan, A. R., Nisha, S. S and Sathik[5], evaluated the Expectation Maximization (EM), K-Means and Farthest First on mushroom data set by using WEKA. They suggested that K-Means performed better than EM and Farthest First. For performance evaluation they considered the time needed to build a model and correctly clustered instances. The authors Upadhyay, N. M., and Singh, R. S[6] provided some simulation results which gives the processors size, efficiency, complexity and workload of input instructions. They also compared optimized and tuned algorithms based on parallel methodology.

Pallavi and Samila G[9] evaluated different algorithms on Iris, Haberman and Wine data set and they suggested that K-Means is better than Hierarchical and Farthest First algorithm. In this study, R Roselin[10] used four classification algorithms and two clustering algorithms to find the behaviour of a credit card customer. In their study they pointed out that Farthest First shows more accuracy than K-Means algorithms.

In Suman Kumar[11] the author conducted a survey of clustering methods in the credit card data set. They proposed a new method that is the combination of Hidden Markov model (HMM) and K-Means. The author Friel[12] his study used Machine learning techniques to find the credit card fraud. This article describes how to perform data analysis, preparation, creation of training set and classification in machine learning.

3.

Data Sets and Classifiers

We have used two set of credit card details from ‘Kaggle’ for comparing four clustering algorithms. The first data set is ‘Credit card data set for clustering (CCDC)’ [13], which contains 9000 records and 18 attributes. This data set includes the eighteen behavioral variables of 9000 credit cards holders. The second one is ‘Credit card fraud detection (CCFD)’ [14], which consists of 284,807 instances and 28 attributes. This data set summarizes the transaction details made by 284,807 European credit card holders. In this comparative study, four clustering algorithms are used. They are FF, MDBC, SK and FC.

A. Farthest First (FF)

‘Farthest First’ algorithm is implemented by Hochbaum and Shmoys in 1985. In a large data set Farthest First algorithm is aPp.ropriate for creating clusters [7]. It is a Greedy method, in this method the first centroid of the cluster is randomly selected from the objects and the objects near to this centroid are grouped into the same cluster. The second centroid is greedily selected that is far away from the first centroid. The clusters are not uniform in FF algorithms. The pseudocode of FF is given below.

(3)

B. Make Density Based Cluster (MDBC)

Density based clustering refers to an unsupervised learning methods, that creates distinctive clusters based on density of the object. This algorithm is mainly used to find non-linear shapes.

DBSCAN is an example for density basted method. It uses two parameters ‘E’ and

‘Minpts’. Where ‘E’ specifies the neighborhood within the radius of a data point and the parameter ‘Minpts’ tells the minimum number of neighborhood within ‘E’. The following figure, Fig.1 illustrates DASCAN method.

Fig.1 DBSACN Method The pseudocode of MDBC is:

C. Simple K-Means(SK)

SK is a widely used unsupervised clustering aPp.roach and it is easy to understand and train quickly. Initially, k centroids are collected and it uses Euclidean distance for generating clusters. These centroids are the center of different cluster/groups and assign every single object to the closest corresponding centroid using Euclidean distance, shows in Fig 2.

(4)

mean is used as the value of the new centroid and this process is repeated until the centroid value remains constant. The pseudocode of SK is given below.

D. Filtered Clustering(FC)

The filtering algorithm uses an index structure on the data set to improve the competence of kmeans and it also minimizes the total number of centroids searched while finding the nearest centroid. The FC algorithm can filter irrelevant data from the input [8].

The pseudocode of FC is given below.

4. Result and Observation

This section deals the experiments carried out with taken dataset and the performance evaluation is also carried out using WEKA tool. This experiment results are discussed here. Fig. 1 shows the result obtained after executing the FF algorithm on ‘CCDC’. This algorithm takes 0.07 seconds to build the model. Two clusters are created. One cluster contains 8890 instances and other one contains 60 instances.

(5)

Fig.1 Result of FF Algorithm in 1^stData Set

Fig. 2 shows the result obtained from FC algorithm on ‘CCDC’. Time taken for this algorithm is 0.26 seconds to build the model. Two clusters are created. One cluster contains 4911 instances and other one contains 4039 instances.

Fig.2 Result of FC Algorithm in 1^stData Set

The algorithm MDBC needs 0.31 second to create the clustering model for the data set

‘CCDC’ one cluster contains 4900 instances and other one contains 4050. The following figure, Fig 3. shows this result.

(6)

The result of SK on ‘CCDC’ data set is shown in Fig. 4. and its requires 0.12 seconds to execute. One cluster contains 4911 and other contains 4039 instances.

Fig .4 Result of SK in 1st Data Set

Fig.5 shows the result obtained from after executing the FF algorithm on ‘CCFD’. This algorithm takes 0.31 seconds to build the model. Two clusters are created. One cluster contains 284806 instances and other one contains only1 instances.

Fig. 5 Result for FF in 2^ndData Set

The algorithm MDBC needs 2.35 second to create the clustering model for the data set

‘CCFD’ one cluster contains 129280 instances and other one contains 155527. The following figure, Fig 6. Shows this result for the same.

(7)

Fig. 6 Result for MDBC Algorithm in 2^ndData Set

Fig.7 shows the result obtained from FC algorithm on ‘CCFD’. Time taken for this algorithm is 1.67 seconds to build the model. Two clusters are created. One cluster contains 131821 instances and other one contains 152986 instances.

Fig. 7 Result for FC Algorithm in 2^ndData Set

The result of SK on ‘CCFD’ data set is exposed in Fig. 8. It needs 1.58 seconds to execute. One cluster contains 131821 and other contains 152986 instances.

Fig. 8 Result for SK Algorithm in 2^ndData Set

5. Comparison of Result

This section deals the comparison part of four clustering algorithms with the metrics of time and no of cluster formation in the given time period. The following tables show the

(8)

FF 0.07 0.31

MDBC 0.31 2.35

SKM 0.12 1.5

FC 0.26 1.61

Fig.9 : Time taken for clustering

The above Fig. 9 represents a line chart for the given data with four different algorithms in two type of data set.

TABLE II. CLUSTERED INSTANCES

Algorithm Dataset 1 Dataset 2

Cluster 0 Cluster 1 Cluster

0 Cluster1

FF 8890 60 282806 1

MDBC 4900 4050 129280 155527

SKM 4911 4039 131821 152986

FC 4911 4039 131821 152986

TABLE III. CLUSTERED INSTANCES(IN PERCENTAGE)

Algorithm Dataset 1 Dataset 2

Cluster 0 Cluster

1 Cluster0 Cluster1

FF 90 60 100 1

MDBC 55 45 45 55

SKM 55 45 46 54

FC 55 45 46 54

6. Conclusion

Clustering plays a decisive role in analyzing the behaviour of credit card holders.

Proper analization of customer behaviour using clustering algorithms can help the organization to improve their business. This paper focuses on trying to find the best clustering algorithm for credit data set. The study observed that FF algorithm has highest efficiency in both data set. It used 0.07 seconds for clustering data set 1 and it takes 0.31 seconds for clustering data set 2. All other algorithmn needs more time than FF algorithm.

(9)

7. References

1. , Frank E, Holmes G, Pfahringer B, Reutemann P, and Ian H, “The WEKA Data mining Software- An Update”, SIGKDD Explorations, Vol. 11, No. 1, 2009, Pp.

1-17.

2. DeFreitas K, M. B., “A comparative evaluation of clustering methods in Educational Data Mining", IADIS IJCSIS, Vol. 10, No. 2, 2016, Pp. 65-78.

3. Singh, P., & Surya, “Performance of clustering algorithms in ,WEKA”, IJAET, Vol. 7, No. 6, Jan 2015, Pp. 1866-1873.

4. Kapil, S., & Chawla, M., “ Performance of K-Means with various distance metrics”, IEEE Conference, Vol. 7, No. 6, July 2016, Pp. 1-4.

5. Khan, A. R., Nisha, S. S. & Sathik, M. M, ”Clustering Techniques For Mushroom Dataset”, IRJET, Vol. 5, No. 6, June 2018, Pp. 1121-1125.

6. Upadhyay, N. M., & Singh, R. S, “Performance Evaluation of Classification Algorithm in WEKA using Parallel Performance Profiling and Computing Technique.”IEEE, Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC),Vol. 7, No. 6, December 2018, Pp. 522-527.

7. Sharma N, Bajpai A and Litoriya R, “Comparison of Various Clustering Algorithms of WEKA Tool”, IJETAE, Vol. 02, No.05, May 2012, Pp. 73-80 8. Khoury R, “Sentence Clustering Using Parts-of-Speech”, International Journal of

Information Engineering and Electronic Business, Vol.4, No. 6, Feb 2012, Pp. 1- 9. Pallavi, S. Godara. (n.d.),. A Comparative Performance Analysis of Clustering9

Algorithm., ,IJERA, Vol. 1, No. 3, Nov 2012, Pp. 441-445.

10. R. Roselin, C, “Customer Behavior Analysis for Credit Card Proposers.”, IJIRAE, Vol. 1, No. 11, Nov 2014, Pp. 62-66.

11. Suman, D. C., “Credit Card Fraud Detection Using Hmm and K-Mmeans Clustering Algorithm”, IIJSRET, Vol. 6, No.6, June 2017, Pp. 614-619.

12. Frei, L., “Detecting credit card fraud using machine learning”, Retrieved November 2019, from towards data science: www.towardsdatascience.com., January 2016.

13. Bhasin, A. (n.d.). www.kaggle.com/mlg-ulb/creditcardfraud. Retrieved November 2019, from www.kaggle.com: https://www.kaggle.com/mlg-ulb/creditcardfraud 14. Group, M. L. (n.d.). www.kaggle.com/mlg-ulb/creditcardfraud,

https://www.kaggle.com/mlg-ulb/creditcardfraud.