A Survey on Data Classification using Machine Learning Techniques

(1)

A Survey on Data Classification using

Machine Learning Techniques

Dr. Chandra.E

Research Supervisor and Director Department of Computer Applications DJ Academy for Managerial Excellence, Coimbatore

Rajeswari .J Research Scholar

Manonmaniam Sundaranar University Tirunelveli

Abstract---Data classification is an essential step in trying to safeguard one of the chief properties of an organization -its information. Data classification engages categorizing information to predefined levels. For every level, organizations must allocate different levels of security controls and restrict the number of employees or persons who can access the information. Data classification would be difficult in large organizations with significant content to estimate and classify. With the assistance of a data classification approach, organizations can recognize and implement proper setting for private access control and also can obtain decision based on the obtained result from data classification. Machine learning techniques are one of the best classification approaches in the recent development. Many techniques are available under the machine learning approaches and the most important objective of machine learning research is to automatically learn to identify complex patterns and make intelligent decisions based on data. In this paper, data classification using many Machine Learning Techniques has been discussed. This paper focuses on better Machine Learning Techniques which supports data classification are available in the literature.

Keywords---Data Classification, Machine Learning, Self-Organizing Map (SOM), Artificial Neural Networks (ANN), Extreme Learning Machine (ELM)

I. INTRODUCTION

ata classification is progressively more significant to security of an organization, since it is very difficult to recognize and control risk of the sensitive data. Data classification assists organizations in finding out the appropriate type of controls to safeguard data from theft and misuse.

Data can be categorized based on any condition, not only comparative significance or rate of recurrence of use. For instance, data can be classified based on its relevant content, file type, working platform, normal file size in megabytes or gigabytes, when it was produced, when it was last used or altered, which person last accessed or altered, and which departments utilize it the most. A strong data classification system makes necessary data easy to find. Without automated data classification, it is practically not possible to keep pace with increasing amounts.

Machine Learning is an approach which enhances the performance by automating the acquisition of knowledge from understanding. Expert performance needs a large amount of particular domain knowledge, and knowledge engineering has generated hundreds of AI expert systems that are now used frequently in industry. The main objective of Machine Learning to is to present increasing levels of automation in the knowledge engineering method, substituting a huge amount time consuming human activity with automatic techniques that enhance accuracy or effectiveness by finding out and utilizing regularities in training data. The crucial test of machine learning is its capability of producing systems that are employed commonly in industry, education, and elsewhere. Most estimation in machine learning is tentative in nature, intended at showing that the learning technique leads to performance on a separate test set, in one or more practical domains, that is enhanced than performance on that test set without learning.

(2)

build a common description of the concept.

In general, several machine-learning algorithms have been found to have straight counterpart with data classification. Machine learning algorithms are data analysis techniques which look for data sets for patterns and characteristic structures. Machine learning has come out primarily from computer science and artificial intelligence, and describes methods from a range of associated subjects including statistics, and more specific fields, such as pattern detection.

It is very complicated to perform the data classification technique manually. Several Machine Learning Techniques which are used to classify different data sets are provided in the subsequent section. The classification process can be executed automatically by using the available machine learning techniques.

II. LITERATURE SURVEY

Yao Yu et al., [1] proposed a novel classifier ensemble learning approach based on decision tree. Ensemble learning technique is one of the algorithms which have best classification results in many classification algorithms. A decision tree algorithm is a kind of greedy approach, it utilize top-down recursive way to obtain the tree structure. The proposed approach improves the accuracy of classification by integrating the benefits of Boosting algorithm with decision tree. The main idea is to make full use of the benefits of ensemble learning approach and decision tree. This technique proved that the property which has the smallest classification error rate as of decision tree is equal to the branching approach of traditional decision tree. The approach uses the rapid classification capabilities of decision tree. The author also considers the classification accuracy of joint classification. Ultimately, Experimental results with UCI machine learning data sets showed that the proposed approach is very effective.

P. Gunther et al., [2] examine many recent treatises on hybridised Self-Organizing Map (SOM) theory. Each article presents a solution to expedite the SOM mapping process and provides more accurate results within a shorter response time through hybridization: including utilization of Bayesian classification approaches; an interactive associative search and exploration tool; and the use of a hierarchical organization of tiered SOM's with input derived via auto-associative feed forward neural network technology. Gunther et al., in this paper described that an amalgamation of SOM and association rule theory is the key to a more generic solution, less dependent on initial supervision and redundant user interaction. The results of clustering stem words from text documents could be used to obtain association rules which assign the applicability of documents to the user. A four stage process is thus detailed, representing a generic instance of how a graphical derivation of associations may be obtained from a repository of text documents, or even a set of synopses of several such repositories.

A. Nurnberger et al., [3] improved naive Bayes classifiers using neuro-fuzzy learning. Naive Bayes classifiers are a familiar and strong type of classifier that can simply be induced from a dataset of sample cases. On the other hand, the well-built conditional independence and distribution assumptions underlying them can occasionally guide to reduced classification performance. Another prominent type of classifier is neuro-fuzzy classification systems which obtain (fuzzy) classifiers from data by means of neural network inspired learning techniques. As there are certain structural resemblances between a neuro-fuzzy classifier and a naive Bayes classifier, the approach recommends itself to mapping the latter to the former so as to improve its capabilities.

F. Kahraman et al., [4] compared the performance of SVM and ANN for handwritten character classification. This study is regarding the choice of classifiers in handwritten character recognition. The objective of the study is to find out the most suitable classifier type for a specified handwritten character feature vector. PCA (Principal Component Analysis) based features were classified by both multilayer artificial neural networks (ANN) and support vector machines (SVM), and then the recognition outputs were compared. The authors chosen error back propagation, resilient back propagation and scaled conjugate gradients as ANN training techniques, whereas the SVM kernel types chooses linear, RBF and polynomial. The experimental observation reveals that the SVM has improved training and test performance than ANN.

Liang Sun et al., [5] proposed a novel support vector and K-Means based hybrid clustering algorithm. Data clustering is the major problem and has been investigated broadly. In this paper, the authors proposed a novel support vector and K-Means based hybrid technique for data clustering. Initially, recognize the outliers and overlapping data points with the help of support vector technique. Then the next step is to eliminate the outliers and overlapping data points and then execute the K-Means on the rest of data points to get clustered data set. At last, construct support vector explanation for every cluster, and then allocate the removed data points to the cluster with the minimum distance, thus resulting in labeling the entire data set. Simulation results make obvious that the proposed technique is very efficient, which exploits the advantages of both support vector clustering and K-Means.

(3)

vector machines (SVM), standard SVM is not appropriate for classification of large data sets, because the training difficulty of SVM is very high. In this paper, the authors developed an innovative SVM classification technique for huge data sets by taking into consideration of the models of classes distribution (MCD). A primary stage utilizes SVM classification so as to obtain a sketch of classes distribution. Then the algorithm find the support vectors (SVs) nearly close among each class and build up a ball by means of minimum enclosing ball from every pair of SVs with different label. The data points incorporated in the balls comprises of MCD, which is the structure in the boundary of each class and stands for the most significant data points, these data points are used as training data for a posterior SVM classification. Experimental observations show that this approach has better classification accuracy while the training is considerably quicker than other SVM classifiers.

Ganglong Duan et al., [7] suggested Extreme Learning Machine (ELM) for bank client’s classification. In this paper, the author proposed a classification method for commercial bank customer’s classification with the help of ELM algorithm is suggested to learn the commercial banks VIP loss. Initially, train the available data sets of banks by using ELM model; then, client classification approach and its parameters are chosen for classification purpose. Finally, comparative examination with available methods are also compared, which shows that its merits with the conventional gradient algorithm and other classification algorithm, which additional indicate that ELM algorithm not only defeat their demerits but also has faster learning rate, better rate of accuracy, and improved generalization.

Two-stage SVM classification for large data sets via randomly reducing and recovering training data was proposed by Xiaoou Li et al., [8]. In spite of high-quality theoretic fundamentals and elevated classification accurateness of support vector machines (SVM), standard SVM is not appropriate for classification of large data sets, because the training difficulty of SVM is very high. This technique presents a two stage SVM classification method for large data sets by arbitrarily selecting training data. The initial stage SVM classification obtains a sketch of support vector distribution. Then the neighbors of these support vectors in real data set are used as training data for the subsequent stage SVM classification. Experimental results reveals that this technique have better classification accuracy while the training is appreciably faster than other SVM classifiers.

G. Pradhan et al., [9] proposed Minimal ANN (MANN) model for data classification. Data classification is a most important task in data mining. Correct and uncomplicated data classification task can facilitate the clustering of huge dataset properly. In this paper, the authors have researched and recommended a simple ANN based classification scheme and named it as Minimal ANN (MANN) for different classification difficulties. The GA (genetic algorithm) is utilized for optimally finding out the quantity of neurons in the single hidden layered model. In addition, the model is trained with back propagation (BP) technique and GA and classification accuracies are compared. It is confirmed from the simulation that the proposed model can be an extremely better approach for many applications as it is very easy with excellent performances.

Hai-Jun Rong, et al., [10] suggested Extreme learning Machine for multi-categories classification applications. In this approach, the multi-class pattern classification by means of extreme learning machine (ELM) is proposed. This method is based on either a sequence of ELM binary classifiers or a single ELM classifier. With the use binary ELM classifiers, the multi-class difficulty is reduced into two-class problem with the help of the one-against-all (OAA) and one-against-one (OAO) method, which are called as ELM-OAA and ELM-OAO respectively. In a single ELM classifier, the multi-class trouble is implemented with a structural design of multi-output nodes which is equivalent to the number of pattern classes. The output of this proposed method is evaluated using a number of multi-class benchmark problems and simulation output show that ELM-OAA and ELM-OAO needs only smaller amount of hidden nodes than the single ELM classifier. Additionally, ELM-OAO frequently has comparable or fewer execution trouble than the single ELM classifier if the pattern class labels is not greater than 10.

Wu Bing et al., [11] recommended a GP-based kernel construction and optimization method for RVM. Choosing an appropriate kernel for relevance vector machine is the tough aspects of effectively using this learning tool. Effectively computerizing the search for such a kernel is hence desirable. This paper recommends a data-driven kernel function creation and optimization technique, which integrates genetic programming (GP) and relevance vector regression to develop an optimal or near-optimal kernel function called GP-Kernel. The developed kernel is evaluated with a number of extensively used kernels on numerous regression standard datasets. Experimental observation reveals that RVM using such GP-Kernel can surpass or equivalent to the best performance of standard kernels.

(4)

at first level to predict the less confident classified examples and in the next level it makes utilizes SVM to learn and categorize the tougher examples. The advantage of the hierarchical approach on a text classification process is that this two-level provides better result than both learning machines.

Yen-Jen Oyang, et al., [13] proposed data classification with radial basis function networks based on a novel kernel density estimation algorithm. This paper presents an innovative learning approach for well-organized construction of the radial basis function (RBF) networks that can provide the same level of accurateness as the support vector machines (SVMs) in data classification applications. This algorithm works by building one RBF sub network to fairly accurate the probability density function of every class of objects in the training data set. Regarding to algorithm design, the major difference of the proposed algorithm is the kernel density evaluation algorithm that characterizes an standard time complexity of O(nlogn), where n is the amount of samples in the training data set. The significant gain of the proposed algorithm, in contrast to SVM, is that the proposed algorithm normally takes only lesser time to build a data classifier with an optimized parameter setting. This characteristic is very important for a lot of current applications, in specific, for those applications in which fresh objects are constantly added into a previously existing large database. An additional attractive characteristic of the proposed algorithm is that the RBF networks thus built are able to execute data classification with above two classes of objects in one single run. As the proposed learning algorithm is instance-based, the data reduction issue is also addressed in this paper. The performance of the RBF networks built with the proposed algorithm is compared with those built with a traditional cluster-based algorithm.

C. Silva et al., [14] proposed scaling Text Classification (TC) with Relevance Vector Machines (RVMS). TC is a difficult task that handles an enormous amount of data. At present, modern research has confirmed that kernel learning based approaches are reasonably successful in this trouble. Contrasting to support vector machines (SVM), the relevance vector machine (RVM) especially yields a probabilistic output while preserving its accuracy. On the other hand, few research efforts have dealt with the problem of scalability that arises when executing RVM to large scale problems like TC. In this paper, the authors proposed an innovative approach which comprises of a two-step RVM classifier capable to (a) be competitive concerning processing time, (b) use all existing training elements and (c) enhance RVM classification performance. The paper also explains that a suitable similarity measure among documents can be defined on all the group of data, which does not only formulates the process swifter but also parallelizable. With the use of REUTERS-21578, the output shows that deployment of successful real-time applications is achievable through reduction of the computational difficulty and enhancement of overall performance, attained by the proposed method.

Rong Liu et al., [15] proposed RBF (radial basis function) classifier with supervised center selection and weighted norm. A variation of RBF classifier depending on supervised center selection and weighted norm is created and then tested in labeling overlapped data sets produced from two binomial distributions. In addition to its simplicity, the experimental observations and the comparisons to other classifiers recommend that this technique is significant in capability as a result of offering reasonable classifications with gentle computation costs. Furthermore, this variation shows better results than the RBF based on accurate interpolation not only in accurateness but also in effectiveness.

Osman et al., [16] proposes a novel technique for constructing multiclass SVM-based binary decision tree classifiers. The fundamental approach of the proposed technique is to set the target values for the training patterns such that linear separability is always achieved and thus a linear SVM can be generated at each non-leaf node. It is discussed that replacing complex, nonlinear SVMs by a larger number of linear SVMs may significantly decreases training and classification times as well as classifier size without compromising classification performance. The performance of the proposed approach is experimentally showed via a comparative analysis involving the most efficient existing multiclass SVM classifiers like one-against-rest and the one-against-one.

III. PROBLEMS AND DIRECTIONS

Many Machine Learning techniques have been discussed in the previous section. The following directions will help the researchers and guide them to develop the best data classification approach using Machine Learning technique.

 Performance of classification techniques using Machine Learning Techniques entirely based on the quality of data were used in learning. The transformation techniques are used to enhance the effectiveness of classification because each type of data is appropriate for different classification techniques.

(5)

IV. CONCLUSION

Classification is a data mining approach that categories items into a group of related objects or classes. Based on the classification results, it is easy to take any decisions. But data classification is a very complicated process if it is carried out manually. Hence, there is a need for the classification technique which has to be done automatically. Therefore, Machine Learning technique can be used to carry out the data classification approach. This survey would definitely help the researchers to develop a promising technique for data classification using Machine Learning technique.

REFERENCES

[1] Yao Yu, Fu Zhong-liang, Zhao Xiang-hui, and Cheng Wen-fang, “Combining Classifier Based on Decision Tree,” WASE International Conference on Information Engineering (ICIE '09), Vol. 2, Pp. 37 – 40, 2009.

[2] P. Gunther, and P. Chen, “A new approach to hybrid SOM implementations for text classification,” The 10th IEEE International Conference on Fuzzy Systems, Vol. 2, Pp. 968 – 971, 2001.

[3] A. Nurnberger, C. Borgelt, andA. Klose, “Improving naive Bayes classifiers using neuro-fuzzy learning,” 6th International Conference on Neural Information Processing (ICONIP '99), 1999.

[4] F. Kahraman, A. Capar, A. Ayvaci, H. Demirel, and M. Gokmen, “Comparison of SVM and ANN performance for handwritten character classification,” Proceedings of the IEEE on Signal Processing and Communications Applications Conference, Pp. 615 – 618, 2004.

[5] Liang Sun, S. Yoshida, and Yanchun Liang, “A novel support vector and K-Means based hybrid clustering algorithm,” IEEE International Conference on Information and Automation (ICIA), Pp. 126 – 130, 2010.

[6] J. Cervantes, Xiaoou Li, and Wen Yu, “SVM Classification for Large Data Sets by Considering Models of Classes Distribution,” Sixth Mexican International Conference on Artificial Intelligence - Special Session, Pp. 51 – 60, MICAI 2007.

[7] Ganglong Duan, Zhiwen Huang, and Jianren Wang, “Extreme Learning Machine for Bank Clients Classification,” International Conference on Information Management, Innovation Management and Industrial Engineering, Vol. 2, Pp. 496 – 499, 2009.

[8] Xiaoou Li, J. Cervantes, and Wen Yu, “Two-stage svm classification for large data sets via randomly reducing and recovering training data,” IEEE International Conference on Systems, Man and Cybernetics, Pp. 3633 – 3638, 2007.

[9] G. Pradhan, G.V. Kalyan, S.C. Satapathy, B. Mitra, and S. Pattnaik, “Minimal ANN (MANN) model for data classification,” World Congress on Nature & Biologically Inspired Computing (NaBIC), Pp. 1059 – 1064, 2009.

[10] Hai-Jun Rong, Guang-Bin Huang, and Yew-Soon Ong, “Extreme learning machine for multi-categories classification applications,” IEEE International Joint Conference on Neural Networks (IJCNN 2008), Pp. 1709 – 1713, 2008.

[11] Wu Bing, Zhang Wen-qiong, Chen Ling, and Liang Jia-hong, “A GP-based kernel construction and optimization method for RVM,” The 2nd International Conference on Computer and Automation Engineering (ICCAE), Vol. 4, Pp. 419 – 423, 2010.

[12] Catarina Silva, and Bernardete Ribeiro, “Two-Level Hierarchical Hybrid SVM-RVM Classification Model,” 5th International Conference on Machine Learning and Applications (ICMLA '06), Pp. 89 – 94, 2006.

[13] Yen-Jen Oyang, Shien-Ching Hwang, Yu-Yen Ou, Chien-Yu Chen, and Zhi-Wei Chen, “Data classification with radial basis function networks based on a novel kernel density estimation algorithm,” IEEE Transactions on Neural Networks, Vol. 16, No. 1, Pp. 225 – 236, 2005.

[14] C. Silva, and B. Ribeiro, “Scaling Text Classification with Relevance Vector Machines,” IEEE International Conference on Systems, Man and Cybernetics (SMC '06), Vol. 5, Pp. 4186 – 4191, 2006.

[15] Rong Liu, and Yong Shi, “A RBF classifier with supervised center selection and weighted norm,” Sixth IEEE International Conference on Data Mining Workshops (ICDM Workshops), Pp. 868 – 872, 2006.

[16] H. Osman, “Novel Multiclass SVM-Based Binary Decision Tree Classifier,” IEEE International Symposium on Signal Processing and Information Technology, Pp. 880 – 883, 2007.

AUTHORS PROFILE

Dr.E.Chandra received her B.Sc., from Bharathiar University, Coimbatore in 1992 and received M.Sc., from Avinashilingam University ,Coimbatore in 1994. She obtained her M.Phil., in the area of Neural Networks from Bharathiar University, in 1999. She obtained her PhD degree in the area of Speech recognition system from Alagappa University Karikudi in 2007. She has totally 15 yrs of experience in teaching including 6 months in the industry. Presently she is working as Director, Department of Computer Applications in D. J. Academy for Managerial Excellence, Coimbatore. She has published more than 30 research papers in National, International Journals and Conferences in India and abroad. She has guided more than 20 M.Phil., Research Scholars. Currently 3 M.Phil Scholars and 8 Ph.D Scholars are working under her guidance. She has delivered lectures to various Colleges. She is a Board of studies member of various Institutions. Her research interest lies in the area of Data Mining, Artificial Intelligence, Neural Networks, Speech Recognition Systems, Fuzzy Logic and Machine Learning Techniques. She is an active and Life member of CSI, Society of Statistics and Computer Applications. Currently she is Management Committee member of CSI Coimbatore Chapter.