Enhanced Multistage Approach for Supervised Learning of Medical Data
K.Sasirekha1, Dr.V.Kathiresa2
1Department of IT, 2Department of MCA
Dr.SNS.Rajalakshmi College of Arts & Science Chinnavedampatti, Coimbatore.
ABSTRACT
Data mining is an interdisciplinary computer science and statistics subfield with the general objective of extracting (by intelligent methods) information from a dataset and transforming it into a comprehensive structure for future use. Data mining is the research stage in the process of "database information discovery," or KDD. Some key data mining problems are being addressed using real world machine teaching techniques such as business, safety, toxic prediction, bio, chemo and computer technology. From a machine learning approach, non-supervised learning is carried out and it attempts to group all objects that have a relationship between them, whereby objects are supervised and allocated to pre-defined classes as a classification process. This article focuses on enhancing multistage method in machine-learning algorithms for medical data.
Keywords: data mining, machine learning, clustering, classification
INTRODUCTION
Data Mining techniques are used to analyze, review and unknowingly linked critical data in a variety of ways to assess any information collected. Data Mining technology is available throughout the data collection process, and these technologies are used for various purposes such as defense, marketing and information collection. This data mining technology is used to verify sampling data, analyze its paths, and test and validate its modules for general and business use. Data mining is the mechanism by which trends are found in large data sets that include approaches at the intersection of computers, statistics and databases. The term data mining is actually a misnomer, because the purpose is to extract the pattern of data and the knowledge in large quantities, not to remove (extract) the data itself. It is also a buzzword and is often applied to any kind of specific data, information processing and any application of computer support systems, including artificial intelligence (e.g., engineering) and business intelligence (i.e. compilation, storage, warehousing, analysis and statistics).
Over some open data sets, machine learning methods, such as instance based learning algorithms, decision trees, genetic algorithms, rules induction, artificial neural networks and fuzzy logic are used.
Data mining applications gather the information and the techniques for machine learning are used to determine based on the data collected. The clustering and grouping are two big methods in data mining.
In many instances, the principle of grouping is distorted by clustering, but the two approaches vary.
From a machine learning system viewpoint, unattended learning attempts to group sets of objects which have connections between them, where the grouping process is supervised and objects are allocated to sets of predefined groups.
Clustering is the mechanism by which an abstract category of things is grouped into similar groups.One category may be viewed as a cluster of data objects. We divide the data set into groups based on data
similarity and allocate thelabel(s) to the groups during cluster analyzes. The main advantage of clustering is that it is adaptable to modifications and helps users recognize useful features which distinguish different groups.
In many applications such as the market research, pattern recognition, data analyzes and image processing, clustering analysis is commonly used. Clustering will allow marketers in their client base to explore different groups. And their consumer groups can be categorized according to the buying trends.
In genetics, plant and animal taxonomy can be established, genes with similar features classified, and structures inherent in species can be examined. The clustering of the earth's observation network also helps to identify regions with common land use. In addition, it helps to identify groups of houses in a city by size, value and geography. The clustering of documents on the site also helps to find details. In outlier cases such as credit card fraud are also used clusters. Cluster analysis, as a data mining tool, provides insight into the distribution of the data to determine each cluster's characteristics.
LITERATURE REVIEW
The data mining process gets the most important results from a broad search in databases and new important information. The data mining process takes many cycles according to the analyst. Two main data mining goals are predictive and descriptive. The predictive model is to obtain information from the established system model and to obtain new, remarkable information from the descriptive model. Data mining is possible using various methods, such as statistical methods, machine learning, etc.
Because of the rapid technological growth, vast amounts of data are now available. The decision-making process in medicine is based entirely on hidden information in these massive data.
Data mining and machine learning techniques therefore provide powerful tools for the exploration of information within data. Interchangeably, two primary methods are used: clustering and Classification.
Clustering is an unsupervised technique for learning machines, and grouping is a supervised form of learning. Both methods can collect useful information and trends to support the data analysis and clinical decision-making process. This research presents a recent study over the last five years of these methods in the medical field. In addition, this paper proposes a hybrid, multi-stage clustering method for the classification of medical data. In the proposed system, the membership values were initially determined by 2 fuzzy clustering algorithms specifically for FCM and GK. In the second stage of the method, these weights are then used as further information in order to improve the classification process by using the SVM algorithm.
Fig:1 Multi Stage Clustering Method
Hybrid multi-stage method is suggested in this research to address the issue of classification of WBC.
The proposed system requires two main stages. In the first stage through testing against real cancer, two fuzzy techniques were initially used exclusively for breast cancer results, Wisconsin Breast Cancer.
The dataset is a well-known two-class real issue from the repository of UCI. At the beginning, the two fuzzy clustering algorithms: Fuzzy Cmeans, FCM, and Gustafson Kesssel, GK, were used to get the data instance weights to which of the two clusters they belonged. This performance is a significant and insightful finding that can be used in the second stage.
The procedure begins with the use of fuzzy classifiers to obtain in the dataset the weights of each instance. Such weights are then used as additional features for training SVM classifier and for extracting the training model, finding the support vectors and alpha values. The data is divided randomly into three different ratios and on all of them the algorithm has been educated and checked.
Major disadvantages of this is to depend on the initialization of the cluster core. Initialization of values for local minimum are sensitivity. Noise sensitivity and poor (or even no) membership for outliers (noisy points) are predicted. Centrally located in the vicinity of the larger cluster, the FCM appears to skip the small and well differentiated clusters. Clusters are smaller and high membership functions are thinner.
PROPOSED METHODOLOGY
Supervised learning is one of the most commonly used activities in smart systems where a large number of supervised, data mining classification, techniques have been developed earlier. The labels must be pre-defined in this form of learning in order to match them to either class. Next, supervised learning algorithms evaluate the training data and then continue to generate a classifier predicting the output of new valid input. The learning algorithm calls for the generalization of new unknown objects from the training data. Supervised learning techniques or (classification algorithms) vary according to the method's learning techniques.
These include perceptron-based learning similar to Neural Networks, example-based learning such as K Nearest Neighbor, logic-based learning such as Decision Trees, statistical learning similar to Support Vector Machine. However, scholars may combine some of these algorithms that form a category of systems called ensemble systems.
Fig.2 Proposed Method
The proposed methodology consists of the following stages of workflow, Stage 1: - Dimensionality Reduction.
Stage 2: - Supervised Clustering Algorithm.
Stage 3: - Consensus Functions.
Stage 4: - Supervised Classification Algorithm.
Medical data has been considered for workflow. Next, the dimensionality reduction algorithm is used to process medical data. Remember it's made up of a lot of data points. The first stage output is then fed into a clustering algorithm for independent cluster generation. To merge the independent clustering into one final consensus clustering, we use four separate consensus functions and sophisticated algorithms.
Finally, on the final consensus clustering of the randomized sample, fast supervised classification algorithms are equipped. This training allows for the use of fast classification algorithms to cluster or profile wide high-dimensional medical data sets.
Classification is another process of data analysis function, i.e. the creation of a model that defines and differentiates data classes and concepts. Classifying is a problem to classify, on the basis of a training set of observational data whose category membership is defined, which of a group of categories a new observation belongs to.
Example: We have to check its viability before beginning any project. A classifier should in this case predict and authorize the project by means of class labels such as ' Safe ' and ' Risky ' It is a system with two phases, like:
1. Training Step: Creating classification model
Different algorithms are used to create a classifier, using a training set to make the model know.
To predict exact results, model must be educated.
2. Classification step: Model for predicting class labels and testing on test data of the constructed model and thereby for estimating the accuracy of the classification laws.
The combined consensus clustering technique was applied to the training set and data was prepared for testing. After that, on the training set obtained, the supervised classification algorithms were trained.
After training on the initial consensus clustering data, our experimental results compare the performance of the classification algorithms. The performance evaluation parameters evaluated the classification algorithms' weighted average precision, recall and Fmeasure according to the classes of the previously obtained combined total consensus clustering.
PERFORMANCE EVALUATION
1. Precision Precision can be considered as a test of accuracy of the classifiers. A low level of accuracy may also suggest a large number of false positives.
2. Recall Recall can be viewed as a test of completeness of classifiers. There are many false negatives suggested by a weak recall.
3. F-Measure The harmonic mean of accuracy and recall is a measure that combines precision and recall.
CONCLUSION
A precise and computationally efficient means of classifying medical data sets in the hope of providing some valuable information worth considering for various physiological and biomedical analyses. The multi-stage approach incorporates dimensionality reduction algorithms, several unattended clustering
algorithms and several supervised classification algorithms so that very large and high-dimensional medical data sets can be effectively and accurately profiled. Every volume prototype reduces the impact of variations in the distribution and density of data on the clustering outcome, while similarity-driven merging helps to determine the appropriate number of clusters, beginning with an overestimated number of clusters. The proposed algorithms for extended versions of the fuzzy c-means and the Gustafson–Kessel clustering algorithms are capable of automatically determining the appropriate data partition without additional user input. For this reason, an adaptive similarity threshold was suggested.
REFERENCES
[1] Chung, H. M., Gray, P. (1999), “Special Section: Data Mining”. Journal of Management Information Systems, (16:1), 11-17.
[2] Fayyad, U., Piatetsky-Shapiro, G., and Smyth, R (1996). "The KDD Process for Extracting Useful Knowledge from Volumes of Data," Communications of the ACM, (39:11), pp.27-34.