A non-linear Kernel feature subset selection based semi-supervised framework for medical disease prediction

(1)

A non-linear Kernel feature subset selection based semi-supervised framework for medical disease prediction

1G.Kranthi Kumar, ²Dr.R.Satya Prasad,

1Research Scholar, ²Professor,

1,2Department of Computer Science & Engineering, Acharya Nagarjuna University, Guntur, Andhra Pradesh (India)

Abstract— As the size of the medical datasets increases, the prediction of disease patterns also increases in biomedical applications in the high dimensional features. Micro-array data play a significant role in the pathogenesis of the various diseases. Many existing methods are not applicable to the prediction of disease without identified genes and semi-supervised learning. Meanwhile, several other approaches have struggled to prioritize associations for all diseases at the same time.

Therefore, the development of an algorithm that can identify reliable candidates for disease using existing gene-disease associations verified by the biological experiment is essential to solve these problems effectively. To overcome these problems, a semi-supervised learning model based on non- linear feature selection is proposed to overcome the problem of prediction of disease. The wrapper- based hybrid correlation approach is used to partition the space of the function into k-related features. Ultimately, the deep neural network architecture was designed and implemented to enhance the prediction of disease on datasets of large dimensions. Experimental results have shown that the current model is more reliable compared to existing models in terms of true positivity and receiver operating characteristics (ROC).

Keywords—High dimensional Datasets, Kernel functions, classification, neural network.

I. INTRODUCTION

The selection of features is an important task in dealing with the issue of information of high dimensions. It can be used to reduce the dimension of data, remove irrelevant and redundant characteristics, reduce training time and improve learning performance[5–8]. Feature selection techniques can be divided into three types: filter, wrapper and integrated[1][2]. Filter techniques choose the subset of features as a pre-processing process separately from the chosen classifier.

Wrapper techniques use a single student as a black box to test feature subsets using their predictive performance. The embedded techniques pick features in the integrated process and are usually unique to the individual learner. In addition, feature selection techniques can be classified into three classes by class label data: ranked feature selection, unattended feature selection and semi-ranked functionality selection.

Techniques of supervised feature choice use learned information to select features and evaluate function importance by evaluating the correlation between functionality and the class label. Semi- ranked techniques for selecting functions measure the value of functions by the ability to maintain specific information characteristics, such as variability and locality preservation[3]. Generally semi- ranked selection techniques produce low efficiencies compared to unregulated selection techniques due to the use of learned data. However, supervised selection techniques include sufficient learned information with extensive knowledge and costly to acquire. Semi-supervided alternative techniques are used to determine practical validity using tag information and data allocation or local compositions of both learned and unlabeled data[4]. Recently, research has been focused on the semi- supervised selection of features and various semi-supervised strategies for selecting features in literature have been suggested. However, there is no comprehensive study of semi-supervised

(2)

selection techniques. Many traditional models provide extensive research on semi-supervised methods for selection of functions, categorize strategies from two different angles, summarize them with specific information and explain their advantages and disadvantages. This is the first extensive study on semi-ranked selection techniques to the best understanding of writers that classifies them from two separate perspectives. Half-supervised instruction comes from a small volume of significant information and large quantities of unlabeled data[56–58]. Semi-supervised learning must include other requirements for smoothness, such as cluster assumption[5] and multiple hypotheses[6]. The cluster theory states that if samples are in the same cluster, they are likely to be of the same group.

Manifold means that high-dimensional statistics are based on a small-dimensional data multiple.

There are a variety of semi-supervised learning methods in the literature. Such techniques can be divided into generational models, self learning, co-training, S3VM and visual technique. In the last decade, machine learning techniques have been examined in high dimensional data analytics. All these studies aim to produce biologically relevant interpretations of complex data sets which are necessary to enhance follow-up experiments. The micro-array technology allows the test of all genes to be put on a chip so that each data point supplied by an experimenter is in a high-dimensional space described by g-size. Furthermore, in these experiments, the sample size is often very small. For example, the popular high dimensional leukemia data found only 72 expression levels of each 7,130 genes. Feature selection methods are essential to ensure that researchers understand their data in such large cases, particularly when the purpose of the study is to identify genes with significant biological associations to classifications. The filter method is typically used to segment the input characteristics.

Through applying the correct feature selection strategy, the overall efficiency of the classification system can be significantly increased. This process uses numerous statistical assessments to select the subdivision of characteristics and adequate predictive performance. Initially, numerical analysis is applied and the scoring is then calculated for each function. The feature selection process is considered a critical pre-processing classification phase. To order to overcome the problem of the dimensionality curse, the selection process is considered the most important process.

Extreme learning machine is actually seen as the learning algorithm in feed-forward networks with single layers. Several variants of extreme learning devices are available and are used in various biomedical applications. Due to the cost invalidity and complex nature of micro array information sets, it is very difficult to predict the disease status. Gene expression is considered to be an important mechanism in which the gene DNA sequence is translated into appropriate mRNA sequences. It is responsible for determining expression levels within a certain cell with countless genes. Cluster analysis is the best method for understanding complex diseases. In addition, it will improve the prognosis method. All conventional clustering methods require certain parameters to be chosen.

Classification may be described as the method used to classify the items in question into separate classes or classes previously specified.

In this contribution of the paper, a novel non-linear feature selection based semi-supervised learning model on high dimensional datasets to improve the true positive rate and accuracy.

II. RELATEDWORKS

Classifiers use only the learned data from a training set in supervised learning to create a classification model or algorithm. The method is then used to estimate tag or class membership in the experiment with unlabeled samples. In a real-life scenario, though, data is often very difficult and costly to get. It also takes a lot of time to get labels in many realistic situations. Most importantly, to identify a collection of information we need human specialists. On the other hand, unlabeled data is often ample and comparatively easy to obtain and inexpensive to obtain. But, maybe because of the huge proportion of unlabeled data about the information that is labelled, often it is not clear how unlabeled

(3)

information can be used to boost the forecast. Consequently, the majority of classification methods ignore unlabeled information and designs relying solely on learned information sets. The theory of teaching is defined as "leaning or inductive teaching" from the learned knowledge alone. Using unlabeled data offers a new potential for changes in classification. There are however problems: for the unlabeled information or the test set, the process of generating data may not be exactly the same as for the labeled data or the training set. Semi-supervised teaching ' techniques have been introduced to create improved classifiers that can well run in' out of pocket ' samples (i.e. test set). The EM approach considers the labels to be missing and then estimated with the complete data set. The EM algorithm requires an E-step and an M-step for each iteration. In view of the current parameter estimates, the E-step must predict the expected value of unknown parameters. The M-step will re- estimate distribution parameters in order to maximize the later probability as expected. Sadly, these steps are sometimes analytically intractable due to the need to calculate a high-dimensional integral.

In order to overcome this problem, a new algorithm, the Monte Carlo EM (MCEM)[9] has been proposed. On the basis of previous studies on cancer classification system Bayesian kernel machine[

10], a new semi-supervised Bayesian SVM binary classification method (Semi-BSVM) is being developed here. We learn from both learned and unscheduled information in our model. The learning frequency from the unlabeled data is defined by and before a learning parameter. We also provide an empirical approach from Bayes to estimate the learning variable by maximizing the marginal probability. Such classifiers have been chosen because they belong to different classifiers, and they are known to give good results in terms of precision. Furthermore, different classification methods are used and, of course, one is more reliable than others for this particular type of training dataset. J48 (Decision Tree Classification) is one of the most common tree classifiers in this category. IB1 and IBk are "lazy" classifiers, who construct their classification model on the k-close neighbours. SMO is in the function group and in its classification algorithm uses the concept of support vector machines.

NaiveBayes belongs to the group of bayes classifiers; it uses the class estimator classes. NNge comes from the group of rules, which shares similarities with bayes, because it uses a Bayes-like algorithm and the same if-then rules to clasp. NNge is from the group of rules.

3.PROPOSED MODEL

The overall architecture of the proposed model is represented in fig 1. Initially, each high dimensional disease dataset is processed to find the synonym of the gene feature for efficient gene- symbol to gene-name mapping. Each high dimensional training dataset is pre-processed using the data transformation function to remove the variation among the data distribution. In the proposed work, a novel mathematical data transformation function is used to normalize the input training data for clustering and wrapper feature ranking process in the mapper phase.

(4)

Figure 1: Proposed mathematical normalization based semi supervised deep neural network High dimensional datasets

Semi supervised feature selection

features subset-list

Semi-supervised learning model

Partition-1 Partition-2 Partition-m

Deep neural network framework

Merge all majority voting patterns of each mapper Non-linear Normalization

Data partitions

Non-linear Data filtering

1. Let input dataset is represented as D.

2. To each features in features set F 3. Perform

4. Apply non-linear data transformation as

5. Th=

6. Where 7. If( > 0.75)

(5)

8. Then

9. Normalize data with 0 to Th range using min-max normalization.

10. Else

Normalize data with 0 to 1 using min max normalization.

11. End if

Here , min-max normalization is used to scale the data between the range of 0 to 1. This approach is used to clean the data in high dimensional datasets.

Algorithm :Ranked Semi supervised PCA (RPCA):

Input: Product training data.

Output: Ranked features.

Step 1: Normalized data as input ND.

Step 2: Compute the covariance matrix between the features F

CV(NF)=

n

NF[x[i]] 2 i 1

(NF[x[i]] )

(n 1)



 





Step 3: Compute the Eigen vector and values using the covariance matrix to find the optimal eigen sum as given below:

Step 4: select the top key ranked features using the optimal eigen sum values for data feature selection and classification.

Wrapper feature ranking based deep neural network classification

The cluster features of the k-means were defined in high-dimensional data sets using standard feature selection measures such as t-statistical high-dimensional noise meaning analysis (SAM) and signal- to-noise ratio (SNR). The main problem with these ranking methods is the choice of the correct genes using the high-dimensional feature space wrapper process. Such rating metrics include identification reliability and true positive levels in the selected characteristics with t-test, SAM and SNR measures (> 50). The modified version of the SAM, SNR and t-test tests is summarized below.

1: Assign feature weighted using the maximized weights . 2: Read number of clusters c.

3: Read number of iterations I.

4: Initialize k random clusters as centroids.

5 Repeat until c clusters

Find nearest cluster distance metrics using the following equation Let gene sets 1 G1, document vector G2

2 2

3

Log( Co s(G1[i], G 2[i])) Dist(G1, G 2)

Chisquare(G1, G 2). G1[i] G2[i]









Done

6: Update cluster centroid using mean distance.

(6)

7: Construct the filtered top k-clusters FC[k].

8: For each ranked feature FC[i] do Check the distance metric >0

If(dist.(SC[i],C[k])>0) Then

Classify the instance SC[i].

End if 9. done

Using the standard deviation of the class labels, the T-statistical weighting measure is used to find the variation in the gene characteristics. It is basically the ratio of the class label to the maximized standard deviation.

P N

2 2

P N

W1 ---(1)

max{ / | P |, / | N |}

where is the mean of the positive cluster class samples is the mean of the negative cluster class samples.

  

  



It is the ratio of the class label difference to the sum of the positive and negative gene disease classes standard deviation. Here, for data classification, the genes with the highest signal to noise ratio measure are chosen as the highest weighting measure.

i j

N P

i i

N N

| F( ) F( ) |

Wgt1 HSNR

2(F( ) F( ))

where F( ) and F( ) are the mean and standard deviation of the cluster positive class samples and are the mean and standard deviation of the cluster ne

  

 

  

 

  gative class samples.

i j

P N

2 2 P N

P N

| F( ) F( ) |

F( ) F( )

Wgt2 Max{Corr(CFeatures : CF), , } ---(3)

2(F( ) F( )) max{F( )/ | P |, F( )/ | N |}

  

     

Neural network weights are initialized using the weighted vector generated from wgt1 and wgt2 as W[]=Max{wgt1,wgt2}

Step 2: Defining the input, hidden and output layers to each mapper for parallel processing.

Step 3: To each hidden layer apply the logistic activation function for weights and error rate optimization.

(7)

Step 4: Classify data using the deep neural network framework until error rate and weights are converged.

4.Experimental Results

In the biomedical database, various high-dimensional datasets are selected to test the quality of the proposed model with regard to existing models. Table 1. lists the various data sets used for experimental evaluation. 10% of training data are used as test data for performance assessment in the test results. The proposed selection methods for the ensemble improve the quality of a truly positive rate and reliability on whole high-dimensional datasets. The model proposed uses the entire training data set for decision pattern building; thus, the predictability of each cross-validation appears to be more reliable than the conventional ensemble classification models. From the experimental results it is clear that the proposed classification of the ensemble improves the overall positive and negative rate. The main advantage of using the proposed model is the reduction of error rate on high dimensional characteristics.

Table. 1 Datasets and Its Characteristics

The proposed model improves the quality and reliability of a true positive value on high- dimensional entire data sets. The proposed model uses the entire training data set to construct decision patterns; hence, the predictability of each cross validation appears to be more accurate than conventional assembly classification models.

PROPOSED PATTERNS ---

(8)

Table 3: Comparative analysis of present approach to the traditional approaches by using accuracy on different High dimensional dataset on 10% features set.

Model DLBCL Prostate Lymphoma BreastCancer

PSO+Ensemble 87.45 91.43 90.54 85.35

ACO+Ensemble 86.35 89.13 91.33 82.54

Kernel based Deep neural network

98.93 96.35 97.14 94.92

Non-linear feature selection based semi-supervised model

99.23 97.43 97.75 97.57

The quality of the proposed model in all cancer datasets is listed in Table 3. Here, the average real positive value and precision rate on the high dimensional data sets are evaluated using the proposed models. The table shows that the current approach has a higher positive rate and reliability compared to the existing models.

CONCLUSION

In order to find essential feature set from the large amount of space, a deep neural network with weighted function is used. Since the weights in the deep neural network are optimized by weighted function and logistical function, the proposed model classifies the large data efficiently with high dimensionality. In order to solve these problems effectively, an algorithm is therefore necessary that can classify valid disease candidates using established genetic disease associations confirmed by the biological experiment. In order to overcome these issues, a semi-supervised learning method based on non-linear feature selection is proposed to overcome the disease prediction problem. The hybrid wrapper-based comparison approach is used to partition the function space into k-related features.

Ultimately, the deep neural network architecture was designed and implemented to enhance high- dimensional dataset disease prediction. Experimental results revealed that, in contrast to existing models, the current model is more accurate in terms of true positivity and receiver operating features.

(9)

REFERENCES

[1] S. Belciug and F. Gorunescu, “Learning a single-hidden layer feedforward neural network using a rank correlation-based strategy with application to high dimensional gene expression and proteomic spectra datasets in cancer detection”, Journal of Biomedical Informatics 83 (2018) 159–166.

[2] V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos, J. M. Benítez and F. Herrera , A review of high dimensional datasets and applied feature selection methods, Information Sciences 282 (2014) 111–135.

[3] V. Bolón-Canedo, N. Sánchez-Maro˜no and A. Alonso-Betanzos, Distributed feature selection: An application to high dimensional dataclassification, Applied Soft Computing 30 (2015) 136–150.

[4] S. H. Bouazza, K. Auhmani, A. Zeroual and N. Hamdi, Selecting significant marker genes from high dimensional data by filter approach for cancer diagnosis, Procedia Computer Science 127 (2018) 300–309.

[5] S. Chormungea and S. Jena, Correlation based feature selection with clusteringfor high dimensional data, Journal of Electrical Systems and Information Technology, 2017.

[6] R. Dash, A Two Stage Grading Approach for Feature Selection and Classification of High dimensional Data using Pareto based Feature Ranking Techniques: A Case Study, Journal of King Saud University - Computer and Information Sciences.

[7] M. Ghosh , S. Begum , R. Sarkar , D. Chakraborty and U. Maulik, Recursive Memetic Algorithm for Gene Selection in High dimensional Data, Expert Systems with Applications.

[8] S. Guo, D. Guo, L. Chen and Q. Jiang, A L1-regularized feature selection method for local dimension reduction on high dimensional data, Computational Biology and Chemistry 67 (2017) 92–101.

[9] Q. Hou, Z. Bing, C. Hu, M. Li , K. Yang , Z. Mo, X. Xie, J. Liao, Y. Lu, S. Horie and M.

Lou, RankProd Combined with Genetic Algorithm Optimized Artificial Neural Network Establishes a Diagnostic and Prognostic Prediction Model that Revealed C1QTNF3 as a Biomarker for Prostate Cancer

[10] M. Kumar and S. K. Rath , feature selection And classification of High dimensional data using Machine learning Techniques.

[11] Y. Liu , Prominent feature selection of high dimensional data, Progress in Natural Science 19 (2009) 1365–1371.

[12] H. Lu , J. Chen , K. Yan , Q. Jin , Y. Xue and Z. Gao , A Hybrid Feature Selection Algorithm for Gene Expression Data ClassiÞcation, Neurocomputing.

[13] M. Mollaee and Md. H. Moattar , A novel feature extraction approach based on ensemble feature selection and modified discriminant independent component analysis for high dimensional data classification.

[14] T. M. Nair , Statistical and artificial neural network-based analysis to understand complexity and heterogeneity in preeclampsia, Computational biology and chemistry.

[15] M. Panda , Elephant search optimization combined with deep neural network for high dimensional data analysis, Journal of King Saud University – Computer and Information Sciences.

[16] S. P. Potharaju and M. Sreedevi , Distributed feature selection (DFS) strategy for high dimensional gene expression data to improve the classification performance, Clinical Epidemiology and Global Health.

(10)

[17] Y. Prasad, K.K. Biswas and M. Hanmandlu, A Recursive PSO scheme for Gene Selection in High dimensional Data, Applied Soft Computing.

[18] B. Sahu, S. Dehuri and A. K. Jagadev , Feature selection model based on clustering and ranking in pipeline for high dimensional Data, Informatics and Medicine Unlocked.

[19] S. Sun, Q. Peng and X. Zhang , Global Feature Selection From High dimensional Data Using Lagrange Multipliers, Knowledge based Systems.