Error margin analysis for feature gene extraction

(1)

stably discriminate distinct types of disease conditions, e.g. tumors and normals. Since gene microarray data normally involves thousands of genes at, tens or hundreds of samples, the gene extraction process may fall into local optimums if the gene set is optimized according to the maximization of classification accuracy of the classifier built from it.

Results: In this paper, we propose a novel gene extraction method of error margin analysis to optimize the feature genes. The proposed algorithm has been tested upon one synthetic dataset and two real microarray datasets. Meanwhile, it has been compared with five existing gene extraction algorithms on each dataset. On the synthetic dataset, the results show that the feature set extracted by our algorithm is the closest to the actual gene set. For the two real datasets, our algorithm is superior in terms of balancing the size and the validation accuracy of the resultant gene set when comparing to other algorithms.

Conclusion: Because of its distinct features, error margin analysis method can stably extract the relevant feature genes from microarray data for high-performance classification.

Background

Gene expression data commonly involve thousands of genes at, tens or hundreds of samples. In order to reduce the computation cost and complexity of the clas-sification, feature extraction on gene expression pattern is necessary. The objective of feature gene extraction process is to select the gene set that can be used to effectively and stably discriminate distinct types of dis-ease statuses, e.g. tumors and normals.

According to the terminology proposed in [1], one of the major approaches available in feature selection is filter model. It uses statistical techniques over the training patterns to “filter out” irrelevant features. Yet the “filtering” process can be further divided to for-ward selection and backfor-ward elimination. In forfor-ward selection [2], variables are progressively incorporated into larger and larger subsets, whereas in backward elimination, one starts with the set of all variables and progressively eliminates the least relevant ones. In the field of bioinformatics, there is a belief that the class of a gene expression pattern, either normal or

cancerous, correlates to the amount of changes in expression levels of feature genes. Thus, inversely, the gene level difference between normal-class patterns and cancer-class patterns is a promising guidance to identify feature gene. The p-value in t-test between normal-class and cancer-class patterns is a more reli-able guidance as it considers not only the level differ-ence but also the significance of the differdiffer-ence. In [3], a gene is regarded as feature if the corresponding p-value is higher than a pre-determined cutoff p-value. Cao et al. [4] defined the relevance of a gene as the sensi-tivity of the output to the inputs in terms of the partial derivative. Guyon et al. in [5] defined the relevance of a gene in terms of its contribution to the cost function in Support Vector Machine (SVM). The corresponding gene ranking method names Recurrsive Feature Elimi-nation (RFE). Several modifications on RFE, such as SQRT-RFE and Entropy-based RFE [6], were proposed to speed up the rank list construction process. Since the importance of variables is not assessed in the con-text of which other variables are not yet included, weaker subsets found by forward selection. Backward elimination method may outsmart it by eliminating the least promising variables and meanwhile providing the

* Correspondence: [email protected]

1_{Research institute of Innovative Products and Technologies, The Hong Kong}

Polytechnic University, Hong Kong SAR, PR China

© 2010 Chow et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

(2)

best classification from dependent variables (the vari-ables that together perform best classification).

Wrapper is another approach to feature gene selec-tion. In this approach, a feature gene set is found by optimizing certain measure quantities. Examples of these quantities include cross-validation [7] and boot-strap [8]. Shevade and Keerthi in [9] extracted feature gene by optimizing a SVM-liked energy function. Zhu et al. [10] presented a Markov blanket-embedded genetic algorithm (MBEGA) for gene selection problem. They used memetic operators to add or delete features (or genes) from a Genetic Algorithm (GA) solution in order to speed-up the GA convergence. Hong and Cho [11] enhanced the population divergence of a GA-based wrapper model by explicit fitness sharing. They also modified the representation of chromosome in GA to suit for large scale feature selection. Li et al. [12] pre-sented a statistical approach for feature gene selection. Many subsets of genes that can well classify the training samples are identified; using GA, and the most fre-quently appeared genes in the subsets are then pre-sumed as feature genes. Raymer et al. [13] reported a feature extraction algorithm to which feature selection, feature extraction, and classifier training are performed simultaneously, using a GA with the objective function involving training accuracy and the number of feature genes. Huerta et al. [14] suggested combining GA with SVM for the classification of microarray data. GA was used to evolve gene subsets, whereas SVM was used to evaluate the fitness values of the gene subsets in terms of classification accuracy. Shen et. al. in [15] reported a similar feature gene selection algorithm. It combined a discrete Particle Swarm Optimization (PSO) for search and SVM for fitness evaluation.

Gilad-Bachrach et al. [16] introduced a margin based feature selection criterion and applied it to measure the quality of a gene subset. A gene subset is said as optimal if the corresponding classifier has maximum error margin.

Most of the proposed feature selection algorithms [9-15] presume that the performance of feature gene set is associated with the training accuracy of the classifier built from it. However, since the number of training patterns related to the pattern dimension is small, train-ing accuracy is not a representative performance mea-sure. Alternatively, validation accuracy is a more objective and reliable performance measure. Though validation accuracy is never known in the training pro-cess, one can divide a training set of n samples into m non-overlapping subsets of roughly equal size; m - 1 of these subsets are combined as new training set and the remaining 1 subset is as validation set. The correspond-ing error is so-called cross-validated (CV) error. As noted by Ambroise and McLachlan [17], CV error may introduce a bias to the feature gene selection process. In

addition, they proposed to tackle it (i.e. obtain an almost unbiased estimate) by a two-layered cross-validation approach. On the other hand, the validation accuracy relates to the generalization of a classifier whilst the generalization of a classifier is commonly measured from its error margin. It is reasonable to hypothesize that validation accuracy is proportional to the width of error margin. And it is worth to represent the perfor-mance of a feature gene set by its error margin.

In this paper, we proposed a novel feature gene extrac-tion scheme, namely Error-Margin Analysis (EMA). EMA, as the name suggests, equates the performance of a feature gene set to error margin instead of classification accuracy. EMA starts from building an error margin curve representing error margin versus the number of mostly relevant genes. Afterwards, an analysis on the curve is performed to identify the optimal feature gene set. The proposed approach differs from [5] in the senses that the selection criterion is margin-based and para-meter-less. It is also in contrast to [16], in which the fea-ture genes are preferred to solely maximizing the error margin. Though [18] considers error margin in measur-ing the performance of a feature gene set, proper selec-tions of penalty coefficient and the size value are critical. In summary, EMA has an advantage over [7-15] in mea-suring the performance of a feature gene set. Addition-ally, it is superior to [3-5] in the sense that the number of the feature genes extracted EMA is parameter-indepen-dent, whereas others are according to parameter settings.

EMA is based on two assumptions. It is assumes that 1) genes are independently expressed; 2) the distribu-tions of gene expression are in Gaussian.

The rest of this paper is organized as follows: We first present an analysis on the relation between error margin and the number of feature genes. Afterwards, we pro-posed a novel feature gene extraction algorithm based on the error margin analysis. The experimental results are then reported and conclusions are drawn.

Results Datasets

In this section, the performance of EMA is evaluated on three datasets:

i. Synthetic dataset

The Synthetic dataset acts as a control to check whether an algorithm underestimates or over-estimates the number of feature genes. It is assumed that the fea-ture genes are distributed in Gaussian and the non-fea-ture genes are uniformly and randomly distributed. Given an artificial pattern x = [x1, x2,..., x500] with the class y, the distribution pi(x) of the gene xiis shown in

Table 1, where G x a b b x a b ( | , )= exp⎛−( ) ⎝ ⎜ ⎞ ⎠ ⎟ − 1 2 2 2 2 2  . It is

(3)

dataset as possible, in order to maximize the amount of possible pathways to the cancer diagnosis. Thus, the result on the Synthetic dataset indicates the ability of which the feature genes extracted by an algorithm cover the actual feature gene set. In this data set, each artifi-cial pattern consists of 500 genes; the first 20 genes are assigned as desired feature genes and the remaining 480 genes are assigned as non-feature genes.

ii. Gastric cancer dataset [19]

This dataset shows expression levels of 123 samples (Osaka University Medical School Hospital). A hundred and twelve of them are normal-class patterns and the remaining twelve patterns are cancerous-class. It is available at the link: http://lifesciencedb.jp/cged/

iii. Oral cancer multiple datasets

We have available four microarray datasets; the first was measured with HG-U133 Plus2 and it has 11 normal and 50 cancerous samples, the second is from a HG-U133A and it has 22 normal and 22 cancerous samples, the third set comes from a HG-Focus and has only 22 cancerous samples and the fourth has 12 normal and 26 cancerous samples and measured also with HG-U133 Plus2. All the chips are manufactured by Affymetrix (Santa Clara, CA).

Algorithms for Comparison

To evaluate the impact of EMA, we compare its perfor-mance with five algorithms. The designs and settings of EMA and the algorithms for comparison are summar-ized below.

Test algorithm 1 - SVM with Feature Gene Extraction by Error Margin Analysis (SVM-ema)

SVM-ema estimates the number of feature genes f0 through the analysis on error margin. Given the gene relevance list, SVM-ema constructs the corresponding error margin curve and f0 is estimated as the critical point of the curve.

Test algorithm 2 - SVM with t-test based feature gene extraction (SVM-ttt)

In SVM-ttt [3], the relevance of a gene is measured on its p-value in t-test. A gene is indicated as a feature if its relevance is higher than a given cutoff p-value.

the change of g(f) is just smaller than threshold. Test algorithm 4 - SVM with Margin-based Selection Criterion (SVM-msc)

SVM-msc [16] performs selection by search the feature gene set that maximizing a margin-based criterion. Test algorithm 5 - Bayesian Logistic Regression (BLogReg) BLogReg [20] is a gene selection algorithm based on sparse logistic regression (SLogReg). The regularization parameter arising in SLogReg is eliminated, via Bayesian marginalization, without a significant effect on predictive performance. The source code of BLogReg is taken from [21].

Test algorithm 6 - STW feature selection using generalized logistic loss (STW)

STW [22] was implemented exactly the same as SVM-RFE except that the hinge loss in SVM-SVM-RFE is replaced with the generalized logistic loss.

For SVM-ema, the parametric model G(.) for the esti-mation of LOOErM curve is chosen as second-order polynomial. The cutoff p-value of SVM-ttt is assigned as 0.005. For SVM-rfe, as suggested in [6], the threshold for obtaining the optimal number of feature is 0.0001 and the error is based on 3-fold experimental structure. The results of BLogReg and STW are obtained under the default parameters assigned in the corresponding source codes.

Experiment Settings

For the Synthetic dataset, five hundreds patterns are generated in each run. Twenty five of them form train-ing pattern set and the remaintrain-ing four hundreds and seventy-five patterns form validation pattern set for per-formance measure. In each of the pattern sets, half of the patterns belong to negative class and another half belong to positive class.

For the Gastric cancer dataset, suppose n- is the number of normal-class patterns and n+ is number of cancer-class patterns in T, and r is the sampling rate, we randomly pick rn+ positive-class patterns and rn -negative-class patterns in T to form the training set. The remaining (1-r)n+ positive-class patterns and the

(4)

remaining (1-r)n-negative-class patterns in T forms the validation set. The simulation is repeated with the sam-pling rate rising from 0.3 to 0.6.

For the Oral Cancer multiple datasets, the first three datasets form a superset O. Suppose n-is the number of normal-class patterns and n+ is number of cancer-class patterns in O, and r is the sampling rate, we randomly pick rn+ positive-class patterns and rn- negative-class patterns in O to form the training set. Meanwhile, the fourth dataset is regarded as the validation set. The cor-responding accuracy represents the generalization ability of a test algorithm on the oral cancer classification pro-blem. The simulation is repeated with the sampling rate rising from 0.1 to 0.7.

To provide a fair and repeatable comparison amongst the test algorithms, the performance of each test algo-rithm on a particular simulation is evaluated based on statistics obtained from 100 independent runs. For the Synthetic dataset, the patterns in both training set and validation set are randomly generated for each run. For the Gastric cancer dataset, the substituted random number is regenerated for each particular invalid expres-sion in each pattern. For Oral cancer multiple datasets, the patterns in the training set are randomly re-picked for each run. All test algorithms are implemented in MATLAB language.

Simulation Results Synthetic dataset

Table 2 lists the statistics of the numbers of the feature genes extracted by the test algorithms. Table 3 lists the statistics of the validation accuracies of the test algo-rithms. The values inside blankets represent the aver-aged number of actual feature genes (i.e. the 20 predefined feature genes) extracted by the correspond-ing algorithms. The averaged numbers of feature genes extracted by SVM-ema, SVM-ttt, SVM-rfe, SVM-msc, BLogReg and STW are 17.9, 43.27, 65.38, 448, 2.45 and 45.42 respectively. Though BLogReg extracted the smal-lest amount of feature genes, it ranks the last on the accuracy measure. The averaged accuracies of BLogReg and STW are 50% and 93.79% respectively; the

remaining algorithms are with 100% averaged accura-cies. On average 16.63 out 17.9 genes extracted by SVM-ema are actual feature genes. The averaged num-ber of actual feature genes extracted by ttt, SVM-rfe, SVM-msc, BLogReg and STW are 19, 19, 18.93, 1 and 8.51 respectively.

Gastric cancer dataset

Figure 1 shows the averaged numbers of feature genes extracted by the test algorithms against the sampling rate r ranging from 0.3 to 0.6. The y-axis of the figure is in log scale. The results of ema, ttt, SVM-rfe, SVM-mcs, BLogReg and STW are represented by the lines with the markers ‘O’, ‘∇’, ‘▖’, ‘*’, ‘◊’ and ‘Δ’ respectively. Seen from the figure, as the sampling rate increases, the number of feature genes fttt extracted by SVM-ttt increases from 263.1 at r = 0.3 to 458.9 at r = 0.6, which is approximately linearly proportional to r. For SVM-rfe, the number of extracted feature genes frfe slightly increases from 79.5 at r = 0.3 to 84.7 at r = 0.6. For SVM-ema, the number of feature genes fema is insensitive to r. It is in the range [51.4, 55.8]. For SVM-mcs, the number of extracted feature genes fmcs is again insensitive to r but constantly stay at a large value ran-ging in [2002.7, 2033]. In contrary to SVM-mcs, BLo-gReg constantly selects small set of feature genes; the corresponding number of extracted feature genes fBLRis in a small range from 2 to 3. Interestingly, the number of extracted feature gene fSTWby STW is inversely pro-portional to the sampling rate. The value of fSTW decreases from 6.85 at r = 0.3 to 2.83 at r = 0.6.

Figure 2 shows the averaged validation accuracy of the test algorithms against the sampling rate r varying from 0.3 to 0.6. The results of the test algorithms are repre-sented by the lines with the same markers in Figure 1. Seen from the figure, SVM-ema, SVM-ttt and SVM-rfe constantly and accurately classify the validation set, the corresponding accuracies range from 99.34% to 100.0%. The validation accuracy of SVM-mcs is just slightly lower than those of the above three algorithms. It ranges from 96.8% at r = 0.3 to 99.19% at r = 0.6. On the other hand, the validation accuracies of BLogReg and STW decrease along with r. The accuracies of BLogReg and Table 2 The statistics of the numbers of feature genes

extracted by the test algorithms: Synthetic dataset

Mean Std. Median Min. Max.

SVM-ema 17.90 (16.63) 1.40 18.00 15.00 23.00 SVM-ttt 43.27 (19.00) 4.72 43.00 31.00 56.00 SVM-rfe 65.38 (19.00) 4.62 66.00 50.00 75.00 SVM-mcs 448.01 (18.93) 36.92 455.50 356.00 500.00 BLogReg 2.45 (1.00) 0.56 2.00 2.00 4.00 STW 45.42 (8.51) 6.24 47.00 16.00 48.00

The values inside blankets represent the averaged number of actual feature genes extracted by the corresponding algorithms.

Table 3 The statistics of the validation accuracies of the test algorithms: Synthetic dataset

Mean Std. Median Min. Max.

SVM-ema 100.00% 0.00% 100.00% 100.00% 100.00% SVM-ttt 100.00% 0.00% 100.00% 100.00% 100.00% SVM-rfe 100.00% 0.00% 100.00% 100.00% 100.00% SVM-mcs 100.00% 0.00% 100.00% 100.00% 100.00% BLogReg 50.00% 0.02% 50.00% 50.00% 50.22% STW 93.79% 7.71% 97.56% 64.44% 100.00%

(5)

Figure 1 The number of feature genes extracted from Gastric cancer dataset by test algorithms.

(6)

STW are in the ranges [85.96%, 87.14%] and [87.23%, 89.88%] respectively.

Oral cancer multiple datasets

Figure 3 shows the averaged numbers of feature genes extracted by the test algorithms against the sampling rate r ranging from 0.1 to 0.7. The y-axis of the figure is again in log scale. Similar to Figure 1 and Figure 2, the results of SVM-ema, SVM-ttt, SVM-rfe, SVM-mcs, BLo-gReg and STW are represented by the lines with the markers ‘O’, ‘∇’, ‘▖’, ‘*’, ‘◊’ and ‘Δ’ respectively. Seen from the figure, for SVM-ema, SVM-ttt, SVM-rfe and SVM-mcs, the influences of the sampling rate to the number of feature genes are similar to those on the Gastric cancer dataset: As the sampling rate increases, the value of ftttlinearly increases from 890.9 at r = 0.1 to 2826.3 at r = 0.7; the value of frfe is insensitive to r and is in the range [81.5, 88.4]; the value of femaslightly increases from 36.5 at r = 0.1 to 56.48 at r = 0.7; the value of fmcsis again insensitive to r but constantly stay at large values ranging in [5367, 5458]. Comparing between SVM-ema and SVM-rfe, though the grow rate of femais large than that of frfe, femais consistently lower than frfe. And it is also significantly lower than fttt and fmcs. For BLogReg, the number of extracted feature genes is yet in low range from 2.37 to 11.53. The value of fSTWincreases from 5.31 at r = 0.1 to 34.6 at r = 0.7.

Figure 4 shows the averaged validation accuracy of the test algorithms against the sampling rate r varying from

0.1 to 0.7. The results are represented by the lines with the same markers in Figure 3. Seen from the figure, with the exception of BLogReg and STW, the validation accuracies of the test algorithms slightly increase along with r. The range of the accuracy of SVM-ttt is [83.32%, 91.58%]. For SVM-ema, its accuracy ranges from 80.5% to 88.6%. The validation accuracies of SVM-rfe and SVM-mcs are in the ranges [81.9%, 86.0%] and [87.7%, 92.26%] respectively. For BLogReg, its validation accu-racy is insensitive to the sampling rate; the accuaccu-racy keeps at a low value ranging from 68.42% to 68.53%. In contrary to BLogReg, the validation of accuracy of STW is much affected by the size of training set. When the value of r is in between of 0.3 and 0.4, the correspond-ing accuracy is at a relatively low value rangcorrespond-ing from [69.95%, 85.11%]. As the value of r reaches 0.6, the accuracy of STW increases to the same of SVM-rfe but is yet lower than that of SVM-ema.

Discussion

Providing a cancer disease correlates to certain amount of genes (namely the actual feature genes), an ideal fea-ture selection algorithm can extract this set of genes from training set without over-extracting the irrelevant genes or filtering-out some of the actual feature genes. The ideality is due to the fact that as more actual fea-ture genes are extracted, the more pathways are pro-vided to the cancer diagnosis. Thus, under a controlled environment, it is suggested to measure the

(7)

performance of an algorithm according to the ratios between the number of feature genes f extracted by this algorithm, the number of actual feature genes fA extracted, and the number of actual feature genes f0. The so called hitting rate rhof this algorithm is defined as fA/f0; and the missing rate rm is defined as (f - fA)/f. The algorithm J is suggested to be superior to another algorithm K if rh(J) is larger than rh(K) and rm(J) is smal-ler than rm(K). Table 4 lists the hitting rates and the missing rates of the test algorithms measured on the synthetic dataset. Seen from the table, ttt, SVM-rfe and SVM-mcs are with high hitting rates but also high missing rates, which infer there are over-extrac-tions of the features. Alternatively, it is suggested that STW underestimates the number of features as its rela-tively low hitting rate. Moreover, BLogReg extremely underestimates the number of features as its unusual low hitting rate, i.e. 5%. In general, SVM-ema is superior to other algorithms as its hitting rate is high and miss-ing rate is low. The results show that SVM-ema can extract the most relevant set of feature genes.

For the cases of two real datasets, Figure 1 and Figure 3 indicate the number of feature genes extracted by dif-ferent algorithms. We found that SVM-ema, SVM-rfe and SVM-mcs are insensitive to the sampling rate, for which the numbers of feature genes just slightly increase along with the sampling rate r. Though SVM-ema and SVM-mcs both employ error margin on their gene selection criterions, SVM-ema consistently result in much less number of feature genes. As indicated in previous sections, irrelevant genes may also contribute to the error margin. The maximization approach of SVM-mcs tends to extract as more genes as possible. Thus, SVM-mcs overextracts feature genes in order to achieve larger error margin. Seen from Figure 1 and Figure 3, the numbers of feature genes extracted by SVM-mcs are unusually large: For the Gastric cancer dataset, the minimal number is 2002, for which nearly 99% genes are regarded as feature. For the Oral cancer data, the number is more than 5000, in which nearly 87% genes are considered as features. Comparing to the results of SVM-ema, the number of feature genes extracted by SVM-mcs is around 35 times and 149 times more than that of SVM-ema for the Gastric can-cer dataset and the Oral cancan-cer datasets respectively. The reason of this difference is that EMA is able to decompose the contributions of the feature genes from those of the background genes. This also indicates that purely maximizing error margin is not a practical selec-tion criterion.

Figure 4 The validation accuracies evaluated on Oral cancer dataset by test algorithms.

Table 4 The average hitting rater_hand the average redundancy rater_rof the test algorithms: Synthetic dataset

SVM-ema SVM-ttt SVM-rfe SVM-mcs BLogReg STW

rh 83.15% 95.00% 95.00% 94.65% 5.00% 42.55%

(8)

While comparing the validation accuracies amongst the test algorithms, SVM-ttt and SVM-mcs should be ignored as their high accuracies are archive by overex-tracting feature genes. Seen from the results shown in Figure 2 and Figure 4, the performance of SVM-ema is better than that of SVM-rfe in terms of not only the validation accuracy but also the number of feature genes. SVM-ema is also superior to BLogReg and STW. This superiority of SVM-ema suggests that 1) margin-based criterion is more suitable to represent the perfor-mance of a feature gene set; and 2) this criterion is more robust than those of BLogReg and STW in the sense that BLogReg and STW may under-estimate the number of feature genes.

Conclusions

This paper proposes a feature extraction algorithm of error margin analysis that uses margin-based criterion to measuring the quality of a feature set. Error margin is a better indicator than training accuracy in repre-senting the generalization ability of a classifier. How-ever, maximizing the error margin may lead to overextraction of features. Therefore, we propose to make a tradeoff between the performance and the number of features, which is done by analyzing the curve of error margin. Under the assumptions on gene independency and on gene distribution, the analysis shows that the error margin of only involving the rele-vant genes grow faster than that of involving random genes. Based on this observation, we model the extrac-tion process as an estimaextrac-tion of critical point in the error margin curve of error margin versus the number of mostly relevant genes. Compared with existing algo-rithms that use either margin-based selection criterion or “filtering” approach, our algorithm has distinct advantage, which has been proven from theoretical framework.

Computational experiments of comparing EMA with other approaches including wrapper models and filtering models. The experimental results show that:

1) Error margin is a more representative measure to the generalization ability of a classifier than training accuracy;

2) Solely maximizing error margin may lead to over-extraction of features;

3) SVM-ema can make right balance between the performance and the size of resultant feature gene set.

Possible future works include 1) an analysis on the error margin curve when the gene distribution is non-Gaussian, 2) deriving a more accurate parametric model for the margin curve segments wI and wR and 3) an

extension to the analysis on error margin of non-linear classifier.

Methods

Error-Margin as an Indicator to Feature Genes

Gene expression level difference and the p-value of the expression level are promising relevance measures of a gene. The gene rank list sorted according to these mea-sures provides guidance for feature gene selection. On the other hand, margins play a crucial role in modern machine learning research. It represents the generaliza-tion ability of a classifier or the confidence of the deci-sion made by the classifier. It is valuable to investigate the possibility of which uses error margin as a criterion to decide how many genes should be selected from the list. In this section, an analysis on the relation between error margin and the number of mostly relevant genes is presented.

Given a training set S = {[xj| yj]} where xj Î X ⊆ ℜ d and yjÎ {-1, 1}, and [u Î ℜd, l] is the decision hyper-plane of S obtained by SVM, the corresponding error margin w is defined as:

w y v y u x j j j j j i j i i d = = ⎛ + ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ =

∑

min min _,  1 (1)

where {hi} are constants, {xj, i} and hence {vj} are ran-dom variables.

Suppose C- contains the indices of all normal-class patterns (i.e. yj= -1 for jÎ C-) in S and C+contains the indices of all cancer-class patterns (i.e. yj= 1 for jÎ C+) in S, since SVM guarantees that the error margin is maximal, the minimal error margin amongst the nor-mal-class patterns equals to that amongst the cancer-class patterns: w v v j C j j C j = − = ∈ − ∈ + min( ) min ₍₂₎

In the rest of this paper, the analysis considers the minimal error margin amongst of the cancer-class pat-terns {vj} for jÎ C+.

We start the error margin analysis by studying the dis-tribution of error margin of a training pattern. The first assumption made in this analysis is that the probability density function qi(x) of the ithgene xiis Gaussian:

q x i x _i i i( ) exp ( ) = + ⎛− − +₊ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ 1 2 2 2 2     (3) where i j i j C n x + ∈ = ₊ +

∑

1 , and i j i i j C n x + + ∈ = ₊ − +

∑

2 1 ₍ ₎2 , . Figure 5 shows the general classification model of gene expres-sion level. i j i j C n x − ∈ = − −

∑

1 , and i j i i j C n x + + ∈ = − − +

∑

2 1 ₍ ₎2 ,

(9)

shown in the figure represent the mean and the variance of the ithgene amongst the patterns in C-. (the physical meanings of    _i−, _i+, _i−, _i+ and the level difference wi of the ithgene). One important assumption of gene pat-tern in bioinformatics is that the level difference mRfor relevant (feature) gene is much larger than the level dif-ference mIfor irrelevant (non-feature) gene, i.e. mR>>mI.

Since w is translation-invariant, we translate the gene pattern xj Î X to zj = xj - μ- Î Z for all j where

₋_{= [ ,}_{ }− −_, _−_]

1 2  d . Meanwhile, the decision hyper-plane of S in Z is transformed as [h, b] = [u -μ-, l+u·μ-]. Figure 6 and Figure 7 show a 2-dimensional example of the pattern translation.

Figure 6 shows the original 2-dimensional feature space X. The white ellipse represents the region of nor-mal-class patterns whilst the grey-filled ellipse repre-sents the region of the cancer-class patterns. The center

of the normal-class patterns is [ ₁−, ₂−], whereas the center of the cancer-class patterns is [ ₁+, ₂+]. The dotted line represents the decision hyperplane obtained by SVM. Figure 7 shows the translated feature space Z. The centers of the normal-class patterns and of the can-cer-class pattern are translated to [0, 0] and [₁+− ₁−, ₂+−₂−] respectively.

After the translation, the probability density function ri(z) of ziis: r z q z i z _i _i i i i( ) i( i) exp (( ) ) = + = + ⎛− + − − +₊ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ = + −        1 2 2 2 2 1 2 2 2 2 2     exp −( − + − −( )) + ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ z _i _i i (4)

Since _i+₋_i− is the level difference mi of the ith gene, the eq. (4) can be further expressed as:

= + ⎛− − + ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ 1 2 2 2 2 _i   z mi i exp ( ) (5)

Let ai = hizi, the corresponding probability density function pi(a) is expressed as:

= ₊ ⎛_{− − +} ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ 1 2 2 2 2 2 h _i a himi hi i   exp  ( ) ( ) (6)

Figure 5 General classification model based on gene expression level.

Figure 6 A two-dimensional example of pattern translation: before translation.

Figure 7 A two-dimensional example of pattern translation: after translation.

(10)

At this stage, we made the second assumption that genes {zi}, and hence {ai}, are independent. Under this assumption, the probability density function pv(v) of v = 〈h, z〉 + b appeared in eq. (2) can now be expressed as:

p v_v( )=(p_i*p₂**p_d* )( ) v (7) where ( )x x b x b = +∞ = ≠ ⎧ ⎨ ⎩ 0

Since the convolution of two Gaussian functions is still a Gaussian function:

i e . . ( * )( ) ( ) . exp ( ( )) ( ) f g m f g m _{w f w g} f g = + − − + + ⎛ ⎝ ⎜ ⎜ ⎜ 1 2 2 2 2 2 2 2      ⎞⎞ ⎠ ⎟ ⎟ ⎟ (8) where f m f m w f f g m g m ( )= exp −( − ) ( ) exp ( ⎛ ⎝ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ = − 1 2 2 2 2 2 1 2 2   and  −− ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ w g g )2 2 2

pv(v) can be simplified as:

1 2 2 2 2 2 _w  v mw w exp⎛− −( ) ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ (9) where w ii i d h 2 2 1 = =

∑

( ) and mw h mi i b i d = + =

∑

1 .

The analysis on the relation between error margin and the number of mostly relevant genes can be divided into three cases:

Case 1: Linearly separable training set with zero gene variance

It is commonly to assume that microarray pattern set is linearly separable. The linear separability of a pattern set is discussed at Appendix I. When the training set is lin-early separable, the probability of which w is lower than a given value w0is described by the function:

D_w w p t dt_v w n ( ₀)= −1 ⎛1− 0 ( ) ⎝ ⎜ ⎞ ⎠ ⎟ −∞

∫

+ (10)

where n+is the cardinality of C+. The probability den-sity function pw(w) of w is:

p w Dw w w n p w p t dt w v v w n ( ) ( ) ( ) ( ) = ∂ ∂ = ⎛ − ⎝ ⎜ ⎞ ⎠ ⎟ + −∞ −

∫

+ 1 1 (11)

The expected error margin w for linearly separable training set is:

w= wp_w w dw= _w n₊ +m_w

−∞ ∞

∫

( )  ( ) (12)

where h(.) is monotonic increasing and depends on on n+. The details of eq. (12) can be found in Appendix II.

A pattern set is said as ideal if the gene variance approach to zeros, i.e. si ® 0, for all i Î [1, d]. For such case, pw(w) can be simplified as pv(w).

pw( )w =p wv( ) (13)

and the expected error margin w is computed as a weighted sum of the expected gene level differences {mi}iÎ[1, d]: w h m_i _i i d = =

∑

1 (14)

Given a gene relevance list L = {ji} where a gene is at a former position of the list if it has higher relevance, we define w_i, as the expected error margin when the i mostly relevant genes are considered:

w_i h m k i k k = =

∑

  1 (15)

In this paper, the term error margin curve W(i) refers to the curve representing error margin versus the num-ber of mostly relevant genes, i.e. W(i) = wi.

Suppose there are nRfeature genes (i.e. the nRmostly relevant genes are the feature genes), the error margin curve can be divided into two segments: 1) the relevant gene seg-ment WR(i) for iÎ [1, nR] and 2) the irrelevant gene seg-ment WI(i) for iÎ [nR+ 1, d]. As we hypothesize that mRis significantly larger than mI, in addition to that the expected error margin is a weighted sum of the gene level differences of the considered genes, the averaged grow rate of WR(i) must be higher than that of WI(i). Thus, there should be a critical point on the error margin curve, and this point indi-cates the boundary between relevant genes and irrelevant genes. Figure 8 shows a typical error margin curve for ideal pattern set. Seen from the figure, the critical point of the curve is at the boundary between the relevant and the non-relevant genes. In other words, the estimation of the num-ber of feature genes is equivalent to find the critical point on the error margin curve.

Case 2: Linearly separable training set with non-zero gene variance

For the case of which the training set is linearly separable but si> 0, the influence of sito the error margin curve

(11)

can be expressed as follows: When siincrease, gene pat-terns spread wider in X and they have higher chance to get closer to the decision hyperplane. Thus, a narrower error margin is expected. Furthermore, when more genes are considered, mwand swin pv(v) grow in different rates,

in which w , as a weight sum of mwand swaccording to the eq. (12), is neither monotonic increasing nor mono-tonic decreasing. Therefore, the error margin curve for si> 0 is filled with small oscillation. Figure 9 shows a typi-cal error margin curve for si> 0.

Figure 8 A typical error margin curve for ideal pattern set.

(12)

Case 3: Linearly non-separable training set

In case of linearly non-separable training set, the soft-mar-gin idea choose a decision hyperplane that the classifica-tion accuracy is as high as possible, while still maximizing the error margin of the correctly-classified pattern set V’ Î V ={ }vj j C∈ + . Thus, the error margin in this case is measured from V’. Since the excluded patterns from V’ are those with minimal (and negative) error margin vi, it is expected that 1) the mean of V’ is larger than that of V and 2) the variance of V’ is smaller than that of V. Under a practical assumption that the gene distributions in V’ are also Gaussian, the soft-margin idea brings the error mar-gin analysis of linearly non-separable training set back to the case of linearly separable pattern set.

In summary, when a training set is linearly separable and si= 0 for all i, the critical point of the error margin curveis definitely the boundary point between relevant and irrelevant gene sets. However, if 1) si > 0 for at least one gene and/or 2) the training set is linearly non-separable, oscillation is introduced to the curve and blunts the critical point. For such case, feature gene extraction is modeled as the estimation of critical point of the error margin.

Feature Gene Extraction by Error-Margin Analysis

In this section, we report a novel feature gene extraction algorithm, namely Feature Gene Extraction by Error Margin Analysis(EMA). Based on the error margin ana-lysis presented in the previous section, the feature gene extraction can be modeled as the search for the critical point of the error margin curve.

In order to moderate the dependency of error margin on pattern set, Leave-One-Out Error margin (LOOErM) is used. LOOErM, as the name suggests, leaves a single pattern from the training set and compute the error margin of the decision hyperplane defined by the remaining patterns. This is repeated such that each pat-tern in the training set is left once. For a training set S consisting of n patterns, n error margins {gj}jÎ[1, n]are obtained. The LOOErM of S is defined as the average of {gj}. Algorithm A1 summarizes the procedure of LOOErM.

Algorithm A1: Leave-One-Out Error Margin

Input: 1) Pattern set S = {[xj| yj}]}jÎ [1, n], 2) the index set of the considered genes F

1. For j : = 1 to n

1.1 Define the pattern subset Z = {[xk(i)iÎF| yk]}k≠j 1.2 Train SVM on Z: the corresponding decision hyperplane denotes by Hj(z): 〈h·z〉 + b where 〈a·b〉 is the dot-product of the vectors a and b.

1.3 Compute the error margin gj of Hj: gj=min_{k j}_≠ Hj(xk( )ii F∈) 2. Next j 3. w _n g_j j n := =

∑

1 1

Output: the leave-out-one error margin w

Since the error margin curve is filled with small oscil-lation due to gene variations amongst patterns, the criti-cal point of the curve is not as significant as that shown in Figure 7. Thus, a noise reduction on the error margin curveis necessary. It can be done by fitting a parametric function G(i|a) to the curve. Recalling that the error margin curveis composed of two segments: WRfor rele-vant genes and WIfor irrelevant genes, the estimation



W (i) of W(i) consists of two parts: G(i|aR) and G(i|aI). The first part deals with the noisy WRwhilst the second part deals with the noisy WI, i.e. WR(i) ≈ G(i|aR) and WI(i)≈ G(i|aI). In addition, since the error margin curve is expected as a continuous function, WRshould meet WI at the critical point c, i.e. G(c|aR) = G(c|aI). In a whole say, the error margin curve can be estimated as:

W i W i G i i c G i ( ) ( ) ( | ) [ , ] ( | ) ≈ =⎧⎨ ∈ ⎩    R I for otherwise Subject to 1 G c( |_R)=G c( |_I) (16)

and the corresponding estimation errorε is defined as:

( , , ) ( ( | ) ) ( ( | ) ) c G i w G i w i i c i i c d     R I R I = − + − = = +

∑

2 1 2 1 (17)

Seen from eq. (17), ε naturally depends on c, aR and aI. In other words, the performance of an arbitrary criti-cal point c = f can be represented by the error

f = min ( ,_{ }_,  f , )

R I R I . Given that G(.) is sufficient to model WRand WI, the optimal critical point f0 of the error margin curve is defined as the critical point where the estimation error of _{W (i) is minimum, i.e.}

f f d f 0 1 = ∈ min [ , ] .

Given a training set S = {[xj= [xj,1, xj,2,..., xj,d]Î ℜd| yjÎ {-1, 1}]}jÎ[1, n], we first rank the genes according to their relevancies. We denote L = {jk}k= 1,2,..., das the gene rele-vance list to which the relerele-vance of the jathgene is larger or equals to that of the jbthgene for all a < b. The list L is then used to rearrange S as {[xj(L) | yj]}jÎ[1, n]. Afterwards, we compute the error margin curve W(i) = w_i where w_i is the LOOErM computed from Algorithm A1 with F= {1, 2,..., i}.

In this paper, G(.) is chosen to be a polynomial func-tion. The corresponding estimation error εf for an arbi-trary critical point c = f can be obtain by the least square method. The details of the method can be found in Appendix III. As benefitted from the prior-knowledge that the number of feature genes is commonly lower than a pre-determined value fmax, say for example

(13)

j j,1 j,2 j, d Î ℜ j Î {-1, 1}]}jÎ[1, n], 2) maximum number of considered genes fmax, 3) parametric error margin model G(.)

/* Construct the gene relevance list L: BEGIN */ 1. Compute the relevance riof the ith gene:

ri =Ω({ }xk i k C, ∉ −,{xk i k C,}∈ +)

whereΩ(A, B) is the p-value of two point sets A and B, C-contains the indices of all normal-class patterns in Sand C+contains the indices of all cancer-class patterns in S.

2. Define the gene relevance list L = {jj}j = 1,2,..., d where the relevance of the jathgene is larger or equals to that of the jbth gene, i.e. r ≥rb for all a <b.

3. Rearrange the gene order of S according to L: S¬ {[xj(L) | yj]}jÎ[1, n]

/* Construct the gene relevance list L: END */ /* Construct the LOOErM curve { w_i}: BEGIN */ 4. For i : = 1 to fmax

4.1 Compute w_i by Algorithm A1 where the set F used in the algorithm is defined as {1, 2,..., i}.

5. Next i

/* Construct the LOOErM curve { wi}: END */ /* Search for the critical point of the LOOErM curve: BEGIN */

6. For f : = 1 to fmax

6.1 Compute the estimation error

_f _i i f i i f d G i w G i w = − + − = = +

∑

min ( ( | ) ) ( ( | ) ) ,  R I R I 2 1 2 1 . If

G(.) is a polynomial function, the optimalaRandaI can be found by the method listed in Appendix III.

7. Next f

8. Compute the optimal critical point f0 as arg min

[ ,max]

f∈1f f

/* Search for the critical point of the LOOErM curve: END */

Output: The index set of the feature genes

F₀ _{k k} ₁_f

0 ={ } ∈[ , ]

Figure 10 and Figure 11 show two examples of feature gene extraction by error margin analysis. For each

Appendix I

Linearity of Gene Patterns

Given a pattern set S = {[xj= [xj,1, xj,2,..., xj, d]Î X ⊆ ℜd | yjÎ {-1, 1}]}kÎ[1, n]where the first n+ patterns is posi-tive-class and the remaining n- = n - n+ patterns is negative-class patterns, if there exisit a d by n transfor-mation matrix T such that every pattern xj is trans-formed to a point xj’ in n-dimensional Euclidean space, xjT = xj’ = [ , ,0 0 1 0, , ]0 1      j n j n − − ∈ ℜ _:

PT=I where P=[x1;x2;,xn] and is an by identity matrI n n iix(18)

We must be able to found at least one

decision hyperplane hI, for example hI= − −

− +

[_{ }1_,_,_1 1_,, ]1

n n

such that the transformed patterns {xj’} can be linearly separable:

yj x h′j, I =1 (19)

Since T is a linear transformation, the eq. (19) can be rewritten as

yj x Thj, I = 1 (20)

Eq. (20) infers that S can be linearly separated by the hyperplane ThI. In conclusion, S is linearly separable if the transformation matrix T exists.

Existence of the transformation matrix

According to eq. (18), T is defined as the right inverse of P, which can be decomposed as PT(PPT)-1. Thus, T exists if P has the rank m.

When the number of genes d is much larger than the number of training patterns n, i.e. d >>n, the probability of that T exists is higher. Reminding that gene pattern analysis deals with small sample size and high sample dimension, the existence of T can be easily archived. Thus, gene patterns are reasonably assumed to be line-arly separable. Additionally, since support vector

(14)

Figure 10 Gene extraction by error margin analysis on the Gastric cancer dataset.

(15)

w mw ⎝ ⎠ = ( − ++ ⎛ − ⎝ ⎜ ⎞ ⎠ ⎟ = − − + −∞ − −∞ ∞ +

∫

m n p w p t dt + dw w m n p w p w v v w n w v ) ( ) ( ) ( ) ( ) 1 1 1 vv w n w v v w t dt dw m n p w p t dt ( ) ( ) ( ) −∞ − −∞ ∞ + −∞ ∞ −∞

∫

⎛ ⎝ ⎜ ⎞ ⎠ ⎟ + − ⎛ ⎝ ⎜ ⎞ + 1 1 ⎠⎠ ⎟ = − ⎛ − ⎝ ⎜ ⎞ ⎠ ⎟ + + + − + −∞ − −∞ ∞

∫

n w v v w n w w dw w m n p w p t dt dw m D 1 1 1 ( ) ( ) ( ) (ww)_−∞∞ (21)

Considering the first integration part of eq. (21)

I w m w m n p w p t dt dw w m w w w v v w n ( , , ) ( ) ( ) ( ) (  = − ⎛ − ⎝ ⎜ ⎞ ⎠ ⎟ = − + −∞ ∞ −∞ −

∫

1

∫

+ 1 w w v w n n w w mw w p t dt ) exp ( ) . ( ) + −∞ ∞ −∞

∫

− − ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ − ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ 1 2 2 2 2 1   _ ++−1 dw (22) Let z w mw w z m w w w = _− , =  + and dw = swdz. Addi-tionally, z = -∞ when w = -∞ and z = ∞ when w = ∞. Thus, eq. (22) is transformed as:

I w m z n w z w t mw w w w ( , , ) exp . exp ( )       = ⎛− ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ − − + −∞ ∞

∫

1 2 2 2 1 2 2 2     _w dt dz n z z z m n w w w w 2 2 2 2 1 ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ = + ⎛− ⎝ ⎜ ⎜ ⎞ + ∞ −

∫

+ exp ⎠⎠ ⎟ ⎟ − − ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ −∞ ∞ + ∞ −

∫

+ . exp ( ) 1 2 2 2 2   _  w t mw w dt z m n w w 1 1 wdz (23) We further let y t mw w = −_ , t = ysw + mw and dt = swdy. Additionally, y = -∞ when t = -∞ and y = z when t= zsw+ mw. Thus, eq. (23) is further transformed as:

Therefore, the expected error margin for linearly separable training set is:

w= ( )w n+ +mw (24)

Appendix III

Suppose G(.) is a g-order polynomial, the estimations of wRand wI are G(x |aR = [A, B]) = A xi i B

i + =

∑

1  and G x D C xi i D i ( |I=[ , ])C = + =

∑

1  respectively.

Since the estimations are subjected to the condition:

A x B C x D D A x B C x i fi i i if i i f i i i f i i + = + ⇒ = + − = = = =

∑

1 1 1 1     (25)

The estimation errorε of W can be rewritten as:    = ⎛ + − ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ + + − ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ = = = =

∑

A x B y C x D y i ij j i j f i ij j i j f 1 2 1 1 ++ = = = =

∑

= ⎛ + − ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ + + + − 1 2 1 2 1 1 n i ij j i j f i ij i i ij i A x B y C x A x B   1 1 1 1 2 1 2   

∑

− ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ = ⎛ + − ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ = = + = C x y A x B y i if j i j f n i ij j i ++ − + + − ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ = = = = +

∑

j f i ij if i i ij j i j f n C x x A x B y 1 1 1 1 2 ( )  

(16)

The optimal values of A, B and C can be computed from the least square method. Firstly, we set the deriva-tive ofε with respect to {Ak}kÎ[1, g], C and {Ck}kÎ[1, g]as zeros: ∂ ∂ = + − ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟+ − = = = +

∑ ∑

∑

  Ak x A x B y x C x x j k j f i ji j i f k j f n i j i f i 1 1 1 ( )) ( ) i i j i j i i i j k i f k i j A x B y A x n f x = = = + +

∑

+ + − ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ = ⇒ + − 1 1 1 0    == = = +

∑

⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟+ + − ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟+ + − 1 1 1 f j k f k j f i i j k i B x n f x C x n f ( ) (  ))x y x y B A x B y f k i j f n j jk j f j j f n i j i j i + = = = + =

∑

⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ = + ∂ ∂ = + − 1 1  1 1 1 1 1   

∑

⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟+ − + + − ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ = = = j f i ij if i i ij j i C x( x ) A x B y jj f n i i j i f i j f i i j i A x n f x Bn C x = + = = =

∑

∑ ∑

∑

= ⇒ ⎛ + − ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟+ + + 1 1 1 1 0   ( ) (( ) ( ) n f x y Ck C x x A x f i j n j j n i j i f i j i j i − ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ = ∂ ∂ = − + = = =

∑

1 1 1   ++ − ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ − = ⇒ − = = + =

∑

B y x x A x x x j i j k f k j f n i i j i j k f k j 1 1 1 0   ( ) ( ) == + = + = = +

∑

∑ ∑

⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟+ − + − f n j k f k j f n i i j i f k j f n B x x C x x 1 1 1 1 ( ) ( )(  xx x y x x j k f k j j k f k j f − ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ = − =

∑

) ( ) 1

Afterwards, we define the matrices M and Y:

M M M M M M M M M M Y Y Y Y AA AB AC BA BB AC CA CB CC A B C = ⎡ ⎣ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎤ and ⎦⎦ ⎥ ⎥ ⎥ where MAA( , )i k xk ij (n f )xk if MCA( , )i k x x( x) j f j i j k f k j f = + + − + + = − = =

∑

1

∑

1 1 M M M M AB CB AC ( ) ( ) ( ) ( ) ( , ) k x n f x k x x i k x j k f k j f j k f k j f = + − + = − = = =

∑

1

∑

1 1 jj k i f k i j f n j i f i j k f k j f n n f x i k x x x x + + = = + − + = − −

∑

( 1) M ( , )

∑

( )( ) M CC BA(( ) ( ) ( ) ( ) i x n f x k y x y n j i f i j f j kj j f j j f n = + − + = + = + = = =

∑

1

∑

1 1 1 Y M Y A BB BB BC C M Y = + = + − + = − = = =

∑

y y i x n f x k y x x j f j n j i f i j f n j kj fk j 1 1 ( ) ( ) ( ) ( ) 1 1 f

∑

The optimal parameter vectorΨ = [A B C]Tis com-puted as Ψ = M-1Y and the optimal value of D can be found by the eq. (25).

Acknowledgements

The methodology part of this paper is supported by the niche-area fund 1-BB56 of the Hong Kong Polytechnic University. The Experiment part of this paper (for Oral cancer) is supported by the Harvard Catalyst/The Harvard Clinical and Translational Science Center (with NIH Award #UL1 RR 025758 and financial contributions from Harvard University and its affiliated academic health care centers).

Author details

1_{Research institute of Innovative Products and Technologies, The Hong Kong}

Polytechnic University, Hong Kong SAR, PR China.2Laboratory for Innovative Translational Technologies, Harvard Medical School, Boston, MA, USA. Authors’ contributions

HLZ and CKC designed the methodology, devised the study and prepared the manuscript. CKC performed the data analysis and organized the experimental results. WPK designed the microarray experiment of the oral cancer. JL performed the microarray experiment of oral cancer in WPK’s lab. All authors read and approved the final manuscript.

Received: 12 June 2009 Accepted: 11 May 2010 Published: 11 May 2010 References

1. John GH, Kohavi R, Peger KP: Irrelevant features and the subset selection problem. Proceedings of the 11th Int Conf on Mach Learning 1994, 121-129. 2. Xiong M, Li W, Zhao J, Jin L, Boerwinkle E: Feature (Gene) Selection in

Gene Expression-Based Tumor Classification. Molecular Genetics and Metabolism 2001, 73:239-247.

3. Man TK, Chintagumpala M, Visvanathan J, Shen JK, Perlaky L, Hicks J, Johnson M, Davino N, Murray J, Helman L, Meyer W, Triche T, Wong KK,

(17)

9. Shevade SK, Keerthi SS: A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 2003, 19(17):2246-2263.

10. Zhua Z, Onga YS, Dasha M: Markov blanket-embedded genetic algorithm

for gene selection. Pattern Recognition 2007, 40:3236-3248.

11. Hong JH, Cho SB: Efficient huge-scale feature selection with speciated genetic algorithm. Pattern Recognition Letters 2006, 27:143-150. 12. Li L, Weinberg CR, Darden TA, Pedersen LG: Gene selection for sample

classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 2001, 17(12):1131-1142.

13. Raymer ML, Punch WF, Goodman ED, Kuhn LA, Jain AK: Dimensionality

Reduction Using Genetic Algorithms. IEEE Trans on Evolutionary Computation 2000, 4(2):164-171.

14. Huerta EB, Duval B, Hao JK: A Hybrid GA/SVM Approach for Gene

Selection and Classification of Microarray Data. EvoWorkshops LNCS 2006, 3907:34-44.

15. Shen Q, Shi WM, Kong W, Ye BX: A combination of modified particle swarm optimization algorithm and support vector machine for gene selection and tumor classification. Talanta 2007, 71:1679-1683. 16. GiladBachrach R, Navot A, Tishby N: Margin Based Feature Selection

-Theory and Algorithms. Proc of the 21th Int Conf on Machine Learning 2004, 43-50.

17. Ambroise C, McLachlan GJ: Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences USA 2002, 99(10):6562-6566.

18. Oh IS, Lee JS, Moon BR: Hybrid Genetic Algorithms for Feature Selection. IEEE Trans on Pattern Analysis and Machine Intelligence 2004,

26(11):1424-1437.

19. Oba S, Kato K, Ishii S: Multi-scale clustering for gene expression data. Proc of the 5th IEEE Symposium on Bioinformatics and Bioengineering 2005, 210-217.

20. Cawley GA, Talbot NLC: Gene selection in cancer classification using sparse logistic regression with Bayesian regularization. Bioinformatics 2006, 22:19.

21. Link to the source code of BLogReg. [http://theoval.cmp.uea.ac.uk/~gcc/ cbl/blogreg/].

22. Park C, Koo J-Y, Kin PT, Lee JW: STW feature selection using generalized logistic loss. Computational Statistics and Data Analysis 2008, 53:3709-3718. doi:10.1186/1471-2105-11-241

Cite this article as: Chow et al.: Error margin analysis for feature gene

extraction. BMC Bioinformatics 2010 11:241. _{Submit your next manuscript to BioMed Central}

and take full advantage of:

• Convenient online submission • Thorough peer review

• No space constraints or color figure charges • Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at