Biomarker Identification Workflows - Evaluation of the Parameter Optimization Plugin

5.3 Evaluation of the Parameter Optimization Plugin

5.3.3 Biomarker Identification Workflows

An important topic in disease diagnosis and biomedical applications is the identification of potential biomarkers, which can for example be represented by differentially expressed genes in microarray data under a specific disease condition. Most traditional statistical methods can discriminate between normal and differentially expressed genes, but their prediction performance across datasets are poor [Chen2011]. Thus, there is a strong tendency to robustly identify differentially expressed genes.

The selection of relevant genes from large high-dimensional gene expression data can be applied by feature selection methods. Two feature selection methods have been implemented here are based on machine learning techniques developed by [Abeel2010]: RFE and EFS. The Recursive Feature Elimination algorithm (RFE) is a recursive process, which uses the absolute values of the weights of each dimension in the hyperplane of a Support Vector Machine (SVM) as importance of each feature (gene) in the dataset. Ensemble Feature Selection (EFS) is based on an ensemble concept, where multiple feature selections are combined to increase the robustness of the results. To get a robust and valuable evaluation result, models are built by sub-sampling the original dataset several times. The result represents a ranked feature list from which highly ranked genes are selected as potential biomarkers. These potential biomarkers may be tested in a further step in the laboratory. The biomarker identification workflow was developed in conjunction with the Department of Bioinformatics of the Fraunhofer Institute for Algorithms and Scientific Computing.

Motivation: The main concern of the classification hyperplanes of SVMs in the field of biomarker identification is its interpretability and robustness for prediction of potential biomarkers. It is important to have properly calibrated machine learning methods available, which identify relevant genes and include these in the upper list of ranked biomarkers. Therefore, there is a need to identify of a set of classifier features to attain maximum predictive power.

Workflow: Figure 5.10 illustrates the workflow for biomarker identification. The first two components split the original data set into several sub-sampling sets. The RFE component performs the machine learning approach by executing several instances of a SVM, each of which consuming one sub-sample data set. The RFE uses the weight of

Workflow output ports Workflow input ports

ES_Score input_file DataSplit_And_Seed number_iterations gold_standard calc_ObjFunc remove_percentage RFE cost epsilon_tolerance ExtractSeeds_and_Input

Figure 5.10: The biomarker identification workflow. The RFE can be substituted by an ensemble feature selection (EFS). The fixed input parameters of the RFE/EFS are not shown.

each feature to i) rank the features from most important to least important and ii) removed iteratively the least important features. The execution of the SVM was performed on the Grid, as typically around 100 parallel instances of SVMs are executed in one workflow run. The feature selection can also be substituted by an Ensemble Feature Selection (EFS). The EFS adds an additional level of sampling to the RFE, namely bootstrapping. The fitness value is calculated by calc_ObjFunc and named ES_Score. The workflows are available at myExperiment: http://www.myexperiment.org/workflows/3694.html and http://www.myexperiment.org/workflows/3695.html.

Fitness: The calculation of the objective function is performed by comparing the ranked gene list against a so-called ’gold standard’. The gold standard was produced using a t-test as described in [Smyth2005] and lists the top 40 ranked genes. The fitness measure was calculated by comparing this list with the signature of the feature selection by the RFE utilizing an F-measure.

Data Input: The data input set of the mouse was taken from the publicly available GEO database [Edgar2002]. The used Affymetrix GeneChip microarray data of Huntington disease at 24 months of age (GEO accession GSE18551; [Becanovic2010]) contained 39637 features.

Optimization:

EFS additional parameter: bootstrapping

Fixed parameters: input_file, number_iterations, gold_standard, svm_type, kernel_type, degree, gamma, coef0, nu, epsilon_loss, cache_size, shrinking, probability

User constraints for parameters: remove_percentage ∈ [0.01, 0.33] (double), cost ∈ [0, 256] (integer), epsilon_tolerance ∈ [0, 2] (double), bootstrapping ∈ [1, 50] (integer)

Used data set: 24 month mouse Affymetrix

Default values for parameters: remove_percentage = 0.2, cost = 1, epsilon_tolerance = 0.001, bootstrapping = 40

Fitness for default values: RFE: fitness = 0.3076; EFS: fitness = 0.3846 gene gold standard default optimized default optimized

Ddit4l* 2 1 1 2 2 Wt1* 3 27 14 27 7 Dleu7 4 85 25 43 20 Slc45a3* 5 19 6 9 5 Gsg1l* 13 244 138 115 29 Entpd7 17 308 163 183 75 Spata5 20 69 18 38 28 Grasp 25 617 383 362 154 Dmpk 38 611 221 424 147 fitness 0.3076 0.4871 0.3846 0.5641

algorithm RFE EFS

Table 5.6: Comparison of the optimization results. The genes marked with * have been tested in the laboratory.

Results: The optimized RFE fitness was 0.4871 with remove_percentage = 0.11, epsilon_tolerance = 0.5498, and cost = 237. The optimized EFS fitness was 0.5641 with remove_percentage = 0.28, epsilon_tolerance = 1.9762, cost = 69, and bootstrapping = 33 . The optimization results are shown in Table 5.6.

Scientific use: Table 5.6 shows the results obtained from the biomarker identification workflow optimization. "The original analysis by the authors [Becanovic2010], evaluated the top 10 ranked genes obtained by their analysis which are differentially expressed in 24 months mouse affected by Huntington’s disease. These 10 genes were experimentally tested in the laboratory and the four genes marked with a star were found to be differentially

expressed in the Huntington disease. As it was not possible to reproduce the authors results using the same statistical analysis (t-test), an own implemented t-test (top 40) was used as ’gold standard’. In the top 40 of this t-test result, 9 genes out of the authors top 10 list were present, listed in Table 5.6. The optimized RFE was able to rank all genes higher than using the default parameters. It identified 5 out of 10 instead of default 3 within the top 40 gold standard genes. Similarly, the optimized EFS identified 6 out of 10 instead of default 4 within the top 40 gold standard genes. In addition, EFS was able to identify 3 out of 4 validated genes in top 10. The EFS method was expected to perform better than RFE as it applies an additional level of sub-sampling, namely a bootstrapping method. Due to this supplementary level, average weightings are calculated and the final result is more robust and stable. This improvement motivates to optimize additional biomarker identification workflows used with other species, diseases or platform formats." [Bagewadi2013]

Optimization runtime:The population size was set to 20 and the termination criterion was set to maximum 40 generations, within 24 hours and no change within 5 generations was defined. The optimization process terminated after 16 generations and took 3:20 hours. The runtime of one workflow execution was between 3.8 minutes and 5.2 minutes depending on the input parameters.

In document Automated Optimization Methods for Scientific Workflows in e-Science Infrastructures (Page 111-114)