Conclusions - Analyzing and Benchmarking Genomic Preprocessing and Batch Effect Removal Methods

In this chapter we used the methods implemented in the InSilico DB platform for analyzing the 6 lung cancer datasets and 2 prostate cancer datasets. We first validated the use of batch effect removal on the each preprocessing method by applying ComBat on the datasets. The batch effect free datasets were visualized by applying Principal Component Analysis on the gene expressions and plotting the first two principal components. We did not observe any problems with the results. The same approach was used for validating the results of applying batch effect removal on the combination of microarray and RNA-Seq data. The result was that batch effect seemed to be removed from the data. The diseases were not as well split as for the lung cancer datasets, but batch effect removal does not remove biological information. This means that the prostate cancer data has less differences in the gene expression values for prostate cancer and controls.

Our second step was to benchmark the results of the different combinations of batch effect removal and preprocessing techniques. Our benchmarking approach uses decision trees for measuring how well samples are separated for the diseases and studies after batch effect removal. A complex decision tree indicates that it is difficult to separate the samples and vice versa. Studies should always be difficult to separate after batch effect removal, meaning their decision tree should be complex. The inverse is true for diseases, they should be easily separable and resulting in simple decision trees.

Therefore the lung cancer datasets are classified on their diseases and studies for building the decision trees. The results of the benchmarking indicates that the preprocessing technique performing the best for our criteria is fRMA. The SCAN preprocessing technique also performs very well for most batch removal methods, except for DWD. We concluded that UPC was not suitable for our benchmark be-cause of its probabilistic values. For the batch effect removal techniques we con-cluded that ComBat performed the best because it estimates the batch effect fac-tors and adjusts the gene expression values using these estimates. GENENORM and BMC performed similarly and DWD was still influenced by batch effect.

Those results are specific to the lung cancer datasets we used, but the method-ology can be applied to other datasets.

The analysis of the stability of the trees showed that genes having a possible bio-logical importance were most of the time the roots of the decision trees classifying the diseases. This information motived us to try an experiment for finding differ-entially expressed genes. We used the information gain as metric for differentiat-ing the gene expressions. For each gene we used the combined information gain of the different batch effect removal methods for finding if a gene is differentially

expressed. Some optimizations have been made in the method for speeding up the computations. One example of such optimizations is that the homogeneity of the samples is checked before starting the computation. The gene is not considered a DEG if it is not homogeneous enough.

The method was applied to each preprocessing technique for microarray data and to the combination of microarray and RNA-Seq data. We did a brief research for finding possible biological relevance in the list of differentially expressed genes we obtained. We found that some genes in the list were co-expressed for multiple pathways related to lung cancer. The analysis of the microarray and RNA-Seq data was also successful since we found genes related to a pathway that is linked to prostate cancer. Even if the analysis in of microarray and RNA-Seq data was rather small-scale, it was important for us to show that the experimental pipeline we build in previous chapter makes UPC values for RNA-Seq data that is merge-able with microarray data. This experiment also proved that it is feasible to anal-yse both legacy microarray data with newer RNA-Seq data in one experiment and find biological relevant information.

CHAPTER 7 Conclusions

7.1 Problem Description

One of the biggest challenges in genomics is to make the data of experiments available to people having the expertise to interpret the data. These people gen-erally do not have the IT background for taking care of all the data preparation steps enabling the data interpretation. It has been years now that scientist use microarray technology for experiments. Millions of dollars have been invested in this technology and cannot be thrown away because of the emergence of new technologies such as RNA Sequencing. Therefore it is crucial to find methods to cope with both data types, otherwise microarray data will be lost.

Therefore solutions have been made for enabling the people with the expertise to be able to interpret the results. InSilico DB is an example of such solution. It provides multiple services such as a user-friendly web interface for the combina-tion and analysis of the processed and biologically curated data. This makes it possible for biologists and biomedical experts to easily analyze the data without being faced with the difficulties that must be tackled by Computer Science.

In document Analyzing and Benchmarking Genomic Preprocessing and Batch Effect Removal Methods in Big Data Infrastructure (Page 91-94)