7.2 Data Set
7.4.4 Comparison of Correlation between Solid and Hemic Cancer Entities
At the end of the analysis the graphs of the solid and hemic cancer entities are compared among each other using the structural hamming distance. As expected, there is more similarity between the pairs of graphs in the four solid cancer entities and in the four hemic cancer entities than between them.
Table 7.20 prints the SHD values of all pairs in the eight cancer entities and the KEGG graph for the apoptosis pathway (hsa04210). Due to the symmetry of the SHD, the matrix is symmetric. The first column and last row is removed (contain no information) to save space.
BREAST COLON LUNG PROSTATE ALL AML CLL LYM
KEGG 590 310 405 362 466 409 299 379 BREAST 0 475 540 475 591 577 522 530 COLON 0 0 335 298 412 366 277 329 LUNG 0 0 0 375 469 447 348 400 PROSTATE 0 0 0 0 434 392 313 349 ALL 0 0 0 0 0 438 373 393 AML 0 0 0 0 0 0 345 383 CLL 0 0 0 0 0 0 0 296
Table 7.20: SHD between pairs of KEGG graph, solid and hemic cancer entities for the Apoptosis pathway (hsa04210).
Figure 7.10 visualizes the distribution of the SHD for the solid cancer, hemic cancer and mixed (solid+hemic) pairs. The data for the hemic and solid cancer entities include the 0, because there is no difference (SHD=0) between the same graphs (e.g., ALL-ALL). In all 11 analysed pathways the mean value of the mixed pairs is higher than in the other two groups. This indicates, that there is a different correlation structure for genes in solid and hemic disease tumors.
7.5
Summary
This chapter demonstrates that the use of parallel computing allows the analysis of more than 4000 microarrays and reduces the computation time to less than one day. ARpackage for data management of microarray experiments was presented and is available at the R- forge repository: http://ArrayExpressDataManage.R-forge.R-project.org/
7.5 Summary 125
solid hemic mix
0
200
400
600
hsa04210
solid hemic mix
0
100
200
300
hsa05221
Figure 7.10: Boxplots for SHD distribution between solid cancer and hemic cancer graphs and between solid and hemic graphs (mix) for the pathways hsa04210 hsa05221.
Testing the used data for differential gene expression shows a very strong differential gene expression for many genes and no consistent differential expression profile for the analysed entities. The strong differential gene expression (small p values) is an indicator for problems in the grouping – only eight entities –. It is well known, that there are huge differences in data from subentities. This coarse grouping was chosen to keep acceptable group sizes (more than 180 arrays) .
There is only a small similarity between the calculated graphs (PC-Algorithm) and the KEGG pathways. This indicates that the paths in biological pathways can not be indicated by expression microarrays. But by analysing paths (correlations) over three other nodes (genes) about 80% of the paths in the KEGG pathways can be proven.
Comparing the calculated graphs for each entity based on the pathways and using the discussed rates as measurement, up to 60% of the same correlation between genes can be found. However, these correlations are different to the KEGG pathways. Using a permutation test in the arrays of the cancer entities and comparing the
calculated graphs with the SHD, for several pairs significant differences in the gene- gene correlation structure between the cohorts of cancer entities can be found. The same study could be executed on public available data of the chip type ’HG-U133 Plus 2.0’. This chip type is a newer generation and stores more probes and genes. The number of arrays will be less than 7000 but nearly all genes of the pathways should be available on the chip. Furthermore, due to younger data (2003 to today) the data and annotation file quality should be better. Therefore, the data can be analyzed more in detail (e.g., female - male) and can be compared to the presented large data study.
Chapter 8
Summary and Outlook
Bioinformatics as an interdisciplinary research area at the interface between computer science and biological science has been confronted with new challenges in the last few years. Especially large data sets and increased computational requirements stemming from more sophisticated methodologies require new computational and statistical solutions, ideas and approaches. This work demonstrates the usefulness of parallel computing as to solving these new challenges as well as its power and limitations with two well-known biological examples. Basic results are published or submitted to relevant journals.
The summary chapter describes the state of development and discusses open topics regarding further parallel applications. In the end, two current trends for the future of parallel computing will be outlined in detail.
8.1
State of Development
For the open-source programming languageR– a software environment for statistical com- puting and graphics – research has focused on using parallel computing techniques in the last decade. Existing packages for different parallel computing hardware environments were compared [SME+09]. The snow and Rmpi package stand out as particularly useful
for general use on computer clusters, themulticore package for multi-core systems.
For preprocessing of high-density oligonucleotide microarrays, the affyPara package based on the snow package was developed [SM08, SM09]. Existing statistical algorithms and data structures had to be adjusted and reformulated for parallel computing. Using the parallel infrastructure, the known methods could be enhanced and new methods have became available. Parallelization of existing preprocessing methods produces, in view of machine accuracy, the same results as serialized methods. The partition of data and distribution to several nodes solves the main memory problems and accelerates the method up to factor 15 for 300 arrays or more.
For next-generation sequence data, improvements could be achieved using parallel com- puting in theRlanguage with the snoworRmpi package, but existing data structures and huge amounts of data (network traffic) limit its usefulness. Using the multicore pack-
age can accelerate the process considerably. However, a huge amount of main memory is required.
The parallelized preprocessing methods were used to analyze more than 7000 microarray data from more then 60 experiments from the public microarray database ArrayExpress. For data management, a new package called ArrayExpressDataManage was developed. The study proves the feasibility of data management and of analysing this amount of microarray data. Eight different cancer entities were studied, and the gene-gene interaction of more than ten KEGG pathways was estimated. Adequate similarities between the graphs were found, but it is not possible to rebuild more than 30% of KEGG pathways using gene expression data.
To conclude, microarrays remain a useful technology to address a wide area of biological problems and the optimal analysis tool of these data to extract meaningful results, but still pose many bioinformatic challenges. Sequencing based characterization of transcriptome is appealing because it effectively surmounts the limitations of microarrays. As access disseminates and costs for next-generation sequencing continue to drop, it seems probable that a steadily increasing fraction of the community will begin to use sequencing, rather than microarrays, to interrogate biological phenomena at the genomic scale [She08].