Summary, Conclusions, and Future Work This thesis comprises advancements in the computational analysis of multiple high-

throughput biological, mainly gene expression, datasets collectively. These advancements cover both the methodological and the application sides by the proposal of a novel suite of computational methods as well as elucidating important insights into various biological aspects by the application of such methods to real datasets.

The focal method in the proposed suite of methods is the UNification of CLustering

results from multiple datasets by using External Specifications (UNCLES) method (Abu-

Jamous, et al., 2015c). This method mines multiple gene expression datasets collectively in order to identify the subsets of co-expressed genes (genes with high correlation between their genetic expression profiles) consistently over the subject datasets while adhering to some external specifications. Two types of external specifications have been proposed here; type A mines for the genes that are consistently co-expressed in all of the given datasets while type B mines for the genes that are consistently co-expressed in one subset of datasets while being poorly co-expressed in another subset of datasets. An earlier development of UNCLES is the Binarisation of Consensus Partition Matrices (Bi-CoPaM) method (Abu- Jamous, et al., 2013a), which is equivalent to UNCLES’ type A.

Amongst the key aspects of the Bi-CoPaM and the UNCLES methods is that they have tuning parameters which allow for unconventional clustering results to be formed. For instance, while clustering a set of genes, any gene may have one of three eventualities; it may be exclusively assigned to a single cluster, as conventional clustering methods do, or it may be simultaneously assigned to multiple clusters, or it may not be assigned to any of the clusters at all. As for the clusters, they may be conventional complementary clusters, or tight and focused clusters which leave many genes unassigned to any cluster, or wide and overlapping clusters. Amongst the benefits of such capability is the ability to inject genome- wide datasets (datasets including the entire unfiltered set of genes) into the method without

filtering, and then to tighten the resulting clusters to be focused while expelling many genes outside all of the clusters. By this, the method applies the filtering step implicitly while clustering, and eventually meets the biological fact that most of the genes in an organism’s genome are expected to be irrelevant to any single given biological context. Most of the experiments detailed in this thesis have such setup and demonstrate its applicability.

UNCLES and the Bi-CoPaM require various parameters to be set such as the number of clusters (K) and the tuning parameters δ and (δ+_{, δ}-_{). Also, the results of these methods} need to be validated. In order to address these aspects, a cluster validation and selection technique is proposed in this thesis based on M-N scatter plots (Abu-Jamous, et al., 2014b; Abu-Jamous, et al., 2015c). This technique favours those clusters which include higher numbers of genes (N) while maintaining lower levels of dispersion as measured by a mean- squared error-based (MSE-based) metric (M). The UNCLES method assisted by the M-N scatter plots technique represents a complete framework of consensus clustering for multiple datasets without the need to set any of the key parameters manually; in other words, it is a parameter-free framework.

In order to test and validate this framework, artificial datasets which meet relevant properties were synthesised by adopting a new approach of expression data synthesis (Abu- Jamous, et al., 2015c). This approach produces datasets with a known-ground truth, which is a desirable feature of artificial datasets rendering them as suitable means to test and validate other methods; yet this is not the unique feature of the proposed approach compared to other approaches of data synthesis; rather, the unique feature is that the values within the artificial datasets are borrowed directly from real datasets overcoming the issue of the faithfulness of the synthetic datasets in representing real data properties.

Another technique of cluster assessment and validation has been proposed in this work, namely the F-P scatter plots technique, which validates the results of clustering while taking the known ground-truth as a reference (Abu-Jamous, et al., 2015c). This technique has been employed while testing the UNCLES method and the M-N plots technique over the synthetic datasets for which the ground-truth is known and has shown that the UNCLES method combined with M-N plots can find those clusters which highly match the ground- truth.

The mature suite of methods, or partially developed versions of it, has been applied to various biological contexts revealing several biological findings and insights. Two major applications to yeast datasets were conducted and published; the first of them revealed important insights into the poorly understood yeast gene CMR1 and its relation to cell-cycle

and DNA metabolism genes by analysing two yeast cell-cycle datasets (Abu-Jamous, et al., 2013b). On the other hand, the second experiment scrutinised forty yeast gene expression datasets from various contexts concluding that the well-known subset of ribosome biogenesis genes and a novel subset of genes are consistently co-expressed over all of the datasets and, more surprisingly, are consistently oppositely expressed. Hypotheses with respect to the functions and the regulation of both subsets of genes were drawn, mainly regarding the novel subset, which was named as the anti-phase with ribosome biogenesis

(APha-RiB) subset of genes (Abu-Jamous, et al., 2014a).

Five Escherichia coli bacterial datasets from different contexts were mined by the Bi- CoPaM method identifying two subsets of genes as consistently co-expressed over all of the five datasets. Biological hypotheses regarding the function and regulation of those subsets were drawn and published (Abu-Jamous, et al., 2015b).

While collaborating with the group of Professor David Roberts at the University of Oxford, which is a research group focusing on the biomedicine of the human blood, eight human and murine blood gene expression datasets were analysed by the Bi-CoPaM method. Those datasets were all generated in the context of red blood cells production (erythropoiesis). Five focused subsets of genes, out of the entire human or murine genome, were identified as consistently co-expressed over all of the eight datasets. Interestingly, these five clusters show peak expression values at different stages of development throughout the erythropoiesis process. When this observation was added to the analysis of the regulation and functions of the clusters, several hypotheses were drawn. These hypotheses and other related ones are under investigation with our collaborators in order to take this research forward.

Finally, the UNCLES method with the M-N plots were applied to two popular malarial datasets as a preliminary experiment. The discovered nine clusters showed a perfect temporal cascade of peaks of expression throughout the blood stages of the malarial parasites. Alongside the analysis of the functions of the genes in those nine clusters, this preliminary experiment demonstrated the applicability of this suite of methods to malarial datasets, and represents a seed for my fellowship/grant applications as well as my prospective collaborations.

The current suite of methods does not answer all of the possible questions with respect to the collective analysis of multiple high-throughput biological datasets. For instance, other types of datasets, such as proteomic, glycomic, and metabolomic datasets exist abundantly. Moreover, more investigations of the efficiency and reliability of the methods can be

conducted. Such concerns constitute subjects for my future work at the side of methods’ design and development. As for the applications, many other areas in biology and biomedicine have produced a great deal of datasets, such as cancer research, and can represent targets for future applications of my methods.

As the future of this research is considered, the focus will be on the analysis of the malaria parasite. Malaria causes up to one million deaths annually and about 40% of the population of the earth live in malaria endemic regions.

Taken together, a mature suite of computational methods with the capability to analyse collectively, validate, and simulate multiple high-throughput gene expression datasets have been described in this thesis alongside a set of real applications to yeast, bacterial, human and murine blood, and malarial datasets. Despite filling many gaps and elucidating many poorly understood aspects in research, this work has opened the eyes to more questions and potential future work, which keeps the wheels of bioinformatic research and personal career development turning.

Appendix I

Introduction to the Molecular Biology of the

In document Collective analysis of multiple high-throughput gene expression datasets (Page 126-130)