Paper IV - Summary of papers - Hidden patterns that matter Statistical methods for analysis of

Chapter 4 Summary of papers

4.4 Paper IV

Centralization Within Sub-Experiments Enhances the Biological Relevance of Gene Co-expression Networks: A Plant Mitochondrial Case Study

Here we present a novel pre-processing method for expression data. This method is a centralisation method that allows the user to combine data from different labs, different conditions etc. to be able find underlaying pathways.

We consider normalised gene-expression, then for each expression value the CSE-value is calculated

In figure 4, the correlation between 5 simulated genes in two separated networks is illustrated in column one. Gene A effects gene B, and Gene S effects gene D and E. The Pearson and partial correlation measure are calculated between all genes, with and without the CSE pre-processing step, column 2-5. To be able to compare Pearson’s correlation against partial correlation the correlations were scaled relatively to each other. The most correlated edge for each set up was given the value one.

In row one (A), there are no extra treatments effecting the genes and there are no differences between the correlations calculated with or without CSE. In row two (B) a treatment effect is applied to gene C. The correlations based on the Pearson´s method is affected without CSE but the Pearson´s correlations with CSE gives the same results as in row one. In the third row (C) a treatment effect is applied on both gene A and gene C, now both correlation methods are affected by the extra effects from the treatment. But with the pre-processing step CSE these treatment effects are accounted for and the right edges are given the highest correlation values ones more.

Figure 4 Schematic representations of the conclusions that can be drawn from different correlation analysis approaches of gene expression data. (Law, Kellgren, Bjork, Ryden, & Keech, 2020)

The CSE method was evaluated by constructing gene co-expression networks, with and without CSE, on gene expression data from the mitochondria of the plant Arabidopsis. The genes and their functions of the mitochondria in the Arabidopsis plant are well studied, our hypothesis was that there would be more connections between genes involved in the same process or having similar functions. By controlling the sparsity of the different co-expression networks, it was possible to test if it was more connections than by chance between genes sharing the same function or involved in the same process.

Finally, a core network was constructed with the CSE and Pearson´s correlation and a clustering algorithm, the walktrap algorithm (Pons & Latapy, 2005) was applied. The genes in the different clusters were then checked if they had similar functions and a core network was constructed. It is possible to use CSE in

addition to ordinary gene co-expression networks constructed without CSE.

Together, these networks can contribute to a larger biological insight.

Chapter 5 Discussion and future research

This thesis is based on four papers where the high dimensionality of the data is of great concern. The number of variables usually exceeds the number of samples which makes statistical analysis more demanding. Therefore, it is important to make correct choices both before the study is conducted and during data analysis.

In healthcare, there are many aspects to consider but usually the primary focus is finding a treatment or relieve pain for patients. In hereditary diseases and diagnosis there are common that there is no treatment available at this day. The ability to sequence the genome and search for a genetic variant that is the cause of a disease is important for the knowledge in general, but also for affected families to get answers. However, when dealing with rare diseases it is hard to speculate how many studies that got a negative result, i.e. a mutation could not be found. A way of minimizing the number of studies that never get answers, is to perform a well thought study.

Based on the results of Paper I and other similar studies with the aim of identifying a rare disease, an experimental design was performed in Paper II. The aim of Paper II was to identify which individuals from a family, affected by a genetical rare disease, that should be selected for exome sequencing. Our study shows that the selection of individuals has a great impact on the number of potential SNV in the final output. Additionally, the number of individuals that are selected for exome sequencing also have an impact on the final number of potential SNVs. However, the gain of adding an extra individual to the study decreases. Hence, by making smart choices when selecting individuals, time, resources, and money can be saved.

A future research project could be to create a program where a user can define a pedigree describing the family of interest. Based on the pedigree and if the SNV is assumed to be inherited autosomal or on the allosomes, if it is dominant or recessive, the program can present the top best designs of selected individuals to include from the pedigree. In our pipeline, we considered an autosomal dominant inheritance, but the pipeline could be developed further so that more inheritance patterns can be considered. There could also be room for handling errors such as misdiagnosed patients and reading errors.

Another concern in healthcare is the increasing number of antibiotic resistance bacteria. In Paper III, it was shown that there was a good agreement with the antimicrobial test and the antibiotic resistance genes found in the sequences.

Hence, by studying the genomes of bacteria, information about which antibiotic resistance genes and virulence genes the bacteria are carrying can lead to more

correct prescription of antibiotics. Further, by studying the relationship between the bacteria genomes there might be evidence of the origin of the bacteria. It can be used to determine if it has evolved in local places in the hospital or if it is transferred from external environment. In general, an information about the bacteria origin might lead to guidelines on how to stop the spread. The Staphylococcus epidermidis ST215 clone analyzed in Paper III shows lot of similarities with the world known antibiotic resistance ST2 clone. A fear is that the ST215 clone also will be spread over the world, therefore further research in how to stop the spread is needed but meanwhile doing the right choices when prescribing antibiotics is of great importance.

The complexity of organisms can not only be described by the order of the nucleotides in the DNA sequence. The whole body is a system that reacts to its surroundings and genetic inheritance. To study how genes interact with each other is of interest when exploring different processes in an organism. By studying how genes interact there is possible to analyze how different treatment, e.g. stresses affect a network of genes. Analyzing gene expression data has the potential of studying how genes interact by creating gene co-expression networks.

The amount of expression data produced until this day is large, to combine different datasets could be a real strength in research. However, the strength comes with a weakness and that is the extra technical and biological noise that can lead to more false positives and false negatives. In paper IV we present a pre-processing step that makes it possible to gain the strength of combining huge dataset and at the same time minimizing the noise when creating a core network.

Omics data is high-dimensional and complex, and new methods is needed for better biological insight. This thesis contributes with a few steps in the direction of reveling the hidden patters that define life!

References

Allen, J. D., Xie, Y., Chen, M., Girard, L., & Xiao, G. (2012). Comparing statistical methods for constructing large scale gene networks. PLoS One, 7(1), e29348. doi:10.1371/journal.pone.0029348

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. J Mol Biol, 215(3), 403-410.

doi:10.1016/S0022-2836(05)80360-2

Bolstad, B. M., Irizarry, R. A., Astrand, M., & Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2), 185-193.

doi:10.1093/bioinformatics/19.2.185

Carter, S. L., Brechbuhler, C. M., Griffin, M., & Bond, A. T. (2004). Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics, 20(14), 2242-2250.

doi:10.1093/bioinformatics/bth234

Chudley, A. E. (1998). Genetic landmarks through philately--Gregor Johann Mendel (1822-1884). Clin Genet, 54(2), 121-123. doi:10.1111/j.1399-0004.1998.tb03713.x

Haendel, M., Vasilevsky, N., Unni, D., Bologa, C., Harris, N., Rehm, H., . . . Oprea, T. I. (2020). How many rare diseases are there? Nat Rev Drug Discov, 19(2), 77-78. doi:10.1038/d41573-019-00180-y

International Human Genome Sequencing, C. (2004). Finishing the euchromatic sequence of the human genome. Nature, 431(7011), 931-945.

doi:10.1038/nature03001

Jukes, TH., Cantor,CR. (1969). ) Evolution of protein molecules. In Munro HN, editor, Mammalian Protein Metabolism, pp. 21-132, Academic Press, New York.

Kimura, M. (1980). A Simple Method for Estimating Evolutionary Rates of Base Substitutions through Comparative Studies of Nucleotide-Sequences.

Journal of Molecular Evolution, 16(2), 111-120. doi:Doi 10.1007/Bf01731581

Langfelder, P., & Horvath, S. (2008). WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics, 9, 559.

doi:10.1186/1471-2105-9-559

Law, S. R., Kellgren, T. G., Bjork, R., Ryden, P., & Keech, O. (2020). Centralization Within Sub-Experiments Enhances the Biological Relevance of Gene Co-expression Networks: A Plant Mitochondrial Case Study. Front Plant Sci, 11, 524. doi:10.3389/fpls.2020.00524

Lazar, C., Meganck, S., Taminau, J., Steenhoff, D., Coletta, A., Molter, C., . . . Nowe, A. (2013). Batch effect removal methods for microarray gene expression data integration: a survey. Brief Bioinform, 14(4), 469-490.

doi:10.1093/bib/bbs037

Ma, S., Bohnert, H. J., & Dinesh-Kumar, S. P. (2015). AtGGM2014, an Arabidopsis gene co-expression network for functional studies. Sci China Life Sci, 58(3), 276-286. doi:10.1007/s11427-015-4803-x

Ma, S., Gong, Q., & Bohnert, H. J. (2007). An Arabidopsis gene network based on the graphical Gaussian model. Genome Res, 17(11), 1614-1625. mendelian disorder. Nat Genet, 42(1), 30-35. doi:10.1038/ng.499 Pons, P., Latapy, M. (2005). “Computing communities in large networks using

random walks,” in Computer and Information Sciences - ISCIS 2005.

Saitou, N., & Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4), 406-425. doi:10.1093/oxfordjournals.molbev.a040454

Song, L., Langfelder, P., & Horvath, S. (2012). Comparison of co-expression measures: mutual information, correlation, and model based indices.

BMC Bioinformatics, 13, 328. doi:10.1186/1471-2105-13-328

Usadel, B., Obayashi, T., Mutwil, M., Giorgi, F. M., Bassel, G. W., Tanimoto, M., . . . Provart, N. J. (2009). Co-expression tools for plant biology:

opportunities for hypothesis generation and caveats. Plant Cell Environ, 32(12), 1633-1651. doi:10.1111/j.1365-3040.2009.02040.x

Waggener, W. N. (1995). Pulse code modulation techniques : with applications in communications and data recording. New York: Van Nostrand Reinhold.

Wang, J., Do, K. A., Wen, S., Tsavachidis, S., McDonnell, T. J., Logothetis, C. J.,

& Coombes, K. R. (2007). Merging microarray data, robust feature selection, and predicting prognosis in prostate cancer. Cancer Inform, 2,

87-97. Retrieved from

https://www.ncbi.nlm.nih.gov/pubmed/19458761

Watson, J. D., & Crick, F. H. (1953). Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature, 171(4356), 737-738.

doi:10.1038/171737a0

Venter, J. C. (2001). The sequence of the human genome(1.0. ed., pp. 1 CD-ROM).

Wille, A., Zimmermann, P., Vranova, E., Furholz, A., Laule, O., Bleuler, S., . . . Buhlmann, P. (2004). Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biol, 5(11), R92. doi:10.1186/gb-2004-5-11-r92

Voelkerding, K. V., Dames, S. A., & Durtschi, J. D. (2009). Next-Generation Sequencing: From Basic Research to Diagnostics. Clinical Chemistry, 55(4), 641-658. doi:10.1373/clinchem.2008.112789

Zhang, B., & Horvath, S. (2005). A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol, 4, Article17.

doi:10.2202/1544-6115.1128

Papers I-IV

Department of Mathematics and Mathematical Statistics Umeå University, SE-901 87 Umeå, Sweden

www.umu.se

ISBN 978-91-7855-240-5 (print) ISBN: 978-91-7855-241-2 (pdf)

In document Hidden patterns that matter Statistical methods for analysis of DNA and RNA data (Page 32-40)