Hidden patterns that matter Statistical methods for analysis of DNA and RNA data

(1)

Hidden patterns that matter

Statistical methods for analysis of DNA and

RNA data

Therese Kellgren

(2)

Hidden patterns that matter

Statistical methods for analysis of DNA and RNA

data

Therese Kellgren

Department of mathematics and mathematical statistics Umeå 2020

(3)

This work is protected by the Swedish Copyright Legislation (Act 1960:729) Doctoral Thesis No. 71/20

ISBN: 978-91-7855-240-5 (print) ISBN: 978-91-7855-241-2 (pdf) ISSN: 1653-0829

Electronic version available at: http://umu.diva-portal.org/ Printed by: Cityprint i Norr AB

(4)

(5)

(6)

List of papers

This thesis is based on the following papers:

I. Mutations in Collagen, Type XVII, Alpha 1 (COL17A1) Cause Epithelial Recurrent Erosion Dystrophy (ERED) Human Mutation, John Wiley & Sons 2015, Vol. 36, (4): 463-473 Jonsson, Frida; Byström, Berit; Davidson, Alice E.; et al.

II. Experimental designs for finding disease-causing mutations in rare diseases Kellgren, Therese; Rydén, Patrik. Manuscript

III. The emergence of an antimicrobial resistant Staphylococcus epidermidis clone in Northern Europe Kellgren, Therese; Dwibedi, Chinmay; Widerström, Micael; Monsen, Tor; Rydén, Patrik; Johansson, Anders. Manuscript

IV. Centralization Within Sub-Experiments Enhances the Biological Relevance of Gene Co-expression Networks: A Plant Mitochondrial Case Study. Front Plant Sci 2020, 11, 524 Law, Simon; Kellgren, Therese; Björk, Rafael; Rydén, Patrik; Keech, Olivier

(8)

Papers not included in the thesis

SNX10 gene mutation leading to osteopetrosis with dysfunctional osteoclasts Scientific Reports, Nature Publishing Group 2017, Vol. 7 Stattin, Eva-Lena; Henning, Petra; Klar, Joakim; et al.

Non-homologous recombination between Alu and LINE-1 repeats results in a 91 kb deletion in MERTK causing severe retinitis pigmentosa Molecular Vision 2018, Vol. 24: 667-678 Jonsson, Frida; Burstedt, Marie; Kellgren, Therese; et al.

(9)

Abstract

Understanding how the genetic variations can affect characteristics and function of organisms can help researchers and medical doctors to detect genetic alterations that cause disease and reveal genes that causes antibiotic resistance. The opportunities and progress associated with such data come however with challenges related to statistical analysis. It is only by using properly designed and employed tools, that we can extract the information about hidden patterns. In this thesis we present three types of such analysis.

First, the genetic variant in the gene COL17A1 that causes corneal dystrophy with recurrent erosions is reveled. By studying Next-generation sequencing data, the order of the nucleotides in the DNA-sequence was be obtained, which enabled us to detect interesting variants in the genome. Further, we present results of an experimental design study with the aim to make the best selection from a family that is affected by an inherited disease.

In second part of the work, we analyzed a novel antibiotic resistance Staphylococcus epidermidis clone that is only found in northern Europe. By investigating its genetic data, we revealed similarities to a world known antibiotic resistance clone. As a result, the antibiotic resistance profile is established from the DNA sequences.

Finally, we also focus on the challenges related to the abundance of genetic data from different sources. The increasing number of public gene expression datasets gives us opportunity to increase our understanding by using information from multiple sources simultaneously. Naturally, this requires merging independent datasets together. However, when doing so, the technical and biological variation in the joined data increases. We present a pre-processing method to construct gene co-expression networks from a large diverse gene-expression dataset.

(10)

Abbreviations

bp – base pair

SNP – Single Nucleotide Polymorphism WES – Whole Exome Sequencing DNA – Deoxyribonucleic Acid, RNA – Ribosomal Nucleotide Acid

mRNA – messenger Ribosomal Nucleotide Acid GCN – Gene Co-expression Network

DE – Differently Expressed

CSE – Centralization within Sub-Experiments cDNA – complementary Deoxyribonucleic Acid NGS - Next Generation Sequencing

(11)

Sammanfattning

Den information som finns i genomen styr till stor del organismers reaktioner och utseende. En del av denna information finns kartlagd, men mycket är för mänskligheten okänt. De senaste årens teknologiutveckling gällande sekvenseringstekniker har skapat otroliga möjligheter att förstå dessa, för oss, dolda mönster.

Med hjälp av Next-generation sekvenseringstekniker finns möjligheter att läsa av vår DNA sekvens och med dessa förstå hur olika varianter i genomen påverkar organismers utseende och funktion. I den här avhandlingen studeras genetiska mutationer som orsakar sjukdom, där en enda förändring i en gen kan ge upphov till förändringar i proteinet som produceras av genen och som därmed påverkar kroppslig funktion. En familj med en nedärvd genetisk variant studeras och genom att analysera DNA-sekvenser från familjemedlemmar kunde mutationen hittas. Baserat på denna studie, och liknande studier, görs en experimentell studie med avsikt att optimera resultatet när syftet med studien är att finna genetiska varianter med hjälp av DNA-sekvensering. Mörkertalet av liknande studier där mutationen aldrig hittas är okänt och med hjälp av genomtänkt design vad gäller valet av familjemedlemmar som sekvenseras kan utfallet optimeras. Vidare så studeras en antibiotikaresistent bakterie som hittills endast hittats i norra Europa. Bakterien kartläggs genom att göra fylogenetiska analyser och genom att undersöka vilka antibiotikaresistenta- och virulensgener som återfinns i genomen. Bakterierna jämförs mot en välkänd antibiotikaresistent bakterie och liknelserna är många.

Slutligen presenteras en preprocessmetod som gör det möjligt att studera stora heterogena dataset med avsikt att göra nätverk över hur gener interagerar med varandra. Metoden valideras genom att undersöka nätverk över välstuderade funktioner som elektrontransportkedjan i mitokondrien hos växten backtrav. Denna avhandling tar oss ett steg närmare att avslöja genomens dolda

(12)

Acknowledgement

During my years at the department of mathematics and mathematical statistics I´ve had the pleasure of meeting so many intelligent and nice people. I would like to thank all of you for every nice word and help you have given me through these years. A special thank you I would like to give my supervisor Patrik Rydén. Patrik, you have been my rock during this thesis, I have learned so much and I am truly grateful for you taking me under your wings. Especially the last months before printing this thesis, you have been such a good support and pushing me when I just wanted to crawl under a bed and never come out. I would also like to thank my second supervisor Sara Sjöstedt de Luna, I have always known I could come to you if I needed.

I had such luck with the collaborations which this thesis is built upon. I have learnt so much and it has been a pleasure working together with all of you. Thank you, Frida Jonsson and Irina Golovleva, for teaching me about genetics. Frida you never gave up and kept digging which gave good results, I am truly impressed of you. Chinmay Dwibedi, Anders Johansson and Micael Widerström thank you so much for teaching me about bioinformatics and antibiotic resistance, I have learnt so much from you three and a special thanks to you Chinmay, I will always remember all the fun we had. Olivier Keech and Simon Law the two of you have so much knowledge and it has been a pleasure collaborating with you. Thank you for taking your time teaching me and being so nice and always having time for that extra laugh

I would also like to thank Johan Strandberg, Mattias Käller, Linda Vidman, David Källberg, Mats Johansson and Kate Bennet. Thank you so much for your extra support and loving words during these years but thank you most of all for being such good friends. I would like to give a special thank you to Lina Schelin and Konrad Abramowicz, thanks so much for reading my thesis and coming with feedback. I am truly thankful! I am so lucky to have gotten to know you two over the years and you have become dear friends of mine.

Further, I would like to thank all the members of the department of mathematics and mathematical statistics. Thank you for all the fun conversations, for all the knowledge in statistics and for the interesting talks about pedagogics and didactics.

A big thank you to my new colleagues at RCN for supporting me in finishing this thesis. Thank you, Gabriel Granåsen, Per Liv, Henrik Holmberg, Simon Valin,

(13)

Oscar Öhman, Jessica Edlund, Annika Persson, Markus Widholm, Christel Häggström, Ewa Marklund and Yulia Blomstedt.

To all my friends, thank you for accepting me not being able to hang out so much as before. Your friendship and acceptance have really helped me forward. A special thank you to, Tina Holma, Elin Lindkvist, Sara Nilsson, Marie Thorneus, Marielle Lindberg, Louise Karlsson and Jenny Gyll, (i.e. pantertanterna). Our friendship is priceless and thank you for all the fun and making me forget about work from time to time.

I would also like to mention all the people who made the thesis go forward on a personal plane. To my lovely family, thank you for the extra help and loving words. To my mother Birgith Gabrielsson and my father Birger Lundberg, for all your support and raising me to who I am today, thank you. My sisters, Beatrice Gabrielsson and Johanna Birgersdotter I love you to the moon and back. To Agneta Kellgren and Ola Kellgren thank you for your support and all the help with the kids and all the pep-talks, even when times have been tough. To Lena Lennartsson, Tony Källman, William Gabrielsson, Ludwig Gabrielsson, Bertholof Brännström, John Kellgren, Karolina Kellgren, Elis Brännström, My Brännström, Emma Brännström, Klara Kellgren, Martin Kellgren, Elsy, Rut och Iver. Thank you for making this process better and being in my life.

To the most important person there is, my love of my life, Carl Kellgren, without you taking the extra days of VAB, fixing dinner, taking my grumpy mood from time to time this would never have happened, I am truly grateful. To our children Gabriel and Elma, not really helpful in creating the thesis but instead making the journey so much better.

This final year has been different for many of us and we could not meet as we use to, but I do not see us as being apart, maybe in distance but never in my heart.

(14)

Chapter 1 Introduction

The genome contains all the genetic material in a cell. It provides the description of when and how the cell should develop and function. These descriptions are stored in long chains of molecules called deoxyribonucleic acid (DNA). The DNA is a two-stranded molecule, meaning it has two chains bounded to each other in a three-dimensional structure, called the double helix. The double helix was first discovered in 1953 by Watson and Crick (Watson & Crick, 1953). The DNA contains four different types of building blocks called nucleotides, adenine (A), guanine (G), cytosine (C) or thymine (T). The order of these nucleotides is essential for how the cell develops and function.

Differences in the genome together with impact from the environment explains why organisms differ in traits, development, and responses. An essential part of the genome are the genes, they carry information on how other molecules - the proteins- should be formed and hence is part in determining an organism’s characteristics and functions. Through evolution, organisms have developed through changes in the genome. Changes that effects an organism positively remains in the population while changes that effects an organism negatively disappears.

Changes in the nucleotide sequence is called mutations. There are several different types of mutations. The specific type of mutation, where one nucleotide has changed into another, is called a single nucleotide variant (SNV). Another type of mutation is called deletions, where a piece of nucleotides has disappeared from the nucleotide sequence. Mutations where nucleotides have been added to a sequence is called insertions. Insertions and deletions are together called indels. Mutations that causes changes in the phenotype, i.e. observable characteristics, can be traced back in time through family trees called pedigrees. The mutation is said to be hereditary and a pattern of how its inherited can be revealed through the pedigrees. These patterns are the foundation when studying genetic mutations that can be traced within a family tree.

An example of organisms that often can adapt quickly to its surrounding, are bacteria. If the environment changes, the bacteria-genome can adapt through mutations or through a process called recombination. Recombination is a process of DNA transfer between two organisms, one donor and one recipient, e.g. a sequence of nucleotides can be transferred from the donor to the recipient. A consequence of bacteria’s ability to adapt to new surroundings is the antibiotic

(15)

resistance we see in healthcare. It is an increasing problem, and a wider understanding about the bacteria genome is thus needed.

To get information about the organisms on a genetic level, the order of nucleotides in the DNA sequence needs to be read. In 1990 the Human Genome Project started and was a huge international research project. The aims of the project were to read the entire sequence of the human genome, and to identify and locate the genes functionally and physically. In 2001 the whole human genome was first published (Venter, 2001), and the last chromosome in 2003 (International Human Genome Sequencing, 2004).

Nowadays, the sequencing technology is a lot faster and accessible. With the parallel sequencing technique, called the next-generation sequencing (NGS), which was introduced in 2005, the price has significantly decreased (Mardis, 2011, 2017). With these techniques, even small mutations as indels or SNVs can be discovered in the genome. Often the genes are of interest, and by only focusing the sequencing on the protein coding parts, the price and can decrease even further.

The order of the nucleotides can only tell us information about the variants of the gene or if the genes of interest are present in the genome. If the genes are active is also of great interest when studying organisms’ traits or function. A way of measuring how active a gene is, is to measure how much protein the gene is expressing. The microarray technology can measure how much known genes are expressed, while the RNA-sequencing technology also can measure unknown genes.

The genes are all connected but some genes are more closely connected and are activated by similar biological processes. To study the genes interactions can help detect networks where genes interact together which can reveal the function of unknown genes and give knowledge to biological processes.(Usadel et al., 2009) The main statistical issue is the high dimensional setting when analyzing gene expression data. There number of variables is very large, but often only a few samples. The potential in fusing datasets together to bring more biological insight is exiting. However, merging datasets together that were produced in different labs, with different settings and with different platforms is challenging, due it can introduce more variations, such as technical and biological.

The genome is filled with secret patterns that explains the function and characteristics of organisms. This thesis contributes to the investigation of such hidden patterns. In Paper I and II, inherited rare diseases are analyzed. In Paper I an inherited disease is detected in a family. Exome sequencing is performed with

(16)

the aim to detect the SNV causing the disease. In Paper II, public DNA sequenced data is used to simulate families with an inherited disease caused by a SNV. The aim with the study is to find the optimal design of individuals chosen from the family to detect the SNV. In Paper III a novel clone of the antibiotic resistant bacteria Staphylococcus epidermidis is analyzed, the aim is to detect genetical differences and similarities compared to a clone known worldwide. Finally, in Paper IV a new normalization method called centralization within sub-experiments (CSE) is presented. CSE makes it possible to fuse large gene expression datasets together to gain biological insight when constructing gene co-expression networks.

(17)

Chapter 2 Objectives

The general objective of this thesis is to present a wide framework that ease the researchers in finding hidden patterns. The examples in this thesis include identifying DNA mutations causing rare diseases, finding differences in genome of mutated bacteria, and finding gene co-expression networks. The objective of each individual paper is listed below.

Paper I

An inherited eye disease causing corneal dystrophy with recurrent erosions was diagnosed in several individuals from the same family tree. The aim was to locate a disease-causing mutation in the genome using exome sequenced data.

Paper II

The aim was to select individuals from a family tree, where a dominant autosomal disease runs in the family, so that the list of potential mutations of interest is minimized.

Paper III

The overall aim was to study a novel clone of Staphylococcus epidermidis bacteria to explore and understand the similarities and differences when compared a worldwide spread clone.

Paper IV

The aim was to develop a method to reduce false positives and false negatives when combining large gene expression datasets together when studying co-expression networks.

(18)

Chapter 3 Background

The following chapter describes the biology behind the characteristics and functions observed in organisms. The technics that can be used to detect these genetic patterns that are the cause of the phenotype are also described. This is followed by description of methods for rare disease and mutation detection, and for measuring the interaction between genes.

3.1 Omics data

This thesis deals with the analysis of two types of omics data; DNA sequenced data, which aims to read the order of the nucleotides in the genome, and microarray data. The latter measures the amount of RNA produced from genes.

DNA

Deoxyribonucleic acid, DNA, is built up on four nucleotides: each nucleotide consists of a nucleoside, either adenine, guanin, cytosine and tyramine, i.e. “A”, “G”,”C” and “T”, and a phosphate group. The DNA is built out of two strands that both contains the same genetic information, this information is replicated when the two strains separate. The replication of DNA is an essential part for organisms to grow and to reproduce. In eukaryotic cells the DNA is organized in chromosomes, they contain parts or all genetic material. Some organisms, e.g. the humans have autosomes which are the body chromosomes and allosomes which are the sex-chromosomes. In addition to these there are mitochondrial DNA and for plants also chloroplast DNA. For procaryotes the DNA is stored in circular chromosomes.

Next generation sequencing

Next generation sequencing (NGS) is a method to read the order of the nucleotides in the genome. This sequence contains the information of hereditary properties.

There are several different NGS platforms and they differ in their template preparation, the chemistry, the length of the reads, the run time and how many Giga byte they produce per run. The common thing for these technologies is that they use masses parallel sequencing that allows the sequencing procedure to produce 1 million to 43 billion short reads per run. (Voelkerding, Dames, & Durtschi, 2009)

(19)

The NGS techniques can be used to read the entire genome, but to reduce the financial cost, it is however common to only sequence the exomes, which is referred to as whole exome sequencing (WES). The genes are located in the exomes and are of high interest when studying inherited diseases. (Ng et al., 2010)

Alignment

When sequencing the DNA, small reads of DNA are produced, and bioinformatics are required to put the pieces together. When a reference genome exists, the small reads are mapped back to the reference genome to rebuild the DNA sequence. Sometimes a reference genome does not exist, or the aim is to find big differences from the reference genome. A de novo sequencing approach can then be used. The small pieces of DNA are then mapped together without a reference genome. A big depth is preferred, i.e. many small reads covers the same positions so that a consensus sequence can be made from the small pieces, when mapping the original sequence.

Alignment can be used, not only when trying to map the sequence together but also to compare different sequences to each other. BLAST (Basic Local Alignment Search Tool) is a common alignment tool which lets the user in a simple way compare nucleotide sequences or protein sequences to other sequences or to the NCBI database. (Altschul, Gish, Miller, Myers, & Lipman, 1990).

In paper I and II the data was aligned against the human reference genome, but for paper III a de novo sequencing approach was used, and the sequences was further explored using BLAST.

The central dogma

Since the genetic material passes on through generations characteristics, i.e. phenotypes, will live on. These phenotypes are encoded in the genomes and therefore the genotype can be found through generations. Many of these genotypes can be found on the protein-coding regions called exomes. The exomes consist of genes which have exons and introns, a combination of exons is translated into proteins. The information flow from the DNA through the genes to proteins are called the central dogma, see figure 1. The DNA is transcribed to RNA, ribonucleic acid, that is a one stranded molecule much smaller than the DNA and can pass through the nucleus to the ribosomes. The ribosomes then translate the RNA into proteins.

(20)

Figure 1: The central dogma in molecular biology

Microarrays

The microarray technology can detect the amount of RNA produced from known genes and hence can be very useful in analyzing how genes work under certain circumstances. The details can vary among different platforms but in general the microarray technology works in the following way:

Microarrays have thousands of spots neatly ordered on an array. Each spot represents one gene and has small pieces of DNA from the gene attached. Messenger RNA (mRNA) from a sample of interest is converted to complimentary DNA (cDNA), for its stabile form, and then labeled with a fluorescent dye. The cDNA sample is then applied on to the microarray and binds to the small pieces of DNA that is attached on the spots on the array, this binding is called the hybridization process. The cDNA that matches to the DNA that is attached on the spots binds on hard and remains after washing the array. The amount of cDNA is then measured by scanning the array with lasers that activates the fluorescence dye. The intensity of the color gives a relative measure of the gene expression.

(21)

3.2 Inheritance

To understand the mechanism behind the hereditary diseases we need to introduce the two following concepts

Meiosis and fertilization

Meiosis is the process where a cell divides its genetic information for the purpose to sexually-reproduce. The result of this cell-dividing process is a mixture of the two chromosomes into one haploid cell, called gamete. Two gametes will fuse during fertilization and a new individual will be created as a mixture of its parents. This inheritance pattern was first described by Gregor Mendel in 1865 by studying pies (Chudley, 1998). Two alleles from one individual are randomly mixed into one allele and paired up with another mixed allele from an unrelated individual, creating the new individual. We use the principle of this processes in our studies in Paper II, to simulate new individuals.

Single nucleotide variants

A single nucleotide variant (SNV) occurs when one of the millions of base pairs (bp) has changed into another nucleotide, e.g. an A has become a G. This is one type of a so-called variant in the genome. Depending on where this change is located it can have an impact on the genes and on the protein produced from that gene. In all organisms, from humans to plants to bacteria, this type of variants occurs. This variation drives the evolution and provides at the same time a natural defense against other organisms by not sharing the same week spots.

Hereditary disease

A hereditary disease (or inherited disease), is a disease caused by genetic variants such as SNV and indels and is directly transferred from parent to offspring during the meiosis and fertilization process.

In the human genome there are 3 billion base pairs (bp), where approximately half comes from the father and half from the mother. Just one change in those billion bp could lead to severe effects on an individual. Diseases caused by these changes are called rare diseases. Even if they are rare, approximately 10% of the population are affected by a rare disease (Haendel et al., 2020) and 80% of these are mendelian rare diseases.

After meiosis, a single stranded DNA from the mother is fused together with a single stranded DNA from the father. If a genetic disorder, that causes a disease, is transferred from one of the parents to the offspring, the child will either carry on, or have, a disease.

(22)

If the offspring inherit one copy of the genetic variant from a diseased parent and gets the disease, the genetic variant is called dominant. However, if the offspring does not get the disease the genetic variant is said to be recessive. The location of the genetic variant can either be on the sex chromosomes, and the variant are then referred to as X-linked or Y-linked. In some rare cases, the variant is inherited on the mitochondrial DNA. If the variant is located on the other chromosomes the variant is said to have an autosomal dominant or an autosomal recessive inheritance.

In paper I and II an autosomal dominant SNV is considered.

Pedigree chart

Pedigrees are used to illustrate an ancestry chart. Usually squares are used to illustrate males and circles to illustrate females. The relationship between the individuals in a family tree is connected with lines, vertical lines represents parents and horizontal lines connects the parents to the offspring. When studying Mendelian rare diseases, the individuals carrying the phenotype of interest has a filled-in square or circle.

When studying autosomal diseases, the sex of the individual is irrelevant and the need to distinguish between males and females are of less concern. In figure 2, the pattern of an autosomal dominant rare Mendelian disease is illustrated. In paper II, Monte Carlo simulations was used to create families affected by an autosomal dominant SNV. The families were simulated according to Mendel Inheritance pattern and followed a pre-decided pedigree.

Figure 2: A fabricated family pedigree where an autosomal disease runs in the family. The blue circles represent the affected individuals.

(23)

3.3 Genetic variation description

If the relationship is not known between taxa, e.g. species or sequences, a phylogenetic tree can be constructed. The tree shows how similar or distant the taxa is to each other.

Phylogeny

The phylogenetic tree is commonly used to illustrate relationship between taxa. Phylogeny has become a tool to understand e.g., evolution, ecology, biodiversity, and genomes. A phylogenetic tree can be rooted or unrooted, a rooted tree assumes that there is a root, i.e. a common ancestor to the samples studied. An unrooted tree does not assume that there is a common ancestor and does not show any timeline between the taxa studied.

A method commonly used to construct phylogenetic trees, from DNA sequenced data, is the neighbor-joining method (Saitou & Nei, 1987). The method is a bottom-up clustering method that requires a distance matrix, explaining the pairwise distance between all sequences.

A commonly used distance is the hamming distance (Waggener, 1995). When studying DNA sequences, the number of SNV that differ between the pairwise sequences is an example of the hamming distance. However, the hamming distance can underestimate the total number of changes. The Jukes-Cantor distance (Jukes & Cantor, 1969) and the Kimura distance (Kimura, 1980) is two other methods that takes into consideration by compensating the branch lengths that several mutations might have occurred. The Kimura distance also compensates that it is more likely for an A to change to a T, and for a C to change to an G and vice versa, due to molecular properties.

(24)

3.4 Networks

Networks and network analysis have a broad spectrum of applications, such as social networks, airline traffic, gene networks etc. From a mathematical point of view, the networks can be described as graphs with vertices and edges. In social network, such as Facebook, every person is a vertex and if two people are friends, they have an edge between them. For airplanes, the different airports are the vertices and the edges are the possible ways for the airplane to go between them. In the field of genomic the vertices (also called nodes) are generated by genes and if the genes have a connection, they have an edge between them. The connections are measured in different ways, but a common one is to see how differently expressed (DE) the genes are under different conditions and to calculate how correlated they are.

Gene Co-expression Networks

When constructing gene networks, the graph could either be a directed or an undirected graph. A gene co-expression network is an undirected graph where the edges in the network are expressed as 1, and 0 otherwise. A general pipeline to construct a GCN is:

• Pre-processing of the data.

• Calculate the co-expression measure to create a similarity score between the genes.

• Decide on a significance threshold that will work as a cut-off deciding if two genes are connected or not.

• Construct an adjacency matrix that will define the network.

Pre-processing

Before performing any analysis on microarray data pre-processing should be made to increase credibility of results. A widely spread initial preprocessing technique is quantile normalization method (Bolstad, Irizarry, Astrand, & Speed, 2003). The aim with quantile normalization is to make two or more distributions identical, and hence keep the natural variation in the data that we wish to study and remove all variation due to technical artifacts, called batch effects (Lazar et al., 2013).

Microarrays are high-dimensional datasets and usually have a small number of samples and lots of variables, and the statistical power are usually low to identify signals in the data. By merging several microarray datasets from independent studies a more reliable result can be obtain due to the increased number of samples (Wang et al., 2007). However, when merging several datasets technical variations due to different platforms, protocols, and procedures may arise (Lazar

(25)

et al., 2013). Another type of variation that can be introduced, include biological variation due to different surroundings, treatments, stresses etc. In paper IV, this problem is addressed by constructing gene co-expression networks from a diverse gene expression dataset and evaluate a new pre-processing step that we call centralization within sub-experiments (CSE). CSE is developed to create a core-network for better understanding of how genes interact regularly when analyzing large heterogenous datasets.

Co-expression measures

To measure the association between two genes a co-expression measure is needed. There are roughly four different types of statistical methods to measure the associations between random variables: Probabilistic network-based approaches, correlation-based-methods, partial-correlation-based methods and information-theory-methods (Allen, Xie, Chen, Girard, & Xiao, 2012). In paper IV we compare the performance of correlation- based and partial-correlation-based GCN with and without CSE.

For both correlation-based and partial-correlation-based methods a correlation matrix is constructed, from the differential expression values, where the pair-wise correlation is calculated between all pairs of genes.

A common correlation-based method is the Pearson´s correlation coefficient which quantifies the linear relationship between two random variables. In a large dependent dataset, this may not provide a correct picture. As a solution, we can use partial correlation which measures the linear relationship between two random variables while accounting for the linear effects of other random variables. Partial correlation has been used in several studies as a measure when constructing GCN (Ma, Bohnert, & Dinesh-Kumar, 2015; Ma, Gong, & Bohnert, 2007; Wille et al., 2004).

Another correlation method worth mentioning for calculating the co-expression measure is the biweight midcorrelation. This similarity measure is based on the median instead of the mean, as a difference to the Pearson´s correlation coefficient. Therefore it is less sensitive for outliers. (Song, Langfelder, & Horvath, 2012). The biweight midcorrelation is used in the popular WGCNA R package. WGCNA has proved to be an effective way to identify both GCN and gene regulatory-networks (GRN), based on directed graphs. (Langfelder & Horvath, 2008)

(26)

Significance threshold

All genes will be correlated no matter the correlation method used on gene expression data. Consider a microarray with 20,000 genes, it will have 200 million correlations, some of these correlations are indeed very meaningful for the purpose of the study but many of the are not. After calculating correlation measures, a big concern is to choose a significance threshold deciding where two genes are connected or not. There are two ways of deciding on a threshold, either a soft (Zhang & Horvath, 2005) that uses the correlations to weight the edges in the gene network, or a hard threshold (Carter, Brechbuhler, Griffin, & Bond, 2004). In the hard threshold methods, the correlation scores that surpasses the cut-off will be given the value 1 which defines an edge between two nodes (genes), all the other correlation scores will be given the value zero, this new adjacency matrix defines the graph that describes the GCN. Instead of setting a particular value as a cut-off, the number of edges in the network can be controlled. In paper IV a sparsity of 0.005 was used as a cut-off, the absolute values of the correlations were ranked and the top correlations where given the value one so that the adjacency matrix had a sparsity of 0.005. In this way, we controlled that the all the methods would have approximately the same number of edges.

Validation

Often the true co-expression networks are unknown and hence it is hard to validate the identified network. One way of justifying the created network is to see which genes that end up in the created network. If the network contains many genes that are known through the literature to be associated with the particular study, the network is considered to be validated. This is regardless of the study topic, that could be e.g., disease, stress, treatment, tissue.

In paper IV an alternative, approach was used to validate the networks. Instead of looking at particular genes, we hypothesized that genes that were known to be associated with the same biological process should have more edges between them than by chance. For example, the electron transport chain complex (ETC-complex) in the mitochondria is well studied and genes associated with the complex are verified. Our hypothesis was that genes associated with the ETC-complex should have more edges between them than by chance. For example, if a there are 100 genes associated with a complex they have 4950 possible edges. Then if the network is constructed to have a sparisity of 0.005 there would be approximately 25 edges just by chance.

(27)

Chapter 4 Summary of papers

4.1 Paper I

Mutations in Collagen, Type XVII, Alpha 1 (COL17A1) Cause Epithelial Recurrent Erosion Dystrophy (ERED)

In Paper I, the aim was to find the genetic cause of the disease in families with corneal dystrophy with recurrent erosions. The disease had an autosomal dominant appearance in the pedigree and exome sequence data was extracted from the families.

With WES data the potential disease-causing missense mutations where found with a pipeline comparing the diseased and healthy family members with each other. The pipeline was constructed in the following way:

1. Select all positions where at least one of the bp differs from the reference genome.

2. Remove all the positions located on the X chromosome, Y chromosome or in the mitochondria.

3. Remove all positions where the individuals carrying the genetic variant differs.

4. Compare the remaining positions with the individuals not effected with the disease. Variants that were shared in the same positions were removed.

5. All known genetic variants where removed 6. All synonymous variations were removed.

The potential disease causing SNVs and indels that were left after the pipeline where clinically validated and one gene was confirmed. The gene COL17A1 was confirmed causing the disease. The gene was validated through Sanger sequencing and also found in patients in New Zeeland.

(28)

4.2 Paper II

Experimental designs for finding disease-causing mutations in rare diseases

In this study we show how the number of potential disease-causing mutations differs depending on which of the individuals in the family tree is chosen for exome sequencing. DNA sequences from a diverse public population from all over the world was used in the analysis. The diverse population was used to simulate new family trees where an autosomal dominant SNV was detected in the families. The simulations were performed in the following way:

1. Two genomes where drawn without replacement from the diverse population

2. A mixture of each of their genome was chosen to replicate the meiosis process and produce a single stranded DNA sequence. These two new single stranded sequences now represent their offspring. Two offspring’s where produced.

3. A new genome was drawn without replacement from the diverse population.

4. The new genome was paired with one of the offspring to symbolise new parents. This was done for both offspring.

5. Repeat from 2 until the desired number of generations in the family has been created

Different sets of individuals where at least one individual was affected by the disease was chosen from the simulated families. The chosen individuals were analysed through a pipeline that worked in a similar way as the first 4 steps in the pipeline used in Paper I.

In figure 3, the mean value from 30 simulated family trees with the same relations as in figure 2 is illustrated. Different family members are chosen for analysis with the demand that at least one individual carries the SNV in each design.

As family members are added to the analysis the number of potential diseases causing SNV decreases. However, in each figure there is a lot of variation indicating that there are better or worse choices to make when choosing individuals from a family where an inherited disease is affecting the family. The red color indicates that there are less then 100 potential SNV in the final list, green colors indicates that there are less than 50 SNV in the final list and blue color illustrates that there are less than 30 potential disease causing SNV after the analysis.

(29)

Figure 3: illustrates the number of potential disease-causing mutations left when adding individuals into the analysis

(30)

Any direct guidelines that will work in every situation is hard to give, it is highly dependent on which members there are to choose from and how many you can select. Still, we propose to select the affected individuals as distant related as possible, and to choose the non-affected family members as closely related as possible to the affected persons, i.e., persons that do not carry the SNV of interest should be closely related to the carrier and the diseased individuals should be distant related.

(31)

4.3 Paper III

The emergence of an antimicrobial resistant Staphylococcus epidermidis clone in Northern Europe

A novel clone of the bacteria Staphylococcus epidermidis called ST215 was analysed using samples from Umeå and Östersund. St215 was compared to a known clone ST2, with samples from Umeå and Perth. Both clones were antibiotic resistance and the aim was to identify similarities and differences between the two clones.

A phylogenetic analysis was done using 60 public Staphylococcus epidermidis genomes and the genetic relationship could be established. St2 and ST215 formed two separate clusters among the 60 public genomes. This was done by using the neighbour- joining algorithm with the Kimura 2 pairwise distance matrix. Virulence- and antibiotic resistance genes where searched for in all the St215 and ST2 genomes using the local alignment tool BLAST. 28 antibiotic resistance genes and 29 virulence genes were found in the ST215 genomes. The St2 genomes had 28 antibiotic resistance genes and 35 virulence genes. Among these genes 23 antibiotic genes and 29 virulence genes where shared between the two clones. The antibiotic resistance genes that were found in the ST2 and St215 genomes had good agreement between an antimicrobial susceptibility testing for 31 antimicrobial drugs.

A recombination analysis showed that areas predicted as recombination areas had significantly more virulence genes than in the rest of the genome when analysing the ST215 genomes. However, the St2 genomes had significantly more antibiotic resistance genes in the recombination areas than in the rest of the genome.

A Monte Carlo simulation of the St215 genomes gave an indication that there was a relationship between the number of SNVs in the genome and the timeline. The genomes were ordered according to their age and then analysed if there was a pattern in the SNVs that were in agreement with time. The analysis indicates that the genomes had evolved from the same ancestor.

(32)

4.4 Paper IV

Centralization Within Sub-Experiments Enhances the Biological Relevance of Gene Co-expression Networks: A Plant Mitochondrial Case Study

Here we present a novel pre-processing method for expression data. This method is a centralisation method that allows the user to combine data from different labs, different conditions etc. to be able find underlaying pathways.

We consider normalised gene-expression, then for each expression value the CSE-value is calculated

.,

CSE

ijk ijk ij x =x −x

for gene i, j sub-experiment and the kth_{replicate in sub-experiment j.}

In figure 4, the correlation between 5 simulated genes in two separated networks is illustrated in column one. Gene A effects gene B, and Gene S effects gene D and E. The Pearson and partial correlation measure are calculated between all genes, with and without the CSE pre-processing step, column 2-5. To be able to compare Pearson’s correlation against partial correlation the correlations were scaled relatively to each other. The most correlated edge for each set up was given the value one.

In row one (A), there are no extra treatments effecting the genes and there are no differences between the correlations calculated with or without CSE. In row two (B) a treatment effect is applied to gene C. The correlations based on the Pearson´s method is affected without CSE but the Pearson´s correlations with CSE gives the same results as in row one. In the third row (C) a treatment effect is applied on both gene A and gene C, now both correlation methods are affected by the extra effects from the treatment. But with the pre-processing step CSE these treatment effects are accounted for and the right edges are given the highest correlation values ones more.

(33)

Figure 4 Schematic representations of the conclusions that can be drawn from different correlation analysis approaches of gene expression data. (Law, Kellgren, Bjork, Ryden, & Keech, 2020)

The CSE method was evaluated by constructing gene co-expression networks, with and without CSE, on gene expression data from the mitochondria of the plant Arabidopsis. The genes and their functions of the mitochondria in the Arabidopsis plant are well studied, our hypothesis was that there would be more connections between genes involved in the same process or having similar functions. By controlling the sparsity of the different co-expression networks, it was possible to test if it was more connections than by chance between genes sharing the same function or involved in the same process.

Finally, a core network was constructed with the CSE and Pearson´s correlation and a clustering algorithm, the walktrap algorithm (Pons & Latapy, 2005) was applied. The genes in the different clusters were then checked if they had similar functions and a core network was constructed. It is possible to use CSE in

(34)

addition to ordinary gene co-expression networks constructed without CSE. Together, these networks can contribute to a larger biological insight.

(35)

Chapter 5 Discussion and future research

This thesis is based on four papers where the high dimensionality of the data is of great concern. The number of variables usually exceeds the number of samples which makes statistical analysis more demanding. Therefore, it is important to make correct choices both before the study is conducted and during data analysis. In healthcare, there are many aspects to consider but usually the primary focus is finding a treatment or relieve pain for patients. In hereditary diseases and diagnosis there are common that there is no treatment available at this day. The ability to sequence the genome and search for a genetic variant that is the cause of a disease is important for the knowledge in general, but also for affected families to get answers. However, when dealing with rare diseases it is hard to speculate how many studies that got a negative result, i.e. a mutation could not be found. A way of minimizing the number of studies that never get answers, is to perform a well thought study.

Based on the results of Paper I and other similar studies with the aim of identifying a rare disease, an experimental design was performed in Paper II. The aim of Paper II was to identify which individuals from a family, affected by a genetical rare disease, that should be selected for exome sequencing. Our study shows that the selection of individuals has a great impact on the number of potential SNV in the final output. Additionally, the number of individuals that are selected for exome sequencing also have an impact on the final number of potential SNVs. However, the gain of adding an extra individual to the study decreases. Hence, by making smart choices when selecting individuals, time, resources, and money can be saved.

A future research project could be to create a program where a user can define a pedigree describing the family of interest. Based on the pedigree and if the SNV is assumed to be inherited autosomal or on the allosomes, if it is dominant or recessive, the program can present the top best designs of selected individuals to include from the pedigree. In our pipeline, we considered an autosomal dominant inheritance, but the pipeline could be developed further so that more inheritance patterns can be considered. There could also be room for handling errors such as misdiagnosed patients and reading errors.

Another concern in healthcare is the increasing number of antibiotic resistance bacteria. In Paper III, it was shown that there was a good agreement with the antimicrobial test and the antibiotic resistance genes found in the sequences. Hence, by studying the genomes of bacteria, information about which antibiotic resistance genes and virulence genes the bacteria are carrying can lead to more

(36)

correct prescription of antibiotics. Further, by studying the relationship between the bacteria genomes there might be evidence of the origin of the bacteria. It can be used to determine if it has evolved in local places in the hospital or if it is transferred from external environment. In general, an information about the bacteria origin might lead to guidelines on how to stop the spread. The Staphylococcus epidermidis ST215 clone analyzed in Paper III shows lot of similarities with the world known antibiotic resistance ST2 clone. A fear is that the ST215 clone also will be spread over the world, therefore further research in how to stop the spread is needed but meanwhile doing the right choices when prescribing antibiotics is of great importance.

The complexity of organisms can not only be described by the order of the nucleotides in the DNA sequence. The whole body is a system that reacts to its surroundings and genetic inheritance. To study how genes interact with each other is of interest when exploring different processes in an organism. By studying how genes interact there is possible to analyze how different treatment, e.g. stresses affect a network of genes. Analyzing gene expression data has the potential of studying how genes interact by creating gene co-expression networks. The amount of expression data produced until this day is large, to combine different datasets could be a real strength in research. However, the strength comes with a weakness and that is the extra technical and biological noise that can lead to more false positives and false negatives. In paper IV we present a pre-processing step that makes it possible to gain the strength of combining huge dataset and at the same time minimizing the noise when creating a core network. Omics data is high-dimensional and complex, and new methods is needed for better biological insight. This thesis contributes with a few steps in the direction of reveling the hidden patters that define life!

(37)

References

Allen, J. D., Xie, Y., Chen, M., Girard, L., & Xiao, G. (2012). Comparing statistical methods for constructing large scale gene networks. PLoS One, 7(1), e29348. doi:10.1371/journal.pone.0029348

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. J Mol Biol, 215(3), 403-410. doi:10.1016/S0022-2836(05)80360-2

Bolstad, B. M., Irizarry, R. A., Astrand, M., & Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2), 185-193. doi:10.1093/bioinformatics/19.2.185

Carter, S. L., Brechbuhler, C. M., Griffin, M., & Bond, A. T. (2004). Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics, 20(14), 2242-2250. doi:10.1093/bioinformatics/bth234

Chudley, A. E. (1998). Genetic landmarks through philately--Gregor Johann Mendel (1822-1884). Clin Genet, 54(2), 121-123. doi:10.1111/j.1399-0004.1998.tb03713.x

Haendel, M., Vasilevsky, N., Unni, D., Bologa, C., Harris, N., Rehm, H., . . . Oprea, T. I. (2020). How many rare diseases are there? Nat Rev Drug Discov,

19(2), 77-78. doi:10.1038/d41573-019-00180-y

International Human Genome Sequencing, C. (2004). Finishing the euchromatic sequence of the human genome. Nature, 431(7011), 931-945. doi:10.1038/nature03001

Jukes, TH., Cantor,CR. (1969). ) Evolution of protein molecules. In Munro HN, editor, Mammalian Protein Metabolism, pp. 21-132, Academic Press, New York.

Kimura, M. (1980). A Simple Method for Estimating Evolutionary Rates of Base Substitutions through Comparative Studies of Nucleotide-Sequences.

Journal of Molecular Evolution, 16(2), 111-120. doi:Doi

10.1007/Bf01731581

Langfelder, P., & Horvath, S. (2008). WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics, 9, 559. doi:10.1186/1471-2105-9-559

Law, S. R., Kellgren, T. G., Bjork, R., Ryden, P., & Keech, O. (2020). Centralization Within Sub-Experiments Enhances the Biological Relevance of Gene Co-expression Networks: A Plant Mitochondrial Case Study. Front Plant Sci,

11, 524. doi:10.3389/fpls.2020.00524

Lazar, C., Meganck, S., Taminau, J., Steenhoff, D., Coletta, A., Molter, C., . . . Nowe, A. (2013). Batch effect removal methods for microarray gene expression data integration: a survey. Brief Bioinform, 14(4), 469-490. doi:10.1093/bib/bbs037

(38)

Ma, S., Bohnert, H. J., & Dinesh-Kumar, S. P. (2015). AtGGM2014, an Arabidopsis gene co-expression network for functional studies. Sci China

Life Sci, 58(3), 276-286. doi:10.1007/s11427-015-4803-x

Ma, S., Gong, Q., & Bohnert, H. J. (2007). An Arabidopsis gene network based on the graphical Gaussian model. Genome Res, 17(11), 1614-1625. doi:10.1101/gr.6911207

Mardis, E. R. (2011). A decade's perspective on DNA sequencing technology.

Nature, 470(7333), 198-203. doi:10.1038/nature09796

Mardis, E. R. (2017). DNA sequencing technologies: 2006-2016. Nat Protoc,

12(2), 213-218. doi:10.1038/nprot.2016.182

Ng, S. B., Buckingham, K. J., Lee, C., Bigham, A. W., Tabor, H. K., Dent, K. M., . . . Bamshad, M. J. (2010). Exome sequencing identifies the cause of a mendelian disorder. Nat Genet, 42(1), 30-35. doi:10.1038/ng.499 Pons, P., Latapy, M. (2005). “Computing communities in large networks using

random walks,” in Computer and Information Sciences - ISCIS 2005. Saitou, N., & Nei, M. (1987). The neighbor-joining method: a new method for

reconstructing phylogenetic trees. Molecular Biology and Evolution,

4(4), 406-425. doi:10.1093/oxfordjournals.molbev.a040454

Song, L., Langfelder, P., & Horvath, S. (2012). Comparison of co-expression measures: mutual information, correlation, and model based indices.

BMC Bioinformatics, 13, 328. doi:10.1186/1471-2105-13-328

Usadel, B., Obayashi, T., Mutwil, M., Giorgi, F. M., Bassel, G. W., Tanimoto, M., . . . Provart, N. J. (2009). Co-expression tools for plant biology: opportunities for hypothesis generation and caveats. Plant Cell Environ,

32(12), 1633-1651. doi:10.1111/j.1365-3040.2009.02040.x

Waggener, W. N. (1995). Pulse code modulation techniques : with applications

in communications and data recording. New York: Van Nostrand

Reinhold.

Wang, J., Do, K. A., Wen, S., Tsavachidis, S., McDonnell, T. J., Logothetis, C. J., & Coombes, K. R. (2007). Merging microarray data, robust feature selection, and predicting prognosis in prostate cancer. Cancer Inform, 2,

87-97. Retrieved from

https://www.ncbi.nlm.nih.gov/pubmed/19458761

Watson, J. D., & Crick, F. H. (1953). Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature, 171(4356), 737-738. doi:10.1038/171737a0

Venter, J. C. (2001). The sequence of the human genome(1.0. ed., pp. 1 CD-ROM). Wille, A., Zimmermann, P., Vranova, E., Furholz, A., Laule, O., Bleuler, S., . . .

Buhlmann, P. (2004). Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biol, 5(11), R92. doi:10.1186/gb-2004-5-11-r92

Voelkerding, K. V., Dames, S. A., & Durtschi, J. D. (2009). Next-Generation Sequencing: From Basic Research to Diagnostics. Clinical Chemistry,

55(4), 641-658. doi:10.1373/clinchem.2008.112789

Zhang, B., & Horvath, S. (2005). A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol, 4, Article17. doi:10.2202/1544-6115.1128

(39)

(40)

Department of Mathematics and Mathematical Statistics

Umeå University, SE-901 87 Umeå, Sweden www.umu.se

ISBN 978-91-7855-240-5 (print) ISBN: 978-91-7855-241-2 (pdf)

Hidden patterns that matter Statistical methods for analysis of DNA and RNA data

Hidden patterns that matter

Statistical methods for analysis of DNA and

RNA data

Therese Kellgren

Hidden patterns that matter

Statistical methods for analysis of DNA and RNA

data

Therese Kellgren

Table of Contents

List of papers

Papers not included in the thesis

Abstract

Abbreviations

Sammanfattning

Acknowledgement

Chapter 1

Introduction

Chapter 2

Objectives

Chapter 3

Background

3.1 Omics data

3.2 Inheritance

3.3 Genetic variation description

3.4 Networks

Chapter 4

Summary of papers

4.1 Paper I

4.2 Paper II

4.3 Paper III

4.4 Paper IV

Chapter 5

Discussion and future research

References