• No results found

A first glimpse of the extent of structural variation in an avian genome

SV markers are rarely used in animal genetics but might gain interest once their abundance is estimated and their putative relation with phenotypic differences is understood. Do SVs influence phenotypes and what is the impact of SVs on animal genome evolution? Previously used methods in CNV screening like fluorescent in-situ hybridization (FISH) of bacterial artificial chromosomes (BACs) and array based comparative genome hybridisation (array-CGH) have a limited resolution. Due to this limit in resolution the majority of the identified SVs have not yet been finely resolved to the nucleotide level. For most reported CNVs in animals, we do not know their true population frequency because they have not been genotyped. To be able to study the potential role of SVs in phenotypes and in genome evolution, we will need a more complete catalog of SVs in animal genomes and large SV genotype datasets from different populations. Precise de novo CNV mutation rates throughout the genome are required to better understand the contribution of CNV versus SNP to genome evolution, particularly with respect to gene duplication/triplication and exon shuffling [22].

Low requirements for obtaining a promising first glimpse

In Chapter 4 we provide a first glimpse of the extent of structural variation in an avian genome at a ~50 bp resolution. We show that even the analysis of mate pair information of a paired end sequenced reduced representation library is sufficient to predict several hundreds of candidate structural variants (SVs) in the chicken genome. More than 180 of these SVs are very likely to represent true structural variation between four chicken breeds and red jungle fowl. The sequencing and the bioinformatics approach we used put high constraints on the SV detection thereby putatively ignoring true variants. Future validation studies can be considered to find out what constraints can be relaxed, at tolerable false positive rates, to increase sensitivity of the detection method. The majority of SVs identified by our method were small deletions, which is consistent with an earlier study where an inverse relationship between the number of SVs in the human genome and their size was established [23]. Our detection strategy did allow for the detection of insertions in only a very limited size range (few tens of basepairs). I expect that the actual size and frequency of the total number of small insertions in the chicken genome is similar to the observed number of small deletions. Based on our findings in Chapter 4 we expect thousands of

rearrangements smaller than one kb and hundreds of larger rearrangements in the chicken genome. Furthermore our study identified SVs in coding regions of the genome suggesting that some of the small SVs putatively can be related to phenotypes. Based on this first glimpse, I think, there is evidence that SVs considerably contribute to phenotypes and genome evolution and that it is worthwhile to obtain a more complete picture of the extent of this type of genetic variation in animal genomes.

More demanding approaches completing the catalog of SVs

Our SV detection method can be classified as a paired-end resequencing and mapping approach using standard insert libraries. Paired-end mapping approaches combined with high-throughput sequencing [23-28], (Chapter 4 of this thesis) provide the possibility of reliably detecting SVs that are one to three orders of magnitude smaller than those assayed previously using FISH mapping of BACs or array-CGH or lower-density oligonucleotide arrays. Most recent advances in paired-end sequencing, which were not available at the time of our study, are the use of large insert library kits and increased read lengths. The latter will improve mapping accuracy whereas the first will allow for deep paired end sequencing of insert libraries in a larger size range.

Paired-end reads of large insert libraries will allow for spanning repetitive elements, which likely hold the majority of genomic structural variation [29-31]. By constructing libraries from a randomly sheared genome, each SV will be predicted by paired end reads from a variety of genome fragments sampled from that genomic region. This will facilitate breakpoint resolution and reduces required additional PCR and sequencing efforts [32]. In spite of these improvements, the SV detection strategy by paired end sequencing and mapping (PEM) still has a fairly large false negative detection rate for large structural variants and segmental duplications compared to more laborious techniques such as fosmid paired end sequencing (FPES) or oligonucleotide arrays comprising millions of optimized probes [33]. Segmental duplications are more difficult to ascertain using PEM because many of the reads in these regions do not map to unique locations in the genome [24]. However, a quantitative NGS approach for detecting segmental duplications can be used to complement the paired -end mapping technique. In this approach the depth of coverage in sequence data is analyzed to look for genomic regions that differ in copy number between individuals [34].

A shortcoming of reference based SV detection techniques like PEM and array- CGH is the bias towards the reference genome. In a sequencing context, reads

obtained from large genomic regions that are missing in the reference cannot be mapped whereas in a micro-array context these genomic regions are not represented by probes. This lack of genomic information will potentially hide structural variation between the sampled individuals. Therefore, the most versatile strategy for SV detection is sequencing and unbiased de novo assembly of individual genomes [8]. This approach will undoubtedly result in a more accurate and complete catalog of structural variation in a genome. However it is unclear what sampling depth is needed to reliably capture the majority of SVs. In human there is evidence that there are many SVs related to disease present within the general population with frequencies lower than the classical definition of a polymorphism (>1%)[35].

Furthermore a linear representation of a genome which is currently being used is not proper to capture and represent all structural variation and therefore needs to be replaced by a higher level of data storage and visualization. Because of the costs of whole genome sequencing and the impossibility to reconstruct complex genomes by de novo assembly of NGS data this approach of SV detection will remain unfeasible in animal sciences for the near future. Even if we had (almost) completed the catalog of structural variation, it would not be possible to genotype CNVs genome-wide due to the lack of a robust manner. Currently the degree of uncertainty in genotype inference reduces the power of association studies, and potentially increases the risk of false-positive associations.

SVs and (unraveling) their relation to phenotype

The biological effect and the evolutionary process behind medium sized (10-50 kb) and small (<10 kb) SVs, which are thought to represent the majority of SVs in the human genome, remains currently largely unknown. These SVs generally have been below the reliable detection limit, and thus are underrepresented in current databases [35,36].

For large SVs, studies in human have provided evidence for their involvement in gene regulation by various molecular mechanisms, including gene dosage, gene disruption, gene fusion and position effects. A well known example of the influence of SVs on phenotype is the deletion of the alpha-globin gene resulting in alpha-thalassaemia in homozygous carriers [37] and protection against malaria in heterozygous carriers [38]. Altered regulation caused by SVs has also been associated with Mendelian [39,40] as well as sporadic traits, and also has been associated with complex diseases like Parkinson disease, Alzheimer disease, mental retardation, Autism and Schizophrenia in human. Furthermore

recent studies have reported that altered expression levels due to CNVs affect susceptibility to HIV, Crohn disease, psoriasis, Pancreatitis, Systemic lupus erythematosus and glomerulonephritis [41]. Moreover, CNVs can also represent benign polymorphic variants and in particular gene duplication and exon shuffling are thought to be a predominant mechanism driving gene and genome evolution. As stated in the Introduction of this thesis only a limited number of animal traits have been linked to CNVs. Currently a large scale CNV detection study is being performed at an eight kb resolution using array based comparative genome hybridization (R. Crooijmans, personal communication). The study of CNVs, in particular those that result in gene amplification favored by positive selection, may reveal genomic regions that were evolutionally favored because of their adaptive benefits. Genomic alteration due to major environmental impact (e.g. domestication) can be identified and modified genomic regions might be linked to traits or hide thus far undiscovered functional genes.

Common SVs in human seem to show patterns of allele frequency, linkage disequilibrium and population differentiation that mirror the properties of SNPs [42]. Cataloging the genomic locations, haplotypes and sequence properties of these alternative structural alleles will therefore also be an important direction for completing databases of common patterns of genetic variation in animals. A complete catalog encompassing SNPs and SVs can be used when attempting to unravel the molecular genetic basis of a given phenotype. In other words SNP based linkage and association studies should be supplemented by SV based linkage and association studies. Traits previously intractable by conventional genetic (SNP) analysis may become manageable by including SVs in the analysis, as was shown for autism spectrum disorders in human [43]. Furthermore the simultaneous study of SNPs and SVs, both common and rare, will be needed to understand the relative contribution of each form of variation to traits in animal populations.

Discovery of genetic variation in animals, what can be