1.5 Identifying susceptibility genes in complex disease
1.5.2 International HapMap Project, 1000 Genomes Project and Encode
alleles that were present on the particular chromosomal background on which it arose, and this association is measured by the amount of LD i.e. alleles within a haplotype show LD and reside in recombination hotspots.
The concept of LD is centralized on the non-‐random association of alleles at different loci. Natural selection, or chance, caused the spread of common SNP mutations that arose thousands of generations ago. A second mutation occurring later but close to an earlier one results in both alleles being transmitted to the same offspring in subsequent generations. It is this model that is exploited in a GWAS (Xiong and Guo 1997). An increased risk of disease caused by one SNP denotes direct association between that SNP and disease in the population and indirect association between several nearby SNPs due to LD. Therefore it is possible to identify association in the chromosomal region without genotyping every SNP in a GWAS i.e. by using tagging SNPs. LD is prone to decay by recombination (since the probability of recombination increases with distance, the strength of LD between loci declines with distance) recurrence of the same mutation and gene conversion.
1.5.2 International HapMap Project, 1000 Genomes Project and Encode
The International HapMap Project commenced in 2002 with a focus to map all common genetic variation (greater than 5% MAF) across 11 populations (1,400 individuals), equating to 3.5 million SNPs. There have been 26 data releases so far capturing approximately 90% of genetic variation in the Caucasian population by using high throughput genotyping chips (Consortium 2003; Thorisson, Smith et al. 2005). This dataset was the first to describe the different types of variants, where they occur in our DNA and their distribution within and amongst populations. By comparing 1,400 individual DNA sequences, haplotypes could be deciphered by mapping chromosomal regions of shared genetic variants. This preceded the initiation and rise of many GWA studies as the HapMap provided a detailed measurement of genetic variation and LD patterns across major populations, as well as the identification of tag SNPs that
act as haplotype markers (Smith, Wang et al. 2006). Over the last decade the quantity of know variation has increased from 20% discovery by the HGP to 90% of mapped human variation with the help of HapMap and other similar projects. The 1000 Genomes Project (1000G) was set up in 2007 with a goal of identifying 95% of SNPs present at least 1% frequency in a range of populations (www.1000genomes.org). In the pilot phase, which commenced in 2008, three different strategies were used: high coverage sequencing of family trios to obtain true phasing of the variants detected, low coverage sequencing of many individuals (179) to allow broader detection of variants but requiring statistical phasing and sequencing of specific exon targets in a larger number of individuals (700) to allow detection of rare variants but would remain un-‐ phased (Durbin, Abecasis et al. 2010). A main goal here was to reconstruct haplotypes using all variants typed from all datasets. The more recently published phase one dataset includes the genomes of 1,092 individuals from 14 populations (Abecasis, Auton et al. 2012). In this paper, functional variation was mapped by a combination of low coverage whole genome sequence data (2-‐6x read depth), targeted deep exome sequence data (50-‐100x), and dense SNP genotype data. The phase two dataset compiled in 2011 includes 1,715 individuals from 19 populations. The final phase three includes an additional 2,500 African and South Asian samples. This public reference catalogue of human genetic variation is already being used for imputation and will aid in identifying previously missed associations and provide a filter in Mendelian disease for exclusionary purposes.
Another project named Encylopedia of DNA Elements (Encode) published a myriad of papers in 2012 based on the identification of transcription regions, transcription factor association, chromatin structure and histone modifications in the human genome. This project differs completely from the genotype-‐based HapMap and 1000G projects and focuses on functional elements of gene products giving previously unknown insights into gene regulation and how statistical associations with disease correspond to these functional elements (Dunham, Kundaje et al. 2012).
1.5.3 Family based studies
Family based designs for the investigation of inherited disease have been used since Mendel’s laws of inheritance dominated the fundamental concepts of genetics. Studies of extended pedigrees have several favourable features for novel gene discovery: causative gene pathways are more homogenous and there is a certain level of phenotypic control against genetic background and environmental exposures (Borecki and Province 2008). Gene mapping strategies utilize linkage and association studies, both of which use family data, but association studies can also be performed with unrelated individuals. A commonly used family based association test is the transmission disequilibrium test (TDT), first introduced in 1993 (Spielman, McGinnis et al. 1993). A TDT uses parents as controls for the cases, who are the affected offspring, so any confounding effects of population stratification are removed. The purpose of the test is to confer whether the disease allele is transmitted from parent to offspring more often in a disease population using genetic markers in nuclear families (trios) by mapping disequilibrium between the marker allele and disease locus. If the disease allele is transmitted to unrelated cases more often than expected by chance, this implicates a linked allele that is associated with the disease mutation. If the allele is only seen in related cases, then it becomes a test of linkage, not association. In essence, the TDT combines linkage and association approaches in cases where either performed separately have failed to provide a positive result. This test has been developed to include all family members and genotypic information (Abecasis, Cookson et al. 2000).
Where association analysis is powerful for the detection of common alleles that confer modest disease risk, linkage analysis is more powerful for identifying high-‐risk disease alleles. The independence of segregation, as inferred by Mendel’s law of segregation, is not always true: there are group of traits which are linked and the genes controlling them tend to be inherited together by the offspring as a group, not independently. This is the underlying principle of a linkage study: if two individuals are phenotypically similar i.e. carry disease, then a genetic marker located near a disease susceptibility gene must also be