Comparing precision and accuracy of inbreeding estimates derived from

Abstract

For studies of inbreeding it is important to know the proportion of an individual’s genome that is identical-by-descent (IBD). This can be calculated from pedigrees (inbreeding coefficient F) or estimated from molecular markers, whereby both estimators contain an error component. Pedigree F is inaccurate at the individual level due to chance events during chromosome segregation (Mendelian noise). Molecular estimates suffer from sampling error of markers plus the error that occurs when a marker is homozygous (identical-by-state, IBS) without reflecting common ancestry (IBD-IBS discrepancy). Here, we quantify these three sources of error via simulations for a captive population of zebra finches, and compare them with estimates from studies on humans. In the zebra finch, where the genome contains large blocks that are rarely broken up by recombination, the Mendelian noise was remarkably large (nearly twofold larger standard deviations compared to humans). On the other hand, the IBD-IBS correspondence for microsatellite markers was relatively high, explaining why a limited number of informative markers can yield useful estimates of genome-wide IBD. Our simulations show how many markers are needed to reach the same precision as pedigree-based predictions, and they illustrate how the various noise components introduce systematic and random errors into inbreeding-fitness relationships.

Submitted to Molecular Ecology Resources as: Knief U, B Kempenaers, W Forstmeier: Comparing precision and accuracy of inbreeding estimates derived from pedigrees versus molecular markers.

62 | C h a p t e r 3

Introduction

Offspring of genetically related individuals are inbred, meaning that they harbor genomic segments homozygous due to common ancestry (identical-by-decent, IBD; Wright 1922). The proportion of an individual’s genome that is IBD (genome-wide IBD, GWIBD) is arguably the best predictor of the homozygous mutation load of an individual because all recessive deleterious mutations that become IBD contribute to inbreeding depression (Keller et al.

2011).

Studies on inbreeding often aim at quantifying the amount of variance in fitness explained by inbreeding or the inbreeding load within a population. These two aims are not the same and we should attempt to reach both of them (Szulkin et al. 2010). For the first, we need to find a way to quantify all relevant inbreeding in order to minimize bias and reduce the random sampling error in our GWIBD estimator. For the second, an accurate and precise estimate of the regression slope of a fitness trait over a measure of inbreeding within a population is required. Both objectives can be addressed by using information from pedigrees or from molecular markers, whereby each has its own limitations in terms of precision and accuracy. Throughout this manuscript we use the term ‘precision’ to refer to random errors around a true value, and the term ‘accuracy’ to refer to systematic errors (bias) away from the true value (Sokal & Rohlf 1995).

In pedigrees, the expected GWIBD (Pedigree F) of a diploid individual whose parents are kth generation linear descendants of a common ancestor is traditionally quantified using Wright’s path method as the inbreeding coefficient F = 2-(2*k+2) and can be extended to also incorporate complex inbreeding loops (Malécot 1948; Wright 1922). As a consequence of limited recombination in meiosis, however, genomes do not get transmitted as independent basepairs but rather in segments of DNA, leading to linkage between adjacent segments and variation around the expected GWIBD (Fisher 1949). In the following, we will call this random error the Mendelian sampling noise (Figure 1). Interestingly, the magnitude of this error does not only change with the degree of inbreeding, but also with the genomic architecture of the species under consideration (Franklin 1977; Hill & Weir 2011; Rasmuson 1993; Stam 1980), as we will explain below.

The variation in GWIBD for a given inbreeding constellation will be smaller, the more segments in a genome segregate independently (law of large numbers; Visscher 2009). Consequently, since chromosomes get inherited as independent units in meiosis the Mendelian noise for a given inbreeding constellation will be smaller in species with more chromosomes (Hill & Weir 2011). Likewise, genomic segments on a given chromosome will be broken up by cross-overs during meiosis, which reduces linkage and consequently increases the number of independently segregating units in a genome. Thus, the longer the genetic map of a genome, which by definition is the expected number of cross-overs in each meiosis, the smaller the Mendelian sampling noise (Hill & Weir 2011). To our knowledge, all

C o m p a r i n g i n b r e e d i n g e s t i m a t o r s | 63 analytical analyses so far have assumed a uniform distribution of cross-overs along chromosomes (e.g. Franklin 1977; Hill & Weir 2011; Stam 1980; but see Suarez et al. (1979) and Libiger & Schork (2007) for Monte Carlo simulations on relatedness). Although this assumption holds more or less for the human genome (Matise et al. 2007), linkage maps from other species have shown that recombination along chromosomes can be highly skewed (Backström et al. 2010; Gore et al. 2009). This will result in a more block-like inheritance pattern of genomic segments on a given chromosome, which in turn will increase the Mendelian noise (Forstmeier et al. 2012; Guo 1995; Risch & Lange 1979).

Another aspect of the use of pedigrees is that pedigree F reflects IBD only within the pedigree (descent from the founders). In other words, only when all founders are unrelated and outbred, pedigree F accurately reflects all inbreeding. Throughout this manuscript we will focus on such an idealized scenario, where all inbreeding is fully defined and captured by the pedigree information. Thus, we further define IBD with reference to the founder population. To emphasize this point we indicate the number of generations of the pedigree as a subscript (e.g. F7, IBD7, GWIBD7 refer to values from a pedigree that is seven

generations long). In real populations, pedigree founders will also be related to some extent, and inbreeding loops that are longer than the pedigree will also contribute to inbreeding depression. Such long inbreeding loops can be captured by molecular information (but not by pedigree information), which generally means that pedigrees underestimate the full amount of inbreeding. Further discussion of this problem lies outside the scope of this study and is treated elsewhere (e.g. Knief et al. 2015; Powell et al. 2010; Speed & Balding 2015; Thompson 2013).

As an alternative to pedigrees one can use molecular markers such as microsatellites or single nucleotide polymorphisms (SNPs) to estimate GWIBD (Broman & Weber 1999). If we could sequence the whole genome of all individuals we are interested in, we could estimate GWIBD more or less exactly (Prado-Martinez et al. 2013). Usually, however, only parts of a genome are covered by molecular markers and IBD within these parts (Marker IBD) is used as a proxy for GWIBD (Powell et al. 2010). This will introduce variation into our estimates of GWIBD, which we call the marker sampling noise (Figure 1). Unlike the Mendelian sampling noise which is an inherent result of the meiotic process, the marker sampling noise is only a consequence of the limited number of markers used; hence, the precision of GWIBD estimates increases when more molecular markers are sampled.

Another problem arises when using molecular markers. Marker homozygosity only equals Marker IBD when each monophyletic haplotype in a population is identifiable by a unique allele of the molecular marker in use (referred to as “ideal marker”). Studies on heterozygosity-fitness correlations combine homozygosity at multiple unlinked microsatellites into a measure of GWIBD, e.g. as the percentage of markers being homozygous. Yet microsatellite homozygosity does not translate directly into IBD for two

64 | C h a p t e r 3

reasons. (1) Two homologous DNA segments may have been inherited from a shared ancestor (IBD), but the microsatellite marker located in this segment may have changed due to mutation since the common ancestor. Then the marker will not be identical-by-state (IBS) even though the segment is IBD. However, if we define IBD with regard to common ancestors that lived rather recently, such cases will be exceedingly rare, because microsatellite mutation rates are too low (Goldstein & Schlötterer 1999). Hence, for simplicity, in the following we ignore this first possibility assuming no mutations. (2) Because a microsatellite marker can adopt only a finite number of states (alleles), two phylogenetically unrelated haplotypes (not IBD) may carry the same allele (IBS) by chance alone. We will call the combination of Marker IBD with this chance homozygosity the Marker identity-by-state (Marker IBS).

Depending on how well Marker IBS reflects Marker IBD this may introduce severe error into the GWIBD estimate, which we call the IBD-IBS discrepancy (Thompson 1976; Figure 1). Generally, this discrepancy becomes smaller by increasing the allelic richness of a molecular marker, because more haplotypes can be distinguished uniquely. Clearly, multiallelic microsatellite markers contain more information on GWIBD than biallelic SNPs. Because genomes get transmitted in segments rather than by individual basepairs, we can consider several closely linked markers jointly as a single marker in order to further decrease the IBD- IBS discrepancy (Kong et al. 2008). By doing this, we generate more diverse and consequently more informative markers for GWIBD estimation (practically ideal markers), because the combined probability of all markers being homozygous just by chance becomes very small (see Knief et al. (2015) for an example of constructing such ideal markers from dense SNP panels).

Although neither pedigrees nor molecular markers estimate GWIBD precisely, a multitude of studies have interpreted a weak correlation between Pedigree F and Marker IBS as a sign of weakness of molecular markers in predicting GWIBD, culminating in the demand for more and better pedigrees (see Forstmeier et al. 2012 for a critical discussion on the subject). Here, we argue that the optimal method depends on the goal of the study. If we are interested in estimating the inbreeding load, Pedigree F may be indeed – as we will show – superior to molecular markers. However, if the aim is to quantify individual GWIBD, molecular markers may be the better choice because both the marker sampling noise and the IBD-IBS discrepancy can be reduced by using a larger number of highly informative markers and no assumptions about the relatedness of the founders have to be made.

Here we illustrate the properties of Pedigree F and molecular markers and address the following questions. (1) How does the amount of random noise introduced into estimates of GWIBD by Mendelian segregation compare to the amount of marker sampling noise and IBD-IBS discrepancy stemming from a limited number of molecular markers? (2) How many molecular markers are needed to become a superior method to pedigree information in

C o m p a r i n g i n b r e e d i n g e s t i m a t o r s | 65 approximating GWIBD (Forstmeier et al. 2012)? (3) How do molecular markers compare to pedigree-based estimates of inbreeding in heterozygosity-fitness correlations/regressions, when we are interested in precise and accurate estimates of the inbreeding load (Szulkin et al. 2010)?

Because GWIBD can be assessed precisely only by sequencing the whole genome of an individual and because for our purpose we also need to look at a large population of individuals, we use a Monte Carlo gene-dropping simulation to answer these questions for two genomes that we expect to show high (zebra finch, Taeniopygia guttata) and low (human) amounts of Mendelian noise. Our expectation is based on the fact that although the zebra finch genome consist of 39 autosomes compared to only 22 in humans, about half of its autosomal genome is made up of only four chromosomes (Tgu1, Tgu1A, Tgu2 and

Tgu3) and shows extremely low recombination rates in the centers of these chromosomes (Backström et al. 2010) whereas in humans recombination is distributed quite uniformly on the megabase scale along chromosomes (Matise et al. 2007). Here, we follow the zebra finch genomes of 159 diploid pedigree founders, who we assume to be unrelated, through an empirical seven-generations pedigree of 3,404 individuals. We estimated the Mendelian sampling noise in Pedigree F7, ignoring inbreeding stemming from relatedness between

pedigree founders. To quantify the amount of marker sampling noise we estimated GWIBD7

from subsets of the simulated genomes (Marker IBD7). We used empirical data from 11

microsatellites genotyped in the seventh generation of our pedigree to estimate the IBD-IBS discrepancy and introduced this noise component into our simulations (Marker IBS7). This

allowed us to compare the precision of Pedigree F7, Marker IBD7 and Marker IBS7 in

predicting GWIBD7. Finally, we demonstrate the effects of the three noise components

(Figure 1) on estimates of heterozygosity-fitness correlations/regressions.

Methods

Empirical linkage and physical maps

For the zebra finch, we used the sex-averaged linkage map described in Backström et al.

(2010) covering 33 chromosomes and based on 1,395 SNPs (data accessible from Schielzeth

et al. 2011, 2012). We excluded the sex chromosome and removed 10 microchromosomes

covered by less than 10 markers to include only those 22 autosomes containing enough information to fit a smoothed line (see below). However, because the variance in relatedness and inbreeding depends on the number of chromosomes and because the zebra finch genome consists of 39 autosomes (Pigozzi & Solari 1998), we added 17 “artificial” chromosomes to our linkage map.

These 17 chromosomes were constructed using both the physical and the genetic length of known chromosomes to be as close to reality as possible. The physical lengths of known chromosomes were taken from the WUSTL v3.2.4 assembly (Warren et al. 2010). First, we

66 | C h a p t e r 3

used the complete published physical genome length (1,222,864,721 bp), subtracted the length of the sex chromosome (72,861,351 bp) and rounded the result up to the nearest 100 kb (11,501 x 100 kb). Then we subtracted the length of the 22 well-characterized autosomes from the linkage map (9,146 x 100 kb), leaving 2,355 x 100 kb.

Calderón & Pigozzi (2006) estimated the total genetic length of the zebra finch genome (containing all 39 autosomes) to be 2,272.5 cM (from MLH1 foci mapping). We subtracted the estimated total genetic length of the 22 well-characterized autosomes (1,391.711 cM, based on our smoothed line fitting function; see below) from the total genetic length of the genome, leaving 880.789 cM. In the zebra finch genome, recombination along the microchromosomes follows a more or less uniform distribution (Backström et al. 2010; Calderón & Pigozzi 2006). Thus, we constructed the 17 additional chromosomes as being 2,355 x 100 kb / 17 = 138.5294 x 100 kb and 880.789 cM / 17 = 51.81114 cM long with a uniform distribution of recombination along the chromosomes (Figure S1). Assuming a uniform distribution of recombination essentially reduces the amount of Mendelian noise stemming from these chromosomes to a minimum (Risch & Lange 1979).

For humans, we used the sex-averaged Rutgers Map v.2 (Matise et al. 2007) covering 24,168 markers and all 22 autosomes and the X chromosome (Figure S2). We removed the sex chromosome from all further analyses. The physical lengths of chromosomes were taken from the hg19/GRCh37 assembly (Collins et al. 2004).

Linkage map smoothing

In order to get a monotonously increasing linkage map for each chromosome we followed Roesti et al. (2013) and fitted a local polynomial regression (loess function in R v3.0.2 (R Core Team 2013) with span parameter = 10 / number of markers on the chromosome, degree = 1, control = loess.control(surface = "direct")) with the genetic positions of the markers as the independent variable and the physical positions of the markers as the predictor. Since we based our simulations on physical 100 kb intervals, we predicted the genetic positions every 100 kb from the loess-smoothing and sorted them in ascending order (Figure S1 and S2).

Finding the optimal interference distance

In meiosis, accurate chromosomal segregation requires at least one cross-over per chromosome (i.e. the minimal length of a chromosome is 50 cM; Petronczki et al. 2003). Because we wanted our simulations to be as realistic as possible, we also required one cross-over per chromosome in each meiosis in our gene-dropping simulations (see below). Yet adding up random and forced cross-overs would lead to an overestimation of the genetic length of the chromosomes. We solved this problem by introducing an interference distance for each chromosome, which suppresses additional cross-overs in a given physical interval surrounding an initial cross-over. We define the optimal interference distance (in

C o m p a r i n g i n b r e e d i n g e s t i m a t o r s | 67 kb) as the distance which returns the distribution of recombination events as close to the actual linkage map as possible. In order to find the optimal interference distance we ran our gene-dropping simulations for each chromosome 10,000 times for interference distances ranging from 0 to maximally 50 Mb (in 100 kb steps) and selected for each chromosome the one with the minimal sum of squared residuals, a residual being the difference between the simulation result and the actual smoothed linkage map (Figure S1 and S2). The optimal interference distances for each chromosome were subsequently used in our gene-dropping simulations. It should be noted that interference actually happens in meiosis (Muller 1916; Sturtevant 1913) and is thus not an artefact introduced by our simulation procedure. We also developed a simulation which does not require interference, which did not change the results (data not shown, but the script is available upon request).

Gene-dropping simulations

Our gene-dropping simulations extend on those described in Libiger & Schork (2007). We first specified a genome by the number and size (both genetically and physically) of chromosomes. Then we split each chromosome in predefined physical segments (we used 100 kb segments) and calculated the recombination probability between segments using a smoothed linkage map (see above) and the Kosambi map function (Kosambi 1943). Finally, we followed each segment through a specified genealogy by using these recombination probabilities and assuming that founders of the pedigree were unrelated.

(1) For each founder we simulated a diploid chromosome set (two unique haplotypes without inbreeding for each chromosome).

(2) The simulation proceeded by creating offspring of the pedigree founders. We tried to simulate meiosis as realistic as possible and implemented the following steps. (a) Prior to meiosis I the homologous chromosomes (2N2C = 2 homologous chromosomes, 2 chromatids) in both the mother and the father duplicate to form two sister chromatids (2N4C) that are identical. (b) In meiosis I (which leads to 1N2C), cross-overs between chromatids of the homologous chromosomes occur with probabilities as defined by the linkage map. Cross-overs may occur between both chromatids of the homologous chromosomes but not between the two sister chromatids of a single chromosome (remember also that the two sister chromatids are identical). Thus, also unrecombined chromatids may get inherited. (c) One of the four chromatids (i.e. 1N1C) in both the mother and the father was chosen randomly to create the offspring which is then 2N2C again.

(3) Within each offspring the total length of all autozygous stretches was determined as homozygosity for a founder haplotype (GWIBD7). The end of an autozygous stretch was

placed at a randomly chosen base pair between the flanking autozygous and non- autozygous segment.

68 | C h a p t e r 3

(4) To show how our simulations of inbreeding can be extended to also quantify relatedness among individuals we did the following. When all individuals in the pedigree had been simulated, the genetic similarity between two individuals was calculated as the proportion of sharing of founder haplotypes between individuals (the pedigree-based equivalent is the additive genetic relatedness, which is twice the coefficient of coancestry). We focus on the genetic similarity matrix because it is commonly used when estimating variance components in animal models (Speed & Balding 2015; Results are presented in the Supplement, Figure S3).

Simulated pedigrees

We ran our gene-dropping simulations 10,000 times on two pedigrees. (1) A designed pedigree comprised of full-sibs and their offspring (full-sib mating; FSM), first-cousins and their offspring (first-cousin mating; FCM) and second-cousins and their offspring (second- cousin mating; SCM; Figure 2). (2) An empirical pedigree from our captive population of zebra finches held at the Max Planck Institute for Ornithology in Seewiesen, Germany, comprised of n = 159 founders and n = 3,404 individuals in total (Figure S4). The pedigree

In document Knief, Johann Ulrich (2016): Quantitative and molecular genetics of phenotypic variation in the zebra finch. Dissertation, LMU München: Fakultät für Biologie (Page 65-103)