Genotypic data with unknown gametic phase

6.3 Tab dialogs

6.3.8 Calculation Settings

6.3.8.4.2 Genotypic data with unknown gametic phase

When gametic phase is unknown, two methods can be used to infer haplotypes: The (maximum-likelihood) EM algorithm or or the (Bayesian) ELB algorithm.

Manual Arlequin ver 3.5.2 Arlequin interface ₇₅

6.3.8.4.2.1 Settings for the ELB algorithm

The ELB algorithm has been described recently in Excoffier et.al (2003).

• Use ELB algorithm to estimate gametic phase [b]: Check this box if you want to estimate the gametic phase of multi-locu genotypes with the ELB algorithm. See methodological section on ELB algorithm (8.1.4.2.3) for a description of the algorithm.

• Dirichlet prior alpha value [f]: Value of the alpha parameter of the prior dirichlet distribution of haplotype frequencies. Recommended value: a small value like 0.01 for all data types has been found to work well (Excoffier et al. 2003). (see section 8.1.4.2.3 details)

• Epsilon value [f]: Value of the parameter controlling how much haplotypes differing by a single mutation from potentially present haplotypes are weighted. Recommended values: 0.1 for microsatellite data, and 0.01 for other data types. (see section 8.1.4.2.3 details)

• Heterozygote site influence zone [i]: Defines the number of sites adjacent to heterozygote sites that need to be taken into account when computing haplotype frequencies in the Gibbs chain. A value of zero implies that gametic phase will be

Manual Arlequin ver 3.5.2 Arlequin interface ₇₆

estimated only on the basis of heterozygote sites. A negative value will indicate that all sites (homozygotes and heterozygotes will be used). This parameter is mostly useful for inferring gametic phase of DNA sequences where there is only a few heterozygote sites among long stretches of homozygous sites. (see section 8.1.4.2.3 details)

• Gamma value [f]: This parameter prevents adaptive windows where gametic phase is estimated to grow too much. It can be set to zero for microsatellite data, and to a small value for other data sets, like 0.01. (see section 8.1.4.2.3 details)

• Sampling interval [i]: It is the number of steps in the Gibbs chain between two consecutive samples of gametic phases.

• Number of samples [i]: It represents the number of samples of gametic phases one wants to draw in the Gibbs chain to get the posterior distribution of gametic phases (and haplotype frequencies) for each individual. (see section 8.1.4.2.3 details)

• Burnin steps [i]: It is the number of steps to perform in the Gibbs chain before sampling gametic phases. The total number of steps in the chain will thus be: Burnin steps + (Number of samples Η Sampling interval). (see section 8.1.4.2.3 details)

• Recombination steps [i]: It is the proportion of steps in the Gibbs chain consisting in implementing a pseudo-recombination phase update instead of a simple phase switch (corresponding to a double recombination around a heterozygous site) (see section 8.1.4.2.3 details).

• Output phase distribution files [b]: Controls if one wants to output Arlequin files with the gametic phase of each sample in the Gibbs chain. The arlequin files (as many as the variable Number of samples defined above) are written in a subdirectory of the result directory called PhaseDistribution. They have the name ELB_EstimatedPhase#<Sample number>.arp. Arlequin also outputs a file called

ELB_Best_Phases.arp containing for each individual the gametic phases

estimated with the ELB algorithm, as well as batch file ELB_PhaseDistribution.arb listing all aforementioned project files.

Manual Arlequin ver 3.5.2 Arlequin interface ₇₇

6.3.8.4.2.2 Settings for the EM algorithm

 Use EM algorithm to estimate ML haplotype frequencies [b]: We estimate the maximum-likelihood (ML) haplotype frequencies from the observed data using an Expectation-Maximization (EM) algorithm for multi-locus genotypic data when the gametic phase is not known, or when recessive alleles are present (see section 8.1.4.2).

Perform EM algorithm at the:

 Haplotype level [m]: Estimate haplotype frequencies for haplotypes defined by alleles at all loci.

 Locus level [m]: Estimate allele frequencies for each locus.

 Haplotype and locus levels [m]: The two previous options are performed one after the other.

 Epsilon value [l]: Threshold for stopping the EM algorithm. After each iteration, Arlequin checks if the current haplotype frequencies are different from those at the previous iteration. If the sum of difference is smaller than epsilon, the algorithm stops.

Manual Arlequin ver 3.5.2 Arlequin interface ₇₈

 Significant digits for output [l]: Precision required for output of haplotype

frequencies. Haplotypes having a zero frequency given the required precisin are not output in the result file.

 Number of starting points for EM algorithm:[i]: Set the number of random initial conditions from which the EM algorithm is started to repeatedly estimate haplotype frequencies. The haplotype frequencies globally maximizing the likelihood of the sample will be kept eventually. Figures of 50 or more are usually in order.  Maximum no. of iterations [i]: Set the maximum number of iterations allowed in

the EM algorithm. The iterative process will have at most this number of iterations, but may stop before if convergence has been reached. Here, convergence is

reached when the sum of the differences between haplotypes frequencies between two successive iterations is smaller than the epsilon value defined above.

 Use Zipper version of EM [b]: Use the zipper version of the EM algorithm consisting in building haplotypes progressively by adding one locus at a time (see section 8.1.4.2.2).

 No. of loci orders [l]: Defines how many random loci orders should be used in the zipper version of the EM algorithm. Results about haplotype frequencies obtained for the locus order leading to the best likelihood is shown in the result file.

 Recessive data [b]: Specify whether a recessive allele is present. This option applies to all loci. The code for the recessive allele can be specified in the project file (see section 4.2.1).

 Estimate standard deviation through bootstrap [b]: Uses a bootstrap approach to estimate the standard deviation of haplotype frequencies.

 No. of bootstrap to perform [i]: Set the number of parametric bootstrap replicates of the EM estimation process on random samples generated from a fictive population having haplotype frequencies equal to previously estimated ML frequencies. This procedure is used to generate the standard deviation of

haplotype frequencies. When set to zero, the standard deviations are not estimated.

 No. of starting points for s.d. estimation [i]: Set the number of initial conditions for the bootstrap procedure. It may be smaller than the number of initial conditions set when estimating the haplotype frequencies, because the bootstrap replicates are quite time-consuming. Setting this number to small values is conservative, in the sense that it usually inflates the standard deviations.

Manual Arlequin ver 3.5.2 Arlequin interface ₇₉ 6.3.8.5 Linkage disequilibrium