Sequencing Complex Diseases With HapMap

(1)

Copyright2004 by the Genetics Society of America DOI: 10.1534/genetics.104.029603

Sequencing Complex Diseases With HapMap

**Tian Liu,* Julie A. Johnson,**

†

**_{George Casella* and Rongling Wu*}**

,1

*Department of Statistics and†_{Department of Pharmacy Practice, University of Florida, Gainesville, Florida 32611}

Manuscript received April 20, 2004 Accepted for publication May 26, 2004

ABSTRACT

Determining the patterns of DNA sequence variation in the human genome is a useful first step toward identifying the genetic basis of a common disease. A haplotype map (HapMap), aimed at describing these variation patterns across the entire genome, has been recently developed by the International HapMap Consortium. In this article, we present a novel statistical model for directly characterizing specific sequence variants that are responsible for disease risk based on the haplotype structure provided by HapMap. Our model is developed in the maximum-likelihood context, implemented with the EM algorithm. We perform simulation studies to investigate the statistical properties of this disease-sequencing model. A worked exam-ple from a human obesity study with 155 patients was used to validate this model. In this examexam-ple, we found that patients carrying a haplotype constituted by allele Gly16 at codon 16 and allele Gln27 at codon 27 genotyped within the␤2AR candidate gene display significantly lower body mass index than patients carrying the other haplotypes. The implications and extensions of our model are discussed.

C

URRENT studies of genetic mapping have been detected with given markers cannot give any information

about the sequence structure and organization of QTL. veloped to a point at which the quantitative

varia-tion of almost any complex trait can be dissected into its Second, the inference of the QTL positions using the

nearby markers is sensitive to marker informativeness, underlying Mendelian factors on the basis of molecular

linkage maps (LanderandBotstein1989) or association marker density, and mapping population type. As a

re-sult of these, only a few QTL detected from markers have

studies (Wuet al.2002;Louet al.2003). Such Mendelian

factors,i.e., the so-called quantitative trait loci (QTL), been successfully cloned (Fraryet al.2000), despite a

considerable number of QTL reported in the literature.

arehypothesized regions with alleles triggering an effect

on quantitative variation. A number of statistical meth- A more accurate and useful approach for the

charac-terization of genetic variants contributing to quantitative ods have been developed to map QTL for various

map-variation is to directly analyze DNA sequences associated ping population structures and different marker and

with a particular disease. If a string of DNA sequence character types, each aimed at precisely identifying QTL

is known to increase disease risk, this risk can be reduced that are close enough to markers for their positional

by the alteration of this string of DNA sequence using a cloning using biotechnology approaches, such as

chro-specialized drug. The control of this disease can be made

mosome walks (WuandCasella2005). These methods

more efficient if all possible DNA sequences determin-have led to the detection of QTL contributing to

com-ing its variation are identified in the entire genome. plex traits in a range of species from microorganisms

With the recent development of the human genome

to plant and animal species to humans (Mackay2001).

project, massive amounts of DNA sequence data have The basic principle for QTL mapping is the

cosegre-been available across the human genome (

Interna-gation of the alleles at a QTL with those at one or a set of

tional HapMap Consortium2003). In particular, sin-known polymorphic markers genotyped on a genome. If

gle-nucleotide polymorphisms (SNPs), being the most a QTL is cosegregating with molecular markers, the

ge-common type of variant in the DNA sequence, provide netic effects of QTL and their genomic positions can

a powerful means for genotyping the whole genome. be estimated from the markers. This approach is robust

This facilitates the complete identification of specific and powerful for the detection of major QTL and

pre-sequence variants responsible for complex diseases. A sents the most efficient way to utilize marker

informa-linear arrangement of alleles (i.e., nucleotides) at

differ-tion when marker maps are sparse. However, this

ap-ent SNPs on a single chromosome, or part of a chromo-proach is limited in two aspects. First, because the markers

some, is called a haplotype. The cosegregation of SNP and QTL bracketed by them are located at different

alleles on haplotypes leads to nonrandom association, genomic positions, the significant linkage of a QTL

de-i.e., linkage disequilibrium (LD), between these alleles in

the population. Empirical analyses of LD for SNPs have shown that nearby SNPs in the human genome tend to

1_{Corresponding author:}_{Department of Statistics, 533 McCarty Hall C,}

University of Florida, Gainesville, FL 32611. E-mail: [email protected] display highly significant levels of LD and are often

(2)

tributed in block-like patterns, rather than displaying Our interest is to search for the haplotype diversity that can explain phenotypic variation in a complex dis-random- or even-spaced distribution as originally

pre-dicted (Patilet al. 2001;Dawsonet al.2002;Gabriel ease. The association between haplotype diversity and

phenotypic variation has been detected in several

stud-et al. 2002). SNPs within haplotype blocks are much

more strongly associated with each other than are those ies of drug responses (Judsonet al.2000;Bader2001).

This allows us to assume that a particular haplotype is between different blocks. Haplotype diversity within each

different from other haplotypes for a given disease. Be-block can be well explained by only a finite number

cause haplotypes (comprising diplotypes) cannot be di-of SNPs, called tag SNPs or representative SNPs. The

rectly observed, the effects of different haplotypes on existence of these tag SNPs means that it is not

neces-the phenotype need be postulated from observed zy-sary to associate a disease with all SNPs in the DNA

gotic genotypes. The inference of diplotypes for a partic-sequence to understand the complete genetic control

ular genotype is statistically a missing data problem that of the disease.

can be formulated by a finite mixture model (Wuand

In this article, we present a novel statistical model for

Casella2005). determining specific DNA sequences that are associated

Likelihood function:In this study, the complete data with the phenotypic variation of disease risk. This model

are diplotype configurations at a given set of SNPs for is derived on the basis of multilocus haplotype analysis

each genotype and for disease outcomes of subjects, using a finite number of tag SNPs. We derived a

closed-whereas the observed data are the genotypes of these form solution for estimating the effects of haplotypes,

SNPs and the disease outcomes. The missing data are at haplotype frequencies, allele frequencies, and the

de-the connection from de-the genotypes to diplotypes. For grees of LD of various orders among tag SNPs

underly-any given genotype, all possible diplotypes can be writ-ing the disease. We performed simulation studies to test

ten out. For example, genotypeA1

1A11/A21A21has one diplo-the statistical behavior of this haplotype-based

sequence-type [A1

1A21][A11A21], as does genotype A11A11/A21A22, the mapping model. A worked example is used to validate

diplotype [A1

1A21][A11A22]. This is because in these

situa-our model, in which a DNA sequence variant is detected

tions where at most one SNP is heterozygous, the geno-to significantly reduce human obesity.

type and diplotype are identical. However, for a

double-heterozygous genotypeA1

1A12/A21A22, we have two diplotypes, [A1

1A21][A12A22] and [A11A22][A12A12].

THEORY

Table 1 lists all possible genotypes and diplotypes

Suppose there is a random sample of size n drawn _{at two SNPs genotyped from a sample of size} _n_{. Each}

from a natural human population at Hardy-Weinberg _{genotype (and therefore each diplotype) is composed}

equilibrium. In this sample, a number of SNPs are geno- _{of two haplotypes, one from the mother and the other}

typed genome-wide, aimed at the identification of DNA _{from the father. Two haplotypes composing a diplotype}

sequences responsible for a complex disease. Consider _{come from four possible haplotypes,} _A1

1A21,A11A22,A12A21,

R (R ⬎ 1) tag SNPs for a haplotype block. Each of _and _A1

2A22, whose frequencies are expressed as p11, p12,

theseRSNPs has two alleles denoted byAr

kr(kr⫽ 1, 2; p₂₁, andp₂₂, respectively. The diplotype frequencies can

r⫽1, . . .R), with allele frequencies denoted byp(r)

k_r for be expressed in terms of the haplotype frequencies

(Ta-therth SNP. We use superscripts and subscripts to distin- _{ble 1). Two diplotypes [}_A1

1A21][A12A22] and [A11A22][A12A21]

guish between different SNPs and different alleles _{of a double heterozygote} _A1

1A12/A21A22 have frequencies

within SNPs, respectively. These SNPs form 2R

possible _p₁₁_p₂₂_and_p₁₂_p₁₂_{, respectively. Thus, the relative}

frequen-haplotypes expressed asA1

k₁A2k₂. . .ARk_R, whose frequencies _{cies of these two diplotypes for this double heterozygote}

are denoted by pk₁k₂...k_R. The haplotype frequencies are _{are a function of haplotype frequencies, which can be}

composed of allele frequencies at each SNP and linkage _{expressed as}_p₁₁_p₂₂_/(_p₁₁_p₂₂_⫹_p₁₂_p₂₁_{) and}_p₁₂_p₂₁_/(_p₁₁_p₂₂_⫹

disequilibria of different orders among SNPs (Louet al. _p₁₂_p₂₁_{), respectively.}

2003). The random combination of maternal and pater- _{Table 1 also gives the relative expected frequencies}

nal haplotypes generates 2R⫺1₍₂R ⫹ _{1) diplotypes}

ex-of haplotypes contained in a given genotype. All

geno-pressed as [A1

k1A

2

k2. . .A R kR][A

1

l₁A2l₂. . .ARlR] (1 ⱕ k1 ⱕ l1 ⱕ types, except for the double heterozygote, contain one

or two known haplotypes. For example, genotypeA1

1A11/

2, . . . , 1ⱕ kRⱕ lRⱕ 2). TheseRSNPs form 3R

observ-able multilocus zygotic genotypes, generally expressed as A2

1A21has one haplotypeA11A21, whereasA11A11/A21A22has one A1

k₁A1l₁/A2k₂A2l₂/. . ./ARk_RARl_R. When at most one SNP is hetero- half haplotype A1₁A2₁ and one half haplotype A1₁A2₂. The

double heterozygote contains four possible haplotypes, zygous, the diplotype is consistent to its zygotic genotype.

However, when two or more SNPs are heterozygous, the with the relative frequenciesp11p22/(p11p22⫹p12p21) for

genotype will have different diplotypes and, therefore, the haplotypesA1

1A21andA12A22andp12p21/(p11p22⫹p12p21) for

number of multilocus genotypes will be less than the num- haplotypesA1

1A22andA21A21.

ber of diplotypes. LetP[k1k2. . . kR][l1l2. . . lR]andPk1l1/k2l2/ . . . /kRlR de- The haplotype frequencies, arrayed by⍀p⫽(p11,p12,

p21,p22), that belong to population genetic parameters

note the diplotype and genotype frequencies, respectively,

can be estimated using the nine observed genotypes andnk₁l₁/k₂l₂/ . . . /k_Rl_Rdenote the observations of various

ge-(G) for two SNPs (Table 1). The log-likelihood function

(3)

505 Sequencing Complex Diseases

TABLE 1

Possible diplotype configurations of nine genotypes at two SNPs and their haplotype composition frequencies

Relative diplotype Haplotype composition

Diplotype frequency within Genotypic

Genotype Diplotype frequency genotypes A1

1A 2 1 A 1 1A 2 2 A 1 2A 2 1 A 1 2A 2

2 Observation mean

A1

1A11/A21A12 [A11A21][A11A21] P[11][11]⫽p211 1 1 0 0 0 n11/11 ␮1⫽ ␮ ⫹a

A1 1A

1 1/A

2 1A

2

2 [A

1 1A

2 1][A

1 1A

2

2] P[11][12]⫽2p11p12 1 1 0 0 n11/12 ␮2⫽ ␮ ⫹d

2

1 2 A1

1A11/A22A22 [A11A22][A11A22] P[12][12]⫽p212 1 0 1 0 0 n11/22 ␮3⫽ ␮ ⫺a

A1 1A

1 2/A

2 1A

2

1 [A

1 1A

2 1][A

1 2A

2

1] P[11][21]⫽2p11p21 1 1 0 0 n12/11 ␮2⫽ ␮ ⫹d

2

1 2 A1

1A12/A21A22 [A11A21][A21A22] P[11][22]⫽2p11p22 ␲ 1 n12/12 ␮2⫽ ␮ ⫹d

2␲ 1 2(1⫺ ␲)

1 2(1⫺ ␲)

1 2␲

[A1 1A

2 2][A

1 1A

2

2] P[12][21]⫽2p12p21 1⫺ ␲ ␮3⫽ ␮ ⫺a

A1

1A12/A22A22 [A11A22][A21A22] P[12][22]⫽2p12p22 0 1 0 n12/22 ␮3⫽ ␮ ⫺a

2 1 2 A1 2A 1 2/A

2 1A

2

1 [A

1 2A

2 1][A

1 2A

2

1] P[21][21]⫽p221 1 0 0 1 0 n22/11 ␮3⫽ ␮ ⫺a

A1

2A12/A21A22 [A12A21][A12A22] P[21][22]⫽2p21p22 1 0 0 1 n22/12 ␮3⫽ ␮ ⫺a

2 1 2 A1 2A 1 2/A

2 2A

2

2 [A

1 2A

2 2][A

1 2A

2

2] P[22][22]⫽p222 1 0 0 0 1 n22/22 ␮3⫽ ␮ ⫺a

␲ ⫽p11p22/(p11p22⫹p12p21), wherep11,p12,p21, andp22are the haplotype frequencies ofA11A21,A11A22,A12A21, andA12A22, respectively.

of unknown haplotype frequencies given observed

geno-types can be written as a multinomial form,i.e.,

f[k₁l₂][l₁k₂](yi)⫽

1

√

2␲␴exp

冤

⫺

(yi⫺ ␮[k₁l₂][l₁k₂]i)2

2␴2

冥

logL(⍀p|G)⬀2n11/11logp11⫹n11/12log(2p11p12)⫹2n11/12logp12

are the probability density functions for subjecti who

⫹n12/11log(2p11p21)⫹n12/12log[2(p11p22⫹p12p21)]

has two possible diplotypes, respectively, with the geno-⫹n12/22log(2p12p22)⫹2n22/11logp21⫹n22/12log(2p21p22)⫹2n22/22logp22.

typic means of␮[k1k2][l1l2]for diplotype [A

1

k1A

2

k2][A

1

l1A

2

l2] and

(1) _␮[

k1l2][l1k2] for diplotype [A

1

k1A

2

l2][A

1

l1A

2

k2] and the common

residual variance of␴2_{. Note that subscript}_i_{is used to}

Assuming that diplotypes are associated with

pheno-describe the genotypic means in the two density func-typic variation in a disease, we formulate a likelihood

tions above because these means are diplotype- (and

for unknown population (⍀p) and quantitative genetic

therefore subject-) dependent although there are only

parameters (⍀q) given observed phenotypes (y) and SNP

a total of 10 genotypic means for two SNPs. These means

genotypes (G). Generally speaking, a given 2-SNP

geno-and variance are contained in vector⍀q ⫽ (␮[k₁k₂][l₁l₂]i,

type,A1

k₁A1l₁/A2k₂A2l₂, can be partitioned into two possible dip- _␮[

k₁l₂][l₁k₂]i,␴2).

lotypes, [A1

k1A

2

k2][A

1

l1A

2

l2] and [A

1

k1A

2

l2] [A

1

l1A

2

k2]. Thus, such a

Suppose there is a particular haplotype,A1

k₁A2k₂, which is

log-likelihood function can be formulated on the basis of

different from the other three haplotypes, A1

k₁A2k¯₂,A1k¯₁A2k₂,

a two-component mixture model,i.e.,

andA1k¯₁A2k¯₂(where a bar means the alternative of that

al-lele), in its effect on a complex disease. The phenotypic

logL(⍀p,⍀q|y,G)⫽

兺

n

i⫽1

log[␲if[k1k2][l1l2](yi) _{means of the three genotypes that contain these two}

dis-tinct groups of haplotypes are denoted as␮1 forA1k1A

1

k1/

⫹ (1⫺ ␲i)f[k₁l₂][l₁k₂](yi)], _A2

k2A

2

k2, ␮2 for A

1

k1A

1

k¯1/A

2

k2A

2

k¯2, and␮3 for A

1

k¯1A

1

k¯1/A

2

k¯2A

2

k¯2. On

(2) _{the basis of quantitative genetic theory, these three} _␮_’s

can be partitioned into the overall mean (␮), the

addi-where the mixture proportion, _{tive effect (}_a_{) due to the substitution of different}

haplo-types, and the dominant effect (d) due to the interaction

between different haplotypes,i.e.,␮1⫽ ␮ ⫹a,␮2⫽ ␮ ⫹

␲i⫽

P[k₁k₂][l₁l₂]i

P[k₁k₂][l₁l₂]i⫹P[k₁l₂][l₁k₂]i

,

d, and ␮3 ⫽ ␮ ⫺ a. In Table 1, the genotypic means

are also given for different genotypes by assuming that

represents the relative frequency of subject i whose _haplotype_A1

1A21 is different from the rest of the

haplo-diplotype is [A1

k₁A2k₂][A1l₁A2l₂], and _types.

The log-likelihood function described by Equation 2 can be expanded to include all possible SNP genotypes,

f[k₁k₂][l₁l₂](yi)⫽

1

√

2␲␴exp

冤

⫺

(yi⫺ ␮[k₁k₂][l₁l₂]i)2

2␴2

冥

(4)

Note that for all the other genotypes, such probabilities logL(⍀p,⍀q|y,G)⫽

兺

n11/11

i⫽1

logf[11][11](yi)

do not exist.

In the M step, the probabilities calculated in the

previ-⫹

_兺

n11/12

_兺

n22/12

i⫽1

logf[21][22](yi) _{from the other three haplotypes, these probabilities are}

used to estimate the haplotype additive and dominant effects and the residual variance by

⫹

_兺

n22/22

i⫽1

logf[22][22](yi), (3)

␮ˆ1⫽

兺

n11/11

i⫽1 yi n11/11

, ₍₁₁₎

nent (diplotype), the double heterozygote is the

mix-ture of two possible diplotypes weighted by␲and 1⫺

␮ˆ3⫽

兺

n¨ i⫽1yi⫹

兺

n12/12

i⫽1 (1⫺

兿

[11][22]i)yi n¨⫹

兺

n12/12

i⫽1 (1⫺

兿

[11][22]i)

, ₍₁₃₎

␲. Thus, an advanced statistical method should be

im-plemented to obtain the MLEs of the underlying

n¨

i⫽1 (yi⫺ ␮ˆ3)2 An integrative EM algorithm: We derived a

closed-form solution for estimating the unknown parameters

⫹

_兺

n12/12

i⫽1

关⌸[12/12]i(yi⫺ ␮ˆ2)2⫹(1⫺ ⌸[12/12]i)(yi⫺ ␮ˆ3)2兴

冧

,(14)

with the EM algorithm (Dempsteret al.1977).

Haplo-type frequencies can be expressed as a function of allelic

wheren˙⫽n11/12⫹n12/11andn¨⫽n11/22⫹n12/22⫹n21/21⫹

frequencies and LD. For a two-SNP haplotype, we have

n21/22 ⫹ n22/22. Iterations including the E and M steps

pk₁k₂ ⫽pk₁pk₂⫹(⫺1)k1⫹k2D, (4) _{are repeated among Equations 5–14 until the estimates}

of the parameters converge to stable values. The

sam-whereDis the linkage disequilibrium between the two

pling errors of these parameters can be estimated by SNPs. Thus, once haplotype frequencies are estimated,

calculating Louis’s (1982) observed information

ma-we can estimate allelic frequencies and LD by solving

trix. Equation 3. The estimates of haplotype frequencies are

Hypothesis tests: We can test two major hypotheses based on the log-likelihood function of Equation 1,

in the following sequence: (1) the association between whereas the estimates of diplotype genotypic means

two SNPs by testing their LD and (2) the difference of (and therefore the overall mean, haplotype additive,

a given haplotype from the rest of the haplotypes in its and dominant effects) and residual variance are based

effect on the disease outcome by testing the significance on the log-likelihood function of Equation 2. These two

of haplotype additive and dominant effects. The LD different types of parameters can be estimated using an

between two given SNPs can be tested using two alterna-integrative EM algorithm.

tive hypotheses:

In the E step, the expected number (␲i) and

probabil-ities (⌸i) of subjectiof a double-heterozygous genotype H₀: D⫽0

who carries diplotype [A1

1A21][A12A22] are calculated by

H1: D⬆0. (15)

␲[11][22]i⫽

p11p22

p11p22⫹p12p21

(5) _{The log-likelihood-ratio test statistic (LR) for the}

sig-nificance of LD is calculated by comparing the

likeli-hood values under H1 (full model) and H0 (reduced

⌸[11][22]i⫽

␲[11][22]if[11][22](yi)

␲[11][22]if[11][22](yi)⫹(1⫺ ␲[11][22]i)f[12][21](yi)

. (6)

(5)

LR1⫽ ⫺2[logL(p˜(1)1 ,p˜(2)1 ,D⫽0,⍀˜q|G)⫺logL(⍀ˆp,⍀ˆq|G)], erable computational load and, also, may not be

neces-sary for the explanation of disease variation. These two (16)

factors should be considered in a further study of the where the tilde and hat denote the maximum-likelihood

determination of a maximal number of SNPs for

se-estimates (MLEs) of unknown parameters under H0and

quencing mapping.

H1, respectively. The LR1is considered to asymptotically

follow a␹2_{distribution with 1 d.f. The MLEs of allelic}

frequencies under H0can be estimated using the EM al- RESULTS

gorithm described above, but with the constraintp11p22⫽

Simulation: We performed Monte Carlo simulation

p12p21.

experiments to examine the statistical properties of the Diplotype or haplotype effects on a complex trait can

model proposed for association studies between varia-be tested using two alternative hypotheses expressed as

tion in DNA sequence and a complex trait. Our simula-tion studies were designed to consider the effects of

H0: a⫽ d⫽0

different heritability levels (H2⫽_0.1_vs._{0.4), different}

H1: at least one equality in H0 does not hold. (17) _{sample sizes (}_n_⫽₁₀₀_vs._{400), and different gene action}

modes (additivevs.dominantvs.overdominant) on the

The log-likelihood-ratio test statistic (LR2) under these

two hypotheses can be similarly calculated. The LR2 estimation precision of parameters from our model.

The ratio of dominant (d) over additive effect (a) is

may asymptotically follow a ␹2 _{distribution with 2 d.f.}

However, the approximation of a ␹2 _{distribution may} _{used to determine the degree of dominant. A haplotype}

is regarded as additive, dominant, or overdominant if be inappropriate when some regularity conditions, such

as normal and uncorrelated residuals, are violated. The this ratio is zero, one, and greater than one, respectively

(LynchandWalsh1998).

permutation test approach proposed by Churchill

andDoerge(1994), which does not rely upon the distri- Suppose two SNPs are segregating in a natural popula-tion at Hardy-Weinberg equilibrium. The allele

frequen-bution of the LR2, may be used to determine the critical

threshold for determining the existence of a QTL. cies and linkage disequilibrium between these two SNPs

are given in Table 2. We assume that one of the four R-SNP sequence model: The idea for sequencing a

complex trait is described for a two-SNP model. It is possible haplotypes for the two SNPs is different from

the rest of the haplotypes in their effects on the pheno-possible that the two-SNP model is too simple to

charac-terize genetic variants for quantitative variation. With type of a complex trait, which leads to three distinct

groups of diplotypes. By assigning the genetic values for the analytical line for the two-SNP sequencing model,

we can readily extend our model to include an arbitrary these two contrasting haplotypes under different gene

action modes (Table 2), we calculate the genotypic val-number of SNPs whose sequences are associated with

the phenotypic variation. ues for all possible diplotypes and further simulate their

phenotypic values on the basis of normal distribution.

ConsiderRSNPs that form 3R_{observable multilocus}

zygotic genotytpes, generally expressed asA1

k₁A1l₁/A2k₂A2l₂/ The residual variance is determined according to

differ-ent heritability levels (0.1vs.0.4) expressed as the

rela-. rela-. rela-. /AR

k_RARl_R. These genotypes are collapsed from a total

tive proportion of genetic variance to total observed

of 2R⫺1₍₂R ⫹ _{1) diplotypes expressed as [}_A1

k₁A2k₂. . .ARk_R]

variance. The genetic variance is determined on the [A1

l₁A2l₂. . .ARl_R] (1 ⱕ k1 ⱕl1 ⱕ 2, . . . , 1ⱕkRⱕ lR ⱕ1).

basis of the genotypic values of all diplotypes and their A key issue for the multi-SNP sequencing model is how

to distinguish among 2r⫺1 _{different diplotypes for the} _frequencies.

Table 2 describes the results of parameter estimation

same genotype heterozygous atrloci. The relative

fre-quencies of these diplotypes can be expressed in terms from our simulation using the proposed model. In

gen-eral, all parameters (including population and quantita-of haplotype frequencies. The integrative EM algorithm

can be employed to estimate the MLEs of haplotype tive genetic) can be reasonably estimated. As expected,

the precision of population parameter estimation is a

frequencies. Lou et al. (2003) provided a general

for-mula for expressing haplotype frequencies in terms of function of sample size and does not rely on the H2

value or the gene action mode. Large sample sizes always allele frequencies and linkage disequilibria of different

orders. The MLEs of the latter can be obtained by solv- lead to more precise estimation of allelic frequencies

and linkage disequilibrium (Table 2). ing a system of equations.

In the multi-SNP sequencing model, we face many The estimation precision of quantitative parameters

is dependent on heritability level, sample size, and gene haplotypes and haplotype pairs. An Akaike information

criterion- or Bayes information criterion-based model action mode. In almost all situations, the additive effects

can be estimated better than the dominant effect. The

selection strategy (BurnhamandAndersson1998) has

been framed to determine the haplotype that is most estimation of additive and dominant effects responds

differently to genetic action mode, depending on the distinct from the rest of the haplotypes in explaining

quantitative variation. However, in practice, a simulta- degree of dominance. For example, given the same

heri-tabilityH2 _⫽ _{0.1 and same sample size} _n _⫽ _{100, the}

(6)

consid-TABLE 2

The MLEs and their square root mean square errors (in parentheses) of population and quantitative genetic parameters of DNA sequences associated with the phenotypic variation estimated from 200 simulation replicates under different

(heritability-sample size-gene action mode) combinations for the 2-SNP sequencing model

0.1-100 0.1-100 dominance 0.4-100 dominance 0.1-400 dominance 0.1-100 additive overdominance

Parameters True MLE True MLE True MLE True MLE True MLE

p(1)

1 0.6 0.6028 0.6 0.600 0.6 0.6005 0.6 0.5988 0.6 0.5976

(0.0037) (0.0025) (0.0013) (0.0026) (0.0028)

p(2)

1 0.6 0.6003 0.6 0.5993 0.6 0.5980 0.6 0.601 0.6 0.5977

(0.0025) (0.0025) (0.0023) (0.0025) (0.0035)

D 0.04 0.0402 0.04 0.0404 0.04 0.0396 0.04 0.0414 0.04 0.0390

(0.0017) (0.0016) (0.0009) (0.0022) (0.0021)

␮ 10 9.9013 10 10.0881 10 10.03795 10 10.1000 10 10.1950

(0.1914) (0.1112) (0.0896) (0.1521) (0.2932)

a 5 4.7502 5 4.9604 5 4.9566 5 5.0670 5 4.8139

(0.2961) (0.0738) (0.0916) (0.1341) (0.2955)

d 5 5.1674 5 4.8548 5 5.0373 0.5 0.7268 10 9.7057

(0.2905) (0.1731) (0.1240) (0.2873) (0.4127)

␴2 _{207.36 202.40} _34.56 _34.13 _{207.36 206.86} _{110.36 112.88} _{419.04 405.75}

(5.4422) (0.5866) (1.1448) (2.5932) (13.8450)

LR2 0.50–33.21 13.49–81.29 9.54–75.78 1.369–42.28 0.01–25.22

Power (␣ ⫽0.05) 71 100 100 77 71

Power (␣ ⫽0.01) 52.5 100 100 62.5 46.5

p(1)

1 ,p(1)2 , andDare the allelic frequencies of allelesA(1)1 andA(1)2 at two SNPs and their linkage disequilibrium, respectively.␮

is the overall mean, andaanddare the additive and dominant effects, respectively, by assuming that haplotypeA(1)

1 A(1)2 is different

from the rest of haplotypes.␴2_{is the residual variance. LR}

2is the test statistics for testing the genetic effect of SNP haplotypes.

“Power” is calculated as the percentages of all simulations that detect significant genetic effects.

sampling error of the additive effect estimation is in- the dominance effect is reduced compared to that from

the 2-SNP model. Second, the power to detect signifi-creased by 121% from additive to dominant modes, but

is stable from dominant to overdominant modes. These cant sequence variants and the estimation precision of

different genetic parameters from the 3-SNP model can two percentage values are 1 and 42% for the sampling

errors of the dominant effect estimation. It appears that be improved with increased heritability levels, increased

sample sizes, and decreased gene interactions. the additive effect estimation is more sensitive than the

dominant effect estimation to the changes of heritability A worked example:A real example from an obesity

study is used to demonstrate the usefulness of our and sample size.

We analyze the power to detect a significant diplotype model. Numerous genes have been investigated as

po-tential obesity-susceptibility genes (Mason et al.1999;

difference from our model under different

combina-tions of H2 _{value, sample size, and gene action mode} _Chagnon_{et al.} _{2003). The}␤_{-2 adrenoceptor (BAR-2)}

is a major lipolytic receptor in human fat cells, whose (Table 2). To do so, we first obtain empirical critical

threshold values for declaring the significance of diplo- function was found to be determined by known

poly-morphisms in codons 16 and 27 (Greenet al.1995). In

type difference through simulations. The thresholds at

the significance level ␣ ⫽ 0.05 or 0.01 are estimated a genetic study of obesity involving 149 women byLarge

et al.(1997), the Arg16Gly polymorphism at codon 16

as the 95th and 99th percentile for the simulated test

statistics calculated from simulated data involving no was found to be associated with altered BAR-2 function

with Gly16 carriers showing a fivefold increased agonist

diplotype difference over 1000 LR2values. It is clear that

increased H2 _{levels and increased sample sizes can in-} _{sensitivity. The Gln27Glu polymorphism at codon 27 is}

markedly associated with obesity traits. The homozygotes crease the power to detect diplotype differences. When

H2 ⫽_{0.4 or}_n⫽_{400, such power is 100% (Table 2).} _{for Glu27 display an average fat mass excess of 20 kg and}

50% larger fat cells than controls. Although the Arg16Gly An additional simulation study was performed to

in-vestigate the statistical behavior of the multi-SNP se- polymorphism is not significantly associated with obesity

and the Gln27Glu polymorphism is not significantly quence model. The results from the 3-SNP sequence

model are summarized in Table 3. First, all genetic associated with the BAR-2 function, genetic variability

in these two sites of the human BAR-2 gene could be parameters can be reasonably estimated from the 3-SNP

(7)

TABLE 3

The MLEs and their square root mean square errors (in parentheses) of population and quantitative genetic parameters of DNA sequences associated with the phenotypic variation estimated from 200 simulation replicates under different

(heritability-sample size-gene action mode) combinations for the 3-SNP sequencing model

0.1-100 0.1-100 dominance 0.4-100 dominance 0.1-400 dominance 0.1-100 additive overdominance

Parameters True MLE True MLE True MLE True MLE True MLE

p(1)

1 0.6 0.5973 0.6 0.5984 0.6 0.5989 0.6 0.5984 0.6 0.5961

(0.0036) (0.0028) (0.0015) (0.0029) (0.0038)

p(2)

1 0.6 0.5956 0.6 0.5986 0.6 0.6003 0.6 0.5988 0.6 0.5936

(0.0050) (0.0028) (0.0013) (0.0027) (0.0054)

p(3)

1 0.6 0.5976 0.6 0.5957 0.6 0.6004 0.6 0.5960 0.6 0.5977

(0.0034) (0.0048) (0.0012) (0.0045) (0.0034)

D12 0.04 0.0413 0.04 0.0408 0.04 0.0389 0.04 0.0402 0.04 0.0412

(0.0021) (0.0018) (0.0014) (0.0016) (0.0022)

D13 0.025 0.0263 0.025 0.0252 0.025 0.0239 0.025 0.0255 0.025 0.0249

(0.0022) (0.0018) (0.0014) (0.0018) (0.0020)

D23 0.025 0.0262 0.025 0.0259 0.025 0.0249 0.025 0.0253 0.025 0.0264

(0.0020) (0.0019) (0.0009) (0.0016) (0.0022)

D123 0.02 0.0187 0.04 0.0190 0.02 0.0202 0.02 0.0190 0.02 0.0190

(0.0017) (0.0015) (0.0005) (0.0015) (0.0017)

␮ 10 10.4424 10 10.1672 10 10.0315 10 10.2563 10 10.6367

(0.4931) (0.1876) (0.1121) (0.2932) (0.7089)

a 5 5.1359 5 5.0572 5 4.9881 5 5.1490 5 5.1918

(0.2514) (0.1018) (0.1004) (0.1803) (0.3587)

d 5 4.5860 5 4.6595 5 5.0308 0.5 0.2804 10 9.3959

(0.5122) (0.3594) (0.1555) (0.3005) (0.7427)

␴2 _223.42 _218.42 _37.23 _37.1198 _223.42 _222.25 _101.93 _99.72 _458.51 _448.39

(5.4770) (0.4722) (1.63) (2.48) (11.11)

LR2 0.06–33.81 12.66–70.47 11.01–62.94 0.11–46.89 0.01–25.22

Power (␣ ⫽0.05) 71 100 100 78 66

Power (␣ ⫽0.01) 57 100 100 57 41

See Table 2 for explanations of parameters.

and lipolytic BAR-2 function in adipose tissue, at least for allele G at codon 16 and 0.38 for allele G at codon

27, respectively, suggesting that both of them have fairly

in women (Large et al.1997).

To determine whether sequence variants at these two high heterozygosity.

Our model is based on the assumption that one haplo-polymorphisms of BAR-2 are associated with obesity

phenotypes, we investigated a group of 155 women of type is different from the rest of the haplotypes. In a

practical situation, as in our example used here, the ages 32 to 86 years old with a large variation in body

fat mass. Each of these patients was determined for her issue of how these haplotypes differ in their effects is not

known. By assuming that any one haplotype is different genotypes at codon 16 with two alleles, Arg16 (A) and

Gly16 (G), and at codon 27 with two alleles, Gln27 (C) from the rest of the haplotypes, we can find a

best-difference pattern that is based on the estimates of the and Glu27 (G), within the BAR-2 gene and measured

for body mass index (BMI). These two SNPs form four LR2’s using Equation 16. When haplotype GG, GC, AG,

or AC is assumed to differ in BMI from the other haplo-haplotypes designated as AC, AG, GC, and GG, which

lead to nine genotypes, AA/CC, AA/CG, AA/GG, AG/ types, the LR2values were estimated as 10.35, 3.11, 1.52,

and 2.32, respectively. Obviously, haplotype GG is con-CC, AG/CG, AG/GG, GG/con-CC, GG/CG, and GG/GG, and

the 10 corresponding diplotypes, [AC][AC], [AC][AG], sidered as a reference haplotype that has an effect on

BMI in a comparison with the other haplotypes. The [AG][AG], [AC][GC], and [AC][GG] or [AG][GC],

[AG][GG], [GC][GC], [GC][GG] and [GG][GG]. Our difference between haplotype GG and the rest of the

haplotypes can significantly explain some variation in model is used to associate diplotype differences with

variation in BMI. The MLEs of the haplotype frequen- BMI. The resultant LR2 for testing the association

be-tween DNA sequence and BMI phenotype (10.35) is cies, allele frequencies, and linkage disequilibrium

be-tween the two SNPs were obtained (Table 4). These two well beyond the critical threshold value at the

signifi-cance level of 1% (6.07) estimated from 1000 simulation SNPs are highly associated with each other, whose LD

(8)

TABLE 4 viduals differ at a single DNA base. Sets of nearby SNPs on the same chromosome are inherited in blocks. Blocks

The maximum-likelihood estimates (MLEs) and their standard

may contain a large number of SNPs, but a few SNPs

errors (SEs) of population and quantitative genetic

are enough to uniquely identify the haplotypes in a

parameters for DNA sequences associated with

phenotypic variation in BMI for 155 patients block (Gabrielet al. 2002). The HapMap is a map of these haplotype blocks constructed by tag SNPs, those

Parameters MLE SE that explain most of haplotype diversity. The HapMap

should be valuable by reducing the number of SNPs

Population genetic parameters

required to examine the entire genome for association

Haplotype frequencies pAC 0.0250 0.0264

with a phenotype from the 10 million SNPs that exist

pAG 0.3686 0.0632

pGC 0.3589 0.0300 to ⵑ500,000 tag SNPs. This will make genome scan pGG 0.2476 0.0361 approaches to finding regions with genes that affect Allele frequencies p(16)

G 0.6065 0.0346 diseases much more efficient and comprehensive, since

and LD _{effort will not be wasted typing more SNPs than}

neces-p(27)

C 0.3839 0.0346 _{sary and all regions of the genome can be included.}

D_AB 0.1261 0.0200

Our model is founded on the discovery of tag SNPs in the genome, thus allowing for a fast scan for the

Quantitative genetic parameters

Overall mean ␮ 30.8269 1.0442 association between variation in DNA sequence and

Additive effect a ⫺1.7732 1.0442 traits. This model has three advantages. First, it can

Dominant effect d ⫺3.0548 1.4758 _{materialize the genetic basis for quantitative variation}

Residual variance ␴2 _71.9070 _20.4749

by directly characterizing specific DNA sequences pre-disposing to a certain disease. The traditional statistical models for genetic mapping attempt to postulate the

dominant effects. Patients with homozygous genotype position of hypothesized QTL that are linked with

GG/GG have significantly lower BMI than those with known markers genotyped from the genome (Lander

homozygous genotype AA/CC, as indicated by a large andBotstein1989;Louet al.2003). The QTL detected

additive effect (a ⫽ ⫺1.77). For double-heterozygote from these models are regarded as “hypothesized”

be-AG/GC, diplotype [GG][AC] reduces BMI by 3.05 (the cause it is not possible to know their DNA sequences

dominant effect) compared to diplotype [AG][GC]. and, therefore, physiological function. As opposed to

The difference between haplotype GG and the rest of the traditional “indirect” approach, our model presents

the haplotypes can explain 6.3% of the observed varia- a “direct” approach. At present, the utility of the direct

tion in BMI. We estimate the standard errors of the approach is limited to sequencing the functional parts

MLEs of the population and quantitative genetic param- of candidate genes with known biochemical or

physio-eters on the basis ofLouis’s (1982) observed informa- logical function. With the release of HapMap, our model

tion matrix, suggesting that all MLEs have reasonable will make the direct approach more useful and more

estimation precision although the estimates of quantita- efficient in searching for causal variants throughout the

tive genetic parameters are not as precise as those of whole genome. Our model is also different from

tradi-population genetic parameters due to a small sample tional association studies (e.g.,Zaykinet al.2002) whose

size used (see the simulation result in Tables 2 and 3). aims are the significance test for the association between

haplotypes and phenotypes (mostly discrete traits) rather than the precise estimation of haplotype effects. Second,

DISCUSSION

our model is statistically simple and computationally fast. The most difficult part for the estimation from The elucidation of the entire human genome using

SNPs will make it possible to develop a haplotype map our model is to construct diplotype configurations for

heterozygous genotypes at two or more SNPs. The esti-of the human genome. Although such a HapMap has

been recently available due to international collabora- mation of population genetic parameters is based on a

multinomial-likelihood function of the observed

geno-tions (International HapMap Consortium 2003),

there is a serious lack of powerful statistical tools for type data, whereas the estimation of quantitative genetic

parameters is based on a mixture-based likelihood func-specific utility of this expensive HapMap to detect genes

and genetic variations that affect health and disease. tion including different diplotypes. These two likelihood

functions can be easily integrated to a unified estimation Such a statistical model has been developed and

pre-sented in this article. The idea behind our statistical framework implemented with the EM algorithm.

Finally, our model is flexible to different genetic and modeling is that differences between SNP-constructed

haplotypes may be associated with differing susceptibil- experimental settings. The results from a simulation study

indicate that the association between DNA sequence

ity to disease (International HapMap Consortium

2003). and phenotype can be well detected when the trait has

(9)

indi-511 Sequencing Complex Diseases

Perusse et al., 2003 The human obesity gene map: the 2002

(100) is used. Our model can also obtain fairly precise

update. Obes. Res.11:313–367.

estimation of parameters when diplotypes display over- _{Churchill, G. A., and} _{R. W. Doerge, 1994} _{Empirical threshold}

values for quantitative trait mapping. Genetics138:963–971.

dominance in the situation with modest heritability and

Dawson, E., G. R. Abecasis, S. Bumpstead, Y. Chen, S. Huntet al.,

sample size. The specific utility of our model to a real ₂₀₀₂ _{A first-generation linkage disequilibrium map of human}

example from an obesity study leads to the successful chromosome 22. Nature418:544–548.

Dempster, A. P., N. M. LairdandD. B. Rubin, 1977 Maximum

detection of a DNA sequence (haplotype) at codons

likelihood from incomplete data via EM algorithm. J. R. Stat.

16 and 27 genotyped within the␤2AR candidate gene _{Soc. Ser. B}_39:_1–38.

(Chagnonet al.2003) for its significant impact on hu- Evans, W. E., andH. L. McLeod, 2003 Pharmacogenomics drug disposition, drug targets, and side effects. N. Engl. J. Med.348:

man obesity. This haplotype, composed of the Gly16

538–549.

form of codon 16 and the Gln27 form of codon 27, _{Frary, A., T. C. Nesbitt, S. Grandillo, E. Knaap, B. Cong}_{et al.}_,

2000 fw2.2: a quantitative trait locus key to the evolution of

tends to reduce BMI when it is combined with itself or

tomato fruit size. Science289:85–88.

any other haplotypes and accounts forⵑ6% of the total

Gabriel, S. B., S. F. Schaffner, H. Nguyen, J. M. Moore, J. Roy

observed variation in BMI for 155 patient samples. et al., 2002 The structure of haplotype blocks in the human

genome. Science296:2225–2229.

Although our simulation and example were based on

Green, S. A., J. Turki, I. P. HallandS. B. Liggett, 1995

Implica-2- or 3-SNP analyses, our sequencing model has been _{tions of genetic variability of human 2-adrenergic receptor}

struc-developed to allow for the detection of sequence vari- ture. Pulm. Pharmacol.8:1–10.

International HapMap Consortium, 2003 The international

ants involving any number of SNPs within a haplotype

HapMap project. Nature426:789–794.

block. In addition to its use in studying genetic associations _{Judson, R., J. C. Stephens}_and_{A. Windemuth, 2000} _{The predictive}

power of haplotypes in clinical response. Pharmacogenomics

with disease, our sequencing model can be extended to

1:5–16.

study the genetic factors contributing to variation in

re-Lander, E. S., andD. Botstein, 1989 Mapping Mendelian factors

sponse to environmental factors, in susceptibility to infec- underlying quantitative traits using RELP linkage maps. Genetics

121:185–199.

tion, and in the effectiveness of and adverse responses

Large, V., L. Hellstrom, S. Reynisdottir, F. Lonnqvist, P.

Eriks-to drugs and vaccines (EvansandMcLeod2003;Wein- _son _{et al.}_{, 1997} _{Human beta-2 adrenoceptor gene}

polymor-shilboum2003). It can also be modified to estimate the phisms are highly frequent in obesity and associate with altered adipocyte beta-2 adrenoceptor function. J. Clin. Invest. 100:

effects of sequence-sequence interaction on a complex

3005–3013.

trait. It is possible that a haplotype within a candidate _{Lou, X.-Y., G. Casella, R. C. Littell, M. K. C. Yang} _and_{R. L.}

Wu, 2003 A haplotype-based algorithm for multilocus linkage

gene interacts with haplotypes from other candidate

disequilibrium mapping of quantitative trait loci with epistasis in

genes. The characterization of specific DNA sequence

natural populations. Genetics163:1533–1548.

variants for diseases should allow the development of Louis, T. A., 1982 Finding the observed information matrix when

using the EM algorithm. J. R. Stat. Soc. Ser. B44:226–233.

tests to predict which drugs or vaccines would be most

Lynch, M., andB. Walsh, 1998 Genetics and Analysis of Quantitative

effective in individuals with particular genotypes for _Traits_{. Sinauer, Sunderland, MA.}

genes affecting drug metabolism. Mackay, T. F. C., 2001 The genetic architecture of quantitative

traits. Annu. Rev. Genet.35:303–339.

We thank two anonymous referees for their constructive comments _{Mason, D. A., J. D. Moore, S. A. Green}_and_{S. B. Liggett, 1999} on this manuscript. This work is supported by an Outstanding Young _{A gain-of-function polymorphism in a G-protein coupling domain} Investigator Award of the National Natural Science Foundation of of the human ␤1-adrenergic receptor. J. Biol. Chem. 274:

12670–12674. China (30128017), a University of Florida Research Opportunity Fund

Patil, N., A. J. Berno, D. A. Hinds, W. A. Barrett, J. M. Doshi (02050259), and a University of South Florida biodefense grant

et al., 2001 Blocks of limited haplotype diversity revealed by (7222061-12) to R.W. The publication of this manuscript was approved

high-resolution scanning of human chromosome 21. Science as journal series no. R-10063 by the Florida Agricultural Experiment _294:_1719–1723.

Station. _{Weinshilboum, R., 2003} _{Inheritance and drug response. N. Engl.} J. Med.348:529–537.

Wu, R. L., andG. Casella, 2005 Statistical Genomics: A Quantitative Trait Loci Perspective. Springer, New York.

Wu, R. L., C.-X. MaandG. Casella, 2002 Joint linkage and linkage

LITERATURE CITED disequilibrium mapping of quantitative trait loci in natural popu-lations. Genetics160:779–792.

Bader, J. S., 2001 The relative power of SNPs and haplotype as _{Zaykin, D. V., P. H. Westfall, S. S. Young, M. A. Karnoub, M. J.} genetic markers for association tests. Pharmacogenomics2:11–24. _Wagner_{et al.}_{, 2002} _{Testing association of statistically inferred} Burnham, K. P., andD. R. Andersson, 1998 Model Selection and _{haplotypes with discrete and continuous traits in samples of}

unre-Inference. A Practical Information-Theoretic Approach. Springer, New _{lated individuals. Hum. Hered.}_53:_79–91. York.

(10)