The genomic distribution of population substructure in four populations using 8,525 autosomal SNPs

(1)

The genomic distribution of

population

substructure in

four populations using

8,525 autosomal

SNPs

Mark D.

Shriver,

1*

Giulia C.

Kennedy,

2

Esteban

J.

Parra,

3

Heather

A.

Lawson,

1

Vibhor

Sonpar,

1

Jing Huang,

2

Joshua M.

Akey

4

and Keith W.

Jones

2

1_Pe_nn_S_t_a_t_{e U}_n_i_v_e_rs_i_ty_{, U}_n_i_v_e_rs_i_ty_Pa_r_{k, Pe}_nnsylv_a_n_{ia, USA} 2_Aff_ym_e_tr_i_x_{, I}_n_c_._{, Sa}_nt_{a C}_l_a_r_{a, Ca}_l_if_orn_{ia, USA}

3_U_n_i_v_e_rs_i_{ty o}_{f T}_oronto _a_t_Mi_ss_i_ss_a_u_{ga, Mi}_ss_i_ss_a_u_{ga, Ca}_n_ada

4_F_r_{ed H}_ut_chi_nson _Ca_n_ce_r_Re_s_ea_r_{ch Ce}_nt_e_r_{, Sea}_ttl_{e, Wa}_s_hi_n_g_ton_{, USA}

*Correspondenceto: Tel:þ1814 863 1078;E-mail:[email protected]

Datereceived(in revised form):2nd February 2004

Abstract

Understandingthenatureof evolutionary relationshipsamongpersonsandpopulationsisimportantfor the efficientapplication of genome

sciencetobiomedical research.We have analysed 8,525 autosomal singlenucleotidepolymorphisms (SNPs)in84 individualsfromfour populations: African-American, European-American, Chinese and Japanese.Individual relationships werereconstructed usingthe allele

sharing distance andtheneighbour-joiningtreemakingmethod.Trees showclearclustering accordingto population,withtherootbranching from the African-Americanclade.The African-Americanclusteris muchless star-likethanEuropean-Americanand EastAsianclusters,

primarilybecauseof admixture.Furthermore,on the EastAsianbranch, all tenChinese individualscluster togetherand all tenJapanese individualscluster together.Usingpositionalinformation,we demonstratestrong correlationsbetweeninter-markerdistance and both

locus-specific FST(theproportion oftotal variationduetodifferentiation) levelsand branchlengths.Chromosomal maps ofthe distribution oflocus-specific branchlengths were constructed bycombiningthese datawithother published SNPmarkers (total of 33,704 SNPs).

Thesemapsclearlyillustrate anon-uniformdistribution of humangeneticsubstructure, aninstructionalanduseful paradigmforeducation

andresearch.

Keywords:populationgenomics,populationgenetics,microarray, genotyping, evolution, admixture

I

ntro

d

u

c

t

i

on

The completion oftheprimaryhumangenomesequencewas announced in 2003 and millions ofsinglenucleotidepoly -morphisms (SNPs) are alreadyavailable in public databases [eg The SNP Consortium (TSC), dbSNP, HGVbase]. Parallelingthese advancesin ourknowledgeofthe human genome have been remarkable breakthroughsingenotyping technologies,providing.1,000-fold increasesingenotyping capacity.Thus,we areon the brink of an unprecedented understandingof human variationand the evolution of our species.A detailedunderstandingofthe extent,patternand meaning of human variationisfundamental to the effective application of genomics to studies of humanbiology. For example, understanding the amount of geneticstructure presentinhuman populationsis relevant toepidemiological studiesas, if uncontrolled for, it can produce false-positive resultsinassociation studies1 _a_n_d_low_e_{r st}_a_t_i_st_ica_{l pow}_e_r _i_n

linkage analyses.2_Addi_t_i_on_a_lly_,_p_a_tt_e_{rns o}_f_stru_c_tur_e_w_i_t_hi_n_a_n_d betweenhuman populationscanbe important in terms of epidemiological risksand the evaluation of drugresponse.3

Previous studieshaveused relatively small numbers of geneticmarkers toexplorethe geographical patterns of human geneticvariation,resulting inanincompletepictureof human diversityat the genomiclevel.4,5_O_{nly r}_ece_ntly_ha_s_i_t_bec_om_e possibletocarry out studies withthousands ofmarkers ona genome-wide scaleunder the new paradigm ofpopulation genomics,whichmodelsgenetic variationat both a genomic and alocus-specificlevel.6,7_{We a}_n_a_lys_ed_t_{he ge}_n_e_t_ic_v_a_r_ia_t_i_on of 8,525 autosomal markersin four population samplesand twoCentre d’Etude duPolymorphisme Humain (CEPH) family trios.The SNPmultilocusgenotype datawere collected using anew method developed by Affymetrix, called‘whole genome amplification’ (WGA).8 _A_s_a_mpl_e_o_f₇₈_unr_e_l_a_t_ed individuals—38 European-Americans,20African-Americans and 20EastAsians (ten Chinese andtenJapanese)—was

PRIMARY RESEARCH

qHENRY STEWART PUBLICATIONS1473-9542.HUMAN GENOMICS.VOL1.NO 4. 274–286MAY2004

(2)

selected from the TSC corepanel of individualsforanalysis on the WGAmicroarrays.The twoCEPH family trios (mother, fatherand child)areof European-Americanancestry. We combinedthesenewdatawith arecentlyavailable dataset consistingof26,530SNPscompiled from apublic database.9 Positionalinformation wasavailable for 33,704ofthese 36,347 SNPsand was used toinvestigate human population substructure ata locus-specificlevel,wherebyall genomic regions arenot averaged together, butinvestigated as individualdata elements.

Me

t

h

o

d

s

Population samples

The DNAsamples we analysedwere from two publicly availablesamplesetscurated at the Corriell Institute (Camden, NJ): TSC and CEPH. Twofamily trios (mother, fatherand child) wereselected from the CEPH family mapping panels and were European-American.The four population samples weresubsets of a commonly used set ofsamplesassembled by TSC for the purposes of SNP verificationand allele frequencyestimation.From these TSC panels,we included 38 European-Americans,20African -Americans,tenChinese andtenJapanese.The Chinese and Japanesesubjects were ascertained in the USA, but wereof Chinese and Japanese ancestry.

SNP genotyping

WGAtechnology was usedtogenotype individualsin this study. Details ofthis method have been published elsewhere,8_b_ut_, briefly, fractions ofthe genome areobtained by restriction enzyme digestion of genomic DNA,ligatedwith adaptorsand subsequentlyamplifiedwith auniversal primer.The amplified targetisfragmented,labelledwithterminal transferase and biotin-ddATP(dideoxyAdenosine Triposphate)and hybridised overnight to syntheticmicroarrays.10_Ge_notyp_e_s_a_r_{e ca}_ll_{ed b}_y interpretingsignalsfromallele-specificprobes using amodel -based algorithm.The accuracy ofthis method is.99.5percent. SNPs were chosenfrom the TSC databaseon the basis oftheir predictedlocation on400–800basepairfragmentsgenerated by

in silicodigestion of humangenomesequences withvarious restrictionenzymes.

Statistical

analyses

Individualgenetic distances were estimatedusingthe allele sharing distance(ASD).11_The_tr_ee_o_{f i}_n_di_v_id_u_a_ls_{, ba}_s_ed_{on t}_he ASD distance,wasconstructedusingtheneighbour-joining method12_w_i_t_h_t_{he M}_ol_ec_ul_a_r_E_volut_i_on_a_ry _Ge_n_e_t_ic_s_A_n_a_lys_i_s softwarepackage (MEGAversion 2.1).13 _The_tr_{ee b}_r_a_n_chi_n_g pattern wasevaluated bybootstrapping, and wasbasedon 100 replicates.The principalcoordinatesanalysis (PCA) was carriedout with NTSYSsoftware.14_{The c}_omput_e_{r pro}_g_r_a_m STRUCTURE 2.015_w_a_{s us}_ed_to _i_n_fe_{r r}_e_l_a_t_i_v_{e i}_n_di_v_id_u_a_l

admixturelevelsin the sample. The analysis wascarriedout with anadmixturemodel of K¼3 (threepopulations),the model previouslydetermined to show the highest posterior probabilitiesfor these data.Atotal of25,000 simulation iterations wererunfor the burn-in period and 75,000 additionaliterations wererun to get parameterestimates.For estimations of individualadmixture in the African-American sample, we includedonly the European-Americanand African-American subjectsandset K¼2 with independent alphas.The average individualadmixture in the African -American samplewas 0.25.

Locus-specific branchlengths (LSBLs),x,yandz,were calculatedusingpairwise FSTdistances, dAB, dBCand dAC, wherex¼ðdABþdAC2dBCÞ=2;y¼ðdABþdBC2dACÞ=2; z¼ðdACþdBC2dABÞ=2.A, B and C arethethree

populations underconsideration.Figure1 shows these calculations.LSBL correlations were estimated after

transformingthe datato more closelyapproachnormality using the inversetransformationafteradding0.35to make each measurepositive.Computer simulations ofthe coalescent process wereperformedusing Hudson’s program, calledms.16 Comparisonsbetween therealdata distributions of LSBL and thesimulatedresults were conductedusingthe Kolmogorov– Smirnov (KS) test.

Re

sults

Individual-based analyses

Therelativeproportion of geneticvariance duetodifferences between populations wasestimatedusing Weir’sFST.17Figure2

Figure1.Diagramdemonstrating howbranchlengthsare estimated foranetworkwiththreepopulations.Locus-specific branchlengths,x,yandz, are calculatedusingpairwise FST

distances, dAB, dBCand dAC,wherex¼ðdABþdAC2dBCÞ=2,

y¼ðdABþdBC2dACÞ=2,z¼ðdACþdBC2dABÞ=2and A, B

and C arethethreepopulations underconsideration.

The genomic distribution ofpopulation substructure infour populations using 8,525 autosomalSNPs PRIMARY RESEARCHReview

(3)

showsa histogram ofthe FSTdistributionfor this sample. The average level of FSTforautosomalSNPs,0.132, is very similar toFSTvalues previouslydescribed for major continentalgroups,9_c_on_fi_rm_i_n_g_t_he _w_e_ll_-k_nown _fac_{t t}_ha_t most variability inhuman populationsis observedwithin populationsand thata minorfraction of genetic variation (5–15 percent)isduetodifferencesbetween majorcont i-nentalgroups.4,5,18 –21_Wi_t_h_su_{ch a} _l_a_r_ge_num_be_{r o}_f _lo_ci, the distribution of FSTprovidesimportantdetails regarding variationamong loci at the level of genetic differentiation. MostSNPs show lowFSTvalues (66 percent have FST,0:15Þ and onlyasmallfraction showhigh FST (4 percenthave FST.0:40Þ:

Phylogenetic and clustering relationships werestudied amongthe 84persons usingthe ASDmethod11 _to_e_st_i_m_a_t_e the average distance betweenall pairwise combinations of individuals.Matrices of ASDmeasures wereusedto recon -struct individual trees withthe neighbour-joiningmethod12 and prepare atwo-dimensional plot basedona PCA. Additionally,weusedsubsets ofthe total marker panels based onFST level (both high and lowFST markers) toexplorethe effects of allele frequencydifferenceon the results.Finally, weused pairwise FSTmeasures tocalculate LSBLsforeach of the populations.Markers were grouped by the distance between themandtested for thelevel of correlationinbranch lengthvalues.

Figure3A showsaneighbour-joining treeof individuals, constructed withthe ASD measurematrix using 8,525 autosomalSNPs.There arethree clearclusters on thetree showing high bootstrap values.These clusterscoincide with individualgeographical origins: all European-Americans cluster together (98percent bootstrap), allAfrican

-Americans cluster together (100 percentbootstrap) and all EastAsianscluster together (100 percentbootstrap).The East Asianand European-American samplesform tighter,more star-like, branchingpatterns radiating fromfocusedpoints than the African-American sample.Thisdifference inbranching pattern maybe due either to variationinindividualadmixture levels22,23_{or to} _highe_{r l}_e_v_e_{ls o}_{f ge}_n_e_t_{ic di}_v_e_rs_i_ty_i_{n su} b-SaharanAfrica.Notable is that,within the EastAsianbranch, there isa bifurcationbetweenclusters of Japanese and Chinese individuals.Althoughthese clustersaremonophyletic,the bootstrap intervalis only57 percent, indicatingthat there is not significantinternalconsistency supporting this split and there arelikely tobe a limitednumber of loci differentiating thesetwoclosely relatedpopulations. Whencomparingthe Japanese and Chinesesamples,thepairwise FSTis 0.045 (449 SNPs with FST .0.20).Interestingly,one European -American subject, Eu5,standsapartfrom the European -Americanclusterand branchescloser to the internodethan to otherEuropean-Americansin thesample.Themembers ofthe twofamiliesform twoclusters on this tree and, asexpected, the child–parent distance isapproximatelyhalf the distance between unrelatedpersons (datanot shown).

The 8,525 markers wereranked by FSTlevel, calculated usingthe threeprimary groupings (African-American, European-Americanand EastAsian).Weselectedthe 1,000 lociwiththe highest levels of FSTðFST.0:281Þandthe

1,000 lociwith thelowest levels of FST ðFST,0:0143Þand

constructedneighbour-joiningtrees usingthe ASD.These two treesareshowninFigures 3B and3C.There areseveral reasonsfor separating and displayingthese data in thisfashion. First,we caninvestigatethestructureofthetreeusingmarkers thatare ancestryinformativewithrespect tohowFSThasbeen

Figure2.Distribution of FST.FSTwascalculatedusing Weirand Cockerham’s unbiased estimator (1984).17The histogram showsbin

distribution, asindicatedon thex-axis, andthe cumulative distribution,represented byaline.

Shriveretal:

Review

PRIMARY RESEARCH

(4)

defined(Figure3B). Someofthe SNPsin the high-FST category mayhave been subject to,or tightly linked to, markers under recentdirectional selection. Becauseofthe stochastic natureof genetic drift, however,some high FST markerscan resultfroma completely neutralevolutionary history.20_A_s_e_xp_ec_t_ed,_t_hi_{s tr}_{ee e}_x_hibi_{ts mu}_ch_lon_ge_r_i_nt_e_rn_a_l branches separatingpopulationgroupsand shorter terminal branches.Additionally,the previously-noted feathered clusteringof African-Americanindividuals usingthe complete data(Figure3A)isaccentuated in this tree.Secondly,we can

displayand investigatethetopology ofthetree drawnfrom the 1,000 lowest-FSTSNPs (Figure3C).Althoughthese SNPs,or nearby variations, mayhave been subject tobalancingor overdominant selection,the bulkofthe genome is low-FST (Figure2).As such,thesemarkers might be a better

representation of the evolutionaryhistory of ‘average’ regions ofthe genome.6_Thi_{s tr}_{ee i}_{s str}_iki_n_g_{ly st}_a_r_-_l_ike,_w_i_t_h_v_e_{ry l}_i_ttl_e internal structure and allindividuals radiating from the centre.

Althoughtrees provide a useful means ofrepresenting evolutionary relationships amongpopulations and individuals,

Figure3A.Relationshipsamong individuals werereconstructedusingtheneighbour-joiningmethodonamatrix of allelesharing distances (ASDs).Unrelated individualsfromfour populationsareshown: African-American (labelled AfX;n¼20Þ;European-American (labelled EuX;n¼38Þ;Japanese(labelled JpX;n¼10Þ;and Chinese(labelled ChX;n¼10Þ, alongwithtwo nuclearfamily trios (mother, fatherand child,labelled FXM, FXF and FXC,respectively).Thisfigureshows phylogenyconstructedusing all8,525 autosomal singlenucleotidepolymorphisms.Figure3Bshows phylogenyconstructedusingthe1,000highestFSTmarkers.Figure3Cshows the treeof individualsconstructedusingthe1,000 lowestFSTmarkers.Itis notablethat the family relationshipsarestillevidentin thetwo trios when usingmarkers likethese,which clearlycontain noinformation regardingpopulationdifferences.

(5)

there isanother important group ofmethodsindependent of certainassumptions inherent in phylogenetic analyses. For this reason,weperformed a PCAusing ASD.Some 25.4 percent ofthetotal variationin these distancemeasures isexplained by the first two principalcoordinates,which are plotted inFigure 4.Likethetree,the PCAplots show substantialclusteringof individualsby population. The East Asianindividuals form thetightestcluster, followed by the European-Americans.The African-Americans cluster ina linear fashionapproachingthe European-American cluster, suggestingthat Europeangene flow isanimportantaspect of the diversityamong African-Americanindividualsand alikely explanationfor whyAfrican-Americanbranchesaremore widely spacedthan theother two populations in the trees (Figure3A and 3B). Additionally,the European-American individual noted asan outlierin thetreeof individuals (Eu5)is

the European-American outlier on the PCAplot moving away from the Europeancluster.

Locus-specific analyses

Tree-based and PCA approaches quantify the average evolutionary relationshipsamong individualsandpopulations, and FST calculated from many loci quantifies the average amount of geneticvariationdueto differencesamong groups orgeographical regions.Asillustrated inFigures 2,3B and3C, notall loci across the genome have experiencedthesame amount of evolutionarychange. Rather,most loci have undergoneonly marginalchangesinallele frequency,while a smaller number ofloci haveundergonevery large changesin frequency.Although FSTcanbeusedto quantify the degreeof evolutionat aparticular locus, andthisapproach has proven successfulin several studies, itis not without some drawbacks.

Figure3B.Treeof individualsconstructedusingthe1,000highestFSTmarkers.

Shriveretal:

Review

PRIMARY RESEARCH

(6)

Oneparticular drawback is that FSTis sensitiveto changes inany ofthepopulations included in the analysis.Any one (or more) ofthe populations could have different allele frequenciesfrom theothers,leadingtoa higherFST.Withthis in mind,we extended thisapproachto quantify the degreeof evolutionat aparticular locusby calculating LSBLs (see

Figure1 and Methods section).Thisapproach geometrically isolatesallele frequencychange, allowingspecification of not only the amount of evolution thathas occurred, butalso the population(s) that underwent changesat particular loci.

Atest of the appropriateness of LSBL asameasureof evolutionis toexaminethe extent oftherelationshipbetween

Figure3C.Treeof individualsconstructedusingthe1,000 latestFSTmarkers.

(7)

genomicproximityand branch lengthlevel.Figure 5shows therelationshipbetweencorrelationfor pairs ofsyntenic SNPs in terms of branchlength foreachpopulationand fullFST and gap size(distance betweenSNPpairs). X-linked and

Y-linked SNPs were excluded from thisanalysisbecause the differenteffectivepopulation size,the potentially stronger effects of natural selectionandthelower recombination rate for markers on these chromosomes might artificially

Figure 4.Aprincipalcoordinatesanalysis representation ofthe allelesharing distance.Populationsincluded are indicated by the

symbols listed in the key.Notethat subjectEu5 is theopencirclespaced awayfrom theprimarycluster of European-Americans.

Figure 5.Levels of correlationbetweenautosomal singlenucleotidepolymorphism loci asa function of inter-markerdistance.Inter

-markerdistance bin,labelled by upperbincutoff inkilobases, is shown on thex-axis.Markers,ranked bychromosomal positionand

thenindividuallycomparedwithmarkersfurtherdown on thesame chromosome,wereplaced in the appropriate binbasedoninter

-markerdistance.All markersineach bin wereusedtocalculatethe correlationbetween the FSTlevelforadjacent markers.

Shriveretal:

Review

PRIMARY RESEARCH

(8)

(9)

exaggerate correlations.Clearly, closely spaced SNPs show moresimilarity inbranchlength, and thiscorrelationdecays asa function of gap size.Additionally, there isanotable transitionin theshapeofthe curve between 30kilobases (kb) and 50kb,where arapid decrease incorrelationbecomes more gradual.

Summary statisticsdescribingthe distributions of hetero -zygosity, LSBLsand FSTareprovided inTable1.We computed theunbiased heterozygosityand LSBL for the autosomalSNPs and the X-linked SNPs separately,to make comparisons between thesetwoclasses ofmarkers.Interestingly,significant differencesinheterozygosity areseenfor two of thethree populations.European-Americans and EastAsiansboth ex hi-bit lowerheterozygosity levelsforX-linkedmarkerscompared with autosomal markers,whilethe African-American sample shows no significantdifferenceðp¼0:18Þ:Conversely,the WestAfrican-American sampleshowsasubstantiallyhigher average LSBL forX-chromosomal markerscomparedwith autosomal markers (0.113and 0.069, respectively) relativeto European-Americansand EastAsians.

Inaddition toexaminingthe branchlengths usingmeasured allele frequencies,we adjusted for, and analysed,the effects of admixture.The Europeanadmixturerate in this samplewas measuredwith STRUCTURE2.015 _to _be₂₅ _p_e_r_ce_nt_— areasonablelevel, given what hasbeen observed in other African-American populations.22,23_{The effec}_{t o}_{f ge}_n_{e fl}_ow_i_s todecrease genetic distance, in thiscase between the African -Americanand European-American samples.This resultsin shorterAfrican-Americanand European-Americanbranches, and relatively longerEastAsianbranches.Indeed,whencon -trolling forgene flow, average branch lengthschangesu b-stantially: Africanautosomalbranchlength increases to 0.114 fromaraw level of0.069, European-Americanbranchlength increases to 0.046 from 0.039and EastAsianbranchlength decreases to 0.055 from 0.066.

Itisimportant toknowifthe admixture adjustmentaffected all markers similarly.Therefore,we calculated the correlation betweenbranchlengthsfor the unadjusted andthe admixture adjusted branchlengthsand found a high correlation (R2 _l_e_v_e_{ls o}_f _0.9_44,_0.977_a_n_d _0.963_f_{or t}_{he Af}_r_ica_n -American, European-Americanand EastAsianbranchlengths, respectively).

Regions wheremultiple SNPs showing high LSBL are in close genomicproximityindicatelocations that haverecently undergone dramatic changesinallele frequency becauseof either randomgenetic drift or natural selection.Therefore, it maybe instructiveto plot the LSBL estimates relativeto their

chromosomal positions.Genomicpositions wereobtained for atotal of33,704 SNPsfroma combinedset of markers (9,817 total markersfrom the WGA chip and 26,530from using arecent version of the dbSNP9₍_Ja_nu_a_ry_,_2003)._The resultshave been plotted foreachofthe23chromosomesand arepresented in the onlinesupplementary material

(www.anthro.psu.edu/biolab). Figure1 presentschromosome 1asanexampleof theseplots.Patterns of high and lowFST levelsare clarified and decomposed bybranchlengthvalues.It is usually the casethathigh branchlengthsfor linked SNPsin particular populations resultinclusters of high FSTlevels.This is reasonable because branch lengthsare calculated from the threepairwise FSTvalues.As such,the impressionis that the fullFSTplotis noisier, havingmorespikes (single highvalues) than the branchlengthplotsand a higher level of extreme valuesforFST comparedwith branchlengths.

Finally,we conducted coalescent simulations toinvestigate theselectively neutralexpectationsfor the LSBLstatistic.The simulations wereperformedusing anislandmodel,where thelevels ofmigrationbetweendemes were adjustedso that the average LSBLlevels oftheobserved andsimulated data matched.Thesesimulationsaresummarisedrelativeto the distributions oftherealdata inFigure7.Althoughthe distributions ofthesimulationsare differentfrom thereal data,usingthe KStestðKS¼0:145;p,0:0001forAfrican -Americans;KS¼0:133;p,0:0001forEastAsians;KS¼ 0:167;p,0:0001forEuropean-Americans),there aresome similarities.First,the basicshapeofthe distributionsis thesame, with a highly skewed distribution wherethe bulkofthe SNPs havelowLSBL,with decreasingnumbers showing higher LSBLlevels.Secondly,thesamerelative differencesamongthe threepopulationsare found in thesimulations.Namely,the European-Americangroup shows moreloci havingshorter LSBLlevels than theother two populations.Itis likely that morerealisticsimulations,using, forexample, astepping-stone model ofpopulationdivergenceorexpansion,wouldprovide results that showa betterfit to theobserved distribution, allowingus tocontrast observed distributionsandparticular loci orgroups oflociwith aneutral model of evolution.

Di

s

c

uss

i

on

Thelargenumber ofmarkers used in these analyses provides an unprecedentedlevel of resolutionfacilitatingthestudy of humanhistoryat the genomic level. Our investigation of multilocusgenotype dataon8,525 autosomalSNPs reinforces

Figure6(seepage281).Chromosomaldistribution oflocus-specific branchlength and FSTfor 3,258markers localisedtochromo -some1.Thetop panel (A) shows the fullFST, African-American locus specific branchlength(LSBL)in thesecondpanel (B), European

-AmericanLSBL in thethirdpanel (C)and EastAsianLSBL in the fourthpanel (D).Thevalue foreach SNP isindicatedwith a black

pointandlinesare drawnconnecting adjacent points.Notethathigher resolutioncolour plotsforeach chromosome are included in the Online Supplementary material (www.anthro.psu.edu/biolab).

Shriveretal:

Review

PRIMARY RESEARCH

(10)

two observations consistently reported in theliterature.First, the root of the tree branchesfrom within the African -Americanclade, ashasbeen previously observed.24,25_Thi_s_i_s consistent withpalaeontologicalevidence indicating an

African origin ofour species.26_Sec_on_d_ly_,_most_h_um_a_n_ge_n_e_t_ic variationisfound within populations,while aminor

proportion ofthetotal variance isduetodifferencesbetween continental populationgroups. The average FSTin this study

Table1. Summary statisticsforheterozygosity, branchlengthsand FST.

Measure Marker type African-American European-American EastAsian FullFST Heterozygosity1 _A_utosom_a_l _{0.321 (0.027)} _{0.317 (0.031)} _{0.290 (0.03}₅₎ _–

X-linked 0.317 (0.028) 0.280 (0.038) 0.245(0.041) –

p-value 0.180 1:6£10212 ₁_:4_£₁₀215 _–

LSBLrawdata2 _A_utosom_a_l _{0.069 (0.01}₄₎ _{0.039 (0.010)} _{0.066 (0.01}₅₎ _0.130 X-linked 0.113 (0.031) 0.054(0.017) 0.080 (0.023) 0.194 p-value 5:5£10233 ₉_:₀_£₁₀27 _0.0001 ₃_:₆_£₁₀263

LSBL adjusted3 _A_utosom_a_l _0.11₄ _0.0₄₆ _0.0₅₅ _0.163

X-linked 0.170 0.063 0.068 0.242

1_A_v_e_r_age_(v_a_r_ia_n_ce₎_f_{or t}_he_un_bia_s_{ed he}_t_e_rozy_g_os_i_ty_ca_l_c_ul_a_t_{ed f}_or_n_¼₃₂_,5₂₇_a_utosom_a_l_a_n_d_n_¼₁_,₁₇_{5 X-}_l_i_n_ked_s_i_n_g_l_e_nu_c_l_e_ot_ide_polymorp_hi_{sm (}_SNP_{) lo}_ci_._A_utosom_a_l_SNP_s

were comparedwith X-linked SNPs using at-testandp-valuesareshown;

2_A_v_e_r_age_(v_a_r_ia_n_ce₎_f_{or t}_he_lo_c_us_-_sp_{ecific b}_r_a_n_ch_l_e_n_g_t_{h a}_n_{d F}

STshownforn¼32,527autosomalandn¼1,175 X-linked SNPloci.LSBL¼locus-specific branchlength; 3_Ad_m_i_xtur_e-adj_ust_{ed a}_v_e_r_age_._Af_r_ica_n_-A_m_e_r_ica_n_a_ll_e_l_{e f}_r_e_qu_e_n_cie_{s w}_e_r_{e adj}_ust_ed_to_acc_ount_f_or_a_n_a_v_e_r_{age ad}_m_i_xtur_e_l_e_v_e_{l o}_f₂₅_p_e_r_ce_nt_E_urop_ea_{n us}_i_n_g_t_{he f}_ormul_a,

pAf¼(pAA2(m)pEA)/(12m); wheremis the admixturerate(25percentEuropean),pAAis the African-Americanallele frequency,pEAis the European-Americanallele

frequencyandpAfis thepredicted Africanallele frequency.

Figure7.Distribution oflocus-specific branchlength(LSBL) levelsfor realdata and coalescent simulations.Shownarethe histogram

distributions ofthe LSBLlevelsfor thethreemain populationgroups.EastAsian (Japanese and Chinese)areshownindark grey, European-Americanin lightgreyand African-Americaninblack.Coalescent simulationsconditionedon the average LSBL areshownas striped bars ofthesameshades.

(11)

(13.2 percent)is similar to values reported in numerous previous studies,4,5,9,18,12,22,23_begi_nn_i_n_g_w_i_t_h_t_{he c}_l_a_ss_ic_p_a_p_e_r byRichard Lewontinin 1972.21_{The a}_v_e_r_{age F}

STobtained withthis study’s markersis not significantlydifferentfrom the average FSTobtained inarecent independent studybasedon SNPs typed in thesesamethreepopulations (13.2 percent versus 12.3 percent; t-testp.0:05Þ:9 _Addi_t_i_on_a_lly_,_t_he average FSTlevelforX-chromosome SNPsisgreater than that forautosomalSNPs (Table1:19.4per cent vs 13.0 percent; t-testp,3:6£10263_Þ_:_Like_w_i_s_e,_t_{he LSBL}_w_a_{s o}_b_s_e_rv_ed_to besignificantlyhigherforX-chromosomal markerscompared with autosomal markersinall three populations.A faster rate of evolutionforX-chromosomemarkershasbeen noted previously27,28_a_n_d _t_he _pot_e_nt_ia_l_ca_us_e_s_di_s_c_uss_{ed i}_n_de_t_ai_l.29 A higheraverage level of X-chromosomaldifferentiationis consistent withthe action of higher selection pressure, especiallyin the African-American sample,wherethe heterozygosityis notdecreased; however, additional popu -lation samples withoutadmixture areneeded,particularlyWest African samples, before conclusions to thiseffect are drawn.

Geographic distribution

Although geneticvariationbetween majorcontinentalgroups representsaminorfraction ofthe total variation observed in humans, it is misleadingto describevariationbetween world populationsas negligible.There isawide dispersion of FST valuesat the genomiclevel.Most of the 8,525markersan a-lysed in this study show small allele frequency differences between populations;however,there isa subset ofmarkers showingveryhigh FSTvaluesand, asillustrated inFigures 3A, 3B and 4, theseloci canhavemajoreffects on the observed clusteringofpopulationsand individualsaccordingtogeo -graphical origin.At the continental level, bootstrap supportis high for individuals belongingto majorbranches ofthis tree. Japanese and Chinese individualscluster in two separate groups within the EastAsianbranch, but show alower bootstrap level.Demographic factors,restricted gene flowand natural selectiondriving adaptation to differentenvironments haveresulted ingenetic divergence between majorhuman continentalgroups thatcanbe captured atboththepopulation and the individual level using a largenumber ofmarkers. Theseresultsconfirmand extend earlier work by

Cavalli-Sforza and colleagues.4,30_A_s_de_monstr_a_t_{ed b}_y_M_oun -tainand Cavalli-Sforza, clusteringrelationshipsare expectedto change asadditional populationsare analysed and thenumber ofsubjectsisincreased.30_T_{o som}_e_un_k_nown_e_xt_e_nt_,_t_he_l_a_r_ge degreeofseparation observed in the PCAplotandon thetrees isaresult of having datarepresenting geographicallyextreme populations to the exclusion of intermediate groups.

Itisalsoimportant to notethatadmixture canhave apro -found impact on the genetic clusteringof individuals.31,32_B_y contrast withthepatterns observed forEuropean-Americans and EastAsians onboththe PCAplotandthetrees, African

-Americanindividualsdo notcluster tightly orinasimilarly globularfashion.We estimatedrelative individualancestry levelsin the African-American sampleusing STRUCTURE 2.015_a_n_{d c}_omp_a_r_ed_t_he_s_e_w_i_t_h_t_{he b}_r_a_n_chi_n_g_p_a_tt_e_{rn o}_f_t_he_tr_ee of individuals.There is remarkable correspondence between the individualadmixture estimates obtainedwith STRUC-TURE and bothtree branchingorderðr¼0:983;p,0:0001Þ and the PCAresultsðr¼0:988;p,0:0001Þ:Whiletherole of admixture in theorigins of African-American populationsis widelyappreciated,22,23_t_{he e}_xt_e_{nt to w}_hich_non_-E_urop_ea_n ancestryis presentinEuropean-Americanshas receivedmuch lessattention.32_O_n_{e i}_n_di_v_id_u_a_{l (}_E_u_5—C_or_ie_ll#_NA₁₇₂₀₅₎ amongthe 44 European-Americans stands outfrom theothers onboththetreesandthe PCA graph,suggesting asignificant proportion ofnon-Europeanancestry.Indeed,this person clusters with South Asians (fromIndia)in separate analyses (data not shown).Theseresultsemphasisethatin some contempor -ary populations,quantitative descriptions ofthe genetic clusteringof individuals28_m_a_y_be_mor_{e a}_ppropr_ia_t_e_t_ha_n dichotomousclassifications.1,11,33,34

Inaddition toconsidering itsinfluenceongenetic clus -tering, it isimportant to recognise howadmixture canaffect the magnitude of LSBLs. As showninTable1,the effect of Europeanancestryin the African-American sample isbothto shortenAfrican-Americanand European-Americanbranches and to lengthenEastAsianbranches.Since more individuals of Europeanancestry wereused toidentifyandvalidatethe TSC SNPs, ascertainmentbias maybe affectingthe overall magnitude of branchlength.35 _A_lt_h_ou_{gh ad}_m_i_xtur_{e ha}_s_a dramatic effect on average branchlength, however,there isa high correlationbetweenLSBLscalculated withrawallele frequenciesand those calculatedwith ancestry-adjusted allele frequencies.

Genomic distribution

Althoughpotentially useful and descriptive,qualitative and quantitative assessments of individualandpopulationaffil ia-tionsand phylogeniesare heuristicrather than definitive statements regarding geneticvariation.Average values may describe importanthistoricaland demographic aspects of both individualsand populations underconsideration; however, thenatureof humangenomicvariationis suchthat there is no one history.6 _I_n_de_p_e_n_de_nt_a_ssortm_e_nt_,_r_ec_om_bi_n_a_t_i_on_, natural selectionand genetic drift haveresulted in tens of thousands of genomicregions, eachwith aunique history.The identificationand subsequentexclusion ofloci respondingto selectivepressuresallowsforamorerealistic assessment of populationdemographics. Thisapproach, first recognised by Lewontinand Krakauer,36_ha_{s s}_i_n_{ce bee}_n_e_xp_a_n_ded_upon _i_n other analysesinterrogatingthe genome for markersaffected by natural selection.9,37–39_M_u_ch_o_f_t_hi_s_f_o_c_us _ha_s_bee_n_c_on -centratedonFST, whichsummarises the proportion of total variationduetogroupdifferences.We haveusedpairwise

Shriveretal:

Review

PRIMARY RESEARCH

(12)

FSTmeasures tocalculatethe three LSBLs,thuseffectively decomposing the fullFSTintocomponent parts.In this way, we canisolate and evaluatethepopulation-specific changesin allele frequency.

To test whetherLSBL capturesevolutionary history, we compared branchlengthlevelsfor pairs ofmarkers (see Figure 5).Levels of correlation,which are high forclosely spaced SNPs, decrease asa function of inter-marker spacing. Ona genomic scale, nearby regions sharemore in terms of common evolutionaryhistories than do morewidely spaced regions. Arelationship between the correlation of FSTfor marker pairsand inter-markerdistancewasfirst shown by Akeyetal.9 _us_i_n_{g a}_su_b_s_e_{t o}_f_t_{he da}_t_{a a}_n_a_lys_{ed he}_r_e_. _The_s_e researchersfound that the correlation observed betweenFST and inter-markerdistancewas stronger thanasimulated coalescent distributionassuming selectiveneutrality.They interpretedthishighercorrelationas the footprint of adaptive hitchhiking.While adaptiveselectionhas unquestionably occurred at particulargenomic locations,these correlations represent summaries of dataon markersfromacross the genome.Since demographic events willalsoaffect thelevels of linkage disequilibriumand haplotype block characteristics, theyare expectedtoaffect relationshipsbetween levels of evolutionand inter-markerdistance. Thus,more generally, it canbe concluded that these correlationsinbranchlength and FST levelsare functions oftheshared evolutionary histories of closely linked markersandreflectanon-uniform distribution of humangeneticsubstructure across the genome.

The multiform distribution of genetic substructure has significantimplicationsfor research,not onlyinevolutionary, but alsoinbiomedicalcontexts. Using LSBLtoisolate allele frequency change allowsfor the identificationand su b-sequentinvestigation of genomic regions that are candidates forhaving experienced recentdirectional selectionby virtue of containing clusters of outlying SNPs. Additional work is required todevelop statistics which combinepositionaland branch length information so that regions least likely to have been the result of genetic drift alone canbe identified.The humangenome hasamultivariate history;consequently, efforts tocontrolfor population structure(eg genomic con -trol,40 _stru_c_tur_{ed a}_sso_cia_t_i_on41 _a_n_{d c}_om_bi_n_ed _m_e_t_h_o_d_s42₎_ca_n andshould be improved throughmarker selectionefforts.It will ultimatelybepossibleto produce dataon thescalethat we demonstrate hereonaroutine basisindisease association studies. Until that time, however,smaller sets of informative markerscanbeselected from largesurveysbecausetheyare informative for particularaxesacross whichpopulation su b-structure isfound.Such selectedsets of markerscanbeused toefficientlydetectand adjustforeven very highlevels of population substructure.32,42 _I_{n sum}_,_t_he_s_{e a}_n_a_lys_e_s_, _w_hich reveala non-uniform distribution of humangenetic su b-structure, suggestaparadigm relevant to the further explor a-tions of genotype/phenotyperelationships both withinand among populations.

Ack

nowl

edg

m

e

nts

This workwas supported in partbya grant: NIH/NHGRI(HG02154) to

MDS.Wewouldliketoacknowledge helpfuldiscussions with Rick Kittles, Nik Schork, Kateryna Makova and Bruce Lindsey.

Refe

r

e

n

ce

s

1. Risch, N., Burchard, E., Ziv, E.etal.(2002),‘Categorization of humans

inbiomedical research: Genes,race and disease’,Genome Biol.Vol. 3,

pp. 1–12.

2. Schork, N., Fallin, D., Tiwari, H.K.etal.(2001),‘Pharmacogenetics’, in: Balding, D., Bishop, M.and Cannings, C. (eds.),Handbookof Statistical

Genetics, JohnWileyand Sons, Hoboken, NJ,pp. 741–764.

3. Burroughs, V.J., Maxey, R.W.and Levy, R.A. (2002),‘Racialand ethnic differencesin responseto medicines: Towardsindividualizedpharmaceu

-tical treatment’,J.Natl.Med.Assoc.Vol. 94,pp. 1–26.

4. Bowcock, A.M., Ruiz-Linares, A., Tomfohrde, J.etal.(1994),‘High

resolution of humanevolutionary trees withpolymorphicmicrosatellites’, NatureVol. 368,pp.455–457.

5. Rosenberg, N.A., Pritchard, J.K., Weber, J.L.etal.(2002),‘Genetic

structureof human populations’,ScienceVol. 298,pp. 2381–2385. 6. Cavali-Sforza, L.L. (1966),‘Population structure and humanevolution’,

Proc.R.Soc.Lond.B.Biol.Sci.Vol. 164,pp. 362–379.

7. Black, IVth, W.C., Baer, C.F., Antolin, M.F.etal.(2001),‘Population

genomics: Genome-widesamplingof insect populations’,Annu.Rev.

Entomol.Vol.46,pp.441–469.

8. Kennedy, G.C., Matsuzaki, H., Dong, S.etal.(2003),‘Large-scale genotypingof complexDNA’,Nat.Biotechnol.Vol. 21,pp. 1233–1237. 9. Akey, J.M., Zhang, G., Zhang, K.etal.(2002),‘Interrogating a

high-densitySNPmapfor signatures ofnatural selection’,Genome Res.Vol. 12,

pp. 1805–1814.

10.Chee, M., Yang, R., Hubbell, E.etal.(1996),‘Accessing genetic infor

-mation with high-densityDNA arrays’,ScienceVol. 274,pp. 610–614. 11.Chakraborty, R.and Jin, L. (1993), in: Pena, S.D.J., Jefferys, A.J., Epplen,

J.and Chakraborty, R. (eds.),Aunified approachto studyhypervariable

polymorphisms: Statisticalconsiderations of determiningrelatednessandpopulation

distancesDNA Fingerprinting: CurrentStateofthe Science, Birkhauser, Basel, Switzerland, Vol.EXS Vol. 67,pp. 153–175.

12.Saitou, N.and Nei, M. (1987),‘Theneighbor-joiningmethod: Anew method for reconstructingphylogenetictrees’,Mol.Biol.Evol.Vol.4,

pp.406–425.

13.Kumar, S., Tamura, K., Jakobesen, I.B.etal.(2001),‘MEGA2: Molecular

evolutionarygeneticsanalysis software’,BioinformaticsVol. 17,

pp. 1244–1245.

14.Rohlf, F.J. (1992), NTSYS-pcversion 1.70.

15.Pritchard, J.K., Stephens, M.and Donelly, P. (2000),‘Inferenceof

population structure from multilocusgenotype data’,GeneticsVol. 155,

pp. 945–959.

16.Hudson, R.R. (2002),‘Generatingsamples undera Wright-Fisher neutral model’,BioinformaticsVol. 18,pp. 337–338.

17.Weir, B.S.and Cockerham, C.C. (1984),‘Estimating F-statisticsfor the analysis ofpopulation substructure’,EvolutionVol. 38,pp. 1358–1370. 18.Romualdi, C., Balding, D., Nasidze, I.S.etal.(2002),‘Patterns of human

diversity,withinand among continents, inferred frombiallelic DNA

polymorphisms’,Genome Res.Vol. 12,pp. 602–612.

19.Jorde, L.B., Watkins, W.S., Bamshad, M.J.etal.(2000),‘The distr i-bution of humangenetic diversity: A comparison ofmitochondrial, auto

-somal, and Y-chromosomaldata’,Am.J.Hum.Genet.Vol. 66,pp. 979–988. 20.Cavalli-Sforza, L.L., Menozzi, P.and Piazza, A. (1994), The Historyand Geography of HumanGenes, PrincetonUniversityPress, Princeton, NJ. 21.Lewontin, R. (1972),‘The apportionment of humandiversity’,Evol.

Biol.Vol. 6,pp. 381–398.

22.Pfaff, C.L., Parra, E.J., Bonilla, C.etal.(2001),‘Population structure in

admixedpopulations: Effects of admixture dynamics on thepattern of

linkage disequilibrium’,Am.J Hum.Genet.Vol. 68,pp. 198–207.

(13)

23.Parra, E.J., Marcini, A., Akey, J.etal.(1998),‘Estimating African

Americanadmixtureproportionsby useofpopulation specific alleles’,Am.

J.Hum.Genet.Vol. 63,pp. 1839–1851.

24.Watkins, W.S., Ricker, C.E., Bamshad, M.J.etal.(2001),‘Patterns of ancestralhumandiversity: Ananalysis of Aluinsertionandrestriction site

polymorphisms’,Am.J.Hum.Genet.Vol. 68,pp. 738–752.

25.Nei, M.and Takezaki, N. (1996),‘Theroot ofthephylogenetictreeof human populations’,Mol.Biol.Evol.Vol. 13,pp. 170–177.

26.Stringer, C. (2002),‘Modernhuman origins: Progressandprospects’,Phil.

Trans.R.Soc.Lond.Vol. 357,pp.563–579.

27.Charlesworth, B., Coyne, J.A.and Barton, N.H. (1987),‘Therelativerates of evolution ofsexchromosomesand autosomes’,Am.Nat.Vol. 130,

pp. 113–149.

28.Payseur, B.A., Cutter, A.J.and Nachman, M.W. (2002),‘Searching for

evidenceofnatural selectionin the genomeusingmicrosatellitevar ia-bility’,Mol.Biol.Evol.Vol. 19,pp. 1143–1153.

29.Kayser, M., Brauer, S.and Stoneking, M. (2003),‘A genomescan to

detectcandidateregionsinfluenced byLocalNaturalSelectioninHuman

Populations’,Mol.Biol.Evol.Vol. 20,pp.893–900.

30.Mountain, J.and Cavalli-Sforza, L.L. (1997),‘Multilocusgenotypes, atree

of individualsand humanevolutionaryhistory’,Am.J.Hum.Genet.

Vol. 61,pp. 705–718.

31.McKeigue, P.M., Carpenter, J., Parra, E.J.etal.(2000),‘Estimation of admixture and detection oflinkage inadmixedpopulationsbya Bayesian

approachusing Markovchain simulation: Application toAfrican -American populations’,Ann.Hum.Genet.Vol. 64,pp. 171–186. 32.Shriver, M.D., Parra, E.J., Dios, S.etal.(2003),‘Skin pigmentation,

biogeographicalancestryand admixturemapping’,Hum.Genet.Vol. 112,

pp. 387–399.

33.Wilson, J.F., Weale, M.E., Smith, A.C. et al. (2001), ‘Population

genetic structure of variable drug response’, Nat. Genet.Vol. 29,

pp. 265–269.

34.Bamshad, M.J., Wooding, S., Watkins, W.S.etal.(2003),‘Human popu

-lationgeneticstructure and inferenceof group membership’,Am.J.Hum.

Genet.Vol. 72,pp.578–589.

35.Mountain, J.and Cavalli-Sforza, L.L. (1994),‘Inferenceof human

evolution through cladistic analysis ofnuclearDNArestriction polymorphisms’,Proc.Natl.Acad.Sci.USAVol. 91,pp. 6515–6519. 36.Lewontin, R.C.and Krakauer, J. (1973),‘Distribution of gene frequency

asatest ofthetheory oftheselectiveneutrality ofpolymorphisms’, GeneticsVol. 74,pp. 175–195.

37.Bowcock, A.M., Kidd, J.R., Mountain, J.L.etal.(1991),‘Drift, admixture, andselectioninhumanevolution: Astudy with

DNApolymorphisms’,Proc.Natl.Acad.Sci.USAVol.88,pp.839–843. 38.Beaumont, M.A.and Nichols, R.A. (1996),‘Evaluatingloci for use in

the genetic analysis ofpopulation structure’,Proc.Biol.Soc.Vol. 263,

pp. 1619–1626.

39.Vitalis, R., Dawson, K.and Boursot, P. (2001),‘Interpretation ofvariation

across marker loci asevidenceofselection’,GeneticsVol. 158,

pp. 1811–1823.

40.Devlin, B.and Roeder, K. (1999),‘Genomic controlforassociation studies’,BiometricsVol.55,pp. 997–1004.

41.Pritchard, J.K.and Donnelly, P. (2001),‘Case-control studies of association

in structuredoradmixedpopulations’,Theor.Popul.Biol.Vol. 60,

pp. 227–237.

42.Hoggart, C.J., Parra, E.J., Shriver, M.D.etal.(2003),‘Control of confoundingof genetic associationsin stratifiedpopulations’,Am.J.Hum.

Genet.Vol. 72,pp. 1492–1504.

Shriveretal:

Review