Omics and Arrays
3.5 Comparative Genomics Comparative genomics has been used to
3.5.1 Comparative maps
A comparative map aligns two or more spe-cies-specific maps using common sets of markers or sequences. It requires identifica-tion of regions of sequence similarity in the genomes of different species or genera (i.e.
typically, genes). Sequence similarity can be identified due to common evolution-ary origins. Gene repertoire and gene order may be found conserved over larger chro-mosomal segments between closely related species. The long-term goals of compara-tive genomics are to establish relationships between map, sequence and functional genomic information across all plant spe-cies and to facilitate taxonomic and phylo-genetic studies in higher plants.
Importance of comparative maps The objective of the development of a com-parative map is to identify subsets of genes that have remained relatively stable in both sequence and copy number since the radia-tion of flowering plants from their last com-mon ancestor. Why are comparative maps so important? First, eukaryotic genomes are organized into chromosomes and maps summarize genetic information using chro-mosomes as the organizational principle.
Secondly, conservation of gene identity and gene order along the chromosomes determines potential for sexual reproduc-tion; disruption leads to speciation and major evolutionary change. Thirdly, species maps provide the context for the study of inheritance and chart the history of genetic change. Fourthly, comparative maps are the major tools for ferrying genetic information back and forth across species and genera in a systematic fashion.
Once chromosomal duplications are identified in a genome and the timing of a duplication/polyploidization event has been determined relative to angiosperm diver-gence nodes, ancestral gene order within the duplicated segments can be inferred. Map comparisons across divergent genera show greater conservation of ancestral gene order and gene repertoire once genome-wide duplication/gene loss within each genome
is accounted for. Map comparisons between closely related species are largely unaffected because most duplications pre-date them.
Comparative maps lay the groundwork for asking questions about whether specific
‘linkage blocks’ or gene arrangements are sta-tistically associated with increased fitness or have a relationship between polyploidy and plant adaptation. For example, comparative linkage mapping and chromosome painting in the close relatives of Arabidopsis have inferred an ancestral karyotype of these spe-cies. In addition, comparative mapping to Brassica has identified genomic blocks that have been maintained since the divergence of the Arabidopsis and Brassica lineages (Schranz et al., 2007).
An example: Arabidopsis–tomato comparative map
DEVELOPMENT OF ARABIDOPSIS–TOMATO COMPARA
-TIVE MAP TO DETECT MACROSYNTENY.Fulton et al.
(2002) identified over 1000 conserved orthologous sequences (COS) between tomato and Arabidopsis by comparison of Arabidopsis genomic sequence with 130,000 tomato ESTs (representing 27,000 unigenes or approximately 50% of the tomato gene content). For 1025 COS markers developed, 927 were screened against tomato DNA using Southern analysis to classify them as single, low or multiple copy, among which 85% were considered to be single or low copy (> 95% hybridization signal assigned to three or fewer restriction fragments) and 50% matched a gene of unknown function (Gene Ontology classification). A total of 550 COS markers was mapped on to the tomato genome. The size of conserved segments was generally smaller than 10 cM. Results indi-cated that multiple polyploidization events punctuate the evolution of Arabidopsis and tomato. Distinguishing orthologues from paralogues is difficult due to reciprocal loss of genes and chromosome segments follow-ing polyploidization events.
PHYLOGENETIC ANALYSIS OF CHROMOSOMAL DUPLI
-CATION EVENTS TO DETECT MICROSYNTENY. The Arabidopsis genome sequence was used to analyse internal duplication events
based on inferred protein matches between 26,028 genes. A total of 34 non-overlapping chromosomal segment pairs were identified consisting of 23,177 (89%) Arabidopsis genes (Bowers et al., 2003b). To relate this ‘alpha’
duplication to the angiosperm family tree, all duplicated syntenic Arabidopsis gene pairs were compared to individual genes from pine, rice, tomato, Medicago, cotton and Brassica. It was determined whether inferred protein sequences were from duplicated syntenic gene pairs. Arabidopsis genes were more similar to one another than to the heter-ologous protein in another species.
RELATIVE AGE OF CHROMOSOMAL DUPLICATION EVENTS. It was concluded that the ‘alpha’ duplication event pre-dated divergence from Brassica about 14.5–20.4 million years ago but post-dated divergence from cotton about 83–86 million years ago.
About 50% (49–64%) of Brassica sequences were more similar to one dupli-cated Arabidopsis sequence than was the other Arabidopsis sequence to its paralogue. Only 6–19% of cotton, rice, pine, etc. sequences clustered internally to the Arabidopsis syn-tenic duplicates (Bowers et al., 2003b).
POLYPLOID ANCESTRY OF MOST PLANT SPECIES. As more data accumulates, the history of angiosperms emerges as a history of genome-wide duplication followed by massive gene loss (and return to diploidy). Only 30% of Arabidopsis genes have retained syntenic copies in less than 86 million years since the ‘alpha’ duplication. In contrast, mam-mals appear to harbour fewer polyploidiza-tion events and less cycling of duplicated genes; 70% of human and mouse proteins show conserved synteny after 100 million years of evolution.
3.5.2 Collinearity
Orthology and paralogy
Figure 3.9 shows the concepts of orthology and paralogy. Orthologues and paralogues are two types of homologous sequence.
Orthology describes genes in different spe-cies that derive from a common ancestor.
Orthologous genes may or may not have the same function. Paralogy describes genes that have duplicated (tandemly or moved to a new location) within a genome since they descended from a common ancestral gene. The word ‘synteny’ (from the Greek syn, together, and taenie, ribbon) refers to linkage of genes along a chromosome;
currently used to indicate conservation of gene order across species. From this defi-nition, macrosynteny means conservation of gene order across species detected at low resolution (i.e. genetic maps) while microsynteny means conservation of gene order across species analysed by high res-olution (i.e. physical or sequence-based maps).
Macrocollinearity
Significant genomic collinearity in plants has been shown by comparative genetic mapping and genome sequencing, although plant genomes vary greatly in genome size and chromosome number and morphology.
Comparative mapping of cereal genomes using low copy number, cross-hybridizing genetic markers has provided compelling evidence for a high level of conservation of gene order across regions spanning many megabases (i.e. macrocollinearity). Initial studies of the organization of grass genomes indicated that individual rice
chromo-Homologues
Orthologues Paralogues Orthologues
Froga Chicka Mouse a Mouseb Chickb Frogb
b-chain gene a-chain gene
Early globin gene Gene duplication
Fig. 3.9. The concepts of orthology and paralogy (from http://www.ncbi.nlm.nih.gov/Education/
BLASTinfo/Orthology.html).
somes were highly collinear with those of several other grass species and extensive work has shown a remarkable conserva-tion of large segments of linkage groups within rice, maize, sorghum, barley, wheat, rye, sugarcane and other agriculturally important grasses (e.g. Ahn and Tanksley, 1993; Kurata et al., 1994; van Deynze et al., 1995a; Wilson et al., 1999). These studies led to the prediction that grasses could be studied as a single syntenic genome. The macrocollinearity was summarized by Gale and Devos (1998) for rice and seven other cereals using what is now known as the
‘circle diagram’ (Plate 1). Further studies identified QTL controlling important agro-nomic traits which showed similarities in locations for the same or similar traits (as reviewed by Xu, 1997). Shattering and plant height are examples that were also mapped to collinear regions among grass genomes (Paterson et al., 1995; Peng et al., 1999).
More recently, Chen et al. (2003) identified four QTL for quantitative resistance to rice blast that showed corresponding map posi-tions between rice and barley, two of which had completely conserved isolate specifi-city and the other two had partial conserved isolate specificity. Such corresponding loca-tions and conserved specificity suggested a common origin and conserved functionality of the genes underlying the QTL for quan-titative resistance, which may be used to discover genes, understand the function of the genomes and identify the evolutionary forces that structured the organization of the grass genomes. Such findings reinforce the notion of collinearity among the cereal genomes.
This unified grass genome model has had a substantial impact upon plant biol-ogy but has not yet lived up to its potential.
There are some difficulties in evaluating synteny between genomes at the macro-level (Xu et al., 2005). First, the genomic marker data are very incomplete and genomic sequence data are largely lacking for many grass species. Secondly, the data are some-times biased because the homologous DNA probes used in comparative mapping are selected for simple cross-hybridization pat-terns. Thirdly, many genes are members of
gene families and, accordingly, it is often dif-ficult to determine if a gene mapped in the second species is orthologous or paralogous to that in the first species. Fourthly, the col-linearity of gene order and content observed at the recombinational map level is often not observed at the level of local genome structure (Bennetzen and Ramakrishna, 2002). Finally, in most early studies, no statistical analysis was used to evaluate whether the presence of a few markers in the same order on two chromosomal seg-ments in two species occurs by chance or is truly significant.
The genome collinearity of several Cammelineae and Brassicaceae species have been recently compared to that of A. thaliana by comparative genetic link-age mapping and comparative chromosome painting (Schranz et al., 2007). A compre-hensive study identified 21 syntenic blocks that are shared by Brassica napus and A. thaliana genomes, corresponding to 90%
of the B. napus genome (Parkin et al., 2005).
Microcollinearity
Using the rice genome sequence as the ref-erence to compare with molecular marker information of other cereals gave a result which indicated many more rearrangements than had been expected from Gale and Devos’s (1998) concentric circles model.
One such comparison involved more than 2600 mapped sequenced markers in maize among which only 656 putative ortholo-gous genes could be identified (Salse et al., 2004). The comparison of the wheat genetic map with the rice sequence also suggests numerous rearrangements between the two genomes with a high frequency of break-downs in collinearity (Sorrells et al., 2003).
Extensive comparisons have also been made between sorghum and rice (Klein et al., 2003; The Rice Chromosome 10 Sequencing Consortium, 2003). To align the sorghum physical map with the rice map, sorghum BAC clones were selected from the mini-mum tiling path of chromosome 3. Unique partial sequences were obtained from each BAC clone and could be directly compared with the rice sequence. This approach
revealed excellent conservation between the overall structure and gene order of sor-ghum chromosome 3 and rice chromosome 1 but also indicated several rearrangements.
Together, these studies indicate a general conservation of large syntenic blocks within cereals but with many more rearrangements and synteny breakdowns than originally anticipated.
This trend is even more obvious when synteny is analysed at the sequence level.
Rearrangements may occur that involve regions smaller than a few centimorgans and would be missed by most recombinational mapping studies. Comparative sequence analysis involving large genomic segments can detect these rearrangements. Such anal-yses reveal the composition, organization and functional components of genomes and provide insight into regional differences in composition between related species.
Recently, the sequencing of genomic seg-ments in the cereals has enabled microcol-linearity across genes or gene clusters to be investigated. Sequencing of the domes-tication locus Q in Triticum monococcum revealed excellent collinearity with the bread wheat genetic map (Faris et al., 2003).
Following the sequencing of the leaf-rust-resistance locus Rph7 from barley, it was observed that this locus is flanked by two HGA genes. The orthologous locus in rice chromosome 1 consists of five HGA genes.
In barley, only four of the five HGA genes are present, one is duplicated as a pseudo-gene and six additional pseudo-genes have been inserted in between the HGA genes. These six genes have homologues on eight dif-ferent rice chromosomes (Brunner et al., 2003). The most striking rearrangement was revealed by the comparison of 100 kb around the Bronze locus of two maize lines.
Not only does the retrotransposon distribu-tion differ between the two lines but the genes themselves could also be different (Fu and Dooner, 2002). Comparison of the low molecular weight glutenin locus between T. monococcum and Triticum durum also revealed dramatic rearrangements: more than 90% of the sequence diverged because of retro-element insertions and because dif-ferent genes are present at this locus (Wicker
et al., 2003). Therefore collinearity can be lost very rapidly within two genomes from the same species.
With the sequencing of long regions, several studies in cereals have demon-strated incomplete microcollinearity at the sequence level. Song et al. (2002) identified orthologous regions from maize, sorghum and two subspecies of rice. It was found that gross macrocollinearity is maintained but microcollinearity is incomplete among these cereals. Deviations from gene colline-arity are attributable to micro-rearrangement or small-scale genomic changes such as gene insertions, deletions, duplications or inver-sions. In the region under study, the orthol-ogous region was found to contain six genes in rice, 15 in sorghum and 13 in maize. In maize and sorghum, gene amplification caused a local expansion of conserved genes but did not disrupt their order or orienta-tion. As indicated by Bennetzen and Ma (2003), numerous local rearrangements dif-ferentiate the structures of different cereal genomes. On average, any comparison of a ten-gene segment between rice and a dis-tant grass relative such as barley, maize, sorghum or wheat shows one or two rear-rangements that involve genes. A simple extrapolation to the rice genome of about 40,000 genes (Goff et al., 2002) suggests that about 6000 genic rearrangements occurred which differentiate rice from any of the other cereals. Most of these rearrangements appear to be tiny and thus would not inter-fere with the macrocollinearity observed by recombinational mapping. There are excep-tions however, which include chromosomal arm translocations and movements of single genes to different chromosomes (Bennetzen and Ma, 2003).
As expected, there is a high degree of gene conservation between the two shot-gun-sequenced subspecies of rice, japonica and indica, which diverged more than 1 million years ago. On careful inspection, however, narrow regions of divergence can be found in these genomes (Song et al., 2002). These regions correspond to areas of increased divergence among rice, sorghum and maize, suggesting that the alignment of the two rice subspecies might be useful
for identifying regions of cereal genomes that are prone to rapid evolution. Similar comparative analyses of Arabidopsis acces-sions have shown that both the relocation of genes and the sequence polymorphisms between accessions (in both coding and non-coding regions) are common in the Arabidopsis genome (The Arabidopsis Genome Initiative, 2000). Intraspecific vio-lation of collinearity has also been identified in maize (Fu and Dooner, 2002). Han and Xue (2003) also discovered significant num-bers of rearrangements and polymorphisms when comparing indica and japonica genomes in rice. The deviations from col-linearity are frequently due to insertions or deletions. Intraspecific sequence polymor-phisms commonly occur in both coding and non-coding regions. These variations often affect gene structures and may contribute to intraspecific phenotypic adaptations.
Implications of genome collinearity Genomics would be much simpler if the order of genes were common (syntenic) across the major groups of plants. The usefulness of the collinearity between the genomes of model plants and important crops can be assessed by the number of failures or successes in its exploitation. For example, the analysis of the Arabidopsis sequence provides information that will facilitate the annotation of the rice sequence and likewise sequencing Medicago provides a resource for research on important crop legumes. Furthermore, the effort put into sequencing and annotating the rice genome has also been rewarded, as this annotation will be transferred to related sequences and used repeatedly in the future. The synteny between the monocots will help decipher the structure and function of the more complex genomes. A fully assembled rice sequence allows more accurate assessment of the macro- and microsynteny of rice with other cereals (Xu et al., 2005).
The advent of technologies for map-ping genomes directly at the DNA level has made comparative genetic mapping among sexually incompatible species possible.
Extensive comparative maps for marker
genes have been constructed for a number of plant taxa, including species in the Poaceae (rice, maize, sorghum, barley and wheat), Solanaceae (tomato, potato and pepper) and Brassicaceae (Arabidopsis, cabbages, mustard, turnip and rape). As a result, the concept of a single genetic or ancestral map for all grasses, with species-specific modifications, is emerging (Moore et al., 1995). The extensive collinearity of wheat, rye, barley, rice and maize suggests that it may be possible to reconstruct a map of the ancestral cereal genome. These conserved gene orders and the possibility of sharing DNA probes and PCR primers across spe-cies will greatly extend the power of map-ping analysis by facilitating the molecular analysis of the corresponding chromosomal regions in different species and allowing information, and perhaps DNA sequences and genes, to be transferred quickly and efficiently between different species.
The challenge of finding which map, sequence and eventually functional genomic information from one species can be accessed, compared and exploited across all plant spe-cies will require the identification of a subset of plant genes that have remained relatively stable in both sequence and copy number since the radiation of flowering plants from their last common ancestor. Identification of such a set of genes would also facilitate taxo-nomic and phylogenic studies in higher plants that are presently based on a very small set of highly conserved sequences, such as those of chloroplast and mitochondrial genes. The conserved orthologue set of markers, identi-fied computationally and experimentally, may further studies on comparative genomes and phylogenetics and elucidate the nature of genes conserved throughout plant evolution.
Completed genome sequences provide templates for the design of genome analysis tools in orphan species lacking sequence information. For example, Feltus et al.
(2006) designed 384 PCR primers to con-serve exonic regions flanking introns using sorghum and millet EST alignments to the rice genome. These conserved-intron scan-ning primers (CISP) amplified single-copy loci with 37–80% success rates; i.e. sampling most of the approximately 50 million years
of divergence among grass species. When evaluating 124 CISPs across rice, sorghum, millet, Bermuda grass, teff, maize, wheat and barley, about 18.5% of them seemed to be subject to rigid intron size constraints that were independent of per-nucleotide DNA sequence variation. Likewise, about 487 conserved non-coding sequence motifs were identified in 129 CISP loci. As pointed out by Feltus et al. (2006), CISP provides the means to effectively explore poorly char-acterized genomes for both polymorphism and non-coding sequence conservation on a genome-wide or candidate gene basis and also to anchor points for comparative genom-ics across a diverse range of species. After the whole genomes of the major food crops have been sequenced, plant breeders will be able to access new gene tools that will facili-tate the selection of outstanding individu-als characterized by resistance to biotic and abiotic stresses and good seed quality, thus enabling breeders to produce new cultivars in addition to those currently available.
of divergence among grass species. When evaluating 124 CISPs across rice, sorghum, millet, Bermuda grass, teff, maize, wheat and barley, about 18.5% of them seemed to be subject to rigid intron size constraints that were independent of per-nucleotide DNA sequence variation. Likewise, about 487 conserved non-coding sequence motifs were identified in 129 CISP loci. As pointed out by Feltus et al. (2006), CISP provides the means to effectively explore poorly char-acterized genomes for both polymorphism and non-coding sequence conservation on a genome-wide or candidate gene basis and also to anchor points for comparative genom-ics across a diverse range of species. After the whole genomes of the major food crops have been sequenced, plant breeders will be able to access new gene tools that will facili-tate the selection of outstanding individu-als characterized by resistance to biotic and abiotic stresses and good seed quality, thus enabling breeders to produce new cultivars in addition to those currently available.