MATERIALS AND METHODS - Application of comparative genomics for detection of genomic features a

Mapping of query probe sets to model genomes

The goal of MGI is to link gene expression data from one species to the genome locations of corresponding genes in a model genome, typically but not necessarily of another species.

The implementation of this task in MGI involves the following data sets and programs: (1) Query probe sets are represented by the (pseudo-) transcripts (often EST assemblies) that they are derived from (often called “unigenes”). This transcript set will be referred to as probeU in the following. (2) The whole genome sequence of a model species (labeled modelG in the following). (3) The identified transcriptome of the model species, as per

available full-length cDNAs (set referred to as modelT). (4) The annotated proteome of the model species, represented by the translation products of the annotated protein-coding gene structures (set labeled modelP). (5) GeneSeqer (Usuka et al., 2000; Brendel et al., 2004), a spliced alignment program used to map transcripts from probeU to the chromosomes in modelG. (6) BLAST+ (Altschul et al., 1990; Camacho et al., 2009), used to match probeU entries to modelT with the BLASTn program and to modelP with the BLASTx program.

Details of the workflow are given below.

Mapping of probeU onto modelG using GeneSeqer. In the case that probeU and modelG are from the same species (here either Arabidopsis or rice), GeneSeqer was run with recommended strict matching parameters (program options -x 15 -y 30 -z 45 -w 0.7).

Otherwise, parameters for less stringent matching (allowing for more sequence divergence) were used (program options -x 12 -y 16 -z 24 -w 0.2). In both cases, GeneSeqer reported matches were only retained if the fulfilled the following criteria: (1) The product of the GeneSeqer reported similarity and coverage scores exceeds 0.4, and (2) For each probeU entry, non-maximal scoring hits were retained only if their product of similarity by coverage scores was at least 90% as good as the score of the maximal scoring match. (3) If more than ten hits satisfied criteria (1) and (2), then only the top ten were retained. These criteria reduce the possibility of chance matches to non-homologous genes, while reporting matches to multiple locations in case the homologous genes are represented in a multi-gene family and cannot be distinguished reliably.

Matching of probeU to modelT using blastn. In some cases, direct mapping of probeU entries to the model genome modelG may fail because too much sequence divergence does not allow reliable matching to loci on the whole genome scale. As a complementary approach, probeU entries were compared to the smaller search space provided by known model transcripts (modelT) using BLASTn. Hits were retained if the reported E-value was less than 1e-20.

Matching of probeU to modelP using BLASTx. The ProbeU entries were also compared to annotated model protein sequences (modelP) using BLASTx. Hits were retained under the same criteria as for the BLASTn search described above.

Consolidation of matches. Inclusion in the MGI data set requires a GeneSeqer alignment to the model genome. In cases when the BLAST+ searches indicated matches for a probeU entry that did not get aligned in the initial GeneSeqer run, then GeneSeqer was re-run for that entry using less stringent matching conditions (program options -x 12 –y 12 –z 12).

Depending on the comparison being made, this re-analysis loop can result in a significant improvemnet in numbers of matched gene models, although the lower matching threshold can also reduce the reliability of ortholog assignment. The PLEXdb implementation of MGI provides an output table detailing the quality scores of all matches and gives the user the ability to exclude any of the reported matches for further analysis.

Software availability

The MGI software package consists of scripts to construct data sets for genome interrogation and Perl-CGI code to allow Web browsers to execute search, tabular display, graphical visualization, and sequence retrieval tools on a Web server. MGI data set construction requires preinstallation of local versions of GeneSeqer (Usuka et al., 2000; Brendel et al., 2004), available at http://brendelgroup.org/bioinformatics2go/GeneSeqer.php, BLAST+

(Altschul et al., 1990; Camacho et al., 2009), available at ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/, and PERL v5.6.1 or later, available at http://www.perl.org/. To host a web implementation of MGI, the CGI and GD packages of PERL are also required. All MGI scripts and code are available for download at (http://www.plexdb.org/modules/MGI/software).

Implementation of MGI at PlexDB

The MGI web interface at PLEXdb (http://www.plexdb.org/modules/MGI/ ) provides pre-computed dataset that match 14 microarray probe sets (listed in Table 1) to both the Arabidopsis genome (TAIR v10.0) and the rice genome (MSU VER6.1). Data sets and

mapping are periodically updated and expanded. The CPU-intensive step involves the spliced alignment of probeU to modelG, which takes several days to run on a modest-size cluster.

Motif searching and comparisons to known cis-regulatory elements

MEME (Bailey et al., 2006) was used for all searches for potential motifs. Motif length was set to be in the range 6 to 25 bases, a typical length for transcription factor binding sites.

Program options were set to search for either “0-1” occurrences per sequence or “any”

number of repetitions. Therefore, each set of promoters was searched twice. For each search, the top five motif results are reported, ordered according to their significance as judged by the E-value from MEME (Tables 2 and 3, Supplemental Data Files SD1-SD4). In order to compare MEME-reported motifs with known motifs, consensuses sequences of the predicted motifs were used as queries to search the plant cis-regulatory element database (PLACE) using the "Signal Scan Search" tool (Higo et al., 1999). In addition, TOMTOM was used to scan the JASPAR and TRANSFAC databases to find known motifs with high similarity to the motifs detected in our use cases (Gupta et al., 2007).

Use case dataset construction

Arabidopsis genes activated by NPR1 were used as MGI queries to retrieve gene models and sequence data. The tabular search results including alignment and annotation details are provided in Supplemental Table ST1, with gene models displayed in Supplemental Data File SD5. The sequences retrieved were used for motif searching are provided in Supplemental Data File SD6. The analogous set of files for the cross-species use case is also provided (Supplemental Table ST2; Supplemental Data Files SD7 and SD8).

Manual construction of gene models

Comparative gene models using rice sequence data were built for more than 100 Barley1 probe sets in order to improve PCR designs for recovery of cDNA ends. EST assemblies for each probe set were identified using HarvEST so that more barley sequence could be used to predict gene structure and to identify rice homologs (Close et al., 2008). We used this

curated set of gene models to test the performance of MGI in identifying putative orthologies between barley and rice genes. In all cases where MGI found a match, the same match was found by manually. However, there were cases where MGI failed to identify a match that could be found manually. Most of these were due to availability of data types for a gene, but in one case, the differing strength of match requirements for MGI versus manual curation caused the discrepancy. Due to its small size (BLU1 has 53 amino acids), no rice homolog for Contig_12219_at was found by MGI as was discussed above (Supplemental Table ST2).

Data Access

All novel materials described in this publication have been made available for non-commercial research purposes subject to the requisite permission from any third-party owners of all or parts of the material. Obtaining any permissions will be the responsibility of the requestor.

REFERENCES

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol. 215: 403-410

Bailey TL, Elkan C (1995) The value of prior knowledge in discovering motifs with MEME Proceedings of the Third International Conference on Intelligent Systems for

Molecular biology: pp. 21-29, AAAI Press, Menlo Park, California

Bailey TL, Williams N, Misleh C, Li WW (2006) MEME: discovering and analyzing DNA and protein sequence motifs. Nucl. Acids Res. 34: W369-373

Bennetzen JL, Kellogg EA (1997) Do plants have a one-way ticket to genomic obesity?

Plant Cell 9: 1509-1514

Boyle B, Brisson N (2001) Repression of the Defense Gene PR-10a by the Single-Stranded DNA Binding Protein SEBF. Plant Cell 13: 2525-2537

Brendel V, Xing L, Zhu W (2004) Gene structure prediction from consensus spliced alignment of multiple ESTs matching the same genomic locus. Bioinformatics 20:

1157-1169

Caldo RA, Nettleton D, Wise RP (2004) Interaction-dependent gene expression in Mla-specified response to barley powdery mildew. Plant Cell 16: 2514-2528

Caldo RA, Nettleton D, Peng J, Wise RP (2006) Stage-Specific Suppression of Basal Defense Discriminates Barley Plants Containing Fast- and Delayed-Acting Mla

101

Powdery Mildew Resistance Alleles. Molecular Plant-Microbe Interactions 19: 939-947

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinformatics 10: 421 Charles M, Tang H, Belcram H, Paterson A, Gornicki P, Chalhoub B (2009) Sixty

million years in evolution of soft grain trait in grasses: Emergence of the softness locus in the common ancestor of Pooideae and Ehrhartoideae, after their divergence from Panicoideae. Mol Biol Evol 26: 1651-1661

Chaw S-M, Chang C-C, Chen H-L, Li W-H (2004) Dating the monocot–dicot divergence and the origin of core eudicots using whole chloroplast genomes. J Mol Evol 58: 424-441

Cheong YH, Yoo CM, Park JM, Ryu GR, Goekjian VH, Nagao RT, Key JL, Cho MJ, Hong JC (1998) STF1 is a novel TGACG-binding factor with a zinc-finger motif and a bZIP domain which heterodimerizes with GBF proteins. Plant J 15: 199-209

Close TJ, Wanamaker S, Roose ML, Lyon M (2008) HarvEST. In Plant Bioinformatics, pp 161-177

Cox B, Kislinger T, Wigle DA, Kannan A, Brown K, Okubo T, Hogan B, Jurisica I, Frey B, Rossant J, Emili A (2007) Integrated proteomic and transcriptomic profiling of mouse lung development and Nmyc target genes. Mol Syst Biol 3: 109

Després C, Chubak C, Rochon A, Clark R, Bethune T, Desveaux D, Fobert PR (2003) The Arabidopsis NPR1 disease resistance protein is a novel cofactor that confers redox regulation of DNA binding activity to the basic domain/leucine zipper transcription factor TGA1. Plant Cell 15: 2181-2191

Dubcovsky J, Ramakrishna W, SanMiguel PJ, Busso CS, Yan L, Shiloff BA, Bennetzen JL (2001) Comparative sequence analysis of colinear barley and rice bacterial

artificial chromosomes. Plant Physiol. 125: 1342-1353

Ducrocq S, Madur D, Veyrieras J-B, Camus-Kulandaivelu L, Kloiber-Maitz M, Presterl T, Ouzunova M, Manicacci D, Charcosset A (2008) Key impact of Vgt1 on flowering time adaptation in maize: Evidence from association mapping and ecogeographical information. Genetics 178: 2433-2437

Durrant WE, Dong X (2004) Systemic acquired resistance. Annu. Rev.Phytopathol. 42:

185-209

Duvick J, Fu A, Muppirala U, Sabharwal M, Wilkerson MD, Lawrence CJ, Lushbough C, Brendel V (2008) PlantGDB: a resource for comparative plant genomics. Nucl.

Acids Res. 36: D959-965

Feuillet C, Keller B (2002) Comparative genomics in the grass family: Molecular characterization of grass genome structure and evolution. Ann Bot 89: 3-10

Goff SA, Ricke D, Lan T-H, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H, et al. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296: 92-100

Guo H, Moose SP (2003) Conserved noncoding sequences among cultivated cereal genomes identify candidate regulatory sequence elements and patterns of promoter evolution.

Plant Cell 15: 1143-1158

Gupta S, Stamatoyannopoulos J, Bailey T, Noble W (2007) Quantifying similarity between motifs. Genome Biol. 8: R24

Hertz GZ, Stormo GD (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15: 563-577 Higo K, Ugawa Y, Iwamoto M, Korenaga T (1999) Plant cis-acting regulatory DNA

elements (PLACE) database. Nucl. Acids Res. 27: 297-300

Hu P, Meng Y, Wise RP (2009) Functional Contribution of Chorismate Synthase,

Anthranilate Synthase, and Chorismate Mutase to Penetration Resistance in Barley-Powdery Mildew Interactions. Molecular Plant-Microbe Interactions 22: 311-320 Hudson ME, Quail PH (2003) Identification of promoter motifs involved in the network of

phytochrome A-regulated gene expression by combined analysis of genomic sequence and microarray data. Plant Physiol. 133: 1605-1616

Hughes JD, Estep PW, Tavazoie S, Church GM (2000) Computational identification of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol.296: 1205-1214

Johnson C, Boden E, Desai M, Pascuzzi P, Arias J (2001) In vivo target promoter-binding activities of a xenobiotic stress-activated TGA factor. Plant J. 28: 237-243

Kaplinsky NJ, Braun DM, Penterman J, Goff SA, Freeling M (2002) Utility and

distribution of conserved noncoding sequences in the grasses. Proc. Natl. Acad. Sci.

99: 6147-6151

Kellogg EA (2001) Evolutionary history of the grasses. Plant Physiol. 125: 1198-1205 Kesarwani M, Yoo J, Dong X (2007) Genetic interactions of TGA transcription factors in

the regulation of pathogenesis-related genes and disease resistance in Arabidopsis.

Plant Physiol. 144: 336-346

Lam E, Benfey PN, Gilmartin PM, Fang RX, Chua NH (1989) Site-specific mutations alter in vitro factor binding and change promoter expression pattern in transgenic plants. Proc. Natl. Acad. Sci. 86: 7890-7894

Liang C, Jaiswal P, Hebbard C, Avraham S, Buckler ES, Casstevens T, Hurwitz B, McCouch S, Ni J, Pujar A, et al. (2008) Gramene: a growing plant comparative genomics resource. Nucl. Acids Res. 36: D947-953

Logemann E, Parniske M, Hahlbrock K (1995) Modes of expression and common structural features of the complete phenylalanine ammonia-lyase gene family in parsley. Proc. Natl. Acad. Sci. 92: 5905-5909

Lyons E, Pedersen B, Kane J, Alam M, Ming R, Tang H, Wang X, Bowers J, Paterson A, Lisch D, Freeling M (2008) Finding and comparing syntenic regions among Arabidopsis and the outgroups Papaya, Poplar, and Grape: CoGe with Rosids. Plant Physiol. 148: 1772-1781

Meng Y, Moscou MJ, Wise RP (2009) Blufensin1 negatively impacts basal defense in response to barley powdery mildew. Plant Physiol. 149: 271-285

Mockler TC, Michael TP, Priest HD, Shen R, Sullivan CM, Givan SA, McEntee C, Kay SA, Chory J (2007) The Diurnal Project: Diurnal and circadian expression profiling, model-based pattern matching, and promoter analysis. Cold Spring Harbor Symposia on Quantitative Biology 72: 353-363

Moses AM, Chiang DY, Eisen MB (2004) Phylogenetic motif detection by expectation-maximization on evolutionary mixtures. Pac Symp Biocomput. 2004: 324-335

103

Niggeweg R, Thurow C, Weigel R, Pfitzner U, Gatz C (2000) Tobacco TGA factors differ with respect to interaction with NPR1, activation potential and DNA-binding

properties. Plant Mol. Biol. 42: 775-788

Ouyang S, Buell CR (2004) The TIGR Plant Repeat Databases: a collective resource for the identification of repetitive sequences in plants. Nucl. Acids Res. 32: D360-363 Ouyang S, Zhu W, Hamilton J, Lin H, Campbell M, Childs K, Thibaud-Nissen F, Malek RL, Lee Y, et al. (2007) The TIGR Rice Genome Annotation Resource:

improvements and new features. Nucl. Acids Res. 35: D883-887

Ozeki Y, Chikagawa Y, Kimura S, Soh H-c, Maeda K, Pornsiriwong W, Kato M, Akimoto H, Oyanagi M, Fukuda T, et al. (2003) Putative cis -elements in the promoter region of the carrot phenylalanine ammonia-lyase gene induced during anthocyanin synthesis. J. Plant Res. 116: 155-159

Pavesi G, Mauri G, Pesole G (2001) An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17: S207-214

Pavesi G, Mereghetti P, Zambelli F, Stefani M, Mauri G, Pesole G (2006) MoD Tools:

regulatory motif discovery in nucleotide sequences from co-regulated or homologous genes. Nucl. Acids Res. 34: W566-570

Rudd S, Frisch M, Grote K, Meyers BC, Mayer K, Werner T (2004) Genome-wide in silico mapping of scaffold/matrix attachment regions in Arabidopsis suggests correlation of intragenic scaffold/matrix attachment regions with gene expression.

Plant Physiol. 135: 715-722

Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, et al. (2009) The B73 Maize Genome: Complexity, diversity, and dynamics. Science 326: 1112-1115

Schreiber A, Sutton T, Caldo R, Kalashyan E, Lovell B, Mayo G, Muehlbauer G, Druka A, Waugh R, Wise R, et al. (2009) Comparative transcriptomics in the Triticeae. BMC Genomics 10: 285

Schulte D, Close TJ, Graner A, Langridge P, Matsumoto T, Muehlbauer G, Sato K, Schulman AH, Waugh R, Wise RP, Stein N (2009) The International Barley Sequencing Consortium--At the threshold of efficient access to the barley genome.

Plant Physiol. 149: 142-147

Seki H, Ichinose Y, Ito M, Shiraishi T, Yamada T (1997) Combined effects of multiple cis-acting elements in elicitor-mediated activation of PSCHS1 gene. Plant Cell Physiol. 38: 96-100

Soderlund C, Descour A, Kudrna D, Bomhoff M, Boyd L, Currie J, Angelova A,

Collura K, Wissotski M, Ashley E, et al. (2009) Sequencing, mapping, and analysis of 27,455 maize full-length cDNAs. PLoS Genet 5: e1000740

Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia-Hernandez M, Foerster H, Li D, Meyer T, Muller R, Ploetz L, et al. (2007) The Arabidopsis Information

Resource (TAIR): Gene structure and function annotation. Nucl. Acids Res.: gkm965 Thibaud-Nissen F, Wu H, Richmond T, Redman JC, Johnson C, Green R, Arias J,

Town CD (2006) Development of Arabidopsis whole-genome microarrays and their application to the discovery of binding sites for the TGA2 transcription factor in salicylic acid-treated plants. Plant J. 47: 152-162

Thijs G, Marchal K, Lescot M, Rombauts S, De Moor B, Rouze P, Moreau Y (2002) A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. J.Comput. Biol. 9: 447-464

Usuka J, Zhu W, Brendel V (2000) Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics 16: 203-211

Vardhanabhuti S, Wang J, Hannenhalli S (2007) Position and distance specificity are important determinants of cis-regulatory motifs in addition to evolutionary conservation. Nucleic Acids Res 35: 3203-3213

Wang D, Amornsiripanitch N, Dong X (2006) A genomic approach to identify regulatory nodes in the transcriptional network of systemic acquired resistance in plants. PLoS Pathogens 2: e123

Wang D, Weaver ND, Kesarwani M, Dong X (2005) Induction of protein secretory pathway Is required for systemic acquired resistance. Science 308: 1036-1040

Wise RP, Caldo R, Lu H, Shen L, Cannon EK, Dickerson JA (2007) BarleyBase/PLEXdb:

A unified expression profiling database for plants and plant pathogens. In D Edwards, ed, Methods in Molecular Biology. Vol 406. Plant Bioinformatics: Methods and Protocols. Humana Press, Totowa, NJ, pp 347–363

Xi L, Moscou MJ, Meng Y, Xu W, Caldo RA, Shaver M, Nettleton D, Wise RP (2009) Transcript-based cloning of RRP46, a regulator of rRNA processing and R gene-independent cell death in barley-powdery mildew interactions. Plant Cell: 21: 3280-3295

Xiang C, Miao Z, Lam E (1997) DNA-binding properties, genomic organization and expression pattern of TGA6, a new member of the TGA family of bZIP transcription factors in Arabidopsis thaliana. Plant Mol. Biol. 34: 403-415

Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M (2005) Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature 434: 338-345

Yu J, Hu S, Wang J, Wong GK-S, Li S, Liu B, Deng Y, Dai L, Zhou Y, et al. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296: 79-92 Zambelli F, Pesole G, Pavesi G (2009) Pscan: finding over-represented transcription factor

binding site motifs in sequences from co-regulated or co-expressed genes. Nucl.

Acids Res. 37: W247-252

Zhou S, Wei F, Nguyen J, Bechner M, Potamousis K, Goldstein S, Pape L, Mehan MR, Churas C, Pasternak S, et al. (2009) A single molecule scaffold for the maize genome. PLoS Genet 5: e1000711

105

Figures

Figure 1. Collage of web interface panels from the PLEXdb implementation of MGI. On the input page (A), users select the source microarray and model genome, paste a gene list into the text field, select output preferences, and submit the query. Map (B) and tabular (C)

outputs then display genomic positions and provide detatiled annotations with direct links to genome browsers. Users may further evaluate the putative orthologies using the gene model display page (D), where they may also specify the FASTA (E) sequences (promoters, exons, introns and UTRs) to extract from their selected gene models.

Figure 2. Spatial distribution of two known motifs found in Arabidopsis genes. The positions of (A) the activation sequence1 motif and (B) the SEBF promoter element are displayed relative to the transcription and translation start sites of the NPR1- activated genes in which they were detected using the 0-1 occurrence method in MEME. These motifs correspond to A03 and A04 in Table 2.

107

Tables

Table 1. Extent of connectivity for 28 microarray-to-model genome interrogations supported by PLEXdb

Loci matched via GeneSeqer^b Structural evidence for models^c Microarray # array

a Arabidopsis (A.t.) or rice (O.s.) is listed as the model genome. Bold typeface indicates which species is evolutionarily closer to the query taxon.

b Numbers of array genes that match at least one, exactly one, exactly two, and more than two model genome loci using GeneSeqer.

c Extent of model gene evidence for matched loci provided by modelP and/or modelT (any), modelP and modelT (both), modelP only, and modelT only data sources.

* Indicates a long-oligo array, whereas all others are Affymetrix GeneChips.

Table 2. Detection and statistical analysis of six- to 25-nucleotide motifs identified in

a Motif identifiers used here are only for the purposes of communicating examples in the text.

b For each base position in a motif, the pictogram graphically displays the relative proportions of each base observed.

c The probability for chance observation of the motif is given as an E-Value by MEME. The best five motifs using the “0-1” and “any” motif search parameters are reported here.

d The number of times the motif was observed in the dataset and the % of the 65 Arabidopsis genes in which it was found are indicated.

e As a means to assess empirical frequency of a motif, position weight matrix searches were conducted on the promoters of the 30,452 A. thaliana genes that have matches to the ATH1 GeneChip. The sum length of these promoters was 26,844kb. The number of times the motif was observed in the dataset is indicated, with frequencies reported on a per promoter or per kb basis for motifs originally detected with

In document Application of comparative genomics for detection of genomic features and transcriptional regulatory elements (Page 101-122)