New technologies in gene expression profiling

Amino acids

1.12 New technologies in gene expression profiling

1.11.5 Reproductive interplay in seed development

The seed itself consists of three basic units representing three different organisms: the embryo, which is the new sporophyte generated by the fusion of the egg cell and the sperm cell; the endosperm that is the fertilization product of a second sperm cell and the homodiploid central cell; and the integuments which are part of the sporophyte. In order to form one functionally-integrated whole, these three organisms have to tightly coordinate their growth and development through feed-back signalling events (Berger,2003).

A recent study reflects the interdependency of embryo and endosperm through a signal from the fertilized egg cell that triggers the endosperm development (Ungru et al.,2008). Furthermore, the authors reviewed later on (Nowack et al.,2010) that specifically, the globular embryo needs a signal from the endosperm in order continue with the development. Interestingly, when a fis mutant is fertilized by cdka;1 pollen (only the egg-cell is fertilized and no nuclei fusion takes place for the central cell), it can complete seed development (Nowack et al.,2010). In these fis × cdka;1 seeds, the fertilized embryo develops next to a diploid endosperm lacking the paternal information.

This is an indication that the embryo can trigger not only the differentiation of the endosperm, but also –directly or indirectly– communicate with the integuments that sustain the seed development and survival.

1.12 New technologies in gene expression profiling

Determination of the nucleotide sequence of the bacteriophage φX174 bySanger et al. (1978) started the race for genome sequencing. Nowadays, the genomic sequence of 41 eukaryotic organisms is complete, whereas 322 have been assembled and 388 more are in progress. Specifically in plants, 7 genomes are completed, 21 have been assembled and 85 are in progress (Entrez Genome Project). The race for genome sequencing is reaching an unprecedented level of detail, going towards the aim of making the sequencing of personal genomes a commodity (Venter,2010).

For expression profiling, not only the genome sequence should be known, but also the genome of new-sequenced organisms must be annotated with gene models. Some strategies have arisen, like using RNA-Seq for building de novo gene models in grapevine (Denoeud et al.,2008). However, an accurate functional and biochemical annotation needs complex computational strategies:

prediction and annotation of gene function must be based, not only in homology methods, but also in a combination of chromosomal gene clustering, phylogenetic and gene fusion information (Hsiao et al.,2010).

Only an accurate gene annotation together with the knowledge of the genomic sequence make it possible to study gene expression in an organism from a genome-wide perspective.

1.12.1 Microarrays

Fifteen years ago, the development of the cDNA arrays (Schena et al.,1995) allowed the simul-taneous expression measurement of thousand of genes. At the same time, the oligo-nucleotide microarrays were developed by Affymetrix which are based in a photo-lithographic synthesis of the different oligo-nucleotide probes (Lipshutz et al.,1995,1999).

In 2000 Affymetrix –in collaboration with Syngenta– developed the AtGenome1 array before the Arabidopsis sequence was completed, this first design contained 7 000 probe sets designed with several EST databases (Zhu and Wang,2000). Later on, it was shown that the array had a different coverage of the different chromosomes (Borevitz et al.,2003).

After the completion of the sequencing of the Arabidopsis genome, Affymetrix released the ATH1 array which contains probe sets for identifying 24 000 genomic features (Redman et al., 2004). This array is based in the TIGRv2 annotation and sequence assembly release. Both arrays, despite the different order of magnitude in identifying miss-regulated genes, correlate globally well for expression data, with only a few exceptions (Hennig et al.,2003).

In 2007 and based on the TIGRv5 assembly, Affymetrix released the Tiling 1.0F and Tiling 1.0R arrays. Each of them contain 3,2 million probe pairs (perfect match and mismatch) tiled through the complete –non-repetitive– Arabidopsis genome. The Tiling 1.0F array –whose production was discontinued– contains one DNA strand, whereas the Tiling 1.0R array contains the complementary strand. The probes are 25 nt-long and are tiled at an average of 35 bp resolution, leaving a gap of approximately 10 bp between probes. Several papers described the use of this arrays, not only for measuring genome-wide gene expression in Arabidopsis (Naouar et al., 2009;Jones-Rhoades et al.,2007), but also for conducting pioneering studies in protein interactions with the genome (ChIP-on-chip,Ren et al.,2000) or for novel transcript identification (Shoemaker et al.,2001) in other species.

The latest development of Affymetrix for Arabidopsis is the AGRONOMICS1 array. The design is based on the sequence assembly of TAIR8 and contains the tiled sequence of both DNA strands in form of perfect match probes. The probes are 25 nt-long with a separation of 7 nt between probes of the same strand, and a separation of 16 nt between the centre points of partially-overlapping complementary probes. No mismatch probes were included due to the difficulties in data interpretation. In addition, all the perfect match probes from the ATH1 array were added.

Custom definition files (CDFs) for quantitative transcriptome profiling in R/Bioconductor were made available, not only based on the TAIR8 annotation release, but also on TAIR9 (Rehrauer et al.,2010).

In this work, ATH1 and Tiling 1.0R arrays were used to analyse the transcription profile in an estradiol-regulated conditional complementation system and in the prl1 mutant, respectively.

1.12.2 Next generation sequencing

Further developments of the polymerase chain reaction (PCR) for complementary DNA synthesis, allowed the process to be monitored in real time, retrieving the DNA sequence information during the synthesis process (Bentley,2006;Shendure and Ji,2008). Novel technologies, still in development (Peng and Ling,2009), include techniques that do not rely on DNA complementary synthesis, but rather in the long-known property of DNA and RNA strands to sequentially modify an applied-electric field as they pass through a nano-pore (Kasianowicz et al.,1996).

The different commercialized technologies (as for 2009–2010) include the Illumina’s Genome Analyzer (based on the works ofFedurco et al.,2006;Turcatti et al.,2008), Roche’s 454 (based on pyro-sequencing,Ronaghi et al.,1998;Margulies et al.,2005), Applied Biosystems’ SOLiD (adapted fromShendure et al.,2005) and Helicos Biosciences’ HeliScope (Harris et al.,2008). These technologies are able to read thousands of millions of bases arranged in sequences in days. The

1.12. NEW TECHNOLOGIES IN GENE EXPRESSION PROFILING 1. INTRODUCTION

generated sequences are relatively short, typically from a few tens to a few hundred nucleotides in length, with an inverse relation between the total number of read sequences and the read length.

These technologies have allowed the re-sequencing of the genome of the Col-0 accession of Arabidopsis thaliana for the last TAIR release (v9) which introduced deep changes in already known gene-models, especially in its location in the genome (Ossowski et al.,2008). Furthermore, the1001 Genomes Projectis making use of these technologies in order to sequence the genome of several Arabidopsis thaliana accessions in a short time-frame.

The production of an incredible amount of data with these technologies, not only has risen concerns over effective analyses (Pop and Salzberg,2008), but also over data storage and availability (Shumway et al.,2010) and visualization (Nielsen et al.,2010).

For this work the Illumina platform was used to sequence the whole transcriptome (RNA-Seq) of the prl1 and cdc5 mutants.

1.12.3 Data analysis development

The development of the different microarray and sequencing platforms also involves the parallel development of new techniques for a proper data analysis.

The case of the Affymetrix microarrays it is worth to note, since this was the first array platform widely used by the scientific community: the default expression measures performed with the Microarray Affymetrix Suite (MAS v5.0) could be significantly improved (Irizarry et al., 2003b). The new analysis work-flow and algorithms were published as an R package (Irizarry et al.,2003a;Gautier et al.,2003) which was the origin of Bioconductor (Gentleman et al.,2004): an Open Source project focused on the analysis of genetic data that runs on top of the R environment for statistical computing (Ihaka and Gentleman,1996).

Pop and Salzberg(2008) commented the challenges that data analysis of next-generation sequencing may need to overcome. Two years later, the situation has developed to a point where many tools are available to perform multiple analyses. Due to the highly dynamic field of next-generation sequencing technologies, only Open Source solutions keep pace in order to perform a reliable data analysis (Richter and Sexton,2009). Examples like Bowtie (Langmead et al.,2009), BFAST (Homer et al.,2009) or BWA (Li and Durbin,2010) can efficiently map short reads against known genome sequences. Other Open Source tools are specific for studying gene structure, like TopHat (Trapnell et al.,2009), QPalma (Bona et al.,2008) or mGene (Schweikert et al.,2009). Furthermore, Open Source tools for de novo assembly of genomes use state-of-the-art algorithms (Miller et al.,2010). These tools include WGA Assembler/CABOG (Myers et al.,2000), MIRA (Chevreux et al.,2004) and Velvet (Zerbino and Birney,2008;Zerbino et al.,2009).

Many commercial solutions are being developed nowadays. However, privative software does not use the most up-to-date algorithms, licenses are expensive and restrictive, programmes perform poorly and do not keep pace with the latest developments in sequencing technologies.

Therefore, in this work Open Source tools were used to analyse the sequenced transcriptome of the prl1 mutant of Arabidopsis thaliana Col-0.

In document Characterization of PRL1 and its paralogue PRL2 in Arabidopsis thaliana (Page 49-52)