Omics and Arrays
3.3 Functional Genomics The use of whole genome information and
high-throughout tools has opened up a new field of research called functional genomics.
Among its subdisciplines, transcriptomics (the complete set of transcripts produced in a cell) (Zimmerli and Somerville, 2005), proteomics (the complete set of proteins produced in a cell) (Roberts, 2002) and metabolomics (the complete set of metabo-lites expressed in a cell) (Stitt and Fernie, 2003) have been used by the plant science community. Functional genomics refers to the development and application of global (genome-wide or system-wide) experimen-tal approaches to assess gene function by making use of the information and reagents provided by structural genomics. It is char-acterized by high-throughput or large-scale experimental methodologies combined with statistical and computational analy-sis (bioinformatics) of the results. The new information provided by all the omics dis-ciplines will lead the plant science
commu-nity to in silico simulations of plant growth, development and response to environmen-tal change.
3.3.1 Transcriptomics
The transcriptome is the set of all the mRNA molecules or ‘transcripts’, produced in one cell or a population of cells. The term can be applied to the total set of transcripts in a given organism or to the specific subset of transcripts present in a particular cell type. Unlike the genome, which is roughly fixed for a given cell line (excluding muta-tions), the transcriptome can vary with external environmental conditions. Because it includes all mRNA transcripts in the cell, the transcriptome reflects the genes that are being actively expressed at any given time with the exception of mRNA degradation phenomena such as transcriptional attenu-ation. Transcriptomics is based on the idea that a catalogue of all the transcripts associ-ated with a specific treatment or develop-mental stage provides a reasonable overview of the underlying biological processes at work. As we moved from northern blots to tiling arrays, we have advanced from a gene-by-gene world to a full genome universe.
The study of transcriptomics often uses high-throughput techniques based on DNA microarray or chip technology. Suggested references for this section include Bernot (2004), Bourgault et al. (2005) and Busch and Lohmann (2007).
Gene expression profiling technolo-gies provide a tool for analysing ‘global’
gene expression by viewing activity of all or (more typically) a substantial part of the genome at a specific time of interest. There are open and closed architecture systems for gene expression profiling. In the open architecture, all genes expressed in a tissue have the possibility of being detected (e.g.
cDNA-AFLP, differential display (dd) PCR, SAGE, cDNA substraction). Advantages include the potential discovery of previ-ously unknown genes, comprehensive cov-erage and the low requirements by way of equipment. Disadvantages include retriev-ing only a small part of the gene (since it can
be laborious to clone full-length cDNA) and simple gene identification that is limited by sequences that are already in a database (otherwise the corresponding gene must be cloned).
Several alternative technologies have emerged for measuring transcript abun-dance in a parallel fashion. Essentially, these methods can be divided into three catego-ries according to their underlying principle, namely PCR-, sequencing- or hybridization-based technologies. Therefore, strategies that are currently available for analysis of transcriptomes include RT-PCR (qualitative and quantitative), hybridization methods (northern blots, macroarrays, DNA micro-arrays, oligonucleotide microarrays), cDNA fingerprinting (differential display, cDNA-AFLP), cDNA sequencing (full-length cDNAs, subtracted cDNAs, normalized cDNA libraries, SAGE, massive parallel sig-nature sequencing – MPSS) and combina-tions of the above techniques.
The most straightforward and unbi-ased method of analysing an RNA popu-lation is the sequencing of cDNA libraries and quantitative analysis of the result-ing ESTs. Traditionally, ESTs with read-lengths of about 200–900 nucleotides have been produced by Sanger-sequencing but the associated costs have severely limited the resolution of this approach (Busch and Lohmann, 2007). Deep sequencing has become a viable alternative for unbiased large-scale expression profiling because of the development of new protocols and entirely new sequencing techniques. Non-gel-based sequencing techniques promise to deliver greatly increased throughput and a considerable cost reduction. MPSS combines in vitro cloning of millions of template tags on separate microbeads with ligation-mediated sequence detec-tion. In each reaction cycle, a four-base overhang is produced on every tag to which a fluorescently labelled adaptor of defined sequence is ligated. The position and fluorescence of every microbead is monitored by a high resolution camera in each of the reaction cycles, allowing the sequences of the 17-nucleotide tags to be reconstructed (Brenner et al., 2000). As
indicated by Busch and Lohmann (2007), the limited length of the sequenced tags precludes the use of MPSS for de novo sequencing but makes it a very powerful tool for expression profiling of organisms with pre-existing sequence information.
By contrast, two other high-throughput sequencing techniques as described previ-ously, 454 and Solexa™, are ideally suited for expression-profiling purposes. Short tags are sufficient to identify a transcript unambiguously and therefore problems arising from assembling short tags into larger contigs can be ignored.
PCR product-based arrays were heavily used in the early days of global transcriptome analysis. However, the low level of stand-ardization among laboratories, high levels of noise and experimental variation and cross-hybridization between homologous transcripts have eroded the attractiveness of these arrays. Oligonucleotide-based micro-arrays are now becoming the most popular technology for large-scale expression pro-filing because they allow the simultaneous detection of tens of thousands of transcripts at a reasonable cost. The expression level of any gene represented on the array can be deduced from the fluorescence inten-sity of the corresponding probe. However, microarrays only offer linear expression measurements over a range of three orders of magnitude compared to quantitative RT-PCR which has a dynamic range of five orders of magnitudes. Microarrays perform with less precision and sensitivity than other techniques when used for measuring low abundance transcripts in particular and this is manifested in their greater inter-assay variability (Busch and Lohmann, 2007).
Another major limitation of microarrays designed for expression analysis is that they rely on current genome annotations, which precludes the identification of novel or very small transcription units.
Microarrays and quantitative RT-PCR have dominated expression profiling to date but deep sequencing and whole-genome tiling arrays will become increasingly important because these techniques are not limited to the detection of known tran-scripts. Tiling arrays, on which the entire
genome is represented by evenly spaced probes, provide a novel means of transcript identification. In Arabidopsis, tiling arrays have been used to map transcriptionally active regions by profiling four different tis-sues (Yamada et al., 2003).
The interaction transcriptome is the sum of all microbe and host transcripts that are produced during the interaction. The challenges in studying interaction transcrip-tomes include how to discriminate patho-gen from host ESTs, similarity searches to genome/cDNA sequences, GC analyses and determination of hexamer frequency (windows of 6 bp). Systems genomics/tran-scriptomics can be used to analyse complex transcriptomes, for example the mixtures of mRNAs from different species (e.g. infected tissue, environmental samples such as soil or seawater, etc.). One challenge is to iden-tify the species of origin in the mixtures.
3.3.2 Proteomics
Proteomics is the study of the identification, function and regulation of complete sets of proteins in a tissue, cell or subcellular compartment. Such information is crucial to understanding how complex biological processes occur at a molecular level and how they differ in various cell types, stages of development or environmental condi-tions (Bourgualt et al., 2005). Proteomics is important as proteins are active agents in cells and they execute the biological func-tions encoded by genes. Sequences of genes (or genomes) and transcriptome analyses are not sufficient to elucidate biological functions. Proteomics complements tran-scriptomics by providing information about the time and place of protein synthesis and accumulation, as well as identifying those proteins and their post-translational modifications. Gene expression does not necessarily indicate whether a protein is synthesized, how fast it is turned over or which possible protein isoforms are synthe-sized (Mathesius et al., 2003). In some cases, the correlation between gene expression and protein presence is as low as 0.4. First, the level of transcription of a gene gives
only a rough estimate of its level of expres-sion into a protein. An mRNA produced in abundance may be degraded rapidly or translated inefficiently, resulting in a small amount of protein. Secondly, many proteins experience post-translational modifications that profoundly affect their activities; for example some proteins are not active until they become phosphorylated. Methods such as phosphoproteomics and glycopro-teomics are used to study post-translational modifications. Thirdly, many transcripts give rise to more than one protein through alternative splicing or post-translational modifications. It is generally supposed that if genomes contain tens of thousands of gene sequences, the proteome comprises several hundred thousand proteins as a result of alternative slicing and post-translational modifications. Finally, many proteins form complexes with other proteins or RNA mol-ecules and only function in the presence of these molecules.
Proteomics has become an important approach for investigating cellular proc-esses and network functions. Significant improvements have been made in technolo-gies for high-throughput proteomics, both at the level of data analysis software and mass spectrometry (MS) hardware (Baginsky and Gruissem, 2006). In this section, proteom-ics will be briefly discussed. For further details, readers are referred to the follow-ing review articles: van Wijk (2001), Molloy and Witzmann (2002), de Hoog and Mann (2004), Saravanan et al. (2004), Baginsky and Gruissem (2006), Cravatt et al. (2007) and Zivy et al. (2007).
Protein extraction
Obtaining high quality protein is the first step in proteomic research. Extracting protein from plant tissue requires tissue disrup-tion by grinding and sonicadisrup-tion, separadisrup-tion of proteins from unwanted cell materials (cell wall, water, salt, phenolics, nucleic acids) by centrifugation after precipitation of proteins with acetone–trichloroacetic acid, resolubilizing protein in a solution that dissolves the maximum number of dif-ferent proteins and inactivation of protease
by acetone–trichloroacetic acid treatment or specific protease inhibitors.Pre-fractionation of tissue is optional for the analysis of pro-teins from different organelles or micro-somal fractions. Solubilization requires urea or, for more hydrophobic proteins, thiourea, as a chaotrope which solubilizes, denatures and unfolds most proteins. Non-ionic zwit-ter dezwit-tergents, e.g. 3-[3-cholamidopropyl-dimethyl-ammonio]-1-propane sulfonate (CHAPS), Triton®-X, or amidosulfobetaines are used to solubilize and separate proteins in a mixture. Sodium dodecyl sulphate (SDS) is also a strong detergent and used to solubilize membrane proteins. However, it renders a negative charge to proteins and, therefore, interferes with isoelectric focus-ing (Mathesius et al., 2003). Reducfocus-ing agents (usually dithiothreitil [DDT], 2-mercapto-ethanol or tributyl phosphine) are needed to disrupt disulfide bonds.
Protein identification and quantification N- or C-terminal sequencing has made pro-tein identification possible on a small scale although with limitations. Improvements in MS have made it possible to identify proteins faster, on a larger scale, using smaller amounts of protein. In addition, post-translational modifications can be determined by MS/MS analysis and pro-teins can be identified even when bound to other proteins in complexes. A standard technique for protein identification with MALDI-TOF MS is peptide mass finger-printing. Protein spots in a gel can be vis-ualized using a variety of chemical stains or fluorescent markers. Proteins can often be quantified by the intensity with which they stain. Once proteins have been sepa-rated and quantified, they can be identi-fied. Individual spots are cut out of the gel and cleaved into peptides with proteolytic enzymes. These peptides can then be iden-tified by MS, specifically MALDI-TOF MS.
The MALDI-TOF analysis will measure very precisely (< 0.1 Da) the mass of peptides formed by this digestion. Since the cleav-age sites are known, the digestion can be simulated by informatics, that is, the masses of all the peptides produced by this
diges-tion can be calculated for all the known sequence proteins of a given organism (Zivy et al., 2007). These masses will depend on the length of peptides and their composi-tion since most amino acids have differ-ent masses. Thus, masses predicted from sequences stored in databases can simply be compared with masses effectively measured by the MALDI-TOF equipment. The greater the number of positive mass matches the more likely it is that the peptides originate from the same protein thus facilitating the rapid identification of proteins.
Protein profiling
Protein mixtures of considerable complexity can now be routinely characterized in some detail. One measure of technical progress is the number of proteins identified in each study. Such numbers can now reach the thousands for suitably complex samples.
Large-scale proteomic studies are needed to solve three types of biological problem (Aebersold and Mann, 2003): (i) the genera-tion of protein–protein linkage maps; (ii) the use of protein identification technol-ogy to annotate and, if necessary, correct genomic DNA sequences; and (iii) the use of quantitative methods to analyse protein expression profiles as a function of the cellular state as an aid to inferring cellular function.
The sequences of many mature pro-teins in higher eukaryotes after processing and splicing are often not directly apparent from their cognate DNA sequences. Peptide sequence data of sufficient quality provides unambiguous evidence of translation of a particular gene and can in principle, dif-ferentiate between alternatively spliced or translated forms of a protein (Aebersold and Mann, 2003). Thus, it might be tempt-ing to systematically analyse the proteins expressed by a cell or tissue, that is, to gen-erate comprehensive proteome maps.
The more common and versatile use of large-scale MS-based proteomics has been to document the expression of pro-teins as a function of cell or tissue state.
Aebersold and Mann (2003) argued that to be meaningful, such data must be at least
semi-quantitative and that a simple list of proteins detected in the different states is insufficient. This is because analyses of complex mixtures are often not comprehen-sive and therefore the non-appearance of a particular sequence in the list of identified peptides does not indicate that the peptide or protein was not originally present in the sample. Additionally, it is often impossible to prepare a certain cell type, cell fraction or tissue in completely pure form without trace contamination from other fractions.
And because the ion current of a peptide is dependent on a multitude of variables that are difficult to control, this measure is not a good indicator of peptide abundance. If stable-isotope dilution has not been used, a rough relative estimate of the quantity of a protein can be obtained by integrating the ion current of its peptide-mass peaks over their elution time and comparing these
‘extracted ion currents’ between states, pro-vided that highly accurate and reproducible methods are used. Increasingly, stable-iso-tope dilution and LC-MS/MS are used to accurately detect changes in quantitative protein profiles and to infer biological func-tion from the observed patterns (Aebersold and Mann, 2003).
Protein–protein interactions
Protein–protein interactions occur among most proteins and there are six types of interfaces found in protein–protein inter-actions: domain–domain, intra-domain, hetero-oligomer, hetero-complex, homo-oligomer, and homo-complex. The analysis of protein–protein interactions can be either qualitative or quantitative. Traditional bio-chemical methods such as co-purification and co-immunoprecipitation have been used to identify the members of protein complexes. Proteomics-based strategies have been used to determine the composi-tion of complexes and to establish interac-tion networks. The systematic, large-scale, high-throughput approaches now being taken to build maps of the interactions between proteins predicted by genome sequence information have become known as interactomics (Causier et al., 2005).
There are many important charac-teristics of a protein–protein interaction.
Obviously, it is important to know which proteins are interacting. In many experi-ments and computational studies, the focus is on interactions between two different proteins. However, one protein can interact with other copies of itself (oligomerization) or with three or more different proteins.
The stoichiometry of the interaction is also important, that is, how many of each pro-tein involved are present in a given reac-tion. Some protein interactions are stronger than others because they bind together more tightly. The strength of binding is known as affinity. Proteins will only bind to each other spontaneously if it is energetically favourable. Energy changes during bind-ing are another important aspect of protein interactions. Many of the computational tools that predict interactions are based on the energy of interactions.
Protein interaction maps represent essential components of the post-genomic tool kits needed for understanding biologi-cal processes at a systems level. Over the past decade, a wide variety of methods have been developed to detect, analyse and quan-tify protein interactions, including surface plasmon resonance spectroscopy, nuclear magnetic resonance (NMR), Y2H screens, peptide tagging combined with MS and fluorescence-based technologies. Lalonde et al. (2008) and Miernyk and Thelen (2008) reviewed the latest techniques and cur-rent limitations of biochemical, molecular and cellular approaches for the detection of protein–protein interactions. In vitro biochemical strategies for identifying and characterizing interacting proteins include co-immunoprecipitation, blue native gel electrophoresis, in vitro binding assays, pro-tein cross-linking and rate-zonal centrifuga-tion. Fluorescence techniques range from co-localization to tags which may be limited by the optical resolution of the microscope, to fluorescence resonance energy transfer (FRET)-based methods that have molecular resolution and can also report on the dynam-ics and localization of the interactions within a cell. Proteins interact via highly evolved complementary surfaces with affinities that
can vary over many orders of magnitude.
Some of the techniques such as surface plas-mon resonance provide detailed information regarding the physical properties of these interactions. To analyse protein complexes systematically at a sub- or full-genome level, several methods have been adapted for high-throughput screens using robotics: (i) Y2H systems; (ii) the mating-based split-ubiquitin system (mbSUS); and (iii) affinity purifica-tion of protein complexes followed by iden-tification of proteins by MS (AP-MS).
One of the first questions usually asked about a new protein, apart from where it is expressed, is to what proteins does it bind?
To study this question by MS, the protein itself is used as an affinity reagent to isolate its binding partners. Compared with two-hybrid and array-based approaches, this strategy has the advantages that the fully processed and modified protein can serve as the bait, that the interactions take place in the native environment and cellular loca-tion and that multi-component complexes can be isolated and analysed in a single operation (Ashman et al., 2001). However, because many biologically relevant interac-tions are of low affinity, transient and gen-erally dependent on the specific cellular environment in which they occur, MS-based methods in a straightforward affinity experi-ment will detect only a subset of the protein interactions that actually occur (Aebersold and Mann, 2003). Bioinformatics methods, correlation of MS data with those obtained
To study this question by MS, the protein itself is used as an affinity reagent to isolate its binding partners. Compared with two-hybrid and array-based approaches, this strategy has the advantages that the fully processed and modified protein can serve as the bait, that the interactions take place in the native environment and cellular loca-tion and that multi-component complexes can be isolated and analysed in a single operation (Ashman et al., 2001). However, because many biologically relevant interac-tions are of low affinity, transient and gen-erally dependent on the specific cellular environment in which they occur, MS-based methods in a straightforward affinity experi-ment will detect only a subset of the protein interactions that actually occur (Aebersold and Mann, 2003). Bioinformatics methods, correlation of MS data with those obtained