• No results found

4. OMICs: principles and application to human hazard assessment, strengths and limitations

4.1. Transcriptomics

4.1.1. Principles of transcriptomics

Transcriptomics deal with the expression level of mRNAs in a given tissue, organ or other cell population, using DNA microarray and other high-throughput technologies that can estimate the quantities of mRNAs (NRC, 2007).

The transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other non- coding RNA (e.g. microRNA-transcriptional and post-transcriptional regulation of gene expression), produced in one or a population of cells (Pietu et al., 1999). The term can be applied to the total set of transcripts in a given organism or to the specific subset of transcripts present in a particular cell type. The key aims of transcriptomics are to catalogue all species of transcripts, including mRNAs, non-coding RNAs and small RNAs, to determine the transcriptional structure of genes, splicing

patterns and other post-transcriptional modifications and to quantify the changing expression levels of each transcript during development and under different conditions (Wang et al., 2009).

Two main technologies are used for transcriptomics, namely oligonucleotide microarrays and next- generation sequencing.

Oligonucleotide microarrays (OM) technology is hybridisation-based which is most common approach used for gene expression profiling, it makes use of the information created by genome sequencing (www.genomesonline.org), and from the myriad of expressed Sequences Tags (ESTs) using the first generation Sanger sequencers. Hybridisation-based approaches are high throughput and relatively inexpensive, except for high-resolution arrays that interrogate large genomes. Today, it is possible to design an array of oligomer probes that covers the whole transcriptome of any organism for which the genome sequence is known and the possible open reading frames and gene models have been identified using well-established bioinformatics analysis pipelines. However, these methods have several limitations, including their dependency on prior knowledge of genome sequence, high background levels caused by cross-hybridization and a limited dynamic range of detection. Moreover, inter-experimental expression level comparison is often difficult and requires complicated normalisation methods (Metzker, 2010).

Next Generation Sequencing (NGS) technologies can deliver fast, high-throughput, inexpensive and accurate genome information, including genomic and epigenomic sequencing. NGS include methods for determining the sequence content and abundance of mRNAs, non-coding RNAs and small RNAs (collectively called RNA–seq) and methods for measuring genome-wide profiles of immunoprecipitated DNA–protein complexes (ChIP–seq), methylation sites (methyl–seq) and DNase I hypersensitivity sites (DNase–seq). A key feature is the ability to sequence the whole genome of many organisms and it has allowed large-scale comparative and evolutionary studies to be performed (Metzker, 2010). In addition, the entire transcriptome can be queried, down to an individual base, whether or not a reference genome is available (McGettigan, 2013). This is illustrated with the recent publication of the genome of 1 092 individuals from 14 human populations constructed using a combination of low-coverage whole-genome and exome sequencing as part of the 1000 Genomes Project. In addition, NGS also allow the genome-scale mapping of epigenomic modifications important for transcriptional control, including DNA methylation and covalent modifications of histone proteins. Several large-scale analysis techniques are available that enable the survey of DNA methylation status at nucleotide resolution throughout the genome. NGS platforms for genome and epigenetic techniques are discussed elsewhere (Metzker, 2010). Overall, NGS is likely to replace OM because of their greater accuracy that closely matches quantitative polymerase chain reaction (PCR) and enable gene-expression studies in organisms for which OM are not available. Finally, they are likely to offer a higher throughput compared with microarrays as new developments will likely allow for the analysis of thousands of transcriptome samples in a single sequencing run (Sturla et al., 2014). However, the technology is limited by artefacts and biases that still need to be fully identified and controlled for (McGettigan, 2013).

Analysis of transcriptomic data requires a combination of statistical techniques, bioinformatic tools and databases. The huge amount of data produced by NGS platforms requires powerful information technology tools for data storage, tracking and quality control and data processing. Datasets are transformed using standardisation, normalisation or scaling in order to be able to compare measurements within and between studies. The challenge is to turn the large data sets with relatively high amounts of noise and without obvious biological/toxicological meaning into relevant findings. Advances in bioinformatics and algorithms have recently been reviewed, with focus on state-of-the-art techniques to support experimental scientists in analysing transcriptomic data (Berger et al., 2013). A number of methods for transcriptomics data analysis and interpretation exist and include: mathematical clustering algorithms (e.g. hierarchical clustering), K-means clustering and self- organising maps, and calculation of a measure of similarity between gene profiles. Clustering creates subsets of similar sequences and enables to select, amongst thousands, the sequences with biologically relevant characteristics. Multivariate statistical methods include Principal Component Analysis (PCA)

and Partial Least Squares (PLS). PCA is an unsupervised method which determines intrinsic structure within data sets, without prior knowledge, and that is used to calculate similarity between large data sets, such as microarray measurements. PLS as principal component discriminant analysis are supervised methods that use additional information (biochemical, histopathological or clinical data) to optimise the discrimination between samples (Draghici et al., 2003). In addition, software tools are under development to enable in-depth analysis of any list of inter-related biological data (pathway analysis tools) and many databases are available (Davies et al., 2010). These databases include the early Protein Data Bank, US National Center for Biotechnology Information (NCBI) sequence data sets and the University of California, Santa Cruz Genome Browser164, ENCODE165 and modENCODE166 projects. Data sets are usually generated by different laboratories and can have different dimensionalities and organisation. In order to support formatting, storing and calibrating of datasets, there have been substantial efforts to analyse such databases and online analysis tools have allowed performing a number of integrative data analyses on genomic data (e.g. Galaxy, DAVID119, STRING, Cytoscape, mouseNET).