• No results found

The MeTra pipeline for the characterization of metatranscriptome data

5.2.3 Taxonomic classication of 16S rDNA amplicon sequences

The sequences are taxonomically classified using the RDP Classifier (version 2.3) [Wang et al., 2007]. For the interpretation of the data, only classifications with at least 0.80 assignment confidence are considered. The taxonomic profile is visualized using Krona [Ondov et al., 2011].

5.2.4 Phylogenetic characterization of 16S rDNA sequences

16S rDNA amplicon sequences are suitable for phylogenetic tree reconstructions, since they comprise at least one variable region known to differ between species. Thus, multiple alignments and subsequent calculations of evolutionary distances between reads covering the same region segment are possible. For phylogenetic characterization, only representative OTU sequences obtained by using UCLUST are considered that were assigned to a specific taxon by the RDP Classifier in the previous step. The collected sequences are aligned by MUSCLE [Edgar, 2004a, Edgar, 2004b]. Based thereon, the tree reconstruction is performed in MEGA 5 [Tamura et al., 2007] using the neighbor- joining method [Saitou and Nei, 1987] with genetic distances as defined by Jukes Cantor [Jukes and Cantor, 1969] and a bootstrap value of 1,000 [Tamura et al., 2007].

5.3 The MeTra pipeline for the characterization of

metatranscriptome data

This section concentrates on the study of metatranscriptome data. Metatranscriptome data reveal the active members and transcribed functions within microbial communi- ties. For this purpose, the Perl-based MetaTranscriptome (MeTra) pipeline has been developed, which extracts the ribosomal, messenger and non-coding RNA sequences from a metatranscriptome dataset. Ribosomal and messenger RNAs can be exploited to generate a taxonomic profile of active members, whereas only the messenger RNA can contribute to the functional profile. In the following sections, the databases neces- sary for the identification of relevant RNA types and the procedure for the study of metatranscriptome data will be provided in detail.

5.3.1 Database construction

For the extraction of ribosomal RNA (rRNA) sequences from the metatranscriptome dataset, three databases comprising the small subunit sequences (SSUdb), large sub- unit sequences (LSUdb) and further RNAs (profile hidden Markov model RNAdb, pHMM-RNAdb), e.g., tRNAs and 5S rRNAs, are constructed (Tab. 5.1). Sequences for the large subunit (LSU), which is composed of prokaryotic 23S and eukaryotic

Table 5.1: Databases constructed for the identification of different RNA types

Database RNA-type Data source

LSUdb prokaryotic 23S, eukaryotic 28S rRNAs SILVA SSUdb prokaryotic 16S, eukaryotic 18S rRNAs Greengenes, SILVA, RDP pHMM-RNAdb functional RNAs RFAM, NCBI database

28S rRNAs, are retrieved from SILVA (Release 102) [Pruesse et al., 2007] as an arb file [Ludwig et al., 2004]. The arb file contains LSU as well as adjacent gene sequences (e.g., SSU rDNA) from Archaea, Bacteria and Eukaryota in an alignment. To retrieve only LSU sequences, they are exported as a fasta file including the aligned sequences from position 66,155 to 129,0619 (according to the 23S rDNA gene in E. coli) using the

software package ARB [Wang et al., 2007].

Sequences for the SSUdb are obtained from several public ribosomal databases includ- ing Greengenes [DeSantis et al., 2006], RDP-II (Release 10.21) [Cole et al., 2003] and SILVA (Release 102). SILVA provides 16S/18S rRNAs for all three domains of life, whereas RDP and Greengenes store bacterial and archaeal 16S rRNAs. The small sub- unit sequences from SILVA and Greengenes are retrieved in an arb file format. Likewise the LSU database in SILVA, the SSU sequences include more than the small ribosomal RNAs such as incorrectly annotated large subunit ribosomal RNAs. Therefore, only positions between 986 and 43,332 (according to 16S rDNA gene in E. coli) of the aligned SILVA sequences are considered for the setup of the SSUdb. The trimmed sequences are exported as a fasta file using the ARB software.

For the search of additional functional RNAs, a profile hidden Markov model (HMM) [Durbin et al., 2006] database is built (pHMM-RNAdb). The RFAM 10.0 database [Griffiths-Jones et al., 2005] is utilized for the generation of the pHMM-RNAdb. The alignments for all available RNA families are exported. Using the HMMER3 package [Eddy, 2011], a profile HMM is built for each alignment. Additionally, tRNA sequences from all available NCBI genomes are extracted and filtered for the length ranging from 30 to 180 bp in order to remove non-tRNA fragments. Subsequently, the sequences are separated according to their anticodons and bacterial or archaeal origin and aligned using MUSCLE [Edgar, 2004a, Edgar, 2004b]. Profile HMMs are generated for each alignment and added to the pHMM-RNAdb.

5.3 The MeTra pipeline for the characterization of metatranscriptome data

5.3.2 Pipeline for the identication of dierent RNA types

For the identification of RNA tags within the dataset a three-step analysis pipeline has been developed in MeTra (Fig. 5.10). First, a BLAST [Altschul et al., 1990] search using an E-value cutoff of 10−5 against the custom SSUdb and LSUdb is performed to identify small and large subunit ribosomal sequences in the metatranscriptome dataset (Fig. 5.10, step 1). The sequence complexity filter is explicitly disabled to include regions with low sequence complexity. As the SSUdb comprises both archaeal and bacterial 16S rRNA as well as eukaryotic 18S rRNA gene sequences, BLAST results for eukaryotic sequences are discarded. For this purpose, the BLAST result headers containing the accession number as identifier are examined. Sequences with matches to the bacterial and archaeal entries in the SSUdb databases are classified by means of the RDP Classifier [Cole et al., 2003]. Only classifications with at least 0.80 assignment confidence are considered for the taxonomic profile.

Figure 5.10: Steps involved in the metatranscriptomic MeTra pipeline: The pipeline iden- tifies rRNA tags, remaining RNA tags and mRNA tags in three different steps.

Second, a profile HMM-based approach is applied to identify additional RNA types such as tRNAs and 5S rRNAs in the pHMM-RNAdb database (Fig. 5.10, step 2). Reads

without matches to SSUdb and LSUdb are compared to the pHMM-RNAdb using the HMMER3 package (E-value cutoff 10−5).

In the third step, mRNA tags are identified using CARMA3 [Gerlach and Stoye, 2011] (Fig. 5.10, step 3). All RNA sequences that have neither a BLAST hit to the ribosomal databases SSUdb and LSUdb nor a hit to the pHMM-RNAdb are further functionally and taxonomically characterized using CARMA3, which simultaneously performs BLASTX searches against the non-redundant GenBank protein database and HMM- based searches against the Pfam database [Finn et al., 2006].

Thereafter, the active biological processes within the underlying community are deter- mined. For this purpose, sequences classified as mRNA tags by CARMA3 are compared against the eggNOG [Muller et al., 2010] database using BlastX (E-value cutoff of 10−5, disabled low complexity filter). Finally, the sequences are annotated with COG or NOG accessions according to their best hit.

5.4 A method for the identication of industrially relevant

enzymes

In this section, a method will be outlined that is applicable for the discovery of novel genes encoding target enzymes in metagenome datasets. Reads encoding functions of industrial interest can be identified by exploiting the knowledge of so far described enzymes. Recently, hydrolase genes were identified in metagenome data, which were confirmed in subsequent activity tests (Section 2.2.3). The surveys clearly demonstrated that profile HMMs representing enzymes of interest are suitable to capture specific sequences from metagenome databases. The profile HMMs were obtained from the Pfam database, which is a large collection of protein families [Finn et al., 2006]. Since the Pfam database has a limited number of HMMs, the development of new models is needed in order to search for reads encoding desired enzymes. In the following, an approach for the construction of a novel profile HMM will be described. A requirement for this approach is the knowledge of described enzyme sequences and a conserved domain that is specific for the enzyme.

5.4.1 Construction of a prole hidden Markov model (HMM)

A two-step approach is carried out to construct a novel profile HMM representing an enzyme of industrial interest (Fig. 5.11). First, an initial profile HMM is built. For this purpose, a BLAST search is applied to the NCBI nr protein database using known queries of the target enzyme. The sequences of the hits are collected and aligned using MUSCLE [Edgar, 2004a, Edgar, 2004b]. Sequences that did not cover the required conserved domain are excluded from subsequent analysis. In addition, duplicates are removed to avoid bias in the profile HMM. The remaining aligned sequences are used