2 Material and Methods
2.3 Computational methods
References for the applied software and tools can be found in Table 2.10. Additionally, Table 2.11 includes information about script programs for automatization of routine tasks.
2.3.1 Homology searches
2.3.1.1 Web-based BLAST searches
DNA homology searches were performed using the web-based BLAST search from NCBI. For identification of BNR1 ORF1 homologues, tBLASTn searches were performed. This method enables to search a nucleotide database with a protein query. The resulting hits were saved as extensible markup language (XML) file. Subsequently, the computer script
NCBI-tBLASTn-parser.py to retrieve the homologous sequences listed in the XML file
directly from the NCBI database. 2.3.1.2 Local BLAST searches
Local databases were queried using the BLAST option in Bioedit. In addition to the human readable output, a tabular output file was generated, which was analyzed by one of the LocalBlastBioedit-tBLASTn.py scripts. These procedures enabled a fast sequence retrieval of either the exact BLAST matches or a controlled output of flanking sequences.
2.3.2 Multiple sequence alignments and assemblies
In order to compare a multitude of DNA or protein sequences, a multiple sequence alignment was created using the MUSCLE algorithm. For more than 500 sequences, the
MUSCLE standalone software was used. In case of very large alignments exhausting the
main storage of the computer, a cruder alignment was produced with only two iterations (instead of a flexibly allocated number of repetitions).
For comparison and alignment of one sequence to a database of sequences (e.g. small RNAs), the Geneious assembler was applied.
2.3.3 Visualization of multiple sequence alignments
Comparative retrotransposon sequence analysis was conducted using the software
MEGA4. Neighbor-Joining consensus trees (Saitou and Nei, 1987) were constructed
MATERIAL AND METHODS 42
Poisson correction method and all positions containing alignment gaps and missing data were eliminated only in pairwise sequence comparisons. Alternatively, Geneious was applied to build Neighbor-Joining consensus trees, if branch-specific access to the underlying sequence data was needed. Dendrograms were exchanged between both programs using the Newick tree format. In case of very large alignment files, in order to shorten computational time, a calculation of bootstrap support was disregarded.
2.3.4 Hidden Markov Model (HMM)-based motif search for the genome-
wide identification of retrotransposons
For the genome-wide detection of retrotransposon RT sequences, a Hidden Markov Model (HMM)-based approach was applied. A Hidden Markov Model is a statistical model of a multiple sequence alignment that takes into account the conservation of amino acids at a certain position as well as the probability of their neighboring amino acids. The software HMMER3 was used to build HMMs and query local databases, while computer scripts enabled controlled sequence extraction. A typical HMM workflow for identification, annotation and presentation of reverse transcriptase sequences is presented in Figure 2.1.
Neighbor joining tree
(MEGA4 or Geneious)
Alignment of the output sequences
(MUSCLE)
Get the sequences of the HMM output
(HMMER-Parse.py)
Scan the AA database with the HMM
(HMMER3, hmmsearch)
Build a HMM
(HMMER3, hmmbuild) Sequence database translation(Translate-from-FASTA.py)
Figure 2.1: Workflow for identification, annotation and presentation of HMMER-derived reverse transcriptase sequences.
2.3.4.1 Creation of a HMM
Hidden Markov Models were constructed with the hmmbuild function of HMMER3 using an alignment of transposon-typical amino acid reverse transcriptases. It is crucial
43 MATERIAL AND METHODS
that the underlying alignment is balanced in sequence as well as in organism diversity. For analysis of LINEs, the LINE RT alignment provided by Kapitonov et al. (2009) was shortened to contain only the eight characterized RT domains (Malik et al., 1999; Wright
et al., 1996; Xiong and Eickbush, 1990). For analysis of Ty3-gypsy, Ty1-copia and BEL-
Pao retrotransposons, RT alignments from the Gypsy Database have been applied without change (Llorens et al., 2010; gydb.org). These alignments had to be converted to the HMMER3-compatible Stockholm format using the Format Converter software prior to HMM generation.
2.3.4.2 Searching a local database with a HMM
Plant genomes were translated in all six reading frames using the script Translate-from-
FASTA.py. In case of large contig lengths, the sequences were partitioned into 100,000
bp fragments with 2000 bp overlaps prior to translation. Sequence fragmentation was performed using the program PartitionSequence.py.
The hmmsearch function of HMMER3 was applied to query the amino acid database with the HMM. A machine-readable tabular output (domtblout option) was saved and parsed by application of HMMER-Parse.py with consideration of the HMMER score and the alignment length. With this method, it is possible to retrieve the exact matches to the HMM query in fasta format. By application of one of the get-nt-seq-from-HMMER-
parse.py scripts, it was also possible to extract the nucleotide sequences and, if desired,
flanking regions.
2.3.4.3 Calibration of the alignment
For parsing of the HMM output, parameters were calibrated by a search against a set of previously identified retrotransposon ORFs containing reverse transcriptases. This set included LINEs, Ty1-copia, Ty3-gypsy and BEL-Pao retrotransposons, retroviruses and endogenous plant pararetrovirus sequences from Kapitonov et al. (2009) and from the
Gypsy Database (Llorens et al., 2010; gydb.org). The HMMER3 score threshold was
defined 50, as hits with a higher score only included reverse transcriptases of the desired retrotransposon type.
2.3.5 Annotation of open reading frames, amino acid composition and
secondary structure motifs
Specialized computational tools were employed to detect ORFs and define sequence features. These tools and their area of application are listed in Table 2.10.
RESULTS 44