Methods for TE annotation - Introduction: complexity of transposon annotation

1.1 Introduction: complexity of transposon annotation

1.1.2 Methods for TE annotation

De novo annotation based clustering repetitive sequences

One approach, which bundles both discovery and identification, is to exploit the fact that transposons tend to be present in large copy numbers, and scan the genome for repeated sequences without using any prior information in regards to TE structure or similarity to known TE sequences.

This has the advantage of potentially identifying transposons unique to this genome, but also several challenges. First, there is the pitfall of mis-annotating other types of repeats as transposons. Indeed, there are many repetitive sequences throughout the genome that are not transposons, for example centromeric repeats, tandem repeats or segmental duplications. Second, TE families composed of largely non-overlapping fragments or present in low copy number will be overlooked by these methods.

The final challenge is the classification into families of the sequences thus identified, due to the aforementioned diversity of elements within a family which makes clustering these sequences difficult.

This strategy has been implemented by software such as RepeatScout (Price, Jones, and Pevzner 2005), but that has been shown to be rather unspecific when benchmarked against the curated annotation of the A. thaliana genome and recover only fragments of the elements it correctly identifies (Flutre et al.

2011).

TE identification based on representative elements

Another approach is to first discover TEs in your given genome (for which many methods are possible), then identify all the individual copies comprising the families of these elements. In this case, the discovery step aims to find “representative” elements, which would be the minimal set of sequences that represent the diversity of elements in the genome. A representative sequence would thus be one which, when used as a query to identify similar sequences, could retrieve the maximal number of fragments in its family (Figure 1.1)

Figure 1.1: A TE family is a continuum of similar sequences

RepeatMasker, a widely used tool to identify transposable element sequences, bypasses the discovery step and uses as representatives consensuses of elements found in other genomes (usually taken from the RepBase database (http://www.girinst.org/repbase/)). While this approach might be sufficient for masking the most conserved regions of TEs before running gene predictors, is has been shown to be

“neither the most efficient nor the most sensitive approach” for TE annotation (Juretic, Bureau, and Bruskiewich 2004), and that representative sequences that are specific to a given genome are more apt as queries to recover their given family members than consensus sequences, as these tend to not include

the specific non-coding sequences and structural characteristics (Buisine, Quesneville, and Colot 2008).

There are several approaches to identify representative TE sequences in a genome, either de novo, homology-based or structure-based.

Methods for representative discovery

De novo methods are based on the identification of repeated sequences (for example by whole-genome self-alignment) then clustering and categorization. This approach has been evaluated by Flutre et al. (2011) by benchmarking against the A. thaliana and D. melanogaster annotations. These authors compare the performance of different algorithms for whole-genome self alignments (BLASTER and PALS) and clustering (GROUPER, RECON and PILER). They then classify the elements based on structural characteristics and/or coding capacity, discarding sequences that do not display any TE features as false positives. This final step is essential for increasing the specificity of the annotations, and is what distinguishes their work from previous implementations of this approach, however it also potentially eliminates any completely novel TE families that would have different characteristics from any known TE. They also show that representatives thus identified perform just as well but not better than finding copies of well-curated representatives.

Homology-based methods use the knowledge base of the large number of TE sequences that have already been characterized, and the fact that coding sequences tend to be well conserved over certain types of elements. Indeed, the RT proteins of LTR retrotransposons are generally conserved as are certain domains of TPase of DNA transposons (Wicker et al. 2007). Transposon related sequences are available in general databases such as NCBI (http://www.ncbi.nlm.nih.gov) which one can retrieve with key terms such as “transposase” or “retrotransposase” and there are also transposon-specific databases such as RepBase for all types of transposons, or GyDB (http://gydb.org) for retroelements, or RetrOryza (http://retroryza.fr/) for retroelements in rice. Similarity search is usually implemented by local alignment search algorithms such as BLAST using protein queries against genomic sequences or

HMMs constructed from multiple alignements (Juretic, Bureau, and Bruskiewich 2004). The possibility of using HMMs is dependent on having sufficient TE sequences already characterized in the genome of interest in order to construct profiles based on alignments, so while HMM based search tools can be more sensitive than alignment based tools (Juretic, Bureau, and Bruskiewich 2004) they are not feasible in all cases. The homology-based strategy has the advantage of generating few false positives, and being capable of retrieving single-copy elements. However the drawbacks are that only the well-conserved regions of a given element will be identified and older, more degenerate elements or copies that have no coding capacity (such as MITEs or SINEs) will be overlooked. In order to characterize the full sequence of an element thus retrieved one must use other methods to identify the non-coding or less conserved regions surrounding the coding region. This can be done either by aligning multiple genomic hits, along with their flanking sequences, and by defining the borders as where the alignment breaks down, or by searching for structural elements such as TIRs in the flanking regions.

Another approach to discovering TEs in genomic sequences is exploiting characteristics specific to a given type or superfamily, and are as numerous and varied as the types of TEs themselves (reviewed in Bergman and Quesneville 2007). These methods are based on identifying a structural characteristic of a TE sequence, such as the long terminal repeats in LTR retrotransposons or the short inverted repeats of MITEs. Multiple tools implement a search for LTR retrotransposons based on identifying direct repeats within a certain window. LTR_FINDER (Xu and Wang 2007) is the most recent of these and has the advantage of allowing user-specified thresholds of divergence between the two LTR sequences as well as identification of ORFs in between them, which aids at filtering out false positives. MITEs also lend themselves to identification by structural characteristics as they are short sequences flanked by direct repeats, and found in large copy numbers. Methods for identifying these are reviewed in (Guermonprez et al. 2013) and the most recent is MITE-hunter (Han and Wessler 2010).

This software is the most sophisticated in that it provides several methods of eliminating false positives, at various steps of the algorithm. Similarly to others (MUST (Chen et al. 2009)), the first step is to identify candidate MITEs based on TIRs and TSDs. In a subsequent step candidates are discriminated based on copy number by pairwise comparison – elements that do not align with any other are eliminated as false positives. Then a consensus sequence is generated for each family and the definition of its borders verified by multiple sequence alignment with its copies taken with flanking regions. This

last step relies on the fact that within a certain family, the copies' terminal sequences (i.e. TIRs and TSDs) will be near identical and align well but the alignment will break down at the flanking regions as each element is inserted in a different genomic context. (See Figure 1.3)

The key to using these structure-based methods is implementing good filtering strategies to eliminate false positives, as these types of structures can occur by chance in the genome with more or less high frequency. For this reason autonomous DNA transposons are not usually discovered with this approach, even though they also have TIRs just like MITEs. Indeed, they are more easily retrieved with homology-based methods and then the search for structural characteristics is limited to their flanking regions.

Methods for identifying copies of a representative

Once one has identified representative TEs using any of the aforementioned methods, the next step is to identify the copies in their respective families. This step is necessary since a family can be composed by fragments and degenerate elements that would not have been identified by the previous step. The way this is implemented depends on your goal: if you desire only to mask the genome, it is sufficient to do a simple similarity search with a program such as BLAST or FASTA, or HMM, to identify sequences similar to your queries. However, if your goal is to study the biology of TEs, you must take into account some of the biological factors of TE evolution to get useful data. The main problem with finding copies of an element is that similarity searches will give fragmented annotations if the target sequence has a large insertion or deletion or has diverged sufficiently. Therefore in order to have a proper annotation one has to chain fragments together to properly define a copy. This is implemented in the context of certain pipelines for example MATCHER in the REPET pipeline (Quesneville et al. 2005; Flutre et al. 2011) which uses a dynamic programming algorithm to chain collinear fragments and then resolve overlaps. Another program is Greedier (Li, Kahveci, and Settles 2008) which uses a graph-based algorithm to link fragments based on maximizing its total alignment score. Both of these algorithms are embedded in their respective pipelines and cannot be used independently.

In document Genome-wide transposon analyses: annotation, movement and impact on plant function and evolution (Page 34-39)