Data integration - Data repository, integration of proteomic data onto ToxoDB and the

Chapter 4 Data repository, integration of proteomic data onto ToxoDB and the

4.4.1 Data integration

The integration of proteomic data with other genomic resources on ToxoDB was carried out on a peptide level. Since the genome sequence remains reasonably stable, matching the peptide identifications onto the corresponding genome scaffold avoids continual updating of peptide data mapping each time a new version of genome annotation is released. Peptide expression data can be directly mapped onto the new gene models according to their coordinates. However, as discussed in section 2.4.3,

122 multiple gene model databases have been used in the study to maximize protein identification. Although the benefit of including alternative gene models has been demonstrated in this chapter where in many cases peptide evidence supports the prediction made by alternative models, several technical issues in handling multiple gene models have been observed through the peptide mapping process.

Firstly, the script used in this study can efficiently collect peptide identifications from various gene models and ORFs expressed. However, due to the requirement of integration with other genomic resources that are already stored on ToxoDB, the peptide mapping algorithm designed was orientated towards release 4 genome annotation. Particular examples are step 2 and step 3 used in the mapping algorithm, where priorities have been given to release 4 gene models during mapping.

Step 2 states that if more than 50% of the peptides from an alternative model could be mapped to a release 4 gene, this was considered a valid mapping and the matching peptides were aligned with the corresponding release 4 gene. The remaining non- matching peptides were separately mounted on the scaffold, aligned with the alternative model. Step 3 states that if a certain set of peptides from an alternative model could be mapped to more than one release 4 gene, the gene that could host most peptides was reported. Again, the remaining peptides were separately mounted on the scaffold, aligned with the alternative model. As shown in this Chapter, peptides identified in the neighbouring region of a release 4 gene model are of great interest to gene expression research as well as genome annotation. They provide strong evidence of alternative splice sites; missing exons and different positioning of start or stop codons, which could alter the function of the gene when expressed. However, due to the peptide mapping algorithm that was orientated towards release 4 genome annotation, these important peptides have been mapped separately to

123 alternative gene models. Unfortunately the current version of the ToxoDB interface does not allow the user to query thesis alternative gene models. In other words, where once readily available, these peptides are no longer searchable.

Additionally, steps 4, 5, and 6 state that peptides identified from an alternative model can be mapped to an ORF or alternative gene model only if 100% of the peptides can be mapped. The stringent threshold set here was due to the consideration that previous versions of EST and ORF databases used in MS data searching contain small sequencing errors that are not consistent with the release 4 gene models and ORFs. In total, there were 220 TgEST sequences and 184 ORF sequences identified in this study that cannot be mapped onto ToxoDB due to this reason, representing 4.9% of the total number of sequences mapped. This reflected the existence of sequencing errors in previous versions of EST and ORF databases. However, this also resulted in the loss of genuine peptide identifications that mapped to the correct part of the EST or ORF sequence. In addition to this, it was not possible to map peptides identified from 163 alternative gene models in this study (74 of TgGlmHMM, 58 of TgTwinScan and 31 of TgTigrScan) to the release 4 gene models and ORFs. While these peptides are presented on ToxoDB, it is not possible to query MS evidence for older gene models such as TgTwinScan, which means that this subset of peptide data is effectively “lost” to the wider research community. Solutions to this problem are discussed in the next section.

Despite the limitations of the peptide mapping approach due to the multiple gene models and ORF database used, the integration of the proteomic data generated in this study onto ToxoDB has already assisted the genome annotation process by confirming the correct predictions of 2477 intron spanning peptides in the official release 4 genome annotations. More importantly, the discrepancies between the

124 proteomic data and the release 4 gene models also demonstrate the incompleteness of the release 4 genome annotation. The peptide identification data provided evidence for the expression of 394 alternative models, which were mapped to 226 ORFs, as well as 421 splice sites that have not been predicted by the release 4 genome annotation. In fact, even in the latest release 5.2 genome annotation which was recently published in July 2009, there is strong peptide evidence for the expression of alternative models that were mapped to 203 ORFs. The reduced number of ORFs that have peptide evidence from 226 to 203 reflects the improvement of release 5.2 genome annotation where peptides previously mapped to 23 ORFs are now able to be mapped to release 5.2 genes. However, peptides mapped to those 203 ORFs still possess valuable information for the improvement of release 5.2 genome annotations. Moreover, if the “lost” peptide identifications during the mapping process were to be mapped onto ToxoDB, an even larger discrepancy between expression data and predicted gene models would be evident. This work has highlighted the importance of proteogenomic research which directly incorporates proteomic data into the genome annotation process. It has also highlighted the on-going problem for proteomic analysis of the need to re-submit raw MS data against the latest genome annotations, in order to obtain the highest quality dataset. This is a time-consuming, manual task but which is of significant importance, if one is to avoid the situation of “lost”, out-dated and inaccessible annotations.

In document Proteomics of Toxoplasma gondii (Page 138-141)