Composition, size reduction and coverage

Chapter 2. LudwigNR: A single comprehensive, non-redundant, species-

2.3 Results and Discussion

2.3.1 Composition, size reduction and coverage

The LudwigNR_Q113 database consists of 19,263,084 sequence entries and is 8.9 GB. The latest TrEMBL sequence database from UniProtKB consists of ~30 million entries and is increasing exponentially (Fig. 2.4).

Figure 2.4: Number of entries in UniProtKB/TrEMBL since its inception. Statistics obtained from http://www.ebi.ac.uk/uniprot/TrEMBLstats/.

A snapshot summary of selected organisms from all database sources “BEFORE” and “AFTER” removal of duplicate sequence entries is shown in Table 2.1.

Table 2.1: The number of sequence entries BEFORE and AFTER removal of duplicate sequence entries at the species-level.

species_taxid species sp tr ref ens jCVI Giardia Micros Plasmo gb Toxo JGI IMGA gemmata sludge epfl Broad TOTAL

9358 Choloepus hoffmanni 2 42 0 12435 0 0 0 0 0 0 0 0 0 0 0 0 12479 6035 Encephalitozoon cunuculi 420 1610 0 0 0 0 7541 0 0 0 0 0 0 0 0 0 9571 33085 Entamoeba invadens 1 64 0 0 11549 0 0 0 0 0 0 0 0 0 0 0 11614 562 Escherichia coli 22948 2134165 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2157113 114 Gemmata obscuriglobus 0 1 0 0 0 0 0 0 0 0 0 0 7763 0 0 0 7764 5741 Giardia intestinalis 63 18542 6502 0 0 15435 0 0 0 0 0 0 0 0 0 0 40542 9606 Homo sapiens 37075 111137 0 101071 0 0 0 0 0 0 0 0 0 0 0 0 249283 3880 Medicago truncatula 55 56314 46092 0 0 0 0 0 0 0 0 33903 0 0 0 0 136364 10090 Mus musculus 24440 59125 0 50892 0 0 0 0 0 0 0 0 0 0 0 0 134457 1773 Mycobacterium tuberculosis 2445 155198 0 0 0 0 0 0 0 0 0 0 0 0 4019 0 161662 4784 Phytophthora capsici 2 188 0 0 0 0 0 0 0 0 159198 0 0 0 0 0 159388 4787 Phytophthora infestans 29 18365 17837 0 0 0 0 0 0 0 0 0 0 0 0 18138 54369 67593 Phytophthora sojae 0 761 47 0 0 0 0 0 0 0 253903 0 0 0 0 0 254711 5821 Plasmodium berghei 25 11851 9824 0 0 0 0 5012 0 0 0 0 0 0 0 0 26712

256318 sludge (environmental samples) 0 0 0 0 0 0 0 0 0 0 0 0 0 73852 0 0 73852

5811 Toxoplasma gondii 60 19333 8013 0 0 0 0 0 0 24294 0 0 0 0 0 0 51700

5722 Trichomonas vaginalis 20 50594 59674 0 0 0 0 0 97873 0 0 0 0 0 0 0 208161

species_taxid species sp tr ref ens jCVI Giardia Micros Plasmo gb Toxo JGI IMGA gemmata sludge epfl Broad TOTAL

9358 Choloepus hoffmanni 2 3 0 12396 0 0 0 0 0 0 0 0 0 0 0 0 12401 6035 Encephalitozoon cunuculi 420 1585 0 0 0 0 2148 0 0 0 0 0 0 0 0 0 4153 33085 Entamoeba invadens 1 41 0 0 11460 0 0 0 0 0 0 0 0 0 0 0 11502 562 Escherichia coli 9970 301382 0 0 0 0 0 0 0 0 0 0 0 0 0 0 311352 114 Gemmata obscuriglobus 0 0 0 0 0 0 0 0 0 0 0 0 7763 0 0 0 7763 5741 Giardia intestinalis 61 16546 2 0 0 346 0 0 0 0 0 0 0 0 0 0 16955 9606 Homo sapiens 37034 53458 0 29699 0 0 0 0 0 0 0 0 0 0 0 0 120191 3880 Medicago truncatula 55 54512 844 0 0 0 0 0 0 0 0 33749 0 0 0 0 89160 10090 Mus musculus 24391 33242 0 11153 0 0 0 0 0 0 0 0 0 0 0 0 68786 1773 Mycobacterium tuberculosis 1992 20056 0 0 0 0 0 0 0 0 0 0 0 0 1561 0 23609 4784 Phytophthora capsici 2 48 0 0 0 0 0 0 0 0 87632 0 0 0 0 0 87682 4787 Phytophthora infestans 29 17723 225 0 0 0 0 0 0 0 0 0 0 0 0 355 18332 67593 Phytophthora sojae 0 701 0 0 0 0 0 0 0 0 118137 0 0 0 0 0 118838 5821 Plasmodium berghei 24 4565 5801 0 0 0 0 4045 0 0 0 0 0 0 0 0 14435

256318 sludge (environmental samples) 0 0 0 0 0 0 0 0 0 0 0 0 0 73852 0 0 73852

5811 Toxoplasma gondii 59 19079 31 0 0 0 0 0 0 4393 0 0 0 0 0 0 23562

5722 Trichomonas vaginalis 18 44576 5692 0 0 0 0 0 32848 0 0 0 0 0 0 0 83134

AFTER BEFORE

The database source (sp, tr, ref etc.) listed in Table 2.1 (left to right) reflects the order in which duplicate sequence entries are removed and thus the preference given to UniProtKB entries over specialised sequence repositories (e.g. Broad). A couple of key points should be made based on the information in Table 2.1: 1) the reduction in the overall number of sequences (BEFORE and AFTER) is in almost all cases between 50- 80% with many organisms achieving in excess of 95% reduction (highlighted in yellow). Indeed the International Nucleotide Sequence Database Collaboration (INSDC) [87] has suggested that the assignment of all strain-level taxonomic identifiers for micro-organism genomes be terminated from 2013, and 2) the importance of including multiple data sources is clearly apparent, based on the large number of sequence entries that are retained after processing from smaller, perhaps more focused, sequence repositories (e.g. JGI, IMGA etc.). Selected examples are highlighted (in yellow) in Table 2.1 to illustrate the contribution that these specialised sequence repositories make for selected organisms. Taking a couple of examples, firstly, it can be seen that Ensembl has good protein sequence coverage of Choloepus hoffmani, which is under-represented in the UniProtKB, and secondly, there are 33,381 sequence entries for Drosophila

melanogaster (taxonomic identifier 7,227) in the LudwigNR_Q113 sequence database.

Of these, 4,440 originate from Swiss-Prot (1,267 sequences containing isoforms), 1926 from Ensembl Genomes (FlyBase), 2031 from RefSeq and 24,984 from TrEMBL. The BEFORE and AFTER statistics (Table 2.1) allow informed decisions to be made as to which sequence repositories should be included in the compilation of LudwigNR, since a balance has to be struck between increasing redundancy and increasing proteome coverage and the inevitable increase in the size of the overall LudwigNR sequence database. It could be argued that these novel sequences eventually make their way to the major sequence repositories, but the current trend suggests an increase in the number of smaller, more focused, sequence repositories. Given the latest release statistics for the TrEMBL sequence database (Fig. 2.4), it is envisaged that this trend will continue.

Since the LudwigNR sequence database is used for MS-based proteomics and peptidomics data analysis, it is illustrative to use an example data set to compare and contrast the results obtained using several commonly used sequence databases. Identical

search parameters and tandem mass spectrum peak lists were submitted to the Mascot search algorithm (version 3.3) and searched against the latest version of the Swiss-Prot, NCBInr and LudwigNR sequence databases using Mus musculus as a taxonomy filter. The sample was obtained from mouse liver and the mass spectrometry data originated from the AO-HUPO MPIS project [88, 89]. A protein-level view of the data searched against the three sequence databases (Swiss-Prot, NCBInr and LudwigNR) can be seen in Fig. 2.5 (A), (B) and (C), respectively.

Figure 2.5: Mascot search result report showing top-ranking proteins.

Mus musculus proteins inferred on the basis of identified tryptic peptides searched against the latest release of the Swiss-Prot (A), NCBInr (B), and LudwigNR_Q113 (C) sequence databases.

Swiss-Prot (part of UniProtKB) is an annotated and manually curated protein sequence database (similar in many respects to RefSeq from the NCBI) featuring canonical “represented” sequences and as such does not comprise all isoform sequences or variants (http://www.uniprot.org/faq/30). Fig. 2.5 (A) highlights that protein scores are generally higher when using the Swiss-Prot database, compared with the NCBInr database (Fig. 2.5 (B)). This result is consistent with the more compact size of the Swiss-Prot database compared with that of NCBInr. Secondly, the increased sequence redundancy of NCBInr, compared with that of Swiss-Prot, is noticeable. This is based on the protein level sequence clustering evident in Fig. 2.5 (B). However, the protein identification results based on searches using the LudwigNR sequence database (Fig. 2.5 (C)) suggest that the LudwigNR sequence database is more redundant than the

Swiss-Prot database, but less redundant than the NCBInr sequence database. These results are not unexpected considering the size and the intended design of the LudwigNR sequence database, which was to provide increased sequence coverage whilst minimising redundancy. Of concern, however, is the high scoring false-positive identification of the protein Titin (primary accession: A2ASS6), which is evident in all the search results (Fig. 2.5 (A), (B) and (C)). Titin is ranked 3rd when searched against the NCBInr sequence database (Fig. 2.5 (B)), but ranked 13th when searched against the Swiss-Prot (Fig. 2.5 (A)) and LudwigNR (Fig. 2.5 (C)) sequence databases. This protein is extremely large (35,213 amino acids), which when digested with trypsin in-silico results in thousands of tryptic peptide fragments, leading to a number of high scoring “random” peptide matches (Fig. 2.6).

Figure 2.6: Mascot search result report showing high-scoring false-positive protein identification.

The protein Titin (A2ASS6) is falsely identified when searched against the NCBInr sequence database.

In contrast, a single protein (Glucose-6-phosphate: Q6WG34) does not appear in the Swiss-Prot search results, but is present in both the NCBInr and LudwigNR search results. Fig. 2.7 (A) and (B) show the identification of a single significant scoring peptide that allows a researcher to infer the identification of the Glucose-6-phosphate protein. This is generally not a problem if multiple peptides from a protein are identified, but in this case, it is due to the fact that the Swiss-Prot database only retains canonical sequences for well annotated genomes, and secondly it is due to the nature of the data dependent sampling of ions by the mass spectrometer – only the highest intensity ions are selected for fragmentation and hence sequencing.

Figure 2.7: Identification of the Glucose-6-phosphate transporter protein.

This protein is identified on the basis of a single high-scoring (significant) peptide when searched with Mascot against the LudwigNR (A) and NCBInr (B) sequence databases.

In document Improved bioinformatics tools for the analysis of mass spectrometry-based peptidomics and proteomics data (Page 50-58)