Single spectrum examples - Digger: development of a highly sensitive and specific search

Chapter 4. Digger: development of a highly sensitive and specific search

4.3.1 Single spectrum examples

The Digger search algorithm and scoring function was designed to be highly sensitive and be able to identify peptides with confidence amongst a large background of candidate peptides (large search space). In this section, two individual spectra will be used to illustrate how the Digger scoring function performs. Fig. 4.4 (A) and (B) show the identification of a peptide using the Mascot and Digger search algorithms, respectively. This same peptide was uniquely identified (at 1% FDR) by the program Morpheus [106]. Counter-intuitively, the authors of Morpheus used a liberal precursor

ion mass tolerance (+-2.1 Da) even though the data were acquired at high mass accuracy. Nevertheless, their claim that Morpheus performs better than Mascot (at least for this one spectrum) is substantiated based on the evidence in Fig. 4.4 (A), whereupon it can be seen that even though the peptide is ranked 1st with a Mascot score of 22, it does not score high enough to be considered a significant match (both the Identity and Homology thresholds are greater than 22). When this spectrum is re-analysed with Digger, the correct peptide gets a very high score (80.27) (Fig. 4.4 (B)) relative to the other high-ranking target and decoy peptide hits. As mentioned previously, Mascot’s scoring function is based on a pre-computed model, whereas Digger’s scoring function relies on empirical observations of “stochastic” fragment ion match statistics collected during the search.

Fenyo and Beavis in 2003 [116] nicely summarised the requirements for using general scoring schemes to assess the statistical significance of MS-based protein identifications using real protein sequence databases. The requirements are (i) that the sequence database is large (meaning that it has thousands of candidate peptides per spectrum); (ii) the level of sequence redundancy is low; and (iii) that the peptide matching the tandem mass spectrum is effectively removed from the sequence database. If these requirements are met for the “decoy” sequence database, then the stochastic distribution of fragment ion matches collected during the search should be a reasonable proxy for the decoy null model. As such, the Digger scoring function is dependent to some extent on the search parameters and the search space, but this dependency is negligible, and should not make it non-compliant with the target-decoy approach. Anyway, it is not clear how one could formulate a theoretical model that could conceivably incorporate all experimental variables and search parameters in order to compute a score.

Figure 4.4: Correct peptide identified by the Mascot and Digger search algorithms. Even though Mascot identifies the correct peptide ranked 1st, its score is below threshold as the Expect value is 8.5 (A). Digger, on the other hand, identifies the correct peptide with a score of 80.27 (well differentiated from the other high- ranking target and decoy peptides) (B).

It should be noted that there is a fundamental difference between the scoring function used by the Digger search algorithm, compared with many other widely used scoring functions that compute a probability. Most score functions determine for a particular spectrum, how many predicted fragment ions match (e.g. ‘b’ and ‘y’ ions) relative to the total number of predicted fragment ions for all candidate peptides, and further assume that this process can be modelled by a discrete probability distribution (examples include poisson, binomial, hypergeometric or multivariate hypergeometric). No

assumptions are made in the current work with regards to particular distributions, since the underlying “decoy” fragment ion matching statistics are used to calculate the relative frequency of the number of candidate peptides that give rise to a specific number of fragment ion matches (i.e., a p-value like quantity). Using the number of peptides is somewhat analogous to the Mascot Identity score derivation.

Mascot’s Identity score reflects the number of candidate peptides matching a particular spectrum. At p = 0.05, the Mascot Identity score is -10Log10(0.05 x 20 x 1/N), where 20

is assumed to be the number of amino acids and N is the number of candidate peptides matching the spectrum. In practise, the Identity score threshold is conservative, especially for large search spaces (e.g., no-enzyme searches or searches for phosphorylated peptides). Mascot’s Homology score threshold, on the other hand, better reflects the distribution of the highest scoring peptides for a spectrum. Precise details as to how the Homology score is calculated is unknown, but the largest gap between scores is determined and to this 13 is added (i.e., HS = Gap score + 13). The number 13 is simply the addition of an extra significant fragment ion match (-10Log10(0.05)). All

things being perfect, the base Mascot ions score is simply 0.05n where n is the number of fragment ion matches for a peptide. This formulation can be easily verified by creating simulated “perfect” spectra. Mascot was the first and only search algorithm to score individual and combinations of ion-series. Only ion-series with a “significant” number of fragment ion matches are tested and scored (stratified by level). This scoring feature is undoubtedly Mascot’s strength, because it allows the scoring function to adapt to different mass spectrometry instruments and ionisation conditions.

Fig. 4.5 (A)-(E) highlight a number of potential issues with the Mascot score derivation compared with that of Digger. Fig. 4.5 (A) shows the correct peptide annotated by Mascot with a good “run” of matching ‘y’ ions and a greater than random component of matching ‘b’ ions. This peptide is ranked 1st when searched in an unrestrictive mode (no-enzyme) using 30 PPM precursor mass tolerance and +-0.5 Da fragment ion mass accuracy. However, the correct peptide drops to 5th when the precursor mass accuracy is opened up to +-3000 PPM (Fig. 4.5 (B)). The new top ranked “incorrect” peptide (score = 27.4) has a decent “run” of ‘y’ ion matches to level-2 peaks (i.e., 8 matches are made using 16 peaks). To Mascot’s credit the top-ranked incorrect peptide is not labelled as

significant, but it highlights a lack of sensitivity when the search space is dramatically increased. This example also illustrates a rather subtle point: candidate peptides that have fragment ion matches to the most intense peaks in the spectrum are favoured, irrespective of whether or not those matches make chemical sense (i.e., an expert data analyst would easily discount the hit as incorrect). This behaviour of a scoring function is of course not totally unexpected, given that we define a perfect ladder spectrum as one where all the ‘y’ or ‘b’ fragment ions of a peptide, match the most intense peaks in the spectrum.

When the same spectrum is searched with Digger using the identical search parameters (30 ppm precursor mass accuracy and no-enzyme search), the correct peptide is easily identified and ranked 1st (score = 46.77) (Fig. 4.5 (D)). Of interest, there were 158,607 candidate peptides scored during this search (see qmatch1, Fig. 4.5 (D)). When the precursor mass tolerance was opened up to +-3000 ppm (Fig. 4.5 (E)) there were 3.26 million candidate peptides scored and the correct peptide is still ranked 1st with an even higher score (50.13). Of course, the scores of all highly ranked “incorrect” peptides have also increased due to an increase in random fragment ion matching, but the score of the correct peptide is still well differentiated from the background hits. These results indicate that an increase in the search space should not adversely affect peptide identification results, as long as the information content and signal-to-noise of a spectrum is high enough. These results also illustrate an advantage for scoring schemes that rely on empirical observations rather than on pre-computed theoretical model-based formulae. Perhaps the only advantage of the model-based formulae approach (as used in Mascot or Andromeda) is that the scoring function is not dependent on the search space.

Figure 4.5: Effect of search space on peptide identifications.

Mascot results (A-C) and Digger results (D,E). Correct peptide identified by

Mascot based on a search using 30 PPM precursor ion mass tolerance (A); Correct peptide ranked 5th by Mascot based on search using 3000 PPM precursor ion mass tolerance (B); Incorrect peptide identified by Mascot based on search using 3000 PPM precursor ion mass tolerance (C); Correct peptide identified by Digger based on a search using 30 ppm (D) and 3000 PPM precursor ion mass tolerance (E). The Digger score in (E) is higher compared with that of (D), even though the search space is dramatically increased.

In document Improved bioinformatics tools for the analysis of mass spectrometry-based peptidomics and proteomics data (Page 91-99)