Re-scoring using multiple peptide matches

Chapter 5. Re-scoring, re-ranking, assisted PTM localisation scoring and

5.3.2 Re-scoring using multiple peptide matches

As has been shown in section 5.3.1, an extended feature set which includes XCorr metrics and retention time information facilitates large increases in peptide spectral

107

matches, especially for the Mascot search algorithm. In this section, we explore whether lower ranked correct peptide hits (based on the search algorithms primary score) could be re-ranked by Percolator based on re-scoring and extended feature sets.

Single spectrum examples 5.3.2.1

We and others [134] have previously demonstrated that unrestrictive (no-enzyme) sequence database searches are challenging due to the increased search space. Similarly, the identification of phosphopeptides is challenging because Ser, Thr and Tyr make up ~13% of the composition of sequence databases and all combinations of these residues have to be tested for potential phosphorylations. Besides the increased search space, phosphorylated peptides tend to fragment in an unpredictable manner, complicating the interpretation of the data. For these reasons, phosphopeptides are often not top-ranked peptides, and if they are multi-phosphorylated, then often there are several sequences to choose from because of multiple potential sites of localising the phospho-groups. An important point to note is that 2nd pass database searches (refined or error-tolerant searches) often fail to identify phosphopeptides when samples are heavily enriched for a particular modification. So in these cases, a search algorithm has to be sensitive enough in order to firstly identify the peptide, and secondly, be able to correctly localise the modification sites.

Using the enriched phosphopeptide dataset from the ABRF-iPRG 2010 study, we analysed all three fractions (3, 4, 12) that were obtained from strong cation-exchange chromatography and analysed by LC-MS/MS. Shown in Fig. 5.4 (A) are the top 10 peptide hits identified by Mascot for a selected tandem mass spectrum from fraction 3 of the phophopeptide data set. The correct peptide is ranked 4th (highlighted in red) with the correct localisation of the Ser-phosphorylation site amongst 3 possible alternatives (based on iPRG 2010 consensus results).

108

Figure 5.4: Identification of a phosphopeptide by Mascot.

Correct phosphopeptide ranked 4th based on Mascot analysis – highlighted in red (A); XCorr metrics for the 4th ranked peptide (B); re-scoring by Percolator showing original 4th ranked peptide ranked 1st (C) and evidence for the

phosphopeptide and localisation of the phosphorylation site based on large-scale proteomics data sets from the literature and UniProtKB annotation.

It would be impossible to recover this phosphopeptide using the standard Mascot- Percolator workflow, because only the top-ranked peptide is submitted for re-scoring. However, if multiple peptide hits were submitted to Percolator for re-scoring, the potential exists to recover this phosphopeptide based on extended feature sets. In this case, using MSPro, multiple peptide hits per spectrum were submitted to Percolator for re-scoring using a Mascot delta-score of 13 and a minimum Mascot Ion score of 5. The correct peptide hit and its associated orthogonal scores (XCorr metrics) plus Percolator’s re-scoring are shown in Fig. 5.4 (B). The original 4th ranked peptide hit (by

109

Mascot) once re-scored becomes the top-ranked hit with a significant q-value, whilst the original top-ranked incorrect peptide is relegated to 2nd with a q-value greater than 0.01 (Fig. 5.4 (C)). The localisation of the phosphorylation site at Ser-1383 is further supported by the work of Olsen et al. based on the analysis of a large-scale mass spectrometry data set (Fig. 5.4 (D)).

Localizing phosphorylation sites when there is more than one possibility and/or if the potential sites are adjacent to each other is very challenging, especially when limited mass spectrum fragment ion information is available that allows clear determination of the phosphorylation site/s. Several software solutions are now available for scoring phosphorylation localisation (or indeed any modifications). For example, the “Ascore” algorithm [135] was one of the earliest tools which assigned a probability to each potential phosphorylation site. The Mascot delta score has been used [136] as well as PhosphoRS [137]. In this work, we show that re-scoring using Percolator with the extended feature set also assists with modification localization. An example of this is shown in Fig. 5.5 (A), where a multi-phosphorylated peptide is identified with an ambiguous localisation site (ranked 1-5). The difference in score between the rank 1 and 2 phosphopeptide is negligible, which suggests that widely used methods such as the Ascore or Mascot delta score would not be able to unambiguously localise the phosphorylation sites. However, as can be seen from Fig. 5.5 (B), when this peptide spectrum match is re-scored using the cross-correlation method and re-scored using Percolator it is re-ranked to 1st place with a significant q-value. The localisation of the phosphorylations to the 3rd and 5th Ser residues is also confirmed by the experimental work of Beausoleil, S et al. (Fig. 5.5 (C)).

110

Figure 5.5: Identification of a multi-phosphorylated peptide by Mascot.

The correct phosphopeptide is ranked from 1st to 5th based on Mascot, but the correctly localised phosphorylation sites is the peptide ranked 2nd with a slightly lower score compared with the top-ranked peptide (A); XCorr and Percolator re- scoring of the 2nd ranked peptide (B) and confirmation of the multi-

phosphorylated peptide and its sites based on literature and UniProtKB annotation (C).

Based on these findings, it is clear that the fast cross-correlation scoring function implemented within MSPro for re-scoring peptide hits and the inclusion of the XCorr metrics as an extended feature set for Percolator re-scoring are beneficial. In order to verify the implementation of the fast cross-correlation scoring function and determine its sensitivity we decided to score mass spectra against each other (i.e. analogous to a library search). We have implemented this feature within MSPro in order to visualise clusters of mass spectra that are very similar to each other. Further, these clusters of mass spectra can be visualised in the form of a dendrogram (using RStudio) to gain further insight as to which mass spectra are most similar. In its current implementation, acquired spectra that are within a specified mass tolerance of each other are compared

111

using the cross-correlation procedure and scored. Average-linkage clustering and a score threshold, based on a horizontal cut at 0.9, determine which spectra are grouped into a cluster. Based on the cluster dendrogram shown in Fig. 5.6, it can be seen that this library-like search is indeed very sensitive, especially when compared with database search results provided by Mascot. This is of course to be expected, since it is difficult to model the fragment ion intensities of peptide sequences with any accuracy. Previous work on modelling fragment ion intensities [60, 138] provide feasible solutions, but incorporating these features in a robust manner represents an on-going research problem.

Figure 5.6: Average-linkage clustering of tandem mass spectra.

Cluster members of a doubly phosphorylated peptide spectrum are shown. The Mascot search algorithm correctly identifies and localises the phosphopeptide (blue circle) but these peptides were not top-ranked. Consensus (red circles) based on the analysis of the data sets by 25 different groups.

112

Library searches or hybrid library and database searches are seen as an attractive option for identifying peptides, however the former method has difficulty dealing with post- translational modifications, whilst the latter method has to be combined in a suitable statistical framework so that false discovery rates can be accurately computed. In this regard, the Pepitome [125] algorithm appears to capture most the requirements for processing proteomics data sets. However, it is unclear whether this pipeline is suitable for peptidomics data sets.

Table 5.1 (top and bottom panel) shows additional information based on the database search results from Mascot and Digger, respectively, for the doubly phosphorylated peptide previously visualised in the form of a cluster dendrogram (Fig. 5.6). The correctly identified peptide sequences, but not necessarily correctly localised in terms of modifications, have been highlighted (in yellow) in Table 5.1. Closer inspection of Table 5.1 highlights a number of subtleties that are worth discussing in more detail. Firstly, the Mascot scores (top panel) for these PSMs are all very low (below Mascot accepted score thresholds), which is not unexpected for low-abundance doubly phosphorylated peptides. Secondly, the number of correctly identified peptides using Digger (Table 5.1 (bottom panel) is almost twice that of Mascot (top panel). Thirdly, several of the correctly identified peptides were not top-ranked by either of the search algorithms, but were re-ranked based on re-scoring by Percolator using extended feature sets. Fourthly, Digger outperforms Mascot at localising the phosphorylation sites to the most likely sites (Ser-566 and Ser-570) based on a majority of literature citings (Fig. 5.7). Finally, almost 50% of the cluster members are identified successfully using a combination of the Digger search algorithm, re-scoring and re-ranking using Percolator. This suggests that further improvements still have to be made in order for uninterpreted database searches to rival the sensitivity of library searching where spectra are compared directly with previously acquired spectra.

113

Table 5.1: Identification of a doubly-phosphorylated peptide by Mascot and Digger and PSM cluster membership.

Scan# ClusterID Size ClusterNode ClusterDist IonScore HScore EValue Rank PrecMZ charge MassError Sequence T/D qValue PepCert

1857 3141 31 -801 0.546181 13.84 18.45 1.04E+00 1 669.7703 2 -0.3155 SATSTsAsPTLR T 0.00E+00 Y 1815 3141 31 -756 0.535062 13.83 19.09 1.08E+00 1 669.7704 2 -0.1331 SATSTsAsPTLR T 0.00E+00 Y 1782 3141 31 -756 0.535062 11.6 20.9 1.85E+00 1 669.7709 2 0.5966 SATSTsAsPTLR T 3.42E-03 Y 1725 3141 31 -1363 0.63243 10.66 17.19 2.25E+00 2 669.7705 2 -0.0419 SATStsASPTLR T 6.42E-03 Y 1569 3141 31 -1716 0.670291 11.32 24.16 1.94E+00 1 669.7708 2 0.323 SATSTsAsPTLR T 1.96E-02 N 1642 3141 31 -3382 0.811272 7.85 15.12 4.30E+00 1 669.7705 2 -0.0419 SATSTsAsPTLR T 2.46E-02 N 1707 3141 31 -1363 0.63243 6.27 15.78 6.31E+00 1 669.7709 2 0.5966 sATSTSASPtLR T 3.39E-02 N 1237 3141 31 -2252 0.720332 6.8 20.23 5.27E+00 3 669.7701 2 -0.6804 sATSTSASPtLR T 4.33E-02 N 1662 3141 31 -2651 0.75522 5.28 16.5 7.78E+00 1 669.7708 2 0.4142 SATSTsAsPTLR T 5.88E-02 N 1902 3141 31 -998 0.579237 7.36 18.05 4.95E+00 1 669.7706 2 -4.1831 CVsTVDFSSTR D 6.32E-02 N 1621 3141 31 -2274 0.722122 8.95 17.74 3.22E+00 4 669.77 2 0.3678 PsQSQEPSDQR T 1.83E-01 N 1676 3141 31 -2519 0.745033 6.84 17.99 5.23E+00 1 669.7699 2 -14.8728 mAIATstQLAR T 2.20E-01 N 1520 3141 31 -2821 0.768676 9.09 15.53 3.35E+00 1 669.772 2 3.7345 LmsKESGSSmIG T 2.20E-01 N 2053 3141 31 -3137 0.79251 5.35 14.17 7.64E+00 1 669.7704 2 6.3244 sMMSSVSLmGGR T 2.32E-01 N 1536 3141 31 -1940 0.691947 5.75 16.73 7.12E+00 1 669.7709 2 1.8288 PSQsQEPSDQR T 2.33E-01 N 1420 3141 31 -3039 0.785384 8.67 22.34 3.74E+00 2 669.7711 2 -10.5036 LAEyIAsAANR D 2.45E-01 N 1586 3141 31 -1167 0.605697 9.94 16.39 2.57E+00 2 669.7703 2 8.0926 EtGILSsQESK D 2.90E-01 N 1553 3141 31 -2821 0.768676 6.14 19.63 6.55E+00 2 669.7708 2 4.7042 HyFVDIQtR D 3.06E-01 N 1192 3141 31 -3606 0.829541 9.69 16.03 2.82E+00 1 669.7694 2 -10.0812 TASQGsSLRsGK D 4.15E-01 N 1749 3141 31 -1780 0.676181 6.45 16.17 6.10E+00 1 669.7706 2 -17.4162 PSAAENPTEQsK D 4.45E-01 N 1693 3141 31 -1780 0.676181 7.56 15.23 4.60E+00 1 669.7709 2 -9.6932 TSPGPtHRGSFD T 4.60E-01 N 1497 3141 31 -1167 0.605697 7.07 16.3 5.41E+00 1 669.7709 2 -19.1962 LyEIRtGGNR D 5.03E-01 N 1605 3141 31 -2274 0.722122 5.23 13.74 7.87E+00 1 669.7706 2 11.6403 HYCtYDKGSK T 5.20E-01 N 1212 3141 31 -1940 0.691947 0 0 0.00E+00 0 669.7692 2 0 T 2.00E+00 N 1474 3141 31 -2364 0.73103 0 0 0.00E+00 0 669.7704 2 0 T 2.00E+00 N 1170 3141 31 -3039 0.785384 0 0 0.00E+00 0 669.7708 2 0 T 2.00E+00 N 1951 3141 31 -1440 0.641225 0 0 0.00E+00 0 669.7708 2 0 T 2.00E+00 N 1261 3141 31 -3584 0.827453 0 0 0.00E+00 0 669.7709 2 0 T 2.00E+00 N 1999 3141 31 -2651 0.75522 0 0 0.00E+00 0 669.7712 2 0 T 2.00E+00 N 1306 3141 31 -3045 0.785742 0 0 0.00E+00 0 669.7715 2 0 T 2.00E+00 N 1452 3141 31 -3216 0.798376 0 0 0.00E+00 0 669.7717 2 0 T 2.00E+00 N Mascot

Scan# ClusterID Size ClusterNode ClusterDist IonScore EValue Rank PrecMZ charge MassError Sequence T/D qValue PepCert

1857 3141 31 -801 0.546183 55.65 1.53E-04 1 669.7703 2 -0.3342 SATsTSAsPTLR T 0.00E+00 Y 1725 3141 31 -1363 0.63243 51.59 4.03E-04 2 669.7705 2 -0.0606 SATsTSAsPTLR T 0.00E+00 Y 1782 3141 31 -756 0.535064 45.33 1.74E-03 1 669.7709 2 0.5779 SATsTSAsPTLR T 0.00E+00 Y 1902 3141 31 -998 0.579239 42.54 3.24E-03 1 669.7706 2 0.1219 SATsTSAsPTLR T 0.00E+00 Y 1951 3141 31 -1440 0.641226 34.79 1.93E-02 3 669.7708 2 0.3955 SATStSAsPTLR T 0.00E+00 Y 1815 3141 31 -756 0.535064 37.5 1.03E-02 1 669.7704 2 -0.1518 SATsTSAsPTLR T 0.00E+00 Y 1237 3141 31 -2252 0.720329 22.87 2.92E-01 4 669.7701 2 -0.6991 sATSTSAsPTLR T 0.00E+00 Y 1569 3141 31 -1716 0.670291 24.38 2.12E-01 2 669.7708 2 0.3043 SATSTsAsPTLR T 0.00E+00 Y 1693 3141 31 -1780 0.676181 19.79 6.10E-01 3 669.7709 2 0.4867 sATSTSAsPTLR T 0.00E+00 Y 1536 3141 31 -1940 0.691947 15.6 1.63E+00 4 669.7709 2 0.5779 sATSTSAsPTLR T 3.24E-04 Y 1642 3141 31 -3382 0.811272 12.49 3.27E+00 5 669.7705 2 -0.0606 sATSTSAsPTLR T 3.24E-04 Y 1749 3141 31 -1780 0.676181 15.83 1.52E+00 3 669.7706 2 0.1219 SAtSTSAsPTLR T 3.24E-04 Y 1212 3141 31 -1940 0.691947 16.07 1.39E+00 2 669.7692 2 -1.9775 SATStSAsPTLR T 3.24E-04 Y 1474 3141 31 -2364 0.73103 8.48 7.98E+00 5 669.7704 2 -0.243 SAtSTSAsPTLR T 3.24E-04 Y 1306 3141 31 -3045 0.785742 12.03 3.71E+00 3 669.7715 2 -2.8007 PSSCPGtSSPGPK T 9.30E-04 Y 1999 3141 31 -2651 0.755219 7.75 9.95E+00 3 669.7712 2 0.9428 SATsTSAsPTLR T 1.94E-02 N 1497 3141 31 -1167 0.605697 10.33 5.49E+00 2 669.7709 2 4.9689 PHyRVSyEK T 5.06E-02 N 1553 3141 31 -2821 0.768676 10.59 5.08E+00 2 669.7708 2 4.6953 PHyRVsYEK T 7.29E-02 N 1707 3141 31 -1363 0.63243 5.74 1.58E+01 3 669.7709 2 0.5779 sATSTSAsPTLR T 8.58E-02 N 1420 3141 31 -3039 0.785384 8.67 8.05E+00 2 669.7711 2 5.2425 PHyRVsYEK T 9.21E-02 N 2053 3141 31 -3137 0.79251 9.86 6.00E+00 1 669.7704 2 8.2466 SNLDTEVttAK T 1.34E-01 N 1261 3141 31 -3584 0.827453 11.05 3.90E+00 1 669.7709 2 15.0474 PsELVPsLtR D 1.42E-01 N 1586 3141 31 -1167 0.605697 8.48 7.98E+00 1 669.7703 2 0.9032 PsQSQEPSDQR T 1.49E-01 N 1676 3141 31 -2519 0.745033 8.64 6.71E+00 1 669.7699 2 -3.2119 ttNKIHtRK D 2.21E-01 N 1520 3141 31 -2821 0.768676 5.9 1.53E+01 1 669.772 2 2.7162 TQHsCPICQK T 3.26E-01 N 1621 3141 31 -2274 0.722122 5.05 1.53E+01 1 669.77 2 -8.1815 PDYDyLAVmR D 3.27E-01 N 1452 3141 31 -3216 0.798376 5.72 1.59E+01 1 669.7717 2 9.1633 DAAAHLQtsHK T 3.43E-01 N 1192 3141 31 -3606 0.829541 0 0.00E+00 0 669.7694 2 0 T 2.00E+00 N 1605 3141 31 -2274 0.722122 0 0.00E+00 0 669.7706 2 0 T 2.00E+00 N 1170 3141 31 -3039 0.785384 0 0.00E+00 0 669.7708 2 0 T 2.00E+00 N 1662 3141 31 -2651 0.755219 0 0.00E+00 0 669.7708 2 0 T 2.00E+00 N

114

Figure 5.7: Phosphosite localisation for a doubly phosphorylated peptide. The two most abundant phosphorylation sites for peptide SATsTSAsPTLR are shown based on a majority of literature references, which indicate that Ser-566 and Ser-570 are phosphorylated. Annotation information was obtained from http://www.phosphosite.org/proteinAction.do?id=19270&showAllsites =true.

Large-scale peptidomics dataset 5.3.2.2

The Mascot search result files re-analysed by MSPro using only top-ranking peptide hits in section 5.3.1.1, were re-analysed by MSPro in order to determine whether the number of significant scoring peptide hits could be further increased based on extended feature sets and by re-scoring multiple peptide hits using Percolator. A Mascot delta- score of 13 and minimum Mascot Ion score of 5 was used. The results are shown in Fig. 5.8, wherein it can be seen that re-scoring of multiple peptide hits per spectrum outperforms re-scoring of only the top-ranking peptide hit per spectrum. Based on ≤5% PEP, 474 proteins with 1228 unique peptides were identified, including 267 single peptide identifications. Using a stricter threshold (≤1% PEP), 368 proteins with 1103 unique peptides were identified. These results demonstrate that the multi-scoring approach significantly increases the number of peptides identified compared with conventional Mascot scoring using the Homology score threshold. Indeed the increase over the conventional scoring is 36% based on ≤5% PEP and 29% based on ≤1% PEP. Of the 267 single peptide identifications, only 10 PSMs were considered incorrect based on expert manual validation [32].

115

Figure 5.8: Comparison of top-ranked and multiple peptide hits re-scoring by Percolator for the peptidomics data sets.

Analysis of the Mascot search results for the peptidomics data sets with re-scoring by Percolator of only top-ranked peptide hits (black line – D0) and all peptide hits per spectrum using a Mascot delta score of 13 (red line – D13).

Large-scale proteomics dataset

5.3.2.3

The Digger search result files re-analysed by MSPro using only top-ranking peptide hits in section 5.3.1.2, were re-analysed by MSPro in order to determine whether the number of significant scoring peptide hits could be further increased based on extended feature sets and by re-scoring multiple peptide hits using Percolator. As part of the ABRF-iPRG

0.000 0.005 0.010 0.015 0.020 1500 2000 2500 3000 q-value N um ber of i dent if ied P S M s Mascot_XCorr_Rt_D0 Mascot_XCorr_Rt_D13

116

2013 study, our lab submitted results using the Mascot search algorithm with extended feature sets and Percolator re-scoring (Fig. 5.9, participant’s code – 31705). Since the aim of these studies is educational, much can be learnt from how different participants approach analysing various data sets. For example, almost 50% of the sequence entries in the protein sequence database provided to participants consisted of unmapped RNA- Seq data (mostly noise). Secondly, in addition to the sequence databases provided, our lab was the only submission that re-mapped the RNA-Seq data using alternative software (Subread [139]). Thirdly, the MS-based proteomics data was labelled using TMT tags, which are used for quantitation purposes. Each TMT-tag adds 226 Da to the mass of a peptide and if there is more than one tag (i.e., n-term of peptide and side-chain modification of Lys (K)) then multiples of 226 Da are added. It so happens that a particular combination of two amino acids is identical in mass to the tag, which is unfortunate, because it means that an incorrect identification can be obtained if the TMT-tag is treated as a “potential” rather than “fixed” modification. Depending on how participants analysed the data, these three issues resulted in a slightly higher than expected number of false-positives (the total YD in Fig. 5.9).

The Perl scripts that were used to summarise and arrive at consensus based on 3 or more participants results, were used in this work, in order to include the results obtained using Digger with extended feature sets (XCorr metrics and Rt) and Percolator re-scoring of multiple peptide hits. From Fig. 5.9, it can be seen that Digger plus extended features and re-scoring based on Percolator performs extremely well compared with all open- source or freely available software including many commercial programs. It is clear that there is room for improvement, because the number of false-positives is clearly higher than 1% FDR. Many of these identifications labelled as “incorrect” are probably due to failings of the consensus approach, since they were only identified by one group based on the alternative RNA-Seq mapping strategy that we employed. That said there are clear cases where modifications such as deamidation cause problems, because deamidated Asn is almost identical in mass to Asp. The Digger score clearly outweighs the mass error feature within Percolator, suggesting that there is further room for improvement.

117

Figure 5.9: ABRF-iPRG 2013 study to identify peptides based on RNA-Seq and MS-based proteomics data.

The number of peptide spectral matches is reported for participants. YS (Blue bar) - indicates peptide certainty based on a 1%FDR and that the identification agrees with at least 3 other participants (consensus). NS (green bar) – indicates

uncertainty (below threshold) but that the identification agrees with at least 3 other participants (consensus). ND (orange bar) – indicates uncertainty and different to the consensus of other participants. YD (red bar) – indicates peptide certainty (above threshold) but different to the consensus of other participants (i.e. false-positive). The participant’s 5-digit codes are shown on the x-axis. Note that code “31705” and “Digger” represent identical analyses, but using two different search algorithms (Mascot and Digger, respectively).

5.4 Summary

In this chapter we have demonstrated that the number of peptides identified at a 1% FDR, could be increased even further if additional scoring metrics, such as the cross- correlation score, delta-score and ranking, are included as extended feature sets for re- scoring by the Percolator algorithm. This increase is more modest for the Digger search

0 10000 20000 30000 40000 50000 60000 70000 80000 90000

N

um

be

r o

f P

ep

tid

e S

pe

ct

ral M

at

ch

es

Participants code

ABRF-iPRG2013 Study using iPOP RNA-Seq

and Proteomics Data

totalYD totalND totalNS totalYS

118

algorithm because its scoring function is already finely tuned, but for the Mascot search algorithm the increase is massive, even relative to the inclusion of retention time as an extended feature. Based on the analysis of an enriched phosphorylation data set, re- ranking and recovering of lower-ranked correct peptide hits is feasible, if multiple peptide hits are submitted for re-scoring by Percolator. In addition, improved localisation of modification sites has been demonstrated. The Mascot developers claim that the statistics do not work if multiple peptide hits are submitted to Percolator. We reason that the statistics do work if additional scoring metrics are included in the feature sets that represent PSMs.

Uninterpreted MS-based conventional sequence database searches using programs such as Digger with re-scoring and post-processing are still not as sensitive as library-based searches. However, the sensitivity gap between the two approaches has narrowed considerably. Further work is required to incorporate intensity-based models as features during post-processing.

An unbiased performance assessment of Digger based on the ABRF-iPRG 2013 study has shown that the combination of Digger, MSPro and Percolator are well suited for challenging applications, in particular peptidomics and phosphoproteomics data sets, where search space is dramatically increased.

119

Chapter 6. Comparison of methods for creating

In document Improved bioinformatics tools for the analysis of mass spectrometry-based peptidomics and proteomics data (Page 126-139)