Data analysis - Comparison of methods for creating decoy sequence

Chapter 6. Comparison of methods for creating decoy sequence

6.2.2 Data analysis

MS/MS peak lists were extracted from the raw data using the Extract-MSn program as part of Bioworks 3.3.1 (Thermo Fisher Scientific). Parameters used to generate peak lists were as follows: minimum mass 700, maximum mass 7000, grouping tolerance 0.001 Da, minimum group count 1, 10 peaks minimum and total ion current of 100. Peak lists for each LC-MS/MS run were merged into a single Mascot generic file (MGF) for Mascot and Digger searches. Automatic charge state recognition was used because of the high-resolution survey scan (30,000). MGF files were searched using the Mascot v2.3.01 and v2.4 search algorithm (Matrix Science) as well as Digger against the LudwigNR_Q113 protein sequence database with a taxonomy filter for human. The search parameters were as follows: enzyme Trypsin/P with two missed cleavage sites, carboxymethylation of cysteine as a fixed modification (+58 Da), protein N-terminal

122

acetylation (+42 Da) and methionine oxidation (+16 Da) as variable modifications. A peptide mass tolerance of ±30 ppm and fragment ion mass tolerance of ±0.5 Da was used. The automatic decoy database sequence option was enabled in Mascot (v2.3 – random protein and v2.4 – reverse protein) to allow false-discovery rate estimation. For the Digger search algorithm, three different methods of decoy sequence generation were used (random and reverse protein as well as random peptide). MSPro was used for instantiating the Digger search algorithm as well as collating the search result files from both search algorithms and extracting all peptide identifications. For both search algorithms, all top-ranking peptides greater than 5 residues in length from both target and decoy searches scoring ≥ 5 were passed to the post-processing program Percolator (v1.2) for re-scoring using extended feature sets. Based on Percolator’s re-scoring, individual q-values and PEP scores were assigned to PSMs. A peptide significance threshold of 5% (PEP, 0.05) was used to infer protein groups based on the principle of parsimonious analysis. Peptides were labeled as unique or degenerate, and if degenerate, whether they were only degenerate within their protein group (i.e., razor peptide). In order to compare the different decoy strategies, only the top-ranking peptide for each spectrum was re-scored by Percolator. The number of identified PSMs was then plotted at all q-values for both search algorithms.

6.3 Results and Discussion

As discussed in a previous chapter, and by Fenyo and Beavis [116] albeit in a slightly different context, it is important that the decoy sequence database have the following properties: (a) it should be large enough in order to generate many thousands of matching peptides to mass spectra; (b) peptide sequence redundancy should be minimised relative to the total number of peptides generated; and (c) the peptide sequence that truly corresponds to the peptide sample used to generate the tandem mass spectrum is removed from the sequence database. If a decoy sequence database is large enough, it is conceivable that all these properties could be satisfied. Fig. 6.1 illustrates why it is important to ignore short peptide sequences when analysing peptidomics or proteomics data sets. Peptides less than seven amino acid residues in length do not

123

generate sufficient unique fragment ions in order to discriminate between target and decoy sequences, no matter how the decoy sequences are generated.

Figure 6.1: High-scoring short decoy peptide sequences using Mascot.

The Mascot ionscores for a set of decoy peptides are plotted to illustrate that many short peptide sequences frequently give rise to high scores, which has a negative impact on the sensitivity of the Mascot search algorithm (red line).

Until recently, the Mascot decoy sequence database strategy involved generating for each real (target) protein sequence, a randomized (decoy) version of the same length using the average amino acid composition of the target sequence database. This was accepted as a reasonable model even though properties of real sequences (such as homology) were ignored. From Fig. 6.2, it can be seen that Mascot identifies ~800 more peptides at 1% FDR compared with when individual protein sequences are reversed (which is the default decoy method in the latest version (2.4) of Mascot). This result

124

was surprising at first, but upon closer inspection it became clear that the poorer performance of the protein reversal method was due to a decrease in peptide sequence diversity.

Figure 6.2: Comparison of decoy sequence strategies at the protein level using Mascot.

The number of identified PSMs, based on Percolator, are plotted at all q-values (FDR) for two different decoy sequence strategies (randomised protein sequences – black line and reversed protein sequences – red line).

Based on these results (Fig. 6.2), the next experiment involved generating decoys at the peptide rather than protein level in order to shed further light on the issue of sequence

0.000 0.005 0.010 0.015 0.020 12000 14000 16000 18000 20000 22000 q-value N um ber of i dent if ied P S M s Mascot_prot_random Mascot_prot_reverse

125

diversity. Unfortunately, this was not possible with Mascot because global configuration changes were required that would affect all users. Instead, this experiment was performed using the Digger search algorithm. The same proteomics data set was re- analysed using Digger with identical search parameters. The major difference between Mascot and Digger in terms of decoy strategy is that with Digger individual protein sequences are either randomised or reversed, whilst retaining the protein N-terminal and C-terminal amino acid residues. Similarly, individual peptide sequences are randomised such that the randomised version is different to the original peptide sequence, both in terms of actual sequence, but also modification position. This implementation was inspired by that implemented within the Crux [146] software. Fig. 6.3 shows the Digger results using three different decoy strategies. For Digger, the protein sequence reversal method (green line, Fig. 6.3) results in slightly more peptide identifications at 1% FDR compared with the protein sequence randomisation method (red line). Both of these strategies easily out-perform the peptide randomisation method (black line, Fig. 6.3). The peptide randomisation method is at first sight an attractive option, because the number of decoy candidate peptides is identical to the number of target candidate peptides. Furthermore, the number of potential post-translational modifications (PTMs) is also held constant because PTMs are localised at different positions within the peptides. However, based on these results, decoy sequence strategies should be avoided at the peptide level in favour of protein level sequence randomisation strategies. We suspect that this trend will hold irrespective of the type of search. For example, all the peptidomics analyses that were carried out as part of this work were performed using peptide randomisation decoy strategies for the no-enzyme searches. Based on these observations, it is clear that Digger would have performed even better if the protein sequence randomisation method had been used for the no-enzyme searches.

126

Figure 6.3: Comparison of decoy sequence strategies at the protein and peptide level using Digger.

The number of identified PSMs based on Percolator are plotted at all q-values (FDR) for three different decoy sequence strategies (randomised peptide sequences – black line, randomised protein sequences – red line and reversed protein

sequences – green line).

Decoy strategies that result in an increase in peptide sequence diversity tend to perform better in terms of overall numbers of peptide identifications. Again upon reflection, this should come as no surprise, but perhaps there is a hidden danger in using the total number of peptide identifications as a metric for assessing which decoy method is the “best” or most suitable. However, on the basis of these results, a simple experiment was performed to verify the impact of peptide sequence diversity. For each spectrum, a

0.000 0.005 0.010 0.015 0.020 20000 21000 22000 23000 24000 q-value N um ber of i dent if ied P S M s Digger_pep_random Digger_prot_random Digger_prot_reverse

127

cross-correlation of the top-scoring target versus decoy peptide sequence was performed. Only peptide sequences with a length greater than 5 and score greater than 5 were used in the cross-correlation. After normalising the cross-correlation values (XCorr) based on the auto-cross-correlation, histograms were plotted for both the “peptide-random” and “protein-reverse” decoy models (Fig. 6.4). From Fig. 6.4, it can be seen that there are many more indistinguishable peptide sequences based on their fragment ion patterns (the red bars are higher than the green bars below 0.2 on the x- axis), whereas there are many more peptides that are completely dissimilar when protein sequence reversal is used as the decoy model (the green bars are higher than the red bars above 0.8 on the x-axis). These results support the peptide sequence diversity issue and explain why the “protein-reverse” model out-performs the “peptide-random” model in terms of total number of PSMs identified at a 1%FDR. Based on these results, a couple of points are worth noting. Firstly, that the results of any search algorithm could be further improved by removing highly similar decoy peptide sequences prior to re- scoring by post-processing tools, and secondly, that the method for generating the decoy sequences becomes less of an issue (data not shown) if these highly similar sequences are removed prior to re-scoring.

128 252 633 812 1008 1601 2527 3848 5467 6236 6305 185 175 374 696 1264 2337 4104 5977 7508 8203 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 N um be r o f P SM s 1 - NXCorr

Cross-correlation of top-ranking target and decoy peptide random peptide reverse protein

Figure 6.4: XCorr of the top-ranking target peptide versus decoy peptide for all MS/MS.

Only peptides with a minimum score of 5 and length greater than 5 were used in the cross-correlation. Highly dissimilar peptide sequences have values of 1(on the right) and peptide sequences where their fragment ions are indistinguishable have values of 0 (on the left). The red bars reflect the frequencies based on the analysis using randomised peptide sequences as the decoy, whereas the green bars reflect the frequencies based on the analysis using reversed protein sequences as the decoy.

6.4 Summary

We have shown that the target-decoy method for estimating error rates in large-scale proteomics data sets is appropriate. Likewise, the decoy sequence database fulfils the null model role if the database is large and results in thousands of peptide matches to a tandem mass spectrum. The protein sequence randomisation method gives rise to an increase in peptide sequence diversity, compared with peptide randomisation methods. We therefore advocate generating decoys at the protein sequence level. Further improvements could be made by removing decoy peptide sequences that display high

129

sequence homology with the target peptide sequence. A simple cross-correlation between target and decoy sequences could alleviate this problem, prior to using post- processing tools. This would allow search algorithms to explore even larger search spaces without compromising sensitivity and specificity, and provide a further benefit to the peptidomics and proteomics research community.

131

In document Improved bioinformatics tools for the analysis of mass spectrometry-based peptidomics and proteomics data (Page 141-151)