AMASS: Software for Automatically Validating the Quality of MS/MS
Spectrum From SEQUEST Results
Wei Sun
1, 2*, Fuxin Li
3, Jue Wang
3, Dexian Zheng
1, 2Youhe Gao
1, 2*1
Proteomics Research Center,
2National Key Laboratory of Medical Molecular
Biology, Institute of Basic Medical Sciences, Chinese Academy of Medical
Sciences/ Peking Union Medical College, Beijing, 100005;
3
Institute of Automation, Chinese Academy of Sciences, Beijing, 100080
Address reprint requests to:Youhe Gao/Wei Sun
5 Dong Dan San Tiao
Institute of Basic Medical Sciences
Chinese Academy of Medical Sciences/Peking Union Medical College
Beijing
People’s Republic of China 100005
Tel: 086-010-6787-2251-206
Fax: 086-010-6787-2251-201
Email: [email protected], [email protected]
Running title: AMASS automatically validate SEQUEST results
Abbreviations
AMASS: Advanced MAss Spectrum Screener
FP: false positives
TP: true positives
MS/MS: tandem mass spectrum
Summary: Time-consuming and experience-dependent manual validations of tandem mass spectra are
usually applied to SEQUEST results. This inefficient method has become a significant bottleneck for
MS/MS data processing. Here we introduce a program AMASS (Advanced MAss Spectrum Screener),
which can filter the tandem mass spectra of SEQUEST results by measuring the match percentage of
high-abundant ions and the continuity of matched fragment ions in b, y series. Compared to Xcorr and
DeltaCn filter, AMASS can increase the number of positives and reduce the number of negatives in 22
datasets generated from 18 known protein mixtures. It effectively removed most noisy spectra, false
interpretations and about half of poor fragmentation spectra. And AMASS can work synergistically with
Rscore [18] filter. We believe the use of AMASS and Rscore can result in a more accurate identification of
peptide MS/MS spectra and reduce the time and energy for manual validation.
Introduction
With the development of proteomics, tandem mass spectrometry has been used to determine the
protein components of complex mixtures [1-4]. In such an approach, proteins are digested into peptides by
enzymes and subjected to reverse-phase liquid chromatography. Then eluted peptides are ionized and
mainly fragmented into b and y ions. Tandem mass spectra produced by mass spectrometer can be used for
peptide identifications. A common way for peptide identifications is to search tandem mass spectra against
a sequence database to find the best matching peptide in the database [5]. Several database search programs
such as SEQUEST [6], Mascot [7], MS-tag [8] and Sonar [9] have been introduced to assign peptides to
MS/MS spectra. These programs use various scoring schemes to distinguish correct identifications from
false positives, but they are known to produce a significant number of incorrect peptide assignments [10].
The process of validating peptide assignments often relies on time-consuming and experience-dependant
manual verification.
Recently, several groups applied different algorithms to evaluate SEQUEST database search results
[11-15]. Moore et al. described a probabilistic algorithm called Qscore [11], which was based on a
probability model. It included the expected number of matches from a given database, the effective
database size, a correction for indistinguishable peptides, and a measurement of match quality. Anderson et
al. [12] applied the support vector machine learning algorithm, to distinguish between correctly and
incorrectly identified peptides by using a vector of parameters describing each peptide identification
including SEQUEST output, considering observed data (peptide mass, precursor ion intensity),
SEQUEST-calculated statistics (such as the parameters Xcorr, deltaCn, Sp and RSp). Keller et al. [13,14]
employed another machine learning algorithm, the expectation maximization algorithm. It incorporated four 4
SEQUEST scores plus the number of tryptic peptide termini present in the matched peptides to estimate a
peptide probability. Probabilities of the peptides with correct assignments are combined together to estimate
the probability of the corresponding protein. More recently, Razumovskaya et al. [15] developed another
method, which combines a neural network and a statistical model, for normalizing SEQUEST scores, and
also for providing a reliability estimate for each SEQUEST hit. The above methods can improve the
separation between correct and incorrect peptides and reduced the number of SEQUEST protein
identifications that have to be validated manually.
The above approaches are based on different algorithms. Here we address the same problem using a
different approach. Manual validation of a peptide match often makes use of various spectral properties to
discriminate positives from negatives [16, 17]. We put manual validation rules into a computer program and
to filter SEQUEST outputs automatically. Two rules are important for manual validation: the fragment ions
should be clearly above baseline noise and the spectrum should have continuous b or y ion matches [16].
Facts underlying in these rules are “highly abundant fragment ions are more likely to be signals” and “the
MS/MS spectrum of an optimally fragmented peptide should theoretically contain continuous fragment ions
of b or y series”. Based on these two facts, two functions were programmed to calculate the match
percentage of high abundance fragment ions and continuity of b or y ion series in AMASS software.
Tandem mass spectra datasets of known protein mixtures searched with SEQUEST were filtered by
AMASS with relaxed Xcorr and DeltaCn settings. And the result was compared with that of using common
Xcorr and DeltaCn settings alone [17].
Experimental Section
Experimental dataset The experimental datasets were obtained as in Ref. 10. The datasets were
produced by analyzing a mixture of 18 proteins by liquid chromatography/tandem mass spectrometry. Two
mixtures, A and B, were obtained by mixing together 18 purified proteins of different physicochemical
properties (Sigma, St. Louis, MO, USA; Prozyme, San Leandro,CA, USA) in the indicated relative molar
amounts (Table 1). The complex peptide mixtures were analyzed by LC/MS using an electron ionization
ion trap mass spectrometer (ThermoFinnigan, San Jose, CA, USA) using a standard top-down
data-dependent ion selection approach, wherein the most abundant peak above background level is selected
and a concurrent 3 min of dynamic exclusion is employed to prevent re-selection of previously selected
ions. Peptides were eluted by an acetonitrile gradient (10–35% over 60 min) across a 10 cm 100 mm C18
column while the ion trap mass spectrometer continuously selected peptides for collision induced
dissociation via alternating MS and MS/MS modes. To increase duty cycle, the zoom scan function capable
of determining charge state was not employed.
In total, 14 LC/MS/MS runs were performed on mixture A, using 10mL (A1), 5mL (A2), 1mL (A3),
or 2.5mL (A4-14) of 1:5 diluted mixture. Eight LC/MS/MS runs were performed on mixture B, using 1 mL
(B1-2), 2mL (B3-4), 5mL (B5-6), or 7.5 ml (B7-8) of 1:20 diluted mixture.
SEQUEST Search and Xcorr filter The 22 raw files were searched against the protein database
with Bioworks 3.1 from ThermoFinnigan (San Jose, CA, USA). The protein database was composed of
88,374 proteins including the SWISS-PROT human protein database and 18 proteins in the mixture. Tryptic
cleavages at only Lys or Arg and up to two missed internal cleavage sites in a peptide were allowed. The
maximal allowed uncertainty in the precursor ion mass was 1.4 m/z. Peptides from 400 to 4500 m/z and
precursor charge states of +1, +2, and +3 were allowed. The minimum total ion current required for
precursor ion fragmentation was 1.0×105 and the minimum number of ions was 25. Altogether, 47,907 6
spectra were searched against database.
The output files were filtered by Xcorr filter (Xcorr+DeltaCn). The following value of XCorr and
DeltaCn were as common setting [17]: DeltaCn≥0.1, Xcorr:
Xcorr ≥1.9 for +1 charged peptides, with fully tryptic ends
Xcorr≥2.2 for +2 charged peptides, with partially and fully tryptic ends
Xcorr≥3.75 for +3 charged peptides, with partially and fully tryptic ends
XCorr filters used were derived from the common setting with constant DeltaCn. For example, an 80%
XCorr filter meant 0.8× (common setting). Thus the filter was actually: XCorr≥0.8×1.9=1.52 for +1
charge peptides, and so forth. The XCorr filters examined in the analysis were 0%-120%, in a step of 10%.
Positive and Negative peptides Positive and negative peptides were selected according to the rule
that whether it was one part of 18 known proteins. Only the first scoring peptide was used to judge the
presence of one particular protein If the peptide passing above Xcorr filter was a part of the 18 known
proteins, it was counted as a positive peptide. Otherwise, it was counted as a negative peptide.
In calculating the number of positives, common contaminants were not included which decreased the
number of positives. In our experiments such a conserved strategy was adopted because we only wanted to
prove AMASS parameters’ effect in the most conservative settings.
Computer Programs—AMASS Following rules are commonly applied in the manual validation
of mass spectra [16]: 1. The MS/MS spectrum must be of good quality with fragment ions clearly above
baseline noise. 2. There must be some continuity to the b or y ion series.
Based on these rules we proposed two functions.
(1)Match percentage, MatchPct: 7
MatchPct=[number of matched daughter ions with relative abundance higher than RACutoff / number of
total daughter ions with relative abundance higher than RACutoff]×100%
RACutoff (Relative Abundance Cutoff) was a number between 0 and 100 serving as a relative abundance
cutoff point in MS/MS spectra. For example, when RACutoff was 20, the ions with relative abundance
higher than 20 were included in the calculation of MatchPct. When lower RACutoff value was used more
fragment ions were included in the calculation. Higher MatchPct value means that more fragments ions
above certain RACutoff were matched. Commonly, the higher the value of MatchPct, the better the quality
of the identification. (2)Continuity, Cont: Cont=
[
(
(
)
(
)
(
))]
/[(
1
)
2]
*
100
,
1l
l
i
f
i
y
i
b
l i+
−
+
+
∑
=Where f(i) = 1 if the ith b or y series ion is matched
0 otherwise
b(i) = n2 if the (i+1)th b series ion is not matched and n = the number of continuously matched b
ions immediately before the ith ( including the ith ion)
0 otherwise
y(i) = n2 if the (i+1)th y series ion is not matched and n = the number of continuously matched y
ions immediately before the ith ( including the ith ion)
0 otherwise
l = the amino acid number of the peptide.
Cont adds up the number of continuously matched b series and y series ions to the second degree and
the total number of matched ions, and then normalized by dividing the maximum possible value of the
addition and multiplying 100. Higher Cont value meant more continuous matching fragment ions.
When calculating MatchPct and Cont all matched daughter ions under different charge state were
taken into account. In order to determine the distinguishing value of AMASS on the number of positive and
negative peptides the values of RACutoff, MatchPct and Cont were ranged from 0 to 90 and applied to
SEQUEST results as a secondary filter besides corresponding Xcorr filter with incremental steps of 10. The
proper values of parameters should maximize the number of positive peptide without sacrificing the rate of
positive. The values of AMASS parameters, RACutoff, MatchPct and Cont, were estimated experimentally
as 20,60 and 40, respectively (Data were shown in supplement1).
Results
The effect of AMASS Figure 1 showed the result of total number of positives and negatives with
four different filters—(1) Xcorr filter, (2) MatchPct+Xcorr filter, (3) Cont+Xcorr filter and (4)
AMASS(MatchPct+Cont)+Xcorr filter and Xcorr filter ranged from 70% to 120% of common setting.
When lowering Xcorr filter, number of negatives increased dramatically and positive rate decreased. But
when MatchPct, Cont or both were used, more positives and higher positive rate could be achieved with
almost the same number of negatives even with lower common Xcorr filter settings. For example, the
number of positives and negatives were 1429 and 99 with common settings, and increased to 2034 and 341
with 80% Xcorr filter. When AMASS was employed (the values for RACutoff, MatchPct and Cont were
20,60 and 40, respectively) the number of positives was 1725 with the similar number of negatives (94) as
with common settings.
Figure 1 also showed that the effects of MatchPct and Cont were similar and the combination of 9
them—AMASSES had even better effect, which indicated that the MatchPct and Cont remove different
type of false identifications.
The effect of each AMASS parameter Above result was based on the hypothesis that all the
peptides belonging to 18 known proteins were positives. But positives with poor quality should be
considered as false positives (FP) with manual validation. So in order to further prove the effect of AMASS
on manual validation result, all of the 22 datasets under common Xcorr filter settings were manually
assigned as true positives (TP) or false positives (FP) according to above manual validation rules [16]. If a
tandem mass spectrum assigned to a peptide meets manual validation rules, the peptide was considered as
true positive (TP) otherwise false positive (FP).
In order to evaluate different effect of each AMASS parameter, according our experience the tandem
mass spectra assigned to FP were classified into three categories. The first category was poor fragmentation,
with much of the ion current in few major peaks. The second one was noisy spectra which had low signal to
noise ratio. The third one was false interpretation, which had major peaks and good signal to noise ratio, but
most of matched ions were noises. The final list of TP assignments consisted of 1295 peptides, confidently
identified in the mixture. The list of FP assignments contains 233 peptide hits by SEQUEST (73, 81, 79 to
the third categories negatives, respectively). We assigned less number of TP peptide identification than
Keller’s result [10]. The reason was that they assigned all the outputs to peptide identification without any
filter, while we only assigned the peptide passing common Xcorr filter.
Figure 2a showed the number of TP and FP under different filters, which indicated that AMASS could
decrease the number of FP at little cost of TP. Figure 2b showed the different effect of AMASS parameters
on the three categories of FP. Cont and MatchPct filtered out most of noisy and false interpretation FP, but 10
only about half of poor fragmentation ones.
The signal/noise in noisy MS/MS spectra was low, so most of ions were of high abundance. While the
number of match ions was relative few, thus the values of MatchPct were lower than in TP and such FP
could be effectively removed. For false interpretation spectra, because most of matched ions were noises,
the MatchPct was very low and could also be filtered out by AMASS. Some of above two types FP might
be filtered out by Cont because of poor continuity. But for some poor fragmentation spectra if a few high
abundance ions were matched, the value of MatchPct might be higher than 80. Moreover, due to random
match the continuity might also be up to 60 or even higher. Therefore, such poor fragmentation spectra were
difficult for AMASS to filter out.
Combining MatchPct and Cont, more FP, most of noisy and false interpretation and about half of poor
fragmentation FP, were filtered out, which proved that effects of those parameters were different.
Combination of AMASS and Rscore Our previous work Rscore [18] was a score evaluating the
relative quality in cross-correlation and matched intensity percentage. The notion underlying RScore was
that true positive peptide identifications should be better than other randomly generated identifications. In
this sense, for poor fragmentation spectra, the few high abundance ions were likely to be matched in both
the first and the second scoring peptide. In this way the relative quality difference of them would be little
and could be filtered out by Rscore. Since AMASS work best in the other two kinds of FP, AMASS and
Rscore should be complementary to each other. Figure 3 showed that when the two filters were used, Xcorr
filter could be lowered to 70% of common settings and more positives (1790) could be achieved with
similar number of negatives (102) compared to common settings (99). This result was better than that of
using each single filter.
Discussion
Different SEQUEST parameters, different algorithms [11-15] and new parameters [12] were used to
evaluate the quality of SEQUEST results. But up to now how to remove maximal negatives while keeping
as many as possible positives is still a problem.
AMASS was proposed based on the two manual validation rules. In our result AMASS could
dramatically increase the number of positives and positive rate with lower than common Xcorr filter
settings. And manual validation result showed that it can filter out most noisy MS/MS spectra, false
interpretation and about half of poor fragmentation FP at low cost of TP. When AMASS and Rscore were
both applied, more positives could be achieved with similar number of negatives. Such result proved that
high quality positive identification could be achieved with AMASS, but it also failed to completely separate
TP from FP.
AMASS made use of a threshold model. We chose the threshold model because we would want TP
results satisfy all the AMASS criteria. AMASS criteria are independent such that a high value in one
parameter cannot compensate the deficit in other parameters (for instance, a perfect Cont score would not
guarantee the matched ions are signals). A linear model does not have this property. Other models may also
be used in tackling this problem. A quadratic model would be able to approximate it, but we decided to
preserve the simplicity of the model, since a simple model would have better generalization ability [19]
(Supplement2).
To our knowledge, none of present parameters or algorithms can completely distinguish positives from
negatives. The possible reason is that the search results may be not a binary yes or no answer [11]. Because
many peptide matches are of intermediate quality, using scores cutoff and/or algorithms to force 12
intermediate quality results into positive or negative categories actually interferes with the goal of
maximizing the data extracted from the system. Even with different perfect evaluation parameter of the
detailed information of tandem mass spectra, peptide sequence, database etc and various algorithms, it is of
great possibility not to completely distinguish positives from negatives.
Because the final aim of proteomics research is the identification of proteins, the probability of protein
correctly identified is more important than that of peptide. Therefore, several steps may be applied to the
problem. Firstly, new parameters and algorithms are still necessary to be proposed to improve the
distinguishing efficiency. Secondly, the probability of protein identifications can be estimated based on
peptide evaluations, as what have been done by Keller and Razumovskaya groups [14, 15]. Thirdly, with
present parameters and algorithms, in order to achieve high creditability protein identification one approach
is to use relative stringent filters, such as higher Xcorr filter setting [17], two or more peptides for one
protein identification [11] or combination of different algorithms. The other is that the protein identification
should be reproducible during multiple experiments for a conclusive result.
There are two other rules for manual validation [16]: the y ions that correspond to a proline residue
should be intense ions and unidentified, intense fragment ions correspond to the loss of one or two amino
acids from one of the ends of the peptide. Because the two rules were difficult to be quantified using
functions as MatchPct and Cont, they were not considered in present AMASS program. Our future work
will take them into account.
Some notices should be mentioned here. Firstly, our result were based on 18 known protein datasets,
but the proteomic research result of tissue or protein complex was much more complex than 18 known
protein mixture, whether our result can be applied to complex result or not should be further proved. 13
Secondly, different Xcorr are used with the different charge state and length of precursor ion, so there are
different settings about them [16, 17, 18, 20]. The one used in our paper was the one producing relative
higher positive rate [10], but other setting may have better performance than the one. At last the database
we used was only human database not swissprot and nr, which may produce more random matches.
Conclusion
We programmed manual validation rules into AMASS to distinguish positives from negatives of
SEQUEST results. Our results from known protein mixture datasets showed that AMASS can reduce the
number of negative identifications and improve positive rate and it works synergistically with Rscore filter.
We believe that AMASS can reduce the time and energy for manual validation. AMASS can be freely
requested via e-mail: [email protected] for non-profit users.
Reference
1.Gavin A, Bosche M, Krause R, et al. (2002) Functional organization of the yeast proteome by systematic
analysis of protein complexes. Nature, 415, 141-147.
2. Ho Y, Gruhler A, Heilbut A, et al.(2002) Systematic identification of protein complexes in
Saccharomyces cerevisiae by mass spectrometry. Nature, 415, 180-183.
3. Blagoev B, Kratchmarova I, Ong SE, et al. (2003) A proteomics strategy to elucidate functional protein-
protein interactions applied to EGF signaling. Nat Biotechnol, 21,315-318
4. Taylor SW, Fahy E, Zhang B, et al. (2003) Characterization of the human heart mitochondrial proteome.
Nat Biotechnol, 21,281-286
5. Fenyo, D. (2000) Identifying the proteome: software tools. Curr. Opin. Biotechnol. 11, 391-395.
6. Eng JK, McCormack AL, Yates JR III. (1994) An Approach to Correlate Tandem Mass Spectral Data of
Peptides with Amino Acid Sequences in a Protein Database .J. Am. Soc. Mass Spectrom, 5, 976-989.
7. Perkins DN, Pappin DJC, Creasy DM, et al. (1999) Probability-based protein identify- cation by
searching sequence databases using mass spectrometry data. Electrophoresis, 20, 3551-3567.
8. Clauser KR, Baker P, Burlingame AL. (1999) Role of accurate mass measurement (+/- 10 ppm) in
protein identification strategies employing MS or MS/MS and database searching. Anal. Chem, 71,
2871-2882.
9.Field HI, Fenyo D, Beavis RC. (2002) RADARS, a bioinformatics solution that automates proteome mass
spectral analysis, optimises protein identification, and archives data in a relational database. Proteomics, 2,
36-47.
10. Keller AD, Purvine S, Nesvizhskii AI, et al. (2002) Experimental protein mixture for validating tandem 15
mass spectral analysis. Omics , 6 (2), 207-212.
11. Moore R, Young M, Lee T. (2002) Qscore: an algorithm for evaluating SEQUEST database search
results. J Am Soc Mass Spectrom, 13, 378-386.
12. Anderson DC, Li WQ, Payan DG, et al. (2003) A new algorithm for the evaluation of shotgun peptide
sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST
scores. J Proteome Res,2,137-146
13. Keller A, Nesvizhskii A, Kolker E, et al.(2002) Empirical statistical model to estimate the accuracy of
peptide identifications made by MS/MS and database search. Anal. Chem, 74, 5383-5392.
14. Nesvizhskii A, Keller A, Kolker E, et al. (2003) A statistical model for identifying proteins by tandem
mass spectrometry. Anal Chem, 75(17),4646-4658
15. Razumovskaya J, Olman V, Xu D, et al.(2004) A computational method for assessing peptide-
identification reliability in tandem mass spectrometry analysis with SEQUEST. Proteomics, 4,961-969
16. Link AJ, Eng J, Schieltz DM, et al. (1999) Direct Analysis of Protein Complexes Using Mass
Spectrometry. Nat. Biotechnol, 17, 676–682.
17. Washburn MP,Wolters D, Yates JR 3rd. (2001) Large-Scale Analysis of the Yeast Proteome by
Multidimensional Protein Identification Technology. Nat Biotechnol, 19, 242–247.
18. Li FX, Sun W, Gao YH, et al. (2004) Rscore: a peptide randomicity score for evaluating tandem mass
spectra Rapid Commun Mass Spectrom, 18,1-5
19. Vapnik VN. (1995) The nature of statistical learning theory. Springer-Verlag.
20. Peng JM, Elias JE, Thoreen CC,et al. (2003) Evaluation of multidimensional chromatography coupled
with tandem mass spectrometry(LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J 16
Proteome Res, 2, 43-50
Acknowledgement: Thank Andrew Keller of Washington University for offering the datasets and
database.This work was partially supported by grants from Key Project for International Corporation (No.
2002AA229031), Pilot Study for Key Basic Research Project (No.2002CCA04100), National High
Technology Research and Development Program (No.2001AA233051), National Natural Science
Foundation (No. 30270657, 30230150)
Figure legend
Figure 1: The comparision of XCorr filter and Xcorr+AMASS filter under different proportion of common
XCorr filter setting (70-120%). With the similar number of negatives (94 and 99 for AMASS and common
setting) as common setting AMASS can achieve more positives with 80% common setting (1725 and 1429
for AMASS and common setting). The values of AMASS parameter--RACutoff, MatchPct and Cont, are 20,
60 and 40, respectively.
.
Figure 2: The effect of different filters on the TP and FP. A: The number of TP and FP under different filters.
B: The number of three kinds FP under different filters. The value of Xcorr filter is the common setting and
the value of Cont, and MatchPct is 40 and 60, respectively. AMASS setting is the combination of MatchPct
and Cont with above values.
Figure 3: Comparision of XCorr filter, Xcorr+AMASS filter, Xcorr+Rscore filter and Xcorr+
AMASS+Rscore fitler under different proportion of common XCorr filter setting (70-120%). With the
similar number of negatives (102 and 99 for AMASS+Rscore and common setting) compare to common
settings AMASS+Rscore can achieve more positives with 70% common setting (1790 and 1429 for
AMASS+Rscore and common setting). The values of RACutoff, MatchPct, Cont and Rscore, were 20, 60,
40 and 2.7, respectively.
Table 1: Protein components of control mixtures A and B used in the experiments. [10]
a
Additional accession numbers for rabbit myosin heavy and light chains: P02603, P02602, P24732,
Q28641, P04460, P04461, P35748, Q99105
Figure 1
.
A
B
Figure 2
Figure 3
The values of AMASS parameters The values of parameters should maximize the number
of positive peptide without sacrificing the rate of positive. Supplement Figure showed the trend of
positive numbers and rate against AMASS parameters at 80% common Xcorr filter setting [17]. The
trend with other common Xcorr filter setting showed the similar result. (Detailed data were not shown
here).
According to the trend of AMASS different parameters, we arbitrarily selected values of 20, 60
and 40 for RACutoff, MatchPct and Cont. Although the values might not be the optimal values for
AMASS in present datasets, they had better distinguished positives from negatives than using Xcorr
filter alone.
In order to validate the values of above AMASS parameters, different Cont and MatchPct values
under constant RACutoff value (20) and 100% common Xcorr filter setting were further proved by
manual validation results. Supplement table showed the results. Cont and MatchPct’s values were
arbitrarily selected as 40 and 60.
Because the values of AMASS parameters were from standard mixtures, whether they were
Supplement Figure: The trend of positive numbers and rate against AMASS parameters with 80%
common Xcorr filter setting. A: Effect of RACutoff value on positive numbers and rate with different
value of MatchPct. B: Effect of MatchPct on positive numbers and rate with different value of
RACutoff. C: Effect of Cont on positive numbers and rate. The setting of AMASS parameters— RACutoff, MatchPct and Cont used in the experiments were 20, 60 and 40, respectively. The arrows
A1 A2
B1 B2
Supplement Table: The total number of positive, negative peptides and positive rate with different
setting of filter.
Different Positive Negative Poor Noisy False Positive Value peptide peptide fragmentation spectra interpretation rate
Cont 30 1291 203 70 65 68 86.41% 40 1258 150 61 38 51 89.35% 50 1179 115 55 27 33 91.11% MatchPct 50 1291 170 70 54 46 88.36% 60 1258 113 51 35 27 91.76% 70 1110 74 41 18 15 93.75%
Supplement Figure: Selection of models, illustrated in a 2D plane. The threshold model, in bold,
features no relevance between different parameters. It cannot be approximated by the linear model
(the dashed line). A quadratic model (grey line) can approximate the line, but a simple model