AMASS: Software for Automatically Validating the Quality of MS/MS. Spectrum From SEQUEST Results

(1)

AMASS: Software for Automatically Validating the Quality of MS/MS

Spectrum From SEQUEST Results

Wei Sun

1, 2*

, Fuxin Li

3

, Jue Wang

3

, Dexian Zheng

1, 2

Youhe Gao

1, 2*

1

_{Proteomics Research Center,}

2

_{National Key Laboratory of Medical Molecular}

Biology, Institute of Basic Medical Sciences, Chinese Academy of Medical

Sciences/ Peking Union Medical College, Beijing, 100005;

3

_{Institute of Automation, Chinese Academy of Sciences, Beijing, 100080}

Address reprint requests to:

Youhe Gao/Wei Sun

5 Dong Dan San Tiao

Institute of Basic Medical Sciences

Chinese Academy of Medical Sciences/Peking Union Medical College

Beijing

People’s Republic of China 100005

Tel: 086-010-6787-2251-206

Fax: 086-010-6787-2251-201

Email: [email protected], [email protected]

(2)

Running title: AMASS automatically validate SEQUEST results

Abbreviations

AMASS: Advanced MAss Spectrum Screener

FP: false positives

TP: true positives

MS/MS: tandem mass spectrum

(3)

Summary: Time-consuming and experience-dependent manual validations of tandem mass spectra are

usually applied to SEQUEST results. This inefficient method has become a significant bottleneck for

MS/MS data processing. Here we introduce a program AMASS (Advanced MAss Spectrum Screener),

which can filter the tandem mass spectra of SEQUEST results by measuring the match percentage of

high-abundant ions and the continuity of matched fragment ions in b, y series. Compared to Xcorr and

DeltaCn filter, AMASS can increase the number of positives and reduce the number of negatives in 22

datasets generated from 18 known protein mixtures. It effectively removed most noisy spectra, false

interpretations and about half of poor fragmentation spectra. And AMASS can work synergistically with

Rscore [18]_{filter. We believe the use of AMASS and Rscore can result in a more accurate identification of}

peptide MS/MS spectra and reduce the time and energy for manual validation.

(4)

Introduction

With the development of proteomics, tandem mass spectrometry has been used to determine the

protein components of complex mixtures [1-4]. In such an approach, proteins are digested into peptides by

enzymes and subjected to reverse-phase liquid chromatography. Then eluted peptides are ionized and

mainly fragmented into b and y ions. Tandem mass spectra produced by mass spectrometer can be used for

peptide identifications. A common way for peptide identifications is to search tandem mass spectra against

a sequence database to find the best matching peptide in the database [5]. Several database search programs

such as SEQUEST [6], Mascot [7], MS-tag [8] and Sonar [9] have been introduced to assign peptides to

MS/MS spectra. These programs use various scoring schemes to distinguish correct identifications from

false positives, but they are known to produce a significant number of incorrect peptide assignments [10].

The process of validating peptide assignments often relies on time-consuming and experience-dependant

manual verification.

Recently, several groups applied different algorithms to evaluate SEQUEST database search results

[11-15]. Moore et al. described a probabilistic algorithm called Qscore [11], which was based on a

probability model. It included the expected number of matches from a given database, the effective

database size, a correction for indistinguishable peptides, and a measurement of match quality. Anderson et

al. [12] applied the support vector machine learning algorithm, to distinguish between correctly and

incorrectly identified peptides by using a vector of parameters describing each peptide identification

including SEQUEST output, considering observed data (peptide mass, precursor ion intensity),

SEQUEST-calculated statistics (such as the parameters Xcorr, deltaCn, Sp and RSp). Keller et al. [13,14]

employed another machine learning algorithm, the expectation maximization algorithm. It incorporated four ４

(5)

SEQUEST scores plus the number of tryptic peptide termini present in the matched peptides to estimate a

peptide probability. Probabilities of the peptides with correct assignments are combined together to estimate

the probability of the corresponding protein. More recently, Razumovskaya et al. [15] developed another

method, which combines a neural network and a statistical model, for normalizing SEQUEST scores, and

also for providing a reliability estimate for each SEQUEST hit. The above methods can improve the

separation between correct and incorrect peptides and reduced the number of SEQUEST protein

identifications that have to be validated manually.

The above approaches are based on different algorithms. Here we address the same problem using a

different approach. Manual validation of a peptide match often makes use of various spectral properties to

discriminate positives from negatives [16, 17]. We put manual validation rules into a computer program and

to filter SEQUEST outputs automatically. Two rules are important for manual validation: the fragment ions

should be clearly above baseline noise and the spectrum should have continuous b or y ion matches [16].

Facts underlying in these rules are “highly abundant fragment ions are more likely to be signals” and “the

MS/MS spectrum of an optimally fragmented peptide should theoretically contain continuous fragment ions

of b or y series”. Based on these two facts, two functions were programmed to calculate the match

percentage of high abundance fragment ions and continuity of b or y ion series in AMASS software.

Tandem mass spectra datasets of known protein mixtures searched with SEQUEST were filtered by

AMASS with relaxed Xcorr and DeltaCn settings. And the result was compared with that of using common

Xcorr and DeltaCn settings alone [17].

Experimental Section

Experimental dataset The experimental datasets were obtained as in Ref. 10. The datasets were

(6)

produced by analyzing a mixture of 18 proteins by liquid chromatography/tandem mass spectrometry. Two

mixtures, A and B, were obtained by mixing together 18 purified proteins of different physicochemical

properties (Sigma, St. Louis, MO, USA; Prozyme, San Leandro,CA, USA) in the indicated relative molar

amounts (Table 1). The complex peptide mixtures were analyzed by LC/MS using an electron ionization

ion trap mass spectrometer (ThermoFinnigan, San Jose, CA, USA) using a standard top-down

data-dependent ion selection approach, wherein the most abundant peak above background level is selected

and a concurrent 3 min of dynamic exclusion is employed to prevent re-selection of previously selected

ions. Peptides were eluted by an acetonitrile gradient (10–35% over 60 min) across a 10 cm 100 mm C18

column while the ion trap mass spectrometer continuously selected peptides for collision induced

dissociation via alternating MS and MS/MS modes. To increase duty cycle, the zoom scan function capable

of determining charge state was not employed.

In total, 14 LC/MS/MS runs were performed on mixture A, using 10mL (A1), 5mL (A2), 1mL (A3),

or 2.5mL (A4-14) of 1:5 diluted mixture. Eight LC/MS/MS runs were performed on mixture B, using 1 mL

(B1-2), 2mL (B3-4), 5mL (B5-6), or 7.5 ml (B7-8) of 1:20 diluted mixture.

SEQUEST Search and Xcorr filter The 22 raw files were searched against the protein database

with Bioworks 3.1 from ThermoFinnigan (San Jose, CA, USA). The protein database was composed of

88,374 proteins including the SWISS-PROT human protein database and 18 proteins in the mixture. Tryptic

cleavages at only Lys or Arg and up to two missed internal cleavage sites in a peptide were allowed. The

maximal allowed uncertainty in the precursor ion mass was 1.4 m/z. Peptides from 400 to 4500 m/z and

precursor charge states of +1, +2, and +3 were allowed. The minimum total ion current required for

precursor ion fragmentation was 1.0×105 and the minimum number of ions was 25. Altogether, 47,907 ６

(7)

spectra were searched against database.

The output files were filtered by Xcorr filter (Xcorr+DeltaCn). The following value of XCorr and

DeltaCn were as common setting [17]: DeltaCn≥0.1, Xcorr:

Xcorr ≥1.9 for +1 charged peptides, with fully tryptic ends

Xcorr≥2.2 for +2 charged peptides, with partially and fully tryptic ends

Xcorr≥3.75 for +3 charged peptides, with partially and fully tryptic ends

XCorr filters used were derived from the common setting with constant DeltaCn. For example, an 80%

XCorr filter meant 0.8× (common setting). Thus the filter was actually: XCorr≥0.8×1.9=1.52 for +1

charge peptides, and so forth. The XCorr filters examined in the analysis were 0%-120%, in a step of 10%.

Positive and Negative peptides Positive and negative peptides were selected according to the rule

that whether it was one part of 18 known proteins. Only the first scoring peptide was used to judge the

presence of one particular protein If the peptide passing above Xcorr filter was a part of the 18 known

proteins, it was counted as a positive peptide. Otherwise, it was counted as a negative peptide.

In calculating the number of positives, common contaminants were not included which decreased the

number of positives. In our experiments such a conserved strategy was adopted because we only wanted to

prove AMASS parameters’ effect in the most conservative settings.

Computer Programs—AMASS Following rules are commonly applied in the manual validation

of mass spectra [16]: 1. The MS/MS spectrum must be of good quality with fragment ions clearly above

baseline noise. 2. There must be some continuity to the b or y ion series.

Based on these rules we proposed two functions.

(1)Match percentage, MatchPct: ７

(8)

MatchPct=[number of matched daughter ions with relative abundance higher than RACutoff / number of

total daughter ions with relative abundance higher than RACutoff]×100%

RACutoff (Relative Abundance Cutoff) was a number between 0 and 100 serving as a relative abundance

cutoff point in MS/MS spectra. For example, when RACutoff was 20, the ions with relative abundance

higher than 20 were included in the calculation of MatchPct. When lower RACutoff value was used more

fragment ions were included in the calculation. Higher MatchPct value means that more fragments ions

above certain RACutoff were matched. Commonly, the higher the value of MatchPct, the better the quality

of the identification. (2)Continuity, Cont: Cont=

[

(

)

(

)

(

))]

/[(

1 )

2

]

*

100 ,

1

l

i

f

i

y

i

b

l i

+

−

+

∑

=

Where f(i) = 1 if the ith_{b or y series ion is matched}

0 otherwise

b(i) = n2 if the (i+1)th b series ion is not matched and n = the number of continuously matched b

ions immediately before the ith ( including the ith ion)

0 otherwise

y(i) = n2_{if the (i+1)}th_{y series ion is not matched and n = the number of continuously matched y}

ions immediately before the ith ( including the ith ion)

0 otherwise

l = the amino acid number of the peptide.

Cont adds up the number of continuously matched b series and y series ions to the second degree and

(9)

the total number of matched ions, and then normalized by dividing the maximum possible value of the

addition and multiplying 100. Higher Cont value meant more continuous matching fragment ions.

When calculating MatchPct and Cont all matched daughter ions under different charge state were

taken into account. In order to determine the distinguishing value of AMASS on the number of positive and

negative peptides the values of RACutoff, MatchPct and Cont were ranged from 0 to 90 and applied to

SEQUEST results as a secondary filter besides corresponding Xcorr filter with incremental steps of 10. The

proper values of parameters should maximize the number of positive peptide without sacrificing the rate of

positive. The values of AMASS parameters, RACutoff, MatchPct and Cont, were estimated experimentally

as 20,60 and 40, respectively (Data were shown in supplement1).

Results

The effect of AMASS Figure 1 showed the result of total number of positives and negatives with

four different filters—(1) Xcorr filter, (2) MatchPct+Xcorr filter, (3) Cont+Xcorr filter and (4)

AMASS(MatchPct+Cont)+Xcorr filter and Xcorr filter ranged from 70% to 120% of common setting.

When lowering Xcorr filter, number of negatives increased dramatically and positive rate decreased. But

when MatchPct, Cont or both were used, more positives and higher positive rate could be achieved with

almost the same number of negatives even with lower common Xcorr filter settings. For example, the

number of positives and negatives were 1429 and 99 with common settings, and increased to 2034 and 341

with 80% Xcorr filter. When AMASS was employed (the values for RACutoff, MatchPct and Cont were

20,60 and 40, respectively) the number of positives was 1725 with the similar number of negatives (94) as

with common settings.

Figure 1 also showed that the effects of MatchPct and Cont were similar and the combination of ９

(10)

them—AMASSES had even better effect, which indicated that the MatchPct and Cont remove different

type of false identifications.

The effect of each AMASS parameter Above result was based on the hypothesis that all the

peptides belonging to 18 known proteins were positives. But positives with poor quality should be

considered as false positives (FP) with manual validation. So in order to further prove the effect of AMASS

on manual validation result, all of the 22 datasets under common Xcorr filter settings were manually

assigned as true positives (TP) or false positives (FP) according to above manual validation rules [16]. If a

tandem mass spectrum assigned to a peptide meets manual validation rules, the peptide was considered as

true positive (TP) otherwise false positive (FP).

In order to evaluate different effect of each AMASS parameter, according our experience the tandem

mass spectra assigned to FP were classified into three categories. The first category was poor fragmentation,

with much of the ion current in few major peaks. The second one was noisy spectra which had low signal to

noise ratio. The third one was false interpretation, which had major peaks and good signal to noise ratio, but

most of matched ions were noises. The final list of TP assignments consisted of 1295 peptides, confidently

identified in the mixture. The list of FP assignments contains 233 peptide hits by SEQUEST (73, 81, 79 to

the third categories negatives, respectively). We assigned less number of TP peptide identification than

Keller’s result [10]. The reason was that they assigned all the outputs to peptide identification without any

filter, while we only assigned the peptide passing common Xcorr filter.

Figure 2a showed the number of TP and FP under different filters, which indicated that AMASS could

decrease the number of FP at little cost of TP. Figure 2b showed the different effect of AMASS parameters

on the three categories of FP. Cont and MatchPct filtered out most of noisy and false interpretation FP, but １０

(11)

only about half of poor fragmentation ones.

The signal/noise in noisy MS/MS spectra was low, so most of ions were of high abundance. While the

number of match ions was relative few, thus the values of MatchPct were lower than in TP and such FP

could be effectively removed. For false interpretation spectra, because most of matched ions were noises,

the MatchPct was very low and could also be filtered out by AMASS. Some of above two types FP might

be filtered out by Cont because of poor continuity. But for some poor fragmentation spectra if a few high

abundance ions were matched, the value of MatchPct might be higher than 80. Moreover, due to random

match the continuity might also be up to 60 or even higher. Therefore, such poor fragmentation spectra were

difficult for AMASS to filter out.

Combining MatchPct and Cont, more FP, most of noisy and false interpretation and about half of poor

fragmentation FP, were filtered out, which proved that effects of those parameters were different.

Combination of AMASS and Rscore Our previous work Rscore [18] was a score evaluating the

relative quality in cross-correlation and matched intensity percentage. The notion underlying RScore was

that true positive peptide identifications should be better than other randomly generated identifications. In

this sense, for poor fragmentation spectra, the few high abundance ions were likely to be matched in both

the first and the second scoring peptide. In this way the relative quality difference of them would be little

and could be filtered out by Rscore. Since AMASS work best in the other two kinds of FP, AMASS and

Rscore should be complementary to each other. Figure 3 showed that when the two filters were used, Xcorr

filter could be lowered to 70% of common settings and more positives (1790) could be achieved with

similar number of negatives (102) compared to common settings (99). This result was better than that of

using each single filter.

(12)

Discussion

Different SEQUEST parameters, different algorithms [11-15] and new parameters [12] were used to

evaluate the quality of SEQUEST results. But up to now how to remove maximal negatives while keeping

as many as possible positives is still a problem.

AMASS was proposed based on the two manual validation rules. In our result AMASS could

dramatically increase the number of positives and positive rate with lower than common Xcorr filter

settings. And manual validation result showed that it can filter out most noisy MS/MS spectra, false

interpretation and about half of poor fragmentation FP at low cost of TP. When AMASS and Rscore were

both applied, more positives could be achieved with similar number of negatives. Such result proved that

high quality positive identification could be achieved with AMASS, but it also failed to completely separate

TP from FP.

AMASS made use of a threshold model. We chose the threshold model because we would want TP

results satisfy all the AMASS criteria. AMASS criteria are independent such that a high value in one

parameter cannot compensate the deficit in other parameters (for instance, a perfect Cont score would not

guarantee the matched ions are signals). A linear model does not have this property. Other models may also

be used in tackling this problem. A quadratic model would be able to approximate it, but we decided to

preserve the simplicity of the model, since a simple model would have better generalization ability [19]

(Supplement2).

To our knowledge, none of present parameters or algorithms can completely distinguish positives from

negatives. The possible reason is that the search results may be not a binary yes or no answer [11]. Because

many peptide matches are of intermediate quality, using scores cutoff and/or algorithms to force １２

(13)

intermediate quality results into positive or negative categories actually interferes with the goal of

maximizing the data extracted from the system. Even with different perfect evaluation parameter of the

detailed information of tandem mass spectra, peptide sequence, database etc and various algorithms, it is of

great possibility not to completely distinguish positives from negatives.

Because the final aim of proteomics research is the identification of proteins, the probability of protein

correctly identified is more important than that of peptide. Therefore, several steps may be applied to the

problem. Firstly, new parameters and algorithms are still necessary to be proposed to improve the

distinguishing efficiency. Secondly, the probability of protein identifications can be estimated based on

peptide evaluations, as what have been done by Keller and Razumovskaya groups [14, 15]. Thirdly, with

present parameters and algorithms, in order to achieve high creditability protein identification one approach

is to use relative stringent filters, such as higher Xcorr filter setting [17], two or more peptides for one

protein identification [11] or combination of different algorithms. The other is that the protein identification

should be reproducible during multiple experiments for a conclusive result.

There are two other rules for manual validation [16]: the y ions that correspond to a proline residue

should be intense ions and unidentified, intense fragment ions correspond to the loss of one or two amino

acids from one of the ends of the peptide. Because the two rules were difficult to be quantified using

functions as MatchPct and Cont, they were not considered in present AMASS program. Our future work

will take them into account.

Some notices should be mentioned here. Firstly, our result were based on 18 known protein datasets,

but the proteomic research result of tissue or protein complex was much more complex than 18 known

protein mixture, whether our result can be applied to complex result or not should be further proved. １３

(14)

Secondly, different Xcorr are used with the different charge state and length of precursor ion, so there are

different settings about them [16, 17, 18, 20]. The one used in our paper was the one producing relative

higher positive rate [10], but other setting may have better performance than the one. At last the database

we used was only human database not swissprot and nr, which may produce more random matches.

Conclusion

We programmed manual validation rules into AMASS to distinguish positives from negatives of

SEQUEST results. Our results from known protein mixture datasets showed that AMASS can reduce the

number of negative identifications and improve positive rate and it works synergistically with Rscore filter.

We believe that AMASS can reduce the time and energy for manual validation. AMASS can be freely

requested via e-mail: [email protected] for non-profit users.

(15)

Reference

1.Gavin A, Bosche M, Krause R, et al. (2002) Functional organization of the yeast proteome by systematic

analysis of protein complexes. Nature, 415, 141-147.

2. Ho Y, Gruhler A, Heilbut A, et al.(2002) Systematic identification of protein complexes in

Saccharomyces cerevisiae by mass spectrometry. Nature, 415, 180-183.

3. Blagoev B, Kratchmarova I, Ong SE, et al. (2003) A proteomics strategy to elucidate functional protein-

protein interactions applied to EGF signaling. Nat Biotechnol, 21,315-318

4. Taylor SW, Fahy E, Zhang B, et al. (2003) Characterization of the human heart mitochondrial proteome.

Nat Biotechnol, 21,281-286

5. Fenyo, D. (2000) Identifying the proteome: software tools. Curr. Opin. Biotechnol. 11, 391-395.

6. Eng JK, McCormack AL, Yates JR III. (1994) An Approach to Correlate Tandem Mass Spectral Data of

Peptides with Amino Acid Sequences in a Protein Database .J. Am. Soc. Mass Spectrom, 5, 976-989.

7. Perkins DN, Pappin DJC, Creasy DM, et al. (1999) Probability-based protein identify- cation by

searching sequence databases using mass spectrometry data. Electrophoresis, 20, 3551-3567.

8. Clauser KR, Baker P, Burlingame AL. (1999) Role of accurate mass measurement (+/- 10 ppm) in

protein identification strategies employing MS or MS/MS and database searching. Anal. Chem, 71,

2871-2882.

9.Field HI, Fenyo D, Beavis RC. (2002) RADARS, a bioinformatics solution that automates proteome mass

spectral analysis, optimises protein identification, and archives data in a relational database. Proteomics, 2,

36-47.

10. Keller AD, Purvine S, Nesvizhskii AI, et al. (2002) Experimental protein mixture for validating tandem １５

(16)

mass spectral analysis. Omics , 6 (2), 207-212.

11. Moore R, Young M, Lee T. (2002) Qscore: an algorithm for evaluating SEQUEST database search

results. J Am Soc Mass Spectrom, 13, 378-386.

12. Anderson DC, Li WQ, Payan DG, et al. (2003) A new algorithm for the evaluation of shotgun peptide

sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST

scores. J Proteome Res,2,137-146

13. Keller A, Nesvizhskii A, Kolker E, et al.(2002) Empirical statistical model to estimate the accuracy of

peptide identifications made by MS/MS and database search. Anal. Chem, 74, 5383-5392.

14. Nesvizhskii A, Keller A, Kolker E, et al. (2003) A statistical model for identifying proteins by tandem

mass spectrometry. Anal Chem, 75(17),4646-4658

15. Razumovskaya J, Olman V, Xu D, et al.(2004) A computational method for assessing peptide-

identification reliability in tandem mass spectrometry analysis with SEQUEST. Proteomics, 4,961-969

16. Link AJ, Eng J, Schieltz DM, et al. (1999) Direct Analysis of Protein Complexes Using Mass

Spectrometry. Nat. Biotechnol, 17, 676–682.

17. Washburn MP,Wolters D, Yates JR 3rd. (2001) Large-Scale Analysis of the Yeast Proteome by

Multidimensional Protein Identification Technology. Nat Biotechnol, 19, 242–247.

18. Li FX, Sun W, Gao YH, et al. (2004) Rscore: a peptide randomicity score for evaluating tandem mass

spectra Rapid Commun Mass Spectrom, 18,1-5

19. Vapnik VN. (1995) The nature of statistical learning theory. Springer-Verlag.

20. Peng JM, Elias JE, Thoreen CC,et al. (2003) Evaluation of multidimensional chromatography coupled

with tandem mass spectrometry(LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J １６

(17)

Proteome Res, 2, 43-50

(18)

Acknowledgement: Thank Andrew Keller of Washington University for offering the datasets and

database.

This work was partially supported by grants from Key Project for International Corporation (No.

2002AA229031), Pilot Study for Key Basic Research Project (No.2002CCA04100), National High

Technology Research and Development Program (No.2001AA233051), National Natural Science

Foundation (No. 30270657, 30230150)

(19)

Figure legend

Figure 1: The comparision of XCorr filter and Xcorr+AMASS filter under different proportion of common

XCorr filter setting (70-120%). With the similar number of negatives (94 and 99 for AMASS and common

setting) as common setting AMASS can achieve more positives with 80% common setting (1725 and 1429

for AMASS and common setting). The values of AMASS parameter--RACutoff, MatchPct and Cont, are 20,

60 and 40, respectively.

.

Figure 2: The effect of different filters on the TP and FP. A: The number of TP and FP under different filters.

B: The number of three kinds FP under different filters. The value of Xcorr filter is the common setting and

the value of Cont, and MatchPct is 40 and 60, respectively. AMASS setting is the combination of MatchPct

and Cont with above values.

Figure 3: Comparision of XCorr filter, Xcorr+AMASS filter, Xcorr+Rscore filter and Xcorr+

AMASS+Rscore fitler under different proportion of common XCorr filter setting (70-120%). With the

similar number of negatives (102 and 99 for AMASS+Rscore and common setting) compare to common

settings AMASS+Rscore can achieve more positives with 70% common setting (1790 and 1429 for

AMASS+Rscore and common setting). The values of RACutoff, MatchPct, Cont and Rscore, were 20, 60,

40 and 2.7, respectively.

(20)

Table 1: Protein components of control mixtures A and B used in the experiments. [10]

a

Additional accession numbers for rabbit myosin heavy and light chains: P02603, P02602, P24732,

Q28641, P04460, P04461, P35748, Q99105

(21)

Figure 1

.

(22)

A

B

Figure 2

(23)

Figure 3

(24)

The values of AMASS parameters The values of parameters should maximize the number

of positive peptide without sacrificing the rate of positive. Supplement Figure showed the trend of

positive numbers and rate against AMASS parameters at 80% common Xcorr filter setting [17]. The

trend with other common Xcorr filter setting showed the similar result. (Detailed data were not shown

here).

According to the trend of AMASS different parameters, we arbitrarily selected values of 20, 60

and 40 for RACutoff, MatchPct and Cont. Although the values might not be the optimal values for

AMASS in present datasets, they had better distinguished positives from negatives than using Xcorr

filter alone.

In order to validate the values of above AMASS parameters, different Cont and MatchPct values

under constant RACutoff value (20) and 100% common Xcorr filter setting were further proved by

manual validation results. Supplement table showed the results. Cont and MatchPct’s values were

arbitrarily selected as 40 and 60.

Because the values of AMASS parameters were from standard mixtures, whether they were

(25)

Supplement Figure: The trend of positive numbers and rate against AMASS parameters with 80%

common Xcorr filter setting. A: Effect of RACutoff value on positive numbers and rate with different

value of MatchPct. B: Effect of MatchPct on positive numbers and rate with different value of

RACutoff. C: Effect of Cont on positive numbers and rate. The setting of AMASS parameters— RACutoff, MatchPct and Cont used in the experiments were 20, 60 and 40, respectively. The arrows

(26)

A1 A2

B1 B2

(27)

Supplement Table: The total number of positive, negative peptides and positive rate with different

setting of filter.

Different Positive Negative Poor Noisy False Positive Value peptide peptide fragmentation spectra interpretation rate

Cont 30 1291 203 70 65 68 86.41% 40 1258 150 61 38 51 89.35% 50 1179 115 55 27 33 91.11% MatchPct 50 1291 170 70 54 46 88.36% 60 1258 113 51 35 27 91.76% 70 1110 74 41 18 15 93.75%

(28)

Supplement Figure: Selection of models, illustrated in a 2D plane. The threshold model, in bold,

features no relevance between different parameters. It cannot be approximated by the linear model

(the dashed line). A quadratic model (grey line) can approximate the line, but a simple model

(29)