Isobaric Tag based MS Quantification Algorithms Analysis and Implementation

(1)

Isobaric Tag based MS

Quantification Algorithms

Analysis and Implementation

Master’s degree in Proteomics and Bioinformatics

Written by Sankar Martial

Supervisors: Nicolas Budin1, Pierre-Alain Binz1

Academic year 2007/2008

This thesis was submitted as part of the requirements for the Master’s degree in Proteomics and Bioinformatics

from the University of Geneva.

1

(2)

Abstract

One single gene gives rise to several proteins. This well-known sentence illustrates all of the complexity of the proteome compared to the genome and the transcriptome. As does genomics, proteomics provides a large toolbox of experimental methods to achieve the quantification of proteins. In contrast, the analytical means to obtain a reliable and trusted value of protein relative abundances are less developed. Although more and more tools are released to assess protein ratios, a quick review of proteomics papers reveals that quantitative data analysis is still performed manually.

For this purpose, three algorithms for isobaric tag-based quantitative analysis were implemented. They were successfully applied on experimental datasets provided by the Biomedical Proteomics Research Group (BPRG) and the Clinical Proteomics Research Group (CPRG) of the University of Geneva. Finally, these algorithms were implemented in the quantification module of the Phenyx software.

(4)

Acknowledgements

It was an immense privilege to be guided and to be under the tutelage of my supervisors Nicolas Budin and Pierre-Alain Binz.

I am very grateful to Alexandre Masselot for having given me the opportunity to do this training course at GeneBio, for his availability and his advice.

I would like to thank Nasri Nahas for having permitted me to carry out this training under the optimal conditions.

Many thanks are due to Olivier Evalet, Yann Mauron, Roman Mylonas and Ivan Topolsky for their support and their good mood.

I am very thankful to Alireza Vaezzadeh and Loic Dayon for their collaboration, I thank very much David Bouyssié who did an essential previous work on the reporter ion peaks extraction.

Finally, I would like to thank all the members of GeneBio for having welcoming me during one year.

(5)

1.

Introduction

1.1.

GeneBio SA

Geneva Bioinformatics (GeneBio) SA is a bioinformatics company founded in November 1997. It was created quasi simultaneously with the Swiss Institute of Bioinformatics (SIB). One of the main activities of GeneBio is to act as the privileged commercial arm of the SIB, and therefore bring to market developments done at the SIB in order to provide back revenues to help further developments. Its first product line started in 1998 with the Swiss-Prot database. Swiss-2DPAGE (a 2D gel database), Prosite (a database of protein domains, families and functional sites) and Melanie (a 2D gel analysis software) soon followed. GeneBio also develops and commercialises proper specialized and innovative databases and software on biological molecules. These include Phenyx, a renowned software platform for the identification and characterization of proteins and peptides from mass spectrometry data. Another example is SmileMS, the latest GeneBio software and also developed in collaboration with the SIB. It is a unique platform for the identification and analysis of small molecules by mass spectrometry. Located in Geneva, a centre of excellence in the field of proteomics, GeneBio now has between 15 and 20 employees, including a majority of biologists and computer scientists.

1.2.

Organisation

My training course was achieved within the Phenyx development team. It's a bioinformatics training involving a binomial supervision. Dr Pierre-Alain-Binz has taken the responsibility of the scientific aspects of my project, giving me advice and orientation in proteomics and data analysis. Dr Nicolas Budin directed the informatics part of the project. He has managed the whole development side of the project, initiating me into R language, Java language, and to reliable methods of software development.

The mass-spectrometric-based quantification universe is wide. From the beginning, it was decided that only iTRAQ quantification would be covered in my work.

To deepen my knowledge of the master courses about Mass-Spectrometric-based quantification in proteomics, I started the training by reading papers on iTRAQ reagents and principles of quantification. Later, in order to familiarize with MS-identification and quantification, I focused on testing some of the available quantification tools. These steps permitted me to handle biological data, to see and understand the basis of large amounts of data analysis and to feel the span of the perspectives that the mass-spectrometric-based quantitative analysis offers. Subsequently, I have developed my own tools. Using the R Language, I have replicated the workflow of some of the studied software. Finally, the time came to offer

(6)

In order to obtain real users’ feedback, to collect what was needed in quantification, to see how the data was analysed, to obtain quantification materials and to have an idea of how the developed tools should behave when faced with real data, a collaboration was carried out with the Biochemical Proteomics Research Group and the Clinical Proteomics Group of the Geneva University.

1.3.

Biological Context

Proteomics analysis proposes a large toolbox of analytical methods, instruments and algorithms to identify and characterize proteins. The majority of the published proteomics studies are limited to the identification of the proteins expressed in a biological system. However, this is not sufficient to answer to most biological questions. Does a protein behave significantly different between two samples? Does a protein exhibit time-dependent change? Which proteins behave similarly in the experiment? Thus, quantitative answers are more and more required and populate an increasing number of publications.

Initially, quantitative and comparative proteome analysis was performed with 2D-PAGE. Due to some limitations (low dynamic range, bias against membrane and soluble proteins), “gel-free” methods have complemented and are gradually supplanting “gel-based” quantitative proteomics. Specific techniques are used to address this issue. One solution is based on the employment of stable isotopes. Isotope label can be incorporated to the process in three ways: metabolically (during cell growth), enzymatically, and chemically.

Chemical incorporation of the stable isotope has produced the most of the quantitative proteome data mainly due to its chemical versatility and because it allows the analysis of any biological sample (in contrast to metabolic).

Due to the high amount of data, manual analysis is laborious but remains persistent in the scientific community. Thus, many software tools are made available to support data analysis. Several are open-source, with their own assets and caveats. Most of them are able to handle identification results from one or more different search engine (Mascot, Phenyx, SEQUEST, X!Tandem...). Recently, Mascot (a major player in the identification market) proposed its own quantification module. Phenyx need to meet the concurrent demands of quantitative high throughput MS data by proposing its own quantification module. Following the six month’s work of the Master Student David Bouyssié in 2006 on the quantitative data extraction module for mainly SILAC and iTRAQ, my main objective was to add the missing downstream analyses pieces of the puzzle and provide a complete quantification pipeline for iTRAQ methodology.

(7)

2.

Quantitative Proteomics

2.1.

Global View

Several methods exist to assess the protein abundance in a sample. Classical methods such as western blotting, fluorophores and radioactivity are widely used due to their sensitivity and dynamic range. However, these methods are not generally appropriate due to some constraints for large scale screening, and particularly for biomarker discovery (non-targetted

experiments). Wu et al. has compared MS-based stable isotope labelling methods (iTRAQ,

cICAT) and DIGE (Differential Gel Electrophoresis) and has shown that these problems can be overcomed by MS-based approaches [32]. MS-based strategy coupled with separation methods (2DE or LC) is currently the more efficient mean to perform the identification of a complex mixture of protein. However, due to the fact that protolytic peptides exhibit a wide range of physico-chemical properties (size, charge, hydrophocity...), the relationship between the amount of proteins and the signal intensities is complex. Therefore mass spectrometry is not inherently quantitative, when seen as a tool for absolute quantitation. Therefore, relative quantitation is preferred, where peptides are compared between experiments data points. This can be achieved in a numbers of ways. Thus, high throughput assessment of change in protein expression is usually performed by stable isotope labelling of peptides and proteins either metabolically,

enzymatically (16_0, 18_{0 incorporation by proteolysis) or chemically using external reagents (Figure}

(8)

Figure 1. Overview of quantification in Proteomics, [41]

Metabolic Labelling involves in-vivo incorporation of the stable isotope during cell growth and division. One of the most widely used approaches is the SILAC (Stable Isotope Labelling Amino acid in Cell culture) approach, which was introduced by Mann and co-workers in 2002 [33]. In these methods, the heavy amino acid is incorporated during the protein synthesis. The main advantage is that no multiple steps in the labelling protocol are needed and the experimental error does not affect the ratio [34] as shown in However, this approach is almost exclusively applicable to cell

An other approach is chemical labelling. For example, several papers studied fluids basing on isobaric tag, iTRAQ [4] or TMT [7]. The principle of chemical tagging rests on the reactivity of the N-ter and side chains of lysine (iTRAQ, TMT) and cysteine (ICAT).

ICAT (Isotope Coded Affinity Tag) is the most known of the Cys tagging strategies. Chemical tags were designed to simultaneously allow the enrichment of a subfration (Cys peptides) from a complex mixture of proteins and the quantification of the selected peptides at

the precursor ions level. Gygi et al. [35] developed this approach in which cysteine residues are

specifically derivatized with a reagent containing either zero or eight deuterium atoms as well as a biotin group for affinity purification of cysteine-derivatized peptides and subsequent MS analysis. Modified versions of ICAT are emerged to solve problems of elution (cICAT) and fragment loss

(9)

Figure 2. Common quantitative mass spectrometry workflows, Boxes in blue and yellow represent two experimental conditions.

Horizontal lines indicate when samples are combined. Dashed lines indicate points at which experimental variation and thus

quantification errors can occur [30].

Group of labelling reagents which targets the peptide N-terminus and the epsilon-amino group of lysine residues are the most sensitive. Most of the time, this is realized via the very specific N-hydroxysuccinimide (NHS) chemistry or other active esters and acid anhydride as in, e.g., the isotope coded protein label (ICPL), isotope tags for relative and absolute quantification (iTRAQ) [19], tandem mass tags (TMT) [8], ...

ICPL, ICAT methods and most of the aforementioned chemical modification techniques, relative quantification is achieved by integration of MS signal over ‘heavy’ and ’light’ labels. TMT and iTRAQ introduce the concept of the isobaric tag. Isobaric tags labelled peptides co-migrate in liquid chromatography separations. The different tag can be distinguished by the mass spectrometer only upon peptide fragmentation. This permits the simultaneous determination of both identity and relative abundance of peptide in tandem-mass spectra. ITRAQ and TMT are described more in details below and some examples of application are provided and summarised (Figure 5).

(10)

2.2.

Isobaric Tagging

iTRAQ

ITRAQ reagents are amine specific stable isotope labels. Up to eight biological samples can be labelled simultaneously. Structure of the reagent is supplied in It consists of three groups: the reporter group and the balance group which form the isotobaric tag (145 Da) and the PRG (Peptide Reactive Group which reacts with the peptide primary amine group). The reporter group contains the charge, and gives strong signature ions in MS/MS. The balance group changes according to the reporter group. It undergoes a neutral loss during MS/MS. The basic iTRAQ experimental workflow is displayed in Figure 4. To begin, proteins from one sample are digested using trypsin. As digestion results, N-ter pepides are ready to be derivatized with the sample-specific reagent via an acetylation reaction. Thanks to the isobaric nature of these reagents, one peak is obtained in MS that greatly simplify the MS spectrum. After CID, the balance group is loss, which leads to one peak for each reporter ion (Figure 4) in the region of low mass of the MS/MS spectrum (Figure 5).

(11)

(12)

iTRAQ reagents allow multiplexed quantification of up to eight samples (cell, tissue,

serum). Moreover, it permits PTM analysis. Aggarwal et al. show that the reagent does not

interfere negatively with the fragmentation to the extent that peptides length and amino acid content are similar to those obtained using other MS approaches [36]. Furthermore, iTRAQ is a

highly sensitive approach. Wu et al. demonstrated that iTRAQ covers a large part of the E.coli

proteomes. It helps to identify proteins across extreme pI and MW, it detects a great number of fragment peptides per protein and low abundance proteins are more often discerned. This high sensitivity can be explained by two factors. First, iTRAQ is a global tagging reagent on all primary amine, contrary to ICAT (labels only cysteine). The second one is related the reactivity of lysine which leads to a stronger signal in MALDI-MS. Quantification relies on daughter ions generated during CID. However, a potential pitfall inherent to the Timed Ion Selector (TIS) resolution of the MALDI-TOF/TOF may affect quantification accuracy. Quantification relies on daughter ions generated during CID. TIS allows precursor ion and his fragment to pass through a gate in order to reach the detector and contribute in this way to the quantification ratio [32].A second limitation is that experimental variations can occur during tryptic digestion. In iTRAQ workflow (Figure 4), digestion is prior to the labelling and the mix, contrary to ICAT where the labelling is prior to mix and digestion. This may introduce a potential source of error, especially in sample handling and variable degrees of tryptic digestion between two or more samples [32].

Table 1. Advantages/Disadvantage of iTRAQ reagents

Advantages Disadvantage

– Parallel proteomics: 8-plexing.

– Analyze proteins from cell, tissues or serum.

– PTM analysis.

– _{iTRAQ reagent don't interfere negatively}

with fragmentation.

– _{High sensitivity.}

– _Mass _spectrometer _interference _could

hinder iTRAQ reliability.

– _{iTRAQ labelling after the tryptic digestion.}

TMT

Tandem Mass Tag uses exactly the same approach as iTRAQ. TMT tags are however heavier (TMT 6-plex: 126 to 131 Da).

(13)

2.3.

Experimental Application

The isobaric labelling approaches have been successfully applied to a variety of experiments and to various samples (prokaryotic and eukaryotic samples including Escherichia coli, yeast, human saliva, human fibroblasts and mammary epithelial cells...) [34,36].

The iTRAQ approach can be applied for various purposes. For example, most of time

when searching to characterize proteins from a specific signalling pathway, parallel proteomics approaches such as iTRAQ 4-plex or TMT 6-plex are commonly used in order to obtain time

courses profiles. Schmelzle et al. used iTRAQ to label 4 samples of adipocytes stimulated for

insulin at different time. Then, time course profiles have been plotted and proteins displaying the same behaviour in the same fashion are clustered. Unknown protein Glu-4 functionality was

discovered in this manner [20]. Another study carried out by Zhang et al. used iTRAQ for the

same purpose. Profiles were made and clustered using Spotfiretm _{and the methods of SOM matrix}

[27]. Biomarkers can also be discovered using isobaric tags. Dayon et al and Choe et al analyzed CSF

samples, using respectively TMT 6-plex [5] and iTRAQ 4-plex [7]. Moreover, Desouza et al.

identified five potential markers for endometrial cancer with iTRAQ reagents and a set of four

proteins using ICAT [9]. Cong et al. compared proteomes of human fibroblasts in four different

biological states: replicatively senescent (under permanent growth arrest), stress-induced prematurely senescent, quiescent and young replicating, to identify the signature proteins of each biological state [6]. Figure 5 summarizes the purpose of quantitative proteomics using the isobaric tags.

(14)

2.4.

Experimental Design: Principle of Replicate Analysis

All these experiences can be declined in replicates [37]. Indeed, random variation when working with isobaric tags has many origins. The source of variation is a function of time, manpower, instrument, subject, subject condition, preparation process, etc ... By definition, the variation is a measure of the spread around the expected value. This can be measured in three

different forms: experimental, technical and biological variations. Typically, experimental

replicates are the actual iTRAQ replicates. Two or more experimental iTRAQ sets serves to

label the same samples. Technical replicates are used to assess of the consistency of a measure

over repeated test of sample from a same biological source under identical conditions. It eliminates errors from sample preparation and it is very important to establish the significance of

the protein expression (ANOVA, t-test, LPE test). Biological replicates are used to estimate

the random biological variability associated with the test subject, by repeating the creation of the test subject under the same conditions (Figure 6).

Figure 6. Schematic views of the relationship between technical, experimental and biological replicates in iTRAQ experiments.

(15)

3.

Methods: Study Quantification Workflow

3.1.

Introduction

An important part of my training course was a prospective work. How is the quantification performed in proteomics? What are the tools? What are the best existing tools? What is the difference between them?

Analysing manually a large dataset of MS identification results is time-consuming and not precise (some methods are impossible or difficult to perform manually such as outliers’ detection, quantile-quantile plot...). However, manual analysis is still widely used. More precise results can be obtained by computer-based data treatment. Thus, several Quantification tools (Q-tools) exist with their own properties (Table 2). As shown in Figure 7, the Applied Bioscience software,

ProQuanttm, remains the most used Q-tool in the scientific community.

In this part, I tested three tools to familiarize myself with tandem MS quantification;

Mascottm_{'s Q-tool, the iTRAQ-specific Q-tool of the Trans-Proteomics Pipeline (TPP) and}

i-Trackertm developed at the Cambridge University by Shadford et al [21]. Data from isobaric tag

labelled samples was difficult to find. To overcome this problem, I utilized the on-line database, Peptide Atlas, which is closely linked to the TPP [23], as well as collaboration with the BPRG, which permitted me to obtain additional data (a description of it can be found in 3.2 materials).

I tried to present the tested tools, to highlight their advantages and their limitations, to assess the quality of the quantification results by comparing the protein ratios and finally to determine a reliable quantification workflow.

Figure 7. Pie chart of the number of publication by quantification tools. 14 publications have been read. Scientists still prefer to

(16)

Table 2. Summary of software available for quantification in isobaric-tag-based reagents.

Source Com m ent links

commercial [23] commercial [9] [5] [20] Quant [21] ISB [11] [21] commercial

Tool Nam e Acad /

Commercial Environm ent Re f

ProteinPilot Applied Biosystems w indow s easily distinguish protein isof orms, protein subsets, and suppress f alse positives; and visualize peptide-protein associations and

https://products.appliedbiosystems.com/ab/en/US/adirect/ab?cmd=catNavigate2&catID=600908

ProQuant Applied Biosystems w indow s simultaneously quantitate and identif y iTRAQ™ reagent-labeled peptides f rom MS/MS spectra.

https://products.appliedbiosystems.com/ab/en/US/adirect/ab?cmd=catNavigate2&catID=600908

Pride Wizard Manchester Centre f or integrative systems biology

academic w indow s Submission of mass spectrometry data and Mascot identifications to generate a valid PRIDE XML f ile. It also includes the f acility to add iTRAQ labels, allow ing quantitation data to be added to the PRIDE XML.

[22, 28]

http://w w w .mcisb.org/resources/PrideWizard/index.html

Multi-Q Institute of Information Science and Institute of Chemistry, Academia Sinica, Taiw an

academic w eb server iTRAQ quantitation perf ormed in a mascot w ay. http://ms.iis.sinica.edu.tw /Multi-Q-Web/

MFPaQ Institut de Pharmacologie et de Biologie Structurale, Toulouse, France

academic w indow s Takes mascot (DAT) result f iles as input for parsing and Analyst Wiff f iles f or quantif ication.

http://mfpaq.sourcef orge.net/

University of Wurzburg, German

academic linux/w indow s Off ers data and results visualization (boxplot, error plot), error estimation,

http://sourcef orge.net/project/screenshots.php?group_id=109078

Libra academic linux/w indow s

(cygw in)

Quantification m odule of the TPP. (cf below for m ore inform ation)

http://s ashim i.svn.sourceforge.net/view vc/ sashim i/trunk/trans_proteom ic_pipeline/sr c/Quantitation/Libra/docs/libra_info.htm l i-Track er Cranfield University,UK academic linux/w indow s (cf below for m ore inform ation) http://sourceforge.net/projects/itracker/?abm ode=1

Mascot-Quanti m atrix scie nce linux/w indow s (cf below for m ore inform ation) w w w .m atrixscience.com /

3.2.

Experimental samples

ABRF Data.A sample from the ABRF 2006 study was used. It contains eight proteins spiked in ratio ranging from 1:1 to 1:76 marked with iTRAQ 4-plex (114.1 to 117.1). Theoretical ratio can be found in Appendix 1. Peptides are identified by tandem mass spectrometry analysis with MALDI TOF-TOF (ABI 4700).

Whitehead Data. Proteomic analysis was conducted at three time points (30, 40 and 60 min) for

both control and γ irradiated cultures of Halobacterium salinarium strain NRC-1. Relative

quantification was achieved using shotgun isobaric tagging with iTRAQ reagents (Applied Biosystems, Foster City, CA). Quantification is achieved upon tandem MS, which fragments the iTRAQ reagents unevenly to release daughter products of differing mass (m/z 114, 115, 116 and 117). For direct comparison across multiple runs a common reference sample derivatized with the 114 mass tag was included in each four-plex experiment. (More information is provided by

Whitehead et al. [26]). This data were obtained from Peptide Atlas

(17)

3.3.

Tested Software Presentation

I-Tracker

Three tools were compared; i-Tracker developed by Shadford & al.[21], the iTRAQ

quantification tools packaged with the TTP pipeline [22], Libra and the quantification module of the 2.2 version of the Mascot sofware.

The main goal of the i-Tracker software is to calculate ratios from non-centroid MSMS peaks lists in a format linked to the results of protein identification tools i.e. Mascot and SEQUEST. The i-Tracker process is detailed in Figure 8. The user can define an arbitrary intensity threshold as an unique filter. Such a threshold can lead to the loss of quantifiable peptides. Moreover, the purity correction coefficients can be entered (the iTRAQ reagents are not completely pure and manufacturers therefore provide a correction factor to avoid peaks overlapping (cf Appendix 2 table 2)). Finally, the user can enter an ion tolerance to collect the reporter peaks areas. In its results, in addition to peptide ratios, i-Tracker displays a table of quantization errors for each ratio. This error provides an interesting indication of the confidence we can give to a ratio, especially for ones calculated from very low abundance ions. Advantages and limitations of i-Tracker can be found in Table 3. I-Tracker is limited to quantification at the peptide level. The i-Tracker outputs a csv-formatted result files. Thus, Excel macros or parsing functions are easily applicable. A second disadvantage is that only samples labelled with iTRAQ 4-plex reagents can be analysed. The main advantage of i-Tracker is the quantisation error, which gives an indication of the confidence to give in a ratio of two peaks of low abundances.

INPUT : iTracker Algorithm : OUTPUT :

- Non Centroid MS spectra ( .mgf, .dta) - Ion intensity threshold - Purity correction - Reporter Peak Range

- Reporter ion peak collection - Reporter ion area calculation - Purity correction

- Peak normalisation (sum of all reporter intensities)

- Under threshold checking - Ratio calculation

- Quantisation error calculation

- Relative errors of each reporter ion - Indicative Errors - Ratio for each reporter ion

(18)

Table 3. Summary table of the advantages and limitations of i-Tracker.

Advantages Limitations

– Algorithm & source code for i-Tracker are freely available

– Relative error, indicating the confidence to give to a ratio especially for low peaks ions

– Linked to other protein identification software

– Specific to iTRAQ 4-plex, don't care about the others isobaric tag methods (iTRAQ 8-plex, TMT....)

(19)

Figure 9. Screen shot of i-Tracker OUTPUT. A. i-Tracker outputs a .csv files containing the reporter-normalised area Norm and

UT? which reports a flag UT if the ion peaks area is under the user-entered threshold. B. i-Tracker outputs a .csv files containing

the peptide ratios, with a matrix containing the quantisation errors. A.

(20)

TPP- Libra

Libra is the iTRAQ quantification module of the Trans-Proteomic Pipeline. The TPP is a collection of tools (Figure 10) for MS-based proteomics developed at the Seattle Proteome Center (SPC). It contains software for the quantification (Libra for iTRAQ, XPRESS for ICAT...), converter (search engine format to pepXML...), validation and probability assignment

(ProteinProphet, PeptideProphet

http://tools.proteomecenter.org/wiki/index.php?title=Software:Overview).

Libra is part of the TPP pipeline and therefore relies on it for all quantification

pre-processing steps (Figure 11). The Libra input data consists in (1)_{a database related to analyzed data}

placed in /dbase/, (2)_{a pepXML file,}(3)_{a mzXML file and}(4)_{a condition file.}

The most suitable database was found at

http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/cgi-bin/gen_list.cgi?genome=hb. The pepXML is the standard

format for representing identification results. It stores information concerning PeptideProphet validation and quantification, and it references the mzXML file. The latter is the bedrock of the TPP. It is an open data format for storage and exchange of mass spectroscopy data, developed at the SPC/Institute for Systems Biology. It provides a standard container for MS and MS/MS proteomics data. Several converters are available to convert raw files (proprietary file formats

from the most of vendors) to mzXML format. For example, the T2DExtractortm_{to convert raw}

files from ABI instruments developed at the University of Michigan or Wolftm,which converts

MassLynx native acquisition files. Most of the time, converters must be run in the computer where the data acquisition instrument's software is installed. The last required file is an XML

configuration file supplied by the condition.xml generator

(http://db.systemsbiology.net/webapps/conditionFileApp/). It contains all the requisite parameters to parse the mzXML file in order to extract reporter peaks intensities: the reagent M/Z values (for iTRAQ 4-plex: 114.1 to 117.1), mass tolerance, isotopic correction coefficients (provided by applied), a method of centroiding, a method of normalization (normalization against sum of intensities, against the most intense peaks...) and a minimum intensity threshold.

(21)

Figure 11. Scheme of the quantification within the TPP. XPRESS, ASAP and Libra are the three quantification modules within

the TPP.

Libra performs the protein quantification in a simple way. Usually, protein quantification is derived from the group of peptides associated with the protein. As summarized in the Figure 12, Libra applies a normalisation based on the sum of the reporter intensities. Then, the normalised intensities are averaged over all peptides of a protein. Normalized values of intensities differing by more than two sigma from the mean are considered as outliers and removed. The average of the reporter intensities of the protein is recalculated and the 1-sigma standard errors are calculated using the standard deviation. The software gives in output the average values of the reporters or the ratio if a reporter has been set in denominator (Figure 13).

To resume, Libra employs a simple but accurate method to compute the protein ratio. Owing to converters packaged within the TPP (i.e. out2XML), Libra handles identification results from many search-engines (Mascot, SEQUEST...). Moreover, it can send (via protXML format) the quantification results through various post-quantification tools (Figure 11) such as Cytoscape, SBEAMS... The algorithm can also quantify samples labelled by the iTRAQ 8-plex reagents. However, other isobaric tags are not taken into account for the moment. Moreover, this tool requires the installation of the whole TPP and it is not very user-friendly since all the TPP software only have command line interface (Table 4).

(22)

INPUT : Libra Algorithm : OUTPUT : – PepXML, – MzXML, – m/z tolerance * – Ion intensity treshold * – Purity correction * – Methods of centroiding * – Methods of normalisation. *

– Reporter ion peak collection

– Apply purity correction

– Peak normalisation,

– Means of each reporter channel

– Outliers Removal

– Ratio calculation

– Protein Ratio

– Standard error

Figure 12. Scheme of Libra process. (*) Contained in a configuration.xml file.

Table 4. Summary table of the advantages and limitations of Libra.

– Simple but precise way to quantify

– Quantify protein identified by various search engines

– TPP pipeline allows an easy link with post Quantification tools (Cytoscape, SBAMS...)

– Can be extended to iTRAQ 8-plex

– Maybe too many options displayed in the interface

– Only command line interface (converters, Libra...)

– Specific to iTRAQ, no other isobaric tag can be quantified

(23)

Figure 13. Screenshot of the protXMLviewer shown protXML files which contain the results of the quantification. Means and SD

(24)

Mascot quantification module

Mascot 2.2 includes a quantification module that computes ratios of identified proteins. This module covers most of the quantification methods i.e. reporter-based (iTRAQ, TMT), precursor-based (ICAT, SILAC, Absolute Quantification...) and Label Free. All these methods

are classified in protocols. Reporter protocol takes into account samples labelled with the most of the

isobaric tag (iTRAQ, TMT, ExacTag) except the AMT. All of the information required for the isobaric quantification is contained in the peak list, which is needed in input. Reagents and the MS/MS tolerance can be set in the interface. All the other parameters are set in the XML configuration file. The configuration.xml file encapsulates all users’ parameters. There are many

different parameters, split into groups. For example, the group Methods contains the methods

used to calculate the ratio and the significance level used for the statistical test (default 0.05). The

group Component contains the information concerning each reporter (average and mono-isotopic

mass, values of impurities correction...). In Ratio users can define each ratio they want to display

(numerator, denominator). The group Quality contains the filter parameters, on intensity, on

score, on expect value. Outlier and Normalization include specific methods. Mascot implements

Grubbs, Rosner, Dixon detection methods, and three types of are available; summing intensities, median and geometric mean. Finally, Mascot performs the quantification following protein identification. It displays a summary box containing all the protein ratios (Figure 15A). A detailed view is also provided for each protein (Figure 15B). Each peptide ratio is given and the protein ratio value is displayed in a box coupled with a measure of spread, generally a geometric standard deviation (Figure 15B). Figure 14 summarizes the Mascot Process.

Table 6 shows a comparison between the theoretical ratios of the proteins of the ABRF dataset and the ratios found in Mascot. Parameters are summarised in the Figure 1 of Appendix 3. Mascot fails to quantify three proteins. Carbonic anhydrase was not identified whereas beta-casein and ribonuclease was identified but not quantified. The quantification cannot be

performed due to the outlier’s removal option. When the option is set to none, these two proteins

become quantifiable.

Due to its number of parameters and its very user-friendly interface, Mascot seems to be a very complete tool for quantification (Table 5). However, importing identification results from other search-engine is impossible.

(25)

INPUT : Mascot Algorithm : OUTPUT : – Peak lists – m/z tolerance * – Significance threshold – Impurities correction – Reporter mass – Threshold on peptide score – Threshold on peptide maximum expect – Ion intensity treshold – Methods of integration – Methods of normalization – Methods of outliers removal – Set numerator, denominator

– Reporter ion peaks collection

– Apply filters

– Peaks normalization

– Peptide ratio calculation

– Outliers Removal – Ratio calculation – Significance changes – Quantification summary Box – Individual proteins: peptide ratios, summary box – Indication of the protein change

Figure 14. Scheme of Mascot process. (*) Settable in the interface, all of the other parameters are contained in the

configuration.xml file.

Table 5. Summary Table of the advantages and limitations of Mascot.

– Correction of sample variability

– Strong outliers detection test

– Significance change

– Covers almost all of the isobaric tag-based quantification

– Clear interface

– Coupled to Mascot identification

– Display only the essentials parameters

– Bias in the ratio to the impurities correction

– Only compatible with Mascot identification results

(26)

Figure 15. Screenshots of Mascot result page, A/ Summary box of the quantified protein. B/ Protein summary box containing

the type of methods to averaged peptides ratio, the number of peptides and the geometric SD. Below this box, the details of ratio at the peptide level.

Table 6. Comparison of the theoretical protein ratio of ABRF sample and the Mascot ones.

Protein Name AC Mascot Ratio

Beta casein P00266 1:4 NQ* P24270 1:5 0.4 P00489 76:1 7.58 / 3:1 NI* Peroxidase horseradish P 6782 P04333 1:1 1.51 P61823 1:1 NQ* Theoric Ratio

Catalase bovine liver C 1345 Glycogen phosphorylase rabbit Carbonic anhydrase I

(27)

3.4.

Software comparison

Comparison of the different tools can be made at several levels; at the level of the quantification results and at the level of the algorithm.

Comparison of the Quantification Results

In a first step, quantification results have been compared. ABRF data have been used in order to compare peptide ratios found in i-Tracker and Mascot. Because i-Tracker handles non-centroid data, the peak list file has been modified in order to add some peaks for each iTRAQ reporter (in the reporter m/z tolerance interval). Three peptides matches assigned to protein hits (observed mass: 1422.6224, 1499.7448, 1433.6318) are used in input to i-Tracker and then in Mascot. Parameters used in i-Tracker and Mascot are shown in Figure 1 and Figure 2 of Appendix 3. Results provided in Table 7 show two peptides differentially expressed. The variation can be explained by the two methods of integration, area calculation and sum of reporter peak intensities.

Libra and Mascot results have been compared using the Whitehead data [26]. Since no protein was spiked in this experiment we cannot conclude about which tools give the best result. However, protein ratios seem to follow the same profile for both tools (Table 8).

Table 7. Result comparison table for three peptide matches

1422.6224 6.79 9.61

1499.7448 6.20 7.20

1433.6318 0.49 /

Peptide m/z i-Tracker ratio 117/114

Mascot ratio 117/114

Table 8. Result comparison table Libra VS Mascot

Libra Mascot Protein Name 115 116 117 115 116 117 0.74+/-.01 0.92+/-.01 0.82+/-.01 0.79 0.9 0.86 0.69+/-.01 0.77+/-.01 0.71+/-.01 0.89 0.91 0.88 0.64+/-.03 0.72+/-.03 0.66+/-02 0.55 0.6 0.63 0.96+/-.01 1.15+/-.01 1.04+/-01 1.13 1.15 1.12 1.64+/-99.99* 2.18+/-99.99* 2.07+/-99.99* 1.79 1.65 1.75 Diphosphomevalonate decarboxylase Adenylosuccinate synthetase Proteasome-activating nucleotidase 2 Fumarate hydratase

(28)

A quantitative analysis can be subdivided in 5 cardinal steps. A pre-processing step in other words by which methods of integration the reporters are extracted (summing intensities of the profile or calculating the area under the profile curves) is performed. This is followed by a filtering step. We can imagine various filters; the most observed ones are a threshold on intensities, on score, and on p-value (“expect”). The normalization step is generally applied to correct systematic biases or to avoid giving too much weight to one reporter. This step is followed by the outliers removal step. Finally, the protein ratio can be estimated. The quantification workflow of each tool is compared for each of these steps.

The three tested tools have their own ways to achieve the quantification. The i-Tracker makes the quantification at the peptide level. The user must therefore manually calculate protein ratios.

INPUT/OUTPUT

First, we compare the INPUT file formats. Mascot quantification is clearly paired with the Mascot identification, in the extent that quantification of proteins which are identified using other identification software are impossible. On the contrary, i-Tracker permits the importing of SEQUEST and Mascot results, and Libra can handle many types of identification tools results (Figure 10) due to the availability of tools such as Out2XML and Mascot2XML, which convert, respectively, .out file format from SEQUEST and .dat file format from Mascot in pepXML format. In addition to pepXML file, Libra inputs RAW files converted to mzXML format (for the conversion of raw to mzXML, converters need access to the computer where the instrument-specific software for data acquisition is installed). In OUTPUT, Mascot releases a .dat file. Libra encapsulates its results in a protXML format file, which is read with the protXML viewer tool and exported to post-quantification tools (Cystoscape, SpotFire etc…). I-Tracker chooses to produce two types of .csv file. Output style 1 is designed to be human-readable when imported into programs such as MS Excel as a comma-separated variable file. It is strictly ordered so automated parsing is also straightforward. Output style 2 is designed to allow very easy basic analysis within programs such as MS Excel. All information is outputted on a one row per spectrum basis and thus all human-readability is lost, but to the gain of being able to run functions and macros more easily.

Reporter Peaks Collection

There are two ways to collect peaks. Mascot allows the choice between both via the configuration editor. Libra employs the sum of intensities of the peaks profiles whereas i-Tracker used the trapezoid approximation for calculating the area under a curve.

Filters

After reporter ion peaks collection comes the filtering process. Many filters can be applied. All of them may involve data loss, especially when choosing a threshold on intensity,

(29)

Type of Quantification Workflow

We can now talk about the quantification workflow of each tool. Libra and Mascot have two ways to compute the protein ratio. Mascot computes an average of peptides ratios whereas Libra computes a ratio of averaged peptide reporter intensities.

Outliers Removal

In a quantification workflow, outliers’ removal (as normalisation) is a crucial step that is always in the quantification workflow although their implementation varies among the Q-tools. Mascot implements three methods for outlier removal. Dixon's r11 test, also referred to as N9, is used to detect and remove a single outlier at a time from either the upper or lower extreme of the range [27, 28]. It is applicable to values between 4 and 100. For a greater number of values, Rosner's test is applied [24]. Grubbs detection is applicable for values between 3 and 100 [26, 25]. Libra decides that a value of intensity is an outlier if it is outside of the range:

]µ - 2 * σ, µ + 2 * σ, [ (1)

Where µ is the means and σ is the standard deviation.

Care must be taken when outliers are blindly removed. A rigorous analysis would be to compare data with or without outliers to see to what extend the conclusions are qualitatively different.

Normalisation

A second important thing when working with large biological dataset is normalisation. Normalisation always takes place at peptide level. To remove variability of the sample incorporated during the experimental procedure, and based on the assumption that differential proteins in a biological sample are in a minority, Mascot proposes two methods to normalise the data so as to make the average of the whole population ratios across the entire data set equal to one. This can be done via the median or the geometric mean. Another method is the sum; totals intensities for each reporter across the entire data set are made equal. In contrast, for Libra and i-Tracker, no variability correction is implemented. However, in order to not give too much confidence in one reporter, both tools normalise on the sum of all reporter intensities.

Determination of protein abundance and measure of spread

Libra implements a simple ratio of the reporter means and displays a standard error calculated from the numerator standard deviation. Due to the fact that Mascot treats the peptide ratios, it allows several methods for averaging them; median, geometric mean, and weighted mean associated with a geometric standard deviation. I-Tracker does not provide the ratio at the protein level.

(30)

Bonus

Several functionalities are software specific. I-Tracker for example is the only tool that displays a quantisation error. This value can serve as warning against placing too high confidence on reported ratios when these have been based on peaks with low ion counts [21].

Err(1,2) = (100 * ((0.5 / Peak1Max) + (0.5 / Peak2Max)) (2)

Mascot provides an interesting and robust indication of the relative protein fold change. It

employs a one-sample t-test. The null hypothesis H0_{is that the estimation of the ratios}

x is equal

to one. If H0 is rejected, x is significantly different from 1. The protein ratio is reported in bold.

3.5.

Discussion

Software comparison reveals that some steps are always found. Filters, outliers removal and estimation of the ratio are essentials in a quantitative analysis.

Each tool has its own way to estimate the ratio. Although they display a good estimation of the relative protein abundance, some of them are not user-friendly (i-Tracker, Libra). Moreover, they don’t take into account the experimental design of the experiment (i.e. replicates analysis, time course...). Finally, no mean to visualize the protein expression is implemented.

(31)

4.

Establish Quantification WF

4.1.

Introduction

Based on the previous study, three methods of quantification were implemented using the R language. This programming language is very well appropriate when analysing large amount of data for quantification to the extent that it offers an environment for statistical programming and In fact, descriptive statistics methods and inference tests are all ready when installing R. There are many additional packages that are easy to use and well documented. Moreover, R is able to output graphs and charts and provides therefore an attractive way to represent quantitative results.

Librus, an intensity-based method similar to Libra, and Mascat, a peptide ratio based method that resembles Mascot were first implemented. Furthermore, a novel quantification algorithm, named QI (Quantification Isobaric), was developed.

QI workflow is a least-square regression-based workflow. According to Bantscheff et al.

[30], linear regression could be a good alternative to the filter on intensity and the inherent data loss that such a filter involves. In fact, making the difference between (1) a weak intensity from a low abundance peptide and (2) a weak intensity from the background noise is difficult. A least-square regression line is a straight line that passes through the data so that the sum of the least-square of the vertical distance data points from the line is as small as possible. So, the advantages are double. First of all, we avoid useless loss of peptide matches by applying an arbitrary threshold on intensity, and secondly, we obtain an easy method in order to visualise the protein

quantification. The idea of this method is to create a linear model and coerce the regression line

to pass through the origin. The slope of the line is an estimation of the protein ratio. The R-squared gives an indication of the data spread (Figure 20). Then, strong methods of outliers and influential detection are used. These methods are part of regression diagnostics.

As highlighted in part 3.5, several statistical steps are common to all quantification approaches. These steps are shown in dark blue in the Figure 16 and detailed in the next part.

The validation of the algorithms was effected by using two sets of spiked proteins. The ratios of the proteins contained in the ABRF sample (described in 3.2) were compared between the three methods. Then a dataset (provided by Loïc Dayon) obtained from samples containing 4 spiked proteins [7] was used to measure the root mean square deviation (RMSD) between the expected theoretical ratio and the obtained ratios for each quantification algorithm.

(32)

Figure 16. Steps of the three approaches implemented; steps in common among Librus, Mascat and QI are shown in dark blue.

4.2.

Description

Signal Extraction

The data were pre-processed using the LabelMS2Extractor (LMS2E), a tool designed to extract the intensities of peptides labelled with isobaric tags. The INPUT data consist of (1) the Phenyx identification result (pidres.xml), (2) the path of the configuration file that contains information about the mass of the reagents, and the values for impurities correction and the mass tolerance. Reporters are integrated using the method of sum of intensities along the range defined by the reagent mass and the user-entered mass tolerance. Missing values of reporter intensities are replaced by 0. Another way to compute reporter signals is to integrate the area under the reagent profile curve. The former method of integration is however more accurate and can handle cases where only one peak is detected in the reporter mass range [2]. The OUTPUT is a

(33)

Filters

In all implemented quantification methods, the first step consists of the filtering of the LMS2E results table. Filters can act at peptide level or at protein level. Various filters can be imagined. At protein level, a threshold on the number of peptide matches is implemented. The greater the number of peptides per proteins, the more accurate is the ratio. At peptides level, several filters are implemented; a threshold on the minimum intensity value, on the minimum score, the maximum p-value, and a filter on proteotypic peptides i.e. when a peptide matches for several proteins, the peptide is removed. Thus, only unique peptides are taken into consideration when performing the protein quantification.

Normalisation/Correction step

The second crucial step is the normalisation of the dataset. Librus implements the normalisation of the peptide intensity, as Libra does. Mascat implements the three normalisation methods of the Mascot quantification module. In fact, the Librus normalisation permits one to not give too much confidence to one reporter whereas the Mascat normalisation corrects the systematic bias that occurs during the sample preparation.

Outliers removal

In Librus, if a normalized value of intensity is more than 2σ from the mean of the signals of one reporter, it is considered as an outlier and removed. Mascot uses specific outliers detection tests (Grubbs, Rosner, Dixon). (For more details see part 3.4). QI bases its workflow on strong regression diagnostic methods.

The DFFITS and studentized residuals are important techniques to the detection of outliers and influential points in a regression analysis. The DFFITS of an observation is a measure of the influence of this observation on its own predicted value. The studentized (standardized) residuals are adjusted by dividing them by an estimate of their standard deviation. The Table 9 shows the number of peptides remaining after outliers removal. Due to its tolerance to a large range of values, the Grubbs detection test has been chosen for Mascot. For Libra, the

methods described above are used. For QI, if a DFFITS value is farther than m* standard

deviation σ from the average of the DFFITS, the peptide is considered as an outlier and

removed. We use the factor m equal to 2, 2.5 and 3. Authors recommend 2 [31] but the number

of lost peptides becomes important (Table 9). As a reliable alternative, I propose a factor m equal to 2.5. The number of peptides per protein is greater. Moreover, the resulting ratios are closer to the theoretical ratios (Table 10). The Figure 17 shows the effect of such a method. Regression lines have been plotted before and after outliers removal (m = 2).

(34)

Table 9. Comparison of the number of remaining peptides for each methods of outliers removal. AC # peptides initial "P00432" 10 9 10 7 8 10 "P00433_CHAIN0" 5 5 5 5 5 5 "P00489" 22 21 21 13 15 19 "P02769_WOSIG0" 26 24 24 9 22 23 "P80025_CHAIN0" 17 17 17 9 14 17 # peptides Grubbs # pe ptide s +/- 2σ # peptides DFFITS ( m = 2 ) # peptides DFFITS ( m = 2.5 ) # peptides DFFITS ( m = 3 )

Table 10. Quantification results comparison for each workflow.

AC Theoretical Ratio "P00432" 1:5 0.5* 0.47* 0.53 0.356 * 0.36 "P00433_CHAIN0" 1:1 1.14 1.09 1.09 1.09 1.09 "P00489" 76:1 6.51* 5.27* 4.84 4.13 * 3.86 “P02769_WOSIG0” 1:1 1.13 1.11 1.18 1.09 * 1.1 "P80025_CHAIN0" 1:1 1,22 1.21 1.1 1.26 * 1.26

(35)

Figure 17. Linear regression for four proteins, equations of the regression line and regression coefficient are provided, A/ before

outliers removal, B/ after outliers removal. The outliers removal step improves the regression coefficient by diminishing the spread.

A.

(36)

Protein ratio calculation and measure of spread

Libra performs a ratio of averaged intensities and gives a standard error. Mascat averages the peptide ratios by some descriptive statistics methods (median and MAD, geometric mean and geometric SD, weighted mean and weighted SD…). Being based on a regression model, QI gives the ratio by least square estimation.

OUTPUT and Visualization

The three implemented approaches provide csv sheets and graphs to visualize the quantification. Each workflow has its own presentation of the results and specific graphs.

Thus, Mascat displays a summary table containing the ratio and an indication of the protein fold change (Figure 18A), and a protein box containing all quantitative information of one protein is provided by the R package (Figure 18B). In addition to histogram and density distribution of the peptides ratio, boxplots are used to display the peptide ratio distribution for each quantified protein (Figure 18C). A boxplot is a graph of statistical summary, with the outliers plotted individually. The quartiles are spammed by the central box (first and third quartiles) and the median is displayed by the central line (bold). Observations plotted outside of the range 1.5*IQR are suspected as possible outliers and the whiskers show the largest and smallest peptide ratios that are not considered as outliers.

Librus provides a matrix of result (Figure 19A), and a protein summary box that contains all information relative to one protein (Figure 19B).

(37)

Figure 18. Mascat output, csv files and boxplots A. summary box, contains proteins for the whole run, B. protein box displays

information for one protein; ratios are calculated with the median and the MAD, Median Absolute Deviation. C. Standard boxplot of peptide ratios. Bold horizontal line is the median; outliers are visible as isolated points.

A.

(38)

Figure 19. Librus output, A. all ratios for one protein are displayed in a matrix of ratios. B. individual information is summarized

in the protein box (mean of normalised reporter intensities, SD, SE and the number of peptide matches). A.

(39)

QI workflow proposes individual protein boxes containing the ratios values calculated from the slope of the regression lines (Figure 20), and quantitative scatter plots of the peptides intensities numerator VS denominator with the fitted line (Figure 17A, 17B). A standard scatter plot displays the relationship between two quantitative variables. The x and y axis represent the denominator and nominator intensity values, respectively.

A scatterplot provides information about the form, the direction and the strength of a relationship. In case of quantification by regression analysis, some of this information will be constant. The relationship is linear and peptide intensity values always show a straight-line pattern. The direction of the relationship is always unambiguous and the slope of the fitted line (that corresponds to the ratio) must be positive. It means that, in terms of regression, the association is positive. The strength of the relationship between the two variables is given by the coefficient of correlation. The reported standard error gives an estimation of the standard deviation of the slope. It is inversely proportional to the reliability of the computed ratio. Finally, the R-squared is provided. It gives a better feeling of the strength of the association between the points and the regression line than the correlation coefficient. In fact it can be interpreted as a measure of the spread. The closer the regression coefficient is to one, the better the regression line describes the data, and the better the slope is an estimation of the ratio.

Figure 20. Qi output, individual box summary for one protein. It contains the slope that corresponds to an estimation of the

protein ratio, the standard error for this coefficient, the regression coefficient and the number of peptide matches. A large number of peptides does not guarantee a precise estimation of the ratio.

Quality of the data set

It is generally admitted that high throughput shotgun proteomics data are log-normally

distributed [2, 29]. In the case of peptide quantification, skewed distribution can be observed.

Indeed, the ratios have log values lower or greater than zero, but very seldom have large values. Usually, they vary around zero. So, a non-normal distribution may indicate that the values are meaningless, or an experimental error occurred. However, this may also indicate that the sample contains very differential proteins. Log-normal distribution is required for an outliers detection test in Mascat Workflow and is therefore also checked before a data analysis is started. The test

(40)

This size threshold is almost always exceeded when working with high throughput proteomics data. To overcome this issue, we advise to always look at the peptide ratio distribution in Mascat (Figure 21A) and the reporter intensity distribution in Librus (Figure 21B), or the data quantile-quantile plot (Figure 21C).

Figure 21. Data quality assessment, A/ Mascat workflow,

Distribution of the peptide ratios. Dotted red curves are the density distribution after normalisation (Ali Data). B/ Librus workflow, superposition of the reporter intensities density distribution (Loic Data, Ante-Mortem labelled reporters) C/ Mascat workflow, normal quantile-quantile plot of peptides ratio 117/114 (Ali Data).

B.

C.

(41)

Implementation

The implementation was done using R version 2.6.0. The program files, documentation and example script are contained in supplementary data 1.

Reporter peaks were collected using the LMS2E that performed the integration by sum of intensities and corrected the reporter impurities.

Librus workflow can be applied using the script librus.R. It executes the wrapper function

librus that performs the quantification. It calls the function of normalisaton, firstnorm, the outliers

removal function outlier; getRatio calculated the ratio and the standard error.

Mascat workflow is called by the wrapper mascat found in the module Mascatfunc.R. This

workflow is devised in four modules. The wrapper calls the script Mascat_norm.R which contains

normalisation function, the script Mascat_outliers.R which contains the outliers detection test (this

script needs the installation of an additional library), outliers which clusters a collection of some

tests commonly used for identifying outliers and Mascat_changes.R which contains statistical test

for significance change. The script Mascat_displays.R performs the export of the quantification

results and charts.

The last workflow, QI, is designed as a single module Qifunc.R. It implements a wrapper

qi, which calls slopeRatio, a subroutine that creates the linear regression model and displays the

regression summary. The summary contains the regression statistics of the model, the slope, the standard error, and the regression coefficient. Moreover the wrapper calls regression diagnostic

methods by the function diagnRM. No additional libraries are needed for dffits and studentized

(42)

(43)

4.3.

Validation of the algorithms

The ratios for the five spiked proteins of the ABRF data are compared in the table 10. It shows that ratios resulting from Mascat, Librus and QI are closed to the expected theoretical values. This therefore validates our three quantification algorithms.

A more precise study was performed to determine which quantification workflow provides the more accurate ratios. For this purpose, we used a data set supplied by Loic Dayon that contains a mixture of albumin (ALBU) from bovine serum, myoglobin (MYO) from horse heart, -lactoglobulin (LACB) from bovine milk, and lysozyme (LYS) from hen egg in equal weight. This dataset was originally used to determine the coefficient for impurities correction for

the TMT reagents, Dayon et Al. for more information [7].

Due to its robustness, root mean square deviation was chosen to calculate the deviation from expected theoretical ratio of 1:2:3:3:5:10. As summarized in Figure 22, RMSD for each expected ratios are combined for all the experiments. The profiles clearly show that the higher a ratio, the greater is its error. To obtain a better indication of the relative accuracy of a quantitative approach, a zoom is performed on the array of low (2:1, 3:1) and high (5:1, 10:1) theoretical ratio (Figure 23). This shows that for a relatively weak ratio, the approaches seem to behave equally (Figure 23A). However for ratios higher than 3:1, the workflow based on peptide ratio, Mascat presents the largest error in the extent that its errors calculated from all the jobs are higher than a RMSD value of 2.5 (Figure 23B). On the contrary, only one job presents a big deviation for the intensity-based workflow - Librus. Moreover except for this latter, Librus' error profiles are closely clustered whatever the expected ratios. The regression-based approach generally gives low error but the job profiles are more dispersed, almost parallel. More information is found in supplementary data 7. Detailed profiles per job can be seeing in Appendix 5.

(44)

(45)

0 0,5 1 1,5 2 2,5 3 3,5 5:1 10:1 R M S D Theoretical Ratio

Root Mean Square Deviation for Librus, Mascat & QI

Librus 39572 Librus 39573 Librus 39574 Librus 41245 Librus 41246 Mascat 39572 Mascat 39573 Mascat 39574 Mascat 41245 Mascat 41246

QI 39572 QI 39573 QI 39574 QI 41245 QI 41246 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 2:1 3:1 R M S D Theoretical Ratio

Root Mean Square Deviation for Librus, Mascat & QI

Librus 39572 Librus 39573 Librus 39574 Librus 41245 Librus 41246 Mascat 39572 Mascat 39573 Mascat 39574 Mascat 41245 Mascat 41246

QI 39572 QI 39573 QI 39574 QI 41245 QI 41246

B.

(46)

4.4.

Discussion

Three algorithms were developed and validated using the R language Each of them proposes their own statistical procedure (Appendix 6).

By measuring the RMSD between the expected theoretical ratios and the estimated ratios from the different algorithms, we conclude that Librus or to a less extent, QI give the more

accurate ratios. This corroborates the observations of Carillo et al. [39].

However, care must be taken by giving too much confidence to this conclusion. Indeed, the two analysed datasets are from samples containing only spiked proteins; the normality assumption is therefore not respected. This assumption is important especially in Mascat where the outliers detection test requires that the peptide ratios of the whole dataset are normally distributed. This could explain the fact that Mascat shows the biggest variations from the

expected ratios. Consequently, to say that one of these algorithms gives the more accurate ratios, a study should be done on a dataset from real biological samples where some spiked proteins would be incorporated.

(47)

5.

Application of the Quantification Workflows

5.1.

Alireza collaboration: Peptides Ratios-based Quantification Approach

applied to Characterize Daptomycin Resistance in Staphylococcus

aureus.

A two-month work with Alireza Vaezzadeh, a PHD student from the Biomedical

Proteomics Research Group (BPRG), has been carried out. My part of the job consisted of an analysis of MS/MS Data obtained from quantitative MS based proteomic experiments on

Staphylococcus S.aureus.

5.1.1.

Introduction

S. aureus, also known as golden staph, is the most common cause of staph infections. It is

a spherical bacterium, frequently living on the skin or in the nose of a person where it appears in grape-like clusters when viewed through a microscope.

It infects tissues causing furuncles and severe diseases like Staphylococcal scalded skin syndrome (SSSS) in the infant. In order to stop the rising incidence of this infection, Daptomycin was approved by the FDA (Food and Drug Administration) in 2003 for the treatment of complicated skin and soft tissue infections caused by susceptible strains of S. aureus, including

methicillin-resistant S.aureus (MRSA) strains, and other gram-positive bacteria.

Despite significant efforts over the past 20 years, the mode of action of Daptomycin remains mysterious. Furthermore the bacterium seems to develop resistance to this anti-staphylococcal agent. The exact mechanism of the resistance is not clearly known. As methods for gaining insight into this domain, Proteomic and Transcriptomic approaches have been developed. Transcriptional profiles were performed using a customized and extensively validated oligoarray by Dr. Patrice François at the Geneva University Hospitals.

Figure 24. Microscopic image of Staphylococcus aureus

(48)

5.1.2.

Materials

Experimental Workflow. Three strains were analysed in this study: 616 (initial patient isolate), 629 (first isolate breaking through Daptomycin therapy and demonstrating decreased Daptomycin susceptibility but still within the susceptible range, termed the “transitional strain”), and 701 (subsequently isolated during Daptomycin therapy and non-susceptible to Daptomycin). Quantitative-MS based proteomic experiments were performed in practical triplicates. Samples were prepared according to manufacturer’s protocol (Applied Biosystems, Framingham MA). More details concerning the Experiments can be found in Supplementary Data 2.

For the first proteomic experiment (PR1), strain 601 was labelled with iTRAQ 114, strain 629 with iTRAQ 116 and 701 with iTRAQ 117. For the second practical replicate experiment (PR2), strain t

Isobaric Tag based MS Quantification Algorithms Analysis and Implementation