Practical Analysis of Proteome Data Using Bioinformatics and Statistics

(1)

Practical Analysis of Proteome Data

Using Bioinformatics and Statistics

Simon Barkow-Oesterreicher

Functional Genomics Center Zurich

Dr. Jonas Grossmann

Functional Genomics Center Zurich

(2)

Outline

•

Challenges in proteomics data analysis

•

Protein identification

--> visualization and validation

•

Scaffold software

•

More than one search engine

•

Quantitative proteomics

•

Beyond protein lists

--> Pathway mapping, over-representation

(3)

Challenges in Proteomics

•

Sample are usually very complex

-> proteins differ widely (size, 3D-structure, chemical groups)

-> dynamic range (different abundances) of proteins (e.g. Rubisco in plants makes up to 50% of the total protein amount in green tissues)

•

Unlike in transcriptomics, only most abundant proteins are detected

•

Because of complexity, samples are usually fractionated (no clear cut)

•

Random-component in DDA experiments (data dependent acquisition) makes reproducibility challenging

•

Genomic sequence and annotation (predicted proteins) is essential

•

Mass spectrometers are complex machines and do not perform always as good (day-to-day variation)

(4)

ELPPAK

Protein Identification Algorithms

protein of interest wet lab in silico peptides of convenient size 1st MS 2nd MS

genome sequence protein sequences

fragmentation

>Ath_Chr1 ACGTTTAG GAGTTAGG ACCACCA

in silico tryptic Peptides

in silico theoretical sectrum >At1g1120 MDASISTOK ADELIKAPPL EISTK >At1g1110 MDASISTALK ADELIKAPPL EISTK MPVCLLSTVK ELIK APPLEISTK MDASISTALK ADELIK APPLEISTK

(using protein sequence databases)

selection MS spectrum MS/MS spectrum gene prediction tryps y-ions b-ions 4

Scheme for protein identification... describe all quite in detail!!

(5)

ELPPAK

Protein Identification Algorithms

fragmentation

selection MS spectrum MS/MS spectrum gene prediction tryps y-ions b-ions 4

(6)

ELPPAK

Protein Identification Algorithms

fragmentation

selection

MS spectrum

peptide identification

inference

protein

MS/MS spectrum gene prediction tryps y-ions b-ions 4

(7)

ELPPAK

Peptide Identification

Nat Rev Mol Cell Biol, 6(7):577–83, 2005 Nat Biotechnol, 25:125-131, 2007 tryptic & in MS range (mass) good flight properties unambigous & observed frequently An example: human mitogen-activated protein kinase-8 (MAPK8), 427 aa

MS-compatible peptides

observed peptides

proteotypic peptides

5

One example of a protein ... MAPK8 from human. ..

when we check which tryptic peptides are in the range of the MS it looks like this... (colored means ... MS-compatible)

next... Which peptides are actually observed ... because they have a good flight properties ... and finally ... which are unambigous and frequently observed

(8)

ELPPAK

Peptide Identification

observed peptides

5

(9)

ELPPAK

Peptide Identification

observed peptides

5

(10)

ELPPAK

Output after peptide identification step

•

An incomplete list of peptides which were presumably in the sample

•

The identified peptides point to corresponding proteins

•

Some peptides are ambiguous (protein inference problem)

•

Some proteins are identified with several peptides, others only with a

single peptide

•

The peptides and also the proteins have some score associated with

them how well they are identified

Better get the most accurate hit list

(11)

ELPPAK

Why validate?

•

Every database search generates

false positives and false negatives

•

Downstream steps can cost a

lot of time and money

Get the most accurate protein hit list with a

known false discovery rate (FDR)

True

Positive

False

Negative

False

Positive

Negative

True

Search Algorithm Prediction

Better get the most accurate hit list

True False

Reality

True

False

(12)

source: PNAS; Storey and Tibshirani 100 (16): 9440. (2003)

False discovery rate (FDR):

e.g FDR = 5% means that

among all the features called

positive, 5% are true

negatives on average.

500 positves, 25 false

positives (5%)

False positive rate (FPR):

e.g. FPR = 5% means that

on average 5% of the true

false in the study will be

called positive

10500 total

500 true positives

10000 false means 500

false positives (50% of

total positives)

FPR vs FDR

8

There is a confusion in the proteomics-community -> FDR and FPR are often used for the same thing..

and as biologists sometimes are not too picky this leads to this confusion --> so here a definition in words.

(13)

••• Simon Barkow & Jonas Grossmann • FGCZ Proteomics •

Validation of Peptide Identification & Protein inference

From Nesvizhskii et al, Anal. Chem.2003, 75,4646-4658

Peptide Prophet

Protein Prophet

(14)

••• Simon Barkow & Jonas Grossmann • FGCZ Proteomics •

Validation of Peptide Identification & Protein inference

From Nesvizhskii et al, Anal. Chem.2003, 75,4646-4658

Issue #1

Peptide Prophet

Protein Prophet

(15)

ELPPAK

Peptide validation by algorithm

•

Key question: how to determine which identifications are valid

•

Typical method: accept all identifications above a chosen

discriminant score of a search engine (e. g. Mascot Ion Score)

•

Choosing an threshold is problematic, depending on sample,

search database, etc.

Use a

validation algorithm

that is based on experience:

PeptideProphet

(16)

Discriminant score (D)

Number of spectra in each bin

Once the discriminant scores for all the

spectra in a sample are calculated, Peptide Prophet makes a histogram of these

discriminant scores.

For example, in the sample shown here, 70 spectra have scores around 2.5.

Histogram of

scores

(17)

“correct”

“incorrect”

This histogram shows the distributions of correct and incorrect matches.

PeptideProphet assumes that these distributions are standard statistical distributions.

Using curve-fitting, PeptideProphet draws the correct and incorrect distributions.

Mixture of

distributions

This Histogram shows the standard _{distributions of correct and incorrect}

matches, validated manually in a

sample with a known set of 18 proteins.

(18)

“correct”

“incorrect”

Bayesian statistics

Once correct and incorrect distributions are _{drawn, PeptideProphet uses Bayesian}

statistics to compute the probability p(+|D)

that a match is correct, given a discriminant score D.

(19)

“correct”

“incorrect”

Probability of a

correct match

The statistical formula looks fierce, but relating it to the histogram shows that the prob of a score of 2.5 being correct is

(20)

(21)

ELPPAK

How to get even more confidence?

Compare peptide patterns seen in each replicate for the same protein

•

Manually examine the spectrum for critical or

characteristic fragment ions (especially single hits)

•

Compare scores from various search engines

(Mascot, SEQUEST, x!tandem, etc.)

•

Compare other characteristics for identified peptides

(NTT, MCS ...)

(22)

ELPPAK

Peptide Prophet features

•

Combines database search scores

•

Number of tryptic termini (NTT)

•

Number of missed cleavage sites (NMC)

•

Mass difference between theoretical mass and measured mass

•

Peptide retention time (expected vs measured)

(23)

ELPPAK

Scaffold Workflow

(24)

ELPPAK

Experimental Design

Three hierachies:

1. Sample Category: disease vs. control, treated vs, untreated, etc. 2. Biosample: drop of blood, tissue sample, etc.

3. MS Sample: each individual spot (MALDI), or one LC fraction

(25)

ELPPAK

Scaffold Sample Window

Overview for comparisons

•

Lists and summarizes the proteins identified in each biosample or MS sample

•

Identification probability

•

Number of unique peptides on which the identification is based

•

Percentage of the total spectra that this number represents

•

Number of unique spectra associated with this protein

(26)

ELPPAK

Scaffold Protein Window

•

All Information about a single protein

•

Sequence coverage for this and similar proteins

•

Peptide sequence, with identified peptides highlighted in yellow and modifications highlighted in green

•

The spectra used to identify each peptide

•

Lots of data about the Peptides that can be revised to get confidence

(27)

ELPPAK

Scaffold Quantify Window

•

View spectral count numbers for biosamples (same color) and categories (different color)

•

Scatterplots pane shows degree of error associated with the spectral count

•

Venn diagram shows relationship between categories of proteins, unique peptides, or unique spectra identifications

•

GO (Gene Ontology) mesh terms pane

(28)

ELPPAK

Scaffold Statistics Window

Check whether your data meets Scaffold’s assumptions

•

Statistical information for each MS sample in your analysis

•

Relationship between peptide and protein probabilities

•

Histogram demonstrating correct and incorrect peptide assignments (used by the Peptide Prophet)

•

Scatterplot comparing two or more search engine results

(29)

ELPPAK

Search Algorithms

(30)

ELPPAK

Search Algorithms

•

MASCOT

•

SEQUEST

•

X!TANDEM

•

OMSSA

•

Spectrum Mill 24

(31)

ELPPAK

Search Algorithms

•

MASCOT

•

SEQUEST

•

X!TANDEM

•

OMSSA

•

Spectrum Mill

All of them can be combined with Scaffold

24

(32)

9%

19%

7%

34%

5%

4%

22%

SEQUEST

X!Tandem

Mascot

considers

intensities

probability

based

scoring

semi-tryptic,

no neutral

loss fragments

The reason that they identify different

spectra is because each program has different strengths.

Why Overlap Small

(33)

ELPPAK

Decoy searches applicable everywhere

>sp|Q4U9M9|104K_THEAN 104 kDa microneme/rhoptry antigen OS=Theileria annulata GN=TA08425 PE=3 SV=1 MKFLVLLFNILCLFPILGADELVMSPIPTTDVQPKVTFDINSEVSSGPLYLNPVEMAGVK YLQLQRQPGVQVHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLK EGDQWAPIPEDQYLARLQQLRQQIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTP KNGHICKMVYDKNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLD DKYAPISVQGYVATIPKLKDFAEPYHPIILDISDIDYVNFYLGDATYHDPGFKIVPKTPQ CITKVVDGNEVIYESSNPSVECVYKVTYYDKKNESMLRLDLNHSPPSYTSYYAKREGVWV TSTYIDLEEKIEELQDHRSTELDVMFMSDKDLNVVPLTNGNLEYFMVTPKPHRDIIIVFD GSEVLWYYEGLENHLVCTWIYVTEGAPRLVHLRVKDRIPQNTDIYMVKFGEYWVRISKTQ YTQEIKKLIKKSKKKLPSIEEEDSDKHGGPPKGPEPPTGPGHSSSESKEHEDSKESKEPK EHGSPKETKEGEVTKKPGPAKEHKPSKIPVYTKRPEFPKKSKSPKRPESPKSPKRPVSPQ RPVSPKSPKRPESLDIPKSPKRPESPKSPKRPVSPQRPVSPRRPESPKSPKSPKSPKSPK VPFDPKFKEKLYDSYLDKAAKTKETVTLPPVLPTDESFTHTPIGEPTAEQPDDIEPIEES VFIKETGILTEEVKTEDIHSETGEPEEPKRPDSPTKHSPKPTGTHPSMPKKRRRSDGLAL STTDLESEAGRILRDPTGKIVTMKRSKSFDDLTTVREKEHMGAEIRKIVVDDDGTEADDE DTHPSKEKHLSTVRRRRPRPKKSSKSSKPRKPDSAFVPSIIFIFLVSLIVGIL 26

(34)

ELPPAK

Decoy searches applicable everywhere

LIGVILSVLFIFIISPVFASDPKRPKSSKSSKKPRPRRRRVTSLHKEKSPHTDEDDAETG DDDVVIKRIEAGMHEKERVTTLDDFSKSRKMTVIKGTPDRLIRGAESELDTTSLALGDSR RRKKPMSPHTGTPKPSHKTPSDPRKPEEPEGTESHIDETKVEETLIGTEKIFVSEEIPEI DDPQEATPEGIPTHTFSEDTPLVPPLTVTEKTKAAKDLYSDYLKEKFKPDFPVKPSKPSK PSKPSKPSEPRRPSVPRQPSVPRKPSKPSEPRKPSKPIDLSEPRKPSKPSVPRQPSVPRK PSKPSEPRKPSKSKKPFEPRKTYVPIKSPKHEKAPGPKKTVEGEKTEKPSGHEKPEKSEK SDEHEKSESSSHGPGTPPEPGKPPGGHKDSDEEEISPLKKKSKKILKKIEQTYQTKSIRV WYEGFKVMYIDTNQPIRDKVRLHVLRPAGETVYIWTCVLHNELGEYYWLVESGDFVIIID RHPKPTVMFYELNGNTLPVVNLDKDSMFMVDLETSRHDQLEEIKEELDIYTSTVWVGERK AYYSTYSPPSHNLDLRLMSENKKDYYTVKYVCEVSPNSSEYIVENGDVVKTICQPTKPVI KFGPDHYTADGLYFNVYDIDSIDLIIPHYPEAFDKLKPITAVYGQVSIPAYKDDLLQFYK NGIMGRDDIVFINLLLLKLGRFFGIVSTVYENYLAKFIRINKDYVMKCIHGNKPTFVVMK ISHQFSSVMEYKYNEHQFSLNLSFFSETHIQQRLQQLRALYQDEPIPAWQDGEKLFFILD PDELLEVYAMYPVENQTVIACTYLPMEENEWIVIDGEVVKHVQVGPQRQLQLYKVGAMEV PNLYLPGSSVESNIDFTVKPQVDTTPIPSMVLEDAGLIPFLCLINFLLVLFKM

>sp|REV_Q4U9M9|REV_104K_THEAN 104 kDa microneme/rhoptry antigen OS=Theileria annulata GN=TA08425 PE=3 SV=1

(35)

1) Sequest & TPP, No decoy search, PeptideProphet > 0.9

# of proteins # of peps # of MS/MS fw proteins 3176 9771 20627

single hits 1148 - -REV proteins - - -REV single hits - -

-36% 64% Overall ath 801 Total: 3176 proteins 27

The regular procedure:

-> only one search engine is taken into account (sometimes even without decoy db) --> TPP for statistical evaluation

--> the difference between decoy & non_decoy searches..

-> a different fitting of the probability function results in a little bit more stringency on the cutoff in terms of fewer peptide identification

(36)

1) Sequest & TPP, No decoy search, PeptideProphet > 0.9

single hits 1148 - -REV proteins - - -REV single hits - -

-36%

64%

Overall ath 801

Total:

3176 proteins

2) Sequest & TPP, w/ decoy search, PeptideProphet > 0.9

single hits 952 - -REV proteins 103 104 126 REV single hits 102 -

-FDR 3.76% 1.17% 0.68% 3% 0% 32% 64% Overall ath 801 Total: 2943 proteins 104 / (8994 - 104) 27

The regular procedure:

-> only one search engine is taken into account (sometimes even without decoy db) --> TPP for statistical evaluation

--> the difference between decoy & non_decoy searches..

-> a different fitting of the probability function results in a little bit more stringency on the cutoff in terms of fewer peptide identification

(37)

ELPPAK

•

Decoy searches can be applied everywhere BUT the

calculation of FDRs only makes sense if a large number of proteins are identified (more than ~200)

•

If the calculated FDR is very high .. there is a good chance that

some search parameters are wrong or maybe some PTMs are not specified

•

Reversed databases are favored over scrambled ones

•

Low FDR doesn’t mean perfect results

Decoy searches - Limitations

(38)

ELPPAK

Quantitative Proteomics -

my critical view

•

Is what everybody is looking for

•

Is what many people claim to do

•

Is definitely the right way to go in the future

•

Is absolutely necessary for Systems Biology

•

Is essential to really understand the dynamics of the proteome

•

Is not really straightforward

(39)

ELPPAK

Quantitative Proteomics -

What is it?

•

Find relative changes of protein abundance from 2 similar samples

(wild type VS mutant // condition_1 VS condition_2)

•

Determine absolute protein concentrations in a sample

(conclude on copy numbers and translation efficiency) -> AQUA peptides ..

•

Find regulatory proteins and elucidate regulatory pathways

(40)

ELPPAK

Quantitative Proteomics

- How can it be achieved?

•

Labeling strategy for differential expression

(ICAT, iTRAQ, TMT, SILAC --> wet lab)

•

Label-free approaches for differential expression

(--> Software solutions)

•

Targeted approaches

(SRM, MRM --> mass spec approach)

(41)

ELPPAK

sample prep solution

Quantitative Proteomics (differential expression)

label strategy

label-free

iCAT

iTRAQ/TMT

SILAC

SuperHirn

Progenesis

software solution

-> problematic are aligning and run to run variation

2 individual runs are acquired only ONE run is acquired

-> problematic is sample prep

(42)

ELPPAK

ICAT

labels have different

weights

Quantification is

done on the

MS-one level

(43)

ELPPAK

iTRAQ

all labels have the same weight

--> all parent ions are the “same”

Quantification is done on the

MS/MS level

(44)

ELPPAK

Beyond Protein Lists and

Quantitation - what else

•

Check for over/under representation of GO-terms

•

Functional categorization

•

Project regulated proteins onto a metabolic pathway map

(45)

ELPPAK

Principle of

- Over-representation Analysis

The Principle

- organism with 1000 genes

- binned in 5 equal categories with 200 genes

- GO-cats 1-5: transcription, translation, energy delivery, nutrients uptake, degradation The researcher decides to do proteomics (brute-force)

- 200 genes are identified --> 1/5th of all

- statistically you would expect to find approx. 40 genes for each category In fact you find about 100 genes from GO:energy delivery category

---> category energy delivery is significantly enriched ---> different statistics can be applied

an easy example

(46)

ELPPAK

The number of measured and identified proteins is still far from complete Over-representation analysis allow to find pathways or “systems” which are regulated or involved in a certain context

-> but it is important to have the correct background/universe selected

Principle:

- all genes of an organism are binned in categories

- categories are related to gene function (e.g. GeneOntology categories) - compare your identifications to randomly drawn genes

Background-problem

- take as background only those proteins ever identified in this species

- take as background all identified proteins and as genes of interest and those proteins which seem to be regulated as targets (e.g: iTRAQ experiment)

Tools: R-package --> TopGO

Web: --> GOTreeMachine (bioinfo.vanderbilt.edu/gotm/)

Principle of - ORA -

In case of Proteomics

(47)

ELPPAK

•

Arabidopsis thaliana: The model plant ---> ~ 28 000 genes

•

Single-cell plant in liquid culture

•

Grown in sugar containing solution & weekly subculturing

•

One part grown in the dark (cardboard box)

•

One part grown in long-day conditions (16h light)

•

Excessive LTQ MS analysis --> 800 LC-MS runs (fractionation & replicates)

•

A total of 7983 proteins identified from all samples

(~ 30% from all genes encoded in the genome) --> Background

•

6547 from the cell cultures that were kept in the dark

•

6474 from the cell cultures that were illuminated

Scenario (from HTP proteomics)

(48)

GO:0006082 organic acid metabol...

GO:0006412 translation

GO:0006519 amino acid and deriv...

GO:0006520 amino acid metabolic... GO:0006807

nitrogen compound me...

GO:0006810 transport GO:0006996 organelle organizati... GO:0007275 multicellular organi... GO:0008150 biological_process GO:0008152 metabolic process GO:0008652 amino acid biosynthe... GO:0009058

biosynthetic process

GO:0009059 macromolecule biosyn...

GO:0009308 amine metabolic proc...

GO:0009309 amine biosynthetic p... GO:0009790 embryonic developmen... GO:0009987 cellular process GO:0016043 cellular component o...

GO:0019538 protein metabolic pr...

GO:0019752 carboxylic acid meta...

GO:0032501 multicellular organi... GO:0032502 developmental proces... GO:0043170 macromolecule metabo... GO:0044237 cellular metabolic p... GO:0044238 primary metabolic pr... GO:0044249 cellular biosyntheti... GO:0044260 cellular macromolecu... GO:0044267 cellular protein met...

GO:0044271 nitrogen compound bi...

GO:0046907 intracellular transp... GO:0051179 localization GO:0051234 establishment of loc... GO:0051641 cellular localizatio... GO:0051649 establishment of cel... GO:0005975 carbohydrate metabol... GO:0005996 monosaccharide metab... GO:0006066 alcohol metabolic pr... GO:0006412 translation GO:0006807 nitrogen compound me...

GO:0006810 transport GO:0006886 intracellular protei... GO:0007275 multicellular organi... GO:0008104 protein localization GO:0008150 biological_process GO:0008152 metabolic process GO:0009056 catabolic process GO:0009057 macromolecule catabo... GO:0009058 biosynthetic process GO:0009059

macromolecule biosyn... amine metabolic proc...GO:0009308

GO:0009309 amine biosynthetic p... GO:0009790 embryonic developmen... GO:0009987 cellular process GO:0015031 protein transport GO:0016043 cellular component o...

GO:0016052 carbohydrate catabol...

GO:0019318 hexose metabolic pro...

GO:0019320 hexose catabolic pro... GO:0019538 protein metabolic pr... GO:0032501 multicellular organi... GO:0032502 developmental proces... GO:0033036 macromolecule locali... GO:0043170 macromolecule metabo... GO:0044237 cellular metabolic p... GO:0044238 primary metabolic pr... GO:0044248

cellular catabolic p... cellular biosyntheti...GO:0044249 GO:0044260

cellular macromolecu...

GO:0044262

cellular carbohydrat... cellular macromolecu...GO:0044265 GO:0044267

cellular protein met... nitrogen compound bi...GO:0044271

GO:0044275 cellular carbohydrat... GO:0045184 establishment of pro... GO:0046164 alcohol catabolic pr... GO:0046365 monosaccharide catab... GO:0046907 intracellular transp... GO:0051179 localization GO:0051234 establishment of loc... cellular localizatio...GO:0051641

GO:0051649 establishment of cel... D ar k L ig h t

Proteins from CC_dark:

BG: full universe of GO BG: only proteins identified in CCProteins from CC_dark:

(49)

GO:0006082 organic acid metabol...

GO:0006412 translation

GO:0006519 amino acid and deriv...

GO:0006520 amino acid metabolic... GO:0006807

nitrogen compound me...

GO:0006810 transport GO:0006996 organelle organizati... GO:0007275 multicellular organi... GO:0008150 biological_process GO:0008152 metabolic process GO:0008652 amino acid biosynthe... GO:0009058

biosynthetic process

GO:0009059 macromolecule biosyn...

GO:0009308 amine metabolic proc...

GO:0009309 amine biosynthetic p... GO:0009790 embryonic developmen... GO:0009987 cellular process GO:0016043 cellular component o...

GO:0019538 protein metabolic pr...

GO:0019752 carboxylic acid meta...

GO:0032501 multicellular organi... GO:0032502 developmental proces... GO:0043170 macromolecule metabo... GO:0044237 cellular metabolic p... GO:0044238 primary metabolic pr... GO:0044249 cellular biosyntheti... GO:0044260 cellular macromolecu... GO:0044267 cellular protein met...

GO:0044271 nitrogen compound bi...

GO:0046907 intracellular transp... GO:0051179 localization GO:0051234 establishment of loc... GO:0051641 cellular localizatio... GO:0051649 establishment of cel... GO:0005975 carbohydrate metabol... GO:0005996 monosaccharide metab... GO:0006066 alcohol metabolic pr... GO:0006412 translation GO:0006807 nitrogen compound me...

GO:0006810 transport GO:0006886 intracellular protei... GO:0007275 multicellular organi... GO:0008104 protein localization GO:0008150 biological_process GO:0008152 metabolic process GO:0009056 catabolic process GO:0009057 macromolecule catabo... GO:0009058 biosynthetic process GO:0009059

macromolecule biosyn... amine metabolic proc...GO:0009308

GO:0009309 amine biosynthetic p... GO:0009790 embryonic developmen... GO:0009987 cellular process GO:0015031 protein transport GO:0016043 cellular component o...

GO:0016052 carbohydrate catabol...

GO:0019318 hexose metabolic pro...

GO:0019320 hexose catabolic pro... GO:0019538 protein metabolic pr... GO:0032501 multicellular organi... GO:0032502 developmental proces... GO:0033036 macromolecule locali... GO:0043170 macromolecule metabo... GO:0044237 cellular metabolic p... GO:0044238 primary metabolic pr... GO:0044248

cellular catabolic p... cellular biosyntheti...GO:0044249 GO:0044260

cellular macromolecu...

GO:0044262

cellular carbohydrat... cellular macromolecu...GO:0044265 GO:0044267

cellular protein met... nitrogen compound bi...GO:0044271

GO:0044275 cellular carbohydrat... GO:0045184 establishment of pro... GO:0046164 alcohol catabolic pr... GO:0046365 monosaccharide catab... GO:0046907 intracellular transp... GO:0051179 localization GO:0051234 establishment of loc... cellular localizatio...GO:0051641

GO:0051649 establishment of cel... D ar k L ig h t

Proteins from CC_dark:

BG: full universe of GO BG: only proteins identified in CCProteins from CC_dark:

(50)

Projection onto Metabolic Pathway Maps

(e.g. MapMan Software (Golm))

D ar k L ig h t

only found in light

only found in dark found in both

same data

(51)

ELPPAK

Q & A

(52)

ELPPAK

Hands on

•

your turn now

•

feel free to ask

(53)

ELPPAK

Scaffold hands on - Example One

•

load your own data with Scaffold before we are going to continue

•

Use also X!Tandem to search

•

Have a look at the results

•

Is it valid to calculate FDR? How high is your FDR?

(54)

ELPPAK

More from Scaffold Q+

hands on ... with iTRAQ data

(55)

ELPPAK

Scenario:

•

Mouse data

•

Liver tissue

•

iTRAQ data (Swiss mouse: standard diet VS high fat diet)

•

Mouse decoy database search with Mascot -> dat-files

•

Labels: 116 -> high fat diet /// 114, 115, 117 -> standard diet

•

Check reproducibility (standard diet vs standard diet)

•

Find proteins which are regulated in high fat diet / standard diet

(56)

ELPPAK

Task with Scaffold Q+

•

How consistent are peptides of the same protein

•

Find confident thresholds for proteins being over/under

expressed

•

Which proteins in this example do you consider as being over/

under expressed?

•

Can you try making sense out of these proteins ..

(57)

ELPPAK

What should come out ..

only 2 quant categories:

Histogram 2 Categories Liver Ex4

0 50 100 150 200 250 300 -1.4 -1.3 -1.2 -1.1 -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 log2(Ratio) Frequency StDiet/StDiet HighFatDiet/StDiet 47

(58)

ELPPAK

What should come out ..

4 quant categories:

Histogram 4 Categories Liver Ex4

0 50 100 150 200 250 300 350 400 -2 -1.8 -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 log2(Ratio) Frequency ratio_2 (st/st)

ratio_3 (high fat / st) ratio_4 (st/st)

(59)

ELPPAK

Regulated Proteins: The List

•

2 ways of making sense out of this data..

•

take the intersection of those 2 lists.. (should be most confident)

2 categories

44 regulated proteins

4 categories

48 regulated proteins

37

49

(60)

ELPPAK

Make sense out of Lists:

this does

make sense !!

(61)

ELPPAK

Paint it on Reactome-maps

(62)

ELPPAK

(63)

ELPPAK

Scaffold Similarity Window

•

Review and control the peptide/protein mapping

•

View protein groups in which peptides are shared

•

“check” or “uncheck” the valid box for a peptide sequence

•

Peptides identified in particular protein groups are color coded