Practical Analysis of Proteome Data
Using Bioinformatics and Statistics
Simon Barkow-Oesterreicher
Functional Genomics Center Zurich
Dr. Jonas Grossmann
Functional Genomics Center Zurich
Outline
•
Challenges in proteomics data analysis
•
Protein identification
--> visualization and validation
•
Scaffold software
•
More than one search engine
•
Quantitative proteomics
•
Beyond protein lists
--> Pathway mapping, over-representation
Challenges in Proteomics
•
Sample are usually very complex-> proteins differ widely (size, 3D-structure, chemical groups)
-> dynamic range (different abundances) of proteins (e.g. Rubisco in plants makes up to 50% of the total protein amount in green tissues)
•
Unlike in transcriptomics, only most abundant proteins are detected•
Because of complexity, samples are usually fractionated (no clear cut)•
Random-component in DDA experiments (data dependent acquisition) makes reproducibility challenging•
Genomic sequence and annotation (predicted proteins) is essential•
Mass spectrometers are complex machines and do not perform always as good (day-to-day variation)ELPPAK
Protein Identification Algorithms
protein of interest wet lab in silico peptides of convenient size 1st MS 2nd MS
genome sequence protein sequences
fragmentation
>Ath_Chr1 ACGTTTAG GAGTTAGG ACCACCA
in silico tryptic Peptides
in silico theoretical sectrum >At1g1120 MDASISTOK ADELIKAPPL EISTK >At1g1110 MDASISTALK ADELIKAPPL EISTK MPVCLLSTVK ELIK APPLEISTK MDASISTALK ADELIK APPLEISTK
(using protein sequence databases)
selection MS spectrum MS/MS spectrum gene prediction tryps y-ions b-ions 4
Scheme for protein identification... describe all quite in detail!!
ELPPAK
Protein Identification Algorithms
protein of interest wet lab in silico peptides of convenient size 1st MS 2nd MS
genome sequence protein sequences
fragmentation
>Ath_Chr1 ACGTTTAG GAGTTAGG ACCACCA
in silico tryptic Peptides
in silico theoretical sectrum >At1g1120 MDASISTOK ADELIKAPPL EISTK >At1g1110 MDASISTALK ADELIKAPPL EISTK MPVCLLSTVK ELIK APPLEISTK MDASISTALK ADELIK APPLEISTK
(using protein sequence databases)
selection MS spectrum MS/MS spectrum gene prediction tryps y-ions b-ions 4
Scheme for protein identification... describe all quite in detail!!
ELPPAK
Protein Identification Algorithms
protein of interest wet lab in silico peptides of convenient size 1st MS 2nd MS
genome sequence protein sequences
fragmentation
>Ath_Chr1 ACGTTTAG GAGTTAGG ACCACCA
in silico tryptic Peptides
in silico theoretical sectrum >At1g1120 MDASISTOK ADELIKAPPL EISTK >At1g1110 MDASISTALK ADELIKAPPL EISTK MPVCLLSTVK ELIK APPLEISTK MDASISTALK ADELIK APPLEISTK
(using protein sequence databases)
selection
MS spectrum
peptide identification
inference
protein
MS/MS spectrum gene prediction tryps y-ions b-ions 4
Scheme for protein identification... describe all quite in detail!!
ELPPAK
Peptide Identification
Nat Rev Mol Cell Biol, 6(7):577–83, 2005 Nat Biotechnol, 25:125-131, 2007 tryptic & in MS range (mass) good flight properties unambigous & observed frequently An example: human mitogen-activated protein kinase-8 (MAPK8), 427 aa
MS-compatible peptides
observed peptides
proteotypic peptides
5
One example of a protein ... MAPK8 from human. ..
when we check which tryptic peptides are in the range of the MS it looks like this... (colored means ... MS-compatible)
next... Which peptides are actually observed ... because they have a good flight properties ... and finally ... which are unambigous and frequently observed
ELPPAK
Peptide Identification
Nat Rev Mol Cell Biol, 6(7):577–83, 2005 Nat Biotechnol, 25:125-131, 2007 tryptic & in MS range (mass) good flight properties unambigous & observed frequently An example: human mitogen-activated protein kinase-8 (MAPK8), 427 aa
MS-compatible peptides
observed peptides
proteotypic peptides
5
One example of a protein ... MAPK8 from human. ..
when we check which tryptic peptides are in the range of the MS it looks like this... (colored means ... MS-compatible)
next... Which peptides are actually observed ... because they have a good flight properties ... and finally ... which are unambigous and frequently observed
ELPPAK
Peptide Identification
Nat Rev Mol Cell Biol, 6(7):577–83, 2005 Nat Biotechnol, 25:125-131, 2007 tryptic & in MS range (mass) good flight properties unambigous & observed frequently An example: human mitogen-activated protein kinase-8 (MAPK8), 427 aa
MS-compatible peptides
observed peptides
proteotypic peptides
5
One example of a protein ... MAPK8 from human. ..
when we check which tryptic peptides are in the range of the MS it looks like this... (colored means ... MS-compatible)
next... Which peptides are actually observed ... because they have a good flight properties ... and finally ... which are unambigous and frequently observed
ELPPAK
Output after peptide identification step
•
An incomplete list of peptides which were presumably in the sample•
The identified peptides point to corresponding proteins•
Some peptides are ambiguous (protein inference problem)•
Some proteins are identified with several peptides, others only with asingle peptide
•
The peptides and also the proteins have some score associated withthem how well they are identified
Better get the most accurate hit list
ELPPAK
Why validate?
•
Every database search generatesfalse positives and false negatives
•
Downstream steps can cost alot of time and money
Get the most accurate protein hit list with a
known false discovery rate (FDR)
True
Positive
False
Negative
False
Positive
Negative
True
Search Algorithm Prediction
Better get the most accurate hit list
True False
Reality
True
False
source: PNAS; Storey and Tibshirani 100 (16): 9440. (2003)
False discovery rate (FDR):
e.g FDR = 5% means that
among all the features called
positive, 5% are true
negatives on average.
500 positves, 25 false
positives (5%)
False positive rate (FPR):
e.g. FPR = 5% means that
on average 5% of the true
false in the study will be
called positive
10500 total
500 true positives
10000 false means 500
false positives (50% of
total positives)
FPR vs FDR
8There is a confusion in the proteomics-community -> FDR and FPR are often used for the same thing..
and as biologists sometimes are not too picky this leads to this confusion --> so here a definition in words.
••• Simon Barkow & Jonas Grossmann • FGCZ Proteomics •
Validation of Peptide Identification & Protein inference
From Nesvizhskii et al, Anal. Chem.2003, 75,4646-4658
Peptide Prophet
Protein Prophet
••• Simon Barkow & Jonas Grossmann • FGCZ Proteomics •
Validation of Peptide Identification & Protein inference
From Nesvizhskii et al, Anal. Chem.2003, 75,4646-4658
Issue #1
Peptide Prophet
Protein Prophet
ELPPAK
Peptide validation by algorithm
•
Key question: how to determine which identifications are valid•
Typical method: accept all identifications above a chosendiscriminant score of a search engine (e. g. Mascot Ion Score)
•
Choosing an threshold is problematic, depending on sample,search database, etc.
Use a
validation algorithm
that is based on experience:
PeptideProphet
Discriminant score (D)
Number of spectra in each bin
Once the discriminant scores for all the
spectra in a sample are calculated, Peptide Prophet makes a histogram of these
discriminant scores.
For example, in the sample shown here, 70 spectra have scores around 2.5.
Histogram of
scores
“correct”
“incorrect”
Discriminant score (D)
Number of spectra in each bin
This histogram shows the distributions of correct and incorrect matches.
PeptideProphet assumes that these distributions are standard statistical distributions.
Using curve-fitting, PeptideProphet draws the correct and incorrect distributions.
Mixture of
distributions
This Histogram shows the standard distributions of correct and incorrectmatches, validated manually in a
sample with a known set of 18 proteins.
“correct”
“incorrect”
Discriminant score (D)
Number of spectra in each bin
Bayesian statistics
Once correct and incorrect distributions are drawn, PeptideProphet uses Bayesianstatistics to compute the probability p(+|D)
that a match is correct, given a discriminant score D.
“correct”
“incorrect”
Discriminant score (D)
Number of spectra in each bin
Probability of a
correct match
The statistical formula looks fierce, but relating it to the histogram shows that the prob of a score of 2.5 being correct isELPPAK
How to get even more confidence?
Compare peptide patterns seen in each replicate for the same protein
•
Manually examine the spectrum for critical orcharacteristic fragment ions (especially single hits)
•
Compare scores from various search engines(Mascot, SEQUEST, x!tandem, etc.)
•
Compare other characteristics for identified peptides(NTT, MCS ...)
ELPPAK
Peptide Prophet features
•
Combines database search scores•
Number of tryptic termini (NTT)•
Number of missed cleavage sites (NMC)•
Mass difference between theoretical mass and measured mass•
Peptide retention time (expected vs measured)ELPPAK
Scaffold Workflow
ELPPAK
Experimental Design
Three hierachies:
1. Sample Category: disease vs. control, treated vs, untreated, etc. 2. Biosample: drop of blood, tissue sample, etc.
3. MS Sample: each individual spot (MALDI), or one LC fraction
ELPPAK
Scaffold Sample Window
Overview for comparisons
•
Lists and summarizes the proteins identified in each biosample or MS sample•
Identification probability•
Number of unique peptides on which the identification is based•
Percentage of the total spectra that this number represents•
Number of unique spectra associated with this proteinELPPAK
Scaffold Protein Window
•
All Information about a single protein•
Sequence coverage for this and similar proteins•
Peptide sequence, with identified peptides highlighted in yellow and modifications highlighted in green•
The spectra used to identify each peptide•
Lots of data about the Peptides that can be revised to get confidenceELPPAK
Scaffold Quantify Window
•
View spectral count numbers for biosamples (same color) and categories (different color)•
Scatterplots pane shows degree of error associated with the spectral count•
Venn diagram shows relationship between categories of proteins, unique peptides, or unique spectra identifications•
GO (Gene Ontology) mesh terms paneELPPAK
Scaffold Statistics Window
Check whether your data meets Scaffold’s assumptions
•
Statistical information for each MS sample in your analysis•
Relationship between peptide and protein probabilities•
Histogram demonstrating correct and incorrect peptide assignments (used by the Peptide Prophet)•
Scatterplot comparing two or more search engine resultsELPPAK
Search Algorithms
ELPPAK
Search Algorithms
•
MASCOT•
SEQUEST•
X!TANDEM•
OMSSA•
Spectrum Mill 24ELPPAK
Search Algorithms
•
MASCOT•
SEQUEST•
X!TANDEM•
OMSSA•
Spectrum MillAll of them can be combined with Scaffold
249%
19%
7%
34%
5%
4%
22%
SEQUEST
X!Tandem
Mascot
considers
intensities
probability
based
scoring
semi-tryptic,
no neutral
loss fragments
The reason that they identify different
spectra is because each program has different strengths.
Why Overlap Small
ELPPAK
Decoy searches applicable everywhere
>sp|Q4U9M9|104K_THEAN 104 kDa microneme/rhoptry antigen OS=Theileria annulata GN=TA08425 PE=3 SV=1 MKFLVLLFNILCLFPILGADELVMSPIPTTDVQPKVTFDINSEVSSGPLYLNPVEMAGVK YLQLQRQPGVQVHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLK EGDQWAPIPEDQYLARLQQLRQQIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTP KNGHICKMVYDKNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLD DKYAPISVQGYVATIPKLKDFAEPYHPIILDISDIDYVNFYLGDATYHDPGFKIVPKTPQ CITKVVDGNEVIYESSNPSVECVYKVTYYDKKNESMLRLDLNHSPPSYTSYYAKREGVWV TSTYIDLEEKIEELQDHRSTELDVMFMSDKDLNVVPLTNGNLEYFMVTPKPHRDIIIVFD GSEVLWYYEGLENHLVCTWIYVTEGAPRLVHLRVKDRIPQNTDIYMVKFGEYWVRISKTQ YTQEIKKLIKKSKKKLPSIEEEDSDKHGGPPKGPEPPTGPGHSSSESKEHEDSKESKEPK EHGSPKETKEGEVTKKPGPAKEHKPSKIPVYTKRPEFPKKSKSPKRPESPKSPKRPVSPQ RPVSPKSPKRPESLDIPKSPKRPESPKSPKRPVSPQRPVSPRRPESPKSPKSPKSPKSPK VPFDPKFKEKLYDSYLDKAAKTKETVTLPPVLPTDESFTHTPIGEPTAEQPDDIEPIEES VFIKETGILTEEVKTEDIHSETGEPEEPKRPDSPTKHSPKPTGTHPSMPKKRRRSDGLAL STTDLESEAGRILRDPTGKIVTMKRSKSFDDLTTVREKEHMGAEIRKIVVDDDGTEADDE DTHPSKEKHLSTVRRRRPRPKKSSKSSKPRKPDSAFVPSIIFIFLVSLIVGIL 26
ELPPAK
Decoy searches applicable everywhere
LIGVILSVLFIFIISPVFASDPKRPKSSKSSKKPRPRRRRVTSLHKEKSPHTDEDDAETG DDDVVIKRIEAGMHEKERVTTLDDFSKSRKMTVIKGTPDRLIRGAESELDTTSLALGDSR RRKKPMSPHTGTPKPSHKTPSDPRKPEEPEGTESHIDETKVEETLIGTEKIFVSEEIPEI DDPQEATPEGIPTHTFSEDTPLVPPLTVTEKTKAAKDLYSDYLKEKFKPDFPVKPSKPSK PSKPSKPSEPRRPSVPRQPSVPRKPSKPSEPRKPSKPIDLSEPRKPSKPSVPRQPSVPRK PSKPSEPRKPSKSKKPFEPRKTYVPIKSPKHEKAPGPKKTVEGEKTEKPSGHEKPEKSEK SDEHEKSESSSHGPGTPPEPGKPPGGHKDSDEEEISPLKKKSKKILKKIEQTYQTKSIRV WYEGFKVMYIDTNQPIRDKVRLHVLRPAGETVYIWTCVLHNELGEYYWLVESGDFVIIID RHPKPTVMFYELNGNTLPVVNLDKDSMFMVDLETSRHDQLEEIKEELDIYTSTVWVGERK AYYSTYSPPSHNLDLRLMSENKKDYYTVKYVCEVSPNSSEYIVENGDVVKTICQPTKPVI KFGPDHYTADGLYFNVYDIDSIDLIIPHYPEAFDKLKPITAVYGQVSIPAYKDDLLQFYK NGIMGRDDIVFINLLLLKLGRFFGIVSTVYENYLAKFIRINKDYVMKCIHGNKPTFVVMK ISHQFSSVMEYKYNEHQFSLNLSFFSETHIQQRLQQLRALYQDEPIPAWQDGEKLFFILD PDELLEVYAMYPVENQTVIACTYLPMEENEWIVIDGEVVKHVQVGPQRQLQLYKVGAMEV PNLYLPGSSVESNIDFTVKPQVDTTPIPSMVLEDAGLIPFLCLINFLLVLFKM
>sp|REV_Q4U9M9|REV_104K_THEAN 104 kDa microneme/rhoptry antigen OS=Theileria annulata GN=TA08425 PE=3 SV=1
1) Sequest & TPP, No decoy search, PeptideProphet > 0.9
# of proteins # of peps # of MS/MS fw proteins 3176 9771 20627
single hits 1148 - -REV proteins - - -REV single hits - -
-36% 64% Overall ath 801 Total: 3176 proteins 27
The regular procedure:
-> only one search engine is taken into account (sometimes even without decoy db) --> TPP for statistical evaluation
--> the difference between decoy & non_decoy searches..
-> a different fitting of the probability function results in a little bit more stringency on the cutoff in terms of fewer peptide identification
1) Sequest & TPP, No decoy search, PeptideProphet > 0.9
# of proteins # of peps # of MS/MS fw proteins 3176 9771 20627
single hits 1148 - -REV proteins - - -REV single hits - -
-36%
64%
Overall ath 801
Total:
3176 proteins
2) Sequest & TPP, w/ decoy search, PeptideProphet > 0.9
# of proteins # of peps # of MS/MS fw proteins 2840 8994 18662
single hits 952 - -REV proteins 103 104 126 REV single hits 102 -
-FDR 3.76% 1.17% 0.68% 3% 0% 32% 64% Overall ath 801 Total: 2943 proteins 104 / (8994 - 104) 27
The regular procedure:
-> only one search engine is taken into account (sometimes even without decoy db) --> TPP for statistical evaluation
--> the difference between decoy & non_decoy searches..
-> a different fitting of the probability function results in a little bit more stringency on the cutoff in terms of fewer peptide identification
ELPPAK
•
Decoy searches can be applied everywhere BUT thecalculation of FDRs only makes sense if a large number of proteins are identified (more than ~200)
•
If the calculated FDR is very high .. there is a good chance thatsome search parameters are wrong or maybe some PTMs are not specified
•
Reversed databases are favored over scrambled ones•
Low FDR doesn’t mean perfect resultsDecoy searches - Limitations
ELPPAK
Quantitative Proteomics -
my critical view
•
Is what everybody is looking for•
Is what many people claim to do•
Is definitely the right way to go in the future•
Is absolutely necessary for Systems Biology•
Is essential to really understand the dynamics of the proteome•
Is not really straightforwardELPPAK
Quantitative Proteomics -
What is it?
•
Find relative changes of protein abundance from 2 similar samples(wild type VS mutant // condition_1 VS condition_2)
•
Determine absolute protein concentrations in a sample(conclude on copy numbers and translation efficiency) -> AQUA peptides ..
•
Find regulatory proteins and elucidate regulatory pathwaysELPPAK
Quantitative Proteomics
- How can it be achieved?
•
Labeling strategy for differential expression(ICAT, iTRAQ, TMT, SILAC --> wet lab)
•
Label-free approaches for differential expression(--> Software solutions)
•
Targeted approaches(SRM, MRM --> mass spec approach)
ELPPAK
sample prep solution
Quantitative Proteomics (differential expression)
label strategy
label-free
iCAT
iTRAQ/TMT
SILAC
SuperHirn
Progenesis
software solution
-> problematic are aligning and run to run variation
2 individual runs are acquired only ONE run is acquired
-> problematic is sample prep
ELPPAK
ICAT
labels have different
weights
Quantification is
done on the
MS-one level
ELPPAK
iTRAQ
all labels have the same weight
--> all parent ions are the “same”
Quantification is done on the
MS/MS level
ELPPAK
Beyond Protein Lists and
Quantitation - what else
•
Check for over/under representation of GO-terms•
Functional categorization•
Project regulated proteins onto a metabolic pathway mapELPPAK
Principle of
- Over-representation Analysis
The Principle
- organism with 1000 genes
- binned in 5 equal categories with 200 genes
- GO-cats 1-5: transcription, translation, energy delivery, nutrients uptake, degradation The researcher decides to do proteomics (brute-force)
- 200 genes are identified --> 1/5th of all
- statistically you would expect to find approx. 40 genes for each category In fact you find about 100 genes from GO:energy delivery category
---> category energy delivery is significantly enriched ---> different statistics can be applied
an easy example
ELPPAK
The number of measured and identified proteins is still far from complete Over-representation analysis allow to find pathways or “systems” which are regulated or involved in a certain context
-> but it is important to have the correct background/universe selected
Principle:
- all genes of an organism are binned in categories
- categories are related to gene function (e.g. GeneOntology categories) - compare your identifications to randomly drawn genes
Background-problem
- take as background only those proteins ever identified in this species
- take as background all identified proteins and as genes of interest and those proteins which seem to be regulated as targets (e.g: iTRAQ experiment)
Tools: R-package --> TopGO
Web: --> GOTreeMachine (bioinfo.vanderbilt.edu/gotm/)
Principle of - ORA -
In case of Proteomics
ELPPAK
•
Arabidopsis thaliana: The model plant ---> ~ 28 000 genes•
Single-cell plant in liquid culture•
Grown in sugar containing solution & weekly subculturing•
One part grown in the dark (cardboard box)•
One part grown in long-day conditions (16h light)•
Excessive LTQ MS analysis --> 800 LC-MS runs (fractionation & replicates)•
A total of 7983 proteins identified from all samples(~ 30% from all genes encoded in the genome) --> Background
•
6547 from the cell cultures that were kept in the dark•
6474 from the cell cultures that were illuminatedScenario (from HTP proteomics)
GO:0006082 organic acid metabol...
GO:0006412 translation
GO:0006519 amino acid and deriv...
GO:0006520 amino acid metabolic... GO:0006807
nitrogen compound me...
GO:0006810 transport GO:0006996 organelle organizati... GO:0007275 multicellular organi... GO:0008150 biological_process GO:0008152 metabolic process GO:0008652 amino acid biosynthe... GO:0009058
biosynthetic process
GO:0009059 macromolecule biosyn...
GO:0009308 amine metabolic proc...
GO:0009309 amine biosynthetic p... GO:0009790 embryonic developmen... GO:0009987 cellular process GO:0016043 cellular component o...
GO:0019538 protein metabolic pr...
GO:0019752 carboxylic acid meta...
GO:0032501 multicellular organi... GO:0032502 developmental proces... GO:0043170 macromolecule metabo... GO:0044237 cellular metabolic p... GO:0044238 primary metabolic pr... GO:0044249 cellular biosyntheti... GO:0044260 cellular macromolecu... GO:0044267 cellular protein met...
GO:0044271 nitrogen compound bi...
GO:0046907 intracellular transp... GO:0051179 localization GO:0051234 establishment of loc... GO:0051641 cellular localizatio... GO:0051649 establishment of cel... GO:0005975 carbohydrate metabol... GO:0005996 monosaccharide metab... GO:0006066 alcohol metabolic pr... GO:0006412 translation GO:0006807 nitrogen compound me...
GO:0006810 transport GO:0006886 intracellular protei... GO:0007275 multicellular organi... GO:0008104 protein localization GO:0008150 biological_process GO:0008152 metabolic process GO:0009056 catabolic process GO:0009057 macromolecule catabo... GO:0009058 biosynthetic process GO:0009059
macromolecule biosyn... amine metabolic proc...GO:0009308
GO:0009309 amine biosynthetic p... GO:0009790 embryonic developmen... GO:0009987 cellular process GO:0015031 protein transport GO:0016043 cellular component o...
GO:0016052 carbohydrate catabol...
GO:0019318 hexose metabolic pro...
GO:0019320 hexose catabolic pro... GO:0019538 protein metabolic pr... GO:0032501 multicellular organi... GO:0032502 developmental proces... GO:0033036 macromolecule locali... GO:0043170 macromolecule metabo... GO:0044237 cellular metabolic p... GO:0044238 primary metabolic pr... GO:0044248
cellular catabolic p... cellular biosyntheti...GO:0044249 GO:0044260
cellular macromolecu...
GO:0044262
cellular carbohydrat... cellular macromolecu...GO:0044265 GO:0044267
cellular protein met... nitrogen compound bi...GO:0044271
GO:0044275 cellular carbohydrat... GO:0045184 establishment of pro... GO:0046164 alcohol catabolic pr... GO:0046365 monosaccharide catab... GO:0046907 intracellular transp... GO:0051179 localization GO:0051234 establishment of loc... cellular localizatio...GO:0051641
GO:0051649 establishment of cel... D ar k L ig h t
Proteins from CC_dark:
BG: full universe of GO BG: only proteins identified in CCProteins from CC_dark:
GO:0006082 organic acid metabol...
GO:0006412 translation
GO:0006519 amino acid and deriv...
GO:0006520 amino acid metabolic... GO:0006807
nitrogen compound me...
GO:0006810 transport GO:0006996 organelle organizati... GO:0007275 multicellular organi... GO:0008150 biological_process GO:0008152 metabolic process GO:0008652 amino acid biosynthe... GO:0009058
biosynthetic process
GO:0009059 macromolecule biosyn...
GO:0009308 amine metabolic proc...
GO:0009309 amine biosynthetic p... GO:0009790 embryonic developmen... GO:0009987 cellular process GO:0016043 cellular component o...
GO:0019538 protein metabolic pr...
GO:0019752 carboxylic acid meta...
GO:0032501 multicellular organi... GO:0032502 developmental proces... GO:0043170 macromolecule metabo... GO:0044237 cellular metabolic p... GO:0044238 primary metabolic pr... GO:0044249 cellular biosyntheti... GO:0044260 cellular macromolecu... GO:0044267 cellular protein met...
GO:0044271 nitrogen compound bi...
GO:0046907 intracellular transp... GO:0051179 localization GO:0051234 establishment of loc... GO:0051641 cellular localizatio... GO:0051649 establishment of cel... GO:0005975 carbohydrate metabol... GO:0005996 monosaccharide metab... GO:0006066 alcohol metabolic pr... GO:0006412 translation GO:0006807 nitrogen compound me...
GO:0006810 transport GO:0006886 intracellular protei... GO:0007275 multicellular organi... GO:0008104 protein localization GO:0008150 biological_process GO:0008152 metabolic process GO:0009056 catabolic process GO:0009057 macromolecule catabo... GO:0009058 biosynthetic process GO:0009059
macromolecule biosyn... amine metabolic proc...GO:0009308
GO:0009309 amine biosynthetic p... GO:0009790 embryonic developmen... GO:0009987 cellular process GO:0015031 protein transport GO:0016043 cellular component o...
GO:0016052 carbohydrate catabol...
GO:0019318 hexose metabolic pro...
GO:0019320 hexose catabolic pro... GO:0019538 protein metabolic pr... GO:0032501 multicellular organi... GO:0032502 developmental proces... GO:0033036 macromolecule locali... GO:0043170 macromolecule metabo... GO:0044237 cellular metabolic p... GO:0044238 primary metabolic pr... GO:0044248
cellular catabolic p... cellular biosyntheti...GO:0044249 GO:0044260
cellular macromolecu...
GO:0044262
cellular carbohydrat... cellular macromolecu...GO:0044265 GO:0044267
cellular protein met... nitrogen compound bi...GO:0044271
GO:0044275 cellular carbohydrat... GO:0045184 establishment of pro... GO:0046164 alcohol catabolic pr... GO:0046365 monosaccharide catab... GO:0046907 intracellular transp... GO:0051179 localization GO:0051234 establishment of loc... cellular localizatio...GO:0051641
GO:0051649 establishment of cel... D ar k L ig h t
Proteins from CC_dark:
BG: full universe of GO BG: only proteins identified in CCProteins from CC_dark:
Projection onto Metabolic Pathway Maps
(e.g. MapMan Software (Golm))
D ar k L ig h t
only found in light
only found in dark found in both
same data
ELPPAK
Q & A
ELPPAK
Hands on
•
your turn now•
feel free to askELPPAK
Scaffold hands on - Example One
•
load your own data with Scaffold before we are going to continue•
Use also X!Tandem to search•
Have a look at the results•
Is it valid to calculate FDR? How high is your FDR?ELPPAK
More from Scaffold Q+
hands on ... with iTRAQ data
ELPPAK
Scenario:
•
Mouse data•
Liver tissue•
iTRAQ data (Swiss mouse: standard diet VS high fat diet)•
Mouse decoy database search with Mascot -> dat-files•
Labels: 116 -> high fat diet /// 114, 115, 117 -> standard diet•
Check reproducibility (standard diet vs standard diet)•
Find proteins which are regulated in high fat diet / standard dietELPPAK
Task with Scaffold Q+
•
How consistent are peptides of the same protein•
Find confident thresholds for proteins being over/underexpressed
•
Which proteins in this example do you consider as being over/under expressed?
•
Can you try making sense out of these proteins ..ELPPAK
What should come out ..
only 2 quant categories:
Histogram 2 Categories Liver Ex4
0 50 100 150 200 250 300 -1.4 -1.3 -1.2 -1.1 -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 log2(Ratio) Frequency StDiet/StDiet HighFatDiet/StDiet 47
ELPPAK
What should come out ..
4 quant categories:
Histogram 4 Categories Liver Ex4
0 50 100 150 200 250 300 350 400 -2 -1.8 -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 log2(Ratio) Frequency ratio_2 (st/st)
ratio_3 (high fat / st) ratio_4 (st/st)
ELPPAK
Regulated Proteins: The List
•
2 ways of making sense out of this data..•
take the intersection of those 2 lists.. (should be most confident)2 categories
44 regulated proteins
4 categories
48 regulated proteins
37
49ELPPAK
Make sense out of Lists:
this does
make sense !!
ELPPAK
Paint it on Reactome-maps
ELPPAK
ELPPAK
Scaffold Similarity Window
•
Review and control the peptide/protein mapping•
View protein groups in which peptides are shared•
“check” or “uncheck” the valid box for a peptide sequence•
Peptides identified in particular protein groups are color coded