a distance measure for peptides. Hierarchical clustering was then used to either exclude outlier peptides (that are uncorrelated to all other peptides) or to group genes in order to infer isoforms. However, the correlation coefficient is useless when only two conditions are investigated as in standard SILAC experiments. Also, since it directly compares the XIC of peptides between conditions to compute the correlation coefficient, it cannot make use of more sophisticated ways
to compute intensity ratios as for instance implemented in MaxQuant [Cox and Mann, 2008].
Furthermore, excluding peptides that are uncorrelated to all other peptides from the same gene may not always be appropriate, since such a peptide may be the only one specific to an isoform (e.g. if it is located on a cassette exon).
Our goal in this study was to provide a method that is able to detect outlier peptides in standard SILAC experiments. The proposed method was rigorously tested on in-silico simulated data,
where it could detect outlier peptides with high performance (as measured by an AUC> 0.8)
when the true difference was as small as 1.4 fold. The second goal was to investigate reasons for
outlier peptides in experimental data: Given we have identfied a set of genes like in Figure7.1,
determine which of the reasons from above play a role in this set.
7.3
Materials and methods
7.3.1
Data processing
Experimental data taken from [Cox and Mann,2008] has been downloaded from ProteomeCom-
mons Tranche, where EGF stimulated HeLa cells were compared to control cells using SILAC.
Data has been analyzed using MaxQuant [Cox and Mann, 2008] version 1.2.0.18 (June 2011)
against all proteins downloaded from Ensembl v60 (November 2010). Default parameters have been used: Oxidatation (M) and Acetylation (N-term) as variable modifications and Carbamidomethylation(C) as fixed modification, reverse peptides as decoy database, matching between runs in a 2min rt window. For all further analyses, we use all unique peptides from evidence.txt (produced by MaxQuant) that contains quantification events of all identified (and matched) SILAC pairs at a FDR of 1% (according to a decoy database approach). To determine uniquely matching peptides, peptide sequences from evidence.txt have been mapped to the human genome using position information obtained via Ensembl Biomart, and only uniquely matching peptides have been retained. Gene definitions also have been taken from Ensembl, with the modification that overlapping genes have been clustered to gene clusters using single linkage (i.e. a peptide mapped to the genome always belongs to a single gene cluster). We will refer to these gene clusters as genes in the following. In order to perform statistical tests on quantifications, we furthermore discard all peptides if less than 3 independent measurements are available.
7.3.2
Detecting outlier peptides
The goal of our method is to distinguish measurement noise from other reasons that lead to peptide fold changes that are different from other measurements from the same gene. This
is based on an important property of typical mass spectrometry experiments: Many peptides are identified and quantified multiple times because experiments have been done in replicates, because peptides may have been measured in multiple gel slices (which may have been used for a fractionation step before mass spectrometry) or in multiple charge states. Since all these quantifications are technically independent from each other, we can use them to estimate the quantitive precision. The goal then is to determine peptides that are different from other peptides from the same gene and where this difference cannot be explained by a high quantification variance.
The most basic algorithm first computes all peptide and gene fold changespi andgk by taking
the mean or median of all corresponding measured fold changes. Then, genes are ranked by their maximal absolute peptide-from-gene deviation
dk= max{|gk−pi| |peptideiuniquely belongs to genek}
Unfortunately, there are two caveats in such a procedure: First, it is difficult to determine a reasonable cutoff without performing permutation tests and second, it inherently assumes that variance due to noise is equal for all peptides in the dataset. This is certainly not true, since the signal-to-noise ratio depends on the expression level of a gene.
Therefore, we also adapted a classical ANOVA procedure: For each gene, we fit the linar model
Fij =g+pi+ǫij to all log2 fold changes of a given gene, whereFij is thejth log2fold change
of a repeatedly measured peptide of the gene,gis the gene fold change,piis the residual peptide
fold change and ǫij is the noise in measurement i, j. Residual peptide fold changes that are
significantly different from 0 indicate that this peptide behaves differently from other peptides
from the same gene. Therefore, genes can be ranked using the p-value from an F test or by
η2 = SSp
SSg from ANOVA (whereSSp is the within peptide sum-of-squares andSSg is the within
gene sum-of-squares), a classical measure for effect size [Cortina and Nouri,2000].
The ANOVA model estimates noise levels gene-by-gene and, therefore, deals with different signal-to-noise ratios across genes. Unfortunately, the signal-to-noise ratio could not only depend on expression levels of genes, but also on properties specific to peptides (e.g. ionization efficiency). The ANOVA model however assumes equal variance across peptides. We therefore
also adapted the heteroscedastic ANOVA from [Krishnamoorthy et al., 2007], which can deal
with different variances.
Thus, we propose five methods to rank genes: Mean distance and Median distance corresponding to ranking by the maximal peptide-from-gene deviation, ANOVA F test p-value and ANOVA
η2 using the classical ANOVA approach and the heteroscedastic ANOVA p-value. For further
analyses, we define the outlier peptide of a significant gene as the peptide that has the greatest
absolute difference between its log2 fold change median and the log2fold change median of the
gene. Note that there may be multiple peptides that have fold changes differing from the gene fold change but for simplicity we only used a single peptide per gene.
7.3.3
In-silico data generation
For the experimental data, no standard of thruth is known, i.e. there is no knowledge about differentially regulated isoforms between stimulated and control HeLa cells. Therefore, we