5.1– Project Overview
5.1.1 – Achievements
The scope of this project was to consider the methods of analysis that are available to researchers in the field of quantitative label-free proteomics and possible methods to increase the confidence of experimental results through the parallel use of multiple post processing software pipelines. To this end comparisons were made between popular software pipelines, between mass spectrometry instruments, and between different identification thresholds. Out of these comparisons arose suggestions of methods that can be used to increase the confidence in both the identification of proteins and the differential expression of those proteins between different experimental conditions.
5.1.2 – Key points highlighted by this project
5.1.2.1 - There is a great need for representative standard datasets where the ground truth is known, and for these datasets to be made available to the bioinformatics community via data repositories, so as to allow an increase in the number of benchmarking studies looking at methods to increase to confidence and reliability of data obtained from various software pipelines and experimental or post processing methods. In particular, there is a need to ensure that the “background proteins” in standard samples where a spike in has been applied for comparison are truly homologous across all replicates of the sample, as variation in this background causes uncertainty in the assignment of differential expression profiles (ie assignment of differential expression for “background proteins” cannot be inferred as a false positive when there are genuine changes in their abundance between samples).
5.1.2.2 - Global normalisation methods do not appear to be suitable for the analysis of datasets where there is more than one distribution in the data, for example when both host and parasite cells are present, or when proteins are spiked into a background sample. The skewing of differential expression ratios resulting from the application of global normalisation to datasets containing multiple distributions becomes an important consideration for those scientists studying such host-parasite or similar systems, and a
possible method to manage this is suggested within this thesis – namely to apply normalisation to each set of proteins individually (ie to host proteins and parasite proteins) – and this has been used successfully by colleagues on real data.
5.1.2.3 - While statistical post-processing, for example using the QPROT tool, can be very successful in terms of improved sensitivity and reduced FDR, the choice of identification threshold prior to quantitative analysis can also have a great impact on the sensitivity and FDR of the quantitative results. Therefore if possible these input thresholds should be optimised for the dataset in question, and from the studies presented in this thesis an input threshold of 1% peptide level FDR appears to be the most effective in terms of outputting the correct ratios between conditions and maintaining sensitivity without reducing the number of identified proteins past what is acceptable.
5.1.2.4 - The parallel analysis of a single biological sample on different instrument platforms yields intensity values that are well correlated at the protein level, with slightly lower correlation at the feature level, and this correlation is improved (at least for the dataset studied) by using hi3 data as opposed to using total abundance data. This may be due to a combination of the most abundant peptides being most likely to be present at the same abundance as the parent protein, and the possibility of low abundance peptides skewing or confusing the assigned abundance of the parent protein. Also, while the application of quartile thresholds by abundance is efficacious in improving the correlation between instruments (particularly at the feature level as the effect is somewhat masked at the protein level), the effect of score thresholds is less predictable, with an increased score threshold actually reducing the correlation at the feature level in some cases. Presumably this is due to the application of thresholds above an (unknown) optimum point removing true positives from the result list.
5.1.3 – Limitations
The intention throughout this project was to use standard procedures for all software packages, and therefore it is possible that the results obtained could be improved if the methods used were further optimised for the data being considered. However, the use of standard procedures was considered more representative of the situation in which a lab scientist is working with experimental data where the ground truth is unknown and therefore the opportunity for optimisation is highly limited.
In the assessment of correlation between instruments there are some differences in the standard chromatographic conditions for two instruments, and in addition it was necessary
to perform initial analysis and protein inference for the two datasets in separate software environments due to vendor software being necessary to analyse the MSE data type.
Therefore it is not possible to exclude these differences as a potential cause of the variation observed between the data output from the two instruments.
5.1.4 - Suggestions for future work
For both the software and instrument comparison studies it would be beneficial to repeat the studies with a dataset that has a robust background that is truly homologous across all conditions, and which contains multiple spike-ins or protein populations of known abundance and/or at known ratios between conditions.
For the instrument study it would also be useful to repeat the study using two or more instruments that produce data formats that allow truly parallel post-processing, and using identical chromatographic parameters. It would also be useful to perform the analysis using different samples to ensure that the conclusions drawn are not specific to the single biological system studied.