A straightforward method was sought to compare the results from the different software packages in order to assess whether there was a consensus in the data. A consensus between packages would make it possible to test the hypothesis that the consensus results will give a better measure of differential expression than any single package alone. This hypothesis follows the logic behind improving sensitivity in peptide/protein identification through the use of multiple search engines[3, 104], and extends it to quantitative analyses. The methods chosen to investigate this hypothesis were heatmaps of the data, and pseudo- ROC plots generated using the QPROT Z-statistic. Both the heatmaps and the the pseudo- ROC plots were generated using the R statistical package.
2.3.9.1 - Heatmap generation
Log10 ratios were calculated between the E/B, E/C and E/D conditions and the maps
constructed using the heat map.2 package in Bioconductor, which is a supplementary statistical package for R. Parameters were set to use the hierarchical clustering function (hclust) as the distance/linkage algorithm to generate the heatmaps. Null and infinity values cause problems for calculating dendrograms, with a null value occuring when the protein is undetected in both conditions being compared, and an infinity value being reported when the protein is detected in only one condition. In both cases these were set to zero, and those rows containing only zeros were excluded as no conclusions could be drawn from the data for these proteins. The scale was fixed so that a zero mean gives black, with up-regulation giving green and down-regulation giving red areas in the heatmap. No optimisation was performed on the parameters used within R to create the dendrograms and no additional thresholds were applied to exclude unreliable data, since we wished to test how well this method would perform in the absence of optimisation which would be difficult for the standard user to perform in a real situation where the ground truth would obviously not be known. The generated heatmaps can be used both as a method to compare the results between pipelines in terms of presence and magnitude of up/down regulation, but can also represent a consensus method giving greater confidence to those proteins that are assigned as similarly differentially expressed by multiple pipelines.
2.3.9.2 - QPROT Z-statistic
This method of comparison used the more stringent thresholds detailed above in 2.3.6.2. The Z-statistic itself is a measure of the distance (in standard deviations) from the mean in normally distributed data. The QPROT tool analyses the global distribution of the data from each software package and thus multiple Z-statistic values are broadly comparable across the different packages. If a given protein had not been measured by a particular package, the Z-statistic for that package was scored as zero to denote that there was no evidence for differential expression.
As the Z-statistic values are comparable across packages, the absolute value of the mean Z- statistic was taken in order to combine the results from the four software packages, bringing all data onto the same scale as the absolute value is an approximation of the global strength of differential expression (either up or down) as calculated by each of the different packages. These values were also plotted on the pseudo-ROC plots to assess this combination method as a way to improve sensitivity with respect to the plots from each individual package.
2.4 - Results
Both datasets were analysed using four different software pipelines in order to test how well each software package could detect the known underlying ratios present within the data, to assess the agreement between pipelines, and to estimate the FDR associated with different methods of determining which proteins were differentially expressed between conditions. Heathmaps and pseudo-ROC plots were used to compare the results obtained from the different software packages and assess any agreement on the direction and magnitude of differential expression between conditions. An attempt was also made to combine the results from all software packages to determine if a consensus method would improve the sensitivity while reducing the number of associated false positives.
2.4.1 – ABRF data
We analysed the ABRF data set (yellow/red comparison) to see how well the different software packages agreed. First we looked at the proteins that were scored as significantly changing (p<0.05) by each of the software pipelines and were contained within the answer key. Across the different pipelines, over 500 proteins were scored as significantly changing, with only 22 proteins being common across all pipelines, thus showing the general
agreement between pipelines was poor. The same analysis was repeated for the fold change data, and this gave a slight increase in the number of common proteins, but still showed no general agreement in the overall picture. A heat map of the five pipelines (at this stage we were also considering the spectral count values reported from Progenesis, however this metric was dropped from further analysis) for the log ratios of yellow over red is shown in Figure 15. A clear trend emerges from the heat map that there are two clusters of proteins, one showing clear up-regulation (green – 14 proteins) and one showing down- regulation (red – 20 proteins). As the published “answer key” does not include differential expression information the ground truth for the ABRF dataset is unknown. However, the results from the heat map are likely to be relatively robust with a lower FDR with respect to using a single software package, since each software package is likely to introduce a unique set of errors, and any bias introduced by using a single package should be removed by considering the consensus results. On considering the heatmap it can be seen that there are a number of proteins where different packages give opposite results – and therefore there is little confidence in the identification/quantitation of these proteins. It is possible that one package is “correct” and the other “incorrect” which could be deduced by careful manual analysis, but creating heatmaps of the results as a first pass allows the lab scientist to focus first on those proteins where there is a high confidence that differential expression is truly present.
Following the analysis of those heatmaps generated from the ABRF dataset it was concluded that the setup was too artificial, and as the true real answer is unknown it is impossible to draw any conclusions about software quality based on the analysis of this dataset with any confidence. As such further stages of analysis were not completed for this dataset.