OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis

(1)

Spectrometry Data Analysis

Thang V. Pham and Connie R. Jimenez

OncoProteomics Laboratory, Cancer Center Amsterdam, VU University Medical Center

De Boelelaan 1117, 1081 HV Amsterdam, The Netherlands

{t.pham,c.jimenez}@vumc.nl http://www.oncoproteomics.nl/

Abstract. We present a software package for the analysis of MALDI-TOF mass spectrometry data. The software is designed to facilitate a complete exploratory workflow: pre-processing of raw spectral data, spec-ification of study groups for comparison, statistical differential analysis, visualization of peptide peaks, and classification. The software supports various external tools for these tasks. We also pay special attention to the iterative nature of a typical analysis. Finally, we present two proteomics studies where the software has been used for data analysis.

Keywords: data analysis, diﬀerential analysis, bio-marker discovery, MALDI-TOF, mass spectrometry, OplAnalyzer, proteomics.

1 Introduction

Mass spectrometry is an attractive method in proteomics research because of its ability to identify and quantify a large number of proteins in complex biological samples [1]. However, the pre-processing and analysis of mass spectrometry data are fast becoming a bottle neck in the discovery process. This paper describes a software platform developed in our laboratory called OplAnalyzer, which sup-ports proteomics mass spectrometry data pre-preprocessing and analysis. Speciﬁ-cally, we deal with MALDI-TOF mass spectrometry, a standard high throughput platform that can potentially be used for various diagnostic purposes.

There are a number of tasks involved in a typical analysis: pre-processing of raw spectral data, specification of study groups for comparison, statistical differential analysis, visualization of peptide peaks, and classification [2]. Instead of integrating all these components into a single tool for a complete analysis, we develop a flexible platform where various existing tools for different tasks are accommodated. Our design also supports the interactive nature of the analysis process.

Currently, the software supports the analysis of MALDI-TOF MS-1 data only. Tools for the analysis of MS/MS data with protein identiﬁcation as well as data from another mass spectrometry platform namely LC-FTMS are under active development.

P. Perner and O. Salvetti (Eds.): MDA 2008, LNAI 5108, pp. 73–81, 2008. c

(2)

Classification a. Data pre−processing b. Sample grouping c. Exploratory analysis Differential analysis Visualization d. Batch processing

Fig. 1.An analysis workﬂow

The analysis workﬂow and the system are described in Section 2. In section 3 we present two proteomics studies where the software has been employed for data analysis.

2 The System

Fig. 1 shows a typical workﬂow in proteomics mass spectrometry data analy-sis. The four main steps are: data pre-processing, sample grouping, exploratory analysis, and batch processing.

2.1 Data Pre-processing

The data pre-processing step includes the preparation of metadata and the pro-cessing of raw mass spectrometry signals which consists of peak detection, align-ment, normalization, and deisotoping. To facilitate the use of existing tools we deﬁne a common data format between this step and the subsequent steps, which is simply based on tab-separated texts.

For our instrument, a 4800 MALDI-TOF/TOF mass spectrometer (Applied Biosystems, Foster City, USA), we found that the MarkerView software (Applied Biosystems) works well for data produced in the reﬂectron mode.

For data produced in the linear mode we have implemented a new method. To detect peaks in an individual spectrum, we search for locations of maximal value within a localm/zwindow. The size of the window is 11 discrete sampling points. This method is similar to the peak detection method employed in [4].

(3)

A

B d Mean spectrum and common peak

Individual spectrum and peak

m/z p

M pI

Fig. 2.Peak alignment. For each common peakpM in the mean spectrum, the closest peakpIin each individual spectrum is located. If the distancedbetween the two peaks is less than √5, the value at pointA is registered for the common peak pM in this particular spectrum. Otherwise, the value atB is registered.

To ﬁnd peaks that are common in all spectra, we apply peak detection to the mean spectra, analogously to [5]. Subsequently, peaks in an individual spec-trum are aligned to this set of common peaks as follows. For each common peak, its value in an individual spectrum is that of the closest detected peak in that spectrum if the distance between the common peak and the closest peak (in them/zaxis) is less than √5 Da. (A better choice is likely to be based on the actual mass accuracy of the measurement and on the m/z value.) If there is no such peak, the value is simply assigned to the value of the spectrum at the

m/zlocation of the common peak. Figure 2 illustrates the procedure. By visual inspection, we found that the quality of our alignment method is comparable to that of the more computationally expensive clustering method in [4] (data not shown).

2.2 Sample Grouping

Typically, researchers are interested in several comparisons in each experiment, for examples, comparisons based on gender, age, and clinical outcomes. Also, in an interactive analysis the user might want to modify the sample groups for instance to include or exclude certain samples. To enable an eﬃcient sample grouping, we deﬁne a text-based sample selection based on metadata. The strat-egy is easy to use and particularly suited for batch processing. For example, to specify two groups “Healthy” consisting of samples from healthy individuals and “Cancer” consisting of samples from cancer patients before treatment, the selection is as follows.

(4)

Fig. 3.A screenshot of the output of the statistical testing module

2.3 Exploratory Analysis

For data analysis we exploit existing tools in Matlab (The MathWorks, Inc). A typical ﬁrst step is unsupervised analysis with principle component analysis (PCA) using all peptide intensities. Here all data points are projected onto a two or three-dimensional space for visualization. The projection does not use any information of group labels. The purpose is two-fold. First, one can observe if the data are clustered in a low dimensional space according to group labels. Second, one can detect possible outliers or unusual pattern in the data by visual inspection.

For diﬀerential analysis, we provide interfaces for the t-test, Mann-WhitneyU

test, Kruskal-Wallis test. The p-values can be adjusted for multiple testing. The peptides are further subjected to intensity ﬁltering, requiring that the median intensity of at least one group must be greater than 80 units and the fold change of the median intensities of the two groups must be greater than 1.5. (The numbers can be tuned for each study). Fig. 3 depicts a screenshot of the result of a comparative study.

The candidate peaks are examined visually by spectra overlay. Again, we use the visualization capability of Matlab for this purpose.

Finally, we provide classification model selection with support vector machine [3]. A grid search method is used to find the optimal parameter values. For each value in the grid, the generalization error is estimated by either leave-one-out cross validation or repeatedly splitting the data into two partitions randomly, one for training and one for testing. The grid point with lowest estimated gen-eralization error is selected as our model for classification.

(5)

2.4 Batch Processing

We consider batch processing an important step in data analysis, especially with regard to reproducibility of ﬁgures and other results. In addition, batch process-ing helps produce a large number of ﬁgures of peptide peaks in a convenient format for visual examination. Again, we make use of the scripting capability of Matlab for this purpose.

3 Examples

In the following, we describe two studies where the current software has been employed for data analysis.

3.1 Time-Course MALDI-TOF-MS Serum Peptide Proﬁling of Non-small Cell Lung Cancer Patients Treated with Bortezomib, Cisplatin and Gemcitabine

This study performs serum peptide proﬁling of non-small cell lung cancer (NSCLC) patients treated with gemcitabine, cisplatin and bortezomib combi-nations before, during, and at end of treatment to discover peptide patterns associated with treatment-related eﬀects and clinical outcomes [7].

Fig. 4 shows a three-dimensional PCA plot of serum peptide spectra of 13 healthy individuals and the pre-treatment serum spectra of 27 NSCLC patients.

(6)

(a) (b)

Fig. 5. (a) Spectra overlay of the eight most diﬀerential peaks in the healthy (red) versus NSCLC (blue) comparison according to p-values of the Mann-WhitneyU test. All peaks have a p-value less than 0.0001. (b) Heatmap of the 47 diﬀerential peaks in the healthy versus NSCLC comparison shown in the natural log scale. The peaks are ordered by median fold change between the two groups.

Here, the MarkerView software was used for preprocessing, resulting in 682 pep-tide peaks per raw spectrum.

The Mann-WhitneyU test is carried out on each of the 682 peptides, resulting in 47 differential peptides. Fig. 5(a) shows the spectra overlay of the eight most differential peaks in the healthy versus NSCLC comparison. Fig. 5(b) shows a heatmap of the 47 differential peaks.

We carried out classiﬁcation analysis using support vector machine. A grid search for parameters was employed to ﬁnd the best model according to leave-one-out cross validation (LOOCV). Using all 682 peptides, a LOOCV accuracy of 93% was achieved. When the 47 peptides selected by the Mann-Whitney U

test were used, the LOOCV accuracy was 98% with 100% sensitivity and 96% speciﬁcity.

The software has also been used for a large number of other comparisons such as gender, age, short and long progression free survival, and clinical treatment responses.

(7)

4000 4100 4200 4300 4400 4500 4600 4700 4800 4900 5000 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 m/z

intensity (tranformed value)

Fig. 6.Mean spectrum and detected peaks in the 4000-5000 Da range

3.2 Breast Cancer Study with Maldi-TOF Mass Spectrometry Data of Serum Samples

This study is part of the international competition on mass spectrometry pro-teomic diagnosis [8][9]. The dataset consists of 153 mass spectra of blood samples drawn from control individuals and patients with breast cancers. The aim is to construct a classiﬁcation rule separating the two groups with a low generalization error.

For this dataset, the baseline correction had been performed by the competi-tion organizer. We used the software to perform further pre-processing: peak de-tection and alignment. Fig. 6 shows an example of the result of the pre-procesing algorithm.

Again, a Mann-Whitney U test was performed to select features discrimi-nating the two classes signiﬁcantly. Furthermore, the Benjamini-Hochberg false discovery rate correction [6] was employed to correct for multiple testing. This results in on average 117 peaks with a false discovery rate less than 1%. Fig. 7 shows the distribution of the values of the 16 most discriminative peaks.

We employed grid search with exponential spacing to find the optimal values for support vector machine model selection. The generalization error is estimated by averaging over 200 runs of randomly splitting the given data into two parti-tions, where the size of the test set is roughly a tenth of size of the whole dataset. The feature selection was performed for each random splitting procedure, so that fair estimates of classification accuracy were obtained. The final accuracy on a separate validation set of 78 samples is 83%.

(8)

50 100 150 0.605 0.61 0.615 0.62 0.625 0.63 0.635 0.64 m/z = 1029.3742 50 100 150 0.65 0.7 0.75 0.8 m/z = 1030.6579 50 100 150 0.6 0.62 0.64 0.66 0.68 0.7 0.72 m/z = 1028.9464 50 100 150 0.65 0.7 0.75 0.8 m/z = 1102.0537 50 100 150 0.604 0.606 0.608 0.61 0.612 0.614 0.616 0.618 m/z = 1021.2623 50 100 150 0.7 0.8 0.9 1 1.1 m/z = 980.7667 50 100 150 0.602 0.604 0.606 0.608 0.61 0.612 0.614 m/z = 1074.3443 50 100 150 0.6 0.61 0.62 0.63 0.64 m/z = 1076.0933 50 100 150 0.6 0.62 0.64 0.66 0.68 0.7 m/z = 1022.1147 50 100 150 0.65 0.7 0.75 0.8 0.85 0.9 0.95 m/z = 1076.5307 50 100 150 0.6 0.61 0.62 0.63 0.64 0.65 0.66 m/z = 1056.0677 50 100 150 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 m/z = 1017.0058 50 100 150 0.6 0.605 0.61 0.615 0.62 0.625 0.63 m/z = 1059.9709 50 100 150 0.6 0.605 0.61 0.615 0.62 m/z = 1022.541 50 100 150 0.6 0.62 0.64 0.66 0.68 m/z = 977.0122 50 100 150 0.6 0.61 0.62 0.63 0.64 0.65 0.66 m/z = 991.2335

(9)

4 Summary

The paper has introduced a software toolbox for the pre-processing and statisti-cal analysis of MALDI-TOF mass spectrometry data. Our current development focuses on the support for the analysis of MS/MS data with protein identiﬁcation and data from another mass spectrometry platform namely LC-FTMS.

References

1. Jimenez, C.R., Piersma, S., Pham, T.V.: High-throughput and targeted in-depth mass spectrometry-based approaches for bioﬂuid proﬁling and biomarker discovery. Biomarkers in Medicine 1(4), 541–565 (2007)

2. Villanueva, J., Martorella, A.J., Lawlor, K., Philip, J., Fleisher, M., Robbins, R.J., Tempst, P.: Serum peptidome patterns that distinguish metastatic thyroid carci-noma from cancer-free controls are unbiased by gender and age. Mol. Cell Pro-teomics 5, 1840–1852 (2006)

3. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1999) 4. Tibshirani, R., Hastie, T., Narasimhan, B., Soltys, S., Shi, G., Koong, A., Le, Q.-T.: Sample classiﬁcation from protein mass spectroscopy, by “peak probability con-trasts”. Bioinformatics 20(17), 3034–3044 (2004)

5. Karpievitch, Y.V., Hill, E.G., Smolka, A.J., Morris, J.S., Coombes, K.R., Baggerly, K.A., Almeida, J.S.: PrepMS: TOF MS data graphical preprocessing tool. Bioinfor-matics 23(2), 264–265 (2007)

6. Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. B 57, 289–300 (1995) 7. Voortman, J., Pham, T.V., Knol, J.C., Giaccone, G., Jimenez, C.R.: Time-course

MALDI-TOF-MS serum peptide proﬁling of non-small cell lung cancer patients treated with bortezomib, cisplatin and gemcitabine. In: Proceedings of American Society of Clinical Oncology (ASCO) 2008 Annual Meeting, Chicago, USA (2008) 8. Mertens, B.: International competition on mass spectrometry proteomic diagnosis.

Statistical Applications in Genetics and Molecular Biology 7(2), Article 1 (2008) 9. Pham, T.V., van de Wiel, M.A., Jimenez, C.R.: Support vector machine approach

to separate control and breast cancer serum samples. Statistical Applications in Genetics and Molecular Biology 7(2), Article 11 (January 2008)