Data Preprocessing - Datasets and Preprocessing

3.4 Datasets and Preprocessing

3.4.2 Data Preprocessing

Preprocessing of the MS data involves a sequence of operations. These operations represent essential steps for analysing the data successfully. These steps include:

3.4. DATASETS AND PREPROCESSING 69 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 50 100 150 Mass/Charge (M/Z) Relative Intensity

Original Spectrum ID:1

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 10 20 30 40 50 60 70 80 90 100

Signal ID: 1 Cutoff Freq: 1.083434

Separation Units Relative Intensity Original samples Up/down−sampled signal 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 10 20 30 40 50 60 70 80 90 100 Signal ID: 1 Separation Units Re lative In te n sity Original Signal Regressed baseline Estimated baseline points

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 −20 0 20 40 60 80 100 120 140 Mass/Charge (M/Z) Relati ve Intensi ty

Normalized Using the Area Under the Curve (AUC)

(a)

(b)

Figure 3.3: Preprocessing steps of the low-resolution ovarian cancer dataset. (a) the original spectrum. (b) the baseline adjustment of the first signal. (c) resampling of this signal. (d) normalisation of the samples using AUC.

Mass/Charge (M/Z) Spectrogram Indices After Alignment 3000 4000 5000 6000 7000 8000 9000 10000 1 2 3 4 5 6 7 8 9 10 Relative Intensity

Figure 3.4: An example of the alignment of 10 spectra of the low-resolution ovarian cancer dataset.

1. Baseline correction and signal filtering: remove noise and baseline artifacts.

2. Peak picking and extraction: find and extract the real peaks corresponding to molecules and remove the peaks that result from instru- mental errors.

3. Multiple map alignment: correct the distortion of the retention time and m/z dimension of multiple raw or feature maps.

4. Intensity normalisation: normalise the spectral counts to remove the fluctuation in the intensity values across the different spectra.

The LC-MS data is a time series of the MS spectra. The preprocessing of the non-chromatographic MS data can share some common steps of preprocessing with the LC-MS datasets although the preprocessing frame- work of the non-chromatographic MS is not exactly the same as the LC-MS

3.4. DATASETS AND PREPROCESSING 71 data. For example, the steps of the baseline adjustment, filtering and normalisation are the same for both MS and LC-MS datasets. However, the alignments of MS and LC-MS data are different. The alignment of MS data is performed on the m/z values while the alignment of LC-MS data is done on both the retention time and m/z values.

OVA1 and OVA2 datasets: During the MS analysis, the number of features produced in all the samples may not be the same. Therefore, the first step is to make the number of features equal for all samples to obtain the same m/z point at all MS spectra [133]. This is done by using the resampling algorithm in the toolbox. The background and chemical noise are removed by the baseline adjustment step. The noise is usually higher at the low-intensity peaks. To estimate the baseline, a window of size 50 m/z for the high-resolution data is passed across the spectra and the min- imum values of the m/z ratios are calculated. For the low-resolution data, the window size is set to 500 m/z points. Afterwards, the baseline is regressed and subtracted [133]. The third step is to remove the fluctuation in the m/z values, which occurs due to the miscalibration of the machine. The alignment of the m/z values is done by shifting and scaling the m/z axis until the maximum alignment of intensity values is reached. The final step is to remove the variation among the intensity values, which occurs due to the changing of the levels of compounds or sometimes the sensitiv- ity of the detector part in the machine. This is performed by normalising each spectrum using the area under the curve (AUC). As an example, Fig- ure 3.3 shows the original spectrum and the spectrum after three steps of preprocessing of the low-resolution ovarian cancer dataset. An example of the result of the alignment of 10 spectra is shown in Figure 3.4.

PAN dataset: The first step done for this dataset is baseline correction. The baseline is estimated by segmenting the whole spectra into windows with a size of 200 m/z ratio intensities. Afterwards, the means of the intensity values under the windows are used as the baseline, and a regression of the baseline is performed using a piecewise cubic interpolation method [49].

The next step is to filter the noise. This is done using a Gaussian ker- nel filter. The last step is the normalisation of the spectra using the AUC method.

ARC dataset:The dataset is preprocessed by the providers by removing the fluctuation or the technical repeats by averaging them, then removing the baseline. Afterwards, smoothing the signals and alignment take place.

DSa dataset: This dataset is an LC-MS dataset, where each sample con-

Figure 3.5: Preprocessing of the raw data spectrum ofDSadataset. (a) The

original raw spectrum. (b) Peak extraction step. (c) Alignment step (d) Filtering step.

sists of retention time, m/z ratios and their corresponding intensity [103]. Ideally, the same compounds detected by the same LC-MS should have the same abundances, m/z ratios and retention times, but this is not usu-

3.5. RESULTS AND DISCUSSIONS 73

In document Genetic Programming for Biomarker Detection in Classification of Mass Spectrometry Data (Page 90-95)