The Peak Finding Method - Statistical analysis of proteomic mass spectrometry data

2.3 The Peak Finding Method

Assume we have n spectra with k observations (m/z values) in each. Let Yij

be the observed intensity at the ith _{m/z value in the j}th _{spectrum. The aim is}

to model the peaks in the dataset ensuring common peak locations across all the spectra in order to aid the interpretability of the results.

We now outline a method for peak finding. To find the location of the first

peak, find a location i1 such that i1 = argmaxi

j=1Yij. This is equivalent

to finding the largest peak in the mean spectrum. Place a Gaussian kernel of

the form c1jN(µ1, σ21) = c1jf1 at this location. As suggested by the chemistry

(e.g. Hortin, 2006), we model σi to be proportional to µi, the peak location,

remembering that the constant of proportionality ξ still needs to be chosen.

The scaling parameter c1j then needs to be calculated for each spectrum j in

order that the height of the fitted peak matches the height of the data at that point. In order to find the location of subsequent peaks, first form

Xij = Yij−

P X

p=1

cpjfp(i)

where P is the number of peaks already fitted. This effectively ‘subtracts’

the peaks already accounted for in the method. Then find an ij such that

ij = argmax_iPn_j=1Xij, set µj = ij and find the scaling parameters as before.

Note that as we are only considering positive peak heights it is simpler to sum

the Xij instead of their squares. Repeat this algorithm until the desired num-

ber of peaks is fitted. We thus have a two-step procedure - we iteratively find the best position to fit a peak (conditional on the presence of existing peaks) and then calculate the required scaling parameters for the height of this peak in order that the fitted function matches the data exactly at the peak location. We simplify matters in this model by assuming that ξ is a known constant, although a suitable value of ξ can be determined quite easily by trial and

error. The algorithm takes of the order of seconds to find the peaks and so many values of ξ could be tested to find which appears to best match the data. However, in this thesis ξ is chosen by a least squares method minimising the residual error.

There is a potential problem with this method as it stands concerning peaks that overlap significantly. If all the peaks are far apart from each other then the small area in the tails of the distributions should not affect the heights of the other peaks significantly. However, if the peaks are close together then height contributions from the other peaks could affect how well the modelled peak heights actually fit the data. For an example when just two peaks are being fitted see figure 2.2.

6800 6900 7000 7100 0 5 10 15 20 m/z value data without correction with correction

Figure 2.2: Example showing a peak arrangement with extra contribution to peak heights after a second peak has been fitted.

2.3 The Peak Finding Method

In figure 2.2, the peak around m/z value 7,000 is fitted according to the above algorithm. The green line shows the resulting fitted heights at each m/z value when the second peak is fitted. It is seen that there is an extra

contribution to the peak height at, for example, point µ1 = 7, 000. Also the

previously correctly fitted peak at m/z = 7, 000 contributes some extra height

to peak 2 at, for example, point µ2 = 6, 900. This occurs at all points where

the distributions of the fitted peaks overlap and could result in substantially incorrect fitted heights. This phenomenon gets worse the larger the number of fitted peaks. In order to eliminate this problem we need to consider all peaks simultaneously, not separately as suggested above. In fitting the second peak we wish to match the heights of both peak 1 and peak 2. This can be achieved

by solving the following two simultaneous equations in the two unknowns c1j

and c2j:

c1jf1(i1) + c2jf2(i1) = Yi1,j

c1jf1(i2) + c2jf2(i2) = Yi2,j, (2.1)

where j is the spectrum number and ip is the m/z value location of the pth

peak. This procedure is repeated every time a new peak is added, each time solving p equations in p unknowns for each spectrum. This system of equations

can be solved in matrix notation by C = F−1_{Y where C = [c1j} _{. . . cpj],}

Y = [yi1,j . . . yip,j] and F =      

f1(i1) f2(i1) . . . fp(i1)

f1(i2) f2(i2) . . . fp(i2)

.. . ... . .. ... f1(ip) f2(ip) . . . fp(ip)       .

The red line in figure 2.2 shows the fitted heights at each m/z value when this approach has been used. The heights obtained are closer to the data than previously.

The algorithm can be adapted to either end after a specified number of

peaks have been fitted or to continue until the Xij’s are all less than a certain

tolerance value.

In summary, the peak finding algorithm is as follows

1. For each m/z value, add up the intensities for each spectrum in the dataset at that m/z value to obtain a single overall spectrum.

2. Find the m/z value location of the largest peak in the overall spectrum. 3. For each spectrum, place a Gaussian kernel at this m/z value with stan-

dard deviation proportional to its mean.

4. (a) If you are fitting the first peak: For each spectrum, calculate the scaling parameter which will match the height of the Gaussian kernel to the data at that m/z value.

(b) If you are fitting subsequent peaks: For each spectrum, calculate the scaling parameters for all peaks currently identified by the algorithm by solving simultaneous equations.

5. For each spectrum, subtract the modelled peak(s) from the original data. 6. As in step 1, sum the intensities in this subtracted dataset at each m/z

value to obtain a new overall spectrum

7. Repeat steps 2 to 6 until a specified number of peaks have been fitted.

It should be noted that as we add the kth _{peak the algorithm requires}

the inversion of n, k × k matrices. This results in fast calculations for small (< 50) numbers of peaks but results for a larger number of peaks are more computationally expensive. The algorithm is now applied to the breast cancer and melanoma datasets.

In document Statistical analysis of proteomic mass spectrometry data (Page 64-68)