3.3 Materials and methods
3.3.3 SIM array image data collection
The far-western film images and the dried Ponceau-S stained arrays were imaged with a Nikon di- gital camera to collect peptide spot intensity data. This section describes the methods used to measure and normalise the intensity data from the images of the SIM peptide arrays. As the far-western im- ages represent interaction data, they will be referred to as simply the interaction images hereafter. The Ponceau-S stained images on the other had show total protein levels in the arrays and will be referred to as the protein quantity images hereafter.
3.3.3.1 Image preprocessing
All preprocessing work was carried out in the image processing software Fiji (Schindelinet al., 2012), a distribution of ImageJ for biological image processing. The colour images were converted to tagged image file format (TIFF) and then converted to 32 bit greyscale (floating point) images. Artefacts such as dust and speckling were locally removed using a noise filter. The images were then rotated so that the peptide spots aligned with thexandyaxes of the image, this was required for correct spot identification in subsequent processing. The images were then scaled by 0.5 in both the x andydirection (75% reduction in area) using bilinear interpolation.
The background of the interaction images (but not the protein quantity images) was then normalised by first generating an approximation image of the background and subtracting this from the original image. To generate the background, the intense regions from the interacting spots were filtered out using a low-pass noise filter and then the resulting image was smoothed using a Gaussian kernel with a radius of 50 pixels. These images, along with the protein quantity images were then reduced to 8 bit greyscale images (256 integer scale values) and were saved for further processing.
3.3.3.2 Intensity data collection
The processed array images were analysed using a MATLAB tool with a graphical user interface written specifically for the purpose (see appendix C.1 for source code). Centre points for each peptide spot were calculated based on the grid size and spacing. Discs of the same size as the peptide spots were then drawn over each spot and the average intensity of the pixels beneath the discs was calculated. To calculate which pixels to include in the mean calculation for a particular peptide spot, any pixel in the image,(Xi,Yj), had to satisfy the inequality
(Xi−x)2+ (Yj−y)2≤r2
to be included, where(x,y)is the centre of the disc over a peptide spot andris the radius. The intensity data for each spot was then matched with its corresponding peptide sequence and annotation.
Background data were also collected from the arrays for use in later normalisation steps. The inter- action images and quantity images were treated differently in this process as the interaction images had signal that bled into the inter-spot space on the arrays. For the interaction images 10 evenly distributed points in areas of the image with no interaction signal were selected and the mean and standard deviation of the intensities was calculated. For the quantity images, the intensity of an area next to each spot was measured so that each peptide value had a corresponding background value.
3.3.3.3 Data normalisation, scaling and correction
Once the numeric data had been extracted from the array images, all further processing and analysis was performed in R (R Core Team, 2013) and the R package and package ggplot2 (Wickham, 2009) was used to generate figures.
To remove discontinuities between the different parts of the arrays, intensities were normalised against the background values and rescaled. The approach differed for the intensity and quantity ar- ray values. For each part of the interaction arrays, the mean background value,µ, was subtracted from the matrix of intensity values, I, and then the data was rescaled between 0 and 1 using the 0th and 97.5th percentile. The 97.5th percentile was used to prevent extreme values distorting the data. The normalisation and rescaling steps are given by:
Inormalised =Ioriginal−µ Iscaled=
Inormalised
P97.5(Inormalised)
Normalisation and scaling of the quantity image parts was performed slightly differently. Rather than subtracting some global background mean, the background intensities for each spot that had been estimated earlier were subtracted from the peptide intensities,Q. The resulting values were then scaled between 0 and 1, again using the 0th and 97.5th percentile. These steps are given by:
Qnormalised=Qoriginal−Qbackground Qscaled=
Qnormalised
The peptide quantity data were then used to rescale the interaction data, essentially correcting in- tensity values for spots with low peptide levels or no peptide. Spots with peptide levels close to 0 were removed from the data as these could not be used to extrapolate interaction levels.
3.3.3.4 Error estimation and data exclusion
Correcting the interaction values by dividing by the quantity values introduces error into the final result. The 95% confidence interval for this error,ˆ, was estimated as
ˆ i=
1.96σ
Qscaledi
whereσis the calculated standard deviation of the background of the scaled intensity images. A cutoff forIscaledbyˆwas chosen sinceˆbecomes large for small values ofQscaled, as
lim Qscaled→0 1.96σ Qscaled = lim Qscaled→0 ˆ =∞
As the error associated with any particular intensity value increases, the uncertainty about the true value of the intensity increases and a suitable error calculation to exclude values was determined. Rather than using a single threshold to decide which data to exclude, the intensity was taken into account as some values with high error may have such a high intensity value that they can still be assumed to be a true interacting spot. An example of this would be a spot with a very low peptide amount that interacts very strongly with SUMO.
As well as the estimated error, the peptide spot intensities from the negative controls had to be included in the criteria for removing data. The intensity of the control data, which was processed in the same manner as the interaction data, is given by C. TheArabidopsis and human data were analysed using two different methods due to very high noise in the human data and a high number of non-specific interactions. For theArabidopsisdata, three thresholds were defined: maximum negative control intensity threshold (TC), interaction intensity threshold (TI) and error threshold (T). For any
particular spot to be retained, the negative control value had to be less thanTCand either the intensity
value had to be greater thanTIor the error value had to be less thanT. More formally these criteria are
given by the formula:
(C<TC)∧[(Ii>TI)∨(ˆi <T)]
The criteria for determining which human data to retain were more complicated. Rather than only using a cutofffor the negative control values, a spot which has a negative control value aboveTCcould
still be retained if the interaction value was higher than the negative control value, within an acceptable margin given by the constantαwhereα >1. This gives the formula for deciding which human data to keep:
[(Ci<TC)∨(Ii> αCi)]∧[(Ii>TI)∨(ˆi<T)]
3.3.3.5 Partitioning data
Once data had been normalised and unreliable values removed, the data was partitioned into SIMs and non-SIMs by taking an interaction threshold and defining all peptides above this threshold as interactors. This threshold was set above a large cluster of peptide interaction values near 0. For theArabidopsis
data this value was 0.125 and for the human data 0.15.