• No results found

Original MAANOVA Implementation

1.5 Microarray Data Analysis Software (MAANOVA)

1.5.2 Original MAANOVA Implementation

The original implementation of the MAANOVA package provides the functionality de- scribed above. These functions include three microarray quality control visualisations, data transformation via a number of user selectable methods, fitting of models to the

microarray data and identification of di↵erentially expressed genes. So that the improve- ments made to MAANOVA during the course of this PhD can be more easily identified, and to give an overview of the whole package, this section presents the purpose of each original function and, where appropriate, an example of the output it would produce.

1.5.2.1 Loading Data

Data from the scan-analysis is loaded into MAANOVA as a single matrix stored in a tab- delimited text file. This file, with each probe occupying one line, contains information regarding the name of the probe, its position on the microarrays, the intensity of the fluorescence for each of the channels on each microarray and whether it was flagged during the scan-analysis. For each of the microarrays in the senescence experiment, three columns of information describe the fluorescence of Cy5, the fluorescence of Cy3 and the flag status. Fluorescence is measured from 1 to 65536 and the flag status is greater than zero where a problem was identified.

MAANOVA, by default, transforms all the expression data to a log2 scale on loading it. The purpose of this transformation is to provide a reasonable spread of features across the intensity range, to provide a constant variability at all intensity levels, to transform experimental errors into a normal distribution and to transform the intensity distribution to be approximately bell-shaped (Stekel, 2003). The added benefit of using a log2scale is that a single unit of change on the log2 scale represents an example doubling or halving of the absolute gene expression which is easily comparable between genes. An example of the transformation achieved by applying a log2 of the expression data can be found in Figure 1.5.

1.5.2.2 GridCheck

GridCheck provides a method of comparing log2 intensity levels between the two chan- nels of each pin-tip group of each microarray by plotting them against each other in a scatterplot. For each CATMA microarray, 48 sub-plots are produced in the same 12 row and 4 column arrangement which is present in the sub-grids of the microarrays. These plots use colours to indicate the status of each probe as defined by the flag of each probe when analysed using ImaGene (BioDiscovery). Where the probe was not flagged, the point is plotted in blue, but where the probe is flagged for any reason, the point is plotted in red. This gives a visual indication of the quality of the pin-tip group, as those with high levels of background or large numbers of poorly printed spots will also have high numbers of flagged probe data.

The plots which are output from GridCheck are presented as a separate plotting window per microarray, displayed directly on the screen. In the case of the example microarray data provided with MAANOVA, this results in 16 sub-plots per window in a 4 4 arrangement. However, with 48 sub-plots required to display the CATMA microarray data, these are very cramped and an example of 4 of these sub-plots can be found in Figure 1.6(a). Additionally, no permanent storage of the plots is made

(a) Cy5 vs Cy3 Intensities before log2 Transformation

(b) Cy5 vs Cy3 Intensities after log2 Transformation

Figure 1.5 – Log2 transformation of microarray data allows the data to

meet a number of expectations

Transforming raw microarray data (a) by taking a log (base 2) of the intensities in both channels (b), allows the data to meet a number of expectations that will be required to allow further transformation to remove systematic bias introduced from experimental sources. The transformed data provides a reasonable spread of features across the intensity range and a reasonably constant variability at all intensity levels. It can be seen, from the trend line, that a small amount of bias still exists towards Cy5 at higher intensities, but this can usually be removed by further transformation steps.

available without laboriously saving the contents of each plotting window, still resulting in low resolution plots.

1.5.2.3 RIPlot

RIPlot is a method for producing scatterplots of the log2 ratio of the two channels for each probe of each microarray against the sum log2 intensity. This type of visualisation for two-channel microarrays allows for biases imposed by experimental factors to be more easily identified. This plot is essentially identical to those of GridCheck but with a 45° rotation in a clockwise direction providing the added visual benefit that the ratio is now one of the axes (Cui et al., 2003). Where a bias causes a linear regression of the data to have a gradient other than zero, the human eye and brain can better determine this since they are better at processing horizontal lines than they are the diagonal lines of GridCheck (Stekel, 2003). Another characteristic of RIPlot scatterplots is that they include the probes of the entire microarray rather than those of specific pin-tip groups, allowing the identification of intensity specific dye biases. Colours are used within the plots to indicate the flag status of the probes being plotted. Where a probe has not been flagged by ImaGene (BioDiscovery), its point is plotted in blue. If a probe has been annotated with a flag, it is plotted using a red point.

The plots output by RIPlot are presented as a separate plotting window per microarray, displayed directly on the screen. This provides a lot of room for the plots, unlike the plots of GridCheck, but still results in many overlapping windows which can only be permanently stored by saving each plotting window by hand. An example of a plot produced by RIPlot can be found in Figure 1.6(b).

1.5.2.4 ArrayView

When analysing datasets such as those of microarray experiments, it can be frustrat- ing to biologists to have converted optical images of microarray scans into numerical data, removing the visual aspect of the data. Yet, before normalisation and annotation provided by the analysis process, there is little information to be gained from the obser- vation of the microarray scans directly. For this purpose, ArrayView produces grid-like false-colour heat maps of the ratios for each probe, maintaining their context by plotting them in the same positions as the probes of the original microarray. This provides a very intuitive interpretation of the ratios and can help to visualise the causes of spe- cific artefacts seen in both GridCheck and RIPlot, before and after data transformation techniques have been used. This can help to identify the efficiency of transformations being applied and complements the plots produced by other functions of MAANOVA. Probes which are higher intensity in one channel are displayed as either a red or green dot dependent on the more intense channel. The colours used to indicate these ratios between the channels form a gradient from red through black and then green in a linear scale with black exactly half way between maximum red and maximum green.

Heat-maps produced by ArrayView are displayed in a separate plotting window per microarray, directly on the screen. This poses a problem for CATMA microarrays because they contain 312 rows of probes which means each probe has only 4 pixels of height on even a very good resolution screen. Additionally, the default dimensions of an R plotting window are approximately square despite the microarray having a 1:3 aspect ratio, resulting in each probe being represented by thin bars rather than more aptly shaped squares. Once correctly dimensioned, each window for each microarray requires saving by hand to permanently capture the heat-maps.

The brightest colours used to plot the heat-map of ArrayView are defined by the greatest ratio between the channels, which, on most occasions, can be found in the probes used to align the microarray grid, resulting in gene-specific probe ratios being plotted as artificially low as seen in Figure 1.6(d), making it impossible to identify out- liers. Additionally, because the ratio being plotted as bright red is not the mathematical inverse of the ratio being plotted as bright green, black does not necessarily represent a ratio of one. As such, the plot can appear to demonstrate a dye-bias on those occa- sions when a ratio of one is represented by a shade of green or red, as can be seen in Figure 1.6(e).

1.5.2.5 Data Transformation

TransformMAData is a function provided for the purpose of normalising within-arrays. This is necessary to avoid biases from experimental sources introduced at the time of scanning (Stekel, 2003):

• Cy3 and Cy5 labels may be incorporated into the same DNA sequences in di↵erent abundances.

• Cy3 and Cy5 dyes may emit di↵erent response wavelengths dependent on their abundance.

• Cy3 and Cy5 emissions may be inconsistently measured by the scanner at di↵erent abundances.

• Cy3 and Cy5 may be inconsistently focussed if the microarray is not perfectly horizontal during scanning.

To remove these biases, TransformMAData uses regressions of the data to iden- tify non-conformity with expected ideals and then transforms the data to coerce the data to conform with those ideals. In a perfect microarray dataset in which none of the sources of variance above can be found, a linear regression of ratio against intensity would provide an intercept of zero, a gradient of zero and the data would lie along a straight regression. A number of alternate methods are available to choose from and are presented by Cui et al. (2003) where the theory of each method and the intended application is explained:

• Shift - is a transformation applied to the raw intensity data prior to being log2 transformed so as to e↵ectively move the origin of the RIPlot along the vertical (log2ratio) axis and minimise the deviation of the mean log2ratio from zero across all intensities. This is done by the simple addition of a constant to one channel whilst subtracting the same constant from the other channel:

8 < : Zrk= log2(Yrk+C) Zgk= log2(Ygk C) (1.1)

where C is the constant, Yrk and Ygk are the raw intensity values in the red and

green channels of probek, respectively. ZrkandZgk are therefore the transformed

intensity values for each of the channels of probe k.

This is appropriate where one channel has a higher intensity across all probes than the other channel, causing the linear regression ofYrversusYgto have a slope⇡1,

but an intercept 6= 0.

• Linear Log - is a transformation in which the data is separated into a lower proportion of intensities which are to be transformed by an additive linear func- tion and an upper proportion of intensities which are to be transformed by a multiplicative log function. This type of transformation is appropriate for data where the low intensity probes are not a↵ected by an intensity dependent e↵ect, but higher intensity ratios become biassed towards one dye due to multiplicative e↵ects that may be introduced by the first three experimental sources of varia- tion shown at the beginning of this section. The transformation is defined by the following functions: Zik= 8 < : log2(di) ln 21 +diYikln 2 Yik< di log2(Yik) Yik di (1.2)

where the index i refers to either channel of the microarray, di is the threshold

between the linear and the log transformation functions, Yik refers to the raw

intensity of probe kin channeliand henceZik is the transformed intensity value

for probe k in channel i. Whilst di is calculated from the distribution of the

intensity data, it often lies at a value which places 25-30% of the data below the threshold.

• Linear Log Shift - as its name suggests, is a combination of both the linear log method and the shift method above. The data is first processed by the shift method to minimise the deviation of the mean log2 ratio from zero, and then transformed by the linear log method.

• Global LOWESS (glowess)- is a curve fitting transformation which fits a local regression line to the log2 ratio of the probes via a locally weighted least squares

estimate which represents genes not di↵erentially expressed. Locality used in the regression is based upon the log2 intensity of the probes across the entire microarray and therefore the regression is an estimate of the log2 ratio for all probes of similar log2 intensity. The LOWESS regression is, e↵ectively, the fit of many linear regressions to the data over small subsets within that data which are then smoothed into a single curve. Locality defines the window of data over which the linear regressions should be performed and is defined by a span parameter, ↵. Increasing this parameter to 1 e↵ectively provides a single linear regression fit to the data, whilst all values below 1 use subsets of the data to fit the curve for each data point and, as ↵ tends to zero, the curve becomes an exact fit for the data and all ratios would tend towards zero in a LOWESS transformation. The span value provides a tricubic weighting for the adjacent log2 intensities and can be defined, for↵<1, as:

weighting/ 1 ✓ dist ↵maxdist ◆3!3 (1.3)

where dist and maxdist refer to the numeric di↵erence in the predictor variable (log2 intensity) and the range of that variable respectively. The fitted values are then used as a spot-specific constant to transform the channels of the microarray using the following functions:

8 < : Zrk= log2(Yrk) +C2k Zgk = log2(Ygk) C2k (1.4)

whereCk is the spot-specific constant obtained from the LOWESS regression,Yrk

and Ygk are the raw intensity values of probe k in the red and green channels,

respectively, and hence Zrk and Zgk are the transformed intensity of probe k in

each channel.

This type of transformation is particularly well suited to microarray data because it is rarely obvious what types of variability exist in the data in order to choose from other transformations, whereas the LOWESS curve fit is driven by the data, causing greater fit where the data is most a↵ected by variability. One criticism is that it is a very strong fitting method and can easily over-fit the curve and substantially reduce the significance of some spot ratios if the span parameter is set too low. As a conservative estimate, 0.1 is the default value for span and should only be decreased if it is the opinion of the user that the data has not been sufficiently transformed to meet the expectations given earlier in this section.

• Joint or Regional LOWESS (rlowess) - is also a curve fitting transformation based around the same theory as shown for the intensity-based LOWESS, except that locality is defined by the combined predictor variables intensity, spot-row

and spot-column rather than just the intensity. This has the e↵ect of putting constraints on the LOWESS to give priority to spots in the same, or nearby, rows and columns of the microarray when establishing the curve fit and hence providing spatial awareness to the regression. This type of transformation isolates and rectifies the problem caused by the fourth experimental source of variation described at the beginning of this section.

An example of the plots produced by the Joint LOWESS method can be seen in Figure 1.6(c) and these are typical of all the transformation methods. For each microarray, a separate plotting window is presented on screen containing an RIPlot before transformation above, and an RIPlot after transformation below. The red line in the upper plot of Figure 1.6(c) is specific to the LOWESS curve fitting methods and represents the fitted curve. The curve crosses over any probes which are then transformed to a log ratio of zero, e↵ectively defining those genes which are to be considered non-di↵erentially expressed. Once the transformation has occurred, this red line would lie along the length of the horizontal axis at a log ratio of zero.

1.5.2.6 Fitting a Model to the Data

MAANOVA provides a function, fitmaanova, which fits a model of gene expression to the microarray data which have just been quality controlled and transformed to minimise systematic artefacts of the microarrays. This function accepts the transformed data object as an input along with a formula describing the model to fit. The formula is composed of experimental terms specified in the design as factors contributing to gene expression. In other words, the most simple formula is “Array+Dye+Sample” which would indicate that expression of a particular probe on a particular array for a specified sample labelled with a particular dye can be directly identified by the e↵ect contributed by the Arrayused for hybridisation, the e↵ect contributed by the Dyeused for labelling and the e↵ects contributed by the Sample hybridised to that array and labelled with that dye. The result returned from this is an object which shows all the fitted parameters of the model describing the transformed data for each gene probe on the microarrays. Any additional variability not captured by the model terms is assigned to an error term, which might account for unobserved factors such as small di↵erences in laboratory technique.

Whilst Array and Dye are essential components of the formula,Sample can be further partitioned into characteristics of each sample, such as the time at which it was collected and/or the treatment it received. In this case, the model returned provides the e↵ects of each characteristic on gene expression as model parameters, allowing the separation of otherwise complex interactions of the model terms. A typical example formula in this case may be “⇠Array+Dye+Time⇤Treatment” whereby it might be anticipated that the Time of sample collection will have an e↵ect on gene expression,

(a)GridCheck

(b)RIPlot

(c)Joint LOWESS Transformation

(d) Dark ArrayView

(e) Dye-Biased Array- View

Figure 1.6 – Examples of graphical output produced by functions of MAANOVA

GridCheck, shown in (a), produces scatterplots of log2 intensities between the two-

channels of the microarrays. Where each microarray has a large number of sub-grids, the plots are almost too small to read, as shown. RIPlot, shown in (b), produces a

Related documents