Chapter 2 Data Collection and Preprocessing
2.2 Methods
2.2.1
Data collection and selection
I primarily used the GEO (http://www.ncbi.nlm.nih.gov/geo/) and the ArrayExpress (http://www.ebi.ac.uk/array express/) microarray data repositories to search for
microarray gene expression datasets using the keyword “memory and brain”. I also used the PubMed literature database to search for relevant studies (Figure 2.1). Careful review of the published articles referencing these data revealed that the goals of these studies were varied, and included many different learning paradigms, test conditions, and different tissue types. This observation necessitated the establishment of some data selection criteria for any downstream analysis in order to minimize heterogeneity
among datasets and to obtain biologically meaningful results. Therefore, in this research I followed a conservative data selection process (Table 2.2). I focused on datasets
generated from carefully designed behavioral studies involving hippocampus dependent ASLI in Fischer 344 strain of male rats (Rattus norvegicus) using Affymetrix® expression arrays. These selected studies investigated the spatial learning tasks in young, adult, and/or old animals using only the Morris water maze as the training and assessment protocol. Affymetrix raw data (CEL files) for the selected studies were either directly downloaded from the GEO website or obtained through personal communication with the original authors.
2.2.2
Quality control
All arrays were first assessed for image quality using dChip software (Li and Wong, 2001a) (http://biosun1.harvard.edu/complab/dchip/). Minor contaminations present in
36 a few of the arrays were corrected using the built in image gradient correction algorithm in dChip by adjusting the background brightness of the contaminated area to a level similar to the background of the surrounding clean region.
All subsequent data preparation, preprocessing, and statistical analyses were performed in R (http://cran.r-project.org/, a freely available programming language), using
appropriate software packages. The data quality was assessed using RNA degradation ratios, relative log expression (RLE), and normalized unscaled standard errors (NUSE) plots using the simpleaffy and affyPLM, packages in Bioconductor
(http://www.bioconductor.org/) and the RMAExpress software in R following standard procedures (Bolstad et al., 2005).
Table 2.2 Data selection criteria.
Selection Category Criteria
Learning paradigm Spatial learning Training and Diagnostic
protocol Morris water maze
Species/strain Rat (Rattus norvegicus) – male Fischer 344 strain
Age category Young Adult Old
Age 3 – 6 months 9 – 14 months 24 – 26 months
Tissue/RNA Hippocampus total RNA
Microarray platform Affymetrix® Microarray experiment
and data standard MIAME (Minimum Information About a Microarray Experiment, http://fged.org/projects/miame/ )
2.2.3
Data preprocessing for meta-analysis
I performed an initial evaluation of five different normalization methods, which were MAS5, RMA, MBEI PM only, MBEI PM – MM, and a recently developed single channel microarray normalization method called SCAN (Piccolo et al., 2012). The question was which normalization method would remove batch effects most effectively. For this purpose, each dataset was normalized with the above methods and then subjected to ComBat batch correction. RMA methods removed batch effects comparatively better
37 than all other methods consistently in all five datasets (result not shown), and was therefore chosen to perform all preprocessing in this research.
The overall data preprocessing steps are shown diagrammatically in Figure 2.1. Within- study normalization and expression measurement were performed using the RMA methods (Bolstad et al., 2003) with default options in the affy package in R (Gautier et al., 2004). Within-study batch correction was performed using the ComBat method (Johnson et al., 2007). Array hybridization dates were retrieved from CEL files and used as processing batches to perform batch correction. Age and spatial learning impairment were used as covariates. It was made sure that each group is well represented in each study during batch correction, even after removal of bad or outlier arrays.
38
2.2.4
Combining data for meta-analysis: common probe set
approach
A gene can have multiple probe sets or often the same probe set can be associated with different gene symbols due to changes or updates in the databases. As a result, gene names or symbols do not serve as a good ID to combine data across microarray
platforms. Therefore, in preparation to combine data across two different platforms (i.e. RAE230A and RGU34A) I decided to combine data at the probe set level rather than at the gene level.
A common probe set file that contains best matching pairs of probe sets representing the same gene in the two chip types (i.e. RGU34A and RAE230A) was downloaded from the Affymetrix website (www.affymetrix.com). Applying the common file and the
genefilter package in R, probe sets from all studies belonging to the two different chip
types were merged into three categories as follows: i) rgu_exclusive, probe sets
exclusive to the RGU34A chip type, ii) all5_common, probe sets common among all five studies, and iii) rae_exclusive, probe sets exclusive to the RAE230A chip type. Control probe sets and probe sets without any annotation were filtered out.
2.2.5
Data preprocessing for network analysis
Network analysis and module detection can be severely biased by the presence of outlying microarray samples (Miller et al., 2010; Oldham et al., 2008). So, it is important to identify and remove such samples in each dataset during the pre-processing steps prior to network construction. Moreover, it is often meaningful to reduce the number of genes (to most connected genes) for network analysis; otherwise it may become
computationally very intensive. Therefore, data selected for network analysis
underwent additional preprocessing steps. All datasets were processed identically for consistency and the overall process is described as follows.
• Removal of outlier array
• Data normalization and batch correction
39
2.2.5.1
Removal of outlier array
For each dataset, original microarray CEL files were read into R, background corrected using the RMA method in the affy package
(http://www.bioconductor.org/packages/release/bioc/html/affy.html) and initial un- normalized expression matrices were created. Outlier samples were removed using the inter-array correlation (IAC) approach as described previously (Miller et al., 2010; Oldham et al., 2008). Briefly, IAC was defined as the Pearson correlation coefficient of the expression levels for a given pair of microarrays (using all probe sets). The
distribution of IACs within a dataset was visualized as a histogram (frequency plot), while the relationships between arrays were visualized as a dendrogram using average linkage hierarchical clustering with 1-IAC as a distance metric. Samples with low mean IACs (i.e. arrays with mean IAC more than two to three standard deviations below average) and/or samples that exhibited divergent clustering were excluded. This process was repeated until no outlier arrays remained.
2.2.5.2
Data normalization and batch correction
Following outlier removal, absence and presence call information for all probe sets were extracted directly from CEL files using the mas5calls() function in the affy package in R. Probe sets that were called “absent” in more than 90% of the samples were filtered out. Next, RMA quantile normalization was performed on each dataset as described before. Batch effect was removed from each dataset using the ComBat batch correction method as described for meta-analysis.
2.2.5.3
Filtering of unwanted probe sets
Unwanted probe sets include control probe sets and those not associated with known genes and were removed. Next, the genefilter package
(http://www.bioconductor.org/packages/release/bioc/html/genefilter.html) was used to keep only the probe sets that were associated with some genes (i.e. probe sets for which annotation was available). Many genes contain more than one probe set. To
40 allow comparison across Affymetrix platforms, only a single probe set for each gene was kept by using a function (CollapseGenesRai(…) in Appendix 6.2.1) modified from (Miller et al., 2010). For this purpose, if a gene contained two or more probe sets, the probe set with the highest connectivity across samples was kept. The remaining probe sets were used for gene network construction using WGCNA.