Methods - Data Collection and Preprocessing

Chapter 2 Data Collection and Preprocessing

2.2 Methods

2.2.1 Data collection and selection

I primarily used the GEO (http://www.ncbi.nlm.nih.gov/geo/) and the ArrayExpress (http://www.ebi.ac.uk/array express/) microarray data repositories to search for

microarray gene expression datasets using the keyword “memory and brain”. I also used the PubMed literature database to search for relevant studies (Figure 2.1). Careful review of the published articles referencing these data revealed that the goals of these studies were varied, and included many different learning paradigms, test conditions, and different tissue types. This observation necessitated the establishment of some data selection criteria for any downstream analysis in order to minimize heterogeneity

among datasets and to obtain biologically meaningful results. Therefore, in this research I followed a conservative data selection process (Table 2.2). I focused on datasets

generated from carefully designed behavioral studies involving hippocampus dependent ASLI in Fischer 344 strain of male rats (Rattus norvegicus) using Affymetrix® expression arrays. These selected studies investigated the spatial learning tasks in young, adult, and/or old animals using only the Morris water maze as the training and assessment protocol. Affymetrix raw data (CEL files) for the selected studies were either directly downloaded from the GEO website or obtained through personal communication with the original authors.

2.2.2 Quality control

All arrays were first assessed for image quality using dChip software (Li and Wong, 2001a) (http://biosun1.harvard.edu/complab/dchip/). Minor contaminations present in

36 a few of the arrays were corrected using the built in image gradient correction algorithm in dChip by adjusting the background brightness of the contaminated area to a level similar to the background of the surrounding clean region.

All subsequent data preparation, preprocessing, and statistical analyses were performed in R (http://cran.r-project.org/, a freely available programming language), using

appropriate software packages. The data quality was assessed using RNA degradation ratios, relative log expression (RLE), and normalized unscaled standard errors (NUSE) plots using the simpleaffy and affyPLM, packages in Bioconductor

(http://www.bioconductor.org/) and the RMAExpress software in R following standard procedures (Bolstad et al., 2005).

Table 2.2 Data selection criteria.

Selection Category Criteria

Learning paradigm Spatial learning Training and Diagnostic

protocol Morris water maze

Species/strain Rat (Rattus norvegicus) – male Fischer 344 strain

Age category Young Adult Old

Age 3 – 6 months 9 – 14 months 24 – 26 months

Tissue/RNA Hippocampus total RNA

Microarray platform Affymetrix® Microarray experiment

and data standard MIAME (Minimum Information About a Microarray Experiment, http://fged.org/projects/miame/ )

2.2.3 Data preprocessing for meta-analysis

I performed an initial evaluation of five different normalization methods, which were MAS5, RMA, MBEI PM only, MBEI PM – MM, and a recently developed single channel microarray normalization method called SCAN (Piccolo et al., 2012). The question was which normalization method would remove batch effects most effectively. For this purpose, each dataset was normalized with the above methods and then subjected to ComBat batch correction. RMA methods removed batch effects comparatively better

37 than all other methods consistently in all five datasets (result not shown), and was therefore chosen to perform all preprocessing in this research.

The overall data preprocessing steps are shown diagrammatically in Figure 2.1. Within- study normalization and expression measurement were performed using the RMA methods (Bolstad et al., 2003) with default options in the affy package in R (Gautier et al., 2004). Within-study batch correction was performed using the ComBat method (Johnson et al., 2007). Array hybridization dates were retrieved from CEL files and used as processing batches to perform batch correction. Age and spatial learning impairment were used as covariates. It was made sure that each group is well represented in each study during batch correction, even after removal of bad or outlier arrays.

2.2.4 Combining data for meta-analysis: common probe set

approach

A gene can have multiple probe sets or often the same probe set can be associated with different gene symbols due to changes or updates in the databases. As a result, gene names or symbols do not serve as a good ID to combine data across microarray

platforms. Therefore, in preparation to combine data across two different platforms (i.e. RAE230A and RGU34A) I decided to combine data at the probe set level rather than at the gene level.

A common probe set file that contains best matching pairs of probe sets representing the same gene in the two chip types (i.e. RGU34A and RAE230A) was downloaded from the Affymetrix website (www.affymetrix.com). Applying the common file and the

genefilter package in R, probe sets from all studies belonging to the two different chip

types were merged into three categories as follows: i) rgu_exclusive, probe sets

exclusive to the RGU34A chip type, ii) all5_common, probe sets common among all five studies, and iii) rae_exclusive, probe sets exclusive to the RAE230A chip type. Control probe sets and probe sets without any annotation were filtered out.

2.2.5 Data preprocessing for network analysis

Network analysis and module detection can be severely biased by the presence of outlying microarray samples (Miller et al., 2010; Oldham et al., 2008). So, it is important to identify and remove such samples in each dataset during the pre-processing steps prior to network construction. Moreover, it is often meaningful to reduce the number of genes (to most connected genes) for network analysis; otherwise it may become

computationally very intensive. Therefore, data selected for network analysis

underwent additional preprocessing steps. All datasets were processed identically for consistency and the overall process is described as follows.

• Removal of outlier array

• Data normalization and batch correction

2.2.5.1 Removal of outlier array

For each dataset, original microarray CEL files were read into R, background corrected using the RMA method in the affy package

(http://www.bioconductor.org/packages/release/bioc/html/affy.html) and initial un- normalized expression matrices were created. Outlier samples were removed using the inter-array correlation (IAC) approach as described previously (Miller et al., 2010; Oldham et al., 2008). Briefly, IAC was defined as the Pearson correlation coefficient of the expression levels for a given pair of microarrays (using all probe sets). The

distribution of IACs within a dataset was visualized as a histogram (frequency plot), while the relationships between arrays were visualized as a dendrogram using average linkage hierarchical clustering with 1-IAC as a distance metric. Samples with low mean IACs (i.e. arrays with mean IAC more than two to three standard deviations below average) and/or samples that exhibited divergent clustering were excluded. This process was repeated until no outlier arrays remained.

2.2.5.2 Data normalization and batch correction

Following outlier removal, absence and presence call information for all probe sets were extracted directly from CEL files using the mas5calls() function in the affy package in R. Probe sets that were called “absent” in more than 90% of the samples were filtered out. Next, RMA quantile normalization was performed on each dataset as described before. Batch effect was removed from each dataset using the ComBat batch correction method as described for meta-analysis.

2.2.5.3 Filtering of unwanted probe sets

Unwanted probe sets include control probe sets and those not associated with known genes and were removed. Next, the genefilter package

(http://www.bioconductor.org/packages/release/bioc/html/genefilter.html) was used to keep only the probe sets that were associated with some genes (i.e. probe sets for which annotation was available). Many genes contain more than one probe set. To

40 allow comparison across Affymetrix platforms, only a single probe set for each gene was kept by using a function (CollapseGenesRai(…) in Appendix 6.2.1) modified from (Miller et al., 2010). For this purpose, if a gene contained two or more probe sets, the probe set with the highest connectivity across samples was kept. The remaining probe sets were used for gene network construction using WGCNA.

In document Genes and Gene Networks Related to Age-associated Learning Impairments (Page 60-65)