Computational Methods - – Materials and Methods

Chapter 2 – Materials and Methods

2.1 Computational Methods

Computational methods were performed using either R (v2.15.1) or Matlab (2011b/2012b) depending on where source code originated. When graphs were produced in R, the ggplot2 (Wickham 2009) package was used. Where possible, the latest versions of software packages available were used, unless otherwise stated.

2.1.1 – BioConductor

Microarray analysis was completed using the BioConductor v2.16.0 (Gentleman et al. 2004) plugin for R. The CEL files downloaded from NASC (Scholl et al. 2000) were loaded in using the ReadAffy() command, which reads all of the .CEL files within the working directory and creates an object containing all the information. This object was then normalised using GCRMA v2.28.0 (Wu et al. 2005) to normalise probe intensities between arrays as well as across each array. This normalisation also corrected the reported intensity based on the proportion of GC to AT residues of each probe. In addition, mas5calls (Gautier et al. 2004) was performed on the arrays to determine which genes were significantly detected in the experiment (present/absent call). This led to the removal of genes whose intensity was not significant compared to the added controls and likely was the result of background luminescence.

2.1.2 – BioDare

Pre-processed Luciferase (2.2.6) and Delayed Fluorescence (2.2.5) results were stored in a large database called BioDare (Moore et al. 2014). This online resource provides a central deposit for all of the ROBuST data. It also acts as an interface for multiple normalisation and analysis techniques. Throughout this project I used it to store time course data as well as detrending the data and normalising gene expression values.

Materials and Methods

Detrending the data removed the accumulative luminescence seen in time course data. This made the major component of the curves the oscillations rather than a luciferase effect. Similarly the normalisation technique used scaled each gene to oscillate between 0 and 1 to begin with, whilst maintaining dampening responses seen in later time steps. This allowed multiple genes to be easily compared to each other. Without amplitude being considered as a major difference.

2.1.3 – ReTrOS

Luciferase expression time series were passed through ReTrOS (Costa et al. 2014) to remove the effects of the LUC gene. As the reporter is measured by protein activity, the mRNA expression is slightly different to the light captured in the screen. Additionally, this delay in recorded expression from the actual mRNA level is temperature dependent. By using this software, expression profiles recovered under different temperatures can be more accurately compared. ReTrOS performed a back-calculation that subtracts this translation effect from the measured fluorescence. This subtraction is performed based on the temperature the experiment was conducted in, verified by experimental investigation.

2.1.4 – Cluster Methods

SplineCluster (Heard et al. 2006) was downloaded from http://www2.imperial.ac.uk/~naheard/splinecluster/ before being executed using CYGWIN. This process was automated using a simple R script provided by the University of Warwick that pre-processed the data, submitted the cluster command, and then performed the graphing script. This script was later adapted to also collate the results into a single file. The clustering command was performed using default parameters with the exception of the P-value, which

Materials and Methods

was set to 0.0001 after this was found to produce a more manageable number of clusters. FFT-spline (Liverani et al. 2009) was similarly completed using CYGWIN and a second R wrapper script. Again default settings were used as the submitted data set was similarly structured to the one previously used to optimise the algorithm.

2.1.5 – Gene Function Analysis

MAPMAN V3.5.1R2(Thimm et al. 2004) was downloaded from it’s website and

analysis was performed using the automatically installed figures and the correct ATH1 array information. BiNGO v2.44 (Maere et al. 2005) analysis was performed on Cytoscape. This was done using the plant GO slim file and the entire Arabidopsis genome as a background. Differentially expressed gene lists were sequentially submitted for analysis. For both of these methods, a Benjamini Hochberg false discovery rate correction was applied within the software to reduce false positives.

2.1.6 – Network Inference and Modeling

Network inference was done using VBSSM and CSI software, both made available by the University of Warwick. VBSSM was run primarily using a GUI interface developed in Warwick although raw data was exported for analysis. This software was run initially by loading the normalized detrended data for the genes of interest into the code and using the default parameters. Investigation into the variation between seeds was performed by increasing the number of times the algorithm iterated. The number of hidden states, K, was determined by the software by testing for a K value between 1 and 20. The value with the optimal likelihood was selected.

CSI exists purely at a code level without any user interface. The EM algorithm was chosen for all experiments following the advice of its creator, Chris Penfold. By adapting the header code, CSI was able to accommodate the bigger data sets

Materials and Methods

being used in this investigation. This adaptation involved adjusting the variables informing the code of: the number of repeats, the number of genes, and the number of time points. Additionally, the code generating possible parental sets (PaSet) was adapted to run on the Liverpool Matlab license by changing the combntns() command with nchoosek(). These were essentially two identical commands that produced every possible unique combinations of x items from a list of y items. This code was also experimented with to investigate how many simultaneous connections should be considered.

Both software packages were run using Matlab 2011b or 2012b. Matlab 2012a did not support some of the functions, and there was no difference in the output of Matlab 2011b and 2012b. Because of this, calculation performed using the old version of Matlab did not have to be repeated on the newer version. Iterations of the network inference software were run on a server to rapidly perform the calculations in parallel, and reduce the overall run time. Modeling and simulations were created and performed in Matlab 2012b. Simulations were performed using a probabilistic Boolean algorithm (Savage et al. 2008) run in Matlab.

Materials and Methods

In document Analysing how the Arabidopsis circadian network responds to temperature (Page 45-49)