High Level Analysis - Analyses of DNA Microarray Data

3.2 Analyses of DNA Microarray Data

3.2.2 High Level Analysis

High level analysis is the ’real’ statistical analysis and interpretation of microarray data. There are many statistical approaches to analyze preprocessed data. Commonly used are differential gene expression, which describes differences in the mean values between different genes, and gene-gene correlation, which represents correlation between different genes using graphs. Different packages in the Bioconductor repository exist for these two methods and many other methods.

Differential Gene Expression

The simplest approach to identify differential expressed genes is to use a fold change criteria. However, single genes are not, in general, the primary focus of gene expression experiments. The researcher might be more interested in relevant pathways, functional sets, or genomic regions consisting of several genes. For example using theGlobalAncovepackage [HMM08, MM05], gene-wise linear models are used to formalize the relationship of gene expression with phenotypic or genomic covariates. An ANOVA-based sum of squares summarizes the individual gene-wise linear models to a group statement. A permutation test and an asymptotic distribution of the test statistics under the null hypothesis are available to calculate P-values.

Network Analyses - Correlation

The estimation of graphs or networks for genomic data is a very ongoing technology and a lot of studies address questions about the interaction between genes. InRand Bioconductor

3.2 Analyses of DNA Microarray Data 27

three useful methods – PC-Algorithm [KB07], Gene Correlation Networks (GeneNet) [SS05] and Graphical Lasso [FHT08] – are available.

PC-Algorithm: The PC-Algorithm is a method for estimating the skeleton and equiv- alence class of a very high-dimensional Directed Acyclic Graph (DAG) with corresponding Gaussian distribution [KB07]. It uses the PC-Algorithm, presented in [SGS01], to estimate a graph defined through conditional dependencies on any subset of the variables. The PC-Algorithm starts from the complete graph and deletes recursively edges based on conditional independencies. In theRlanguage the method is available in thepcalgpackage. The algorithm is computationally feasible and it is difficult to evaluate the computational complexity of the PC-Algorithm exactly, but the worst case is bounded by O(pq) as a function of dimensionality p (variables) and q the maximal size of the neighbourhoods [KB07].

GeneNet: The GeneNet package is a R package for analyzing high-dimensional (time series) data obtained from high-throughput functional genomics assays, such as expression microarrays or metabolic profiling. Specifically, GeneNet allows to infere large-scale gene association networks. These are Graphical Gaussian Models (GGMs), that represent mul- tivariate dependencies in biomolecular networks by means of partial correlation. Therefore, the output of an analysis – conducted by GeneNet – is a graph, where each gene corre- sponds to a node and the edges included in the graph portray direct dependencies between them [SS05].

Graphical Lasso: Another R package is the glasso package to estimate a sparse in- verse covariance matrix using a lasso (L1) penalty. It can be used for estimating a sparse undirected graph. Using a coordinate descent procedure for the lasso, a simple and fast algorithm - the Graphical Lasso - is available [FHT08].

Comparison: A comparison and simulation study for Graphical Gaussian Models is available in [VSBH08]. It reports strong differences between the available methods and for the PC-Algorithm a good control of the FDR (False Discovery Rrate), when the parameter

α is suitably chosen. Comparing the computation time of the methods for estimating graphs, the PC-Algorithm is about two times faster than the other ones. The comparison and improvement of the existing methods is an ongoing work. Especially the choice of the α parameter of the significance level for the individual partial correlation tests and the convergence criteria. A fix value of α= 0.05 is commonly used, but for more than 60 nodes and more than 50% sparseness in the graph there is a huge bias. Figure 3.1 shows the structural hamming distance (SHD) between the original graph and the estimated graph. For the simulation 1500 normal distributed samples were generated from the original graph. The result is independent of the used number of observations and only small imporvements with a smaller α can be achieved.

Nodes 20 40 60 80 100 Sparseness 0.0 0.2 0.4 0.6 0.8 1.0 SHD 0 1000 2000 3000 4000

Figure 3.1: Visualization of the bias in the PC-Algorithm. Graphic plots the structural hamming distance between original and estimated graph for different numbers of nodes and sparseness in the original graph. 1500 normal distributed observataions and α = 0.05 are used.

The PC-Algorithm was designed for estimating sparse graphs. KEGG graphs (pathways) analyzed in the large cancer study (see Chapter 7) have a nearly sparse graph structure. They have less than 20% of the maximal number of edges and the average number of nodes is 53 (average of edges: 173). In the large cancer study existing graph structures should be confirmed, therefore, a good control of the FDR is required. Due to the mentioned reasons, in this thesis the PC-Algorithm available in thepcalgpackage with a fixα = 0.05 is used.

Methods for Comparing Graphs: There are only a few methods to compare graphs especially arising from microarray data. The pcalg package provides two useful methods to compare graphs.

Structural Hamming Distance: The Structural Hamming Distance (SHD) between two graphs is the number of edge insertions, deletions or flips in order to trans- form one graph to another graph. The smaller the SHD the bigger is the similarity between the two graphs. The SHD is symmetric and can be calculated using the function

shd(graph1,graph2).

Rates: The True Positive Rate (TPR) is the number of correctly found edges in the estimated (first graph parameter) divided by the number of true edges in the true (second

3.2 Analyses of DNA Microarray Data 29

graph parameter) graph. The False Positive Rate (FPR) is the number of incorrectly found edges in the estimated (first graph parameter) divided by the number of true gaps in the true (second graph parameter) graph. Therefore, a high TPR and a low FPR show a good similarity between the graphs. The rates can be calculated with the function

compareGraphs(graphEstimated, graphTrue).

Graphical Comparison: Another way to demonstrate differences or similarities between two graphs is the visualization of the graphs next to each other. As you can see in the graphs in this work, the visualization of graphs with more than 50 nodes and edges gets very complex. Especially there are no tools for automatic graphical comparison of two graphs. Highlighting same nodes or edges is possible, using the difficult code structure from theRgraphvizpackage. But, there are no tools to plot two graphs with the same node structure. Therefore, a graphical comparison of graphs is not yet possible and a package for graphical comparison of graphs should be developed.

In document Schmidberger, Markus (2009): Parallel Computing for Biological Data. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 42-45)