Cluster Visualization - Implementation of Data Analysis

7.4 Implementation of Data Analysis

7.4.10 Cluster Visualization

The cluster viewer is a Java-applet written for EMMA2 and completely embedded in the web-interface. The rationale for using a Java-applet instead of HTML- pages is the increased need for interactivity and efficiency when handling large datasets and trees. Other web-based software for microarray analysis depicts large trees and heatmaps as static web page with large images and hyperlinks. It is almost impossible to achieve free zooming and object manipulation (like for example rotation of trees and online adjustment of colors) with static web-pages.

The cluster viewer allows to inspect the results of cluster analysis pipeline which has been stored in the database. The result of a hierarchical cluster analysis is depicted as a rooted tree with a heatmap. Trees with lots of leaves are often not easy to navigate as the tree consists of thousands of objects. To solve this, the cluster viewer allows to zoom freely into the tree, to open subtrees in new windows, and to search for the location of genes within the tree. In addition, all branches can be made invisible and all sub-trees can be swapped.

Furthermore, the tree can be cut at any height, to form an acceptable number of clusters. The clusters can be further inspected with a k-cluster plot. This plot depicts the individual clusters, the contained expression profiles, and boxplots of the expression profiles for each experimental condition. Furthermore, one can manually select genes of interest and prepare bar-graphs, line-graphs, or so called web-plots of these profiles. All graphical visualizations created in the bluster viewer can exported as pixel graphics or postscript graphics, ready for inclusion in publications (see Figure 7.14 on page 129).

Lipopolysaccharide biosynthesis Lysine biosynthesis Lysine degradation Methane metabolism Nicotinate and nicotinamide metabolism Nitrogen metabolism Nucleotide sugars metabolism One carbon pool by folate Oxidative phosphorylation Pantothenate and CoA biosynthesis Penicillins and cephalosporins biosynthesis Pentose and glucuronate interconversions Pentose phosphate pathway Peptidoglycan biosynthesis Phenylalanine metabolism Phenylalanine, tyrosine and tryptophan biosynthesis Photosynthesis Polyketide sugar unit biosynthesis Porphyrin and chlorophyll metabolism Prostaglandin and leukotriene metabolism Protein export Purine metabolism Pyrimidine metabolism Pyruvate metabolism RNA polymerase Reductive carboxylate cycle (CO2 fixation) Retinol metabolism Riboflavin metabolism Selenoamino acid metabolism Starch and sucrose metabolism Stilbene, coumarine and lignin biosynthesis Streptomycin biosynthesis Sulfur metabolism Terpenoid biosynthesis Tetrachloroethene degradation Thiamine metabolism Tryptophan metabolism Type II secretion system Tyrosine metabolism Ubiquinone biosynthesis Urea cycle and metabolism of amino groups Valine, leucine and isoleucine biosynthesis Valine, leucine and isoleucine degradation Vitamin B6 metabolism beta−Alanine metabolism gamma−Hexachlorocyclohexane degradation

−10

−5 ₀ ₅

Figure 7.13: The ‘KEGG chamber orchestra’. This plot is an example of the violin

plots, generated by EMMA2, using the R function advanced vioplot. For each of the 91 KEGG-pathways (only the second half is depicted) found in S. meliloti in the example experiment, the data are projected on the first principle component and the density of the sample distribution is plotted in combination with a box plot. The width of the boxes of the boxplot depends on the number of members within the pathway. The bivariate nature of the distribution of the expression profiles in the group ‘Oxidative Phosphoralation’ is directly visible, while the median, spread, group size, and outliers are depicted by the boxplot.

7.4. Implementation of Data Analysis 129

Figure 7.14: Two screenshots of the EMMA2 cluster viewer. The cluster viewer is

a Java that allows easy and detailed navigation of the hierarchical clustering trees (top). The tree can be cut into at an individual hight, yielding individual clusters that can be further analyzed in the cluster panel (bottom). The ordered expression matrix is depicted as a heatmap with adjustable color coding (for example to provide a blue-yellow contrast instead of a standard red-green contrast.) The cluster panel can also be used for non hierarchical methods and provides multiple cluster plots, like bar-plots (bottom center), web-graphs (not-shown), line-graphs(left), and boxplots(at the bottom of the window).

7.4.11 3D-SOM Viewer

The 3D-SOM viewer is another Java applet written specifically for the visualization of the results of a Self-Organizing Map analysis. Unlike partition based clustering methods, self-organizing maps contain a topological ordering of the nodes. The high-dimensional expression data is projected on a low-dimensional (usually 2-dimensional) grid, while trying to preserve the topological releations within the data. When visualizing the 2D grid the third dimension of the visualization space can be used to convey more information about the organization of the grid nodes or representative features of the input data.

The SOM viewer contains four different visualization modes for the generated SOMs:

The static net depicts the network as a rectangular grid of connected balls representing the nodes of the SOM. The length of the interconnecting lines is constant, but the line width is inverse proportional to the Euclidean distance between nodes, depicting the connection strength. Shape and color parame- ters of all nodes can be controlled by user definable components of the node representative vectors. This results in a total of six components (three RGB- components and three axis components) that can be directly visualized by the appearance of the nodes. By default, the number of vectors attributed to a node controls the diameter of the ball.

The dynamic net has the same features as the Static Net mode. The only differ- ence between the two is, that the distance between the node representative vectors is mapped on distance between the individual balls, so that nodes which are more similar to each other appear closer to each other.

The distance matrix mode depicts all nodes as part of a rectangular surface. The distance between nodes is mapped on the gray value of the surface. A maxi- mum intensity value represents the minimal distance found between node representatives. The surface can either be flat, while the nodes are represented by ’pins’ with variable lengths representing the number of vectors assigned to it; or the z-axis of the surface is taken to represent the number of vectors. The Manhattan grid is a completely new way to visualize SOMs. It is based on

the Static net, but with each node represented by a pin. All genes assigned to each node are individually accessible, as they appear as rings, making up the ‘stem’ of the pin. The position from top to bottom of the node members is assigned by their distance to the node representative vector. The distance is additionally mapped on a color code for each ring. This visualization method has the advantage, that all genes are directly accessible from their nodes and that it allows to judge the cluster quality by the color-code. A high number of nodes with many genes and with high distances to the node representatives can be an indication of a too small SOM grid. The inter-node distance is not

7.4. Implementation of Data Analysis 131

visible in the Manhattan grid to keep the complexity of the plots low. As one can immediately switch between visualization modes, it is easy to use a more appropriate mode for inspecting the grid topology and consecutively identify groups of genes of interest.

CHAPTER

8 Applications and their Results

There are currently eight major national and international projects utilizing EMMA2. The projects to which the author has contributed by implementing and applying customized analysis functions are described in detail in the following sec- tions. These projects cover a large variety of microarray applications from bacteria, plants, cancer research, and last but not least the study of marine eco-systems.

EMMA’s analysis pipelines have also been used for evaluation of several statistical tests and methods for data-integration. As a secondary effect, this setting provides a framework to evaluate the whole software and its flexibility.

8.1 Overview of the Various EMMA2 Projects

The GenoMik project is currently the largest project with respect to the number of users, hybridizations, and array designs included. It is dedicated to bacteria relevant for agriculture, environment, and biotechnology.

The BACDIVERS project is focused on the Rhizobia family of bacteria, which offers high potential in agriculture, natural strain diversity and stress resistance. The projects MEDICAGO, MolMyk and Grainlegumes (GLIP) are all focused on the plant Medicago truncatula, which makes for an excellent model organism for symbiotic root interactions between plants and bacteria (e.g. Rhizobia) and plants and fungi.

The Mamma Carcinoma project is dedicated to human breast cancer research. It aims at improving clinical diagnostic methods and general medical treatment of patients.

tion of transcriptomics software to study marine organisms. With respect to the number of project partners from all over Europe and also with respect to the studied organisms and array technologies, it is the most diverse project of all. It consists of project nodes dedicated to fish and shellfish, algae and marine bacteria. For many marine organisms, microarray studies are underway. They involve a large diversity of different array technologies and array layouts. Some laboratories use spotted microarrays, others Agilent and Affymetrix arrays. The application of tiling arrays is also planned. Currently, the project has produced only few hybridizations com- pared with the other projects, but this is going to change dramatically in thenear future.

In grand total as of January 2007, there are over 2700 hybridizations in more than 400 experiments in various EMMA2 projects. All corresponding raw-data and protocols were processed and uploaded using the ArrayLIMS.

Project Organisms Sequence type Array Technology # Arrays

MEDICAGO M. truncatula ESTs cDNA macroarrays,

cDNA microarrays

198

MolMyk M. truncatula, P.

tremula

ESTs cDNA microarrays 144

GRAIN- LEGUMES

M. truncatula, P. Sativum

ESTs oligo microarrays 343

BACDIVERS S. meliloti whole genome cDNA & oligo microar-

rays

GenoMik different prokary-

otes

whole genome cDNA & oligo microarrays

1510 PathoGenoMik different prokary-

otes

whole genome cDNA & spotted & in- situ oligo arrays

155

Mamma carci-

noma

Homo sapiens whole genome cancer oligo theme array 322

Marine Genomics (prospected)

marine prokary- otes and eukary- otes

ESTs & whole genome

spotted cDNA & oligo, in-situ oligo, tiling in- situ oligo arrays

48 (>1000)

Table 8.1: Overview of national and international projects which use EMMA2 as

their central transcriptomics platform (figures as of January 2007).

In document EMMA2 : a MAGE-compliant system for the analysis of microarray data in integrated functional genomics (Page 145-152)