GeneX Va: VBC gene expression database and analysis system for
multiple users in clinical research
Jae K. Lee1, Tom Laudeman2, Jodi Kanter2, Teela James1, Mir S. Siadaty1, Brad Freeman3, Daniela Puiu3, Li min Wen3, Gregory A. Buck3, Karen Schlauch4, Jennifer Weller5, Harry
Mangalam6, and William A. Knaus1
1Department of Health Evaluation Sciences, 2Academic Computing Health Sciences, University of Virginia School of Medicine, Charlottesville, Virginia, 22908, 3Center for the Study of Biological Complexity, Virginia Commonwealth University, Richmond, VA 23284, 4Center of Biomedical Genomics and Informatics, 5School of Computational Sciences, George Mason University, Manassas, VA 20110, 6tacg Informatics, Inc., Irvine, CA, 92612, USA.
Correspondence should be addressed to: Jae K. Lee, Ph.D. ([email protected]) Department of Health Evaluation Sciences Hospital West Complex, Room 3181
University of Virginia School of Medicine, P.O. Box 800717 Charlottesville, VA 22908-0717
ABSTRACT
GeneX Va is an Open Source database and Bioinformatics analysis system for archiving and analyzing Affymetrix GeneChip® data. Supported by the Virginia Bioinformatics Consortium (VBC), GeneX Va provides a set of sample management, sample
documentation, and analysis tools designed to support a range of users, from individual research laboratories to institutional microarray facilities. GeneX Va provides WWW-based access to a PostgreSQL relational database system with a comprehensive security system that can provide data to interactive or scriptable statistical analysis protocols. The security system allows each investigator to manage all array data and analysis output files, and allows custom access privileges for other users, groups, and internal/external collaborators. The analysis interface uses “Analysis Trees”, an innovative user interface that allows researchers to interactively create a tree-structured flow chart of analysis routines. The complete GeneX Va software and documentation is available from and can be freely downloaded at our Sourceforge web site http://va-genex.sourceforge.net. To allow researchers to access the database and analysis capabilities of the GeneX Va system, microarray data from 17 VBC GeneChip experiments have been deposited into a public section of the GeneX Va system at the University of Virginia. The VBC GeneX Va sites are at http://genes.med.virginia.edu/ of the University of Virginia and at http://genex.csbc.vcu.edu/ of the Virginia Commonwealth University.
BACKGROUND
Microarray technologies are widely used for genome-wide gene expression studies in biology and the medical sciences (1, 2). Many research institutions have established microarray facilities to support individual researchers and collaborative research groups (3). Data management presents a major barrier to fully exploiting microarray gene expression results. Individual researchers who have performed an experiment seek reliable tools for identifying significant changes in gene expression and extracting the biological pathways that may be responsible for these changes. Thus, these researchers need robust statistical analysis methods and well-annotated databases that link gene sets to biological function. Scientists interested in common biological mechanism that emerge across multiple experiments must combine data from many sources. Microarray
experiments are expensive, so there is considerable interest in developing data
management software and protocols that allow microarray data to be shared by additional investigators. Handling and analyzing large expression datasets can be cumbersome and is often overwhelming for individual researchers due to the size and complexity of these datasets. Thus, institutional microarray core facilities typically manage expression data using commercial or custom-built database systems (4, 5). In addition to the drive to share data there are also requirements for complete reporting, review, and replication of the data and results of experiments.
Unlike sequence data, gene expression data is context-dependent; with expression data different interpretations are possible depending on the different combinations of experimental conditions, RNA samples, array instrumentation and software processing techniques (6). The problem of sample documentation for gene expression experiments is well recognized; the Microarray Gene Expression (MAGE) working group has proposed the Minimum Information About a Microarray Experiment (MIAME) guidelines from a collaborative effort by many genome research institutes (7). These guidelines specify the types of descriptive information that should be reported for the experiment and its results to be meaningful to and replicable by other researchers. Many public and private microarray databases and scientific journals are adopting these
guidelines as the publication’s data storage and publication standard. ArrayExpress at the European Bioinformatics Institute (EBI) is a public database of microarray data with online submission, database query, and certain clustering and functional filtering tools that is being developed in compliance with the MIAME guidelines (8).
Two years ago, under the auspices of the Virginia Bioinformatics Consortium, microarray core facilities at the University of Virginia and Virginia Commonwealth University sought to address their expression array data management needs by adopting and extending GeneX, an Open Source gene expression data management and analysis system (9). The GeneX data model is agnostic as to the technology platform, in which unlike the MAGE model, data does not need to be generated specifically on an “array” type platform. In the original GeneX, specific tools were developed for cDNA arrays, which were the first data sets available, to upload and annotate this type of data. Although the initial GeneX software provided a rich database schema and WWW interface for describing cDNA array experiments, tools had not been developed for handling Affymetrix GeneChip® data and it also lacked data security features required by a shared core research support facility that supports a diverse group of users. Over the next 18 months, GeneX was extended for archiving and analyzing GeneChip® data, based on the microarray experiment workflow encountered in basic medical and clinical research (Figure 1a), and the needed security features were added. Important features supported by GeneX Va are summarized in Table 1.
Of the major features provided by the GeneX Va database – sample documentation and data management, data security, and statistical analysis tools – sample documentation can be the most burdensome for individual researchers. GeneX Va seeks to reduce this burden by integrating the collection of sample information with the process of submitting samples to the microarray core facility (information must be submitted before the
analysis is performed), and by allowing researchers to group experiments into “studies” and copy required information from one experimental condition or study into a new study. Collection of accurate sample descriptions greatly enhances the value of the expression data. However, individual researchers rarely anticipate and provide the extensive information required by the MIAME guidelines. For example, the researcher may feel that the experimental protocol is simple and obvious and that storing certain details is unnecessary. Moreover, it is time-consuming to record all the options and parameter values for an entire experiment and analysis. Although it can be difficult to archive all the MIAME information in practical array experiments, GeneX Va’s user interface collects such information at the start of an array study, and requires users to enter the essential sample descriptive data (Figure 1b).
Although there is substantial sentiment within the scientific community that gene expression data should be publicly available, most individual researchers feel that data need only be made after the results are accepted for publication. To gain acceptance by the diverse group of researchers at the U. of Virginia, our first modification of the initial GeneX (version 1.0) software package was the addition of a versatile security module. This module implements security at the database level, using row-level security on the data tables, and allows users and groups to be defined such that data and metadata can be shared in many ways. For example, investigators may conduct multiple array studies with different collaborators, and a single microarray data set may be used for several studies, with different subsets shared among researchers at different institutions. Unpublished array data and related patient information must be stored securely under each user’s access and control, with legitimate access monitored. Different levels of access privileges are also needed for each data set, e.g., primary investigators can be allowed full (clinical) information access privileges and collaborative/associated investigators can be restricted to partial privileges. GeneX Va provides specific tools for the data access and
management features of the database curator, microarray core facility personnel, data owner (primary user), and group (associated/collaborative users). This flexibility was not available in the original GeneX system and is missing from several other microarray databases that are based on client-server architecture (5).
While data management, documentation, and security are essential for large-scale microarray core facilities, individual investigators are most interested in flexible, reliable, and informative analysis protocols that allow them to focus on the biologically important changes revealed by the array experiment. Conclusions from microarray data analyses can differ substantially depending on normalization and analysis procedures (10). More importantly, interpretation of microarray results requires a series of analysis procedures from quality control to functional analysis, which can involve many different
combinations of analysis options and parameters. Most current microarray database systems, e.g., ArrayExpress, mAdb (NCI), and SMD (Stanford Microarray database) provide various analysis routines such as clustering and functional analysis tools, but many of these analysis routines cannot be linked together to produce a complete,
consistent reusable analysis method (4, 5, 8). Flexible analysis strategies become critical when combining analysis of data from diverse sources. GeneX Va has a powerful and flexible analysis interface, called Analysis Trees, which allows users to interactively
create, save, and freely modify a tree-structured flow chart of analysis routines with selectable analysis options for these analysis demands.
The GeneX Va system is currently used to manage gene expression data at the University of Virginia (http://genes.med.virginia.edu/) and at the Virginia
Commonwealth University (http://genex.csbc.vcu.edu/). VBC’s GeneX Va sites also provide many public microarray data sets from medical research experiments, with full MIAME documentation.
SYSTEM USE AND USER INTERFACE
GeneX Va is comprised of a relational database, a web interface, and a statistical analysis suite with Perl DBI and APIs. There is also a repository for each user's files, and a comprehensive security system. All interaction with the system is via secure web pages; no special software is required for end users. Using the GeneX Va system, an investigator initiating an array experiment first creates an “array study” and submits an order to the microarray core facility. The GeneX Va web-based curation tool allows investigators to describe their experimental protocols and samples. During this initial data entry step users create consistent names for each RNA sample and array with those for the actual sample tubes and chips of the array experiment. Having a compact, informative, and consistent naming procedure is one of the most important steps in maintaining a vital microarray database. Specific Perl DBI tools (through the web-based user interface) upload data from the array scanner to the GeneX Va database. The expression data can then be exported locally as a text format and/or analyzed using the GeneX Va analysis interface. The analysis interface allows users to build a graphical representation of a flow chart through various analysis routines, which include quality control, normalization, fold-change and statistical differential discovery, clustering, and annotation tools. Under each user’s account all the analysis results and derived output files are saved with text and graphical formats that can be viewed or downloaded from the web browser. GeneX Va user interface (UI) can be used to capture the entire workflow of array experiments from the research lab to the microarray core facility (Figure 1a). At each step a series of relevant procedures are supported by the UI. For example, Figure 1b depicts a schematic diagram that is followed by investigators as they create and document an array study and
orders for specific chip hybridizations. A step-by-step guide for these steps is available at the GeneX Va web site (Supplementary Document S1).
SYSTEM DESIGN AND SECURITY
GeneX Va database is built upon PostgreSQL, a mature, Open Source relational database system. The database schema was originally based on GeneX 1.05, but has been modified and expanded to reflect the workflow and security requirements mentioned above. GeneX Va now contains eight (conceptual) categories of tables to define the RNA samples, experimental information, array layout, array protocol, array measurements, spotted sequence/gene information, citation information, ID and security, and administration notations (Supplementary Figure S1). The last two categories of tables implement the comprehensive security system that can manage access to microarray data from clinical samples. GeneX Va security is implemented at both the web interface for server login and at the database level for accessing database records. The database security model has users, groups, and primary investigators (PIs), the last of which are the users registered in the PI table with primary owner privileges of each array dataset. The unix-like
permission scheme provides different levels of data and database security, allowing different accessibility and management privileges for curators, array center personnel, data owners, users, and groups. In order to assure security of the system and data as it transits the network, GeneX Va uses Secure Sockets Layer (SSL) via https and web security based on Apache access authentication and sessioning. (Supplementary Document 1).
ANALYSIS INTERFACE
Analysis on the GeneX Va system is started by using the web-based query tools to retrieve all or a subset of a relevant array dataset. Data retrieval can be based either on existing array studies and experimental conditions or on a virtual array study and/or
virtual experimental conditions created by researchers. The latter utility is important because researchers often want to analyze gene expression data (either their own or publicly accessible data) by forming an array data set with various different experimental groups of samples. Having established an experimental data set, users then construct analysis methods (linked procedures); which can be visualized by using the GeneX Va analysis interface, Analysis Trees. Analysis Trees represent hierarchically linked analysis procedures with branches for several analysis routines. Each node of the analysis tree is defined with its input, output, and connections to other compatible
analysis nodes (Figure 2). Any number of compatible analysis routines can be added at each node with different selectable options and parameter values for each analysis routine. All analysis (intermediate and output) files are saved, together with their history (log) files, to the researcher’s account. A detailed guide for the use of the GeneX Va analysis interface can be found at the GeneX Va web sites (Supplementary Document S2). GeneX Va currently supports ten analysis routines (six in use and four under development as of July 2003): qualControl, diffDiscover, westfallYoung, filter,
permClust, treeDraw, funcFilter, SAM, HEM, and SOM. Each of these analysis routines has various options, as summarized in Table 2. New analysis routines can be written in R, Perl or C (and many other scripting and programming languages) and can be added to the system based on a plug-in architecture (using newly-developed Perl utilities, e.g,
Rwrapper). The functionality of each newly-added analysis routine is carefully tested and validated with the routines that can be either its input or output nodes.
OPEN SOURCE SOFTWARE AND INSTALLATION
The GeneX Va system is completely based on Open Source software, including a Linux server (we use Red Hat version 7.2, but more recent releases should not pose any problems), Apache web server, and PostgreSQL relational database (version 7.1). The web interface and internal control scripts are written in standard Perl. The statistical analysis routines are written in R, an Open Source statistical programming environment (http://www.r-project.org); some functions are written in C and Perl. These analysis routines are written by the members of the UVa GeneChip/Micorarray Bioinformatics (GMB) core or adapted from Open Source analysis routines, such as the Bioconductor packages (http://www.bioconductor.org/), and the current GeneX development group based at GMU . GeneX Va software is Open Source and freely available under the Lesser GPL. For the download of the up-to-date software package with its detailed installation instruction, we maintain a Sourceforge site at http://va-genex.sourceforge.net/ or http://sourceforge.net/projects/va-genex/. The complete GeneX Va software package is compact and its installation can be performed within a few hours. GeneX Va will require a pre-installation of some additional Open Source packages, documented in the
installation support package. GeneX Va software releases (including analysis routines) are validated based on a standard testing procedure implemented by a third-party who is independent of the development of each component.
PUBLIC MICROARRAY DATA AND ACCESS
We have developed a standard procedure, many of whose steps are automated, to consistently and efficiently post public array data in the GeneX Va system with its relevant MIAME information. For guest users, the GeneX Va site at the University of Virginia (http://genes.med.virginia.edu) currently provides 17 public array data sets from various experiments in biomedical research, including human transcriptional profiles for melanoma, Alzheimer and Parkinson diseases, antigen immune response, and various cancer and human diseases as summarized in Supplementary Table S1; additional array data will be continuously added. We also plan to make our analysis server available via a public Web interface in the near future.
FUTURE DEVELOPMENT
We are currently adding various analysis routines to perform more comprehensive investigations on microarray data, including many Open Source analysis routines, such as affy, genefilter, and multtest that are well recognized and validated for their usefulness. We also plan to integrate various functional and annotation analysis tools such as annotate, AnnBuilder, in-house tools that can effectively triage gene targets based on each investigation goal. Customized cDNA and oligonucleotide array technologies are currently being used in experiments at VBC institutions; relevant workflows are available in the GeneX Va system. Specific user interfaces will be developed for a full database and analysis service accordingly. One challenging task associated with the use of custom arrays is the support of numerous array layouts, archiving various parts of manufacturing and technical aspects of array fabrication. We propose to develop tools to support
documenting a large number of layouts. We also plan to support counting-type of gene expression technology such as SAGE and AFLP soon.
We also plan to integrate the GeneX Va database with complementary genomic and clinical data across multiple institutions. In doing so we will insure that any patient-related data is stored and exchanged in compliance with the Health Insurance Portability and Accountability Act (HIPAA). We thus plan to build the interface between the GeneX Va system and clinical databases by providing anonymity for all patient-identifiable information. Array and personal clinical data can be combined by a properly authorized user to be analyzed as dictated by each pre-defined investigation goal. As our GeneX Va
system is integrated with various medical and clinical data sources, security issues will become increasingly critical with respect to sensitive data. We will monitor the
effectiveness of our security implementation and update as necessary to remain current with new regulations. We are also in the process of developing tools to monitor and control client use. Through our collaboration with the GeneX version 2 developers at the George Mason University, we will take advantage of their experience to decide whether to follow the MAGE-ML and MAGE-OM concepts, including the use of XML exchange formats for data, schemas and APIs.
Supplementary Materials:
- Figure S1: GeneX Va database schema
- Document S1: Step-by-step user guide for GeneX Va system
- Document S2: Guide to GeneX Va analysis interface (Analysis Trees) - Table S1: VBC public microarray data sets
Acknowledgements
GeneX Va development is supported by Virginia CTRF grant, CIT grant BIO-02-004, and the University of Virginia Pratt fund. We thank Alyson Prorock, Paul Gallagher, Yongde Bao, and Jay W. Fox at the UVa Biomolecular Research Facility for providing us various feedback and on-site testing on the system, and Tarynn M. Witten, J. Michael Davis, Darrell Mallonee, and J. M. Alves at the VCU Center for the Study of Biological Complexity for their direct and indirect support for this development. We also thank Irene Mullins, Tom
Spraggins, and William A. Pearson for valuable comments on earlier versions of this manuscript.
References
1. Sander C (2000). Genomic medicine and the future of health care. Science, Mar 17;287(5460):1977-8.
2. Tusher V, Tibshirani R, and Chu C (2001). Significance analysis of microarrays applied to transcriptional responses to ionizing radiation, Proc. Natl. Sci. Acad.98: 5116-21.
3. Knudtson KL, Griffin C, Iacobas DA, Johnson K, Khitrov G, Levy S, Massimi A, Nowak N, Viale A, Grills G, Brooks AI. MARG Survey 2003. From the Association of Biomolecular Resource Facilities. ABRF A Current Profile of Microarray Laboratories: the 2002-2003 ABRF Microarray Research Group Survey of Laboratories Using Microarray Technologies. See http://www.abrf.org.
4. Gollub J, Ball CA, Binkley G, Demeter J, Finkelstein DB, Hebert JM, Hernandez-Boussard T, Jin H, Kaloper M, Matese JC, Schroeder M, Brown PO, Botstein D, Sherlock G. (2003). The Stanford Microarray Database: data access and quality assessment tools, Nucleic Acids Res. 1;31(1):94-6.
5. Gardiner-Garden M and Littlejohn TG (2001). A comparison of microarray databases. Brief Bioinform. 2(2):143-58.
6. Bassett DE Jr, Eisen MB, Boguski MS (1999). Gene expression informatics--it's all in your mine. Nature Genetics 21(1 Suppl):51-5.
7. Stoeckert CJ, Causton HC, and Ball CA. (2002). Microarray databases : standards and ontologies, Nature Genetics, 32, 469-473.
8. Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG, Oezcimen A, Rocca-Serra P, Sansone SA (2003). ArrayExpress--a public repository for microarray gene expression data at the EBI, Nucleic Acids Res. 1;31(1):68-71.
9. Mangalam H, Stewart J, Zhou J, Schlauch K, Waugh M, Chen G, Farmer AD, Colello G, and Weller JW (2001). GeneX: an open source gene expression database and integrated tool sets, IBM Systems J., 40: 2, 552-569.
10. Lee JK (2002). Discovery and validation of microarray gene expression patterns,
LabMedica International, 19, 2: 8-10.
11. Jain N, Ley K, Thatte J, O’Connell M, and Lee JK (2003) Local pooled error test for identifying differentially expressed genes with a small number of replicated microarrays, To appear in Bioinformatics.
12. Dudoit S, Yang YH, Callow MJ, and Speed TP (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments, Statistica Sinica, 12 (1), 111-139.
13. Scherf U, Ross DT, Waltham M, Smith LH, Lee JK, Kohn KW, Reinhold WC, Myers TG, Andrews DT, Scudiero DA, Eisen MB, Sausville EA, Pommier Y, Botstein D, Brown PO, and Weinstein JN (2000). A cDNA microarray gene expression database for the molecular pharmacology of cancer. Nature Genetics,
24 (3), 236-244.
14. Lee JK, Scherf U, Smith LH, Tanabe L, and Weinstein JN (2001). Analysis of gene expression data of the NCI 60 cell lines using Bayesian hierarchical effects model, Proceedings of SPIE, BIOS 2001, Microarrays: Optical Technologies and Informatics, Vol. 2, 23, 228-235.
15. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, and Golub TR (1999) Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl., Sci. Acad, 96: 2907-2912.
Table 1. GeneX Va gene expression database overview
Feature Available? Comments
Data storage: GeneChip® (Affymetrix) data Yes All original data (including image) and derived data are saved (see MIAME compliance below) cDNA microarray data Yes Available from GeneX 1.5 and
current data at VBC
Automatic uploader: GeneChip (Affymetrix) Yes Most layouts for human, mouse, rat, and yeast chips
cDNA microarray No Under development for some customized cDNA arrays [There are several available for QuantArray data through the GeneX 2 branch]
MIAME compliance: Experiment design Yes Described in array study,
experimental conditions, and orders
(mandatory)
Sample preparation Yes/No Described in description (details are optional)
Hybridization Yes For GeneChip®, a standard array-center protocol recorded. For cDNA arrays, under experiment for some cases
Measurement data Yes Image data and all derived data are archived. Final output from image software is loaded on the database Array Design Yes For GeneChip®, complete
information from Affymetrix. For cDNA arrays, some designs under development
User Interface: Web tools Yes No direct access to database and files is allowed. All data management can be performed via WWW with different tools and privileges for curator, array center, user, and group Initiation of array study (orders) Yes Mandatory
Retrieval of data Yes For both individual chips and a whole set
Analysis Yes Specific analysis interface (Analysis Tree) is used as below
Security: Data Management Yes Group creation, assignment, and access privilege can be managed for any users on the system
Private account & Access restriction Yes Unless owner releases each array data, its access restricted
Analysis: Quality control and normalization Yes QC indices are derived. IQR and Lowess normalization available Discovery of differential expression Yes Fold change, LPE, t-test, and Westfall
and Young (SAM under development)
Clustering Analysis Yes Hierarchical clustering with
permutation-based validation (SOM under development)
Functional and annotation analysis Yes Direct links to NCBI databases for gene accession numbers in clustering analysis. GO, COG, and UniGene links for various analysis tools are under development
Integration of multiple analysis routines Yes Using Analysis Trees (AT) interface Storage of analysis and derived files Yes Graphical analysis trees and all
relevant (log and output) files are saved under each user’s account
Public array data: Yes Currently, 21 array data sets in
medical research available for public access without a restriction. More sets will be constantly added
Public Access: Data deposition and posting Yes/No Mainly array data obtained from the array center can be loaded and posted for public access. Some external data can be loaded manually
Database query Yes/No GeneX Va users can query their data via analysis interface, but not others (see next for analysis tool access) Analysis Tools Yes/No Currently not available for general
public. A registered login-based accessibility under development.
Local Installation Yes Complete software with all current
features can be obtained from the Sourceforge page and can be installed locally on a Linux server
XML exchange format No XML formats for some
frequently-exchanged output are under development
MAGE-OM and MAGE-ML No Planned to be developed under the
collaboration with the GeneX 2 project
Table 2. GeneX Va Analysis Routines. Six analysis routines are currently in use (U) for the Analysis Tree interface and four routines are under development (D) and will be available shortly. Many other tools are planned to be added from Bioconductor [http://www.bioconductor.org] and other open source developments.
Analysis Routines Status Description
qualityControl U Quality control and normalization analysis. QC is performed with graphical and statistical summaries for the whole array data set and Inter-quartile-range (IQR, default) or LOWESS normalization can be performed. (10)
diffDiscover U Fold change and statistical significance for discovery of
differentially expressed genes. The function performs hypothesis testing by the LPE and two-sample t-test methods and fold change values in the log scale (11).
westfallYoung U Permutation-based multiple testing procedures to identify differentially expressed genes (12)
Filter U Gene selection based on statistical criteria: LPE, t-test, and fold change values
permCluster U Hierarchical clustering procedure with permutation-based validation. (13)
treedraw U Graphics for cluster tree and image maps and links to gene
annotation information sites. A zoomable pdf graphic is created with the gene names and annotations at the end of the tree branches hyperlinked to genome web sites (3, 13)
funcFilter D
Gene selection based on functional filtering. The section can be performed for Gene Ontology, COGs, UniGene, and other functional annotations
SAM D Significance analysis of microarrays (2)
HEM D Hierarchical Error Model and ANOVA model for estimation of interactive gene expression patterns (14)
Figure Legends:
Figure 1. Workflow for Microarray Experiments. (a) Information and sample transfer between researchers and the microarray core facility; (b) Steps required to create a new study and describe its experimental conditions, supported by GeneX Va.
Figure 2. An Analysis Tree: the GeneX Va Graphical Analysis Interface. An extended analysis tree is drawn with three branches: (1) a standard procedure of differential discovery, filtering, and clustering analyses, (2) an analysis with a different statistical test (Westfall and Young), and (3) an analysis with different quality control (QC) options.