Tools and Algorithms in Bioinformatics
GCBA815, Fall 2015
Week-8:
WebGestalt, DAVID, Gene Set Enrichment Analysis (GSEA)
Simarjeet K. Negi, Ph.D. candidate
(Guda Lab)
Department of Genetics, Cell Biology and Anatomy
University of Nebraska Medical Center
Why perform enrichment analysis?
• Large gene lists resulting from high- throughput analysis
• Deciphering the biology
• Organize expression changes into meaningful functional themes
• Gene enrichment analysis increases the likelihood to identify
molecular processes/functions most pertinent to the study
• If a biological process is abnormal in a given study, the co-functioning
genes should have a higher (enriched) potential to be selected as a
relevant group by the high-throughput screening technologies
• Analytic conclusion is based on a group of relevant genes that increases
the likelihood to identify the biological processes most pertinent to study
• Enrichment tools map a large number of ‘interesting’ genes to
biological annotation terms (e.g. GO Terms or Pathways)
• Statistical examination of the enrichment of user genes for each of the
annotation terms by comparing the outcome to the control (or reference)
background
• Based on the difference of algorithms, the current enrichment tools can be
broadly divided into three classes:
• Singular enrichment analysis (SEA); WebGestalt
• Gene set enrichment analysis (GSEA); GSEA
• Modular enrichment analysis (MEA); DAVID
• Note, some tools with diverse capabilities belong to more than one class
Classification of Enrichment Tools
Overrepresentation approaches
WebGestalt : WEB-based Gene SeT AnaLysis Toolkit
(
http://bioinfo.vanderbilt.edu/webgestalt/)
• Input: user’s preselected (e.g. differentially expressed genes selected
between experimental versus control samples) ‘interesting’ genes
• Iteratively testing the enrichment of each annotation term one-by-one in
a linear mode
• Integrates functional enrichment analysis with information
visualization
• Constantly updated
• Efficiently processes large gene lists
• Weakness: output of terms can be large, thereby diluting the focus and
interrelationships of relevant terms
DAVID: Database for Annotation, Visualization and Integrated
Discovery (
https://david.ncifcrf.gov/home.jsp
)
DAVID: Database for Annotation, Visualization and Integrated
Discovery
• DAVID inherits the basic enrichment calculation as found in
WebGestalt
• Input: user defined gene list
• Incorporates extra network discovery algorithms by considering the
term-to-term relationships
• Improve discovery sensitivity and specificity by considering
inter-relationships of GO terms in the enrichment calculations
• Joint terms may contain unique biological meaning for a given study, not
held by individual terms
• Weakness: Not updated in the recent years, user input gene list size limited to
3000 genes
DAVID: Database for Annotation, Visualization and Integrated
Discovery
GSEA: Gene Set Enrichment Analysis
(http://www.broadinstitute.org/gsea/)
• Identifies the enriched pathways/gene sets between two biological states
• The program uses an underlying database (MSigDB) of about 11,000 gene sets
that include KEGG, BIOCARTA pathways, curated sets from disease states, etc.
Seven Broader Collections of GSEA
• Search
• Browse
• Examine gene sets
• Investigate
• Download
GSEA: Gene Set Enrichment Analysis
• GSEA program (download to your PC)
• Input:
Expression dataset (between two conditions); Phenotype labels between two states; Gene
sets in gmx/gmt format (MSigDB - supplied by GSEA)
• GSEA implements a ‘no-cutoff’ strategy, taking all genes from a microarray
experiment without selecting significant genes (e.g. genes with P-value 0.05
and fold change 2)
• GSEA method requires a summarized biological value (e.g. fold change)
• Weakness:
•
Sometimes, it is a difficult task to summarize many biological aspects of a gene into one
meaningful value; example: SNP arrays, clinical microarray studies
•
GSEA is less powerful to detect a gene set with a mix of genes with positive and negative
associations with the phenotype
• 487 colorectal cancer prognosis genes downloaded from Shi et al. 2012
• 11521 genes as the reference gene set from the protein-protein interaction
network used in the same paper
• Genes are from a human study
WebGestalt : WEB-based Gene SeT AnaLysis Toolkit
(http://bioinfo.vanderbilt.edu/webgestalt/)
hsapiens
hsapiens_gene_symbol
PPI_network
GO Analysis
nodes with red label represents enriched categories and black label represents their non-enriched parents
KEGG Analysis
Genes highlighted in red in
the pathway map are enriched
in the user input
• 408 genes involved in the cellular responses to HIV envelope protein
infection in resting or suboptimally activated peripheral blood mononuclear
cells; Cicala et al. 2002
• Affymetrix U95A microarray chip (genome wide expression) as the
reference gene set
DAVID: Database for Annotation, Visualization and Integrated
Discovery (
https://david.ncifcrf.gov/home.jsp
)
HIV_genes
When multiple species pop up, click on the species of interest and press ‘Select Species’
If multiple gene lists are open in the program, select the gene list of interest and click on ‘Use’
1
Percentage, e.g. 33/398 (involved genes/total genes)
KEGG Pathway
BIOCARTA
User input genes classified into big gene functional groups
Measure of the importance of a gene group in the user’s gene list
Key biology of this gene group Check if there are any other
genes in the gene list or in the genome functionally similar to this gene group
GSEA dataset
• Transcriptional profiles from p53+ and p53 mutant cancer cell lines
• Expression datasets: P53_hgu95av2.gct, P53_collapsed.gct
'Collapsed' refers to datasets whose identifiers (i.e Affymetrix probe
set ids) have been replaced with symbols
• Phenotype labels (e.g tumor vs normal): P53.cls
• Gene set: c1.v2.symbols.gmt
GSEA: Gene Set Enrichment Analysis
1
2
3