Tools and Algorithms in Bioinformatics GCBA815, Fall 2015

(1)

Tools and Algorithms in Bioinformatics

GCBA815, Fall 2015

Week-8:

WebGestalt, DAVID, Gene Set Enrichment Analysis (GSEA)

Simarjeet K. Negi, Ph.D. candidate

(Guda Lab)

Department of Genetics, Cell Biology and Anatomy

University of Nebraska Medical Center

(2)

Why perform enrichment analysis?

• Large gene lists resulting from high- throughput analysis

• Deciphering the biology

• Organize expression changes into meaningful functional themes

• Gene enrichment analysis increases the likelihood to identify

molecular processes/functions most pertinent to the study

(3)

• If a biological process is abnormal in a given study, the co-functioning

genes should have a higher (enriched) potential to be selected as a

relevant group by the high-throughput screening technologies

• Analytic conclusion is based on a group of relevant genes that increases

the likelihood to identify the biological processes most pertinent to study

• Enrichment tools map a large number of ‘interesting’ genes to

biological annotation terms (e.g. GO Terms or Pathways)

• Statistical examination of the enrichment of user genes for each of the

annotation terms by comparing the outcome to the control (or reference)

background

(4)

• Based on the difference of algorithms, the current enrichment tools can be

broadly divided into three classes:

• Singular enrichment analysis (SEA); WebGestalt

• Gene set enrichment analysis (GSEA); GSEA

• Modular enrichment analysis (MEA); DAVID

• Note, some tools with diverse capabilities belong to more than one class

Classification of Enrichment Tools

Overrepresentation approaches

(5)

WebGestalt : WEB-based Gene SeT AnaLysis Toolkit

(

http://bioinfo.vanderbilt.edu/webgestalt/)

(6)

• Input: user’s preselected (e.g. differentially expressed genes selected

between experimental versus control samples) ‘interesting’ genes

• Iteratively testing the enrichment of each annotation term one-by-one in

a linear mode

• Integrates functional enrichment analysis with information

visualization

• Constantly updated

• Efficiently processes large gene lists

• Weakness: output of terms can be large, thereby diluting the focus and

interrelationships of relevant terms

(7)

DAVID: Database for Annotation, Visualization and Integrated

Discovery (

https://david.ncifcrf.gov/home.jsp

)

(8)

DAVID: Database for Annotation, Visualization and Integrated

Discovery

(9)

• DAVID inherits the basic enrichment calculation as found in

WebGestalt

• Input: user defined gene list

• Incorporates extra network discovery algorithms by considering the

term-to-term relationships

• Improve discovery sensitivity and specificity by considering

inter-relationships of GO terms in the enrichment calculations

• Joint terms may contain unique biological meaning for a given study, not

held by individual terms

• Weakness: Not updated in the recent years, user input gene list size limited to

3000 genes

DAVID: Database for Annotation, Visualization and Integrated

Discovery

(10)

GSEA: Gene Set Enrichment Analysis

(http://www.broadinstitute.org/gsea/)

• Identifies the enriched pathways/gene sets between two biological states

• The program uses an underlying database (MSigDB) of about 11,000 gene sets

that include KEGG, BIOCARTA pathways, curated sets from disease states, etc.

(11)

Seven Broader Collections of GSEA

• Search

• Browse

• Examine gene sets

• Investigate

• Download

(12)

GSEA: Gene Set Enrichment Analysis

• GSEA program (download to your PC)

• Input:

Expression dataset (between two conditions); Phenotype labels between two states; Gene

sets in gmx/gmt format (MSigDB - supplied by GSEA)

• GSEA implements a ‘no-cutoff’ strategy, taking all genes from a microarray

experiment without selecting significant genes (e.g. genes with P-value 0.05

and fold change 2)

• GSEA method requires a summarized biological value (e.g. fold change)

• Weakness:

• Sometimes, it is a difficult task to summarize many biological aspects of a gene into one

meaningful value; example: SNP arrays, clinical microarray studies

• GSEA is less powerful to detect a gene set with a mix of genes with positive and negative

associations with the phenotype

(13)

(14)

• 487 colorectal cancer prognosis genes downloaded from Shi et al. 2012

• 11521 genes as the reference gene set from the protein-protein interaction

network used in the same paper

• Genes are from a human study

(15)

WebGestalt : WEB-based Gene SeT AnaLysis Toolkit

(http://bioinfo.vanderbilt.edu/webgestalt/)

hsapiens

hsapiens_gene_symbol

(16)

PPI_network

(17)

GO Analysis

nodes with red label represents enriched categories and black label represents their non-enriched parents

(18)

KEGG Analysis

Genes highlighted in red in

the pathway map are enriched

in the user input

(19)

• 408 genes involved in the cellular responses to HIV envelope protein

infection in resting or suboptimally activated peripheral blood mononuclear

cells; Cicala et al. 2002

• Affymetrix U95A microarray chip (genome wide expression) as the

reference gene set

(20)

DAVID: Database for Annotation, Visualization and Integrated

Discovery (

https://david.ncifcrf.gov/home.jsp

)

(21)

HIV_genes

When multiple species pop up, click on the species of interest and press ‘Select Species’

If multiple gene lists are open in the program, select the gene list of interest and click on ‘Use’

1

(22)

Percentage, e.g. 33/398 (involved genes/total genes)

(23)

(24)

KEGG Pathway

BIOCARTA

(25)

(26)

(27)

User input genes classified into big gene functional groups

Measure of the importance of a gene group in the user’s gene list

Key biology of this gene group Check if there are any other

genes in the gene list or in the genome functionally similar to this gene group

(28)

GSEA dataset

• Transcriptional profiles from p53+ and p53 mutant cancer cell lines

• Expression datasets: P53_hgu95av2.gct, P53_collapsed.gct

'Collapsed' refers to datasets whose identifiers (i.e Affymetrix probe

set ids) have been replaced with symbols

• Phenotype labels (e.g tumor vs normal): P53.cls

• Gene set: c1.v2.symbols.gmt

(29)

GSEA: Gene Set Enrichment Analysis

(30)

(31)

(32)

(33)

1

2

3

(34)

(35)

1

3

(36)

Interpreting GSEA Results

GSEA Statistics

GSEA computes four key statistics for the gene set enrichment analysis report:

● Enrichment Score (ES)

● Normalized Enrichment Score (NES)

● False Discovery Rate (FDR)

(37)

Enrichment plot ; Enrichment Score (ES)

• The ES is the maximum deviation from zero encountered in walking the list

• Enrichment score (ES), reflects the degree to

which a gene set is overrepresented at the

top or bottom of a ranked list of genes

• GSEA calculates the ES by walking down

the ranked list of genes, increasing a

running-sum statistic when a gene is in the

gene set and decreasing it when it is not

• The magnitude of the increment depends on

the correlation of the gene with the

phenotype

(38)

(39)

• To identify the subset of genes that actually contribute to the enrichment score (ES)

• The leading edge subset in a geneset are those genes that appear in the ranked list at or before

the point at which the running sum reaches its maximum

1

2

3

(40)

Interpreting Leading Edge Analysis Results

HeatMap

Gene in Subsets

Histogram

(41)