• No results found

Tools and Algorithms in Bioinformatics GCBA815, Fall 2015

N/A
N/A
Protected

Academic year: 2021

Share "Tools and Algorithms in Bioinformatics GCBA815, Fall 2015"

Copied!
41
0
0

Loading.... (view fulltext now)

Full text

(1)

Tools and Algorithms in Bioinformatics

GCBA815, Fall 2015

Week-8:

WebGestalt, DAVID, Gene Set Enrichment Analysis (GSEA)

Simarjeet K. Negi, Ph.D. candidate

(Guda Lab)

Department of Genetics, Cell Biology and Anatomy

University of Nebraska Medical Center

(2)

Why perform enrichment analysis?

• Large gene lists resulting from high- throughput analysis

• Deciphering the biology

• Organize expression changes into meaningful functional themes

• Gene enrichment analysis increases the likelihood to identify

molecular processes/functions most pertinent to the study

(3)

• If a biological process is abnormal in a given study, the co-functioning

genes should have a higher (enriched) potential to be selected as a

relevant group by the high-throughput screening technologies

• Analytic conclusion is based on a group of relevant genes that increases

the likelihood to identify the biological processes most pertinent to study

• Enrichment tools map a large number of ‘interesting’ genes to

biological annotation terms (e.g. GO Terms or Pathways)

• Statistical examination of the enrichment of user genes for each of the

annotation terms by comparing the outcome to the control (or reference)

background

(4)

• Based on the difference of algorithms, the current enrichment tools can be

broadly divided into three classes:

• Singular enrichment analysis (SEA); WebGestalt

• Gene set enrichment analysis (GSEA); GSEA

• Modular enrichment analysis (MEA); DAVID

• Note, some tools with diverse capabilities belong to more than one class

Classification of Enrichment Tools

Overrepresentation approaches

(5)

WebGestalt : WEB-based Gene SeT AnaLysis Toolkit

(

http://bioinfo.vanderbilt.edu/webgestalt/)

(6)

• Input: user’s preselected (e.g. differentially expressed genes selected

between experimental versus control samples) ‘interesting’ genes

• Iteratively testing the enrichment of each annotation term one-by-one in

a linear mode

• Integrates functional enrichment analysis with information

visualization

• Constantly updated

• Efficiently processes large gene lists

• Weakness: output of terms can be large, thereby diluting the focus and

interrelationships of relevant terms

(7)

DAVID: Database for Annotation, Visualization and Integrated

Discovery (

https://david.ncifcrf.gov/home.jsp

)

(8)

DAVID: Database for Annotation, Visualization and Integrated

Discovery

(9)

• DAVID inherits the basic enrichment calculation as found in

WebGestalt

• Input: user defined gene list

• Incorporates extra network discovery algorithms by considering the

term-to-term relationships

• Improve discovery sensitivity and specificity by considering

inter-relationships of GO terms in the enrichment calculations

• Joint terms may contain unique biological meaning for a given study, not

held by individual terms

• Weakness: Not updated in the recent years, user input gene list size limited to

3000 genes

DAVID: Database for Annotation, Visualization and Integrated

Discovery

(10)

GSEA: Gene Set Enrichment Analysis

(http://www.broadinstitute.org/gsea/)

• Identifies the enriched pathways/gene sets between two biological states

• The program uses an underlying database (MSigDB) of about 11,000 gene sets

that include KEGG, BIOCARTA pathways, curated sets from disease states, etc.

(11)

Seven Broader Collections of GSEA

• Search

• Browse

• Examine gene sets

• Investigate

• Download

(12)

GSEA: Gene Set Enrichment Analysis

• GSEA program (download to your PC)

• Input:

Expression dataset (between two conditions); Phenotype labels between two states; Gene

sets in gmx/gmt format (MSigDB - supplied by GSEA)

• GSEA implements a ‘no-cutoff’ strategy, taking all genes from a microarray

experiment without selecting significant genes (e.g. genes with P-value 0.05

and fold change 2)

• GSEA method requires a summarized biological value (e.g. fold change)

• Weakness:

Sometimes, it is a difficult task to summarize many biological aspects of a gene into one

meaningful value; example: SNP arrays, clinical microarray studies

GSEA is less powerful to detect a gene set with a mix of genes with positive and negative

associations with the phenotype

(13)
(14)

• 487 colorectal cancer prognosis genes downloaded from Shi et al. 2012

• 11521 genes as the reference gene set from the protein-protein interaction

network used in the same paper

• Genes are from a human study

(15)

WebGestalt : WEB-based Gene SeT AnaLysis Toolkit

(http://bioinfo.vanderbilt.edu/webgestalt/)

hsapiens

hsapiens_gene_symbol

(16)

PPI_network

(17)

GO Analysis

nodes with red label represents enriched categories and black label represents their non-enriched parents

(18)

KEGG Analysis

Genes highlighted in red in

the pathway map are enriched

in the user input

(19)

• 408 genes involved in the cellular responses to HIV envelope protein

infection in resting or suboptimally activated peripheral blood mononuclear

cells; Cicala et al. 2002

• Affymetrix U95A microarray chip (genome wide expression) as the

reference gene set

(20)

DAVID: Database for Annotation, Visualization and Integrated

Discovery (

https://david.ncifcrf.gov/home.jsp

)

(21)

HIV_genes

When multiple species pop up, click on the species of interest and press ‘Select Species’

If multiple gene lists are open in the program, select the gene list of interest and click on ‘Use’

1

(22)

Percentage, e.g. 33/398 (involved genes/total genes)

(23)
(24)

KEGG Pathway

BIOCARTA

(25)
(26)
(27)

User input genes classified into big gene functional groups

Measure of the importance of a gene group in the user’s gene list

Key biology of this gene group Check if there are any other

genes in the gene list or in the genome functionally similar to this gene group

(28)

GSEA dataset

• Transcriptional profiles from p53+ and p53 mutant cancer cell lines

• Expression datasets: P53_hgu95av2.gct, P53_collapsed.gct

'Collapsed' refers to datasets whose identifiers (i.e Affymetrix probe

set ids) have been replaced with symbols

• Phenotype labels (e.g tumor vs normal): P53.cls

• Gene set: c1.v2.symbols.gmt

(29)

GSEA: Gene Set Enrichment Analysis

(30)
(31)
(32)
(33)

1

2

3

(34)
(35)

1

3

(36)

Interpreting GSEA Results

GSEA Statistics

GSEA computes four key statistics for the gene set enrichment analysis report:

Enrichment Score (ES)

Normalized Enrichment Score (NES)

False Discovery Rate (FDR)

(37)

Enrichment plot ; Enrichment Score (ES)

The ES is the maximum deviation from zero encountered in walking the list

Enrichment score (ES), reflects the degree to

which a gene set is overrepresented at the

top or bottom of a ranked list of genes

GSEA calculates the ES by walking down

the ranked list of genes, increasing a

running-sum statistic when a gene is in the

gene set and decreasing it when it is not

The magnitude of the increment depends on

the correlation of the gene with the

phenotype

(38)
(39)

To identify the subset of genes that actually contribute to the enrichment score (ES)

The leading edge subset in a geneset are those genes that appear in the ranked list at or before

the point at which the running sum reaches its maximum

1

2

3

(40)

Interpreting Leading Edge Analysis Results

HeatMap

Gene in Subsets

Histogram

(41)

Heat map

shows the (clustered) genes in the leading edge subsets. The

expression values are represented as colors, where the range of colors (red, pink,

light blue, dark blue) shows the range of expression values (high, moderate, low,

lowest)

Set-to-Set

graph uses color intensity to show the overlap between subsets: the

darker the color, the greater the overlap between the subsets

Gene in subsets

graph shows each gene and the number of subsets in which it

appears

Histogram

; the Jacquard is the intersection divided by the union for a pair of

leading edge subsets. Number of Occurrences is the number of leading edge

subset pairs in a particular bin. In this example, most subset pairs have no overlap

(Jacquard = 0)

References

Related documents

4,21 the United States District Court for the District of Delaware allowed preinduction review of certain board action in- fringing upon such rights of the

In view of these results, obesity and related diseases in Constantine is high and can represent in the near future a serious public health problem..

Over the past two decades, the number of physicians in private practice has dropped dramatically. This trend is the result of the financial pressure imposed by the federal government

Uses a Level - wise search, where k-itemsets (An itemset that contains k items is a k- itemset) are used to explore (k+1)- itemsets, to mine frequent itemsets

Inl J De,' Bioi 39 605 615 (1995) Original Arlide Differential expression of the full length and secreted truncated forms of EGF receptor during formation of dental tissues JEAN

By raising or lowering short-term interest rates, monetary policy affects the housing market, and in turn the overall economy, directly or indirectly through at least six channels:

To compute the economic insecurity index, the paper follows the Osberg (1998), and Osberg and Sharpe (2002, 2003) method of computing the insecurity index. They computed the

The results of studied samples given in the table 4 showed that the lower concentration of Calcium was 50 mg / l in the sample taken from Habeer well (6) and the highest