Enriment analysis - Data analysis methods

1.2 Data analysis methods

1.2.5 Enriment analysis

Enriment analysis is one of widely used bioinformatics methods for a systematic dis-section of large genes lists, su as hits from systematic RNAi screens or lists of dif-ferentially expressed genes. anks to the biological knowledge gathered in various databases (e.g. Gene Ontology), the enriment analysis makes it possible to assemble a summary of various biological annotations that could be associated with the given set of genes. Su annotations may include not only Gene Ontology terms but also pathways, transcription factor binding sites, epigenetic markers, structural properties of proteins or any other annotations derived from previous studies.

All currently available enriment analysis tools can be classiﬁed into four cate-gories, depending on the algorithm they use: singular enriment analysis (SEA), gene set enriment analysis (GSEA), modular enriment analysis (MEA) (all three reviewed in Huang et al., 2009) and model-based gene set analysis (MGSA). Some tools have im-plemented several algorithms so they may belong to more than one class.

Singular Enriment Analysis (SEA)

is is the most commonly used approa in whi a list of hits is iteratively tested for the enriment of ea annotation term one-by-one in a linear mode. ereaer, the individual, enried annotation terms passing the enriment 𝑝-value threshold are reported in a tabular format ordered by the enriment probability (enriment 𝑝-value).

e enriment 𝑝-value calculation, i.e. number of genes in the list that are annotated with a given annotation as compared to pure random ance, can be performed with the aid of some common and well-known statistical methods, including 𝜒^􏷡test, Fisher’s exact test, Binomial probability and Hypergeometric distribution, etc.

Gene Set Enriment Analysis (GSEA)

SEA approa strongly relies on a osen hit selection algorithm and user-defined thresh-olds. Moreover, the experimental results (i.e. level of expression or phenotype strength) are not considered. To overcome these limitations, a gene set enriment analysis (GSEA) method was developed (Mootha et al., 2003). GSEA sorts a complete list of experimental results and seares for annotations enried on its top or boom. is allows even mild effects to contribute to the overall enriment score. To calculate the significance of association, the Kolmogorov–Smirnov or the Mann–Whitney U -test are used (Keller et al., 2008; Subramanian et al., 2005).

However, tools in the GSEA class are also associated with some common limitations.

First, the ‘no-cutoﬀ’ strategy is the key advantage of GSEA, but is also becoming its major limitation in some biological studies. e GSEA method requires a summarized biological value (e.g. fold ange) for ea of the genes in the input. Despite that, in case of quantitative studies, su as cell-based RNAi screens, this limitation is not a problem.

A more extensive introduction to GSEA methods is given in Chapter 5.

Modular Enriment Analysis (MEA)

MEA inherits the basic enriment calculation found in SEA and incorporates addi-tional algorithms considering the relationships between annotation terms. An exam-ple of su algorithm is imexam-plemented in the ProfCom tool (Antonov & Mewes, 2006;

Antonov et al., 2008), whi has an ability to proﬁle enriments of whole subgroups of available GO terms assembled in a Boolean fashion. is and other tools, su as On-tologizer (Bauer et al., 2008), topGO (Alexa et al., 2006), GENECODIS (Carmona-Saez et al., 2007; Nogales-Cadenas et al., 2009), ADGO (Nam et al., 2006) claim to improve discovery sensitivity and speciﬁcity.

Model-based Gene Set Analysis (MGSA)

MGSA is a newly emerged class of methods that analyze all annotation categories at once by embedding them in a Bayesian network (Bauer et al., 2010; Zhang et al., 2010).

Gene response is modeled as a function of the activation of biological categories. Prob-abilistic inference is used to identify the active categories. e Bayesian modeling

ap-proa naturally takes category overlap into account and avoids the need for multiple testing correction.

Multiple testing correction

Most enriment analysis teniques require multiple statistical tests and thus, the prob-ability of making at least one false discovery (i.e. that a given set of genes is enried for a specific term) increases significantly. In statistics, this probability is called the family-wise error rate (FWER). In general, all multiple testing correction teniques aempt to reduce FWER while keeping the testing power at the same time. e reduction of FWER is aieved by requiring a stronger level of evidence to be observed in order for an individual enriment to be called ‘significant’ (Lehmann & Romano, 2005).

e most commonly used multiple hypothesis correction method is the Bonferroni correction. Assuming 𝑛 independent statistical tests and the given signiﬁcance level for the whole family of tests to be (at most) 𝛼, ea of the individual test should be tested at the level of 𝛼/𝑛. So all individual 𝑝-values should be multipled by 𝑛 before applying the signiﬁcance threshold selection. It is considered to be the most conservative among all multiple testing correction teniques (Rice et al., 2008).

is tenique proﬁdes the maximum FWER control, but it is considered to be too re-strictive for practical use in bioinformatics. An alternative to the Bonferroni correction was designed by Banjamini and Hoberg (Benjamini & Hoberg, 1995). It is the false discovery rate (FDR) control algorithm, whi correct for the expected number of false discoveries (in contrast to Bonferroni method that assumes the worst-case scenario). It was shown (Benjamini & Yekutieli, 2001; Williams et al., 1999) that their approa yields mu greater power than the Bonferroni tenique. e corrected 𝑝-value (𝑝_{􏸂􏸎􏸑􏸑}) can be calculated from the following formula:

𝑝_{􏸂􏸎􏸑􏸑} = 𝑝 ⋅ 𝑛

𝑟 (1.16)

where 𝑛 is the total number of statistical tests performed and 𝑟 is a rank of the 𝑝-value (𝑝) in a list of all obtained 𝑝-values sorted in the ascending order.

Other multiple testing correction algorithms are based on various permutation ap-proaes (Boyle et al., 2004). Boyle’s algorithm repeats the enriment analysis on ran-domly pied lists of genes that are of the same size as the original list. Obtained results

are used for generating null distributions of 𝑝-values for ea annotation term. e null distribution can be constructed from at least 100 permutations. Finally, for a given term, a corrected 𝑝-value is calculated as a fraction of 𝑝-values from the null distribution that are the same or lower than the observed 𝑝-value.

Examples of application of enriment analysis are given in following apters, and especially in Chapter 5, whi is exclusively describing a novel GSEA algorithm utilising protein structure-derived information as annotation terms.

In document Integration and analysis of phenotypic data from functional screens (Page 34-37)