1.2 Data analysis methods
1.2.5 Enriment analysis
Enriment analysis is one of widely used bioinformatics methods for a systematic dis-section of large genes lists, su as hits from systematic RNAi screens or lists of dif-ferentially expressed genes. anks to the biological knowledge gathered in various databases (e.g. Gene Ontology), the enriment analysis makes it possible to assemble a summary of various biological annotations that could be associated with the given set of genes. Su annotations may include not only Gene Ontology terms but also pathways, transcription factor binding sites, epigenetic markers, structural properties of proteins or any other annotations derived from previous studies.
All currently available enriment analysis tools can be classified into four cate-gories, depending on the algorithm they use: singular enriment analysis (SEA), gene set enriment analysis (GSEA), modular enriment analysis (MEA) (all three reviewed in Huang et al., 2009) and model-based gene set analysis (MGSA). Some tools have im-plemented several algorithms so they may belong to more than one class.
Singular Enriment Analysis (SEA)
is is the most commonly used approa in whi a list of hits is iteratively tested for the enriment of ea annotation term one-by-one in a linear mode. ereaer, the individual, enried annotation terms passing the enriment 𝑝-value threshold are reported in a tabular format ordered by the enriment probability (enriment 𝑝-value).
e enriment 𝑝-value calculation, i.e. number of genes in the list that are annotated with a given annotation as compared to pure random ance, can be performed with the aid of some common and well-known statistical methods, including 𝜒test, Fisher’s exact test, Binomial probability and Hypergeometric distribution, etc.
Gene Set Enriment Analysis (GSEA)
SEA approa strongly relies on a osen hit selection algorithm and user-defined thresh-olds. Moreover, the experimental results (i.e. level of expression or phenotype strength) are not considered. To overcome these limitations, a gene set enriment analysis (GSEA) method was developed (Mootha et al., 2003). GSEA sorts a complete list of experimental results and seares for annotations enried on its top or boom. is allows even mild effects to contribute to the overall enriment score. To calculate the significance of association, the Kolmogorov–Smirnov or the Mann–Whitney U -test are used (Keller et al., 2008; Subramanian et al., 2005).
However, tools in the GSEA class are also associated with some common limitations.
First, the ‘no-cutoff’ strategy is the key advantage of GSEA, but is also becoming its major limitation in some biological studies. e GSEA method requires a summarized biological value (e.g. fold ange) for ea of the genes in the input. Despite that, in case of quantitative studies, su as cell-based RNAi screens, this limitation is not a problem.
A more extensive introduction to GSEA methods is given in Chapter 5.
Modular Enriment Analysis (MEA)
MEA inherits the basic enriment calculation found in SEA and incorporates addi-tional algorithms considering the relationships between annotation terms. An exam-ple of su algorithm is imexam-plemented in the ProfCom tool (Antonov & Mewes, 2006;
Antonov et al., 2008), whi has an ability to profile enriments of whole subgroups of available GO terms assembled in a Boolean fashion. is and other tools, su as On-tologizer (Bauer et al., 2008), topGO (Alexa et al., 2006), GENECODIS (Carmona-Saez et al., 2007; Nogales-Cadenas et al., 2009), ADGO (Nam et al., 2006) claim to improve discovery sensitivity and specificity.
Model-based Gene Set Analysis (MGSA)
MGSA is a newly emerged class of methods that analyze all annotation categories at once by embedding them in a Bayesian network (Bauer et al., 2010; Zhang et al., 2010).
Gene response is modeled as a function of the activation of biological categories. Prob-abilistic inference is used to identify the active categories. e Bayesian modeling
ap-proa naturally takes category overlap into account and avoids the need for multiple testing correction.
Multiple testing correction
Most enriment analysis teniques require multiple statistical tests and thus, the prob-ability of making at least one false discovery (i.e. that a given set of genes is enried for a specific term) increases significantly. In statistics, this probability is called the family-wise error rate (FWER). In general, all multiple testing correction teniques aempt to reduce FWER while keeping the testing power at the same time. e reduction of FWER is aieved by requiring a stronger level of evidence to be observed in order for an individual enriment to be called ‘significant’ (Lehmann & Romano, 2005).
e most commonly used multiple hypothesis correction method is the Bonferroni correction. Assuming 𝑛 independent statistical tests and the given significance level for the whole family of tests to be (at most) 𝛼, ea of the individual test should be tested at the level of 𝛼/𝑛. So all individual 𝑝-values should be multipled by 𝑛 before applying the significance threshold selection. It is considered to be the most conservative among all multiple testing correction teniques (Rice et al., 2008).
is tenique profides the maximum FWER control, but it is considered to be too re-strictive for practical use in bioinformatics. An alternative to the Bonferroni correction was designed by Banjamini and Hoberg (Benjamini & Hoberg, 1995). It is the false discovery rate (FDR) control algorithm, whi correct for the expected number of false discoveries (in contrast to Bonferroni method that assumes the worst-case scenario). It was shown (Benjamini & Yekutieli, 2001; Williams et al., 1999) that their approa yields mu greater power than the Bonferroni tenique. e corrected 𝑝-value (𝑝) can be calculated from the following formula:
𝑝 = 𝑝 ⋅ 𝑛
𝑟 (1.16)
where 𝑛 is the total number of statistical tests performed and 𝑟 is a rank of the 𝑝-value (𝑝) in a list of all obtained 𝑝-values sorted in the ascending order.
Other multiple testing correction algorithms are based on various permutation ap-proaes (Boyle et al., 2004). Boyle’s algorithm repeats the enriment analysis on ran-domly pied lists of genes that are of the same size as the original list. Obtained results
are used for generating null distributions of 𝑝-values for ea annotation term. e null distribution can be constructed from at least 100 permutations. Finally, for a given term, a corrected 𝑝-value is calculated as a fraction of 𝑝-values from the null distribution that are the same or lower than the observed 𝑝-value.
Examples of application of enriment analysis are given in following apters, and especially in Chapter 5, whi is exclusively describing a novel GSEA algorithm utilising protein structure-derived information as annotation terms.