3.2 Methods
3.2.2 Experiment specific methods
3.2.2.1 Experiment 1: Pathway based analysis of genes and interactions
back fat
The major aim of this analysis was to identify and study the dominant metabolic pathways and interactions involved in the maintenance and regulation of steroidogenesis and androstenone biosynthesis in porcine testicular tissues. For this purpose, an integrative knowledge driven approach merging together interaction network and pathway information from KEGG database and gene expression data from RNA-seq experiments was used. But, a current limitation of this approach in terms of studying androstenone metabolism is that none of the major pathway databases contain data on metabolic reaction steps or gene interactions involved in androstenone biosynthesis. As a work around to this limitation, androstenone biosynthesis was considered as an offshoot of steroid hormone (testosterone) synthesis pathway in testis under the assumption that the pathways and interaction events that affect steroid hormone biosynthesis could also affect
androstenone biosynthesis.
This analysis section has three subsections (i) identification of significant interactions, (ii) KEGG pathway enrichment and (iii) variant calling. The methods used in the identification of statistically significant pathway interactions are described in the first subsection, the steps followed in interaction pathway enrichment are detailed in the second subsection and the last subsection describes the gene polymorphism analysis performed.
Identification of significant interactions
The major objective behind this analysis was to identify significant pathway interactions by merging RNA-seq gene expression data and KEGG pathway interaction network (section 3.1.1.3). In this analysis, the gene expression data from only the testis samples in RNA-seq expression data (see section 3.1.1.1) was used. As noted in Table 3.1, the normalized porcine testis expression matrix contained expression measurements of 16,760 genes in 10 samples and as described in section 3.1.1.3, the KEGG gene interaction network contain interactions only for 3,510 genes. Hence, the first step in this analysis procedure was to trim the the testis gene expression data set for genes in the KEGG interaction network. As a result of this trimming, only 2,871 genes in common between the gene expression data set and the KEGG interaction network were retained. The KEGG interaction network was also trimmed down to 2,871 genes and contained 23,198 edges.
In the next analysis phase, Pearson Correlation Co-efficient (PCC) of gene expression values were calculated for both HA and LA testis samples separately and the edges of the trimmed pathway interaction network were weighted with these correlation values. This step gave rise to two different pathway interaction networks: in the first network, the edges were weighted with correlation coefficients derived from LA testis expression data (“LA network”) and in the second network, edges were weighted with correlation coefficients derived from HA testis expression data (“HA network”). Both LA and HA networks were comprised of 2,871 nodes and 15,960 edges.
In order to identify the interactions that were significantly different between both LA and HA networks, the edge weights (correlation coefficients) of both networks were transformed to z-score using Fisher-r-to-z transformation based on the equation:
z = 1 2ln
(1 + r)
(1 − r) , where r is the PCC (3.1)
Following the calculation of z-scores for interactions in both networks, the differences between the z-scores were also calculated. For an edge z-score in LA network, the corresponding edge z-score from HA network was retrieved and the difference between the z-scores was calculated as:
zscoreDIF F = zscoreLA − zscoreHA (3.2)
In the following analysis step, in order to identify significant zscoreDIF F, a two step evaluation
criteria based on permutation and random sampling was used (Ripley, 1987). Permutation and random sampling based methods for estimating significance thresholds have already been used in
high throughput studies (Gatti et al., 2009; Zhang et al., 2012). The evaluation criteria used in this step were:
(i) zscoreDIF F should be significant at a threshold of empirical p-value <0.05 against a set of
zscores generated from randomly sampling the original expression data set.
(ii) At least one of the correlations, either from LA expression set or from HA expression set used to calculate the zscoreDIF F must be significant at a threshold of empirical
p-value <0.05 against a set of correlations generated from randomly sampling the original expression data set.
For generating the set of zscores used in evaluation criteria (i), a random expression matrix was generated by randomly shuffling and assigning the whole testis gene expression values into two sample groups. The purpose behind random shuffling and assigning of expression values was to break up the original ordering and classification of the expression values and samples as belonging to either HA or LA sample set and generate two complete random expression matrices (expression sets). Pearson correlation coefficients, zscores and zscore differences were calculated on these random expression matrices following the previously described steps and the entire process was repeated 10,000 times to generate a set of random zscore differences (zscoreRAN D) for each
interaction. The significance threshold empirical p-value for each zscoreDIF F was calculated as:
P valEmpirical =
# zscoreRAN D > zscoreDIF F
N , where N = 10, 000 (3.3)
A similar procedure was followed for calculating significance threshold empirical p-value for correlations in evaluation criteria (ii), where empirical p-value was calculated between correlation coefficients from randomly sampled expression data and the original correlation coefficients from LA or HA datasets. Once the significant interaction (correlation) identification was complete, the identified significant interactions were further classified into 8 correlation types such as: HA positive, HA positive significance, HA negative, HA negative significance, LA positive, LA positive significance, LA negative and LA negative significance. The rules used for classification of these correlation types and edge colors and line styles used in visualization of these correlation types are given in Table 3.2. These classification rules were mainly used in the visualization step, and all the interaction networks in this work were visualized using Cytoscape. All the above mentioned analysis procedure were carried out in the statistical computing platform R and several custom functions were written in R to perform these analysis steps. In this analysis, R package igraph27(Csardi and Nepusz, 2006) was used for network analysis and manipulation.
Table 3.2: Interaction edge classification rules. Set of rules used for the classification of interactions(correlations) and assigning correlation types, edge color and line styles
Correlation Correlation coefficient Correlation coefficient Edge color Edge line style
coefficients coefficient in coefficient in for for
HA testis samples LA testis samples visualization visualization
HA positive positive and significant negative red solid line
HA positive signif- icance
positive and significant positive red dashed line
HA negative negative and significant positive light green solid line
HA negative sig- nificance
negative and significant positive or negative light green dashed line
LA positive negative positive and significant green solid line
LA positive signif- icance
positive positive and significant green dashed line
LA negative positive negative and significant orange solid line
LA negative sig- nificance
positive or negative negative and significant orange dashed line
KEGG pathway enrichment analysis
Once the identification of significant interactions were completed, the next step in this analysis was the identification of pathways enriched for significant interactions. In this step, rather than performing the conventional gene enrichment analysis, an interaction enrichment analysis was performed following the school of thought that the interactions of a gene reveals more about the functions of that particular gene in a phenotype. A custom function was written in R to perform the hypergeometric test to asses the pathways over-represented for significant interactions. The pvalues generated by the R phyper function were then corrected for multiple testing using Benjamini–Hochberg procedure. Finally, the pathways with a p-adjusted value of < 0.05 from this analysis were considered as significantly enriched (over-represented) for the identified interactions.
Variant calling
This section describes the analysis methods used in the variant calling pipeline. The variant calling pipeline used utilities and tools implemented in software suites Gatk, Picard and SAMTools function mpileup.
The input data used in this pipeline were :
(i) BAM format sequence alignments from TopHat (see section 3.2.1.1) (ii) Sscrofa10.2 DNA sequences in FASTA format and
(iii) SNP annotations in VCF format (see section 3.1.1.4)
The variant calling pipeline described below was adapted from the GATK guideline on best practices for variant calling28. The variant calling pipeline used GATK algorithms and Picard
28
function MarkDuplicates for realigning and re-indexing the bam files and SAMTools function mpileup for variant calling. Figure 3.4 shows the workflow followed for variant calling pipeline in this thesis. In the final step in this pipeline, the realigned and recalibrated reads in BAM format are used for variant calling by using the SAMTools function mpileup. The initial set of polymorphisms obtained from samtools was further filtered down with the parameters: Root Mean Square (RMS) Phred quality score greater than 20, read depth greater than 50 and SNP quality score greater than 20. Furthermore, all the polymorphisms mapped to intronic positions of genes were excluded from this analysis. The chromosomal position and reference alleles of the final filtered set of polymorphisms were crosschecked against dbSNP database (Build 136) to identify the variants that were already represented in the SNP database. The possible amino acid coding effects of these polymorphisms such as synonymous mutation, non synonymous mutation, start/stop codon gain or loss and genomic positions such as upstream, downstream, in UTR (un-translated region) were predicted using SnpEff software.
.bam RealignerTargetCreator(GATK) IndelRealigner(GATK) MarkDuplicates(Picard)
CountCovariates (GATK) TableRecalibration (GATK) Processed .bam mpileup (SAMTools) SNPs per .bam (VCF)
Figure 3.4: Flow chart of variant calling pipeline used in this experiment.
data Q.C raw reads reads after Q.C Alignment (TopHat/BWA) Sscrofa10.2 gene annotation (NCBI) aligned reads Variant calling (samtools mpileup) read count (BEDTools) Candidate polymorphisms dbSNP annotated polymorphisms on selected genes candidate genes from
enriched pathways
dbSNP (build 137) porcine
SNP annotation
Gene expression data (read counts) normalization (R package: limma voom) normalized data KEGG interaction network and pathway information KEGG interaction network
• pruning interaction network and normalized data for common genes • calculating Pearson correlation
HA and LA interaction networks with correlation coefficient weighted edges
resampling z-score diff. resampling correlation coefficient • z-score calculation (HA and LA edges
separately)
• z-score diff. between HA and LA edges
• pruning for significant z-score diff. (p<0.05) • pruning for z-score diff. with at least one
significant correlation (p<0.05) Interaction network with significant z-score diff. enrichment analysis hypergeometric test (p<0.05) enriched pathways SnpEff (SNP effect prediction) Result 1. Significant interactions Result 2. Pathways enriched for significant interactions
Result 3. Polymorphisms on genes
involved in significant interactions
Figure 3.5: Pathway based analysis workflow. Legend: White parallelograms with grey outline: Input/output data and results. White cylinders with red outline: data from external databases. Rectangles with light blue shades: various tools and analysis processes used in this workflow.
3.2.2.2 Experiment 2: Identification of gene co-expression clusters in liver tissues