An initial model for gene expression data

CHAPTER 4. CONFOUNDING EFFECTS IN DOUBLE SHRINKAGE

4.2 An initial model for gene expression data

This Section presents an RNA-seq expression data used as motivating example, and the initial hierarchical Bayesian model to analyze that data set.

A diploid genome has two sets of chromosomes, one from each parent, so every gene has two copies. One of the advantages of next generation sequencing is that makes possible to measure the expression of each gene copy, we call allele-specific expression (ASE) to refer this measure.

ASE can be obtained using single nucleotide polymorphism (SNP) that makes it possible to distinguish the expression of the two alleles (Sun and Hu, 2014). We assume the ASE counts for a single hybrid variety are available, the main interest is to detect genes that present alleles differentially expressed.

Let Bgi and Mgi the transcript abundance count for gene g in sample i for the reference

and non-reference alleles respectively. In this paper we do not model directly the observed transcript abundance for each allele, instead a logarithmic transformation and then normal data model is used.

The log-transformed allele ratio, dgi, and its centered version, ygi are defined as follows:

dgi= log _B gi+1 Mgi+1 ygi= dgi− P gidgi P gng

where the addition of 1 read to each count in the ratio dgi ensure the transformation is well

defined.

The response variable is defined as the centered allele expression ratio in logs, ygi. Is

important to center the allele ratio to remove any systematic difference affecting all genes, these effects are not relevant to detect interesting genes and are usually contaminated with bias.

It could be argued is better to model directly the ASE counts instead of the log-transformed version. However, the main interest here is to illustrate confounding effects and explore its rea- sons. Normal hierarchical models are better for this goals since they are more analytically tractable. In addition, it has been suggested that transformation-based methods show com- petitive performance in detecting differentially expressed genes (Soneson and Delorenzi, 2013). Nevertheless, in Section4.6 we also perform a data analysis of the ASE counts with a Poison- lognormal mixture model that shows similar results in terms of confounding effects.

As an initial model, we assume response variable is normally distributed, group means also are normal while group variances distributed as inverse gamma, this model is presented in

(4.2). As we mentioned in previous Section, the normal data model with both group-specific

means and variances has receive attention in recent years to analyze microarray data (Gene Hwang et al., 2009; Hwang and Liu, 2010; Zhao, 2010). Using hierarchical distributions for

group means and group variance parameters, we can borrow information across groups, which is specially appealing when there are few observations per group and thousands of groups.

ygi ind ∼ ∼ N (µ_g, σ_g2) µg ind ∼ N (µ0, σ02) σ2_g ind∼ IG(ντ /2, τ /2) (4.2)

In terms of the log transformed ASE counts described above, the main interest is to identify groups (or genes) where µg 6= 0 since it means that gene show different expression level among

alleles. Figure 4.1 shows results of full Bayesian inference for model (4.2) (details on how inference is performed are described later), each panel is a bivariate histogram plot of posterior expectations and observed group means. Top row facets show the posterior expectation of the group means and bottom row facets show the square root posterior expectation of the group variances. Genes are declared as differentially expressed (DE) if a 95% credible interval does not contain zero, and declared as non-DE otherwise.

Figure 4.1 suggests that groups (genes) with observed sample mean close to zero are non-

differentially expressed while genes with larger observed sample means results in credible in- tervals that not contains zero value. However, there are a few genes with large sample means which are not flagged as differentially expressed, the reason for this can be found in the bottom row panels, those genes have the largest estimated variances in the data set. This might con- stitutes a reasonable explanation of the signals present in the data. Genes with weak signals or large signals but too much noise are founded to be non-differentially expressed, while genes with strong signals and low levels of noise are flagged as DE.

However, it seems suspicions that all genes with the largest observed means result in small posterior expectation and large posterior variances. A similar effect is pointed out by Cook et al. (2007), the most interesting genes detected using plots had very large adjusted p-values. In the rest of this paper, we make a case against the explanation of data provided by model

(4.2), and suggest that sometimes strong signals are confounded with noise when we use a

non−DE DE means v ar iances −2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0 −2 −1 0 1 2 1 2 3 4 Observed mean P oster ior mean 1 10 100 1000 Frequence

Figure 4.1 Bivariate histograms of model results for ASE counts of Paschold et al. (2012) hybrid data, model with normal and inverse gamma hierarchical distributions. Pos- terior expectation of group mean and group sample mean (top facets), and square root of posterior expectation of group variance and group sample mean (bottom facets). Column facets indicates genes has its alleles differentially expressed (DE) or not (non-DE).

In document Bayesian analysis of high-dimensional count data (Page 74-78)