ABSTRACT
SU, YUHUA. Mixture Models for Gene Expression Experiments with Two Species. (Under the direction of Dr. Jason Osborne.)
A bivariate mixture model utilizing information across two species is proposed to solve the
fundamental problem of identifying differentially expressed genes in microarray experiments.
Orthologs, or genes from two different species that originated from a common ancestor, have the
potential to exploit similarities between species to better understand the genetic basis of disease
and treatment. The proposed approach intuitively models the distribution of the estimated
treatment effects with minimal assumptions. The mixture model posits up to nine components,
four of which include groups in which genes are differentially expressed in both species. An
EM algorithm is developed to accomplish the nontrivial likelihood maximization, along with
methodology for handling singular covariance matrices that arise during the implementation
of the algorithm. A comprehensive simulation to evaluate the model performance and two
applications on real world data sets, a dog and human lymphoma data set prepared by a group
of scientists in the College of Veterinary Medicine at North Carolina State University and a
mouse and human type II diabetes experiment sponsored by GlaxoSmithKline, suggest that
the proposed model, though highly structured, can handle various situations and is practically
useful, especially when the magnitude of differential expression due to the different treatment
intervention is weak. In both applications, the proposed 9-component mixture model is able
to eliminate unimportant genes and identify a list of genes that are potential candidates of
biomarkers. Though the primary motivation for the development of the bivariate mixture
model is to enable identification of genes whose differential expression extends from humans to
another species, possible extension to classification/prediction of cancer type or drug response
is also initiated in the two case studies. In the dog and human lymphoma study, a very small
number of genes are identified as being differentially expressed in both species and the human
patients into two subgroups, the germinal-center B-cell-like diffuse large B-cell lymphoma and
the activated B-cell-like diffuse large B-cell lymphoma. Additionally, the two subgroups defined
by this cluster of human genes have significantly different survival functions, indicating that
the stratification based on gene-expression profiling using the proposed 9-component mixture
model provides better insight into the clinical differences between the two types of cancer.
The application of the 9-component mixture model on the mouse and human type II diabetes
experiment is less successful. While the mixture model is able to separate differentially expressed
genes from those non-differentially expressed ones, attempts at predicting human drug response
status using the genes identified as being differentially expressed in both species did not lead
to the same success as the lymphoma experiment. This may be due to the fact that there is
little evidence of any differential expression. The linear model for week 8 expression in human
genes was one of many possible models, but it did not uncover much evidence of a treatment
effect. Nonetheless, a potential multi-gene predictor may still be developed according to the
genes identified by the proposed 9-component mixture model to benefit patients in therapeutic
© Copyright 2011 by Yuhua Su
Mixture Models for Gene Expression Experiments with Two Species
by Yuhua Su
A dissertation submitted to the Graduate Faculty of North Carolina State University
in partial fulfillment of the requirements for the Degree of
Doctor of Philosophy
Statistics
Raleigh, North Carolina
2011
APPROVED BY:
Dr. Lei Zhu Dr. Jacqueline Hughes-Oliver
Dr. Dahlia Nielsen Dr. Jung-Ying Tzeng
DEDICATION
BIOGRAPHY
Yuhua was born in Kaohsiung, Taiwan. She received her Bachelor of Arts degree in Public
Finance from National Chengchi University in 2001 and her Master’s degree in Economics from
the University of North Carolina at Chapel Hill in 2003. She decided to pursue a career in
Statistics and joined the Department of Statistics at North Carolina State University in 2004.
ACKNOWLEDGEMENTS
I have been truly fortunate to have Dr. Jason Osborne as an advisor. His unwavering faith in
my ability has helped me stay the course even when I doubted myself the most. His guidance
has been priceless, and it truly has been a pleasure and honor to work with him.
I greatly appreciate all the support and encouragement I received from my committee
mem-ber and manager, Dr. Lei Zhu. She has been a wonderful role model for me.
I am deeply indebted to Dr. Dahlia Nielsen for answering many questions for the two-species
data and to Dr. Chris Smith, Dr. Kristy Richards, and Dr. Matthew Breen for sharing the
data.
I am very gracious to Dr. Jacqueline Hughes-Oliver and Dr. Jung-Ying Tzeng for their
invaluable suggestions on my research.
I want to thank my parents, Er-Chen Su and Hsiu-Chu Su-Liu, and my sisters, Li-Wen Su
and Pi-Hwa Su, who kept believing in me throughout this journey, even after all this time.
Last but certainly not least, I wish to thank my husband, Andrew Cigna, for his support
and devotion through this roller coaster called graduate school, and my three precious babies,
Winry, Yuna and Duke for helping me keep my eye on what is truly important in life.
TABLE OF CONTENTS
List of Tables . . . vii
List of Figures . . . x
Chapter 1 Introduction . . . 1
1.1 Background . . . 1
1.2 Gene expression and drug development . . . 2
1.3 Homology and multiple species gene expression analysis . . . 4
1.4 Statistical issues in microarray data analysis . . . 6
1.4.1 Conventional search for differentially expressed genes . . . 6
1.4.2 Correction for multiple testing . . . 10
Chapter 2 Joint modeling across species . . . 14
2.1 A 9-component bivariate mixture model for two species . . . 16
2.1.1 Comparison with Ogorek (2008) . . . 21
2.2 Mixture models and the EM algorithm . . . 26
2.2.1 Mixture models . . . 27
2.2.2 The EM algorithm . . . 28
2.2.3 The EM algorithm for mixture models . . . 30
2.3 The EM Algorithm with constraints . . . 39
2.4 Regularized covariance matrices in the EM algorithm . . . 43
Chapter 3 Simulation . . . 48
3.1 Simulation studies: Case I and Case II . . . 50
3.1.1 Parameter determination . . . 51
3.1.2 Data generation . . . 53
3.1.3 Starting values of the EM algorithm . . . 55
3.1.4 Results of simulation studies . . . 58
3.2 Simulation study: Case III . . . 77
3.3 Simulation study to evaluate the choice of the regularization parameter for reg-ularized covariance matrices . . . 83
3.4 Conclusions . . . 87
Chapter 4 Applications . . . 88
4.1 Application I: : Gene selection and cancer type classification on the dog and human diffuse large-B-cell lymphoma study . . . 88
4.1.1 Introduction . . . 88
4.1.2 Data description . . . 91
4.1.3 Data analysis . . . 94
4.1.4 Results . . . 96
4.2 Application II: The mouse and human type II diabetes experiment . . . 115
4.2.1 Introduction . . . 115
4.2.2 Data analysis . . . 117
4.2.3 Results . . . 119
4.2.4 Conclusions . . . 126
Chapter 5 Concluding remarks . . . 129
References. . . 135
Appendix . . . 142
Appendix A Derivation of the distribution of the Least Squares estimators ( ˆβ1ai,βˆ1hi)
LIST OF TABLES
Table 1.1 Table of outcomes. FP=false positive, TP=true positive, FN=false
nega-tive and TN=true neganega-tive. . . 11
Table 2.1 Possible categories of treatment effects: a prior for (β1ai, β1hi)
T . . . . . 16
Table 3.1 Number of genes in thekthcategory for simulation studies Case I and Case
II . . . 51
Table 3.2 Combination of parameters for simulation studies Case I and Case II . . 53
Table 3.3 Summary of the parameter estimates for the 8 different scenarios under
simulation study Case I. Averaged over the 500 simulated data sets. Num-bers in parentheses are the Monte Carlo standard deviations. NE: not
estimated. . . 61
Table 3.4 Summary of the parameter estimates for the 8 different scenarios under
simulation study Case II. Averaged over the 500 simulated data sets. Num-bers in parentheses are the Monte Carlo standard deviations. NE: not
estimated. . . 62
Table 3.5 Monte Carlo mean squared error for each estimator under the 8 different
scenarios in simulation study Case I. NE: not estimated. . . 63
Table 3.6 Monte Carlo mean squared error for each estimator under the 8 different
scenarios in simulation study Case II. NE: not estimated. . . 64
Table 3.7 Proportions of genes in category 0 classified into categoryq under
simula-tion studies Case I and Case II . . . 67
Table 3.8 Under simulation studies Case I and Case II, the number of genes selected
based on (a) the 9-component bivariate mixture model, (b) the conven-tional one-species approach (c) Ogorek (2008). Numbers in parentheses are the observed FDRs. Averaged over the 500 simulated datasets. Tukey’s
HSD for anα level of 0.05 is included beneath each set of eight simulation
cases. NA: not available. ∗: numbers averaged after exclusion of data sets
due to convergence criteria not met in PROC NLP.∗∗: numbers averaged
after exclusion of data sets due to convergence criteria not met in PROC
NLP and data sets where all genes are classified as I-orthologs. . . 74
Table 3.9 A three-way analysis of variance (ANOVA) table to quantify the variability
among the results (gene counts and observed FDRs) obtained using the proposed 9-component mixture model in Table 3.8 for the 16 different simulated situations under simulation studies Case I and Case II. ANOVA was performed independently for simulation studies Case I and Case II.
Table 3.10 Results of Tukey’s HSD test for pairwise comparison of the results (gene counts and observed FDRs) obtained using the proposed 9-component mix-ture model in Table 3.8 for the 16 different simulated cases under
simula-tion studies Case I and Case II. ∗: significantly different at the 0.05 level
(gene counts), •: significantly different at the 0.05 level (observed FDRs),
NA: not available. . . 76
Table 3.11 Combination of parameters for simulation study Case III . . . 78
Table 3.12 Summary of the parameter estimates for the 9-component mixture model under simulation study Case III. Averaged over the 500 simulated data sets. Numbers in parentheses are the Monte Carlo standard deviations.
NE: not estimated. . . 81
Table 3.13 Summary of the parameter estimates for Ogorek(2008) under simulation study Case III. Averaged over the corresponding simulated data sets.
Numbers in parentheses are the Monte Carlo standard deviations. . . 82
Table 3.14 Number of genes, averaged over the 500 simulated datasets, selected using (a) the 9-component mixture model, (b) Ogorek (2008) under simulation study Case III. Numbers in parentheses are the observed FDRs, averaged
over the 500 simulated data sets for each case. ∗: numbers averaged after
exclusion of data sets due to the failure of convergence in PROC NLP.
∗∗: numbers averaged after exclusion of data sets due to the failure of
convergence in PROC NLP and data sets where all genes are classified as
I-orthologs. . . 83
Table 3.15 Divergence measures for various regularization constantα. Averaged over
the 500 simulated data sets. Numbers in parentheses are the Monte Carlo
standard deviations. NE: not estimated. . . 85
Table 3.16 Gene identification based on variousα, the regularization parameter.
Num-bers in parentheses are the observed FDR. . . 87
Table 4.1 Thek-means clustering results for the dog and human lymphoma study . 95
Table 4.2 Parameter estimates of the bivariate 9-component mixture model for the
dog and human lymphoma study. Averaged over the 156 LOOCV
out-comes. NE: not estimated. . . 97
Table 4.3 The bootstrap standard errors for parameters of interest in the 9-component
mixture model for the dog and human lymphoma study . . . 100
Table 4.4 Summary of gene counts in each category based on the results of the
Table 4.5 Misclassification tables for the dog and human lymphoma study using different criteria. (a) and (b) are classification results of human subjects based on two-species analysis, the 9-component mixture model. (a) is the misclassification table for human subjects using genes in category (1, 2, 3, 4); (b) is the misclassification table using genes in category (1, 2, 3, 4, 5, 6). (c) and (d) are misclassification tables for human subjects based on single-species (human) analysis. (c) is the misclassification table for
human subjects from genes selected using FDR = 0.00001; and, (d) is the
misclassification table from genes selected using FDR = 0.01. . . 105
Table 4.6 Overall misclassification rate for the dog and human lymphoma study . . 106
Table 4.7 Mean and median survival time, measured in years (standard errors in
parentheses), for the dog and human lymphoma study. (a) no stratifi-cation, and stratified into two subgroups: ABC DLBCL and GCB DL-BCL, by (b) gene-expression profiling performed in Lenz et al. (2008a) and Lenz et al. (2008b) and (c) gene-expression profiling using the
pro-posed 9-component mixture model . . . 109
Table 4.8 Summary of the gene-specific information for the 21 human genes in
cat-egories (1,2,3,4) determined by the 9-component mixture model for the
dog and human lymphoma study. Retrieved from Entrez Gene, an NCBI’s
database for gene-specific information. . . 113
Table 4.9 Parameter estimates (bootstrap (B = 1000) standard errors in
parenthe-ses) of the bivariate 9-component mixture model for the mouse and human
type II diabetes experiment. NE: not estimated. . . 120
Table 4.10 Summary of the 9-category gene counts for the mouse and human type II
diabetes experiment . . . 121
Table 4.11 Misclassification table for the prediction results of human drug response for the mouse and human type II diabetes experiment. Gene expression
used: Xhweek8
il −X
week0
hil . . . 126
Table 4.12 Misclassification table for the prediction results of human drug response for the mouse and human type II diabetes experiment. Gene expression
used: Xhweek8
LIST OF FIGURES
Figure 1.1 The drug development and approval chain (adopted from Bolten and
De-Gregorio (2002)) . . . 3
Figure 2.1 Scatter plots of ( ˆβ1ai,βˆ1hi)
T from (a) a simulated situation and (b) an
analysis of real data (the dog-and-human lymphoma study discussed in
Chapter 4) . . . 21
Figure 3.1 An example of the determination of the starting values for the EM algorithm 57
Figure 3.2 Simulation study Case I: Plots of proportions of genes in categoryk
clas-sified into category q . . . 65
Figure 3.3 Simulation study Case II: Plots of proportions of genes in category k
classified into category q . . . 66
Figure 3.4 Plots of PPV and sensitivity for simulation studies Case I and Case II . 69
Figure 3.5 Under simulation studies Case I and Case II, the number of genes
se-lected based on (1) the 9-component bivariate mixture model, (2) the conventional one-species approach, (3) Ogorek (2008), and the associated
observed FDRs . . . 77
Figure 3.6 Scatter plots of ( ˆβ1ai,βˆ1hi)
T under simulation study Case III . . . . 79
Figure 3.7 Divergency scores for parameter estimation based on different values of
regularization constant α. Comparison between the reference group α =
0.0001 andα= (0.1,0.01,0.001,0.00001)T. . . . 86
Figure 4.1 Plots of the bootstrap standard errors for parameters of interest in the
9-component mixture model for the dog and human lymphoma study . 101
Figure 4.2 Scatter plots of ( ˆβ1a,βˆ1h)T for the dog and human lymphoma study: (a)
all orthologs, (b) orthologs in category (1,2,3,4), (c) orthologs in category
(1,2,3,4,5,6), (d) orthologs selected based on human data only with FDR
cutoff value = 0.00001, and (e) orthologs selected based on human data
only with FDR cutoff value = 0.01. . . 103
Figure 4.3 Kaplan-Meier survival probability estimates for the dog and human
lym-phoma study. (a) no stratification, (b) stratification based on the results of gene-expression profiling performed in Lenz et al. (2008a) and Lenz et al. (2008b), and (c) stratification based on gene-expression profiling
using the proposed 9-component mixture model. . . 107
Figure 4.4 Histograms of p-values from tests of no treatment effects for the mouse
and human type II diabetes experiment . . . 122
Figure 4.5 Scatter plots of the estimated treatment effects ( ˆβ1ai,βˆ1hi)
T before and
after gene membership identification for the mouse and human type II
Chapter 1
Introduction
1.1
Background
Pharmaceutical medicine is an industry with huge up front investment for rewards that may
or may not come years later. A drug development process is lengthy, expensive, and risky.
Determined by the US Food and Drug Administration (FDA, 2004), the average total cost per
drug development is about $1.9 billion. The typical development time is 10 to 15 years. The
overall attrition rate 1 of a drug compound from first-in-man to registration is approximately
80%–90% (Bolten and DeGregorio, 2002; Kola and Landis, 2004). Figure 1.1 (adopted from
Bolten and DeGregorio (2002)) depicts a complete drug development process, including drug
discovery, preclinical research (on animals) and clinical trials (on humans).
FDA (2004) calls the preclinical and clinical research together as the “critical path”
devel-opment phase, where most investment required for a successful drug launch occurs. Currently,
this development phase is inherently inefficient. The goal of preclinical research is to assess how
a drug is absorbed, distributed, metabolized, and excreted in animals, and to use the findings
to determine potential human outcomes before starting clinical trials. Yet the rate of success
after a drug candidate entering Phase I is undesirably low. Lack of efficacy and safety are the
1Attrition rates describe the rate at which investigational drugs fall out of testing in the various clinical
major causes of attrition. As mentioned in FDA (2004) and Kola and Landis (2004), animal
models with poor clinical relevance may be accountable for this perplexity.
A common limitation of the current drug development strategy is that prior information
is partly or completely ignored when analyzing and interpreting the results of the most recent
clinical trial. When a drug shows promising efficacy and safety in animal models, but fails later
in clinical studies, it is important to identify what causes the translation failure between the
two species. It is commonly known that there are drugs that work well in humans but not in
animals, and vice versa, which is one of the major reasons for drug attrition. Once a failure
occurs, the candidate compound is discarded and the discovery team is pressed to come up with
a new candidate. However, without a precise understanding of why the first candidate failed,
there is no assurance that the second candidate will perform any better. Hence, improving
translation between two species is of tremendous value to drug discovery and development.
Modeling across species, this research is intended to help reduce the attrition of a large
percentage of compounds that fail because of poor translation from animal models to human
clinical trials, and hence improve the predictive power of animal models to human studies.
1.2
Gene expression and drug development
In a project undertaken by GlaxoSmithKline (GSK) to investigate the treatment effect of
rosigli-tazone on type II diabetes, both preclinical (mice) and clinical (humans) experiments were
im-plemented. In order to evaluate the treatment intervention, blood glucose, insulin, hemoglobin
A1c (HbA1c) and other diabetes related lab measurements were measured for both mice and
humans. While it is possible to evaluate the efficacy of the drug compound in terms of these
lab measurements, a good explanation of the mechanism for drug activities common to both
species is still needed. If a drug candidate works on animals, it will be applied to humans in
clinical trials. This same compound may or may not work on humans. In this case, an approach
that provides any insight into the pharmaceutical and biomedical differences between species
Figure 1.1: The drug development and approval chain (adopted from Bolten and DeGregorio (2002))
development. This is the essential objective of this research.
Microarrays are tools for gene expression analysis and have been used successfully in a
wide range of applications. As summarized in Slonim (2002), some of the common themes in
microarray data analysis include detection of differential expression, clustering, and predicting
disease status. An advantage of microarray technology is that it can assist researchers in better
defining and understanding the expression profile of a given genotype associated with disease or
the effects from exposure to certain stimuli. For example, Golub et al. (1999) developed a class
discovery procedure based on microarray gene expression to discover the distinction between
acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL), whose appearances
are highly similar. Gray et al. (1998) used microarrays to identify differences in yeast gene
expression before and after treatment with various kinase inhibitors.
activities that translates across species. Under a single species experiment, Dhiman et al. (2002)
concluded that gene expression data can provide logical answers to problems of vaccine failure
or give important leads to identification of novel vaccine candidates as scientists will be able to
unravel the gene expression data into fundamental biological principles and design vaccines to
target pathogens specific to given human genotypes. The utility of microarray information in
the drug development process is reviewed by Braxton and Bedilion (1998) who embrace the idea
that gene expression analysis can be a surrogate marker for the interaction between compounds
and cells, and should yield information about efficacy.
Hence, it is believed that, as discussed in Debouck and Goodfellow (1999), by measuring
the expression patterns of thousands of genes in response to drug treatments, microarrays can
be used to generate clues to patterns of gene function that can help improve the efficiency of
drug development.
1.3
Homology and multiple species gene expression analysis
Homology designates a relationship of common descent between any species and there are two
types homologous sequences: paralogous and orthologous (Koonin, 2001). Paralogs are genes
related via duplication and often belong to the same species, but this is not necessary (Koonin,
2001). Orthologs are two genes from two different species that derive from a single gene in the
last common ancestor of the species (Sonnhammer and Koonin, 2002). Orthologs are, typically,
functional counterparts in different species. Hence, some orthologs are highly conserved whereas
the similarity between others is barely detectable (Koonin, 2001; Theiben, 2002). Orthologous
relationships can be one-to-one, one-to-many, or many-to-many (Theiben, 2002).
As stated before, one key challenge of drug development is to successfully translate the
results of preclinical findings in animal models to human beings in the clinic. Pre-clinical
experiments assume that the effect of the drug tested on animals is comparable to that on
humans, which can only be true if a functional equivalent of the human drug target exists
therefore provides the best functional annotation of experimentally undetermined genes across
species. Holbrook and Sanseau (2007) remarked that the use of orthologs has the potential to
improve the understanding of biological differences between species (animals and humans).
Many of the successful applications of cross-species microarray gene expression analysis
involve orthology. In Grigoryev et al. (2004), orthologous genes exhibiting similar patterns
of expression across species were selected as ventilator-associated lung injury (VALI) related
candidates. Grigoryev et al. (2004) also claimed that the use of orthologs increased the statistical
power of their gene expression analysis and allowed them to identify candidate genes that would
otherwise remain unnoticed. Batzoglou et al. (2008) compared the human and mouse genomic
loci and discovered that the exon number and exon length for the 117 identified orthologs are
strongly preserved. Taher et al. (2004) successfully used homology information from
cross-species alignments of genomic sequences to perform new gene-finding. Ogorek (2008) adopted
the idea of orthologs to increase human differential gene-finding power.
Additionally, over the past decade, researchers have tried to use orthology and gene
expres-sion data to do cross-species comparison in order to understand how genes interact to perform
particular biological processes. Lelandais et al. (2006) proposed a multi-dimensional scaling
(MDS) method to directly compare the gene expression across two species and by using this
approach, they extracted some common properties and differences between budding and fission
yeasts. “Essential” is a terminology referring to the functional importance of a gene on an
organism. An essential gene is a gene that is absolutely required for survival. As it is not
practical to experimentally estimate essentiality of human genes, Park et al. (2008) utilized
the existing knowledge on essentiality of mouse orthologs to estimate essentiality of the
corre-sponding human genes. By investigating the characteristics of human disease genes through a
comparative analysis with mouse mutant phenotype data, the idea that human disease genes
have properties of essential genes was confirmed.
These findings all support the idea that orthologs could be a useful tool for researchers to
Over the past decade, many ortholog databases have been established. Alexeyenko et al.
(2006) offer a nice review of available ortholog databases and the methods used to build
or-thologous relationships. For the most commonly used animals in preclinical experiments, mice,
Bult et al. (2008) provided comparative genomic information particularly for human and rat
genomes. Mouse Genome Database (MGD) is a core component of the Mouse Genome
Infor-matics (MGI) database resource (http://www.informatics.jax.org) hosted at the Jackson
Laboratory (http://www.jax.org). Comprehensive orthology information for other organisms
can be found through HomoloGene (http://www.ncbi.nlm.nih.gov/homologene), a tool of
the National Center for Biotechnology Information (NCBI).
1.4
Statistical issues in microarray data analysis
In the analysis of microarray data, statistical methodologies need to be carefully planned out
as researchers often have to deal with massive amounts of data and adjust for various sources
of variability in order to identify the important genes from a large pool. Some of the statistical
issues in microarray data analysis have been summarized in Smyth et al. (2003), including
experimental design, image analysis, graphical presentation, normalization, quality measure,
multiple testing and the search for differentially expressed genes. In this paper, only statistical
issues in multiple testing and the search for differentially expressed genes are discussed since
the rest are beyond the scope of this research.
1.4.1 Conventional search for differentially expressed genes
A nice introductory review about various statistical tests for differentially expressed genes in
microarray experiments has been given by Cui and Churchill (2003). If the purpose is to
compare two conditions, the following testing statistics are commonly used in practice. Define
x1ij and x2ij as the log2 expression levels of geneiin replicate j in the control and treatment,
respectively. The data are collected from two random samples of independent observations,
is defined as
ti =
¯
x1i−x¯2i
sei
, (1.1)
where ¯x1i =
Pn1
j=1x1ij
n1
and ¯x2i =
Pn2
j=1x2ij
n2
. n1 and n2 are the number of replicates in
the control and treatment group, respectively. sei is the estimated standard error of gene i.
Another version of the ordinarytstatistic termed the globalttest in Cui and Churchill (2003)
is to replace sei by se, where se is the standard error computed by combining data across
all genes. Neither of the ordinary t statistics are ideal. For the gene specific t statistic, the
variances estimated from each gene are not stable, and the global t statistic does not adjust
for individual gene variability. In addition, an unrealistically small sei can result in a large t
statistic and therefore genes with small sample variances stand a good chance of being declared
as differentially expressed even if they are not differentially expressed.
As a result, modifications of the t test have been widely proposed to accommodate the
shortcoming of the ordinaryttest. Tusher et al. (2001) proposed a regularizedtstatistic of the
form (the SAM (Significance Analysis of Microarrays) test statistic):
tSAMi = x¯1i−x¯2i
s0+sei
, (1.2)
where the value of the constants0 is chosen to minimize the coefficient of variance of thetSAMi
statistic. Genes with tSAMi greater than a threshold are deemed potentially significant. The
threshold can be adjusted to select smaller or larger sets of genes based on the estimated false
discovery rate (FDR), which is calculated by random permutation of gene expressions among the
different experimental units, i.e., permuting the treatment labels for the entire arrays, thereby
preserving any correlation among the genes. Zhang (2007) offers a detailed description of how
SAM estimates FDR and a comprehensive evaluation of SAM.
A more recent example for the modified tstatistic is the “shrinkage t” statistic introduced
by Opgen-Rhein and Strimmer (2007). For given data withggenes, first compute the empirical
is obtained by
υi?= ˆλ?υmedian+ (1−λˆ?)υi
with optimal estimated pooling parameter
ˆ
λ?= min 1,
Pg
i=1Var(d υi)
Pg
i=1(υi−υmedian)2 !
,
where Var(d υi) is the estimated variance of the empirical gene variances. The “shrinkage t”
statistic is given by
tshrinkagei = rx¯1i−x¯2i
υ1?i n1
+υ
?
2i
n2
. (1.3)
Two variants of this statistic are considered by Opgen-Rhein and Strimmer (2007). One where
variances are estimated separately in each group, this results in two different shrinkage
es-timators, υ?1i and υ2?i. The other one using pooled variances estimates, gives one common
shrinkage estimator υ?i. The advantage of the “shrinkage t” statistic is that it is derived
without any specific distribution assumptions and not computer-intensive. The “shrinkage
t” statistic is implemented in the R package “st” which is available from the CRAN archive
(http://cran.r-project.org).
All methods introduced above are for two-sample problems. When there are more than two
conditions in an experiment, a more general method, such as analysis of variance (ANOVA) can
be used to detect differential expression. Wolfinger et al. (2001) used two interconnected mixed
models to assess gene significance. It is a two step approach. At the first step, a mixed model
is fit for all genes and residuals are obtained from this model. This is called the normalization
model in Wolfinger et al. (2001), and the purpose of it is to adjust for any bias which arises
from variation in the microarray technology rather than from biological differences between the
genes. The second step, a mixed model is fit separately for each gene, using the normalized
data obtained from the first step, called the gene model in the paper. Inference for testing for
A moderated t statistic to address the problem of assessing differential expression in
mi-croarray experiments with more than two treatments was discussed in Smyth (2004). Smyth
(2004) applied the empirical Bayes approach and linear models to microarray data analysis in
the sense of shrinking the estimated sample variances towards a pooled estimate. In a single
species experiment, let β1i be the unknown coefficient associated with the treatment effect for
theithgene, and the conjugate prior distribution of the variances forβ1i,σi2(vary across genes),
be
σ2i ∼Inverse Gamma(d0
2,
d0s20
2 ), (1.4)
where d0 and s20 are the scale and shape parameters for the inverse gamma distribution and
need to be estimated. Smyth (2004) took an empirical Bayes approach to estimate d0 and s20
from the data. Let ˆβ1i be the least squares estimate of β1i from the linear model for the ith
gene ands2i be the observed variance for ˆβ1i, obtained from data. Then,
dis2i|σi2 ∼Gamma(
di
2,2σ
2
i) (1.5)
where di is the error degrees of freedom for the linear model for gene i. The moderated t
statistic proposed by Smyth (2004) is in the following form:
tsmythi = ˆ
β1i
˜
si/
√
n, (1.6)
wherenis the number of replicates in the experiment, and ˜s2i is the posterior mean ofσi2 given
s2i. Under the above hierarchical setting (1.4) and (1.5), Smyth (2004) claimed that
˜
s2i = d0s 2 0+dis2i
d0+di
.
Indeed, the denominator of this formula should have been d0 +di + 2. But this erroneous
the distribution oftsmythi under the null hypothesis. It turns out that
tsmythi |β1i= 0∼td0+di, (1.7)
i.e., a t distribution with inflated degrees of freedom. This is useful for getting the p-values
associated with the test.
From a different point of view, Allison et al. (2002) developed a sequence of procedures to
address issues when searching for differentially expressed genes. To answer the fundamental yet
important question in microarray experiments, whether any of the genes under study exhibit a
difference in expression across the treatments, a procedure involving finite mixture model and
bootstrap inference was proposed by Allison et al. (2002). The idea behind the approach is
that the information contained in the distribution of the many test statistics and corresponding
p-values can be used to detect differentially expressed genes. This set of procedures allow for
non-normality, heteroscedasticity, and may be adapted for data with small sample sizes.
1.4.2 Correction for multiple testing
When conducting a single hypothesis test, two types of errors may be committed. A type I error
(false positive) occurs when a gene is declared as differentially expressed when in fact it is not.
A type II error (false negative) occurs when a differentially expressed gene fails to be declared
significant from the analysis of data. Typically, a statistical test is constructed to control the
type I error probability at levelα. Consider the problem of testingghypotheses simultaneously,
Table 1.1 illustrates the situation in a traditional form. The problem of multiple comparisons
arises when using the test repeatedly in order to produce a list of rejected hypotheses. One
approach to multiple testing is to control the familywise error rate (FWER), which is the
probability of one or more false positive results over a number of statistical tests. For g test
statistics with levelα, the FWER is defined as
Table 1.1: Table of outcomes. FP=false positive, TP=true positive, FN=false negative and TN=true negative.
H0 true H0 not true
rejectH0 TP FP
not rejectH0 FN TN
i.e., FWER=Pr(FP>=1). Controlling the FWER means to calculate the α level needed for
individual gene, say c, in order to ensure that the experiment level Type I error is less than
or equal to αe. The simplest procedure to control the FWER at level αe is the Bonferroni
correction:
c= αe
g ,
with this significance cutoff value c, the FWER will be no larger than αe for any family of g
tests.
Controlling the FWER is very stringent to multiple testing. In addition, a microarray
experiment often contains thousands of genes. It can be argued that controlling the FWER
is not as important for the microarray experiments, since falsely selecting a handful of genes
as differentially expressed may not be a serious problem if the majority of significant genes
are correctly chosen. Hence, controlling the false discovery rate (FDR) is a less conservative
alternative that can be useful for microarray experiments, which was introduced by Benjamini
and Hochberg (1995) and formally defined as
FDR = E( FP
FP + TP),
in other words, FDR is the expected proportion of the rejected null hypotheses which are
erroneously rejected. This procedure has been implemented in SAS PROC MULTTEST. When
to reject tests corresponding to p(1), p(2), . . . , p(ˆk), where ˆk = max{k : p(k) ≤ kα/g}. This
procedure controls the FDR at level α.
Assuming independence of the test statistics, the FDR adjusted p-values are defined as
˜
p(g) = p(g)
˜
p(g−1) = min{p˜(g),
g
g−1p(g−1)} ..
.
˜
p(1) = min{p˜(2), gp(1)}.
Instead of fixing α and estimating ˆk, i.e., estimating the rejection region (Benjamini and
Hochberg, 1995), Storey (2002) proposed a different approach to false discovery rates: fixing
the rejection region and then estimatingα.
Recall that, in multiple testing, the p-value for an individual test can be defined as the
smallest significance level (the smallest FWE) for which the null hypothesis can be rejected.
Analogously, theq-value (Storey, 2002) is the smallest estimated FDR at which the test may be
rejected. For theg hypothesis tests with corresponding p-values, p1, . . . , pg, and any rejection
region of interest [0, γ],γ ≤1,q-value is defined as
q-value(pi) = minγ≥pi\F DR(γ),
whereF DR\(γ) is the estimated FDR. Storey (2002) derived the explicit form of \F DR(γ):
\
F DR(γ) = πˆ0γg
#{pi≤γ}
,
where ˆπ0 = #(1{p−iα≤)γg}, and ˆπ0 is defined as the estimated value of the probability of the null
hypothesisibeing true. Storey (2002) has also shown that this approach rejects more
approach is implemented in the R package “qvalue” which is available from the CRAN archive
Chapter 2
Joint modeling across species
As mentioned in Section 1.3, there has been successful use of gene information from multiple
species on different applications. Since the goal of this research is to make the animal models
more predictive to the human models on drug development, utilizing the gene expression data
from both species jointly and discovering the patterns of the differentially expressed genes across
species would be an approach to overcoming the current hardship that animal models often have
poorly clinical pertinence. Investigating the activities of these selected orthologous genes with
different patterns of differential expression can help scientists decode the cause of drug attrition
and thus have a more efficient process of drug development for the same class of drug in the
future.
Let Xaij andXhil denote gene expression measurements from the i
th orthologous gene pair
for the jth animal and the lth human. The following independent linear models describe the
association between gene expression and treatment:
Xaij = β0ai+β1aiTaj+eaij, (2.1)
Xhil = β0hi+β1hiThl+ehil, (2.2)
where Taj and Thl are {0,1} treatment indicators, andeaij and ehil are independent N(0, σ
2
and N(0, σ2h) random variables. σa2 and σh2 are variances for eaij and ehil, respectively. In drug
development, the animal research and human experiments are conducted independently–one’s
results do not affect the other’s. However, the treatment effects are expected to have some kind
of association between the two species. This results in our choice of using two independent
models for the two species to capture the effects of treatment on gene expression.
The joint behavior of β1ai andβ1hi is of interest as they describe the differential expression
of theith orthologous animal and human genes due to a treatment intervention. The quantities
β1ai and β1hi are not observable; nonetheless, the gene expression data for mice and humans
are observed. Separate least squares estimates ˆβ1ai and ˆβ1hi can be obtained from linear
mod-els (2.1) and (2.2). Recall the goal of this research is to bridge the animal modmod-els and the
human models on drug development by discovering the patterns of relevant gene expression
across species. An answer to this question is really desired: why drugs have opposite effects on
mice and humans when they are expected to work toward the same direction. Therefore,
mod-eling (β1ai, β1hi)
T jointly to identify the patterns of differentially expressed genes across species
is expected to help resolve the current problems of poor translation from animal experiments to
clinical research. Understanding the activities of the relevant orthologs (differentially expressed
against treatment) across species gives explanations for the poor translation between animal
and human models, and hence helps reduce the failure rate of drugs going from preclinical trials
to clinical trials.
Section 2.1 describes the joint (prior) distribution for (β1ai, β1hi)
T and the corresponding
marginal distribution of ( ˆβ1ai,βˆ1hi)
T. Section 2.2 introduces mixture models and the EM
algorithm in general. Section 2.3 discusses the specific solution of the EM algorithm for the
proposed models in particular. Section 2.4 presents the problem of singularity for the covariance
2.1
A 9-component bivariate mixture model for two species
β1ai andβ1hi quantify the differential expression of thei
thorthologous animal and human genes
due to a treatment intervention. A given gene can be classified as non-differentially expressed
(NDE): showing no signs of treatment effects, positively differentially expressed (pDE):
show-ing positive treatment effects, or negatively differentially expressed (nDE): showshow-ing negative
treatment effects. Therefore, for a human and animal gene pair, there are 9 possibilities for
cat-egorizing this pair of genes. Furthermore, assume dependency between differentially expressed
orthologs, i.e., existence of association posited only for (β1ai, β1hi)
T in category (1, 2, 3, 4)
and zero correlation presumed for (β1ai, β1hi)
T in category (0, 5, 6, 7, 8). Table 2.1 illustrates
the 9 possible categories of (β1ai, β1hi)
T. (µ
β1ai, µβ1hi)
T is the vector of population means of
(β1ai, β1hi)
T under each category.
Table 2.1: Possible categories of treatment effects: a prior for (β1ai, β1hi)
T
category (β1ai, β1hi) (µβ1ai, µβ1hi) Corr(β1ai, β1hi)
0 (NDE,NDE) (0,0) 0
1 (pDE,pDE) (+,+) ρ1
2 (nDE,nDE) (−,−) ρ2
3 (pDE,nDE) (+,−) ρ3
4 (nDE,pDE) (−,+) ρ4
5 (NDE,pDE) (0,+) 0
6 (NDE,nDE) (0,−) 0
7 (pDE,NDE) (+,0) 0
8 (nDE,NDE) (−,0) 0
In consequence of these possible patterns of (β1ai, β1hi)
T, mixture models (McLachlan and
Basford, 1988; McLachlan and Peel, 2000) are adopted to deal with the correlation and
distri-bution of each subgroup of genes across species. An additional advantage of mixture models is
observation to give a probabilistic clustering. As a result, the pooling of information for genes
across species can be exploited to better understand the underlying relationship between the
treatment intervention for both species.
In practice, finite mixture models (mixture models with finite numbers of components)
(McLachlan and Basford, 1988; McLachlan and Peel, 2000) are often fitted with the component
densities of the mixture taken to be normal. The following normal mixture model is adopted
as the prior distribution of the vector (β1ai, β1hi)
β1ai
β1hi
∼ π0N
µa0
µh0
,
η2a0 ρ0ηa0ηh0
ρ0ηa0ηh0 η2h0
+ π1N
µa1
µh1
,
η2a1 ρ1ηa1ηh1
ρ1ηa1ηh1 η2h1
+ π2N
µa2
µh2
,
η2a2 ρ2ηa2ηh2
ρ2ηa2ηh2 η2h2
+ π3N
µa3
µh3
,
η2a3 ρ3ηa3ηh3
ρ3ηa3ηh3 η2h3
+ π4N
µa4
µh4
,
η2
a4 ρ4ηa4ηh4
ρ4ηa4ηh4 η2h4
+ π5N
µa5
µh5
,
η2a5 ρ5ηa5ηh5
ρ5ηa5ηh5 η2h5
+ π6N
µa6
µh6
,
η2a6 ρ6ηa6ηh6
ρ6ηa6ηh6 η2h6
+ π7N
µa7
µh7
,
η2a7 ρ7ηa7ηh7
ρ7ηa7ηh7 η2h7
+ π8N
µa8
µh8
,
η2a8 ρ8ηa8ηh8
ρ8ηa8ηh8 η2h8
, (2.3)
whereπk is the probability that an observation belongs to thekth component, with
8 X
k=0
πk= 1 and πk ≥0.
parameters may be made according to Table 2.1, and non-differentially expressed genes may be
assumed to have treatment effects that are deterministically zero, i.e., (β1ai, β1hi)
T = (0,0)T.
This leads to the following two-species mixture model:
β1ai
β1hi
∼ π0N
0
0
,
0 0
0 0
+ π1N
µa1
µh1
,
ηa21 ρ1ηa1ηh1
ρ1ηa1ηh1 η2h1
+ π2N
µa2
µh2
,
ηa22 ρ2ηa2ηh2
ρ2ηa2ηh2 η2h2
+ π3N
µa3
µh3
,
ηa23 ρ3ηa3ηh3
ρ3ηa3ηh3 η2h3
+ π4N
µa4
µh4
,
η2
a4 ρ4ηa4ηh4
ρ4ηa4ηh4 η2h4
+ π5N
0
µh5
,
0 0
0 ηh25
+ π6N
0
µh6
,
0 0
0 ηh26
+ π7N
µa7
0
,
η2a7 0
0 0
+ π8N
µa8
0
,
η2a8 0
0 0
, (2.4)
whereµa1 ≥0,µh1≥0,µa2≤0,µh2 ≤0,µa3 ≥0,µh3≤0,µa4≤0,µh4 ≥0,µh5≥0,µh6 ≤0,
µa7≥0, andµa8 ≤0,ρ0= 0, ρ5 = 0,ρ6= 0, ρ7 = 0, andρ8= 0.
(β1ai, β1hi)
T is in general unknown. However, their estimates ( ˆβ
1ai,βˆ1hi)
by the method of least squares using the linear models (2.1) and (2.2). The marginal distribution
of ( ˆβ1ai,βˆ1hi)
T is as follows:
ˆ
β1ai
ˆ
β1hi
∼ π0N
0
0
,
σ2a0 0
0 σh20
+ π1N
µa1
µh1
,
σ2a1 ρ1σa1σh1
ρ1σa1σh1 σ2h1
+ π2N
µa2
µh2
,
σ2
a2 ρ2σa2σh2
ρ2σa2σh2 σ2h2
+ π3N
µa3
µh3
,
σ2a3 ρ3σa3σh3
ρ3σa3σh3 σ2h3
+ π4N
µa4
µh4
,
σ2a4 ρ4σa4σh4
ρ4σa4σh4 σ2h4
+ π5N
0
µh5
,
σ2a5 0
0 σ2
h5
+ π6N
0
µh6
,
σ2a6 0
0 σh26
+ π7N
µa7
0
,
σa27 0
0 σh27
+ π8N
µa8
0
,
σa28 0
0 σh28
, (2.5)
whereµa0 = 0,µh0= 0,µa1≥0,µh1 ≥0,µa2 ≤0,µh2 ≤0,µa3 ≥0,µh3≤0,µa4≤0,µh4 ≥0,
µa5= 0, µh5 ≥0,µa6 = 0,µh6 ≤0, µa7 ≥0, µh7= 0, µa8≤0,µh8 = 0,ρ0= 0, ρ5 = 0, ρ6 = 0,
ρ7 = 0, and ρ8 = 0. Note that the marginal distribution of ( ˆβ1ai,βˆ1hi)
T has means equal to the
prior means of (β1ai, β1hi)