Mixture Models for Gene Expression Experiments with Two Species.

(1)

ABSTRACT

SU, YUHUA. Mixture Models for Gene Expression Experiments with Two Species. (Under the direction of Dr. Jason Osborne.)

A bivariate mixture model utilizing information across two species is proposed to solve the

fundamental problem of identifying differentially expressed genes in microarray experiments.

Orthologs, or genes from two different species that originated from a common ancestor, have the

potential to exploit similarities between species to better understand the genetic basis of disease

and treatment. The proposed approach intuitively models the distribution of the estimated

treatment effects with minimal assumptions. The mixture model posits up to nine components,

four of which include groups in which genes are differentially expressed in both species. An

EM algorithm is developed to accomplish the nontrivial likelihood maximization, along with

methodology for handling singular covariance matrices that arise during the implementation

of the algorithm. A comprehensive simulation to evaluate the model performance and two

applications on real world data sets, a dog and human lymphoma data set prepared by a group

of scientists in the College of Veterinary Medicine at North Carolina State University and a

mouse and human type II diabetes experiment sponsored by GlaxoSmithKline, suggest that

the proposed model, though highly structured, can handle various situations and is practically

useful, especially when the magnitude of differential expression due to the different treatment

intervention is weak. In both applications, the proposed 9-component mixture model is able

to eliminate unimportant genes and identify a list of genes that are potential candidates of

biomarkers. Though the primary motivation for the development of the bivariate mixture

model is to enable identification of genes whose differential expression extends from humans to

another species, possible extension to classification/prediction of cancer type or drug response

is also initiated in the two case studies. In the dog and human lymphoma study, a very small

number of genes are identified as being differentially expressed in both species and the human

(2)

patients into two subgroups, the germinal-center B-cell-like diffuse large B-cell lymphoma and

the activated B-cell-like diffuse large B-cell lymphoma. Additionally, the two subgroups defined

by this cluster of human genes have significantly different survival functions, indicating that

the stratification based on gene-expression profiling using the proposed 9-component mixture

model provides better insight into the clinical differences between the two types of cancer.

The application of the 9-component mixture model on the mouse and human type II diabetes

experiment is less successful. While the mixture model is able to separate differentially expressed

genes from those non-differentially expressed ones, attempts at predicting human drug response

status using the genes identified as being differentially expressed in both species did not lead

to the same success as the lymphoma experiment. This may be due to the fact that there is

little evidence of any differential expression. The linear model for week 8 expression in human

genes was one of many possible models, but it did not uncover much evidence of a treatment

effect. Nonetheless, a potential multi-gene predictor may still be developed according to the

genes identified by the proposed 9-component mixture model to benefit patients in therapeutic

(3)

(4)

Mixture Models for Gene Expression Experiments with Two Species

by Yuhua Su

A dissertation submitted to the Graduate Faculty of North Carolina State University

in partial fulfillment of the requirements for the Degree of

Doctor of Philosophy

Statistics

Raleigh, North Carolina

2011

APPROVED BY:

Dr. Lei Zhu Dr. Jacqueline Hughes-Oliver

Dr. Dahlia Nielsen Dr. Jung-Ying Tzeng

(5)

DEDICATION

(6)

BIOGRAPHY

Yuhua was born in Kaohsiung, Taiwan. She received her Bachelor of Arts degree in Public

Finance from National Chengchi University in 2001 and her Master’s degree in Economics from

the University of North Carolina at Chapel Hill in 2003. She decided to pursue a career in

Statistics and joined the Department of Statistics at North Carolina State University in 2004.

(7)

ACKNOWLEDGEMENTS

I have been truly fortunate to have Dr. Jason Osborne as an advisor. His unwavering faith in

my ability has helped me stay the course even when I doubted myself the most. His guidance

has been priceless, and it truly has been a pleasure and honor to work with him.

I greatly appreciate all the support and encouragement I received from my committee

mem-ber and manager, Dr. Lei Zhu. She has been a wonderful role model for me.

I am deeply indebted to Dr. Dahlia Nielsen for answering many questions for the two-species

data and to Dr. Chris Smith, Dr. Kristy Richards, and Dr. Matthew Breen for sharing the

data.

I am very gracious to Dr. Jacqueline Hughes-Oliver and Dr. Jung-Ying Tzeng for their

invaluable suggestions on my research.

I want to thank my parents, Er-Chen Su and Hsiu-Chu Su-Liu, and my sisters, Li-Wen Su

and Pi-Hwa Su, who kept believing in me throughout this journey, even after all this time.

Last but certainly not least, I wish to thank my husband, Andrew Cigna, for his support

and devotion through this roller coaster called graduate school, and my three precious babies,

Winry, Yuna and Duke for helping me keep my eye on what is truly important in life.

(8)

TABLE OF CONTENTS

List of Tables . . . vii

List of Figures . . . x

Chapter 1 Introduction . . . 1

1.1 Background . . . 1

1.2 Gene expression and drug development . . . 2

1.3 Homology and multiple species gene expression analysis . . . 4

1.4 Statistical issues in microarray data analysis . . . 6

1.4.1 Conventional search for differentially expressed genes . . . 6

1.4.2 Correction for multiple testing . . . 10

Chapter 2 Joint modeling across species . . . 14

2.1 A 9-component bivariate mixture model for two species . . . 16

2.1.1 Comparison with Ogorek (2008) . . . 21

2.2 Mixture models and the EM algorithm . . . 26

2.2.1 Mixture models . . . 27

2.2.2 The EM algorithm . . . 28

2.2.3 The EM algorithm for mixture models . . . 30

2.3 The EM Algorithm with constraints . . . 39

2.4 Regularized covariance matrices in the EM algorithm . . . 43

Chapter 3 Simulation . . . 48

3.1 Simulation studies: Case I and Case II . . . 50

3.1.1 Parameter determination . . . 51

3.1.2 Data generation . . . 53

3.1.3 Starting values of the EM algorithm . . . 55

3.1.4 Results of simulation studies . . . 58

3.2 Simulation study: Case III . . . 77

3.3 Simulation study to evaluate the choice of the regularization parameter for reg-ularized covariance matrices . . . 83

3.4 Conclusions . . . 87

Chapter 4 Applications . . . 88

4.1 Application I: : Gene selection and cancer type classification on the dog and human diffuse large-B-cell lymphoma study . . . 88

4.1.1 Introduction . . . 88

4.1.2 Data description . . . 91

4.1.3 Data analysis . . . 94

4.1.4 Results . . . 96

(9)

4.2 Application II: The mouse and human type II diabetes experiment . . . 115

4.2.1 Introduction . . . 115

4.2.2 Data analysis . . . 117

4.2.3 Results . . . 119

4.2.4 Conclusions . . . 126

Chapter 5 Concluding remarks . . . 129

References. . . 135

Appendix . . . 142

Appendix A Derivation of the distribution of the Least Squares estimators ( ˆβ1ai,βˆ1hi)

(10)

LIST OF TABLES

Table 1.1 Table of outcomes. FP=false positive, TP=true positive, FN=false

nega-tive and TN=true neganega-tive. . . 11

Table 2.1 Possible categories of treatment effects: a prior for (β1ai, β1hi)

T _{. . . . .} ₁₆

Table 3.1 Number of genes in thekthcategory for simulation studies Case I and Case

II . . . 51

Table 3.2 Combination of parameters for simulation studies Case I and Case II . . 53

Table 3.3 Summary of the parameter estimates for the 8 different scenarios under

simulation study Case I. Averaged over the 500 simulated data sets. Num-bers in parentheses are the Monte Carlo standard deviations. NE: not

estimated. . . 61

Table 3.4 Summary of the parameter estimates for the 8 different scenarios under

simulation study Case II. Averaged over the 500 simulated data sets. Num-bers in parentheses are the Monte Carlo standard deviations. NE: not

estimated. . . 62

Table 3.5 Monte Carlo mean squared error for each estimator under the 8 different

scenarios in simulation study Case I. NE: not estimated. . . 63

Table 3.6 Monte Carlo mean squared error for each estimator under the 8 different

scenarios in simulation study Case II. NE: not estimated. . . 64

Table 3.7 Proportions of genes in category 0 classified into categoryq under

simula-tion studies Case I and Case II . . . 67

Table 3.8 Under simulation studies Case I and Case II, the number of genes selected

based on (a) the 9-component bivariate mixture model, (b) the conven-tional one-species approach (c) Ogorek (2008). Numbers in parentheses are the observed FDRs. Averaged over the 500 simulated datasets. Tukey’s

HSD for anα level of 0.05 is included beneath each set of eight simulation

cases. NA: not available. ∗: numbers averaged after exclusion of data sets

due to convergence criteria not met in PROC NLP.∗∗: numbers averaged

after exclusion of data sets due to convergence criteria not met in PROC

NLP and data sets where all genes are classified as I-orthologs. . . 74

Table 3.9 A three-way analysis of variance (ANOVA) table to quantify the variability

among the results (gene counts and observed FDRs) obtained using the proposed 9-component mixture model in Table 3.8 for the 16 different simulated situations under simulation studies Case I and Case II. ANOVA was performed independently for simulation studies Case I and Case II.

(11)

Table 3.10 Results of Tukey’s HSD test for pairwise comparison of the results (gene counts and observed FDRs) obtained using the proposed 9-component mix-ture model in Table 3.8 for the 16 different simulated cases under

simula-tion studies Case I and Case II. ∗: significantly different at the 0.05 level

(gene counts), •: significantly different at the 0.05 level (observed FDRs),

NA: not available. . . 76

Table 3.11 Combination of parameters for simulation study Case III . . . 78

Table 3.12 Summary of the parameter estimates for the 9-component mixture model under simulation study Case III. Averaged over the 500 simulated data sets. Numbers in parentheses are the Monte Carlo standard deviations.

NE: not estimated. . . 81

Table 3.13 Summary of the parameter estimates for Ogorek(2008) under simulation study Case III. Averaged over the corresponding simulated data sets.

Numbers in parentheses are the Monte Carlo standard deviations. . . 82

Table 3.14 Number of genes, averaged over the 500 simulated datasets, selected using (a) the 9-component mixture model, (b) Ogorek (2008) under simulation study Case III. Numbers in parentheses are the observed FDRs, averaged

over the 500 simulated data sets for each case. ∗: numbers averaged after

exclusion of data sets due to the failure of convergence in PROC NLP.

∗∗_{: numbers averaged after exclusion of data sets due to the failure of}

convergence in PROC NLP and data sets where all genes are classified as

I-orthologs. . . 83

Table 3.15 Divergence measures for various regularization constantα. Averaged over

the 500 simulated data sets. Numbers in parentheses are the Monte Carlo

standard deviations. NE: not estimated. . . 85

Table 3.16 Gene identification based on variousα, the regularization parameter.

Num-bers in parentheses are the observed FDR. . . 87

Table 4.1 Thek-means clustering results for the dog and human lymphoma study . 95

Table 4.2 Parameter estimates of the bivariate 9-component mixture model for the

dog and human lymphoma study. Averaged over the 156 LOOCV

out-comes. NE: not estimated. . . 97

Table 4.3 The bootstrap standard errors for parameters of interest in the 9-component

mixture model for the dog and human lymphoma study . . . 100

Table 4.4 Summary of gene counts in each category based on the results of the

(12)

Table 4.5 Misclassification tables for the dog and human lymphoma study using different criteria. (a) and (b) are classification results of human subjects based on two-species analysis, the 9-component mixture model. (a) is the misclassification table for human subjects using genes in category (1, 2, 3, 4); (b) is the misclassification table using genes in category (1, 2, 3, 4, 5, 6). (c) and (d) are misclassification tables for human subjects based on single-species (human) analysis. (c) is the misclassification table for

human subjects from genes selected using FDR = 0.00001; and, (d) is the

misclassification table from genes selected using FDR = 0.01. . . 105

Table 4.6 Overall misclassification rate for the dog and human lymphoma study . . 106

Table 4.7 Mean and median survival time, measured in years (standard errors in

parentheses), for the dog and human lymphoma study. (a) no stratifi-cation, and stratified into two subgroups: ABC DLBCL and GCB DL-BCL, by (b) gene-expression profiling performed in Lenz et al. (2008a) and Lenz et al. (2008b) and (c) gene-expression profiling using the

pro-posed 9-component mixture model . . . 109

Table 4.8 Summary of the gene-specific information for the 21 human genes in

cat-egories (1,2,3,4) determined by the 9-component mixture model for the

dog and human lymphoma study. Retrieved from Entrez Gene, an NCBI’s

database for gene-specific information. . . 113

Table 4.9 Parameter estimates (bootstrap (B = 1000) standard errors in

parenthe-ses) of the bivariate 9-component mixture model for the mouse and human

type II diabetes experiment. NE: not estimated. . . 120

Table 4.10 Summary of the 9-category gene counts for the mouse and human type II

diabetes experiment . . . 121

Table 4.11 Misclassification table for the prediction results of human drug response for the mouse and human type II diabetes experiment. Gene expression

used: X_hweek8

il −X

week0

hil . . . 126

Table 4.12 Misclassification table for the prediction results of human drug response for the mouse and human type II diabetes experiment. Gene expression

used: X_hweek8

(13)

LIST OF FIGURES

Figure 1.1 The drug development and approval chain (adopted from Bolten and

De-Gregorio (2002)) . . . 3

Figure 2.1 Scatter plots of ( ˆβ1ai,βˆ1hi)

T _{from (a) a simulated situation and (b) an}

analysis of real data (the dog-and-human lymphoma study discussed in

Chapter 4) . . . 21

Figure 3.1 An example of the determination of the starting values for the EM algorithm 57

Figure 3.2 Simulation study Case I: Plots of proportions of genes in categoryk

clas-sified into category q . . . 65

Figure 3.3 Simulation study Case II: Plots of proportions of genes in category k

classified into category q . . . 66

Figure 3.4 Plots of PPV and sensitivity for simulation studies Case I and Case II . 69

Figure 3.5 Under simulation studies Case I and Case II, the number of genes

se-lected based on (1) the 9-component bivariate mixture model, (2) the conventional one-species approach, (3) Ogorek (2008), and the associated

observed FDRs . . . 77

Figure 3.6 Scatter plots of ( ˆβ1ai,βˆ1hi)

T _{under simulation study Case III} _{. . . .} ₇₉

Figure 3.7 Divergency scores for parameter estimation based on different values of

regularization constant α. Comparison between the reference group α =

0.0001 andα= (0.1,0.01,0.001,0.00001)T_{. . . .} ₈₆

Figure 4.1 Plots of the bootstrap standard errors for parameters of interest in the

9-component mixture model for the dog and human lymphoma study . 101

Figure 4.2 Scatter plots of ( ˆβ1a,βˆ1h)T for the dog and human lymphoma study: (a)

all orthologs, (b) orthologs in category (1,2,3,4), (c) orthologs in category

(1,2,3,4,5,6), (d) orthologs selected based on human data only with FDR

cutoff value = 0.00001, and (e) orthologs selected based on human data

only with FDR cutoff value = 0.01. . . 103

Figure 4.3 Kaplan-Meier survival probability estimates for the dog and human

lym-phoma study. (a) no stratification, (b) stratification based on the results of gene-expression profiling performed in Lenz et al. (2008a) and Lenz et al. (2008b), and (c) stratification based on gene-expression profiling

using the proposed 9-component mixture model. . . 107

Figure 4.4 Histograms of p-values from tests of no treatment effects for the mouse

and human type II diabetes experiment . . . 122

Figure 4.5 Scatter plots of the estimated treatment effects ( ˆβ1ai,βˆ1hi)

T _{before and}

after gene membership identification for the mouse and human type II

(14)

Chapter 1

Introduction

1.1 Background

Pharmaceutical medicine is an industry with huge up front investment for rewards that may

or may not come years later. A drug development process is lengthy, expensive, and risky.

Determined by the US Food and Drug Administration (FDA, 2004), the average total cost per

drug development is about $1.9 billion. The typical development time is 10 to 15 years. The

overall attrition rate 1 of a drug compound from first-in-man to registration is approximately

80%–90% (Bolten and DeGregorio, 2002; Kola and Landis, 2004). Figure 1.1 (adopted from

Bolten and DeGregorio (2002)) depicts a complete drug development process, including drug

discovery, preclinical research (on animals) and clinical trials (on humans).

FDA (2004) calls the preclinical and clinical research together as the “critical path”

devel-opment phase, where most investment required for a successful drug launch occurs. Currently,

this development phase is inherently inefficient. The goal of preclinical research is to assess how

a drug is absorbed, distributed, metabolized, and excreted in animals, and to use the findings

to determine potential human outcomes before starting clinical trials. Yet the rate of success

after a drug candidate entering Phase I is undesirably low. Lack of efficacy and safety are the

1_{Attrition rates describe the rate at which investigational drugs fall out of testing in the various clinical}

(15)

major causes of attrition. As mentioned in FDA (2004) and Kola and Landis (2004), animal

models with poor clinical relevance may be accountable for this perplexity.

A common limitation of the current drug development strategy is that prior information

is partly or completely ignored when analyzing and interpreting the results of the most recent

clinical trial. When a drug shows promising efficacy and safety in animal models, but fails later

in clinical studies, it is important to identify what causes the translation failure between the

two species. It is commonly known that there are drugs that work well in humans but not in

animals, and vice versa, which is one of the major reasons for drug attrition. Once a failure

occurs, the candidate compound is discarded and the discovery team is pressed to come up with

a new candidate. However, without a precise understanding of why the first candidate failed,

there is no assurance that the second candidate will perform any better. Hence, improving

translation between two species is of tremendous value to drug discovery and development.

Modeling across species, this research is intended to help reduce the attrition of a large

percentage of compounds that fail because of poor translation from animal models to human

clinical trials, and hence improve the predictive power of animal models to human studies.

1.2 Gene expression and drug development

In a project undertaken by GlaxoSmithKline (GSK) to investigate the treatment effect of

rosigli-tazone on type II diabetes, both preclinical (mice) and clinical (humans) experiments were

im-plemented. In order to evaluate the treatment intervention, blood glucose, insulin, hemoglobin

A1c (HbA1c) and other diabetes related lab measurements were measured for both mice and

humans. While it is possible to evaluate the efficacy of the drug compound in terms of these

lab measurements, a good explanation of the mechanism for drug activities common to both

species is still needed. If a drug candidate works on animals, it will be applied to humans in

clinical trials. This same compound may or may not work on humans. In this case, an approach

that provides any insight into the pharmaceutical and biomedical differences between species

(16)

Figure 1.1: The drug development and approval chain (adopted from Bolten and DeGregorio (2002))

development. This is the essential objective of this research.

Microarrays are tools for gene expression analysis and have been used successfully in a

wide range of applications. As summarized in Slonim (2002), some of the common themes in

microarray data analysis include detection of differential expression, clustering, and predicting

disease status. An advantage of microarray technology is that it can assist researchers in better

defining and understanding the expression profile of a given genotype associated with disease or

the effects from exposure to certain stimuli. For example, Golub et al. (1999) developed a class

discovery procedure based on microarray gene expression to discover the distinction between

acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL), whose appearances

are highly similar. Gray et al. (1998) used microarrays to identify differences in yeast gene

expression before and after treatment with various kinase inhibitors.

(17)

activities that translates across species. Under a single species experiment, Dhiman et al. (2002)

concluded that gene expression data can provide logical answers to problems of vaccine failure

or give important leads to identification of novel vaccine candidates as scientists will be able to

unravel the gene expression data into fundamental biological principles and design vaccines to

target pathogens specific to given human genotypes. The utility of microarray information in

the drug development process is reviewed by Braxton and Bedilion (1998) who embrace the idea

that gene expression analysis can be a surrogate marker for the interaction between compounds

and cells, and should yield information about efficacy.

Hence, it is believed that, as discussed in Debouck and Goodfellow (1999), by measuring

the expression patterns of thousands of genes in response to drug treatments, microarrays can

be used to generate clues to patterns of gene function that can help improve the efficiency of

drug development.

1.3 Homology and multiple species gene expression analysis

Homology designates a relationship of common descent between any species and there are two

types homologous sequences: paralogous and orthologous (Koonin, 2001). Paralogs are genes

related via duplication and often belong to the same species, but this is not necessary (Koonin,

2001). Orthologs are two genes from two different species that derive from a single gene in the

last common ancestor of the species (Sonnhammer and Koonin, 2002). Orthologs are, typically,

functional counterparts in different species. Hence, some orthologs are highly conserved whereas

the similarity between others is barely detectable (Koonin, 2001; Theiben, 2002). Orthologous

relationships can be one-to-one, one-to-many, or many-to-many (Theiben, 2002).

As stated before, one key challenge of drug development is to successfully translate the

results of preclinical findings in animal models to human beings in the clinic. Pre-clinical

experiments assume that the effect of the drug tested on animals is comparable to that on

humans, which can only be true if a functional equivalent of the human drug target exists

(18)

therefore provides the best functional annotation of experimentally undetermined genes across

species. Holbrook and Sanseau (2007) remarked that the use of orthologs has the potential to

improve the understanding of biological differences between species (animals and humans).

Many of the successful applications of cross-species microarray gene expression analysis

involve orthology. In Grigoryev et al. (2004), orthologous genes exhibiting similar patterns

of expression across species were selected as ventilator-associated lung injury (VALI) related

candidates. Grigoryev et al. (2004) also claimed that the use of orthologs increased the statistical

power of their gene expression analysis and allowed them to identify candidate genes that would

otherwise remain unnoticed. Batzoglou et al. (2008) compared the human and mouse genomic

loci and discovered that the exon number and exon length for the 117 identified orthologs are

strongly preserved. Taher et al. (2004) successfully used homology information from

cross-species alignments of genomic sequences to perform new gene-finding. Ogorek (2008) adopted

the idea of orthologs to increase human differential gene-finding power.

Additionally, over the past decade, researchers have tried to use orthology and gene

expres-sion data to do cross-species comparison in order to understand how genes interact to perform

particular biological processes. Lelandais et al. (2006) proposed a multi-dimensional scaling

(MDS) method to directly compare the gene expression across two species and by using this

approach, they extracted some common properties and differences between budding and fission

yeasts. “Essential” is a terminology referring to the functional importance of a gene on an

organism. An essential gene is a gene that is absolutely required for survival. As it is not

practical to experimentally estimate essentiality of human genes, Park et al. (2008) utilized

the existing knowledge on essentiality of mouse orthologs to estimate essentiality of the

corre-sponding human genes. By investigating the characteristics of human disease genes through a

comparative analysis with mouse mutant phenotype data, the idea that human disease genes

have properties of essential genes was confirmed.

These findings all support the idea that orthologs could be a useful tool for researchers to

(19)

Over the past decade, many ortholog databases have been established. Alexeyenko et al.

(2006) offer a nice review of available ortholog databases and the methods used to build

or-thologous relationships. For the most commonly used animals in preclinical experiments, mice,

Bult et al. (2008) provided comparative genomic information particularly for human and rat

genomes. Mouse Genome Database (MGD) is a core component of the Mouse Genome

Infor-matics (MGI) database resource (http://www.informatics.jax.org) hosted at the Jackson

Laboratory (http://www.jax.org). Comprehensive orthology information for other organisms

can be found through HomoloGene (http://www.ncbi.nlm.nih.gov/homologene), a tool of

the National Center for Biotechnology Information (NCBI).

1.4 Statistical issues in microarray data analysis

In the analysis of microarray data, statistical methodologies need to be carefully planned out

as researchers often have to deal with massive amounts of data and adjust for various sources

of variability in order to identify the important genes from a large pool. Some of the statistical

issues in microarray data analysis have been summarized in Smyth et al. (2003), including

experimental design, image analysis, graphical presentation, normalization, quality measure,

multiple testing and the search for differentially expressed genes. In this paper, only statistical

issues in multiple testing and the search for differentially expressed genes are discussed since

the rest are beyond the scope of this research.

1.4.1 Conventional search for differentially expressed genes

A nice introductory review about various statistical tests for differentially expressed genes in

microarray experiments has been given by Cui and Churchill (2003). If the purpose is to

compare two conditions, the following testing statistics are commonly used in practice. Define

x1ij and x2ij as the log2 expression levels of geneiin replicate j in the control and treatment,

respectively. The data are collected from two random samples of independent observations,

(20)

is defined as

ti =

¯

x1i−x¯2i

sei

, (1.1)

where ¯x1i =

Pn1

j=1x1ij

n1

and ¯x2i =

Pn2

j=1x2ij

n2

. n1 and n2 are the number of replicates in

the control and treatment group, respectively. sei is the estimated standard error of gene i.

Another version of the ordinarytstatistic termed the globalttest in Cui and Churchill (2003)

is to replace sei by se, where se is the standard error computed by combining data across

all genes. Neither of the ordinary t statistics are ideal. For the gene specific t statistic, the

variances estimated from each gene are not stable, and the global t statistic does not adjust

for individual gene variability. In addition, an unrealistically small sei can result in a large t

statistic and therefore genes with small sample variances stand a good chance of being declared

as differentially expressed even if they are not differentially expressed.

As a result, modifications of the t test have been widely proposed to accommodate the

shortcoming of the ordinaryttest. Tusher et al. (2001) proposed a regularizedtstatistic of the

form (the SAM (Significance Analysis of Microarrays) test statistic):

tSAM_i = x¯1i−x¯2i

s0+sei

, (1.2)

where the value of the constants0 is chosen to minimize the coefficient of variance of thetSAMi

statistic. Genes with tSAM_i greater than a threshold are deemed potentially significant. The

threshold can be adjusted to select smaller or larger sets of genes based on the estimated false

discovery rate (FDR), which is calculated by random permutation of gene expressions among the

different experimental units, i.e., permuting the treatment labels for the entire arrays, thereby

preserving any correlation among the genes. Zhang (2007) offers a detailed description of how

SAM estimates FDR and a comprehensive evaluation of SAM.

A more recent example for the modified tstatistic is the “shrinkage t” statistic introduced

by Opgen-Rhein and Strimmer (2007). For given data withggenes, first compute the empirical

(21)

is obtained by

υ_i?= ˆλ?υmedian+ (1−λˆ?)υi

with optimal estimated pooling parameter

ˆ

λ?= min 1,

Pg

i=1Var(d υ_i)

Pg

i=1(υi−υmedian)2 !

,

where Var(d υ_i) is the estimated variance of the empirical gene variances. The “shrinkage t”

statistic is given by

tshrinkage_i = rx¯1i−x¯2i

υ₁?_i n1

+υ

?

2i

n2

. (1.3)

Two variants of this statistic are considered by Opgen-Rhein and Strimmer (2007). One where

variances are estimated separately in each group, this results in two different shrinkage

es-timators, υ?₁_i and υ₂?_i. The other one using pooled variances estimates, gives one common

shrinkage estimator υ?_i. The advantage of the “shrinkage t” statistic is that it is derived

without any specific distribution assumptions and not computer-intensive. The “shrinkage

t” statistic is implemented in the R package “st” which is available from the CRAN archive

(http://cran.r-project.org).

All methods introduced above are for two-sample problems. When there are more than two

conditions in an experiment, a more general method, such as analysis of variance (ANOVA) can

be used to detect differential expression. Wolfinger et al. (2001) used two interconnected mixed

models to assess gene significance. It is a two step approach. At the first step, a mixed model

is fit for all genes and residuals are obtained from this model. This is called the normalization

model in Wolfinger et al. (2001), and the purpose of it is to adjust for any bias which arises

from variation in the microarray technology rather than from biological differences between the

genes. The second step, a mixed model is fit separately for each gene, using the normalized

data obtained from the first step, called the gene model in the paper. Inference for testing for

(22)

A moderated t statistic to address the problem of assessing differential expression in

mi-croarray experiments with more than two treatments was discussed in Smyth (2004). Smyth

(2004) applied the empirical Bayes approach and linear models to microarray data analysis in

the sense of shrinking the estimated sample variances towards a pooled estimate. In a single

species experiment, let β1i be the unknown coefficient associated with the treatment effect for

theithgene, and the conjugate prior distribution of the variances forβ1i,σi2(vary across genes),

be

σ2_i ∼Inverse Gamma(d0

2,

d0s20

2 ), (1.4)

where d0 and s20 are the scale and shape parameters for the inverse gamma distribution and

need to be estimated. Smyth (2004) took an empirical Bayes approach to estimate d0 and s20

from the data. Let ˆβ1i be the least squares estimate of β1i from the linear model for the ith

gene ands2_i be the observed variance for ˆβ1i, obtained from data. Then,

dis2i|σi2 ∼Gamma(

di

2,2σ

2

i) (1.5)

where di is the error degrees of freedom for the linear model for gene i. The moderated t

statistic proposed by Smyth (2004) is in the following form:

tsmyth_i = ˆ

β1i

˜

si/

√

n, (1.6)

wherenis the number of replicates in the experiment, and ˜s2_i is the posterior mean ofσ_i2 given

s2_i. Under the above hierarchical setting (1.4) and (1.5), Smyth (2004) claimed that

˜

s2_i = d0s 2 0+dis2i

d0+di

.

Indeed, the denominator of this formula should have been d0 +di + 2. But this erroneous

(23)

the distribution oftsmyth_i under the null hypothesis. It turns out that

tsmyth_i |β1i= 0∼td0+di, (1.7)

i.e., a t distribution with inflated degrees of freedom. This is useful for getting the p-values

associated with the test.

From a different point of view, Allison et al. (2002) developed a sequence of procedures to

address issues when searching for differentially expressed genes. To answer the fundamental yet

important question in microarray experiments, whether any of the genes under study exhibit a

difference in expression across the treatments, a procedure involving finite mixture model and

bootstrap inference was proposed by Allison et al. (2002). The idea behind the approach is

that the information contained in the distribution of the many test statistics and corresponding

p-values can be used to detect differentially expressed genes. This set of procedures allow for

non-normality, heteroscedasticity, and may be adapted for data with small sample sizes.

1.4.2 Correction for multiple testing

When conducting a single hypothesis test, two types of errors may be committed. A type I error

(false positive) occurs when a gene is declared as differentially expressed when in fact it is not.

A type II error (false negative) occurs when a differentially expressed gene fails to be declared

significant from the analysis of data. Typically, a statistical test is constructed to control the

type I error probability at levelα. Consider the problem of testingghypotheses simultaneously,

Table 1.1 illustrates the situation in a traditional form. The problem of multiple comparisons

arises when using the test repeatedly in order to produce a list of rejected hypotheses. One

approach to multiple testing is to control the familywise error rate (FWER), which is the

probability of one or more false positive results over a number of statistical tests. For g test

statistics with levelα, the FWER is defined as

(24)

Table 1.1: Table of outcomes. FP=false positive, TP=true positive, FN=false negative and TN=true negative.

H0 true H0 not true

rejectH0 TP FP

not rejectH0 FN TN

i.e., FWER=Pr(FP>=1). Controlling the FWER means to calculate the α level needed for

individual gene, say c, in order to ensure that the experiment level Type I error is less than

or equal to αe. The simplest procedure to control the FWER at level αe is the Bonferroni

correction:

c= αe

g ,

with this significance cutoff value c, the FWER will be no larger than αe for any family of g

tests.

Controlling the FWER is very stringent to multiple testing. In addition, a microarray

experiment often contains thousands of genes. It can be argued that controlling the FWER

is not as important for the microarray experiments, since falsely selecting a handful of genes

as differentially expressed may not be a serious problem if the majority of significant genes

are correctly chosen. Hence, controlling the false discovery rate (FDR) is a less conservative

alternative that can be useful for microarray experiments, which was introduced by Benjamini

and Hochberg (1995) and formally defined as

FDR = E( FP

FP + TP),

in other words, FDR is the expected proportion of the rejected null hypotheses which are

erroneously rejected. This procedure has been implemented in SAS PROC MULTTEST. When

(25)

to reject tests corresponding to p(1), p(2), . . . , p_(ˆ_k₎, where ˆk = max{k : p(k) ≤ kα/g}. This

procedure controls the FDR at level α.

Assuming independence of the test statistics, the FDR adjusted p-values are defined as

˜

p₍_g₎ = p₍_g₎

˜

p(g−1) = min{p˜(g),

g

g−1p(g−1)} ..

.

˜

p(1) = min{p˜(2), gp(1)}.

Instead of fixing α and estimating ˆk, i.e., estimating the rejection region (Benjamini and

Hochberg, 1995), Storey (2002) proposed a different approach to false discovery rates: fixing

the rejection region and then estimatingα.

Recall that, in multiple testing, the p-value for an individual test can be defined as the

smallest significance level (the smallest FWE) for which the null hypothesis can be rejected.

Analogously, theq-value (Storey, 2002) is the smallest estimated FDR at which the test may be

rejected. For theg hypothesis tests with corresponding p-values, p1, . . . , pg, and any rejection

region of interest [0, γ],γ ≤1,q-value is defined as

q-value(pi) = minγ≥pi\F DR(γ),

whereF DR\(γ) is the estimated FDR. Storey (2002) derived the explicit form of \F DR(γ):

\

F DR(γ) = πˆ0γg

#{pi≤γ}

,

where ˆπ0 = #₍₁{p₋i_α≤₎γ_g}, and ˆπ0 is defined as the estimated value of the probability of the null

hypothesisibeing true. Storey (2002) has also shown that this approach rejects more

(26)

approach is implemented in the R package “qvalue” which is available from the CRAN archive

(27)

Chapter 2

Joint modeling across species

As mentioned in Section 1.3, there has been successful use of gene information from multiple

species on different applications. Since the goal of this research is to make the animal models

more predictive to the human models on drug development, utilizing the gene expression data

from both species jointly and discovering the patterns of the differentially expressed genes across

species would be an approach to overcoming the current hardship that animal models often have

poorly clinical pertinence. Investigating the activities of these selected orthologous genes with

different patterns of differential expression can help scientists decode the cause of drug attrition

and thus have a more efficient process of drug development for the same class of drug in the

future.

Let Xaij andXhil denote gene expression measurements from the i

th _{orthologous gene pair}

for the jth animal and the lth human. The following independent linear models describe the

association between gene expression and treatment:

Xaij = β0ai+β1aiTaj+eaij, (2.1)

Xhil = β0hi+β1hiThl+ehil, (2.2)

where Taj and Thl are {0,1} treatment indicators, andeaij and ehil are independent N(0, σ

2

(28)

and N(0, σ2_h) random variables. σ_a2 and σ_h2 are variances for eaij and ehil, respectively. In drug

development, the animal research and human experiments are conducted independently–one’s

results do not affect the other’s. However, the treatment effects are expected to have some kind

of association between the two species. This results in our choice of using two independent

models for the two species to capture the effects of treatment on gene expression.

The joint behavior of β1ai andβ1hi is of interest as they describe the differential expression

of theith _{orthologous animal and human genes due to a treatment intervention. The quantities}

β1ai and β1hi are not observable; nonetheless, the gene expression data for mice and humans

are observed. Separate least squares estimates ˆβ1ai and ˆβ1hi can be obtained from linear

mod-els (2.1) and (2.2). Recall the goal of this research is to bridge the animal modmod-els and the

human models on drug development by discovering the patterns of relevant gene expression

across species. An answer to this question is really desired: why drugs have opposite effects on

mice and humans when they are expected to work toward the same direction. Therefore,

mod-eling (β1ai, β1hi)

T _{jointly to identify the patterns of differentially expressed genes across species}

is expected to help resolve the current problems of poor translation from animal experiments to

clinical research. Understanding the activities of the relevant orthologs (differentially expressed

against treatment) across species gives explanations for the poor translation between animal

and human models, and hence helps reduce the failure rate of drugs going from preclinical trials

to clinical trials.

Section 2.1 describes the joint (prior) distribution for (β1ai, β1hi)

T _{and the corresponding}

marginal distribution of ( ˆβ1ai,βˆ1hi)

T_. _{Section 2.2 introduces mixture models and the EM}

algorithm in general. Section 2.3 discusses the specific solution of the EM algorithm for the

proposed models in particular. Section 2.4 presents the problem of singularity for the covariance

(29)

2.1 A 9-component bivariate mixture model for two species

β1ai andβ1hi quantify the differential expression of thei

th_{orthologous animal and human genes}

due to a treatment intervention. A given gene can be classified as non-differentially expressed

(NDE): showing no signs of treatment effects, positively differentially expressed (pDE):

show-ing positive treatment effects, or negatively differentially expressed (nDE): showshow-ing negative

treatment effects. Therefore, for a human and animal gene pair, there are 9 possibilities for

cat-egorizing this pair of genes. Furthermore, assume dependency between differentially expressed

orthologs, i.e., existence of association posited only for (β1ai, β1hi)

T _{in category (1, 2, 3, 4)}

and zero correlation presumed for (β1ai, β1hi)

T _{in category (0, 5, 6, 7, 8). Table 2.1 illustrates}

the 9 possible categories of (β1ai, β1hi)

T_{. (}_µ

β1_ai, µβ1_hi)

T _{is the vector of population means of}

(β1ai, β1hi)

T _{under each category.}

Table 2.1: Possible categories of treatment effects: a prior for (β1ai, β1hi)

T

category (β1ai, β1hi) (µβ1_ai, µβ1_hi) Corr(β1ai, β1hi)

0 (NDE,NDE) (0,0) 0

1 (pDE,pDE) (+,+) ρ1

2 (nDE,nDE) (−,−) ρ2

3 (pDE,nDE) (+,−) ρ3

4 (nDE,pDE) (−,+) ρ4

5 (NDE,pDE) (0,+) 0

6 (NDE,nDE) (0,−) 0

7 (pDE,NDE) (+,0) 0

8 (nDE,NDE) (−,0) 0

In consequence of these possible patterns of (β1ai, β1hi)

T_{, mixture models (McLachlan and}

Basford, 1988; McLachlan and Peel, 2000) are adopted to deal with the correlation and

distri-bution of each subgroup of genes across species. An additional advantage of mixture models is

(30)

observation to give a probabilistic clustering. As a result, the pooling of information for genes

across species can be exploited to better understand the underlying relationship between the

treatment intervention for both species.

In practice, finite mixture models (mixture models with finite numbers of components)

(McLachlan and Basford, 1988; McLachlan and Peel, 2000) are often fitted with the component

densities of the mixture taken to be normal. The following normal mixture model is adopted

as the prior distribution of the vector (β1ai, β1hi)

(31)



 

β1ai

β1hi





 ∼ π0N



 



 

µa0

µh0 



,



 

η2_a₀ ρ0ηa0ηh0

ρ0ηa0ηh0 η2_h₀ 

 



 

+ π1N



 



 

µa1

µh1 



,



 

η2_a₁ ρ1ηa1ηh1

ρ1ηa1ηh1 η2h1 

 



 

+ π2N



 



 

µa2

µh2 



,



 

η2_a₂ ρ2ηa2ηh2

 



 

+ π3N



 



 

µa3

µh3 



,



 

η2_a₃ ρ3ηa3ηh3

ρ3ηa3ηh3 η2_h₃ 

 



 

+ π4N



 



 

µa4

µh4 



,



 

η2

a4 ρ4ηa4ηh4

 



 

+ π5N



 



 

µa5

µh5 



,



 

η2_a₅ ρ5ηa5ηh5

 



 

+ π6N



 



 

µa6

µh6 



,



 

η2_a₆ ρ6ηa6ηh6

 



 

+ π7N



 



 

µa7

µh7 



,



 

η2_a₇ ρ7ηa7ηh7

ρ7ηa7ηh7 η2_h₇ 

 



 

+ π8N



 



 

µa8

µh8 



,



 

η2_a₈ ρ8ηa8ηh8

 





, (2.3)

whereπk is the probability that an observation belongs to thekth component, with

8 X

k=0

πk= 1 and πk ≥0.

(32)

parameters may be made according to Table 2.1, and non-differentially expressed genes may be

assumed to have treatment effects that are deterministically zero, i.e., (β1ai, β1hi)

T _{= (0}_,₀₎T_.

This leads to the following two-species mixture model:



 

β1ai

β1hi





 ∼ π0N



 



 

0





,



 

0 0



 



 

+ π1N



 



 

µa1

µh1 



,



 

η_a2₁ ρ1ηa1ηh1

 



 

+ π2N



 



 

µa2

µh2 



,



 

η_a2₂ ρ2ηa2ηh2

 



 

+ π3N



 



 

µa3

µh3 



,



 

η_a2₃ ρ3ηa3ηh3

ρ3ηa3ηh3 η2_h₃ 

 



 

+ π4N



 



 

µa4

µh4 



,



 

η2

a4 ρ4ηa4ηh4

 



 

+ π5N



 



 

0

µh5 



,



 

0 0

0 η_h2₅



 



 

+ π6N



 



 

0

µh6 



,



 

0 0

0 η_h2₆



 



 

+ π7N



 



 

µa7

0





,



 

η2_a₇ 0

0 0



 



 

+ π8N



 



 

µa8

0





,



 

η2_a₈ 0

0 0



 





, (2.4)

whereµa1 ≥0,µh1≥0,µa2≤0,µh2 ≤0,µa3 ≥0,µh3≤0,µa4≤0,µh4 ≥0,µh5≥0,µh6 ≤0,

µa7≥0, andµa8 ≤0,ρ0= 0, ρ5 = 0,ρ6= 0, ρ7 = 0, andρ8= 0.

(β1ai, β1hi)

T _{is in general unknown. However, their estimates ( ˆ}_β

1ai,βˆ1hi)

(33)

by the method of least squares using the linear models (2.1) and (2.2). The marginal distribution

of ( ˆβ1ai,βˆ1hi)

T _{is as follows:}



 

ˆ

β1ai

ˆ

β1hi





 ∼ π0N



 



 

0





,



 

σ2_a₀ 0

0 σ_h2₀



 



 

+ π1N



 



 

µa1

µh1 



,



 

σ2_a₁ ρ1σa1σh1

ρ1σa1σh1 σ2_h₁ 

 



 

+ π2N



 



 

µa2

µh2 



,



 

σ2

a2 ρ2σa2σh2

ρ2σa2σh2 σ2h2 

 



 

+ π3N



 



 

µa3

µh3 



,



 

σ2_a₃ ρ3σa3σh3

ρ3σa3σh3 σ2h3 

 



 

+ π4N



 



 

µa4

µh4 



,



 

σ2_a₄ ρ4σa4σh4

ρ4σa4σh4 σ2_h₄ 

 



 

+ π5N



 



 

0

µh5 



,



 

σ2_a₅ 0

0 σ2

h5 

 



 

+ π6N



 



 

0

µh6 



,



 

σ2_a₆ 0

0 σ_h2₆



 



 

+ π7N



 



 

µa7

0





,



 

σ_a2₇ 0

0 σ_h2₇



 



 

+ π8N



 



 

µa8

0





,



 

σ_a2₈ 0

0 σ_h2₈



 





, (2.5)

whereµa0 = 0,µh0= 0,µa1≥0,µh1 ≥0,µa2 ≤0,µh2 ≤0,µa3 ≥0,µh3≤0,µa4≤0,µh4 ≥0,

µa5= 0, µh5 ≥0,µa6 = 0,µh6 ≤0, µa7 ≥0, µh7= 0, µa8≤0,µh8 = 0,ρ0= 0, ρ5 = 0, ρ6 = 0,

ρ7 = 0, and ρ8 = 0. Note that the marginal distribution of ( ˆβ1ai,βˆ1hi)

T _{has means equal to the}

prior means of (β1ai, β1hi)