This paper presents a model to find associations between a gene’s expression and time-to- event data for cDNA microarrays that accounts for the substantial measurement error. The model for the microarray probes is parametric and creates a GEI which is latent instead using the log-ratio. The model for the time-to-event data is a Bayesian semipara- metric piecewise constant hazards model. We fit the model using an MCMC algorithm in a two stage process. The first stage estimates the measurement error parameters, and the second stage uses these estimates in the survival model on a gene by gene basis. A case study with a breast cancer dataset is performed with and without adjusting for clinical covariates. The new model is shown to be generally consistent with a conventional model that uses the log-ratios in a Cox proportional hazards model, and potentially important genes selected by the proposed model only are found to have known connections with breast cancer. That is, conventional models that do not account for measurement error may fail to detect these genes’ associations between event and gene expression. In addi- tion to detecting associations, the conventional models may underestimate the strength of these associations because models not accounting for measurement error are known to be biased towards the null, and this bias may be avoided in the proposed model. The model was shown to be robust to some parametric assumptions for inference about the parameter of interest, and the new GEI’s are found to be highly correlated with the log-ratios. Further, the model is demonstrated to have good operating characteristics concerning type I and type II error rates as well as accurate coverage of the parameter values by the HPDs. However, the issue of False Discovery Rates (FDR) is not addressed
here. Conceivably, permutation of the survival times could be applied to the data in order to estimate the false discovery rate. Permutation is regularly applied in the case of the Cox model and in other frequentist approaches in microarray data Sorlie et al. (2001), yet such permutations would be not computationally feasible for a Bayesian analysis using this model, and permutation is only valid under exchangeability which excludes more complex models with clinical covariates. The problem of estimating FDR for Bayesian models is one of current research (Efron et al., 2001; Ibrahim et al., 2002; Newton et al., 2004a; Tadesse et al., 2005), and the estimation of the FDR can be obtained by using the mean posterior probability. If one is interested in which genes are most likely to be associated with the time-to-event data, an ordering of the genes in terms of association is required. In the frequentist setting, the p-values for the test statistics can generate the ordering. One may easily derive such an ordering from the model presented here by calculating the posterior probability that γg = 0 as in Tadesse et al. (2005). Overall, this model has an important advantage over the conventional one in that it accounts for measurement error which is a significant additional source of variation.
3
Microarrays and Genetics
The second paper derives an enhanced method for finding associations between genotype and gene expression. Microarrays represent high-dimension complex traits that can be influenced by the genotype of the cells. The purpose of genetic analysis of microarray data is to understand the influence of genotype on gene expression as an intermediary between genotype and the directly observable complex traits such as blood pressure, cholesterol, obesity and disease states like diabetes. Linking genotype and expression may help to elucidate genetic networks as well. Jansen and Nap (2001) asserted that the combined analysis of gene expression and genetic variation be called “genetical genomics”. Others have called it eQTL analysis for Expression Trait Loci. eQTL analysis methods are closely related to Quantitative Trait Loci (QTL) methods that have been developed for single or a few traits (Lynch and Walsh, 1998). The genetic analysis of quantitative traits has a very long history dating back to Francis Galton in 1869 (Galton, 1892).
3.1
Fundamentals of Genetics
The basic aim of eQTL and QTL analysis is to find associations between the genotype which is a set of positively correlated, categorical variables and the phenotype that is a continuous response. For a review of QTL methods, see Lynch and Walsh (1998).
Experimental or observational design plays a pivotal role in the analysis techniques used in mapping or detecting eQTL and QTL. The main consideration is whether the
population tested is inbred or outbred. Inbred populations are those whose parents are closely related. Specifically, recombinant inbred lines (RILs or RI strains) are the results of multiple generations of brother-sister mating (Lynch and Walsh, 1998). Through recombination, the offspring will become almost completely homozygous, meaning that the maternal and paternal chromosomes have the same genotypes. The offspring will have identical genotypes except for the differences between sexes. Two RILs can be crossed in different ways depending on the experimental design. For example, F1 designs
compare offspring from the cross of 2 RILs. F2 designs involve the offspring of the F1
generation and so on. The backcross design compares the cross of the F1 line with one
of the parents. The observational designs of outbred populations are very different from those of inbred populations. Outbred parents and offspring are those whose ancestors are not closely related. This poses additional analytical challenges compared to inbred populations, but many important studies of humans involve outbred subjects. Lynch and Walsh (1998) stress that the outbred designs examine within population trait variability while the inbred designs examine between population variances, and they give the major differences between the two. The variability of genetic markers is not well controlled in outbred populations. For example, markers may not be informative, meaning that the genotypes are polymorphic (having variation) for the subjects in the study. On the other hand, outbred parents could have excess variability at a locus. For example, if there are 4 or more genotypes, then the analysis can become less powerful to detect QTLs.