Methods - Enhancing eQTL analysis techniques with special attention to the transcript dependenc

The methods we describe here are designed with relatively few assumptions that are employ- able on a wide variety of experiments. For the transcript data we assume that each transcript is normally distributed. Normality is easily obtained in most available analysis and expression data management software packages. For the marker data, studies that record bi-allelic markers will use the basic unpaired t-test statistic which can be expressed in the framework of the linear model. For more detailed records such as the number of alleles at a given position, an additive effect will be assumed and evaluated using the linear model.

In the model G represents the vector of individual genotypes taking either values 0 or 1 in the case of linkage data and can have values of 0,1 or 2 in the case of diploid data.

The effect of extreme observations in the expression data as presented earlier can cause bias in analysis. Studies with smaller sample sizes are especially vulnerable to the effect of outlaying observations. To remove the effect of outliers in expression data without removing observations, we propose the use of Van der Waerden transformations (Van der Waerden, 1952, 1953). The transformation uses the quantiles of the data and the inverse of the normal quantile function. Fory|r|, therth ordered value ofy,

y0_|_r_| = ψ r n+ 1 (2.2)

whereψis the inverse of the cumulative normal distribution function. The Van der Waerdern

transformation provides an equal or more powerful result than that of a Wilcoxon signed rank test(Van der Waerden, 1952). The quantile transformation imposes a normal distribution on the responses. The normal distribution restricts the variability in the data and reduces the deviation of the outlying values. Additionally, the expression values now have a distribution that meets the assumptions of the linear model.

We propose the use of principal components to adjust for the effects of population sub- structures present in the data. The principal components are calculated from the genetic marker data of the subjects. Use of principal components has often been employed in single trait QTL analysis (Paschou et al., 2007; Bauchet et al., 2007; Price et al., 2006) but has not been seen in the eQTL literature. When compiling the components, several assumptions about the data must be made to ensure proper description of the populations. Calculations of the principal components should contain only unrelated individuals as not to bias population contributions. Imputations can be used to calculate components on the individuals excluded from the original component calculations and are available in most software packages. The use of multiple eigenvectors to describe different populations allow for adjustments based on multiple population layers for each individual. This is an advantage in describing participants of unknown or mixed populations.

EIGENSOFT (Price et al., 2006) is proposed to calculate the principal components for eQTL data. Different approaches exist to determine the appropriate number of eigenvectors to include in the analysis. The initial naive approach is to select vectors for eigenvalues that are greater than 1. This method can still result in the selection of too many eigenvectors.

To compute the principal components, EIGENSOFT utilizes the singular value decomposition ofXwhereX is themnucleotides bynindividuals matrix of transcripts. The singular

value decomposition composition provides the following relationship:

X = V DU0 (2.3)

In the expression V is the matrix of orthonormal eigenvectors of XX0, U0 is the matrix of

orthonormal eigenvectors of X0X and D is a diagonal matrix consisting of square roots of

the eigenvalues. Theith row of the jth column of U0 represents the ancestry component of

theithindividual along the jthprincipal component. The vectors representing the principal

components can be used as covariates in the linear model. For populations containing related individuals or highly correlated clusters of subjects,X can be formed using a unique subset

of the entire population. The resulting V and D matrices can be used as loadings on the

remaining individuals to project their values onto the components and assign them a score. To further reduce the number of vectors selected, Tracy-Widom statistics can be tested for significance of each eigenvalue. In studies with smaller sample sizes, the number of components is bounded by the degrees of freedom. For our analysis we will identify the number of eigenvalues to adjust by using the Tracy-Widom statistics. An α = 0.05 will determine

whether or not the statistic is used with a minimum of two components to be included as additional covariates in our models. The distribution of the eigenvalues will be presented briefly below, for a complete description details can be found in Patterson et al. 2006 (Patterson, Price and Reich, 2006).

Define the following population rescaling parameters based on the dimensions ofX. µ₍_n,m₎ = ( √ m−1 +√n)2 m (2.4) σ(n,m) = (√m−1 +√n) m 1 √ m−1 +√1 n 1/3 (2.5)

Then for the eigenvalueλ1, let

x1 = nλ1

iλi −µ(n,m)

σ₍_n,m₎ (2.6)

Then the statistic x1 follows a Tracy-Widom distribution. This statistic is computed for

each eigenvalue and compared to the Tracey-Widom distribution to determine whether the eigenvector is to be included in our model based on some previously definedα.

A t-test is proposed evaluate each pairwise marker/expression comparison. The t-test is available within the framework of the standard linear regression. After performing the Van der Waerden transformation, the expression values follow a standard normal distribution. Again this distribution meets the requirements of the linear model. By remaining within the framework of the linear model, additional covariates can be incorporated into the analysis easily as covariates. These covariates include the principal components and other data of interest when available.

Y = µ+Gβ1+P C1β2+· · ·+P Ckβk+1+ (2.7)

The t test will be carried out for β1 describing the significance of the genotype on the

transcript after adjustment from the principal components. Let λ be a vector of length n

such thatλ= (0,1,0, . . . ,0), then

β1 = λ0(X0X)−1X0Y1 (2.8)

t(df E) ∼ _p βˆ1

To evaluate the effectiveness of the proposed adjustments, the cis enrichment will be calculated (McClurg et al., 2007). As stated previously, the enrichment of cis associations can be used to validate the analysis. The cis enrichment score is calculated using the ratio of significant cis associations to total significant associations. This ratio is then compared to the enrichment of cis associations occurring in the entire genome. Associations that are classified as cis associations will be those between markers occurring within 500kb of the starting marker position of the transcript. This gives a 1Mb window for markers to be considered cis in the results.

In document Enhancing eQTL analysis techniques with special attention to the transcript dependency structure (Page 35-39)