Inverse Average Method - : CONTROL OF FALSE DISCOVERIES IN

CHAPTER 3 : CONTROL OF FALSE DISCOVERIES IN

3.11 Inverse Average Method

There are a number of methods that can be used to estimate the lfdr at the gene- SNP level. Therefore it will be useful if those gene-SNP level lfdr’s within a gene can be combined in some way to obtain the gene level lfdr. Given the set up in Section 3.3, Equation 3.4 and Equation 3.5 indicate that the inverse average or harmonic mean of the λij’s can be equal to the gene level lfdr λi except for the difference in the priors

(π0 and ˜π0) and the difference between f0(.) and f0ij(.). In fact, it can be shown that

the inverse average of λij’s is an upper bound for the gene level lfdr λi. To show this,

consider λij = f(H0ij, Yi, X (i) j ) f(H0ij, Yi, X (i) j ) +f(H0cij, Yi, X (i) j ) ≥ f(H0i, Yi, X (i) j ) f(H0i, Yi, Xj(i)) +f(H0cij, Yi, Xj(i)) = π0f0(Yi) π0f0(Yi) + (1−π˜0)f1(Yi|X (i) j ) ≥ π0f0(Yi) π0f0(Yi) + (1−π0)f1(Yi|X (i) j )

The first inequality follows since H0i ⊆ H0ij, and the second inequality follows from

the fact ˜π0 ≥π0. Therefore, using Equation 3.4, we obtain Inequality 1. 1 1 mi Pmi j=1 1 λij ≥λi.

However, Figure 3.10 shows that such upper bound might not be sharp enough for feeding it into the adaptive thresholding procedure for controlling FDR. The difference is believed to be due the difference in the priors which, under the set up in Section 3.3, are related by the equation

˜ π0 =π0+ mi−1 mi (1−π0) (3.11) 0 5000 10000 15000 20000 0.0 0.2 0.4 0.6 0.8 1.0 Genes

Sorted FDR and Inverse Average

Figure 3.10: Showing the sharpness of the Inverse Average bound using a simulated data. The black line shows the sorted true gene lfdr’s and the red dots are the inverse average of the corresponding gene-SNP level lfdr’s. The simulation procedure used is similar to the scheme described in Section 3.5.

We propose an adjustment factor to adjust the inverse average which addresses the difference between ˜π0 andπ0. The proposal is to use the adjusted inverse average given by 1 AF mi Pmi j=1 1 λij+(1−AF)

whereAF is the adjustment factor defined as

AF = π˜0(1−π0) π0(1−π˜0)

When such adjustment is used, the performance of the inverse average improves greatly in terms of the sharpness of the bound (See Figure 3.11). The realized FDR while controlling FDR at 5% level is 0.046. The results from the simulations under var- ious conditions show that once adjusted for the differences in the priors, the difference between f0(.) and f0ij(.) does not have much effect.

0 5000 10000 15000 20000 0.0 0.2 0.4 0.6 0.8 1.0 Genes

Sorted FDR and Adjusted Inverse Average

Figure 3.11: Showing the sharpness of the Inverse Average bound after adjustment. The black line shows the sorted true gene lfdr’s and the red dots are the adjusted inverse average of the corresponding gene-SNP level lfdr’s.

However, the use of the inverse average method for eQTL data has a serious diffi- culty. It is difficult to obtain the lfdr’sλij for the gene-SNP level hypotheses that tests

whether the SNP is causal for the gene. The SNPs that are in linkage disequilibrium with the causal SNP will also show high association with the expression of the tran- script and that will lead to under-estimation of the lfdr’s in general. Therefore, even after the adjustment, the inverse average may not be an upper bound for the gene level lfdr.

Even though there is no guarantee in the real data scenario for the inverse average to be an upper bound for the gene level lfdr, it might still be useful in some cases. For instance, consider a hypothetical situation where the SNPs are divided into several blocks with the correlations within block being very high and those across blocks near

0 5000 10000 15000 20000 0.0 0.2 0.4 0.6 0.8 1.0 Genes

Sorted FDR and Adjusted Inverse Average

Figure 3.12: Showing the sharpness of the Inverse Average bound for a blocked data structure. The black line shows the sorted true gene lfdr’s and the red dots are the adjusted inverse average for the hypothesis of causality. The blue dots are the adjusted inverse average for the gene-SNP level significance test.

zero. Simulations show that in such a situation, the adjusted inverse average serves as an approximate upper bound of the gene level lfdr (Figure 3.12).

0 5000 10000 15000 20000 0.0 0.2 0.4 0.6 0.8 1.0 Genes

Sorted FDR and Adjusted Inverse Average

Figure 3.13: Showing the sharpness of the Inverse Average bound for a window type data structure. The black line shows the sorted true gene lfdr’s and the red dots are the adjusted inverse average for the hypothesis of causality. The blue dots are the adjusted inverse average for the gene-SNP level significance test.

Figure 3.13 illustrates that if the data is such that all the SNPs within a window around the causal SNP have significantly small gene-SNP level lfdr, then the adjusted

inverse average is an upper bound when the lfdr are small, but is not an upper bound for larger values. Therefore, in such a scenario, the method may be useful if the target FDR level is small. However, one needs to look at the sorted inverse average values carefully and decide from its shape whether or not to use them as approximate gene level lfdrs. For more details of the simulation schemes, see Appendix C.

Correlation of estimated FDR by Inverse Average and estimated FDR with true parameters

Frequency 0.984 0.986 0.988 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 B

lfdr estimates by Inverse Average

lfdr estimates by REG-FDR 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 C

Estimated FDR by Inverse Average

Estimated FDR by REG-FDR

Figure 3.14: Showing the comparison of inverse average method with REG-FDR Figure 3.14 shows the behavior of the inverse averages for a real SNP data while the expressions are simulated using our model. The comparison withREG-FDRshows that the inverse average method might be useful in this case. For the particular example, the realized FDR using inverse average method was 0.0652 implying that it is somewhat anti-conservative, but can still be used by setting the target FDR level slightly lower. However, one needs to know or estimate both the gene level and gene-SNP level priors in order to calculate the adjustment factor.

3.12 Discussion and future work

The REG-FDR method is the true maximum likelihood approach for the assumed model. However, the Z-REG-FDR method is computationally much efficient, and as shown above, produces estimates very similar to the REG-FDRprocedure. Therefore, it enjoys the desirable properties of the maximum likelihood estimator with an improved

computation speed. It was also shown that theZ-REG-FDRmethod remains valid even when the assumptions are not fully true. Its performance under simulations and real data seem satisfactory in terms of controlling the FDR and at the same time achieving higher power as compared to some other methods.

Our model may be useful to analyse other similar data types. For instance, a similar version can be proposed for genome-wide association stuides (GWAS). However, GWAS data seldom have a trueσ as large as what we observed in eQTL data. When the true σ is small, both REG-FDRand Z-REG-FDR seem to be inefficient. In particular, the estimation using both methods perform poorly when σ is smaller than 1. Therefore, even though this model may be applicable to many types of data, it is not advisable to use it unless the true σ is expected to be large.

The method might be slightly anti-conservative in some situations and future research is needed to understand the bias of the estimators so that the procedure can be adjusted to take care of such anti-conservativeness. Also, the current research focuses only on cis-eQTL. Analysis of trans-eQTL is an interesting statistical problem that is beyond the scope of the model in its current form. Further studies are required to modify the model in order to apply it to trans-eQTL problem.

The inverse average method is a simple tool to combine individual level lfdr to obtain the group level lfdr. However, in order to gurantee that the inverse average method will work, one needs to know the individual level lfdr for the hypothesis of causality, which is difficult to model statistically. Regardless, the inverse average has been shown to have the potential to work under different circumstances. However, it is inferior to the Z-REG-FDR in the sense that it is expected to be more anti-conservative.

Future research is required to verify if there are situations other than the eQTL data where a simple adjusted inverse average can be applied. One such hypothetical situation is described below.

we have an n-dimensional y-vector for each of N groups. Corresponding to a group i, we have a matrix X(i) _with _n _{columns and} _m

i rows. Let us call the mi rows

X₁(i), X₂(i), ..., Xm(i)i. We won’t model the correlation structure of the X

(i)

j ’s and assume

that Yi may have a causal relationship with at most one of the X

(i)

j ’s. Suppose each

X_j(i) consists of two parts given by

X_j(i)=w_j(i)+e(_ji).

Whilew(_ji)’s might be considered either fixed or random,e(_ji)’s are random errors and they have a correlation structure that generates the correlation structure of the X_j(i)’s. w_j(i)’s are assumed to be independent of the e(_ji)’s and independent among themselves. The causal relationship ofYi with the causal Xj(i) is given by

Yi =βw

(i)

causal+i.

Under this situation, the sample correlations r(_ji) = Cor(Yi, X

(i)

j ) will have a spike

only at the causal location even though theX_j(i)’s are correlated. Note however that the r(_ji)’s are not independent, the alternativeness is not “transferred” to the other r_j(i)’s.

If the lfdr’s are known for each location, one can combine them using inverse average to obtain the lfdr for each group. Verifying the existence of such a problem in practice and finding methods to estimate the priors (to calculate the adjustment factor) requires more research on this topic.

In document Rudra_unc_0153D_15683.pdf (Page 90-97)