Real Data Examples - BAYESIAN APPROACH: SPLIT AND MERGE STRATEGY AND BIG

2. BAYESIAN APPROACH: SPLIT AND MERGE STRATEGY AND BIG

2.4 Real Data Examples

The first example is related to a metabolic quantitative trait loci (mQTL) experiment, which links SNPs data to metabolomics data [16]. The predictors come from a genome-wide analysis of candidate genes for ALAT enzyme elevation in liver with the Mass Spectroscopy metabolomics data as the response [38]. The spec- tra are divided by regions or bins to reduce the dimension of spectral data, and a log10-transformation is applied to normalize the signal. A total of 10,000 SNPs are preselected as candidate predictors based on the following criteria: no missing values, no monomorphic SNPs and close to known regulatory regions. The total number of subjects included in the dataset is 50. The genotype of each SNP is coded as 0,1 and 2 for homozygous rare, heterozygous and homozygous common allele, respectively. As in [8], one particular metabolite bin that discriminates between the disease status of the clinical trial’s participants is chosen as the response.

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 20 25 30 35 0 10 20 30 40 50 Comparison model size

CV−median square error

● ● ●

● ● ● ● _● _●

● ● ●

Figure 2.4: Results comparison for real mQTL data set. The plot compares the SaM result with Lasso, SIS-Lasso and ISIS-Lasso for the mQTL data: Leave-one- out cross validation median square error of the MAP1 – MAP35 models produced

by SaM (black dot with dashed line), the models selected by Lasso (triangle with dashed line), the models selected by SIS-Lasso (”+” with solid line), and the models selected by ISIS-Lasso (hollow dot with solid line).

into 20 subsets withs = 500 and SaM was then run with exactly the same setting as for the massive data example of Section 2.3.2. In Stage I SaM reduced the number of SNPs to 536 from 10,000, and in stage II SaM selected two SNPs, rs17041311 and rs17392161, whose marginal inclusion probabilities were 0.98 and 0.93, respectively. The two SNPs also compose the MAP model for this example. We note that in the dataset, the SNP rs7896824 has the same genotypes as the SNP rs17041311 across all 50 subjects. Similarly, the SNPs rs17390419, rs12328732, rs2164473, rs322664, rs17415876, rs16950829, rs6607364, rs829156, rs829157, rs2946537 share the same genotypes with the SNP rs17392161.

For comparison, SIS and ISIS were applied to this example. SIS-SCAD and ISIS- SCAD first reduced the number of SNPs from 10,000 to 25 and then applied SCAD to refine the selection, but both yielded the null model. To assess the performance of different methods, we used the leave-one-out crossing validation. The median cross-validation square error of the SaM model is 1.8, and that of the null model is 9.74.

We also compared the MAPi models produced by SaM with those produced by

Lasso, SIS-Lasso and ISIS-Lasso along their regularization paths. Here the MAPi

model refers to the maximum aposteriorimodel containingiSNPs. Figure 2.4 shows the leave-one-out median square error of these models. Lasso and SIS-lasso failed to select the SNP rs17041311, which is one of the two important SNPs identified by SaM, and thus yielded enormous median square errors. ISIS-Lasso successfully selected both rs17041311 and rs17392161, and thus yielded similar median square errors with SaM.

2.4.2 PCR Example

The second example relates to a PCR dataset. [45] conducted an experiment which examines the genetics of two inbred mouse population (C57BL/6J and BTBR). A total of 60 F2 samples, with 31 female and 29 male mice, were used to moni- tor the expression levels of 22,575 genes. Some physiological phenotypes, includ- ing numbers of phosphoenopyruvate carboxykinase (PEPCK), glycerol-3-phosphate acyltransferase (GPAT), and stearoyl-CoA desaturase 1 (SCD1) were measured by quantitative real-time PCR. In this example, we study the relationship between PEPCK (as responses) and the gene expression level. The gene expression data and the phenotype data are published at GEO (http://www.ncbi.nlm.nih.gov/geo; ac- cession number GSE3330). The gene expression data have been normalized before statistical analysis.

SaM was first applied to this example. The data was divided into 45 subsets with s = 502. In stage I, SaM selected 1113 genes from 22575 genes. In stage II, SaM selected 6 genes under the settingrn =p0.3n . The six selected genes are 1429089 s at,

1430779 at, 1432745 at, 1437871 at, 1440699 at, 1459563 x at. The first five genes also compose the MAP model. The leave-one-out cross validation mean square error of the model of the six genes is 0.084. In comparison, SIS-SCAD selected 17 genes with the leave-one-out mean square error 0.204, and the ISIS-SCAD selected 9 genes with the leave-one-out mean square error 0.112.

Figure 2.5 shows the leave-one-out mean square errors of the MAPi models se-

lected by SaM and the models selected by Lasso, SIS-Lasso and ISIS-Lasso along their regularization paths. For this example, SIS-Lasso and ISIS-Lasso first reduced the number of genes to 30, then applied Lasso to refine the selection. From Fig- ure 2.5, it is easy to see that SaM significantly outperforms the penalized likelihood

approaches for this example. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 20 25 30 35 0.0 0.1 0.2 0.3 0.4 Comparison model size

CV−mean square error

● ● ● ● ● ● ● ●

Figure 2.5: Results comparison for real PCR data set. The plot compares the SaM result with Lasso, SIS-Lasso, and ISIS-Lasso for the PCR data: leave-one-out cross validation median square error of the MAP1 – MAP35 models produced by SaM

(black dot with dashed line), the models selected by Lasso (triangle with dashed line), the models selected by SIS-Lasso (”+” with solid line), and the models selected by ISIS-Lasso (hollow dot with solid line).

In document Variable Selection for Ultra High Dimensional Data (Page 49-54)