Real Data Application - Bayesian Modeling of Complex High-Dimensional Data

2.4 Real Data Application

We apply the proposed variational functional mixed model and basis-space testing procedure to two real datasets reviewed in Section1.2.1and Section1.2.2. For both data set, we consider conducting both parameter estimation and statistical inference in the wavelet basis. Our goal is to identify important regions on biomedical signals/images that reflect differences across groups. Performance of VFMM is compared with the benchmark, Bayesian FMM.

2.4.1 1D Organ-Cell Line

We applied both VFMM and FMM to the pre-processed data set by adopting the same basis transformations and design matrices. In particular, we applied a discrete wavelet transform to each spectrum by using the Daubechies wavelets with eight vanishing moments, periodic boundary extension mode, and 9 resolution levels. We implemented the same setting as in Zhu et al. [74]. Specifically, we set the fix effect design matrix X by using the cell mean model for the factorial design with an additional column for the laser intensity effect, so that X is a 32× 5 matrix. Columns from one to four indicates four treatment groups: brain-A375P, brain-PC3MM2, lung-brain-A375P, lung-PC3MM2, respectively. Furthermore, column five represented whether the observations were from low (coded as -1) or high (coded as 1) laser intensity. The random effect design matrix Z was a 32 ×16 binary matrix with Z_ib = 1 indicating that spectrum i came from the bth animal. Based on estimation of the fixed effects, we detected nonzero regions on three contrast effects: the organ effect C₁(t) = (B₁(t)+B₂(t)−B3(t)−B4(t))/2, the cell-line effect C₂(t) = (B₁(t)−B2(t)+B₃(t)−B4(t))/2, and the organ-by-cell line interaction effect C₃(t) = (B₁(t)−B2(t)−B3(t)+B₄(t))/2. Results of region detection were compared between VFMM and FMM.

To detect nonzero regions on contrast effects, for Bayesian FMM, we applied the Bayesian

0 1000 2000 3000 4000 5000 6000 7000 8000

Figure 2.11: Cancer proteomics data analysis: significant nonzero regions flagged by VFMM and FMM on the cell line effect. The regions were flagged on the mean estimate obtained by VFMM. Red, blue, and green colors denote locations flagged by VFMM only, FMM only, or both VFMM and FMM, respectively.

FDR control with δ = log₂(1.5) on the measurement grid in data domain following Meyer et al. [39]; for VFMM, we applied the basis-space testing procedure described in Section 2.2.2with ϵ = 0.07 and δ = log₂(1.5). The significant level for both approaches was set to be α = 0.05. In Figure 2.11, we compared significant regions flagged by FMM and VFMM on the cell line effect. The flagged regions were marked by colors on the mean estimate of cell line effect obtained by VFMM. Different colors denote whether the locations were flagged by VFMM only, FMM only, or both VFMM and FMM. From Figure 2.11, we see that most flagged regions were shared by both VFMM and FMM, while VFMM tend to flag many more regions than FMM. There exists none contiguous regions flagged by the FMM for a cell line effect that were not detected by VFMM. While there are 28 regions which have more than 10 contiguous time grid only flagged by VFMM of cell line effect. Some examples of contiguous regions on the data grid are [172, 184], [1470, 1528], [4549, 4677], [5837, 5944]

and all detailed regions can be seen in Figure 2.11. For the organ main effect, all regions flagged by FMM are also flagged by VFMM and 38 regions were flagged by only VFMM. As for organ-by-cell-line interaction function, 7 regions are both flagged by FMM and VFMM and none of regions flagged by FMM are not detected by VFMM. In addition to the plots, we also calculated the Bayesian expected sensitivity, false negative rate and specificity following the approach described in Section2.2.2. Note that for VFMM, since the test was performed

2.4. Real Data Application 43

in basis space, these statistics were calculated in wavelet domain, i.e., they characterizes the proportion of wavelet components that were correctly or incorrectly detected, which is different than those defined in the simulation study. For comparison purposes, for FMM, we also performed the Bayesian FDR test in basis space. Results are listed in Table 2.2. From Table 2.2, we see that VFMM achieves higher sensitivity and specificity, and lower FNR for the 1-D proteomic data. To estimate around 8000 parameters, FMM takes about 6 hours while VFMM just takes 1 hours. Visualization results about organ effect and organ-by-cell interaction can be found in A.5.

2.4.2 3D ADNI

In this analysis, we consider the 3-D brain imaging data from the Alzheimer’s Disease Neu-roimaging Initiative (ADNI) introduced in Section 1.2.2.

By analyzing the TBM data, we aim to estimate the contrast effects between groups (i.e., the differences between group means) and detect local brain regions with systematic volumetric expansion or compression across patient groups with different diagnostic status or genders.

We adopt the cell mean design by setting X to be a 816× 6 binary matrix, with 1’s in every row indicating the diagnosis stage and the subject’s gender. In particular, columns 1-6 of X correspond to subgroups Normal-Male, Normal-Female, MCI-Male, MCI-Female, AD-Male, AD-Female respectively. Under this design, the contrast effect between AD and Normal can be calculated by pre-multiplying [−1/2, −1/2, 0, 0, 1/2, 1/2] to the fixed effects B (or B^∗) and the gender effects can be calculated by pre-multiplying [1/3,−1/3, 1/3, −1/3, 1/3, −1/3].

Since there is only one image per subject, no random effect was modeled. We applied VFMM and FMM by adopting the same basis transformations and design matrix. In particular, we applied a 3-D discrete wavelet transform to each image by using the Daubechies wavelets with

four vanishing moments, periodic boundary extension mode and four resolution levels. To further reduce the dimension, in wavelet domain we applied an efficient wavelet compression algorithm, which reduces the dimension from 10, 657, 241 to 40, 112 while retaining 99% of the total energy.

Region Detection Time

Data Model SEN FNR SPEC (hrs)

1-D FMM 0.645 0.029 0.997 6.2

VFMM 0.759 0.028 0.995 1.2

3-D FMM 0.993 0.014 0.896 3.42

VFMM 0.995 0.010 0.895 0.67

Table 2.2: Real data application results: Bayesian expected sensitivity, false negative rate and specificity for region detection, calculated in wavelet domain.

Based on group means obtained from VFMM and FMM in wavelet domain, we calculated pair-wise contrast effects between the AD (N=192), MCI (N=396), and normal (N=228) groups as well as the contrast effect between male and female groups. To identify significantly non-zero regions on these contrast effects, we performed basis-space testing in the compressed wavelet domain by following the procedure proposed in Section 2.2.2. For both VFMM and FMM models, we set ϵ = 0.02 and controlled the overall FDR at significant level α = 0.001.

Significant local regions were flagged in data domain by using threshold δ = 20. In Figure 2.12, we demonstrate regions flagged by VFMM for each of the four contrast effects by using sliced 2D plots. The flagged regions (colored by red or blue) were plotted on top of the MDT background image (the gray scale image). For each contrast effect, we showed the flagged regions via three views: the axial, sagittal and coronal views, sliced in the middle of the 3D brain along three directions. Similar 2D plots for FMM are available in Figure 2.13.

Flagged regions on contrast effects reveal local volumetric tissue change for one group relative to another group. From Figure 2.12, we observe that, on the AD-Normal contrast effect, there is a profound positive contrast effect in the lateral ventricle region, which indicates

2.4. Real Data Application 45

cerebrospinal fluid (CSF) inflation in AD patients. In addition, positive contrast effects are also seen in the circular sulcus of the insula bilaterally, suggesting brain volume expansions in these regions. Furthermore, we observe negative contrast effects in the temporal and parietal regions and the hippocampus, which suggests brain atrophy in these regions in AD patients. The contrast effects for AD-MCI and MCI-Normal show similar patterns but with lower values and smaller regions. This indicates graduate volumetric tissue changes from Normal to MCI and from MCI to AD.

On the Male-Female contrast effect, we observe positive contrast effects in the lateral ven-tricle region, which indicates more CSF inflation for males relative to females. Additionally, we also observe positive contrast effects on the top portion of the frontal region. This sug-gests higher brain volume (i.e., less tissue loss) for males in this region. Finally, we observe negative contrast effects in the pariental and temporal regions, indicating lower brain vol-ume (i.e., more brain atrophy) for males in these regions. Results of FMM are similar to VFMM, and the plots are shown in2.13. In addition to plots, in the bottom section of Table 2.2, we also listed Bayesian expected SEN, FNR and SPEC, calculated just as in the 1-D case described in Section 2.4.1. These results demonstrate that VFMM achieves Bayesian expected statistics very close to FMM. Regarding computation, we have split the wavelet components to 10 blocks and performed posterior calculation in parallel for both VFMM and FMM. It took 3.42 hours for FMM to finish 4000 MCMC iterations with 1000 burnin samples, whereas it only took 39.9 minutes for VFMM to converge.

-200 -100 0 100 200 300 400

(A)

(B)

(C)

(D)

Figure 2.12: TBM brain imaging data analysis of VFMM: plots of regions detected for four contrast effects: (A) AD-Normal; (B) AD-MCI; (C) MCI-Normal; (D) Male-Female. Each row illustrates three 2D images according to three views—the axial (sliced at z = 110), sagittal (sliced at x = 110), and coronal (sliced at y = 110) views, from left to right.

2.4. Real Data Application 47

-100 0 100 200 300 400

(A)

(B)

(C)

(D)

Figure 2.13: TBM brain imaging data analysis of FMM: plots of regions detected for four contrast effects: (A) AD-Normal; (B) AD-MCI; (C) MCI-Normal; (D) Male-Female. Each row illustrates three 2D images according to three views—the axial (sliced at z = 110), sagittal (sliced at x = 110), and coronal (sliced at y = 110) views, from left to right.

2000 3000 4000 5000

050100150200

Number of Parameters

Run−Time (min)

Computational Scalability

VFMM FMM

Figure 2.14: Running time in each block

Chapter 3 Model Data Heterogeneity via Dirichlet Diffusion Tree

3.1 Introduction

With remarkable advancements in technologies, various types of complex, high-dimensional data can be collected. Examples include medical images, genomics data, engineering signals, etc. The availability of data and the improvement of computational power have dramatically driven the development of statistics. New methodologies have been proposed to process, vi-sualize, and analyze high-dimensional data. Despite the progress, many critical questions in biomedical research remain unanswered due to the lack of effective statistical analytical tools. One challenge comes from the difficulty to take into account complex data heterogene-ity structures caused by sub-populations or latent factors. Taking the brain tumor data as an example, different heterogeneity patterns of tumor images may be associated with demo-graphic factors, the stage of disease progression, genetic characteristics, or other variables.

It is highly desirable to develop a flexible statistical framework to model the latent hetero-geneous data structures and study the association between data heterogeneity and variables of interest.

Modeling data heterogeneity is difficult especially when data is complex and high dimen-sional. Currently, major approaches to this problem are based on ad-hoc methods, for

example, using summary statistics such as skewness or kurtosis of the estimated probability density [29]. Although such approaches are straightforward and can be performed by using available statistical software, they are often limited as they are only based on distribution of the data, ignoring data structures that are potentially more sophisticated such as latent hierarchical structures.

In high-dimensional data setups where quantifying the relationship between variables is diffi-cult, nonparametric tree procedures can be adopted to characterize, summarize, and visualize latent data structures. The hierarchical nature of trees allows the relationships between vari-ables to be represented in a flexible framework, leading to meaningful interpretations in var-ious scientific applications. Examples include phylogenetic trees [15] for biological evolution, hierarchical clustering [25,59], and decision trees. In Bayesian statistics, tree-based methods have also received a great deal of attention. MCMC has been adopted to generate random tree samples in posterior inference. These methods have been shown effective to accommo-date complex, hierarchical data structures. For example, random-walk MCMC algorithms have been used extensively to model complex biological evolution processes in Bayesian phy-logenetic inference [26, 37, 66]. In the Bayesian regression tree framework, Chipman et al.

[9] developed the Bayesian classification and regression tree (BCART), which establishes the foundation for later development. Gramacy and Lee [21] introduced treed Gaussian process to model nonstationary data with application in response surface. Such tree mod-els rely on partitioning feature space hierarchically to achieve regression or classification.

Besides the regression model, Aldous [1] developed parametric probability models for trees to capture topological features and branch length information. Such models establish the basis of modeling heterogeneous data with nonparametric tree priors. Neal [51] proposed a Dirichlet Diffusion Tree (DDT) prior, a top-down stochastic process to generate a rooted binary tree. In this work, Neal [51] has shown the effectiveness of the DDT prior in density

3.1. Introduction 51

estimation. Later on, Knowles and Ghahramani [31] extended the Dirichlet Diffusion Tree to embrace more flexibility by allowing multiple children for each node. Compared to DDT, dealing with trees with an arbitrary number of children requires significant changes in the probability model and the computation. In recent years, nonparametric hierarchical tree models have shown effectiveness in various unsupervised learning tasks [19, 61]. Despite the progress on using trees to model complex data structures, some critical questions remain unanswered. For example, how to use latent trees to characterize data heterogeneity for a group of samples, and how to associate latent trees with covariates in a regression setup.

Motivated by the brain tumor data in Section 1.2.3, in this chapter we consider modeling data heterogeneity by using latent trees. We treat each observation as a set of point clouds (with data points whose index are exchangeable and the number of data points varying across observations). We adopt Dirichlet Diffusion Trees to model the latent hierarchical data structure underlying the observations, and propose a regression framework by intro-ducing covariates to the hyper-parameters of the latent trees. To perform posterior inference, we propose a Markov chain Monte Carlo algorithm to alternatively update the latent tree structures and the regression coefficients. Comparing with existing approaches on modeling data heterogeneity, our proposed methodology offers several distinctive advantages: (1) un-like the ad-hoc approaches that depend on density estimation, our latent tree model offers a more flexible framework that can capture hidden hierarchical structures in observed data. (2) Posterior samples of the latent trees can be used to summarize the heterogeneity structure of each observation.(3) By introducing covariates, our proposed model can be used to discover associations between data heterogeneity and other variables of interest, or test differences on data heterogeneity across groups of observations. (4) While sampling latent trees can be computationally expensive, our proposed MCMC algorithm can be performed partially in parallel by using multicore computers, which leads to improved computation scalability.

We will demonstrate the performance of the proposed method by a simulation study and a real-data application by using the brain Glioblastoma Multiforme (GBM) images. In the GBM data analysis, we will focused on characterizing the heterogeneity in pixel intensities of the brain tumor images. We will also investigate the differences of heterogeneity across two groups of patients: the short-survival and long-survival patients.

The rest of this chapter is organized as follows. Section3.2introduces our proposed Bayesian latent tree models. Section3.3 demonstrate the results of a simulation study. In Section3.4, we apply our proposed model to the GBM image data that were described in Section 1.2.3.

In document Bayesian Modeling of Complex High-Dimensional Data (Page 54-65)