Shuning Huo
Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in
Statistics
Hongxiao Zhu, Chair Xinwei Deng Robert B. Gramacy
Inyoung Kim
October 27th, 2020 Blacksburg, Virginia
Keywords: Variational Inference, Bayesian Variable Selection, Functional Mixed Model, Parallel Computing, Bayesian Hierarchical Clustering, Dirichlet Diffusion Tree
Bayesian Modeling of Complex High-Dimensional Data
Shuning Huo
(ABSTRACT)
With the rapid development of modern high-throughput technologies, scientists can now collect high-dimensional complex data in different forms, such as medical images, genomics measurements. However, acquisition of more data does not automatically lead to better knowledge discovery. One needs efficient and reliable analytical tools to extract useful formation from complex datasets. The main objective of this dissertation is to develop in-novative Bayesian methodologies to enable effective and efficient knowledge discovery from complex high-dimensional data. It contains two parts—the development of computation-ally efficient functional mixed models and the modeling of data heterogeneity via Dirichlet Diffusion Tree. The first part focuses on tackling the computational bottleneck in Bayesian functional mixed models. We propose a computational framework called variational func-tional mixed model (VFMM). This new method facilitates efficient data compression and high-performance computing in basis space. We also propose a new multiple testing proce-dure in basis space, which can be used to detect significant local regions. The effectiveness of the proposed model is demonstrated through two datasets, a mass spectrometry dataset in a cancer study and a neuroimaging dataset in an Alzheimer’s disease study. The second part is about modeling data heterogeneity by using Dirichlet Diffusion Trees. We propose a Bayesian latent tree model that incorporates covariates of subjects to characterize the heterogeneity and uncover the latent tree structure underlying data. This innovative model may reveal the hierarchical evolution process through branch structures and estimate system-atic differences between groups of samples. We demonstrate the effectiveness of the model through the simulation study and a brain tumor real data.
Shuning Huo
(GENERAL AUDIENCE ABSTRACT)
With the rapid development of modern high-throughput technologies, scientists can now collect high-dimensional data in different forms, such as engineering signals, medical images, and genomics measurements. However, acquisition of such data does not automatically lead to efficient knowledge discovery. The main objective of this dissertation is to develop novel Bayesian methods to extract useful knowledge from complex high-dimensional data. It has two parts—the development of an ultra-fast functional mixed model and the modeling of data heterogeneity via Dirichlet Diffusion Trees. The first part focuses on developing ap-proximate Bayesian methods in functional mixed models to estimate parameters and detect significant regions. Two datasets demonstrate the effectiveness of proposed method—a mass spectrometry dataset in a cancer study and a neuroimaging dataset in an Alzheimer’s dis-ease study. The second part focuses on modeling data heterogeneity via Dirichlet Diffusion Trees. The method helps uncover the underlying hierarchical tree structures and estimate systematic differences between the group of samples. We demonstrate the effectiveness of the method through the brain tumor imaging data.
Dedication
To my dearest family.
I would like to express my deepest gratitude to my advisors, Dr. Hongxiao Zhu, for her persistent mentoring, guidance, and encouragement during my Ph.D. journey. She is always passionate about research, devoted to her students. She provides me tremendous help in in-vestigating independent study, overcoming difficulties in research as well as my life. I could never make the work done without her consistent support.
I would like to extend my sincere gratitude to my committee members, Dr. Inyoung Kim, Dr. Xinwei Deng, and Dr. Robert Gramacy, for their constructive suggestions, inspiring questions, and valuable advice. I would like to thank Julia Armstrong from the writing center for her kind help in proof-reading my dissertation.
I would like to express my gratitude to my collaborators, who have generously shared their valuable data and advice. Thanks to Dr. Jeffrey S. Morris from the biostatistics depart-ment at the University of Pennsylvania to share cancer proteomics data. Thanks to Dr. Karthik Bharath from school of mathematical sciences at the University of Nottingham, Dr. Veerabhadran baladandayuthapani from the Department of Computational Medicine and Bioinformatics at the University of Michigan for sharing Glioblastoma Multiforme tumor data and providing precious advice. This dissertation can never be done without them. My journey at Virginia Tech could never be such unforgettable without the company of all friends. Thanks to all the best friends, especially Shuting Sun, for the joy and sorrow we have been through.
Last but not least, I would like to give my sincere appreciation to my family, especially my parents, my mother, Meiying Pan, and my father, Jianzhong Huo, my grandparents, Xingdi Zhu, Yulan Zhong, for their endless, unconditional love and support.
Contents
List of Figures ix
List of Tables xiii
1 Introduction and Background 1
1.1 Overview . . . 1
1.2 Motivation Examples . . . 3
1.2.1 Detecting Biomarkers in Cancer Proteomics Data . . . 3
1.2.2 Modeling Tensor-Based Morphometry Images of Human Brain . . . . 5
1.2.3 Modeling Heterogeneity in Brain Tumor Data . . . 6
1.3 Background . . . 8
1.3.1 Overview of Functional Data and Bayesian Functional Mixed Models 9
1.3.2 An Overview of Variational Inference . . . 13
1.3.3 Dirichlet Diffusion Trees . . . 15
1.4 Dissertation Structure . . . 18
2 Ultra-Fast Approximate Inference Using Variational Functional Mixed
Models 20
2.1 Introduction . . . 20 vi
2.2.1 Variational Functional Mixed Model Framework . . . 23
2.2.2 Region Detection via Basis Space Testing . . . 29
2.3 Simulation Study . . . 31
2.3.1 Simulation Setup . . . 31
2.3.2 Evaluation Criteria . . . 32
2.3.3 Results . . . 34
2.4 Real Data Application . . . 41
2.4.1 1D Organ-Cell Line. . . 41
2.4.2 3D ADNI . . . 43
3 Model Data Heterogeneity via Dirichlet Diffusion Tree 49 3.1 Introduction . . . 49
3.2 Model . . . 52
3.2.1 Model Setup . . . 52
3.2.2 Posterior Sampling with Markov Chain Monte Carlo . . . 55
3.3 Simulation Study . . . 61
3.3.1 Simulation Setting . . . 61
3.3.2 Simulation Results . . . 63
3.4 Real Data Application . . . 68 vii
3.5 Discussion . . . 73
4 Conclusion and Discussion 76 Bibliography 79 Appendices 88 Appendix A 89 A.1 Derivation for (2.4) . . . 89
A.2 Derivation for (2.6), (2.9) and (2.10) . . . 90
A.3 Derivation for (2.11), (2.12) . . . 93
A.4 ELBO Under the Blockwise Design . . . 95
A.5 Application Results in Chapter 2.4 . . . 97
1.1 Plot of one mass spectrometry curve under the original scale (a) and the log2 scale (b). . . 4
1.2 Example images of 3D TBM at the (z = 114) slice. The image on the left panel (a) is for a patient with Alzheimer’s disease and the one on the right panel (b) corresponds to the normal patient. . . 7
1.3 T2-weighted FLAIR MRIs with segmented tumor (outlined in black) of two patients diagnosed with GBM. The left image (a) shows tumor for a long-survival patient with GBM. The right image corresponds to the brain tumor for a short-survival patient. . . 9
1.4 Sample(N = 4) generated from the Dirichlet Diffusion Tree. . . . 17
2.1 A diagram for determining the Bayesian FDR threshold ϕα. . . 31
2.2 The 1-D simulation case: estimation and region detection results for the sim-ulated cell line effect C(t) = (B1(t)− B2(t) + B3(t)− B4(t))/2.. . . 36
2.3 The 1-D simulation case: estimation and region detection results for the sim-ulated organ effect C(t) = (B1(t) + B2(t)− B3(t)− B4(t))/2. . . . 36
2.4 The 1-D simulation case: estimation and region detection results for the sim-ulated organ -cell-line interaction C(t) = (B1(t)− B2(t)− B3(t) + B4(t))/2.. 36
2.5 The 3-D simulation case: region detection results for the contrast effects (B1(t)− B2(t)), along with the truth. Only one 2-D slice of the the 3-D
image is plotted. White areas are regions that are not flagged. Colors repre-sent estimated values. Left, middle and right figures correspond to results of VFMM, FMM and the truth respectively. . . 38
2.6 The 3-D simulation case: region detection results for the contrast effects (B1(t) − B3(t)), along with the truth. Only one 2-D slice of the the 3-D
image is plotted. White areas are regions that are not flagged. Colors repre-sent estimated values. Left, middle and right figures correspond to results of VFMM, FMM and the truth respectively. . . 38
2.7 The 3-D simulation case: region detection results for the contrast effects (B1(t) − B4(t)), along with the truth. Only one 2-D slice of the the 3-D
image is plotted. White areas are regions that are not flagged. Colors repre-sent estimated values. Left, middle and right figures correspond to results of VFMM, FMM and the truth respectively. . . 39
2.8 The 3-D simulation case: region detection results for the contrast effects (B2(t) − B3(t)), along with the truth. Only one 2-D slice of the the 3-D
image is plotted. White areas are regions that are not flagged. Colors repre-sent estimated values. Left, middle and right figures correspond to results of VFMM, FMM and the truth respectively. . . 39
2 − B4
image is plotted. White areas are regions that are not flagged. Colors repre-sent estimated values. Left, middle and right figures correspond to results of VFMM, FMM and the truth respectively. . . 40
2.10 The 3-D simulation case: region detection results for the contrast effects (B3(t) − B4(t)), along with the truth. Only one 2-D slice of the the 3-D
image is plotted. White areas are regions that are not flagged. Colors repre-sent estimated values. Left, middle and right figures correspond to results of VFMM, FMM and the truth respectively. . . 40
2.11 Cancer proteomics data analysis: significant nonzero regions flagged by VFMM and FMM on the cell line effect. The regions were flagged on the mean esti-mate obtained by VFMM. Red, blue, and green colors denote locations flagged by VFMM only, FMM only, or both VFMM and FMM, respectively. . . 42
2.12 TBM brain imaging data analysis of VFMM: plots of regions detected for four contrast effects: (A) AD-Normal; (B) AD-MCI; (C) MCI-Normal; (D) Male-Female. Each row illustrates three 2D images according to three views—the axial (sliced at z = 110), sagittal (sliced at x = 110), and coronal (sliced at
y = 110) views, from left to right. . . . 46
2.13 TBM brain imaging data analysis of FMM: plots of regions detected for four contrast effects: (A) AD-Normal; (B) AD-MCI; (C) MCI-Normal; (D) Male-Female. Each row illustrates three 2D images according to three views—the axial (sliced at z = 110), sagittal (sliced at x = 110), and coronal (sliced at
y = 110) views, from left to right. . . . 47 xi
2.14 Running time in each block . . . 48
3.1 The procedure for proposing a new tree in the Metropolis-Hastings sampler adopted from Neal [50]. Vertical axes denote divergence time and horizonal axis denotes location of nodes. Black nodes are leaves, which correspond to random observations yi in our model. . . 57
3.2 Plots of two simulated point clouds, one from each group. The left panel shows a point cloud in the heterogeneous group and the right panel shows one from the homogeneous group. . . 61
3.3 Histograms of divergence time in posterior tree structures in the homogeneous group and the heterogeneous group. . . . 65
3.4 One posterior sample of a latent tree in the homogeneous group. . . 66
3.5 One posterior sample of a latent tree in the heterogeneous group. . . 67
3.6 Histograms of divergence time in posterior tree structures. . . 70
3.7 One posterior latent tree sample of one randomly selected long-survival patient. 71
3.8 One posterior latent tree sample of one randomly selected short-survival patient. 72
2.1 Simulation results of FMM and VFMM for both 1-D and 3-D cases. . . 35
2.2 Real data application results: Bayesian expected sensitivity, false negative rate and specificity for region detection, calculated in wavelet domain. . . 44
3.1 Summary of posterior samples in simulation: {t1} denote divergence time in
the homogeneous group and {t2} denote that in the heterogeneous group. . . 64
3.2 Summary of posterior samples in real data application. Here, {t1} denote the
divergence time of latent trees for observations in the long survival group, and
{t2} denote that in the short survival group. . . . 69
Chapter 1
Introduction and Background
1.1
Overview
Rapid advancements in technologies have led to incredible amounts of high-dimensional data in various forms. Typical examples include spectral curves, medical images, and many other digital measurements. The availability of such data constitutes the foundation for gaining deeper insights into the underlying mechanisms of many complex phenomena. However, the acquisition of more data does not automatically lead to efficient and accurate knowl-edge discovery. High-dimensional data usually displays various complex structures, which include but are not limited to hierarchical structures, heterogeneity patterns, local clusters, and sharp changes. These special features need to be investigated in detail and taken into account in data analysis. With the availability of high-performance computational resources, Bayesian methods have become a cornerstone of modern statistical development for deal-ing with complex high-dimensional data. Bayesian methods are attractive owdeal-ing to their flexibility to build in prior beliefs, their ability to incorporate complex data structures us-ing hierarchical priors, and their computational convenience with the help of Markov Chain Monte Carlo sampling. Posterior inference of unknown parameters is usually straightforward when posterior samples are obtained. Despite these advantages, Bayesian methods also en-counter new challenges in dealing with high-dimensional data. The first noted challenge is making Bayesian methods scalable to the high volume and high-dimensionality of modern
datasets. Despite recent developments, existing Bayesian methods often fail to scale to high-dimensionality and result in running times that render them unacceptable on large-scale datasets. For example, commonly adopted posterior estimation procedures heavily rely on the Markov chain Monte Carlo (MCMC) sampling approach for parameter estimation and uncertainty quantification. Although the MCMC algorithms offer statistical guarantees, they are notoriously expensive due to the need for repeated sampling and the requirement of storing posterior samples.
Another challenge lies in the lack of suitable approaches to characterize complex heteroge-neous data structures under high-dimensional data setups. Existing approaches are often based on ad-hoc statistics, such as histograms and skewness or kurtosis of the data points. Although such approaches are simple with software easily accessible, they are insufficient to describe complex heterogeneity structures underlying data. Detecting heterogeneity under complex systems is often extremely hard due to the difficulty in searching in large parameter space and the non-Euclidean geometry of the latent structure.
In this dissertation, I mainly focus on developing flexible Bayesian methods for dealing with different types of complex high-dimensional data. Two specific problems will be considered:
• Approximate Inference for Functional Mixed Model Bayesian function regres-sions usually result in running time that renders the procedure unusable on large-scale data. We propose an approximate inference approach under the functional mixed model framework. This approach approximates posterior distributions after transform-ing the functional mixed models to basis space. Dimension reduction can be achieved with a lossless or near-lossless data compression in basis space. We also propose a basis space multiple testing procedures to detect statistically significant local regions. Estimation and region detection results can be easily transformed and visualized in
1.2. Motivation Examples 3
the original data domain.
• Model Data Heterogeneity via Dirichlet Diffusion Trees Motivated by brain tumor data, we consider data that consist of observations in the form of point clouds. Each observation is a point cloud with a different number of points. Furthermore, we expect a latent heterogeneity structure underlying each observation. The goal is to de-tect the difference in heterogeneity structures across groups of observations. We adopt Dirichlet Diffusion Trees to model latent heterogeneity structure and incorporate co-variates in a regression setup. The latent tree structures can describe the heterogeneity patterns in data, and regression coefficients can be used to detect differences across groups. The proposed approach may help uncover the underlying tree structures that reflect tumor cells’ evolution and test the association between the latent trees and covariate variables when applied to the brain tumor data.
1.2
Motivation Examples
This section demonstrates three motivating examples of complex high-dimensional data in the fields of biomedical sciences and neuroscience. The proposed methods to analyze these datasets will be investigated in Chapter2 and Chapter 3.
1.2.1
Detecting Biomarkers in Cancer Proteomics Data
The first motivating example we consider is the inference of a high dimensional proteomic data in cancer study. The dataset consists of mass spectrometry curves collected during a cancer study. The goal was to find peaks on these curves that may serve as biomarkers for cancer diagnosis, assessment or treatment.
In this study, two cell lines, i.e., a human melanoma cancer cell line with low metastatic potential and a human prostate cancer cell line with high metastatic potential were consid-ered. A total of 16 nude mice were the subjects in the experiment. Specifically, each mouse’s brain or lung was implanted with a tumor from one of the two cell lines. Data were collected by drawing a blood serum sample from each mice and run it through a MALDI-TOF mass spectrometer. Each sample collected is a proteomic spectrum y(t), a function supported on 1-D domain.
In Figure 1.1, we illustrate a mass spectrometry curve under the original scale (a) and the log2 scale (b). The spectral curve in Figure 1.1 consists of many peaks. In Figure 1.1, location t denotes the molecular mass of t Daltons. A peak at location t corresponds to a protein in the sample. Therefore, the curve y(t) provides estimation of protein abundance.
0 1000 2000 3000 4000 5000 6000 7000 8000 m/z(kDaltons) 0 50 100 150 200 250 Intensity (a) 0 1000 2000 3000 4000 5000 6000 7000 8000 m/z(kDaltons) -6 -4 -2 0 2 4 6 log 2 Intensity (b)
Figure 1.1: Plot of one mass spectrometry curve under the original scale (a) and the log2 scale (b).
Each mouse was measured under two setups, one with a low laser intensity and one with a high laser intensity. This results in 32 spectrum samples in total. We followed the setup proposed in Morris et al. [47]. Specifically, we only considered the spectrum between t = 2, 000 and t = 14, 000 Daltons. This gives T = 7, 985 points each curve. Details about preprocessing steps, include background correction, normalization of the mass spectra, and
1.2. Motivation Examples 5
the log2 transformation of the intensities are described in Morris et al. [47]. The goal is to find peaks that are different across organ implant sites or implanted cell line types so that we can identify biomarkers that are differentially expressed under different experiment conditions.
1.2.2 Modeling Tensor-Based Morphometry Images of Human Brain
Another motivating example is a 3-D brain imaging dataset. In this analysis, we consider the 3-D brain imaging data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). ADNI is a multisite study that aims to develop imaging, clinical, genetic, and biochemical biomarkers for the early detection and tracking of Alzheimer’s disease (AD). The website for the ADNI study is at http://adni.loni.usc.edu/.TBM is an image analysis technique that measures brain structural differences relative to a common anatomical template summarized by Frackowiak et al. [16]. To generate TBM images, a minimal deformation template (MDT) was first created based on Magnetic Reso-nance (MR) scans of 40 randomly selected normal subjects. All other brain MR images were aligned to the MDT using a nonlinear inverse-consistent elastic intensity-based registration algorithm [57]. For each brain image, a Jacobian matrix field was derived based on the gra-dients of the deformation field that warped the brain image to the MDT. Volumetric tissue differences were then assessed at each voxel by calculating the determinant of the Jacobian matrix. The determinant value encodes local volume deficit or excess relative to the MDT. The preprocessed TBM data consists of 816 subjects, among which 228 were healthy elderly controls (118 Male, 110 Female), 396 were diagnosed with mild cognitive impairment (MCI; 255 Male, 141 Female), and 192 were diagnosed with Alzheimer’s disease (AD; 101 Male, 91 Female). Each TBM image was measured on a common 220×220×220 grid in 3-D.
In Figure 1.2, we demonstrate an axial view of two TBM brain images, one from the AD group (a) and the other from the Normal group (b). The plotted images correspond to 2-D slices at z = 114 of the 3-D images. The colors in Figure 1.2 denote the volumetric changes relative to a common template. In Figure1.2, we observe that the AD patient demonstrates different volumetric changes compared to the normal patient. The volumetric differences vary across regions.
The purpose of this study is to design robust and scalable algorithms to estimate and identify volumetric brain differences between the following group pairs: NL vs. MCI, NL vs. AD, AD vs. MCI and Male vs. Female. Due to the high dimensionality of the brain imaging data, MCMC based procedure suffers from extensive running time and requires large storage space for posterior samples. It is imperative to develop an algorithm that is scalable to the dimensionality of data with statistical guarantee. Results of this study may help scientists understand how brain atrophy happens at different stages of disease, which may further assist the development of strategies for prevention, diagnosis, and treatment.
1.2.3
Modeling Heterogeneity in Brain Tumor Data
The third example that motivated my dissertation work is a brain tumor dataset for patients with Glioblastoma Multiforme (GBM), a common malignant brain cancer for adults. Patients with GBM usually have a poor prognosis and short survival time if remained untreated. Even with treatment, the median survival time for GBM patients is between 12 and 15 months [70]. Brain tumors usually display heterogeneity in terms of geometric shape, spatial locations as well as temporal scales [40]. Recent progress has been made in the molecular/genetic field to analyze brain tumors in qualitative ways [8, 27]. Quantitatively, Bharath et al. [3] proposed a geometric approach to measure brain tumors in terms of their shapes. Zhou
1.2. Motivation Examples 7
(a) TBM Image Slice for AD
50 100 150 200 20 40 60 80 100 120 140 160 180 200
220 (b) TBM Image Slice for Normal
50 100 150 200 20 40 60 80 100 120 140 160 180 200 220 0 500 1000 1500 2000 2500
Figure 1.2: Example images of 3D TBM at the (z = 114) slice. The image on the left panel (a) is for a patient with Alzheimer’s disease and the one on the right panel (b) corresponds to the normal patient.
et al. [73] proposed to characterize the spatial variation among tumors and identify specific disease-related sub-regions. They further use the discovered sub-regions to predict survival time. Extensive work has been done on tumor segmentation [10, 38], feature extraction [54], and classification of brain tumor grade in machine learning [69]. Although recent progress has been made in this area, there is still a lack of quantitative approaches to characterize heterogeneity and identify its associated with disease status, gender, or any other characteristics of interest.
In this study, we use the image data from The Cancer Genome Atlas (TCGA) database. The
website for the GBM study is at The Cancer Imaging Archive (https://www.cancerimagingarchive. net/). In addition to the brain images, we have acquired patients’ clinical characteristics,
including survival time, gender, age, as well as genetic variables. The clinical variables were retrieved from the website http://www.cbioportal.org/. The complete data —including
images (T2-weighted/FLAIR images) and clinic variables—contain 63 samples, among which 21 were females, and 42 were males. The raw MRIs were first preprocessed by using the Medical Image Processing Analysis and Visualization software; steps include spatial registra-tion as well as intensity bias correcregistra-tion. Brain tumors were further segmented by using the Medical Image Interaction Toolkit (MITK.org). The preprocessed segmented tumor retains the high signal intensity and clearly differentiates between the edema and the tumor. Details of preprocessing steps are described in Bharath et al. [2].
In Figure 1.3, we demonstrate two examples of T2-weighted brain images with segmented tumor regions. One (a) corresponds to a long-survival (> 12 months) patient, and the other (b) comes from a patient with short-survival times (≤ 12 months). The color in Figure 1.3
denotes the pixel intensities. From Figure 1.3, we observe that the pixel intensity values in the tumor regions show non-homogeneity patterns—different areas of the tumor demonstrate different intensity values. Our analysis aims to characterize latent heterogeneity patterns of the tumor pixels and investigate the association between these structures and covariates of interest. In particular, we will focus on a binary covariate on survival time, classifying patients to long-survival (> 12 months) and short-survival (≤ 12 months). The results of this study may reveal heterogeneous patterns of tumor cells caused by factors such as different evolution stages or etiologies.
1.3
Background
In this section, we will review some background information that is relevant to the methods proposed in Chapter 2 and Chapter 3. Specifically, we will review functional data and Bayesian functional mixed model in Section1.3.1, the idea of approximate Bayesian inference in Section 1.3.2, and Dirichlet Diffusion Trees in Section 1.3.3.
1.3. Background 9
(a) Segmented Tumor
50 100 150 200 250 50 100 150 200 250 0 50 100 150 200 250 300 350 (b) Segmented Tumor 50 100 150 200 250 50 100 150 200 250 0 50 100 150 200 250 300 350
Figure 1.3: T2-weighted FLAIR MRIs with segmented tumor (outlined in black) of two patients diagnosed with GBM. The left image (a) shows tumor for a long-survival patient with GBM. The right image corresponds to the brain tumor for a short-survival patient.
1.3.1
Overview of Functional Data and Bayesian Functional Mixed
Models
Functional data is a type of high-dimensional data with basic observational units being curves or surface defined on continuous domain. Usually, we consider functional data as realizations of a set of stochastic process, denoted by X1(t), ..., Xn(t). While often measured on discrete
grids, functional data is intrinsically infinite dimensional. In recent years, functional data has attracted a large amount of attention in the statistics community. Extensive efforts have been made to develop theory and methods in order to extract important information and uncover systematic patterns. One of the topics that has been well studied is functional regression. Here, I will give a brief introduction of functional regression in this section. Functional regression investigates the relationship between functional responses and a vector of predictors. There is plenty of literature about this topic and comprehensive review could
be found in Morris [42]. Morris [42] also gives the commonly used regression form with a functional response and scalar predictors:
Yi(t) = p
X
a=1
XiaBa(t) + Ei(t), t∈ T , (1.1)
where {Y1(t), . . . , YN(t)} denotes N functional responses and {Xi1, . . . Xip} denotes scalar
predictors. In (1.1), Ba(t) refers to the effect of Xa on the responses at time grid t. Here, Ei(t) denotes error term, assumed to be i.i.d Gaussians with mean 0 and covariance S(t1, t2). Technically, the goal of functional regression model is to estimate Ba(t), a = 1, . . . p, and
perform statistical inference to select any time grid t where Ba(t)̸= 0. To achieve this goal,
functional data is usually represented by basis, e.g., splines [13], Fourier Series, Wavelets as well as Functional Principle Components [12]. These bases constitute the foundations to build functional regression and each basis is suitable for functions with certain features. There is plenty of literature related to characteristics of basis in functional regression [6,
24, 49]. Among the above bases, wavelet basis are well suited to model the non-stationary, discontinous functions with local spikes and dips [6]. We will adopt wavelet basis in our analysis of functional data.
Although assumptions underlying error terms are straightforward in (1.1), it is inadequate to capture the between functions correlations inherent in replications during the experiment design. To address this issue, Guo [22] introduced random effects in functional regression on smoothing splines in frequentist way. However, it is unable to model the between-function correlation as well as accommodate multi-level random effects. Instead, Morris and Carroll [46] proposed Functional Mixed Model (FMM) which induces the multilevel random effects in functional regression. Next, I shall give a brief introduction to the FMM.
1.3. Background 11
FMM introduced by Morris and Carroll [46] takes the following general form:
Y(t) = XB(t) + ZU(t) + E(t), t∈ T , (1.2)
where B(t) = (B1(t), . . . , Bp(t))T is a vector of fixed effect coefficient functions associated
with a N× p design matrix X, U(t) = (U1(t), . . . , UM(t))T is a vector of random effect
coef-ficient functions associated with a N× M design matrix Z, and E(t) = (E1(t), . . . , EN(t))T
is the vector of random errors. Here, the response Y(t) and the covariates X, Z are treated as known, and the goal is to infer B(t) and U (t). Morris and Carroll [46] assumed that both
U (t) and E(t) are mutually independent multivariate Gaussian processes. In particular, if
we use E(t) ∼ N (R, S) to denote that E(t) is a multivariate mean-zero Gaussian process with a N×N between-function covariance matrix R and a within-function covariance surface
S(·, ·), then cov{Ei(t1), Ej(t2)} = RijS(t1, t2), for i, j ∈ {1, . . . , N} and t1, t2 ∈ T . Similarly, they assumed that U (t)∼ N (P, Q).
Specifically, to fit FMM, Morris and Carroll [46] proposed to discretize the functions in (1.2) on a dense grid T and apply a discrete wavelet transform to the discretized functions. This is equivalent to representing Y(t), B(t), U(t) and E(t) through expansions on a common wavelet basis {ϕjk}, e.g., Yi(t) = PJj=1PKj
k=1dijkϕjk(t) where j is the scale index, k is the
location index and {dijk} are wavelet coefficients of Yi(t). Model (1.2) can therefore be transferred to the dual space of wavelet coefficients:
D = XB∗+ ZU∗+ E∗, (1.3)
where rows of D, B∗, U∗ and E∗ contain wavelet coefficients of entries in Y(t), B(t), U(t) and E(t), respectively. In the dual space (wavelet domain) model (1.3), the random effects U∗ and the error E∗ are both zero-mean normal matrices, denoted by U∗ ∼ N(P, Q∗), E∗ ∼
N(R, S∗), where Q∗ and S∗ denote covariances between columns of U∗ and E∗, respectively. Since different columns of D, B∗, U∗ and E∗ correspond to wavelet coefficients at different resolution levels (j) and locations (k), the whitening property of the wavelet transform enables a simplified independence assumption for the covariance matrices Q∗ and S∗, i.e., Q∗ = diag({qjk∗ }), S∗ = diag({s∗jk}). Furthermore, Morris and Carroll [46] assumed that P, R are both identity matrices.
In order to encourage sparsity of functional data, sparse regularization are imposed on the coefficients of functional basis. Specifically, the sparsity of the basis coefficients can be accomplished by imposing either L1 penalty e.g., LASSO [60] or by conducting stochastic search by imposing a sparse priors [17, 20] in Bayesian framework. In FMM, Morris and Carroll [46] set a “spike-slab” prior to each component of the fixed effect B∗, i.e., B∗ = (Bi,jk∗ ). Furthermore, inverse gamma priors were assumed for the random effect and residual variances
{q∗jk}, {s∗
jk}. The assumptions of FMM can be written as
Bi,jk∗ ∼ γi,jk∗ N (0, τi,jk) + (1− γi,jk∗ )δ0, γi,jk∗ ∼ Bernoulli(πij), (1.4) {q∗
jk} ∼ IG(aqjk, bqjk), {s
∗
jk} ∼ IG(asjk, bsjk),
where δ0 is a point mass at 0, τi,jk, πij, aqjk, bqjk, asjk and bsjk denote hyperparameters. These prior setups lead posterior inference via Markov chain Monte Carlo (MCMC) sampling. The
1.3. Background 13
Markov chain Monte Carlo sampling procedure of FMM is summarized in Algorithm 1: Algorithm 1: The FMM sampling procedure
1 Initialize all parameters; for b from 1 to B do 2 for all jk do
3 Sample Bi,jk∗ | Djk, B∗(−i)jk, q∗jk, s∗jk, for i = 1, . . . p; 4 Sample qjk∗ , s∗jk using Metropolis-Hasting (M-H); 5 Sample U∗jk | Djk, B∗jk, q∗jk, s∗jk.
6 end
7 end
In the above algorithm, B∗(−i)jk denotes the fix effects on level j and location k excluding the ith fixed effect corresponding to predictor Xi.
Later in Chapter 2, we will focus on functional mixed models (FMMs) and develop the variational Bayes functional mixed models. Tha latter provides tremendous computational convenience for high-dimensional functional data.
1.3.2
An Overview of Variational Inference
While MCMC sampling provides a standard way to estimate parameters of FMM, its compu-tation can be slow and its convergence can be difficult to diagnose especially in the context of large-scale settings. Variational Bayes provides an alternative way to approximate the posterior samplings. In this dissertation, we develop the variational functional mixed model inspired by variational Bayes.
Here, we briefly review the basic idea of variational Bayes. A comprehensive review can be can be found in Blei et al. [5]. Let θ denote all model parameters and y denote the observed data. Unlike MCMC sampling which provides a stochastic approximation of the
exact posterior p(θ|y) using a set of samples, variational Bayes finds an analytical proxy
qv(θ) that is closest to p(θ|y). In particular, one estimates the parameters v of qv(θ) in order
to minimize the Kullback-Leibler divergence between qv(θ) and p(θ|y),
KL(q||p) = Z
qv(θ) log qv(θ)
p(θ|y)dθ.
Directly minimizing KL(q||p) is often difficult. Fortunately, one can decompose log p(y) as
log p(y) = Z qv(θ) log qv(θ) p(θ|y)dθ + Z qv(θ) logp(y, θ) qv(θ) dθ. = KL(q||p) + Eqv logp(θ, y) qv(θ) . (1.5)
As log p(y) is a constant, minimizing KL(q||p) is equivalent to maximizing the second term in (1.5), which is referred to as the evidence lower bound (ELBO). Thus, inference with variational Bayes boils down to solving to a optimization problem. Often, qv(θ) is restricted
to take a simpler form than p(θ|y). One common restriction is through the mean-field assumption, i.e., assuming that θ can be partitioned into independent blocks, therefore the factorization qv(θ) =
Q
iqvi(θi) holds. Such simplification makes it possible to calculate
qv(θ) analytically. Additionally, more convenient calculations can be induced by adding
assumptions about exponential family, i.e., assuming that (i) each conditional distribution
p(θi | θ(−i), y) belongs to the exponential family, and (ii) the approximate distribution qvi(θi) belongs to the same exponential family as p(θi | θ(−i), y). In particular, write p(θi | θ(−i), y) in canonical form p(θi | θ(−i), y) = exp ηi(θ(−i), y)Tt(θi)− A(ηi) ,
1.3. Background 15
with natural parameter η(vi) = Eq\i
ηi(θ(−i), y)
, where q\i =Qj̸=iqvj(θj). Detailed deriva-tions can be found in Appendix A of Blei [4]. The exponential family assumptions transfer the estimation of qvi(θi) to the estimation of natural parameters, making the calculations much more straightforward.
1.3.3 Dirichlet Diffusion Trees
Complex, high dimensional data, such as curves, images, can be represented in different forms. One of the approaches is to use a tree to represent the hierarchical structure in data. The advantage of a tree lies in its flexible structure and hierarchical establishment. In this dissertation, we focus on the binary tree due to its flexibility in estimation and uncertainty qualification. Priors over binary trees usually define a stochastic generative process. Specifically, Kingman [30] developed the coalescent, an infinite binary tree generated by randomly merges. Another stochastic generative tree prior can be constructed by using a Dirichlet Diffusion Tree (DDT) framework, introduced by Neal [51]. In general, a DDT is a top-down stochastic model to generate a rooted, binary tree by random splits. The leaves of a DDT are a set of data points. Data diverges in the tree through a diffusion procedure under the control of divergence function. Extensions of DDTs have been made by Knowles and Ghahramani [31] to allow an arbitrary number of children instead of binary nodes. In this dissertation, we focus on using Dirichlet Diffusion Trees to model complex, high-dimensional data. We now briefly review the basics of DDTs.
Here, we summarize the generation process of DDT by the example shown in Figure 1.4. Figure 1.4 displays a sample of DDT generation process for N = 4 data points with leaf values x1, x2, x3 and x4. Specifically, we have a tree with four paths along with four-leaf nodes at x1, x2, x3 and x4. In this sample, the first path starts from t = 0 following a
Brownian motion with variance σ2, leading to a leaf node in the tree when t = 1 with value x1. At time t, if the first data point is at position x1(t), it will arrive at the position
x1(t + dt) = x1(t) + N(0, σ2dt) at time t + dt. It is easy to show that x1(t)∼ N (0, σ2t). The second path also starts at t = 0 and follows the pathway of x1 at first. At tb, it diverges
from the path of x1, which constitutes an internal node with divergence time t = tb and
location at xb. In particular, this divergence time, tb, is controlled by a divergence function,
i.e., a(t) = c/(1− t). Once diverged, the second path follows a Brownian motion that is independent of the first path until t = 1, constituting a leaf node with value x4. Similarly, the third path follows the previous two paths until time tb. It has to choose to go with either
the left subtree or the right subtree. Neal [50] has proved that the probability distribution generated by DDT is exchangeable. Hence, the order of the path does not affect probability. We simply assume that the third path follows the path of x1 after divergence. Later at time point ta, the third path diverges with the first path and splits at t = 1, reaching the leaf
node with value x2. In general, the generative process of the subsequent path is summarized as follows. Initially, it follows the path of the previous data points until divergence. Neal [50] pointed out that at any time t and given the next infinitesimal time dt, the probability of divergence is a(t)dt/n, where n denotes the number of data points that have previously been through the path. If divergence has not happened at internal nodes, it has to choose whether to follow either the left subtree or the right subtree. In particular, this choice relies on the branching probability, that is, the proportion of how many times each branch has been traversed before.
In the above generation process, several parameters of the DDT will influence the branching behavior of the latent trees, including c in the divergence function and σ2 in Browian motion. Here, c controls splitting time, whereas σ2 establishes diffusion variance. In this dissertation, we consider divergence function a(t) = c/(1−t), where c > 0. Larger values of c (e.g., c > 1)
1.3. Background 17
Figure 1.4: Sample(N = 4) generated from the Dirichlet Diffusion Tree.
establishes more homogeneous data pattern and results in less dependence within data. Smaller values (e.g., c < 1) usually generates roughly more heterogeneous structures, i.e., sub-clusters or local clusters [50].
In order to estimate the parameters of DDT, we often need to calculate the likelihood of a latent tree given observations. In general, the probability of a given latent tree can be decomposed by a product of two factors—a tree factor and a data factor. In particular, the tree factor takes into account the probability of obtaining the given hierarchical organization and internal divergence time. The data factor accounts for the probability of obtaining the latent internal node locations and the leaf nodes. Here, we introduce foundations to calculate the probability of tree based on the sample in Figure1.4. More details can be found in Neal [50]. First, the probability of a path that does not diverge between time s ≤ t on a branch
which has been followed by m data points can be calculated by
P (not diverging) = exp(A(s)− A(t)
n ),
where A(t) =R0ta(u)du is the cumulative divergence function. Specifically, for a divergence
with the form a(t) = c/(1− t), the cumulative divergence is A(t) = −c log(1 − t). The branching probability is determined by the number of data points that has traversed the segment. Given the above two facts, the probability of the tree factor for obtaining the corresponding structure in Figure 1.4 can be calculated by
exp(−(A(tc))a(tc)× exp(− A(tb)
2 )(a(tb)/2)× exp(−
A(tb)
3 ) 1
3exp(A(tb)− A(ta))a(ta). (1.6)
Given the branching structure and divergence time, one can simply calculate the data factor as
N (xb | 0, σ2t
b) × N (xc| xb, σ2(tc− tb))× N (xa| xb, σ2(ta− tb))× N (x4 | xc, σ2(1− tc))
× N (x3 | xc, σ2(1− tc))× N (x1 | xa, σ2(1− ta))× N (x2 | xa, σ2(1− ta)).
Later in Chapter 3, we will utilize Dirichlet Diffusion Trees to develop a Bayesian latent tree model to model the heterogeneity of complex, high-dimensional data.
1.4
Dissertation Structure
The rest of this dissertation is organized as follows. In Chapter 2, we propose variational functional mixed models to resolve the computational bottleneck of MCMC in dealing ultra-high dimensional functional data. We introduce a multiple testing procedure in basis space
1.4. Dissertation Structure 19
to select important regions. In Chapter 3, we consider modeling the heterogeneity structure of data by using Dirichlet Diffusion Trees. We aim to characterize heterogeneity and estimate the systematic difference between groups. Chapter 4 summarizes the proposed models and discusses related future work.
Ultra-Fast Approximate Inference
Using Variational Functional Mixed
Models
2.1
Introduction
Modern high-throughput technologies enable the collection of high-dimensional data in func-tional form, with ideal observafunc-tional units being curves or surfaces defined on some con-tinuous domain and sampled on a discrete grid. Typical examples include longitudinal measurements, spectral curves, engineering signals, brain images, and many other digital measurements. While these data often provide a rich source of information, they also pose extraordinary challenges to statistical methodology, mostly due to the ultra-high dimension-ality/volume and the complex data structures. It is not just desirable, but essential that analytical tools should be flexible to accommodate complex data structures and scalable to the increasing size and dimensionality. Extensive research work has been done in the field of functional data analysis to process and model functional data, among which the most studied is functional regression that focuses on characterizing the relationship between functional observations and other variables. Notable work includes functional linear mod-els [7, 11, 23, 67, 68], generalized functional linear models [28, 48, 76], functional additive
2.1. Introduction 21
models [14, 79], etc. Comprehensive reviews can be found in Ramsay and Silverman [52], Morris [42] and Wang et al. [63]. Among these models, functional mixed models (FMMs) provide a flexibility regression framework to model complex data structures. Furthermore, compared with the Frequentist FMM [22], Bayesian functional mixed models [46, 72] have achieved a great deal of success due to their convenience in inference through Markov chain Monte Carlo (MCMC) sampling and their ability to fully characterize the uncertainty of the unknown parameters. As a result, Bayesian FMMs have been widely applied to different applications such as the analysis of mass spectrometry data [44,45], accelerometer data [43], acoustic signals [35], and medical images [33,34, 78]. Further extensions of Bayesian FMM have been made to allow robust regression [74,75], function-on-function regression [39], spa-tial correlations [71, 78] between functions, multivariate functional responses [77], as well as quantile functional regression [65]. These methods constitute a rich family of Bayesian FMM-based approaches that cover a large scope of analysis involving complex structured functional data. Despite their effectiveness, Bayesian FMMs become computationally de-manding for data with extraordinary high volume and dimensionality. The computational challenges primarily come from running Markov chain Monte Carlo (MCMC) sampling and the need to store a large number of posterior samples. Various strategies have been sug-gested to improve the computation scalability, for example, by performing data compression in advance to reduce dimension or calculating good MCMC initial values to start with [75]. However, even with these strategies, the computation can still be a challenging task for large scale data as it still requires running MCMC until convergence. In this paper, we aim to develop an ultra-fast Bayesian FMM computational framework that is suitable for large scale functional data, avoiding expensive MCMC sampling and the hassle of storing posterior samples.
that strategies for Big Data computation should be useful. For example, techniques such as divide-and-conquer [62] lead to embarrassingly parallel algorithms [64], and approximation is often adopted to improve computation efficiency [41, 53, 55]. Despite the progress in Big Data computation, the fast computation of large-scale functional data has not received much attention. In this paper, we propose a variational functional mixed model (VFMM) framework that takes advantages of several attractive computational strategies in Big Data computation. In particular, we adopt the divide-and-conquer strategy by first representing functional observations by parsimonious basis and then performing statistical inference for each component in basis space. The parsimonious basis representation enables efficient compression and embarrassingly parallel computation. It also facilitates an efficient multiple testing procedure in basis space which can be used to identify significant local regions on functional data in the original data domain. Instead of performing MCMC sampling, we rely on approximate inference [58]. In particular, we approximate the posterior distribution by using variational Bayes, a method from machine learning that approximates posterior distributions through optimization [5]. While this approximation sacrifices some of MCMC’s accuracy, it provides large gains in terms of computational feasibility, especially in ultra-high dimensional settings. We design a fast iterative algorithm to estimate parameters of the approximated posterior distribution. Owing to its fast speed, our approach is ideal for obtaining quick initial estimation based on large scale data. If desired, it can be combined with full MCMC schemes to achieve more accurate Bayesian inference to the exact posterior distribution. Relative to existing functional regression approaches, our proposed VFMM approach brings several advantages. (1) It enables distributed inference thus is scalable to large-scale functional data; (2) it avoids the hassle of running MCMC and storing posterior samples; (3) it facilitates the detection of significant local regions, and these regions can be visualized directly in data domain. Our results for the simulated and real data demonstrate the effectiveness of the proposed VFMM in estimating parameters and in saving computation
2.2. Model 23
time and storage. The outline for the rest of this chapter is as follows. In Section 2.2.1, we introduce the proposed VFMM framework. Specifically, we review the Bayesian functional mixed model framework in Section1.3.1and describe VFMMs in Section2.2.1. An approach to detect significant regions by performing basis space tests will be discussed in Section2.2.2. Estimation results and computational gains of VFMM are demonstrated by simulations in Section 2.3 using both curves and three-dimensional images. Two case studies are used to demonstrate the effectiveness of VFMM in Section 2.4.
2.2
Model
2.2.1
Variational Functional Mixed Model Framework
We adopt parsimonious basis representation to transform the FMM to basis space, which leads to divide-and-conquer computation. Consider the general FMM in (1.2), we assume that the responses {Yi(t), i = 1, . . . , N} take values in L2(T ), where T is a closed subset of Rd, d ≥ 1. Let {ϕj}∞
j=1 denote a compactly supported, orthonormal basis of L2(T ).
We can expand Yi(t) by Yi(t) =
P∞
j=1dijϕj(t) where dij = ⟨Yi, ϕj⟩ =
R
T Yi(t)ϕj(t)dt. The
coefficient sequence (di1, di2, . . .) lies in the space of square-summable sequences, denoted by ℓ2 = nd
j :
P∞
j=1d2j <∞
o
. Since Y(t) is written as the linear combination of B(t), U(t) and E(t), it is natural to assume that these unobserved functional components also take values in the same L2(T ) space. With this assumption, all functional objects in FMM can be represented by a common basis. Specifically, denote Φ = (ϕ1, ϕ2, . . . )T, then all basis expansions can be represented by linear operations: Y = DΦ, B = B∗Φ, U = U∗Φ, and E = E∗Φ. Model (1.2) becomes DΦ = XB∗Φ + ZU∗Φ + E∗Φ. Since Φ preserves linear operation, the above model is equivalent to the dual space model D = XB∗ + ZU∗+ E∗.
This transforms FMM from the functional space L2(T ) to the dual space ℓ2. Since the due space model consists of discrete sequences only, estimation is much easier to perform. Furthermore, as the correlations between basis coefficients are often substantially reduced, one can make independence approximations between columns of D, B∗, U∗, and E∗ in the dual space model following the idea of Morris and Carroll [46]. This further divides the dual space model into many independent regular mixed effect models. We denote the jth model by
dj = Xb∗j + Zu∗j + e∗j, j = 1, . . . , n (2.1)
where dj, b∗j, u∗j and e∗j denote the jth columns of D, B∗, U∗ and E∗ respectively. While
the above divide-and-conquer strategy is suitable if using any orthonormal basis, we only fo-cus on compactly supported orthonormal basis such as Haar wavelets, Daubechies Wavelets, and spherical wavelets. These bases have the ability to capture local features of functional data, enable parsimonious representation which allows further compression, and have dis-crete transformation versions that are fast to compute. They are generally applicable to curves, images, surfaces, etc. As we will explain in Section2.2.2, using compactly supported orthonormal basis allows us to identify interesting local regions by performing testing in ba-sis space only, avoiding the need of inverse-transforming many posterior samples back to the data domain. Consider the jth model in (2.1), we slightly modify the priors and the random effect/residual distributions proposed by Morris and Carroll [46] to enable efficient varia-tional Bayes computation. In particular, denote dj = (d1j, . . . , dN j)T, bj∗ = (b∗1j, . . . , b∗pj)T,
u∗j = (u∗1j, . . . , u∗mj)T, and e∗j = (e∗1j, . . . , e∗N j)T. Specifically, our model can be written as
b∗i,j ∼ γi,j∗ N (0, qj∗τi,j) + (1− γi,j∗ )δ0, γi,j∗ ∼ Bernoulli(πj), qj∗ ∼ IG(aj, bj), u∗j ∼ N(0, qj∗I), e∗j ∼ N(0, qj∗ζjI).
2.2. Model 25
Here, we have factored out the random effect variance q∗j from the prior variance of fixed effect b∗i,j and the residual variance. This new parameterization allows for convenient update of the approximate distribution of qj∗. Based on the above model setup, the joint posterior distribution of {b∗i,j}, {γi,j∗ } and q∗j can be written as
p({b∗i,j}, {γi,j}, q∗ j∗ | dj, ζj, τi,j, πj) ∝ p(dj | {b∗
i,j}, qj∗, ζj) p(qj∗) p
Y
i=1
p(b∗i,j | γi,j∗ , τi,j) p(γi,j∗ | πj). (2.2)
We treat πj, τi,j, ζj, and (aj, bj) as hyperparameters and make mean-field assumptions for
the approximate distributions of {b∗i,j},{γi,j}, and q∗ j∗. This enables an efficient variational EM algorithm In particular, we assume that the approximate distribution can be factored as follows:
q({b∗i,j}, {γi,j∗ }, qj∗) = q(qj∗)
p
Y
i=1
q(b∗i,j | γi,j∗ )q(γi,j∗ ). (2.3)
As the conditional posterior of {b∗i,j},{γi,j}, and q∗ j∗ all fall in the exponential family, we follow Blei [4] by assuming that each factor in the above approximate distribution also falls in the same exponential family. Therefore, in the E-step, the estimation of the approximate distribution boils down to the estimation of the natural parameters in exponential family. This facilitates fast calculation of the approximate distributions.
In the M-step, conditional on the estimation of the approximate distributions, we update values of the hyperparameters πj, τi,j and ζj by directly maximizing the ELBO. Specifically,
we are able to analytically solve the value of πj and τi,j by setting the first derivative of ELBO
to zero. The value of ζj, however, needs to be searched by using an optimization algorithm.
In general, ELBO is neither a convex nor concave function of ζj because ζj is contained in
optimum. The values of the hyperparameters (aj, bj) are determined by matching the mean
of the inverse-Gamma prior with the initial estimate of q∗j while setting the prior variance to be fairly large (e.g., 103).
Based on the full conditional distribution in (2.2), the conditional distribution of dj is:
dj | ({b∗\i,j}, γ\i,j∗ , γi,j∗ = 1)∼ N(X(−i)b∗(−i,j), Σj+ X(i)XT(i)τi,j), (2.4)
dj | ({b∗\i,j}, γ\i,j∗ , γ∗i,j = 0)∼ N(X(−i)b∗(−i,j), Σj). (2.5)
Instead of modeling distribution of γi,j and b∗i,j separately, estimation of joint distribution of
(γi,j∗ , b∗i,j) will be more reasonable. The way to jointly model (γi,j∗ , b∗i,j) is through modeling
q(γi,j∗ ) and q(b∗i,j | γi,j∗ ). Based on distribution in (2.4) and (2.5) and prior distribution of γi,j∗ , the approximate distribution of γi,j∗ is a Bernoulli distribution, i.e.,
q(γi,j∗ = 1)∼ Bernoulli(eπi,j), (2.6)
where
eπi,j =
1
1 + exp−Oi,j, (2.7)
Oi,j = log{ πj 1− πj} −
1
2log(1 + X
T
(i)Σ−1j X(i)τi,j) (2.8) + Eq 1 2d ∗T j
Σ−1j X(i)XT(i)Σ−1j τi,j q∗j(1 + τi,jXT(i)Σ−1j X(i))
d∗j
,
where d∗j = (dj −
P
l̸=iX(l)b∗l,j) and Eq denotes the expectation of random variable with
2.2. Model 27
Based on (2.6) and (2.2), distribution of q(b∗i,j | γi,j∗ = 1) can be derived as
b∗i,j | γi,j∗ = 1∼ N (eµi,j, eσ2i,j), (2.9)
whereeσ2
i,j =
1/(τi,j) + XT(i)Σ−1j X(i) −1 Eq−1 1 qj∗ ,eµi,j = Eq XT (i) qj∗Σj)−1(djk − P l̸=iX(l)b∗l,j eσ2 i,j.
By (2.2) and getting expectation from (2.6) and (2.9), the approximate distribution for qj∗ can be written as q(q∗j)∼ IG(eaj, ebj), (2.10) where eaj = N2 + Pp i=1 eπi,j 2 + a∗j, ebj = 12Eq[(dj − Pp i=1X(l)b∗l,j)T(Σj)−1(dj − Pp i=1X(l)b∗l,j) + Pp i=1 (γi,jb∗i,j)2 τi,j + 2b ∗ j].
The above steps are so-called E step in updating approximate distribution for parameters. We still need to update hyperparameters πi,j, τi,j and ηj to finish one iteration in algorithm.
Specifically, we will use optimization techniques to update these hyperparameters. For exam-ple, after maximizing ELBO = Eq[log p(djk,{b∗i,j}, {γi,j∗ }, qj∗)]− Eq[log q({b∗i,j}, {γi,j∗ }, qjk∗ )],
the algorithm reaches to the explicit solutions of πi,j, τi,j, which can be denoted by
πj =
Pp i=1eπi,j
p , (2.11)
τi,j =
eaj(eσ2i,j+eµ2i,j)
ebj
. (2.12)
After integrating and substitute (2.11) and (2.12) into the ELBO, Remaining hyperparam-eters ζj, can be obtained through the maximization of the following object function:
− log | 2πΣj | −aej e bj Eq dj− p X i=1 X(i)b∗i,j !T Σ−1j dj − p X i=1 X(i)b∗i,j ! . (2.13)
Since there is no explicit solution for (2.13), we utilize the climb-hill algorithm in MATLAB to perform the optimization. Since the computation of inverse matrix in (2.13) is computa-tionally expensive and sometimes unstable in terms of convergence, we show that through simplification, we can get rid of matrix inversion, leading to a speed up of computation in Appendix A.4. We list steps of the VFMM algorithm in Algorithm 2. More technical details can be found in Appendix A.4. To ensure fast convergence, we adopt Henderson’s Mixed Model equations [56, pages 275-286] to initialize parameters. While options for more parameter settings and detailed tuning are available, the only required inputs are the ob-served data Y and the design matrices X, Z. As the calculation is performed independently across the index j, Algorithm 2 can either be designed by using vector-form calculation or be distributed to multi-core computational units.
Algorithm 2: The VFMM algorithm
1 Initialize all parameters; while ELBO(t)− ELBO(t−1) > δ do 2 for all jk do
3 Update q(γi,jk∗ , Bi,jk∗ ) in (2.6) and (2.9); 4 Update q(qjk∗ ) in (2.10);
5 Update ELBO(t);
6 Update πjk, τi,jk in (2.11) and (2.12);
7 Update ζjk in (2.13) by hill-climbing algorithm;
8 end
9 Update ELBO(t); 10 end
2.2. Model 29
2.2.2
Region Detection via Basis Space Testing
In addition to estimating unknown parameters, another key inferential objective is to identify local regions that reflect significant differences across groups of samples. In Bayesian FMM, since posterior samples are available, detection of local regions can be achieved by controlling family-wise error rate or false discovery rate across a grid of T as done by Morris et al. [45] and Meyer et al. [39]. In the VFMM framework, since we are targeting at avoiding posterior sampling for the sake of improving computational efficiency, we propose a new basis-space testing strategy to detect local regions. Let C(t) denotes a contrast effect of interest. For example, C(t) = Bi(t) − Bj(t) represents the contrast effect between groups i and j if Bi(t) and Bj(t) are the respective group means. We focus on detecting regions on which |C(t)| ̸= 0. To achieve this, we assume that C(t) can be represented by the basis expansion C(t) =P∞j=1cjϕj(t). In stead of performing multiple testing on a grid ofT , we aim to obtain
a region identifier eC(t) =P∞j=1cj1{|cj|>ϵ}ϕj(t), where ϵ is a small positive threshold. We see that as eC(t) → C(t), as ϵ → 0 for all t. The component cj1{|cj|>ϵ} represents significantly nonzero component. When using a compactly supported basis, we expect a subset of {cj} to be zero (i.e., {H0j : cj = 0} holds for some j) if C(t) is sparse. Therefore, we will
identify nonzero components in basis space by performing a sequence of basis space testing
{H0j :|cj| > ϵ vs. Haj :|cj| ≤ ϵ ; j = 1, . . . , K} while controlling the Bayesian false discovery rate. Here we have assumed that there exists a finite truncation K such that all H0j will not be rejected beyond K. This is a reasonable assumption as for most smooth functions, higher K corresponds to high frequency components which are usually predominantly noise. We propose a testing procedure which consists of the following steps:
1. For each component j = 1, . . . , K in basis space, estimate a probability discovery function pϵ(j) = pr{|cj| > ϵ | Data} by bpϵ(j) based on the approximate distributions
q(b∗i,j|γi,j∗ ).
2. Sort {bpϵ(j)} in descending order to obtain {bpϵ,(l), l = 1, . . . , K}.
3. Set ϕα = bpϵ,(s), where s = max{l∗ : (l∗)−1
Pl∗
l=1{1 − bpϵ,(l)} ≤ α}.
4. Suppose J ⊂ {1 : K} are the indices that are significant. We reconstruct eC(t) by
e
C†(t) =Pj∈J cjϕj(t) and flag regions on which| eC†(t)| > δ for a pre-specified threshold
δ.
The above method of determining the Bayesian FDR threshold ϕα is sketched in a diagram
in Figure 1, where the solid decreasing line denotes the ordered bpϵ(j). The areas marked by
A, B, C, D are the estimated proportions for true positives, false positives, false negatives and true negatives respectively. The threshold ϕα is indeed determined by constraining B/(A + B) ≤ α. The set of locations ψ = {l : bpϵ,(l) > ϕα} corresponds to the set of discoveries. The threshold ϕα is a cut-point on the estimated posterior probabilities that
correspond to an expected Bayesian FDR of α. In addition to Bayesian FDR calculated by
B/(A + B), one can further calculate the corresponding Bayesian expected false negative
rate (FNR) by C/(C + D), sensitivity (SEN) by A/(A + C), and specificity (SPEC) by
D/(B + D).
Note that in step 4, when flagging regions in data domain, we need to apply a threshold
δ. This is because for smooth functions, there is often no clear cut-off between zero and
2.3. Simulation Study 31
B
A
D
C
Index of ordered probabilities
0.2 0.5 1 s ψ ψ’ φα
Figure 2.1: A diagram for determining the Bayesian FDR threshold ϕα.
2.3
Simulation Study
2.3.1
Simulation Setup
We designed a simulation study to assess the computational benefit of VFMM and evalu-ated the potential loss of accuracy when using approximate inference in VFMM. To mimic realistic inter- and intra-function correlations, we generated simulated data by resembling two real datasets described in Section2.4—the cancer organ-by-cell-line proteomics data and the Tensor-based morphometry (TBM) brain imaging data. The first dataset provides an example of functional data supported on a one-dimensional domain, which we refer to as the 1-D case; and the second dataset represents functional data measured on a three-dimensional domain, which we refer to as the 3-D case. More details of the reference datasets are avail-able in Section 4. In the 1-D case, we simulated data by first fitting Baysian FMM to the reference dataset, from where we obtained posterior means of several key parameters in basis space, including the fixed effects {b∗j}, the random effect variances {qj∗}, and the residual variances {s∗j}. We then simulated data by treating theses values as the underlying truth.