Variational Functional Mixed Model Framework

We adopt parsimonious basis representation to transform the FMM to basis space, which leads to divide-and-conquer computation. Consider the general FMM in (1.2), we assume that the responses {Yi(t), i = 1, . . . , N} take values in L²(T ), where T is a closed subset of R^d, d ≥ 1. Let {ϕj}^∞j=1 denote a compactly supported, orthonormal basis of L²(T ).

We can expand Y_i(t) by Y_i(t) = P_∞

j=1d_ijϕ_j(t) where d_ij = ⟨Yi, ϕ_j⟩ = R

T Y_i(t)ϕ_j(t)dt. The coefficient sequence (d_i1, d_i2, . . .) lies in the space of square-summable sequences, denoted by ℓ² =

d_j :P_∞

j=1d²_j <∞o

. Since Y(t) is written as the linear combination of B(t), U(t) and E(t), it is natural to assume that these unobserved functional components also take values in the same L²(T ) space. With this assumption, all functional objects in FMM can be represented by a common basis. Specifically, denote Φ = (ϕ1, ϕ2, . . . )^T, then all basis expansions can be represented by linear operations: Y = DΦ, B = B^∗Φ, U = U^∗Φ, and E = E^∗Φ. Model (1.2) becomes DΦ = XB^∗Φ + ZU^∗Φ + E^∗Φ. Since Φ preserves linear operation, the above model is equivalent to the dual space model D = XB^∗ + ZU^∗+ E^∗.

This transforms FMM from the functional space L²(T ) to the dual space ℓ². Since the due space model consists of discrete sequences only, estimation is much easier to perform.

Furthermore, as the correlations between basis coefficients are often substantially reduced, one can make independence approximations between columns of D, B^∗, U^∗, and E^∗ in the dual space model following the idea of Morris and Carroll [46]. This further divides the dual space model into many independent regular mixed effect models. We denote the jth model by

d_j = Xb^∗_j + Zu^∗_j + e^∗_j, j = 1, . . . , n (2.1)

where d_j, b^∗_j, u^∗_j and e^∗_j denote the jth columns of D, B^∗, U^∗ and E^∗ respectively. While the above divide-and-conquer strategy is suitable if using any orthonormal basis, we only fo-cus on compactly supported orthonormal basis such as Haar wavelets, Daubechies Wavelets, and spherical wavelets. These bases have the ability to capture local features of functional data, enable parsimonious representation which allows further compression, and have dis-crete transformation versions that are fast to compute. They are generally applicable to curves, images, surfaces, etc. As we will explain in Section2.2.2, using compactly supported orthonormal basis allows us to identify interesting local regions by performing testing in ba-sis space only, avoiding the need of inverse-transforming many posterior samples back to the data domain. Consider the jth model in (2.1), we slightly modify the priors and the random effect/residual distributions proposed by Morris and Carroll [46] to enable efficient varia-tional Bayes computation. In particular, denote d_j = (d_1j, . . . , d_{N j})^T, b^∗_j = (b^∗_1j, . . . , b^∗_pj)^T, u^∗_j = (u^∗_1j, . . . , u^∗_mj)^T, and e^∗_j = (e^∗_1j, . . . , e^∗_{N j})^T. Specifically, our model can be written as

b^∗_i,j ∼ γi,j^∗ N (0, q_j^∗τ_i,j) + (1− γi,j^∗ )δ₀, γ_i,j^∗ ∼ Bernoulli(πj), q^∗_j ∼ IG(aj, bj), u^∗_j ∼ N(0, qj^∗I), e^∗_j ∼ N(0, qj^∗ζjI).

2.2. Model 25

Here, we have factored out the random effect variance q^∗_j from the prior variance of fixed effect b^∗_i,j and the residual variance. This new parameterization allows for convenient update of the approximate distribution of q_j^∗. Based on the above model setup, the joint posterior distribution of {b^∗i,j}, {γi,j^∗ } and q^∗j can be written as

p({b^∗i,j}, {γi,j^∗ }, q^∗j | dj, ζ_j, τ_i,j, π_j)

∝ p(dj | {b^∗_i,j}, q_j^∗, ζ_j) p(q_j^∗) Yp i=1

p(b^∗_i,j | γ_i,j^∗ , τ_i,j) p(γ_i,j^∗ | πj). (2.2)

We treat π_j, τ_i,j, ζ_j, and (a_j, b_j) as hyperparameters and make mean-field assumptions for the approximate distributions of {b^∗i,j},{γi,j^∗ }, and qj^∗. This enables an efficient variational EM algorithm In particular, we assume that the approximate distribution can be factored as follows:

q({b^∗i,j}, {γi,j^∗ }, qj^∗) = q(q_j^∗) Yp i=1

q(b^∗_i,j | γi,j^∗ )q(γ_i,j^∗ ). (2.3)

As the conditional posterior of {b^∗i,j},{γi,j^∗ }, and qj^∗ all fall in the exponential family, we follow Blei [4] by assuming that each factor in the above approximate distribution also falls in the same exponential family. Therefore, in the E-step, the estimation of the approximate distribution boils down to the estimation of the natural parameters in exponential family.

This facilitates fast calculation of the approximate distributions.

In the M-step, conditional on the estimation of the approximate distributions, we update values of the hyperparameters πj, τi,j and ζj by directly maximizing the ELBO. Specifically, we are able to analytically solve the value of π_j and τ_i,j by setting the first derivative of ELBO to zero. The value of ζ_j, however, needs to be searched by using an optimization algorithm.

In general, ELBO is neither a convex nor concave function of ζ_j because ζ_j is contained in the inverse of a covariance matrix. Thus, an optimization algorithm may converge to a local

optimum. The values of the hyperparameters (a_j, b_j) are determined by matching the mean of the inverse-Gamma prior with the initial estimate of q^∗_j while setting the prior variance to be fairly large (e.g., 10³).

Based on the full conditional distribution in (2.2), the conditional distribution of dj is:

d_j | ({b^∗_\i,j}, γ_\i,j^∗ , γ_i,j^∗ = 1)∼ N(X(−i)b^∗₍_−i,j), Σ_j+ X_(i)X^T_(i)τ_i,j), (2.4) d_j | ({b^∗_\i,j}, γ_\i,j^∗ , γ^∗_i,j = 0)∼ N(X(−i)b^∗₍_−i,j), Σ_j). (2.5)

Instead of modeling distribution of γ_i,j and b^∗_i,j separately, estimation of joint distribution of (γ_i,j^∗ , b^∗_i,j) will be more reasonable. The way to jointly model (γ_i,j^∗ , b^∗_i,j) is through modeling q(γ_i,j^∗ ) and q(b^∗_i,j | γi,j^∗ ). Based on distribution in (2.4) and (2.5) and prior distribution of γ_i,j^∗ , the approximate distribution of γ_i,j^∗ is a Bernoulli distribution, i.e.,

q(γ_i,j^∗ = 1)∼ Bernoulli(eπi,j), (2.6)

where

eπi,j = 1

1 + exp^−O^i,j, (2.7)

Oi,j = log{ π_j 1− πj

} − 1

2log(1 + X^T_(i)Σ⁻¹_j X_(i)τi,j) (2.8) + E_q

2d^∗T_j Σ⁻¹_j X_(i)X^T_(i)Σ⁻¹_j τ_i,j q^∗_j(1 + τ_i,jX^T_(i)Σ⁻¹_j X_(i))d^∗_j

where d^∗_j = (d_j −P

l̸=iX_(l)b^∗_l,j) and E_q denotes the expectation of random variable with respect to distribution q.

2.2. Model 27

By (2.2) and getting expectation from (2.6) and (2.9), the approximate distribution for q_j^∗ can be written as

The above steps are so-called E step in updating approximate distribution for parameters.

We still need to update hyperparameters π_i,j, τ_i,j and η_j to finish one iteration in algorithm.

Specifically, we will use optimization techniques to update these hyperparameters. For exam-ple, after maximizing ELBO = E_q[log p(d_jk,{b^∗_i,j}, {γ_i,j^∗ }, q^∗_j)]− Eq[log q({b^∗_i,j}, {γ_i,j^∗ }, q_jk^∗ )], the algorithm reaches to the explicit solutions of π_i,j, τ_i,j, which can be denoted by

π_j =

After integrating and substitute (2.11) and (2.12) into the ELBO, Remaining hyperparam-eters ζj, can be obtained through the maximization of the following object function:

− log | 2πΣj | −ae_j

Since there is no explicit solution for (2.13), we utilize the climb-hill algorithm in MATLAB to perform the optimization. Since the computation of inverse matrix in (2.13) is computa-tionally expensive and sometimes unstable in terms of convergence, we show that through simplification, we can get rid of matrix inversion, leading to a speed up of computation in Appendix A.4. We list steps of the VFMM algorithm in Algorithm 2. More technical details can be found in Appendix A.4. To ensure fast convergence, we adopt Henderson’s Mixed Model equations [56, pages 275-286] to initialize parameters. While options for more parameter settings and detailed tuning are available, the only required inputs are the ob-served data Y and the design matrices X, Z. As the calculation is performed independently across the index j, Algorithm 2 can either be designed by using vector-form calculation or be distributed to multi-core computational units.

Algorithm 2: The VFMM algorithm

1 Initialize all parameters; while ELBO^(t)− ELBO^(t⁻¹⁾ > δ do

2 for all jk do

3 Update q(γ_i,jk^∗ , B_i,jk^∗ ) in (2.6) and (2.9);

4 Update q(q_jk^∗ ) in (2.10);

5 Update ELBO^(t);

6 Update π_jk, τ_i,jk in (2.11) and (2.12);

7 Update ζ_jk in (2.13) by hill-climbing algorithm;

8 end

9 Update ELBO^(t);

10 end

2.2. Model 29

In document Bayesian Modeling of Complex High-Dimensional Data (Page 36-42)