Simulation Results - Bayesian Modeling of Complex High-Dimensional Data

3.2 Model

3.3.2 Simulation Results

Here, we summarize the results of simulation by using the statistics introduced in Section 3.3.1. We demonstrate the results of posterior estimation in Table 3.1. In particular, Table 3.1 displays the 2.5%, 50% and 97.5% quantiles for the regression coefficients c₀ and c₁ after

burn-in period. Our simulation results produce a 95% credible interval for the regression parameter c₁ to be [−2.2270, −1.4199], which does not cover 0. The significance of c1 reveals differences in data heterogeneity across the two groups. Specifically, data in the homoge-neous group produce an estimation E(exp(c₀)) = 2.376. Furthermore, the heterogeneous data group produce an estimation E(exp(c0 + c1)) = 0.409, which corresponds to a more local-clustered, clumpy patterns compared to the homogeneous group. We conclude that in our simulation study, the significance of regression coefficient indicates the differences in heterogeneity patterns between two groups.

Quantile

Parameter 2.5% 50% 97.5%

c₀ 0.6386 0.8041 1.2013

c₁ -2.2270 -1.6980 -1.4199 {t1} 0.2236 0.7631 0.9492 {t2} 0.6462 0.9963 0.9998

Table 3.1: Summary of posterior samples in simulation: {t1} denote divergence time in the homogeneous group and {t2} denote that in the heterogeneous group.

Besides summary statistics for c0 and c1, Table3.1also displays the same quantiles for diver-gence time of posterior latent trees across all posterior samples. Our results show that the 95% credible interval for the divergence time in the heterogeneous group is [0.6462, 0.9998], and the homogeneous group is [0.2236, 0.9492]. Figure 3.3 plots a histogram of divergence time for better visualization. It is evident that the distribution of the heterogeneous group illustrates greater skewness, with mode concentrated near 1. Our investigation reveals that latent trees tend to diverge later (near the leaf nodes) in the heterogeneous group and earlier (near the roots) in the homogeneous group. Intuitively speaking, as shown in the left panel of Figure3.2, data points in the heterogeneous group should have an intensity function with more local modes, which corresponds to shorter segment lengths near the terminal nodes of a latent tree.

3.3. Simulation Study 65

0.0 0.1 0.2 0.3

0.00 0.25 0.50 0.75 1.00

Divergence Time

Frequency

group

heterogeneous homogeneous

Histogram of Divergence Time

Figure 3.3: Histograms of divergence time in posterior tree structures in the homogeneous group and the heterogeneous group.

0.00 0.25 0.50 0.75 1.00

0.0 0.2 0.4 0.6 0.8

branch.length

Figure 3.4: One posterior sample of a latent tree in the homogeneous group.

3.3. Simulation Study 67

0.00 0.25 0.50 0.75 1.00

0.0 0.2 0.4 0.6

branch.length

Figure 3.5: One posterior sample of a latent tree in the heterogeneous group.

In addition to summary statistics shown in Table 3.1, we also plotted one posterior sample of a latent tree in the homogeneous group in Figure 3.4, and one from the homogeneous group in Figure 3.5 for better visualization. In both figures, we use a heat map to indicate the length of the branches. The green color represents the shorter branch lengths, while the red color highlights longer branch lengths. Comparing Figures 3.4 and 3.5, we see that overall the latent tree from the homogeneous group has longer branches and subbranches, and the splits seem to happen at an earlier time. On the other hand, the latent tree from the heterogeneous group tends to diverge at a later stage, and the tree tends to have shorter sub-branches. These results demonstrate that both regression coefficients and the latent tree structures can help reveal the data heterogeneity patterns and identify differences across groups of samples. Regarding computation, we implemented parallel computing with 30-cores in updating tree structures on a Linux server equipped with Intel(R) Xeon(R) CPU E5-4627v2 @ 3.30GHz and 252G RAM storage. In general, it took about 4.84 hours to finish 15,000 iterations.

3.4 Real Data Application

We applied the proposed model to the brain tumor data described in Section 1.2.3. We focused on the tumor regions segmented from T2-weighted images. Each observation consists of a point cloud of pixel intensities, saved in the form of a vector. These pixel intensities are extracted from a 2D slice of the tumor region, sliced along the axial direction of the brain. In this study, each observation contains around 15% of the pixel intensities randomly sampled from the 2D tumor region. This resulted in observations with the number of points in each observation varying between 66 and 1043. Our data consist of 63 segmented tumor samples, among which 37 belong to patients with long-survival time, and 26 belong to patients with

3.4. Real Data Application 69

short-survival time. By applying the proposed model, we aim to characterize heterogeneity structures of the pixel intensities and identify differences in data heterogeneity across the long-survival and short-survival patient groups.

We implemented the proposed MCMC algorithm, with the DDT parameter fixed at σ = 5.

Fixing σ helps improve the mixing of the chain. We set z_i = 1 if the sample belongs to a patient with long-survival time and z_i = 0 otherwise. Therefore, the regression coefficient c₁ measures the difference between the long survival and the short survival group. The proposed MCMC algorithm was implemented with parallelization in the step of updating latent trees.

A total of 15,000 MCMC iterations were run, among which the first 12,000 iterations were treated as the burn-in period. We further applied the thinning tricks by taking samples from every 5 iterations. This produces a total of 600 posterior samples for each parameter. The computation was performed on a 30-core Linux server equipped with Intel(R) Xeon(R) CPU E5-4627v2 @ 3.30GHz and 252G RAM storage. We summarized the posterior estimation results similarly as in the simulation study in Section 3.3.2. Table 3.2 displays the 2.5%, 50%, and 97.5% quantiles of the model parameters c₀, c₁, and the divergence time of latent trees after burn-in period.

Quantile

Parameter 2.5% 50% 97.5%

c₀ 0.3222 0.5282 0.8427 c₁ 0.2143 0.4721 0.9579 {t1} 0.4658 0.8439 0.9790 {t2} 0.5138 0.9019 0.9927

Table 3.2: Summary of posterior samples in real data application. Here, {t1} denote the divergence time of latent trees for observations in the long survival group, and {t2} denote that in the short survival group.

From Table 3.2, we observe that the 95% credible interval for c₁ is [0.2143, 0.9579]. This reveals a slightly positive effect for z_i = 1, i.e., observations from the long-survival group

0.000 0.025 0.050 0.075

0.00 0.25 0.50 0.75 1.00

Divergence Time

Frequency

name

long survival short survival

Histogram of Divergence Time

Figure 3.6: Histograms of divergence time in posterior tree structures.

3.4. Real Data Application 71

0.00 0.25 0.50 0.75 1.00

0.0 0.2 0.4 0.6

branch.length

Figure 3.7: One posterior latent tree sample of one randomly selected long-survival patient.

0.00 0.25 0.50 0.75 1.00

0.0 0.1 0.2 0.3 0.4

branch.length

Figure 3.8: One posterior latent tree sample of one randomly selected short-survival patient.

In document Bayesian Modeling of Complex High-Dimensional Data (Page 76-86)