3.2 Model
3.3.2 Simulation Results
Here, we summarize the results of simulation by using the statistics introduced in Section 3.3.1. We demonstrate the results of posterior estimation in Table 3.1. In particular, Table 3.1 displays the 2.5%, 50% and 97.5% quantiles for the regression coefficients c0 and c1 after
burn-in period. Our simulation results produce a 95% credible interval for the regression parameter c1 to be [−2.2270, −1.4199], which does not cover 0. The significance of c1 reveals differences in data heterogeneity across the two groups. Specifically, data in the homoge-neous group produce an estimation E(exp(c0)) = 2.376. Furthermore, the heterogeneous data group produce an estimation E(exp(c0 + c1)) = 0.409, which corresponds to a more local-clustered, clumpy patterns compared to the homogeneous group. We conclude that in our simulation study, the significance of regression coefficient indicates the differences in heterogeneity patterns between two groups.
Quantile
Parameter 2.5% 50% 97.5%
c0 0.6386 0.8041 1.2013
c1 -2.2270 -1.6980 -1.4199 {t1} 0.2236 0.7631 0.9492 {t2} 0.6462 0.9963 0.9998
Table 3.1: Summary of posterior samples in simulation: {t1} denote divergence time in the homogeneous group and {t2} denote that in the heterogeneous group.
Besides summary statistics for c0 and c1, Table3.1also displays the same quantiles for diver-gence time of posterior latent trees across all posterior samples. Our results show that the 95% credible interval for the divergence time in the heterogeneous group is [0.6462, 0.9998], and the homogeneous group is [0.2236, 0.9492]. Figure 3.3 plots a histogram of divergence time for better visualization. It is evident that the distribution of the heterogeneous group illustrates greater skewness, with mode concentrated near 1. Our investigation reveals that latent trees tend to diverge later (near the leaf nodes) in the heterogeneous group and earlier (near the roots) in the homogeneous group. Intuitively speaking, as shown in the left panel of Figure3.2, data points in the heterogeneous group should have an intensity function with more local modes, which corresponds to shorter segment lengths near the terminal nodes of a latent tree.
3.3. Simulation Study 65
0.0 0.1 0.2 0.3
0.00 0.25 0.50 0.75 1.00
Divergence Time
Frequency
group
heterogeneous homogeneous
Histogram of Divergence Time
Figure 3.3: Histograms of divergence time in posterior tree structures in the homogeneous group and the heterogeneous group.
0.00 0.25 0.50 0.75 1.00
0.0 0.2 0.4 0.6 0.8
branch.length
Figure 3.4: One posterior sample of a latent tree in the homogeneous group.
3.3. Simulation Study 67
0.00 0.25 0.50 0.75 1.00
0.0 0.2 0.4 0.6
branch.length
Figure 3.5: One posterior sample of a latent tree in the heterogeneous group.
In addition to summary statistics shown in Table 3.1, we also plotted one posterior sample of a latent tree in the homogeneous group in Figure 3.4, and one from the homogeneous group in Figure 3.5 for better visualization. In both figures, we use a heat map to indicate the length of the branches. The green color represents the shorter branch lengths, while the red color highlights longer branch lengths. Comparing Figures 3.4 and 3.5, we see that overall the latent tree from the homogeneous group has longer branches and subbranches, and the splits seem to happen at an earlier time. On the other hand, the latent tree from the heterogeneous group tends to diverge at a later stage, and the tree tends to have shorter sub-branches. These results demonstrate that both regression coefficients and the latent tree structures can help reveal the data heterogeneity patterns and identify differences across groups of samples. Regarding computation, we implemented parallel computing with 30-cores in updating tree structures on a Linux server equipped with Intel(R) Xeon(R) CPU E5-4627v2 @ 3.30GHz and 252G RAM storage. In general, it took about 4.84 hours to finish 15,000 iterations.
3.4 Real Data Application
We applied the proposed model to the brain tumor data described in Section 1.2.3. We focused on the tumor regions segmented from T2-weighted images. Each observation consists of a point cloud of pixel intensities, saved in the form of a vector. These pixel intensities are extracted from a 2D slice of the tumor region, sliced along the axial direction of the brain. In this study, each observation contains around 15% of the pixel intensities randomly sampled from the 2D tumor region. This resulted in observations with the number of points in each observation varying between 66 and 1043. Our data consist of 63 segmented tumor samples, among which 37 belong to patients with long-survival time, and 26 belong to patients with
3.4. Real Data Application 69
short-survival time. By applying the proposed model, we aim to characterize heterogeneity structures of the pixel intensities and identify differences in data heterogeneity across the long-survival and short-survival patient groups.
We implemented the proposed MCMC algorithm, with the DDT parameter fixed at σ = 5.
Fixing σ helps improve the mixing of the chain. We set zi = 1 if the sample belongs to a patient with long-survival time and zi = 0 otherwise. Therefore, the regression coefficient c1 measures the difference between the long survival and the short survival group. The proposed MCMC algorithm was implemented with parallelization in the step of updating latent trees.
A total of 15,000 MCMC iterations were run, among which the first 12,000 iterations were treated as the burn-in period. We further applied the thinning tricks by taking samples from every 5 iterations. This produces a total of 600 posterior samples for each parameter. The computation was performed on a 30-core Linux server equipped with Intel(R) Xeon(R) CPU E5-4627v2 @ 3.30GHz and 252G RAM storage. We summarized the posterior estimation results similarly as in the simulation study in Section 3.3.2. Table 3.2 displays the 2.5%, 50%, and 97.5% quantiles of the model parameters c0, c1, and the divergence time of latent trees after burn-in period.
Quantile
Parameter 2.5% 50% 97.5%
c0 0.3222 0.5282 0.8427 c1 0.2143 0.4721 0.9579 {t1} 0.4658 0.8439 0.9790 {t2} 0.5138 0.9019 0.9927
Table 3.2: Summary of posterior samples in real data application. Here, {t1} denote the divergence time of latent trees for observations in the long survival group, and {t2} denote that in the short survival group.
From Table 3.2, we observe that the 95% credible interval for c1 is [0.2143, 0.9579]. This reveals a slightly positive effect for zi = 1, i.e., observations from the long-survival group
0.000 0.025 0.050 0.075
0.00 0.25 0.50 0.75 1.00
Divergence Time
Frequency
name
long survival short survival
Histogram of Divergence Time
Figure 3.6: Histograms of divergence time in posterior tree structures.
3.4. Real Data Application 71
0.00 0.25 0.50 0.75 1.00
0.0 0.2 0.4 0.6
branch.length
Figure 3.7: One posterior latent tree sample of one randomly selected long-survival patient.
0.00 0.25 0.50 0.75 1.00
0.0 0.1 0.2 0.3 0.4
branch.length
Figure 3.8: One posterior latent tree sample of one randomly selected short-survival patient.