CHAPTER 3: JOINT MODELING OF MIXED SCALE VARIABLES USING MODULARIZED
3.3 Simulation Study
wherefj(yij˜ ) =Pd 0 j hj=1πhˆ jKj(yij|θhˆjj).
A mutual information measure is computed at each iteration of a sampler and summarized by the mean, ˆ
ζj1j2 =
1 T
PT
t=1ζj1j2,t for all pairs of variables. For testing,Pr(Hb 1j1j2|Y) =
1 T
PT
t=1ζj1j2,t >0where
H1j1j2is the alternative hypothesis ofH0j1j2 :Yj1⊥Yj2. For defining dependence, we flag pairs of variables
as dependent ifPr(Hb 1j1j2|Y)>1−α.
3.3
Simulation Study
The performance of the proposed method was assessed via a simulation study. We compared the performance of MOTEF relative to the performance of the MPK. There is no standard or straightforward way of modeling a mixed-scale distribution so we defaulted to the MPK because of its robustness, simplicity in definition, and excellent computational performance. The aim of the simulation study was to compare the ability to: 1) adequately model the dependence structure; 2) convergence diagnostics; 3) the posterior number of clusters chosen.
The aims of our simulation study were rooted by our interest in determining how a collection of mixed- scale variables interrelate and jointly profiling or clustering observations. The multivariate dependence was assessed via the empirical mutual information measure defined in the previous section for MOTEF with an analogous version defined for MPK. Since posterior mutual information measures were used for determining the dependence structure, these posterior samples for all pairs were assessed for convergence with the multivariate potential scale reduction factor and effective sample size (Brooks and Gelman, 1998).
Given that we are using a distributional approach to clustering, it is important to monitor some measure of the number of classes chosen. We chose to monitor the posterior median number of clusters chosen defined as the number of occupied labels in the module 2 latent allocation variablez. The MPK in difference to MOTEF only has one latent allocation variable which is used for defining the number of clusters. Additionally, we monitor the number of clusters chosen from an optimal partition of the data computed using the least-squares approach of Dahl (2006).
The simulated data consisted ofp= 20mixed-scale variables with 11 dichotomous 0/1 variables and three variables each for polytomous, continuous, and ordered categorical scales. 500 data sets were generated with the binary, polytomous, continuous, and ordered categorical variable blocks lined up adjacently in the specified order for variables 1 - 20 assuming dependence among variables{2,3,7,8,12,15,18}. That is, dependence was induced among four binary and one of each of the other types. We assumed data sets were composed of three subpopulations where a three-class latent subpopulations indicator was generated with probability (0.20, 0.55, 0.25). For binary and polytomous variables, the probability vector for each variable differed within each subpopulation forj = 2,3,7,8,12. Continuous variables were generated from two (j = 15) and three (j= 16,17) component mixture models also with different mixture probability components within each subpopulation forj= 15. Similarly, ordinal variables were generated by applying the floor function on a latent continuous variable generated from a three component mixture with different mixture probabilities within each subpopulation forj= 18. Lastly, all data sets were of sample size 1000.
Each data set was analyzed separately using the Gibbs sampling scheme detailed in the previous section. For both the MOTEF and MPK procedures, their respective samplers were run for 5,000 iterations with a 1,000 iteration burn-in where every4thiteration was stored. Additionally, five separate chains were initialized
at different starting allocation values for a total effective sample size of 5,000.
The results of the simulation study indicate that MOTEF and MPK differ slightly in elucidating the under- lying dependence structure among the mixed scale variables. Figure 3.1 displays the proportion of simulations at each variable pair flagged as dependent. The range of proportions for flagging true dependent ranged between (0.42 - 1.00) and (0.73 - 1.00) for MOTEF and MPK, respectively. MOTEF notably outperforms MPK in flagging location (15,18) with this location flagged as dependent in 99% of the data sets while MPK flagged 73%. MPK notably outperforms MOTEF in flagging locations (7,18) and (8,18) with proportions 0.97 and 0.86, respectively, compared to MOTEF proportions of 0.42 and 0.74. False discovery proportions ranged from (0.00 - 0.12) for MOTEF and (0.00 - 0.08) for MPK. Overall, MPK marginally outperforms MOTEF in selecting the true dependence structure with slightly better performance at correctly flagging true dependence and lower false discovery proportions.
Figure 3.1: Results of simulations for MOTEF and MPK, which display percentages of simulations for each variable pair flagged as dependent,Pr(Hb 1jj0 :ζjj0 >0|Y)>0.95.
Table 3.1 displays simulation results for various diagnostics comparing MOTEF with MPK. The medians of the multivariate potential scale reduction factors for MOTEF and MPK were estimated to be 1.14 and 2.40, respectively. Generally, values less than 1.2 indicate the chains have achieved stationarity and mixed well which implies MOTEF outperforms MPK. MOTEF achieves good diagnostics on all pairwise empirical mutual information measures treated as multivariate, which are used to characterize the dependence structure of the simulated data sets. The median of the scaled multivariate effective sample size estimates also show the superiority of MOTEF over the MPK with values of 0.75 and 0.29, respectively. Scaled multivariate effective sample size values close to one indicate the multivariate posterior sample is akin to an independent identically distributed sample. Thus, the characteristics of the MOTEF posterior samples are much better than MPK.
MOTEF also outperformed the MPK in selecting a number of clusters closer to the true number of subpopulations. Table 3.1 also displays the proportion of data sets by posterior median number of clusters and the number of clusters (i.e. occupied components) from the least squares selected optimal partition for MOTEF and MPK. Across the overwhelming majority of data sets, MOTEF correctly estimated the posterior median number of clusters at three using the posterior median and the optimal partition with 98.8% and 96.6% of the 500 data sets. MPK on the other hand selected a greater number of clusters which ranged from four to seven for both the posterior median number of clusters and number of clusters from the optimal partitions. The mode of the median number of clusters for MPK was five in 58.2% data sets while the optimal number of clusters chose five clusters in 51.2% of the data sets.
Table 3.1: Simulation study results for various diagnostics comparing MOTEF with MPK; median (IQR) for contin- uous variables, percentages for integer valued diagnostics (out of 500 data sets)
Diagnostic MOTEF MPK
PSRF1 1.14 2.40
(1.05 – 1.29) (1.35 – 3.56)
Eff. Sample Size2 0.75 0.29
(0.70 – 0.79) (0.19 – 0.43) Median no. clusters
2 1.2 % 0 % 3 98.8 % 0 % 4 0.0 % 21.2 % 5 0.0 % 58.2 % 6 0.0 % 20.4 % 7 0.0 % 0.2 %
No. clusters - opt. part.
2 3.4 % 0 % 3 96.6 % 0 % 4 0 % 18.8 % 5 0 % 51.2 % 6 0 % 26.8 % 7 0 % 3.2 %
1 Multivariate potential scale reduction factor for empirical mutual
information measures from all pairs of variables
2 Scaled multivariate effective sample size by 5000 (1000 samples per
Based on our simulation scenario, MOTEF has better performance over the MPK in efficiency and subpopulation estimation with somewhat comparable performance in elucidation the multivariate dependence structure. The proposed method has the ability to mostly characterize multivariate dependence correctly, and offers the ability to identify latent classes for profiling. Additionally, MOTEF has the potential for being computationally feasible for moderate to large data sets when coupled with parallelization for the first module components. Lastly, in addition to providing a smaller number of subclasses, which is useful for investigators seeking to jointly profile multiple mixed-scale variables, the modularization has the added feature providing information about the marginal clustering of the non-nominal categorical variables. This feature is not unique to MOTEF since it is also inherent in the ITM but MOTEF avoids the potential for undue feedback from the multivariate clustering mechanism as can be observed in ITM.