8.3 Experiments and Analysis
8.3.2 Direct Model Comparisons
Weak and Strong Predictive Power
We test the models’ abilities to make accurate predictions. We used IPIP100, which many subjects had taken multiple times, referring to each repeat as a ‘sitting’ of the questionnaire. We only used subjects with two or more than complete sittings. The repeated tests serve two purposes. First, they provide a model-free estimate of how well one can predict responses. Second, they allow us to evaluate both the predictive power on held-out questions from within a sitting, we call this weak predictive power, and across sittings, the strong predictive power. We compute the predictive powers using both exponentiated log likelihood, Equation (8.6), and ‘fraction correct’, the proportion of times that most probable (MAP) response predicted by the model is correct. The first sitting for each subject was used for training. 20% of the ratings were held out from the training set to evaluate the weak predictive power and the entire second sitting
was used to evaluate the strong predictive power. Test-retest Reliability
The test-retest reliability is the proportion of times that the subject’s response is the same across questionnaire sittings. It is an estimate of the best possible predictive performance. For example, a subject who responds randomly will have a reliability of 1/R = 0.2, and no model can perform better than that. In particular, if a subject provides response k with probability pk, their reliability is
reliability = E " R X k=1 p2k # , (8.11)
where the expectation is over the users and items in the dataset. The best possible fraction correct is achieved by selecting most likely response based on pk. The score is
then,
best fraction correct = E
max
k (pk)
. (8.12)
Equation (8.12) is an upper bound on Equation (8.11). Therefore the reliability is a lower bound on the best fraction correct achievable. An upper bound can also be derived from the reliability [Neri & Levi,2006]. However, this bound may be loose and it returned values greater than one on our data, so we do not use it.
Subjects may be more consistent within a sitting, so the weak fraction correct cannot be compared to the reliability. Furthermore, it is harder to relate the reliability to the log likelihood, which will always be smaller than or equal to the fraction correct. However, we can provide a chance baseline for both metrics, presented in the next section.
Frequency-weighted Chance
A na¨ıve chance baseline is 1/R = 0.2. However, a better baseline takes into account the imbalance in the responses in Y. This may still be considered ‘chance’ because the identities of the subject and question are ignored when making predictions. Under fraction correct, the best constant predictor assigns a point mass to the most frequent response, p(y = k) = I[k = argmaxipˆi], where I[·] is the indicator function and ˆpi is
the empirical proportion of response i. This predictor attains a chance level of maxipˆi.
Intuitively, if the data is more imbalanced, predictions are easier, and chance increases. Under log likelihood, the best constant predictor assigns the empirical proportion to
0 0.2 0.4 0.6 fraction correct training test−weak test−strong 0 0.1 0.2 0.3 0.4
exponentiated log lik.
training test−weak test−strong HOMF GFA MIRT−graded MIRT−GPCM GFA−BIG5 GFA−rand reliability
chance (per question)
Figure 8.9: Weak (intra-questionnaire) and strong (inter-questionnaire) predictive pow- ers using D = 5 dimensions in all models. Error bars indicate ±1 s.d. across experi- mental repeats. Left: Fraction correct, solid horizontal line is the test-retest reliability, dashed line is the mean question-specific baseline. Right: Exponentiated log likelihood.
each response, p(y = k) = ˆpk. The chance likelihood is then exp(−H[{ˆpk}]), where H[·]
is the entropy function. These chance levels are improved further by making different predictions for each question, but ignoring the identity of the subjects. The predictions are now based on the empirical statistics in the corresponding columns of Y. There are too few questions to compute a good user-specific baseline. In our experiments we computed these chance levels using the entire first sittings of the questionnaires. Methods
In Section7.4.1 we showed that HOMF outperforms a number of models for ordinal matrices developed in machine learning. In this section we focus on models used in psychometrics: GFA, MIRT-graded and MIRT-GPCM, see Section8.2.1. The MIRT models were implemented using the R package described in Chalmers [2012]. We also investigate how well the data can be predicted using the Big-Five traits. To do this we use a unidimensional confirmatory GFA model with fixed loading matrix corresponding to the Big-Five measurements, as depicted in Figure8.8, centre (GFA-BIG5). GFA- BIG5 only learns the latent traits X and the item noise levels Σ, and is constrained to five dimensions. As a baseline for GFA-BIG5, we run the same algorithm but using a random loading matrix with i.i.d. standard normal elements (GFA-rand). We used N = 5000 subjects, and repeated the entire procedure, including sampling of the subjects and dataset splits, five times.
0 10 20 30 40 50 0 0.2 0.4 0.6 0.8 num factors (D) fraction correct 0 10 20 30 40 50 0 0.1 0.2 0.3 0.4 0.5 num factors (D)
exponentiated log lik.
HOMF GFA MIRT−graded MIRT−GPCM reliability
chance (per question)
Figure 8.10: Training (dashed line, × markers), weak-test (dash-dot line, ◦ markers) and strong-test (solid line, + markers) predictive powers using different latent dimen- sions (GFA-Big5 and GFA-rand not plotted because they are constrained to D = 5). Left: Fraction correct. Right: Exponentiated log likelihood.
Results
Figure8.9 and 8.10 show the results using five latent traits and across dimensionali- ties, respectively. HOMF performs best by a substantial margin with both metrics. The MIRT models are significantly outperformed by HOMF at D = 5. Beyond five dimensions, the MIRT models perform very poorly, and failed to run with D > 20. Furthermore, they only beat GFA at very low dimensionalities, D < 5. MIRT-graded has a similar likelihood to HOMF (see Section 8.2.1), this indicates that the MHRM inference algorithm used by the MIRT models is ineffective as the dimensionality grows. GFA-BIG5 substantially outperforms the baseline GFA-rand. As noted in the pre- vious sections, the Big-Five dimensions are highly prevalent in IPIP data, so provide useful basis vectors. However, exploratory GFA improves upon GFA-Big5, which indi- cates that some questions provide information about multiple traits, which the multi- dimensional model can exploit.
According to log likelihood, HOMF is the only model that makes robust inter- questionnaire predictions at larger dimensionalities. The heteroscedasticity is likely to be contributing to HOMF’s robustness. We use the re-tests to assess whether HOMF learns the noise levels correctly. We correlate the reliability of each subject with the MAP noise level for each subject returned by HOMF, γrowin Equation (7.2). We do the
same for the items. Note that HOMF only observes a single sitting of the questionnaire and so does not directly observe inconsistent behaviour.
Figure8.11 shows the correlation coefficients using different dimensionalities. For the subjects, the learnt noise levels correlate negatively (p < 0.05) with the reliabilities, indicating that HOMF learns the noise correctly. For the questions, there is negative
0 20 40 60 −0.5 −0.4 −0.3 −0.2 −0.1 0 num factors (D)
correlation coefficient per subject
per question
Figure 8.11: Pearson’s correlation coefficient ρ between the test-retest reliability for each user and question and their corresponding noise level inferred using HOMF. Points marked with an × have significant correlation (p < 0.05) points with a ◦ are not significant.
correlation at low dimensionalities, but not at high dimensionalities. This may be because the intra-questionnaire response entropy for each item (empirical entropy of the columns of Y) correlates negatively with the reliability (ρ = −0.85, p < 10−20). At
low dimensionalities the model cannot capture the response patterns, so models high entropy items with high noise. Therefore, the noise also correlates with the unreliable items. However, at high dimensionalities the model decouples response entropy from noise, and there are insufficient questions to attain a strong correlation with reliability.