5.3 Comparing aggregate and personalised models (AT-PT and PT-PT)
5.3.3 Comparison with common Android conventions
5.3.3.2 Volume state baseline
As the Android devices in the dataset allow a degree of rule-based interruption man- agement to take place through manually setting the volume state, the second baseline involves training and testing models based only on this feature. For example, a user is conceptually unlikely to be interruptible when the device is in silent mode. The ra- tionale for this baseline is to use it determine whether other contextual features provide
98 5.3 Comparing aggregate and personalised models (AT-PT and PT-PT) Rc Eg Rv Rv* 0.0 0.2 0.4 0.6 0.8 1.0 PPV Sensitivity Rc Eg Rv Rv* NPV Specificity
(a) Volume state baseline performance (AT-PT).
Rc Eg Rv Rv* 0.0 0.2 0.4 0.6 0.8 1.0 PPV Sensitivity Rc Eg Rv Rv* NPV Specificity
(b) Multi-modal model performance (same as Figure 5.4a).
Figure 5.7: A comparison of the volume state baseline against the multi-modal models trained from aggregated data (AT-PT). Rv* refers to receptivity when the device is in use. Y-axes represent prediction performance. Figures from [124].
additional utility in reducing the variance between users and in improving typical accuracy.
5.3 Comparing aggregate and personalised models (AT-PT and PT-PT) 99
Figure 5.7a shows the performance of the baseline for AT-PT models. Comparing this with the AT-PT multi-modal model (Figure 5.4a, and shown again in Figure 5.7b for a side-by-side comparison), the baseline performs slightly worse overall at correctly classifying interruptible moments (PPV, shown on the left side) when considering the median and upper quartile values for reachability and engagability and receptivity when in use. With similar performance to the baseline for receptivity when not in use. For sensitivity (also shown on the left side), the baseline performs better for reachability and engagability, but lower for receptivity.
For NPV and specificity (shown on the right side of Figures 5.7a and 5.7b), the general trend in the performance is the inverse of PPV and specificity. The results suggest that the median performance of the baseline for NPV across interruptibility labels and use states is higher or similar to the multi-modal model. This suggests that just using the volume state may be a better choice than a multi-modal approach if this is the sole priority, however the multi-modal approach offers less variation between users and higher specificity. For specificity, the median performances of the baseline when the device is not in use is considerably worse than the multi-modal model and only marginally better when the device is in use.
Overall, these results for AT-PT suggest that users may not always base their decisions in response to a notification purely on the volume state they have set and that the inclusion of other contextual features can aid in correctly predicting opportunities to interrupt. This is useful as while multi-modal AT-PT models were shown to largely under perform against multi-modal PT-PT models, they still offer utility over this baseline.
For PT-PT, Figure 5.8a shows the performance of the baseline. Comparing with the multi-modal PT-PT models (Figure 5.4b, and shown again in Figure 5.8b for a side-by- side comparison), a general trend is that the baseline has much less stability between user performances for all labels and use states. This alone presents a favourable consideration in the use of the multi-modals, as this reduces the variability between different definitions of interruptibility (likewise to the comparison to AT-PT models
100 5.3 Comparing aggregate and personalised models (AT-PT and PT-PT) Rc Eg Rv Rv* 0.0 0.2 0.4 0.6 0.8 1.0 PPV Sensitivity Rc Eg Rv Rv* NPV Specificity
(a) Volume state baseline performance (PT-PT).
Rc Eg Rv Rv* 0.0 0.2 0.4 0.6 0.8 1.0 PPV Sensitivity Rc Eg Rv Rv* NPV Specificity
(b) Multi-modal model performance (same as Figure 5.4b).
Figure 5.8: A comparison of the volume state baseline against the multi-modal models for personalised models (PT-PT). Rv* refers to receptivity when the device is in use. Y-axes represent prediction performance. Figures from [124].
discussed in Section 5.3.2). From this, it could said that users likely manage the volume state differently, and that there may be cases where users unintentionally forget to change the volume state at the exact moment their interruptibility changes, which other
5.3 Comparing aggregate and personalised models (AT-PT and PT-PT) 101
contextual data can help to improve upon.
Additionally, the multi-modal model matches or outperforms the baseline in terms of the median performance for most metrics, interruptibility labels, and use states. With the exceptions being: sensitivity for reachability and engageability (shown on the left side of Figures 5.8a and 5.8b) and NPV for receptivity (shown on the right side of Figures 5.8a and 5.8b) when the device is not in use. Coupling these results with the sole reliance on the human effort required to manage the volume state, the results suggest that the use of a multi-modal trained interruptibility system is more worthwhile, particularly if the objective is to find opportune moments to interrupt; regardless of whether reachability, engageability, or receptivity is used.
Summary: Aggregate and personalised models are useful for differ-
ent use cases
The primary findings from exploring the use of aggregate and personalised training data can be defined as:
• The relative differences in predictive performance across DOIG labels is larger in comparison to the typical user model (AT-AT, Section 5.2), suggesting that individual differences in interruption habits likely exist between users;
• If a hypothetical application is seeking to predict opportune moments to interrupt, by prioritising true-positive classifications and minimising false-negative clas- sifications, the results showed that personalised models typically outperformed models trained from the data of other users;
• Whereas if an application is seeking to avoid issuing notifications that will not likely produce their desired response behaviour (e.g., being at least reachable) by prioritising true-negative classifications and minimising false-positive classi- fications, the results showed that models trained from aggregate data typically