Machine learning approach - Correlations between contextual data and DOIG labels

4.3 Correlations between contextual data and DOIG labels

5.1.1 Machine learning approach

In this chapter, the extent in which reachability, engageability, and receptivity are predictable is explored using multiple contextual features (multi-modal). The feature vectors and labels used for prediction are the same as those created for the analysis in Chapter 4, Section 4.3. The primary aim of the analysis is to explore the relative differences in the predictive performance across the DOIG labels for different machine learning methods, however, where appropriate the scope is refined to prune the worse performing solutions (e.g., classifier choice).

The methods used for each component in the predictive modelling are outlined as follows.

5.1.1.1 Pre-processing

Analysis of the dataset reveals that the label (class) distribution is imbalanced since the majority of notifications are null-responses, i.e., users were often unreachable (as

84 5.1 Examining machine learning strategies

discussed in Chapter 3, Section 3.4). Without pre-processing, this could lead to false reporting in model performance, for example, if a model always predicts a single class and 80% of the data is labelled with that class, then the model is trivially correct 80% of the time, but practically useless. To prevent this, random-under-sampling (RUS) [41] was used to produce 100 evenly distributed training datasets for each model.

5.1.1.2 Classifier choice

The choice of classifier algorithms to train the models will be examined as part of the initial analysis of predictive performance for a typical user (Section 5.2). This is due to the wide variety of success that has been seen across previous works in using different classifiers (as discussed in Chapter 2). This analysis will also explore the suitability of creating either independent binary classification models for each label and use state (e.g., at least reachable/not, engageable/not, receptive/not) or mutli-class models (e.g., the user is either reachable/engageable/receptive/not at all). The results from this analysis then prune the analysis space for exploring the performance of individual users.

5.1.1.3 Training and testing models

For each DOIG label, three approaches are used for splitting the data where relevant (visualised in Figure 5.1): Aggregate Trained and Aggregate Tested (AT-AT), where training and testing data is split from the same aggregated dataset from all users; Aggregate Trained and Personally Tested (AT-PT), where for each user, the models are trained from the data of all other users, and tested only against that selected user’s data; and Personally Trained and Personally Tested (PT-PT), where training and testing data are both from the data of each individual user. However, as the level of participation from individual users varied, some users may not have data for all classes (such as if no notifications occurred when the device was in use), these users are excluded where relevant.

5.1 Examining machine learning strategies 85

Response data for user₁ Response data for user

Response data for user

Response data for user₁ Response data for user

Response data for user

Aggregate-Trained Aggregate-Tested (AT-AT)

Response data for user₁ Response data for user₂

Response data for user_n

Response data for user₁ Response data for user₂

Response data for user_n Aggregate-Trained

Personally-Tested (AT-PT)

Response data for user

Response data for user₂

Response data for user_n

Response data for user

Response data for user₂

Response data for user_n Personally-Trained

Personally-Tested (PT-PT)

Figure 5.1: Visualisation of the training and testing approaches (as described in Section 5.1.1.3). Personally tested approaches are visualised using an example user (user₁). Additionally, each data point cannot be in both training and testing datasets. ▮= the training data used and▮= the testing data used.

For testing, 10-fold cross-validation was used for the AT-AT and PT-PT models. As AT-PT models use separate training and testing datasets, cross-validation would not be suitable. However, this issue is mitigated as the above analysis is performed on 100 RUS datasets (as defined in Section 5.1.1.1).

5.1.1.4 Evaluating model performance

Different applications may have different priorities on predictive performance (e.g., minimising missed opportunities to interrupt (false-negatives), or minimising ineffective interruptions (false-positives)). To consider this, models are evaluated using different

86 5.1 Examining machine learning strategies

Predicted True

(e.g., reachable) True Positive (TP)

True Negatives (TN) False Positives (FP) False Negative (FN) Sensitivity = TP/(TP+FN) Specificity = (TN/FP+TN) PPV = TP/(TP+FP) NPV = TN/(FN+TN) Actually True (e.g., reachable) Predicted False (e.g., not reachable)

Actually False (e.g., not reachable)

Figure 5.2: Visualisation of the PPV, NPV, sensitivity and specificity metrics used. Weighted precision is the average between PPV and NPV performance, and weighted recall refers to the average between sensitivity and specificity performance.

standardised metrics, which are derived from the confusion matrix produced in the evaluation (visualised in Figure 5.2):

PPV : The positive predictive value (PPV) is a precision metric that refers to the proportion of cases in the testing dataset that were correctly classified as reachable, engageable, or receptive.

NPV : The negative predictive value (NPV) is a precision metrics that refers to the proportion of cases in the testing dataset that were correctly classified as not reachable, not engageable, or not receptive.

Sensitivity : The sensitivity recall metric refers to the proportion of positive cases (e.g., reachable) that were correctly identified against the total number of cases that exist in the testing dataset. This metric can be paired with PPV.

Specificity : The specificity recall metric refers to the proportion of negative cases (e.g., not reachable) that were correctly identified against the total number of cases that exist in the testing dataset. This metric can be paired with NPV.

Weighted Precision : The weighted precision value refers to the average of the PPV and NPV metrics, weighted by the number of cases of each class if unbalanced.

Weighted Recall : The weighted recall value refers to the average of the sensitivity and specificity metrics, weighted by the number of cases of each class if unbalanced.

In document Decomposing responses to mobile notifications (Page 113-117)