• No results found

5.3 Data and Labels

5.4.1 Utilizing Expert-Engineered Features

The first set of models use expert-engineered features to detect each label of student affect, off-task behavior, and gaming the system using methods similar to those implemented in previous works. As described in the Data and Labels Section, the expert features are first generated using the the raw action-level log data. In both sets, the features are generated to describe the actions that occur in 20 seconds of observation but also include neighboring actions that go beyond those 20 seconds to capture the full context of these 20 seconds (e.g. a student may take over a minute to respond after receiving help feedback, and we include that response). Therefore, clips are not completely uniform in their duration and can describe intervals longer than 20 seconds, particularly if a student exhibits idle periods while interacting with the system.

In the case of the engineered features used in the affect and off-task behavior detectors, 23 distinct features are created from the raw logs and then an average, sum, min, and max is applied to each action to aggregate these features across each clip (23 distinct features multiplied by the four functions yields the final 92

features). Each set of features describes one or more actions and include such

measures as time on task, hint usage, correctness, and other similar descriptives of student performance and interaction with the system, but also include skill-based features (e.g. the number of problems previously seen by the student pertaining to a given knowledge component), and recent performance history (e.g. number of incorrect responses over the last 5 problems).

The engineered features used in the gaming detector similarly aggregate student actions to 20-second clips, but then apply several behavior- and pattern-matching techniques to generate the 33 distinct features. These features attempt to measure gaming behavior through estimates of student timing information (e.g. apparent lack of time spent thinking before asking for help), repetitive actions (e.g. providing the same incorrect response multiple times), and uses the Levenshtein distance [Lev66] applied to the entered text of student responses to identify a specific form of guessing behavior (e.g. providing similar incorrect answers).

Previous work exploring each of these labels applied a large range of rule-, tree-, and regression-based models. For the purpose of the comparisons described in this work, we apply a Naive Bayes classifier, a REP tree classifier (a type of decision tree classifier with reduced error pruning [EK01]), and a Long-Short Term Memory (LSTM) deep learning network [HS97] for the gaming, off-task behavior, and affect detection tasks respectively in accordance with previous works. These models, to the authors’ knowledge, represent the highest performing previously published models of their respective outcome measure and were for this reason chosen for comparison; the use of a deep learning model for affect inherently conflates the use of expert- features (used as input to the model) and machine-learned features (through the hidden layer of the network), but we still compare this alongside the other models utilizing expert-engineered features as it is this set of features that is used as input to the model.

As was the case in previous work, each model uses only the clips with corre- sponding labels as input and produces a continuous-valued output representing the probability that each affective state or unproductive behavior is exhibited within the supplied clip. In the case of off-task behavior and gaming models, each clip is sup- plied to the respective REP tree and Naive Bayes model and the result is compared

to the binomial label, with positive labels corresponding to each case of off-task behavior and gaming and negative labels corresponding with a lack of each behav- ior (e.g. on-task behavior and non-gaming behavior). Due to the large number of features generated and likely co-linear relationship between some of the engineered features, a forward feature selection is applied directly prior to each model training procedure to select at most the best 10 features to use in each model.

This paradigm differs for the case of the affect detector model as each of the four affective states are modeled simultaneously as a multinomial classification task through the use of the LSTM model. As a type of recurrent neural network, LSTMs attempt to model sequential relationships within the data; the labelled clips are therefore not treated as independent samples by the model, but rather as a sequence for which a sequence of 4-valued predictions are generated in a many-to-many (or sequence-to-sequence as it is more commonly referred) manner. As was performed in [BBH17] to ensure better temporal consistency within each sequence of clips, stu- dent sequences are partitioned such that subsequent clips in the observed sequence occur no more than approximately 5 minutes from the previous clip; spans between clips greater than this threshold are split into two (or more) sequences of student interaction for input to the model.

Each of the models are trained and evaluated using stratified 10-fold student- level cross validation. Given that there is a large imbalance among each of the labels, we stratified each fold based on the number of occurrences of positive labels of each outcome label at the student level in order to generate the folds of the cross validation. This helps to ensure that each fold contains a representative distribu- tion of labels; as this is performed at the student level, it is difficult to produce perfectly balanced folds such that each contains a fully representative set of labels, but the stratification method is an effort toward this property. All subsequent mod-

els described in this work utilized the same student folds described here for better comparability between methods. Each method is trained and evaluated on the same student data and labels within each respective fold.