Finally we apply the hierarchical personalization framework to the problem of student- specific learning outcome prediction. Students may vary significantly in how they display their emotional states during learning, while engaged with and reacting to the ITS. We,
Figure 7.11: The mean F1-scores and their standard deviations for an HBNN model trained with all data pooled together and a student-specific HBNN model.
therefore, wish to explore whether there are any benefits of learning slightly different mappings to the learning outcome label for each individual student, as opposed to learning a generic classification function for all students pooled together.
7.3.1 Dataset
We used the same dataset of 1596 problem outcome video-clips of students engaged with MathSpring. Each input video is represented by the same action unit-based summary statistic feature descriptor, as described in Chapter 5.
7.3.2 Experiments
Here, we report results on experiments conducted to investigate the benefits of student- specific modeling for learning outcome prediction. We trained and tested all our models on 5 random, stratified 75/25 splits of the data. All HBNN models had 1 hidden layer with a 100 activation nodes and were trained for the SOF-vs-all binary classification task,
as this was the only binary classification task where all students had examples for both classes.
Compared to a baseline model (HBNN-pooled), which was trained with the data from all 30 students pooled together (obtaining a mean F1-score of 0.61), we found that training a student-specific model did not bring any improvement in classifier performance (mean F1-score of 0.59).
The lack of improvement, we posit, can be attributed to a few reasons. First, some individuals do not possess a sufficient number of training examples. Moreover, for many students, their examples are highly imbalanced across the 2 classes (e.g. some students have solved almost all problems on first attempt). Second, the underlying assumption in our personalization framework that between-group variance is high and within-group variance is low, is not as strong as in the gesture recognition problem.
7.3.3 Summary
In this chapter, we first extensively evaluated the hierarchical Bayesian neural network model, introduced in Chapter 6, to the problem of personalized gesture recognition system. We illustrated the benefits of the hierarchical model over baselines that ignore subject- specific gesture variations and demonstrated the scalability of the model’s capacity to learn complex feature-label mappings. We used the inferred posterior distributions over weights to guide active learning procedures for personalizing pre-trained models to new users. Our posterior driven active learning algorithm consistently outperformed selecting gestures at random as well as outperforming or being competitive with existing methods. We then extended the framework to support recurrent architectures, demonstrating their benefits in modeling gestures.
formation. Our hierarchical Bayesian model trained on a dataset divided according to the sentiment expressed in the interviews outperformed a baseline model that ignored this contextual information in both classification and regression scenarios. We then used the HBRNN framework to model temporal dynamics in expressivity prediction but found no empirical benefits compared to results obtained using fully-connected feed-forward archi- tectures trained on aggregated features.
Third, we investigated whether there are any benefits of learning slightly different map- pings from the raw video input to the learning outcome label for each individual student, as opposed to learning a generic classification function for all students pooled together but found no empirical benefits
Conclusions and Future Work
In this thesis, we focused on the following challenges within face and gesture analysis: a) the classification of hand and body gestures along with the temporal localization of their occurrence in a continuous stream, b) the recognition of facial expressivity levels in people with Parkinson’s Disease using multimodal feature representations, c) the prediction of student learning outcomes in intelligent tutoring systems using affect signals, and d) the personalization of models that can adapt to subject and group-specific nuances in facial and gestural behavior.
8.1
Contributions
Here, we summarize the major contributions of this thesis:
• We presented an analysis of methods for gesture spotting and classification by com- paring two methods. The first method trains a single random forest model to recog- nize gestures from a given vocabulary, as presented in a training dataset of video plus 3D body joint locations, as well as out-of-vocabulary (non-gesture) instances. The second method employs a cascaded approach, training a binary random forest model to distinguish gestures from background and a multi-class random forest model to classify segmented gestures. Given a test input video stream, both frameworks are
applied using sliding windows at multiple temporal scales. We evaluated our formu- lation in segmenting and recognizing gestures on two different benchmark datasets: the NATOPS dataset of 9600 gesture instances from a vocabulary of 24 aircraft han- dling signals, and the ChaLearn dataset of 7754 gesture instances from a vocabulary of 20 Italian communication gestures. The performance of our method compares favorably with state-of-the-art methods that employ Hidden Markov Models or Hid- den Conditional Random Fields on the NATOPS dataset.
• We investigated how to computationally predict an accurate and objective score for facial expressivity in people with Parkinson’s Disease. We first presented a base- line method that trains a random forest regressor based on geometric shape features of the face. We provided insight on the geometric features that are important in this prediction task by computing variable importance scores for our features. We then build improved models on more informative facial action unit-based features, providing interpretations based on their aggregated feature importance. We demon- strated the utility of extracting features from not only the visual domain but also the audio in order to accurately predict facial expressivity, finding that a model trained on a combined audio-visual feature representation outperformed models trained on features extracted from a single modality. We evaluated our formulation on a dataset of 772 20-second interview video clips of PD patients using 9-fold cross validation. • We described the process with which a novel multimodal dataset used in this study was collected and annotated, with the aim of fulfilling an existing gap in affective tutoring systems literature: a benchmark, publicly available facial affect dataset in an educational setting. We provided an exploratory analysis of the different prob- lem outcome classes using average facial action unit activations, discussing some interesting observed trends. Based on this novel dataset, we then developed baseline
models to predict the problem outcome labels of students solving math problems, demonstrating its effectiveness in accurately forecasting several problem outcome labels.
• We developed hierarchical Bayesian neural networks for personalized modeling of face and gesture signals in the presence of inter-group and inter-subject variations. Leveraging recent work on learning Bayesian neural networks, we built variational inference-based fast, scalable algorithms for inferring the posterior distribution over all network weights in the hierarchy. We also developed methods for adapting our model to new groups when a small number of group-specific personalization data is available. We proposed to utilize active learning algorithms for interactively labeling personalization data in resource-constrained scenarios. We also implemented recur- rent variants of our hierarchical Bayesian model, given their suitability in building models involving sequential signals.
• We applied our hierarchical Bayesian framework to three tasks: subject-specific ges- ture recognition, context-specific facial expressivity prediction and student-specific learning outcome prediction.
First, we illustrated the benefits of the hierarchical model over baselines that ignore subject-specific gesture variations and demonstrated the scalability of the model?s capacity to learn complex feature-label mappings, testing our framework on three widely used gesture recognition datasets. We used the inferred posterior distribu- tions over weights to guide active learning procedures for personalizing pre-trained models to new users, showing that our posterior driven active learning algorithm consistently outperformed selecting gestures at random. We demonstrated the suit- ability of applying hierarchical Bayesian recurrent neural networks in the gesture recognition task, achieving comparable or improved model performance at a frac-
tion of the parameter cost.
Second, we illustrated the benefits of using a framework that adapts to contextual information, regarding the task of facial expressivity prediction. Our hierarchical Bayesian model trained on a dataset divided according to the sentiment expressed in the interviews outperformed baseline models that ignored this contextual informa- tion in both classification and regression scenarios.
Third, we applied our personalization framework to the problem of student-specific problem outcome prediction. However, unlike in subject-specific gesture recogni- tion and context-specific expressivity prediction, we did not find empirical benefits of using our personalization framework over a generic classifier.
8.2
Strengths, Limitations and Future Research Directions
Here, we discuss the strengths of the methods we have proposed and address their weak- nesses, suggesting ideas for research directions that could further improve our work.
8.2.1 Gesture Spotting and Recognition
We presented an analysis of methods for gesture spotting and classification by comparing a framework that employs a single multi-class random forest classification model to dis- tinguish gestures from a given vocabulary in a continuous video stream with a framework that uses a cascaded approach. The strengths of the two methods we proposed lie in their simplicity to train and their capacity to generalize well to variations in user size, distance to the sensor, and speeds at which the gestures are performed, as well as our methods’ ro- bustness to the effects of sensor noise. One area of the framework that can be improved is the process of selecting and creating better feature sets. Many additional features, such as joint-pair distances used by Yao et al.[134], can be experimented with in order to improve
the accuracy of our framework. Additionally, selecting a small group of features over an interval of frames to split a node in a decision tree, instead of selecting a single feature at a single frame, might be better suited to the purpose of learning complex spatio-temporal objects such as gestures. However, computing more features may hamper the random forest framework’s speed during test time.
In gesture recognition, there are often ambiguities between similar gesture pairs in both datasets, which our random forest classifier cannot differentiate well. A potential idea for further exploration is to use another layer of tree-forest classifiers to identify the features that can differentiate the ambiguities in order to further refine classification results. In general, gesture classification can be performed in a hierarchical framework, where random forests at the top-most level will accurately separate a dynamically-defined set of super-classes, each of which will be subject to further classification by classifiers at subsequent layers, until all classes are well-separated.
Moreover, feature engineering approaches have generally been replaced by feature learning approaches across many large-scale computer vision tasks, including gesture recognition. Novel neural network architectures based on CNNs, LSTMs, 3D-CNNs and their unique combinations learn discriminative feature representations directly from input skeletal, RGB and depth data and have been shown to obtain good results on numerous gesture recognition benchmark datasets [83, 92, 80, 71]. For example, Neverova et al. [83] presented a gesture localization and recognition scheme based on a multimodal deep learning architecture that leverages audio signals to take advantage of the fact that gestures are often accompanied by speech or sounds.
Another drawback of our current approach lies in the use of a sliding window mech- anism. Exhaustive, multi-scale sliding window search is not very computationally effi- cient and cannot predict flexible gesture boundaries. Workarounds to sliding window ap-
proaches have been proposed in the object detection [50, 57] and activity detection [128] literature. However, these approaches rely on the entire input (e.g. the complete input image or complete input video) being available to the algorithm during test time. In real- time gesture recognition applications, where the model must be able to respond with its prediction in real-time, sliding window approaches are still appropriate.
We should also note that deep learning approaches are not always suitable for ges- ture recognition applications. For one, gesture recognition applications often require low latency computations in resource-constrained devices, e.g. real-time gesture recognition in AR/VR settings, where only a fraction of the on-device computation resources can be devoted to real-time gesture recognition. Second, gesture recognition systems are often designed for specific applications, where data collection and annotation in a scale required for most deep learning methods can be prohibitively expensive.
8.2.2 Predicting Active Facial Expressivity in People with Parkinson’s Disease
We presented an interpretable system that computes facial expressivity scores in people with Parkinson’s Disease using multimodal audio-visual feature descriptors extracted from a video sequence. Automated assessment of facial expressivity in Parkinson’s Disease patients has the potential to be a useful tool for clinicians in this field. Human coders have successfully coded facial expression in people with PD [54] but the costs associated with the manual assessment of all patients with PD can be prohibitively high. Comprehensive manual coding of 20 seconds of video can take upwards of an hour, and often two coders are needed to establish that the human coder is reliable. Most existing works in the domain of computational facial analysis of PD patients are limited to small-scale pilot studies comparing the characteristics and dynamics of facial expressions exhibited by a small group of PD patients against those of a separate control group. By utilizing a dataset of
772 short audio-video clips of 117 PD patients along with their facial expressivity labels, we demonstrated the feasibility of using a machine learning model in predicting the facial expressivity ratings of new audio-video clips.
A potential weakness of our current approach lies in the simplicity of our feature repre- sentation. Although summary statistics-based feature representations, such as the ones we have used, provide concise, easy-to-interpret features that was appropriate for our applica- tion, we forego a significant amount of signal from the raw input, which could potentially prove useful for more complex classification/regression frameworks. However, utilizing larger, complex models is challenging, given the relatively small size of our dataset (con- sisting of less than 700 training samples).
Considering that PD is widespread and affects millions of people around the world, the benefits of an accurate, interpretable and automated facial analysis for patients are beyond doubt. One avenue for further research is to extend this work on a larger scale. However, obtaining real patient data on a large scale can be a challenge. An interesting research question to ask then is: can the vast amounts of audio-video interview data widely and freely available in the Internet be leveraged to learn better facial expressivity models? With some expenses for expert annotation, one could train deep, multimodal models on the large, non-PD data and finetune them on the smaller target dataset of PD patients. Given that the distribution of the source domain of interview clips might differ from that of the target domain of interview clips of PD patients, domain adaptation methods might be useful [117].
8.2.3 Affect-driven Learning Outcomes Prediction in Intelligent Tutoring Systems
We investigated the problem of trying to predict the learning outcome of students from fa- cial affect signals, based on a novel dataset of student videos interacting with MathSpring,
a popular web-based ITS. The dataset was collected with the intention of releasing it as a benchmark affect dataset in an educational setting, the likes of which are currently missing in the literature. Based on this novel dataset, we developed models to directly predict the learning outcomes of the students from concise action unit-based feature representations that capture the facial affect dynamics of the input video. This is different from most ex- isting work that maps the input video into the student’s emotional state, such as happiness, anger and level of engagement.
While the results we provided are that of baseline models, there are several avenues for improvement. First, we have so far ignored two rich streams of information while building our predictive models: the GoPro video stream that captures the students’ faces when they are facing down and therefore not visible in the webcam, and the mouse-coordinate clickstream which can often be very informative about the students’ internal state. A multi- modal model that utilizes signals from all streams will probably result in better predictive performance.
Despite the relatively large size of the raw dataset, the problem outcome labels are quite sparse. It is therefore challenging to build accurate models that map very high dimensional, highly variable spatio-temporal affect signals into a single problem outcome label using only a few examples. Moreover, the raw input signals are mostly dominated by non- informative neutral facial expressions. Obtaining denser labels around times of high facial activity could help provide an improved understanding of the relationship between facial affective signals and the final problem outcome.
Finally, the biggest challenge in ATSs is to then utilize these affect-sensitive models to provide appropriate and effective interventions that quantifiably improve the learning experience. There have been some recent works that have ventured in this direction. For example, Gordon et al. [51] combined the valence and engagement values inferred from
facial affect of students and inputted them into a social robot’s reinforcement learning algorithm, which allowed the social robot tutor to personalize its motivational strategies according to the observed facial affect of students. In future work, our research team plans to provide personalized interventions in MathSpring based on the proposed affect analysis models, and then conduct experiments to validate the effectiveness of the interventions, as well as analyze affect signals that result after the interventions are applied.
8.2.4 Hierarchical Bayesian Neural Networks and Applications
Group-specific variations in data can pose a significant challenge to building robust and reliable classifiers. We developed hierarchical Bayesian neural networks for personalized modeling of face and gesture signals in the presence of inter-group and inter-subject vari- ations. When group-membership labels are available, we showed that they can be lever- aged to build group-specific models within a hierarchical framework. We demonstrated the utility of this hierarchical approach to three tasks: subject-specific gesture recogni- tion, context-specific facial expressivity prediction and student-specific learning outcome