5.4 Study 2: Evaluation with Children
6.1.1 Engagement Detection
In general, two different directions to track the user’s engagement automatically can be found in the lit- erature of Human-Machine Interaction (HMI). First, directly tracking the engagement, e.g., by train- ing classification systems on huge datasets or, second, focus on tracking the affective and cognitive states that are known to be indicative of the engagement and affective learning (see Section2.1.4&2.1.2).
6.1. Engagement and the Dimension of Affective Learning
These states can then be used to make decisions for the course of the interaction (e.g.,D’Mello and Graesser,2012) or to infer the engagement later on (e.g.,Altuwairqi et al.,2018).
Most of the commonly used approaches rely on machine learning techniques to automatically an- alyze children’s behavioral cues and to derive their corresponding affective and cognitive states, or en- gagement level. To capture the different modalities, such as facial expressions, body postures, the voice or information from the interaction itself, sensory inputs (e.g., from cameras and microphones) are used. But often these techniques, e.g., Support Vector Machines (SVMs), are supervised and need pre- processed input data, where the input values are transformed into numerical representations before each datapoint is labeled, e.g., with the corresponding engagement level. With this, a SVM, for in- stance, can learn the mathematical functions to map the input data to the corresponding output labels, which then can be used to classify the user’s engagement level from new input data during an interac- tion. Although this process increases the time needed to prepare the machine learning environments and can increase the difficulty to transfer the approaches to different contexts, it still is often used in the community.
Castellano et al.(2012), for instance, used this technique to directly assess the user’s engagement based on pre-recorded sessions of a chess game interaction. They compared several SVM-based mod- els trained with different information about the game (e.g., game state and game evolution), the social context of the interaction and the game turn level (e.g., encouraging comments and scaffolding). Their evaluation showed that the SVM incorporating the social game context, as well as interactional infor- mation about the game, achieved the best performance with an average accuracy of 80%. However, basing the classification process on specific game state information can complicate its generalization and transfer to other games/contexts.
A different approach fromSanghvi et al.(2011) used the postural expressions of children to train five other supervised classification approaches in Weka1(ADTree, OneR, LogitBoost, MultiClassClassifier and logistic functions). The postural expressions data consist of features of users’ full upper body sil- houette that were automatically extracted by a computer vision algorithm and, subsequently, labeled by three trained coders. Their results demonstrated an accuracy of up to 82% for two of the five classi- fiers, namely, ADTree and OneR.
Also many other approaches can be found to directly classify users’ engagement, which incorporate conditional random fields, e.g., by using audio-visual data (Foster et al.,2017) or personalized deep learning frameworks, e.g., by using audio-visual and physiological data (Rudovic et al.,2018). However, the majority of approaches seem to focus on only tracking users’ affective or cognitive states.
Since the cognitive states mainly describe the internal states that are often not expressed through overt behaviors, most research focuses on tracking just the user’s attention. This is generally done by analyzing the user’s gaze behavior during the interaction, e.g., directly with an eye-tracker (El Haddioui and Khaldi,2012;Yang et al.,2013) or the head pose (Lemaignan et al.,2016a). For example,Lemaignan et al.(2016a) used the OpenCV2framework to calculate children’s attention and “with-me-ness” during
1https://www.cs.waikato.ac.nz/ ml/weka/index.html 2https://opencv.org/
a robot-teaching task. Here, the term with-me-ness represents a measure of “how much the user is with the robot during a task” (Lemaignan et al.,2016a, p. 163). Their evaluation demonstrated for both aspect results fairly close to the used ground truth provided by human raters.
In the scope of affect detection, however, a broader variety of different modalities and approaches were already used. One widely applied method, also for commercial products, such as Affectiva Affdex (McDuff et al.,2016), is the analysis of facial expressions to detect the affective state of a user (e.g.,
McDaniel et al.,2007;Wang et al.,2018). But these classifiers are often trained on “very expressive and acted” emotions for which they yield classification accuracy around 97%, however, this makes their applicability to real-world interactions questionable (cf.Stöckli et al.,2018). In fact, the accuracy of emotion detection based on facial features is often low in real-world applications and the recognition rate is strongly dependent on the individual expressiveness of each person (cf.Benta and Veida,2015;
Stöckli et al.,2018).
An alternative approach is the detection of affective states from the user’s voice (Devillers and Vidrascu,
2006;Kim et al.,2017;Tzirakis et al.,2018). Classifiers to analyze the voice are often trained on datasets of spontaneous speech, so that they are more suitable for real-world applications.Kim et al.(2017), for instance, used a SVM with the aim of assessing the affective states of users interacting and playing with robots, such as Robotis OP2, Robotis Mini or Romo. To train their model they used the IEMOCAP database3, which contains approximately 12 hours of audio-visual data from adult speakers and their evaluation showed a reasonably good classification performance (Kim et al.,2017). However, with re- gard to a cHRI setting, affect detection through speech analysis is difficult, because speech input is not always included since ASR systems for young children still have low accuracy (Kennedy et al.,2017b). Other attempts made use of analyzing written text to detect the user’s affective states (Alm et al.,
2005;Kahn et al.,2007), which include, for instance, analyzing the usage of adjectives and adverbs. However, this approach is not applicable for kindergarten children, since they are usually not able to read and write text, while even adults usually do not use written text in natural face-to-face interactions.
A more extensive approach for affective state detection is the tracking of the whole body posture and movements, e.g., by using a Microsoft Kinect (McColl and Nejat,2012) or a body pressure mat laying on a chair (D’Mello and Graesser,2009). A limitation of the latter is that it assumes the user stays seated without moving around, which is also not easy for young children in the age of 4-6 years. The Kinect, however, allows the user to move around, but may have problems in detecting smaller events, such as small gestures or postural shifts.
In addition, approaches based on human physiology have been developed, too. In this realm, tech- niques, such as ECG, EEG, EMG (Wagner et al.,2005;Villon and Lisetti,2006), and brain imaging (Immordino-Yang and Damasio,2007) are applied to “read” the affective state from users’ physiolog- ical signals. The results of these methods are promising, however, the applicability of such obtrusive approaches (e.g., wires and patches on the body) in tutoring interactions with young children is clearly limited, while the interaction itself is also not very natural anymore.
6.1. Engagement and the Dimension of Affective Learning
Finally, also multi-modal approaches were studied with the goal to overcome some of the previously mentioned limitations and to increase the detection accuracy. A lot of combinations can be found that incorporate, e.g., facial expressions and voice (Busso et al.,2004;Esposito,2009), facial expression, voice and body posture (Bänziger et al.,2009), facial expressions, body postures and context dependent activity logs (Kapoor and Picard,2005), or speech and text (Arroyo et al.,2009). And indeed, most of these systems demonstrated that a multi-modal approach to detect affective states results in higher accuracy rates.
In summary, a lot of different approaches with a broad variety of used modalities have been devel- oped so far, which, however, are not all applicable when trying to provide an interaction as natural as possible for young kindergarten children learning with a SAR. Most of them have special require- ments, are very obtrusive or have a low accuracy when applied in the wild. Furthermore, they often have not been verified to work with young kindergarten children and, consequently, are not necessarily applicable for this particular age group. In fact, observations of study 2 revealed that most children do not provide much speech input nor show a lot facial expressions during the interaction with a SARTS. Instead, approaches that are based on learners’ gaze direction, body posture or movement might be successful in tracking their engagement. However, most affect detectors are trained on a huge amount of specifically annotated data to identify the important cues for each affective state, which also holds for commercial products such as Affectiva Affdex. Establishing such a dataset in a natural way without using an artificial setting for data recording is very complicated and time consuming, especially when not having a clue which features are important. Moreover, due to the fast development of children (cf.
Mooney,2013, Chap. 4), a dataset for different age ranges might be needed to train separate classifiers to be able to autonomously and reliably track children’s engagement based on their behavioral cues.