Chapter 5: Automated Analysis Techniques for Multi-Modal Corpora
4- Repeat process: Track the complete ten-minute extract, comparing
and assessing the accuracy of the results seen for each speaker, summarise the findings.
5.3.3. Results
In order to assess the relative accuracy of the tracker, that is, the proficiency of the system, a logical ‘point of entry’ into the data is to focus on specific movements that are detected as being significantly higher or lower than the mean y-axis position of the head. It is hypothesised that those y-axis movements which differ significantly from the mean are of a head-up or head- down nature, in other words a sequence of movements that potentially corresponds to a head nod.
To explore this hypothesis further, the tracking outputs generated from each speaker can be plotted graphically. Figure 5.3 maps the relative y-axis position of the head for <$M>, mapping the up-and-down motion of the head over time, denoted by the progression of frames of the video. A more detailed, ‘raw’, frame-by-frame breakdown of the head tracking results can be found in file 5.1 on the data disk for <M> (the left hand side of the table), and real-time
video records of the tracker in action for this speaker can also be found on the data disk (file 5.2).
Figure 5.3: Tracking the head movements of <$M> throughout the ten-minute
case study data.
Although a head nod is theoretically seen as an up-and-down movement, for the initial analyses of the tracked output it is appropriate to also explore situations where an intense up or intense down movement occur in isolation. These are situations which witness no preceding and/or following up or down movement of the head. This is because, as seen in Chapters 2 and 4, a nod does not necessarily always comprise of identical forms of movements in both directions. So, in instances where an intense movement is used, this may only be evident in one direction or the other.
B A
In order to investigate the most intense head-up and head-down tracked movements of <$M>, those frames which have a y-axis position above and below circa 2 standard deviations (S.D.) of the mean head location can be examined in greater detail. This includes frames that have a y-axis output reading within the range of 25≤ y axis position ≥35 (refer to Appendix 5.1 for details of these specific frames), and are movements which are considerably above or below the average and/or ‘no-movement’ value (mean value = 30.135, 1*S.D.= 2.0223, 2*S.D.= circa 5.0446, rounded down to 5). It is hypothesised that 2 S.D. from the mean is an appropriate figure to test since such emphatic movements are more likely to be attributed to some form of gesticulation, but not necessarily a nod, rather than simply a shuffle or a fidget. Behaviours such as fidgeting and shuffling are instead assumed to cause more subtle differences in the head position than a head nod perhaps would.
Since the raw tracking output deals with a frame-by-frame account of the tracking results (see file 5.1 on the data disk), it is useful to group ‘clusters’ of frames that are located within this range of y-axis positions in order to make the analysis of the data more manageable. As a working benchmark, a ‘cluster’ is taken as a collection of up-or-down outputted movements that lie within the span of 25 frames of each other, and within the y-axis range given above. Therefore, these are groups of movements that are above or below 2 S.D. of the mean which exist within a 1 second time frame of each other. Although a 1 second margin appears slight, head nods can range extensively in terms of intensity and duration, which means that to best allow us to identify
a range of different head nod movements; from short to long, types A to E nods, such a small margin is required, possibly even smaller.
For <$M>, 599 frames, from a total of 15403, outputted y-axis coordinates within the 25≤ y axis position ≥35 range. These results can be clustered into a total of 40 intense head-up sequences (regarded as peaks hereafter; where the y-axis position is ≥35 for some or all of the frames across the cluster range), which are marked with red nodes in Figure 5.3. In addition, there is a total of 21 intense head-down clusters seen (regarded as troughs hereafter; where the y-axis position is 25≤ for some or all of the frames across the cluster range). These are marked with green nodes in Figure 5.3. Appendix 5.1 provides details of the tracking outputs for all intense peak and trough clusters seen in this video excerpt (see Table 1 of Appendix 5.1 for details on the most intense peaks, Table 2 for the most intense troughs), suggesting a range of movements that fluctuate between the y-axis positions of 3 to 70 (refer to ‘min’ and ‘max’ values in Appendix 5.1).
Table 3 of Appendix 5.1 provides a list of the ten clusters of frames where the head-up and head-down movements overlap or correspond to one another, within a 25 frame span. These are marked with black nodes in Figure 5.3 (refer to the ‘max’ and ‘min’ values detailed in Table 3 of Appendix 5.1 for specific y-axis values of these 10 clusters). These ten instances are, therefore, assumed to show where intense a head-up motion(s) is followed by a head-down motion(s), or in reverse; mirroring movements that we assume to be outputted in the case of emphatic head nodding behaviour.
The most emphatic peak type movement used by <$M>, as detected by the tracker, is marked as point A on Figure 5.3. This occurs between frames
4722 and 4732, ranging from 36 to 70 on the y-axis across this frame range. This peak is immediately preceded by the most emphatic trough movement seen in the data, as shown by point B on Figure 5.3. This trough, ranging from 3 to 25 on the y-axis values, exists between frames 4711 to 4721 on the figure. This close succession of a peak and trough therefore exists as part of the most pronounced up-and-down movement detected by the tracker, marked as ‘combined peak and trough 3’ in Table 3 of Appendix 5.1. It is relevant to note that this episode does in fact correspond to a backchanneling head nod, as defined in Chapter 4. The conjuncture, at which this nod is used, in context of the rest of the conversation, is shown in Figure 5.4 (refer to Appendix 4.1 for a key to the coding used in this transcript; also refer to Figure 4.1 in Chapter 4):
Figure 5.4: Exploring the most ‘intensive’ nod from <$M> in the case study
data.
point of speech by <$F> and continues while <$M> backchannels with the IR token right, and the subsequent CNV string yeah yeah. This specific nod was initially classified as being of type E, a ‘multiple nod, comprising of a combination of types A and B, with a longer duration than types A and C’ (see 4.2.3.1 for details).
Despite this initial success, such a manual-automatic detection agreement fails to exist at a constant rate for the remaining instances of ‘combined peak and trough’ sequences. In fact it is in a mere 2/10 (only 20%) of cases where the automatically tracked nods correspond to manually ascribed nods for the 10 intense clusters seen in Appendix 5.1.
This low success rate also holds true for the majority of the single head-up and head-down movements. From the 40 ‘intense’ peaks detected by the tracker (see Appendix 5.1), 11 were manually pre-coded as non-verbal backchannels, with a further 1 as a non-backchanneling head nod (so 12/40, i.e. 30%) whereas, this figure of successful detection stands at only 1.5% for the intense troughs (3/20 instances), see Table 2 of Appendix 5.1 for further details.
In around half of the successfully tracked episodes, the nods that were detected were actually of types C, D and E; thus of the most intense types (5, 2 and 3 respectively). However, at this point it is relevant to note that multiple peaks and troughs may combine as part of an extended nod movement, see Appendix 5.1 for further details on nod numbers and codes. So of the total uses of these three nod types, as determined in Chapter 4 (see Figure 4.4), a mere 35% (6 different nods across the peaks and troughs) were correctly