• No results found

4.2 System Design

4.2.1 Gesture Classifier

Metatone Classifier classifies gestures by calculating descriptive statistics from each performer’s touch data using a sliding window of five seconds duration. The statistics are shown in Figure 4.1, and include frequency of movement, frequency of touch starts, mean location of touches, stan- dard deviation of touch location, and mean velocity. These statistics are similar to those used in touch-interface performance applications such as the TUIO protocol (Kaltenbrunner et al., 2005). Similar feature vectors were calculated by Swift (2012) in post-hoc analysis of musical smartphone improvisations and were found to distinguish between musicians’ personal styles of performance.

1The format of these logs and the OSC messages are discussed in more detail in

4.2. System Design

# Label Description

1 centroid x Mean X position

2 centroid y Mean Y position

3 std x S.D. of X position

4 std y S.D. of Y position

5 freq Total number of touch messages

6 movement freq Total number of moving touch mes-

sages (velocity6= 0)

7 touchdown freq Total number of touch-down messages

(velocity = 0)

8 velocity Mean velocity

Table 4.1: The feature vector of descriptive statistics used in Metatone Classifier to classify touch gestures.

Gestures are identified from these features vectors by a Random Forest

classifier (Breiman, 2001) from Python’sscikit-learnpackage (Pedregosa

et al., 2011). This particular ML algorithm was selected from the numer- ous approaches available for classification tasks due to its proven utility in Swift’s (2012) smartphone performance analysis. In that research, Ran- dom Forests was shown to outperform alternative algorithms such as Naive Bayes or Support Vector Machines in tasks with similar touch data.

The Random Forest classifier in the present system was used to iden- tify gestures from the vocabulary of nine continuous touch-screen gestures shown in Table 4.2. These gestures were chosen from the vocabulary char- acterised in Section 3.6 (p. 54). Some of those gestures, such as random

phrases (see §3.6.1, p. 56), were deemed too idiosyncratic to include in this

classification, but others, such as short swipes were divided into swipes

that are fast and regular, and those that specifically accelerate to produce

an emphatic rhythm. Three kinds of swirls were included to cover the small and large swirls, as well as the very slow, soft swirls, observed in the original characterisation. Interaction with the iPad volume control was not included as this was not included in the touch-screen data. The nine

# Code Description Group

0 N Nothing 0

1 FT Fast Tapping 1

2 ST Slow Tapping 1

3 FS Fast Swiping 2

4 FSA Accelerating Fast Swiping 2

5 VSS Very Slow Swirling 3

6 BS Big Swirling 3

7 SS Small Swirling 3

8 C Combination of Swirls and Taps 4

Table 4.2: The nine touch-screen gestures that Metatone Classifier is trained to identify during performances. These were influenced by the characterisation developed in Section 3.6.

gestures included “nothing”, which, although not strictly a touch gesture, represents the “space” that was discussed as an important component of free-improvisation structure (§3.6.6, p. 59).

The gestures listed in Table 4.2 arecontinuous gestures, rather than the

time-delimited command gestures that are typical of HCI research. So how much time is required to distinguish each of these gestures? Intuitively, if the window is too short, it may be difficult to distinguish multiple classes of gesture, for instance, between swirls and swipes or slow and fast taps. A very long window could contain multiple kinds of gestures, which might be classified too frequently as combinations. As mentioned above, five seconds of touch-data was chosen as the classification window for Metatone Classifier by an early process of trial and error. This choice might also be justified on psycho-acoustic grounds. Five seconds has been considered to be an upper bound for perception of local sound objects in music (Godøy et al., 2010), so it would appear to be a sensible choice for a low-level gesture classification. Another timing parameter is the frequency of sampling touch- data for gestural classifications. Early post-hoc analysis sampled every five seconds; however, trial-and-error experimentation found that sampling at

4.2. System Design

one second intervals (while retaining a five second window) provided a more responsive system for real-time application and better captured slight local variations in the performance.

As with many ML algorithms, Random Forests requires explicit training with known examples of each class that it is expected to identify. The ear- liest prototype of Metatone Classifier used data from one of the Ensemble Metatone performances described in Section 3.5 (p. 47). This data was clas- sified by hand every five seconds directly from the performance video and touch-data animation. The time-stamped classifications were then matched with the touch-data recording to produce labelled feature vectors for each class. While this training data suggested that gesture classification was possible, it suffered from inaccuracies due to matching the video with the touch-data and the relative frequency of particular gestures. Since then, two improved sets of training data have been collected in a studio session and using a computer conducted survey. The accuracy of these datasets will be discussed in Section 4.4.1.

Metatone Classifier’s gesture identification system has been used in a post-hoc context to generate gestural scores of performances such as the one illustrated in Figure 4.1 (p. 65). In graphical form, a human viewer can easily break up an ensemble performance into sections and identify ensemble behaviours. In a real-time context the gesture classifications are further analysed by Metatone Classifier to identify the start of new segments in the performance automatically, as will be discussed in the next section. The agent also immediately sends identified gestures back to the performers’ iPad apps, which can potentially respond to sequences of similar gestures with supportive functionality.