4.2 Observation study
4.2.2 Materials and methods
Materials
We used KinectVizz1 for the recordings. For this, we incorporated a new functionality
to the application which allows to select an audio file and play it while recording video and motion capture from the Kinect aligned with this audio. For the analysis, we only used information from the nine upper-body joints: head, neck, torso, shoulders, elbows and hands.
We selected excerpts from a performance of Beethoven’s 3rd Symphony (Eroica) 1st
Movement performed by the Royal Concertgebouw Orchestra2 for which multimodal
data (including high quality audio for every section, multi-perspective video and aligned
1https://github.com/asarasua/KinectVizz 2http://www.concertgebouworkest.nl/
score) were made available within the PHENICX project3. This movement, Allegro con
brio, is in3/4 time.
We selected 35 seconds fragments so we have enough data while allowing users to mem- orize them in a short time period. Fragments were chosen to have some dynamics and tempo variations. All files were converted to mono so participants did not have to pay at- tention to spatialization. Beat annotations are available in the dataset and were used as ground truth beats location. Loudness values were computed from audio using Essentia’s (Bogdanov et al.,2013) Loudness4 algorithm. This algorithm computes the loudness of
the audio signal using power law byStevens(1975), as the energy of the signal raised to the power of 0.67. Computed values were resampled to 30 Hz (the rate of the MoCap data) in order to make a frame-by-frame comparative analysis with respect to MoCap descriptors.
During the study, participants used over-ear headphones and stood approximately two meters from the Kinect sensor, placed on a flat speaker stand, approximately 1.4 m from the floor. The experimenter read instructions to participants and controlled the application from a laptop to which the Kinect sensor and headphones were connected. The recorded data and the scores of the fragments used in the study are available online5.
Methods
Motion Capture descriptors Here we detail all MoCap descriptors that were extracted from raw position data (x, y, z) of the nine upper-body joints. They are classified into
joint descriptors, computed for every joint, and body descriptors, describing general
characteristics of the whole body movement. • Joint descriptors:
– (vx,vy,vz), (ax,ay,az): Velocity and acceleration components, computed by fit-
ting a second-order polynomial to 7 subsequent points centered at each frame and taking the derivative of the polynomial. We used Python’s polyfit6
function from scientific tools package SciPy (Jones et al.) for this. – v, a: Velocity and acceleration magnitudes.
– vmean, vstd: Velocity magnitude mean and standard deviation, computed from
31 (1.03 seconds) values around each frame. They are expected to account
3https://repovizz.upf.edu/phenicx/datasets/
4http://essentia.upf.edu/documentation/reference/std_Loudness.html 5http://mtg.upf.edu/download/datasets/phenicx-conduct
4.2 Observation study for the “quantity” and “regularity” of the joint movement, respectively. – (xtor, ytor, ztor): Relative position with respect to the torso components. These
are computed from the position of the joint and the position of the torso. For the x axis, points to the left (from the subject perspective) of the torso are positive and points to the right are negative. In the y axis, points over the torso are positive and points below are negative. In the z axis, points in front of the torso are positive and points behind are negative. Appropriate weights were empirically estimated to make the values approximately 1 for the case of hands completely extended in the corresponding axis.
– dtor: Distance to torso, computed from the position of the torso and joint of
interest.
These last two sets of descriptors, (xtor, ytor, ztor) and dtor, are related to the shape
component in LMA (see Section2.1). Even though they are computed using information from two different joints, we list them as joint descriptors as they inform about how stretched a joint is with respect to the torso.
• Body descriptors:
– QoM: Quantity of Motion. It is computed as the average magnitude velocity of all tracked joints.
– CI: Contraction Index. Is is computed by looking at maximum and minimum values along every axis. We used equation 3.1, empirically derived to make its value approximately 1 when arms are completely stretched out and 0 for a very contracted pose.
– Ymax: maximum hand height. This is a simple descriptor that takes at each
time the highest y position of both hands.
Loudness analysis In order to study the relationship between the participants move- ment and the loudness of the fragments, we performed least squares linear regression, using movement descriptors as predictors and the computed loudness as the independent variable. As opposed to the case of the real conductor in the performance studied in Sec- tion3.2.3, where we preselected three descriptors based on expert knowledge (QoM, CI and Ym), we now follow a blind approach with a larger set of descriptors and allowing the
model to identify the relevant ones. We created different linear models for different levels of specificity (general to subject-specific). In all cases, we started from maximal models including all descriptors and kept simplifying by removing non-significant explanatory variables until the resulting model only contained descriptors with a significant effect on
loudness, to get the minimal adequate model.
Beat analysis We took maxima in vertical acceleration of the hand, ay, as beat po-
sition estimates. For each participant, we automatically selected her hand by looking at which of the two of them showed more activity, estimated as the total energy of ay
for each hand during the n frames under analysis: E =qnay(tn)2. We then used the
manual annotations of beat positions as ground truth to build an error distribution7 e,
following the procedure detailed in Algorithm 1. The mean of the distribution is used as an estimation of the lag between the detected and the annotated beats.
We compute the F-measure as the harmonic mean of the precision and recall values: F =
2·precision·recall
precision+recall. precision refers to the proportion of estimated beats that are correct;
recall corresponds to the proportion of annotated beats that are correctly estimated. In
our case, we consider that an annotated beat anhas been correctly detected if its closest
predicted beat is closer than 66 ms (equivalent to 2 frames at 30 fps). The estimated lag is used to correct the position of detected beats before computing F . An F-measure of 1 indicates that the position of annotated music beats can be perfectly estimated from hand movement acceleration with a ±66 ms precision.
Participants
Participants were recruited via convenience sampling through department members and their students. They subsequently signed an informed consent form informing about the type of data being recorded and the intention of making it publicly available, filled out a brief pre-questionnaire, performed the tasks of the study, and filled out a post- questionnaire. The study involved approximately 10 minutes per participant.
Procedure
For each of the 35 seconds fragments, participants were allowed to listen to them twice (so they could focus on learning the music). Then, they were asked to “conduct” the fragment three times (so they could keep learning the music while already practicing their conducting). For each of the fragments, the analysis was only performed in the last of the three takes, where participants are assumed to be able to better anticipate changes (i.e. this way being closer to “conducting” and not just “reacting” to changes).
7Although we speak of "error distribution", a beat estimated from hand movement appearing far from an
annotated beat does not imply that there is anything wrong, participants did not have to accomplish any task.
4.2 Observation study Also, to get rid of the effects of initial synchronization, the analysis was done on the 30 seconds from second 4 to second 34. This makes a total a total of 90 seconds (30 seconds per fragment) of conducting data for each participant.
At the end of the recording, participants filled another questionnaire with Yes/No ques- tions about how they had faced the study and the problems they had encountered. Concretely, the questions we asked were the following:
• Were you able to recognize the time signature (3/4) in the excerpts?
• In general, did you use rhythm information to guide your conducting movements? • In general, did you use loudness information to guide your conducting movements? • Were you able to anticipate changes in the last take of each excerpt?