3.2 Analyzing conducting movements during performance
3.2.1 Initial tests
As a first step, we performed some tests in order to identify technical issues to be solved for posterior recordings.
One of the goals of the PHENICX project was to perform and share multimodal record- ings including multichannel audio, video and MoCap from the conductor (also enriched with metadata such as the aligned score). The challenge in this kind of recordings is to have all streams aligned in time. In order to test the concrete issues we would face using a Kinect V1 as a MoCap device, we attended two rehearsals of the student orchestra
in ESMUC and made recordings including audio from a ZOOM H4n hand recorder and video, audio and MoCap from the Kinect.
For the Kinect data, we developed KinectVizz1, an OpenFrameworks2 application that
uses OpenNI for skeleton tracking. The application allows to record up to four data streams: audio from the Kinect microphone array or laptop built-in microphone, RGB video, RGB-D video and MoCap. The MoCap information is stored in a Tab-Separated- Values (TSV) file where each line contains the information of a frame, including its ID (which increases by 1 at each frame), a timestamp and the x, y and z positions of the fifteen joints provided by OpenNI+NITE skeleton tracking (i.e. forty five position values per frame in total). The application includes functionality to export the recorded data to Repovizz3, an integrated online system developed by Mayor et al. (2011) capable of
structural formatting and remote storage, browsing, exchange, annotation, and visual- ization of synchronous multi-modal, time-aligned data. Repovizz in this sense has been not only a platform that we used for sharing data recorded during this work, but also as a visualization tool for all recorded streams.
To avoid the need to have a person close to the conductor running the application, we used RealVNC4 for remote control through a local network. This way, we can have a
laptop capturing the Kinect data close to the conductor’s podium and control it from out- side the orchestra. Figure3.2shows a laptop controlling another one running KinectVizz during the recording of a rehearsal by Orquestra Simfònica del Vallès. These recordings are explained below, in Section3.2.3.
Provided that all the streams captured by KinectVizz are aligned, our assumption was that a simple manual alignment of the audio stream to the audio captured by the hand recorder would allow to have all streams aligned. These rehearsals were useful to test the application and to make sure that the setup was easy and unobtrusive for all musicians, including the conductor. The location of the Kinect camera during one of these rehearsals is shown in Figure3.1.
The posterior analysis of these recordings showed that, for long recordings, some frames were dropped in the recorded Kinect streams5. This shortens the duration of the re-
sulting recorded data and dramatically affects the alignment to other streams recorded
1https://github.com/asarasua/KinectVizz 2http://openframeworks.cc/
3https://repovizz.upf.edu/ 4https://www.realvnc.com/
5We tested this with different computers and operative systems and the effect persisted. The laptop
we used for all reported recordings has a 2,9 GHz Intel Core i7 processor and 8GB RAM, which is quite beyond the Kinect requirements.
3.2 Analyzing conducting movements during performance
Figure 3.2: Laptop remotely controlling another laptop on stage, running KinectVizz for recording aligned multimodal data during a rehearsal by Orchestra Simfònica
del Vallès.
externally, such as the audio from the hand recorder in this case. The way we overcame this issue was using the information stored in the MoCap TSV file. If no frames are dropped, the ID of consecutive frames are consecutive numbers. When the ID of two consecutive frames stored in the TSV file are not consecutive we can tell how many frames were dropped between them by inspecting the difference between their IDs. Us- ing this information, we generate corrected video and MoCap streams with the process illustrated in Figure3.3:
• For video, we use ffmpeg6 to decompose the original video into frames. The cor-
rected video is generated, also using ffmpeg, by locating the times were frames were dropped and repeating the frame previous to the dropping as many times as frames were dropped. For example, if the TSV file has two consecutive rows with IDs 100 and 106, frame 100 is repeated 5 times in the resulting video. This video freezes for a moment in times were frames had been dropped, but it can be perfectly aligned with other streams and we can automatically annotate where the freezing occurs.
• For MoCap, we follow a similar procedure but perform linear interpolation for the position of joints where frames were dropped. This way, the resulting MoCap file
Figure 3.3: Recording setup scheme. Video and MoCap streams recorded from Kinect cannot be directly aligned to other streams due to dropped frames. Informa- tion about dropped frames is used to generate corrected video and MoCap data.
does not freeze as the video. However, the interpolation might differ from the actual performed motions during recording, specially if these movements were fast (and increasingly with the number of dropped frames).