Datasets - Methods, Models and Features - Computational Modeling and Analysis of Multi-timbral

3. Methods, Models and Features

3.7 Datasets

In the Music Information Retrieval (Music-IR) community, large datasets for training and evalu- ating models are notoriously hard to obtain and share due to the commercial nature of the content. This difficulty is compounded in multi-track sources for several reasons. Music production using DAWs was not commonplace until the recent past and a significant amount of multi-track source audio older than 15 years is archived on analog tape or digital audio tape (DAT). Second, record labels had little incentive to release source audio since the home studio was still rather expensive to own. In the past decade, as technology advanced and home music production became common, bands have released multi-track sources for fans to remix and create derivative work. This section describes two datasets used in the subsequent experiments. The first is a set of stems from the RockBand®video game and the second is a collection of multi-track audio from a variety of sources that are publicly available.

3.7.1 Rockband Dataset

There are 48 artists in the RockBand® dataset and one song was selected randomly from each of the artists resulting in a total of 48 songs. Only one song was chosen from each artist due to time constraints encountered in generating the data and to prevent over-representation in the dataset. The ‘final mix’ experienced during gameplay was acquired by recording the optical audio output of the game console onto a computer and aligning it to the source tracks. The game console mix was used, as opposed to the radio/album release, due to synchronization issues between the source

files and the commercial version. It was evident that time stretching/compression was performed on many of the RockBand®releases since the song from the commercial release was often not the same length as the version from the game console. Most likely this was done to align the beats so that they occur on regular exact intervals to facilitate gameplay.

Preprocessing and Normalization

There were several inconsistencies in the dataset which we had to account for in order to make comparisons between songs more accurate and to facilitate modeling in the system described in Section 3.2. The number and type of sources varied between each song, with a minimum track count of eight and maximum of 14. For example, many songs had individual stereo (L and R) waveforms for each instrument, whereas other songs only had mono tracks for some instruments and stereo tracks for others. Additionally, not all songs had individual tracks for the kick drum, snare drum or overhead drum microphones.

To deal with this discrepancy, we opted to form five mono tracks for each song: bass, drums, guitar, vocals and backup. The instruments in the backup track vary from song to song and may contain vocal harmonies, synthesizers, percussion, guitar or a variety of other instruments, however the content of the backup track within a song is fairly consistent. Given the variance in the dataset, this method created more uniformity between the content of each song.

To create a single mono track for each instrument class, we mixed all audio that belonged to the given instrument class according to the track weights computed using the method described in Section 6.1. A diagram of the preprocessing step is shown in Figure 3.4.

3.7.2 Multiple Genre Dataset

The second dataset consists of 135 songs across a variety of genres. The genres include Acoustic, Alternative, Country, Dance, Electronic, Hip-Hop, Indie, Jazz, Rock and Metal. The songs were obtained from three primary sources: Weathervane Music1_{, Sound on Sound}2 _{and a multi-track}

dataset used for song structure segmentation [31]. Each track is converted to a monaural source at 44.1kHz sampling rate and labeled with the instrument present in the track.

The tracks in every song are labeled by three individuals and the majority label for each track was retained as ground truth. The labelers are students in the music industry program at Drexel

1_{http://weathervanemusic.org/} 2_{http://www.soundonsound.com/}

Kick L Kick R Snare R Vocal R Bass R Bass L Bass Vocal Estimate Weight X Estimate Weight X Estimate Weight X Drums Vocal L + Guitar Backup Average Average Least Squares Kalman Smoothing Final Weight Estimation

Figure 3.4: Diagram of dataset preprocessing for each song in the RockBand dataset.

University and the author. The filenames for each audio track are used when possible and normalized to a standard label for a single instrument class. Instrument classes are differentiated on a fine level (clean/distorted electric guitar) and may be combined into superclasses (electric guitar) if desired. The electric guitar is a specific example where fine level labels are desired since the distorted and clean versions are treated very differently by engineers and have much different roles in the mix. The dataset is publicly available online3_.

There is much more variation in this dataset than in the one compiled in Section 3.7.1. All of the material in the RockBand dataset possesses similar instrumentation and was commercially released. In addition to spanning multiple genres, the open dataset is not all commercially available material and varies in terms of the quality of the signal capture (i.e. experience of the recording engineer) as many of the songs come from novice home studio users.

In document Computational Modeling and Analysis of Multi-timbral Musical Instrument Mixtures (Page 67-70)