Feature Extraction and Enriched Access Modules for Musical Audio Data

(1)

Internal Note

Feature Extraction and

Enriched Access Modules

for Musical Audio Data

Version 1.0 Draft Date: 15 February 2007 Editor: QMUL

(2)

Introduction

This document enumerates the modules for the extraction of musical features from recorded audio files to be integrated within the EASAIER framework.

Additional modules dedicated to the implementation of Enriched Access features will also be described here.

1. Architectural Notes

The EASAIER framework requires applications with DSP capability both on the content provider side (“Archiver”) and on the end user side (“Browser/navigator”).

Initially the project envisaged a complete separation between feature extraction and enriched access tools, the former being exclusively assigned to the server side application and the latter to be used by the client side application.

In practice, following the initial systems architecture meeting (07/09/2006), it was found that both sides of the system might benefit from a certain degree of interoperability between the two set of tools.

The following sections describe the generic feature extraction and audio processing functionality of the Server and Client-Side applications.

(note, these two sections are mostly generated by brainstorming and guessing, so do modify/add/complain as you see fit)

1.1. Server Side Archiving Software

The server side archiving application is a tool that allows content providers to manually enter and/or automatically extract meta-data from musical audio/video assets and archive them within the EASAIER system.

1.1.1. Audio analysis

The musical audio asset is submitted to the application and, whenever necessary, undergoes restoration. A simplified system diagram is proposed in figure 1. Also, a compressed version of the audio asset is generated and submitted, along with the original, to the audio files repository: this “lower quality” copy can then be used by the EASAIER server for the purpose of streaming audio to the end user without using excessive amounts of bandwidth.

The process of sound source separation may also be performed at this stage, although there is limited confidence that this will significantly improve the performance of the musical features extractors. However, the inclusion of this algorithm within the “Archiver” would allow an expert operator to choose an optimal set of separation parameters uniquely associated with the audio file, which can be transmitted to the enriched access tools on the client-side application as default settings.

Following restoration and source separation, the audio data goes through a number of modules for the extraction of mid and high-level musical features that will be included in the meta-data associated to the audio file under analysis for classification and search purposes.

The modules have been divided in two categories: mid-level extractors and high-level extractors. Broadly speaking, mid level extractors return time-synchronous (frame-based) information such as harmonic and timbre profiles, chord sequences or the position of beats and are particularly suitable

(3)

for spawning transcriptions and performing similarity-based searches within the EASAIER archives.

High level extractors, on the other side, aim to describe global, and mostly single-valued, information regarding a piece of music, such as the tempo, meter, global key, mode or the presence of a particular instrument within the audio file. These descriptors can be employed to perform a parameter-based search such as: “find an audio file exhibiting a tempo of 120 bpm at 4/4 time signature and containing the instrument conga”.

The mid-level features are extracted by the relevant algorithm (see section 2 for a description) and stored in a suitable format (TBD) in a repository within the EASAIER system. As well as being utilised by the server for search purposes, these features can also be used by the client-side navigation and playback tool to provide specialised visualisations of the music under analysis (e.g. an intensity envelope) and markers on points of interest within the waveform (e.g. position of beats, verse/chorus boundary, etc).

Mid level descriptors are also used within the archiving application by a second level of software modules for the generation of high level features.

Unfortunately high level features extractors are not robust enough at this stage of development to guarantee an absolute consistency, hence we envisage the use of a “reliability metric” that can prompt the operator to double-check the results and, if necessary, to manually populate the relevant high-level tags.

Figure 1: server side musical audio archiving.

1.1.2. Video analysis

The video asset is submitted to the EASAIER server and it undergoes necessary transcoding process. A compressed version of the video asset is generated and submitted, along with the original, to the video files repository, this “lower quality” copy can then be used by the EASAIER server for the purpose of streaming video to the end user without using excessive amounts of

To PCM & compressed audio assets repository

Input Audio File (PCM) Manual Entry Tags/Data De-Noising / Restoration Source Separation Mid-Level Feature Extractors / Transcript. High-Level Features Extractors / Compression Reliability Metric High-level features (parametric search) Mid-level descriptors &

transcript (similarity search)

Optimal source separation & denoising parameters

To Metadata Repository

Manual Tags & Manual High Level Features

(4)

bandwidth. In this process the audio stream is extracted from the video for purpose of audio analysis given in figure 1. The video stream undergoes then automatic analysis as shown in figure 2. All these processes on the video/audio assets will be accomplished using open source software, such as ffmpeg [FFMPEG]. The ffmpeg software is known as fastest and most reliable open source transcoding software, having integrated majority of popular audio/video coders.

Figure 2: server side video archiving.

QMUL will also provide video segmentation and key frame extraction modules. The modules take as input video in mpeg2 format and give as output temporal information about start and duration of video segments as well as keyframes images and their positions within video file. The modules are already available as linux binaries and in the stage of developing cross-platform versions. In the current implementation, only one feature is extracted for each video frame, the ColorLayout. ColorLayout is a simple representation of the layout of colour within a frame, using a DCT to represent the feature. One DCT is created for each colour component (one luminance and two chrominance components in the case of a video frame).

A difference metric for each component involves taking the weighted Euclidian distance between each DCT value in each colour component. This leads to fast matching, and scalability can be improved by using fewer DCT values and sacrificing accuracy. The resulting feature vector can be used for a variety of applications. Simple shot cuts can be detected by looking for peaks in the rate of change of feature between subsequent frames, which produces a robust method for detecting abrupt shot changes which is reasonably accurate even in sequences with high visual activity. In the presence, we are working on expansion of the feature set used for the cut detection and keyframe extraction and on more sophisticated difference metrics, such as N-Cut (Normalized Cut).

The extracted keyframes are further processed in order to extract a set of MPEG7 low-level descriptors [MPEG7], which will be used in the EASAIER cross-retrieval engine in addition to audio similarities searches to provide expansion of searches to non-audio assets. For this purpose the MPEG-7 eXperimentation Model (XM) software [MP7XM] will be used. This is standard

Compression Audio Stream Extraction Video Segmentation and Keyframe extraction Keyframe Analysis Manual Annotation Audio stream analysis

(figure 1)

Input Video File

Keyframes PCM

Original video file

Streaming video file (eg. mpeg 4)

Multimedia assets repository

KF temporal data Video segments temporal data Metadata Features Temporal data Video segments metadata KF Extracted Features Metadata repository

(5)

reference software used by Mpeg standardization body that is open source and both Linux and Windows versions exists and were tested and used at QMUL.

The starting set of features that will be extracted for the purpose of EASAIER is defined in EASAIER metadata document [EMD2006] and Deliverable 3.1 [ED312006], but is still to be refined during implementation and testing phases of the EASAIER project.

1.2. Client Side Search and Browsing Software

The end user will be able to access the content of the EASAIER archive by means of an application (figure 3) that can retrieve an audio asset and its associated meta-data using a variety of non mutually exclusive query methodologies, such as:

- Queries based on general tags: i.e. find material by author/title, genre and year - Musical parameters-based queries: i.e. find songs by key, orchestration, tempo range.

- Similarity-based queries: i.e. once a musical audio asset has been retrieved, find other assets that exhibit some degree of similarity in terms of macroscopic structure, timbre and

harmonic profile.

The audio is delivered by the server (either by streaming or download of the entire compressed file) to the client application and then buffered and converted to a suitable format for further processing and visualisation of its time-domain waveform.

Following the decoding stage, a suite of real-time audio processing modules allows restoration, source separation and enhancement of the incoming audio stream. The associated meta-data retrieved from the server contains a set of default parameters for both the source separation and restoration algorithms; alternatively, the user can override these parameters manually through an advanced menu/interface on the client application (enriched access UI).

The default source separation parameters can be associated to the tags generated by the instrument recognition algorithms to provide a “click and play” list of the various orchestral components of the musical audio asset.

A time-scale modification algorithm that can be operated in real time by the user is included in the enriched access set of tools, allowing to slow down or speed up the audio playback, without affecting the pitch content .

As well as providing default operational parameters to the enriched access tool set, the meta-data also contain:

1) General and music-specific tags providing comprehensive information regarding the audio asset under analysis (displayed in the “Browsing and Searching UI”)

2) Mid-level features that can be used to deliver technical visualisations of the audio asset as well as markers for advanced playback and looping functionalities, (displayed in the “Looping and Visualisation UI”)

Although high and mid-level musical descriptors are generated by the archiving application on the server side, an enhancement to the functionality offered by the EASAIER system can be identified in the ability to provide similarity-based searches using audio files residing on the client’s hard drive.

(6)

As shown in the bottom of figure 3, this functionality will require the deployment of a scaled-down version of the archiving application, allowing the generation of data that can be used to search the contents of the EASAIER server.

Figure 3: client side musical audio browser/navigator.

2. Software Modules

The software modules described in this section are included in the following EASAIER work packages:

1) WP4 – Sound Object Representation: This work package deals with the identification of features within the archived audio assets. As far as the musical audio is concerned, the tools will enable the extraction of high and mid-level descriptors for classification and search

Audio Out E A S A I E R S E R V E R QUERY ENGINE Local Audio File Mid-Level Feature Extractors / Transcript. High-Level Features Extractors / High-level features (parametric search)

Mid-level descriptors & transcript (similarity search)

“Mini Archiver” (musical audio)

Mid & High Level Features Query Enriched Access UI Browsing & Searching UI Looping & visualisation UI Streaming Audio Metadata De-Noising / Restoration Source Separation Buffer / Decode Default Enriched Access Parameters Equalisation

Time & pitch Scale Modification User-Defined Parameters Mid-Level Features High-Level Features Textual/General Tags

Browsing application (musical audio)

(7)

purposes as well as modules capable of providing information regarding the musical structure of the audio asset for visualisation and looping purposes.

2) WP5 – Enriched Access Tools: Tools developed within this work package will allow the user to apply useful modifications to the audio content at access time and in real-time, enabling an “enriched” exploration of the musical audio asset.

2.1. Enriched access

2.1.1. Time-scale Modification / Pitch-scale Modification Provided by DIT :

The TSM algorithm will allow the user to vary the playback rate of the audio in real-time without affecting the local pitch content. The module will use both time domain algorithms and frequency domain algorithms. The appropriate algorithm will be chosen automatically depending on metadata provided with the audio content. The user should also be able to choose the algorithm manually. Pitch scale modification independent of time base is achievable in similar manner.

Provided by QMUL:

An alternative TSM algorithm based on a phase vocoder implementation. The algorithm allows for excellent transient preservation and robust stereo performance but requires a-priori knowledge of transients within the audio file, this can be provided by the extracted mid-level features.

2.1.2. Sound Source Separation Provided by DIT :

A real-time separation algorithm which is capable of separating multiple sources from 2 channel mixtures. At present this tool requires the user to set some parameters based on visual and audio feedback from the GUI in order to achieve meaningful separations. This version of the algorithm will be deployed as an enriched access tool for WP5. An automated version of this algorithm may also be provided as a pre-processor for transcription and instrument recognition in WP4. Some other work on single channel separation is ongoing within the group at DIT.

2.1.3. Equalisation and Noise Reduction Provided by DIT :

DIT may also be able to provide some rudimentary real-time noise reduction and equalisation tools for the purposes of audio enhancement. QMUL will provide support in the generation of C++ libraries for these tools.

2.2. Sound Object Representations

2.2.1. Segmentation

Provided by DIT :

Some segmentation routines such as a “Novel Event Detector” which may be incorporated if desired.

(8)

A module for the segmentation and thumbnailing of recorded musical audio using a hierarchical timbre model (SoundBite) is available.

2.2.2. Mid Level Descriptors and Music Transcription Provided by DIT :

The transcription algorithm will perform a non real-time analysis which will result in a musical transcription of the audio content. Harmony features may also be extracted during this analysis. It is also intended that some time aligned visual indication of harmony be provided. Alternative representations of transcribed audio will also be provided such as melodic contours for the purposes of melodic similarity queries. This tool will be deployed at the server side and will provide meta-data for the purposes of indexing. The tool may also be deployed at the client side for the case where the user wishes to query by example, where the example audio comes from outside the database.

Provided by QMUL:

The Centre for Digital Music can provide the following Feature Extraction Modules

− Detection Function : A module for the generation of a function describing the local structure of an audio signal.

− Peak Picking : Module for the estimation of onsets from the detection function. Also contains a class for Detection Function processing.

− Onset Detection : A module for estimating onsets from audio files, incorporating the detection function and peak picking classes.

− Multi/Band Onset Detection : (Released after 31/10/2006) Module for estimating tonal and percussive onsets from audio files.

− Chroma Class :A module for logarithmic frequency analysis. − Beat Tracker : A module for Beat Tracking of Musical Audio

− Harmonic Change Detection Function (HCDF) : Module for the detection of harmonic change in musical audio files.

− Chord Estimation : (Ongoing Research) Module for the estimation of musical chords from audio files.

− Harmonic Content Estimation : The module is intended to provide a mid-level representation of the harmonic and rhythmic information from audio files.

The algorithm returns a robust description of musical attributes that is intended to be used for similarity matching rather than for transcription and information retrieval

− Key Estimation : (Ongoing Research) Module for the estimation of the key in a musical file (frame-based).

(9)

− Tempo Estimator : (Ongoing Research) The module estimates tempo from a musical audio file using information returned by the beat tracking algorithm

− Meter Estimator : (Ongoing Research) The module estimates the time signature from a a musical audio file using information returned by the beat tracking algorithm.

− Global Key Estimator : (Ongoing Research) Module for the estimation of the predominant key in a musical file using information returned by the frame-based key estimation algorithm.

2.2.4. Musical Instrument Recognition

Provided by DIT : DIT has very recently begun work in this field. We expect to be able to integrate this work into EASAIER at a later stage. QMUL will provide legacy code (Instrument Identification Libraries) and knowledge gained from previous research carried out at the Centre for Digital Music

(10)

3. Current status of Software Modules

Type of Module Feature Extractor Name Underlying Technology / [ references ] Input / Output / Scope Lang. Current Development Status Low Level Descriptor Detection Function

A number of techniques are covered. [JF2000]

[JPB 2005]

Input is a dense frequency domain frame

Outputs a single value per input frame

Matlab C++

C++ module

Completed & Deployed VAMP Plugin is available

Low Level Descriptor

Peak Picking Detection function undergoes DC removal, smoothing and median filtering [IK2002].

Peak selection is based on quadratic fit [reference needed].

Input is detection function

Output is a vector indicating location of estimated onsets Time base is relative to the detection function

C++ Completed & Deployed

Low Level Descriptor

Onset Detection

The module links the detection function and peak-picking classes to provide a complete onset estimator.

Input is a pointer to a location containing samples of the audio file under analysis Output is a vector indicating the location of estimated onsets.

Time base is relative to the original audio file.

C++ Completed & Deployed VAMP Plugin is available Low Level Descriptor Multi-Band Onset Detection

The module splits the signal into four sub-bands using a constant-Q filterbank prior to onset detection [CD2004]. Tonal and percussive components are discriminated on the basis of the presence of onsets on the different sub-bands. [ER2005]

Input is raw audio data.

Output are vectors indicating the location of estimated tonal and percussive onsets. Time base is relative to the original audio file. Matlab C++ A C++ version is available. Low Level Descriptor

Chroma Based on an FFT, utilises a sparse kernel approach for the calculation of a constant-Q transform.

Input is raw audio data. Matlab Complete.

(11)

The Chormagram (HPCP) is then calculated from the result of the Constant-Q data.

[JB1991], [JB1992], [CH2005]

Output is a dense matrix containing the Chromagram bins of the file under analysis. Time base depends on the resolution of the Constant-Q transform

C++ deployed but needs revision.

VAMP Plugin is available

Mid Level Descriptor

Beat Tracker Beat times are recovered by passing the output of an onset detection function through comb filterbank matrices to identify the

beat period and alignment

The module uses a two state model for tracking tempo changes and for maintaining continuity within a single tempo hypothesis

[MD2004] [MD2005]

Output is either a sparse vector Sparse Vector with the non-zero elements denoting an estimated beat or a vector containing the temporal location of the identified beats Time base is relative to the original audio file. Matlab C++ C++ version is deployed VAMP Plugin is available Low/Mid Level Descriptor Harmonic Change Detection Function (HCDF)

A 12 bins chromagram is mapped to a 6-D space using a tonal centroid transform and smoothed using a Guassian window.

The HCDF is defined as the rate of change of the smoothed tonal centroid signal

Transition times between harmonically stable regions can be obtained by peak picking the HCDF

Output is a dense vector representing the peak change between tonal centorid frames

Matlab C++ Complete. C++ version is deployed VAMP Plugin is available Mid Level Descriptor Chord Estimation

The algorithm relies on a 36-bins tuned chromagram obtained from a constant-Q transform.

The identification is performed using chord templates [CH2005],[MC2004],[BP2002]

Standard Matlab I/O

Output is a sequence of estimated chord symbols. Matlab A complete C/C++ implementation is not currently available. Mid/High Level Descriptor Key Estimation

The key space is modelled by a 24-state HMM.

Each state represents one of the 24 major or minor chords and each observation represents a chord transition.

Standard Matlab I/O

Input is sequence of estimated chord symbols

Matlab A complete C/C++ implementation is not currently available.

(12)

[KN2006] Output is estimated key either on a frame or per-track basis. Mid Level Descriptor / Similarity Retrieval Harmonic Content Estimation

A 36-bins tuned chromagram is averaged between detected beats.

The resulting averaged chromagram is further reduced to 12 bins by

summing all three bins for each pitch class

The state transition matrix, mean vector and covariance matrix of a HMM are initialised using musical knowledge and selectively trained using the 12 bins chromagram. The chord sequence is then inferred from the HMM using Viterbi decoding

[JPB2005A]

Standard Matlab I/O Input is raw audio data.

The output represents a sequence of major and minor triads.

Time base consists in detected beats (tactus)

Matlab A complete C/C++ implementation is not currently available. High Level Descriptor Instrument Identification Libraries

The instrument identifier relies on a mono-feature timbre modelling approach, using

Line Spectrum Frequencies (LSF) as the unique identifier. Various classifier are implemented, in particular k-Means, Gaussian mixture models and Support Vector Machines. [NC2005],[NC2006],[FI1975],[PK1986] undetermined Matlab / C Status unknown. Code is allegedly in DSPMac repository Similarity Retrieval / Enriched Access

SoundBite For a given track, the space of possible timbres is divided into N timbre types,

each of which generates timbre features according to a Gaussian distribution

The sequence of timbre features through the track Is modelled by an N-state Hidden Markov Model where the hidden states

correspond to the N timbre-types,

The most likely sequence of timbre types to have generated the features is Viterbi decoded from the HMM

Output is a sequence of labelled segments.

C/ C++

A C/C++ demonstrator is available for Mac. OSX.

VAMP Plugin is available

(13)

The most likely segmentation is found by clustering histograms of the timbre types.

The features vector consists in the first 20 PCA components extracted from the normalised constant-Q spectrum of the audio under analysis along with the normalised envelope.

Analysis hop size is chosen as the estimated beat length of the audio under analysis.

[ML2006A] ,[ML2006B]

High Level Descriptor

Tempo & Meter

The algorithm is based on the beat tracker described above.

Meter estimation is currently limited to 4/4 and 3/4 . The tempo value is estimated by analysing the beat histogram generated using tempo tracking across the audio file and a measure of reliability can be inferred from the distribution of bins in the histogram

Input is raw audio data. Outputs are:

- Histogram of detected tempos - Estimated main tempo - Estimated time signature

C++ C++ code completed and deployed. Some further experimental work is needed. Enriched Access

Time-Scaling Time scaling is performed using a FFT-based phase vocoder.

Percussive onsets are identified using a multi-band onset detection algorithm and only steady state portions of the signals are time scaled, thus preserving the integrity of transients.

Coherence in stereo signals is maintained by using a single reference channel for the identification of transient and steady state frames.

[ER2005]

Output is the time-scaled audio data.

Matlab C++

A non-optimal C++ implementation is available

(14)

4. References:

[JF2000] J. Foote, “Automatic audio segmentation using a measure of audio novelty,” in Proc. IEEE Int. Conf. Multimedia and Expo (ICME2000), vol. I, New York, Jul. 2000, pp. 452–455.

[JPB2005] J.P.Bello et al, “A Tutorial on Onset Detection in Music Signals”, in IEEE Transactions on speech and audio processing, vol. 13, no. 5, September 2005.

[IK2002] I. Kauppinen, “Methods for detecting impulsive noise in speech and audio signals,” in Proc. 14th Int. Conf. Digital Signal Processing (DSP2002), vol. 2, Santorini, Greece, Jul. 2002, pp. 967–970.

[CD2004] C Duxbury. Signal Models for Polyphonic Music. PhD Thesis, 2004.

[ER2005] E.Ravelli et al, “Fast implementation for non-linear time-scaling of stereo signals”, in Proc. of the 8th Int. Conference on Digital Audio Effects (DAFx’05), Madrid, Spain, September 20-22, 2005

[JB1991] Judith Brown, “Calculation of a Constant Q Spectral Transform“,Journal of the Acoustical Society of America, vol. 89, no. 1, 425–434, 1991.

[JB1992] Judith C. Brown, Miller S. Puckette, “An Efficient Algorithm for the Calculation of a Constant Q Transform”, Journal of the Acoustical Society of America, vol. 92, no. 5, 2698–2701, 1992.

[CH2005] C.Harte, M.B. Sandler, “Automatic Chord Identification Using a Quantised Chromagram”, in Proc. Of the 118th AES Convention

2005 May 28–31 Barcelona, Spain

[MD2004] M. E. P. Davies and M. D. Plumbley, “Causal tempo tracking of audio,” in 5th International Symposium on Music Information Retrieval, October 2004.

[MD2005] M. E. P. Davies and M. D. Plumbley, “Beat tracking with a two state model,” in Proceedings of ICASSP, Philadelphia, USA, March 18–23, 2005

[CH2006] C. Harte, M. Gasser, M.B. Sandler, “Detecting Harmonic Ch’ange in musical audio”, in Proc of AMCMM’06, Santa Barbara, USA, October 27, 2006

a, vol. 89, no. 1, 425–434.

[MC2004] Markus Cremer and Claus Derboven, “A System for Harmonic Ananlysis of Polyphonic Music” ,Proceedings of the AES 25th International Conference, 2004, London, UK, 115–120.

[BP2002] Bryan Pardo and William P. Birmingham, “Algorithms for Chordal Analysis, 2002”, Computer Music Journal, vol. 26, no. 2, 27–49

[KN2006] Katy Noland, Mark Sandler, “Key Estimation using a Hidden Markov Model”, in Proc of ISMIR, Victoria, Canada, 2006

[JPB2005A] J.P. Bello, J. Pickens, “A Robust Mid-level Representation for Harmonic Content in Music Signals”, in 6th International Symposium on Music Information Retrieval, London, 2005.

[PK1986] P. Kabal and R.P. Ramachandran,“The Computation of line spectral frequencies using Chebyshev polynomials,”, IEEE trans. on Acoustics, Speech and Signal Processing, vol. ASSP-34, no. 6, 1419–1426, 1986

[FI1975] F. Itakura, “Line spectrum representation linear predictive coefficients of speech signals,”, J. Acoust. Soc. Amer., vol. 57, S35, 1975

[NC2005] N. Chetry et al, “Musical Instrument Identification using LSF and K-means”, in Proc. AES 118th Convention, Barcelona, Spain, 2005 May 28–31.

[NC2006] N. Chetry et al, “Computer Models for Musical Instrument Identification”, PhD Thesis, 2006..

[ML2006A] M.Levy et al, “New methods in structural segmentation of musical audio”, in Proc.

Eusipco 2006.

[ML2006B] M.Levy et al, “Extraction of High-Level Musical Structure from audio data and its application to thumbnail generation”,, in Prc ICASSP 2006

(15)

[MPEG7] ISO/IEC JTC1/SC29/WG11, “Information Technology – Multimedia Content Description Interface – Part 3: Multimedia Description Schemes”, ISO/IEC FDIS 15938-5, 2001-10-23.

[MP7XM] MPEG-7 eXperimentation Model,

http://www.lis.ei.tum.de/research/bv/topics/mmdb/e_mpeg7.html

[EMD2006] Dan Barry et al,“EASAIER Metadata & Descriptors”, Internal Note, ver. 1.0 Draft, 2006.

[ED312006] “EASAIER Deliverable 3.1: Retrieval System Functionality and Specifications”, ver. 1.12, November 1, 2006.