Internal Note
Feature Extraction and
Enriched Access Modules
for Musical Audio Data
Version 1.0 Draft Date: 15 February 2007 Editor: QMUL
Introduction
This document enumerates the modules for the extraction of musical features from recorded audio files to be integrated within the EASAIER framework.
Additional modules dedicated to the implementation of Enriched Access features will also be described here.
1.
Architectural Notes
The EASAIER framework requires applications with DSP capability both on the content provider side (“Archiver”) and on the end user side (“Browser/navigator”).
Initially the project envisaged a complete separation between feature extraction and enriched access tools, the former being exclusively assigned to the server side application and the latter to be used by the client side application.
In practice, following the initial systems architecture meeting (07/09/2006), it was found that both sides of the system might benefit from a certain degree of interoperability between the two set of tools.
The following sections describe the generic feature extraction and audio processing functionality of the Server and Client-Side applications.
(note, these two sections are mostly generated by brainstorming and guessing, so do modify/add/complain as you see fit)
1.1.
Server Side Archiving Software
The server side archiving application is a tool that allows content providers to manually enter and/or automatically extract meta-data from musical audio/video assets and archive them within the EASAIER system.
1.1.1. Audio analysis
The musical audio asset is submitted to the application and, whenever necessary, undergoes restoration. A simplified system diagram is proposed in figure 1. Also, a compressed version of the audio asset is generated and submitted, along with the original, to the audio files repository: this “lower quality” copy can then be used by the EASAIER server for the purpose of streaming audio to the end user without using excessive amounts of bandwidth.
The process of sound source separation may also be performed at this stage, although there is limited confidence that this will significantly improve the performance of the musical features extractors. However, the inclusion of this algorithm within the “Archiver” would allow an expert operator to choose an optimal set of separation parameters uniquely associated with the audio file, which can be transmitted to the enriched access tools on the client-side application as default settings.
Following restoration and source separation, the audio data goes through a number of modules for the extraction of mid and high-level musical features that will be included in the meta-data associated to the audio file under analysis for classification and search purposes.
The modules have been divided in two categories: mid-level extractors and high-level extractors. Broadly speaking, mid level extractors return time-synchronous (frame-based) information such as harmonic and timbre profiles, chord sequences or the position of beats and are particularly suitable
for spawning transcriptions and performing similarity-based searches within the EASAIER archives.
High level extractors, on the other side, aim to describe global, and mostly single-valued, information regarding a piece of music, such as the tempo, meter, global key, mode or the presence of a particular instrument within the audio file. These descriptors can be employed to perform a parameter-based search such as: “find an audio file exhibiting a tempo of 120 bpm at 4/4 time signature and containing the instrument conga”.
The mid-level features are extracted by the relevant algorithm (see section 2 for a description) and stored in a suitable format (TBD) in a repository within the EASAIER system. As well as being utilised by the server for search purposes, these features can also be used by the client-side navigation and playback tool to provide specialised visualisations of the music under analysis (e.g. an intensity envelope) and markers on points of interest within the waveform (e.g. position of beats, verse/chorus boundary, etc).
Mid level descriptors are also used within the archiving application by a second level of software modules for the generation of high level features.
Unfortunately high level features extractors are not robust enough at this stage of development to guarantee an absolute consistency, hence we envisage the use of a “reliability metric” that can prompt the operator to double-check the results and, if necessary, to manually populate the relevant high-level tags.
Figure 1: server side musical audio archiving.
1.1.2. Video analysis
The video asset is submitted to the EASAIER server and it undergoes necessary transcoding process. A compressed version of the video asset is generated and submitted, along with the original, to the video files repository, this “lower quality” copy can then be used by the EASAIER server for the purpose of streaming video to the end user without using excessive amounts of
To PCM & compressed audio assets repository
Input Audio File (PCM) Manual Entry Tags/Data De-Noising / Restoration Source Separation Mid-Level Feature Extractors / Transcript. High-Level Features Extractors / Compression Reliability Metric High-level features (parametric search) Mid-level descriptors &
transcript (similarity search)
Optimal source separation & denoising parameters
To Metadata Repository
Manual Tags & Manual High Level Features
bandwidth. In this process the audio stream is extracted from the video for purpose of audio analysis given in figure 1. The video stream undergoes then automatic analysis as shown in figure 2. All these processes on the video/audio assets will be accomplished using open source software, such as ffmpeg [FFMPEG]. The ffmpeg software is known as fastest and most reliable open source transcoding software, having integrated majority of popular audio/video coders.
Figure 2: server side video archiving.
QMUL will also provide video segmentation and key frame extraction modules. The modules take as input video in mpeg2 format and give as output temporal information about start and duration of video segments as well as keyframes images and their positions within video file. The modules are already available as linux binaries and in the stage of developing cross-platform versions. In the current implementation, only one feature is extracted for each video frame, the ColorLayout. ColorLayout is a simple representation of the layout of colour within a frame, using a DCT to represent the feature. One DCT is created for each colour component (one luminance and two chrominance components in the case of a video frame).
A difference metric for each component involves taking the weighted Euclidian distance between each DCT value in each colour component. This leads to fast matching, and scalability can be improved by using fewer DCT values and sacrificing accuracy. The resulting feature vector can be used for a variety of applications. Simple shot cuts can be detected by looking for peaks in the rate of change of feature between subsequent frames, which produces a robust method for detecting abrupt shot changes which is reasonably accurate even in sequences with high visual activity. In the presence, we are working on expansion of the feature set used for the cut detection and keyframe extraction and on more sophisticated difference metrics, such as N-Cut (Normalized Cut).
The extracted keyframes are further processed in order to extract a set of MPEG7 low-level descriptors [MPEG7], which will be used in the EASAIER cross-retrieval engine in addition to audio similarities searches to provide expansion of searches to non-audio assets. For this purpose the MPEG-7 eXperimentation Model (XM) software [MP7XM] will be used. This is standard
Compression Audio Stream Extraction Video Segmentation and Keyframe extraction Keyframe Analysis Manual Annotation Audio stream analysis
(figure 1)
Input Video File
Keyframes PCM
Original video file
Streaming video file (eg. mpeg 4)
Multimedia assets repository
KF temporal data Video segments temporal data Metadata Features Temporal data Video segments metadata KF Extracted Features Metadata repository
reference software used by Mpeg standardization body that is open source and both Linux and Windows versions exists and were tested and used at QMUL.
The starting set of features that will be extracted for the purpose of EASAIER is defined in EASAIER metadata document [EMD2006] and Deliverable 3.1 [ED312006], but is still to be refined during implementation and testing phases of the EASAIER project.
1.2.
Client Side Search and Browsing Software
The end user will be able to access the content of the EASAIER archive by means of an application (figure 3) that can retrieve an audio asset and its associated meta-data using a variety of non mutually exclusive query methodologies, such as:
- Queries based on general tags: i.e. find material by author/title, genre and year - Musical parameters-based queries: i.e. find songs by key, orchestration, tempo range.
- Similarity-based queries: i.e. once a musical audio asset has been retrieved, find other assets that exhibit some degree of similarity in terms of macroscopic structure, timbre and
harmonic profile.
The audio is delivered by the server (either by streaming or download of the entire compressed file) to the client application and then buffered and converted to a suitable format for further processing and visualisation of its time-domain waveform.
Following the decoding stage, a suite of real-time audio processing modules allows restoration, source separation and enhancement of the incoming audio stream. The associated meta-data retrieved from the server contains a set of default parameters for both the source separation and restoration algorithms; alternatively, the user can override these parameters manually through an advanced menu/interface on the client application (enriched access UI).
The default source separation parameters can be associated to the tags generated by the instrument recognition algorithms to provide a “click and play” list of the various orchestral components of the musical audio asset.
A time-scale modification algorithm that can be operated in real time by the user is included in the enriched access set of tools, allowing to slow down or speed up the audio playback, without affecting the pitch content .
As well as providing default operational parameters to the enriched access tool set, the meta-data also contain:
1) General and music-specific tags providing comprehensive information regarding the audio asset under analysis (displayed in the “Browsing and Searching UI”)
2) Mid-level features that can be used to deliver technical visualisations of the audio asset as well as markers for advanced playback and looping functionalities, (displayed in the “Looping and Visualisation UI”)
Although high and mid-level musical descriptors are generated by the archiving application on the server side, an enhancement to the functionality offered by the EASAIER system can be identified in the ability to provide similarity-based searches using audio files residing on the client’s hard drive.
As shown in the bottom of figure 3, this functionality will require the deployment of a scaled-down version of the archiving application, allowing the generation of data that can be used to search the contents of the EASAIER server.
Figure 3: client side musical audio browser/navigator.
2.
Software Modules
The software modules described in this section are included in the following EASAIER work packages:
1) WP4 – Sound Object Representation: This work package deals with the identification of features within the archived audio assets. As far as the musical audio is concerned, the tools will enable the extraction of high and mid-level descriptors for classification and search
Audio Out E A S A I E R S E R V E R QUERY ENGINE Local Audio File Mid-Level Feature Extractors / Transcript. High-Level Features Extractors / High-level features (parametric search)
Mid-level descriptors & transcript (similarity search)
“Mini Archiver” (musical audio)
Mid & High Level Features Query Enriched Access UI Browsing & Searching UI Looping & visualisation UI Streaming Audio Metadata De-Noising / Restoration Source Separation Buffer / Decode Default Enriched Access Parameters Equalisation
Time & pitch Scale Modification User-Defined Parameters Mid-Level Features High-Level Features Textual/General Tags
Browsing application (musical audio)
purposes as well as modules capable of providing information regarding the musical structure of the audio asset for visualisation and looping purposes.
2) WP5 – Enriched Access Tools: Tools developed within this work package will allow the user to apply useful modifications to the audio content at access time and in real-time, enabling an “enriched” exploration of the musical audio asset.
2.1.
Enriched access
2.1.1. Time-scale Modification / Pitch-scale Modification Provided by DIT :
The TSM algorithm will allow the user to vary the playback rate of the audio in real-time without affecting the local pitch content. The module will use both time domain algorithms and frequency domain algorithms. The appropriate algorithm will be chosen automatically depending on metadata provided with the audio content. The user should also be able to choose the algorithm manually. Pitch scale modification independent of time base is achievable in similar manner.
Provided by QMUL:
An alternative TSM algorithm based on a phase vocoder implementation. The algorithm allows for excellent transient preservation and robust stereo performance but requires a-priori knowledge of transients within the audio file, this can be provided by the extracted mid-level features.
2.1.2. Sound Source Separation Provided by DIT :
A real-time separation algorithm which is capable of separating multiple sources from 2 channel mixtures. At present this tool requires the user to set some parameters based on visual and audio feedback from the GUI in order to achieve meaningful separations. This version of the algorithm will be deployed as an enriched access tool for WP5. An automated version of this algorithm may also be provided as a pre-processor for transcription and instrument recognition in WP4. Some other work on single channel separation is ongoing within the group at DIT.
2.1.3. Equalisation and Noise Reduction Provided by DIT :
DIT may also be able to provide some rudimentary real-time noise reduction and equalisation tools for the purposes of audio enhancement. QMUL will provide support in the generation of C++ libraries for these tools.
2.2.
Sound Object Representations
2.2.1. SegmentationProvided by DIT :
Some segmentation routines such as a “Novel Event Detector” which may be incorporated if desired.
A module for the segmentation and thumbnailing of recorded musical audio using a hierarchical timbre model (SoundBite) is available.
2.2.2. Mid Level Descriptors and Music Transcription Provided by DIT :
The transcription algorithm will perform a non real-time analysis which will result in a musical transcription of the audio content. Harmony features may also be extracted during this analysis. It is also intended that some time aligned visual indication of harmony be provided. Alternative representations of transcribed audio will also be provided such as melodic contours for the purposes of melodic similarity queries. This tool will be deployed at the server side and will provide meta-data for the purposes of indexing. The tool may also be deployed at the client side for the case where the user wishes to query by example, where the example audio comes from outside the database.
Provided by QMUL:
The Centre for Digital Music can provide the following Feature Extraction Modules
− Detection Function : A module for the generation of a function describing the local structure of an audio signal.
− Peak Picking : Module for the estimation of onsets from the detection function. Also contains a class for Detection Function processing.
− Onset Detection : A module for estimating onsets from audio files, incorporating the detection function and peak picking classes.
− Multi/Band Onset Detection : (Released after 31/10/2006) Module for estimating tonal and percussive onsets from audio files.
− Chroma Class :A module for logarithmic frequency analysis. − Beat Tracker : A module for Beat Tracking of Musical Audio
− Harmonic Change Detection Function (HCDF) : Module for the detection of harmonic change in musical audio files.
− Chord Estimation : (Ongoing Research) Module for the estimation of musical chords from audio files.
− Harmonic Content Estimation : The module is intended to provide a mid-level representation of the harmonic and rhythmic information from audio files.
The algorithm returns a robust description of musical attributes that is intended to be used for similarity matching rather than for transcription and information retrieval
− Key Estimation : (Ongoing Research) Module for the estimation of the key in a musical file (frame-based).
− Tempo Estimator : (Ongoing Research) The module estimates tempo from a musical audio file using information returned by the beat tracking algorithm
− Meter Estimator : (Ongoing Research) The module estimates the time signature from a a musical audio file using information returned by the beat tracking algorithm.
− Global Key Estimator : (Ongoing Research) Module for the estimation of the predominant key in a musical file using information returned by the frame-based key estimation algorithm.
2.2.4. Musical Instrument Recognition
Provided by DIT : DIT has very recently begun work in this field. We expect to be able to integrate this work into EASAIER at a later stage. QMUL will provide legacy code (Instrument Identification Libraries) and knowledge gained from previous research carried out at the Centre for Digital Music
3.
Current status of Software Modules
Type of Module Feature Extractor Name Underlying Technology / [ references ] Input / Output / Scope Lang. Current Development Status Low Level Descriptor Detection FunctionA number of techniques are covered. [JF2000]
[JPB 2005]
Input is a dense frequency domain frame
Outputs a single value per input frame
Matlab C++
C++ module
Completed & Deployed VAMP Plugin is available
Low Level Descriptor
Peak Picking Detection function undergoes DC removal, smoothing and median filtering [IK2002].
Peak selection is based on quadratic fit [reference needed].
Input is detection function
Output is a vector indicating location of estimated onsets Time base is relative to the detection function
C++ Completed & Deployed
Low Level Descriptor
Onset Detection
The module links the detection function and peak-picking classes to provide a complete onset estimator.
Input is a pointer to a location containing samples of the audio file under analysis Output is a vector indicating the location of estimated onsets.
Time base is relative to the original audio file.
C++ Completed & Deployed VAMP Plugin is available Low Level Descriptor Multi-Band Onset Detection
The module splits the signal into four sub-bands using a constant-Q filterbank prior to onset detection [CD2004]. Tonal and percussive components are discriminated on the basis of the presence of onsets on the different sub-bands. [ER2005]
Input is raw audio data.
Output are vectors indicating the location of estimated tonal and percussive onsets. Time base is relative to the original audio file. Matlab C++ A C++ version is available. Low Level Descriptor
Chroma Based on an FFT, utilises a sparse kernel approach for the calculation of a constant-Q transform.
Input is raw audio data. Matlab Complete.
The Chormagram (HPCP) is then calculated from the result of the Constant-Q data.
[JB1991], [JB1992], [CH2005]
Output is a dense matrix containing the Chromagram bins of the file under analysis. Time base depends on the resolution of the Constant-Q transform
C++ deployed but needs revision.
VAMP Plugin is available
Mid Level Descriptor
Beat Tracker Beat times are recovered by passing the output of an onset detection function through comb filterbank matrices to identify the
beat period and alignment
The module uses a two state model for tracking tempo changes and for maintaining continuity within a single tempo hypothesis
[MD2004] [MD2005]
Input is raw audio data.
Output is either a sparse vector Sparse Vector with the non-zero elements denoting an estimated beat or a vector containing the temporal location of the identified beats Time base is relative to the original audio file. Matlab C++ C++ version is deployed VAMP Plugin is available Low/Mid Level Descriptor Harmonic Change Detection Function (HCDF)
A 12 bins chromagram is mapped to a 6-D space using a tonal centroid transform and smoothed using a Guassian window.
The HCDF is defined as the rate of change of the smoothed tonal centroid signal
Transition times between harmonically stable regions can be obtained by peak picking the HCDF
Input is raw audio data.
Output is a dense vector representing the peak change between tonal centorid frames
Matlab C++ Complete. C++ version is deployed VAMP Plugin is available Mid Level Descriptor Chord Estimation
The algorithm relies on a 36-bins tuned chromagram obtained from a constant-Q transform.
The identification is performed using chord templates [CH2005],[MC2004],[BP2002]
Standard Matlab I/O
Input is raw audio data.
Output is a sequence of estimated chord symbols. Matlab A complete C/C++ implementation is not currently available. Mid/High Level Descriptor Key Estimation
The key space is modelled by a 24-state HMM.
Each state represents one of the 24 major or minor chords and each observation represents a chord transition.
Standard Matlab I/O
Input is sequence of estimated chord symbols
Matlab A complete C/C++ implementation is not currently available.
[KN2006] Output is estimated key either on a frame or per-track basis. Mid Level Descriptor / Similarity Retrieval Harmonic Content Estimation
A 36-bins tuned chromagram is averaged between detected beats.
The resulting averaged chromagram is further reduced to 12 bins by
summing all three bins for each pitch class
The state transition matrix, mean vector and covariance matrix of a HMM are initialised using musical knowledge and selectively trained using the 12 bins chromagram. The chord sequence is then inferred from the HMM using Viterbi decoding
[JPB2005A]
Standard Matlab I/O Input is raw audio data.
The output represents a sequence of major and minor triads.
Time base consists in detected beats (tactus)
Matlab A complete C/C++ implementation is not currently available. High Level Descriptor Instrument Identification Libraries
The instrument identifier relies on a mono-feature timbre modelling approach, using
Line Spectrum Frequencies (LSF) as the unique identifier. Various classifier are implemented, in particular k-Means, Gaussian mixture models and Support Vector Machines. [NC2005],[NC2006],[FI1975],[PK1986] undetermined Matlab / C Status unknown. Code is allegedly in DSPMac repository Similarity Retrieval / Enriched Access
SoundBite For a given track, the space of possible timbres is divided into N timbre types,
each of which generates timbre features according to a Gaussian distribution
The sequence of timbre features through the track Is modelled by an N-state Hidden Markov Model where the hidden states
correspond to the N timbre-types,
The most likely sequence of timbre types to have generated the features is Viterbi decoded from the HMM
Input is raw audio data.
Output is a sequence of labelled segments.
C/ C++
A C/C++ demonstrator is available for Mac. OSX.
VAMP Plugin is available
The most likely segmentation is found by clustering histograms of the timbre types.
The features vector consists in the first 20 PCA components extracted from the normalised constant-Q spectrum of the audio under analysis along with the normalised envelope.
Analysis hop size is chosen as the estimated beat length of the audio under analysis.
[ML2006A] ,[ML2006B]
High Level Descriptor
Tempo & Meter
The algorithm is based on the beat tracker described above.
Meter estimation is currently limited to 4/4 and 3/4 . The tempo value is estimated by analysing the beat histogram generated using tempo tracking across the audio file and a measure of reliability can be inferred from the distribution of bins in the histogram
Input is raw audio data. Outputs are:
- Histogram of detected tempos - Estimated main tempo - Estimated time signature
C++ C++ code completed and deployed. Some further experimental work is needed. Enriched Access
Time-Scaling Time scaling is performed using a FFT-based phase vocoder.
Percussive onsets are identified using a multi-band onset detection algorithm and only steady state portions of the signals are time scaled, thus preserving the integrity of transients.
Coherence in stereo signals is maintained by using a single reference channel for the identification of transient and steady state frames.
[ER2005]
Input is raw audio data.
Output is the time-scaled audio data.
Matlab C++
A non-optimal C++ implementation is available
4.
References:
[JF2000] J. Foote, “Automatic audio segmentation using a measure of audio novelty,” in Proc. IEEE Int. Conf. Multimedia and Expo (ICME2000), vol. I, New York, Jul. 2000, pp. 452–455.
[JPB2005] J.P.Bello et al, “A Tutorial on Onset Detection in Music Signals”, in IEEE Transactions on speech and audio processing, vol. 13, no. 5, September 2005.
[IK2002] I. Kauppinen, “Methods for detecting impulsive noise in speech and audio signals,” in Proc. 14th Int. Conf. Digital Signal Processing (DSP2002), vol. 2, Santorini, Greece, Jul. 2002, pp. 967–970.
[CD2004] C Duxbury. Signal Models for Polyphonic Music. PhD Thesis, 2004.
[ER2005] E.Ravelli et al, “Fast implementation for non-linear time-scaling of stereo signals”, in Proc. of the 8th Int. Conference on Digital Audio Effects (DAFx’05), Madrid, Spain, September 20-22, 2005
[JB1991] Judith Brown, “Calculation of a Constant Q Spectral Transform“,Journal of the Acoustical Society of America, vol. 89, no. 1, 425–434, 1991.
[JB1992] Judith C. Brown, Miller S. Puckette, “An Efficient Algorithm for the Calculation of a Constant Q Transform”, Journal of the Acoustical Society of America, vol. 92, no. 5, 2698–2701, 1992.
[CH2005] C.Harte, M.B. Sandler, “Automatic Chord Identification Using a Quantised Chromagram”, in Proc. Of the 118th AES Convention
2005 May 28–31 Barcelona, Spain
[MD2004] M. E. P. Davies and M. D. Plumbley, “Causal tempo tracking of audio,” in 5th International Symposium on Music Information Retrieval, October 2004.
[MD2005] M. E. P. Davies and M. D. Plumbley, “Beat tracking with a two state model,” in Proceedings of ICASSP, Philadelphia, USA, March 18–23, 2005
[CH2006] C. Harte, M. Gasser, M.B. Sandler, “Detecting Harmonic Ch’ange in musical audio”, in Proc of AMCMM’06, Santa Barbara, USA, October 27, 2006
a, vol. 89, no. 1, 425–434.
[MC2004] Markus Cremer and Claus Derboven, “A System for Harmonic Ananlysis of Polyphonic Music” ,Proceedings of the AES 25th International Conference, 2004, London, UK, 115–120.
[BP2002] Bryan Pardo and William P. Birmingham, “Algorithms for Chordal Analysis, 2002”, Computer Music Journal, vol. 26, no. 2, 27–49
[KN2006] Katy Noland, Mark Sandler, “Key Estimation using a Hidden Markov Model”, in Proc of ISMIR, Victoria, Canada, 2006
[JPB2005A] J.P. Bello, J. Pickens, “A Robust Mid-level Representation for Harmonic Content in Music Signals”, in 6th International Symposium on Music Information Retrieval, London, 2005.
[PK1986] P. Kabal and R.P. Ramachandran,“The Computation of line spectral frequencies using Chebyshev polynomials,”, IEEE trans. on Acoustics, Speech and Signal Processing, vol. ASSP-34, no. 6, 1419–1426, 1986
[FI1975] F. Itakura, “Line spectrum representation linear predictive coefficients of speech signals,”, J. Acoust. Soc. Amer., vol. 57, S35, 1975
[NC2005] N. Chetry et al, “Musical Instrument Identification using LSF and K-means”, in Proc. AES 118th Convention, Barcelona, Spain, 2005 May 28–31.
[NC2006] N. Chetry et al, “Computer Models for Musical Instrument Identification”, PhD Thesis, 2006..
[ML2006A] M.Levy et al, “New methods in structural segmentation of musical audio”, in Proc.
Eusipco 2006.
[ML2006B] M.Levy et al, “Extraction of High-Level Musical Structure from audio data and its application to thumbnail generation”,, in Prc ICASSP 2006
[MPEG7] ISO/IEC JTC1/SC29/WG11, “Information Technology – Multimedia Content Description Interface – Part 3: Multimedia Description Schemes”, ISO/IEC FDIS 15938-5, 2001-10-23.
[MP7XM] MPEG-7 eXperimentation Model,
http://www.lis.ei.tum.de/research/bv/topics/mmdb/e_mpeg7.html
[EMD2006] Dan Barry et al,“EASAIER Metadata & Descriptors”, Internal Note, ver. 1.0 Draft, 2006.
[ED312006] “EASAIER Deliverable 3.1: Retrieval System Functionality and Specifications”, ver. 1.12, November 1, 2006.