records, and a set of audio tapped and symbolic queries available for download2. The question here is whether the temporal structure of music can be adequately represented by rhythmic (onset) information alone [Jang et al., 2001]. This has implications for score-alignment algorithms where robustness may be increased by ignoring pitch information rather than including it.
Applying the models here is straightforward. C= 1andnc(t)is the number of observed onsets provided in the query up to frame t. ρc(t)is set to the number of note onsets in the Midi le up to frame t. The examples in the Mirex task are monophonic, so note onsets do not happen at exactly the same time.
The system ranks all of the Midi les in the database according to the likelihood (8.11) for each query.
Figure 8.3 on page 151 shows an example of the inter-onset timings which are used to match the observed onsets to the note onsets in Midi. To evaluate how well the system performs, for each queryq, we compute the rank rq of the correct score when the database scores are ranked according to likelihood. If the system correctly matches query q by assigning the highest likelihood, then rq = 1. If the system assigned two incorrect scores with a higher likelihood than the correct score, then rq = 3. The evaluation method published by Mirex is the mean reciprocal rank (MRR) over all the queriesq∈Q:
MRR=X
q∈Q
1 rq
such that a perfect system which returns the correct score for every query, i.e.,rq = 1∀q has an MRR of1, and systems which make more mistakes have lower MRR values.
. The best performing algorithm by Typke and Walczak-Typke [2008] for the 2008 task, which calculates an Earth mover's distance (EMD) between the observed and score onsets, achieved an average MRR of0.52.
Even without using any Bayesian techniques, we obtain an average MRR of 0.54on the Mirex database, outperforming more elaborate and computationally expensive algorithms on the symbolic onset data.
5 10 15 20 25 30 0
500 1000
Time/ms
Query Onsets
5 10 15 20 25 30
0 500 1000 1500
MIDI Onsets
Time/ms
Figure 8.3: Inter-onset timings in a query-by-tapping problem. The timings in the tapped query have the same rhythmic structure as the Midi timings, which enables the query to be matched with a high likelihood to the score using the Poisson process model generated in this chapter.
how training data, for example a synthesized version of the score, can be used to infer the parameters of the generative model, and how to update the model using the structure inferred within the observed audio itself. A standard approach to score alignment is to match a synthesized track with the observed audio using DTW. By applying the prior model of the score pointer position and the improved inference techniques, we show how the accuracy of the matching can be improved, especially at the position of note onsets where timing errors are perceptually most critical.
The nal applications are based on an event-based observation of the audio, which may be obtained through an onset detector or a principal pitch transcription algorithm. We develop a novel event counting generative model of the events using a non-homogeneous Poisson process, which replaces a generative model of the signal itself, and may use the same inference techniques as the other models in this chapter. We describe simple priors for the event processes which allow and infer the probability of missed event detections and clutter for each event type. The approach is applied to a query-by-tapping example where it is shown to be more accurate than more elaborate algorithms.
The work of Degara et al. [2010] shows that signicant improvement in note onset detection can be achieved by applying a prior model of rhythmic structure and fusing onset detection with rhythmic structure constraints. A natural extension of the work in Section 8.3 would be to jointly infer the event positions and the score structure using Bayesian inference.
Throughout this chapter we have mostly outlined the prior models and inference algorithms that can be applied using this framework, whereas in previous work [Peeling et al., 2007a] we have expressed a specic approach in more detail. The hidden Markov model framework is general and well studied, and many software packages exist for inference in HMMs which only require the specication of the observation model and the dynamics of the hidden state. Our contribution in this chapter has been to illustrate how several important applications in musical signal processing may be carried out through these inference algorithms.
Chapter 9
Conclusion
9.1 Summary
In this thesis we have presented a variety of Bayesian models and modelling methods aimed at tackling dicult applications of the processing of musical audio signals. A hierarchy of Bayesian priors is a feasible way to represent the complex structure exhibited in musical audio that is known from musical theory and has been obtained by experiments on the physics of musical instruments. Generative models of music are attractive as they are not tied to a particular application. Instead, the same generative model and prior structure may be used for music transcription, source separation, synthesis and reconstruction, simply by identifying which parameters of the model are known and unknown in a particular context, and applying an appropriate inference algorithm to the model to infer the values of the unknown parameters.
For each model we have developed in this thesis, we have described how it diers from existing work or generalizes an existing model, and provide a theoretical basis and justication to the accuracy observed in the experimental results presented. We have focused particularly on the problem of multiple pitch detection in frames of audio, particularly because it is an exacting measure of how appropriate and accurate a model of a musical signal is, especially for mixtures of three or more notes. Our assumption is that a generative model which is capable of performing the dicult classication task of identifying multiple pitches with high accuracy will also produce faithful and realistic reconstructions when used for its original purpose in a synthesis application, as far as the limitations of the particular model allow1.
Another goal of this thesis was to produce algorithms and inference methods using these generative models that are ultimately feasible for real-time applications. Musical signals are inherently complex, and even a simple model requires many parameters to model a single frame of audio. Full Bayesian inference by reversible jump MCMC when the number of parameters is itself unknown is slow despite substantial improvements to inference in the Bayesian harmonic model in previous work. It is important however that any compromises made should be with the inference algorithm rather than by oversimplifying the model, because musical audio requires a rich prior structure to capture the information we are extracting from the signal.
1The matrix factorization models in Chapter 7 have no explicit model for the phase of the source coecients, hence the phase of a synthetic signal would need to be constructed using a phase vocoder [Flanagan et al., 1965, Cemgil and Godsill,
In Chapter 5 we embellished the existing Bayesian harmonic model with a justication for using the Hilbert transform to improve the models ability to capture frequency and amplitude modulations in a partial frequency. We also saw a small improvement when using sinc windowed Gabor basis functions to model signals from instruments with vibrato. We also provided a result for the mode of the posterior distribution of the signal-to-noise ratio parameter which reduces the number of parameters that need to be simulated in the MCMC algorithm. By making these modications to the existing model, we were able to show improvement in the accuracy of multiple pitch detection for two-note mixtures, although there was no improvement demonstrated for mixtures of three or more notes, and we claimed this was due to not estimating the frequency values to sucient accuracy. In Chapter 6 however, we make this model practical for applications demanding faster computation, by splitting inference into two stages: the rst stage to detect partial frequency positions in a frame using the generative model with a vague prior over possible frequency positions, using numerical techniques to improve the accuracy of the frequency estimates; and the second stage to t a harmonic model to the estimated partial frequencies. Both stages were implemented using greedy algorithms, but we showed that the generative signal model and the prior harmonic model are powerful enough to detect the number and frequencies of pitches in a frame to the same level of accuracy as a full Bayesian inference scheme, but with computation reduced dramatically.
In Chapter 7 each frame of audio is modelled by projection onto a xed basis, rather than inferring the number and frequencies of the bases. The harmonic spectrum of a note is represented as a prior on the relative amplitudes of the basis functions in each frame, and the amplitude envelope of a note is represented as a prior on the relative energy of the signal across frames. The model is linear in the amplitude parameters, allowing multiple notes to be superimposed, and importantly multiple frames can be processed in parallel.
A simple polyphonic transcription algorithm was implemented for this model, and was shown to have a competitive level of accuracy to other transcription schemes on a large set of classical music.
In Chapter 8 we present a unied framework for musical signal processing applications which interact with a score of the music. This framework denes the dynamics of a virtual score pointer, such that the generative model of the signal in each frame depends on the properties of the score at the position indicated by the score pointer. By treating the dynamics and the generative model together as a hidden Markov model, we have a large and exible class of inference algorithms available to estimate the movement of the score pointer through the signal and to iteratively train and learn the parameters of the generative models on past and currently observed data.