4.9 Final onset detection results
5.1.2 Automatic segmentation using the Amplitude/Centroid
It has been shown that in order to better understand the temporal evolution of sounds, it is necessary to also consider the way in which the audio spectrum changes over time [45]. Hajda proposed a new model for the partitioning of isolated non- percussive musical sounds [44], based on observations by Beauchamp that for cer- tain signals the root mean square (RMS) amplitude and spectral centroid have a monotonic relationship during the steady state region [7]. An example of this rela- tionship is shown for a clarinet sample in Figure 5.2. The spectral centroid is given by Equation 5.4, where f is frequency (in Hz) and a is linear amplitude of frequency band b (up to m bands) which are computed by Fast Fourier Transform. The Fourier
Transform is performed on Bartlett windowed analysis frames that are 64 samples in duration. This results in 32 evenly spaced frequency bands (up to 11025 Hz), each with a bandwidth of about 345 Hz.
centroid(t) = Pm b=1Pfb(t)× ab(t) m b=1ab(t) (5.4)
Figure 5.2: The full-wave-rectified version of a clarinet sample, the RMS amplitude envelope (dashed line) and the spectral centroid (dotted line). The RMS amplitude envelope and the spec- tral centroid have both been normalised and scaled by the maximum signal value.
Hajda’s model, called the Amplitude/Centroid Trajectory (ACT), identifies the boundaries for four contiguous regions in a musical tone:
Attack: the portion of the signal in which the RMS amplitude is rising and the spectral centroid is falling after an initial maximum. The attack ends when the centroid slope changes direction (centroid reaches a local minimum).
Attack/steady state transition: the region from the end of the attack to the first local maximum in the RMS amplitude envelope.
Steady state: the segment in which the amplitude and spectral centroid both vary around mean values.
Decay: the section during which the amplitude and spectral centroid both rapidly decrease. At the end of the decay (near the note offset), the centroid value can rise again however as the signal amplitude can become so low that denomi- nator in Equation 5.4 will approach 0. This can be seen in Figure 5.2 (starting at approximately sample number 100200).
Hajda initially applied the ACT model only to non-percussive sounds. However, Caetano et al. introduced an automatic segmentation technique based on the ACT model [19] and proposed that it could be applied to a large variety of acoustic in- strument tones. It uses cues taken from a combination of the amplitude envelope and the spectral centroid. The amplitude envelope is calculated using a technique called the true amplitude envelope, which is a time domain implementation of the true envelope.
The True Envelope and the True Amplitude Envelope
The true envelope [50, 101, 21] is a method for estimating a spectral envelope by iteratively calculating the filtered cepstrum, then modifying it so that the original spectral peaks are maintained while the cepstral filter is used to fill the valleys be- tween the peaks. The real cepstrum of a signal is the inverse Fourier transform of the log magnitude spectrum and is defined by Equation 5.5, where X is the spectrum
produced by a N-point DFT. C(n) =
N−1X k=0
log(|X(k)|)ej2πkn/N (5.5)
The cepstrum can be low-pass filtered (also known as liftering) to produce a smoother version of the log magnitude spectrum. This smoothed version of the spectrum can be calculated according to Equation 5.6 where wnis a low-pass window in the cep-
stral domain (defined by Equation 5.7) with a cut-off frequency of nc.
ˆ X(n) = NX−1 k=0 wnC(k)e−j2πkn/N (5.6) w(n) = 1 |n| < nc 0.5 |n| = nc 0 |n| > nc (5.7)
If ˆXi(n)is the smoothed version of the spectrum at iteration i, then the true envelope
is found by iteratively updating the current envelope Ai according to Equation 5.8.
Ai(n) = max(Ai−1(k), ˆXi−1(k)) (5.8)
The algorithm is initialised by setting A0(n) = log(|X(k)|) and ˆX0(n) = −∞.
This process results in the envelope gradually growing to cover the spectral peaks, with the areas between spectral peaks being filled in by the cepstral filter. An addi- tional parameter ∆ is specified in order to stop the algorithm, specifying the maxi- mum value that a spectral peak can have above the envelope.
create the true amplitude envelope (TAE) [21]. The first step in the TAE is to obtain a rectified version of the audio waveform so that there are no negative amplitude values. The signal is then zero-padded to the nearest power of two, a time-reversed version of it is appended to the end of the signal and the amplitude values are ex- ponentiated as the true envelope assumes that the envelope curve is being created in the log spectrum. Finally, the true envelope algorithm is applied to the time do- main signal instead of the Fourier spectrum so that the resulting envelope accurately follows the time domain amplitude peaks.
Automatic segmentation using the ACT model
For each musical tone the method presented by Caetano et al. locates onset, end of attack, start of sustain, start of release and offset boundaries as follows:
Onset: start of the note, found by using the automatic onset detection method de- scribed in [100]3.
End of attack: position of the first local minima in the spectral centroid that is between the onset and the start of sustain.
Start of sustain: boundary detected using a modified version of Peeters’ weakest effort method.
Start of release: also detected using a version of the weakest effort method, but starting at the offset and working backwards.
3This technique basically involves looking for signal regions in which the center of gravity of the
instantaneous energy of the windowed signal is above a given threshold. Or in other words, if most of the energy in a spectral frame is located towards the leading edge of the analysis window, then the frame is likely to contain a note onset.
Offset: the last point that the TAE attains the same energy (amplitude squared) as the onset.
Notably, they allow the same point to define the boundary of two distinct contigu- ous regions. This signifies that the region is too short to be detected as a separate segment and makes the model more robust in dealing with different types of sounds. A plot of a clarinet sample and the boundaries detected by our implementation of this segmentation method are shown in Figure 5.3. From just a visual inspection of the waveform, the attack and sustain sections look to be well detected. There are some changes between our implementation and the technique described here (which are discussed in more detail in Section 5.3) which partly account for the lack of precision in the identification of the onset and offset. The identification of the release region for this sample does not seem accurate however.
Evaluation of the ACT model
Caetano et al. compare the performance of their automatic segmentation technique to that of the one described by Peeters [95]. They do this by visual inspection of plots of the waveform, spectrogram and detected boundaries produced by both methods, showing 16 analysed samples consisting of isolated tones from western orchestral instruments (plus the acoustic guitar). They found that their model out- performed the Peeters method in all cases, although for one sample (a marimba recording) the amplitude envelope and spectral centroid do not behave in the manner that is assumed by the model and so neither method gives good results. However, this provides strong evidence that the ACT model assumptions can be applied to a wide variety of sounds, and shows that using a combination of the amplitude en-
Figure 5.3: A clarinet sample and the boundaries (vertical dashed lines) detected by our implemen- tation of the automatic segmentation technique proposed by Caetano et al. [19]. From left to right, they are the onset, end of attack, start of sustain, start of release and offset.
velope and the spectral centroid can lead to more accurate note segmentation than methods based on the amplitude envelope alone.
The automatic segmentation technique proposed by Caetano et al. cannot be used to improve the performance of real-time synthesis by analysis systems how- ever, as the method for detecting the start of sustain and start of release boundaries requires knowledge of future signal values. The spectral centroid has been shown to be a useful indirect indicator as to the extent of the attack region. However in or- der to help reduce synthesis artifacts in real-time sinusoidal modelling systems, it is desirable to have a more accurate and direct measure of the attack transient duration by locating signal regions in which the spectral components are changing rapidly or
unpredictably. Both of these issues are addressed by the new segmentation model that is proposed in Section 5.2.