ML parameter estimates from music convolved with simulated room impulse

8.2 ML parameters from music

8.2.1 ML parameter estimates from music convolved with simulated room impulse

simulated room impulse responses, 1 kHz octave band results

A music signal is convolved with the database of 100 simulated room impulse responses. The ML estimation algorithm (method c) is applied to each case and acoustic parameters estimated. As observed with speech signals, the real room responses produced similar results when compared with similar artificial responses, but as the simulated RIR database covers a wider range of acoustic parameters only the simulated results are presented here, the real room results are presented in Appendix F.3.

The resulting estimation accuracy for Rt, using eight, five minute segments of music, is depicted in Figure 8-8. These results show comparable accuracy to those achieved using speech with a tendency to underestimate in a small number of cases. As for speech, at low Rts there is a small tendency for overestimation. This is because the rate of decay of the reverberation is starting to become comparable to the release of the musical note (the rate of decay of the anechoic note).

The reason for the underestimation of a number of the results when compared with speech is due to the non-broadband-like excitation of the music. Music only excites certain portions of the frequency band, depending on the particular notes being played. Due to this increased stochastic variability in the excitation, the resulting ML estimated decay phases also have greater stochastic variability compared with those estimated from speech. The ML algorithm chooses the fastest decaying phases and therefore selects estimates from the lower bound of this variability; this is the cause of the underestimation bias. This phenomenon was also encountered when the music

Chapter 8 : Applications of the Maximum Likelihood Estimation Method 162

spectrum filter was applied to the simulated RIRs in Section 6.8.2, but the error is more pronounced here as the decay phases are often the result of single note excitation and thus the spectrum deviates even further from the desired broadband response.

Figure 8-8. Comparison of estimated and true reverberation time. Estimates were obtained from the application of the ML method to simulated impulse responses convolved with 40 mins of anechoic music windowed into eight, five minute segments.

By decreasing the segment length (and therefore increasing the number of averages that are used to calculate the decay curve), the number of selected decay phases from which the median is estimated is increased. Assuming that the spectrum of each decay phase varies randomly, by computing the response using more decay phases, the resulting response is more representative of a broadband response and the tendency for underestimation is reduced. This is shown in Figure 8-9 where 20, two minute segments are used. Figure 8-9 also highlights another problem, the shorter segment length increases the number of averages when computing the decay estimate but there is less constraint on the algorithm to select the cleanest possible decay. Therefore the over-estimation at short Rts has been increased as more sub-optimal decay phases containing significant residual tails of musical notes are used in the averaging. This overestimation is even more problematic for the other early reflection based parameters (EDT C80 etc.) as the increased residual tails of musical notes bias the early part of the

Chapter 8 : Applications of the Maximum Likelihood Estimation Method 163

The solution to this problem is to increase the overall length of the recording to increase the number of averages, or to use music which has more pauses after short notes suitable for ML estimation.

Figure 8-9. Comparison of estimated and true reverberation time. Estimates were obtained from the application of the ML method to simulated impulse responses convolved with 40 mins of anechoic music windowed into 20, two minute segments,

results presented.

Figure 8-10 shows the EDT results. As can be seen there is a tendency for an overestimation of EDT, due to two factors: The decay of the musical notes being included with the decay of the room in the ML estimation, and the positive bias that is introduced when the excitation is not impulsive (as explained for speech signals). Once again the EDT accuracy is below that for Rt, again this is because the EDT is very sensitive to changes in the early part of the response and the early part is more heavily influenced by persisting tails of musical notes and the ambiguity between impulsive and sustained excitation.

Chapter 8 : Applications of the Maximum Likelihood Estimation Method 164

Figure 8-10. Comparison of estimated and true EDT. Estimates were obtained from the application of the MP method to simulated impulse responses convolved with

40 mins of anechoic music windowed into eight, five minute segments.

Section 8.1 demonstrated that, for speech, increasing the segment length improves the prospects of finding a clean decay phase. A disadvantage of the longer segment length is that with a fixed length of recorded audio, the number of averages is decreases. Figure 8-11 shows that, by increasing the segment length to 10 minutes (thereby increasing the chance of finding ‘cleaner’ decay phases) the over-estimation trend is removed. Although the overestimation trend is removed, Figure 8-11 shows a significant variation in the parameter estimates, which is due to the small number of decay curves used in computing the median (only four segments!).

Chapter 8 : Applications of the Maximum Likelihood Estimation Method 165

Figure 8-11. Comparison of estimated and true EDT. Estimates were obtained from the application of the MP method to simulated impulse responses convolved with 40

minutes of anechoic music windowed into four, ten minute segments.

Figure 8-12 and Figure 8-13 show the results for C80 and ts using four ten minute

segments. The under estimation at high C80 and low ts values is similar to the features

seen with speech, and is thought to arise because decay of the anechoic musical notes causes inaccuracies in the estimations. To improve the accuracy for EDT C80 and ts

requires a longer length of signal. The variation in the estimates is due to the small number of decay phases used to compute the estimate and the highly stochastic nature of the source signal. It is postulated that a recording of two hours would be sufficient to ensure plenty of suitable decay phases.

Chapter 8 : Applications of the Maximum Likelihood Estimation Method 166

Figure 8-12. Comparison of estimated and true C80. Estimates were obtained from the application of the ML method to simulated impulse responses convolved with 40 mins of

anechoic music windowed into 10 minute segments.

Figure 8-13. Comparison of estimated and true ts. Estimates were obtained from the application of the ML method to simulated impulse responses convolved with

40 mins of anechoic music windowed into 10 minute segments

In document Blind estimation of room acoustic parameters from speech and music signals (Page 183-188)