ML Parameter estimates from speech convolved with simulated room impulse

8.1 ML Acoustic parameter estimates from speech

8.1.1 ML Parameter estimates from speech convolved with simulated room impulse

simulated room impulse responses, 1kHz octave band results

9 minutes of reverberated speech was windowed using rectangular windows, with no overlap, into 1½ minutes segments. This produces seven decay curve estimates by calculating the minimum energy decay curve over each segments (method c). Then the median decay curve was calculated from these seven estimates. The optimal segment length for a signal (1 ½ mins in this case) is dependent on how often a gap in the speech occurred suitable to allow a good estimation of the reverberant decay. Ideal decay phases have both a long enough gap to reveal the room decay and are also preceded by an impulsive sound.

The results shown in Figure 8-1 to Figure 8-6 have been generated using speech convolved with 100 simulated impulse responses. The impulse responses were chosen at random from the database of responses, generated as described in Chapter 4. These figures show parameter estimates extracted from reverberated speech signals. These are plotted against actual parameters calculated directly from impulse responses. The figures also show the difference limens for each parameter as dotted lines.

Figure 8-1 shows that excellent accuracy can be achieved over a wide range of Rts. Most of the estimations are within the subjective difference limens. For reverberation times below about 0.4s some over-estimation of the Rt occurs. This happens because the ML method encounters the minimum decay time of the speech utterances, which then places a lower limit on the Rt estimation, below which overestimation is observed. The error also increases at long Rts, exhibiting a slight positive bias. This is because as the Rt increases the number of decay phases providing sufficient dynamic range decreases, making the method more prone to errors as the estimation is calculated from a smaller sample. In this case the overestimation is because the only available decays with sufficient dynamic range contain noise, slowing the rate of decay.

Overall this shows very promising results for reverberation time measurement using speech signals. Appendix F.1 shows the same results for real room impulse responses and these are also in good agreement with the simulated results.

Chapter 8 : Applications of the Maximum Likelihood Estimation Method 153

Figure 8-1. Rt estimated using ML method plotted against the true Rt for 100 simulated impulse responses and speech.

Figure 8-2 shows the result for EDT estimated by the ML method, vs the true EDT. The EDT accuracy is less than that for Rt. There are a number of contributing factors to this differing accuracy.

Firstly, referring back to Figure 7-12, this shows that even when performing ML estimations directly on impulse responses, the EDT accuracy is significantly lower than the Rt. This, as previously mentioned, is due to the limited complexity in the model. The model is less valid in the early region due to the non random nature of the early order reflections when compared with the later reflections which are more appropriately modelled as Gaussian noise.

Secondly, it is apparent that for EDT times above 2s there is a tendency for overestimation. This can be explained by comparing the estimated EDT values with the true Rt values. The EDT estimates are generally found to be somewhere between the true EDT value and the true Rt value. Late reflections may be present in the decay phases due to previous sounds or utterances and if the early decay rate is slower than the later decay, these reflections will mask the early decay rate. It is common for RT to be greater than EDT in many rooms and it is certainly true for many of the simulated RIRs.

Chapter 8 : Applications of the Maximum Likelihood Estimation Method 154

Figure 8-2. EDT estimated using ML method plotted against the true EDT for 100 simulated impulse responses and speech.

Each decay phase is neither the result of impulsive excitation nor a sustained signal being switched off, rather it is somewhere between the two (sustained refers to the signal being present for a period of time so that the level is constant prior to turning the signal off and recording the decay curve – aka interrupted noise method). The algorithm used (method c – Section 7.4.3) attempts to search for decay phases resulting from impulsive-like excitation rather than a sustained excitation. The accuracy of the recorded decay curve is affected by the availability of impulsive-like excitations and on how different the early and late decays are.

Figure 8-3 compares decay curves from impulsive and sustained excitation. Figure 8-3 shows that the early part of the decay from the sustained excitation does not decay as quickly as when excited impulsively, this is because slow late reverberation is masking the quickly decaying early reverberation. As in most RIRs, the late reverberation generally decays more slowly than the early which causes the EDT to be skewed towards the Rt value. As all the decay phases are the result of excitation that is in- between impulsive and sustained, this bias is prevalent especially when the decay is non-uniform.

Chapter 8 : Applications of the Maximum Likelihood Estimation Method 155

Figure 8-3. Comparison of the envelope of the reverberant decay when the RIR is excited using impulsive excitation and when the response is a sustained noise source

switched off after a long period of time.

The algorithm automatically searches for the ‘cleanest’ decay phases, however, the lack of at least one suitable decay per segment of signal, will limit the overall accuracy of the method. A longer segment length can help improve the accuracy as it is more likely that the algorithm will find the cleanest decays. Figure 8-4 shows the EDT estimation using longer signal segments (3 minutes). The over-estimation at longer decay times has been reduced but the overall accuracy is roughly two difference limens. This accuracy can be increased by using longer recordings and increasing the number of segments.

Chapter 8 : Applications of the Maximum Likelihood Estimation Method 156

Figure 8-4. EDT estimated using ML method plotted against the true EDT. 100 RIRs were chosen at random from a database. Each impulse response was convolved with 9 minutes of speech. For the ML estimation, the reverberated signal was windowed into

three, three minute segments.

Figure 8-5 and Figure 8-6 compare the true and estimated values of clarity (C80) and

centre time (ts). The estimated values were obtained from reverberated speech signals,

but using three minute sections of speech (rather than 1 ½ minute segments). These results show similar trends as seen for EDT and Rt. When C80 is large, i.e. in spaces

with low reverberation times or high direct to reverberant ratio, the values are underestimated. Once again, the natural decay of the speech utterances prevents accurate estimation when the room decays are short. Centre time estimates are accurate above 0.03s, the overestimation at these low ts is due to the rate of reverberant decay

being less than or comparable to the fastest decaying speech utterance. The clarity appears to have a trend for overestimation, this is due to the ambiguity between late and early decay rate, where the late reverberation masks the early reflections. This causes an increase in the energy in the first 80ms when compared with the later energy and therefore an increase in the clarity index.

Chapter 8 : Applications of the Maximum Likelihood Estimation Method 157

Figure 8-5. C80 estimated using ML method plotted against the true C80. 100 impulse responses were chosen at random from a database. Each impulse response was convolved with 9 minutes of speech. For the ML estimation, the reverberated signal

was windowed into three, 3 minute segments.

Figure 8-6. ts estimated using ML method plotted against the true ts. 100 impulse responses were chosen at random from a database. Each impulse response

was convolved with 9 minutes of speech. For the ML estimation, the reverberated signal was windowed into three, 3 minute segments.

Chapter 8 : Applications of the Maximum Likelihood Estimation Method 158

In document Blind estimation of room acoustic parameters from speech and music signals (Page 174-180)