Chapter 3. Overlapped-Speech Detection based-on Stochastic Properties
3.4 Experiments
To investigate their actual abilities, the algorithm has been tested step-by-step, carefully taking account of most possibilities. The number of the speakers taking part in the required conversations are 24 speakers. 10 of the speakers are females and 14 speakers are males. The source of 2 speakers is the TIMIT standard library for the speech and the audio DSP [20]. One of the TIMIT speakers is a female and the other is male (the F and the M, which are used in the description text).
The rest of the speakers are narrators; who are, arbitrarily, chosen from the well-known websites related to the audio books. The narrators of these books are famous (e.g. Dick Estell), and have deterministic and well-defined speech. For each one of these speech databases, the long periods of silence are removed, but the short periods of silence are not removed because the RASTA-PLP requires such short silences. Long period silence could cause the following conflict against the algorithm. When the two speakers are talking simultaneously (mixture speech), if one of them is talking continuously, but the is taking some time and remaining silent for a long period, the fashion of this mixture speech is mixture sometimes but dialogue other times. The period of the silence should be less than the longest neglected group.
68 Figure 3.14 Flowchart of Chapter 3 algorithm.
According to the Goodness formula, define the useful and the harmful groups by the clustering of the 32 orders into 2 clusters Input spontaneous conversation speech signal
Overlapping-window speech signal frames, each frame is 32 ms with 22 ms overlapping, 10 ms hopping time, number of frames is Nf
Extracting 13 features from each frame, the features are arranged as [13-by-Nf] array
By the k-means, the array is clustered into 3 clusters: F, M and FM. The output is their labels-vector of Nf elements: {F, M, FM}
Collecting the elements in the 32 orders of 0.1-s (10 frames), taking their PDFs and their variances, for each order the variances reside in one row of Nt /10 variances
Cluster each order (row) into 2 clusters:
high variances and low, to produce the
Decision-array
Define a concept of the PR recognition Goodness by using the relationship between the overlapped area with the distance between the centroids, the spread of the patterns (there variances) and the durations Arrange the 32 orders to produce [32-by-( Nf /10)] V-
array
By including the useful and rejecting the harmful ordered groups, cluster the Decision-array into: correct and false decisions. Neglect the false decisions and adopt the correct decisions
For the adopted correct decisions, by the Hierarchical Clustering Scenarios, rectify the fragmented wrongs. Mask the outputs undesired decisions
Output segregated speech signals:
69
Because the format of the standard PCM is without any compression or expanding of the speech, the database is formatted as (FileName.wav) of the PCM. For the cross-checking, the sampling rates of 16000 samples/s and 8000 samples/s are used. The resolution of each file is 32 bit/sample, but it has been reduced to the accepted resolution of 16 bit/sample.
The conversations have the same sequence, which is: F, FM, M, F, M, FM, F, FM, M then FM. The F is the speech of the female alone (the dialogue speech), the M is the speech of the male alone (the dialogue speech) and the FM is the speech of the female and the male simultaneously (the mixture speech). The conversations are prepared using the above sequence of speech segments (F, M or FM). The period of each segment is 30 s, so the conversation is 10 times 30 s, i.e. 300 s (five minutes).
In order to avoid any power-normalization problem during the processing, the segments have been power-normalised by the comparing of these segments with the standard energy of the segments of the F speech recording (the female of the TIMIT [20]).
For the audio files, the Audacity environment has been used continuously for the preparation of these audio files, for the quick and the subjective tests during the running of the executable files and for the final testing and playing of the resulting speech files. For the ASCII-code text, the notepad++ has been used for the editing of the source (FileName.m) files and the other data editing. For the ASCII-code tabling, MS-Excel has been used for the arrangement of the required database tables. The main speech-DSP is implemented by the using of the MATLAB environment, which is the most powerful DSP and statistics environment.
After the preparation of the required files, the implementation of the algorithm starts by the writing of the SourceFile.m for the using of the MATLAB DSP environment. In order to execute the first step of the algorithm, the required FileName.wav speech files should be read from its physical location, i.e. the conversation speech is fetched. The conversation speech signal is divided by the windowing-framed with 32 ms. The hopping of the frame is 10 ms. These frames are processed by the RASTA-PLP technique to produce 13 features each frame. The total number of frames are Nf for the conversation. The output of the RASTA-PLP implementation is the feature array with [13- by-Nf] elements; the (b)/Figure 3.7 Initially, these features have been investigated by clustering into 2 clusters: the dialogue speech and the mixture speech (by the k-means clustering algorithm). The out is a vector of Nf elements, each element is either 1 or 2. For the cross-checking, that investigation is repeated by clustering the features-array into 3 clusters: The F speech, the FM
70
speech and the M speech. The output is the K-vector of Nf elements, each element is either 1, 2 or
3; the (e)/Figure 3.7. The results of these two vectors have been plotted in the (c) and the (d)/Figure 3.7. Sequentially, the above algorithm is done for all the 300 FileName.wav files.