k-means Clustering of the Features - An Algorithm of Overlapped-Speech Detection

Chapter 3. Overlapped-Speech Detection based-on Stochastic Properties

3.3 An Algorithm of Overlapped-Speech Detection

3.3.3 k-means Clustering of the Features

The next step of the algorithm is the clustering of the extracted features into 2 clusters and into 3 clusters. In the case of 2 clusters: the 1st_{cluster denotes to the dialogue speech features, the 2}nd

cluster denotes to the mixture speech features. The space of the resulting vector of the clustering Input speech signal

Short Time Fourier Transform (STFT)

Square of the magnitude (|. |2)

Integration of Bark-scale Critical Bands

Equal-Loudness Pre-Emphasis filter RASTA filter

Nonlinearity (.)0.33_{power function}

Inverse Short Time Fourier transform ISTFT Inverse Discrete Fourier Transform IDFT

Linear Predictive Coding Coefficients LPCC Levinson-Durbin recursion

Cepstral recursion

Mean normalization

process should be the set {1, 2}. In the case of 3 clusters: the 1st_{cluster denotes to M speech}

features, the 2nd_{cluster denotes to FM speech features and the 3}rd_{cluster denotes to F speech}

features. The space of the resulting vector of the clustering process should be the set {1, 2, 3}. The {1, 2} and {1, 2, 3} do not have any calculation values because they are the arbitrary labels of the clusters: {dialogue mixture}, and the clusters: {M, FM, M}.

There are many algorithms, techniques and algorithm which are used to cluster any data to specific (known or unknown) number of clusters. These algorithms are based on different approaches such as the connectivity-based clustering (the hierarchical clustering), the centroid-based clustering, the statistical distribution-based clustering and the statistical density-based clustering [138].

The centroid-based clustering algorithms are the k-means and the k-mediods. The k-means technique is an efficient and one of the simplest cluster analysis, where k is the known number of the required clusters. The k-means is a centroid-based approach which is performed after finding the k numbers of means (centroids) of these clusters, then sharing the input data according to the nearest distance to these centroids. Appendix C has been added to the thesis to expand the description.

The k-means has been described in the appendix A of the thesis, which includes a brief description of this well-known algorithm and the historical overview of the famous implementation algorithms (e.g. Lloyd-algorithm) [139]. In addition to these, the description presented the main problems which are posed by these algorithms and the recent solutions to overcome these problems [140, 141].

The output of k-means clustering is a vector of Nf elements (K-vector). The elements of the K- vector are either 1 or 2, for the clustering of the frames’ features either the dialogue or the mixture speech (the (c)/Figure 3.7) The elements of the K-vector are either 1, 2 or 3, for the clustering of the frames’ features either the M, the FM or the M (the (d)/Figure 3.7). Obviously, the figure shows that the major labels are false and the minor labels are correct. The subjective and the objective tests, of this step, denote that the above clustering are bad and the labels have a lot of errors. According to these tests, the above crude clustering does not have the enough capability for the proper detection. Instead of that direct crude use of that clustering, improvement(s) could upgrade this capability. The next paragraph and steps present the proposed improvements. The Figure 3.7, the Figure 3.9, Table 3.2 and Table 3.3 list the comparison for that initial clustering step with the next modified clustering.

Input: M FM F M F FM M FM F FM

Nf frames

Clustering the RASTA-PLP features into 2 clusters: dialogue (high) or mixture (low)

Clustering the RASTA-PLP features into 3 clusters: F, M or FM (i.e. K-vector)

Nf labelling elements (K-vector) by clustering into: F, M or FM

Figure 3.7 Audio features extraction, and Initial crude clustering. The (a) is the input spontaneous conversation. The (b) is an array which contains the [13-by-Nf] features extracted by the RASTA- PLP. The (c) is the clustering of the features into 2 clusters: label-1 for the mixture and label-2 for the dialogue. The (d) is the clustering of the features into 3 clusters: lable-1 for M, lable-2 for F and lable-3 for FM. The lower line is the K-vector which contains the k-means clustering results of the (d), i.e. into 3 clusters: M, FM or F. There are horizontal-axes time-domain relationships between all the sketches.

13 C oe ffic ie n ts FM F M

At first, the modification uses the clustering of the features to M, FM or M labels, i.e. the elements of the K-vector are 1, 2 or 3. Suppose they are the labels of M, FM or F respectively; see

Figure 3.7 . Assume that the instances of the switching times, from any speech segment to its following speech segment, are known (this assumption is very important for the proposed modification). The period of each speech segment is prepared for 30 s/segment. The speech segments are 10, and each segment has 3000 frames, so the number of frames of F speech is 9000 (3 × 3000), and M has the same number of frames. FM has 12000 frames because there are 4 segments (4 × 3000). For each: M segment, FM segment and F segment, the Probability Density Function (PDF) are calculated. The calculation is done by counting of the chances of 1, 2 and 3 of their corresponding values in K-vector. This counting is the Histograms of the segment. The PDF of each segment is the per unit normalization of each histogram. Statistically, the PDFs are discrete and finite; Figure 3.8 illustrates the PDFs of the 1st_{, the 2}nd_{and the 3}rd_{segments of the conversation.} It is easy to calculate the variances of these 10 PDFs. Obviously, there are a wide-range of differences between the variance of the mixture speech (FM) in comparing with the variances of the dialogue speech (F or M). The mixture has higher variance and the dialogue has lower variance. According to that wide range, the conclusion is the clustering of above features could be achieved successfully by the use of any well-known technique such as k-means or k-mediods. This clustering introduces the segregation task directly by using binary masks. The above conclusion has been investigated on a sample of 24 arbitrary female and male speakers. Since they are 24 speakers, and each conversation includes 2 speakers; the number of the investigated conversations are:

((24+1) × 24/2) = 300, (number of chances = N × (N+1) /2). A number of the successful conversations is 297 (99%) and the failed conversations are only 3 (1%). The (c) and the (d)/Figure 3.9 are the results of the clustering of the successful and the failed conversation respectively. The clustering is divided into 2 clusters according to the values of the variances: high or low. This conclusion is correct and excellent when the switching instants are known, but always these important instants are unknown and not easy to predict them. The first stage of the speaker diarization process is called the Speaker Segmentation. This stage can track-and-estimate the locations of these instants. The speaker diarization is only deals with input dialogue conversations. In this chapter case, the input is a hybrid conversation which contains both dialogue and mixture speech signals. Instead of the traditional speaker diarization, the above approach has been adopted for this research because it has the excellent ability for the segmentation of those spontaneous

conversations. The high efficiency performance is the main motivation for that. The tracking-and- capture of these instants is the key to the solution. The estimation of proximity of any actual instant is the suggestion which leads to very good results.

The differences between any estimated time and its actual time of occurrence, produces error. Accumulation of these errors reduces the overall efficiency of the system, but this reduction is acceptable according to the final tests.

To estimate those instants, the solution is by trying to find any instant in the range of 0.1 s to 3.2 s (10 to 320 frames). The lower limit is 0.1 s, because the resulting single error is negligible for a period of less than this limit. The upper limit is chosen 3.2 s by the trial-and-error. The second reason for this choice is the fact that if 2 or more switches during this period occur, the resulting error(s) are insignificant. This is because of the resulting errors are accepted in the estimation algorithm. Another reason for that is the fact that the duration period has been taken into account in the optimal formula that will be used in the next subtitle (the relationship is formulated as inversely proportional). Inside that range (0.1 s to 3.2 s), 32 switching times are suggested, then investigated carefully according to the machine-learning and the pattern-recognition basics and principles. Periodically, this algorithm is repeated to find the next switching-instant and so on to find the other switching-instants. The details are presented in the following optimization algorithm.

Figure 3.8 Specimens of three PDFs. Each one for the clustered labels of the extracted features of M (left), FM (middle) and F (right) speech.

Variance = 4 Variance = 17 Variance = 2.2

Input: M FM F M F FM M FM F FM (a)

Grouping, Nf /10 groups of K-vector (b)

For known instances of the switching of speech-segment to another speech-segment, this is the clustering of the variances into 2 clusters: high variance and the low variance. (c)

Per-Unit variances of PDFs, each PDF is for 2-second period of K-vector (i.e. 20 groups) (d)

For unknown instances of the switching from speech-segment to another speech-segment, this is the clustering of the variances into 2 clusters: high variance and low. (e)

Figure 3.9 Grouping concept. Each 10 frames (0.1 seconds) is the fundamental group. The concept facilitates the mission of finding the switching instants. The (c) is the perfect clustering of the variances when the switching instants are known. The (d) for the supposed switching instants are regular each 2 second. The (e) is the worst clustering of the variances, when the switching instants are unknown. There are horizontal -axes time-domain relationships between all the sketches.

In document Single channel overlapped-speech detection and separation of spontaneous conversations (Page 76-82)