• No results found

Chapter 4. Blind Speech Separation by Filter-Bank, Non-negative Matrix Factorization and

4.3 Functional Block Diagrams and Waveforms

4.3.3 Non-negative Matrix Factorization NMF

Mathematically, matrix is a powerful arrangement for the data in both number and function forms. The matrix, the determinant and the vector are the terms of the linear algebra. In the computer programming field, the equivalent term of any one of them is β€œarray”. The vector is a one- dimensional array. The determinant is a square two-dimensional array. The matrix is a multi- dimensional rectangular array. The array is an adequate container for tabulating the information and data, because the fact that these data should be manipulated by a machine (e.g. computer). These data reside their meaningful locations inside specific matrices. The data are facing a problem of the dramatically increasing of these data. The increasing causes a huge expansion in the capacity of the storage devices those save these data. These increased-data cause more manipulating time also. There are requests to create reduced data which equivalent to these original huge data. To find the equivalent, factorizing technique could be used for that purpose.

Instead of the original main huge matrix [S], the factorization transforms it to reduced multi- matrices. By the factorizing:

i/p o/p Filter-Bank Analysis Sub-band 65 Sub-band 1 Sub-band 2 . . Filter-Bank Synthesis Sub-band 65 Sub-band 1 Sub-band 2 . . 24 S u b -sig n als N M F Sp ea ker C lu st er in g

92

[𝑆] β‰ˆ [π‘Š][𝐻] (4-6)

[S] ∈ ℝrΓ—ss, β‰₯0, [W] ∈ ℝrΓ—ss, β‰₯0, [H] ∈ ℝssΓ—c, β‰₯0

where, the data matrix [S] has r rows and c columns, [W] has r rows and ss columns, and [H] has ss rows and c columns.

[𝑒] = [𝑆] βˆ’ [π‘Š][𝐻] (4-7)

where, [e] is the error matrix. The factorization algorithms are based on the feedback iteration programming. The calculated error norm || [e] || determines the divergence condition and the accepted tolerance, to finish the machine running and accept the approximated factors [W] & [H].

min([π‘Š]π‘œπ‘Ÿ [𝐻]) βˆ‘ |[𝑆] βˆ’ [π‘Š][𝐻]|2 𝑖=1:π‘Ÿ,𝑗=1:𝑐

[S] ∈ ℝrΓ—ss, β‰₯0, [W] ∈ ℝrΓ—ss, β‰₯0, [H] ∈ ℝssΓ—c, β‰₯0

(4-8)

For the capacity of the required storage, [S] resides (r Γ— c) memory locations, and [W] plus [H] reside (r Γ— ss) + (ss Γ— c) memory locations. The difference between the required locations for the original [S] matrix and the required locations for the factors [W], and [H] matrices depends on the difference between ss and the minimal value of r or c. This storage reduction, maybe million times, i.e. the required storage for [S] is million time the required storage for [W] + [H]. The condition for using the Non-negative Matrix Factorization NMF is the fact that all the elements of [S] must me non-Negative values (i.e. β‰₯ 0). The resulting [W] and [H] have non-Negative values elements also. To achieve equation (4-8), the following well-known algorithms and methods are used:

β€’ Lee algorithm [74, 75]. β€’ Brunet NMF algorithm [153]. β€’ KL-NMF algorithm [154]. β€’ Frobenius-Norm NMF [155]. β€’ The Offset NMF method.

β€’ The ns-NMF, the ls-NMF, the pe-NMF and the si-NMF algorithms. β€’ The SNMF/R and the SNMF/L [156].

93

To explain how the NMF has been exploited for the speech separation job, the following describes the main idea for that. The total duration of the mixture speech segment is Tt, the duration of the

overlapping-window speech frame is Tw and the duration of the hopping is Th. Their corresponding

numbers of samples per each frame are Nt, Nw and Nh respectively, where:

𝑁𝑑= 𝑓𝑠×𝑇𝑑; 𝑁𝑀 = 𝑓𝑠×𝑇𝑀 and π‘β„Ž = π‘“π‘ Γ—π‘‡β„Ž (4-9)

The total number of the processed frames are: π‘π‘“β‰ˆ π‘“π‘™π‘œπ‘œπ‘Ÿ (𝑇𝑑

π‘‡β„Ž) β‰ˆ π‘“π‘™π‘œπ‘œπ‘Ÿ ( 𝑁𝑑

π‘β„Ž) (4-10)

where, floor(.) is the down-rounding floor function which approximates its argument to the nearest less integer.Since Nw is number of the samples per each input frame in the time domain, number

of the frequency domain sub-bands is 1+(Nw /2). The FFT for all the frames of the segment speech

could be arrange as the (1+(Nw /2)) Γ— Nf spectrum matrix [S]. In the speech-DSP, due to the

ignorable effect of the phase variations, the frequency domain description is the absolute values of the spectrum magnitude. The elements of [S] are positive values those represent the magnitude of their sub-bands. According to this property of [S] matrix, NMF is applicable on [S], i.e. could be factorized into two factorizing matrices: [W] and [H]. Any ith row of [S], is the ith vector [D

i] which

contain the spectral analysis of the ith sub-band. Any jth element of the [D

i] vector is:

𝑆𝑖𝑗 = βˆ‘ π‘Šπ‘–π‘›Γ—π»π‘›π‘– 1+𝑁𝑓/2

𝑛 = 1 (4-11)

where Win is the ith row- nth column element of [W] and Hni is the nth row- ith column element of

[H]. According to Equation (4-11), each sub-band of each frame is the summation of the multiplication results of its spectral base by the activation weights of each sub-band. The full spectrum (i.e. all the sub-bands) of that frame is the concatenation arrangement of all these sub- bands. For that full spectrum, [W] is the Spectral-Basis matrix of the filter-bank analysis and [H] is the Activation-Weights matrix of the filter-bank analysis. Obviously, thus matrix manipulation has the ability for increasing the resolution of the frequency domain analysis. It seems that the

94

resulting ss number of sub of the sub-bands could produce ss number of their corresponding sub- waveforms in the time domain. This waveform generation has a splitting action for the original waveform of each sub-band, i.e. it has the ability for the separation of each sub-band components. Generally, the NMF technique is used, widely for the audio separation depending on the above spectral analysis of the mixture audio signal. In this Chapter, filter-bank analysis has been used for sustain the ability of the NMF for the separation of a mixture speech signal. Although the NMF has good ability for the separation of a mixture audio signal, NMF has poor capability for the separation of the mixture speech signal. Instead of the mixture speech signal itself, in this Chapter the NMF has applied on all the sub-bands signal, sub-band by sub-band. Since the NMF can split its input signal into ss signals/sub-bands and there are 1+(Nf /2) sub-bands, the total number of split signals

are ss Γ— (1+(Nf /2)).

These several hundreds of signals are split but their speakers’ identification have a lot of errors. The errors are since each waveform signal belongs to the multi-speakers. Instead of all-the-signal identification, the process deals with the frames one-by-one, inside each waveform signal. This Identification could be executed, successfully by using the speaker clustering algorithms. The speaker clustering is the second phase of the speaker diarization process. To implement the clustering in this chapter, an existing reliable speaker diarization toolbox have been used. The toolbox is an open-source package which available in the GitHub institute website. Since each conversation has Nf frames, the total number of the frames are (ss Γ— Nf Γ— (1+(Nw /2))), Figure 4.5.

The identified sub-frames are summed to produce the desired speech signal of a specific speaker. The other unwanted signals are masked for this speaker but they are considered for the other speakers. There are two types of the masks: the binary and the soft masks. The binary mask belongs all the specific signal for a specific speaker, and nulls the other speakers. The soft mask shares the signal among all the speakers. The sharing is done according to the distances of the signal parameters from their references [22, 87].