Sample preprocessing & dimension reduction

Chapter 6 Phylogenetic analysis of Romance languages

6.2 Methods & Implementations

6.2.1 Sample preprocessing & dimension reduction

As shown in section 2.1, spectrograms can be assumed to provide a full two-dimensional characterization of a syllable’s phonetic properties within the limitation of their physical characteristic (sampling frequency, window length and type). Here we uti- lize them under the assumption that all possible phonetic characteristics of syllables, starting with the zeroth harmonicF0 and going all the way up to higher formants

1 _(eg. _F

2 orF3) assumed to be of importance in Indo-european languages [97], are

reflected in those syllables’ spectrograms.

The original acoustic dataset was first resampled at 16Khz; using that the spectrograms were computed by using a window length of 10ms. This resulted into a window size of 160 readings per frame. Because we used a 16 Khz sampling rate, our maximal effective frequency detected is 8Khz, the Nyquist frequency of our sampling procedure. A Gaussian window was used during windowing of each frame. The original power spectral density is shown after a 10 log₁₀(·) transform so it is depicted in decibels (dB).

Despite having an otherwise perfectly balanced grid with no missing values, we can not exclude instances of noise corruption because of the rather heteroge- neous sample quality as well as the non-laboratory recording conditions during the sample’s generation. For this reason we employ a penalized least squares filtering

F0 isnot a formant as it does not refer to acoustic resonance.

Figure 6.1: Unsmoothed and smoothed spectrogram of a male Portuguese speaker saying “un” (u˜(N)). It is immediately evident that throughout all frequencies there is small-scale unstructured variation that the smoothing algorithm filters out.

technique for grid data [93] which is based on the discrete cosine transformation in two dimensions; this is in contrast with our work in the previous sections where we used a kernel smoother. Here because we wanted to keep our implementation fast and efficient we chose a parametric basis for our data. The basic idea behind this parametric assumption stems from the use of Eq. 3.11 as a smoother. We see that effectively the smoothed data are the projections of the original data in another domain. Choosing to penalize the roughness of our data by the use of their second-order difference (their second derivative in the case of functional data), Eq. 3.11 can be re-expressed as a penalized regression system of the form:

(I+sBTB)ˆy=y (6.2) wherescorresponds to the smoothing parameter used,B to the second order differ- encing matrix and as alwaysI is the identity matrix. The tridiagonal square matrix

B being defined as:

Bi,i−1 = −2 ri−1(ri−1+ri) , Bi,i = 2 ri−1ri , Bi−1,i= −2 ri(ri−1+ri) (6.3) for 2≤i≤N−1 whereN is the number of elements in ˆy andri represents the step

between ˆyi and ˆyi+1. Assuming repeating border elements (y0 =y1 andyN+1 =yN)

then: B1,1 = −B1,2 = r−12 and −BN,N−1 = BN,N = r_N−2−1. When if ri = 1 for

i= 1, . . . , N matrixB is of the form:

B =              1 −1 0 · · · 0 −1 2 −1 . .. ... ... 0 . .. ... ... ... ... .. . . .. ... ... ... 0 .. . . .. ... −1 2 −1 0 · · · 0 −1 1              (6.4)

Obviously if s→ 0 no smoothing takes places as one retrieves the original signal directly and if s → ∞ one just recovers the second order polynomial fit to the data [324]. Given that B has an eigendecomposition of B = UΛUT, Λ being the diagonal matrix with the eigenvalues ofB, Eq. 6.2 can be rewritten as:

y=U(I +sΛ2)−1UTy. (6.5) The computational efficiency of this approach comes from the realization that as Garcia presents: “UT and U are actually n-by-n type-2 discrete cosine transform

(DCT) and inverse DCT matrices, respectively”[93], the orthogonal form of type-2 DCT kernel matrix being:

[C2]i,j = r 2 Nξ(i)cos( i(j+1₂)π N ), i, j = 0,1, . . . , (N −1) (6.6) ξ(p) =    q 1 2 ifp= 0 or p = N, 1 ifp= 1,2, . . . , N −1 (6.7) and thus resulting in the equation:

y= [C₂−1]((I+sΛ2)−1[C2]y). (6.8)

Then taking advantage of the known eigenvalues formulas for tridiagonal matrices likeB [339], (I+sΛ2) can also be rewritten as 1 +s(2−2cos((i−1)π/n))2 where

icorresponds to the i-th eigenvalue of the original matrix B. We then define Γ = (I+sΛ2)−1 =diag([1 +s(2−2cos((i−1)π/n))2]−1) giving the final estimate of y as:

y= [C₂−1](Γ[C2]y). (6.9)

One can immediately see the computational efficiency of Garcia’s algorithm compared to standardized smoothing techniques as well as compared to standard matrix decompositions. Especially in regards with this second claim, even the most “efficient” matrix decomposition for the solution of a least squares problem, the Cholesky decomposition is of 1₃n3 order complexity [225], while the 2-D DCT2 (and IDCT) is of the order n2log(n) [297], yielding significant speed-ups even for small datasets. Finally while Garcia advocates the use of generalized cross-validation for a choice ofs, the current implementation useds= 0.5, this value being determined by qualitatively examining the resulting smoothed spectrograms. The generalization of this technique to the two-dimensional object employs simply the two-dimensional DCT instead of the one-dimensional, the two-dimensional DCT being especially popular as it is the back-bone of the well-known JPEG format [272] for digital pictures. Finally after smoothing is conducted, the sample is interpolated over a common time grid assumed to represent “word time”.

Two important caveats need to be mentioned: First, using Garcia’s method we enforce a discrete transformation on functional data. Second, this smoothing methodology is based on the theoretical assumption that a function is periodic and

2_{The two-dimensional DCT takes the one-dimensional DCT of each column followed by a one-}

extends outside the domain over which it is observed. The first caveat, is an over- simplification that as mentioned is done for the sake of computational efficiency. It cannot hide the fact though that higher order fluctuations might be truncated as only 64 two-dimensional basis functions are used. What can be argued though is that given the relatively small sample from which we want to draw conclusions, the choice of 64, highly informative in the case of two-dimensional patters, basis is not limiting the insights behind our analysis; it does not ”meaningfully” exclude information. The second caveat concerns the theoretical foundations of this type-2 DCT smoothing framework and is more ambiguous. In standard periodic signals the assumption of ”extending outside the observable domain” might be non-restrictive one; in the current case though and especially when examining a frequency contin- uum where the concept of negative values is a highly not trivial one conceptually (assuming that one can interpreter ”negative time” as goingback in time), this approach can be questionable. Countering this second caveat is based on dynamics of the physical system we investigate. In the case of frequencies, one has practically no fluctuations below a very low threshold. Frequencies below 20Hz are effectively out of our vocal range. Thus assuming that the border of ”zero-th” fluctuations extends ”in negative frequencies” does not meaningfully alter the boundary condition we employ. These two caveats were made not cancel the efficiency or the elegance behind Garcia’s method of smoothing, they were done because one should not naively move methodologies from a discrete domain to a continuous one; if he chooses to do so, he must be able to offer a meaningful interpretation of the assumptions imposed. In addition to noise distortions, as mentioned earlier, phase distortions are almost certain to exist in any acoustic signal. Here using spectrograms as our acoustic signal units of analysis, we are presented with two-dimensional instead of one-dimensional objects. While in general in a two-dimensional object phase variation cannot be assumed to influence a single dimension exclusively, under specific circumstances all variation can be assumed to occur along a single “relevant axis”. In particular when one focuses on the analysis of spectrograms, an inherently two- dimensional object over a frequency and a time axis, phase variations are relevant only in the context of time; frequency can be assumed to occur in absolute time as the phonation procedure of speaker affects only the timing of the sound excitation and not the amplitude of it (at least directly). One can therefore reformulate the original pairwise warping criterion from simple one-dimensional objects as curves (as in chapter 5 where pairwise curve synchronization was utilized) to slightly more complex two dimensional objects. Assumingy_i(t, f) andy_k(t, f) being two spectrograms with an equal size of frequency index, their “discrepancy” cost functionD0

is: D_λ0(y_k, y_i, g0) = (6.10) E{ Z FN yq f=0 Z 1 t=0 (y_k(g0(t), f;Tk)−yi(t, f;Ti))2+λ(g0(t)−t)2dtdf|yk, yi, Tk, Ti},

or in its discretised version:

D_λ0(y_k, y_i, g0) =E{ r X f=0 1 X t=0 (y_k(g0(t), f;Tk)−yi(t, f;Ti))2+λ(g0(t)−t)2|yk, yi, Tk, Ti}, (6.11) where as in 3.2.2,λis an empirically evaluated non-negative regularization constant,

Ti and Tk are used to normalize the spectrograms time lengths and g_k,i0 (·) is the

pairwise warping function mapping the time evolution ofy_i(t, f) to that ofy_k(t, f). Thus we are led to the one-dimensional reformulation of the cost functionD0 as:

D0_λ(y_k, y_i, g0) =E{ 1 X t=0 (−→y_k(g0_r(t);Tk)− −→yi(t;Ti))2+λ(g0(t)−t)2dt|−y→k,−→yi, Tk, Ti}, (6.12) where−→y_k is the concatenated across frequencies vectorized form of the spectrogram

y_k and g0_r is the version of the pairwise warping function mapping g_k,i0 (·) repeated

r times, r being the number of discrete points along the frequency axis f. This ultimately being a two-dimensional version of Eq. 5.27. Thus similar to the one- dimensional case of the pairwise warping curves, Eq. 3.21 is used to recover the final warping function by taking advantage of the Law of Large numbers; giving a two-dimensional version of the pairwise synchronization framework presented in Sect. 3.2.2. Fig. 6.2 shows the subtle changes warping induces to a spectrogram’s structure in our dataset. With the completion of this step we are presented with 219 smoothed and warped spectrograms. Importantly the warping itself was done withindigit and gender clusters. That means that the speakers of different genders uttering a specific digit (irrespective of their language) had their utterances time- registered only among themselves. We made this choice for two reasons: first, we know from previous findings that intonation dynamics differ significantly between speakers of opposite sexes [117], second, we also know that registration of completely unrelated data will produce spurious results; for example, the word “un” ([˜E]) and the word “quatro” ([’kwatro]) (French for one and Spanish for four respectively) will exhibit different inclination patterns and the time-registration procedure will fail to recognize meaningful similarities to exploit. For modelling purposes the “word time”

Spectrograms are almost by definition objects with a complex internal structure; as instances of functional data they appear as two-dimensional functions of time and frequency. While it is possible to directly work in this function-space for computational efficiency and conceptual conciseness given a dataset y of function- valued traits as shown in Eq. 6.1, we would like to find appropriate estimates ˆQ

and ˆφ of the mixing matrix Q and the basis set φ respectively. The first task is to identify a good linear subspace S of the space of all continuous functions by choosing basis functions appropriately. Evidently these basis functions in the case of spectrogram data will be two dimensional. The purpose of this task is to work, not with the function-valued data directly, but with their projections inS. As for- malized in Sect. 3.3 we may say that the chosen subspaceS is good if the projected data approximate the original data well while the number of basis functions is not unnecessarily large, so thatShas the “effective” dimension of the data. The warped spectrograms W, as in previous sections, are assumed to be adequately expressed as: Wi(u, f) =µW(u, f) + ∞ X k=1 Ai,kφk(u, f), where: µW(u, f) =E{W(u, f)} (6.13)

where as before u ∈[0,1] is the absolute time-scale the spectrograms are assumed to evolve in andf is the frequency domain (here modelled as the domain between 0 and 8Kz in 100Hzintervals).

Before applying FPCA to our sample we recognize that the ultimate goal

Figure 6.2: Unwarped and warped spectrogram of a male Portuguese speaker saying “un” (u˜(N)). Notice how the warped instance of the spectrogram is registered on a universal “word time” rather than absolute time; ridges among formant frequencies appear more prominently.

of this work is to provide language-specific descriptions; scalar estimates that can be utilized within the context of a phylogenetic tree. Additionally we know that

digit-wide FPC’s would be unrealistic as they would combine non-comparable variation patterns, and that it would be beneficial to incorporate the minimum prior knowledge that the sex of the speaker has at least “some influence” in the phonetic characteristic encoded by the spectrogram. In a manner similar with section 5.3, givenWd, the spectrograms for a given digit done formulates:

E{wd_i(u, f)|X_id}=µw,d(u, f) +

∞

k=1

E{Ad_i,k|X_id}φd_k(u, f). (6.14) Given the structure of our data, we use a fixed effect rather than a mixed effect model to account of speaker variation within a given languagel. The reason for this design choice is that we do not have enough speaker realizations to provide meaningful estimates in certain cases. For example, we have a single male speaker in Spanish and in Portuguese; a random effects model could not meaningfully decompose the variation due to the sex of the speaker and the variation due to speaker’s unique characteristic. Taking that into account, our final estimates for thelanguage-specific

Figure 6.3: Functional Principal Components for the digitone spectrograms. Two different views are shown. Top row shows the viewing angle from a (−50,50) azimuth rotation and vertical elevation; bottom row shows the viewing angle from a (0,90) viewpoint (completely top to bottom). It is immediately seen that the majority of variation is encapsulated by the first two FPC’s. Mean spectrogram shown in Appendix, Fig.A.11.

FPC scoresβ₀d,l are given by thelanguage-gender interaction model: E{Ad,l_i,k|X_id,l}=X_id,lβd,l (6.15) where (βd,l)T = [βd,l1 0 , β d,l2 0 , β d,l3 0 , β d,l4 0 , β d,l5 0 , β d,l1 1 , β d,l2 1 , β d,l3 1 , β d,l4 1 , β d,l5

1 ] and the de-

sign matrix X_id,l is simply an×m indicator matrix, wherenequals the number of all speakers uttering digit dand m equals 2∗5, such that:

X=hδl1 . . . δl5 δ sex l1 . . . δ sex l5 i (6.16) where the column vectorsδli and δ_lisex are defined respectively as:

δli ( 1 for languageli, 0 otherwise (6.17) and δ_lisex=     

1 for language li iff

the speaker is male,

0 otherwise

(6.18)

where i = {1,2,3,4,5} corresponds to each of the five languages represented in the current tree. In that way using these averaged scores3 we are offered effective representatives of languagelfor a specific digit dinvestigated by combining all our digit-specific readings. This allows us to create in a way “language exemplars” scores (β₀d,li), these scores being the ones used for the phylogenetic analysis. Clearly one could also construct “language exemplar” spectrograms4_{, that could be then uti-}

lized to compare the final protolanguage against. Notably, given the design matrix

Xused, the protolanguage estimates will correspond to female speakers, as the male gender effects should be encapsulated in theβ₁d,l_,k term that is not carried forward in the analysis. Examining these artificial spectrograms, interestingly American Span- ish appear to downplay the effect of the second vowel in their utterance compared to the other “two-vowel” languages, while on the contrary French (somewhat expect- edly given the phonation of “un” in French) have a strong almost singly peak-like spectrogram. The actual interpretation of the first three FPC’s is almost obvious; the first FPC captures the variation due to the phonation of the first vowel present in the utterances of digitone5. Even if another vowel exists (as in the cases of Italian and Spanish), that vowel is not as strongly stressed as the first one; it is therefore

3_{See Appendix, Table A.12 for actual values.} 4

See Appendix, Fig. A.13. E{wd,l}=µw,d(u, f) +P∞

k=1β d,l 0,kφ d k(u, f) 5

In IPA these are encoded as: [u˜(N)] in Portuguese, [’u:no] in Italian, [’u:no] in Spanish and [˜E] in French.

expected that the major point of variance will be at the beginning of the word. This finding is in accordance with the finding of the previous chapters where in all cases the beginning of a syllable exhibits greater influence in the syllables dynamics than the other parts. The second and the third FPC’s encapsulate the presence of the second vowel. They reflect a phonation event occurring in the second half of the word utterance. It can be also argued upon investigating the second FPC’s shape, that it partially compliments the first FPC; it allows the difference between the two vowels to come forward more strongly. In similar but less pronounced manner, the third FPC also compliments the first FPC but in a more localized manner; the highly localized frequency drop in the amplitude of the third FPC, occurring approximately in the center of the word’s first half, is counter-balanced by an overall amplitudal increase in the lower frequencies of the word’s half. For the fourth FPC it could be argued that the long ridge exhibited approximately at the 6KHz band is a speaker specific construct. One would not expect phylogenetically attributed phonetic variation in that range as it is highly speaker dependent, in the sense that this might be due to a specific speaker’s dynamics or (more worryingly) to speaker specific recording equipment. For that reason we do not examine higher order FPC’s. As seen in Table 6.1 these components exhibit variation that is rather small in percentage terms and taking into account that we have “just” 22 instances of the digitone, it is not reasonable to believe these FPC’s generalized to sample-wide variation patterns. In particular individual variation reflected by each FPC quickly falls below a value (1/22 ≈4.5%) that could be attributed to a variational pattern present to a single

In document Functional data analysis in phonetics (Page 134-142)

Sample preprocessing & dimension reduction

Chapter 6 Phylogenetic analysis of Romance languages

6.2 Methods &amp; Implementations

6.2.1 Sample preprocessing &amp; dimension reduction

6.2 Methods & Implementations

6.2.1 Sample preprocessing & dimension reduction