The Data Source - SELF-ORGANIZATION OF SPEECH SOUND INVENTORIES IN THE FRAMEWORK OF COMPLEX NET

2.4 Summary

3.1.2 The Data Source

The source of data for this work is the UCLA Phonological Segment Inventory Database (UPSID) [102]. The choice of this database is motivated by a large number

Figure 3.1: A hypothetical example illustrating the nodes and edges of PlaNet

of typological studies [44,69,88,96] that have been carried out in the past on UPSID. We have selected UPSID mainly due to two reasons – (a) it is the largest database of this type that is currently available and, (b) it has been constructed by selecting languages from moderately distant language families, which ensures a considerable degree of genetic balance.

The languages that are included in UPSID have been chosen in a way to approxi- mate a properly constructed quota rule based on the genetic groupings of the world’s extant languages. The quota rule is that only one language may be included from each small language family (e.g., one from the West Germanic and one from the North Germanic) but that each such family should be represented. Eleven major genetic groupings of languages along with several smaller groups have been considered while constructing the database. All these together add up to make a total of 317 languages in UPSID. Note that the availability as well as the quality of the phonological descriptions have been the key factors in determining the language(s) to be included from within a group; however, neither the number of speakers nor the phonological

Table 3.1: Some of the important features listed in UPSID

Manner of Articulation Place of Articulation Phonation

tap velar voiced

flap uvular voiceless

trill dental click palatal nasal glottal plosive bilabial r-sound alveolar fricative retroflex affricate pharyngeal implosive labial-velar approximant labio-dental ejective stop labial-palatal affricated click dental-palatal ejective affricate dental-alveolar ejective fricative palato-alveolar lateral approximant

peculiarity of a language has been considered.

Each consonant in UPSID is characterized by a set of articulatory features (i.e., place of articulation, manner of articulation and phonation) that distinguishes it from the other consonants. Certain languages in UPSID also consist of consonants that make use of secondary articulatory features apart from the basic ones. There are around 52 features listed in UPSID; the important ones are noted in Table 3.1. Note that in UPSID the features are assumed to be binary-valued (1 meaning the feature is present and 0 meaning it is absent) and therefore, each consonant can be represented by a binary vector.

Over 99% of the UPSID languages have bilabial (e.g., /p/), dental-alveolar (e.g., /t/) and velar (e.g., /k/) plosives. Furthermore, voiceless plosives outnumber the voiced ones (92% vs. 67%). According to [101], languages are most likely to have

8 to 10 plosives; nevertheless, the scatter is quite wide and only around 29% of the languages fall within the mentioned limits. 93% of the languages have at least one fricative (e.g., /f/). However, as [101] points out, the most likely number of fricatives is between 2 to 4 (around 48% of the languages fall within this range). 97% of the languages have at least one nasal (e.g., /m/); the most likely range reported in [101] is 2 to 4 and around 48% of the languages in UPSID are in this range. In 96% of the languages there is at least one liquid (e.g., /l/) but, languages most likely have 2 liquids (around 41%) [101]. Approximants (e.g., /j/) occur in fewer than 95% of the languages; however, languages are most likely to have 2 approximants (around 69%) [101]. About 61% of the languages in UPSID have the consonant /h/, which is not included in any of the categories already mentioned above. Some of the most frequent consonants in UPSID are, /p/, /b/, /t/, /d/, /tS/, /k/, /g/, /P/, /f/, /s/, /S/, /m/, /n/, /ñ/, /N/, /w/, /l/, /r/, /j/, /h/, and together they are often termed as the ‘modal’ inventory [101].

It is important to mention here that there are certain criticisms of this database especially related to representation of the phonemes [156]. The phoneme inventories in UPSID are represented using a feature-based classificatory system developed by the phoneticians primarily through the inspection of various observable facts about language (see [88] for a vivid description of this design methodology). Although there are questions regarding the authenticity of this representation mostly related to the existence of (abstract) features [156], in absence of any other alternative resource for the validation of our computational models we had to resort to UPSID. Note that it is hard to find what exactly could be a true representation of the phonemes and there is no consensus on such an issue even among the experts in the field. Nevertheless, the representation of UPSID can be at least said to be quite “faithful” if not the “true”. This is because, numerous studies on this database show that various patterns reflected by it actually correlate perfectly with what is observed in nature. The results presented in this thesis further brings forth certain universal qualities of the feature-based classificatory system for describing the consonant and the vowel inventories, which do not manifest in case of the randomly constructed inventories. Most importantly, the structural regularities reported in the thesis were not presumed by the phoneticians while designing the classificatory system. Therefore, these non-

trivial findings possibly point to universal properties of real languages that are getting reflected only because the classificatory system turns out to be a very appropriate way of representing the inventories. We understand that the statistics that we present might change if the experiments are carried out on a different data set. Therefore, we do not claim that the inferences drawn here are sacrosanct; rather they are only indicative. In this context, the trends in the results outlined here are more important than the exact values. We believe that for any choice of the data set the trends should remain similar and this being an interesting future research question related to the evolution of sound inventories, our results definitely have a crucial role in propelling it forward.

In document SELF-ORGANIZATION OF SPEECH SOUND INVENTORIES IN THE FRAMEWORK OF COMPLEX NETWORKS. Animesh Mukherjee (Page 67-71)