Nidaba overview - Nidaba : A segment distribution database for measuring language

Nidaba : A segment distribution database for measuring language

4.3 Nidaba overview

A set of typological parameters used as input to the Hamming Distance metric ideally has the following characteristics: Firstly, there is a reproducible methodology for deciding parameter values, which gives consistent results when applied by different researchers, and is extensible to new languages. Secondly, the parameter set is flexible, and can be adapted to different theoret-ical positions, so the consequences of those positions for Hamming Distance can be contrasted.

To aid in this, I have written a database and lexical analysis tool, called Nidaba. Its core functions are the search and comparison of segmental patterns in transcribed lexicons.

Nidaba contains wordlists drawn from a variety of sources, which have been transcribed phonemically, either by myself or the original authors. (The principles used in determining phonemic representation for a given analysis of each language are stored in Nidaba, and altern-ative mappings can be uploaded by other researchers if they prefer another analysis.) For each language where such data is available, the frequency of each lexical item is listed, principally drawn from film subtitle data (see Section 4.7). From this source data, consonant or vowel se-quences can be extracted from different positions within the word (see Subsection 4.3.2). The syllabic parameter values can be derived from these sequences. The values of the vowel and consonant parameters are derived from the phonemic transcription chosen.

The values so derived have been manually checked against other sources where these exist, and any discrepancies noted.

4.3.1 Input data

To analyse a language with Nidaba, two sets of input data are required: firstly, a list of lexical items in some transcription system, together with any data the researcher would like to tag items with (e.g. English gloss, part of speech, origin of loan items, frequency in some corpus); secondly, a conversion to IPA transcription.

Initially, this conversion will be a simple phonetic mapping. This stage allows the researcher to confirm the phonetic inventory of their initial transcription, identifying any typographical er-rors (e.g. [c] in place of [k]). The mapping system can handle combinations of characters, using a longest-match-first approach. This allows for lexicons derived from semi-regular orthographic systems, containing digraphs or loan words which follow different pronunciation rules.

Once a lexicon has been uploaded, the researcher can compare the occurrence of different segments in different positions (word initial, medial and final), which can assist in identifying allophones. Once the researcher has completed a phonemic analysis, the list of lexical items can be retranscribed with a new, phonemic, mapping, for use in further analysis.

By combining word lists with transcription conversions, we derive a ‘doculect’, a particular documentation of a dialect, which is transcribed in the IPA (nidaba.co.uk/Contents/Doculect).

Since a word list can be associated with multiple conversions, this allows a choice of analysis without any data loss; for example, I have chosen not to use the linking R of the DISC transcrip-tion in my IPA representatranscrip-tion of English, but another researcher can include it in their own analysis by using a different conversion (nidaba.co.uk/Contents/TranscriptionConversion).

4.3.2 Pattern retrieval

Since IPA symbols have static values, they can be pre-assigned place and manner values, and sorted into vowels and consonants³. Using pre-constructed regular expressions⁴, Nidaba auto-matically locates certain combinations of word edges, vowels and consonants.

For any given doculect, the researcher can view word initial, medial or final sequences of vowels or consonants. These sequences are displayed with the number of lexical items in which they are found, and a link to all known examples. This latter feature can help in discovering commonalities, such as all examples of a given sequence deriving from the same morpheme.

From this basic overview, more detailed searches can be conducted. The researcher can specify properties of sequences such as length, number of items, or sonority profile; place, man-ner and/or voicing features; and part of speech or other lexical tags, such as loan words of a particular origin.

Nidaba also generates composite properties for each sequence from the relevant lexical items. For example, if corpus frequency data is available, Nidaba will give the total frequency of a sequence summed over all items.

If lexical items have associated frequency data, Nidaba can produce total and average fre-quency statistics for any given pattern retrieved. If token frefre-quency is not available - for example, in an unwritten and unbroadcast language - Nidaba also produces the number of items in which

3Mapping IPA symbols to a user’s feature set of choice is an extension goal.

4i.e. search patterns to be located in a longer text

a given sequence occurs in the lexicon, and what those items are. Each sequence is linked to its list of source words, to verify the original context.

By filtering out sequences only found in relatively few lexical items, or with very low fquency, the researcher can exclude noise arising from errors in input data, loan words or re-gional variants (nidaba.co.uk/Tools/CompareSets). Because this data is not excluded automat-ically, users can compare marginal sequences - such as [sf] in sphere - with non-existent se-quences.

Nidaba has a default set of binary features for every IPA segment known to the database, covering place, manner and voicing. These features are not hard-coded, and can be straightfor-wardly replaced with alternatives; I hope to make this functionality available through the web interface in a future version. These features are available to the pattern retrieval tool, simplifying the task of examining the contexts in which segments appear.

4.3.3 Comparison

The results of the detailed searches can be automatically compared, making it easy to see which sequences occur word-initially but not word-finally; in nouns but not in verbs; or in high fre-quency items but not in numerous ones (e.g. English [ð], which is the most frequent word-initial consonant, but only occurs in a couple of dozen items).

These comparisons are not limited to a single doculect or even language. As well as cus-tomisable sequence set comparisons, Nidaba also has two default comparison pages designed to give a quick overview of the similarities and differences between multiple doculects. The first presents sequences located by a set of default searches, including multi-consonant initial sequences, word final consonants and sonority violating sequences. The second automatically calculates parameter values from the results of such searches, and provides researchers with links to the relevant lexical items.

Finally, Nidaba has a tool for locating subsequences. For example, the researcher can divide all word-medial consonant sequences into sequences also found word-finally (‘codas’), and any following consonants (‘onsets’)⁵. This data can then be fed into the set comparison tool men-tioned above, and word-internal and word-edge ‘onset’ and ‘coda’ sequences compared.

5‘Onset’ and ‘coda’ here being terms of convenience for particular subsets of consonant sequences, not commit-ments to a particular syllabification.

4.3.4 Accessing Nidaba

The principal use case is through a web interface, with data stored centrally and potentially made accessible to other researchers. It is available at the URL nidaba.co.uk.

Data which has been uploaded unrestricted can be viewed by anyone. By contrast, to up-load data, users need to register for an account. The upup-loader then maintains control over the accessibility of their data. They can choose to make their data available to all, or they can share it with only named collaborators.

The software is open source, and is available at bitbucket.org/selizabetheden/nidaba. Users can also download the source code to run a local copy of Nidaba, for use with very sensitive data or without a reliable internet connection. However, this removes access to inter-language com-parisons, because the database itself is not downloadable. Uploaders are however encouraged to provide URLs to public domain lexicons elsewhere, which could then be imported into the local copy of Nidaba.

Finally, users may also wish to run a local copy to make custom modifications, but I would prefer to receive suggestions for any useful modifications so that they can be implemented in the main web app.

4.3.5 Further applications

The set of computational tools I have outlined here were primarily designed for collating and analysing data to provide syllable structure parameter values. However, by making every step explicit and configurable, Nidaba has several secondary uses, including in experimental set-up and field work. For example, it can be used to:

• create a set of experimental stimuli from a set of constraints, such as ‘words with a min-imum frequency of x with branching onsets’

• locate possible errors in transcriptions via unique distributional patterns

• locate cognates using shared glosses or phonemic features

• generate minimal pairs

• collaborate with other researchers during data collection, editing and analysis

• compare the effects of different phonemicisations

• find data, sources and collaborators for new languages

Nidaba contains functions for set comparison. Rather than manually comparing the res-ults of searches, a user can specify two separate sets of criteria, and receive a list of segments which match either or both. The most basic use of this tool is in locating or verifying positional allophones. However, criteria for comparison can also include type or token frequency, part of speech, or other factors which may contribute to variation. Comparisons can also be performed with other doculects.

The segmental properties of a language are not interesting only in isolation, but in how they relate to languages of the same family, or with which they exchange lexical items. Nidaba contains multiple tools to aid in the investigation of cognacy and loanword adaptation.

Using the custom tagging system, lexical items can be glossed in multiple languages. Prop-erties of these glosses can be used in filtering results to establish correspondences between pu-tatively cognate items. Nidaba also contains a “word comparison” tool for comparing lexical items across multiple doculects, based on whole or partial overlap in transcription or gloss.

Nidaba contains a tool for generating minimal pairs. This tool provides examples of all in-stances in which transcriptions differ by only a single segment. Examples are grouped by con-trasting segments, illustrating not just minimal pairs, but minimal triplets or larger sets.

In document Measuring phonological distance between languages (Page 44-48)