Nidaba : A segment distribution database for measuring language
4.5 Similar databases and tools
In this section, I discuss eight existing databases and computational tools which are similar in function to Nidaba, and what makes Nidaba unique.
4.5.1 AusPhon-Lexicon
The AusPhon-lexicon project (Round, 2017) is a ‘data warehouse’ currently containing normal-ised lexicons for 166 Australian language varieties, with data querying tools including an exten-ded regular expression language.
Nidaba is effectively an application of this idea, trading depth of analysis for universality:
Nidaba users are required to scrub their own data and produce their own normalisations, but are not restricted to a given language family.
4.5.2 World Phonotactics Database
The World Phonotactics Database has broadly similar aims of providing a typology of paramet-ers (termed ‘features’) which describe syllable structure. However, it does not have any para-meters dealing with sonority, which forms the basis for many phonotactic formulations (e.g.
Blevins, 1995). The raw data is not available to verify how parameter value choices were made, which also limits flexibility in adding extra parameters, or making alternative choices using dif-ferent cues.
Nidaba, by contrast, is primarily concerned with distributional data, including place and manner information. It aims to provide the tools necessary for users to replicate my results. It is also intended to be sufficiently flexible that users can make different assumptions about valid input data, phonemic representation, sonority, or syllable structure, or add new parameters.
4.5.3 P-base
P-base (Mielke, 2008) “is a database of several thousand sound patterns in 500+ languages”.
However, these are not distributional patterns but processes such as nasalisation or devoicing.
Again, the data on which these patterns are based is not available to the user.
Nidaba can be used to duplicate some of the functionality of P-base, by inputting a narrowly transcribed wordlist, and searching for particular combinations of properties. In this way, the results of P-base can be verified, and specific examples of its sound patterns found in a lexicon.
However, the primary purpose of Nidaba is to look at more static distributional patterns.
4.5.4 TalkBank
The TalkBank project (MacWhinney, 2000) comprises CHILDES (Child Language Data Exchange System) and other corpora. Each corpus contains audio and/or video recordings and a transcrip-tion of the data in CHAT format. This is the input format for the accompanying analysis program CLAN, which performs various kinds of discourse analysis. Among the analyses is token fre-quency, which is a useful input to Nidaba. You can also get PHONFREQ, which performs similar functions to Nidaba’s segment search, but with much less powerful search tools.
Another accompanying analysis program is Phon (Rose et al., 2006). Phon contains tools for searching by features, like Nidaba; but its use case is analysing a spoken corpus, not a lexicon, and it does not contain tools for comparison between different phonemic analyses or languages.
4.5.5 Phonology Assistant
Phonology Assistant (SIL, 2008) provides tools for inventory analysis, given a corpus of tran-scription data. Whilst Nidaba provides a basic inventory tool, its main focus is instead on distri-butional data.
4.5.6 Phoible
PHOIBLE (Moran, McCloy and Wright, 2014) “is a repository of cross-linguistic phonological inventory data”. Its two guiding principles have also been applied to Nidaba, namely that all data should be encoded in Unicode IPA, and that data from multiple doculects should be faithfully included. Nidaba also includes much information beyond inventory data, e.g. it cross-references all phonemes with lexical items, to aid in the treatment of marginal items.
4.5.7 CLTS
CLTS (List, 2017) is “a cross-linguistic database of phonetic notation systems”. When complete, this will be a useful source for generating or verifying transcription conversions for Nidaba, which is currently a manual process for individual researchers.
4.5.8 ILSP PsychoLinguistic Resource
ILSP PsychoLinguistic Resource (Protopapas et al., 2012, located at speech.ilsp.gr/iplr/) provides computational tools for in depth search and analysis of Greek, based on two printed text cor-pora. Many of tools are similar in function to Nidaba tools: returning subsets of a corpus based on length, frequency, and syllable structure. The available data for Greek is more extensive than that in Nidaba, including orthographic / phonological ‘neighbours’ of lexical items, as measured by Levenshtein distance; stress; and ‘orthographic transparency’ (predictability of grapheme/-phoneme correspondence); but it is limited to Greek only.
4.5.9 SYLLABARIUM
SYLLABARIUM (Duñabeitia et al., 2010) is a web tool for examining syllables in Spanish and Basque. It provides similar functions to Nidaba in locating type and token frequency of different syllables, but is limited to orthographic data, and only in those two languages.
4.6 Languages
Nidaba contains phonemically transcribed7word lists for the following languages:
• Ambel, an Austronesian language (fieldwork of Laura Arnold)
• Cheke Holo, an Oceanic language (White, Kokhonigita and Pulomana, 1988)
• Dutch (CELEX: Baayen, Piepenbrock and Rijn, 1993)
• English (CELEX: Baayen, Piepenbrock and Rijn, 1993)
• French (Lexique3: New, Pallier et al., 2001)
• German (CELEX: Baayen, Piepenbrock and Rijn, 1993)
• Greek (GreekLex: Ktori, Heuven and Pitchford, 2008)
• Hrusso Aka, a Tibeto-Burman language (fieldwork of Vijay D’Souza: D’Souza, 2015)
• Lithuanian (Tang and Harris, In prep(a))
• Matbat, an Austronesian language (Remijsen, 2015)
• Portuguese (PorLex: Gomes and Castro, 2003)
• Polish (Tang and Harris, In prep(b), Howell et al., 2017)
• Romanian (Tang and Harris, In prep(c), Howell et al., 2017)
• Spanish (EsPal: Duchon et al., 2013)
• Sylheti, an Indo-Aryan language (SOAS Sylheti Project, 2015)
• Welsh (Ellis et al., 2001)
7The exact type of transcription varies widely between projects. Many have been derived by applying pronunci-ation rules to orthography, with resulting oddities, including Greek, Lithuanian, Polish, Romanian and Spanish.
4.6.1 Phonemic inventories
For the following languages, the source (or at least reference) of the phonemic transcription is separate from the source of the lexicon: Cheke Holo (Corretta, pc.); Dutch (CELEX: Burnage, 1990); English (CELEX: Burnage, 1990); German (CELEX: Burnage, 1990); Sylheti (Eden, in press);
and Welsh (Pronunciation data from Williams, Jones and Uemlianin, 2006, converted into tran-scription by Florian Breit). In the case of English, I adapted the DISC trantran-scription system to remove nasal vowels: æ̃ː → ɒ; ɑ̃ː → ɒ; æ̃ → ɑː; ɒ̃ː → æ.