1.2 Introduction to Language Identification
1.2.4 Classification of LID Systems
Once we are aware of the main speech aspects that distinguish languages, the generic system presented above is developed to capture one or several of those aspects. LID systems can be classified into groups according to the type of information they use, which basically depends on the type of features extracted from speech. We can distin- guish the following types of LID systems depending on the type of information that they use:
• Raw waveform: these systems use directly the speech signal without any trans- formation.
• Acoustic LID: it refers to the lowest level in the prosodic hierarchy of speech (presented below), and the goal is to represent the smallest stationary parts of the speech. Acoustic features are extracted in short windows and are lack of any linguistic meaning. They just model acoustic (commonly spectral) aspects of the speech.
• Acoustic-Phonetic LID: in these systems the basic features are different, such as articulatory features. Though, acoustic features are normally extracted first and modified later to obtain feature of this group. For example, we can model each phoneme or phonetic transcription individually by grouping acoustic features in a continuous interval, if we have the transcribed text, or simply perform an unsupervised modelling by grouping similar consecutive acoustic features. • Phonotactic LID: usually in this approach, a phoneme recognizer is used to ex-
collecting how these phonemes are combined is built. During test, the phonemes extracted of each utterance are evaluated over each of the trained models. Some- times, phonemes are grouped according to the articulatory features they belong to. These are more basic phonological units like manner or place of articulation. It is the most common token-based approach.
• Syllable-Based LID: systems where the basic unit for classification are syllables. The token-based version is the syllabotactic approach, which has the same idea as the phonotactic approach but with syllables instead of phonemes.
• Lexical LID: systems that extract lexical information like morphemes or words. • Syntax LID: systems that extract syntactic information or are based on language
rules.
• Prosodic LID: prosodic information capturing the rhythm, stress, and intona- tion of the speech is extracted and modeled for each target language. Prosodic information includes supra-segmental information, unlike the acoustic approach, which is focused on segmental information. Prosody can be modeled from the syllable level to the whole utterance.
• Large vocabulary continuous speech recognition (LVCSR) LID: this is the most simple idea but more difficult to implement. Ideally, if we had a perfect automatic speech recognizer (ASR) for each target language, we could recognize the audio with all of them, and observe which one gives the best output. The cost to build such a system is very high, because we would need many hours of labeled data for all the target languages, and very good ASR accuracies. Thus this alternative is infeasible in most cases today.
• Hybrid Systems: systems that mix some of the previous approaches.
The hybrid approach is a typical solution used successfully by many researchers because it combines different types of information which are complementary. The most used systems are the acoustic, the phonotactic, and the prosodic. The results are generally improved with respect to a single system alone. The fusion strategy can differ. Some authors fuse at the feature level, by concatenating features from different
sources, but the most extended strategy is to fuse at the score level, where two or more complete LID classifiers belonging to different categories are built, and the scores of each of those systems are combined in a controlled way.
Other ideas have arisen along the years that could be considered as belonging to other categories not mentioned above. We will make an extensive review of the litera- ture in Chapter 3 and analyze the different approaches deeper.
In addition to the previous classification, LID systems can also be grouped in two categories, depending on the nature of extracted features. We find
• Vector-based methods: every utterance is represented by a set of continuous vectors. Those vectors are directly used for classification.
• Token-based methods: the speech signal is segmented into a set of discrete units, like phonemes or Gaussian component indices. The token stream is used to obtain a model of the frequency of occurrence of tokens and combinations of tokens, like n-grams counts, that will be used as features. Phonotactic and syllabotactic systems fall in this category, but they are not the only ones, since there are other levels of types of information in the speech that could be tokenized, like prosody. The levels of the features that give rise to the different types of systems presented above are normally associated to the level of analysis of language [Ambikairajah et al., 2011], from deeper fields focused primarily on form, and surface fields focused pri- marily on meaning. In Figure 1.6, the different levels of linguistic structure and the corresponding LID systems into each category are shown. We have to say that, nowa- days there are no known systems using semantic information or higher, mainly due to the complexity that such a system would require.
Although the linguistic levels and the LID system groups are well interconnected, we think that there are still some overlaps which make this association unclear. Actu- ally, we have observed that the classification of LID systems matches better with the prosodic hierarchy theory of prosodic phonology [Selkirk, 1978]. This theory postulates that syntactic and phonological representations are not isomorphic and that there is a distinct level of representation called prosodic structure which contains a hierarchically organized set of prosodic constituents [Elordieta, 2008]. The prosodic hierarchy is the name for an ordered set of prosodic category types [Selkirk, 2011]. These types develop a syntactic structure that triggers the phonological rules [Hayes, 1989]. The levels in
Figure 1.6: LID Systems classification according to linguistic structure - Classi- fication of LID systems according to the linguistic level they are based on.
prosodic and syntactic hierarchies of speech are shown in Figure 1.7, together with the corresponding LID approach.