A Multilingual Language Identifier

(1)

Shibboleth

A Multilingual Language Identifier

Edvin Ullman

Uppsala University

Department of Linguistics and Philology Master’s Programme in Language Technology Master’s Thesis in Language Technology October 6, 2014

Supervisors:

(2)

Abstract

This thesis presents a system for multilingual language identification, the purpose of which is to provide accurate detection of foreign language inclusions in multilingual documents. The presented system was developed with the intention of being used together with a polyglot speech synthesis, but the system can be reimplemented to suit any language identification task. The system employs a novel algorithm, which recursively analyses, tags, and reanalyses until guaranteed convergence. The analysis relies on an interpolation of character quadram probabilities, bi-directional word bigram probabilities, and frequency dictionary unigram probabilities. In the final step, a single simple cleanup heuristic is employed to aid in correcting words that are particularly troublesome to tag. The system is trained to identify five languages, English, French, German, Spanish and Swedish. Testing was performed on the Stockholm-Umeå Corpus, as well as five synthetic multilingual corpora, using twelve different weighting schemes. The results were evaluated, and precision, recall, F1-score, and true negative rate was calculated for the main language in each corpus.

The evaluation shows that the system, whilst maintaining high precision throughout the experiments for all of the main languages, suffers from either low recall or low true negative rate.

(3)

List of Tables

4.1 Size of training sets . . . 16

4.2 Size of development and test sets . . . 16

4.3 Different system settings . . . 17

4.4 Results SUC . . . 19

4.5 Confusion matrix SUC . . . 20

4.6 Results Synthetic English . . . 21

4.7 Confusion matrix Synthetic English . . . 22

4.8 Results Synthetic French . . . 23

4.9 Confusion matrix Synthetic French . . . 24

4.10 Results Synthetic German . . . 25

4.11 Confusion matrix Synthetic German . . . 26

4.12 Results Synthetic Spanish . . . 27

4.13 Confusion matrix Synthetic Spanish . . . 28

4.14 Results Synthetic Swedish . . . 29

4.15 Confusion matrix Synthetic Swedish . . . 30

(5)

Acknowledgements

This thesis is the end product of many days and nights of hard work, and it could not have been done without the ongoing support, feedback and help from my beloveds, my friends, my colleagues, my supervisors.

Thank you Hanna, for your support and understanding.

Thank you Jimmy, for your continuous interest in my progress, and help with some of my worst theoretical headaches.

Thank you Andreas, Erik and Filip, for your keen interest in my project and the welcoming atmosphere at ReadSpeaker. A special thank you to Erik for helping me generate the synthetic corpora.

Thank you Kåre, for giving me the possibility of doing this study, and for your enthusiasm and encouragement.

Finally, thank you Joakim, for all the help and support I got, and for all the help and support I would have gotten, should I have asked for it.

(6)

1 Introduction

In this truly international day and age, completely monolingual text data is becoming increasingly rare. Languages borrow from one another, international companies brand their products with their own native tongue, and day-to-day interaction through social media is often within a multilingual context. In other words, code switching, and code mixing is quickly becoming the new norm.

This makes the task of language identification an important building block for information gathering and natural language processing applications. In the text-to-speech industry, this multilingual environment poses a two-fold problem.

Firstly, language identification is crucial to accurately process the given text data from which to produce speech, secondly, the speech synthesis needs to be able to handle multiple phonotaxes, to produce intelligible and fluent speech.

The first of these two tasks is the focal point in this master’s thesis, and a novel system is developed with the intention to be used together with a polyglot text-to-speech application. The theoretical framework however, can easily be re-adapted to be used with practically any given task, in which multilingual language identification is needed.

1.1 Purpose

The purpose of this study is to develop a system for multilingual language identification on word level. Testing and evaluation of the system is performed on the Stockholm-Umeå Corpus (SUC), as well as five synthetically generated multilingual corpora. SUC is, at least partially, tagged with language information, and as such it is suitable to evaluate a multilingual language identification system on, but due to the sparseness of foreign language inclusions in SUC, further experiments were conducted on the synthetic corpora. The generation of these synthetic corpora is described in section 4.1.

To articulate the question the experiments aims to answer,

• How accurately can the developed system detect foreign language inclusions?

To evaluate the developed system, a series of experiments using a number of different parameter setups is performed. The relevant metrics used are precision, recall, f1-score and true negative rate, all calculated on the detected main language.

The developed system uses a number of pretrained models for language identification, together with a novel algorithm for dynamically and recursively recalculating scores based on context.

(7)

1.2 A brief note on the name

The name of the developed system is Shibboleth. The term is originally a Hebrew word, the meaning of which is irrelevant to it’s use in modern English.

In an account of the Hebrew Bible, the word is used by the Gileadites to distinguish Ephraimites, whose dialect lacked the phoneme /S/. The meaning of the term today is some form of discriminant to distinguish in-groups from out-groups. Normally, this is some linguistic feature, but in an extended use of the term other aspects of life can be used as discriminants.

The Shibboleth multilingual language identification system makes extensive use of the concept of in-groups versus out-groups. A contextual language is identified as the in-group, and the detection of out-groups in this in-group context is fundamental in the system, which will be described in detail in chapter 3.

1.3 Outline

In chapter 2, a brief overview of the current state of language identification, monolingual as well as multilingual, is given. An overview of different methods for language identification is presented, with a more in-depth look at the specific methods used in the Shibboleth multilingual language identification system. In chapter 3, the Shibboleth multilingual language identification system is presented in its entirety. In chapter 4, the experiments will be described in detail, and the results will be presented. In chapter 5, the results will be discussed, error analysis will be performed, and potential areas for future research will be covered. Finally, in chapter 6, concluding remarks will be made.

(8)

2 Background

Language identification for written material has a history of mainly focusing on monolingual identification (Hughes et al., 2006). This is not to say that multilingual Language Identification is a completely novel idea, nor is it to say that the methods applied in multilingual language identification, or in this thesis for that sake, have not been used in monolingual language identification.

There are an abundance of studies with different approaches on how to do monolingual language identification, many of which can claim a near-perfect result under their specific testing conditions. In a study by Cavnar and Trenkle (1994), monolingual language identification using character n-grams is shown to reliably identify the language in both shorter and longer texts.

There are a number of relatively recent studies on multilingual language identification. Linguistic approaches, targeted at identifying linguistic charac- teristics such as function words or word endings have been shown to produce reliable multilingual sentence level language identification (Giguet, 1995).

Rule-based identification using multiple layered monolingual grammars together with inclusion grammars have been shown in Romsdorfer and Pfister (2007) to reliably identify not only foreign inclusions, but also to some extent mixed-lingual words.

Other approaches regards language identification as a pattern identification problem, relying to a large extent on machine learning methods, and less on genuinely linguistically motivated ones. Stensby et al. (2010) uses mixed order n-grams to generate a feature profile for each language it is supposed to identify and then classifies each sentence based on its constituents feature vectors distance to these feature profiles.

In Nguyen and Dogruoz (2013) word level language identification is done on online discussions, accurately identifying Dutch and Turkic using a number of different methods, including character n-grams, word n-grams, logistic regression and more.

Many different methods have been applied to the task of Language Identi- fication. In this section, a number of these methods will be described, some briefly, and the ones relevant to the Shibboleth system in a more comprehensive fashion.

2.1 Word N-Gram Modelling

A way of modelling language is to establish conditional probabilities of a string of words. Unlike a frequency dictionary, the conditional probability of a word given another results in less problems with bilingual or multilingual homographs.

(9)

However, as a multilingual environment is expected, monolingual language models are bound to fail when a foreign language inclusion happens. Thus, word-for-word language identification cannot solely rely on word n-grams. In a recent study by Zampieri et al. (2013), different varieties of Spanish was accurately identified using a system which combined character level n-gram models, word-level language models and POS distributions. In the Shibboleth system, bi-directional word bigram models are used as one way of identifying the language of a word.

2.2 Character N-Gram Modelling

A common approach to language identification, character n-gram modelling is a way of extracting graphotactical information from a text by the means of partitioning each token to be tagged into its constituting character n-grams.

These are then compared to a number of pre-trained character n-gram models, one for each of the languages it is supposed to identify. These character n-gram models are trained in a similar fashion. By partitioning a large corpus into its constituting character n-grams and applying the same methodological approach as with word n-gram modelling, language specific character n-gram models are also conditionally probabilistic in nature.

In Cavnar and Trenkle (1994), near-perfect identification of a multitude of languages is obtained by essentially creating a feature profile of n-grams of varying values for n for each language, and comparing these to a similarly generated feature profile for the text to be tagged.

In a recent study on identifying multilingual documents Lui et al. (2014), a robust system using information gain to extract features from a similarly generated set of character n-grams is shown to outperform the winner of the shared task on multilingual language identification at ALTW2010.

2.3 Frequency Dictionaries

Perhaps the most intuitive way of doing language identification is relying on dictionary look ups. Frequency dictionaries for each language of interest can provide some information of the likelihood of a token belonging to that language.

However, as intuitive it might seem, relying on information on word level has been shown to be notoriously unreliable. Even “common words” lists have been since long regarderd as a dead end in terms of language identification (Dunning, 1994). “Common words” lists, or frequency dictionaries will not work reliably in the ever changing environment that is language, and language specific, distinctive and common suffixes or inflections would be caught by a character n-gram model either way.

Nevertheless, when disambiguating between two possible candidate languages, a frequency dictionary could potentially still be of relevance, and since a frequency dictionary lookup is O(log n), it is hardly harmful to consider it.

(10)

2.4 Other Methods

In Stensby et al. (2010), classification was done on data from real life electronic applications, in which language is seldom consistent, and text lengths often short. This language environment prompted the authors to use weak estimators, rather than strong estimators, or estimators that converge with probability 1.

They compiled language profiles with mixed order character n-grams and using their weak estimator, continuously updated the probability vectors for a given sentence. The resulting system accurately identified even sentences with ten tokens less.

With enough linguistic knowledge, an explicit grammar can be written to describe in what way a foreign language inclusion may occur. Romsdorfer and Pfister (2007) made use of such knowledge, and in the study they were able to detect inclusions even on a morpheme level. The developed system was devised around a mixed lingual morphological and syntactical analyzer, with inclusion grammars for the each combination of languages the system was trained to handle. The authors note that as the inclusion grammars were only about 5%

of the size of a corresponding monolingual grammar, expansion of the system does not require much data.

The use of the world wide web is becoming an increasingly prospective area for many natural language processing tasks, language identification included, and in (Alex, 2005), Google lookup is used together with readily available linguistic resources for identifying English inclusions in German text.

(11)

3 Shibboleth System Description

The Shibboleth multilingual language identifier takes as input a string of words. A sentence start token, and a sentence end token is appended at their respective ends. The string is then analyzed, word for word, relying on three separate models. An interpolated probability is obtained for each language the system is trained to handle, for each word in the string. The geometrical mean is calculated for all languages, and each word in the string is assigned the language with the highest geometrical mean. The probabilities are then reweighted based on the geometrical means, effectively lowering probabilities of languages with a low geometrical mean of probabilities, and raising probabilities of languages with a high geometrical mean of probabilties. The system then tries to identify disagreements within the string. A disagreement is found when the language with the highest probability for a specific word is another than the assigned language. If there are more disagreeing words directly after the found disagreement, these are joined together to form a substring of disagreeing words.

When a non-disagreeing word is found, a new geometrical mean of probabilities is calculated for the substring, each word in the substring is assigned the language with the highest geometrical mean, the probabilities are reweighted, and finally, all words are checked for disagreements. If a disagreement is one word in length, the language with the highest probability is assigned to it. This process is iterative, and guaranteed to terminate with either no more identified disagreements, or a disagreement of length one. Once all disagreements have been handled, a simple cleanup heuristic is applied to handle irregularities that may occur with shorter words.

The Shibboleth system relies on probabilities from three sources. One of which is a frequency dictionary. These frequency dictionaries are calculated on monolingual corpora beforehand, smoothed using Laplace smoothing, and normalized, resulting in a probability distribution from 0⁻1 (1). Each word in the input string is looked up in frequency dictionaries for all languages the system is trained to handle, thus obtaining one unigram probability for the word at indexⁱfor each language.

P(wi) = C(w_i) +1

N + V (1)

However, as frequency dictionaries are relatively unreliable in the context of language identification, the system also relies on a bi-directional word bigram model. For each language the system is trained to handle, two bigram models are calculated on monolingual corpora as before, one for the conditional probability for the word at indexⁱ, given the word at index^{i −}1, and one for the conditional probability for the word at indexⁱgiven the word at index^{i +}1. The two models

(12)

are then smoothed using Laplace smoothing and normalized, resulting in a probability distribution from 0⁻1 (2)(3). The input string is partitioned into bigrams, starting with the sentence start token, and endining in the sentence end token, and each bigram is subsequently looked up in all bigram models.

The two probabilities are evenly interpolated, thus obtaining one bi-directional conditional bigram probability for the word at index ⁱfor each language.

P(wi|wi−1) = C(w_i−1,^wi) +1

C(w_i−1) + V (2)

P(w_i|wi+1) = C(wi,^wi+1) +1

C(w_i+1+ V (3)

One of the more common ways of doing language identification is to analyze graphotactical information. In a similar fashion as with a word ngram model, a character ngram model can be obtained from monolingual corpora to be used in language identification. For the Shibboleth system, a character quadgram model is generated for each language the system is trained to handle. To calculate the conditional probability of a character quadgram, four separate models need to be obtained. The first is a model for the probability of the character at index ^j (4), which essentially is a frequency dictionary for characters. The second is a model for the conditional probability of the character at index ^j, given the character at index ^{j −}1 (5). The third is a model for the conditional probability of the character at index^j, given the string of characters from index j −2 to ^{j −}1 (6). The fourth is a model for the conditional probability of the character at index ^j, given the string of characters from index ^{j −}3 to ^{j −}1.

Probabilities from these models are then interpolated to create a character quadgram model (8). In the experiments run, the interpolation was symmetric, with even weight given to unigrams, bigrams, trigrams, and quadgrams. Each input word is partitioned into quadgrams starting with the first character as the character at index ^j, and ending with the last character as the character at index^{j −}3. Probabilities are retrieved from the models of all languages, and the geometrical mean of the retrieved probabilities is calculated. Thus obtaining a single probability for the word at index ⁱfor each language.

P(cj) = C(c_j)

N (4)

P(cj|cj−1) = C(c_j−1,^cj)

C(c_j−1) (5)

P(cj|cj−2,^cj−1) = C(c_j−2,^cj−1,^cj)

C(c_j−2,^cj−1) (6) P(cj|cj−3,^cj−2,^cj−1) = C(c_j−3,^cj−2,^cj−1,^cj)

C(c_j−3,^cj−2,^cj−1) (7) P(cj|cj−3,^cj−2,^cj−1) = P(cj)× λ1+ P(cj|cj−1)× λ2

+ P(c_j|cj−2,^cj−1)× λ₃ + P(c_j|cj−3,^cj−2,^cj−1)× λ₄

λi>0 ^λ1+λ₂+ λ₃+ λ₄ =1

(8)

(13)

Once all words from the input string has been run through the aforemen- tioned three models, the probabilities from the separate models are interpolated according to an adjustable weighting scheme. The languages are then ordered from the language with the highest interpolated probability, to the language with the lowest interpolated probability. The geometrical mean of the probabilities for all languages are calculated for the input string, producing a second ordered list from the language with highest probability, to the language with the lowest probability, and the language with the highest probability in this list is called the contextual language. This list is used for two things. First of all, all words in the input string is assigned the language with the highest probability as the language of this word. Secondly, the probabilities are used as a reweighting scheme, adding and removing weight in accordance to this list of probabilities, from all languages of all words in the input string. The amount of reweighting is governed by an adjustable weighting coefficient.

For the experiments run, four different weighting schemes for the interpolation of the probabilities, and three different weighting coefficients for the reweighting were used, producing a total of twelve separate experimental setups.

Once all words in the input string have been assigned a language in accordance to the contextual language, the search for disagreements begins. A word is said to disagree if the assigned language differs from the language with the highest interpolated probability for that word, to within a certain threshold.

The threshold is there to allow for some flexibility in the cases where a word is almost equally probable to belong to one language or another. A disagreement is said to be verified if it is without said threshold. If more than one disagreeing word is identified and verified directly after an other disagreeing word, these are joined together to form a disagreeing substring. Once a non-disagreeing word is identified, the system restarts the process of identifying a contextual language, as previously described, for the words in the disagreeing substring. Once a geometrical mean of the probabilities of all languages in the substring have been computed, assigning of language, and reweighting occurs. Finally, the substring is again checked for disagreements. If there is only one disagreeing word, the language with the highest interpolated probability is directly assigned to that word. This process is guaranteed to converge, with either one disagreeing word, which is then assigned the language with the highest interpolated probability, or with no disagreeing words left. The entirety of this process is described in algorithm 1.

(14)

Algorithm 1 Contextual Language Require: Languages ^l^{∈ L}

Require: String of words ^w0, ...,^wn

Require: Interpolated probabilities ^PINT(l|wi) Output: String of languages ^l0, ...,^ln

1: procedure ContextualLanguage(^w0, ...,^wn)

2: ComputeContextualProbability(w₀, ...,^wn)

3: RecalculateScoresAndAssignLanguage(w₀, ...,^wn) 4: IdentifyVerifyAndHandleDisagreements(w₀, ...,^wn) 5: procedure ComputeContextualProbability(^w0, ...,^wn)

6: for each ^lin^Ldo

7: PCON(l|w0, ...,^wn)←1

8: for each^wi in^w0, ...,^wn do

11: for each^wi in^w0, ...,^wn do

12: P_CON(l|wi)← PL^P^CON^(l^|w⁰^,...,wⁿ⁾ l=1P_CON(l|w0,...,wn)

13: procedure RecalculateScoresAndAssignLanguage(^w0, ...,^wn)

14: for each ^wi in^w0, ...,^wn do

15: for each^lin^L do

16: PINT(l|wi)← PINT(l|wi) + PCON(l|wi)× γ .Default^γ^←0.1

17: l_i←max

l P_CON(l|wi)

18: procedure IdentifyVerifyAndHandleDisagreements(^w0, ...,^wn)

19: disagr_start← −1

20: disagrend← −1

21: for each ^wi in^w0, ...,^wn do

22: if ^li6=max

l P_INT(l|wi)then

23: if ^PINT(l|wi)max− PCON(l|wi)max< tthen

24: if ^disagrstart= −1 then

25: disagrstart ← i

26: else

27: if ^disagrstart6= −1 then

28: disagr_end← i

29: else

30: if ^disagrstart 6= −1 then

31: disagrend← i

32: if ^disagrstart 6= −1 and ^disagrend6= −1 then

33: ContextualLanguage(disagrstart, ...,^disagrend) 34: disagrstart ← −1

35: disagr_end← −1

(15)

4 Experiments

The experiments were conducted using the Stockholm-Umeå Corpus (SUC), as well as five synthetically generated multilingual corpora. The process of gener- ating these synthetic corpora is outlined in 4.1. SUC is composed of Swedish textual data from in part news articles, biographies, scientific articles and more.

The data is lemmatized and annotated, using POS-tags, morphological tags and more. Foreign inclusions is to an extent tagged with language information, but unfortunately, much language information is lost as institutions names and places carries no language information. To effectively use SUC for the experiments intended, all words tagged as institutions, names, or places needed to be excluded from the evaluation. This is, in part at least, not an unusual sacrifice, as names cannot really be said to carry language information since many names are found across different countries and languages.

4.1 Synthetic Corpora

The synthetic corpora were made up from monolingual text from Wikipedia.

They were generated by setting one language as the main language and the remaining languages as inclusion languages. The inclusions were of varying length, and were both intersentential, meaning they ocured monolingually in between sentences of the main language, and intrasentential, meaning they ocured within sentences of the main language. The intrasentential inclusions were at least one word, and at most five words in length. The resulting pro- portions of languages in the resulting corpora is ^{(N −}1^)/Ntokens of the main language, and ⁽1^/(N^{× (N −}1⁾ tokens for each of the inclusion languages, were N is the number of languages in total.

Grammaticality is not an issue, as the only model relying on more than one token at a time is the Bi-Directional Word Bigram Model, and the inability to detect a bilingual word bigram is part of its use, whether or not the inclusion is grammatically well formed.

4.2 Experimental Set-Up

The system was trained on monolingual corpora, courtesy of ReadSpeaker, for English, French, German, Spanish and Swedish, see table 4.1. A series of tests were carried out on the tuning corpus, see table 4.2, and twelve configurations were chosen to be used with the actual testing corpora. The different configurations for the interpolation between character quadgrams, word bigrams and frequency dictionary lookups are outlined in table 4.3.

(16)

Corpus Size English 47, 706, 488 French 61, 680, 602 German 45, 218, 081 Spanish 54, 641, 747 Swedish 47, 479, 254 Table 4.1: Size of training sets

Corpus Size

SUC development set 93, 139 SUC test set 927, 108 Synthetic English 5, 231 Synthetic French 4, 580 Synthetic German 5, 034 Synthetic Spanish 3, 227 Synthetic Swedish 4, 716 Table 4.2: Size of development and test sets

In 4.3 the different experimental weighting schemes are described.^λBWB

is the weight given to the backward word bigram model, and ^λFWB is the weight given to the forward word bigram model.^λCQ is the weight given to the character quadgram model, ^λFD is the weight given to the frequency dictionary, and finally ^γ is the reweighting coefficient used in the experiments.

Note that throughout the experiments the backward and forward word bigram models are assigned equal weight. Note also that in settings 2⁻4, 6⁻8 and 9⁻12, the frequency dictionary receives no weight at all. This was partly due to the relatively poor performance of configurations which gave higher weight to frequency dictionary lookups, partly due to the fact that unigram counts are taken into account in the bilateral word bigram models, see figures 2, and 3, and partly due to the fact that high frequency terms of length 64, which tends to be the case for many function words, also are covered in the character quadgram models, see figure 7. Nevertheless, in configuration 1, 5 and 9, frequency dictionary lookups are given a weight of ^λFD =0.3. Finally, note also that configuration 1⁻4 are almost the same as configuration 5⁻8, with the exception being the reweighting coefficient,^γ, for the reweighting of probabilities, see algorithm 1, row 13⁻17. The same goes for configuration 9⁻12, with the exception that the reweighting coefficient here is set to 0 Meaning no reweighting of probabilities at all. The system performs very differently

(17)

t λBWB λFWB λCQ λFD γ

1 0.2 0.2 0.3 0.3 0.1

2 0.25 0.25 0.5 0.0 0.1

3 0.35 0.35 0.3 0.0 0.1

4 0.15 0.15 0.7 0.0 0.1

5 0.2 0.2 0.3 0.3 0.05

6 0.25 0.25 0.5 0.0 0.05

7 0.35 0.35 0.3 0.0 0.05

8 0.15 0.15 0.7 0.0 0.05

9 0.2 0.2 0.3 0.3 0.0

10 0.25 0.25 0.5 0.0 0.0

11 0.35 0.35 0.3 0.0 0.0

12 0.15 0.15 0.7 0.0 0.0

Table 4.3: Different system settings

without the reweighting, but for research purposes it would be irresponsible not to cover it.

4.3 Evaluation

For each corpus, precision, recall, F1 and true negative rate was calculated for the main language. True negative rate is a metric that works well in the task measuring the identification of out-groups from the in-group, for clarity, it is described in figure 9, but it can be described as an inverted precision. The higher the true negative rate, the more accurately the system identifies an inclusion as an actual inclusion. True negative rate is as such a very appropriate metric with which to determine whether or not the system is suited for detection of foreign inclusions.

A confusion matrix covering predicted languages versus actual languages is supplied for each test corpus. These confusion matrices each cover all twelve configurations through which the corpus was run. The confusion matrices also provides a more detailed view of wherein eventual errors and shortcomings lie, and as such is a valuable resource in finding flaws in the system.

T rueNegativeRate = T rueNegative

T rueNegative + FalsePositive (9)

(18)

4.4 Results

The first tests were run with all twelve configurations on the SUC test set. For all twelve configurations, precision for Swedish was approximately 1.0 to within 0.002. However, recall was as low as 0.65 with configurations 4, and as high as 0.88 with configurations 11. Interestingly enough, these extremes corresponds inversely with the extremes for true negative rate, 0.34 in configuration 4, and 0.13 in configuration 11. This correspondence is somewhat intuitive when one regards the composition of the SUC test set. The system is trained to treat every language as a likely candidate, and since the system is not free from fault, erroneous classifications will be made. The composition of the SUC test corpus is highly biased towards Swedish, with a total of 2, 485 non-Swedish tokens out of 927, 108, thus, when recall is high, true negative rate is low, and conversely, when recall is low, true negative rate is high.

In SUC, some languages are not explicitly specified, instead the language tag "Other" is applied to low frequency foreign language inclusions. Since the system is not trained to handle any other languages than English, French, German, Spanish and Swedish, in the confusion matrix, table 4.5, there is no corresponding column for the row labeled "Other".

In the results in the confusion matrix, table 4.5, a noticable drop in correctly identified foreign inclusions can be seen in configurations 5⁻8, and an even more so in configurations 9⁻12. The lowered weight for the recalculation of scores is apparently harmful for identifying sparse inclusions, while on the other hand beneficial for lowering the amount of misclassifications of the main language. It is also apparent from the generally low number of correctly identified foreign inclusions that the system, although performing well in terms of monolingual language identification, is lacking in correctly identifying foreign language inclusions.

As such, it is worth pointing out that if no reweighting is done during classification, and the system relies more on word bigrams, less on character quadgrams and not at all on frequency dictionaries, the system scores an F1-score of 0.94, again giving justification to the effectiveness and accuracy of relying on such models for monolingual Language Identification.

Results and confusion matrix for all twelve configurations are covered in tables 4.4, and 4.5.

(19)

Setting Precision Recall F1 TNR

1 ∼1.0 0.67 0.80 0.31

2 ∼1.0 0.67 0.80 0.32

3 ∼1.0 0.68 0.81 0.33

4 ∼1.0 0.65 0.79 0.34

5 ∼1.0 0.76 0.86 0.25

6 ∼1.0 0.75 0.86 0.26

7 ∼1.0 0.78 0.87 0.23

8 ∼1.0 0.71 0.83 0.29

9 ∼1.0 0.86 0.92 0.16

10 ∼1.0 0.85 0.92 0.17

11 ∼1.0 0.88 0.94 0.13

12 ∼1.0 0.81 0.89 0.22

Table 4.4: Results SUC

In 4.4, and in all following similar tables, precision, recall, F1 and true negative rate is calculated for the main language of the relevant corpus. In SUC, the main language is Swedish, and in the synthetically generated corpora, the main language is that of the corpus.

(20)

Predicted language

English French German Spanish Swedish

Actuallanguage

English

1:63 7:49 1:25 7:31 1:64 7:19 1:87 7:72 1:526 7:594

2:68 8:63 2:27 8:44 2:61 8:37 2:88 8:80 2:521 8:541

3:59 9:34 3:17 9:26 3:66 9:10 3:86 9:44 3:537 9:651

4:68 10:34 4:40 10:28 4:61 10:10 4:83 10:47 4:513 10:646

5:56 11:27 5:35 11:17 5:27 11:3 5:73 11:36 5:574 11:682

6:56 12:39 6:33 12:51 6:32 12:21 6:76 12:49 6:568 12:605

French

1:2 7:1 1:2 7:2 1:2 7:2 1:7 7:5 1:33 7:36

2:4 8:2 2:2 8:1 2:2 8:2 2:6 8:5 2:32 8:36

3:2 9:1 3:0 9:1 3:2 9:0 3:8 9:5 3:34 9:39

4:2 10:1 4:3 10:1 4:2 10:0 4:8 10:5 4:31 10:39

5:1 11:1 5:2 11:2 5:2 11:0 5:5 11:3 5:36 11:40

6:1 12:1 6:2 12:1 6:2 12:0 6:5 12:5 6:36 12:39

German

1:37 7:34 1:0 7:2 1:7 7:3 1:12 7:11 1:107 7:113

2:40 8:29 2:0 8:7 2:6 8:2 2:12 8:13 2:105 8:112

3:61 9:19 3:2 9:6 3:23 9:1 3:12 9:2 3:65 9:135

4:49 10:21 4:0 10:6 4:7 10:1 4:8 10:3 4:99 10:132

5:34 11:18 5:2 11:6 5:5 11:1 5:11 11:1 5:111 11:137

6:34 12:26 6:2 12:6 6:4 12:1 6:12 12:6 6:111 12:124

Spanish

1:0 7:0 1:0 7:1 1:2 7:2 1:4 7:2 1:9 7:10

2:0 8:0 2:0 8:1 2:4 8:1 2:3 8:3 2:8 8:10

3:0 9:1 3:0 9:1 3:1 9:0 3:5 9:1 3:9 9:12

4:0 10:1 4:0 10:1 4:3 10:0 4:3 10:1 4:9 10:12

5:0 11:0 5:1 11:0 5:1 11:0 5:2 11:1 5:11 11:14

6:0 12:1 6:1 12:1 6:1 12:0 6:2 12:1 6:11 12:12

Swedish

1:79965 7:54018 1:50050 7:47435 1:78495 7:37211 1:84760 7:62483 1:605909 7:698032 2:80780 8:64918 2:51200 8:70083 2:77020 8:46684 2:85978 8:78487 2:604201 8:639007 3:81962 9:31304 3:44110 9:49076 3:78980 9:12684 3:82752 9:33584 3:611375 9:772531 4:82803 10:32891 4:60254 10:52690 4:78721 10:14485 4:93263 10:35212 4:584138 10:763901 5:57260 11:27848 5:53030 11:40572 5:39328 11:9023 5:66540 11:29953 5:683021 11:791783 6:58831 12:39480 6:56393 12:67318 6:40537 12:24128 6:68516 12:42372 6:674902 12:725881

Other

1:36 7:32 1:30 7:42 1:60 7:19 1:43 7:28 1:377 7:425

2:32 8:41 2:38 8:56 2:58 8:24 2:44 8:38 2:374 8:387

3:39 9:20 3:35 9:55 3:51 9:4 3:43 9:15 3:378 9:452

4:34 10:20 4:42 10:57 4:59 10:4 4:53 10:15 4:358 10:450

5:30 11:18 5:44 11:48 5:21 11:5 5:28 11:14 5:423 11:461

6:34 12:22 6:49 12:69 6:23 12:15 6:30 12:20 6:410 12:420

Table 4.5: Confusion matrix SUC

(21)

1 0.90 0.63 0.74 0.66

2 0.91 0.63 0.75 0.66

3 0.90 0.63 0.75 0.66

4 0.91 0.60 0.72 0.68

5 0.92 0.72 0.80 0.66

6 0.92 0.72 0.80 0.66

7 0.91 0.72 0.81 0.64

8 0.91 0.70 0.79 0.65

9 0.93 0.92 0.93 0.63

10 0.93 0.92 0.92 0.63

11 0.93 0.93 0.93 0.62

12 0.93 0.91 0.92 0.64

Table 4.6: Results Synthetic English

The first of the synthetic corpora to be tested was the Synthetic English corpus. Unlike the results from SUC, recall and true negative rate does not seem to be inversely correlating, at least not to the same extent. Instead, true negative rate is fairly steady in between 0.62 and 0.68, regardless of recall.

The highest recall, 0.93, was achieved with configurations 9 and 11. These configurations also achieve the highest precision, 0.93, see table 4.6.

In the results from the SUC test set, a noticable drop in identified foreign inclusions could be observed when lowering the weight of recalculating scores, configurations 5⁻12, here instead, the number of correctly identified foreign inclusions is approximately the same over all configurations, with the exception of correctly identified French inclusions, which benefit from no reweighting, configurations 9⁻12, see table 4.7.

The number of misclassifications of English tokens as French is uncomfort- ably high. It is hard to discern exactly why this is the case, but a qualified guess is that in terms of orthography, not counting the more specific diacritics of French, French and English are all too dissimilar. The number of incorrectly classified English tokens as French is never lower than 222, and the corresponding number of misclassification of French tokens as English is never lower than 57.

(22)

Predicted language

Actuallanguage

English

1:2751 7:3155 1:384 7:517 1:326 7:222 1:387 7:329 1:514 7:139 2:2760 8:3070 2:382 8:342 2:323 8:301 2:384 8:428 2:513 8:221 3:2767 9:4033 3:414 9:243 3:332 9:51 3:429 9:24 3:420 9:11 4:2614 10:4018 4:287 10:258 4:491 10:51 4:427 10:24 4:543 10:11 5:3130 11:4052 5:471 11:222 5:224 11:51 5:299 11:26 5:238 11:11 6:3137 12:3991 6:477 12:257 6:226 12:59 6:285 12:45 6:237 12:10

French

1:60 7:93 1:31 7:48 1:14 7:3 1:64 7:63 1:40 7:2

2:60 8:96 2:30 8:45 2:14 8:3 2:62 8:61 2:43 8:4

3:60 9:78 3:31 9:126 3:14 9:0 3:64 9:5 3:40 9:0

4:57 10:78 4:29 10:126 4:14 10:0 4:62 10:5 4:47 10:0

5:93 11:78 5:46 11:126 5:5 11:0 5:63 11:5 5:2 11:0

6:93 12:80 6:46 12:124 6:5 12:0 6:63 12:5 6:2 12:0

German

1:55 7:56 1:14 7:20 1:123 7:119 1:16 7:14 1:5 7:4

2:55 8:56 2:14 8:8 2:123 8:119 2:17 8:26 2:4 8:4

3:56 9:73 3:14 9:20 3:121 9:117 3:17 9:3 3:5 9:0

4:47 10:73 4:2 10:20 4:126 10:117 4:33 10:3 4:5 10:0

5:56 11:73 5:20 11:20 5:119 11:117 5:14 11:3 5:4 11:0

6:56 12:70 6:21 12:20 6:117 12:120 6:15 12:3 6:4 12:0

Spanish

1:91 7:87 1:2 7:9 1:4 7:4 1:102 7:111 1:15 7:3

2:89 8:67 2:6 8:9 2:4 8:4 2:104 8:131 2:11 8:3

3:90 9:74 3:9 9:21 3:5 9:0 3:102 9:116 3:8 9:3

4:84 10:74 4:4 10:21 4:14 10:0 4:102 10:116 4:10 10:3

5:67 11:83 5:11 11:12 5:4 11:0 5:129 11:116 5:3 11:3

6:67 12:69 6:11 12:17 6:4 12:4 6:129 12:121 6:3 12:3

Swedish

1:86 7:71 1:3 7:22 1:37 7:33 1:20 7:11 1:66 7:75

2:82 8:80 2:5 8:45 2:41 8:1 2:20 8:19 2:64 8:67

3:85 9:88 3:3 9:52 3:37 9:0 3:20 9:0 3:67 9:72

4:82 10:88 4:3 10:53 4:4 10:0 4:55 10:0 4:68 10:71

5:71 11:88 5:52 11:52 5:1 11:0 5:14 11:0 5:74 11:72

6:71 12:88 6:47 12:48 6:1 12:1 6:19 12:4 6:74 12:71

Table 4.7: Confusion matrix Synthetic English

(23)

1 0.92 0.53 0.67 0.77

2 0.92 0.53 0.67 0.77

3 0.93 0.55 0.69 0.77

4 0.91 0.49 0.64 0.76

5 0.92 0.64 0.75 0.71

6 0.92 0.64 0.75 0.73

7 0.92 0.65 0.76 0.71

8 0.93 0.63 0.75 0.74

9 0.93 0.97 0.95 0.64

10 0.94 0.97 0.95 0.64

11 0.93 0.97 0.95 0.64

12 0.93 0.97 0.95 0.64

Table 4.8: Results Synthetic French

With the Synthetic French corpus, precision is again relatively high, span- ning 0.91⁻0.94. Recall is on the low end of the scale for all configurations but 9⁻12, which all achieve a recall of 0.97. The system erroneously classifies between approximately 800 and 1000 French tokens as Spanish in configurations 1⁻8, a number which is lowered to approximately 90 misclassifications in configurations 9⁻12, see table 4.9.

Moreover, as with the Synthetic English corpus, there are, at least with a configurations 1⁻8 an uncomfortable amount of misclassificatons between English and French, with aproximately 300 french tokens being classified as English. Like the Spanish misclassifications, these numbers drop to a more tolerable level with configurations 9⁻12. Misclassifications of Swedish tokens as French are also very frequent with configurations 1⁻8, but again drops to a very tolerable 2 with configurations 9⁻12.

Apart from these misclassifications, as with the Synthethic English corpus, all configrations with lower or no weight for recalculating scores outperform the ones with higher weight, in all languages but German.

The true negative rate is at its lowest in configurations 9⁻12, with 0.64, and at its highest in configurations 1⁻3, with 0.77. An inverse correspondence between lower recall and higher true negative rate again observed, however to a less extreme extent than with the SUC test set, see table 4.8.

(24)

Predicted language

Actuallanguage

English

1:38 7:59 1:38 7:46 1:53 7:41 1:48 7:38 1:7 7:0

2:38 8:59 2:40 8:40 2:53 8:37 2:48 8:48 2:5 8:0

3:40 9:100 3:39 9:72 3:51 9:1 3:48 9:11 3:6 9:0

4:36 10:100 4:35 10:72 4:55 10:1 4:48 10:11 4:10 10:0

5:59 11:100 5:46 11:72 5:37 11:1 5:42 11:11 5:0 11:0

6:59 12:100 6:46 12:72 6:37 12:1 6:42 12:11 6:0 12:0

French

1:298 7:357 1:2037 7:2479 1:201 7:43 1:930 7:789 1:369 7:167 2:298 8:366 2:2040 8:2422 2:222 8:39 2:898 8:862 2:377 8:146

3:220 9:23 3:2124 9:3713 3:148 9:6 3:982 9:91 3:361 9:2

4:272 10:23 4:1872 10:3710 4:321 10:6 4:898 10:94 4:472 10:2 5:314 11:25 5:2448 11:3711 5:31 11:6 5:830 11:91 5:212 11:2 6:393 12:23 6:2439 12:3708 6:39 12:6 6:820 12:96 6:144 12:2

German

1:29 7:17 1:66 7:74 1:68 7:69 1:12 7:16 1:10 7:9

2:24 8:25 2:71 8:68 2:68 8:69 2:12 8:15 2:10 8:8

3:24 9:2 3:71 9:84 3:68 9:95 3:12 9:4 3:10 9:0

4:27 10:2 4:66 10:84 4:70 10:95 4:12 10:4 4:10 10:0

5:21 11:2 5:74 11:84 5:65 11:95 5:16 11:4 5:9 11:0

6:25 12:2 6:68 12:85 6:67 12:94 6:16 12:4 6:9 12:0

Spanish

1:76 7:63 1:22 7:27 1:1 7:1 1:74 7:86 1:12 7:8

2:76 8:68 2:19 8:25 2:4 8:1 2:74 8:83 2:12 8:8

3:76 9:5 3:22 9:41 3:1 9:0 3:74 9:139 3:12 9:0

4:78 10:5 4:20 10:41 4:5 10:0 4:72 10:139 4:10 10:0

5:63 11:5 5:27 11:41 5:1 11:0 5:86 11:139 5:8 11:0

6:68 12:7 6:22 12:41 6:1 12:0 6:86 12:137 6:8 12:0

Swedish

1:17 7:4 1:41 7:64 1:22 7:3 1:57 7:47 1:47 7:66

2:20 8:0 2:38 8:62 2:22 8:1 2:58 8:71 2:46 8:50

3:16 9:0 3:39 9:67 3:21 9:9 3:59 9:28 3:49 9:80

4:20 10:0 4:59 10:67 4:3 10:9 4:59 10:28 4:43 10:80

5:0 11:0 5:64 11:67 5:3 11:9 5:65 11:26 5:52 11:82

6:0 12:0 6:64 12:67 6:3 12:24 6:65 12:28 6:52 12:65

Table 4.9: Confusion matrix Synthetic French

(25)

1 0.94 0.76 0.84 0.74

2 0.94 0.75 0.83 0.77

3 0.94 0.76 0.84 0.74

4 0.92 0.72 0.81 0.68

5 0.95 0.82 0.88 0.76

6 0.95 0.82 0.88 0.76

7 0.95 0.82 0.88 0.77

8 0.93 0.81 0.86 0.68

9 0.95 0.94 0.95 0.76

10 0.95 0.94 0.94 0.77

11 0.95 0.95 0.95 0.76

12 0.95 0.93 0.94 0.75

Table 4.10: Results Synthetic German

When testing the system on the Synthetic German corpus, precision is still high. Many of the configurations yields a precision of 0.95. However, more interesting is that recall is overall the highest in these series of experiments, and more interesting still. True negative rate is also through and through good, with its lowest score being with configuration 4 and 8, 0.68 and its highest score being with configuration 2, 7 and 10, 0.77, see table 4.10.

In the confusion matrix, table 4.11 it is apparent that no recalculation of scores helps with correctly classifying all languages. It seems the system has a hard time disambiguating Swedish and German, much like the high number of misclassifications of French as Spanish in the French Synthetic corpus.

Another thing of interest is found in the number of misclassified English tokens as German. If recalculation is not turned off, and if a higher weight is given to the character quadgrams than anything else, the number of missclassi- fications of English tokens as German goes up tremendously. This exemplifies how orthographical similarity effects language identification.

(26)

Predicted language

Actuallanguage

English

1:99 7:129 1:0 7:0 1:1 7:1 1:10 7:86 1:106 7:0

2:99 8:50 2:0 8:0 2:1 8:73 2:10 8:9 2:106 8:84

3:99 9:212 3:0 9:0 3:1 9:0 3:10 9:4 3:106 9:0

4:17 10:212 4:0 10:0 4:83 10:0 4:10 10:4 4:106 10:0

5:127 11:212 5:0 11:0 5:1 11:0 5:8 11:4 5:80 11:0

6:127 12:212 6:0 12:0 6:1 12:0 6:8 12:4 6:80 12:0

French

1:7 7:1 1:99 7:100 1:62 7:58 1:32 7:40 1:2 7:3

2:3 8:4 2:102 8:106 2:63 8:49 2:32 8:40 2:2 8:3

3:3 9:0 3:97 9:141 3:69 9:58 3:32 9:3 3:1 9:0

4:2 10:0 4:94 10:144 4:56 10:55 4:40 10:3 4:10 10:0

5:1 11:0 5:103 11:137 5:55 11:62 5:40 11:3 5:3 11:0

6:1 12:0 6:104 12:147 6:54 12:52 6:40 12:3 6:3 12:0

German

1:301 7:260 1:127 7:160 1:3201 7:3470 1:228 7:159 1:361 7:169 2:275 8:303 2:139 8:235 2:3158 8:3402 2:249 8:148 2:397 8:130 3:291 9:89 3:123 9:104 3:3207 9:3970 3:217 9:35 3:380 9:20 4:328 10:97 4:181 10:125 4:3049 10:3947 4:281 10:37 4:379 10:12 5:263 11:83 5:176 11:87 5:3479 11:3992 5:160 11:34 5:140 11:22 6:271 12:90 6:177 12:160 6:3478 12:3921 6:162 12:35 6:130 12:12

Spanish

1:31 7:31 1:16 7:16 1:63 7:66 1:80 7:86 1:10 7:1

2:31 8:31 2:16 8:22 2:63 8:64 2:77 8:79 2:13 8:4

3:31 9:1 3:15 9:5 3:64 9:70 3:81 9:124 3:9 9:0

4:31 10:1 4:18 10:5 4:62 10:70 4:76 10:124 4:13 10:0

5:29 11:1 5:18 11:3 5:64 11:72 5:86 11:124 5:3 11:0

6:31 12:1 6:18 12:7 6:64 12:70 6:84 12:122 6:3 12:0

Swedish

1:19 7:14 1:15 7:20 1:81 7:58 1:6 7:5 1:74 7:98

2:20 8:14 2:15 8:23 2:60 8:73 2:6 8:6 2:94 8:79

3:13 9:4 3:16 9:28 3:79 9:66 3:7 9:0 3:80 9:97

4:34 10:4 4:15 10:28 4:60 10:66 4:25 10:0 4:61 10:97

5:14 11:4 5:22 11:26 5:74 11:65 5:5 11:0 5:80 11:100

6:14 12:9 6:22 12:29 6:75 12:84 6:5 12:0 6:79 12:73

Table 4.11: Confusion matrix Synthetic German

(27)

1 0.91 0.63 0.74 0.69

2 0.91 0.65 0.76 0.68

3 0.92 0.64 0.76 0.70

4 0.92 0.68 0.78 0.68

5 0.92 0.80 0.85 0.64

6 0.92 0.80 0.86 0.64

7 0.92 0.79 0.85 0.66

8 0.92 0.79 0.85 0.66

9 0.93 0.96 0.95 0.61

10 0.93 0.96 0.95 0.61

11 0.93 0.96 0.95 0.61

12 0.93 0.96 0.94 0.61

Table 4.12: Results Synthetic Spanish

From the Synthetic Spanish corpus, the results are somewhat similar to the ones obtained when testing on the other corpora. Precision is consistently higher than all other metrics, with 0.91⁻0.93. Recall is lower, and fluctuates more, being as low as 0.63 with configuration 1, and as high as 0.96 with configurations 9⁻12, see table 4.12.

Misclassifications of Spanish to French are relatively plentiful, however, the amount do decrease with a lower weight for recalculating scores, and with no weight, the decrease is even greater, see tale 4.13

A Multilingual Language Identifier

Shibboleth