The freqency dictionariy unigram probabilities made little or no impact during the tuning stage, and was as such disregarded in all but 3 of the 12 configurations finally put through the tests. Whether or not frequency dictionary lookups is a potential way of disambiguating two potential languages is however hard to determine at this stage. Stemming lexica would be a great resource to investigate together with the Shibboleth system. It would provide a much broader coverage, while at the same time mitigate potentially overrepresented word forms which might have disproportionate score, leading to it being an actual complement to the other systems, rather than just being hit-and-miss and redundant.
The character quadgram models could potentially be of even more use if interpolated or weighted differently. Using for example a tfidf weighting scheme for the quadgrams in whihch the potential languages are regarded as the documents, the more language specific sequences of characters would receive a higher weight.
There are better performing smoothing techniques than standard add one-smoothing, and as such the bi-directional word bigram conditional probabilities could potentially perform even better with for example modified Kneser-Ney smoothing, a smoothing technique which reportedly works very well in language modelling.
The optimal configuration could be learned, and parameter tuning be done automatically, for example by doing expectation-maximization. Instead of interpolating the different models, they could be initialized with even weights and then the EM algorithm would be able to find an optimal weighting scheme for these models.
Experiments with the threshold for ambiguity could affect performance. In the final twelve configurations the same threshold was used to determine if two potential languages were almost equally possible for a token, and with the results in mind, the system could potentially benefit from a lower threshold, as the system overgenerates.
Another thing which might aid in reducing the overgeneration would be to perform a two step identification. By mirroring the methods in Lui et al. (2014), language proportions in a document could be obtained. With these numbers in place, the system could disregard all languages not in this distribution, removing languages that do not occur at all in a document completely from the equation.
The simple heuristic applied in the very last stage of the tagging process could easily be expanded to include a larger rule set. This inclusion grammar approach might not do much to aid in the overgeneration, but it would serve as a form of guarantee for reasonable sequences of languages.
The system can with little work be readapted to output any number of languages with a certainty score. This readaption might be of more help in certain fields, but it doesn’t have any bearing on the actual performance of the system.
Further experiments would be interesting. Experiments including languages of other language families and with differing typographies would be interesting to perform. A redesign of the character quadgram model to be a byte sequence quadgram model is a trivial task, and as such, including languages such as Arabic or Chinese could easily be done.
6 Conclusion
In this paper, a novel algorithm has been described with the purpose of identifying foreign language inclusions. The algorithm is incorporated in a system using interpolated probabilities from a number of serparate models. The system achieves high precision and recall for the majority language of all tested corpora with at least one configuration. True negative rate is at worst unusably low, and at best definitely noteworthy. As such, the problem of identifying foreign language inclusions is still not fully solved.
The Shibboleth multilingual language identifier is not a definitive answer to word for word language identification, there are many ways the system could be improved or altered to perform even better. However, the foundations are laid, and hopefully this work will continue to grow even after this thesis is presented and archived.
Bibliography
Alex, Beatrice (2005). “An unsupervised system for identifying English inclu-sions in German text”. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 133–138.
Cavnar, William B. and John M. Trenkle (1994). “N-gram-based text catego-rization”. In: Proceedings of the 1994 Symposium on Document Analysis and Information Retrieval, pp. 161–175.
Dunning, Ted (1994). Statistical identification of language. Computing Research Laboratory, New Mexico State University.
Giguet, Emmanuel (1995). “Multilingual sentence categorization according to language”. In: Proceedings of the European Chapter for Computational Linguistics SIGDAT Workshop, pp. 73–76.
Hughes, Baden, Timothy Baldwin, Steven Bird, Jeremy Nicholson, and An-drew MacKinlay (2006). “Reconsidering language identification for written language resources”. In: Proceedings of the International Conference on Language Resources and Evaluation, pp. 485–488.
Lui, Marco, Jey Han Lau, and Timothy Baldwin (2014). “Automatic Detection and Language Identification of Multilingual Documents”. Transactions of the Association for Computational Linguistics 2, pp. 27–40.
Nguyen, Dong-Phuong and A. Seza Dogruoz (2013). “Word level language identification in online multilingual communication”. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 857–862.
Romsdorfer, Harald and Beat Pfister (2007). “Text analysis and language identification for polyglot text-to-speech synthesis”. Speech Communication 49.9, pp. 697–724.
Stensby, Aleksander, B. John Oommen, and Ole-Christoffer Granmo (2010).
“Language detection and tracking in multilingual documents using weak estimators”. In: Structural, Syntactic, and Statistical Pattern Recognition, pp. 600–609.
Zampieri, Marcos, Binyam Gebrekidan Gebre, and Sascha Diwersy (2013).
“N-gram language models and POS distribution for the identification of Spanish varieties”. In: Proceedings of TALN 2013, pp. 580–587.