Building tools to access multi-lingual collections of documents poses all the problems of tool design familiar from mono-lingual information access systems but adds several new issues. Crosslanguageinformationretrieval systems allow users to retrieve documents written in one language using a query written in another (4): in general, people are able to read several more languages than they are able to formulate queries in. However, assessing the worth of documents in a foreign language is more complex than in one’s first language – and building a system to present results in several languages is a complex design issue.
Cross-languageinformationretrieval today is dominated by techniques that rely principally on context-independent token-to-token mappings despite the fact that state-of-the-art statistical machine translation systems now have far richer translation models available in their internal representations. This paper explores combination-of-evidence techniques using three types of statistical translation models: context-independent token translation, token translation using phrase-dependent contexts, and token translation using sentence-dependent contexts. Context-independent translation is performed using statistically-aligned tokens in parallel text, phrase-dependent translation is performed using aligned statistical phrases, and sentence-dependent translation is performed using those same aligned phrases together with an n-gram language model. Experiments on retrieval of Arabic, Chinese, and French documents using English queries show that no one technique is optimal for all queries, but that statistically significant improvements in mean average precision over strong baselines can be achieved by combining translation evidence from all three techniques. The optimal combination is, however, found to be resource-dependent, indicating a need for future work on robust tuning to the characteristics of individual collections.
Hull and Grefenstette (1996), Pirkola (1998), Ballesteros and Croft (1996) perform Cross-LanguageInformationRetrieval through dictionary-based approaches. Littman et al. (1998) performs Latent Se- mantic Indexing on the term-document matrix. Statistical Machine Translations have also been tried out in (Schamoni et al., 2014; T¨ure et al., 2012b; T¨ure et al., 2012a; Sokolov et al., 2014). (Padariya et al., 2008; Chinnakotla et al., 2008) use transliteration for Out-of-Vocabulary words. In this method the dictionary-based technique is combined with a transliteration scheme in to a pageRank algorithm. We re- port their work as one of the baselines. Herbert et al. (2011) uses Wikipedia concepts along with Google Translate to translate the queries. By mining the cross-lingual links from the Wikipedia articles, a trans- lation table is built. This is now coupled with translations from Google. Franco-Salvador et. al. (2014) leverages BabelNet, a multilingual semantic network for CLIR. Hosseinzadeh Vahid et al. (2015) uses Google and Bing to translate the queries and shows how the performances vary with translations from two different online systems.
One problem we can see from the results of our experiments is that the accuracy of retrieving English documents from a Japanese query (from Japanese to English) is lower than from English to Japanese in almost all cases. On the other hand, when applying KCCA to English-French corpus for cross-languageinformationretrieval in , the results were very similar when using English document as query for retrieving French documents or using French document for retrieving English documents. Note that the main difference between processing English and Japanese documents was in the procedure of collecting the terms. The English (or French) terms were basically the stemmed words. However, in Japanese, unlike in English or French, there is no delimiter between words in a sentence. Hence we had to employ some procedure to segment Japanese sentence into a sequence of words and then to select Japanese terms according to the POS tags. The procedure of collecting Japanese terms may introduce more errors than that of collecting English term. Therefore, we think that the lower accuracy of using Japanese query for retrieving English documents may due to the fact that the quality of the Japanese terms we collected was not as good as that of English terms. Searching a better method of collecting Japanese terms than the one we used would be part of the future work.
Electronically available multilingual information can be divided into two major categories: (1) alphabetic languageinformation (English-like alphabetic languages) and (2) ideographic languageinformation (Chinese-like ideographic languages). The information available in non-English alphabetic languages as well as in ideographic languages (especially, in Japanese and Chinese) is growing at an incredibly high rate in recent years. Due to the ideographic nature of Japanese and Chinese, complicated with the existence of several encoding standards in use, efficient processing (representation, indexing, retrieval, etc.) of such information became a tedious task. In this paper, we propose a Han Character (Kanji) oriented Interlingua model of indexing and retrieving Japanese and Chinese information. We report the results of mono- and cross- languageinformationretrieval on a Kanji space where documents and queries are represented in terms of Kanji oriented vectors. We also employ a dimensionality reduction technique to compute a Kanji Conceptual Space (KCS) from the initial Kanji space, which can facilitate conceptual retrieval of both mono- and cross- languageinformation for these languages. Similar indexing approaches for multiple European languages through term association (e.g., latent semantic indexing) or through conceptual mapping (using lexical ontology such as, WordNet) are being intensively explored. The Interlingua approach investigated here with Japanese and Chinese languages, and the term (or concept) association model investigated with the European languages are similar; and these approaches can be easily integrated. Therefore, the proposed Interlingua model can pave the way for handling multilingual information access and retrieval efficiently and uniformly.
Much attention has recently been paid to natural language processing in information storage and retrieval. This paper describes how the application of natural language processing (NLP) techniques can enhance cross-languageinformationretrieval (CLIR). Using a semi-experimental technique, we took Farsi queries to retrieve relevant documents in English. For translating Persian queries, we used a bilingual machine- readable dictionary. NLP techniques such as tokenization, morphological analysis and part of speech tagging were used in pre-and- post translation phases. Results showed that applying NLP techniques yields more effective CLIR performance.
Abstract. With the increasing availability of machine-readable bilingual dictionaries, dictionary-based automatic query translation has become a viable approach to Cross-LanguageInformationRetrieval (CLIR). In this approach, resolving term ambiguity is a crucial step. We propose a sense disambiguation technique based on a term-similarity measure for selecting the right translation sense of a query term. In addition, we apply a query expansion technique which is also based on the term similarity measure to improve the effectiveness of the translation queries. The results of our Indonesian to English and English to Indonesian CLIR experiments demonstrate the effectiveness of the sense disambiguation technique. As for the query expansion technique, it is shown to be effective as long as the term ambiguity in the queries has been resolved. In the effort to solve the term ambiguity problem, we discovered that differences in the pattern of word-formation between the two languages render query translations from one language to the other difficult.
ABSTRACT: Bilingual dictionaries have always been an important source of query translation in CrossLanguageInformationRetrieval. Besides other issues bilingual translation suffers from ambiguity problem. To resolve this issue, several recent works have recommended the use of term co occurrence statistics. Same concept with a major modification is the focus of our work described here. Our work is based on the fact that all terms do not have same discriminating power in a query. To overcome such problem, our algorithm provides more weight to discriminating terms in the query and treats co occurrences of useful terms as more valuable than those of frequent terms. The paper also takes into account the concept of local context in formulating formula for co-occurrences statistics. In the experiments, our method achieved 85% of monolingual translation in terms of the mean average precision (MAP). The results are quiet encouraging as compared to other methods used for crosslanguageinformationretrieval for Indian languages.
This paper describes our Korean-Chinese cross-languageinformationretrieval system. Our system uses a bi-lingual dictionary to perform query translation. We expand our bilingual dictionary by extracting words and their translations from the Wikipedia site, an online en- cyclopedia. To resolve the problem of translating Western people’s names into Chinese, we propose a transliteration mapping method. We translate queries form Korean query to Chinese by using a co-occurrence method. When evaluating on the NTCIR-6 test set, the performance of our system achieves a mean average precision (MAP) of 0.1392 (relax score) for title query type and 0.1274 (relax score) for description query type.
As mentioned earlier, it is essential, that the designers gain as much prior understanding of users and usage as possible. As there are few examples of operational cross-languageinformationretrieval systems, there are limited opportunities for gaining an understanding of the nature of the retrieval task in the multilingual context. Hence the Clarity project team considered that it was of paramount importance to find ways of eliciting the necessary in-depth understanding of how a cross-language IR system would be used. Indeed users have different knowledge and expertise, and may perform different tasks, as well as interact with different people (e.g. colleagues, customers) or work alone in the course of information seeking. Different users may need different features or may make different use of the same features. Moreover the place where the interaction takes place may affect its use and effectiveness.
using Korean queries, referred to as Korean-Chinese cross-languageinformationretrieval (KCIR). The main challenge involves translating NEs because they are usually the main concepts of queries. In Chen (1998), the authors romanized Chinese NEs and selected their English transliterations from English NEs extracted from the Web by comparing their phonetic similarities with Chinese NEs. Al-Onaizan and Knight (2002) transliterated an NE in Arabic into several candidates in English and ranked the candidates by comparing their occurrences in several English corpora. In the above works, the target languages are alphabetic; however, in K- C translation, the target language is Chinese, which uses an ideographic writing system. Korean- Chinese NET is much more difficult than NET considered in previous works because, in Chinese, one syllable may map to tens or hundreds of characters. For example, if an NE written in Korean comprises three syllables, there may be thousands of translation candidates in Chinese.
Cross-languageinformationretrieval is dif- ficult for languages with few processing tools or resources such as Urdu. An easy way of translating content words is pro- vided by Google Translate, but due to lex- icon limitations named entities (NEs) are transliterated letter by letter. The resulting NEs errors (zynydyny zdn for Zinedine Zi- dane) hurts retrieval. We propose to replace English non-words in the translation out- put. First, we determine phonetically sim- ilar English words with the Soundex algo- rithm. Then, we choose among them by a modified Levenshtein distance that models correct transliteration patterns. This strat- egy yields an improvement of 4% MAP (from 41.2 to 45.1, monolingual 51.4) on the FIRE-2010 dataset.
This paper presents the ITEM multilingual search engine. This search engine performs full lexical processing (morphological analysis, tagging and Word Sense Disambiguation) on documents and queries in order to provide language-neutral indexes for querying and retrieval. The indexing terms are the EuroWordNet/ITEM InterLingual Index records that link wordnets in 10 languages of the European Community (the search engine currently supports Spanish, English and Catalan). The goal of this application is to provide a way of comparing in context the behavior of different Natural Language Processing strategies for Cross-LanguageInformationRetrieval (CLIR) and, in particular, different Word Sense Disambiguation strategies for query translation and conceptual indexing.
We have presented an approach to mine nearly parallel or comparable data from a large corpus of Twitter messages to adapt an existing SMT system for Twitter translation by extracting a new in-domain phrase table either based on alignments generated in the vicinity of known words, or treating candidate pairs as parallel for unsupervised word alignment. The data mining approach relies on a Cross-LanguageInformationRetrieval model that makes use of a lexical translation table to map terms in two languages. This translation table is created as a side-product from the baseline SMT model training and is thus bound to lexical knowledge of baseline SMT system and its general-domain data. Since the retrieval function only orders documents in the collection by scores and lacks a component of classifying parallelism, one must define precision-oriented constraints to ensure parallelism or comparability of the returned candidate pairs in a post-retrieval step. Still, the mined data contains a lot of noise and a positive adaptation result for method E2 may not always be guaranteed. One way to approach this, is to incrementally adapt the SMT system by performing smaller adaptation steps while iteratively re-training the SMT model. Since the retrieval step depends on lexical coverage of the SMT system, one can imagine a bootstrapping process to incrementally and jointly adapt both SMT and IR models in such a way. For example, by first adding new words in the vicinity of known words more conservatively, one can then re-run retrieval with this updated translation model to either learn more distant words or boost the retrieval score of existing candidate pairs which where previously not positioned in the top 𝑘. With this iterating scheme, smaller portions of found in-domain data can be used per iteration. In the next section we will propose such an iterative approach to domain adaptation for Twitter translation.
The experimental results have shown that the KCCA outperformed the LSI consistently and significantly for cross-languageinformationretrieval. We can also see that the simi- lar results were obtained for the English-Japanese bilingual corpus as for English-French. However, comparing with the high retrieval accuracy when training documents as queries, the retrieval accuracy is low when the documents not used in training as queries. This may be due to a small number of training documents we used. By the KCCA we extracted a se- mantic correspondence between two languages from the training documents. If the training documents is too small to be representative, then the semantic correspondence is not good in general.
Parallel corpora are invaluable resources in many areas of natural language processing (NLP). They are used in multilingual NLP as a basis for the cre- ation of translation models (Brown et. al., 1990), lexical acquisition (Gale and Church, 1991) as well as for cross-languageinformationretrieval (Chen and Nie, 2000). Parallel corpora can also benefit monolingual NLP via the induction of monolingual analysis tools for new languages or the improve- ment of tools for languages where tools already exist (Hwa et. al., 2005; Padó and Lapata, 2005; Yarowsky and Ngai, 2001).
Many users have some foreign language knowledge but their proficiency may not be good enough to formulate queries to appropriately express their information needs. Such users will benefit enormously if they can enter their query in their native language because they are able to examine relevant documents in other languages, even if they have not been translated. Users with no target language knowledge can send relevant retrieved material to a translation service. The key issue is to be able to find relevant information in the first place, and to know that it is relevant. For this reason, much attention has been given over the last few years to the study and development of tools for cross-languageinformationretrieval, i.e. tools that allow users of document collections in multiple languages to formulate queries in their preferred language and retrieve relevant information, in whatever language it is stored.
A novel and complex form of information access is cross-languageinformationretrieval: searching for texts written in foreign languages based on native language queries. Although the underlying technology for achieving such a search is relatively well understood, the appropriate interface design is not. This paper presents three user evaluations done during the iterative design of Clarity, a cross-languageretrieval system for rare languages, and shows how the user interaction design evolved depending on the results of the usability tests. The first test was instrumental to identify weaknesses in both functionalities and interface; the second was run to determine if query translation should be shown or not; the final was a global assessment and focussed on user satisfaction criteria. Lessons were learned at every stage of the process leading to a much more informed view of what a cross-languageretrieval system should offer to users.
The retrieval task becomes more difficult in the settings of cross-languageinformationretrieval (CLIR), because of additional uncertainty introduced in the cross-lingual matching process. This paper introduces a CLIR framework that combines the state-of-the-art keyword-based approach with a latent semantic-based retrieval model (Fig. 1). To capture and analyze the hidden semantics of source language queries and documents in the target language, we construct latent semantic analysis models that map the texts in the source and the target languages into a shared semantic space, in which the similarities of a query and documents are measured. In addition to the traditional keyword-based CLIR system, our proposed framework consists of deep belief network (DBN)-based semantic analysis models for each language and a canonical correlation analysis (CCA) model for inter-lingual similarity computation. The DBN and the CCA models are trained on a large-scale comparable corpus and use low dimension vectors to represent the semantics of texts. The proposed approach is evaluated on a standard ad hoc CLIR dataset from CLEF workshop, with English as the source language and German as the target language. 2 Related Work