Procedia Computer Science 57 ( 2015 ) 815 – 820
1877-0509 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of organizing committee of the 3rd International Conference on Recent Trends in Computing 2015 (ICRTC-2015) doi: 10.1016/j.procs.2015.07.484
ScienceDirect
3rd International Conference on Recent Trends in Computing 2015 (ICRTC-2015)
Monolingual Information Retrieval using Terrier: FIRE 2010
Experiments based on n-gram indexing
Santosh K. Vishwakarma
a, Kamaljit I Lakhtaria
b, Divya Bhatnagar
c, Akhilesh K Sharma
da
Gyan Ganga Institute Of Technology & Sciences, Jabalpur, Madhya Pradesh, India b Auro Univesity, Surat, Gujarat, India
c,dSPSU, Rajasthan, Udaipur, 313001, India
Abstract
N-gram based indexing technique has been proved as a useful technique for efficient document retrieval. We applied the n-gram approach and performed experiments in Hindi language text collections. The experiments are performed on the dataset of FIRE 2010 Hindi text collections. We used the Terrier open search engine for experimental purpose. Our experiments state that 4-gram gives the best results among all n-grams of different length. The results show an increase in value of mean average precision. © 2015 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of organizing committee of the 3rd International Conference on Recent Trends in Computing 2015 (ICRTC-2015).
Keywords: Information Retrieval; N-gram; MAP; Pruning; Hindi Monolingual
1.Introduction
N-gram based indexing approach aims at improving the effectiveness of the retrieval task. The purpose of n-gram approach is to replace a whole term into multiple n-grams in the vector space model. The n-gram based system is easy to develop. It takes lesser amount of time for morphological processing. With this approach only a fixed number of n-grams exist for a given value of n [1]. We performed our experiments in Hindi corpus, as Hindi is the official language of India and it is the most spoken language in the country. It is mainly spoken in the northern and
* Corresponding author. Tel.: +91-9329487050 E-mail address: santoshscholar@gmail.com
© 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of organizing committee of the 3rd International Conference on Recent Trends in Computing 2015 (ICRTC-2015)
central parts of the country. In this paper, we describe how n-gram based approach can be used for efficient retrieval in Hindi text collection. The experiments are carried out in FIRE 2010 data set collection for Hindi languages. This paper is organized as following. Section 2 discuss about the related work carried out using n-gram approach for different languages. Section 3 introduces the basic functions of the open search engine, Terrier. We also discussed the main reasons behind our decision to use Terrier for our experiments and evaluation purpose. Section 4 discusses the structure of corpus, dataset, query relevant and topics file. Section 5 reports the experiments and analysis of our evaluation results. The paper concludes with the possible research directions to improve the IR performance for Indian Languages.
2.Related Work
D’Amore and Mah’s [1] introduces the concept of n-grams by replacing the whole terms with n-grams in the vector space model. Their contention was that a large document contents more n-grams than a small document. They compute the weight for n-gram using the number of occurrences in a document. Their assumption was n-gram occurs with equal likelyhood and follow a binomial distribution. Damashek [2] expanded on D’Amore and Mah’s [1] work by implementing a five gram based measure of relevance. Their algorithm relies upon the vector space model, but computes relevance in different manner. Their algorithms trace n-gram without any parsing so it makes it language independent.
Pearce and Nicholas [3] expand the work of Damashek [2] using n-gram to generate hypertext links. The links are obtained by computing similarity measures between a selected body of text and remainder of the document. Thus five-grams are identified and a vector representing this selected text is constructed and finally cosigned similarities are computed to rank the document. Teufel studied the n-gram techniques based on indirect similarity measure given in his paper [4].
Paul McNamee [5] performed research in the area of IR known as “haircut” technique. He reports that out of n -grams 4--grams produced the best results for most of the European Languages. The Thesis work of Paul McNamee [6] states that the performance of IR system is affected when large words are broken into small parts, and how the word spanning n-gram captures associations in the text. Lexical analysis of documents is studied by Ashish Almeida and Pushpak Bhattacharya [7] for Marathi Language.
3
The Terrier Toolkit
As a newly established research group, we consider to adapt one of the available Information Retrieval toolkits for our research and experimentation purpose. Terrier [8], the open search engine provided by University of Glasgow becomes our primary research tool with compare to others available tools such as Lemur, Smart, Lucene, etc. Terrier is written in Java and it is designed and developed by researchers from the Computer Science Department at the University of Glasgow. Moreover, the open source nature of Terrier is critical, since it enables researchers to build their own unique research on top of it rather than treating it as a black box. Terrier strives to provide state of the art efficient indexing and effective retrieval mechanisms. There are several reasons to consider Terrier, including:
x Terrier supports huge variety of weighting models such as DFR_BM25, BB2, TF_IDF, PL2, InL2, In_expB2, In_expC2, etc. It also supports several field based weighting models such as PL2F, BM25F, ML2, MDL2, etc. Two proximity or dependence models DFRDependenceScoreModifier and MFRDependenceScoreModifier are also supported by Terrier.
x Terrier is designed as a IR Research platform, and is quite convenient to be used for UTF-8 document format parsing support.
x The toolkit is under constant development for performance improvements as well as feature additions. The latest version is Terrier 3.5 which is compatible with FIRE text collections.
x The toolkit is expandable and adaptable with available source codes. Various IR functions such as cross lingual information retrieval, question answering is supported.
4
Dataset
The experiments has been carried out on the data set of FIRE 2010 [http://www.isical.ac.in/~fire/] for Hindi. The corpus contains various documents from Hindi news domain. These news articles are extracted from the websites of two widely read newspapers, Amar Ujala [http://www.amarujala.com/] and Jagran [http://www.jagran.com/]. There are 54271 documents from amar ujala and 95216 documents from the jagran. The task of corpus creation was carried out to support experiments for research purpose in information retrieval domain.
4.1 Document Format
FIRE dataset adapt TREC document style format [http://trec.nist.gov]. Each text document is store in a separate file. For Hindi text collection the document supports the UTF-8 encoding system. The document has 3 fields DOC, DOC NO and TEXT. Doc No. is a unique identifier which is assigned to every document in the corpus. Text field contains the actual news article in plain text. The example of a text file is shown below,
<DOC> <DOCNO>default_cur_1_date_1_5_2005.utf8</DOCNO> <TEXT> बहार मसमयरहते रा यपाल केसामने अपनाबहुमत सा बतनह ं करपाए राजग नेशु वार को रा प तभवनमभंग वधानसभाकेसद य क परेडकरवाई।भाजपाअ य लालकृ णआडवाणी केनेतृ वमरा प तएपीजेअ दुल </TEXT> </DOC>
Table 1. FIRE 2010 hi file 12 4.2 Topic Format
The test data set contains total 50 queries starting from number 76 to 125. The queries were selected on the basis that they cover all news between that time segments. The example of a topic file is as following,
<top lang='hi'> <num>81</num> <title>भारतमजापानीए सेफलाइ टसके लयेअसं ा यकायसूचीक सम याएँ</title> <desc>भारतीयब च काजापानीए सेफलाइ टससेर ाकरनेके लयेभारतीय वा यमं ालय नेिजस कमसूचीको हण कया, उसे लागू करने के लये कन-कन बाधाओं का सामना करना पड़ा?</desc> <narr>भारतीयब च कोजापानीए सेफलाइ टसक असं ा यट क केदेनेसे या- यासम याएँ उ प नहु ? भारतमज़ रतकेमुता बकट क केनबननेसे (वशेषतःचीनसे) ट क कोखर दने क प रक पना। ासं गक लेखमइससेसंबं धतचचाह रहनेचा हये</narr> </top>
4.3 Relevance Judgements
The qrels file contains the relevance judgements for queries number 76 to 125. The file contains 22572 lines of measure for the same set of queries.
5
Experiments
We performed our experiments in Terrier 3.5. It has all the necessary codes to support experiments for FIRE dataset. We make some changes in terrier.properties file. We index documents with n-gram taking different values of n. We created the indexes for n=2 to 6. N-grams are generated from a stream of characters where all punctuation marks were removed. For every index, four retrival models were used to evaluate the results. These models are available in Terrier 3.5 version. The results are evaluated in terms of MAP i.e. mean average precision. These scores represent the MAP values on title, description and narration.
Models
MAP value for n-grams
n=2 n=3 n=4 n=5 n=6 TF_IDF 0.1640 0.3305 0.3729 0.3482 0.3012 BM25 0.1689 0.3428 0.3772 0.3496 0.3003 DFR_BM25 0.1684 0.3429 0.3787 0.3521 0.3026 PL2 0.1675 0.3443 0.3790 0.3524 0.3077 InL2 0.1658 0.3389 0.3748 0.3534 0.3089
Table 3. MAP scores for different n values
Fig. 1. MAP scores for different n-grams
0 0.1 0.2 0.3 0.4 TF_IDF BM25 DFR_BM25 PL2 InL2 n=6 n=5 n=4 n=3 n=2 MAP value for n-grams
Fig. 2. Map score for 4-gram unique value
The results shows that the PL2 model performs best for n=4 gram with the highest value as 0.3790. This is slightly greater with compare to the other model such as the term frequency based model as well as the different
probabilistic model. The results of the experiments clearly states that 4-grams for the hindi textual documents gives the maximum precision values. Therfore the 4-gram approach can be taken as a probabilistic approach to generate sentences in hindi languages.
6
Conclusion
Based on our experiments, we found that among different length of n-grams, 4-grams produces the best result. It gives the max MAP scores for every retrieval model we had considered for our experiments. For future work, we will carry out this work with more values of n and try to judge with taking some other retrieval models. We will also carry out this approach for other Indian Languages such as Marathi, Gujarati, Bengali, etc.
7
Acknowledgements
Our sincere thanks to Forum for Information Retrieval & Evaluation (FIRE) group for allowing us to use the data for our experiments. Also thanks to TerrierTM development group [8] for providing open source software for
research purpose.
References
1. D’Amore, R. and Mah, C. One time complete indexing of text: Theory and practise. Eighth Annual ACM SIGIR Conference on Research and Development in Information Retrieval, pages 155-164
2. Damashek, M. Gauging (1995) similarity via n-grams: Language independent categorization of text. Science, 267(5199):843-848
3. Pearce, C. and Nicholas, C. (1993). Generating a dynamic hypertext environment with n-gram analysis. In Procedings of the Second International Conference on Information and Knowledge Management, pages 148-153
4. Teuful, B. (1998) Statistical n-gram indexing of natural language documents. International Forum of Information and Documentation, 16(4):15-19
5. Ljiljana Dolamic & Jacques Savoy UniNE at FIRE 2010: Hindi, Bengali, and Marathi IR
6. Paul McNamee and James Mayfield, Character N0gram Tokenization for European Language Text Retrieval. Information Retrieval, 7:73-97,2004. 0.37 0.375 0.38 TF_IDF BM25 DFR_BM… PL2 InL2
MAP value for 4-gram
7. Paul McNamee, Textual Representations for Corpus-Based Bilingual Retrieval, PhD Thesis, University of Maryland Baltimore Country, December 2008
8. Ashish Almeida and Pushpak Bhattacharyya, Using Morphology to Improve Marathi Monolingual Information Retrieval, FIRE 2008, Kolkata, India
9. Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, and Christina Lioma. Terrier: A High Performance and Scalable Information Retrieval Platform. In Proceedings of ACM SIGIR'06 Workshop on Open Source Information Retrieval (OSIR 2006). 10th August, 2006. Seattle, Washington, USA.
10. Ricardo Baeza – Yates and Berthier Ribeiro – Neto, Modern Information Retrieval, ACM Press, 1999 11. David A. Grossman, Ophir Frieder, Information Retrieval Algorithms and Heuristics, Springer, 2004
12. Vishwakarma, Santosh K., Kamaljit I. Lakhtaria, Divya Bhatnagar, and Akhilesh K. Sharma. "An efficient approach for inverted index pruning based on document relevance." In Communication Systems and Network Technologies (CSNT), 2014 Fourth International Conference on, pp. 487-490. IEEE, 2014.
13. Dolamic, Ljiljana, and Jacques Savoy. "Indexing and stemming approaches for the Czech language." Information Processing & Management 45.6 (2009): 714-720.
14. Vishwakarma, Santosh K., Divya Bhatnagar, Kamaljit Lakhtaria, and Yashoverdhan Vyas. "A distance based static index pruning method for phrase terms."