• No results found

Monolingual Information Retrieval using Terrier: FIRE 2010 Experiments Based on N-gram Indexing

N/A
N/A
Protected

Academic year: 2021

Share "Monolingual Information Retrieval using Terrier: FIRE 2010 Experiments Based on N-gram Indexing"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

Procedia Computer Science 57 ( 2015 ) 815 – 820

1877-0509 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Peer-review under responsibility of organizing committee of the 3rd International Conference on Recent Trends in Computing 2015 (ICRTC-2015) doi: 10.1016/j.procs.2015.07.484

ScienceDirect

3rd International Conference on Recent Trends in Computing 2015 (ICRTC-2015)

Monolingual Information Retrieval using Terrier: FIRE 2010

Experiments based on n-gram indexing

Santosh K. Vishwakarma

a

, Kamaljit I Lakhtaria

b

, Divya Bhatnagar

c

, Akhilesh K Sharma

d

a

Gyan Ganga Institute Of Technology & Sciences, Jabalpur, Madhya Pradesh, India b Auro Univesity, Surat, Gujarat, India

c,dSPSU, Rajasthan, Udaipur, 313001, India

Abstract

N-gram based indexing technique has been proved as a useful technique for efficient document retrieval. We applied the n-gram approach and performed experiments in Hindi language text collections. The experiments are performed on the dataset of FIRE 2010 Hindi text collections. We used the Terrier open search engine for experimental purpose. Our experiments state that 4-gram gives the best results among all n-grams of different length. The results show an increase in value of mean average precision. © 2015 The Authors. Published by Elsevier B.V.

Peer-review under responsibility of organizing committee of the 3rd International Conference on Recent Trends in Computing 2015 (ICRTC-2015).

Keywords: Information Retrieval; N-gram; MAP; Pruning; Hindi Monolingual

1.Introduction

N-gram based indexing approach aims at improving the effectiveness of the retrieval task. The purpose of n-gram approach is to replace a whole term into multiple n-grams in the vector space model. The n-gram based system is easy to develop. It takes lesser amount of time for morphological processing. With this approach only a fixed number of n-grams exist for a given value of n [1]. We performed our experiments in Hindi corpus, as Hindi is the official language of India and it is the most spoken language in the country. It is mainly spoken in the northern and

* Corresponding author. Tel.: +91-9329487050 E-mail address: santoshscholar@gmail.com

© 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Peer-review under responsibility of organizing committee of the 3rd International Conference on Recent Trends in Computing 2015 (ICRTC-2015)

(2)

central parts of the country. In this paper, we describe how n-gram based approach can be used for efficient retrieval in Hindi text collection. The experiments are carried out in FIRE 2010 data set collection for Hindi languages. This paper is organized as following. Section 2 discuss about the related work carried out using n-gram approach for different languages. Section 3 introduces the basic functions of the open search engine, Terrier. We also discussed the main reasons behind our decision to use Terrier for our experiments and evaluation purpose. Section 4 discusses the structure of corpus, dataset, query relevant and topics file. Section 5 reports the experiments and analysis of our evaluation results. The paper concludes with the possible research directions to improve the IR performance for Indian Languages.

2.Related Work

D’Amore and Mah’s [1] introduces the concept of n-grams by replacing the whole terms with n-grams in the vector space model. Their contention was that a large document contents more n-grams than a small document. They compute the weight for n-gram using the number of occurrences in a document. Their assumption was n-gram occurs with equal likelyhood and follow a binomial distribution. Damashek [2] expanded on D’Amore and Mah’s [1] work by implementing a five gram based measure of relevance. Their algorithm relies upon the vector space model, but computes relevance in different manner. Their algorithms trace n-gram without any parsing so it makes it language independent.

Pearce and Nicholas [3] expand the work of Damashek [2] using n-gram to generate hypertext links. The links are obtained by computing similarity measures between a selected body of text and remainder of the document. Thus five-grams are identified and a vector representing this selected text is constructed and finally cosigned similarities are computed to rank the document. Teufel studied the n-gram techniques based on indirect similarity measure given in his paper [4].

Paul McNamee [5] performed research in the area of IR known as “haircut” technique. He reports that out of n -grams 4--grams produced the best results for most of the European Languages. The Thesis work of Paul McNamee [6] states that the performance of IR system is affected when large words are broken into small parts, and how the word spanning n-gram captures associations in the text. Lexical analysis of documents is studied by Ashish Almeida and Pushpak Bhattacharya [7] for Marathi Language.

3

The Terrier Toolkit

As a newly established research group, we consider to adapt one of the available Information Retrieval toolkits for our research and experimentation purpose. Terrier [8], the open search engine provided by University of Glasgow becomes our primary research tool with compare to others available tools such as Lemur, Smart, Lucene, etc. Terrier is written in Java and it is designed and developed by researchers from the Computer Science Department at the University of Glasgow. Moreover, the open source nature of Terrier is critical, since it enables researchers to build their own unique research on top of it rather than treating it as a black box. Terrier strives to provide state of the art efficient indexing and effective retrieval mechanisms. There are several reasons to consider Terrier, including:

x Terrier supports huge variety of weighting models such as DFR_BM25, BB2, TF_IDF, PL2, InL2, In_expB2, In_expC2, etc. It also supports several field based weighting models such as PL2F, BM25F, ML2, MDL2, etc. Two proximity or dependence models DFRDependenceScoreModifier and MFRDependenceScoreModifier are also supported by Terrier.

x Terrier is designed as a IR Research platform, and is quite convenient to be used for UTF-8 document format parsing support.

(3)

x The toolkit is under constant development for performance improvements as well as feature additions. The latest version is Terrier 3.5 which is compatible with FIRE text collections.

x The toolkit is expandable and adaptable with available source codes. Various IR functions such as cross lingual information retrieval, question answering is supported.

4

Dataset

The experiments has been carried out on the data set of FIRE 2010 [http://www.isical.ac.in/~fire/] for Hindi. The corpus contains various documents from Hindi news domain. These news articles are extracted from the websites of two widely read newspapers, Amar Ujala [http://www.amarujala.com/] and Jagran [http://www.jagran.com/]. There are 54271 documents from amar ujala and 95216 documents from the jagran. The task of corpus creation was carried out to support experiments for research purpose in information retrieval domain.

4.1 Document Format

FIRE dataset adapt TREC document style format [http://trec.nist.gov]. Each text document is store in a separate file. For Hindi text collection the document supports the UTF-8 encoding system. The document has 3 fields DOC, DOC NO and TEXT. Doc No. is a unique identifier which is assigned to every document in the corpus. Text field contains the actual news article in plain text. The example of a text file is shown below,

<DOC> <DOCNO>default_cur_1_date_1_5_2005.utf8</DOCNO> <TEXT> बहार मसमयरहते रा यपाल केसामने अपनाबहुमत सा बतनह ं करपाए राजग नेशु वार को रा प तभवनमभंग वधानसभाकेसद य क परेडकरवाई।भाजपाअ य लालकृ णआडवाणी केनेतृ वमरा प तएपीजेअ दुल </TEXT> </DOC>

Table 1. FIRE 2010 hi file 12 4.2 Topic Format

The test data set contains total 50 queries starting from number 76 to 125. The queries were selected on the basis that they cover all news between that time segments. The example of a topic file is as following,

<top lang='hi'> <num>81</num> <title>भारतमजापानीए सेफलाइ टसके लयेअसं ा यकायसूचीक सम याएँ</title> <desc>भारतीयब च काजापानीए सेफलाइ टससेर ाकरनेके लयेभारतीय वा यमं ालय नेिजस कमसूचीको हण कया, उसे लागू करने के लये कन-कन बाधाओं का सामना करना पड़ा?</desc> <narr>भारतीयब च कोजापानीए सेफलाइ टसक असं ा यट क केदेनेसे या- यासम याएँ उ प नहु ? भारतज़ रतकेमुता बकट क केबननेसे (वशेषतःचीनसे) ट क कोखर दने क प रक पना। ासं गक लेखमइससेसंबं धतचचाह रहनेचा हये</narr> </top>

(4)

4.3 Relevance Judgements

The qrels file contains the relevance judgements for queries number 76 to 125. The file contains 22572 lines of measure for the same set of queries.

5

Experiments

We performed our experiments in Terrier 3.5. It has all the necessary codes to support experiments for FIRE dataset. We make some changes in terrier.properties file. We index documents with n-gram taking different values of n. We created the indexes for n=2 to 6. N-grams are generated from a stream of characters where all punctuation marks were removed. For every index, four retrival models were used to evaluate the results. These models are available in Terrier 3.5 version. The results are evaluated in terms of MAP i.e. mean average precision. These scores represent the MAP values on title, description and narration.

Models

MAP value for n-grams

n=2 n=3 n=4 n=5 n=6 TF_IDF 0.1640 0.3305 0.3729 0.3482 0.3012 BM25 0.1689 0.3428 0.3772 0.3496 0.3003 DFR_BM25 0.1684 0.3429 0.3787 0.3521 0.3026 PL2 0.1675 0.3443 0.3790 0.3524 0.3077 InL2 0.1658 0.3389 0.3748 0.3534 0.3089

Table 3. MAP scores for different n values

Fig. 1. MAP scores for different n-grams

0 0.1 0.2 0.3 0.4 TF_IDF BM25 DFR_BM25 PL2 InL2 n=6 n=5 n=4 n=3 n=2 MAP value for n-grams

(5)

Fig. 2. Map score for 4-gram unique value

The results shows that the PL2 model performs best for n=4 gram with the highest value as 0.3790. This is slightly greater with compare to the other model such as the term frequency based model as well as the different

probabilistic model. The results of the experiments clearly states that 4-grams for the hindi textual documents gives the maximum precision values. Therfore the 4-gram approach can be taken as a probabilistic approach to generate sentences in hindi languages.

6

Conclusion

Based on our experiments, we found that among different length of n-grams, 4-grams produces the best result. It gives the max MAP scores for every retrieval model we had considered for our experiments. For future work, we will carry out this work with more values of n and try to judge with taking some other retrieval models. We will also carry out this approach for other Indian Languages such as Marathi, Gujarati, Bengali, etc.

7

Acknowledgements

Our sincere thanks to Forum for Information Retrieval & Evaluation (FIRE) group for allowing us to use the data for our experiments. Also thanks to TerrierTM development group [8] for providing open source software for

research purpose.

References

1. D’Amore, R. and Mah, C. One time complete indexing of text: Theory and practise. Eighth Annual ACM SIGIR Conference on Research and Development in Information Retrieval, pages 155-164

2. Damashek, M. Gauging (1995) similarity via n-grams: Language independent categorization of text. Science, 267(5199):843-848

3. Pearce, C. and Nicholas, C. (1993). Generating a dynamic hypertext environment with n-gram analysis. In Procedings of the Second International Conference on Information and Knowledge Management, pages 148-153

4. Teuful, B. (1998) Statistical n-gram indexing of natural language documents. International Forum of Information and Documentation, 16(4):15-19

5. Ljiljana Dolamic & Jacques Savoy UniNE at FIRE 2010: Hindi, Bengali, and Marathi IR

6. Paul McNamee and James Mayfield, Character N0gram Tokenization for European Language Text Retrieval. Information Retrieval, 7:73-97,2004. 0.37 0.375 0.38 TF_IDF BM25 DFR_BM… PL2 InL2

MAP value for 4-gram

(6)

7. Paul McNamee, Textual Representations for Corpus-Based Bilingual Retrieval, PhD Thesis, University of Maryland Baltimore Country, December 2008

8. Ashish Almeida and Pushpak Bhattacharyya, Using Morphology to Improve Marathi Monolingual Information Retrieval, FIRE 2008, Kolkata, India

9. Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, and Christina Lioma. Terrier: A High Performance and Scalable Information Retrieval Platform. In Proceedings of ACM SIGIR'06 Workshop on Open Source Information Retrieval (OSIR 2006). 10th August, 2006. Seattle, Washington, USA.

10. Ricardo Baeza – Yates and Berthier Ribeiro – Neto, Modern Information Retrieval, ACM Press, 1999 11. David A. Grossman, Ophir Frieder, Information Retrieval Algorithms and Heuristics, Springer, 2004

12. Vishwakarma, Santosh K., Kamaljit I. Lakhtaria, Divya Bhatnagar, and Akhilesh K. Sharma. "An efficient approach for inverted index pruning based on document relevance." In Communication Systems and Network Technologies (CSNT), 2014 Fourth International Conference on, pp. 487-490. IEEE, 2014.

13. Dolamic, Ljiljana, and Jacques Savoy. "Indexing and stemming approaches for the Czech language." Information Processing & Management 45.6 (2009): 714-720.

14. Vishwakarma, Santosh K., Divya Bhatnagar, Kamaljit Lakhtaria, and Yashoverdhan Vyas. "A distance based static index pruning method for phrase terms."

References

Related documents

Eastern Kentucky is an historical place for union advocacy efforts, the coal mining industry, and unfortunately, poor healthcare standards.. These interviews took place in or

Many statistical models and methods for networks need model selection and tuning parameters, which could be done by cross-validation if we had a good method for splitting network

Sequence and structure features of the SLBB superfamily A comprehensive multiple alignment for the SLBB super- family (Figure 1B) was prepared by combining align- ments for

) and methyl syringate (MeS) mediator (1% and 3%) in a sequence including four enzymatic treatments (and four alkaline peroxide extractions) compared with a control without enzyme,

WD Associates provides a wide array of causal analysis services including causal analysis training courses, mentoring of cause analysts and management staff, and cause analysis

Lessee shall use and occupy the Leased Premises (i) to engage in farming, community, educational and charitable activities in accordance with that certain Town Farm Plan of

For the second objective: Because we are comparing pre and post (this is the independent variable which is paired observation) and stress, anxiety, and depression

To backup your custom species list, open up the Species folder and right click on the Custom folder, select Export -&gt; Selected Data to Backup File. Save the file somewhere easy