NOVEL IMPLEMENTATION OF SEARCH ENGINE FOR TELUGU DOCUMENTS WITH SYLLABLE N- GRAM MODEL

(1)

NOVEL IMPLEMENTATION OF

SEARCH ENGINE FOR TELUGU

DOCUMENTS WITH SYLLABLE

N-GRAM MODEL

DR.B.PADMAJA RANI* AND DR.A.VINAY BABU1

*Associate Professor Department of CSE JNTUCEH Hyderabad A.P. India http://jntuceh.ac.in/csstaff.htm

1

Director of Admissions JNTUH and Professor Department of CSE JNTUCEH Hyderabad A.P. India http://jntu.ac.in/director-admissions.php

Abstract:

As the technology is growing day by day, there is an enormous increase in the number of documents posted on web. There is a need for an application that facilitates the user with an efficient retrieval of the information that is needed. Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be virtually impossible to locate anything on the Web without knowing a specific URL. A search engine is a program that searches documents for specified keywords and returns a list of the documents where the keywords were found. The keywords to be searched for are given as query and the search engine gives the list of the documents having a match with the keywords in the query based on certain algorithms. The search engine also ranks the documents such that the more relevant documents are placed first in the results retrieved. Recently, there is an enormous increase of non-English web page documents posted on web page documents posted on web. The greater amount of documents is related to Chinese where as the increase in Indian language text documents are gaining popularity. There is a need to organize Indian language text documents so that the retrieval with a query is very fast. Telugu is the third most spoken language in India and one of the fifteen most spoken languages in the world. It is the official language of the state of Andhra Pradesh. There is an also a vast increase in Telugu language text documents. Because of the complexity of Telugu language, its very difficult to search and retrieve the documents needed. Hence, there is a need for an application that facilitates the user with an efficient retrieval of the information that is needed.

Keywords: Search Engine; Telugu Documents; Syllable N gram model .

1. Introduction

A typical search engine methodology goes like this: Step One: Make a List of URLs and Crawl Them

Before anything can be done, a list of URLs needs to be retrieved to initially crawl. The most popular option for this is to load the URLs in the DMOZ database. These aren’t the only sites that will be crawled. The pages linked to by sites in the DMOZ directory are also crawled since the crawler follows the links. It certainly helps to be in DMOZ; especially if you don’t have enough links from other sites to be sure that you’ll be sufficiently crawled. Now, a group of computers are set up to download all of the pages on the list. These are called the “crawlers.” They will also look at the links on those pages, and crawl those URLs as well (the crawlers will continue following links until their hard drives are full).

Step Two: Analyze the Page

The crawlers now go through each page and look at their content. First, the crawler makes a table with every unique word on the page. It gives “points” to each word based on how many times it’s used on the page, and words in bold, in the title, in meta tags, or in headers are given extra points.

(2)

unnecessary text that uses one term a lot will raise your percentage for that term, but will also lower the percentage for other terms.

More advanced engines will also cross-reference each word to other major words based on where they are relative to each other. As a result, the placement of words relative to each other does matter. This is why targeting phrases is usually better than targeting a variety of single words.

Calculate Link Popularity:

The crawlers now take their lists of the URLs that each page links to and combine them. So for each page there is now a list of the links on it, as well as the text of each link. The list is then reversed, so that instead of showing the links on each page, it shows for each page the sites that link to it.

Some search engines stop here and simply store the number of links pointing to a given page, but other search engines takes it a little further.

For every page in its database, the search engine gives it “points” based on how many links are going to it just like any other search engine. Then, it re-calculates the number of links pointing to each page, but gives more points to links that had a higher point-value themselves in the first count. It then repeats the process about 100 times, each time making the points more accurate. So:

1. Points are assigned based on the number of links going to a page.

2. Points are calculated again, but pages get more points if the links going to a page had more points in the last step.

3. The original point values are thrown out and are replaced with the points just calculated. Now, the points are re-calculated again, this time considering the points from Step 2 instead of Step 3. This is repeated approximately 100 times, and every time the points become more accurate (because it considers further down the line where links are coming from).

Now, the search engine takes the point values–which could be extraordinarily large–and converts them to a Page Rank, which is on a scale of 0 to 10. However, it does not simply convert, for example, 1,000 to 1 and 2,000 to 2. The scale is logarithmic, which means that higher Page Ranks require much more points.

Search engines put the databases into a specialized format, and then write the search software.

When a search is made, every site containing the relevant terms is pulled up. The ranking is based on a combination of the points for each relevant term, the site’s link popularity (PageRank), and other smaller factors. Each engine weighs these differently.

2. Our Model for Telugu Search Engine:

Our search engine uses syllable-n-gram model in which each word in every document is divided into n-grams of length 1 to n, where n is the length of the word. The query to be searched is also divided into n-grams.

Every document is processed under these steps:

i) each and every word in the document is divided into prefixes(n-grams) of decreasing length starting from n to 1 i.e., from the original word itself to a prefix of length 1.

ii) A frequency count is attached with every gram according to the no. of occurrences of that n-gram.

iii) These n-grams are sorted to facilitate faster retrieval. The query to be searched is processed under these steps:

i) each and every word of the query string is also divided into n-grams . The search pattern for a given query is done as follows:

i) starting with the largest n-gram i.e.,the word itself is searched for a match in each and every document.

ii) The documents in which the match is found for the current n-gram are ranked according to the frequency count i.e., the words with highest frequency count are ranked first to facilitate the retrieval of most relevant documents.

iii) The above two steps will be repeated for the remaining n-grams i.e., from n-gram of length n-1 to n-gram of length one.

3. Problem Definition and Methodology:

Telugu script is an abugida from the Brahmic family of scripts. The writing systems that employ Devanagari and other Indic scripts constitute a cross between Syllabic writing systems and Phonetic writing systems. The effective unit of these scripts is the Orthographic Syllable constituting of a consonant and vowel (CV) Core and optionally, one are more preceding consonants with a canonical structure of ((C)C)CV. The orthographic syllable need not correspond exactly with a Phonological syllable especially when a consonant cluster is

(3)

modifiers. Apart from CV core, a large number of conjunct formations are found in all Indic Scripts. The conjunct formations are nothing but a combination of one or two consonants preceding the CV core, which provide more effective association with phonetic syllables. The basic structure is decomposed into vowels, consonants, CV core Conjunct formations and dead consonants. For all these formations there exists a nasal sound represented with the help of ‘anuswara’ sign as an addition. Few character combinations are found with a special symbol ‘Visarga’, which is a rare occurrence. In the real usage of script, the above character combinations are found with certain percentages of occurrence.

In the present work, Words are segmented into syllables. Stems are derived from words in the text file. Similar procedure is adopted for the query. Stemmed query is searched (traced) for a match in the text file with all the words replaced by their stems.

3.1 METHODOLOGY

The search engine model basically follows these methods: Syllable Tokenization:

A finite state machine describes the set of all valid akshara’s of Telugu script based on (C(C)CV) canonical structure. Vowels and consonants alone form independent syllables.Vowel sign can follow consonants. On the other hand, complex conjuncts are formed with occurrence of consonants and Halant combination once ot twice and then followed by either by consonant alone or consonant and vowel sign. All of these akshara combiontions can be optionally followed by Anuswara. It is observed that Latin words written using Telugu scripts are as often found in text. Such syllables are usually found with dead consonants at their end. Dead consonants are observed to form with the combination of consonant followed by Halant once,twice or thrice.(Refer to Figure 1 below)

V- Vowel C-Consonant VS-Vowel sign(modifier) H- Halant

SS-Special Symbol(Anuswaram)

Syllables are segmented from words using the above state machine. Combining n syllables together forms syllable n-grams.

For example, a bigram which is of length 2 is formed with two consecutive syllables extracted from word. N-gram formation:

An n-gram is a sub-sequence of n items from a given sequence. An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram"; size 3 is a "trigram"; and size 4 or more is simply called an "n-gram".

After the syllable tokenization, n-grams are formed for each word. If a word of length n is found in a file then n-grams of length n,n-1,n-2,….1 length are formed. N-grams are formed by concatenating adjacent syllables formed from the syllable tokenization step.

V

VS

C

H

C

VS VS

SS

(4)

Now the n-grams are sorted and any duplicates in the n-grams if present, are removed by attaching a frequency count to each n-gram. This frequency count represents the number of times the n-gram appears in the document.

After the n-grams are sorted and frequency count is calculated, the file contents are

The number to the next of each n-gram represents the frequency count of the n-gram. That is, the frequency count represents the number of times that particular n-gram occurred in the file.

For each document, a temporary file named n-gram file is created which consists of the n-grams of all words present in the document in a sorted manner along with their frequency count attached to each n-gram. Processing the Query:

The query which is given by the user is stored in a text file.

N-grams are formed for the query given by the user. If the query is of length ‘n’ then n-grams of length n,n-1,n-2,….1 are formed.

These are sorted in descending order of length. For example the query given by the user is :

(5)

The procedure is repeated for each gram in the query. Once all the grams in the query are searched in the n-gram file, the next n-n-gram file is taken and the same process is continued.

After this process the contents of the frequency count file are

The first row contains the n-grams of the query. The file names are displayed in the first column and the corresponding frequency count for each file is also displayed in decreasing order of the length of n-gram. Page Count:

Now the page count is calculated from the frequency count file obtained from the above step. The results are displayed in the decreasing order of the frequency count of the n-gram of length in decreasing order. That is, the documents having the longest n-gram (of length n) that has the highest frequency count are displayed first, and then next highest frequency count documents are displayed and this process continues until no match is found for this n-gram. Now the n-gram of length n-1 is taken and this process is continued until n-gram of length 2. For example:

(6)

4. Implementations and Validations:

This is an example Telugu text document. (Refer to Figure 2 below)

Figure2: Input Telugu Document

The program is executed in terminal. (Refer to Figure 3 below)

Figure3: Program executed in terminal

(7)

Figure 4: n-grams found and frequency count calculated

This is the file where the user gives the query. (Refer to Figure 5 below)

Figure 5: User query

(8)

Figure 6: Final Implementations Screen Shot

4.1 Efficiency Calculation:

The efficiency of the n-gram model is calculated in this way: Example 1:

The word is: అ ె ికాలో

Refer to Table 1 below Table 1

QUERY (5-gram)

4-gram 3-gram 2-gram 1-gram

matches 1 4 7 7 89

error 0 0 3 3 85

efficiency 100% 100% 57.14% 57.14% 4.49%

Average efficiency without uni-grams and bi-grams=85.71% Average efficiency with uni-grams and bi-grams=63.75% Example 2:

The word is: అ ి కి

Refer to Table 2 below

4-gram 3-gram 2-gram 1-gram

matches 6 8 10 87

error 0 2 4 81

efficiency 100% 75% 60% 6.9%

Average efficiency without uni-grams and bi-grams=87.50%

Average efficiency with uni-grams and bi-grams=60.47%

Refer to Table 3 below

3-gram 2-gram 1-gram

matches 5 10 66

error 0 5 61

(9)

Average efficiency with uni-grams and bi-grams=52.52%

For detailed information of this implementation document please refer to the web site

http://sites.google.com/site/upendramgitcse

5. CONCLUSION AND FUTURE SCOPE

The present work, the search engine for Telugu documents retrieval is attempted using Syllable-n-gram model. Words are stemmed into modules in a text file by varying n-gram length from 1 to 6. The similar procedure is adapted for query. The stemmed prefixes of the query are searched for the match in the n-gram files. Results show that n-gram of length 3 increases the search capacity. The searching procedure is implemented for back end module. The front end GUI tool where a Telugu word as a query can be typed and the list of documents are displayed. In this procedure a proper Trans-Literati on scheme is to be adopted where a keyboard is of English characters and display is in Telugu.

ACKNOWLEDGEMENTS

The authors wish to thank the following students of JNTUCEH for implementing these concepts: B.Naveena Devi S.Hima Bindu, C.Sireesha and B.Sowjanya

5. References

[1] Mansur M., Uzzaman N. and Khan M. Analysis of N-gram based text categorization for Bangla in a news paper corpus 2004, Available online at http://www.naushadzaman.com/textcat-ICCIT06.pdf

[2] McNamee P. and Mayfield J. Single N-gram stemming proceeding of SIGIR-03, ACM International Conference on Research and Development in Informational Retreival ,2003 , pp.415 -416

[3] McNamee P. and Mayfield J. Character N-gram Tokenization for European Language Text Retrieval Information Retrieval 7(1-2) ,73- 97 , 2004,

[4] Vishnu Vardhan B., Padmaja Rani B. , Rao V.P.C. , Pratap Reddy L. and Vinay Babu A.N-gram based document classification of phonetic based languages -A case study on Telugu Script proceeding of RSPS-08 ,International Conference on RF and Signal Processing , 2008, PP,374 -378

[5] Vishnu Vardhan B.Analysis of N-gram model on Telugu Document classification thesis ,2008.

[6] Pratap Reddy L.,Vinay Babu A. , and Vishnu Vardhan B.A model using trigram overlapping technique for Telugu script journal of theoretical and applied IT, 2007.