DEVELOPING TF IDF VECTOR SPACE MODEL ( VSM ) ALGORITHM FOR INFORMATION RETRIEVAL FROM INDONESIA TRANSLATION VERSION OF AL QUR AN

(1)

(2)

(3)

(4)

(5)

1

DEVELOPING TF IDF VECTOR SPACE MODEL ( VSM ) ALGORITHM

FOR INFORMATION RETRIEVAL FROM INDONESIA TRANSLATION

VERSION OF AL QUR’AN

1_{LILY WULANDARI ,}2_{TRISTYANTI YUSNITASARI,}3_{DIANA IKASARI,}4 _{IRFAN HUMAINI}

1,2.3.4_{Faculty Of Computer Science and Information Technology Gunadarma University Depok, Indonesia} Email: 1_{lily@staff.gunadarma.ac.id}_,2_{tyusnita@staff.gunadarma.ac.id}_,3_{d_ikasari@staff.gunadarma.ac.id}_,

4_{irfan_humaini@staff.gunadarma.ac.id}

Abstract: Information Retrieval (IR) is a search for information that is usually in a text document. In this study, discussing IR against the Al Quran Indonesian translation consisting of 6236 verses and is a Muslim guideline so that the information contained in the Qur'an is very important for a Muslim. Corpus synonyms (thesaurus) were formed to support information retrieval so that search results became wider. Method used is the TF-IDF Vector Space Model (VSM) with the development of keyword weighting and query processes, namely the results of queries that are ranked first in the search retrieval result are queries for the next search process. Cosine similarity is used for document similarity calculations. The formation of a corpus synonym database (thesaurus) is done by developing a system so that it can be done automatically

.

In the testing phase, it is done by entering keywords using 1 word and 2 words or more (a sentence). The percentage of success of testing using 1 word reaches 100%. The success of search testing uses more than 1 word or a sentence, in the top 10 rankings of documents found, success reaches 95.6%. This research has proven that information retrieval by using corpus synonyms (thesaurus), and the addition of word weights from the first keyword sought to add relevant level, because it significantly expands the search results and eliminates irrelevant documents.

Keywords: Alquran, Corpus, Information Retrieval, TF IDF, VSM, Cosine Similarity, Tesaurus

1. INTRODUCTION

It is the duty of Muslims (Muslims) to implement daily life based on the instructions of the Qur'an. Before implementing it, certainly the things contained in the Qur'an are studied first. As a Muslim, almost all may know things that are prohibited or forbidden and things that are allowed in accordance with the Qur'an. Not a few who know this are just hearing things that are prohibited or things that are not prohibited without knowing correctly that what is actually being heard is written in the Qur'an. For example, almost all Muslims know what is forbidden to be consumed by Muslims, but many do not know what the prohibition is in the letter and what verse in the Qur'an. There are many other examples such as the virtue of prayer and so on. The limited time to find information about this is one of the reasons and difficulties in finding the words that are desired to be searched in the Qur'an because the Al Quran consists of 30 Juz, 114 Letters and 6326 Verses so that the search for words that fit the desired theme will be very difficult . In some existing software, information retrieval such as searching for the word "lie" then the search results are the names of verses and letters about false words, whereas in the translation of the Qur'an and Hadith there are many meanings similar to "lying" such as "lies", " trickery, "slander" and others that will not be found through the software

because the search method is only based on keywords. There are even many words that are considered popular in the community that have no translation in the Qur’an, for example the word corruption so that if you search for corrupt keywords, the search process does not give any results. Based on these examples, it is necessary to make a corpus of synonyms (thesaurus) to support information retrieval processes so that search results become wider and more relevant.

(6)

2

obtaining a document that has the highest level of similarity (rank 1), the document is then used as a query plus the first keyword entered along with the synonym.

2. LITERATURE REVIEW

2.1 Information Retrieval

The information retrieval (IR) system is used to retrieve information relevant to the user needs of an information set automatically [1,2]. Some experts define Information Retrieval as follows :

a. Manning(2007), defines that Information Retrieval is the process of finding material (documents) from an unstructured environment (usually text) that meets the information needs of a large collection (usually on a computer) [13].

b. Baeza-Yates(1999) Information Retrieval is a part of computer science that learns about data collection and retrieval of documents [4].

c. Greengrass(2000), Information Retrieval is a discipline that deals with unstructured search data, especially textual documents, in response to a query or topic statement.[7].

Information Retrieval is part of computer science that deals with the retrieval of information from documents based on the content and context of the documents themselves. Based on the reference explained that the information retrieval is a search information based on a query that is expected to meet the user's desire of the existing document. The working principle of the information retrieval system if there is a collection of documents and a user formulating a query (request or query). The answer to that question is a collection of relevant documents and discarding irrelevant documents [19].

2.2 Vector Space Model ( VSM )

The description of a series of documents as vectors in a common vector space is called a vector space model and is the basis for search operations for a number of information ranging from giving document values to queries, as well as classification and distribution of documents (Manning, 2009). The vector space model realizes binary weights that are too limited to the Boolean model, then offers a framework that allows partial matching to be done by assigning non-binary weights to the index terms of the query. And documents.

In the vector space model, documents are on a base and user queries are represented by a multi-dimensional vector.

2.2.1 TF IDF

The TF-IDF method is a term weighting method that is widely used as a comparison method for the new weighting method. In this method, the calculation of the term t weight in a document is done by multiplying the Term Frequency value with Inverse Document Frequency.

In Term Frequency (tf), there are several types of formulas that can be used [11]:

a.

binery tf, only pay attention to whether a

word exists or not in the document, if there is given a value of one, if it is not given a zero value

b.

raw tf, tf value is given based on the number

of occurrences of a word in the document. For example, if it appears five times the word will be five.

c.

tf logaritmik, this is to avoid the dominance of documents that contain few words in the query, but have a high frequency.

tf=1+log(tf) ... (1)

d.

tf normalization, using a comparizon between the frequency of a word and the total number of words in the document

tf=0.5+0.5x [ tf/max tf ] ... (2) Inverse Document Frequency (idf) dihitung dengan

menggunakan formula

idfi=log(D/df)j. ... (3)

where, D is the number of all documents in the

collection

df is the number of documents containing term t

j

Wij = tfij × idfj

Wij = tfij ×log (D/df)j ... (4)

Where Wij is weight term tj to document di

tfij is the number of occurrences of the term tj to

document di

(7)

3

dfj is the number of documents containing term tj

(there is at least one word, that is term tj)

Wij = tfij ×log(D/df)j+ 1 ... (5)

3. RESULT AND DISCUSSION

Development of the proposed method for developing Information Retrieval with TF-IDF vector space techniques (VSM) and ranking using the cosine similarity technique.

Development carried out increases the relevance and precision of document search results in information retrieval, it can be done because of the existence of a corpus synonym (thesaurus) and development with the VSM TF-IDF algorithm with additional weighting on the terms that become keywords. At this stage the development of Information Retrieval consists of :

1. In the process of calculating weights as the initial determinant of data filtering, TF-IDF vector space technique (VSM) is used. The weight obtained is used as the basis for doing ranking. The verses of the Qur’an which rank first are then used as keywords to search the verses of the Qur’an. This is used to increase the relevance of search results. In this process a synonym database is used.

2. Furthermore, the cosine similarity technique performs a ranking that is sorted by document which has the highest similarity value to a lower similarity value. Additional weight is given on the keyword entered by the user, this is done after the query process has obtained a rank 1 document which will then be used as the next query. The additional weight is done so that information retrieval based on cosine similarity calculations of keywords entered by users has a higher value than synonyms (thesaurus) or other words that are also queries, thus documents based on keywords will be ranked on search results.

3.1 Developing TF IDF Vector Space Model (VSM) Algorithm.

In general, information retrieval with the VSM methodology is a search for a number of information ranging from giving document values to queries, as well as classification and distribution of documents and frameworks that allow partial matching to be done by assigning non-binary weight to index terms in

queries and documents. The process of information retrieval carried out as an example of the word Corruption that has been exemplified previously has the same meaning by eating wealth or seizing, but not all seizure words have connotations as acts of corruption, so partial matching must be seen, and then given the weight where the verses of the Qur'an and Hadith have the meaning closest to the input word in this case corruption, so that those who have the highest weight is considered to have the highest level of relevance so that it is displayed in the order of search results.

The expansion of the query results in the information retrieval process is done by considering the synonyms of the keyword forming query. In order for the results of information retrieval to be even wider, the words in the verses of the Qur'an as a result of the query with the weight of the results of the highest VSM TF-IDF calculation were used to carry out the next information retrieval process. The results of the trial show that the results of information retrieval are broader, but the order of results shows that the first place in the Quran and Hadith verses that do not use keywords from the query. This certainly makes users think that the results of information retrieval no one uses keywords that are done by users. For example for fake keywords, then the top of the results is precisely verses of the Qur’an and Hadith which contain the synonym words such as lies, denial, slander, and deception. This is because the weight of the VSM TF-IDF generated by lying keywords is no higher than the synonyms.

(8)

4

Fig. 3.1 Process Flow Calculation for TF-IDF VSM and Cosine

Similarity

Fig 3.2 Process Flow Developing TF IDF VSM Algorithm results

a) Step 1 is to read the keyword.

b) Steps 2-5 are used for Preprocessing so that the end result is only a basic word.

c) Step 6-9 is the process of finding synonyms for words that are keywords, if synonyms are found they will be included in the search query.

d) Steps 10-11 are used for similarity calculations and ranking of documents detected according to the keyword.

e) Steps 12-13 are used for the process where the document with the first rank and the keywords and synonyms (thesaurus) are used as queries.

f) Step 14 is used to give additional weight to the keywords used in the query.

g) Steps 15-16 are used to calculate the weights, similarities and ranking of each document. h) Steps 18-19 are used for the search process

based on the classification theme associated with queries in the Indonesian language Al Quran translation database and the classification theme

.

i) Step 20 is used to display the results of the Information Retrieval in the form of verses in the Quran along with what themes are contained in each verse.

4. CONCLUSIONS

(9)

5

synonym (thesaurus) has been successfully carried out.

It has been produced the Synonym formation algorithm (thesaurus) for the preparation of popular terms or words along with their synonyms originating from several URL and input from experts, so that a good corpus synonym (thesaurus) is formed. Information retrieval trials for 20 words that have used the corpus synonym, have successfully detected verses from the Al Quran which is quite extensive and has raised the results of the keywords entered along with the synonyms. Information retrieval test results if without using a corpus synonym will bring up verses from the letter of the Qur'an which are fewer because they only detect the same word as the keyword. Comparison of the results of information retrieval using the corpus synonym results are more significant than without using the corpus synonym. Based on the search results trial after using the synonym body in the Al Quran verse for 20 additional words of 553 verses.

TF-IDF VSM algorithm has been produced by increasing the weighting of keywords, so that it can improve the relevance and precision of search results information retrieval.

ACKNOWLEDGMENT

The author would like to thank Gunadarma University for its support in this research. The authors also thank resource persons, Al Quran experts and Hadith Experts. The authors also thank colleagues for their input and support, and we also thank resource persons for their time in providing information related to this research

REFERENCES

1. Adriani, M., Asian, J., Nazief, B. Tahaghoghi, S.M.M., Williams, H.E. 2007. Stemming Indonesian: A

Confix-Stripping Approach. Transaction on Asian Langeage

Information Processing.

2. Agusta, Ledy. Comparison of Algortima Stemming Porter With Nazief & Adriani Algorithm For Stemming Indonesian Text Document. Satya Wacana Christian University. 2009. 3. Akram Roshdi, Akram Roohparvar. Review: Information

Retrieval Techniques and Applications, International Journal of Computer Networks and Communications Security, VOL. 3, NO. 9, 373-377, September 2015.

4. Baeza R.Y., Neto R., Modern Information Retrieval, Addison Wesley-Pearson international edition, Boston. US. USA, 1999.

5. Berry, M.W. & Kogan, J. 2010. Text Mining Aplication and theory.

6. Broto Poernomo T.P, Ir. Gunawan, Information Retrieval System Search Similarities AlQur'an Translation Version in Indonesian with Query Expansion from Tafsirnya IDeaTech, ISSN: 2089-1121, 2015. 7. Bridge, C. 2011. Unstructured Data and the 80 Percent

Rule

8. Fatkhul Amin, Information Retrieval System with Vector Space Model Method, Journal of Business Information Systems 02, 2012

9. Feldman, R & Sanger, J. 2007. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press: New York

10. Jasman Pardede, Mira M Barmawi, Wildan D Pramono, Implementation of Generalized Vector Space Model Method In Information Retrieval Applications, No.1, Vol. 4, ISSN: 2008-5266, January - April 2013. 11. Jovita, Linda, Andrei Hartawan, 2015, Using Vector

Space Model in Question Answering System,

International Conference on Computer Science and Computational Intelligence (ICCSCI 2015)

12. Kendall, J.E. & Kendall, K.E. 2010. Analisis dan Perancangan Sistem. Jakarta: Indeks.

13. Lukman Fakih Lidimilah, 2017, Question Answering Terjemah Al qur’an Menggunaka Named Entity Recognition, Jurnal Ilmiah Informatika Volume 2 No. 2 14. Mandala, Rila dan Hendra Setiawan. Peningkatan

Performansi Sistem Temu KembaliInformasi dengan Perluasan Query Secara Otomatis, Laboratorium Keahlian Informatika teori Department Teknik Informatika, Institut Teknologi Bandung, 2006 15. Manning, Christopher D., Prabhakar Raghavan,.

Introduction to Information Retrieval. Cambridge University Press, Cambridge, England, 2009.

16. McEnery, A.M., Wilson, A. 2001. Corpus Linguistics. Edinburgh: Edinburgh University Press

17. Moral, C., Antonio, A., Imbert, R., Rmirez J.: A survey of stemming algorithms in information retrieval. Inf. Res.: Int Electron. J. 19(1) ,2014).

18. Nesdi E. Rozanda, Arif Marsal, Kiki Iswanti, Design of Hadist Information Systems Using Technique of Retrieval of Vector Space Model Information, ejournal.uin-suska.ac.id, 20014.

19. Salton G, Buckley C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513-523. https://doi.org/10.1016/0306-4573(88)90021-0 20. Saraswati, N. W, 2011. Text Mining dengan Metode

Naive Bayes Classifier dan Support Vector Machines untuk Sentiment Analysis. Universitas UDAYANA 21. Subari, Ferdinandus, Health Information Retrieval

System For Medical Treatment Using Space Vector Method (VSM) Method Based on WebGis, , ISSN 2089-1083, Snatika 2015.

22. Surya Agustian, Imelda Sukma Wulandari, Qur'an Retrieval System Web-based Indonesian Translation with Reorganization of Corps, KNSI 2013, ISBN 978-602-17488-0, 2013.