Stemming algorithm in Quranic verses - Word recognition in Qur‘anic verses

6.1 Word recognition in Qur‘anic verses

6.1.7 Stemming algorithm in Quranic verses

In deciding on which methodology to adopt for stemming algorithm, the related issue that is crucial as a determinant is the fact that the main objective is to support word recognition. To learn words in the stem form means that there are more words to learn and more words to process, however this approach gives better meaning of words in the context of a sentence. For quicker learning and simpler implementation root word form is better. Here in this study, the approach is to adopt the algorithm based on root words. Next is to decide on how this should be implemented. There are two studies conducted in this research as in the following §s. The first is a case study on the 30th part of the Qur‘an (Juz Amma). The second is an actual implemented algorithm based on the word patterns identified in Table 2.4 for sura 29. This algorithm is referred to as the Rule-Based Stemming Engine (RES) (Yusof R, Zainudin, & Yusoff, 2010). Both studies were done on a dual 2.4-GHz processor with 2.99 GB of memory.

(a) Case study on Juz Amma

The Case Study was conducted using Juz Amma consisting of short suras form chapters 78 to 114 of the Qur’an. There are approximately 3000 words or 2000 over

126 word tokens in Juz Amma which are taken as the dataset to conduct the case study. These words are stored in a MySQL database and queries on the related tables were made based on the test cases below using Regression Expression to find the percentage of correctness in removing prefixes and suffixes.

The tested cases are divided into 2 categories, i.e. removing one of ا, ٖ, ٞ, ْ, ٛ, ّ, خ, ي, ء, س (A, h, y, n, w, m, t, l, ’, s) (10AL - see § 2.3.1 for clarification) and pronoun/ particle. The statements are written based on following; + ٞ is equivalent to y+, where y is the transcription of the Arabic letter ٞ (see Table 2.1). The + sign indicates that ٞ y is a prefix since other letters can be added after the + symbol: i. Prefixes of only one or two alphabets

i. + ٞ y+, + خ t+, + َ m+ (in the category of 10AL) ii. + ي ي ll+, + َف f+ , +ٚ w+ (in the category of particles) ii. Suffixes of one or two alphabets

i. ْ ا + +An, ْ ٚ+ +wn, ْ ٞ+ +yn (in the category of 10AL) ii. َ خ + +tm, َ ن + +km and ن + +k (in the category of pronouns)

(b) Implementing word pattern identifier

The word pattern identifier algorithm is implemented in this research using a Rule- based Stemming Engine (RSE). The stemming process is to determine the root word of the input token from Dukes (2008) Qur‘an Corpus which contains the Arabic Qur‘anic words, the word information (word, ayah and sura number) and also the part-of-speech tagset which is written in roman scripts. This tagset include the root word of all words in a verse, words such as particles does not contain a root word. Our algorithm caters for three letter root word since this is the most common type.

127

The main reference for the implementation of this word pattern identifier is the table

2.4 § 2.5.1. The word pattern identified in this table were further categorised into

number of characters (4, 5, 6, 7 and 8). The RSE algorithm is created based on this categorization. After each the stemming process, each word left is compared to the root word information in the tagset written in roman script. Therefore each Arabic stemmed words must be transliterated using the table 2.1 shown in §2.5.1 to

determine whether the stemming had been done correctly.

The algorithm is as follows:

i) Get word tokens from the corpus.

a. Separate word token from word tag set into two vector lists  Qur‘an word list (Arabic script)

 Part-of-speech tag set list (Roman script)

ii) Remove particle prefixes (+ِِي l+,+َف f+ ,+ِب b+ , +َٚ w+,+ٱ A+) iii) Remove diacritization from all word tokens

iv) Remove prefixes (the prefixes in table 2.4 § 2.5.1 excluding the ones in ii) v) Remove suffixes (the suffixes in table 2.4 § 2.5.1 )

vi) Determine length of word, stem the words by removing one of 10AL as in table 2.4 in § 2.5.1 and put them into a new vector list

a. If word length = 1, 2, 3 or >813, return the word without stemming b. If word length = 4, use algorithm to detect word pattern of length 4

 Delete one of 10AL