• No results found

2.18 Arabic Text Classication

2.18.1 Arabic language

Arabic has 28 letters and is written from right to left. In contrast with English, Arabic has a richer morphology that makes developing automatic processing sys- tems for it a highly challenging task. The basic nature of the language, in the context of text classication, is similar to English in that we can hope to rely on the frequency distributions of content terms to underpin the development of automatic text classication. However, the large degree of inections, word gender, and plu- ralities (Arabic has forms for singular, dual, and plural), means the pre-processing (e.g. stemming) stage is more complex than in the English case [27,84, 87].

Arabic language has three genders, feminine, masculine and neuter [128]. In gen- eral, Arabic words are classied into three main groups; nouns, verbs, and particles. Noun in Arabic is dened as a word that describes person, thing, place or idea [127]. Nouns in Arabic can be derived from other nouns, verbs, or particles [127]. Verbs in Arabic are divided into perfect, imperfect and imperative. Arabic particle category includes pronouns, adjectives, adverbs, conjunctions, prepositions, interjections and interrogatives [127]. Based on xed patterns called "Awzan", most of Arabic words can be obtained from stem or root of words by attaching prexes, suxes and inxes to the root of word [128, 131]. Arabic roots are composed of three, four, or, in some cases, ve letters [130].

In contrast with phonetic symbols in English, Arabic language has a set of dia- critics which are used to pronounce words correctly. Diacritic marks can be written below or above letters. They are short vowel marks. The main Arabic diacritics in- clude Fatha, Dama, Kasra, Shada, Sukun and Tanween [118, 129, 130]. For instance, Table 2.6 presents dierent pronunciations of the letter (Sean) (

€

) [118].

€ € € €

€

€

€

€

/sa/ /si/ /su/ /s/ /ssa/ /ssi/ /ssu/ /ss/ Table 2.6: Dierent pronunciations of the letter (Sean)

classication in particular is related to the nature of Arabic language. Here, in comparison with other languages such as English, we list some of aspects that make automatically processing Arabic language a challenge task [126, 129, 130, 132]:

Arabic language has a complex morphology in comparison with English. An Arabic word is usually built up from a root attached with axes. As an ex- ample, Table 2.7 presents dierent morphological forms of word study (

éƒ@PX

) [129].

Word Tense Pluralities Meaning Gender

€PX

Past Single He studied Masculine

IƒPX

Past Single She studied Feminine

€PYK

Present Single He studies Masculine

€PYK

Present Single She studies Feminine

AƒPX

Past Dual They studied Masculine

AJƒPX

Past Dual They studied Feminine

àAƒPYK

Present Dual They study Masculine

àAƒPYK

Present Dual They study Feminine

@ñƒPX

Past Plural They studied Masculine

áƒPX

Past Plural They studied Feminine

àñƒPYK

Present Plural They study Masculine

áƒPYK

Present Plural They study Feminine

€PYJƒ

Future Single He will study Masculine

€PYJƒ

Future Single She will study Feminine

AƒPYJƒ

Future Dual They will study Masculine

AƒPYJƒ

Future Dual They will study Feminine

àñƒPYJƒ

Future Plural They will study Masculine

áƒPYJƒ

Future Plural They will study Feminine

AƒPYK

Present Dual They study Masculine

AƒPYK

Present Dual They study Feminine Table 2.7: Dierent morphological forms of word (Darasa)

suxes. In Arabic, inxes can be added inside the word. For instance, in English, the word writeis the root of word writer. In Arabic, the word writer(

I.KA¿

) is formed dierently from English. It is formed by adding the letter Alef (

@

) inside the root (

I.J»

). In such cases, especially in process like stemming, it is dicult to distinguish between the root letters and inxes [126].

Semantic, morphology, and syntactic of Arabic language is dierent from, more complex than Indo-European languages [132].

Some Arabic words may have dierent meanings depending on their appear- ance in the context. Especially in Arabic scripts in digital form, mostly, dia- critics are not used, the proper meaning of the Arabic word can be determined based on the context. For instance, the word (

I.ë X

) could be noun gold (

I.ë X

) or verb went (

I.ë X

) depending on the context [130].

In Arabic language, Irregular plurals and synonyms are widespread [101, 129]. An example of the challenges of Arabic text automatic processing is the prob- lem of dealing with proper nouns, since Arabic letters do not have lower and upper case, proper nouns in Arabic do not begin with capital letters as in English; the process of capturing such words in Arabic text is more dicult than in English [126].

For Arabic TC, Arabic corpus with its precise training and testing portions is not publically available for research purpose. This makes the comparison between Arabic TC approaches not possible [129, 132]. In this work, to over- come this issue, we have formed three Arabic datasets (each dataset is split into training/test portions) and, made them available for other researchers to directly compare with our results.

Arabic language has special encoding. The use of unsuitable encoding will result in improper Arabic text display. The most common used Arabic text encodings are UTF and CP-1256 Arabic windows [129].