2.18 Arabic Text Classication
2.18.1 Arabic language
Arabic has 28 letters and is written from right to left. In contrast with English, Arabic has a richer morphology that makes developing automatic processing sys- tems for it a highly challenging task. The basic nature of the language, in the context of text classication, is similar to English in that we can hope to rely on the frequency distributions of content terms to underpin the development of automatic text classication. However, the large degree of inections, word gender, and plu- ralities (Arabic has forms for singular, dual, and plural), means the pre-processing (e.g. stemming) stage is more complex than in the English case [27,84, 87].
Arabic language has three genders, feminine, masculine and neuter [128]. In gen- eral, Arabic words are classied into three main groups; nouns, verbs, and particles. Noun in Arabic is dened as a word that describes person, thing, place or idea [127]. Nouns in Arabic can be derived from other nouns, verbs, or particles [127]. Verbs in Arabic are divided into perfect, imperfect and imperative. Arabic particle category includes pronouns, adjectives, adverbs, conjunctions, prepositions, interjections and interrogatives [127]. Based on xed patterns called "Awzan", most of Arabic words can be obtained from stem or root of words by attaching prexes, suxes and inxes to the root of word [128, 131]. Arabic roots are composed of three, four, or, in some cases, ve letters [130].
In contrast with phonetic symbols in English, Arabic language has a set of dia- critics which are used to pronounce words correctly. Diacritic marks can be written below or above letters. They are short vowel marks. The main Arabic diacritics in- clude Fatha, Dama, Kasra, Shada, Sukun and Tanween [118, 129, 130]. For instance, Table 2.6 presents dierent pronunciations of the letter (Sean) (
) [118].
/sa/ /si/ /su/ /s/ /ssa/ /ssi/ /ssu/ /ss/ Table 2.6: Dierent pronunciations of the letter (Sean)
classication in particular is related to the nature of Arabic language. Here, in comparison with other languages such as English, we list some of aspects that make automatically processing Arabic language a challenge task [126, 129, 130, 132]:
Arabic language has a complex morphology in comparison with English. An Arabic word is usually built up from a root attached with axes. As an ex- ample, Table 2.7 presents dierent morphological forms of word study (
é@PX
) [129].Word Tense Pluralities Meaning Gender
PX
Past Single He studied MasculineIPX
Past Single She studied FemininePYK
Present Single He studies MasculinePYK
Present Single She studies FeminineAPX
Past Dual They studied MasculineAJPX
Past Dual They studied FeminineàAPYK
Present Dual They study MasculineàAPYK
Present Dual They study Feminine@ñPX
Past Plural They studied MasculineáPX
Past Plural They studied FeminineàñPYK
Present Plural They study MasculineáPYK
Present Plural They study FemininePYJ
Future Single He will study MasculinePYJ
Future Single She will study FeminineAPYJ
Future Dual They will study MasculineAPYJ
Future Dual They will study FeminineàñPYJ
Future Plural They will study MasculineáPYJ
Future Plural They will study FeminineAPYK
Present Dual They study MasculineAPYK
Present Dual They study Feminine Table 2.7: Dierent morphological forms of word (Darasa)suxes. In Arabic, inxes can be added inside the word. For instance, in English, the word writeis the root of word writer. In Arabic, the word writer(
I.KA¿
) is formed dierently from English. It is formed by adding the letter Alef (@
) inside the root (I.J»
). In such cases, especially in process like stemming, it is dicult to distinguish between the root letters and inxes [126].Semantic, morphology, and syntactic of Arabic language is dierent from, more complex than Indo-European languages [132].
Some Arabic words may have dierent meanings depending on their appear- ance in the context. Especially in Arabic scripts in digital form, mostly, dia- critics are not used, the proper meaning of the Arabic word can be determined based on the context. For instance, the word (
I.ë X
) could be noun gold (I.ë X
) or verb went (I.ë X
) depending on the context [130].In Arabic language, Irregular plurals and synonyms are widespread [101, 129]. An example of the challenges of Arabic text automatic processing is the prob- lem of dealing with proper nouns, since Arabic letters do not have lower and upper case, proper nouns in Arabic do not begin with capital letters as in English; the process of capturing such words in Arabic text is more dicult than in English [126].
For Arabic TC, Arabic corpus with its precise training and testing portions is not publically available for research purpose. This makes the comparison between Arabic TC approaches not possible [129, 132]. In this work, to over- come this issue, we have formed three Arabic datasets (each dataset is split into training/test portions) and, made them available for other researchers to directly compare with our results.
Arabic language has special encoding. The use of unsuitable encoding will result in improper Arabic text display. The most common used Arabic text encodings are UTF and CP-1256 Arabic windows [129].