Natural Language Processing (NLP) applications such as text categorization, machine translation, sentiment analysis, etc., need annotated corpora and lexicons to check quality and performance. This paper describes the development of resources for sentiment analysis specifically for Arabic text in social media. A distinctive feature of the corpora and lexicons developed are that they are determined from informal Arabic that does not conform to grammatical or spelling standards. We refer to Arabic social media content of this sort as Dialectal Arabic (DA) - informal Arabic originating from and potentially mixing a range of different individual dialects. The paper describes the process adopted for developing corpora and sentiment lexicons for sentiment analysis within different social media and their resulting characteristics. The addition to providing useful NLP data sets for Dialectal Arabic the work also contributes to understanding the approach to developing corpora and lexicons.
We present a single clitic segmentation model that is accurate on both MSA and informal Arabic. The model is an extension of the character-level conditional random field (CRF) model of Green and DeNero (2012). Our work goes beyond theirs in three aspects. First, we handle two Arabic ortho- graphic normalization rules that commonly require rewriting of tokens after segmentation. Second, we add new features that improve segmentation ac- curacy. Third, we show that dialectal data can be handled in the framework of domain adaptation . Specifically, we show that even simple feature space augmentation (Daumé, 2007) yields significant im- provements in task accuracy.
In this work, we adapt a previously proposed sys- tem for automatic detection of code switching in informal Arabic text to handle twitter data. We experiment with several setups and report the re- sults on two twitter datasets and a surprise-genre test-set, all of which were generated for the shared task at EMNLP workshop for Computational Ap- proaches to Code Switching. In the future we plan on handling other Arabic dialects such as Levan- tine, Iraqi and Moroccan Arabic as well as adapt- ing the system to other genres.
unigrams. Hamouda and Akaichi (2013) studied Facebook “statuses updates” written in English and posted by Tunisian users during the Arabic Spring. The objective of this research is to analyse the social behaviour of Tunisians during a critical event and whether textual data can be used to know the public opinion during that event. Status updates were collected from randomly selected Facebook users of different ages, genders, occupations and social statuses, and two ML classifiers were used, namely SVM and NB. Their approach consists of 5 phases: collection of comments, creating sentiment lexicons, pre-processing, feature extraction, and classification. To test their approach, 260 status updates posted within a week during the Tunisian revolution were collected in phase one. In phase two, three different sets of lexicons were created: emoticons or smiley faces, acronyms such as “gr8” that means great and “lol” that means “laugh out loud,” and interjections such as “haha” and “Wow” were used. Approximately 30 different lexemes were used. In the third phase, pre-processing included removing stop words that do not affect sentiments since they are neutral words, and stemming was used to enhance system performance. Moreover, the roots of opinionated words were used. POS and n-grams were used as features in phase 4. In the last phase, an updated status is classified as positive or negative. The highest achieved accuracy was 75.31% using SVM outperforming NB, which achieved 74.05%.
159 Read more
The written form of Arabic, Modern Standard Arabic (MSA), differs quite a bit from the spoken dialects of Arabic, which are the true “native” languages of Arabic speakers used in daily life. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. We present the Arabic Online Commentary Dataset, a 52M-word monolingual dataset rich in dialec- tal content, and we describe our long-term an- notation effort to identify the dialect level (and dialect itself) in each sentence of the dataset. So far, we have labeled 108K sentences, 41% of which as having dialectal content. We also present experimental results on the task of au- tomatic dialect identification, using the col- lected labels for training and evaluation.
Notable among our results were scores on the Morphology and Proper Name Assessments. Predictably, the average Morphology score over all systems on Formal input (.89) exceeded (by close to 15%) the same score calculated on output from Informal input (.75). Moreover, differences in Morphology scores for the poorer systems on Informal input approached .20 with the standard deviation among scores on this output more than doubling that of the Formal category. These marked differences suggest that PLATO accurately reflects the inability of standard MT systems to account for the widely varying morphological phenomena common to Informal Arabic.
(b) Woooooow. Ur car is cooooooool. Due to these factors, classifying a word as Ara- bizi or English has to be done in-context. Thus, we employed sequence labeling using Conditional Ran- dom Fields (CRF) to detect Arabizi in context. The CRF was trained using word-level and sequence- level features. For converting Arabizi to Arabic script, we used transliteration mining in combination with a large Arabic language model that covers both MSA and other Arabic dialects to properly choose the best transliterations in context.
In general, natural language processing for spoken and written English and other languages has been the subject of most studies in the last fifty years . However, Arabic language research has been growing very slowly in comparison to English language research . This slow growth is due to the lack of recent studies on the nature of the acoustic-phonetics of the Arabic language resulting from a lack of a database of Arabic dialects (ibid). In addition, assessing the similarities and differences between dialects of a language is a challenge in natural language processing. Most research in Arabic dialectology focus on phonetic variation based on audio recordings and listening to dialect speakers [3, 1, 5, 4]. Horesh and Cotter (2016) confirmed that past and current research is focussed on phonetic and phonological variation between Arabic dialects: all examples that they presented are of phoneme variation, and they did not mention any work on text, or corpus-based research, or of lexical or morpho-syntactic or grammar variation. Therefore, most Arabic dialectology research collected audio recording to use it in their research .
Methodology proposed by Kouloumpis et.al.  The author investigated the utility of linguistic features for detecting the sentiment of Twitter messages and evaluated the usefulness of existing lexical resources as well as the features that capture information about informal and creative language used in micro blogging. The researcher used three datasets (hashtag, emotion and iSieve) from twitter. In this study, different features are used to classify data, use bigrams, unigrams and sentiment analysis, mainly features representing information from a sentiment lexicon and POS features.
The notion of markedness is applied indirectly in the work of the grammarians in all areas of the grammar (Owens 1988), and, in phonology, S bawayh identifies sa ii ‘strong’ and mu tall ‘weak’ elements in many of the phonological oppositions that he proposes. For example, he distinguishes sounds as either muta arrik (CV, followed by a short vowel) or saakin (C, closing a syllable), and argues that the muta arrik is the strong member of the pair. Identification of syllable-initial (onset) position as strong, and syllable-final (coda) position as weak, is argued to account for the differing range of phenomena observed in each position (Al-Nassir 1993:111), mirroring contemporary approaches to onset-coda asymmetries in general (e.g. Lombardi 1999), and directly matching claims made about the underlying syllabic structure of Arabic (Lowenstamm 1996, see 2.2.3 below). Other phonological phenomena discussed by S bawayh range from the optimal size of the verbal root (3-5 uruuf) (Al-Nassir 1993:26) to variation in the realisation of particular sounds or lexical items across definable groups of speakers (Al-Nassir 1993:116-7, cf. Owens 2006 ch.7). We even find discussion of the potential phonological effects of word frequency (“they dare change what occurs more frequently in their speech” Al -Nassir 1993:117) which prefigure exemplar-based approaches to phonology (see 2.2.3 below).
21 Read more
The software involved in the development of this Arabic courseware is Adobe Flash. The design of the courseware is suitable for early childhood where it uses the concept of e-flashcard, persuasive principles, children’s learning style and multimedia component. E- flashcard has been use as an academic tool to helps educators in the learning process. The courseware was developed based on the combination of children’s learning styles which are visual, audio and kinesthetic. The audio is the sound for each basic Arabic alphabet. The visual part included the related picture that was being displayed to represent the sound of Arabic alphabet. The kinesthetic element uses the animation to represent an action that related to the Arabic alphabets. For example, ‘ba’ represents ‘bahu’ and an animation of moving the shoulder is shown. The colours chosen for this courseware are primary color which are black, white and red. This is because baby only able to see this three main color (Muhamamad & Nawi, 2011). Baby’s sound us used in the courseware as they are more familiar with baby’s sound.
10 Read more
Objective: linguistic validation in Moroccan Arabic dialect of overactive bladder questionnaires OAB-q, developed and validated initially in English. Materials and Methods: Questionnaire OAB-q Moroccan Arabic dialect ver- sion was obtained after a double translation (English-Arabic), a back transla- tion (Arabic-English), a journal of translations by three experts, comprehen- sion tests and cultural adaptation on a sample of 10 patients with overactive bladder syndrome. Results: The OAB-q in two parts. The first eight items had assessed the discomfort associated with symptoms of overactive bladder (uri- nary frequency day and night, urgency and urge incontinence). The second included 25 items and measured the impact on quality of life (coping beha- vior, embarrassment, sleep, social interactions). The conceptual and cultural adaptation was performed on a sample of 10 patients (5 with multiple sclero- sis, 5 spinal cord injuries), the average age is 42, 4 +/− 7.6, the sex ratio is 1. Discussion and Conclusion: OAB-q has been validated in men and women with symptoms of overactive bladder with or without incontinence (a neuro- logical or not). Internal consistency and discriminative construct validity were demonstrated. It has been well validated in patients with multiple sclerosis and spinal cord injury. The linguistic validation Moroccan Arabic dialect is the ini- tial step pending a psychometric validation with larger number of patients.
13 Read more
content in RA and RB, it is necessary to properly recognize the languages. Otherwise, there is a large risk of getting misleading information. Moreover, it is crucial to be able to distinguish between them. The reason is that RA and RB coexist in North Africa, which is a rich multilingual region, and they share a considerable amount of vocabulary due to the close contact between them. Undoubtedly, this type of tool will help to build NLP applications for both. There is some work done to automatically transliterate RA into Arabic script (Al-Badrashiny et al., 2014). However, this is very limited because RA perfectly adheres to the principle ‘write as you speak’, i.e. there is no standardized orthography. Furthermore, the Arabic Chat Alphabet (ACA), designed for Romanized Arabic used in social media, is just a suggested writing system and not necessarily a Natural Language Processing (NLP) tool for RA. To overcome the various challenges faced when dealing with RA automatic processing, namely the use of non-standardized orthography, spelling errors and the lack of linguistic resources, we believe that it is better to consider RA as a stand-alone language and try to find better ways to deal with it instead of using only transliteration. RB is already a stand-alone language. It is important to clarify that considering both
There are utterances in the language of advertising which could be classified as mitigating what is being said. Some researchers have classified this as „controlling the addressee‟. Often in the case of mitigating messages, the advertiser tries to persuade the consumers that this product is less demanding. Mitigating utterances in Algerian advertisements are generally related to what concerns people most: the value and the price of the product. An example of this category is found in the following advertisement for „Force Express‟ washing product. In this television commercial a formal level of Arabic is used to talk about the product, as the following excerpt shows:
11 Read more
Adapting SMT resources for other Arabic di- alects: Many researchers have explored the po- tential of using MSA as a pivot language for im- proving SMT of Arabic dialects (Bakr et al., 2008; Sawaf, 2010; Salloum and Habash, 2011; Sajjad et al., 2013a; Jeblee et al., 2014). This often involves DA-MSA conversion schemes as an alternative in the absence of DA-MSA parallel resources. In contrast, limited work has been done on lever- aging available resources for other dialects. Re- cently, Zbib et al. (2012) have shown that using a small amount of dialectal data could yield great improvements for SMT. Here, we investigate the potential of improving the resource adaptability of Arabic dialects. Our work is different as we use an unsupervised segmenter that helps in improv- ing the lexical overlap between dialects and MSA.
10 Read more
The test and development sets contained spel- ling errors (mostly run-on words). The most com- mon of these is the vocative particle yA, which is usually attached to following word (e.g. yArAjl, (you man, لجاراي)). It is not clear whether it should be treated as a proclitic, since it also occurs as a separate word, which is the standard way of writ- ing. The same holds true for the variation between the letters * and z, (ذ and ز in Arabic) which are pronounced exactly the same way in CEA to the extent that the substitution may not be considered a spelling error.
The unemployment rate in Kosovo is estimated to be at 30%. However, the number of people employed in the informal sector is significant and it changes the unemployment indicators (Adrian 2001). Kosovo’s labor market is characterized by some special features in comparison to neighboring and other developing countries. Most data show that one third of Kosovo’s population is below the age of 16 while around 50% are estimated to be below the age of 24. This puts a great pressure in creation of jobs which cannot match the incoming entry of workers (Aleance of Kosovar Business 2007). It must be mentioned that the unemployment data varies depending on the source. The unreliability of data has been an ongoing issue in the after-war Kosovo, which has made the picture very unclear and thus affecting policy drafting.
43 Read more
With the growth of social media and online blogs, people express their opinion and sentiment freely by providing product reviews, as well as comments about celebrities, and political and global events. These texts reflecting opinions are of great interest to companies and individuals who base their decisions and actions upon them (Feldman, 2013; Taboada et al., 2011). In par- ticular, there is an increased interest in easy ac- cess to Arabic opinion from mobiles. In fact, around “10.8 million tweets come from the Arab region every day. 73.6% of all the tweets from the region are now in Arabic” (Radcliffe, 2013).
This article analyses aspects of the greater use of coordination in Modern Standard Arabic as compared to English, illustrating this through Arabic>English translation. It argues that Arabic ‘favours’ coordination linguistically, textually and rhetorically, as follows: 1. The linguistic resources of Arabic favour coordination while those of English favour subordination – whether these are lexical (Arabic wa- and fa- vs. English ‘and’ ), or semantic (the possibility of backgrounding coordinated clauses in Arabic compared to the marginality of backgrounded coordinated clauses in English); 2. Accompanying Arabic textual norms, e.g. (near-)synonym repetition and chained coordination, favour coordination while those of English favour subordination; 3. Further associated ‘rhetorical semantic’ uses of coordination are found in Arabic, e.g. hyperonym-hyponym repetition and associative repetition, which do not exist in English; 4. These extended usages further entrench coordination as a norm in Arabic as compared to English.
29 Read more
units of weight, volume, distance, and the number. The word restraint is often inaccurate and comes with revenue. One of the most commonly used agreements in Arabic is the revenue agreement, which, as we have stated above, is divided into two, namely, the original and the original. The names in the original revenue agreement include the Mafools, and the Muful, the absolute Masjid, also belong to the group of names in the original revenue, which Arabic linguists describe: ًّ نمال هوصًّذّ ثا ا ًّ ةترح ًّ تترح ًّ هوترح ثترح :ًّين و اها راصمال ًّا ًّ قذامال. Absolutely masdar, he is, in fact, a masdar. For example: I hit one, two, and sat down. Clearly, the verb scope can sometimes be used in the sentence to reinforce the meaning that is understood from that verbي Masks in this position are called "ma‘ful mutlaq" or "mutton masdar" and come at the end of the sentence, in vague terms, in proseي For example, لًّترح وبرح ررذال ل-"The thief was severely beaten." Linguist B.M Grande says this about absolute masdar: ―It is clear that every verb, whether it is temporal or mundane, has its own masdar. These are the things that come with the yield, that is absolute Masdar (وفعًّ وا tI ي‖sa ot derrefer era (tcejbo etulosba naق should also be noted that in the place of absolute mascot, the meaning of the synonym verb, rather than the exact verb may be used. For example, هوفًّاًّ فاًّ instead of هومه ا فاًّ - ―he woke up‖ي هوصًّذّ سذّ instead of لواًّ ا سذّ - ―he sat down‖ We also want to draw your attention to the views of Arab linguists. Fuat Ne‘mat said that, ل وفعًّ ل وا ق إسو و رًّب و فن ل فع (ورار) يل ر وعه ثن ياه لًّ بيل ًّعه لًّ عااهي ―Maful is a noun in a pronunciation of an absolute verb (masd) and the emphasis is on the expression of its type or numberي‖ي