The last six chapters are geographically arranged: Iraq, Iran, India (this is a very short chapter), North Africa plus Spain, Egypt, and Syria. In all, there are 442 entries on individual physicians, of varying length: some get only a line or two, others thousands of words: 15,000 on Galen, more than seven thousand on Ibn Sīnā (these are Arabic words; the English transla- tion would have many more). Normally, the length of a section refl ects the importance of the physician in question. But there are exceptions. Under- standably, near-contemporaries of the author tend to have longer entries simply because the author knows more about them. There are other reasons, however, why an entry can grow. We know that Ibn Abī Uṣaybiʿa also made poetry, like so many others in pre-modern Arab society. In a short entry on him in a 14th-century biographical dictionary he is described as “the excel- lent physician” but the author adds: “he was a lettered man, a physician, and a poet”. Ibn Abī Uṣaybiʿa had literary inclinations. Although his prose style is simple, devoid of the rhetorical artifi ces that make the prose of many of his contemporaries rather diffi cult to read, he could not resist quoting reams of poetry, including some of his own. The book contains some 3,600 lines of verse in all. This is also the reason why I became involved in the preparation of a new edition and translation of the work. I must say something about this and the earlier editions of the work.
iii based method, is used in the third part of this thesis. This thesis finally proposes and implements a variant term frequency inverse document frequency weighting method, and investigates the effect of using different choices of features in document representation on single-label text classification performance (words, stems or roots as well as including to these choices their respective phrases). This thesis applies forty seven classifiers on all proposed representations and compares their performances. One challenge for researchers in Arabic text processing is that reported root extraction techniques in literature are either not accessible or require a long time to be reproduced while labeled benchmark Arabic text corpus is not fully available online. Also, by now few machine learning techniques were investigated on Arabic where usual preprocessing steps before classification were chosen. Such challenges are addressed in this thesis by developing a new labeled Arabic text corpus for extended applications of computational techniques.
Let’s begin with the first words (رقبلا يثخ) which are about cows discharging remains. The same meaning is sev- ered by the word (ةءارخلا) which is the dirtiest word in the entire Arabic Dictionary as it is about the remains dis- charged by human beings. The word is so offensive when used in vulgar speech to insult someone as he or she is mere (ارخ-khara). The rest of the words describe (rottenness) such words of number 5, 6, 9 and11, which are all about things getting rotten, like meat or plants. But there are two words 10 and 11 that describe a negative physical human feature. The word (سختسملا) is about a person with ugly face, where the other word (لاجرلا نم سبانُخلا) is about a gigantic man with ugly features. The letter [خ-kha]is also employed against animals. The last two words are about ‘pigs’. The pig in Arabicliterature is a dirty animal. Islamic religion deprives Muslims to eat its meat. So ‘pork’ is not consumed by Muslim community, where this animal’s meat is used. The word also can sometimes be used as an insult. So if someone is called (pig=ريزنخ) this would mean great insult. But the most surprising thing here is the word ‘pig’ begins with [خ-kha]in Arabic as well as the word (صوَّنخ) which stands for the newborn pig that also begins with the same letter.
Another critical remark that is noted is in relation to the 3 overall objectives that state... „Giving learners adequate opportunities for the use of the language in speech and in writing‟ One is compelled to query why then the bulk of the curriculum content is concentrated on reading comprehension, grammar and literature. Modern language learning theory reveals that learning to speak a language is much less cognitively complex than is learning to read and write that language and that the students who are exposed to immersive learning environment first learn to communicate orally, and then go on to have better achievement in reading and writing following the development pattern of native language acquisition and learning. As suggested by Nergis (2011), foreign language learning is most effective when it takes place through an interactive and meaningful way. Loading the curriculum up with many reading, grammar and literature activities will not ensure a communicative competent student whose aim of studying Arabic is mainly to interact. It can, hence, be concluded that the SSAC overall objectives are not in consonance with the selected classroom expected behavioural objectives.
Arabs and Islam had a significant influence in Asia and North Africa, and at the height of the Arab Empire, groups of Arabs went out to do business or preach, spreading Islam around the world. With the spread of Islam and the great idea that "all Muslims are brothers" stipulated in the Koran, the foundation of the Arab clan society linked by consanguineous relations has been greatly shaken, and different Muslim ethnic groups have been able to unite with each other and develop together. Hundreds of millions of Muslims who advocate Islamic tolerance, freedom, and equality sincerely support and follow Islam, even with blood and life, to safeguard the dignity of Islam. Allah The contribution of Bo people is essential, mainly in literature, mathematics, history, and other fields, as well as cultural exchange, development and inheritance.
8). Obviously, the main difficulty is the expectation that Arab children who are only fluent in a spoken (vernacular) Arabic would learn efficiently from textbooks written in the (as yet) unknown to them, (“foreign”) language of standard Arabic. Some educators who recognize this difficulty often make matters worse by writing textbooks in ad hoc “simplified” or “modernized” versions of standard Arabic, thus confusing pupils even more (Haeri 2009: 429) However, it appears that even in an Arab (Arabicphone) state where high rates of formal literacy in standard Arabic are achieved, books in Arabiclanguage are not used more widely. Something is lacking. Perhaps, the gap that exists between standard Arabic and the local vernacular forbids most Arabs with an average command of this “antique-holy-modern” language of standard Arabic from enjoying fiction. Khaled Al Khamissi’s 2006 novel Taxi is composed of stories narrated by Cairo taxi drivers. Memorably, it is the first-ever novel written (almost) entirely in the Egyptian vernacular, the speech of the Egyptian capital’s common people. It is so because the author lets the taxi drivers speak in their own voices. Dialogs and plays are the preserve of vernaculars in Arabicliterature, among others, pushing writers of fiction to mix creatively standard and vernacular Arabic (Mejdell 2006). Resultant “mixed styles” bring to mind the Russian polymath Mikhail Lomonosov’s mid-18 th -century plan for
The lexicographic method has been chosen as the primary research method since it makes it possible to systematically analyse the representation of linguistic units in dictionaries, ways of giving their definitions, procedures for compiling dictionaries, as well as to study in great detail the peculiarities and linguistic functions of the objects in such dictionaries (synonyms, antonyms, phraseological units, etc.) (Rybalkin, 1984). We also use other methods such as the contrastive-comparative method. This method is used to define the similar and distinguishing aspects in the concepts and structures adopted in "The Large Arabic Dictionary" and other thesauruses of the Arabic literary language. The statistic method deployed makes it possible to estimate and evaluate supporting data within an entry in "The Large Arabic Dictionary". Alongside, the modelling method used helps us to analyse word-building patterns, ways of adopting new words and defining them in a dictionary that is established by the Academy in Cairo. With the parametric method, the essential components of a dictionary entry are described. The component analysis contributes to the description of defining methods. Using the above techniques, the research resulted in revealing a number of innovations introduced by the Academy. This fact undoubtedly points to the transition of the Arabic monolingual
Language model vocabulary size (LM VOC Size) and the unknown stem ratio (OOV ratio) of various segmenters is given in Table 6. For unsupervised stem acquisition, we have set the frequency threshold at 10 for every 10-15 million word corpus, i.e. any new morphemes occurring more than 10 times in a 10-15 million word corpus are considered to be new stem candidates. Prefix, suffix, prefix-suffix likelihood score to further filter out illegitimate stem candidates was set at 0.5 for the segmenters developed from 10K, 20K, and 40K manually segmented corpora, whereas it was set at 0.85 for the segmenters developed from a 110K manually segmented corpus. Both the frequency threshold and the optimal prefix, suffix, prefix-suffix likelihood scores were determined on empirical grounds. Contextual Filter stated in (8) has been applied only to the segmenter developed from 110K manually segmented training corpus. 5 Comparison of Tables 5 and 6 indicates a high correlation between the segmentation error rate and the unknown stem ratio.
Period (1500 to 1800 A.D.) and 3 Modern Period (1800 A.D. to present time). The ancient period is considered undeveloped from historical point of view wherein the accepted language of Stone-inscription (Shilalekh) or Copper-sheet (Tamrapatra) was Sanskrit or Prakrit language and the use of Hindi seems insignificant at this stage. During this period two languages are available: Padvarti Apabhramsa or Avahatta and Native language or Dingal- Pingal. The eminent poet Vidyapati wrote Kirtilata and Kirtipataka in Avahatta and Rasokavya in Dingal-Pingal. The middle period of Hindi language started as a golden era for the further advancement of Hindi dialects. A number of texts were written in Avadhi and Vrajabhasa. The great sage Tulsidas wrote Ramacharitmanas (1574) in Avadhi language and Surdas as well as Nandadas wrote in Vrajabhasa. Vrajabhasa received the large acknowledgment as literary form of language during this time. Even literary creation was also conducted in Khadi Boli. Hindi was unchained from the influence of Apabhramsh and the use of Sanskrit vocabulary in dialects was also practiced in the middle period. The phase of eighteenth century is considered an era of „ruin of Vrajabhasa‟, henceforth Khadi Boli (Urdu) remained the centre of attention and acquired its importance among Muslims. The British conducted experiments in the prose form of Khadi Boli for Hindus at the beginning of eighteenth century. The writers like Bharatendu and Dayananda were the chief propagandists and this period is recognized as „Harishchandri Hindi‟ owing to which the art of printing invented. Due to the varying factors like time and place, Hindi language received various forms in the course of time: Magahi, Maithili, Bhojpuri, Kannoji, Badhelkhandi, Bundelkhandi, Vraj, Khadiboli, Bangaru, Mevati, Hadauti, Marvadi, Mevadi, Malvi, Bhili, Khandeshi, etc. The literature written in all these forms is also called Hindi literature. Hindi literature has also been classified under various names. For example, Siddha Literature, Jain Literature, Nath Literature, Raso Literature, Laukik Literature and Prose Literature. The Hindi literature depicts social, political, religious, cultural and literary background of diverse times and the spirit of its people. The states where Hindi language is used are: Himachal Pradesh, some parts of Punjab, Haryana, Rajasthan, Delhi, Uttarpradesh, Madhyapradesh and Bihar. There are five sub-languages and ten dialects in the entire field of this language (Dr. Nagendra 6-17).
We compare a language model built on multiple seg- mentations as determined by the FSMs described above to two baseline models. We call our exper- imental model FSM-LM; the baseline models use word-based n-grams (WORD), and pre-defined affix segmentations (AFFIX). Our data set in this study is the TDT4 Arabic broadcast news transcriptions (Kong and Graff, 2005). Because of time and mem- ory constraints, we built and evaluated all models on only a subsection of the training data, 100 files of TDT4, balanced across the years of collection, and containing files from each of the 4 news sources. We use 90 files for training, comprising about 6.3 mil- lion unvoweled word tokens, and 10 files for testing, comprising about 700K word tokens, and around 5K sentences. The size of the vocabulary is 104757. We use ten-fold cross-validation in our evaluations. 4.1 Experimental Model
Firstly, it provides an explanation of positive and negative views toward utilizing literature as a resource for language teaching .Secondly; it sketches out different methodological issues regarding the use of literature. Finally, some empirical studies carried out to examine the role of literature in language instruction are presented.
There is no doubt about the fact that the development of the intercultural skills is closely connected to the particular stages of teaching a language. The failure to include the specific cultural norms by a foreigner uttering a message containing a number of mistakes will be taken by a native speaker with a certain dose of understanding. However, the skill of fluent and correct usage of a foreign language is usually linked with the expectations regarding the better knowledge of the foreign culture. On the basis of the guidelines formed in CEFR in reference to this area, the basic objective for the selection of the themes and taught content on the basic level is, on the one hand enabling a beginner student an efficient communication in daily situation and, on the other hand – developing a skill of expressing the basic communication intent. The essential element of the latter consists in the knowledge of the most important social and cultural conventions functioning in the communication in a given language. Hence, a successful graduate of the A2 language course should be able to participate in the social talks and organize their utterances in such a way that they are understood by other interlocutors regarding the language and socio-cultural criteria included in the scope of the verbal contact and social rituals. Additionally, the course graduate should know the basic facts on the country (countries) of a given language area, facilitating them the functioning in this country and understanding the patterns of behaviour, mentality and the identity of the citizens of this country .
Arabic ontology is the foundation of the creation of Semantic Web in Arabiclanguage. Basic categorization of terminologies and meanings in a domain give the semantics. The interrelationship between one word to the other words that matches to its meaning can also result to the stems and branches of semantics. Ontology can be built by using domain experts or learned from information available in a corpus of the domain. The goal of ontology learning is to automatically extract relevant concepts and relations from the given corpus or other kinds of data sets to form Ontology . There are six parts in the life cycle in the development of ontology: Creation, Population, Validation, Deployment, Maintenance and Evolution . The 6 parts above can also be subdivided into the following: extracting terms, discovering synonyms, obtaining concepts, extracting concept hierarchies, defining relations among concepts, deducing rules or axioms. These processes are used in order to make the ontology matching become possible and that the related branches of topics would be available to any users.
In this paper, we present our work on de- tecting abusive language on Arabic so- cial media. We extract a list of obscene words and hashtags using common pat- terns used in offensive and rude commu- nications. We also classify Twitter users according to whether they use any of these words or not in their tweets. We expand the list of obscene words using this classi- fication, and we report results on a newly created dataset of classified Arabic tweets (obscene, offensive, and clean). We make this dataset freely available for research, in addition to the list of obscene words and hashtags. We are also publicly releasing a large corpus of classified user comments that were deleted from a popular Arabic news site due to violations the site’s rules and guidelines.
To have an overview of the research that has been done in this area we went through as many papers as possible and tried to specify the main contributions of each paper. We could identify four main categories, whereas each category has some subcategories. The main categories are basic language analyses, building language resources, semantic-level analysis and synthesis, and identi- fying Arabic dialects. Then, we mapped each pa- per to categories and subcategories as well as to the addressed dialect or dialects in a matrix form as given in Table 1. By this means, it can be easily identified what has been done in the Arabic NLP, by whom, and for what dialects.
Jakobson’s classic 1960s linguistic essay on ‘poetic language’ was arguably the beginning of modern stylistics and has been much discussed elsewhere in print. There Jakobson posited, in essence, that literary language is found predominantly in literature where it is to be understood functionally as language that in one way or another draws attention to itself. Standard devices for doing this are repetition or near repetition, ‘parallelism’ is Jakobson’s term; deviant uses of language (neologism, archaic language, unusual collocations, innovative uses of figurative language, dialect forms in standard contexts (or vice versa), and so on. (See for example discussions in Goodman and O’ Halloran 2006; Jeffries 1996; Tambling 1988). Many of these ideas remain very fertile for the understanding of how language in literature works. Cognitive poetics has taken up such ideas as ‘foregrounding’ as a characteristic of literature, salient forms or uses are likely to be noticed and found meaningful by the literary reader. (See Stockwell 2002 for an introduction to foregrounding and cognitive poetics more widely.) Also interesting, for example, is Jakobson’s idea that because of this concentration on language forms literature can be driven by language as much as by meaning (unlike the weather forecast or sales letter). Thus popular musician Ian Dury wrote lyrics for his songs with the aid of a rhyming dictionary, with the rhyme word then driving the backward formation of the rest of the line. Sound driving sense, as Jakobson would say. Most of us who have tried to do any extended writing recognise the truth of E. M. Forster’s widely quoted adage, ‘how do I know what I think until I see what I say’. It is basic to literature to be particularly concerned with form, perhaps especially when reading and writing poetry. A genre like the sonnet can be extended and played with (16 line sonnets in sequence in the cases of George Meredith and Tony Harrison rather than the usual 14 lines) but here as elsewhere (rhyme, rhythm, metre) form is found by many to be both creatively enabling as well as sometimes constraining. Creative literary writing can be a key space for such experimentation and play with form and meaning.
The survey (cf. sect. 3) will identify existing Machine Translation and Multilingual Information Retrieval tools, both at universities and provided by industry. In order to ensure that such tools are usable for Arabic, a task will focus on identifying all the obstacles that may prevent such use. This will include identification of language resources that are needed but also of basic NLP tools that may be required and that could be language-dependent (lemmatizer, POS tagger, vowelizer, etc.). The survey will cover language resources (LRs) currently available to build MT and MLIR systems and benchmark them.
In this study, there is a significant variation in language maintenance between the age groups within the same family. Older parents are more successfully maintaining their Arabiclanguage than their younger second generation children. The interview and participant observation data revealed that the Arabic-speaking parents are more proficient in the Arabiclanguage than their children. Results showed that children tend to adopt English in most of their interaction, especially outside the home environment. The reason behind that is clear. Parents came to Australia as adults, unlike their children who were born here, or they came with their parents at an early age. Also, the Arabic-speaking children are exposed to the English language with their peers at school. This finding is consistent with Hatoss et al.’s (2011) research which found that the loss of the language is prevalent among the young children who migrated with their parents at an early age. This also parallels many studies that found that language maintenance is very common among the first generation, while the shift to English is prevalent among the second and third generation children (Clyne; 2003, 2005; Fishman, 1966). It is important to mention at this stage that most of the Arabic-speaking children in this study can speak Arabic, but they prefer English as they are more proficient at it than Arabic because of the influence of the school environment.
researchers. Classical Arabic texts, in particular the Quran and Hadith, are a specialised genre. The Classical Arabic Quran has been analysed, translated, interpreted and annotated by scholars for over a thousand years, resulting in many knowledge sources for rich corpus linguistic annotation. Modern Standard Arabic is the common written standard used throughout the Arab world; but our research with Arabic corpora has covered wider genre and language variation. AI researchers at Leeds University have collaborated with Arabic linguists to develop a number of Classical Arabic corpus resources: the Quranic Arabic Corpus with several layers of linguistic annotation; the QurAna Quran pronoun anaphoric co-reference corpus; the QurSim Quran verse similarity corpus; the Qurany Quran corpus annotated with English translations and verse topics; the Boundary-Annotated Quran Corpus; the Quran Question and Answer Corpus; the Multilingual Hadith Corpus; the King Saud University Corpus of Classical Arabic; and the Corpus for teaching about Islam. We have also developed Modern Arabic corpus resources spanning a range of genres and language types: Arabic By Computer; the Corpus of Contemporary Arabic; the Arabic Internet Corpus; the World Wide Arabic Corpus; the Arabic Discourse Treebank; the Arabic Learner Corpus; the Arabic Children’s Corpus; and the Arabic Dialect Text Corpus. Modern Arabic corpus researchers harvest online news, web-pages, and internet social media; these might see to differ markedly from religious texts in language and genre. However, Quran verses are short text snippets, analogous to Twitter tweets or Amazon customer reviews. Quran verses annotated with analyses derived from traditional exegesis or scholars’ commentaries can provide rich training data for supervised Machine Learning of language models, in Artificial Intelligence research. So, the language of the Quran may still inform Modern Arabic corpus linguistics and artificial intelligence research, and development of Modern Arabic text analytics tools.
In this section, we will present a description of the corpus related to the four domains. For Uni- versity Schooling Management which is a DBMS Information Retrieval Domain, We collected from around 300 students which formulated their re- quests to access their information from the edu- cation office. After discarding the repeated re- quests, we obtained a corpus made of 127 differ- ent requests expressed in French. The collected corpus, which was initially in French, was trans- lated manually by experts to Arabic (?). Some examples of these queries are given in the table 1. These queries express what do students re- quest from the office of education such as Marks, Certificates and Diplomas. The second domain which is Medical Diagnostic, We collected a cor- pus from a medical care forum known as Doctis- simo (Alexandre, 2000). Some examples of these queries are also given in the table 1. These queries express the symptoms and feelings of ill people describing their health states to a doctor on the forum so that he could administer their treatment or the advice to give. We choose seven diseases, namely: Allergy, Anemia, Bronchitis, Diarrhea, Fatigue, Flu and Stress. For the Consultation do- main, We collected the dataset from Islamtoday website (Today, 2000). It contains four main tasks which are: Educational, Psychological, Social and Religion Consulting. An example of this corpus is presented in table 1. We have shared the first two corpora (University Schooling Management and Medical Diagnostic) in a github repository 2