Compilation of an Arabic Children’s Corpus

0
1
6
4 months ago
PDF Preview
Full text
(1)This is a repository copy of Compilation of an Arabic Children’s Corpus. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/100839/ Version: Accepted Version Proceedings Paper: Al-Sulaiti, L, Abbas, N, Brierley, C et al. (2 more authors) (2016) Compilation of an Arabic Children’s Corpus. In: LREC 2016 Proceedings. LREC 2016: 10th Language Resources and Evaluation Conference, 23-28 May 2016, Portorož, Slovenia. . ISBN 978-2-9517408-9-1 Reuse Unless indicated otherwise, fulltext items are protected by copyright with all rights reserved. The copyright exception in section 29 of the Copyright, Designs and Patents Act 1988 allows the making of a single copy solely for the purpose of non-commercial research or private study within the limits of fair dealing. The publisher or other rights-holder may allow further reproduction and re-use of this version - refer to the White Rose Research Online record for this item. Where records identify the publisher as the copyright holder, users can verify any specific terms of use on the publisher’s website. Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing eprints@whiterose.ac.uk including the URL of the record and the reason for the withdrawal request. eprints@whiterose.ac.uk https://eprints.whiterose.ac.uk/

(2) Compilation of an Arabic Children’s Corpus Latifa Al-Sulaiti, Noorhan Abbas, Claire Brierley, Eric Atwell, Ayman Alghamdi University of Leeds School of Computing, University of Leeds, LS2 9JT, UK E-mail: lsulaiti9@hotmail.com, noorhanabbas@yahoo.co.uk, C.Brierley@leeds.ac.uk, E.S.Atwell@leeds.ac.uk , scaaa@leeds.ac.uk Abstract Inspired by the Oxford Children's Corpus, we have developed a prototype corpus of Arabic texts written and/or selected for children. Our Arabic Children's Corpus of 2950 documents and nearly 2 million words has been collected manually from the web during a 3month project. It is of high quality, and contains a range of different children's genres based on sources located, including classic tales from The Arabian Nights, and popular fictional characters such as Goha. We anticipate that the current and subsequent versions of our corpus will lead to interesting studies in text classification, language use, and ideology in children's texts. Keywords: Arabic, Children's Corpus, Genre Classification 1. Introduction When the Oxford Children's Corpus (OCC) was published in 2012, it was heralded as the first of its kind (Wild, Kilgarriff and Tugwell 2012). The preferred definition of children's literature adopted for that corpus is material written especially for children, and/or selected for children by parents, teachers, and publishers. Version 1.0 of the OCC contains some 30 million tokens of English text for 5-14 year olds, organised into four basic genres (fiction; non-fiction; children's own writing; and unclassified), and with an emphasis on 21st-century material (ibid). Children's corpora as defined above appear to be few and far between. A small corpus of children's texts for English (less than 1 million words) was extracted from the British National Corpus (BNC) by Thompson and Sealey (2007) for corpus-based insight into the distinguishing features of children's versus adult fiction. There is also a children's literature section in the Corpus of Translated Finnish (Mauranen 2000), where the main source languages are English and Russian; this has been used in studies of translationese in Finnish children's texts (Puurtinen 2003; 1998). Finally, children's fiction is included in the parallel Bulgarian-Polish Corpus (Dimitrova and Koseska-Toszewa 2009). Rather like the OCC for English, the Arabic Children's Corpus (ACC) introduced in this paper is a first for the Arabic language. Surveys of Arabic children's literature do exist. Perhaps the most comprehensive is Al-Hajji's Bibliographical Guide (1990), which covers an estimated 12,000 books published for children across the Arab region between 1950-1999 (Mdallel 2003). Al-Hazza (2006) offers an update on new and selected publications which genuinely reflect contemporary Arab culture; and Peterson (2005) offers an intriguing survey of Arabic children's magazines, most notably: Majid; Alaa Eldin; Al-Arabi Alsaghir; and Bolbol. Compilation (and later annotation) of the ACC is an ongoing project. However, version 1.0 of the corpus, with some 1,877,615 word tokens, and drawn exclusively from the Internet, already attempts to classify a wider range of children's genres than any of the above datasets. 2. Why Collect a Corpus of Children's Literature? The OCC was primarily compiled for the purposes of lexicography, to inform children's dictionaries at Oxford University Press (OUP). A large corpus was needed to help refine headword lists, and to identify collocates, senses, and naturally-occurring examples of target terms used in context (Wild, Kilgarriff and Tugwell 2012). Corpus comparison of children's versus adult literature (where a representative sample of the latter was drawn from the BNC) in SketchEngine (Kilgarriff et al 2014) established that children's dictionaries should not just be simplified versions of adult dictionaries, and should be based on children's corpora since these properly reflect the language and content that children are exposed to (Wild, Kilgarriff and Tugwell 2012). Critical Linguistics or Critical Discourse Analysis (Wodak and Meyer 2001) is a major field of enquiry with regard to children's literature. Researchers are interested in the interrelationship of syntax, readability and ideology (Puurtinen 1998; Mdallel 2003). Strategies such as nominalisation and passivisation in translations of children's texts are subtle effects which tend to obscure agency, and hence responsibility (Puurtinen 1998). At the same time, they introduce linguistic complexity, and may detract from readability, defined as the ease and naturalness with which such texts may be read aloud by children themselves (ibid). Children's texts in general are noted for their pedagogic and didactic overtones (ibid), and this pertains to Arabic children's literature as well (Mdallel 2003). Based on AlHajji's bibliography (1999), one of the most popular genres besides overtly religious texts (e.g. stories about the Prophet's life and the lives of other prophets such as Moses and Jesus) is historical fiction featuring Islamic heroes (Mdallel 2003). This is corroborated in a study of Arabic children's magazines (Peterson 2005), where comic characters are said to model the 'hybrid identity' of young Arabs as both Muslim and modern, via juxtaposition of Islamic and consumer values. Building on Puurtinen's work cited above, Mdallel

(3) (2003) offers further evidence of translation as a vehicle for enculturation. She notes that while Russian and Chinese publishers specialise in translating their children's classics into Arabic, there are far fewer Arabic translations of modern Western literature for children. The influx of translated books is also said to be detrimental to the spread of local literature in the region (ibid). However, on a positive note, Al-Hazza and Lucking (2012) welcome English translations of high-quality, contemporary Arab children's fiction in American classrooms to encourage cultural pluralism. Titles cited include: The Day of Ahmed's Secret plus Sami and the Time of the Troubles (Heide and Gilliland 1990; 1992); and also historical fiction such as: A Peddler's Dream (Shefelman 1992) and Saladdin: Noble Prince of Islam (Stanley 2002). 3. Sourcing the Corpus The first stage of designing a corpus is to decide on the sources of texts. Since the Internet is rich in Arabic texts and easily accessible, it was the main source for corpus collection. Two native Arabic speakers on the team identified around 70 websites from which to collect texts. Websites that contained mainly audio and video files were excluded. Scanned PDF files were also excluded since there is no sufficiently accurate Arabic Optical Character Recognition (OCR) system for the low resolution files we came across; although most of the OCRs available online claim high accuracy, tests reveal that their output needs editing (Abbas, Atwell and Al-Sulaiti 2016). The list of websites was filtered using specified context words such as: English Arabic children’s stories Arabic children’s magazines Arabic children’s plays Arabic children’s songs Arabic children’s blog Arabic children’s forums Arabic ‫ا ل‬ ‫تا ل‬ ‫تا ل‬ ‫ا ل‬ ‫ا‬ ‫و أ ل‬ ‫تا ل‬ Figure 1: Search terms used for locating sources Most of the websites found were online children’s magazines, children’s websites, forums which include small sections for children’s stories, and blogs. These websites are created by individuals who either include their own writing or collect stories written by others. Usually these websites contain a good selection of stories written by well-known authors. The texts identified cover a variety of categories such as folk tales, religious stories, fiction, biographies, plays, poems, instructions, scientific, medical, general knowledge and warfare texts et cetera. Most of the texts found were suitable for older children, aged seven and above. However, texts for beginners are very few and some are scanned PDF files or audio files. To make the corpus more balanced, it was decided to manually type these texts for younger children. Once suitable websites were determined, each researcher embarked on text collection, making sure to keep a record of the title and source of each text in an Excel file. This ensured that the list of texts collected by the two researchers was not the same because some websites copy stories from other sources. 4. Collecting the Corpus Our initial plan for collecting the corpus was to use bespoke web-as-corpus software, namely: WebBootCat (Baroni et al 2006). This offers two alternatives for corpus collection: (i) via seed terms used as queries in Google, where subsequent hits then constitute a first-pass specialist corpus; (ii) and by uploading URLs. In practice, however, we found that sometimes, when the corpus was downloaded from the website, the text was not complete; when compared to the original document either the beginning or the end of the text was missing. This is related to the fact that when the tool tries to remove unwanted data such as navigation menus, links, ads, headers and footers – all referred to as boilerplate – a slice of the text is also removed; the tool cannot separate out redundant from pertinent content, hence some of the latter is lost as well. We also found that, on occasion, the tool failed to process potentially relevant websites which instead came up as errors. In terms of the second option (i.e. uploading URLs), extracting from a website means that all the data on the website will be extracted. This worked well for websites that were solely/mainly for children, and for longer stories, where the URLs only pointed to children’s material; but some websites are forums and contain all sorts of data. Therefore, using URLs was not appropriate for our purposes since everything would be extracted. Based on the tests conducted in WebBootCat, and the problems outlined above, we found both tool options for automatic corpus collection unsuitable for our purposes. However, the ‘Create Corpus’ tool in SketchEngine (Kilgarriff et al 2014) enables researchers to gather texts in a range of different formats and then upload them into SketchEngine format. If it is a .doc(x) file, output text appears without paragraph tags, but if it is from a website, Create Corpus copies paragraph tags from the html. We therefore decided that the best method of collecting a children’s corpus was to do this manually. In this way, we ensured that texts would be complete and no unwanted material included. Our approach was to examine each site for suitable texts, and to copy and paste material into a file, making note of provenance data such as URL, author, and title of text. Texts were then reformatted and saved as Word files (see Section 6). 5. Classification of Genres Our corpus presents a new and diverse snapshot of Arabic Children’s Literature in the 21st century. The genre categorisation scheme emerged from corpus collection, and reflects the variety of children’s texts available on the web. Version 1.0 of our corpus classifies texts via two overarching categories: Fiction or Non-Fiction, and further specifies genres according to the following types (Figure 2).

(4) Fiction Non-Fiction Biography History Informative Texts Religion Science War Other Adventure Stories Animal Stories Comic Books Contemporary Fiction Detective Stories Educational Fairy Tales Fantasy Fiction Folk Tales Ghost Stories Historical Fiction Humour Moral Tales Nature Stories Plays Poetry & Nursery Rhymes Riddles Science-Fiction identifier/name on the system. It contains basic information about the corpus such as language and encoding, and also definition of the attributes that correspond to the metatags added to the story files (cf. Figure 3). The structure and attribute names in the actual data have to correspond to the corpus configuration file. These attribute definitions are constructed by the corpus owners to enable SketchEngine to handle the metadata properly. The beginning and the end of each story text was also marked with the paragraph html tags <p> and </p>. Marking the text in this way facilitated using the Onion tool (provided by SketchEngine) to remove duplicate texts when compiling the corpus. All the corpus files were also checked manually to remove any duplicate texts. Figure 4 shows paragraph mark-up for a single document mapped to the header information given in Figure 3. <p> Figure 2: Genre classification scheme used in the ACC However, classifying texts in terms of their primary genre does not preclude further sub-categorisation. How, for instance, would one categorise the Harry Potter series? They are certainly contemporary; but as well as depicting ordinary life as we know it (school, teachers, holidays, home, parents, siblings etc), they also conjure a fantasy world of magical powers and beneficent/malicious creatures; and they are not even just children’s books, since mums and dads (and people on the train) read them too! Therefore, our genre categorisation scheme is likely to be refined in subsequent versions of the corpus, since the ACC is an ongoing project. One idea is to introduce a more differentiated hierarchy of genres and metadata. < doc url = “http://vb.3dlat.net/showthread.php?t=183153” Title = “ ” Author = “Unknown” Genre = “Fiction,Fiction::Moral,Fiction::Animal” Dialect = “Egyptian” > Figure 3: Example of story header for one corpus file 6. Presentation of the Corpus: Formatting and Storage Version 1.0 of our Arabic Children’s Corpus has been uploaded into the SketchEngine corpus query tool in preparation for further investigation and analysis. This involved several steps as follows. Stories were collected from different websites and all text formatting, including bold, italics, highlights, et cetera, was removed (see Section 4). Then, for each story, a header was added containing the following fields: URL of the story; story title; author; genre; and dialect (see Figure 3). Sketch Engine creates a configuration file for each uploaded corpus. This configuration file is located in the registry directory with a filename which is the corpus </p> </doc> Figure 4: Paragraph mark-up for the same document defined in Figure 3 A simple hierarchy for the genre field has been introduced for this pilot version of the corpus. Multiple values in the genre field can be structured into a hierarchy, so headers in the story files were therefore modified to create this treelike structure (cf. Figure 5). < doc url = “http://al- hakawati.net/arabic/stories_Tales/story23.asp” Title = “ ” Author = “Unknown” Genre = “Fiction,Fiction::Folktale,Fiction::Folktale::Arabian Nights” Dialect = “MSA” > Figure 5: Story header for one corpus file showing 3-level hierarchy for the genre: Folktales Most genres can be defined via 2-levels, but the example in

(5) Figure 5 shows an Arabian Nights story classified via 3 levels for the Folktales genre: Fiction, Folktale, and Arabian Nights. Figure 6 displays the hierarchy of genres in this current version of the corpus. It is unusual for collectors of corpora to proofread corpus texts, as normally the corpus is used as evidence of real language usage; but corpora aimed at language teaching and learning may be an exception, as we need correct exemplars of the language to be taught (Alfaifi et al 2013; Atwell 1987). Another issue is vowelisation: some texts are fully vowelled (i.e. contain short vowels and other diacritics), while others are not; this may create problems for analysis later (e.g. use of a concordancer). Finally, though we consider genre to be an important feature for inclusion in the corpus, texts are very difficult to categorise. For example, a biography can contain anecdotal stories, or a story with animals as characters may preach a moral lesson. 8. Figure 6: Screenshot from SketchEngine showing simple tree structure for text genres As previously mentioned, our Arabic Children’s Corpus is an ongoing project; but current statistics for version 1.0 are as follows (Figure 7). Text Type Number of Tokens Corpus 1, 877,615 Fiction 711,615 Non Fiction 310, 000 Arabian Nights 806, 000 Goha Stories 50,000 No of stories/documents 2950 Figure 7: Statistics for the Arabic Children’s Corpus (version 1.0) 7. Further Issues in Text Collection and Classification We have encountered a number of issues in text collection and classification during this 3-month project. One issue is provenance data: some texts have no authorship attribution; some sites copy from other sites so the actual source of the text cannot be identified; and it is also difficult to ascertain whether some texts are original or translated. There are also issues to do with written quality of texts: some texts contain spelling mistakes (e.g. two words are connected, or a word is divided into two, or an incorrect letter is used). We have also found missing prepositions, and redundant words added in a sentence. Errors such as these necessitate further proofreading as another aspect of manual corpus collection. Conclusions and Further Work We have compiled the first corpus of Arabic children's texts, defined as texts written or selected especially for children. The data, being collected manually, is of a high quality and covers a wide range of genres. Collecting the corpus presented a major manual challenge. There is no straightforward way of characterising children's texts, and automated methods in WebBootCat (e.g. using seed terms to identify and upload URLs) were not always successful in filtering out unsuitable and/or unwanted material. The platform chosen for corpus formatting and storage is SketchEngine, a market leader enabling export of corpus data in a choice of different formats. We plan to develop and refine this corpus further in the context of a major, applications-based project involving automatic corpus annotation and analysis. Some of our immediate ideas on corpus composition include:  addition of classic texts by well-known Arabic children's writers;  revision and further differentiation of genre categories and hierarchy;  approaching copyright holders if we decide to make this corpus freely available at some point (though it may be impractical to get copyright for each site);  collecting a subcorpus of children's language, following the design of the Oxford Children's Corpus, to enable psycholinguistic and pedagogic research into children's language use. We plan to use our Arabic Children’s Corpus for research on language suitable for use in books for child readers ; for example, we can compare it with other Arabic corpora representing adult language, by extracting vocabulary and formulaic sequences (Alghamdi et al 2016) in texts written for children, and comparing these with vocabulary and formulaic sequences in Arabic texts written for adults. In conclusion, we believe that with further development, our Arabic Children's Corpus will constitute an excellent source of data for a range of Arabic language teaching and linguistics research, and development of reading books for Arab children.

(6) 9. References Abbas, N., Atwell, E. and Al-Sulaiti, L. 2016. ‘An Empirical Evaluation of 10 Arabic Optical Character Recognition Tools’. International Journal of Computational Linguistics (IJCL), 7. 1-14. Alfaifi, A., Atwell E, Abuhakema G. 2013. ‘Error Annotation of the Arabic Learner Corpus: A New Error Tagset’ in: Language Processing and Knowledge in the Web, LNCS Lecture Notes in Computer Science vol. 8105, pp.14-22. Springer. Alghamdi, A., Atwell, E. and Brierley, C. 2016. ‘An empirical study of Arabic formulaic sequence extraction methods’ in Proceedings of LREC Language Resources and Evaluation Conference. Al-Hajji, F.A. 1990. ‘al-Dalil al-Biblioghrafi Likitab at-Tifl al-‘Arabi. Bibliographical Guide to Arab Children’s Books’. Sharjah. Dairatu al-Thakafa wal-‘alam. Al-Hazza, T.C. 2006. ‘Arab Children’s Literature: An Update’. In Book Links. 15.3. 11-12. Al-Hazza, T.C. and Lucking, B. 2012. ‘Celebrating Diversity through Explorations of Arab Children’s Literature’. In Childhood Education. 83.3. 132-135. Atwell, E. 1987. ‘How to detect grammatical errors in a text without parsing it’ in: Maegaard, B., (editor) Proceedings of EACL '87 Third conference on European chapter of the Association for Computational Linguistics, 38-45. Baroni, M., Kilgarriff, A., Pomikálek, J., and Rychlý, P. 2006. ‘WebBootCaT: instant domain-specific corpora to support human translators’. In Proceedings of EAMT European Association for Machine Translation. 247-252. Dimitrova, L. and Koseska-Toszewa, V. 2009. ‘BulgarianPolish Corpus’. In Cognitive Studies. 9. 133-141. Heide, F.P. and Gilliland, J. 1990. The day of Ahmed’s secret. New York. Lothrop, Lee and Shepard Books. Heide, F.P. and Gilliland, J. 1992. Sami and the time of the troubles. New York. Clairon Books. Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Vojt ch, K., Michelfeit, J., Rychlý, P. and Suchomel, V. 2014. ‘The Sketch Engine: ten years on’. In Lexicography. 1.1. 7-36. Mauranen, A. 2000. ‘Strange strings in translated language. A study on corpora’. In Olohan, M. (ed.), Intercultural Faultlines: Research Models in Translation Studies. I. Textual and Cognitive Aspects. Manchester: St. Jerome. 119-141. Mdallel, S. 2003. ‘Translating Children’s Literature in the Arab World’. In Meta: Translators’ Journal. 48.1-2. 298306. Peterson, M.A. 2005. ‘The Jinn and the Computer: Consumption and identity in Arabic children’s magazines’. In Childhood. 12.2. 177-200. Puurtinen, T. 2003. ‘Genre-specific Features of Translationese? Linguistic Differences between Translated and Non-translated Finnish Children’s Literature’. In Literary and Linguistic Computing. 18.4. 389-406. Puurtinen, T. 1998. ‘Syntax, Readability and Ideology in Children’s Literature’. In Meta: Translators’ Journal. 43.4. 524-533. Thompson, P. and Sealey, A. 2007. ‘Through children’s eyes?’ In International Journal of Corpus Linguistics. 12.1. 1-23. Shefelman, J. 1992. A peddler’s dream. Austin, Texas. Houghton Mifflin. Stanley, D. 2002. Saladin: Noble prince of Islam. New York. HarperCollins. Wild, K., Kilgarriff, A. and Tugwell, D. 2012. ‘The Oxford Children’s Corpus: Using a Children’s Corpus in Lexicography’. In International Journal of Lexicography. doi: 10.1093/ijl/ecs017. OUP. 1-29. Wodak, R. and Meyer, M. (eds). 2001. Methods of Critical Discourse Analysis. London. SAGE.

(7)

New documents

A perchlorate salt of 1 [(2 di­methyl­amino­ethyl­imino)meth­yl]naphthalen 2 ol
Data collection: SMART Bruker, 1998; cell refinement: SAINT Bruker, 1998; data reduction: SAINT; programs used to solve structure: SHELXS97 Sheldrick, 1997a; programs used to refine
“I just don’t want to get bullied anymore, then I can lead a normal life”; Insights into life as an obese adolescent and their views on obesity treatment
Many studies have confirmed the significant physical impact on health in the short- and long-term,2 particularly in a world which places increasing emphasis on body size as a marker of
1 (4,6 Di­methyl­pyrimidin 2 yl­sulfanyl) 3,3 di­methyl­butan 2 one
Data collection: SMART Bruker, 1998; cell refinement: SAINT Bruker, 1999; data reduction: SAINT; programs used to solve structure: SHELXS97 Sheldrick, 1997; programs used to refine
Redetermination of the Diels–Alder diadduct of 1,4 benzo­quinone and cyclo­penta­diene: 1,4:5,8 di­methano 1,1a,4,4a,5,5a,8,8a octa­hydro­anthracene 9,10 dione
parameters; the range of C—H was 0.964 17–1.023 17 A Data collection: SMART Bruker, 1998; cell refinement: SAINTPlus Bruker, 1998; data reduction: SAINT-Plus; programs used to solve
3 Cyclo­hexyl 4 (3 methyl 3 phenyl­cyclo­butyl) 1,3 thia­zole 2(3H) thione
were fixed at 0.040 A Data collection: COLLECT Bruker–Nonius, 2002; cell refinement: EVALCCD Bruker–Nonius, 2002; data reduction: EVALCCD; programs used to solve structure: SHELXTL
Methyl β N (3 nitro­phenyl­methyl­ene)­di­thio­carb­aza­te
Data collection: PROCESS-AUTO Rigaku, 1998; cell refinement: PROCESS-AUTO; data reduction: CrystalStructure Rigaku/ MSC, 2002; programs used to solve structure: SIR92 Altomare et al.,
Treatment of severe obesity in adolescents; a mixed method approach
To achieve this, there are several research objectives to be met; 1 Better understand the unique characteristics and needs of adolescents as a distinct population and explore how this
2 Bromo­benzo­phenone
Data collection: P3 Siemens, 1989; cell refinement: P3; data reduction: XDISK and XPREP Siemens, 1991; programs used to solve structure: SHELXS97 Sheldrick, 1990; programs used to
The impact of menu label design on visual attention, food choice and recognition: an eye tracking study
The impact of menu label design on visual attention, food choice and recognition: an eye tracking study REALE, Sophie and FLINT, Stuart Available from Sheffield Hallam University
2,4 Di­chloro 6 morpholino 1,3,5 triazine
The cell e.s.d.'s are taken into account individually in the estimation of e.s.d.'s in distances, angles and torsion angles; correlations between e.s.d.'s in cell parameters are only

Related documents

Modern Arabic Calligraphic-Based Logos: The Influence of Traditional Arabic Calligraphy on Modern Arabic Calligraphic-based Logo Designs
Application Thesis Graphic Design MFA Program College of Imaging Arts and Sciences Amer Alkharoubi Spring 2013 School of Design Rochester Institute of Technology
0
0
147
Arabic Courseware Using Constructivism
LIST OF TABLES TABLE TITLE Table 2.1 How instructional design principles can complement the operational capabilities of the computer to ensure learning Table 2.2
0
0
24
The relationship between nonword repetition, root and pattern effects, and vocabulary in Gulf Arabic speaking children
Abbreviations AEVT Arabic Expressive Vocabulary test APVT Arabic Picture Vocabulary Test ASD Autistic Spectrum Disorder CL Children with Language Impairment CNRep The Children’s Test of
0
1
315
Assessing early sociocognitive and language skills in young Saudi children
ABBREVIATIONS APS-RVT Arabic Preschool-Receptive Vocabulary Test ARA-LUI Arabic Research Adaptation of the LUI ASD Autism spectrum disorders BCa Bias corrected and accelerated
0
0
241
Joint Alignment of Segmentation and Labelling for Arabic Morphosyntactic Taggers
We combined four Arabic POS-taggers: MADAMIRA MA, Stanford Tagger ST, AMIRA AM, Farasa FA; and as the target output used two Classical Arabic gold standards: Quranic Arabic Corpus QAC
0
0
13
Web-based Annotation Tool for Inflectional Language Resources
5-8 out of 25-24 morphemes 6 shows the high number of homographs in the Quranic Arabic Corpus a case of highly inflectional language.. 7.5 General Case: Sunnah Arabic Corpus We have
0
0
7
Developing Bilingual Arabic-English Ontologies of Al-Quran
Additionally, the majority of Al-Quran ontologies are part of, or dependent upon, the Quranic Topics dataset QT, the Arabic Quran Corpus AQC, the Ontology of Quranic Concepts OQC, or
0
0
7
Arabic dialects annotation using an online game
• The level of dialectal content – MSA for document written in MSA – Little bit of dialect for document written in MSA but it contains some words of dialect – Mix of MSA and dialect
0
0
6
Classical and modern Arabic corpora: Genre and language change
To illustrate the range of language and linguistic annotation in Classical Arabic corpora, here are some of the Classical Arabic corpus resources developed by the Artificial
0
0
28
Variation in polar interrogative contours within and between Arabic dialects
Code moca tuns egca joka syda irba kwur ombu Dialect Moroccan Arabic Casablanca Tunisian Arabic Tunis Egyptian Arabic Cairo Jordanian Arabic Karak Syrian Arabic Damascus Iraqi Arabic
0
0
6
Sunnah Arabic Corpus: Design and Methodology.
5th International Conference on Islamic Applications in Computer Science And Technology, 26 Ð 28 December 2017, Semarang, Indonesia Sunnah Arabic Corpus: Design and Methodology
0
0
7
Tagging Classical Arabic Text using Available Morphological Analysers and Part of Speech Taggers
Abdulrahman Alosaimy, Eric Atwell Tagging Classical Arabic Text using Available Morphological Analysers and Part of Speech Taggers Abstract Focusing on Classical Arabic, this paper in
0
0
26
Exploring Twitter as a Source of an Arabic Dialect Corpus
CONCLUSION AND FUTURE WORK Most of Arabic corpora are audio recording, so in this paper we explored Twitter as a source of Arabic dialect texts to create written corpus of Arabic
0
0
8
The pervasiveness of coordination in Arabic, with reference to Arabic>English translation
Article about the lack of competence of educated Arabic speakers in Standard Arabic: ‫د‬ ‫ د‬luğatu-na al-jam la Our Beautiful Language, published in the Saudi-owned international
0
0
29
A comparative analysis of the Arabic and English verb systems using a Quranic Arabic corpus
Making a comparison using a parallel corpus such as a Quran corpus which includes several translations in English can help us to understand the differences in meaning and grammar in
0
0
11
Show more