Collocation in Arabic-Thesis

(1)

COLLOCATION AND SYNONYMY IN CLASSICAL

ARABIC

A CORPUS-BASED STUDY

A thesis submitted to the University of Manchester Institute of Science and

Technology (UMIST) for the degree of Doctor of Philosophy

2004

Abdel-Hamid Elewa

Centre for Computational Linguistics

No portion of the work referred to in the thesis has been submitted in support of

an application for another degree or qualification of this or any other university,

(2)

Acknowledgements

First and foremost, I thank God Almighty, Who teaches man what he does not know. Then, I would like to express my gratitude to my supervisor Dr. Paul Bennett who throughout the years I have spent doing my research showed me an unequivocal perseverance, gave me so much time and enriched my work with his invaluable comments.

I am deeply grateful to Mona Baker, Professor of Translation Studies and Director of CTIS, University of Manchester, who provided me with the first drops of genuine knowledge.

I am also indebted to Paul Johnston, Department of Computation, University of Manchester for all the technical support he gave me and also for the statistical programs he wrote specifically for my research.

During this work I have collaborated with many colleagues for whom I have great regard, and I wish to extend my warmest thanks to all those who have helped me with my work in the Department of Language Engineering, particularly, Sattar Izwaini and Amin Almuhanna; we managed together through our discussions and

commentary on Arabic language to raise a lot of interesting points.

I would like also to thank my examiners, Prof. Harold Somers, Dept. of Informatics, University of Manchester, and Dr. James Dickens, Dept. of Middle Eastern Studies, University of Durham, for their criticism and helpful comments that gave my thesis its academic form.

My thanks are also due to my wife, Iman Refaey who helped me in assembling the electronic corpus for use in this research.

(3)

5.1 INTRODUCTION ... 5.2 DEFINITIONOF COLLOCATION ... 5.3 COLLOCATIONAND COLLIGATION ... 5.4 TYPESOF COLLOCATION ... 5.5 SPANS ... 5.6 SEMANTIC PROSODY ... 5.7 EXTRACTIONOF COLLOCATION ... 5.7.1 Using statistics in collocation extraction ... 5.7.1.1 Lemmatisation ... 5.7.1.2 Concordances ... 5.7.1.3 Frequency ... 5.7.1.4 T-test: a measure of difference ...

CHAPTER SIX: SYNONYMY: AN OVERVIEW ...

6.1 INTRODUCTION ... 6.2 DEFINITION ... 6.2.1 Synonymy - Four Approaches ... 6.2.2 Degrees of Synonymy ... 6.2.2.1 Absolute synonymy: ... 6.2.2.2 Propositional synonymy ... 6.2.2.3 Near-synonymy ... 6.3 SYNONYMYIN ARABIC ... 6.4 THE REPETITIONOF SYNONYMSIN ARABIC ... 6.5 CONCLUSION ...

CHAPTER SEVEN: COLLOCATIONAL TREATMENT OF SYNONYMY IN

ARABIC ...

7.1 INTRODUCTION ... 7.2 DATA CHOICE ... 7.3 DATA ANALYSIS ...

7.4 A CASESTUDY: THEWORDPAIRJAA’AANDATA ‘COME’ ...

7.4.1 Summary ...

7.5 A CASESTUDY: THEWORDPAIRITHMANDDHANB ‘SIN’ ...

7.5.1 A Few Remarks ... 7.5.2 Summary ...

7.6 A CASE STUDY: THE WORD PAIR H}ASIBA AND Z}ANNA ‘THINK’ ...

7.6.1 Summary ...

7.7 A CASE STUDY: THE WORD PAIR H}BB AND WDD ‘LOVE’ ...

(5)

CHAPTER EIGHT: CONCLUSION ... APPENDICES ...

APPENDIX 1: COPYRIGHTS ... APPENDIX 2: ... THECONTENTSOFTHE CAC ARESUMMARISEDINTHEFOLLOWINGCHARTS (1) & (2): ... APPENDIX 3: ... GENRESANDTEXTSINCLUDEDIN CAC. ... APPENDIX: 4 ... APPENDIX 5: ...

(6)

Abstract

I am concerned in this study with applying the corpus linguistics methodology that

concentrates on investigating language use, with particular reference to Classical Arabic. I do not wish to undermine what has been done on the basis of intuition, but the time is now opportune to use modern tools to discover new facets of linguistic behaviour in relation to Classical Arabic and to demonstrate the potential impact of computational methods on Arabic linguistic studies.

One of our main aims will be to demonstrate the usefulness of the corpus methodology in describing Classical Arabic by examining lexical collocations. To do this, I have assembled a classical Arabic corpus which covers the early period of Islam, because the available Arabic corpora are only limited to Classical Arabic of today which is called Modern Standard Arabic.

This study is also an attempt at explaining some issues in semantic relations, particularly synonymy, which can be accounted for in terms of collocations by using a computerized concordancer that enable large quantities of text to be searched for all occurrences of a particular lexical item. Through lexical collocational analysis I can compare and contrast the characteristic uses of semantically related words such as synonyms. According to Cruse (1986) two lexical units would be absolute synonyms (i.e. would have identical meanings) if and only if all their contextual relations were identical. Through corpus analysis we can show whether two items are indeed absolute synonyms or not by checking their relations in all available contexts.

By this technique, it is possible to compare seemingly synonymous words and find out whether they are real synonyms or not. I will argue that absolute synonyms do not exist in terms of their collocational patterns. Through collocation we can distinguish one sense of a word from another and know whether a seemingly synonymous pair are real synonyms or not. Collocation is, therefore, a device with which a particular sense of a word is activated.

In order to prove that subtle differences can be brought out by collocation, the collocates for a list of synonymous pairs are analysed. This will be explored through the analysis of these seemingly synonymous Arabic words, aiming to show that many synonyms are partial or incomplete, and none can be called true (absolute) synonyms.

(7)

Notes on Transliteration

There are two common ways to represent the Arabic script in the Roman script: transliteration and transcription. The former is based on graphemic mapping and the latter is phonemic. There are some Arabic consonants and vowels which have equivalent letters in the Roman alphabet. These are easy to transliterate or transcribe; it depends on what purpose one has for rendering them in either way.

For the letters which have no Roman equivalent, linguists or Arabic users sometimes adopt a set of symbols which are mainly transcriptions. Such a process yields a mixed system of transliteration and transcription. ‘This leaves plenty of scope for scholarly debate, with the result that there are now many supposedly international standards’ (Whitaker, 2002). Among the most common systems are the one adopted by the International Convention of Orientalist Scholars in 1936, the British Standard, BS 4280, the US Library of Congress and the American Library Association. The latter have issued “Romanisation tables” for more than 150 non-Roman written languages and dialects including Arabic (ibid).

One of the reasons given by Whitaker (ibid) for the inefficiency of these Romanisation systems is that they are not easy to key due to the sophisticated figures they use like dots, lines and other marks.

For a practical reason, I tried to use a transliteration system which makes the utmost use of the English alphabet. This is dependent to a great extent on the one adopted by the US Library of Congress with some modifications as shown below:

(8)

Arabic Transliteration Chart

Name of letter Arabic letter shape Symbol in Transliteration

hamza ء ‘ ba: ب b ta: ت t θa: ث th ji:m ج j ha: ح h{ xa: خ kh da: د d dh:l ذ dh ra: ر r za: ز z si:n س s shi:n ش sh sa:d ص s} da:d ض d} ta: ط t} z{a: ظ z{ c_ayn _ع c ġayn غ gh fa: ف f qa:f ق q ka:f ك k la:m ل l mi:m م m nu:n ن n ha: ه h wa:w و w ya: ي y

Such a chart is easy to use because it is familiar to both Arabic and English speakers. For Arabic consonants that do not have equivalents in English we used the most common system. This applies with two types of sounds: emphatic and pharyngeal. For the former we put a dot under the symbol to show emphasis and for the latter we used two symbols (c_{and ‘). This}

makes it difficult to represent the doubling of consonant like dhdh or khkh. We would rather ignore doubling with such consonants. This is much easier than struggling with new symbols. The Arabic definite article al ‘the’, which sometimes takes another form when assimilated with the following sound is represented as is without showing any sort of assimilation. The long vowels are marked by doubling the short vowel to avoid putting more figures on the symbols, except for Proper Nouns which are commonly used among Arabs and Arabists.

(9)

Chapter One: Introduction

1.1 The Rationale Behind the Study

A general motivation for many recent linguistic studies has been the desire to automate some descriptive processes and to employ scientific observation in the study of language.

Linguistic studies in Arabic were first introduced and established by Al-Khalil, who was the first lexicographer to give lexical order in the collection of his dictionary (cf. Haywood 1965), and his outstanding pupil, Sibawayh in the late 8th century. What Al-Khalil and Sibawayh did was to investigate language use to formulate rules and describe linguistic devices.

Although Arab lexicographers were the first to integrate corpus-analysis into the dictionary-making process, with Al-Khalil’s manual corpus discussed below in Chapter Two, a corpus-based approach is certainly not used in contemporary lexicography in the Arab world. The mainstream lexicography is undoubtedly intuition-based.

Employing modern technology in investigating language use should enable us to research more aspects of linguistic behaviour, in more detail. We can investigate how people exploit the resources of their language and how they use it to achieve their communicative goals.

1.2 Goals

The current study will provide the resources for accurate descriptions of the way words co-occur in classical Arabic. For that purpose, the major activity of the study has been the assembly and analysis of a corpus comprising samples of different types of written Arabic: biography, religion, poetry, etc.

With this in mind, I decided to work toward the compilation of a comprehensive corpus of written Classical Arabic in order to facilitate research in a range of disciplines concerned with Arabic and with the general methodology of Corpus Linguistics. I would like to emphasise that the Classical Arabic Corpus will be available for any potential user for her or his needs.

(10)

1.3 Corpus-driven or Corpus-based

Two approaches can be at play when working with corpora: corpus-based and corpus-driven (Tognini Bonelli, 2000). When a linguist in describing a language using this methodology observes a phenomenon without a prior knowledge on the validity of a particular theory, i.e. when he/she finds out something unexpected to him/her, it is called corpus-driven. For instance, the subtle differences that occur between synonymous pairs and the semantic features extracted for every word that distinguishes it from another (as shown in Chapter Seven) are not obvious by casual observation nor available in the literature I have examined. On the other hand, when we use corpus linguistics methodology to support or invalidate an existing hypothesis or a theory, then it is called corpus-based. For example, in Chapter Five we test a collocation assumed to be fixed and find out that it is not a collocation at all.

1.4 Lexical Collocation

Lexical collocation has become trendy in linguistic research. This phenomenon gained such currency after computational corpus-based methodology had been adopted as an accurate and effective way of text analysis.1

Collocation was recognised early by Arab linguists, but the phenomenon was just referred to between the lines and did not get an extensive study. Al-Sakkaki, for example, in Miftah

al-c_{Ulum defined it as}_{‘likull kalimah ma}c_{a s}aah}ibatiha maqaam’}_{(every word has with its}

companion a position [lit. trans.]). This roughly means that every word has a different sense with a different adjacent word. Emery (1988: 51) regards this quotation as equivalent to Firth’s (1957: 179) definition of collocation, which is the company that a word keeps. He also considers the classification of Thac_{alibi’s lexicon, Fiqh Al-Lughah}2_{, as showing his}

awareness of how significant collocational relations are.

Linguistic units can be combined with each other phonologically, morphologically, syntactically, lexically, or semantically. We are only concerned with combinations on the

1 Corpus-based methodolgy has been widely used for other linguistic fields (Biber, 1998, Meyer, 2002). 2 This lexicon, which was written ten centuries ago, classifies the types of actions with their specific doers and the types of words with their specific predicates. So, it can be considered like Benson’s (1997) work on collocation, The BBI Dictionary of English Word Combinations.

(11)

lexical level. This is what is traditionally called collocation3_{. In this sense, ‘collocation is}

restricted to idiosyncratic relationships between words’ (Wouden, 1997: 24).

1.5 Synonymy

One of the main goals of this study is to check the synonymy or non-synonymy of a given pair of items. We will use the corpus-based analysis and the computer technology that can help us identify easily the relative frequency of words, whether throughout the whole corpus or in a particular genre. Subsequently, we can explore the collocates of words and further isolate the various meanings, or senses, a word has. This is especially interesting for words which are considered synonyms, since an investigation may reveal differences in syntactic and/or stylistic distribution. Such research might show that near synonymous words or structures are used in different ways.

Synonymy is understood as a gradual cline along which we may locate different degrees of synonymy: near, cognitive and absolute. However, there is a widely held opinion among semanticists that strict or absolute synonymy is rare in human languages (see Cruse: 1986). A further step is taken here in this study to demonstrate that absolute synonymy does not exist in Arabic. The study will argue that Arabic never has two words that mean nearly the same thing and are used in the same range of grammatical and lexical patterns.

Chapter Two discusses Arabic linguistics scope and pinpoints some technical problems in digitising Arabic. Chapter Three gives a brief account about the methodology of corpus linguistics and surveys its historical background. Chapter Four describes the corpus compiled especially for this study and gives an account of the tools used for analysis. Chapter Five discusses lexical collocations with a particular emphasis on Arabic. Chapter Six addresses the concept of synonymy in English and Arabic. Chapter Seven tries to find differences between seemingly synonymous word pairs by studying their collocation and suggests that applying corpus linguistics methodology to Arabic can help us become aware of lexical matters. Chapter Eight is dedicated to findings and conclusions.

(12)

Chapter Two: Some Aspects of the Arabic Language

2.1 Introduction

The Arabic language originated in Arabia in pre-Islamic times, and spread rapidly across the Middle East. Today it is spoken as an official language by almost 200 million people,

Muslims and Christians, in more than twenty two countries, from Morocco in Africa to Iraq in Asia, and as far south as Somalia and Sudan. As the language of Qur’an, the Holy book of Islam, it is to some extent familiar throughout the Muslim world, rather as Latin was in the lands of the Roman church. It is taught as a first language in all Arab countries and as a second language in non-Arab Muslim states. It is the liturgical language of about one billion Muslims. In addition, Modern Standard Arabic is the lingua franca used and respected by educated Muslims throughout the entire world.

2.2 The Status of Arabic

Arabic4_{is the oldest language which is still used for communication and culture in the Arab}

world. There are many varieties of Arabic: Classical Arabic (CA), Modern Standard Arabic (MSA), and colloquial Arabic, which differs from country to country. Classical written Arabic, however old, has changed little over the centuries. Classical Arabic is still employed today as the written language, but it is restricted to formal usage as a spoken tongue. It differs considerably from its descendant, the modern colloquial Arabic that is the medium of general conversation. Modern Standard Arabic is the variety of Arabic which is essentially a

continuation of Classical Arabic as it was passed down to us throughout the ages and which is partly a modernised form of expression of contemporary ideas, concepts, science and

technology.

Although it is widely used throughout the Arab world, with different vernaculars, in everyday language, language of communication and entertainment, the Modern Standard Arabic is still

4 The term ‘Arabic’ is applied to a number of speech-forms which, in spite of many and sometimes substantial differences, are reckoned as dialectal varieties of a single language. The term Classical Arabic is sometimes used as a synonym of Standard Arabic. However, I will use the former to refer to the early Classical Arabic which extends over the first four centuries of Islam, i.e. until the early eleventh century, whereas the latter is used to refer to the modern Classical Arabic. These two varieties are sometimes interchangeable; they can be used in formal situations such as schools, universities, textbooks, lectures (whether religious or academic), mass-media and personal writing as in letters and autobiography.

(13)

adopted as the formal language of press, writing and speeches. Because Qur’an is revealed in Arabic, most Arabs think that this language must be perpetuated and kept alive (Haeri, 2003). They always emphasise that Classical Arabic, as a living language should be used in formal written and spoken language. Bakalla (1983) argued that ‘living’ language is by definition the language acquired by children in their early age and this is not the case with Classical Arabic. However, the general desire among the educated Arabs is to write and read literary works, Islamic and general books in an elegant language and nothing can be more beautiful than Classical Arabic. ‘In that sense Classical Arabic is [a] ‘living’ language, but it is not a ‘living’ in the sense of colloquial’ (Bakalla 1983: xvii).

2.3 Factors in the Survival of the Classical Arabic

One of the main characteristics of language is change. If a language does not change through time, it is likely to become obsolete, or extinct in terms of its usage. This could make one wonder how Classical Arabic has been preserved over so many centuries. The obvious connection between the Holy Qur’an and the language in which it was revealed to Prophet Muhammad explains the preservation of this language. Below we will give three reasons that made the Classical Arabic language survive throughout the past centuries.

1. Belief in its divinity

Most Arab grammarians and theologians regarded Arabic as a divine language. Explaining Allah’s saying, “And He taught Adam all the names (of everything)” (Qur’an: Sura 2, 31, trans. by Mohamed Khan), Ibn Abbas [a well-known exegete of the Qur’an] said, ‘Allah taught him all common names [i.e. all generic nouns] such as animal, earth, valley, mountain, donkey etc.’ (Ibn Faris, s}ah}ibi:33).

This is an important question in linguistic study because if we believed that Arabic is God-given, we would stick to the Qur’anic language and the expressions used by the ancient Arabs and the early Muslims. Ibn Faris (s}ah}ibi, p.17) said, ‘We are not entitled to-day to

innovate, to use expressions which they did not use, or to develop analogies which they did not know; for this would mean corrupting the language and annihilating its essence.’

(14)

Unlike English and other languages, there was no detailed discussion in Arabic literature concerning the origin of speech. Arab linguists did not concern themselves with this question because, owing to the aforementioned Qur’anic verses, they thought that Arabic is revealed by Allah. This question was considered as theological rather than linguistic. Even those who thought that Arabic is not revealed by Allah gave up investigating this question since there was no conclusive evidence for either position. Most grammarians, however, regarded Arabic as God-given language. Therefore, Arabs had to stick to the usage of their predecessors to whom the Qur’an was revealed. All they could do was to describe this usage for Arab and non-Arab people in order to stick to the genuine Arabic, the language of the Qur’an. As a point of departure, we can realise how Islam influenced the study of language. Arabic itself was very limited before the advent of Islam in terms of use by a large number of people. The introduction of Arabic grammar was motivated by Islamic incentives to protect the language from being corrupted by converts.

2. Belief in its Supremacy

As a God-given language, Arabs believe that Arabic is the most perfect, the noblest, the clearest and the richest language. In the introduction of his Lisan Al-Arab, Ibn Manzur says, “Allah made the Arabic language superior to all other languages and enhanced it further by revealing the Qur’an through it and by making it the language of the people of Paradise. The Prophet was reported to have said, ‘I am an Arab; the Qur’an is Arabic; and the language of the people of Paradise is Arabic.’” This is why Arabs believe in the supremacy of Arabic as a God-given language.

Arabic is of supreme and great importance for all Muslims and for those who are interested in study of the orient; for the former it is their religious language which contains the Qur’an, the Prophetic traditions and the early Muslim works and for the latter it is the medium of the Arabic culture.

(15)

translate the word sayf (sword), for example, into Persian we would have only one word as equivalent. In Arabic, we can have many words for ‘sayf’, each with a specific connotation. To most Arabs, Arabic has a magical effect on their souls. Hitti (1958: 90) said,

No people in the world, perhaps, manifest such enthusiastic admiration for literary expression and are so moved by the word, spoken or written, as the Arabs. Hardly any language seems capable of exercising over the minds of its users such irresistible influence as Arabic. Modern audiences in Baghdad, Damascus and Cairo can be stirred to the highest degree by the recital of poems, only vaguely comprehended, and by the delivery of orations in the classical tongue, though it be only partially understood. The rhythm, the rhyme, the music produce on them the effect of what they call ‘lawful magic’ (sih{rh{alaal).

3. It has a long standing and genuine linguistic heritage

After the expansion of the Muslim Empire and the increase in the number of foreign people who embraced Islam, Arabic became corrupted in the course of being used by the new converts. Those new converts made mistakes when reading the Qur’an. Muslim scholars began to fear lest the language become completely corrupted. They had to put an end to such a situation to protect the Holy Qur’an. On the one hand, they wanted to preserve their

language from the distortion and the solecism introduced by non-Arabic speakers and, on the other hand, to teach those converts Arabic to help them perform their Islamic rituals properly, since prayers can only be performed in Arabic. Thus, the main motivation for the introduction of Arabic descriptive models was to preserve the knowledge of Classical Arabic.

There is no consensus among Arab or foreign linguists with regard to who is the founder of Arabic grammar. Some argued that Ali (the fourth Caliph) is the true founder of Arabic grammar as a science. He gave the first glimpse by dividing the word classes into a ‘noun’, a ‘verb’ or a ‘particle’; others said that Abu Al-Aswad Ad-Du’ali was the first one to write the

(16)

first treatise of Arabic grammar on the basis of what Ali or Ziyad Ibn Abihi, who was the governor of Iraq by then, supposedly told him.

Although people differ as to who introduced Arabic grammar, they are unanimous in asserting that it was introduced to preserve the language of the Qur’an. Al-Anbari (Nuzhat: 11) concluded that the first founder of grammar was Ali ibn Abi Talib, because all stories referred to him and Abu al-Aswad referred to Ali ibn Abi Talib. Abu al-Aswad himself admitted that he learned grammar from Ali ibn Abi Talib.

The first written treatises in Arabic grammar appeared at the end of the eighth century when Al-Khalil ibn Ahmad and his outstanding pupil Sibawayh wrote their influential and

pioneering books describing the Arabic language. The former wrote his dictionary of Arabic

Al-c_{Ayn and the latter wrote his grammatical description of Arabic.}

The science introduced by Abu al-Aswad dealt with all branches of modern linguistics as a whole. There was no separation among the different fields of linguistics as in the modern time. Many of the early Arab scholars had the ability to write in all branches of linguistics. For example, Sibawayh’s Kitab, dealt with phonetics, syntax, morphology and phonology. Moreover, Al-Zamakhshari had outstanding works in the field of syntax and lexicography, in addition to his pioneering work in the exegesis of the Qur’an.

2.4 The Development of Arabic Linguistics

It is well known that Arabic linguistics emerged in the seventh century for a religious

motivation: to preserve the language of the Holy Qur’an from the mistakes made by the new foreign converts. Some modern linguists assumed that the beginning of Arabic linguistics was influenced by Indian or Greek linguistics, but there is no concrete evidence for such a theory. The science was founded before the beginning of the great movement of translation from other languages into Arabic in the Umayyad and Abbasid eras. Therefore, Arabic linguistics was introduced by Arabs since Ali Ibn Abi Talib, the true founder of Arabic linguistics, had no contact with Indian or Greek culture at that time.

(17)

The golden age of Arabic linguistics was between the eighth and the eleventh century. Chejne (1969: 170) notes that “in the 12th and 13th centuries Arabic was looked upon with

admiration by the West, in the same manner the Arab of today looks at the more developed Western languages.”

Owens (1998, ch. 9) argued that Arabic linguistics reached its highest methodology and its most sophisticated level with Jurjani (d. 1078). There are many contributions made by later linguists until the end of the eleventh century, but they were mainly interested in reworking what had been done by their predecessors.

Little contribution has been made in the past millennium. Linguists throughout this period used only to remodel or to add relatively slight changes to what has been done in the early ages of Islam. However this little contribution, based on the same corpus used by their predecessors, was still within the general framework introduced by the early linguists as ‘...the major preoccupation of grammarians… (after 1077)… was to find ever new ways of saying the same thing’ (Carter 1985a: p. 270, quoted by Owens, 1988: p. 8). In other words, ‘Sibawayh had, in fact, laid down the basic rules and methods of grammar, while the later grammarians’ contribution consisted only in expounding his theory in a more explicit and systematic form, or in finding new applications for it’ (Bohas, Guillaume and Kouloughli: 1990, p.5). They were mainly concerned with codifying and preserving the literature of their predecessors.

2.4.1 Recent Contributions to Arabic Linguistics

There is still something to be done in the study of Arabic language especially with the introduction of scientific approaches and modern technology in the field of linguistic investigation. The early Arab linguists felt that their contribution was not enough. Al-Khalil ibn Ahmad for example said, “If someone has in mind another cause for grammar than the one I mentioned, let him come forth with it!” (Al-Iid{aah{, p. 66 quoted in Versteegh: 1997: 74).

In the early 20th_{century the current trend was to rely totally on what has been formulated}

(18)

in verifying and editing the grammatical manuscripts left by the Arab grammarians. On the other hand, it tries to explain and interpret such work in modern linguistic terms.

During the last four decades the study of Arabic language has increased dramatically. The current tendency has been to enrich Arabic with modern theories of linguistics through comparative or applied linguistic studies. There are two main features which characterise modern Arabic linguistics of the last decades. First, the tendency towards the application of linguistic theories and methodologies, especially to the teaching of Arabic as a first language. Secondly, the use of modern techniques in linguistic research, as in computational linguistics and corpus linguistics.

Much of the work in this field was done in thesis or dissertation form, both in the universities of the Arab world and abroad. Very few of these studies have been published. Straley (1989) listed the dissertations done in the American universities in the field of Arabic linguistics from 1967 to 1987 in an annotated bibliography. He noticed that these dissertations, in general, cover a wide variety of topics: phonology, grammar, comparative linguistics, language planning, sociolinguistics and pedagogy. Bakalla (1983: p. xxxvii) pointed out that much of the work on Arabic linguistics ‘has been influenced by developments within

linguistic theory and that many studies have been formed in, and reflect, contemporaneous theory’.

There are also indications of the same interest in engaging with the development in linguistic theory as it is a very dominant paradigm in all branches of science represented by the

establishment of some Arabic teaching centres in the Arab world and abroad and the

appearance of some periodicals and journals interested in Arabic linguistics like the Journal

of Arabic and Islamic Studies (JAIS), Journal of Arabic Linguistics (in Germany), Arabica and Al-c_{Arabiyya (Arabic). Moreover a number of the big universities all over the world are}

now engaged in organising conferences, workshops and seminars devoted to Arabic linguistics for many purposes: scientific, commercial, or others.

(19)

With the introduction of computational techniques into the field of linguistics in USA and Europe, a corresponding interest in the use of computers to investigate the Arabic language grew, as was also the case for the theoretical linguistics. Academic centres, companies and conferences specialised in Natural Language Processing flourished in the Arab countries and abroad5_{. Research in this domain is currently under development.}

2.5 Some Features of Arabic Grammar

So far I have briefly outlined some aspects of the status and development of Arabic, Classical Arabic in particular, in order to acquaint the reader with the variety I am going to use in this study. To pursue the notion, I will illustrate the main features of Arabic grammar to help those who are to construct a computational system for Arabic know what kind of

complexities they may face. More importantly, this section serves as an introduction to the problems encountered when attempting to search the Arabic texts by lemmas. Below are some of these features:

1. Unlike English, Arabic is written from right to left.

2. Arabic script has twenty-eight letters representing the consonants in addition to three long vowels; the shape of each letter depends on what position it occurs in a word: initial, middle, or final.

3. Arabic short vowels are written in a diacritical form, under or above the preceding consonant. ‘For technical reasons the diacritisation is impossible when using the computer. This results in compound cases of morphological-lexical and morphological-syntactical ambiguities’ (Khalid et al 1974: 29). This has been sorted out recently with programs that can handle all diacritics in Arabic (c.f. 4.2.1).

4. Arabic, like Latin, is a synthetic (inflectional) language. English, on the other hand, is non-synthetic. Arabic has three cases: nominative, accusative and genitive. The use of cases in Arabic is complicated by the fact that they are mainly represented by short vowels and the Arabic script only allows the writer to show consonants and long vowels. Diacritics which are traditionally used for case endings are computationally problematic.

5 The Institute for the Languages & Cultures of the Middle East, University of Nijmegen, focuses nowadays on Arabic Natural Language Processing. It managed lately to produce an Arabic/Dutch dictionary based on a large Arabic corpus. Also, some companies like Sakhr (based in Egypt) are involved with developing solutions for Arabic computationally, and there are also conferences which are specialised in Arabic worldwide.

(20)

5. Arabic words are formed from roots, based on fixed morphological patterns, where vowels, suffixes, prefixes, or infixes can be added to form new words. Once we know these patterns, it is easy to form any possible word without making mistakes. More interestingly, we can add to the base form other linguistic units such as person, tense, mood, participles case, and verbal noun. English words, on the other hand, are generated from stems. Therefore, the key word for searching the traditional lexicon in Arabic is the root6_{, whereas in English it is the}

stem (the basic word form).

6. As Arabic is a synthetic language, it allows pronouns to combine with words forming one single word. Such personal pronouns can be suffixed to nouns, verbs or particles. We may form an Arabic word representing a whole sentence. Consider the following word in (1) below.

(1( كوبرض d{arabuuka (they hit you).

This property raises another problem of analysing Arabic computationally. When searching for a word in an electronic text, we have to search for every possible form of this word. This is because, if we look for the stem of this word, like in English, we will find a huge amount of results which are not needed. In Arabic we can form different roots by adding more characters. For example, c_{am (year) can include}c_{amer (populated), na}c_{am (ostrich),}c_amel

(worker) are derived from different root words. All the occurrences of each word in a simple word search program which is not trained on Arabic idiosyncrasies can give a good result which won’t need a laborious hand-editing.

7. Word order in Arabic is more flexible than in English. There are two types of word order in Arabic: VSO and SVO.

6 By the word ‘root’ I mean the three or four nuclear conosonantal letters from which we can generate all possible word forms in Arabic by adding suffixes, prefixes or infixes.

(21)

Chapter Three: Corpus Linguistics

3.1 Introduction

Corpus is a Latin word which means ‘body’, hence any collection of texts, linguistic or non-linguistic, can be called a corpus, such as the Corpus Juris Civilis which was a collection of early Roman laws and legal principles in the sixth century and the corpus Manuscript of

Chaucer (1400) which included Chaucer’s works. In 1731 Alexander Gruden used the Bible

(King James Version) as a corpus to show that the Bible is consistent (Kennedy 1998: 14). In modern linguistic terms, a corpus is a designed collection of written, spoken or a mixture of written and spoken data which can be used for linguistic investigation. In this sense, not any collection of texts can be called a corpus since there is a big difference between a corpus and a text database; the former has to be ‘a systematic, planned, and structured compilation of text’ (ibid: 4).

Linguists throughout the history of linguistic research used to rely on textual resources as a source of evidence, at least, to prove the correctness of their theories about language. ‘It is obvious that if someone sets about writing a grammar of English, he must have a suitable body of material from which he is to elicit his rules, whether they be purely descriptive, or, as is more common, prescriptive or even pedagogical. These bodies of material may be

considered corpora, with some extension of the term’ (Francis 1992: 28).

The study of language in general, whether in the context of modern linguistics or in the context of earlier linguistic studies has also been largely based on empirical research. This empirical approach to language is basically dominated by the observation of naturally occurring data, as linguists tended to gather evidence for the grammaticality of a given word or a sentence. This is partly what corpus linguistics deals with. However, corpus linguistics goes beyond the use of corpora as a source of evidence in linguistic description. ‘Corpus linguistics, like all linguistics, is concerned primarily with the description and explanation of the nature, structure and use of language and languages and with particular matters such as language acquisition, variation and change’ (Kennedy 1998: 8).

(22)

Nowadays, two main objectives can be met via corpus collection: linguistic investigation and language processing. As Souter and Atwell (1993: i-ii) explained,

Two primary research applications of corpora can be identified. On the one hand, linguists hope to exploit computer technology to explore linguistic data for the purpose of identifying linguistic trends and developing new theories. On the other, computer scientists and practitioners of artificial intelligence hope to use the linguistic information (including frequencies) present in and derivable from machine-readable corpora to develop software tools and systems for the automatic analysis, understanding and generation of natural languages like English. In some cases, of course, they will also employ the frameworks developed by the linguists, but this is by no means always the case.

3.2 Intuition vs. Empiricism

A general motivation for much of the linguistic studies before 1950s was the desire to deal with linguistics on the ground of a positivist and behaviourist view of the science. Linguists like Harris and Hill regarded the corpus as the ‘primary explicandum of linguistics’. For such linguists, the corpus can sufficiently meet this approach, whereas intuition can, if need be, be used as a second source (Leech 1991: 8).

With the advent of Chomskyan theories in the 1950s, less emphasis was placed on empirical observations. With the authority of his works, Chomsky has directed linguistics away from empiricism and the study of language use towards rationalism for many years. Following de Saussure, he made a distinction between two approaches to looking at language: a theory of language system and a theory of language use. These two approaches are drawn (1965) as

competence and performance.7_{Chomsky, rejecting the corpus linguistics approach, argued}

that:

Any natural corpus will be skewed. Some sentences won’t occur

7 Competence can be defined as ‘the speaker-hearer’s knowledge of his language’ whereas performance is ‘the actual use of language in concrete situations’ (Chomsky: 1965: 4). Competence both explains and characterises one’s internalised knowledge of a language. The only way to investigate competence is through introspection.

(23)

because they are obvious, others because they are false, still others because they are implicit. The corpus, if natural, will be so wildly skewed that the description [of language] would be no more than a mere list.

(Chomsky, 1962, quoted in Leech 1991: 8)

In the course of invalidating the corpus-based studies, he gave a lecture at the Linguistic Society of America Summer Institute in 1964, in which he rejected any kind of quantitive (statistical) data. To prove his argument, he gave the following examples in (1a & 1b) below:

1a. I live in New York. 1b. I live in Dayton, Ohio.

The sentence (a) above is more likely to occur more frequently, just for demographic reasons! Following Chomsky, Horrocks (1987: 13-14) argues that although performance is the only available evidence to the linguist, it is not a transparent reflection of competence. He (ibid: 16) expounded that an observationally adequate grammar cannot simply list all the well-formed sentences of a given language. This is because our mind has a finite storage capacity and the choices of language we produce are infinite. Only by positing competence can we account for a finite system with the capacity to define the membership of an infinite set. Therefore, Chomsky suggested that ‘the corpus could never be a useful tool for the linguist, as the linguist must seek to model language competence rather than performance’ (McEnery and Wilson, 1996: 5).

Horrocks (1987: 16-17) further argued that relying on a corpus to derive grammatical rules will lead to some sort of rules which have a predictive power which can generate strings not available in the corpus itself. However, we can only test the validity of such strings through referring to the intuition of a native speaker.

In fact, the approaches based on Chomsky’s theories, which were considered mainstream in linguistics, do not cope with vast areas in language study, most notably register variation where probability plays a major role in selecting certain combinations of meaning with certain frequencies. However, the bitter criticism of corpus data arising from the tradition

(24)

which Chomsky established has led corpus linguists to remedy the drawbacks of corpus data such as balance and representativeness. To pursue the premise, I would suggest, following Francis (1992), if someone sets about writing a grammar of a given language, he must have a corpus from which he is to derive his rules. Hence, the grammatical rules are derived by analysis and generalisation of a corpus.

Makkai (1987) considers the total reliance on intuition a serious disease that affects modern linguistics, which he called textphobia, that needs a radical surgery. A useful cure for this disease, he proposes, is reading Malinowski, Firth and Halliday.

It is worth stressing that eliminating observation from the study of language was fervently criticised by linguists even before Chomsky. Criticising de Saussure’s approach, Malinowski in 1936 suggested overlooking the question of langue and parole and paying more attention to the living speech in a context of situation, which is the main object of linguistic study

(Roulet, 1975: 78).

Firth (1957) also discredited the introspection of the native speaker as a reliable source of data. He observed that the language we produce is governed to a large extent by particular conventions (social, situational, etc.).

Sinclair (1991) also criticised the reliance on intuitive data, especially in the field of word meaning, lexis. He argued that ‘we may see formal patterns being used overtly as criteria for analysing meaning, which is a more secure and less eccentric position for a discipline which aspires to scientific seriousness’ (Sinclair, 1991: 6-7).

Instead of treating corpus-based and intuition-based linguistics as two contradicting disciplines, we would rather make use of both of them in a more interactive way. Fillmore (1992) argued that the two approaches can have interface and complement each other, since a corpus, however large, is inadequate to cover all aspects of language. On the other hand, a corpus, however small, can pinpoint interesting facts. He emphasised the role of the native speaker’s introspective judgement as a subsequent step.

(25)

theoretical model put forward through intuition or to investigate a language with an emphasis on what is typical in this language or what is called norms of use.

3.3 Historical Survey

We have to bear in mind that the manual collection of textual resources was the regular means before the invention of computers. With the introduction of the computer into the field, the interest in corpora has grown and continues to increase. This is because the manipulation of large corpora accurately is quite hard without the use of computer techniques. The

computer made the process easier and more reliable. Thus we can distinguish between two stages of corpus collection: Pre-computational and computational corpus Linguistics.

3.3.1 Pre-computational Corpus Linguistics

The definition of corpus as a designed collection of texts for linguistic investigation subsumes all early corpora compiled in this respect. However, most studies of corpus

linguistics are mainly focused on English, although corpora in this sense are deeply rooted in the history of linguistics as most of the great civilizations have long traditions of the study of language. For instance Panini’s grammar of Sanskrit, Thrax’s grammars of Greek and early Arab linguistics were definitely based on textual resources. However, apart from Arabic, we do not know exactly what form of corpus they used, since none of them has left an account of the methodology used.

The early Arab linguists relied mainly on three sources of linguistic data to describe their language: the Holy Qur’an, poetry and nomad proverbs. This is obvious in their use of quotations from these sources as linguistic evidence. Such quotations were certainly taken from a corpus they designed for their inquiry about language. They have postulated certain selection criteria for designing such a corpus. Versteegh explained, ‘on the one hand, the corpus used by the grammarians was closed, being limited to the text of the Qur’an and the pre-Islamic poetry, but on the other hand, the grammarians upheld the fiction of native speakers whose judgement could be trusted’ (1997: 42).

They made it as representative as possible. Ditters (1990: 130) described this corpus as consisting of specific media, registers, genres, styles and varied topics including poetry and

(26)

prose. He (ibid: 133) pointed out the way early Arab grammarians employed the corpus they assembled:

Originally corpus-information constituted the basis for a grammar of the Arabic language, but instead of the grammar being tested out again and again on corpus-data in a cyclic process as is the case in modern corpus linguistics, this grammar became the norm for language use.

As for English language corpora, Francis (1992) gives a full description of English pre-computer corpora. He divided corpora into three types: lexicographical, dialectological and grammatical. But he pinpointed some drawbacks in these collections due to (1) the editors of lexicographical collections like Oxford English Dictionary and Webster’s Dictionary in particular, encountered a big problem, as they did not have enough citations for function and simple words like, prepositions, articles and pronouns. (2) The major difficulty with

collections assembled for grammatical investigation is that ‘they are inevitably skewed in the direction of the unusual and interesting constructions that the readers encounter, at the

expense of the normal core of the language’ (Francis 1992: 28). Commenting on this, Johansson (1995) suggested, ‘the natural solution to this problem is to collect texts in a systematic manner and subject them to the principle of “total accountability”‘ (Johansson, 1995: 244).

Quirk, in an attempt to avoid the shortcomings of the other corpora, collected a more

representative corpus (spoken and written), taken from a wide range of genres, as a basis for describing English grammar. Therefore, his Survey of English Usage is considered a

landmark in corpus-based grammatical description in the 20th_{century. It is important to note}

that ‘the spoken part of SEU corpus was, however, later computerised yielding the London-Lund Corpus’ (Svartvik, 1990 quoted in Kenny 1999: 32). Therefore, Kennedy (1990: 17) pointed out that the SEU corpus, which was initially manually assembled is considered a transitional point between a non-computerised corpus and modern corpus linguistics. Undoubtedly, working on such large corpora was tedious and exhausting. This is because corpora without the assistance of computer techniques are time-consuming, banal, error prone, boring and very expensive to process (McEnery and Wilson, 1996:10). It now takes a

(27)

matter of minutes to process such corpora by computer accurately.

As a point of departure we can conclude that the methodology of corpus linguistics, however unrepresentative of the actual use of language, was widespread in linguistics for a long time. Corpora remained as a source of data for linguistic research in spite of the difficulties raised above until the 1950s, when the corpus for linguistic research underwent a severe blow at the hands of Chomsky, who invalidated it as a reliable methodology (see 3.2).

3.3.2 Computational Corpus linguistics

With the introduction of computers to the field of corpus linguistics, much attention has been given to this methodology. The electronic corpus has become widely recognised and

exploited when Francis and Kucera launched their pioneering corpus (Brown Corpus) in 1961. Then, linguists began to realise that electronic corpora can offer a new insight and a reliable methodology for natural language processing, as they found out that computers have made possible the collection, storage and processing of very large and varied texts. Unlike manual corpora, computerised corpora can provide us with well-designed and representative corpora, which are easy to process in few minutes. This can reveal unexpected features of language. More important, ‘the ability to examine large text corpora in a systematic manner allows access to a quality of evidence that has not been available before.’ (Sinclair, 1991a: 4)

Computerised English Corpora

Today, there are many electronic corpora available on either punched cards or CD ROMs in various languages such as the Lancaster/Oslo-Bergen Corpus (LOB), London-Lund Corpus the Lancaster/IBM Spoken English Corpus (SEC), The Longman/Lancaster English

Language Corpus, and the British National corpus (BNC).

Below I am going to give a brief account of two major English corpora: Brown Corpus as the first computerised corpus and Birmingham Collection as the first major computerised corpus used for dictionary-making based on a thorough study of the language use.

Brown Corpus

This was, undoubtedly, a pioneering corpus not only because it was the first computerised corpus of English, but also because it was against the mainstream, which was

(28)

intuition-oriented. The corpus consisted of about one million words of the written English printed in US in 1961, comprising 500 text samples of about 2000 words each. The samples were taken from a variety of genres excluding verse and drama. The project started in 1961 and only after three years (in 1964) was the corpus ready for distribution on a magnetic tape.

Birmingham Collection

The starting point of this corpus goes back to the 1960s in the form of research carried out at Birmingham University where Sinclair (1969) issued his early computational British corpus: OSTI project (135000 running words of informal conversation transcribed and

computerised). The collection undertaken at Birmingham University is made up of written texts and transcribed speech. It was intended to provide raw language data for a variety of purposes, relevant to the needs of the learners and teachers, lexicographic in particular (Renouf, 1984: 4-5). Since 1980 Cobuild, which is a joint venture between Collins and the School of English at Birmingham University, has been collecting a corpus for dictionary compilation and language study, making use of the Birmingham collection.

In October 2000 the latest release of the corpus amounted to 415 million words and it continues to grow with the constant addition of new material. Research at COBUILD over the last fifteen years has shown that very large samples of text are necessary for good

linguistic study, since the vocabulary of English is so large (well over half a million different words) and there is such variety in current usage. In order to draw statistically valid

conclusions from computerised analysis of a corpus, researchers need to have adequate data samples at their disposal (http://titania.cobuild.collins.co.uk/).

In addition to the corpora mentioned above, there are ‘a number of initiatives that have aimed at collecting and disseminating textual material amongst the international research

community’ (Kenny 1999: 34). Below are examples of these initiatives: The ACL/DCI (the Association for Computational Linguistics’ Data Collection Initiative) which produced a CD-ROM containing just plain orthographic text. It consists of the Collins English Dictionary; selections from the Wall Street Journal; the Penn Treebank of skeleton-parsed data compiled by Mitch Marcus and his team at the University of Pennsylvania; and a database of scientific abstracts. There are also some other initiatives like ECI (European Corpora Initiative), LDC

(29)

(The Linguistic Data Consortium), ELRA (The European Language Resources Association).

3.4 Corpus Design

The corpora we have mentioned above are not assembled haphazardly, since a corpus is defined as a designed collection of texts. Prior to the process of collecting a corpus there should be theoretical research to specify what type, time period, language variety or state, size and design method a corpus involves (Sinclair 1987; Atkins et al. 1992; Biber 1993; McEnery & Wilson 1996; Kennedy 1998, Meyer 2002).

3.4.1 The purpose of the corpus

From the many corpora we have discussed above we can conclude that corpora can be designed for several purposes: as a basis for a dictionary; to create a word frequency list; to study some linguistic phenomenon; to study the language of a particular author or time period; to study language change; to train an NLP system; as a teaching resource for non-native speakers; to study language acquisition. Due to the diversity of corpora purposes, there is no consensus among corpus linguists as to the procedures or the selection criteria to be followed in corpus design. For example, the selection criteria for Cobuild excluded poetry, drama and technical language (Renouf, 1984: 6). In addition to excluding poetry and drama, the Brown Corpus is designed to be a synchronic corpus- it contains written texts of

American English published in 1961. If the purpose of the corpus is to highlight the features of a language over a period of time, we will definitely need a criterion that allows that purpose to be met. Moreover, specialist corpora may introduce different criteria to study a certain aspect of the language.

Some of the first considerations in constructing a corpus is to specify for whom and for what the corpus is designed: for personal research, or to serve as a general resource. Kennedy (1998: 70) argued, ‘the optimal design of a corpus is highly dependent on the purpose for which it is intended to be used.’ Anyhow, Atkins et al (1992) and Meyer (2002) drew up the principal features of corpus design for whatever purpose. They discussed the practical stages in building a corpus: selection of sources, text annotation, copyright permission, in addition to some extra-linguistic variables.

(30)

3.4.2 Text Sampling

The next step after deciding the type, purpose and content of a corpus is to select and sample the actual texts which will make up the corpus. Biber (1993: 243) pointed out that any selection of texts is considered a sample, irrespective of being representative or not, but he noted that ‘a corpus must be ‘representative’ in order to be appropriately used as the basis for generalisations concerning a language as a whole.’ However, we have to bear in mind, in the first place, that there may be a corpus that is designed to represent not the language as a whole but one particular genre or the whole works of an author for example. Secondly, it is feasible to get a grip of the complete Old English corpus or the complete Early Middle English corpus, but a complete 20th c. British or American English corpus is not feasible. This is because it is too difficult to access all the publications in a given language, let alone speech.

There are two ways of sampling a language: language reception and language production, i.e. whether to sample the audible and readable language or the spoken and written language (Atkins et al., 1992: 5). We can hardly achieve a

representative sample of the total language production for the vast demographic and contextual variation among people. In

addition, a corpus, however big, is small when compared with the entire population of the language under investigation.

Moreover, ‘the value of a corpus as a research tool cannot be measured in terms of brute size. The diversity of the corpus, in terms of the variety of registers on text types it represents, can be an equally important (or even more important) criterion’ (Garside, Leech and McEnery, 1997: 2).

With this in mind, Garside, Leech and Sampson, 1987: 6) noted that Sinclair (1982) defined the problem of corpus compilation as a problem of selecting the right sample from the existing massive quantities of machine–readable texts. The main challenge in

(31)

sampling the population8_{of a given language lies in}

representing all the relevant genres, topics or registers

while keeping the corpus at a manageable size. Therefore, sampling has to be conducted according to statistical measures and thus will be qualitatively and quantitatively representative of the entire publication and population.

More importantly, in order to achieve

an accurate representativeness of the

samples

, in general corpora, we have to ensure the diversity of the selected data. With the diversity of the corpus, we can avoid the pervasiveness of a certain genre or the stylistics of an author. Sampling from various genres can reduce the possibility of being dominated by stylistic idiosyncrasies of a particular author (Atkins et al. 1992: 2).

Sampling all data randomly, where all texts have a chance to be represented, can also reduce the stylistic idiosyncrasies of authors. However, Biber (1993: 244) argued that the process of random sampling is mostly used within each subgenre to ensure a representative selection of texts.

Sinclair (1995: 27-28) made a distinction between a ‘whole text’ corpus and a ‘sample corpus’. He noted that ‘samples are small, in relation to texts such as newspapers, books, radio programmes, and of a constant size, hence not qualifying as texts.’ Unlike many corpus linguists like Francis and Kucera in their pioneering corpus (Brown Corpus) in 1961, he thinks that ‘whole text corpus’ should be a default value for anyone building a corpus. To him, ‘the use of small samples is just a remnant of the early restraints on corpus building’ (ibid). Stubbs (1993: 11) also argues in favour of whole texts being the unit of study. He also quoted Sinclair saying that ‘few linguistic features of a text are distributed evenly

8 To statisticians, this word does not necessarily refer to human beings as commonly used. We may have a population of anything to be counted such as people, animals, trees, companies, books, cars, etc. (Stuart, 1968: 10).

(32)

throughout’, which could be overlooked with use of sample texts.

3.4.3 Text Typology

Atkins et al (1992) distinguished between two criteria for constructing a corpus: external (non-linguistic) and internal (linguistic). The former criteria are the first to look at when compiling a corpus, whereas the latter won’t be attained until the corpus becomes available for analysis (ibid: 5).

In sampling written texts, the designer of the corpus has to take into account some important information about both the author and the reader who differ in regard to certain author-related and work-related criteria. Such considerations, in addition to contextual criteria, are also required when sampling spoken data. These criteria are by definition non-linguistic.

Atkins et al. (1992) have given a full systematic account of non-linguistic characteristics in corpus design. Work-related criteria include, among other things, mode (written, spoken, written to be read, written to be spoken), text origin, preparedness, participants, genre, style, setting, factuality, topic, date of publication. Author-related criteria are those associated with authors. These criteria are mainly

demographic: geographical, ethnic, socioeconomic, and social (age, education, sex, profession, nationality, age and size of intended audience or readership, etc.). Contextual criteria refer to situationally-defined varieties such as conversation (face-to-face vs. telephone (informal), monologue vs. dialogue, personal vs. impersonal.

3.5 Technical Requirements

In addition to the criteria mentioned above, there are also some considerations one has to keep in mind when designing a corpus such as getting permission, data capturing, marking-up.

Before starting the process of creating a corpus, the designer may have to get permission from the publishers of his selected works, national or international, to use the text in an electronic form for language research. Having got permission, he needs to capture the data.

(33)

Written corpora are easy to capture by keyboarding, scanning or downloading from the Internet. However, proofreading is still needed to make sure of the reliability of the data. Spoken material, on the other hand, is difficult to capture. Spoken materials need to be recorded and then transcribed before processing. To have a reliable transcribed text is,

undoubtedly, time-consuming, expensive and error-prone. This is because people’s perception of speech may differ in respect of prosodic features, situations, homophonous words, etc. Once a text, written or spoken, is captured electronically, some information can be added, electronically, to indicate some text features such as titles, chapters, paragraphs, sentence boundaries, headings, various types of hyphenation, etc. This process is called marking-up. There is also some other information, which can be added to the text to show the parts of speech of each sentence (as in tagged corpora), or the sentence structure and the function in the sentence for each word (as in parsed corpora).

3.6 Corpus Processing

Once a corpus is available to use in an electronic form it needs to be processed by computer for use in linguistic research. Since most corpora are incredibly large, it is nonsense to search a corpus without the help of some software that can highlight what we look for accurately and fast. Hence, we need tools to turn the electronic texts into databases, which can be searched. There are a lot of tools designed for such a purpose.

Barnbrook (1996), Meyer (2002) and Kenny (2001) gave an overview of how to process such a corpus. The first thing the computer techniques can do with texts is to provide word

frequency lists for the whole contents of the texts.

Frequency Lists

These lists can be made by identifying every word form in the text, counting identical forms and classifying them according to a particular order: alphabetical, or according to their frequency. This can be done in descending or ascending order. Listing words according to their frequencies can show how often every single word form occurs in the text. Therefore, ‘by examining a list, one can get an idea of what further information would be worth acquiring: or one can make guesses about the structure of the text, and so focus on investigation’ (Sinclair, 1991: 31).

(34)

Concordances

A concordance can be defined as listing all occurrences of search-words in the text with a short section of the context that precedes and follows each word. Unlike word frequency lists, the search-word is represented within its contextual environment; this can give more

information about the nature and behaviour of words. This process is also called KWIC (key word in context). The search-word can be highlighted by putting it in the centre of each line, with a space on each side. The arrangement of each key word is alphabetical according to the left-hand or the right-hand context. Barnbrook (1996) describes the main features of

concordance programs in detail.

Collocation

In addition to KWIC and word frequency lists, most programs also offer the possibility of searching for word combinations within a specified range of words. Furthermore, if the program is a bit more sophisticated, it might also provide its user with lists of collocates based on some statistical tests. Collocation is discussed in detail in Chapter Five.

3.7 Summary

This chapter has given a brief account about the methodology of corpus linguistics and has surveyed its historical background. We have investigated some aspects of corpus linguistics to make it easy for the reader to be aware of the state of the art. Such aspects include the methodology for creating a corpus, such as representativeness, size, sampling, etc., the types of corpora as well as the technical requirements needed for utilising corpora.

(35)

Chapter Four: Description of the Corpus and Tools of Analysis

4.1 Introduction

Based on the information given in the previous chapter we embarked on building a computerised Arabic corpus to use in our linguistic study on lexical collocations and synonymy in Arabic, taking into consideration the state of the art of Arabic which we will discuss below. We attempted to meet all the design criteria for corpora compilation in order that we can conduct a methodical study based on it and to make it available as a resource for other researchers to use in the future.

4.2 Arabic for Computational analysis

Work in Arabic computing did not start as early as European languages. Attempts have been made, but due to some technical problems with Arabic script (orthography) and grammar there is far less development than in English and languages written with the Roman alphabet. This is because ‘the native Arabic grammar [which is produced by early Arab linguists], although one of the most sophisticated systems of linguistic analysis ever devised, was developed by scholars who lacked the concepts of consonant, vowel, and syllable’ (Koenraad et al, 1999: 162-63). This raises some problems of digitising Arabic which require laborious work of computation. For instance, the absence of vowels in Arabic9_{makes the process of}

tagging or any morphological analysis quite hard and sometimes ambiguous. Consider for example the three letters-word _{درو wrd which can be lexicalised as a verb َدَرَو warada ‘come,} be mentioned’, a noun _{ٌدْرَو ward ‘flower’, a noun ٌدْرِو wird ‘watering place’. For more details} about the difficulties of analysing Arabic computationally see Goweder and Roeck (2001), Khoja, Garside and Knowles (2001), Van Mol (2002).

4.2.1 Progress in machine-readable Arabic language

The Sakhr Company has been working on digitising Arabic since 1985. Two years later they managed to produce the first Arabic morphological analyser. Not until 2001 did they manage

9 A few written Arabic texts contain vowels; the most famous one is Qur’an, with a fully-detailed vowel system. Then we can find some old Arabic poems and some primary schoolbooks with only vowels that mark the words cases.