• No results found

Corpus Linguistics Resources and Tools for Arabic Lexicography

N/A
N/A
Protected

Academic year: 2019

Share "Corpus Linguistics Resources and Tools for Arabic Lexicography"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

This is a repository copy of Corpus Linguistics Resources and Tools for Arabic Lexicography.

White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/81627/

Proceedings Paper:

Sawalha, M and Atwell, ES (2011) Corpus Linguistics Resources and Tools for Arabic Lexicography. In: Proceedings of the 2011 Workshop on Arabic Corpus Linguistics. 2011 Workshop on Arabic Corpus Linguistics, 11 - 12 Apr 2011, Lancaster University, UK. UCREL .

[email protected] https://eprints.whiterose.ac.uk/

Reuse

Unless indicated otherwise, fulltext items are protected by copyright with all rights reserved. The copyright exception in section 29 of the Copyright, Designs and Patents Act 1988 allows the making of a single copy solely for the purpose of non-commercial research or private study within the limits of fair dealing. The publisher or other rights-holder may allow further reproduction and re-use of this version - refer to the White Rose Research Online record for this item. Where records identify the publisher as the copyright holder, users can verify any specific terms of use on the publisher’s website.

Takedown

If you consider content in White Rose Research Online to be in breach of UK law, please notify us by

(2)

Corpus linguistics resources and

tools for Arabic lexicography

Workshop on Arabic Corpus Linguistics

11-12 April 2011, Lancaster University

tools for Arabic lexicography

Majdi Sawalha and Eric Atwell

School of Computing,

University of Leeds, Leeds, LS2 9JT, UK

(3)

Outline

Introduction

– Oxford University Dictionaries

• Monolingual Dictionaries • Bilingual Dictionaries

– Arabic & Arabic NLPArabic & Arabic NLP

• Difficulties

• Morphological analysis

– Traditional Arabic lexicography

– Constructing the Arabic broad-coverage lexical resource

– Key notes

– Conclusions

(4)

Oxford Dictionaries

• Searching for a word

• Dictionary entries are the lemmas of the word.

• Lemmatising of input words is needed to direct the search for

Lemmatising University

University University Universities Universities

• Lemmatising of input words is needed to direct the search for

the input word to the correct dictionary entry.

• What happens if the user entered a miss-spelled (unknown) word?

• Gives

suggestions of

Univercity

Univercity

(5)

Dictionary Entry

Dictionary Entry

Pronunciation POS + Plural form

Dictionary Dictionary

entries

• Dictionary entry: Information provided

POS + Plural form

Meaning Examples

Phrases

Origin

Meaning Examples

(6)

What information can be provided

by the Arabic dictionaries?

POS Dictionary entry

Position in dictionary

Meaning

Pronunciation Pronunciation

Lemma Root Pattern Plural form Examples

Related words list

A list of the words derived from the

same root or words that have the same lemma

Morphological features

Origin (relation to traditional Arabic Dictionaries)

Examples

Using a suitable font that shows the letters and the diacritics Using colours to illustrate clitics attached to the word

Phrases, Collocations, Idioms

(meaning and examples)

(7)
(8)

Traditional Arabic

Lexicography

• Arabic lexicography is one of the original and deep-rooted arts

of Arabic literature.

• The first lexicon constructed was kitāb al-‘ayn ﲔﻌﻟﺍ ﺏﺎﺘﻛ‘ al-‘ayn

lexicon’ by al-farāh

d

(died in 791).

lexicon’ by al-farāh d

• Over the past 1200 years, many different kinds of Arabic

language lexicons were constructed; these lexicons are different in ordering, size and aim or goal of construction.

• Many Arabic language linguists and lexicographers studied the

(9)

Ordering lexical entries in

the Arabic lexicons

al-ẖalῑl

methodology

[5 lexicons]

– Listed the lexical entries based on the pronunciation of the

letters starting from the farthest in the mouth to the nearest

abῑ

‘ubayd

methodology

[3 lexicons]

– Listed the lexical entries based on similarity in meaning.

al-

ğ

awharῑ

methodology

[4 lexicons]

– Listed the lexical entries based on last letter of the word.

al-barmakῑ

methodology

[11 lexicons]

– Listed the lexical entries alphabetically.

• Modern dictionaries uses a combination of root/word as

(10)
(11)

The use of Corpora in building

dictionaries – Example 1

Lexicographers

Selecting examples

(12)

Lemmatized Arabic Internet

Corpus

(13)
(14)

The use of Corpora in building

dictionaries – Example 2, cont.

• Frequency lists

/measurements can compare

between two words, and find words, and find how these

words are

(15)
(16)

Oxford bilingual dictionaries

POS POS

French French

English English

Examples from Examples from parallel corpora

French French

(17)
(18)

Oxford bilingual dictionaries

Dictionary entry Dictionary entry

Meaning in Meaning in

English

Word list Word list

Language Dictionary entry

(19)

Oxford bilingual dictionaries

Do users need to know the meaning of word in

many different languages?

Users need to translate from one language to

another.

Connecting terms with a central language

Connecting terms with a central language

(English) can connect two languages together.

specific linguistic information to each word from

(20)

Arabic & Arabic NLP

200 million people speaking Arabic as first

language.

More than 1 billion Muslims need Arabic to

recite the Quran ( the holy book of Muslims).

recite the Quran ( the holy book of Muslims).

One of the UN official languages.

Increased potential for learning Arabic recently.

Many commercial software companies invest

(21)

Why is Arabic NLP difficult?

• Complex morphology

– Words consist of multi-morphemes of 5 kinds

– Proclitics, prefixes, stem/root, suffixes, enclitics.

َو

َس

َي

ُ ُْآ

َنُو

َه

• ﺎـﻬـﻧﻮــﺒـﺘـﹾﻜـــﻴـﺳﻭ [ wasayaktubūnhā ] (And they will write it)

• Vowels & Diacritic Marks

– 3 long vowels ( ا alif, و wāw, ي yā )

– 3 short vowels ( fathah َ, dammah ُ, kasrah ِ)

– Other diaratics: sukūn ـْــ, šaddah ـّــ , tanwīn ( ً, ٌ, ٍ) – tawīl character ( ـ).

hamzah (ﺉ ﺅ ﺇ ﺃ ﺀ), tā’ marbūṭah ( ﺓ ) and hā ( ﻩ ), yā ( ﻱ ) and alif maqṣūrah( ), and maddah ( ).

َو

َس

َي

ُ ُْآ

َنُو

َه

Conjunction Particle of futurity

Progressive letter

Root / Stem Relative Pronoun (Plural/Subject)

Relative Pronoun (Object)

(22)

Morphological Analyses of

Arabic text

• Morphological analysis is

essential for processing text corpora and building dictionaries.

• Existing Arabic

morphological analyzers

Morphological Analyzer for Arabic text

Step 1: Tokenization

- Different text types

- Spell-checking

Step 2: Function words

morphological analyzers failed to achieve accuracy rates more than 75%.

(Sawalha & Atwell, 2008)

• We can not rely on such

analyzers for further analysis such as

part-of-speech tagging and parsing.

Step 2: Function words

Step 3: Clitics, Affixes & Stems Step 4: Root/Lemma extraction Step 5: Pattern generation

Step 6: Vowelization

Step 7: Assigning detailed

(23)

Example of Analyzed word

ﺕﺍ

ﻊﻣﺎﺟ

ﹾﻝﺍ

ﻭ

Feminine plural letters

stem Definite article conjunction

ﺕﺎﻌﻣﺎﺠﹾﻟﺍﻭ wa al-ğāmi‘āt (And the universities)

Lemma <link>

ﹲﺔﻌﻣﺎﺟ

Root

ﻊﻤﺟ

<link>

Pronunciation

Lemma root POS Pattern Meaning

Word’s list of similar root ﻊﻤﹶﳉﺍ ﻊﻴﻤﺠﺘﻟﺍ ﻉﺎﻤﺘﺟﺍ ﻉﺎﻤﺟﹺﺇ ﻊﻤﺟ

ﹲﺔﻌﻣﺎﺟ

ﻊﻤﺟ

Pattern

ﹲﺔﹶﻠﻋﺎﹶﻓ

Meaning Examples

Phrases, collocations, idioms Origin (links to traditional Arabic lexicons) ﻊﻤﺟ ﻊﻤﺠﺗ ﻊﻣﺎﺟ ﻲﻌﻣﺎﺟ ﹶﻥﻮﻴﻌﻣﺎﺟ ﺔﻴﻌﻣﺎﺟ ﺕﺎﻴﻌﻣﺎﺟ ﺔﻴﻌﻤﺟ ﻊﻤﺠﻣ ﻊﻤﺠﻣ ﻉﻮﻤﺠﻣ ﻊﻤﺠﻳ Word’s morphemes

ﻭ p--c--- Particle, conjunction (clitic)

ﹾﻝﺍ r---d--- definite article (clitic)

ﻊﻣﺎﺟ np----flp-vndd---ncat-s collective noun, feminine

(M/S), varied, non-human …

ﺕﺍ r---l--- plural feminine letters

Plural form

ﺕﺎﻌﻣﺎﺟ

(24)

Samples of traditional Arabic

lexicons

ﺐﺘﻛ : ﺏﺎﺘﻜﻟﺍ : ،ﻑﻭﺮﻌﻣ ﻊﻤﳉﺍﻭ ﺐﺘﹸﻛ ﻭ ﺐﺘﹸﻛ . ﺐﺘﹶﻛ َﺀﻲﺸﻟﺍ ﻪﺒﺘﹾﻜﻳ ﹰﺎﺒﺘﹶﻛ ﻭ ﹰﺎﺑﺎﺘﻛ ﻭ ﹰﺔﺑﺎﺘﻛ ، ﻭ ﻪﺒﺘﹶﻛ : ؛ﻪﱠﻄﺧ ﻝﺎﻗ ﻮﺑﹶﺃ ﻢﺠﻨﻟﺍ : ﺖﹾﻠﺒﹾﻗﹶﺃ ﻦﻣ ﺪﻨﻋ ﺩﺎﻳﺯ ،ﻑﹺﺮﹶﳋﺎﻛ ﱡﻂﺨﺗ ﻱﻼﺟﹺﺭ ﱟﻂﲞ ،ﻒﻠﺘﺨﻣ ﻥﺎﺒﺘﹶﻜﺗ ﰲ ﹺﻖﻳﺮﱠﻄﻟﺍ ﻡﻻ ﻒﻟﹶﺃ ﻝﺎﻗ : ﺖﻳﹶﺃﺭﻭ ﰲ ﺾﻌﺑ ﹺﺦﺴﻨﻟﺍ ﻥﺎﺒﺘﻜﺗ ، ﺮﺴﻜﺑ ،ﺀﺎﺘﻟﺍ ﻲﻫﻭ ﺔﻐﻟ ،َﺀﺍﺮﻬﺑ ﻥﻭﺮِﺴﹾﻜﻳ ،ﺀﺎﺘﻟﺍ ﻥﻮﻟﻮﻘﻴﻓ : ،ﹶﻥﻮﻤﹶﻠﻌﺗ ﰒ ﻊﺒﺗﹶﺃ ﻑﺎﻜﻟﺍ ﹶﺓﺮﺴﻛ ﺀﺎﺘﻟﺍ . ﻭ ﺏﺎﺘﻜﻟﺍ ﹰﺎﻀﻳﹶﺃ : ،ﻢﺳﻻﺍ ﻦﻋ ﱐﺎﻴﺤﻠﻟﺍ . ﻱﺮﻫﺯَﻷﺍ : ﺏﺎﺘﻜﻟﺍ ﻢﺳﺍ ﺎﳌ ﺐﺘﹸﻛ ؛ﹰﺎﻋﻮﻤﺠﻣ ﻭ ﺏﺎﺘﻜﻟﺍ ؛ﺭﺪﺼﻣ ﻭ ﹸﺔﺑﺎﺘﻜﻟﺍ ﻦﻤـﻟ ﹸﻥﻮﻜﺗ ﻪﻟ ،ﹰﺔﻋﺎﻨﺻ ﻞﺜﻣ ﺔﻏﺎﻴﺼﻟﺍ ﺔﻃﺎﻴـﳋﺍﻭ . ﻭ ﹸﺔﺒﺘﻜﻟﺍ : ﻚﺑﺎﺘﺘﹾﻛﺍ ﹰﺎﺑﺎﺘﻛ ﻪﺨﺴﻨﺗ . ﻝﺎﻘﻳﻭ : ﺐﺘﺘﹾﻛﺍ ﹲﻥﻼﻓ ﹰﺎﻧﻼﻓ ﻱﹶﺃ ﻪﻟﹶﺄﺳ ﻥﹶﺃ ﺐﺘﹾﻜﻳ ﻪﻟ ﹰﺎﺑﺎﺘﻛ ﰲ ﺔﺟﺎﺣ . ﻭ ﻪﺒﺘﹾﻜﺘﺳﺍ َﺀﻲﺸﻟﺍ ﻱﹶﺃ ﻪﻟﹶﺄﺳ ﻥﹶﺃ ﻪﺒﺘﹾﻜﻳ ﻪﻟ . ﻦﺑﺍ ﻩﺪﻴﺳ : ﻪﺒﺘﺘﹾﻛﺍ ﻪﺒﺘﹶﻜﻛ . ﻞﻴﻗﻭ : ﻪﺒﺘﹶﻛ ؛ﻪﱠﻄﺧ ﻪﺒﺘﺘﹾﻛﺍﻭ : ،ﻩﻼﻤﺘﺳﺍ ﻚﻟﺬﻛﻭ ﻪﺒﺘﹾﻜﺘﺳﺍ . ﻪﺒﺘﺘﹾﻛﺍﻭ : ﻪﺒﺘﹶﻛ ، ﻭ ﻪﺘﺒﺘﺘﹾﻛﺍ : ﻪﺘﺒﺘﹶﻛ . ﰲﻭ ﻞﻳﱰﺘﻟﺍ ﺰﻳﺰﻌﻟﺍ : ﺎﻬﺒﺘﺘﹾﻛﺍ ﻲﻬﻓ ﻰﻠﻤﺗ ﻪﻴﻠﻋ ﹰﺓﺮﹾﻜﺑ ؛ﹰﻼﻴـﺻﹶﺃﻭ

ﻱﹶﺃ ؛ﹰﻼﻴـﺻﹶﺃﻭ ﹰﺓﺮﹾﻜﺑ ﻪﻴﻠﻋ ﻰﻠﻤﺗ ﻲﻬﻓ ﺎﻬﺒﺘﺘﹾﻛﺍ : ﺰﻳﺰﻌﻟﺍ ﻞﻳﱰﺘﻟﺍ ﰲﻭ . ﻪﺘﺒﺘﹶﻛ : ﻪﺘﺒﺘﺘﹾﻛﺍﻭ ، ﻪﺒﺘﹶﻛ : ﻪﺒﺘﺘﹾﻛﺍﻭ . ﻪﺒﺘﹾﻜﺘﺳﺍ ﻚﻟﺬﻛﻭ ﻱﹶﺃ ﺎﻬﺒﺘﹾﻜﺘﺳﺍ . ﻝﺎﻘﻳﻭ : ﺐﺘﺘﹾﻛﺍ ﹸﻞﺟﺮﻟﺍ ﺍﺫﹺﺇ ﺐﺘﹶﻛ ﻪﺴﻔﻧ ﰲ ﻥﺍﻮﻳﺩ ﻥﺎﻄﹾﻠﺴﻟﺍ ...

(25)

Constructing the Arabic

broad-coverage lexical resource

Analyzing Lexicon’s Text Separately

• Convert each lexicon text into a unified format.

*** ﺐﺘﻛ : ﺏﺎﺘﻜﻟﺍ : ،ﻑﻭﺮﻌﻣ ﻊﻤﳉﺍﻭ ﺐﺘﹸﻛ ﻭ ﺐﺘﹸﻛ . ﺐﺘﹶﻛ َﺀﻲﺸﻟﺍ ﻪﺒﺘﹾﻜﻳ ﹰﺎﺒﺘﹶﻛ ﻭ ﹰﺎﺑﺎﺘﻛ ﻭ ﹰﺔﺑﺎﺘﻛ ، ﻭ ﻪﺒﺘﹶﻛ : ؛ﻪﱠﻄﺧ ﻝﺎﻗ ﻮﺑﹶﺃ ﻢﺠﻨﻟﺍ : ﺖﹾﻠﺒﹾﻗﹶﺃ ﻦﻣ ﺪﻨﻋ ﺩﺎﻳﺯ ،ﻑﹺﺮﹶﳋﺎﻛ ﱡﻂﺨﺗ ﻱﻼﺟﹺﺭ ﱟﻂﲞ ،ﻒﻠﺘﺨﻣ ﻥﺎﺒﺘﹶﻜﺗ ﰲ ﹺﻖﻳﺮﱠﻄﻟﺍ ﻡﻻ ﻒﻟﹶﺃ ﻝﺎﻗ : ﺖﻳﹶﺃﺭﻭ ﰲ ﺾﻌﺑ ﹺﺦﺴﻨﻟﺍ ﻥﺎﺒﺘﻜﺗ ، ﺮﺴﻜﺑ ،ﺀﺎﺘﻟﺍ ﻲﻫﻭ ﺔﻐﻟ ،َﺀﺍﺮﻬﺑ ﻥﻭﺮِﺴﹾﻜﻳ ،ﺀﺎﺘﻟﺍ ﻥﻮﻟﻮﻘﻴﻓ : ،ﹶﻥﻮﻤﹶﻠﻌﺗ ﰒ ﻊﺒﺗﹶﺃ ﻑﺎﻜﻟﺍ ﹶﺓﺮﺴﻛ ﺀﺎﺘﻟﺍ . ﻭ ﺏﺎﺘﻜﻟﺍ ﹰﺎﻀﻳﹶﺃ : ،ﻢﺳﻻﺍ ﻦﻋ ﱐﺎﻴﺤﻠﻟﺍ . ﻱﺮﻫﺯَﻷﺍ : ﺏﺎﺘﻜﻟﺍ ﻢﺳﺍ ﺎﳌ ﺐﺘﹸﻛ ؛ﹰﺎﻋﻮﻤﺠﻣ ﻭ ﺏﺎﺘﻜﻟﺍ ؛ﺭﺪﺼﻣ ﻭ ﹸﺔﺑﺎﺘﻜﻟﺍ ﻦﻤـﻟ ﹸﻥﻮﻜﺗ ﻪﻟ ،ﹰﺔﻋﺎﻨﺻ ﻞﺜﻣ ﺔﻏﺎﻴﺼﻟﺍ ﺔﻃﺎﻴـﳋﺍﻭ . ﻭ ﹸﺔﺒﺘﻜﻟﺍ : ﻚﺑﺎﺘﺘﹾﻛﺍ ﹰﺎﺑﺎﺘﻛ ﻪﺨﺴﻨﺗ . ﻝﺎﻘﻳﻭ : ﺐﺘﺘﹾﻛﺍ ﹲﻥﻼﻓ ﹰﺎﻧﻼﻓ ﻱﹶﺃ ﻪﻟﹶﺄﺳ ﻥﹶﺃ ﺐﺘﹾﻜﻳ ﻪﻟ ﹰﺎﺑﺎﺘﻛ ﰲ ﺔﺟﺎﺣ . ﻭ ﻪﺒﺘﹾﻜﺘﺳﺍ َﺀﻲﺸﻟﺍ ﻱﹶﺃ ﻪﻟﹶﺄﺳ ﻥﹶﺃ ﻪﺒﺘﹾﻜﻳ ﻪﻟ . ﻦﺑﺍ ﻩﺪﻴﺳ : ﻪﺒﺘﺘﹾﻛﺍ ﻪﺒﺘﹶﻜﻛ . ﻞﻴﻗﻭ : ﻪﺒﺘﹶﻛ ؛ﻪﱠﻄﺧ ﻪﺒﺘﺘﹾﻛﺍﻭ : ،ﻩﻼﻤﺘﺳﺍ ﻚﻟﺬﻛﻭ ﻪﺒﺘﹾﻜﺘﺳﺍ . ﻪﺒﺘﺘﹾﻛﺍﻭ : ﻪﺒﺘﹶﻛ ، ﻭ ﻪﺘﺒﺘﺘﹾﻛﺍ : ﻪﺘﺒﺘﹶﻛ . ﰲﻭ ﻞﻳﱰﺘﻟﺍ ﺰﻳﺰﻌﻟﺍ : ﺎﻬﺒﺘﺘﹾﻛﺍ ﻲﻬﻓ ﻰﻠﻤﺗ ﻪﻴﻠﻋ ﹰﺓﺮﹾﻜﺑ ؛ﹰﻼﻴـﺻﹶﺃﻭ ﻱﹶﺃ ﺎﻬﺒﺘﹾﻜﺘﺳﺍ . ﻝﺎﻘﻳﻭ : ﺐﺘﺘﹾﻛﺍ ﹸﻞﺟﺮﻟﺍ ﺍﺫﹺﺇ ﺐﺘﹶﻛ ﻪﺴﻔﻧ ﰲ ﻥﺍﻮﻳﺩ ﻥﺎﻄﹾﻠﺴﻟﺍ ... ﻥﺎﻄﹾﻠﺴﻟﺍﻥﺍﻮﻳﺩ ...

• A bag of words is extracted from the definition text.

) ، ﻝﺎﻗ ﺐﺘﻛ

( ( ﺐﺘﻛ , ﻥﺎﺒﺘﹶﻜﺗ) ( ﺐﺘﻛ , ﻑﹺﺮﹶﳋﺎﻛ ) ( ﺐﺘﻛ , ﻢﺠﻨﻟﺍ) ( ﺐﺘﻛ , ﹰﺔﺑﺎﺘﻛﻭ ) ( ﺐﺘﻛ , ﺐﺘﹶﻛ ) ( ﺐﺘﻛ , ﺏﺎﺘﻜﻟﺍ ) )

ﺐﺘﻛ ،ﺖﻳﺃﺭﻭ

( ( ﺐﺘﻛ , ﰲ) ( ﺐﺘﻛ , ﱡﻂﺨﺗ ) ( ﺐﺘﻛ , ﺖﹾﻠﺒﹾﻗﹶﺃ) ( ﺐﺘﻛ , ﻪﺒﺘﹶﻛﻭ ) ( ﺐﺘﻛ ، َﺀﻲﺸﻟﺍ) ( ﺐﺘﻛ , ﻑﻭﺮﻌﻣ) )

ﰲ ﺐﺘﻛ ،

( ( ﺐﺘﻛ , ﹺﻖﻳﺮﱠﻄﻟﺍ) ( ﺐﺘﻛ , ﻱﻼﺟﹺﺭ ) ( ﺐﺘﻛ , ﻦﻣ) ( ﺐﺘﻛ , ﻪﱠﻄﺧ ) (ﺐﺘﻛ ، ﻪﺒﺘﹾﻜﻳ ) ( ﺐﺘﻛ , ﻊﻤﳉﺍﻭ) )

ﺐﺘﻛ ، ﺾﻌﺑ

( ( ﺐﺘﻛ , ﻡﻻ) ( ﺐﺘﻛ , ﱟﻂﲞ ) ( ﺐﺘﻛ , ﺪﻨﻋ ) ( ﺐﺘﻛ , ﻝﺎﻗ) (ﺐﺘﻛ ، ﹰﺎﺒﺘﹶﻛ) ( ﺐﺘﻛ , ﺐﺘﹸﻛ ) )

ﹺﺦﺴﻨﻟﺍ ﺐﺘﻛ ،

( ( ﺐﺘﻛ , ﻒﻟﹶﺃ) ( ﺐﺘﻛ , ﻒﻠﺘﺨﻣ ) ( ﺐﺘﻛ , ﺩﺎﻳﺯ) ( ﺐﺘﻛ , ﻮﺑﹶﺃ) ( ﺐﺘﻛ , ﹰﺎﺑﺎﺘﻛﻭ) ( ﺐﺘﻛ , ﺐﺘﹸﻛ )

• A normalization analysis that verifies the word-root pairs by

applying linguistic knowledge. )

، ﻝﺎﻗ ﺐﺘﻛ

( ( ﺐﺘﻛ , ﻥﺎﺒﺘﹶﻜﺗ) ( ﺐﺘﻛ , ﻑﹺﺮﹶﳋﺎﻛ ) ( ﺐﺘﻛ , ﻢﺠﻨﻟﺍ) ( ﺐﺘﻛ , ﹰﺔﺑﺎﺘﻛﻭ ) ( ﺐﺘﻛ , ﺐﺘﹶﻛ ) ( ﺐﺘﻛ , ﺏﺎﺘﻜﻟﺍ ) )

ﺐﺘﻛ ،ﺖﻳﺃﺭﻭ

( ( ﺐﺘﻛ , ﰲ) ( ﺐﺘﻛ , ﱡﻂﺨﺗ ) ( ﺐﺘﻛ , ﺖﹾﻠﺒﹾﻗﹶﺃ) ( ﺐﺘﻛ , ﻪﺒﺘﹶﻛﻭ ) ( ﺐﺘﻛ ، َﺀﻲﺸﻟﺍ) ( ﺐﺘﻛ , ﻑﻭﺮﻌﻣ) )

ﰲ ﺐﺘﻛ ،

( ( ﺐﺘﻛ , ﹺﻖﻳﺮﱠﻄﻟﺍ) ( ﺐﺘﻛ , ﻱﻼﺟﹺﺭ ) ( ﺐﺘﻛ , ﻦﻣ) ( ﺐﺘﻛ , ﻪﱠﻄﺧ ) (ﺐﺘﻛ ، ﻪﺒﺘﹾﻜﻳ ) ( ﺐﺘﻛ , ﻊﻤﳉﺍﻭ) )

ﺐﺘﻛ ، ﺾﻌﺑ

( ( ﺐﺘﻛ , ﻡﻻ) ( ﺐﺘﻛ , ﱟﻂﲞ ) ( ﺐﺘﻛ , ﺪﻨﻋ ) ( ﺐﺘﻛ , ﻝﺎﻗ) (ﺐﺘﻛ ، ﹰﺎﺒﺘﹶﻛ) ( ﺐﺘﻛ , ﺐﺘﹸﻛ ) )

ﹺﺦﺴﻨﻟﺍ ﺐﺘﻛ ،

( ( ﺐﺘﻛ , ﻒﻟﹶﺃ) ( ﺐﺘﻛ , ﻒﻠﺘﺨﻣ ) ( ﺐﺘﻛ , ﺩﺎﻳﺯ) ( ﺐﺘﻛ , ﻮﺑﹶﺃ) ( ﺐﺘﻛ , ﹰﺎﺑﺎﺘﻛﻭ) ( ﺐﺘﻛ , ﺐﺘﹸﻛ )

(26)

ﻪﺒﺘﻛﺃ ’aktabahu ﺏﺎﺘﻜﻟﺍ al-kitāb ﹸﺔﺒﺘﹸﻜﻟﺍ al-kutbatu

ﺐﺘﹾﻛﹶﺃ ’aktaba ﺔﺑﺎﺘﻜﻟﺍ al-kitābat ﹸﺔﺒﺘﹸﻜﻟﺍ al-kutbatu

ﺖﺒﺘﹾﻛﹶﺃ ’aktabtu ﹶﺔﺑﺎﺘﻜﻟﺍ al-kitābata ﺏﺎﺘﻜﻟﺍ al-kitāb

ﻲﹺﻨﺒﺘﹾﻛﹶﺃ ’aktibnῑ ﺔﺑﺎﺘﻜﻟﺍ al-kitābat ﹸﺔﺑﺎﺘﻜﻟﺍ al-kitābatu

ﹰﺎﺑﺎﺘﹾﻛﹺﺇ ’iktāban ﺐﻴﺗﺎﺘﻜﻟﺍ al-katātb ﺏﺎﺘﻜﻟﺍ al-kitāba

ﻪﺒﺘﻜﺘﺳﺍ ’istaktabahu ﺔﺒﺘﻜﻟﺍ al-kitbat ﹸﺔﺑﺎﺘﻜﻟﺍ al-kitābatu

ﻪﺒﺘﹾﻜﺘﺳﺍ ’istaktabahu ﺔﺒﻴﺘﻜﻟﺍ al-katbat ﺏﺎﺘﻜﻟﺍ al-kitābu

ﺎﻬﺒﺘﹾﻜﺘﺳﺍ ’istaktabahā ﺔﺒﻴﺘﹶﻛﻭ wa katbat ﹺﺏﺎﺘﻜﻟﺍ al-kitābi

ﺐﺘﺘﻛﺍ ’iktataba ﺐﺋﺎﺘﹶﻜﻟﺍ al-katā’iba ﺐﺗﺎﻜﳌﺍ al-mukātib

ﺐﺘﺘﹾﻛﺍ ’iktataba ﺐﺋﺎﺘﹶﻜﻟﺍ al-katā’ibu ﺔﺒﺗﺎﻜﳌﺍ al-mukātibat

ﻪﺒﺘﺘﹾﻛﺍ ’iktatabahu ﹸﺔﺒﻴﺘﹶﻜﻟﺍ al-katbata ﺐﺘﻜﳌﺍ al-maktab

ﻪﺒﺘﺘﹾﻛﺍ ’iktatabahu ﹸﺔﺒﻴﺘﹶﻜﻟﺍ al-katba ﺐﺘﻜﳌﺍ al-maktab

ﺎﻬﺒﺘﺘﹾﻛﺍ ’iktatabahā ﺐﺋﺎﺘﹶﻜﻟﺍ al-katā’iba ﺔﺒﺘﻜﳌﺍ al-maktabat

ﺐﺘﹾﻛﺍ ’uktub ﺔﺒﺘﹶﻜﻟﺍ al-katabat ﺔﺑﻮﺘﻜﳌﺍ al-maktūbat

ﺖﺒﺘﺘﹾﻛﺍ ’uktutibtu ﺐﺘﹶﻜﻟﺍ al-katbu ﺏﺎﺘﹸﻜﹾﻟﺍ al-kuttābu

ﻚﺑﺎﺘﺘﹾﻛﺍ ’iktitābuk ﹺﺐﺘﹶﻜﻟﺍ al-katbi ﺏﺎﺘﻜﹾﻟﺍ al-kitāba

ﻚﺑﺎﺘﺘﹾﻛﺍ ’iktitābuka ﺐﺘﹸﻜﻟﺍ al-kutabu ﹸﺔﺑﺎﺘﻜﹾﻟﺍ al-kitābatu

ﺏﺎﺘﺘﹾﻛﻻﺍ al-’iktitābu ﹸﺔﺒﻴﺘﹸﻜﻟﺍ al-kutaybatu ﺔﺑﺎﺘﻜﹾﻟﺍ al-kitābati

ﺐﺗﺎﻜﺘﻟﺍ at-takātubu ﺏﺎﺘﹸﻜﻟﺍ al-kuttāba ﺐﺘﹾﻜﻤﹾﻟﺍ al-maktabu

ﺐﺗﺎﻜﻟﺍ al-kātib ﹺﺏﺎﺘﹸﻜﻟﺍ al-kuttābi ﹸﺔﺑﻮﺘﹾﻜﻤﹾﻟﺍ al-maktūbatu

(27)

Combining into One

Broad-Coverage Lexical Resource

• After analyzing each lexicon, a combination algorithm is

applied to construct the broad-coverage lexicon.

# Lexicon Word types

[B]

Records inserted [A]

Percentage (A/B)% (A/C)%

1 lisān al-‘rab 207,992 207,992 100.00% 47.80%

2 mu’ğam al-muḥῑṭf

al-luat 74,507 61,113 82.02% 14.04%

3 tağal-‘arūs min ğawāhir

al-qāmūs 128,119 95,415 74.47% 21.93%

4 mutār a-iḥāḥ 19,540 16,573 84.82% 3.81%

5 al-muğrab ftartb

al-mu‘rab 12,396 9,805 79.10% 2.25%

(28)

The Corpus of Traditional

Arabic Lexicons

• Lexicons’ text can be used as a

corpus of traditional Arabic lexicons.

• Different domain than existing

corpora.

Number of files 247

Size 178.32 MB

Vowelized words analysis

# of words 14,369,570

# of word types 2,184,315

Non-vowelized # of words 14,369,570

• The Arabic corpus of

dictionaries covers a period of more than 1200 years.

• Consists of large number of

words and word types.

• Has both vowelized and

non-vowelized text.

Non-vowelized word analysis

# of words 14,369,570

# of word types 569,412

Partially-vowelized Non-vowelized Word Frequency Word Frequency

ﰲ 292,396 ﻦﻣ 322,239

ﻦﻣ 269,200 ﰲ 301,895

ﻝﺎﻗ 172,631 ﻝﺎﻗ 190,918

ﻭ 120,060 ﻱﺃ 132,635

ﻰﻠﻋ 108,252 ﻭ 130,809

ﺎﻣ 89,195 ﻰﻠﻋ 119,639

ﻝﺎﻗﻭ 88,233 ﺍﺫﺇ 115,842

ﻦﻋ 82,027 ﻝﺎﻗﻭ 99,601

(29)

Key notes

• When do users start using dictionaries?

• What are the users’ needs of monolingual or bilingual

dictionaries?

• What is the best arrangement methodology for Arabic

dictionaries?

• What are the technologies needed to search for a word in

dictionary?

• What linguistic information that can be provided in a

dictionary for each dictionary entry?

• How is dictionary entry connected to other dictionary entries?

• How are examples of the dictionary entry selected?

• What are the phrases, collocations, idioms that are related to

dictionary entry? 28

(30)

Conclusions

• Morphological analysis is essential technology for constructing

dictionaries.

• The Arabic morphological analyzer provides detailed information

about the processed words.

• Traditional Arabic lexicons are important because they provide the Traditional Arabic lexicons are important because they provide the

correct linguistic information about the words, examples from the Qur’an and poetry, collocation used in the past 1200 years, etc…

• Arrangement of Arabic dictionary entries is a challenge for Arabic

lexicography.

• Three types of dictionaries can be provided according the user

(31)

References

• Sawalha, MS; Atwell, ES Constructing and Using Broad-coverage Lexical Resource for Enhancing

Morphological Analysis of Arabic in: Proceedings of the Seventh conference on International

Language Resources and Evaluation (LREC'10), pp.282-287. European Language Resources

Association (ELRA). 2010.

• Sawalha, M; Atwell, ES Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic

Textin: Proceedings of the Seventh conference on International Language Resources and

Evaluation (LREC'10), pp.1258-1265. European Language Resources Association (ELRA). 2010.

• Sawalha, M; Atwell, ES ا ء!" ف $ او % " ا &'ا%( ) *%+Adapting Language

Grammar Rules for Building a Morphological Analyzer for Arabic Text)in: Proceedings of ALECSO

Arab League Educational Cultural and Scientific Organization workshop on Arabic morphological

analysis. 2009.

• Sawalha, M; Atwell, ES Linguistically Informed and Corpus Informed Morphological Analysis of

Arabicin: Proceedings of CL2009 International Conference on Corpus Linguistics. 2009.

• Sawalha, M; Atwell, E Comparative evaluation of Arabic language morphological analysers and

stemmersin: Proceedings of COLING 2008 22nd International Conference on Computational

(32)

References

Related documents

In Procambarus clarkii, it was shown by extracellular recording from the statocyst nerve that the sensory neurones attached to the hairs had the same directional sensitivity,

IT4var19 and 3173-S (DC8-EPCR binders) presented a reduction in binding to activated endothelium, ITGICAM-1 (dual CD36 and high ICAM-1 affinity) presented an increase of binding

response from separate heterogeneous voltage and optical pulses followed by a pulse 350. consisting of both the optical and voltage pulse coincident on the nanowire simultaneously (2

If commitment to hedge is important in reducing agency costs of debt and interest rate protection covenants serve as a credible commitment mechanism, we expect mandatory users to

Thus, it is evident (from Table 8.5 &amp; 8.6) that with the access to online media, readers get better access to the international news whereas with the printed form of

Stevi´c, S., Sharma, A.K., Bhat, A.: Essential norm of products of multiplication composition and differentiation operators on weighted Bergman spaces.. Zhu, X.: Essential norm

1 Abdus Salam School of Mathematical Sciences, GC University Lahore, Gulberge, Lahore 54660, Pakistan 2 Faculty of Textile Technology, University of Zagreb, 10000 Zagreb, Croatia..

Gli interventi sul sistema della mobilità sono stati orientati soprattutto al miglioramento dell’accessibilità all’area espositiva; in particolare sono tre gli