• No results found

A Czech Morphological Lexicon

N/A
N/A
Protected

Academic year: 2020

Share "A Czech Morphological Lexicon"

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

A Czech Morphological Lexicon

H a n a Skoumalovfi

I n s t i t u t e of T h e o r e t i c a l a n d C o m p u t a t i o n a l Linguistics Charles U n i v e r s i t y

C e l e t n £ 13, P r a h a 1 Czech R e p u b l i c h a n a . s k o u m a l o v a @ff . cuni. cz

A b s t r a c t

In this paper, a t r e a t m e n t of Czech phonological rules in two-level mor- phology approach is described. First the possible phonological alternations in Czech are listed and then their t r e a t m e n t in a practical application of a Czech morphological lexicon.

1

M o t i v a t i o n

In this paper I want to describe the way in which I treated the phonological changes that occur in Czech conjugation, declension and derivation. My work concerned the written language, but as spelling of Czech is based on phonological principles, moSt statements will be true about phonology, too,

My task was to encode an existing Czech mor- phological dictionary (Haji~, 1994) as a finite state transducer. The existing lexicon was orig- inally designed :for simple C programs that only attach "endings" to the "stems". The quota- tion marks in the previous sentence mean that the terms are not used in the linguistic mean- ing but rather, technically: S t e m means any part of a word: t h a t is not changed in declen- sion/conjugation. Ending means the real ending and possibly also another part of the word that is changed. Wh:en I started the work on convert- ing this lexicon to a two-level morphology sys- tem, the first idea was t h a t it should be linguis- tically more elegant and accurate. This required me to redesign the set of patterns and their cor- responding endings. From the original number

of 219 paradigms I got 159 t h a t use 116 sets of endings. Under the term paradigm I mean the set of endings that belong to one lemma (e.g. noun endings for all seven cases in both num- bers) and possible derivations with their cor- responding endings (e.g. possessive adjectives derived from nouns in all possible forms). T h a t is why the number of paradigms is higher then the number of endings.

In this approach, it is necessary to deal with the phonological changes t h a t occur at bound- aries between the stem and the suffix/ending or between the suffix and the ending. There are also changes inside the stem (e.g. p~'tel 'friend' x p~dteld 'friends', or hndt 'to chase' x 5enu 'I chase'), but I will not deal with them, as they are rather rare and irregular. They are treated in the lexicon as exceptions. I also will not deal with all the changes t h a t may occur in a verb s t e m - - t h i s would require reconstructing the forms of the verbs back in the 14th cen- tury, which is outside the scope:of my work. Instead, I work with several stems of these ir- regular verbs. For example the verb hndt ('to chase') has three different stems, hnd- for infini- tive, 5en- for the present tense, imperative and present participles, and hna- for the past par- ticiples. The verb vdst ('to lead') has two stems, vds- for the infinitive and ved- for all finite forms and participles. The verb tit ('to cut') has the stem tn- in the present tense, and the stem ra- in the past tense; the participles can be formed both from the present and the past stem. For practical reasons we work either with one verb stem (for regular verbs) or with six stems (for irregular verbs). These six stems are stems for

(2)

infinitive, present indicative, imperative, past participle, transgressive and passive participle. In fact, there is no verb in Czech with six differ- ent stems, but this division is made because of various combinations of endings with the stems.

2 T y p e s o f p h o n o l o g i c a l a l t e r n a t i o n s i n C z e c h

We will deal with three types of phonological alternations: palatalization, assimilation and epenthesis. Palatalization occurs mainly in de- clension and partly also in conjugation. Assimi- lation occurs mainly in conjugation. Epenthesis occurs both in declension and in conjugation.

2.1 E p e n t h e s i s

An epenthetic e occurs in a group of consonants before a O-ending. The final group of conso- nants can consist of a suffix (e.g. -k or -b) and a p a r t of the stem; in this case the epenthesis is obligatory (e.g. kousek x kousku 'piece', malba x maleb 'painting'). In cases when the group is morphologically unseparable, the application of epenthesis depends on w h e t h e r the group of consonants is phonetically admissable at word end. In loan words, the epenthetic e m a y occur if t h e final group of consonants reminds a Czech suffix (e.g. korek x korku 'cork', but alba x alb 'alb'). In declension, two situations can occur:

• T h e base form contains an epenthetic e; the rule has to remove it, if the form has a non-O ending, e.g. chlapec 'boy', chlapci d a t i v e / l o c a t i v e sg or nominative pl.

• T h e base form has a non-O ending; the rule has to insert an epenthetic e, if the ending is O, e.g. chodba 'corridor', chodeb genitive pl.

In conjugation, an epenthetic e occurs in the past participle, masculine sg of the verb jit 'to go' (and its prefixed derivations): gel 'he-gone', gla 'she-gone', glo 'it-gone'. The rule has to in- sert an epenthetic e if the form has a O-ending.

2.2 P a l a t a l i z a t i o n a n d a s s i m i l a t i o n

Palatalization or assimilation at the m o r p h e m e boundaries occurs when an ending/suffix starts

with a soft vowel. T h e alternations are different for different types of consonants. T h e t y p e s of consonants and vowels are as follows:

• hard c o n s o n a n t s - - d , (g,)h, ch, k, n, r, t

• soft c o n s o n a n t s - - c , d, d, j, ~, ÷, g, t, 2

• neutral c o n s o n a n t s - - b , l, m; p, s, v, z

• hard vowels--a, d, e, d, o, 6, u, ~, y, ~] and the diphthong ou

• soft vowels--d, i, (

T h e vowel d c a n n o t occur in the ending/suffix so it will not be interesting for us. I also will not discuss w h a t happens with 'foreign' consonants /, q, w and x - - t h e y would be t r e a t e d as v, k, v and s, respectively. T h e only borrowing from foreign languages t h a t I included to the above lists is g: This sound existed in Old Slavonic but in Czech it changed into h. However, when later new words with g were a d o p t e d from o t h e r lan- guages, this sound behaved phonologically as h (e.g. hloh, hlozich--from C o m m o n Slavonic glog ' h a w t h o r n ' , and katalog, kataloz(ch 'catalog').

T h e phonological alternations are reflected in writing, with one e x c e p t i o n - - i f the consonants d, n and t are followed by a soft vowel, t h e y are palatalized, but the spelling is not changed:

spelling: d~, di phonology: / d e / , / d i /

ne, ni

I el, la l

t~, ti / [e/, / [i/

In other cases the spelling reflects t h e phonol- ogy. In the further text I will use { } for the morpho-phonological level, / / for the phonolog- ical level and no brackets for the orthographical level. In the cases where the o r t h o g r a p h y and phonology are the same I will only use the or- thographical level. Let us look at the possible types of alternation of consonants:

• Soft consonant and ~ - - T h e soft consonant is not changed, the soft ~ is changed to e. {d(d@} ---+ d(de 'pussycat' dative sg

• Soft or neutral consonant and i / ( - - No al- ternations occur.

(3)

• Hard c o n s o n a n t and a soft vowel - - T h e a l t e r n a t i o n s differ depending on when and how the soft vowel originated.

Assimilation:

- {k j} - ~ e

tlak 'pressure' ---+ tladen 'pressed'

- { h j ) ~

mnoho 'much, m a n y ' ~ mno2eni'mul- t/plying'

-

{gj}.-~2

It is !not easy to find an example of

i

this sprt of alternation, as g only oc- curs in loan words t h a t do not use the old t~rpes of derivation. In colloquial speec h it would be perhaps possible to creat~ the following form:

pedaglog 'teacher' ---+ pedago2en( 'work- ing as a teacher'

- { d j } - ~ z

sladit 'to sweeten' ~ slazen('sweeten- ing'

This s o r t of alternation is not pro- ductive any m o r e - - i n newer words

r

palatalization applies:

sladit.'to t u n e up' --+ slad~n( ' t u n i n g u p '

In some cases both variants are pos- sible, :or the different variants exist in different d i a l e c t s - - t h e east (Moray/an) dialects tend to keep this phonolog- ical alternation, while the west (Bo- hemiah) dialects often a b a n d o n e d it.

- { t i e } ~ ~e

platit !to pay' ~ placen( 'paying' This alternation is also not productive any more. T h e newest word t h a t I found w h i c h shows this sort of phono- log/ca! alternation is the word fotit 'to take a p h o t o ' ~ focen( 'taking a p h o t o ~.

Palatalization:

D u r i n g the historical development of the language several sorts of palatalization o c c u r e d - - t h e first and second Slavonic palatalization and further Czech palataliza-

tions.

- {k~/ki} --+ 5e/di (1st pMat.)

matka ' m o t h e r ' ---+ matSin possesive adjective

- {k~/ki) --~ ce/ci (2nd palat.) matka ~ matce d a t i v e / l o c a t i v e sg - { h i / h i } ~ 2e/2i (1st palat.)

B~h ' G o d ' ~ Bo2e vocative sg - { h i / h i } ~ z e / z i (2nd palat.)

Bgh ~ Bozi n o m i n a t i v e / v o c a t i v e pl - {g~/gi} ~ 2e/2i (1st palat.)

Jaga a witch from Russian tales --~ Ja2in possesive adjective

- {ge/gi} -+ z e / z i (2nd palat.) Jaga ~ Jaze d a t i v e / l o c a t i v e sg

-

{ d~} ~ / de/--4 dg

rada 'council' --~ radg d a t i v e / l o c a t i v e sg

- { t 4 --~ l i e / - - ~ t~

teta ' a u n t ' --+ tet~ d a t i v e / l o c a t i v e sg

Both palatalization and assimilation yields the same result:

- {oh} ~

moucha 'fly' -+ mouse d a t i v e / l o c a t i v e sg, muM derived adjective

- { n ) ~ / ~ / ~

hon 'chase' ---+ honit 'to chase', hongn~] 'chased'

- { r ) - ~ ~

vat 'boil' --~ va÷it 'to cook', va÷en( 'cooking'

• Neutral consonant and ~ - - : T h e alterna- tions differ d e p e n d i n g on when and how originated.

Assimilation:

- { b j e } ~ be

zlobit 'to irritate' ---+ {zlobjem] zloben( 'irritating'

- { m j 4 - ~ . ~ e

zlomit 'to break' ~ {zlornjen~]} --+ zlornen~ 'broken'

- { p i e } ~ p e

(4)

- {vie} -+ ve

lovit 'to hunt' ---+ {lovjen~] -+ loven( 'hunting'

- {sje} ~ ge

prosit 'to ask' --+ {prosjenz~ -+ proven( 'asking'

This t y p e of assimilation is not pro- ductive any more. In newer deriva- tions {sje} --+ se (e.g. kosit 'to mow'

kosen( 'mowing') .

- {zje} ~ 2e

kazit 'to spoil' ~ { kazjenz~ -+ ka2en( 'spoiling'

This t y p e of assimilation is also not productive any more. In newer deriva- tions {zje} ~ ze (e.g. ~et&it 'to con- c a t e n a t e ' --+ ÷et&eni'concatenating'). Palatalization:

W i t h b, m, p and v no alternation occurs ({vrb~} 'willow' dative/locative sg ---+ vrb~).

- { s ~ ) + s e

rosa 'wasp' ---+ {vos@} ~ rose da- tive/locative sg

- { z ~ } --~ z e

koza 'goat' --.+ {koz@} --+ koze da- tive/locative sg

Both palatalization and assimilation yields the same result:

- {lje} -+ le

akolit 'to school' --+ {$koljem~ gkolen( 'schooling'

- { l e } ~ le

~kola 'school' -+ { $kol~} ~ ~kole da- tive/locative sg

• G r o u p of hard consonants and a soft vowel. Here again either palatalization or assimi- lation occurs.

Assimilation:

- {stj} ~ Igtl

distit 'to clean' --+ 5igt~n( 'cleaning'

- { s l j } - ~ ~z

myslit 'to think' --+ my~leni'thinking' Palatalization:

- { . k } + / ~ i /

kamarddsk~] 'friendly' ~ kamarddgt( masculine animate, nominative pl, ka- marddgt~jg( 'more friendly'

- { c k } ~ / d /

5ack~] 'brave' ~ 5aSt( masculine ani- mate, nominative pl, 5a2t~jM 'braver'

- { e k ) + / d /

2lu[oudkU 'yellowish' ~ 2lu[oudt~jg( 'more yellowish', but 21ufoudc( mascu- line animate, nominative pl

T h e alternations affect also the vowel ~. W h e n it causes palatalization or assimilation of the previous consonant, it looses its 'softness', i.e. ~ --~ e:

{matk@} ~ matce { sestr@} ~ sest÷e { gkol@} --+ gkole

3 P h e n o m e n a t r e a t e d b y t w o - l e v e l r u l e s i n t h e C z e c h l e x i c o n

As the Czech lexicon should serve practical ap- plications I did not t r y to solve all t h e prob- lems t h a t occur in Czech phonology. I concen- t r a t e d on dealing with the alternations t h a t oc- cur in declension and regular conjugation, and the most productive derivations. T h e rest of al- ternations occurring in conjugation are t r e a t e d by inserting several verb stems in t h e lexicon. T h e list of alternations and o t h e r changes cov- ered by the rules:

• epenthesis

• palatalization in declension

• palatalization in conjugation

• palatalization in derivation nouns from masculines

of feminine

• palatalization in derivation of possessive adjectives

• palatalization in derivation of adverbs

palatalization in derivation of c o m p a r a t i v e s

(5)

• palatalization or assimilation in derivation of passive participles

• shortening of the vowel in suffixes -ik (in derivation of feminine noun from mascu- line) a n d - ~ v (in declension of possesive ad- jectives)

For the CZech lexicon I used the software r

tools for two-level morphology developed at Xe- rox (Karttune.n and Beesley, 1992; K a r t t u n e n , 1993). T h e le:kical forms are created by attach- ing the proper ending/suffix to the base form in a s e p a r a t e : p r o g r a m . To help the two-level rules to find where t h e y should operate, I also m a r k e d m o r p h e m e boundaries by special mark- ers. These m a r k e r s have two further functions:

• T h e y bear the information about the length of ending i(or suffix and ending) of the base form, i.e. h o w m a n y characters should be removed before attaching the ending.

• T h e y bear the information about the kind of alternation.

Beside the markers for m o r p h e m e boundaries I also use markers for an epenthetic e. As I said before, e is inserted before the last consonat of a final consonant group, if the last consonant is a suffix, or if the consonant group is not phoneti- cally admissable. However, as I do not generally deal with derivation nor with the phonetics, I am not able to recognize what is a suffix and w h a t is phone~ically admissable. T h a t is why I need these special markers.

A n o t h e r auxiliary marker is used for mark- ing the suffix -~7~, t h a t needs a special t r e a t m e n t in derivation of feminine nouns and their poss- esive adjectives. T h e long v o w e l / m u s t be short- ened in the derivation, and the final k must be palatalized even if the O-ending follows. I need a special marker, as -ik- allows two realizations for both the sohnds in same contexts:

Two realizations of i

d~edn~7~ 'clerk' ~ d~ednice 'she-clerk', but rybnzT~ ' p o n d ' ~ rybnlce locative sg

Two realizations of k

d÷ednzT~ x d÷ednic (genitive pl of the derived feminine)

i

In the previous section, I described all pos- sible alternations concerning single consonants. When I work with the paradigms or with the derivations, it is necessary to specify the kind of the alternation for all consonants t h a t can occur at the boundary. For this purpose I in- troduced four types of markers:

" 1 P - - 1st palatalization for g, h and k, or the only possible (or no) palatalization for other consonants. I use this m a r k e r also for palatalization c --~ 5 in vocative sg of the paradigm chlapec. T h e final c is in fact a palatalized k, so there is even a linguistic motivation for this.

A 2 P - - 2nd palatalization for g, h and k, or

the only possible (or no) palatalization for other consonants.

^A - - Assimilation (or nothing).

AN --- NO alternation.

These markers are followed by a n u m b e r t h a t denotes how m a n y characters of t h e base form should be removed before attaching the end- ing/suffix. Thus there are markers ~ 1P0, ^2P0, ^1P1, etc. The markers starting with ^N only denote the length of the ending of the base f o r m - - a n d instead of using ^N0 I a t t a c h the suffix/ending directly to the base form. For- tunately, nearly all paradigms and derivations cause at most one type of alternation, so it is possible to use one m a r k e r for the whole paradigm.

The markers for an epenthetic e are ^ E l (for e t h a t should be deleted) and ^E2 (for e t h a t should be inserted). The m a r k e r for the suffix -zTc in derivations is ^ IK.

Here are some examples of lexical items and the rules t h a t transduce t h e m to the surface form:

(1) d o k t o r k a ^ 1 P l i n ^ 2 P 0 ~ c h

(6)

be removed from the word form and the possi- ble alternation concerns k). The marker ~2P0 means t h a t the derived possessive adjective has a O-ending and the possible alternation at this m o r p h e m e b o u n d a r y is palatalization. If we rewrite this string to a sequence of morphemes we get the following string: doktork-in-~jch. The sound k in front of i is palatalized, so the cor- rect final form is doktordin~eh, which is genitive plural of the possessive adjective derived from the word doktorka.

Let us look now at the two-level rules t h a t t r a n s d u c e the lexical string to the surface string. We need four rules in this example: two for deleting the markers, one for deleting the end- ing -a, and one for palatalization. The rules for deleting auxiliary markers are very simple, as these m a r k e r s should be deleted in any context. T h e rules can be included in the definition of the alphabet of symbols:

Alphabet

7j IP0 : 0 7j 1P1:0

7.'2P0:0 7,'2PI:0 7j2P2:0 7,'2P3:0 7jA2:0

Z'NI:0 Z'N2:0 Z'N3:0 Z'N4:0 Y,'EI:0 Y.'E2:0 Y.'IK:0

T h i s notation m e a n s that the auxiliary m a r k e r s

are always realized as zeros on the surface level.

T h e rule for deleting the ending -c looks as follows:

"Deletion of the ending -a-"

a : O <=> _ [ Y,'NI: I ~ j i P I : I ~,'2Pl: ] ;

_ t: [ Z'N2: I Z'N4: ] ;

T h e first line of the rule describes the context

of a one-letter nominal ending u, and the second line describes the context of an infinitive suffix with ending -at or -ovut.

T h e rule for palatalization k -+ d looks as fol- lows:

"First palatalization k -> ~"

k:~ <=> _ (7,'IK:) [ a: I ~: ] 7.'iPi: i ;

NonCeS: (End) 7.'1PI: ~: ;

T h e first line describes two possible cases: ei- ther the derivation of a possesive adjective from a feminine noun (doktorku--~ doktordin), or the

derivation of a possesive adjective from a fem- inine noun derived from a masculine that ends with -~7~ ( ~ednzT~ ~ ( d÷ednice -+) d÷ednidin).

The second c o n t e x t describes a c o m p a r a t i v e of an adjective, or a c o m p a r a t i v e of adverb de- rived from t h a t adjective (ho÷k~] ~ ho÷dejM, ho~deji). The set NonCCS contains all c h a r a c t e r except c, d and s and it is defined in a speciM section. This c o n t e x t condition is introduced, because the groups of consonants ck, dk and sk have different 1st p a l a t a l i z a t i o n .

The label End denotes any c h a r a c t e r t h a t can occur in an ending and t h a t is removed from the base form.

(2) k o r e k ' 2 P 0 ^ E l e m

T h e base form of this word form is korek 'cork'; the marker ^2P0 means their the possible alter- nation is (second) palatalization and t h a t the length of ending of the base form is 0. T h e m a r k e r ^ E l means t h a t the base form contains an epenthetic e, and em is the ending of in- s t r u m e n t a l singular. T h e correct final form is korkem. T h e rule for deleting an (epenthetic) e follows:

"Deletion of e"

e:0 <=> Cons c: 7,'N2:;

[ YjIPI~" I 7j2P1: I Y,'NI: I 7jN2: ] ;

Dons Cons: ( [ Z ' I P O : I Z ' 2 P O : ] ) Z ' E i : Vowel: ;

t:0-[ Z*2P2: I Z'N2: ];

T h e first line describes the context for dele- tion of the suffix -ec in the derivation of the type

v~dec 'scientist' --+ v~dkyn~ 'she-scientist'. The second c o n t e x t is the c o n t e x t of t h e end- ing -e or the suffix -ce. This suffix m u s t be removed in the derivation of t h e t y p e soudce 'judge' ~ soudkyn~ 'she-judge'. :

T h e third context is t h e c o n t e x t of an epenthetic e t h a t is present in the base form and must be removed from a form with a non-O ending. The sets Cons and Vowel contain all consonants and all vowels, respectively.

The fourth line describes the c o n t e x t for dele- tion of the infinitive ending -et.

T h e whole p r o g r a m contains 35 rules. Some of the rules concern r a t h e r m o r p h o l o g y t h a n phonology; n a m e l y t h e rules t h a t remove end- ings or suffixes. One rule is purely technical; it is one of the two rules for the alternation ch ~ ~, as c and h m u s t be t r e a t e d separately

(7)

(though ch is considered one letter in Czech alphabet). Six rules are forced by the Czech spelling rules (e.g. rules for treating / d / , / t / a n d / ~ / i n various contexts, or a rule for rewrit- ing y ~ i after soft consonants). 18 rules deal

b

with the actual phonological alternations and they cover the whole productive phonological system of Czech language. The lexicon using these rules was tested on a newspaper text con- taining 2,978,320 word forms, with the result of more than 96% analyzed forms.

4 A c k n o w l e d g e m e n t s

My thanks to Ken Beesley, who taught me how to work with the Xerox tools, and to my fa- ther, Jan Skoumal, for fr~uitful discussions on the draft of tNis paper.

R e f e r e n c e s

Jan Hajji. 1994. Unification Morphology Grammar, Ph.D. dissertation, Faculty of Mathematics and Physics, Charles University, Prague.

Josef Holub, and Stanislav Lyer. 1978. StruSn~ etymologick~ :slovnzT~ jazyka 5eskdho (Concise et- ymological dictionary of Czech language), SPN, Prague.

Lauri Karttunen, and Kenneth R. Beesley. 1992. Two-Level Role Compiler, Xerox Palo Alto Re- search Center', Palo Alto.

Lauri Karttunen. 1993. Finite-State Lexicon Com- piler, Xerox Palo Alto Research Center, Palo Alto. Kimmo Koskenniemi. 1983. Two-level Morphology:

E

A General Computational Model for Word-Form Recognition ~ind Production, Publication No. 11, University of iHelsinki.

Arno~t Lamprecht, Dugan Slosar, and Jaroslav Bauer. 1986.i Historickd mluvnice 5egtiny (His- torical Grammar of Czech), SPN, Prague. Jan Petr et al. 1!986. Mluvnice 5egtiny (Grammar of

Czech), Academia, Prague.

Jana Weisheiteiov£, Kv~ta Kr£1fkov£, and Petr Sgall. 1982. Morphemic Analysis of Czech. No. VII in Explizite Beschreibung der Sprache und au- tomatische Textbearbeitung, Faculty of Mathemat- ics and Physics, Charles University, Prague.

(8)

References

Related documents

We therefore decided to perform a more detailed analysis on a fine-scale transect across the contact zone using mor- phological and molecular traits (12 polymorphic micro-

We describe an automatic method to extract a morpholog- ical lexicon from the German version of Wiktionary that can be used with the SMOR grammar to build a finite-state

If auditory processing of complex words requires access to mor- phological constituents in Semitic languages, it would be expected that phoneme prediction would be sensitive

Modifying and appending the legal index of Cukierman (1992) by transition specific measures, Siklos (1994) investigates the independence of central banks in the Czech Republic,

55 Similar to Estonia, the Czech Republic is possibly choosing to not blame to Russia because of the risk of undesirable Russian escalation (such as natural gas sanctions or