• No results found

Structural Lexical Heuristics in the Automatic Analysis of Portuguese

N/A
N/A
Protected

Academic year: 2020

Share "Structural Lexical Heuristics in the Automatic Analysis of Portuguese"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

Structural Lexical Heuristics

in the Automatic Analysis of Portuguese

Eckhard Bick

D e p a rtm e n t o f L in g u is tic s , A rh u s U n iv e rs ity , W ille m o e s g a d e 15 D , D K -8 2 0 0 Å rh u s N

te l: + 4 5 - 8 9 4 2 2 1 3 1 , fax : + 4 5 - 86 2 8 1 3 9 7 , e -m a il: lin e b @ h u m .a a u .d k h ttp ://v is l.h u m .o u .d k /L in g u is tic s .h tm l

Abstract

T h e p a p e r d is c u s s e s , o n th e le x ic a l le v e l, th e in te g ra tio n o f h e u ris tic so lu tio n s in to a le x ic o n b a s e d a n d ru le g o v e rn e d s y s te m fo r th e a u to m a tic a n a ly sis o f u n re s tric te d P o rtu g u e s e te x t. I n p a rtic u la r, a m o rp h o lo g y b a s e d a n a ly tic a p p ro a c h to le x ic a l h e u ris tic s is p re s e n te d a n d e v a lu a te d .

T h e ta g g e r in v o lv e d u s e s a 5 0 .0 0 0 e n try b a s e fo rm le x ic o n a s w e ll as p re fix -, su ffix - a n d in fle x io n e n d in g s le x ic a to a s s ig n p a rt o f s p e e c h a n d o th e r m o rp h o lo g ic a l ta g s to e v e ry w ordform . in th e te x t, w ith re c a ll ra te s b e tw e e n 9 9 .6 % a n d 9 9 .7 % . M u ltip le re a d in g s a re s u b s e q u e n tly d is a m b ig u a te d b y u s in g g ra m m a tic a l ru le s fo rm u la te d in th e C o n s tra in t G ra m m a r fo rm a lis m . O n th e n e x t le v e l o f a n a ly s is , ta g s fo r s y n ta c tic a l fo rm a n d fu n c tio n a lte rn a tiv e s a re m a p p e d o n to th e

w o r d f o n n s a n d d is a m b ig u a te d in a s im ila r w a y . In s p ite o f u s in g a h ig h ly d iffe re n tia te d ta g set, th e p a rs e r y ie ld s c o rre c tn e s s ra te s - o n ru n n in g u n re s tric te d a n d u n k n o w n te x t - o f o v e r 9 9 % fo r m o rp h o lo g y /P o S a n d 9 7 -9 8 % fo r sy n ta x . A te s t s ite w ith a v a rie ty o f a p p lic a tio n s (p a rs in g , c o rp u s s e a rc h e s, in te ra c tiv e g ra m m a r te a c h in g a n d - e x p e rim e n ta l - M T h a s b e e n e s ta b lis h e d a t

h ttp ://v is l.h u m .o u .d k /L in g u is tic s .h tm l.

1 Background

In c o rp u s lin g u is tic s , m o s t s y s te m s o f a u to m a tic a n a ly s is c a n b e c la s s ifie d b y m e a s u rin g th e m

a g a in s t th e b ip o la rity o f ru le b a s e d v e rs u s p ro b a b ilis tic a p p ro a c h e s . T h u s K a rls s o n (1 9 9 5 ) d is tin g u is h e s b e tw e e n “ p u re ” ru le b a s e d o r p r o b a b ilis tic s y s te m s , h y b r id s y s te m s a n d c o m p o u n d s y s te m s , i.e. ru le b a s e d s y s te m s s u p p le m e n te d w ith p ro b a b ilis tic m o d u le s , o r p r o b a b ilis tic s y s te m s w ith ru le b a s e d “b ia s ” o r p o s tp ro c e s s in g . A s a s e c o n d p a ra m e te r, le x ic o n d e p e n d e n c y m ig h t b e a d d e d , s in c e b o th r u le s b a s e d a n d p ro b a b ilis tic s y s te m s d if f e r in te rn a lly a s to h o w m u c h u se th e y m a k e o f e x te n s iv e le x ic a , b o th in te rm s o f le x ic a l c o v e ra g e a n d g ra n u la rity o f le x ic a l in fo rm a tio n .

T h e c o n s tr a in t g r a m m a r (C G ) fo rm a lis m (e.g . K a rls s o n e t a l., 1 9 9 5 ), w h ic h I h a v e b e e n u s in g in m y o w n s y s te m ’ f o r th e a u to m a tic a n a ly sis o f u n re s tric te d P o rtu g u e s e te x t (B ic k , 1996 [1] a n d

1997 [2]), is b o th ru le g o v e rn e d a n d le x ic o n b a s e d , fo c u s in g o n d is a m b ig u a tio n o f m u ltip ly

(2)

a s s ig n e d le x ic a l a n d s tru c tu ra l re a d in g s a s th e m a in to o l o f a n a ly s is . R e a d in g s a re e x p re ss e d as s e ts o f w o rd b a s e d m o d u la r ta g s. S y n ta c tic s tru c tu re is c o v e re d b y u s in g fu n c tio n ta g s an d d e p e n d e n c y m a rk e rs (B ic k , 1997 [1 ]), b u t I w ill h e re c o n c e n tra te o n th e le x ic o -m o rp h o lo g ic a l le v e l. B e fo re a n y C o n s tra in t G ra m m a r ru le s c a n a p p ly , all (m o rp h o lo g ic a lly ) p o s s i b le re a d in g s h a v e to b e id e n tifie d , a n d I h a v e to th is e n d d e v e lo p e d a p re p ro c e s s o r, th a t id e n tifie s w o rd fo rm s, p o ly le x ic a l u n its a n d s e n te n c e b o u n d a rie s , a s w e ll as a m o rp h o lo g ic a l a n a ly s e r f o r P o rtu g u e s e

u s in g a n a d a p te d e le c tro n ic v e rs io n o f a p re v io u s ly p u b lis h e d d ic tio n a ry (B ic k , 1993) in c o m b in a tio n w ith a ffix - a n d in fle c tio n e n d in g s le x ic a s u p p le m e n te d b y c o rre s p o n d in g a lte rn a tio n

ru le s fo r w o rd fo rm a tio n (B ic k , 1 995). In th e a n a ly se r's o u tp u t, e v e ry w o rd fo rm is fo llo w e d b y a s m a n y ta g lin e s a s th e re a re p o te n tia l re a d in g s:

(

1

) " < re v ista > "

"re v ista " < + n > < C P > < rr> N F S

"re v e stir" < v t> < d e ^ v tp > < d e ^ v rp > V P R 1/3S S U B J V F IN "re v ista r" < v t> V IM P 2 S V F IN

"re v ista r" < v t> V P R 3 S IN D V F IN "rev er" < v t> < v i> V P C

W ith a C G -te rm , s u c h an a m b ig u o u s lis t o f re a d in g s is c a lle d a c o h o r t. In th e e x a m p le , the w o rd fo rm 're v is ta ' h a s o n e n o u n -re a d in g (fe m a le sin g u la r) a n d fo u r (!) v e rb -re a d in g s , th e la tte r c o v e rin g th re e d iffe re n t b a s e fo rm s, s u b ju n c tiv e , im p e ra tiv e , in d ic a tiv e p re s e n t te n se a n d p a rtic ip le re a d in g s. C o n v e n tio n a lly , P o S a n d m o rp h o lo g ic a l fe a tu re s a re r e g a rd e d a s p rim a ry ta g s

a n d c o d e d b y c a p ita l le tte rs. In a d d itio n th e re c a n b e s e c o n d a ry le x ic a l in fo rm a tio n a b o u t v a le n c y a n d se m a n tic a l c la ss, m a rk e d b y b ra c k e tin g .

A c o n s tr a in t g r a m m a r r u le b rin g s th e a m b ig u ity p ro b le m to th e f o re g ro u n d b y s p e c ify in g w h ic h re a d in g (o u t o f a c o h o rt o f a m b ig u o u s re a d in g s fo r a g iv e n w o rd ) is im p o s s ib le (a n d th u s to b e d is c a rd e d ) o r m a n d a to ry (a n d th u s to b e c h o s e n ) in a g iv e n s e n te n c e -c o n te x t. F o r in s ta n c e , a ru le m ig h t d is c a rd a fin ite v e rb r e a d in g a fte r a p r e p o s itio n (2 a ) , o r w h e n a n o th e r - u n a m b ig u o u s -

fin ite v e rb is a lre a d y fo u n d in th e s a m e c la u se , w ith n o c o o rd in a to rs p r e s e n t (2b).^

(2 a ) R E M O V E (V F IN ) IF (-1 P R P )

[d is c a rd fin ite v e rb re a d in g s (V F IN ) i f th e firs t w o rd to th e le ft (-1 ) is a p r e p o s itio n P R P )]

(2 b ) R E M O V E (V F IN ) IF (* 1 C V F IN B A R R IE R C L B O R K C ) (N O T * - l C L B -W O R D )

[d is c a rd V F IN i f th e re is a n o th e r u n a m b ig u o u s (C ) fin ite v e rb (V F IN ) a n y w h e re to th e r ig h t (* 1 ) w ith n o c la u s e -b o u n d a ry (C L B ) a n d c o o rd in a tin g c o n ju n c tio n (K C ) in te rfe rin g

(B A R R IE R ). D is c a rd o n ly i f th e re is n o s u b o rd in a to r (C L B -W O R D ) a n y w h e re to th e le ft ( * - l ) ]

2 Ordinarily, this disambiguation process works on whole cohort lines, i.e. distinguishes between PoS, base form and inflection, but tolerates competing valency options. However, on a higher level o f analysis, I have introduced valency and semantical disambiguation, too. This can be very useful for polysemy resolution, like in "rever", where the transitive <vt> - intransitive <vi> distinction has a meaning correlate: 'tomar a ver' [see again] vs. 'transudar' [leak through]. Likewise, "revista" followed by a name <+n> or being read (semantical class <rr>) is more likely to be a newspaper than an inspection (semantical class <CP> for action; +CONTROL, +PERFECTIVE).

(3)

W ith c u rre n t s o ftw a re , b e fo re a n a n a ly s is ru n , a ll ru le s a re tr a n s la te d in to a fin ite s ta te n e tw o rk b y a c o m p ile r p ro g ra m , y ie ld in g th e a c tu a l p a rs e r. T h e P o rtu g u e s e g ra m m a r w a s o rig in a lly w ritte n in th e fo rm a lis m s u g g e s te d b y P a si T a p a n a in e n 's fir s t c o m p ile r im p le m e n ta tio n , b u t la te r

re w ritte n to m a tc h th e n o ta tio n u s e d in h is n e w C G -2 p a r s e r c o m p ile r (T a p a n a in e n , 1996).

B y a p p ly in g th e ru le s e t se v e ra l tim e s , th e p a r s e r re n d e rs m o re a n d m o re w o rd s in th e s e n te n c e u n a m b ig u o u s , a n d in th e e n d , o n ly o n e re a d in g is le ft f o r e v e ry w o rd . S in c e th e in d iv id u a l ru le c a n b e m a d e v e ry " c a u tio u s " b y a d d in g m o re c o n te x t c o n d itio n s , a n d sin c e th e la s t s u rv iv in g re a d in g w ill n e v e r b e d is c a rd e d , th e fo rm a lis m is v e ry ro b u s t. E v e n im p e rfe c t in p u t w ill y ie ld

s o m e p a rse . U n lik e p ro b a b ilis tic s y ste m s, w h e re " m a n u a l in te rfe re n c e " a s in th e in tro d u c tio n o f b ia s o n b e h a lf o f irre g u la r p h e n o m e n a o fte n h a s a n a d v e rs e s id e -e ffe c t o n th e o v e ra ll p e rfo rm a n c e o f th e p a r s e r (d u e to in te rfe re n c e w ith th e o rd in a ry s ta tis tic a l "ru le s" b iised o n th e

r e g u la r " m a jo rity " p h e n o m e n a ), C o n s tra in t G ra m m a r to le ra te s a n d e v e n e n c o u ra g e s th e in c re m e n ta l " p ie c e m e a l " a d d itio n o f e x c e p tio n s a n d c o n te x t c o n d itio n s fo r in d iv id u a l ru le s (F o r a c o m p a ris o n o f s ta tis tic a l a n d c o n s tra in t-b a s e d m e th o d s s e e C h a n o d & T a p a n a in e n , 1994).

2. System Performance

I f th e y c a n b e m a d e to w o r k o n fre e te x t, ru le b a s e d s y s te m s c a n a c h ie v e v e ry lo w e rro r ra te s. W h ile s ta te -o f-th e -a rt p r o b a b ilis tic ta g g e rs still h a v e e r r o r ra te s o f o v e r th re e p e rc e n t^ e v e n fo r P o S ta g g in g , C G b a s e d s y s te m s fa re s o m e w h a t b e tte r. F o r E n g lis h w o r d c la s s e rro r ra te s o f u n d e r

0 .3 % h a v e b e e n re p o rte d a t a d is a m b ig u a tio n le v e l o f 9 4 -9 7 % (V o u tila in e n , 1992), F o r m y o w n P o rtu g u e s e C G s y ste m , te s t ru n s ru n s w ith n e a r 1 0 0 % d is a m b ig u a tio n o n fic tio n an d n e w s te x ts su g g e s t a c o rre c tn e s s ra te o f o v e r 9 9 % fo r m o rp h o lo g y a n d p a r t o f sp e e c h , w h e n a n a ly s in g

u n k n o w n u n re s tric te d text'*. F o r s y n ta x th e fig u re s a re 9 8 % fo r c la s s ic a l lite ra ry p ro s e (E9a d e

Q u e iro z , "O te s o u ro " ) a n d 9 7 % fo r th e m o re in v e n tiv e " jo u rn a le s e " o f n e w s m a g a z in e te x ts (V E J A ,9 .1 2 .1 9 9 2 ),a s s h o w n in ta b le (3 );

^ Compare, for English, (Garside et.al, 1987) on the HMM based CLAWS system, (Francis and Kucera, 1992) on recovering PoS tags from the Brown corpus, Ratnaparkhi's maximum-entropy tagger trained on the Penn Treebank (Marcus et al., 1993) or Brill's stochastic tagger using automated learning (Brill, 1992). For German the Morphy system described in (Lezius et. al., 1996) achieved an accuracy o f 95.9%.

(4)

(3) System performance on the PoS and syntactic levels:

T e x t : 0 tesou ro V E J A J V E J A 2

c a . 2 5 0 0 w o r d s c a . 4 8 0 0 w o r d s c a . 3 1 4 0 w o r d s

E r r o r t y p e s : ,' . , > L. e r r o r s c o rre c t- e r r o r s c o r r e c t - e r r o r s c o r r e c

t-n e s s . n e s s n e s s

P a r t- o f - s p e e c h e r r o r s 16 15 2 4 B a s e - f o r m & fle x io n e r r o r s 1 2 2

A l l m o r p h o l o g i c a l e r r o r s 17 9 9 .3 % 17 9 9 .7 % 26 9 9 ,2 % s y n ta c t ic : w o r d & p h r a s e s 5 4 118 101

s y n ta c t ic : s u b c la u s e s 10 11 13

A l l s y n t a c t i c e r r o r s 6 4 9 7 .4 % 129 9 7 .3 % 114 9 6 .4 % "lo cal" s y n t a c t ic e r r o r s d u e to - 2 7 - 23 - 2 8

P o S /m o r p h o lo g ic a l e r r o r s

P u r e l y s y n t a c t i c e r r o r s 37 9 8 .5 % 106 9 7 .8 % 8 6 9 7 .3 %

3. Lexico-morphological heuristics

Y e t e v e n in a ru le b a s e d C G s y ste m , h e u ris tic s c a n b e q u ite u se fiil (fo r E n g lis h , se e K a rls s o n et. a l., 1995). T h u s ru le s a re u s u a lly g ro u p e d a c c o rd in g to th e ir " sa fe ty " , i.e. th e ir s ta tistic a l

te n d e n c y to m a k e e rro rs. L e s s s a fe ru le s c a n b e a d d e d as a h e u ris tic le v e l o n to p o f a k e rn e l o f s a fe ru le s , a n d w ill b e a p p lie d a fte r th e s e . A ls o , sta tistic a l in s p ire d " ra rity ta g s" (< R a re > ) c a n be a d d e d to c e rta in le ss p r o b a b le re a d in g s in th e le x ic o n , a n d th e n r e fe rre d to b y c o n te x tu a l d is a m b ig u a tio n ru le s. A th ird fie ld fo r th e a p p lic a tio n o f h e u ris tic s is o n th e a n a ly se r le vel, i.e. c o n c e rn s th e (le x ic o -m o rp h o lo g ic a l) in p u t o f th e d is a m b ig u a tio n ru le s y ste m . It is th is th ird ty p e o f h e u ris tic s I a m c o n c e rn e d w ith h e re .

S in c e th e h ig h e r le v e ls o f th e p a rs in g s y s te m (fo r e x a m p le , P o S a n d s y n ta x ) a re te c h n ic a lly ru le b a s e d d is a m b ig u a to rs , th e y n e e d s o m e re a d in g fo r e v e ry w o rd to w o r k o n , w h ic h is w h y e v e n u n a n a ly z a b le w o rd fo rm s (i.e. w o r d fo rm s th a t c a n n o t b e r e d u c e d to a ro o t fo u n d in th e

a n a ly s e r ’s le x ic o n ) n e e d to b e g iv e n o n e o r m o re h e u ris tic re a d in g s w ith re g a rd to w o rd c la ss a n d f le x io n m o rp h o lo g y . T h e m a jo rity o f s u c h c a se s is a c c o u n te d fo r b y u n k n o w n p ro p e r n o u n s (1 - 2 % o f a ll w o rd s, d e p e n d in g o n te x t ty p e ), a n d c a n b e h a n d le d b y a s s ig n in g h e u ris tic P R O P ta g s

to a l l c a p ita lis e d w o rd s in c e r ta in c o n te x ts , a n d th e n a d d in g a n y c o m p e tin g a n a ly tic a l a n a ly sis , e s p e c ia lly in th e c a se o f s e n te n c e in itia l p o s itio n , le a v in g th e fin a l d e c is io n to th e ru le b a s e d d is a m b ig u a tio n m o d u le . T h is w a y , th o u g h 8 0 % o f a ll n o u n s in m y c o rp u s n e e d h e u ris tic a l tre a tm e n t, th e e rro r ra te f o r th e P R O P c la s s a s a w h o le c a n b e k e p t a t 2 % , n o t to o f a r fro m th e ta g g e rs o v e ra ll P o S e rro r ra te (< 1% ).

(5)

3.1 "Unanalyzable® words": typology and statistics

T h o u g h a c c o u n tin g fo r o n ly 0 .3 -0 -5 % o f w o rd fo rm s in ru n n in g te x t, o th e r ty p e s o f a n a ly s is

fa ilu re s (i.e. w o r d fo rm s th a t c a n n o t b e re d u c e d to a r o o t fo u n d i n th e a n a ly s e r ’s le x ic o n ) a re m o re d iffic u lt to h a n d le , d u e to th e ir fim c tio n a l d iv e rs ity a n d th e la c k o f a c le a r m o rp h o lo g ic a l m a rk e r. T a b le (4 ) p ro v id e s a n e rro r ty p o lo g y f o r a 1 3 L 9 8 1 w o rd lite ra tu re a n d s e c o n d a ry lite ra tu re c o rp u s (T h e R N P d e p o s ito ry o f B ra z ilia n lite ra tu re ), c o n ta in in g 6 0 4 u n a n a ly z a b le w o rd s in th e te s t ru n .

T h re e m a in g ro u p s m a y b e d is tin g u is h e d , c o m p ris in g o f ro u g h ly o n e th ird o f th e c a s e s ea c h :

a ) o rth o g ra p h ic a l e rro rs (s h a d e d in th e ta b le , a n d p a rtia lly c o rre c te d b e fo re h e u ris tic s p r o p e r b y a n a c c e n t m o d u le re c o g n is in g r e g u la r r e g io n a l s p e llin g v a ria tio n s )

b ) u n k n o w n a n d u n d e riv a b le P o rtu g u e s e w o rd s o r a b b re v ia tio n s

c) u n k n o w n fo re ig n lo a n w o rd s

(4 ) E rro r ty p e s in " u n a n a ly z a b le " w o rd s:

D O M A IN N U M B E R O F

T O K E N S

P E R C E N T A G E

F o r e ig n 232 38.4

o r t h o g r a p h ic v a r ia t io n (E u r o p e a n /a c c e h tu a tio n )

125 20.7 Correctables

o th e r p o r t. O r th o g r a p h ic 74 12.3 M isspellings

n o n - c a p it a lis e d n a m e s a n d a b b r e v ia tio n s

37 6.1 E n c y c lo p a e d ic

lex ico n f a ilu r e s

n a m e s a n d n a m e r o o t s 18 3.0

a b b r e v ia tio n s 19 3.1

r o o t n o t fo u n d in le x ic o n 119 19.7 C ore lex ic o n

fo u n d in A u relio^ 91 15.1 f a ilu r e s

n o t fo u n d in A u r e lio 28 4.6

d e r iv a t io n /f le x io n p r o b le m 15 2.5 A ffix lex ico n

s u ffix 8 1.3 f a ilu r e s

p r e fix 3 0.5

H e x io n e n d in g 2 0.3

a lt e r n a t io n in f o r m a t io n 2 0.3

o th e r 2 0.3

SUM 604 100.0

3.2 Analytical morphological heuristics

F o r o p tim a l p e rfo rm a n c e , th e th re e g ro u p s m e n tio n e d a b o v e w o u ld re q u ire d iffe re n t s tra te g ie s . F o re ig n w o r d s a p p e a rin g in ru n n in g P o rtu g u e s e te x t a re ty p ic a lly n o u n s o r n o u n p h ra s e s , a n d * *

* In this paper I intend "unanalysable word forms" to mean word forms that cannot - by derivation and/or inflexional analysis - be reduced to a root found in the analyser's lexicon. Of course, only part o f these - typing errors and foreign language quotes - are really unanalysable, while others might be covered by enlarging the lexicon or enhancing the scientific derivation list.

(6)

try in g to id e n tify v e rb a l e le m e n ts o n ly c a u se s tro u b le . I n "real" P o rtu g u e s e \v o rd s w ith o u t s p e llin g erro rs, s tru c tu ra l c lu e s - lik e fle x io n e n d in g s a n d s u ffix e s - s h o u ld b e e m p h a sise d . T h e s e w ill b e m e a n in g fu l in m is s p e lle d P o rtu g u e s e w o rd s, to o , b u t, in a d d itio n , s p e c ific ru le s a b o u t

le tte r m a n ip u la tio n (d o u b lin g o f le tte rs , m is s in g le tte rs, le tte r in v e rs io n , m is s in g b la n k s etc.) a n d e v e n k n o w le d g e a b o u t k e y b o a rd c h a ra c te ris tic s m ig h t m a k e a d iffe re n c e .

M o tiv a te d b y a g ra m m a tic a l p e rs p e c tiv e ra th e r th a n p ro b a b ilis tic s , m y a p p ro a c h h a s b e e n to e m p h a s is e g ro u p s (a) a n d (b ) a n d lo o k fo r P o r tu g u e s e m o rp h o lo g ic a l c lu e s in w o rd s w ith u n k n o w n stem s. S in c e p re fix e s h a v e v e ry little b e a rin g o n th e p ro b a b ility o f a w o rd 's w o rd cla ss

o r fle x io n a l c a te g o rie s , o n ly th e f le x io n e n d in g s an d s u ffix le x ic a a re u s e d . A s it a lso d o e s in o rd in a ry m o rp h o lo g ic a l a n a ly s is , th e ta g g e r trie s to id e n tify a w o rd fro m th e rig h t, i.e. b a c k w a rd s , c u ttin g o f f p o te n tia l e n d in g s o r su ffix e s a n d c h e c k in g f o r th e re m a in in g stem in th e r o o t le x ic o n (th e m a in le x ic o n ). N o rm a lly , fo r (m u ltip ly ) a n a ly s e d w o rd s , u s in g K a rls s o n 's la w (K a rls s o n 1992, 1 9 9 5 )’, th e P o rtu g u e s e a n a ly se r w o u ld try to m a k e th e ro o t as lo n g as p o s sib le , a n d to u s e as fe w d e riv a tio n a l layers® a s p o ssib le . F o r (s y s te m -in te rn a lly ) u n a n a ly z a b le w o rd s, h o w e v e r, I u s e th e o p p o s ite s tra te g y ; S in c e I am lo o k in g fo r a h y p o th e tic a l ro o t, fle x io n e n d in g s a n d s u ffix e s are all I'v e g o t, a n d I try to m a k e th e ir h a l f o f th e w o rd (th e r ig h t h a n d p a rt) as la rg e a s p o s sib le .

W o rk in g w ith a m in im a l ro o t le n g th o f 3 le tte rs, an d c a llin g m y h y p o th e tic a l ro o t 'x x x ', I w ill

s ta rt b y re p la c in g o n ly th e firs t 3 le tte rs o f th e w o rd in q u e s tio n b y 'x x x ' a n d try fo r a n a n a ly sis, th e n I w ill re p la c e th e fir s t 4 le tte rs b y 'x x x ', a n d so o n , u n til - i f n e c e s s a ry - th e w h o le w o rd is re p la c e d b y 'x x x '.’ F o r a w o rd lik e o n to g e n e tic a m e n te th e re w ritin g re c o rd w ill y ie ld th e c h a in b e lo w . H e re , th e fu ll c h a in is g iv e n , w ith all re a d in g s it w o u ld e n c o u n te r o n its w a y . In th e re a l c a se , h o w e v e r, th e ta g g e r - p re fe rrin g lo n g d e riv a tio n s /e n d in g s to s h o rt o n e s - w o u ld sto p s e a rc h in g a t th e x x x tic a m e n te -le v e l, w h e re th e firs t g ro u p o f re a d in g s is fo u n d . In fact, th e a d v e rb ia l u s e o f a n a d je c tiv e ly s u ffix e d w o rd is m u c h m o re lik e ly th a n h ittin g u p o n , say , a " ro o t- o n ly " n o u n w h o se la st 9 le tte rs h a p p e n to in c lu d e b o th th e '-ic o ' a n d th e '-m e n te ' le tte r c h a in s b y c h a n c e .

(5 ) o n to e e n e tic a m e n te -> n o a n a ly s is

xxxogeneticamente

xxxgeneticamente

xxxeneticamente

xxxneticamente *

’’ Karlsson's law states, that of two morpological analyses o f different derivational complexity, the one with fewer

elements is almost always the correct one.

* Karlsson's law can be applied to any string o f free (i.e. compounding), derivational or inflexional morphemes, but the frequency of ambiguity types with respect to these three elements will differ from language to language -thus, in Portuguese, compounding is much rarer than in most germanic languages, while Swedish, the language for which Karlsson's law was originally fomulated, does have compounding, but not as rich an inflexion morphology.

’ A similar method of partial morphological recognition and circumstantial categorization might be responsible for a human being's successful inflectional and syntactic treatment o f unknown words in a known language; the Portuguese word games "coHorido" (president Collor & colorido - 'coloured') and "tucanagem" (the party of the

tucanos & sacanagem - 'dirty work'), for instance, will not be understood by a cultural novice in Brazil, even if he is

a native speaker of European Portuguese - but he will still be able to identify both as singular, the first as a past participle ('-do') and the second as an abstract noun ('-agem') of the feminine gender.

(7)

xxxeticamente

xxxticamente

xxxicamente

xxxcamente

xxxamente

xxxmente

xxxente

-> s u ffix -ic o ' (v a ria tio n '-tic o ') + a d v e rb ia l e n d in g '-m e n te ' "o n to g e n e " < D E R S -ic o [A T T R ]> < d e a d j> A D V -> s u ffix '-ic o ' + a d v e rb ia l e n d in g '-m e n te '

"o n to g e n e " < D E R S -ic o [A T T R ]> < d e a d j> A D V

~> a d v e rb ia l e n d in g '-m e n te ' (v a ria tio n '-a m e n te ') " o n to g e n e tic o " < x x x o > < d e a d j> A D V

~> " p re s e n t p a rtic ip le " - s u f f ix '-e n te '

" o n to g e n e tic a m e r" < D E R S -e n te [P A R T .P R ]> A D J M /F S

" o n to g e n e tic a m e r" < D E R S -e n te [A G E N T ]> N M /F' S -> c a u s a tiv e s u ffix '- e n ta r'’° + v e rb a l f le x io n e n d in g ’-e'

" o n to g e n e tic a m " < D E R S - a r [C A U S E ]> V P R 1/3S S U B J V F IN

xxxnte

xxxte

xxxe

-> v e rb a l fle x io n e n d in g '-e'

" o n to g e n e tic a m e n te r" < x x x e r> V IM P 2 S V F IN " o n to g e n e tic a m e n tir" < x x x ir> V IM P 2 S V F IN # # #

" o n to g e n e tic a m e n te r" < x x x e r> V P R 3 S I N D V F IN " o n to g e n e tic a m e n tir" < x x x ir> V P R 3 S IN D V F IN # # # " o n to g e n e tic a m e n ta r" < x x x a r> V P R 1/3 S S U B J V F I N

XXX -> no d e riv a tio n o r f le x io n

" o n to g e n e tic a m e n te " < x x x > N F S

" o n to g e n e tic a m e n te " < x x x > N M S

R o o ts w ith 'x x x ' a re p re s e n t in th e c o re le x ic o n a lo n g s id e th e "real" ro o ts , in c lu d in g th e n e c e s s a ry ste m a l te r n a tio n s " f o r v e rb s (h e re , B b C c f o r d if f e r e n t ro o t-s tre s s e d fo rm s a n d A a iD f o r e n d in g s -

s tre s s e d fo rm s ):

(

6

)

ro o t w o rd c la s s a lte rn a tio n su b c la s s le x e m e ID ta rg e t o f a n a ly sis _________ _

XXX

<sm>

xxxar <am£>

xxxar- <vt>

xxxer <sm>

xxxia <sf>

xxxo <sm>

A a iD

5 4 5 7 3 5 9 5 4 7

5 4 5 7 8

5 4 6 6 6 5 4 6 6 5 5 4 5 8 2

m a s c u lin e n o u n , ty p ic a lly fo re ig n P o rtu g u e s e '-a r'-a d je c tiv e *

e n d in g s -s tre s s e d fo rm s o f '- a r '- v e rb s

m a sc u lin e n o u n , ty p ic a lly E n g lis h *

fe m in in e n o u n , L a tin -P o rtu g u e s e * m a sc u lin e n o u n , ty p ic a lly

P o rtu g u e s e

This suffix is regarded as a variant of'-ar', and therefore normalized in the DER-tag: <DERS -ar [CAUSE]>.

(8)

B e s id e s th e ty p ic a l s te in s e n d in g in '-o ', '-a' a n d '-r', d e fa u lt s te m s c o n s is tin g o f a p la in 'xxx' h a v e b e e n e n te re d to a c c o m m o d a te f o r fo re ig n n o u n s w ith " u n -P o rtu g u e s e " s p e llin g . L ik e m a n y o th e r la n g u a g e s , P o rtu g u e s e w ill fo rc e its o w n g e n d e r s y ste m e v e n o n to fo re ig n lo a n w o rd s, so a

m a s c u lin e an d a fe m in in e c a s e m u s t b e d is tin g u is h e d , fo r la te r u s e in th e ta g g e r's d is a m b ig u a tio n m o d u le .

S in c e th e a n a ly se r's h e u ris tic s fo r u n k n o w n w o rd s p re fe rs re a d in g s w ith e n d in g s (o r s u ffix e s) to th o s e w ith o u t, a n d lo n g e r o n e s to s h o rte r o n e s, v e rb a l re a d in g s (e s p e c ia lly th o s e w ith in fle x io n m o rp h e m e s in 'r', 'a' o r 'o ') h a v e a "n a tu ra l" a d v a n ta g e o v e r w h a t re a lly s h o u ld b e n o u n s o r a d je c tiv e s , e s p e c ia lly w h e n th e s e a p p e a r in th e ir u n in fle c te d s in g u la r b a s e fo rm . L e x ic o n -w is e , th is te n d e n c y is c o u n te re d b y a d d in g th re e o f th e m o s t c o m m o n ly ig n o re d n o m in a l c a s e s s p e c ific a lly in to th e le x ic o n : (a ) E n g lis h '-er' n o u n s o th e rw is e o n ly ta k e n a s P o rtu g u e s e

in fin itiv e s , (b ) L a tin -P o rtu g u e s e '-ia ' n o u n s o th e rw is e o n ly r e a d a s v e rb a l fo rm s in th e im p e rfe ito te n s e , a n d (c) '-ar' a d je c tiv e s o th e rw is e a n a ly se d o n ly a s in fin itiv e s .

R u le -w is e , v e rb a l re a d in g s a lo n e a re n o t a llo w e d to s to p th e h e u ris tic s -m a c h in e , it w ill p ro c e e d u n til it fin d s a re a d in g w ith a n o th e r w o rd class. S o, th e p ro c e s s is s e t to ig n o re v e r b a l re a d in g s o n its w a y d o w n th e c h a in o f h y p o th e tic a l w o rd fo rm s w ith e v e r s h o rte r s u ffix /e n d in g s -p a rts . T h u s , th e h e u ris tic s -m a c h in e w ill r e c o r d v e rb a l re a d in g s, b u t o n ly s to p i f a n o u n , a d je c tiv e o r a d v e rb re a d in g is fo u n d in th a t le v e l's c o h o rt (list o f re a d in g s). In th is c o n te x t, p a rtic ip le s a n d g e ru n d s - th o u g h v e rb a l - a re tre a te d as " a d je c tiv e s" a n d " a d v e rb s" , r e s p e c tiv e ly , b e c a u s e th e y fe a tu re v e ry c h a ra c te ristic e n d in g s ('-a d o ', '-id o ', '-a n d o ', '-e n d o ', '-in d o ').

T h is ra is e s th e p o s s ib ility o f th e h e u ris tic s -m a c h in e p ro g re s s in g fro m m u lti-d e riv e d a n a ly se s (w ith o n e o r m o re s u ffix e s ) to s im p le a n a ly se s (w ith o u t s u ffix e s ) b e fo re it e n c o u n te rs a n o n ­ v e rb a l re a d in g . In th is c a se , th e a p p lic a tio n o f K a rls s o n 's la w d o e s s till m a k e se n se , a n d w h e n th e h e u ris tic s -m a c h in e h a n d s its re s u lts o v e r to th e lo c a l d is a m b ig u a tio n m o d u le , th is w ill se le c t th e

re a d in g s o f lo w e s t d e riv a tio n a l c o m p le x ity , w e e d in g o u t a ll (re a d : v e rb a l!) re a d in g s c o n ta in in g m o re (read : v e rb a l!) s u ffix e s th a n th e g ro u p se le c te d . In th e m is s p e lle d F re n c h w o rd 'en ta e n te',

f o r e x a m p le , th e v e rb a l re a d in g :

(7 a ) "en ta" < D E R S - (e n t)a r [C A U S E ]> V P R 1/3S S U B J V F IN ,

fro m th e 'x x x a e n te '-le v e l, is re m o v e d , le a v in g o n ly u n d e riv e d v e rb a l re a d in g s - fro m th e 'x x x e '-

le v e l, a lo n g w ith th e d e s ire d n o u n s in g u la r re a d in g fro m th e 'x x x '-le v e l. (7 b ) e n ta e n te A L T x x x a e n te A L T x x x e A L T x x x

" e n ta e n te r" < x x x e r> V IM P 2 S V F IN

" e n ta e n te r" < x x x e r> V P R 3S IN D V F IN " e n ta e n ta r" < x x x a r> V P R I/3 S S U B J V F IN " e n ta e n te " < x x x > N F S

" e n ta e n te " < x x x > N M S

S in c e a ll d is a m b ig u a tio n n o t re la te d to K a rls s o n 's la w is re fe rre d to th e C G -m o d u Ie , th e w o rd c la ss c h o ic e b e tw e e n V a n d N w ill b e c o n te x tu a l (a n d ru le b a s e d ), a s w e ll a s th e m o rp h o lo g ic a l

s u b -c h o ic e o f m o d e (IM P - P R ) f o r th e v e rb , a n d g e n d e r (M - F ) fo r th e n o u n . In th e p ro to ty p ic a l c a s e o f a p re c e d in g a rtic le , th e v e rb re a d in g is ru le d o u t b y :

(8 a ) R E M O V E (V ) IF (-1 A R T )

a n d th e g e n d e r c h o ic e is th e n ta k e n b y a g re e m e n t ru le s s u c h as:

(9)

(8 b ) R E M O V E (N M ) IF ( - 1 C D E T ) ( N O T -1 M ) R E M O V E (N F ) IF ( - 1 C D E T ) (N O T -1 F )

C o n s id e r th e f o llo w in g e x a m p le s o f " u n a n a ly z a b le " w o rd s fro m r e a l c o rp u s s e n te n c e s , w h e re th e fin a l o u tp u t, a fte r m o rp h o lo g ic a l c o n te x tu a l d is a m b ig u a tio n , is g iv e n ;

(9 a ) in v e n tim a n h a s A L T x x x a s (a lso : o n e A D J a n d th re e r a r e V -re a d in g s ) " in v e n tim a n h a " < x x x > N F P 'tric k s '

(9 b ) a ra ra q u a re n s e s A L T x x x e n s e s (3 o th e r A D J re a d in g s r e m o v e d b y lo c a l

d is a m b ig u a tio n )

" a ra ra q u a i" < D E R S -e n s e [P A T R ]> < jh > < jn > A D J M /F P 'fro m A ra ra q u a ra ' (9 c ) s o m b ra n c e lh a s A L T x x x a s (a lso : o n e A D J a n d th re e ra re V -re a d in g s )

" s o m b ra n c e lh a " < x x x > N F P '= s o b ra n c e lh a s - e y e b r o w s '

(9 d ) c a s t A L T x x x (a lso : N F S ) " cast" < * 1 > < * 2 > < x x x > N M S 'E n g lis h : ca st'

In (9 a ) a n d (9 b ) th e p a rs e r a s s ig n s c o rre c t re a d in g s to u n k n o w n , b u t w e llfo rm e d P o rtu g u e s e w o rd s . D e p e n d in g o n th e o rth o d o x y o f th e fu s io n p ro c e s s , th e se a f f ix e s m a y b e r e c o g n is e d (9 b ), o r n o t (9 a ). T h a t a ffix re c o g n itio n is im p o rta n t, c a n b e se e n fro m th e fa c t th a t a ll c o m p e tin g

a n a ly s e s in (9 b ) - b u t n o t in (9 a ) - h a v e th e c o rre c t P o S tag. W h a t is s p e c ia l a b o u t (9 c), is th e (p h o n e tic a l? ) m is s p e llin g ( 's o m b ra n c e lh a s ') o f a n o th e rw is e o rd in a ry P o rtu g u e s e w o rd . E v e n so , w ith th e h e lp o f th e s u rv iv in g m o rp h o lo g ic a l c lu e s a n d c o n te x tu a l d is a m b ig u a tio n , th e p a rs e r is

a b le to a s s ig n th e r ig h t a n a ly sis in m o s t c a se s, e s p e c ia lly i f th e w o rd s s till lo o k P o r tu g u e s e . (9 d ), fin a lly , is th e h a r d czise - fo re ig n lo a n w o rd s. E n g lis h 'c a st' d o e s n o t f it w ith a n y P o rtu g u e s e f le x io n e n d in g , th e re fo re th e d e fa u lt re a d in g N is a s s ig n e d , g e n d e r d is a m b ig u a tio n re ly in g o n

N P -c o n te x t„

In o r d e r to te s t th e p a rs e r's p e rfo rm a n c e a n d to id e n tify th e s tre n g th s a n d w e a k n e s s e s o f th e

h e u ris tic s s tra te g y o f th e p a rse r, I h a v e m a n u a lly in s p e c te d 7 5 7 " ru n n in g " in s ta n c e s '^ o f lo w e r c a s e w o r d fo rm s w h e re th e p a rs e r's d is a m b ig u a tio n m o d u le re c e iv e d its in p u t fr o m th e ta g g e r's h e u ris tic s m o d u le . T h e firs t c o lu m n s h o w s th e w o r d c la s s a n a ly s is c h o s e n , a n d in s id e th e th re e

g ro u p s (e rro rs , P o rtu g u e s e , fo re ig n ) th e le ft c o lu m n g iv e s th e n u m b e r o f c o rre c t a n a ly s e s , w h e r e a s th e rig h t c o lu m n o ffe rs s ta tis tic s a b o u t th e m is ta k e s , s p e c ify in g - a n d q u a n tif y in g - w h a t th e a n a ly s is s h o u ld h a v e b een.

(10)

(1 0 ) W o rd c la s s d is tr ib u tio n a n d p a rs e r p e rfo rm a n c e in " u n a n a ly z a b le " w o rd s (V E J A n e w s te x t)

A) orthographical errors

B) Portuguese words C) foreign words^® aU

analysis correct other correct other correct other correct other

N 119 A D J 8

ADV 8 W I N 3 P R O N l DET 1 PRP 1

212 ADJ 3 226 ADV 11

ADJ 3 P R O N 2 PRP 2

557 43

ADJ 25 N 8

GER 2

95 N 7 8 128 17

ADV 3 - 5 - - 8

-W I N 13 N 4

PCP 1 ADV 1

9 N 4

ADJ 2

N 7 ADJ 1

22 20

PCP 10 - 16 - - - 26

-GER 3 - - - 3

-INF 9 - 4 - - N 4 13 4

182 38 341 16 234 30 757 84

(17.3%) (4.5%) (11.4%) (10.0%)

T h e ta b le s h o w s th a t, w h e n u s in g le x ic a l h e u ris tic s , th e p a r s e r p e rfo rm s b e s t - n o t e n tire ly s u rp risin g ly - fo r w e llfo rm e d P o rtu g u e s e w o rd s (B ). O f 323 n o u n s a n d a d je c tiv e s in g ro u p B, o n ly 16 (5 % ) w e re m is a n a ly s e d a s fa lse p o s itiv e s o r fa ls e n e g a tiv e s . T h e p ro b a b ility f o r a n

a s s ig n e d N -ta g b e in g c o r r e c t is a s h ig h as 9 8 .6 % , fo r th e u n d e rre p re s e n te d a d v e rb a n d n o n -fm ite v e rb a l c la s s e v e n 1 00% . A ll fa ls e p o s itiv e n o m in a l re a d in g s (N a n d A D J ) a re still in th e n o m in a l c la ss , a fa c t th a t is q u ite f a v o u ra b le fo r la te r s y n ta c tic a n a ly s is .

F ig u re s a re lo w e r fo r g ro u p C , u n k n o w n lo a n w o rd s, w h e re th e c h a n c e o f a n N -ta g b e in g c o rre c t

is o n ly 9 2 .6 % , e v e n w h e n a llo w in g fo r a n a m e -c h a in -lik e N -a n a ly s is o f E n g lis h a d je c tiv e s in te g ra te d in n o u n c lu s te rs o f th e ty p e 'b ig b o ss'. F in ite v e rb re a d in g s, th o u g h ra re (d u e to la c k in g fle x io n in d ic a to rs ), a re o f c o u rs e a ll fa ilu re s, a n d o n ly th e little a d je c tiv e g ro u p w a s a h it, th e fe w

c a se s b e in g trig g e re d b y m o rp h o lo g ic a lly " P o rtu g u e s is h " S p a n is h o r Ita lia n w o rd s.

T h e re s u lts in g ro u p A (m is s p e llin g s ) re s e m b le th o s e o f g ro u p B , w ith a g o o d p e rfo rm a n c e fo r c la ss e s w ith c le a r e n d in g s , i.e. n o n -fm ite v e rb s a n d '-m e n te '-a d v e rb s , a n d a b a d p e rfo rm a n c e fo r fin ite v e rb fo rm s . F o r th e la rg e n o m in a l g ro u p s fig u re s a re s o m e w h a t lo w e r: 8 4 .4 % o f N -ta g s , a n d o n ly 7 1 .4 % o f A D J -ta g s a re c o rre c t - th o u g h m o s t fa ls e p o s itiv e A D J -ta g s a re still w ith in th e n o m in a l ra n g e . T h e lo w e r fig u re s c a n b e p a rtly e x p la in e d b y th e fa c t th a t m is s p e lle d c lo s e d c la s s w o rd s (a d v e rb s, p r o n o u n s a n d th e lik e) w ill g e t th e (d e fa u lt, b u t w ro n g ) n o u n re a d in g - a te c h n iq u e th a t w o rk s s o m e w h a t b e tte r a n d m o re n a tu ra lly f o r fo re ig n lo a n w o rd s (C ), w h ic h o fte n a re " te rm s" im p o rte d to g e th e r w ith th e th in g o r c o n c e p t th e y s ta n d fo r, o r n£imes. A ls o , th e

Only individual words and short integrated groups are treated, foreign language sentences or syntactically complex quotations are treated as "corpus fall-out" in this table.

(11)

p e r c e n ta g e o f "sim p le x '^ " w o rd s w ith o u t a ffix e s is m u c h h ig h e r a m o n g th e m is s p e llin g s in g ro u p

A th a n in g ro u p B , w h e re a ll s im p le x w o rd s - b e in g s p e lle d c o r r e c tly - w o u ld h a v e b e e n re c o g n is e d in th e le x ic o n a n y w a y , d u e to th e g o o d le x ic o n c o v e r a g e b e fo r e g e ttin g to th e h e u r is tic s m o d u le . T h e re fo re , n o u n s a n d a d je c tiv e s in g ro u p A la c k th e s tru c tu ra l in fo rm a tio n o f s u ffix e s th a t h e lp s th e p a rs e r in g ro u p B : 'xxxo* lo o k s d e fin ite ly le s s a d je c tiv a l th a n 'x x x fstic o '. In p a rtic u la r, 'x x x o ' in v ite s th e N /A D J -c o n fu s io n , w h e re a s m a n y s u f f ix e s a re c le a rly N o r A D J.

T h u s , '-is tic o ' y ie ld s a safe a d je c tiv e re a d in g ,

4. Special - ”deviant” - word class probabilities for the heuristics module

Is it p o s s ib le , a p a r t fro m m o rp h o lo g ic a l-s tru c tu ra l c lu e s, to u s e " p ro b a b ilis tic s p u re " fo r d e c id in g o n w o r d c la s s ta g s fo r " u n a n a ly z a b le " w o r d s ? In o rd e r to a n s w e r th is q u e s tio n , I w ill - in ta b le

(1 4 ) - re a rra n g e in fo rm a tio n fro m ta b le (1 3 ) a n d c o m p a re it to w h o le te x t d a ta (in th is case, fiom . a 1 9 7 .0 2 9 w o rd s tre tc h o f th e m ix e d g e n re B o rb a -R a m s e y c o rp u s ). H e re , I w ill o n ly be c o n c e rn e d w ith th e o p e n w o rd c la ss e s , n o m in a l, v e rb a l a n d '-m e n te '-a d v e rb ia l.

(1 1 ) O p e n w o r d c la s s fre q u e n c y fo r " u n a n a ly z a b le " w o rd s a s c o m p a re d to w h o le te x t fig u re s

w hole tex t "unanalysable" wore s orthographical errors

Portuguese words

foreign words all heuristics

an alyse s

% cases % cases % cases % cases %

N 47.38 131 63.59 232 63.39 237 95.18 600 73.08

ADJ 12.79 33 16.02 100 27.32 12 4.82 145 17.66

ADV15 1.26 3 (+9) 1.46 5 1.37 - (+11) 8 0.97

W I N 24.96 16 7.77 9 2.46 - 25 3.05

PCP 4.96 11 5.34 16 4.37 - - 27 3.29

GER 2.47 3 1.46 - - 3 0.37

IN F 6.17 9 4.37 4 1.09 - - 13 1.58

206 366 249 821

A m o n g o th e r th in g s , th e ta b le s h o w s th a t th e n o u n b ia s in " u n a n a ly z a b le " w o rd s is m u c h s tro n g e r th a n in P o rtu g u e s e te x t as a w h o le , th e d iffe re n c e b e in g m o s t m a rk e d in fo re ig n lo a n w o rd s. T h e o p p o s ite is tru e o f fin ite v e rb s w h ic h s h o w a s tro n g te n d e n c y to b e a n a ly s a b le . F in ite v e rb s a re

v irtu a lly a b s e n t fro m th e u n k n o w n lo a n w o rd g ro u p . F o r th e n o n - f in ite v e rb a l c la s s e s th e d is tr ib u tio n p a tte r n is fa irly u n ifo rm , a g a in w ith th e e x c e p tio n o f fo re ig n lo a n w o rd s .

''' "Simplex'' words are here defined as words that can be found in the root lexicon without prior removal of prefixes or suffixes. Of course, the larger the lexicon the higher the likelihood o f an (etymologically) affix-bearing word appearing in the lexicon, - and thus not needing "live" derivation from the parser.

(12)

A s m ig h t b e e x p e c te d , a m o n g th e " u n a n a ly z a b le " w o rd s , o rth o g ra p h ic a l e rro rs a n d c o rre c t

P o rtu g u e s e w o rd s s h o w a re m a rk a b ly s im ila r w o rd c la s s d is trib u tio n .

A le ss o n fro m th e a b o v e fin d in g s m ig h t b e to o p t f o r n o u n re a d in g s a n d a g a in s t fin ite v e rb

re a d in g s in " u n a n a ly s a b le " w o rd s , w h e n in d o u b t, e s p e c ia lly w h e re n o P o rtu g u e s e fle x io n e n d in g o r s u ffix c a n b e fo u n d , s u g g e s tin g fo re ig n m a te ria l. A s a m a tte r o f fa c t, th is s tra te g y h a s s in c e b e e n im p le m e n te d in th e s y ste m , in th e fo rm o f h e u ris tic a l d is a m b ig u a tio n ru le s, th a t d is c a rd V F IN re a d in g s a n d c h o s e N re a d in g s fo r < M O R P -H E U R > w o rd s, w h e re lo w e r le v e l (i.e. safe)

C G -ru le s h a v e n 't b e e n a b le to d e c id e th e c a s e c o n te x tu a lly .

5. Conclusion

I t c a n b e s h o w n th a t le x ic o -m o rp h o lo g ic a l h e u ris tic s - a t le a s t fo r a m o rp h o lo g y -ric h la n g u a g e lik e P o rtu g u e s e - c a n b e b a s e d o n stru c tu ra l c lu e s a n d th e s y s te m a tic e x p lo ita tio n o f d e riv a tio n a l an d in fle c tio n a l s u b le x ic a . A p p lie d to im p ro v e a n a ly s e r re c a ll o n th e in p u t le v e l o f a C o n s tra in t G ra m m a r s y s te m , th e d e s c rib e d te c h n iq u e p o s itiv e ly c o n trib u te d to th e o v e ra ll p e rfo rm a n c e o f a le x ic o n b a s e d ru le g o v e rn e d ta g g e r/p a rse r. C o rre c tn e s s ra te s o f m o re th a n 9 9 % w e re a c h ie v e d fo r th e m o rp h o lo g ic a l/P o S ta g g e r m o d u le , w ith h e u ris tic e rro r ra te s ru n n in g a t 2 % fo r p ro p e r n a m e h e u ris tic s a n d 4 .5 % fo r th e h e u ris tic a l a n a ly s is o f o th e r u n re c o g n ise d , b u t c o rre c tly sp e lle d

P o rtu g u e s e w o rd fo rm s. In a ll, h e u ris tic a n a ly sis w a s n e e d e d fo r 8 0 % o f all p ro p e r n o u n s (a m o u n tin g to ca. 2 % o f ru n n in g w o rd fo rm s in n e w s te x t), b u t fo r le ss th a n 0 .4 % o f n o n -n a m e w o rd fo rm s. F in a lly , w o rd c la ss fre q u e n c y c o u n ts s u g g e s t th a t P o S p ro b a b ilitie s fo r

" u n a n a ly z a b le " w o rd s in P o rtu g u e s e te x ts a re q u ite d iffe re n t fro m th o s e fo r th e la n g u a g e o n th e w h o le.

References

B ic k , E c k h a rd , P o r tu g is is k - D a n s k O rdbog, M n e m o , Å rh u s , 1 9 9 3 ,1 9 9 5 ,1 9 9 7 .

B ic k , E c k h a rd , T he P a r s in g S y s te m "P a la vra s", D o c u m e n ta tio n , u n p u b lis h e d P h .D . p ro je c t e v a lu a tio n , 1995, 1997.

B ic k , E c k h a rd , " A u to m a tic P a rs in g o f P o rtu g u e se " , in P r o c e e d in g s o f th e S e c o n d W o r k sh o p o n C o m p u ta tio n a l P r o c e s s in g o f W ritte n P o r tu g u e se , C u ritib a , 1996.

B ic k , E c k h a rd , " D e p e n d e n s s tru k tu re r i C o n s tra in t G ra m m a r S y n ta k s fo r P o rtu g is is k " , in: B rø n d s te d , T o m & L y tje , I n g e r (e d s), S p r o g o g M u ltim e d ie r , A a lb o rg , 1997.

B ic k , E c k h a rd , " A u to m a tis k a n a ly s e a f p o r tu g is is k s k rifts p ro g " , in; J e n s e n , P e r A n k e r & Jø rg e n s e n , S tig . W . & H ø m in g , A n e tte (e d s). D a n s k e p h d - p r o j e k t e r i d a ta lin g v istik , f o r m e l lin g v is tik o g s p ro g te k n o lo g i, p p . 2 2 -2 0 , K o ld in g , 1997.

B rill, E ric , "A S im p le R u le -b a s e d P a r t o f S p e e c h T a g g e r" , in P r o c e e d in g s o f th e T h ir d C o n fe re n c e o n A p p l ie d N a tu r a l L a n g u a g e P ro c e ssin g , A C L , T re n to , Ita ly , 1992.

(13)

C h a n o d , J e a n -P ie rre & T a p a n a in e n , P a s i, " T a g g in g F re n c h - c o m p a rin g a s ta tis tic a l an d

a c o n s tra in t-b a s e d m e th o d " , a d a p te d fro m : S ta tis tic a l a n d C o n s tr a in t-b a s e d T a g g e r s f o r F re n c h ,

T e c h n ic a l re p o rt M L T T -0 1 6 , R a n k X e ro x R e s e a rc h C e n tre , G re n o b le , 1994.

F ra n c is , W .N . & K u c e ra , F ., F r e q u e n c y A n a ly s is o f E n g lis h U sage, H o u g h to n M ifflin , 1982.

G a rs id e , R o g e r & L e e c h , G e o ffre y & S a m p s o n , G e o ffre y (e d s.). T h e C o m p u ta tio n a l A n a ly s is o f E n g lish . A C o r p u s -B a s e d A p p r o a c h , L o n d o n , 1987.

K a rls s o n , F re d , " S W E T W O L : A C o m p re h e n s iv e M o rp h o lo g ic a l A n a ly s e r fo r S w e d ish " , in

N o r d ic J o u r n a l o f L in g u is tic s 15, 19 9 2 , p p . 1-45.

K a rls s o n , F re d & V o u tila in e n , A tro & H e ik k ila , J u k a & A n ttila , A rto (e d s .), C o n s tr a in t G ra m m a r, A L a n g u a g e -In d e p e n d e n t S y s te m f o r P a r s in g U n r e s tr ic te d T ext, M o u to n d e G ru y te r, B e r lin 1995.

L e z iu s , W o lfg a n g & R a p p , R e in h a rd & W e ttle r, M a n fre d , " A M o r p h o lo g y - S y s te m an d P a r t- o f S p e e c h T a g g e r fo r G e rm a n " , ", in; D a fy d d G ib b o n (ed ,): N a tu r a l L a n g u a g e P r o c e s s in g a n d S p e e c h T e c h n o lo g y , B e rlin , 1996.

M a rc u s , M itc h e ll, " N e w tre n d s in n a tu ra l la n g u a g e p ro c e s s in g : S ta tis tic a l n a tu ra l la n g u a g e

p ro c e s s in g " , p a p e r p re s e n te d a t th e c o llo q u iu m H u m a n -M a c h in e C o m m u n ic a tio n b y Voice,

o rg a n iz e d b y L a w re n c e R . R a b in e r, h e ld b y th e N a tio n a l A c a d e m y o f S c ie n c e s a t T h e A r n o ld a n d M a b e l B e c k m a n C e n te r in Irv in e , U S A , F e b , 8-9, 1993.

T a p a n a in e n , P a s i, "T h e C o n s tra in t G r a m m a r P a r s e r C G -2 ", U n iv e rs ity o f H e ls in k i, D e p a rtm e n t o f L in g u is tic s , P u b lic a tio n s n o . 2 7 , 1996.

References

Related documents