A statistical and structural approach to extracting
collocations likely to be of relevance in relation to an LSP
sub"domain text
Bjarne Blom
D e p a r tm e n t o f L e x ic o g ra p h y a n d C o m p u ta tio n a l L in g u is tic s T h e A a rh u s B u s in e s s S c h o o l
b b @ ln g .h h a .d k
1. Background.
T h is article* s k e tc h e s a m e th o d b y m e a n s o f w h ic h lik e ly te x t-re le v a n t c o llo c a tio n a l w o rd s trin g s m a y b e e x tra c te d fro m s u b -d o m a in L S P -te x ts. B e c a u s e o f th e r e p e titiv e n a tu re o f c o llo c a tio n s , a s ta tis tic a l a n d s tru c tu ra l a p p ro a c h is su g g e ste d . T h e g o a l o f th is p r o je c t is tw o -fo ld : ( a ) to e x p lo re th e d e g re e to w h ic h c o m p u ta tio n a l m e th o d s a re s u ita b le fo r e x tra c tin g c o llo c a tio n s ; ( b ) to e x p lo re th e d e g re e to w h ic h it is p o s s ib le to e x tra c t in fo rm a tio n p a r tic u la r to th e o v e ra ll to p ic o r d o m a in o f a L S P te x t, i.e. te rm in o lo g y , b y m e a n s o f k n o w le d g e -p o o r te c h n iq u e s . A s th is s tu d y h a s r u n o n a tim e -lim ite d b a s is , m o s t im p o rta n c e h a s b e e n p la c e d o n (a).
2. What is understood by **coIIocation”,
A " c o llo c a tio n ” is o fte n d e fin e d a s e ith e r " a n a rb itra ry a n d re c u rre n t w o rd c o m b in a tio n " (B e n s o n 1 9 9 0 ) o r " th e c o o c c u rre n c e o f tw o o r m o re w o rd s w ith in a s h o rt sp a c e o f e a c h o th e r" (S in c la ir 1 9 9 1 ) . H o w e v e r, s u c h d e fin itio n s te n d to le a v e tw o im p o rta n t q u e s tio n s o u t o f c o n s id e ra tio n , cf. th e tw o f o llo w in g e x a m p le s :
( a ) ... s h o u ld n 't h a v e d o n e th a t. T h is p r o m p te d ... ( b ) ... s e e m s u n lik e ly th a t th is s h o u ld b e th e c a se
T h e tw o d e fin itio n s a b o v e , w h ic h a re in e ffe c t fa irly re p re s e n ta tiv e o f th e m a jo r p a r t o f d e fin itio n s o ffe re d , d o n o t ta k e tw o p o in ts in to c o n s id e ra tio n s , n a m e ly th e d is tin c tio n b e tw e e n le x ic a l c o llo c a tio n s lik e p r im e m in is te r a n d s y n ta c tic c o llo c a tio n s lik e th a t th is - o r th e is s u e w h e th e r w o rd s s h o u ld b e a llo w e d to c ro s s s e n te n c e b o u n d a rie s . R e ly in g o n e ith e r o f th e a b o v e d e fin itio n s w o u ld r e s u lt in m a s s iv e o v e rg e n e ra tio n , so c o n s e q u e n tly w e w ill d e fin e th e c o n c e p t o f " c o llo c a tio n " a s to b e tte r s u it a q u a n tita tiv e a p p ro a c h :
A c o llo c a tio n is a w o r d strin g c o n s is tin g o f a m in im u m o f tw o w o r d s ’ w ith th e fo llo w in g c h a ra c te ris tic s :
*This is a brief description o f the method designed and the findings made in connection with the research project UDOG (Udforskelse af Dansk Ordforråd og Grammatik - Exploration o f Danish Vocabulary and Grammar) under fhe auspices of The Danish Research Council for the Humanities. A broader outline is given in Danish in Blom
1997 and (1998 - forthcoming).
(1 ) th e e n tire w o rd s trin g is o c c u rin g w ith in a g iv e n te x tu a l s e g m e n t a .
(2 ) th e le a s t fre q u e n t w o rd (s ) in th e w o rd strin g o c c u r(s ) a t le a s t h tim e s in th e c o rp u s,
(3 ) all w o rd s in th e w o rd s trin g o c c u r w ith a sp a n o f c w o rd s to its neighboixr c o llo c a te
(4 ) all w o rd s in th e w o rd s trin g o c c u r in d p a rtic u la r s e q u e n c e s
T h is d e f in itio n is ra th e r a s e t o f p a ra m e te rs to b e a d ju s te d a c c o rd in g to th e p a r tic u la r sc ie n tific c o n te x t in w h ic h th e y a re to b e u tilis e d , th u s (1 ) re fe rs to w h e th e r a w o r d s trin g is c o n s id e re d to b e a p h r a s e , a s y n ta g m a , an e n tire s e n te n c e , o r e v e n p e rm itte d to c ro s s s e n te n c e s b o u n d a rie s; (2) r e f e r s to a lo w e r th re s h o ld v a lu e d e c id e d o n f o r a g iv e n p u rp o s e in o r d e r to f ilte r o f f in fre q u e n t o c c u rre n c e s ; (3 ) re fe rs to th e is s u e w h e th e r in te rru p te d s tru c tu re s s h o u ld b e ta k e n in to ac c o u n t; (4 ) r e f e r s to th e q u e s tio n w h e th e r in v e rte d s tru c tu re s sh o u ld b e ta k e n in to a c c o u n t. I n th is a rtic le w e le t a b e s e n te n c e le v e l, i.e . w e a c c e p t a c la u su a l s tru c tu re a s a c o llo c a tio n ra th e r th a n im p o s in g s o m e a rb itra ry v a lu e a s a m a x im u m w o rd strin g le n g th ; w e le t h b e fo u r, i.e. a w o rd o c c u r r in g th re e tim e s o r le ss is n o t c o n s id e re d to b e s ig n ific a n t; w e le t c b e z e ro , i.e. w e d o n o t in c lu d e in te rru p te d s tru c tu re s, a s s u c h s tru c tu re s re q u ire q u ite a n e x te n s iv e s ta tis tic a l setu p ( c f Ik e h a ra , S h ira i & U c h in o 1996), a n d s u c h s tru c tu re s a re o u ts id e th e s c o p e o f th is stu d y ; w e le t d
b e le ft-to -rig h t, in th e se n se th a t w e d o n o t le m m a tiz e b ig ra m s tru c tu re s , i.e, a fg ift b e ta le s =
b e ta le s a fg ift in o rd e r to s e c u re th e in c lu s io n o f u n d e rre p re s e n te d ite m s . H o w e v e r, in ca se in v e rte d s tru c tu re s a re s ig n ific a n tly re p re s e n te d in th e c o rp u s , th e y w ill b e in c lu d e d as v a lid c o llo c a tio n s .
3. Finding likely relevant words (unigram level).
T h e a p p r o a c h is fo u n d e d o n th e a s s u m p tio n th a t a w o rd h a s b o th s ta tis tic a l a n d g ra m m a tic a l p r o p e r tie s w h ic h m a y se rv e a s a c lu e to its te rm in o lo g ic a ln e s s , a t le a s t th is is b e lie v e d to b e th e c a s e f o r v e ry h a rd -c o re L S P te x ts . A s a n e x a m p le o f s u c h a te x t, th e D a n is h V A T a c t h a s b e e n c h o s e n . T h e le g is la tiv e g e n re h a s b e e n c a re fu lly se le c te d , a s le g is la tiv e te rm in o lo g y te n d s to b e u n a m b ig u o u s in th e se n se th a t o n e p a r tic u la r c o n c e p t is u s u a lly b a c k e d b y o n e p a rtic u la r o rth o g ra p h y . A s o p p o s e d to L G P , th e u s e o f s y n o n y m y , p a ra p h ra s e s a n d o th e r s ty lis tic w a y s o f re p r e s e n tin g a c o n c e p t o rth o g ra p h ic a lly is m in im is e d in o rd e r to a v o id in te rp re ta tiv e sc o p e o n th e le g a l c irc u m s ta n c e s re la tin g to a g iv e n c o n c e p t. A s a b o n u s , w e m a y e x p e c t p o ly s e m y n o t to c o n s titu te a s e rio u s p ro b le m fo r th is k in d o f stu d y ,
Z i p f (1 9 4 9 ) s u g g e s te d th a t th e re s e e m s to e x is t a n in v e rs e ly p r o p o rtio n a l in te rd e p e n d e n c e b e tw e e n fre q u e n c y a n d ra n k in a te x t, in th a t a v e ry lim ite d n u m b e r o f w o rd s o c c u r w ith v e ry h i g h fre q u e n c ie s , w h e re a s a v e ry la rg e n u m b e r o f w o rd s o c c u r w ith v e ry lo w fre q u e n c ie s , D a m e r a u (1 9 6 5 ), D e n n is (1 9 6 7 ) a n d S to n e & R u b in o f f (1 9 6 8 ) d is c o v e re d a lin k b e tw e e n fre q u e n c y , ty p ic a lity - a n d w o rd c la s s , so th a t s o -c a lle d fu n c tio n w o rd s ( w h ic h te n d to a c c o u n t f o r q u ite a c o n s id e ra b le p a rt o f w o rd s f o u n d a t th e h ig h -fre q u e n c y e n d o f th e Z ip f ia n d is trib u tio n ) a n d c o n te n t w o rd s h a v e d iffe re n t s ta tis tic a l b e h a v io u r in te x ts , a s th e fo r m e r c a te g o ry fo llo w s a p o is s o n d is trib u tio n c lo se ly , w h e re a s th e la tte r d o e s n o t. T h e s e w o rk s p ro v id e th e o re tic a l f o u n d a tio n f o r th e in tu itio n th a t a w o r d lik e "th at" h a s a g re a te r in te rte x tu a l d is p e rs io n th a n a w o r d lik e " th e rm o d y n a m ic s" . B o o k s te in & S w a n s o n (1 9 7 4 ) (1 9 7 5 ) a n d H a r te r (1 9 7 5 ) s h a rp e n e d
th i s p o in t b y o ffe rin g a n e x p la n a tio n w h y c e rta in c o n te n t w o rd s (lik e e .g . red, sa y, m a n ) h a v e g re a te r in te rte x tu a l d is p e rs io n th a n o th e r c o n te n t w o rd s (lik e e.g . e n te rp r ise , su b -c la u se , ta x a b le ).
T h e ir c la im w a s th a t w o rd s w h ic h s h o w a r a n d o m d is trib u tio n in a p o is s o n p ro c e s s a re lik e ly n o t to c o n ta in in fo rm a tio n a b o u t th e te x t in w h ic h th e y o c c u r, w h e re a s w o rd s v d iic h d o n o t f o llo w a p o is s o n d is tr ib u tio n te n d to c o n ta in in fo rm a tio n a b o u t th e te x t in w h ic h th e y o c c u r.
F o r th is s tu d y , w e w ill r e ly o n th e v e ry s im p le a s s u m p tio n th a t g iv e n th e fa c t th a t w e a re d e a lin g w ith a v e ry s p e c ia lis e d s u b -d o m a in L S P te x t, i.e. a le g is la tiv e g e n re te x t w ith in th e d o m a in o f ta x a tio n , te x t-re le v a n t w o r d s a re lik e ly to b e f o u n d a m o n g h ig h -fre q u e n c y c o n te n t w o rd s . F o r th i s e n d , th r e e s te p s h a v e b e e n n e c e ssa ry : (1 ) th e te x t h a s b e e n ta g g e d fo r w o rd c la ss , (2 ) w o rd s s h a rin g a n id e n tic a l s te m h a v e b e e n le m m a tis e d to a v o id in s ig n ific a n c e a ris in g fro m th e u n d e rre p re s e n ta tio n o f a g iv e n in fle c te d fo rm , a n d (3 ) a v e r y s im p le a n d lim ite d s to p list h a s b e e n a p p lie d to f ilte r o f f s p a tio -te m p o ra l ite m s (eg . n a m e s o f v a rio u s E U c o im trie s o r w o rd s re fe rrin g to a d e a d lin e b y w h ic h a g iv e n ta s k is to b e c a rrie d o u t), w h ic h te n d to b e re p re s e n te d w ith s o m e fre q u e n c y o w in g to th e n a tu re o f th e te x t, h o w e v e r is lik e ly to b e irre le v a n t f o r th e p a r tic u la r le g a l d o m a in in q u e s tio n . I n o rd e r to d e te rm in e a s u ita b le th re s h o ld v a lu e fo r filte rin g o f f w o rd s w ith n o o r y e ry little re le v a n c e , a n u m b e r o f s u b -s e c tio n s o f th e V A T a c t h a v e b e e n ra n d o m ly p ic k e d in o r d e r to c o m p a re th e fre q u e n c y o f o c c u rre n c e o f w o rd s in th e s u b -se c tio n w ith th e ir fre q u e n c y o f o c c u rre n c e in th e e n tire te x t. T h is w a s d o n e in o rd e r to fin d a s u ita b le c u t - o f f p o in t b e lo w w h ic h w o rd s a re lik e ly n o t to b e o f re le v a n c e to th e te x t (fo r a n e x a m p le o f th is, se e B lo m
1 9 9 7 ), a n d th e re s e e m e d to b e a c a s e fo r o m ittin g ite m s b e lo w th e fre q u e n c y o f fo u r.
4. Finding words with collocational potential (unigram level).
T h e o u tp u t c o n s is ts o f a lis t o f u n ig ra m s re n d e re d te x t- r e le v a n f b y th e m e th o d . W h a t w e n e e d to fin d o u t n o w is w h ic h o f th e s e u n ig ra m s te n d to o c c u r a s e ith e r s in g le w o rd s o r a s p a r t o f a m u lti w o rd u n it.
T h e re a r e e x is tin g s ta tis tic a l m e th o d s f o r te s tin g th e " b o n d n e s s " o f w o rd p a irs , th e m o s t p ro m in e n t o n e b e in g m u tu a l in fo rm a tio n , cf. C h u r c h & H a n k s (1 9 9 0 ). M u tu a l in fo rm a tio n is a w id e ly u s e d f re q u e n c y -b a s e d fo rm a lis m th a t c a lc u la te s th e p ro b a b ility w h e th e r a w o rd p a ir o c c u rs to g e th e r o r s e p a ra te ly in a te x t. H o w e v e r, th is a p p ro a c h is n o t a d e q u a te f o r th is s tu d y , a s w e p la c e o u r fo c u s o n c o n te n t w o rd s. I n m u tu a l in f o r m a tio n s ta tis tic s , b o th w o rd s o f a b ig r a m r e c e iv e e q u a l w e ig h t irre s p e c tiv e o f g ra m m a tic a l s ta tu s . T h is w o u ld m e a n th a t fu n c tio n w o rd s m ig h t e x e r c is e u n d u e in flu e n c e o v e r th e o v e ra ll M l- v a lu e a n d th u s b ia s th e m e a su re . T h is is n o u n re a s o n a b le a s s u m p tio n , i f w e c o n s id e r th e fa c t th a t th e m o s t fre q u e n t w o rd s in a Z ip fia n d is trib u tio n te n d to b e f u n c tio n w o rd s. In th is s tu d y , w e re s tric t o u r sc o p e to k e e p in g c o n te n t w o rd s a s " n o d e s " a n d th e n e x a m in e th e w a y th e y c o m b in e w ith th e ir e ith e r le ft o r rig h t a d ja c e n t " c o llo c a te s " .
W e v rill u s e th e h e u r is tic s th a t a c o n te n t w o rd 's a b ility to e n te r in to a "b o n d " w ith a n o th e r w o rd d e p e n d s o n its c o n te x tu a l d is trib u tio n , i.e, its n u m b e r o f a d ja c e n t e ith e r le ft o r r ig h t w o rd s in a te x t. T h e f e w e r s u c h a d ja c e n t w o rd s, th e m o re a w o r d 's u n ig ra m fre q u e n c y w ill b e d is trib u te d rq)on p a r tic u la r c o llo c a te s . I n c a s e a c o n te n t w o rd h a s a la rg e n u m b e r o f c o llo c a te s , its u n ig ra m
fi^ q u e n c y w ill b e w a te re d o u t. C o n s id e r a c o n te n t w o rd w h ic h o c c u rs fo u r te e n tim e s in a te x t a n d h a s a n e q u a l n u m b e r o f c o llo c a te s , th e n e a c h b ig ra m g e ts a fre q u e n c y o f o n e . O n th e b a s is o f th e s e s ta tis tic s , w e c a n s a fe ly a s s u m e th a t th is c o n te n t W ord d e fin ite ly c o n s titu te s a s in g le w o rd u n i t I t o n th e o th e r h a n d , a c o n te n t w o r d o c c u rrin g fo u rte e n tim e s is re p re s e n te d tw o tim e s w ith o n e p a r tic u la r a d ja c e n t w o rd a n d tw e lv e tim e s w ith a n o th e r p a rtic u la r a d ja c e n t w o rd , w e c a n d e d u c t th a t th is c o n te n t w o rd is p r e d o m in a n tly p a rt o f a m u lti-w o rd u n it. F o r th e p u rp o s e s o f th is h e u ris tic s , w e a p p ly th e fo llo w in g fo rm u la in w h ic h u is th e u n ig ra m fre q u e n c y o f th e c o n te n t w o r d a n d c is th e n u m b e r o f c o llo c a te s o f th e c o n te n t w o rd :
5. Finding word-pairs with collocational strength (bigram level).
O u r p r e s e n t o u tp u t c o n s is ts o f th e u n ig ra m li s t fro m th e p re v io u s s te p e x c lu s iv e o f w o rd s w h ic h th e m e th o d d o e s n o t re n d e r p o te n tia l m u lti-w o rd ite m s. W e w ill n o w e x p lo re th e a b o v e id e a a b it fu rth e r, a s w e w ill fo c u s o n th e s tre n g th o f a n y g iv e n b ig ra m in w h ic h a p a r tic u la r c o n te n t w o rd o c c u rs in o rd e r to fin d th e s tre n g th o f th e b o n d o f th e w o rd -p a ir. T h e b o n d s tre n g th d e p e n d s o n th e n u m b e r o f tim e s a w o rd o c c u rs in a g iv e n b ig ra m c o m p a re d to h o w o fte n th e w o rd o c c u rs as a u n ig ra m . O u r a s s u m p tio n is th a t c o n te n t w o r d n e x e rc ise s g o o d " a ttra c tiv e p o w e r" o n w o r d x i f th e b ig r a m fre q u e n c y O^/tx) a c c o u n ts fo r a m a jo r p a rt o f th e u n ig ra m fre q u e n c y o f c o n te n t w o rd
n (fq n )- W e w ill tr y to m a k e th is s ta tis tic a l p a ra m e te r d iffe re n tia te b e tw e e n w o rd s w ith th e s a m e u n ig ra m /b ig ra m -ra tio , b u t w ith d iffe re n t fre q u e n c ie s , eg. th e in s ta n c e w h e r e tw o w o rd s w ith te n c o llo c a te s e a c h o c c u r w ith a u n ig ra m fre q u e n c y o f te n a n d a h u n d re d , r e s p e c tiv e ly . W e w ill u se th is f o rm u la , in w h ic h u is th e u n ig ra m fre q u e n c y a n d b is th e b ig ra m fre q u e n c y , to d is c rim in a te b e tw e e n s u p e r io r a n d le ss s u p e rio r c o llo c a tio n a l stre n g th :
6. Expanding the bigrams (sentence level).
T h e ouq)U t fro m th e la st tw o s te p s c o n s is ts o f a lis t o f b ig ra m s w h ic h a re , o n th e b a s is o f th e s ta tis tic s a p p lie d , c o n s id e re d to b e o f a m u lti-w o rd n a tu re ra th e r th a n a s in g le -w o rd n a tu re . N o w w e w ill e x p a n d e a c h b ig ra m to fin d e a c h p o s s ib le w o rd s trin g in w h ic h th e b ig ra m o cc u rs. In s te a d o f m e re ly e x tra c tin g th e lo n g e s t p o s s ib le b ig ra m lik e fo r in s ta n c e S m a d ja (1 9 9 3 ), w e w ill p e r m it a ll s y n ta c tic a lly w e ll-fo rm e d w o r d s trin g s in w h ic h a c a n d id a te b ig ra m o c c u rs . T o th is e n d a c o n te x tu a l a rra y is a p p lie d , in w h ic h a g iv e n b ig ra m is s h o w n in a ll its c o n te x ts , so th a t th e h o riz o n ta l a x is a c c o u n ts fo r s y n ta g m a tic v a lu e s , w h ile th e v e rtic a l a x is a c c o u n ts f o r p a ra d ig m a tic v a lu e s . T h e fo llo w in g ru le s a p p ly to th e e x p a n s io n o f o rth o g ra p h ic w o rd s trin g s : (1 ) O n ly w o rd s trin g s (n o n -e x p a n d e d a s w e ll a s e x p a n d e d b ig ra m s ) o c c u rrin g a t le a s t th re e tim e s a re a c c e p te d ; (2 ) A b ig r a m m a y b e e x p a n d e d b y ± o n e p o s i t i o n o n l y i f t h e s a m e w o r d o c c u r s a t t h e s a m e s y n ta g m a tic p o s itio n in th e c o n te x tu a l a rra y ; (3 ) A b ig ra m is le ft o u t o f c o n s id e ra tio n in c a s e a lo n g e r w o r d s trin g o c c u rs w ith a t le a s t th e b ig ra m fre q u e n c y m in u s o n e o c c u rre n c e . T h is is to a v o id s y n ta c tic a lly in c o m p le te s tru c tu re s , th e fre q u e n t o c c u rre n c e o f w h ic h c a n b e a s c rib e d to s y n ta c tic re a s o n s o n ly . T h e lo n g e r w o rd s trin g w h ic h th e s a m e o r n e a rly th e s a m e fre q u e n c y as th e b ig r a m is lik e ly to b e th e v a lid o n e . A n e a s ie r s o lu tio n w o u ld b e to f a v o u r c e rta in sy n ta c tic s tru c tu re s , ty p ic a lly n o u n s y n ta g m a s o r p re d ic a tiv e -lik e s tru c tu re s lik e S m a d ja (1 9 9 3 ), o r to lim it t h e s tu d y to b ig ra m s c o n ta in in g o n ly c o n te n t w o rd s, h o w e v e r th is w o u ld id io s y n c ra tic a lly fa v o u r
b e ta le q fg i ft { C -C } to b e ta lin g a f a fg ift { C -F -C } .
I n a Z ip f ia n d is trib u tio n o f w o rd s, th e h ig h -fre q u e n c y e n d is d o m in a te d b y g ra m m a tic a l w o rd s
lik e p r e p o s itio n s , c o n ju n c tio n s a n d th e lik e , w h ic h is w h y a g re a t a m o u n t o f s y n ta c tic c o llo c a tio n s is to b e e x p e c te d in a n y g iv e n b ig ra m o u tp u t. I n o rd e r to re d u c e th e lik e ly a m o u n t o f o v e rg e n e ra tio n , w e w ill ta k e in to a c c o u n t o n ly b ig ra m s th a t c o n ta in a t le a s t o n e c o n te n t w o r d a n d w h ic h d o e s n o t in c lu d e a p u n c tu a tio n m a rk . H a v in g ta k e n th is p re v e n tiv e s te p , w e w ill a s s u m e th a t s y n ta c tic n o is e is lik e ly to o c c u r a t th e le ft- a n d r ig h tm o s t p o s itio n s in a w o rd s trin g w h e r e a fu n c tio n w o r d m a y b e fo u n d . I n s te a d o f m e re ly le a v in g a n e x p e r t w ith th e te d io u s ta s k o f f ilte rin g o f f ite m s ir r e le v a n t o w in g to s y n ta c tic n o is e , w e w ill a p p ly a s e t o f sy n ta c tic ru le s to s e rv e a s a filte r f o r s u c h s y n ta c tic n o ise.
I h e r e a r e fo u r s y n ta c tic m a in ru le c la ss e s (1 -3 ), o n e s y n ta c tic e x c e p tio n ru le c la s s (4 ) a n d o n e s ta tis tic a l e x c e p tio n r u le c la s s (5 ) a s fo llo w s: (1 ) a p a r tic u la r ta g ( o r ta g se q u e n c e ) c a n n o t ta k e u p th e le ftm o s t s lo t ( o r s lo ts ) o f a w o rd s trin g ; (2 ) a p a r tic u la r ta g ( o r ta g se q u e n c e ) c a n n o t ta k e u p th e rig h tm o s t s lo t (o r s lo ts ) o f a w o rd s trin g ; (3) a p a r tic u la r ta g s e q u e n c e c a n n o t fo rm e ith e r a n e n tire w o r d s trin g o r a p ru t o f a g iv e n w o rd s trin g c o u n tin g fro m th e se c o n d le ft- o r r ig h tm o s t s lo t; ( 4 ) a p a r tic u la r ta g s e q u e n c e d is c h a rg e d b y a s y n ta c tic ru le m a y q u a lify , i f it m a tc h e s a ^ e c i f i c s y n ta c tic p a tte rn ; (5 ) a p a rtic u la r ta g s e q u e n c e le ft o u t b y a s y n ta c tic ru le m a y q u a lify i f a c o n te n t w o rd is im m e d ia te ly s u c c e e d e d a n d /o r p re c e d e d b y a p re p o s itio n in a t le a s t 8 0 % o f all c a se s. T h e la tte r ru le is u s e d to s ta tis tic a lly id e n tify p h ra s a l v e rb s (b e re ttig e til), a s w e ll as c o m p le x p r e p o s itio n s ( i h e n h o ld til). A ll o c c u rre n c e s o f ( v b -p re p } a n d { p re p -s b -p re p } h a v e b e e n e x a m in e d in o rd e r to e v a lu a te th e p e rfo rm a n c e o f th is 8 0 % -ru le . A ll in sta n c e s o f p h ra s a l v e rb s a n d c o m p le x p re p o s itio n a l p h ra s e s a re c o rre c tly id e n tifie d . C o n s e q u e n tly , th is te c h n iq u e m ig h t e v e n p r o v e to b e a th e o re tic a l s p in o f f o f th is p ro je c t.
Example o f a Class (1) rule:
input "den i stk. #num#" {det-prep-noun.abbr-num}
main rule: a word string cannot be initiated by tags {det-prep}- reduce word string by these words, output "stk. #num" {noun.abbr-num}
Example o f a Class (2) rule:
input "registrere som landbrug og" {vb.inf-konj-noun-konj}
main rule: a word string cannot be ended by tag {konj}- reduce word string by this word. ouq}ut "registrere som landbrug" {vb.inf-konj-noun}
Example o f a Class (3) rule;
input "opgøres på grund a f {vb.pres.pass-prep-noun-prep}
m a in ru le : a w o rd s tr in g c a m io t b e e q u iv a le n t to ta g s { v b .o p t\o p t-p re p -n o u n -p re p } w h e re * p re p -n o u n -p re p * is a
ou^ut:
c o m p le x p re p o s itio n a l p h ra s e - o m it e n tire w o r d s trin g ,
0
E x a m p le o f a C la s s (4 ) ru le :
i n p u t " d e n i s tk # n u m # o m h a n d le d e a fg ift" { d e t-p re p -n o u n .a b b r-n u m -v b .p a rt.a d j-n o u n } m a in ru le : a w o rd s trin g c a n n o t b e in itia te d b y ta g s { d e t-p re p } - re d u c e w o rd s trin g b y th e s e w o rd s.
^This notation refers to optional filling o f slot. The slots may be filled by a tag referring to a verbal tense, voice and
e x c e p tio n ; a w o rd strin g in itia te d b y ta g s {d et-p rep } c a n n o t b e r e d u c e d , i f th e w o rd strin g is e n d e d b y ta g s
{ n u m -v b .p re t.a d j-n o u n } ^ .
o u tp u t "den i stk #num # om handlede afgift" {det-prep-noun.abbr-num -vb.part.adj-noun}
E xam ple o f a C lass (5) rule:
input: "fritage for" {vb.inf-prep}
m ain rule: a w ord string cannot b e ended by ta g {prep}- reduce w ord string by this w ords,
exception: a w ord string ended by tag {prep}cannot be reduced, if (prep}has in its im m ediate left position th e
tag {vb.opt.opt}, and { v b .o p to p t} h as in all its lem m atized form s the orthographic representation o f
{prep} in 80% o r m ore o f all cases, o u tp u t "fritage for" {vb.opt.opt-prep}
... m s tæ n d ig h e d e r
fritage
e n v irk s o m h e d fo r... ... . k a n in d rø m m efritagelse
f o r... ... . ... ..e r k a n m e d d e lefritagelse
f o r fo rh ø je ls e a f ... . ... ....o m h a n d le n d efritagelser.
d . S p ø rg sm å l 1... . ... f v irk s o m h e d e rfritage
f o r a t sv a re a f g if t... . ...a r e r v ille v æ refritaget
f o r a fg ift e fte r §... ... d re E F -la n d e e rfritaget
f o r a fg ift : 1 ... ... h e d e r v ille v æ refritaget
f o r a fg ift . S tk. 5 ... . ... n d e t v ille v æ refritaget
f o r a fg ift . 2 ) . . ... ... s tk . 1 o g 3 , e rfritaget
f o r a t s v a re a fg ift... .Fig. 1. Concordance for the lemma "fritage[-J" (frequency of occurrence: 10) and its immediate right collocates. In eight out o f ten cases, the lemma "fritage" has "for" as its adjacent right collocate. In this case, the {prep}tag is deemed to be a particle rather than a preposition, and thus a valid part o f a phrasal verb.
7. Evaluating the syntactic filter.
T h e s y n ta c tic filte r w a s c re a te d o n th e b a s is o f th e o rth o g ra p h ic o u tp u t s trin g s . T h e o u tp u t w a s e x a m in e d in o rd e r to d is tin g u is h b e tw e e n w o r d strin g s w h ic h s h o u ld e ith e r b e p e rm itte d a s s y n ta c tic a lly w e ll-fo rm e d o r o m itte d a s s y n ta c tic a lly ill-fo rm e d . T h is w a s d o n e in o r d e r to d e d u c t s y n ta c tic r u le s to b e g e n e ra lis e d w ith a v ie w to c re a tin g th e b e s t p o s s ib le re c a ll- p r e c is io n rate.
4 8 % o f a ll o rth o g ra p h ic w o rd s trin g s w e re s y n ta c tic a lly ill-fo rm e d , w h e re a s 5 2 % w e re w e ll- fo rm e d . T h is 5 0 -5 0 ra tio d o e s c e rta in ly im p ly a c e rta in h ig h e r p rin c ip le o f r a n d o m n e s s , w h ic h m ig h t m a k e s y n ta c tic fo rm a liz a tio n im p o s s ib le . H o w e v e r, i f w e e x a m in e s y n ta c tic ta g se q u e n c e s a m o n g th e g ro u p o f o m itte d w o rd s trin g s , w e d is c o v e r th a t s o m e 9 6 % a re a c tu a lly u n iq u e ,
^This rule aims at including a syntactic pattern particular to the legal domain in Danish in which the head o f a nominal phrase is pre-modified by a determiner followed by a preposition.
w h e r e a s s o m e 4 % s h a re a s y n ta c tic s e q u e n c e w ith a m e m b e r o f th e g ro u p o f s y n ta c tic a lly w e ll- f o im e d w o rd s trin g s . W ith in th e g ro u p o f o m itte d w o r d s trin g s w i th a u n iq u e ta g se q u e n c e , 9 9 % m a y b e s u b je c t to fo rm a liz a tio n , w h e re a s o n ly 1% c o u ld n o t b e fo rm a lis e d in to a n y o p e ra tio n a l ru le . T h is o n e p e r c e n t a c c o u n t fo r a m in o r lo s s o f p r e c is io n (-0 ,6 % ). T h e fo rm a liz a tio n o f th e 9 9 % w o r d s trin g s d id n o t c o n flic t w ith ru le s a p p lic a b le to th e s y n ta c tic a lly w e ll-fo rm e d w o rd strin g s . A ll o f th e r e m a in in g 4 % a m b ig u o u s w o r d s trin g s c o u ld b e fo rm a lis e d , h o w e v e r w ith a m o d e ra te lo s s o f re c a ll (-3 ,4 % ). T h e s y n ta c tic filte r w o rk s w ith 9 6 ,6 % r e c a ll a n d 9 9 ,4 % p re c is io n , an d p ro v id e s s o m e in d ic a tio n th a t s y n ta x u s a g e in s u b -d o m a in L S P te x ts lik e th e p r e s o i t o n e is o f a n a tu re w h ic h a c c o m m o d a te s fo rm a lis a tio n ru le s .
8. How to evaluate the terminological relevance of the word strings extracted.
T h e s y n ta c tic f ilte r e n s u re s a w e ll-fo rm e d o u tp u t, b u t th is is o f c o u rs e n o g u a ra n te e th a t th e o u tp u t m e e ts a c e rta in q u a lita tiv e s ta n d a rd a s to te x t ty p ic a lity . H o w c a n w e k n o w th a t th e w o rd s trin g s e x tra c te d a re n o t m e re ly c o m m o n p la c e c o llo c a tio n s fo u n d in v irtu a lly a n y te x t. O n e o b v io u s w a y o f ju d g in g th e o u tp u t w o u ld b e to h a v e a n e x p e rt e v a lu a te it, h o w e v e r o n e o b v io u s d ra w b a c k to th is a p p ro a c h is th e c irc u m s ta n c e th a t e x p e rts te n d to h a v e d iffe re n t a n d s o m e tim e s e v e n c o n f lic tin g o p in io n s . H o w e v e r a n o th e r m e th o d m ig h t b e to te s t h o w th e w o rd s trin g s are d is p e rs e d o n a c ro s s -s e c tio n o f te x ts fro m v a rio u s d o m a in s in o rd e r to fin d o u t w h e th e r th e s a m e c o llo c a tio n s te n d a p p e a r in o n e o r m o re o th e r te x ts , o r w h e th e r th e y te n d to b e re s tric te d to th e D a n is h V A T A c t. O b v io u s ly , w e c a n n o t lo o k fo r th e c o llo c a tio n s i n all te x ts e v e r w ritte n , b u t w e c a n a s c e rta in w ith s ta tis tic a l c e rta in ty w h e th e r a g iv e n w o rd s trin g is re s tric te d to th e V A T A c t o r n o t E v a lu a tio n o f L S P re le v a n c e h a s b e e n le ft o u t o f c o n s id e ra tio n for th is p r e s e n t s tu d y , b u t is a n in te re s tin g is s u e fo r fu rth e r re se a rc h .
References
B e n s o n , M . (1 9 9 0 ): C o llo c a tio n s a n d g e n e ra l-p u rp o s e d ic tio n a rie s . In te r n a tio n a l J o u r n a l o f L e x ic o g r a p h y 2, p p . 1-14.
B lo m , B . (1 9 9 7 ): O m s ta tis tis k o g s tru k tu re l a f g ræ n s n in g a f s a n d s y n lig e te k s tty p is k e k o U o k a lio n e r i M o m s lo v e n . In : U D O G -ra p p o r t 5, p p .3 -2 3 .
B lo m , B . (1 9 9 8 - fo rth c o m in g ): A m e th o d fo r id e n tify in g c o llo c a tio n s lik e ly to b e re le v a n t in r e la tio n to a s u b - d o m a in L S P -te x t. U D O G -ra p p o r t 7.
B o o k s te in , A . & S w a n s s o n , D . R . (1 9 7 4 ): P r o b a b ilis tic m o d e ls f o r a u to m a tic in d e x in g . J o u r n a l o f th e A m e r ic a n S o c ie ty f o r I n fo r m a tio n S c ie n c e , 25, p p . 3 1 2 -3 1 8 .
B o o k s te in , A . & S w a n s s o n , D , R , (1 9 7 5 ): A d e c is io n -th e o re tic f o u n d a tio n fo r in d e x in g . J o u r n a l o f th e A m e r ic a n S o c ie ty f o r I n fo r m a tio n S c ie n c e 26, p p . 4 5 -5 0 .
C h u rc h , K , W . & H a n k s , P . (1 9 9 0 ): W o rd a s s o c ia tio n n o rm s , m u tu a l in fo rm a tio n a n d le x ic o g ra p h y . C o m p u ta tio n a l L in g u is tic s 16-1 p p . 2 2 -2 9 .
D e n n is . S . F . ('1967): T h e d e s ig n a n d te s tin g o f a fu lly a u to m a tic in d e x in g -s e a rc h s y s te m fo r d o c u m e n ts c o n s is tin g o f e x p o s ito ry te x t. In : G . S c h e c h te r (e d .) I n fo r m a tio n R e tr ie v a l: A c r itic a l re v ie w , p p . 6 7 -9 4 . T h o m p s o n B o o k s C o ., W a s h in g to n .
F i a n t d , K . T . & A n a n ia d o u , S. (1 9 9 6): E x tra c tin g n e s te d c o llo c a tio n s . P r o c e e d in g s f r o m
C O L I N G ’9 6 , p p . 4 1 -4 6 .
H a r te r S P (1 9 7 5 ): A p ro b a b ilis tic a p p ro a c h to a u to m a tic k e y w o rd in d e x in g . P a r t 1: 0 ^ d is tr ib la io n o f s p e c ia lty w o rd s in a te c h n ic a l lite ra tu re , 2 : f o r probab.^^^^^^ in d e x in g . J o u r n a l o f th e A m e r ic a n S o c ie ty f o r In fo r m a tio n S c ie n c e 26, p p . 1 9 7 -2 0 6 , p p . 2 8 0 -2 8 ^ .
I k e h a ia ,S ., S h ira i, S & U c h in o , H . (1 9 9 6 ): A s ta tis tic a l m e th o d ^ n t e m ^ ^ in te rru p te d c o llo c a tio n s fro m v e ry la rg e c o rp o ra . P r o c e e d in g s f r o m C O L I N G 9 6 , p p . 5 7 4 - 5 7 ..
S in c la ir, J . (1 9 9 1 ): C o rp u s, c o n c o rd a n c e a n d c o llo c a tio n . O x fo rd U n iv e rs ity P re s s.
S m a d ja , F. (1 9 9 3 ): R e trie v in g c o llo c a tio n s fro m te x t: X tra c t. C o m p u ta tio n a l L in g u is tic s 19-1,
p p . 1 4 3 -1 7 8 .
S to n e , D . C . & R u b in o ff, M . (1 9 6 8 ): S ta tis tic a l g e n e ra tio n o f te c h n ic a l v o c a b u la ry . A m e r ic a n
D o c u m e n ta tio n , p p . 4 1 1 -4 1 2 .
Z ip f, G . K , (1 9 4 9 ): H u m a n b e h a v io u r a n d th e p rin c ip le o f le a s t e ffo rt. A d d is o n -W e s le y ,
C a m b r id g e , M a s s a c h u s e tts .