American
Journal
of
Computational
Linguistics
Nicholas V. F i n d l e r and
Heino
Viil
Department
ofComputer
Science
S t a t eUniversity
of
New
YorkBuff
a10M i c r o f i c h e 4
N i c h o l a s V. F i n d l b r and fleino V i i l Department o f L'omputcr S c i e n c e
S t a t e u n i v e r s i t y of New York a t U u f f a l o
We describe a branch of d i c t i o n a r y s c i e n c e , and recormend t h e
term lexicometry for i t , t h a t deals w i t h the m a t h e m a t i c a l and
statistical aspects of d i c t i o n a r i e s . I t i s related t o both and former de n o t incj the d e s c r i p t i o n
of l e x i c a l m a t e r i a l and t h e latter i t s a n a l y s i s and study.
Many problems i n computational l i n g u i s t i c s r e q u i r e the use of a stored d i c t i o n a r y easily a c c e s s i b l e t o a c o q u t w program. I n the course of an i n v e s t i g a t i o n , s u c h a d i c t i o n a r y may have to be
expanded, reduced, rcanrranged, o r modified in various ways o A l s o
several n o n l i n g u i s t i c disciplines using t h e c m p u t e r , such as psychology, biology, m e d i c i n e , and s o c i o l o g y , o f t e n n e e d a l a r g e d a t a base in t h e £ o m of a d i c t i o n a r y . The relevant s t r u c t u r a l properties of a d i c t i o n a r y , however, have n o t yet been s u f f i c i e n t l y and systematically i n v e s t i g a t e d . Research i n this area is needed i n order t o o p t i m i z e the construction af s t o r e d
dictionaries and t o manipulate them i n e f f i c i e n t ways.
1
A c o n s i d e r a b l y extended version of t h i s p a p e r w a s sUbmitted t o t h e S t a t e University of N e w York i n Buffalo in partial
s a t i s f a c t i o n of the r e q u i r e m e n t s for the dbcjree of of science of Ileino ~ i i l . T h e project r c p r e s e n : . ~ the continuation of an e a r l i e r work by 1Jickolas V. ~ i n d f e r . Ilmy ideas and a l l t h e p r o g ~ m m i n g effort i s due t o I i e i n o ~ i i l . The w r i t e - u p is a
o i n t e f f o r t . The work r e p o r t e d here w a s s u p p o r t e d by N a t i o n a l cience F o u n d a t i o n Grant G J - G 5 8 .
F ' i r s t
,
we
review c r i t i c a l l y t h e problems o f meaning and i t sr e p r e s e n t a t i o n , t h e q u e s t i o n s r e l a t i n g t o l e x i c a l d e f i n i t i o n s . . t o
P O ~ Y S = ' ~ Y , homonymy, s e m a n t i c d e p l e t i o n . synonymy, and
l e x i c o g r a p h y and Lexicology i n g e n e r a l . We a l s o d i s c u s s the
c o n c e p t of l e x i c a l v a l e n c e
and
e l a b o r a t e a n o v e l idea, c o v e r a g e ,which
i s o f b o t h t h e o r e t i c a l and p r a c t i c a l importance. I n t h i scontext, r e l a t i o n s h i p 6 are e s t a b l i s h e d among three v a r i a b l e s :
the
s i z e
of t h e covered set, t h es i r e
of t h e c o v e r i n g set, andthe maximum d e f b i t i o n l e n g t h . b o t h , t h e s i z e o f t h e c o v e r i n g
s e t and the
maximum
d e f i n i t i o n l e n g t h should be s m a l l foreconomic c o n s i d e r a t i o n s . But decreasing one
w i l l
i n c r e a s e the other. I t i s t h e r e f o r e important t o e s t a b l i s h theser e l a t i b n s h i p s e m p i r i c a l l y . T h e knowledge, s o g a i n e d w i l l c o n s t i t u t e a b a s i s f o r o p t h i z i n g the s t r u c t u r e of a d i c t i o n a r y f o r s p e c i f i e d size of t h e covered s e t and a s p e c i f i e d machine.
The p r e s e n t p i l o t project i n t h i s v i r g i n f i e l d has an o b j e c t i v e o f v e r i f y i n g some c o n j e c t u r e s . I t e s t a b l i s h e s some
principles
of c o n s t r u c t i n g , f o r m a t t i n g , and s t o r i n g a large d a t abase i n d i c t i o n a r y form. it d e v e l o p s programs f o r d i s p l a y i n g , h a n d l i n g , and modifying such a d a t a base. T h e p a p e r o f f e r s an example how a c o n c e p t u a l l y c o n t i n u o u s o p e r a t i o h
on
large amounts of datacan
be redudedts
o p e r a t i n gon
afraction
of thewhole
d a t a base a t a t i m e by s u c c e s s i v e s m a l l i n c r e m e n t s o f time. W ef i n a l l y d e m o n s t r a t e t h e f e a s i b i l i t y o f solving l e x i c o m a t r i c problems on t h e computer and, a t t h e same time, show t h e c o s t
i n v o l v e d i n doing such
work
i n terms o f b o t h human e f f o r t andmachine time,
and
the r e s u l t s thatwere obtained
in
using an e x i s t i n g dictionary of computer terminology of more than 1,800entries.
The effort r e q u i r e d was considerable: 6 man
-
month's, work andabout 1 4 hours of CDC 6400 comphter time. Pxogramning was done
TABLE OF CONTENTS
. . .
Some problems of lexical relatedness 11
. . .
1
.
Polysemy and homonymy. .
. .
112
.
Synonymy. . . a . m . m . . .
1 3. . .
3
.
Definitions 1 3. . .
Aspects of the science of dictionary 15
1
.
General concepts. . .
152
.
The problem-of coverage. . .
20On l e x i c o m e t r i c r e l a t i o n s h i p s among the s i z e of d ' e f i n i n g
set. the size of the defined set and the maximum l e n g t h
of definitions
. . .
261
.
Some measures of coverage. . .
26. . .
2.
C o n s t r u c t i o n ofthe
data b a s e 293
.
The results of the computations. .
. . .
42Acknowledgement
. . .
51References
. . . a .
- 5 1Appendix I
Program DeVelopment
. . .
54Appendix I1
Some ideas for the program to investigate the relatianship
INTRODUCTION
S i n c e t h e
early days
o f e l e c t r o n i c
computing, two
kinds
of
a s s o c i a t i o n s
have
existed between computers
andd i c t i o n a r i e s
:either
t h ecomputer
uses,
for v a r i o u s purposes,
a
storedd i c t i o n a r y
of some
sort
(lexicon,
vocabulary,
glossary,
thesaurus)
or
thecompuker
is
employedfor constructing
anda n a l y z i n g
a
d i c t i o n a r y .
Thelatter
a c t i v i t y
wasgiven a
s t r o n gimpetus
in
the
late
1950's
by the
formation
o f t h e c e n t r e
dlEtudes
du
Vocabulaire
Francais
and
its
p u b l i c a t i o n ,the
Cahiers
deLexicologie.
Thus
lexicographywas among
t h e
first
ncn-mathematical disciplines
to make
use
of
the
symbol
manipulating
c a p a b i l i t y ofcomputers.
While
formal
theories
of
s y n t a x
have
been $uccessful
in
d e s c r i b i n g
t h erules
of gramnaticalaccepeability
of
n a t u r a l
language
utterances,
the
study
of
meaning,
u s u a l l y
c a l l e dsemantics,
has
not yet
produced atheory
of
the
semantic
structure
of
languages, basedon observation
anda n a l y s i s .
Itis
beyondthe
scope
a ft h i s
paperto
d i s c u s s ,even
s u p e r f i c i a l l y ,the
various v i e w p o i n t s concerned
w i t h
theconcept
of
meaning.
One
of
us,
V i i l(19741,
h a s ,however,
compiled
areasonably
exhaustive
c r i t i c a l survey
of
the
relevant l i t e r a t u r e .
For
the
purposesof
this
work,
it suffi-ces to present the1 . Logical
meaning
a p p l i e sto
such attempts todeal
with meaning as symbolic l o g i cand
mathematics.The meanings with
which the
s i g n a l s of such systemscorrelate
are uniqueoutside-world
referentsor
unique
meanings w i t h i n t h e l o g i c a lsystem
t h a te v e n t u a l l y have o u t s i d e - w o r l d referents.
2. General-sernant'4c
meanings
are a l s ouniqne
in
theirreference
to
o u t s i d e world, but thesemanticists
areless
s t r i n g e n t
i n
scope than the l o g i c i a n s . N e v e r t h e l e s s , t h e i rscope is
an
i d e a l i z e d language,much
more
l i m i t e d than ordinary language.3.
Communication-theory meaning
isequivalent
to t h eamount
of
information
t h a tcan
be transmitted peru n i t
timein
a comunication . s y s t e m .4 . Lexicoqraphical meaning is t h a t
of
"words,"
andthe
I
outside-world
reference
i swhat
w eo r d i n a r i l y c a l l
meaning. 115.
Psycholoqical
meaning has so great a scope t h a t t h e par&involving
o r d i n a r y
language becomesnearly
t r i v i
a1.
Itencompasses
overt or
covert behavior of any organism asresponses to s t i m u l i .
6. Word-mind meaning h$q the scope e q u i v a l e n t to t h a t of
conceptual
c a t e g o r i e s . T o o r d i n a r y meanings ( i n t h e l e x i c a l s e n s e ) here c o r r e s p o n d s i g n a l s by which m e n t a l s t a t e s a r e a s c e r t a i n e d .7. L i n q u i s t i c meaning refers t o s i g n a l s a s the p i e c e s
o u t
o f which l a n g u a g e i s made, i.e. m i c r o l i n g u i s t i c , p h ~ n o l o g i c a l , and s y n t a c t i c s i g n a l s .
In
theframework
.of o u rparticular
topicw e
shall
bemainly
c o n c e r n e d w i t h categories 4 and 7 .A c c o r d i n g t o
Weinreich
(1 966 ),
u n i l i n g u a l d e f l n i n g d i c t i o n a r i e s appear t o be based on a model that assumes a d i s t i n c t i o nbetween
meaning p r o p e r ( s i g n i f i c a t i o n , comprehension, i n t e n s i o n ) and t h e t h i n g meant by a s i g n ( d e n o t a t i o n ,reference,
e x t e n s i o n ).
On the basisof what
i s
meant
by a sign, Osgaod,s u c i ,
and
Tannenbaum
( 1 9 5 7 )distinguish three k i n d s
ofmeaning.
1. Pragmatical ( s o c i o l o g i c a l ) meaning : the r e l a t i o n of
signs t o s i t u a t i o n s and behaviors.
2
.
( l i n a u i s t i c ) meaning :the
r e l a t i o n of s i g n st o
othersigns.
3. S e m a n t i c a l meaning: the r e l a t i o n of signs t o t h e i r s i g n i f i c a t e s
.
I ti s
easy to see that these classes are i nHoming o n t o o u r primary t a r g e t , w e may now restrict
our
interests
somewhat
f u r t h e r and c o n c e n t r a t e on t h e t w o l a s t classes o f meaning, known u n d e r v a r i o u s d e s i g n a t i o n s b u t , b y t h em a j o r i t y
of
writers,
d i s t i n g u i s h e d as s t r u c t u r a l meaning andlexical
meaning.Mackey (19653 f i n d s
structural meanings
i n ( 1 )structure
words, ( 2 ) i n f l e c t i o n a l f o r m s , and (31 types of word o r d e r .Examples
of
structure
words
area r t i c l e s
and
prepoait i o n s ,
and
these, he i n s i s t s , a l t h o u g h o f t e ncalled
m e a n i n g l e s s o rempty,
may
havea
large
number
o fmeanings.
S i m i l a r l y , the i n f l e c t i o n a l forms, s u c h as t h e g e n i t i v e case and p r e s e n t t e n s e ,may
have anumber
of meanings, ands o
may some types ofword
order.
L e x i c a l mefinings, on t h e o t h e r h a n d ,refer
t o
themeanings
of t h e c o n t e n t words, i nwhich
the d i f f e r e n c e s i n meaningare
most
easily
s e e n .I n
R u s s e l l ' s v i m (1 9 6 7 )the
s t r u c t u r ewords, such
as " t h a n , I&"or,
"
" h o w e ~ e r , " have meaning o n l yin
a suitableverbal
c o n t e x tand
c a n n o t
s t a n dalone.
Thec o n t e n t
words,which
hec a l l s
object
words, s u c h as p r o p e r names, c l a s s names of animals, names ofc o l o r s , do
not
p r e s u p p o s e ~ t h e rwords and can
be usedi n
i s o l a t i o n . T h e i r meaningi s
l e a r n t by c o n f r o n t a t i o nwith
o b j e c t sthat
are
whatthey
mean
or
i n s t a n c e s
of
whatt h e y
mean.
A ssoon
as t h e a s s o c i a t i o nbetween
an
object word and whatit
means hasbeen
e s t a b l i s h e d by t h e l e a r n e r ' s h e a r i n g , i f f r e q u e n t l y pronounced i n the p r e s e n c e of t h e o b j e c t , t h e word i s u n d e r s t o o dexcludes words
that
d e n o t e a b s t r a c t e n t i t i e s ,
w h i c ha r e not
o b j e c t - l i k e
andu s u a l l y c a n n o t
have a" p r e s e n c e . "
I ta l s o
d e n i e s t h a tevery
s t r u c t u r e
wordi n h e r e n t l y d e n o t e s o n e
o r a f e wd e f i n i t e
relationships
even
i n i s o l a t i o n .
I fthis
were
not
s o ,
one
could
n o t u n d e r s t a n d what k i n d o f r e l a t i o n s h i p
i td e s i g n a t e s
i f
used
i n a c o n t e x t .
Lyans
(1 9 6 9 ),
q u i t e s e n s i b l y ,
d i s t i n g u i s h e s
betweenthree
d i f f e r e n t
k i n d so f s t r u c t u r a l ,
o r
grammatf c a l meaning.
1 . The
meaning
ofg r a m n a t i c a l
items,
such
asp r p p o s i t i o n s
and
c o n j u n c t i o n s .
2. The
meaning
ofg r a m m a t i c a l f u n c t i o n s ,
such
assubject
and o b j e c t , i . e .
s y n t a c t i c a l r e l a t i o n s .
3 The
meaning
a s s o c i a t e d w i t h n o t i o n s
s u c h a sd e c l a r a t i v e ,
i n t e r r o g a t i v e , i m p e r a t i v e ,
i. e.s y n t a c t i c a l
types.Ile
further
r i g h t l y
o b s e r v e sthat
g r a m m a t i c a l
items b e l o n g t o
closed
sets, whichh a v e
a f i x e d , smallmembership,
e.g.p e r s o n a l
pronouns.
L e x i c a l
items, on t h e
other
handbelong
t o open
sets,
which havea n
unrestricted,
l a r g e memhership,e
. g onouns
Moreover,
l e x i c a l
items
have
bothl e x i c a l
( m a t e r i a l )
andIn
our
work, t h e distinction betweenstructure
words
andc o n t e n t s
words is e s s e n t i a l . T h i s f a c tis
c l e a r l ~ seenin
thepreparation of the d i c t i o n a r y used
for
our
experiments.SOME; PROBLEhIS OF LEXICAL HELATEDNESS
1. Polysemy and Homonymy
While the problem of meaning is complex in itself, the
difficulty
i n c r e a s e s byanother order of
m a g n i t u d eif one
h a s to deal w i t h words of manyn~eanings or
different
words w i t hd i f f e r e n t meanings thak have i d e n t i c a l s p e l l i n g s or
p r o n o u p c i a t i o n s . And t h e
decision
as to whether agiven
caserepresents o n e polysemous word or two (or
more)
homonymsis
far from being w e l l d e f i n e d .The separation can be based on morphological c r i t e r i a . First of
all,
two g r a p h e m a t i c a l l y i d e n t i c a l word forms w i t h differentmeanings
are
regardeda s
homqraphs and separated i f they display a phonematic d i f f e r e n c eor
i f they b e l o n g to different wordclasses.
Theyare
a l s o homographseven if they
belong to the same word c l a s s but possess different i n f l e c t i o n systems.otherwise,
they r e p r e s e n t the sameword. More
than one meaningappearance. A d i s t i n c t i o n between t h e two
can o n l y
be
made,if
a t a l l , on the basis of t h e h i s t o r i c a l o r i g i n of the words involved.
Direct, t r a n s f e r r e d andspecialized
s e n s e s ofa
wordcan
be l i s t e d a l o n g ope d i m e n s i o n of meaning, dominant andbasic
senses
r e p r e s e n tcertain measures
a l o n g a n o t h e r dimension.Another concept i s s e m a n t i c d e p l e t i o n ,
i n
which
case t h e wordoccurs
i n scoresof
e x p r e s s i o n s . Mere, theverbal o r
s i t u a t i o n a lcontext
-
addssubstantially
t o
themeaning
of t h eword i n
question.
With polysemy,however,
thec o n t e x t
e l i m i n a t e s thosesenses
of
the word thatdo
not apply and thereby disambiguatest h e polysemous word. It i s , therefore, i m p o r t a n t from t h e l e x i c o g r a p h i c a l p o i n t of view
t o
d i s t i n g u i s h betweenthe
degrees ofinteraction between
the c o n t e x t and t h emeaning
of
i n d i v i d u a l
( a ) i n case o f weak i n £ lu e n c e , w e t a l k a b o u t a u t o s e m a n t i c o r
semantically autonomous
words ;(b) a s t r o n g i n f l u e n c e performs a d i s a m b i g u a t i o n o f polysemous o r homonymous
words;
( c ) t h e c o n t e x t d e f i n e s t h e 'meaning of synsemantic
or
semantically d e p l e t e d words.
Needless to say t h a t
the
above,
asi n n u m e r a b l e
other,it c o u l d be
noted
that, i n e x c e p t i o n a lcases,
eventhe
inmediatec o n t e x t c a n n o t r e s o l v e t h e ambiguity4
and
two o rmore
i n t e r p r e t a t i o n sare
acceptable. T h i s p h e n p e n o n i s theI t
i s
clear
even
t o thecasual
observer
t h a t t o t a l i n t e r c h a n g e a b i l i t yin
all contexts,
and identity
in
both c o g n i t i v e and emotivesenses,
of twolexical
units (words,
i n thes i m p l e s t case]
are
not possible
i n
g e n e r a l . The s e m a n t i c r e l a t i o n s h i r ;between
synonymy is based on and measured by a l e v e l o f s i m i l a r i t y .R a t h e r
than
d i s t i n g u i s h i n gbetween
the"meaning"
andthe
"usage"
of
aword, one
s h o u l d assumethe
v i e w t h a t t h eformer
i st h e sum
t o t a l
of t h e p o s s i b i l i t ! i e s of the l a t t e r . This i s b a s i c a l l ywhat
j u s t i f i e sthe
e x i s t e n c e ofany
monolingual (and,p o s s i b l y , b i l i n g u a l ) d i c t i o n a r y .
The
entries
i n the d i c t i o n a r i e s w ea r e
c o n c e r n e dwith
are b o t hwords
(the i n t e r p r e t a t i o nand
d e f i n i t i o n of which units a r eless
t h a n c l e a r - c u t )and
m u l t i - w o r d l e x i c a lunits.
The twoare
of the same s t a n d i n gand
function,
and t h e y w i l l be treatedi d e n t i c a l l y .
D e f i n i t i o n
is the
most
fuhdamental
concept
associated
with
d i c t i o n a r i e s .
W
e
s h a l l
be
concerned
w i t h
both
classical
A r i s t o t e l i a n
definitions,
based
on
" c l a s s "
and
"characteristics",
and
o p e r a t i o n a l
d e f i n i t i o n s
which
use sententialw
g e n e r a t i v e
terms.
I n
fact,
it
is
o f t e n d i f f i c u l t
o r
impossible t o
separate
equivalence
o r
paraphrase
ciefinitions
,
on
one
hand, and
t h o s e
t h a t
are process-oriented
r e p r o d u c t i o n s ,
on
t h e
other,
In
general.,
*he
l e x i c a l
meaning
can
be
rendered
by
f o u r
basic
instruments
and
t h e i r
various
combinations
:( a )
t h e
lexicpgraphic
d e f i n i t i o n
enumerates
t h e
most
important
features
of
t h e
l e x i c a l
u n i t
being
defined,
i n
the
simplest
p o s s i b l e
terms;
(b)
q u a l i f i e d
synonyms provide
a
system
of
semantically
most
related
words;
( c )
exemplification p u t s
the
d e f i n e d
u n i t
in
f u n c t i o n a l
combination
w i t h
o t h e r
u n i t s ;
(d)
ag l o s s
is an
explanator
or
descriptive
comnent related
t o the
d i c t i o n a r y e n t r y ;
it
may
also
skates i m i l a r i t i e s
t o
-
15-
AsPECrS. O F THE SCIENCE OF DICPIONARY
1
.
General,
Concepts
U a o u g h
definitions abound, a
reasonable
d i s t i n c t i o n
seems
t o beto say t h a t
t h esemantic description of
i n d i v i d u a l terms,t h e
inventory
of words i sthe customary
province
ofLexicoqraphy
whereas le&coloqy
refers
tothe study
o f t h e
lexical
material,
ofthe
recurrent
patterns
ofsemantic r e l a t i o n s h i p s , and
of
anyformal devices, such
asphonological
andgranmatical
$ystems,that generate
t h elatter.
T o
c o n s t r u c t
ad i c t i o n a r y
of agiven size,.
one
could
choosethe
entries
on
the
basis
of
t h e i rfrequency
of
occurrence or in
r e l y i n gon
somemeasure
o f * u t i l i t y t h a tis
vaguely t i e dto
t h esemantic
generality of
the candidates. N os o l u t i o n
i sperfect
or
even
uniformly
useful
over
the whole
dictionary.
Even the arrangement of
meanings
of a given entry is moot.we
talk
about l o g i c a l ,
historical and
empirical
orders.
(The
latter
starts w i t h thec o m o n
andcurrent
usage followed byobsolete, colloquial, provincial,
slang
and technical meanings. )about
sanesamponent
of the
e x t r a l i h g u i s
tic
world.
Our
work
derives
i t s
data
base
from
an encyclopedic
d i c t i o n a r y .
It
ehould
be
noted
t h a t the
highly polysemous
nature
of
the
entries
in a
linguistic
dictionary
would
have
constituted
an
addi
tiona
1
complication in this pilot project,
which
h a snow
beenavoided
w i t h o u taffecting
t h egeneral
validity
of the
resu
Its.
We
proposet o
i n t r o d u c ethe
tern
lexicometry
to
designate
the,d i s c i p l i n e
which i n v e s t i g a t e s and
a n a l y z e s
the
q u a n t i t a t i v e
aspects
o f
dictionaries,
t h evocabulary
ofa
language
and
various
s u b s e t s
of
t h el a t t e t .
Lexicometry would
count,
weigh and. .
measure,
and
express
t h e
results in
s t a t i s t i c a l
and
m a t h m a t i c a l
terms.
Many
such studies are
widely
known.
Suchis
t h eone
reported
by
G U ~
raud
( 1959
:The
most
frequent
wordsare:
(a)
t h e
shortest,
b
the
o l d e s t ,
( c )
the morphologically
simplest,
(d)
the
semanti-caf
l ymost e x t e n d e d ,
i.e.
g r e a t e s t
number
of
meanings.
possessing
t h e
As
to
the
measure
of
frequency,
n
the
f i r s t
100words
cover
608
ofan
averagen t e x t ,
I# 81 t l tl M I( Q
1000
85%
fThus
the
remainingX
( ? )thousand words
covero n l y
2 . 5 % oft h e
t e x t . H o w e V e r , froman
information
theoretic
p o i n tof
view,the
first
100 wordscomprise
3 0 % ofthe
information,
I n n n w I
1000 50%
"
11 H (I II n n
4000 70%
'
Consequently, rare
words konvey
a
great
deal
of
information.
We
could
saythat
a frequent
word
i s
most
u s e f u l
in
the aggregate,and
arare
word
in
aparticular
case.Other
studies in
glottochronology
mhcern thanselves
with
therate
ofchange
i n
Languageand
i n
basicvocabulary.
Further,
distribution
of
the
frequencies
of
occurrence
w i t h orwithout
reference
to
any particularvocabulary has
a l s o
been studied.Finding
r e l a t i o n s
of the above k i n dis
not
j u s tan
academicexercise
tos a t i s f y
thec u r i o s i t y
of
a few l i n g u i s t s , but theserelationships
may
have
variouspractical applications.
For example, Maas ( 1 9 7 2 ) asserts thatthe
knowledge
of af u n c t i o n a l
relation between
the length of a t e x t and thesize
o f t h evocabulary
used
i n
itwould
be
d e s i r a b l ein
order
t o estimate thee f f o r t
needed f o rextension
of amachine d i c t i o n a r y or
i ncomparison of
vocabulary c o n t e n t s
of t e x t s of d i fferent
l e n g t h s . In thel a t t e r
case, one cans t a n d a r d i z e or
normalize
the t e x t s under i n v e s t i g a t i o n by reducingthem
t o acommon
minimal
length throughcomputational
methodsand
then compare ther e s u l t i n g
L e t V be the number of
e l e m e n t s
(words)i n
atext
and
N theI
-
l e n g t h of the text.
Then
we
surmise,
says Maas, af u n c t i o n a l
r e l a t i u r n s h i p t o
exist
between N-
and V:
-
Muller
(1964)reported
ar e l a t i o n
betweenV
-
and
d-
such t h a t
the
r a t i o
of their logarithms
i s c o n s t a n t :
lo N
-3.-
=a,or
v a
= N, or,10q-
v
1
if
we s e t-
=k,
V = L J ,k
a
Since
the
vocabulary
of
alanguage,
however,
is
supposed to
be restricted, so argues Maas, the e x i s t e n c e of a l i m i t i n g value
is
to be
p o s t u l a t e d :V,=
lim
f(N)
N+m
As the derivative
of f
-
at a g i v e nv a l u e
of N
-
represents t h er e l a t i v e
increase
in
V -1it is to
be s t a t e d thatf' (N)
approaches0
with
i n c r e a s i n g
-
N,
The
d e r i v a t i v eof
a f at the p o i n t 1is
assumed to be 1 becauseTherefore
-
f'
i s a
function
that
decreases
m n o t o n i c a l l y
from
1to
As
aconsequence
of the above
speculations,
in t h e
expression
V
=N~
'
-
k
cannot
beconstant.
s t a t i s t i c a l
investigations of the
dramas
by Corneille
haveresulted
in
the
r e l a t i o n s h i p
1
log
E
=
0 . 0 1 3 7 . ( l o g N)1/3
~ h u s ,
if
N
Iis
g i v e n ,
k
ICcan be
determined,
and V
-
can
be
c a l c u l a t e d
from
Another
noteworthy concept
i s
that
of
r e p e t i t i o n
factor
:which
shows
how
of
ten
word has
occurred
in
a t e x ton
the
averaqe.
The
following
r e l a t i o n s h i p has been determined:
which
d i s p l a y s
avery
good
agreement
w i t h
r e a l i t y .
N O
s i n g l e
empirical
law
s e w sto
e x i s t
between
N
and
V
f o r
I D
a l l
N.
-
2 ,
The
Problem
of
Coverase
We
are
now coming
close
to
the core
subject
matter
of
t h i s
paper.
Mackey
( 1 9 6 5 ) s t a t e s t h a the
coverage
or
covering
capacity
of an
item
is
the
number
of
t h i n g s
one
can
sayw i t h
it. It can
be
measured
by
the
number
of
o t h e r
items
which
i t
can
d i s p l a c e .
)IAccording
to
him,
words
can
displace
other
words
by
Eour
means:
(1
)i n c l u s i o n ,
( 2 )extension,
( 3 )
combination,
and
(4!d e f i n i t i o n ,
1,
Aword
t h a t
already
-
includes
the
meaning
o fo t h e r
words
can
be
usedinstead of
these
( e . g . , Ls e a t -i n c l u d e s
- Pchair
bench,
s t o o l ,
and
place)
,
-
lLlCI2 .
Words
the
meanings
of
which
are
easily extended
me'kaphorically
can
be
used to
eliminate others
(e.g.,3
.
Certain
simple words
can
displaceothers
by combining
e i t h e r togetheror
with
simple
word
endings
(em
g.,
news
+
paper
+
man
= j o u r n a l i s t ;hand
+
book = manual).
4
.
Certain
words
can
be
replaced by simple d e f i n i t i o n( e r g . , breakfast can be d e f i n e d as morning meal; pony a s
small
h o r s e ) .As
an
example
of t h e a p p l i c a t i o n of the aboveprinciple,
in
the
derivationof Basic
English
(by
definition), t h elanguage was
f i r s t reducedto
7500 words, and, by r e d e f i n i t i o n ,c u t
down
t o 1500. Thesewere
further
reduced t o t h e e v e n t u a l 850 by a technique of "panoptic" d e f i n i t i o n(eliminate
each wordon
t h egrounds
that i t i ssome sort
of modificationof
o t h e rwords,
e .
g.a m o d i f i c a t i o n
i n
time,
numbe-r,
ors i z e )
.
Basic English, which was founded
essentially
on
the p r i n c i p l eof
cove
rage, wasa
conscious
reaction
a g a i n s tthe
o v e r - a p p l i c a t i o n of t h e p r i n c i p l e of frequency
i n
s e l e c t i o n . ForOgden ( 1 933)
,
it
was n o t the frequencyof
a wordwhich
makes i t u s e f u l , i t was i t s usefulnesswhich
makes it f r e q u e n t .In the
following
p a r t ofthis
s e c t i o n ,
we a t t e m p tt o
present
some of the salierit p o i n t s of
Savard
(1 970).n o t s u f f i c i e n t
t o
select words fora restricted
v o c a b u l a r y forthe
purpose of teachinga
f o r e i g n language, such asW e n c h ,
to b e g i n n e r s .An
objective
c r i t e r i o n i s l e x i c a l v a l e n c e . I t would allow1
.
to o b t a i n a n o v e l p r i n c i p l e of vocabulary s e l e c t i o n ,2
.
t o assist the i n v e s t i g a t o r s i n s e t t i n g up a basevocabulary
f o rFrench,
3 . t o p r o v i d e a u s a b l e d e f i n i t i o n ,
combination,
inclusion,
andextension
vocabulary,4
a t ocorrect
a l lthe already
e x i s t i n gscales
of Frenchvocabulary,
5. to provide a
valid working
tool
for theanalysis of
t e a c h i n g material.The
valence problem is a problem of verbal economy. \that hecalls
v a l e n c ei s
t h e
fundamentalcapability
o fa
word t o besubstituted
for another
word.It is
Mackey's coveraqe t h a t heL i k e Mackey ( 1 9 6 5 ) , he
maintains
t h a t the s u b s t i t u t i o n of one wordfor
anothercan
he
made byv i r t u e
offour criteria:
( 1 )d e f i n i t i o n , ( 2 )
i n c l u s i o n ,
( 3 ) combinatiori, ( 4 ) e x t e n s i o n .~ e f i n i t i o n has already been discussed previously.
Linguists do
n o t
t a l kspecifically
about inclusion; r a t h e r , they d e a lwith
synonymy
or
l e x i c a l p a r a l l e l i s m .Synonyms
are.words
that have n e a r l y t h esame meaning,
e . g .-
lieu
and e n d r o i t . For Savard, the b a s i ccriterion
t h a t permits t oestablish
a
series of the p o s s i b i l i t y of s u b s t i t u t i n g one term f o r another.One
of
the
s i m p l e s ta m n g a l l
t h e proceduresof
vocabulary
e n r i c h m e n t consists
o f
j o i n i n gtwo
words
order to makecompound
words.The p r i n c i p l e
0-fcombination
appears asanother
phenomenon
common
t~ a l l langrlages.It i s n o t necessary t h a t the
number
of s i m p l e words beunbounded
because
almosta l l
verbs have a p o t e n t i a l of undetermined sense, andso
do the a d j e c t i v e s . A word is said to have more or less extension a c c o r d i n g to w h e a e r i t can "cover" amore
orless
great number of f u l l y or p ~ r t i a l l y d i f f e r e n t notions.Polysemy
is
the exact opposite of synonymy. Polysemy becomeshomonymy
constitute two
very r i c hsources
of
l e x i c a l economy.
Togethel:
they
form
Savard'
s
l a s t
criterion
oflexical
valence--the
semantic extension,
Although
the
valence
i t s e l f
hasrfever been mathematically
measured
and
a l t h o u g h thereexis- n o
s c i e n t i f i c
means
of
showing its existence, it h a s neverthe less,been proven
thatfour formal
proceaures
of
lexical
economy permt
toreplace
certain
words
by
other words, and t h a t
is
what S a v ~ d c a l l sl e x i c a l valence.
The postulated
existence
hypothesis
oflexical
v a l e n c eleads
t o the c a l c u l a t i o n of a global i n d e x of valence f o r e k d r y word.To
evaluate
t h epower
of
o f aword, one
i n s p e c t s ,in
t h e d i c t i o n a r y , each e l e m e n tof
the general 139t and counts how manytimes
a
word
e n t e r si n t o
thedefinition
ofanother.
To
measure
the powerof
combination of a lexical
unit,
one
inspects
in
t h e d i c t i o n a r yall the compound words
joined by ahyphen,
all
the Gallicisms ( i n English, these would be Anglicisms) and, in g e n e r a l , a l l t h eword
groups.W i t h a view of a p p r a i s i n g the power of i n c l u s i o n , one
inspects
me
unitsof
the general l i s t in two synonym d i c t i o n a r i e s and takes the h i g h e rnumber.
The numbeiof
synonymst h a t
possess
a word c o n s t i t u t e s a measure of the nunberof
wordsTo measure the power of s e m a n t i d e x t e n s i o n , o n e i n s p e c t s each
of t h e e l e m e n t s of the general list i n the d i c t i o n a r y and c o u n t s the number of meanings g i v e n by the author t o such a word in t h e
list. T h e number of meanings of a word is c o n s i d e r e d as a m a s u r e of i t s power of s e m a n t i c extension.
The g l o b a l i n d e x of lexical valence is t h e sum of t h e four
n o r m a l i z e d c o u n t s . The two c r i t e d i a h a v i n g t h e h i g h e s t c o r r e l a t i o n are d e f i n i t i o n and c o m b i n a t i o n .
In the beginning of t h e study, it was assumed t h a t thk four
v a r i a b l e s were entirely i n d e p e n d e n t of each o t h e r . The results of a f a c t o ~ a n a l y s i s i n d i c a t e that they are n o t c o m p l e t e l y so. A
factor r o t a t i o n shows, however
,
that t h e variables ares u f f i c i e n t l y i n d e p e n d e n t t o make it necessary t o retain the four
c r i t e r i a of l e x i c a l valence.
A c o m p a r i s o n of t h e rank of t h e first 40 c o n t e n t words on the valence scale w i t h the same words on t h e f r e q u e n c y l i s t allows t o frame a hypothesis t h a t t h e c o r r e l a t i o n between v a l e n c e and
f r e q u e n c y would be rather weak. A more c o m p l e t e s t u d y would show
w i t h o u t d o u b t t h a t : w e have there two very different s e l e c t i o n
p r h c i p l e s .
I n c o n c l u s i o n , i eSan be stated with c o n f i d e n c e that t h e measure of valence i s n o less v a l i d t h a n that of frequency, distribution
.
and a v a i l a b i l i t y . These c o n c e p t s w i l l eventually lead to more efficient d i c t i o n a r i e s with respect to precision,ON LEXICOMETRIC RELATIONSHIPS AMONG
THE
SIZE OF D E F I N I N G SET,TUE: SIZE OF DEFINED SET AND TILE MAXIMUM LENGTH OF DEFINITIONS
1. Some Measures of Coverage
A d i c t i o n a r y may be considered efficient and
economical
i f i t u s e s areasonably
small set of words to d e f i n e 9 r e l a t i v e l y largeset o f entries. W e have, however, a very v a g u e idea a b o u t what
size
v o c a b u l a r y i s n e e d e d t o c o v e r a g i v e n number of d i c t i o n a r ye n t r i e s . (The r e l a t e d problem of c i ~ c u l a r d e f i n i t i o n s seems t o have t o wait for a camputer s o l u t i ~ n . )
I t i s known, for example, that Basic E n g l i s h , Ogden (1 933)
,
involves a l i s t of 850 E n g l i s h words and 50 i n t e r n a t i o n a l words,which were e v e n t u a l l y used t o d e f i n e the 20,000 English words of Basic Y n g l i s h D i c t i o n a r y . This gives
a r a t i o
ofthe number
ofc o v e r i n g . words t o that of d e f i n e d words of 0.045.
West s t u d i e d t h e problem of what c o n s t i t u t e s a simple definition and e s t a b l i s h e d a minimum defining vocabulary of 1 , 4 90 words. The meaning of sane 18,000 words and 6,OQ'u idioms, i.e.
about 2 4 , 0 0 0 expressions, was e x p l a i n e d exclusively by t h e s e
1,490 words, which were not d e f i n e d themselves. The r e s u l t s were p u b l i s h e d in 1 9 6 1 as The 14ew Method English D i c t i o n a r y bf H o p m
West and J. G. E n d i c o t t . The c o r r e s p o n d i n g size r a t i o here i s 0.062,
can
define
d s e t of about 20 times mat s i z e , b u tin
g e n e r a l thebehavior of these variables h a s not been i n v e s t i g a t e d and i s not
known i n
any
d e t a i l .One of us, in F i n d l e r ( 1 3 7 0 ) , has formulated t h e problem i n de.Ein ; ie terms, T h r e e v a r i a b l e s were considered : ( 1 ) t h e
covered set S of size
% ,
(2) the Coverinq set R o f s i z e% ,
andL
-
.I)-
( 3 ) the'max&mum definition\ l e n g t h
-
N, such t h a t each wordin
S can- 1
bq
d e f i n e d by at m o s t-
N ordered words The t a s k f i n d :( a ) VR a s a function of vS at different v a l u e s of
-
N as a-
parameter, and
(b) v as a f u n c t i o n of
-
N at d i f f e r e n t values of v a s a l?-
L
i
parameter.
Usinq the terminology of i n c r e m e n t ratio for Av
/Av
and s i z e=;: S
r a t i o f o r vR/vS
,
it was p o s t u l a t e d f o r case (a) that2
*
t h e i n c r e m e n t r a t i o is, i n q e n e r a l , less than one,
2
*
t h e i n c r e m e n t ratio, i n g e n e r a l , decreases asv
i n c r e a s e s,
S
-
*
f o r l a r g e values ofP,
vRL
-
a s y m p t o t i c a l l v approaches a l i m i t i n g value asvS
i n c r e a s e s ,-
*
t h e i n c r e m e n t ratio will, n e v e r exceed the s i x e ratio.L
A n e x c e p t i o n t o t h i s rule would occur i n a d i c t i o n a r y
s y s t e m , which does n o t treat liomon~ms a s i n d i v i d u a l entries,
A
It was f u r t h e r asSumed t h a t f o r B=l
,
the
coverincr set andthe
covered set are of the. same s i z e ,
i
. o .
both the i n c r e m e n t ratio and the s i z e ratio equalo n e .
We 'mustnow
correct this s t a t e m e n t becallse not every word i s d e f i n e d by i t s e l f o n l y . If a new wordis
Lntroduced that alreadyhas
a synonymin
the covering s e t , i t w i l l be d e f i n e d by that synonym. Then the i n c ' r e m e n t ratio i s 0a n d t h e s i z e r a t i o become less t h a n 1.
For
tke
s e c o n d c a s e , (b),
it is n o s t u l a t e d t h a t*
vmonotonically
decreases a s I) Nincreases,
*
f o r any fixed v v a l u e , v asymptotically approachesa
S-
-
R-lower
l i m i t qs 11 increases w i t h o u t bbund.C
I t was finally p o i n t e d o u t t h a t vR s h o u l d be small. ko
-
m i n i m i z e s t o r a g e requirements,, and
-
N should he s m a l l t o mlnlmizep r o c e s s i n g t i m e and output volume. A . compromise on these
mnf l i c t i n g requirements is needed. The u l t i m a t e q u e s t i o n i s :-
g i v e n
"What are t h e optimum
yl
and-
11 values fora v
f o r certain-
A-
s
computer a p p l i c a t i o n s
on
a machinewith
a
given c o s ts t r u c t u r e ? "
X t i s reasooable to assume that the behavior of
the
threev a r i a l e s and t h e r e f o r e the answer to the l a s t q u e s t i o n w i l l l a r q e l y denend on t h e semaptic
index
of t h e elements of thecovered set: and on the lexical v a l e n c e of t h e elements of the