• No results found

A Few Steps Towards Computer Lexicometry

N/A
N/A
Protected

Academic year: 2020

Share "A Few Steps Towards Computer Lexicometry"

Copied!
70
0
0

Loading.... (view fulltext now)

Full text

(1)

American

Journal

of

Computational

Linguistics

Nicholas V. F i n d l e r and

Heino

Viil

Department

of

Computer

Science

S t a t e

University

of

New

York

Buff

a10

M i c r o f i c h e 4

(2)

N i c h o l a s V. F i n d l b r and fleino V i i l Department o f L'omputcr S c i e n c e

S t a t e u n i v e r s i t y of New York a t U u f f a l o

We describe a branch of d i c t i o n a r y s c i e n c e , and recormend t h e

term lexicometry for i t , t h a t deals w i t h the m a t h e m a t i c a l and

statistical aspects of d i c t i o n a r i e s . I t i s related t o both and former de n o t incj the d e s c r i p t i o n

of l e x i c a l m a t e r i a l and t h e latter i t s a n a l y s i s and study.

Many problems i n computational l i n g u i s t i c s r e q u i r e the use of a stored d i c t i o n a r y easily a c c e s s i b l e t o a c o q u t w program. I n the course of an i n v e s t i g a t i o n , s u c h a d i c t i o n a r y may have to be

expanded, reduced, rcanrranged, o r modified in various ways o A l s o

several n o n l i n g u i s t i c disciplines using t h e c m p u t e r , such as psychology, biology, m e d i c i n e , and s o c i o l o g y , o f t e n n e e d a l a r g e d a t a base in t h e £ o m of a d i c t i o n a r y . The relevant s t r u c t u r a l properties of a d i c t i o n a r y , however, have n o t yet been s u f f i c i e n t l y and systematically i n v e s t i g a t e d . Research i n this area is needed i n order t o o p t i m i z e the construction af s t o r e d

dictionaries and t o manipulate them i n e f f i c i e n t ways.

1

A c o n s i d e r a b l y extended version of t h i s p a p e r w a s sUbmitted t o t h e S t a t e University of N e w York i n Buffalo in partial

s a t i s f a c t i o n of the r e q u i r e m e n t s for the dbcjree of of science of Ileino ~ i i l . T h e project r c p r e s e n : . ~ the continuation of an e a r l i e r work by 1Jickolas V. ~ i n d f e r . Ilmy ideas and a l l t h e p r o g ~ m m i n g effort i s due t o I i e i n o ~ i i l . The w r i t e - u p is a

o i n t e f f o r t . The work r e p o r t e d here w a s s u p p o r t e d by N a t i o n a l cience F o u n d a t i o n Grant G J - G 5 8 .

(3)

F ' i r s t

,

we

review c r i t i c a l l y t h e problems o f meaning and i t s

r e p r e s e n t a t i o n , t h e q u e s t i o n s r e l a t i n g t o l e x i c a l d e f i n i t i o n s . . t o

P O ~ Y S = ' ~ Y , homonymy, s e m a n t i c d e p l e t i o n . synonymy, and

l e x i c o g r a p h y and Lexicology i n g e n e r a l . We a l s o d i s c u s s the

c o n c e p t of l e x i c a l v a l e n c e

and

e l a b o r a t e a n o v e l idea, c o v e r a g e ,

which

i s o f b o t h t h e o r e t i c a l and p r a c t i c a l importance. I n t h i s

context, r e l a t i o n s h i p 6 are e s t a b l i s h e d among three v a r i a b l e s :

the

s i z e

of t h e covered set, t h e

s i r e

of t h e c o v e r i n g set, and

the maximum d e f b i t i o n l e n g t h . b o t h , t h e s i z e o f t h e c o v e r i n g

s e t and the

maximum

d e f i n i t i o n l e n g t h should be s m a l l for

economic c o n s i d e r a t i o n s . But decreasing one

w i l l

i n c r e a s e the other. I t i s t h e r e f o r e important t o e s t a b l i s h these

r e l a t i b n s h i p s e m p i r i c a l l y . T h e knowledge, s o g a i n e d w i l l c o n s t i t u t e a b a s i s f o r o p t h i z i n g the s t r u c t u r e of a d i c t i o n a r y f o r s p e c i f i e d size of t h e covered s e t and a s p e c i f i e d machine.

The p r e s e n t p i l o t project i n t h i s v i r g i n f i e l d has an o b j e c t i v e o f v e r i f y i n g some c o n j e c t u r e s . I t e s t a b l i s h e s some

principles

of c o n s t r u c t i n g , f o r m a t t i n g , and s t o r i n g a large d a t a

base i n d i c t i o n a r y form. it d e v e l o p s programs f o r d i s p l a y i n g , h a n d l i n g , and modifying such a d a t a base. T h e p a p e r o f f e r s an example how a c o n c e p t u a l l y c o n t i n u o u s o p e r a t i o h

on

large amounts of data

can

be reduded

ts

o p e r a t i n g

on

a

fraction

of the

whole

d a t a base a t a t i m e by s u c c e s s i v e s m a l l i n c r e m e n t s o f time. W e

f i n a l l y d e m o n s t r a t e t h e f e a s i b i l i t y o f solving l e x i c o m a t r i c problems on t h e computer and, a t t h e same time, show t h e c o s t

i n v o l v e d i n doing such

work

i n terms o f b o t h human e f f o r t and

machine time,

(4)

and

the r e s u l t s that

were obtained

in

using an e x i s t i n g dictionary of computer terminology of more than 1,800

entries.

The effort r e q u i r e d was considerable: 6 man

-

month's, work and

about 1 4 hours of CDC 6400 comphter time. Pxogramning was done

(5)

TABLE OF CONTENTS

. . .

Some problems of lexical relatedness 11

. . .

1

.

Polysemy and homonymy

. .

. .

11

2

.

Synonymy

. . . a . m . m . . .

1 3

. . .

3

.

Definitions 1 3

. . .

Aspects of the science of dictionary 15

1

.

General concepts

. . .

15

2

.

The problem-of coverage

. . .

20

On l e x i c o m e t r i c r e l a t i o n s h i p s among the s i z e of d ' e f i n i n g

set. the size of the defined set and the maximum l e n g t h

of definitions

. . .

26

1

.

Some measures of coverage

. . .

26

. . .

2

.

C o n s t r u c t i o n of

the

data b a s e 29

3

.

The results of the computations

. .

. . .

42

Acknowledgement

. . .

51

References

. . . a .

- 5 1

Appendix I

Program DeVelopment

. . .

54

Appendix I1

Some ideas for the program to investigate the relatianship

(6)

INTRODUCTION

S i n c e t h e

early days

o f e l e c t r o n i c

computing, two

kinds

of

a s s o c i a t i o n s

have

existed between computers

and

d i c t i o n a r i e s

:

either

t h e

computer

uses,

for v a r i o u s purposes,

a

stored

d i c t i o n a r y

of some

sort

(lexicon,

vocabulary,

glossary,

thesaurus)

or

the

compuker

is

employed

for constructing

and

a n a l y z i n g

a

d i c t i o n a r y .

The

latter

a c t i v i t y

was

given a

s t r o n g

impetus

in

the

late

1950's

by the

formation

o f t h e c e n t r e

dlEtudes

du

Vocabulaire

Francais

and

its

p u b l i c a t i o n ,

the

Cahiers

de

Lexicologie.

Thus

lexicography

was among

t h e

first

ncn-mathematical disciplines

to make

use

of

the

symbol

manipulating

c a p a b i l i t y of

computers.

While

formal

theories

of

s y n t a x

have

been $uccessful

in

d e s c r i b i n g

t h e

rules

of gramnatical

accepeability

of

n a t u r a l

language

utterances,

the

study

of

meaning,

u s u a l l y

c a l l e d

semantics,

has

not yet

produced a

theory

of

the

semantic

structure

of

languages, based

on observation

and

a n a l y s i s .

It

is

beyond

the

scope

a f

t h i s

paper

to

d i s c u s s ,

even

s u p e r f i c i a l l y ,

the

various v i e w p o i n t s concerned

w i t h

the

concept

of

meaning.

One

of

us,

V i i l

(19741,

h a s ,

however,

compiled

a

reasonably

exhaustive

c r i t i c a l survey

of

the

relevant l i t e r a t u r e .

For

the

purposes

of

this

work,

it suffi-ces to present the

(7)

1 . Logical

meaning

a p p l i e s

to

such attempts to

deal

with meaning as symbolic l o g i c

and

mathematics.

The meanings with

which the

s i g n a l s of such systems

correlate

are unique

outside-world

referents

or

unique

meanings w i t h i n t h e l o g i c a l

system

t h a t

e v e n t u a l l y have o u t s i d e - w o r l d referents.

2. General-sernant'4c

meanings

are a l s o

uniqne

in

their

reference

to

o u t s i d e world, but the

semanticists

are

less

s t r i n g e n t

i n

scope than the l o g i c i a n s . N e v e r t h e l e s s , t h e i r

scope is

an

i d e a l i z e d language,

much

more

l i m i t e d than ordinary language.

3.

Communication-theory meaning

is

equivalent

to t h e

amount

of

information

t h a t

can

be transmitted per

u n i t

time

in

a comunication . s y s t e m .

4 . Lexicoqraphical meaning is t h a t

of

"words,

"

and

the

I

outside-world

reference

i s

what

w e

o r d i n a r i l y c a l l

meaning. 11

5.

Psycholoqical

meaning has so great a scope t h a t t h e par&

involving

o r d i n a r y

language becomes

nearly

t r i v i

a1

.

It

encompasses

overt or

covert behavior of any organism as

responses to s t i m u l i .

6. Word-mind meaning h$q the scope e q u i v a l e n t to t h a t of

(8)

conceptual

c a t e g o r i e s . T o o r d i n a r y meanings ( i n t h e l e x i c a l s e n s e ) here c o r r e s p o n d s i g n a l s by which m e n t a l s t a t e s a r e a s c e r t a i n e d .

7. L i n q u i s t i c meaning refers t o s i g n a l s a s the p i e c e s

o u t

o f which l a n g u a g e i s made, i.e. m i c r o l i n g u i s t i c , p h ~ n o l o g i c a l , and s y n t a c t i c s i g n a l s .

In

the

framework

.of o u r

particular

topic

w e

shall

be

mainly

c o n c e r n e d w i t h categories 4 and 7 .

A c c o r d i n g t o

Weinreich

(1 966 )

,

u n i l i n g u a l d e f l n i n g d i c t i o n a r i e s appear t o be based on a model that assumes a d i s t i n c t i o n

between

meaning p r o p e r ( s i g n i f i c a t i o n , comprehension, i n t e n s i o n ) and t h e t h i n g meant by a s i g n ( d e n o t a t i o n ,

reference,

e x t e n s i o n )

.

On the basis

of what

i s

meant

by a sign, Osgaod,

s u c i ,

and

Tannenbaum

( 1 9 5 7 )

distinguish three k i n d s

of

meaning.

1. Pragmatical ( s o c i o l o g i c a l ) meaning : the r e l a t i o n of

signs t o s i t u a t i o n s and behaviors.

2

.

( l i n a u i s t i c ) meaning :

the

r e l a t i o n of s i g n s

t o

other

signs.

3. S e m a n t i c a l meaning: the r e l a t i o n of signs t o t h e i r s i g n i f i c a t e s

.

I t

i s

easy to see that these classes are i n

(9)

Homing o n t o o u r primary t a r g e t , w e may now restrict

our

interests

somewhat

f u r t h e r and c o n c e n t r a t e on t h e t w o l a s t classes o f meaning, known u n d e r v a r i o u s d e s i g n a t i o n s b u t , b y t h e

m a j o r i t y

of

writers,

d i s t i n g u i s h e d as s t r u c t u r a l meaning and

lexical

meaning.

Mackey (19653 f i n d s

structural meanings

i n ( 1 )

structure

words, ( 2 ) i n f l e c t i o n a l f o r m s , and (31 types of word o r d e r .

Examples

of

structure

words

are

a r t i c l e s

and

prepoai

t i o n s ,

and

these, he i n s i s t s , a l t h o u g h o f t e n

called

m e a n i n g l e s s o r

empty,

may

have

a

large

number

o f

meanings.

S i m i l a r l y , the i n f l e c t i o n a l forms, s u c h as t h e g e n i t i v e case and p r e s e n t t e n s e ,

may

have a

number

of meanings, and

s o

may some types of

word

order.

L e x i c a l mefinings, on t h e o t h e r h a n d ,

refer

t o

the

meanings

of t h e c o n t e n t words, i n

which

the d i f f e r e n c e s i n meaning

are

most

easily

s e e n .

I n

R u s s e l l ' s v i m (1 9 6 7 )

the

s t r u c t u r e

words, such

as " t h a n , I&

"or,

"

" h o w e ~ e r , " have meaning o n l y

in

a suitable

verbal

c o n t e x t

and

c a n n o t

s t a n d

alone.

The

c o n t e n t

words,

which

he

c a l l s

object

words, s u c h as p r o p e r names, c l a s s names of animals, names of

c o l o r s , do

not

p r e s u p p o s e ~ t h e r

words and can

be used

i n

i s o l a t i o n . T h e i r meaning

i s

l e a r n t by c o n f r o n t a t i o n

with

o b j e c t s

that

are

what

they

mean

or

i n s t a n c e s

of

what

t h e y

mean.

A s

soon

as t h e a s s o c i a t i o n

between

an

object word and what

it

means has

been

e s t a b l i s h e d by t h e l e a r n e r ' s h e a r i n g , i f f r e q u e n t l y pronounced i n the p r e s e n c e of t h e o b j e c t , t h e word i s u n d e r s t o o d

(10)

excludes words

that

d e n o t e a b s t r a c t e n t i t i e s ,

w h i c h

a r e not

o b j e c t - l i k e

and

u s u a l l y c a n n o t

have a

" p r e s e n c e . "

I t

a l s o

d e n i e s t h a t

every

s t r u c t u r e

word

i n h e r e n t l y d e n o t e s o n e

o r a f e w

d e f i n i t e

relationships

even

i n i s o l a t i o n .

I f

this

were

not

s o ,

one

could

n o t u n d e r s t a n d what k i n d o f r e l a t i o n s h i p

i t

d e s i g n a t e s

i f

used

i n a c o n t e x t .

Lyans

(1 9 6 9 )

,

q u i t e s e n s i b l y ,

d i s t i n g u i s h e s

between

three

d i f f e r e n t

k i n d s

o f s t r u c t u r a l ,

o r

grammatf c a l meaning.

1 . The

meaning

of

g r a m n a t i c a l

items,

such

as

p r p p o s i t i o n s

and

c o n j u n c t i o n s .

2. The

meaning

of

g r a m m a t i c a l f u n c t i o n s ,

such

as

subject

and o b j e c t , i . e .

s y n t a c t i c a l r e l a t i o n s .

3 The

meaning

a s s o c i a t e d w i t h n o t i o n s

s u c h a s

d e c l a r a t i v e ,

i n t e r r o g a t i v e , i m p e r a t i v e ,

i. e.

s y n t a c t i c a l

types.

Ile

further

r i g h t l y

o b s e r v e s

that

g r a m m a t i c a l

items b e l o n g t o

closed

sets, which

h a v e

a f i x e d , small

membership,

e.g.

p e r s o n a l

pronouns.

L e x i c a l

items, on t h e

other

hand

belong

t o open

sets,

which have

a n

unrestricted,

l a r g e memhership,

e

. g o

nouns

Moreover,

l e x i c a l

items

have

both

l e x i c a l

( m a t e r i a l )

and

(11)

In

our

work, t h e distinction between

structure

words

and

c o n t e n t s

words is e s s e n t i a l . T h i s f a c t

is

c l e a r l ~ seen

in

the

preparation of the d i c t i o n a r y used

for

our

experiments.

SOME; PROBLEhIS OF LEXICAL HELATEDNESS

1. Polysemy and Homonymy

While the problem of meaning is complex in itself, the

difficulty

i n c r e a s e s by

another order of

m a g n i t u d e

if one

h a s to deal w i t h words of many

n~eanings or

different

words w i t h

d i f f e r e n t meanings thak have i d e n t i c a l s p e l l i n g s or

p r o n o u p c i a t i o n s . And t h e

decision

as to whether a

given

case

represents o n e polysemous word or two (or

more)

homonyms

is

far from being w e l l d e f i n e d .

The separation can be based on morphological c r i t e r i a . First of

all,

two g r a p h e m a t i c a l l y i d e n t i c a l word forms w i t h different

meanings

are

regarded

a s

homqraphs and separated i f they display a phonematic d i f f e r e n c e

or

i f they b e l o n g to different word

classes.

They

are

a l s o homographs

even if they

belong to the same word c l a s s but possess different i n f l e c t i o n systems.

otherwise,

they r e p r e s e n t the same

word. More

than one meaning

(12)

appearance. A d i s t i n c t i o n between t h e two

can o n l y

be

made,

if

a t a l l , on the basis of t h e h i s t o r i c a l o r i g i n of the words invo

lved.

Direct, t r a n s f e r r e d and

specialized

s e n s e s of

a

word

can

be l i s t e d a l o n g ope d i m e n s i o n of meaning, dominant and

basic

senses

r e p r e s e n t

certain measures

a l o n g a n o t h e r dimension.

Another concept i s s e m a n t i c d e p l e t i o n ,

i n

which

case t h e word

occurs

i n scores

of

e x p r e s s i o n s . Mere, the

verbal o r

s i t u a t i o n a l

context

-

adds

substantially

t o

the

meaning

of t h e

word i n

question.

With polysemy,

however,

the

c o n t e x t

e l i m i n a t e s those

senses

of

the word that

do

not apply and thereby disambiguates

t h e polysemous word. It i s , therefore, i m p o r t a n t from t h e l e x i c o g r a p h i c a l p o i n t of view

t o

d i s t i n g u i s h between

the

degrees of

interaction between

the c o n t e x t and t h e

meaning

of

i n d i v i d u a l

( a ) i n case o f weak i n £ lu e n c e , w e t a l k a b o u t a u t o s e m a n t i c o r

semantically autonomous

words ;

(b) a s t r o n g i n f l u e n c e performs a d i s a m b i g u a t i o n o f polysemous o r homonymous

words;

( c ) t h e c o n t e x t d e f i n e s t h e 'meaning of synsemantic

or

semantically d e p l e t e d words.

Needless to say t h a t

the

above,

as

i n n u m e r a b l e

other,

(13)

it c o u l d be

noted

that, i n e x c e p t i o n a l

cases,

even

the

inmediate

c o n t e x t c a n n o t r e s o l v e t h e ambiguity4

and

two o r

more

i n t e r p r e t a t i o n s

are

acceptable. T h i s p h e n p e n o n i s the

I t

i s

clear

even

t o the

casual

observer

t h a t t o t a l i n t e r c h a n g e a b i l i t y

in

all contexts,

and identity

in

both c o g n i t i v e and emotive

senses,

of two

lexical

units (words,

i n the

s i m p l e s t case]

are

not possible

i n

g e n e r a l . The s e m a n t i c r e l a t i o n s h i r ;

between

synonymy is based on and measured by a l e v e l o f s i m i l a r i t y .

R a t h e r

than

d i s t i n g u i s h i n g

between

the

"meaning"

and

the

"usage"

of

a

word, one

s h o u l d assume

the

v i e w t h a t t h e

former

i s

t h e sum

t o t a l

of t h e p o s s i b i l i t ! i e s of the l a t t e r . This i s b a s i c a l l y

what

j u s t i f i e s

the

e x i s t e n c e of

any

monolingual (and,

p o s s i b l y , b i l i n g u a l ) d i c t i o n a r y .

The

entries

i n the d i c t i o n a r i e s w e

a r e

c o n c e r n e d

with

are b o t h

words

(the i n t e r p r e t a t i o n

and

d e f i n i t i o n of which units a r e

less

t h a n c l e a r - c u t )

and

m u l t i - w o r d l e x i c a l

units.

The two

are

of the same s t a n d i n g

and

function,

and t h e y w i l l be treated

i d e n t i c a l l y .

(14)

D e f i n i t i o n

is the

most

fuhdamental

concept

associated

with

d i c t i o n a r i e s .

W

e

s h a l l

be

concerned

w i t h

both

classical

A r i s t o t e l i a n

definitions,

based

on

" c l a s s "

and

"characteristics",

and

o p e r a t i o n a l

d e f i n i t i o n s

which

use sententialw

g e n e r a t i v e

terms.

I n

fact,

it

is

o f t e n d i f f i c u l t

o r

impossible t o

separate

equivalence

o r

paraphrase

ciefinitions

,

on

one

hand, and

t h o s e

t h a t

are process-oriented

r e p r o d u c t i o n s ,

on

t h e

other,

In

general.,

*he

l e x i c a l

meaning

can

be

rendered

by

f o u r

basic

instruments

and

t h e i r

various

combinations

:

( a )

t h e

lexicpgraphic

d e f i n i t i o n

enumerates

t h e

most

important

features

of

t h e

l e x i c a l

u n i t

being

defined,

i n

the

simplest

p o s s i b l e

terms;

(b)

q u a l i f i e d

synonyms provide

a

system

of

semantically

most

related

words;

( c )

exemplification p u t s

the

d e f i n e d

u n i t

in

f u n c t i o n a l

combination

w i t h

o t h e r

u n i t s ;

(d)

a

g l o s s

is an

explanator

or

descriptive

comnent related

t o the

d i c t i o n a r y e n t r y ;

it

may

also

skate

s i m i l a r i t i e s

t o

(15)

-

15

-

AsPECrS. O F THE SCIENCE OF DICPIONARY

1

.

General,

Concepts

U a o u g h

definitions abound, a

reasonable

d i s t i n c t i o n

seems

t o be

to say t h a t

t h e

semantic description of

i n d i v i d u a l terms,

t h e

inventory

of words i s

the customary

province

of

Lexicoqraphy

whereas le&coloqy

refers

to

the study

o f t h e

lexical

material,

of

the

recurrent

patterns

of

semantic r e l a t i o n s h i p s , and

of

any

formal devices, such

as

phonological

and

granmatical

$ystems,

that generate

t h e

latter.

T o

c o n s t r u c t

a

d i c t i o n a r y

of a

given size,.

one

could

choose

the

entries

on

the

basis

of

t h e i r

frequency

of

occurrence or in

r e l y i n g

on

some

measure

o f * u t i l i t y t h a t

is

vaguely t i e d

to

t h e

semantic

generality of

the candidates. N o

s o l u t i o n

i s

perfect

or

even

uniformly

useful

over

the whole

dictionary.

Even the arrangement of

meanings

of a given entry is moot.

we

talk

about l o g i c a l ,

historical and

empirical

orders.

(The

latter

starts w i t h the

c o m o n

and

current

usage followed by

obsolete, colloquial, provincial,

slang

and technical meanings. )

(16)

about

sane

samponent

of the

e x t r a l i h g u i s

tic

world.

Our

work

derives

i t s

data

base

from

an encyclopedic

d i c t i o n a r y .

It

ehould

be

noted

t h a t the

highly polysemous

nature

of

the

entries

in a

linguistic

dictionary

would

have

constituted

an

addi

t

iona

1

complication in this pilot project,

which

h a s

now

been

avoided

w i t h o u t

affecting

t h e

general

validity

of the

resu

Its.

We

propose

t o

i n t r o d u c e

the

tern

lexicometry

to

designate

the,

d i s c i p l i n e

which i n v e s t i g a t e s and

a n a l y z e s

the

q u a n t i t a t i v e

aspects

o f

dictionaries,

t h e

vocabulary

of

a

language

and

various

s u b s e t s

of

t h e

l a t t e t .

Lexicometry would

count,

weigh and

. .

measure,

and

express

t h e

results in

s t a t i s t i c a l

and

m a t h m a t i c a l

terms.

Many

such studies are

widely

known.

Such

is

t h e

one

reported

by

G U ~

raud

( 1

959

:

The

most

frequent

words

are:

(a)

t h e

shortest,

b

the

o l d e s t ,

( c )

the morphologically

simplest,

(d)

the

semanti-caf

l y

most e x t e n d e d ,

i

.e.

g r e a t e s t

number

of

meanings.

possessing

t h e

As

to

the

measure

of

frequency,

n

the

f i r s t

100

words

cover

608

of

an

averagen t e x t ,

I# 81 t l tl M I( Q

1000

85%

f

(17)

Thus

the

remaining

X

( ? )

thousand words

cover

o n l y

2 . 5 % of

t h e

t e x t . H o w e V e r , from

an

information

theoretic

p o i n t

of

view,

the

first

100 words

comprise

3 0 % of

the

information,

I n n n w I

1000 50%

"

11 H (I II n n

4000 70%

'

Consequently, rare

words konvey

a

great

deal

of

information.

We

could

say

that

a frequent

word

i s

most

u s e f u l

in

the aggregate,

and

a

rare

word

in

a

particular

case.

Other

studies in

glottochronology

mhcern thanselves

with

the

rate

of

change

i n

Language

and

i n

basic

vocabulary.

Further,

distribution

of

the

frequencies

of

occurrence

w i t h or

without

reference

to

any particular

vocabulary has

a l s o

been studied.

Finding

r e l a t i o n s

of the above k i n d

is

not

j u s t

an

academic

exercise

to

s a t i s f y

the

c u r i o s i t y

of

a few l i n g u i s t s , but these

relationships

may

have

various

practical applications.

For example, Maas ( 1 9 7 2 ) asserts that

the

knowledge

of a

f u n c t i o n a l

relation between

the length of a t e x t and the

size

o f t h e

vocabulary

used

i n

it

would

be

d e s i r a b l e

in

order

t o estimate the

e f f o r t

needed f o r

extension

of a

machine d i c t i o n a r y or

i n

comparison of

vocabulary c o n t e n t s

of t e x t s of d i f

ferent

l e n g t h s . In the

l a t t e r

case, one can

s t a n d a r d i z e or

normalize

the t e x t s under i n v e s t i g a t i o n by reducing

them

t o a

common

minimal

length through

computational

methods

and

then compare the

r e s u l t i n g

(18)

L e t V be the number of

e l e m e n t s

(words)

i n

a

text

and

N the

I

-

l e n g t h of the text.

Then

we

surmise,

says Maas, a

f u n c t i o n a l

r e l a t i u r n s h i p t o

exist

between N

-

and V:

-

Muller

(1964)

reported

a

r e l a t i o n

between

V

-

and

d

-

such t h a t

the

r a t i o

of their logarithms

i s c o n s t a n t :

lo N

-3.-

=

a,or

v a

= N, or,

10q-

v

1

if

we s e t

-

=

k,

V = L J ,

k

a

Since

the

vocabulary

of

a

language,

however,

is

supposed to

be restricted, so argues Maas, the e x i s t e n c e of a l i m i t i n g value

is

to be

p o s t u l a t e d :

V,=

lim

f

(N)

N+m

As the derivative

of f

-

at a g i v e n

v a l u e

of N

-

represents t h e

r e l a t i v e

increase

in

V -1

it is to

be s t a t e d that

f' (N)

approaches

0

with

i n c r e a s i n g

-

N,

The

d e r i v a t i v e

of

a f at the p o i n t 1

is

assumed to be 1 because

(19)

Therefore

-

f'

i s a

function

that

decreases

m n o t o n i c a l l y

from

1

to

As

a

consequence

of the above

speculations,

in t h e

expression

V

=

N~

'

-

k

cannot

be

constant.

s t a t i s t i c a l

investigations of the

dramas

by Corneille

have

resulted

in

the

r e l a t i o n s h i p

1

log

E

=

0 . 0 1 3 7 . ( l o g N)

1/3

~ h u s ,

if

N

I

is

g i v e n ,

k

IC

can be

determined,

and V

-

can

be

c a l c u l a t e d

from

Another

noteworthy concept

i s

that

of

r e p e t i t i o n

factor

:

which

shows

how

of

ten

word has

occurred

in

a t e x t

on

the

averaqe.

The

following

r e l a t i o n s h i p has been determined:

(20)

which

d i s p l a y s

a

very

good

agreement

w i t h

r e a l i t y .

N O

s i n g l e

empirical

law

s e w s

to

e x i s t

between

N

and

V

f o r

I D

a l l

N.

-

2 ,

The

Problem

of

Coverase

We

are

now coming

close

to

the core

subject

matter

of

t h i s

paper.

Mackey

( 1 9 6 5 ) s t a t e s t h a t

he

coverage

or

covering

capacity

of an

item

is

the

number

of

t h i n g s

one

can

say

w i t h

it. It can

be

measured

by

the

number

of

o t h e r

items

which

i t

can

d i s p l a c e .

)I

According

to

him,

words

can

displace

other

words

by

Eour

means:

(1

)

i n c l u s i o n ,

( 2 )

extension,

( 3 )

combination,

and

(4!

d e f i n i t i o n ,

1,

A

word

t h a t

already

-

includes

the

meaning

o f

o t h e r

words

can

be

used

instead of

these

( e . g . , Ls e a t -

i n c l u d e s

- P

chair

bench,

s t o o l ,

and

place)

,

-

lLlCI

2 .

Words

the

meanings

of

which

are

easily extended

me'kaphorically

can

be

used to

eliminate others

(e.g.,

(21)

3

.

Certain

simple words

can

displace

others

by combining

e i t h e r together

or

with

simple

word

endings

(em

g.

,

news

+

paper

+

man

= j o u r n a l i s t ;

hand

+

book = manual)

.

4

.

Certain

words

can

be

replaced by simple d e f i n i t i o n

( e r g . , breakfast can be d e f i n e d as morning meal; pony a s

small

h o r s e ) .

As

an

example

of t h e a p p l i c a t i o n of the above

principle,

in

the

derivation

of Basic

English

(by

definition), t h e

language was

f i r s t reduced

to

7500 words, and, by r e d e f i n i t i o n ,

c u t

down

t o 1500. These

were

further

reduced t o t h e e v e n t u a l 850 by a technique of "panoptic" d e f i n i t i o n

(eliminate

each word

on

t h e

grounds

that i t i s

some sort

of modification

of

o t h e r

words,

e .

g.

a m o d i f i c a t i o n

i n

time,

numbe-r,

or

s i z e )

.

Basic English, which was founded

essentially

on

the p r i n c i p l e

of

cove

rage, was

a

conscious

reaction

a g a i n s t

the

o v e r - a p p l i c a t i o n of t h e p r i n c i p l e of frequency

i n

s e l e c t i o n . For

Ogden ( 1 933)

,

it

was n o t the frequency

of

a word

which

makes i t u s e f u l , i t was i t s usefulness

which

makes it f r e q u e n t .

In the

following

p a r t of

this

s e c t i o n ,

we a t t e m p t

t o

present

some of the salierit p o i n t s of

Savard

(1 970).

(22)

n o t s u f f i c i e n t

t o

select words for

a restricted

v o c a b u l a r y for

the

purpose of teaching

a

f o r e i g n language, such as

W e n c h ,

to b e g i n n e r s .

An

objective

c r i t e r i o n i s l e x i c a l v a l e n c e . I t would allow

1

.

to o b t a i n a n o v e l p r i n c i p l e of vocabulary s e l e c t i o n ,

2

.

t o assist the i n v e s t i g a t o r s i n s e t t i n g up a base

vocabulary

f o r

French,

3 . t o p r o v i d e a u s a b l e d e f i n i t i o n ,

combination,

inclusion,

and

extension

vocabulary,

4

a t o

correct

a l l

the already

e x i s t i n g

scales

of French

vocabulary,

5. to provide a

valid working

tool

for the

analysis of

t e a c h i n g material.

The

valence problem is a problem of verbal economy. \that he

calls

v a l e n c e

i s

t h e

fundamental

capability

o f

a

word t o be

substituted

for another

word.

It is

Mackey's coveraqe t h a t he

(23)

L i k e Mackey ( 1 9 6 5 ) , he

maintains

t h a t the s u b s t i t u t i o n of one word

for

another

can

he

made by

v i r t u e

of

four criteria:

( 1 )

d e f i n i t i o n , ( 2 )

i n c l u s i o n ,

( 3 ) combinatiori, ( 4 ) e x t e n s i o n .

~ e f i n i t i o n has already been discussed previously.

Linguists do

n o t

t a l k

specifically

about inclusion; r a t h e r , they d e a l

with

synonymy

or

l e x i c a l p a r a l l e l i s m .

Synonyms

are.

words

that have n e a r l y t h e

same meaning,

e . g .

-

lieu

and e n d r o i t . For Savard, the b a s i c

criterion

t h a t permits t o

establish

a

series of the p o s s i b i l i t y of s u b s t i t u t i n g one term f o r another.

One

of

the

s i m p l e s t

a m n g a l l

t h e procedures

of

vocabulary

e n r i c h m e n t consists

o f

j o i n i n g

two

words

order to make

compound

words.

The p r i n c i p l e

0-f

combination

appears as

another

phenomenon

common

t~ a l l langrlages.

It i s n o t necessary t h a t the

number

of s i m p l e words be

unbounded

because

almost

a l l

verbs have a p o t e n t i a l of undetermined sense, and

so

do the a d j e c t i v e s . A word is said to have more or less extension a c c o r d i n g to w h e a e r i t can "cover" a

more

or

less

great number of f u l l y or p ~ r t i a l l y d i f f e r e n t notions.

Polysemy

is

the exact opposite of synonymy. Polysemy becomes

(24)

homonymy

constitute two

very r i c h

sources

of

l e x i c a l economy.

Togethel:

they

form

Savard'

s

l a s t

criterion

of

lexical

valence--the

semantic extension,

Although

the

valence

i t s e l f

has

rfever been mathematically

measured

and

a l t h o u g h there

exis- n o

s c i e n t i f i c

means

of

showing its existence, it h a s neverthe less,

been proven

that

four formal

proceaures

of

lexical

economy permt

to

replace

certain

words

by

other words, and t h a t

is

what S a v ~ d c a l l s

l e x i c a l valence.

The postulated

existence

hypothesis

of

lexical

v a l e n c e

leads

t o the c a l c u l a t i o n of a global i n d e x of valence f o r e k d r y word.

To

evaluate

t h e

power

of

o f a

word, one

i n s p e c t s ,

in

t h e d i c t i o n a r y , each e l e m e n t

of

the general 139t and counts how many

times

a

word

e n t e r s

i n t o

the

definition

of

another.

To

measure

the power

of

combination of a lexical

unit,

one

inspects

in

t h e d i c t i o n a r y

all the compound words

joined by a

hyphen,

all

the Gallicisms ( i n English, these would be Anglicisms) and, in g e n e r a l , a l l t h e

word

groups.

W i t h a view of a p p r a i s i n g the power of i n c l u s i o n , one

inspects

me

units

of

the general l i s t in two synonym d i c t i o n a r i e s and takes the h i g h e r

number.

The numbei

of

synonyms

t h a t

possess

a word c o n s t i t u t e s a measure of the nunber

of

words

(25)

To measure the power of s e m a n t i d e x t e n s i o n , o n e i n s p e c t s each

of t h e e l e m e n t s of the general list i n the d i c t i o n a r y and c o u n t s the number of meanings g i v e n by the author t o such a word in t h e

list. T h e number of meanings of a word is c o n s i d e r e d as a m a s u r e of i t s power of s e m a n t i c extension.

The g l o b a l i n d e x of lexical valence is t h e sum of t h e four

n o r m a l i z e d c o u n t s . The two c r i t e d i a h a v i n g t h e h i g h e s t c o r r e l a t i o n are d e f i n i t i o n and c o m b i n a t i o n .

In the beginning of t h e study, it was assumed t h a t thk four

v a r i a b l e s were entirely i n d e p e n d e n t of each o t h e r . The results of a f a c t o ~ a n a l y s i s i n d i c a t e that they are n o t c o m p l e t e l y so. A

factor r o t a t i o n shows, however

,

that t h e variables are

s u f f i c i e n t l y i n d e p e n d e n t t o make it necessary t o retain the four

c r i t e r i a of l e x i c a l valence.

A c o m p a r i s o n of t h e rank of t h e first 40 c o n t e n t words on the valence scale w i t h the same words on t h e f r e q u e n c y l i s t allows t o frame a hypothesis t h a t t h e c o r r e l a t i o n between v a l e n c e and

f r e q u e n c y would be rather weak. A more c o m p l e t e s t u d y would show

w i t h o u t d o u b t t h a t : w e have there two very different s e l e c t i o n

p r h c i p l e s .

I n c o n c l u s i o n , i eSan be stated with c o n f i d e n c e that t h e measure of valence i s n o less v a l i d t h a n that of frequency, distribution

.

and a v a i l a b i l i t y . These c o n c e p t s w i l l eventually lead to more efficient d i c t i o n a r i e s with respect to precision,

(26)

ON LEXICOMETRIC RELATIONSHIPS AMONG

THE

SIZE OF D E F I N I N G SET,

TUE: SIZE OF DEFINED SET AND TILE MAXIMUM LENGTH OF DEFINITIONS

1. Some Measures of Coverage

A d i c t i o n a r y may be considered efficient and

economical

i f i t u s e s a

reasonably

small set of words to d e f i n e 9 r e l a t i v e l y large

set o f entries. W e have, however, a very v a g u e idea a b o u t what

size

v o c a b u l a r y i s n e e d e d t o c o v e r a g i v e n number of d i c t i o n a r y

e n t r i e s . (The r e l a t e d problem of c i ~ c u l a r d e f i n i t i o n s seems t o have t o wait for a camputer s o l u t i ~ n . )

I t i s known, for example, that Basic E n g l i s h , Ogden (1 933)

,

involves a l i s t of 850 E n g l i s h words and 50 i n t e r n a t i o n a l words,

which were e v e n t u a l l y used t o d e f i n e the 20,000 English words of Basic Y n g l i s h D i c t i o n a r y . This gives

a r a t i o

of

the number

of

c o v e r i n g . words t o that of d e f i n e d words of 0.045.

West s t u d i e d t h e problem of what c o n s t i t u t e s a simple definition and e s t a b l i s h e d a minimum defining vocabulary of 1 , 4 90 words. The meaning of sane 18,000 words and 6,OQ'u idioms, i.e.

about 2 4 , 0 0 0 expressions, was e x p l a i n e d exclusively by t h e s e

1,490 words, which were not d e f i n e d themselves. The r e s u l t s were p u b l i s h e d in 1 9 6 1 as The 14ew Method English D i c t i o n a r y bf H o p m

West and J. G. E n d i c o t t . The c o r r e s p o n d i n g size r a t i o here i s 0.062,

(27)

can

define

d s e t of about 20 times mat s i z e , b u t

in

g e n e r a l the

behavior of these variables h a s not been i n v e s t i g a t e d and i s not

known i n

any

d e t a i l .

One of us, in F i n d l e r ( 1 3 7 0 ) , has formulated t h e problem i n de.Ein ; ie terms, T h r e e v a r i a b l e s were considered : ( 1 ) t h e

covered set S of size

% ,

(2) the Coverinq set R o f s i z e

% ,

and

L

-

.I)

-

( 3 ) the'max&mum definition\ l e n g t h

-

N, such t h a t each word

in

S can

- 1

bq

d e f i n e d by at m o s t

-

N ordered words The t a s k f i n d :

( a ) VR a s a function of vS at different v a l u e s of

-

N as a

-

parameter, and

(b) v as a f u n c t i o n of

-

N at d i f f e r e n t values of v a s a l?

-

L

i

parameter.

Usinq the terminology of i n c r e m e n t ratio for Av

/Av

and s i z e

=;: S

r a t i o f o r vR/vS

,

it was p o s t u l a t e d f o r case (a) that

2

*

t h e i n c r e m e n t r a t i o is, i n q e n e r a l , less than one

,

2

*

t h e i n c r e m e n t ratio, i n g e n e r a l , decreases as

v

i n c r e a s e s

,

S

-

*

f o r l a r g e values of

P,

vR

L

-

a s y m p t o t i c a l l v approaches a l i m i t i n g value as

vS

i n c r e a s e s ,

-

*

t h e i n c r e m e n t ratio will, n e v e r exceed the s i x e ratio.

L

A n e x c e p t i o n t o t h i s rule would occur i n a d i c t i o n a r y

s y s t e m , which does n o t treat liomon~ms a s i n d i v i d u a l entries,

A

(28)

It was f u r t h e r asSumed t h a t f o r B=l

,

the

coverincr set and

the

covered set are of the. same s i z e ,

i

. o .

both the i n c r e m e n t ratio and the s i z e ratio equal

o n e .

We 'must

now

correct this s t a t e m e n t becallse not every word i s d e f i n e d by i t s e l f o n l y . If a new word

is

Lntroduced that already

has

a synonym

in

the covering s e t , i t w i l l be d e f i n e d by that synonym. Then the i n c ' r e m e n t ratio i s 0

a n d t h e s i z e r a t i o become less t h a n 1.

For

tke

s e c o n d c a s e , (b)

,

it is n o s t u l a t e d t h a t

*

v

monotonically

decreases a s I) N

increases,

*

f o r any fixed v v a l u e , v asymptotically approaches

a

S

-

-

R-

lower

l i m i t qs 11 increases w i t h o u t bbund.

C

I t was finally p o i n t e d o u t t h a t vR s h o u l d be small. ko

-

m i n i m i z e s t o r a g e requirements,, and

-

N should he s m a l l t o mlnlmize

p r o c e s s i n g t i m e and output volume. A . compromise on these

mnf l i c t i n g requirements is needed. The u l t i m a t e q u e s t i o n i s :-

g i v e n

"What are t h e optimum

yl

and

-

11 values for

a v

f o r certain

-

A

-

s

computer a p p l i c a t i o n s

on

a machine

with

a

given c o s t

s t r u c t u r e ? "

X t i s reasooable to assume that the behavior of

the

three

v a r i a l e s and t h e r e f o r e the answer to the l a s t q u e s t i o n w i l l l a r q e l y denend on t h e semaptic

index

of t h e elements of the

covered set: and on the lexical v a l e n c e of t h e elements of the

Figure

TABLE I1 Covered-Covering Relationships
Fig. 3 .  -Flow

References

Related documents

Committee Member, College of Education Assessment Committee, 2008 Committee Member, College of Education Recruitment Committee, 2008 Florida Agricultural and

The defendant argued that these words could not be construed as a TRUE THREAT because there was no reasonable fear of immediate danger, nor was it specific as to time place or

Many Things Build the Brain Daily Life Diet Play Interactions Community Education Social life Culture Personality Reaction to Stress Grit Attitude The Family Genes

Music preparation by Jay Nichols To contact Jay for music manuscript projects,. please email Jim Martin at the

Yadav, learned Additional Advocate General-IX informs that for ascertaining information of overlapping of issues, the learned Advocate General, had

Typed into the area a spreadsheet runs vertical wall, the cells that is an absolute cell can be used to copy data that has a particular area.. Of cells with the area a runs

Quite a few people wanted to know how to make it themselves, so I’ll go ahead and do what I just said I never do, and regurgitate a method for “tubular” nickel II oxide that I

The empirical results for three stocks traded on the New York Stock Exchange showed that the co-jumps of any two assets have a significant impact on future co-volatility, but that